2025-03-12

Title: Psychological Counseling Ability of Large Language Models

Authors: Fangyu Peng, Jingxin Nie
Subjects: cs.LG, cs.AI, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2503.07627
Pdf URL: https://arxiv.org/pdf/2503.07627
Copy Paste: [[2503.07627]] Psychological Counseling Ability of Large Language Models(https://arxiv.org/abs/2503.07627)
Keywords: large language model
Abstract: With the development of science and the continuous progress of artificial intelligence technology, Large Language Models (LLMs) have begun to be widely utilized across various fields. However, in the field of psychological counseling, the ability of LLMs have not been systematically assessed. In this study, we assessed the psychological counseling ability of mainstream LLMs using 1096 psychological counseling skill questions which were selected from the Chinese National Counselor Level 3 Examination, including Knowledge-based, Analytical-based, and Application-based question types. The analysis showed that the correctness rates of the LLMs for Chinese questions, in descending order, were GLM-3 (46.5%), GPT-4 (46.1%), Gemini (45.0%), ERNIE-3.5 (45.7%) and GPT-3.5 (32.9%). The correctness rates of the LLMs for English questions, in descending order, were ERNIE-3.5 (43.9%), GPT-4 (40.6%), Gemini (36.6%), GLM-3 (29.9%) and GPT-3.5 (29.5%). A chi-square test indicated significant differences in the LLMs' performance on Chinese and English questions. Furthermore, we subsequently utilized the Counselor's Guidebook (Level 3) as a reference for ERNIE-3.5, resulting in a new correctness rate of 59.6%, a 13.8% improvement over its initial rate of 45.8%. In conclusion, the study assessed the psychological counseling ability of LLMs for the first time, which may provide insights for future enhancement and improvement of psychological counseling ability of LLMs.

Title: FourierNAT: A Fourier-Mixing-Based Non-Autoregressive Transformer for Parallel Sequence Generation

Authors: Andrew Kiruluta, Eric Lundy, Andreas Lemos
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.07630
Pdf URL: https://arxiv.org/pdf/2503.07630
Copy Paste: [[2503.07630]] FourierNAT: A Fourier-Mixing-Based Non-Autoregressive Transformer for Parallel Sequence Generation(https://arxiv.org/abs/2503.07630)
Keywords: transformer
Abstract: We present FourierNAT, a novel non-autoregressive Transformer (NAT) architecture that employs Fourier-based mixing in the decoder to generate output sequences in parallel. While traditional NAT approaches often face challenges with capturing global dependencies, our method leverages a discrete Fourier transform to mix token embeddings across the entire sequence dimension, coupled with learned frequency-domain gating. This allows the model to efficiently propagate context without explicit autoregressive steps. Empirically, FourierNAT achieves competitive results against leading NAT baselines on standard benchmarks like WMT machine translation and CNN/DailyMail summarization, providing significant speed advantages over autoregressive Transformers. We further demonstrate that learned frequency-domain parameters allow the model to adaptively focus on long-range or short-range dependencies, partially mitigating the well-known coherence gaps in one-pass NAT generation. Overall, FourierNAT highlights the potential of integrating spectral-domain operations to accelerate and improve parallel text generation. This approach can potentially provide great computational and time savings in inference tasks LLMs.

Title: Cross-modal Causal Relation Alignment for Video Question Grounding

Authors: Weixing Chen, Yang Liu, Binglin Chen, Jiandong Su, Yongsen Zheng, Liang Lin
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.07635
Pdf URL: https://arxiv.org/pdf/2503.07635
Copy Paste: [[2503.07635]] Cross-modal Causal Relation Alignment for Video Question Grounding(https://arxiv.org/abs/2503.07635)
Keywords: robust
Abstract: Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, vision-language models exhibit unfaithful generalization performance and lack robustness on challenging downstream tasks such as VideoQG. In this work, we propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Grounding (GSG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter, ii) Cross-Modal Alignment (CMA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features, iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving robust question reasoning. Codes are available at this https URL.

Title: Is Pre-training Applicable to the Decoder for Dense Prediction?

Authors: Chao Ning, Wanshui Gan, Weihao Xuan, Naoto Yokoya
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07637
Pdf URL: https://arxiv.org/pdf/2503.07637
Copy Paste: [[2503.07637]] Is Pre-training Applicable to the Decoder for Dense Prediction?(https://arxiv.org/abs/2503.07637)
Keywords: segmentation
Abstract: Pre-trained encoders are widely employed in dense prediction tasks for their capability to effectively extract visual features from images. The decoder subsequently processes these features to generate pixel-level predictions. However, due to structural differences and variations in input data, only encoders benefit from pre-learned representations from vision benchmarks such as image classification and self-supervised learning, while decoders are typically trained from scratch. In this paper, we introduce $\times$Net, which facilitates a "pre-trained encoder $\times$ pre-trained decoder" collaboration through three innovative designs. $\times$Net enables the direct utilization of pre-trained models within the decoder, integrating pre-learned representations into the decoding process to enhance performance in dense prediction tasks. By simply coupling the pre-trained encoder and pre-trained decoder, $\times$Net distinguishes itself as a highly promising approach. Remarkably, it achieves this without relying on decoding-specific structures or task-specific algorithms. Despite its streamlined design, $\times$Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation, achieving state-of-the-art performance particularly in monocular depth estimation. and semantic segmentation, achieving state-of-the-art results, especially in monocular depth estimation. embedding algorithms. Despite its streamlined design, $\times$Net outperforms advanced methods in tasks such as monocular depth estimation and semantic segmentation, achieving state-of-the-art performance particularly in monocular depth estimation.

Title: Mixture of Experts Made Intrinsically Interpretable

Authors: Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder de Witt, Puneet K. Dokania, Adel Bibi, Philip Torr
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.07639
Pdf URL: https://arxiv.org/pdf/2503.07639
Copy Paste: [[2503.07639]] Mixture of Experts Made Intrinsically Interpretable(https://arxiv.org/abs/2503.07639)
Keywords: interpretability, large language model
Abstract: Neurons in large language models often exhibit \emph{polysemanticity}, simultaneously encoding multiple unrelated concepts and obscuring interpretability. Instead of relying on post-hoc methods, we present \textbf{MoE-X}, a Mixture-of-Experts (MoE) language model designed to be \emph{intrinsically} interpretable. Our approach is motivated by the observation that, in language models, wider networks with sparse activations are more likely to capture interpretable factors. However, directly training such large sparse networks is computationally prohibitive. MoE architectures offer a scalable alternative by activating only a subset of experts for any given input, inherently aligning with interpretability objectives. In MoE-X, we establish this connection by rewriting the MoE layer as an equivalent sparse, large MLP. This approach enables efficient scaling of the hidden size while maintaining sparsity. To further enhance interpretability, we enforce sparse activation within each expert and redesign the routing mechanism to prioritize experts with the highest activation sparsity. These designs ensure that only the most salient features are routed and processed by the experts. We evaluate MoE-X on chess and natural language tasks, showing that it achieves performance comparable to dense models while significantly improving interpretability. MoE-X achieves a perplexity better than GPT-2, with interpretability surpassing even sparse autoencoder (SAE)-based approaches.

Title: BrainNet-MoE: Brain-Inspired Mixture-of-Experts Learning for Neurological Disease Identification

Authors: Jing Zhang, Xiaowei Yu, Tong Chen, Chao Cao, Mingheng Chen, Yan Zhuang, Yanjun Lyu, Lu Zhang, Li Su, Tianming Liu, Dajiang Zhu
Subjects: cs.LG, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2503.07640
Pdf URL: https://arxiv.org/pdf/2503.07640
Copy Paste: [[2503.07640]] BrainNet-MoE: Brain-Inspired Mixture-of-Experts Learning for Neurological Disease Identification(https://arxiv.org/abs/2503.07640)
Keywords: transformer, generative
Abstract: The Lewy body dementia (LBD) is the second most common neurodegenerative dementia after Alzheimer's disease (AD). Early differentiation between AD and LBD is crucial because they require different treatment approaches, but this is challenging due to significant clinical overlap, heterogeneity, complex pathogenesis, and the rarity of LBD. While recent advances in artificial intelligence (AI) demonstrate powerful learning capabilities and offer new hope for accurate diagnosis, existing methods primarily focus on designing "neural-level networks". Our work represents a pioneering effort in modeling system-level artificial neural network called BrainNet-MoE for brain modeling and diagnosing. Inspired by the brain's hierarchical organization of bottom-up sensory integration and top-down control, we design a set of disease-specific expert groups to process brain sub-network under different condition, A disease gate mechanism guides the specializa-tion of expert groups, while a transformer layer enables communication be-tween all sub-networks, generating a comprehensive whole-brain represen-tation for downstream disease classification. Experimental results show superior classification accuracy with interpretable insights into how brain sub-networks contribute to different neurodegenerative conditions.

Title: ConstellationNet: Reinventing Spatial Clustering through GNNs

Authors: Aidan Gao, Junhong Lin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07643
Pdf URL: https://arxiv.org/pdf/2503.07643
Copy Paste: [[2503.07643]] ConstellationNet: Reinventing Spatial Clustering through GNNs(https://arxiv.org/abs/2503.07643)
Keywords: robust
Abstract: Spatial clustering is a crucial field, finding universal use across criminology, pathology, and urban planning. However, most spatial clustering algorithms cannot pull information from nearby nodes and suffer performance drops when dealing with higher dimensionality and large datasets, making them suboptimal for large-scale and high-dimensional clustering. Due to modern data growing in size and dimension, clustering algorithms become weaker when addressing multifaceted issues. To improve upon this, we develop ConstellationNet, a convolution neural network(CNN)-graph neural network(GNN) framework that leverages the embedding power of a CNN, the neighbor aggregation of a GNN, and a neural network's ability to deal with batched data to improve spatial clustering and classification with graph augmented predictions. ConstellationNet achieves state-of-the-art performance on both supervised classification and unsupervised clustering across several datasets, outperforming state-of-the-art classification and clustering while reducing model size and training time by up to tenfold and improving baselines by 10 times. Because of its fast training and powerful nature, ConstellationNet holds promise in fields like epidemiology and medical imaging, able to quickly train on new data to develop robust responses.

Title: BicliqueEncoder: An Efficient Method for Link Prediction in Bipartite Networks using Formal Concept Analysis and Transformer Encoder

Authors: Hongyuan Yang, Siqi Peng, Akihiro Yamamoto
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2503.07645
Pdf URL: https://arxiv.org/pdf/2503.07645
Copy Paste: [[2503.07645]] BicliqueEncoder: An Efficient Method for Link Prediction in Bipartite Networks using Formal Concept Analysis and Transformer Encoder(https://arxiv.org/abs/2503.07645)
Keywords: transformer
Abstract: We propose a novel and efficient method for link prediction in bipartite networks, using \textit{formal concept analysis} (FCA) and the Transformer encoder. Link prediction in bipartite networks finds practical applications in various domains such as product recommendation in online sales, and prediction of chemical-disease interaction in medical science. Since for link prediction, the topological structure of a network contains valuable information, many approaches focus on extracting structural features and then utilizing them for link prediction. Bi-cliques, as a type of structural feature of bipartite graphs, can be utilized for link prediction. Although several link prediction methods utilizing bi-cliques have been proposed and perform well in rather small datasets, all of them face challenges with scalability when dealing with large datasets since they demand substantial computational resources. This limits the practical utility of these approaches in real-world applications. To overcome the limitation, we introduce a novel approach employing iceberg concept lattices and the Transformer encoder. Our method requires fewer computational resources, making it suitable for large-scale datasets while maintaining high prediction performance. We conduct experiments on five large real-world datasets that exceed the capacity of previous bi-clique-based approaches to demonstrate the efficacy of our method. Additionally, we perform supplementary experiments on five small datasets to compare with the previous bi-clique-based methods for bipartite link prediction and demonstrate that our method is more efficient than the previous ones.

Title: On the Importance of Clearsky Model in Short-Term Solar Radiation Forecasting

Authors: Cyril Voyant, Milan Despotovic, Gilles Notton, Yves-Marie Saint-Drenan, Mohammed Asloune, Luis Garcia-Gutierrez
Subjects: cs.LG, physics.ao-ph, physics.data-an
Abstract URL: https://arxiv.org/abs/2503.07647
Pdf URL: https://arxiv.org/pdf/2503.07647
Copy Paste: [[2503.07647]] On the Importance of Clearsky Model in Short-Term Solar Radiation Forecasting(https://arxiv.org/abs/2503.07647)
Keywords: robust
Abstract: Clearsky models are widely used in solar energy for many applications such as quality control, resource assessment, satellite-base irradiance estimation and forecasting. However, their use in forecasting and nowcasting is associated with a number of challenges. Synchronization errors, reliance on the Clearsky index (ratio of the global horizontal irradiance to its cloud-free counterpart) and high sensitivity of the clearsky model to errors in aerosol optical depth at low solar elevation limit their added value in real-time applications. This paper explores the feasibility of short-term forecasting without relying on a clearsky model. We propose a Clearsky-Free forecasting approach using Extreme Learning Machine (ELM) models. ELM learns daily periodicity and local variability directly from raw Global Horizontal Irradiance (GHI) data. It eliminates the need for Clearsky normalization, simplifying the forecasting process and improving scalability. Our approach is a non-linear adaptative statistical method that implicitely learns the irradiance in cloud-free conditions removing the need for an clear-sky model and the related operational issues. Deterministic and probabilistic results are compared to traditional benchmarks, including ARMA with McClear-generated Clearsky data and quantile regression for probabilistic forecasts. ELM matches or outperforms these methods, providing accurate predictions and robust uncertainty quantification. This approach offers a simple, efficient solution for real-time solar forecasting. By overcoming the stationarization process limitations based on usual multiplicative scheme Clearsky models, it provides a flexible and reliable framework for modern energy systems.

Title: The day-ahead scenario generation method for new energy based on an improved conditional generative diffusion model

Authors: Changgang Wang, Wei Liu, Yu Cao, Dong Liang, Yang Li, Jingshan Mo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07648
Pdf URL: https://arxiv.org/pdf/2503.07648
Copy Paste: [[2503.07648]] The day-ahead scenario generation method for new energy based on an improved conditional generative diffusion model(https://arxiv.org/abs/2503.07648)
Keywords: interpretability, diffusion, generative
Abstract: In the context of the rising share of new energy generation, accurately generating new energy output scenarios is crucial for day-ahead power system scheduling. Deep learning-based scenario generation methods can address this need, but their black-box nature raises concerns about interpretability. To tackle this issue, this paper introduces a method for day-ahead new energy scenario generation based on an improved conditional generative diffusion model. This method is built on the theoretical framework of Markov chains and variational inference. It first transforms historical data into pure noise through a diffusion process, then uses conditional information to guide the denoising process, ultimately generating scenarios that satisfy the conditional distribution. Additionally, the noise table is improved to a cosine form, enhancing the quality of the generated scenarios. When applied to actual wind and solar output data, the results demonstrate that this method effectively generates new energy output scenarios with good adaptability.

Title: TS-RAG: Retrieval-Augmented Generation based Time Series Foundation Models are Stronger Zero-Shot Forecaster

Authors: Kanghui Ning, Zijie Pan, Yu Liu, Yushan Jiang, James Y. Zhang, Kashif Rasul, Anderson Schneider, Lintao Ma, Yuriy Nevmyvaka, Dongjin Song
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07649
Pdf URL: https://arxiv.org/pdf/2503.07649
Copy Paste: [[2503.07649]] TS-RAG: Retrieval-Augmented Generation based Time Series Foundation Models are Stronger Zero-Shot Forecaster(https://arxiv.org/abs/2503.07649)
Keywords: interpretability, large language model
Abstract: Recently, Large Language Models (LLMs) and Foundation Models (FMs) have become prevalent for time series forecasting tasks. However, fine-tuning large language models (LLMs) for forecasting enables the adaptation to specific domains but may not generalize well across diverse, unseen datasets. Meanwhile, existing time series foundation models (TSFMs) lack inherent mechanisms for domain adaptation and suffer from limited interpretability, making them suboptimal for zero-shot forecasting. To this end, we present TS-RAG, a retrieval-augmented generation based time series forecasting framework that enhances the generalization capability and interpretability of TSFMs. Specifically, TS-RAG leverages pre-trained time series encoders to retrieve semantically relevant time series segments from a dedicated knowledge database, incorporating contextual patterns for the given time series query. Next, we develop a learnable Mixture-of-Experts (MoE)-based augmentation module, which dynamically fuses retrieved time series patterns with the TSFM's representation of the input query, improving forecasting accuracy without requiring task-specific fine-tuning. Thorough empirical studies on seven public benchmark datasets demonstrate that TS-RAG achieves state-of-the-art zero-shot forecasting performance, outperforming TSFMs by up to 6.51% across diverse domains and showcasing desired interpretability.

Title: MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration

Authors: Jinguang Wang, Jingyu Wang, Haifeng Sun, Tingting Yang, Zirui Zhuang, Wanyi Ning, Yuexi Yin, Qi Qi, Jianxin Liao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07654
Pdf URL: https://arxiv.org/pdf/2503.07654
Copy Paste: [[2503.07654]] MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration(https://arxiv.org/abs/2503.07654)
Keywords: large language model
Abstract: Quantization has been widely used to compress and accelerate inference of large language models (LLMs). Existing methods focus on exploring the per-token dynamic calibration to ensure both inference acceleration and model accuracy under 4-bit quantization. However, in autoregressive generation inference of long sequences, the overhead of repeated dynamic quantization and dequantization steps becomes considerably expensive. In this work, we propose MergeQuant, an accurate and efficient per-channel static quantization framework. MergeQuant integrates the per-channel quantization steps with the corresponding scalings and linear mappings through a Quantization Step Migration (QSM) method, thereby eliminating the quantization overheads before and after matrix multiplication. Furthermore, in view of the significant differences between the different channel ranges, we propose dimensional reconstruction and adaptive clipping to address the non-uniformity of quantization scale factors and redistribute the channel variations to the subsequent modules to balance the parameter distribution under QSM. Within the static quantization setting of W4A4, MergeQuant reduces the accuracy gap on zero-shot tasks compared to FP16 baseline to 1.3 points on Llama-2-70B model. On Llama-2-7B model, MergeQuant achieves up to 1.77x speedup in decoding, and up to 2.06x speedup in end-to-end compared to FP16 baseline.

Title: GraphT5: Unified Molecular Graph-Language Modeling via Multi-Modal Cross-Token Attention

Authors: Sangyeup Kim, Nayeon Kim, Yinhua Piao, Sun Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07655
Pdf URL: https://arxiv.org/pdf/2503.07655
Copy Paste: [[2503.07655]] GraphT5: Unified Molecular Graph-Language Modeling via Multi-Modal Cross-Token Attention(https://arxiv.org/abs/2503.07655)
Keywords: transformer
Abstract: Molecular language modeling tasks such as molecule captioning have been recognized for their potential to further understand molecular properties that can aid drug discovery or material synthesis based on chemical reactions. Unlike the common use of molecule graphs in predicting molecular properties, most methods in molecular language modeling rely heavily on SMILES sequences. This preference is because the task involves generating a sequence of multiple tokens using transformer-based models. Therefore, a main challenge is determining how to integrate graph data, which contains structural and spatial information about molecules, with text data. In addition, simply using both 1D SMILES text and 2D graph as inputs without addressing how they align and represent the molecule structure in different modalities makes it challenging to fully utilize structural knowledge about molecules. To this end, we propose GraphT5, a multi-modal framework that integrates 1D SMILES text and 2D graph representations of molecules for molecular language modeling. Specifically, we introduce a novel cross-token attention module in GraphT5 to bridge the gap arising from the fundamental differences between the two modalities of molecule representations. Cross-token attention exploits implicit information between SMILES and graphs of molecules, resulting from their interactions at a fine-grained token level that benefits molecular language modeling. Extensive experiments including molecule captioning, IUPAC name prediction tasks, and case studies show that our GraphT5 outperforms the latest baseline approaches, which validates the effectiveness of our GraphT5 in sufficiently utilizing 1D SMILES text and 2D graph representations.

Title: DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving

Authors: Xiaosong Jia, Junqi You, Zhiyuan Zhang, Junchi Yan
Subjects: cs.LG, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.07656
Pdf URL: https://arxiv.org/pdf/2503.07656
Copy Paste: [[2503.07656]] DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving(https://arxiv.org/abs/2503.07656)
Keywords: transformer
Abstract: End-to-end autonomous driving (E2E-AD) has emerged as a trend in the field of autonomous driving, promising a data-driven, scalable approach to system design. However, existing E2E-AD methods usually adopt the sequential paradigm of perception-prediction-planning, which leads to cumulative errors and training instability. The manual ordering of tasks also limits the system`s ability to leverage synergies between tasks (for example, planning-aware perception and game-theoretic interactive prediction and planning). Moreover, the dense BEV representation adopted by existing methods brings computational challenges for long-range perception and long-term temporal fusion. To address these challenges, we present DriveTransformer, a simplified E2E-AD framework for the ease of scaling up, characterized by three key features: Task Parallelism (All agent, map, and planning queries direct interact with each other at each block), Sparse Representation (Task queries direct interact with raw sensor features), and Streaming Processing (Task queries are stored and passed as history information). As a result, the new framework is composed of three unified operations: task self-attention, sensor cross-attention, temporal cross-attention, which significantly reduces the complexity of system and leads to better training stability. DriveTransformer achieves state-of-the-art performance in both simulated closed-loop benchmark Bench2Drive and real world open-loop benchmark nuScenes with high FPS.

Title: SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs

Authors: Jaewoo Song, Fangzhen Lin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07657
Pdf URL: https://arxiv.org/pdf/2503.07657
Copy Paste: [[2503.07657]] SplitQuantV2: Enhancing Low-Bit Quantization of LLMs Without GPUs(https://arxiv.org/abs/2503.07657)
Keywords: large language model
Abstract: The quantization of large language models (LLMs) is crucial for deploying them on devices with limited computational resources. While advanced quantization algorithms offer improved performance compared to the basic linear quantization, they typically require high-end graphics processing units (GPUs), are often restricted to specific deep neural network (DNN) frameworks, and require calibration datasets. This limitation poses challenges for using such algorithms on various neural processing units (NPUs) and edge AI devices, which have diverse model formats and frameworks. In this paper, we show SplitQuantV2, an innovative algorithm designed to enhance low-bit linear quantization of LLMs, can achieve results comparable to those of advanced algorithms. SplitQuantV2 preprocesses models by splitting linear and convolution layers into functionally equivalent, quantization-friendly structures. The algorithm's platform-agnostic, concise, and efficient nature allows for implementation without the need for GPUs. Our evaluation on the Llama 3.2 1B Instruct model using the AI2's Reasoning Challenge (ARC) dataset demonstrates that SplitQuantV2 improves the accuracy of the INT4 quantization model by 11.76%p, matching the performance of the original floating-point model. Remarkably, SplitQuantV2 took only 2 minutes 6 seconds to preprocess the 1B model and perform linear INT4 quantization using only an Apple M4 CPU. SplitQuantV2 provides a practical solution for low-bit quantization on LLMs, especially when complex, computation-intensive algorithms are inaccessible due to hardware limitations or framework incompatibilities.

Title: Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy

Authors: Wei Junhao, Yu Zhe, Sakuma Jun
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.07661
Pdf URL: https://arxiv.org/pdf/2503.07661
Copy Paste: [[2503.07661]] Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy(https://arxiv.org/abs/2503.07661)
Keywords: protect, defense, attack, robust, watermark
Abstract: Model merging is a technique that combines multiple finetuned models into a single model without additional training, allowing a free-rider to cheaply inherit specialized capabilities. This study investigates methodologies to suppress unwanted model merging by free-riders. Existing methods such as model watermarking or fingerprinting can only detect merging in hindsight. In contrast, we propose a first proactive defense against model merging. Specifically, our defense method modifies the model parameters so that the model is disrupted if the model is merged with any other model, while its functionality is kept unchanged if not merged with others. Our approach consists of two modules, rearranging MLP parameters and scaling attention heads, which push the model out of the shared basin in parameter space, causing the merging performance with other models to degrade significantly. We conduct extensive experiments on image classification, image generation, and text classification to demonstrate that our defense severely disrupts merging while retaining the functionality of the post-protect model. Moreover, we analyze potential adaptive attacks and further propose a dropout-based pruning to improve our proposal's robustness.

Title: Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLMs

Authors: Dingkun Zhang, Shuhan Qi, Xinyu Xiao, Kehai Chen, Xuan Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07663
Pdf URL: https://arxiv.org/pdf/2503.07663
Copy Paste: [[2503.07663]] Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLMs(https://arxiv.org/abs/2503.07663)
Keywords: large language model
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enhanced their versatility as they integrate a growing number of modalities. Considering the heavy cost of training MLLMs, it is necessary to reuse the existing ones and further extend them to more modalities through Modality-incremental Continual Learning (MCL). However, this often comes with a performance degradation in the previously learned modalities. In this work, we revisit the MCL and investigate a more severe issue it faces in contrast to traditional continual learning, that its degradation comes not only from catastrophic forgetting but also from the misalignment between the modality-agnostic and modality-specific components. To address this problem, we propose an elegantly simple MCL paradigm called "MErge then ReAlign" (MERA). Our method avoids introducing heavy training overhead or modifying the model architecture, hence is easy to deploy and highly reusable in the MLLM community. Extensive experiments demonstrate that, despite the simplicity of MERA, it shows impressive performance, holding up to a 99.84% Backward Relative Gain when extending to four modalities, achieving a nearly lossless MCL performance.

Title: WECAR: An End-Edge Collaborative Inference and Training Framework for WiFi-Based Continuous Human Activity Recognition

Authors: Rong Li, Tao Deng, Siwei Feng, He Huang, Juncheng Jia, Di Yuan, Keqin Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07669
Pdf URL: https://arxiv.org/pdf/2503.07669
Copy Paste: [[2503.07669]] WECAR: An End-Edge Collaborative Inference and Training Framework for WiFi-Based Continuous Human Activity Recognition(https://arxiv.org/abs/2503.07669)
Keywords: transformer
Abstract: WiFi-based human activity recognition (HAR) holds significant promise for ubiquitous sensing in smart environments. A critical challenge lies in enabling systems to dynamically adapt to evolving scenarios, learning new activities without catastrophic forgetting of prior knowledge, while adhering to the stringent computational constraints of edge devices. Current approaches struggle to reconcile these requirements due to prohibitive storage demands for retaining historical data and inefficient parameter utilization. We propose WECAR, an end-edge collaborative inference and training framework for WiFi-based continuous HAR, which decouples computational workloads to overcome these limitations. In this framework, edge devices handle model training, lightweight optimization, and updates, while end devices perform efficient inference. WECAR introduces two key innovations, i.e., dynamic continual learning with parameter efficiency and hierarchical distillation for end deployment. For the former, we propose a transformer-based architecture enhanced by task-specific dynamic model expansion and stability-aware selective retraining. For the latter, we propose a dual-phase distillation mechanism that includes multi-head self-attention relation distillation and prefix relation distillation. We implement WECAR based on heterogeneous hardware using Jetson Nano as edge devices and the ESP32 as end devices, respectively. Our experiments across three public WiFi datasets reveal that WECAR not only outperforms several state-of-the-art methods in performance and parameter efficiency, but also achieves a substantial reduction in the model's parameter count post-optimization without sacrificing accuracy. This validates its practicality for resource-constrained environments.

Title: TVNet: A Novel Time Series Analysis Method Based on Dynamic Convolution and 3D-Variation

Authors: Chenghan Li, Mingchen Li, Ruisheng Diao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07674
Pdf URL: https://arxiv.org/pdf/2503.07674
Copy Paste: [[2503.07674]] TVNet: A Novel Time Series Analysis Method Based on Dynamic Convolution and 3D-Variation(https://arxiv.org/abs/2503.07674)
Keywords: robust, transformer
Abstract: With the recent development and advancement of Transformer and MLP architectures, significant strides have been made in time series analysis. Conversely, the performance of Convolutional Neural Networks (CNNs) in time series analysis has fallen short of expectations, diminishing their potential for future applications. Our research aims to enhance the representational capacity of Convolutional Neural Networks (CNNs) in time series analysis by introducing novel perspectives and design innovations. To be specific, We introduce a novel time series reshaping technique that considers the inter-patch, intra-patch, and cross-variable dimensions. Consequently, we propose TVNet, a dynamic convolutional network leveraging a 3D perspective to employ time series analysis. TVNet retains the computational efficiency of CNNs and achieves state-of-the-art results in five key time series analysis tasks, offering a superior balance of efficiency and performance over the state-of-the-art Transformer-based and MLP-based models. Additionally, our findings suggest that TVNet exhibits enhanced transferability and robustness. Therefore, it provides a new perspective for applying CNN in advanced time series analysis tasks.

Title: PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity

Authors: Kwanyoung Kim, Byeongsu Sim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07677
Pdf URL: https://arxiv.org/pdf/2503.07677
Copy Paste: [[2503.07677]] PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity(https://arxiv.org/abs/2503.07677)
Keywords: robust, diffusion, transformer
Abstract: Diffusion models have shown impressive results in generating high-quality conditional samples using guidance techniques such as Classifier-Free Guidance (CFG). However, existing methods often require additional training or neural function evaluations (NFEs), making them incompatible with guidance-distilled models. Also, they rely on heuristic approaches that need identifying target layers. In this work, we propose a novel and efficient method, termed PLADIS, which boosts pre-trained models (U-Net/Transformer) by leveraging sparse attention. Specifically, we extrapolate query-key correlations using softmax and its sparse counterpart in the cross-attention layer during inference, without requiring extra training or NFEs. By leveraging the noise robustness of sparse attention, our PLADIS unleashes the latent potential of text-to-image diffusion models, enabling them to excel in areas where they once struggled with newfound effectiveness. It integrates seamlessly with guidance techniques, including guidance-distilled models. Extensive experiments show notable improvements in text alignment and human preference, offering a highly efficient and universally applicable solution.

Title: Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM

Authors: Yongqiang Yao, Jingru Tan, Kaihuan Liang, Feizhao Zhang, Yazhe Niu, Jiahao Hu, Ruihao Gong, Dahua Lin, Ningyi Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07680
Pdf URL: https://arxiv.org/pdf/2503.07680
Copy Paste: [[2503.07680]] Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM(https://arxiv.org/abs/2503.07680)
Keywords: large language model
Abstract: Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances. Existing works mainly use data packing to alleviate this issue but fail to consider imbalanced attention computation and wasted communication overhead. This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies. In particular, the HBP constructs multi-level data packing groups, each optimized with a distinct packing length. It assigns training samples to their optimal groups and configures each group with the most effective settings, including sequential parallelism degree and gradient checkpointing configuration. To effectively utilize multi-level groups of data, we design a dynamic training pipeline specifically tailored to HBP, including curriculum learning, adaptive sequential parallelism, and stable loss. Our extensive experiments demonstrate that our method significantly reduces training time over multiple datasets and open-source models while maintaining strong performance. For the largest DeepSeek-V2 (236B) MOE model, our method speeds up the training by 2.4$\times$ with competitive performance.

Title: A Time Series Multitask Framework Integrating a Large Language Model, Pre-Trained Time Series Model, and Knowledge Graph

Authors: Shule Hao, Junpeng Bao, Chuncheng Lu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07682
Pdf URL: https://arxiv.org/pdf/2503.07682
Copy Paste: [[2503.07682]] A Time Series Multitask Framework Integrating a Large Language Model, Pre-Trained Time Series Model, and Knowledge Graph(https://arxiv.org/abs/2503.07682)
Keywords: robust, large language model
Abstract: Time series analysis is crucial in fields like finance, transportation, and industry. However, traditional models often focus solely on temporal features, limiting their ability to capture underlying information. This paper proposes a novel time series multitask framework, called LTM, which integrates temporal features with textual descriptions to enhance analytical and predictive capabilities. LTM combines pre-trained time series model, large language model (LLM), and knowledge graph to tackle time series tasks, including forecasting, imputation, and anomaly detection. LTM achieves improved performance with a few trainable parameters. It is very efficient and practical. LTM encodes time series data into patches and enriches user-provided prompts using knowledge graphs to generate enhanced prompts. A novel feature fusion method embeds prompts into each patch encoding, which is processed by a frozen LLM, followed by a feature enhancement module and a time decoder module. During fine-tuning stage, cosine similarity between prompts and temporal patches is integrated into the loss function to boost performance. Experiments on benchmark datasets show that LTM significantly outperforms existing methods. It provides a robust and versatile solution for time series tasks.

Title: Fair Text Classification via Transferable Representations

Authors: Thibaud Leteno, Michael Perrot, Charlotte Laclau, Antoine Gourru, Christophe Gravier
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.07691
Pdf URL: https://arxiv.org/pdf/2503.07691
Copy Paste: [[2503.07691]] Fair Text Classification via Transferable Representations(https://arxiv.org/abs/2503.07691)
Keywords: fair
Abstract: Group fairness is a central research topic in text classification, where reaching fair treatment between sensitive groups (e.g., women and men) remains an open challenge. We propose an approach that extends the use of the Wasserstein Dependency Measure for learning unbiased neural text classifiers. Given the challenge of distinguishing fair from unfair information in a text encoder, we draw inspiration from adversarial training by inducing independence between representations learned for the target label and those for a sensitive attribute. We further show that Domain Adaptation can be efficiently leveraged to remove the need for access to the sensitive attributes in the dataset we cure. We provide both theoretical and empirical evidence that our approach is well-founded.

Title: PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models

Authors: Michael-Andrei Panaitescu-Liess, Pankayaraj Pathmanathan, Yigitcan Kaya, Zora Che, Bang An, Sicheng Zhu, Aakriti Agrawal, Furong Huang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2503.07697
Pdf URL: https://arxiv.org/pdf/2503.07697
Copy Paste: [[2503.07697]] PoisonedParrot: Subtle Data Poisoning Attacks to Elicit Copyright-Infringing Content from Large Language Models(https://arxiv.org/abs/2503.07697)
Keywords: defense, attack, steal, large language model
Abstract: As the capabilities of large language models (LLMs) continue to expand, their usage has become increasingly prevalent. However, as reflected in numerous ongoing lawsuits regarding LLM-generated content, addressing copyright infringement remains a significant challenge. In this paper, we introduce PoisonedParrot: the first stealthy data poisoning attack that induces an LLM to generate copyrighted content even when the model has not been directly trained on the specific copyrighted material. PoisonedParrot integrates small fragments of copyrighted text into the poison samples using an off-the-shelf LLM. Despite its simplicity, evaluated in a wide range of experiments, PoisonedParrot is surprisingly effective at priming the model to generate copyrighted content with no discernible side effects. Moreover, we discover that existing defenses are largely ineffective against our attack. Finally, we make the first attempt at mitigating copyright-infringement poisoning attacks by proposing a defense: ParrotTrap. We encourage the community to explore this emerging threat model further.

Title: Graphint: Graph-based Time Series Clustering Visualisation Tool

Authors: Paul Boniol, Donato Tiano, Angela Bonifati, Themis Palpanas
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07698
Pdf URL: https://arxiv.org/pdf/2503.07698
Copy Paste: [[2503.07698]] Graphint: Graph-based Time Series Clustering Visualisation Tool(https://arxiv.org/abs/2503.07698)
Keywords: robust, interpretability
Abstract: With the exponential growth of time series data across diverse domains, there is a pressing need for effective analysis tools. Time series clustering is important for identifying patterns in these datasets. However, prevailing methods often encounter obstacles in maintaining data relationships and ensuring interpretability. We present Graphint, an innovative system based on the $k$-Graph methodology that addresses these challenges. Graphint integrates a robust time series clustering algorithm with an interactive tool for comparison and interpretation. More precisely, our system allows users to compare results against competing approaches, identify discriminative subsequences within specified datasets, and visualize the critical information utilized by $k$-Graph to generate outputs. Overall, Graphint offers a comprehensive solution for extracting actionable insights from complex temporal datasets.

Title: RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories

Authors: Huiyang Shao, Xin Xia, Yuhong Yang, Yuxi Ren, Xing Wang, Xuefeng Xiao
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.07699
Pdf URL: https://arxiv.org/pdf/2503.07699
Copy Paste: [[2503.07699]] RayFlow: Instance-Aware Diffusion Acceleration via Adaptive Flow Trajectories(https://arxiv.org/abs/2503.07699)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable success across various domains. However, their slow generation speed remains a critical challenge. Existing acceleration methods, while aiming to reduce steps, often compromise sample quality, controllability, or introduce training complexities. Therefore, we propose RayFlow, a novel diffusion framework that addresses these limitations. Unlike previous methods, RayFlow guides each sample along a unique path towards an instance-specific target distribution. This method minimizes sampling steps while preserving generation diversity and stability. Furthermore, we introduce Time Sampler, an importance sampling technique to enhance training efficiency by focusing on crucial timesteps. Extensive experiments demonstrate RayFlow's superiority in generating high-quality images with improved speed, control, and training efficiency compared to existing acceleration techniques.

Title: Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

Authors: Lixue Gong, Xiaoxia Hou, Fanshi Li, Liang Li, Xiaochen Lian, Fei Liu, Liyang Liu, Wei Liu, Wei Lu, Yichun Shi, Shiqi Sun, Yu Tian, Zhi Tian, Peng Wang, Xun Wang, Ye Wang, Guofeng Wu, Jie Wu, Xin Xia, Xuefeng Xiao, Linjie Yang, Zhonghua Zhai, Xinyu Zhang, Qi Zhang, Yuwei Zhang, Shijia Zhao, Jianchao Yang, Weilin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07703
Pdf URL: https://arxiv.org/pdf/2503.07703
Copy Paste: [[2503.07703]] Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model(https://arxiv.org/abs/2503.07703)
Keywords: diffusion, large language model
Abstract: Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We develop a powerful data system that facilitates knowledge integration, and a caption system that balances the accuracy and richness for image description. Particularly, Seedream is integrated with a self-developed bilingual large language model as a text encoder, allowing it to learn native knowledge directly from massive data. This enable it to generate high-fidelity images with accurate cultural nuances and aesthetic expressions described in either Chinese or English. Beside, Glyph-Aligned ByT5 is applied for flexible character-level text rendering, while a Scaled ROPE generalizes well to untrained resolutions. Multi-phase post-training optimizations, including SFT and RLHF iterations, further improve the overall capability. Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score. In addition, it can be readily adapted to an instruction-based image editing model, such as SeedEdit, with strong editing capability that balances instruction-following and image consistency.

Title: SIRE: SE(3) Intrinsic Rigidity Embeddings

Authors: Cameron Smith, Basile Van Hoorick, Vitor Guizilini, Yue Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07739
Pdf URL: https://arxiv.org/pdf/2503.07739
Copy Paste: [[2503.07739]] SIRE: SE(3) Intrinsic Rigidity Embeddings(https://arxiv.org/abs/2503.07739)
Keywords: segmentation
Abstract: Motion serves as a powerful cue for scene perception and understanding by separating independently moving surfaces and organizing the physical world into distinct entities. We introduce SIRE, a self-supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes by learning intrinsic rigidity embeddings from videos. Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss: a least-squares solver uses the estimated geometry and rigidity to lift 2D point track trajectories into SE(3) tracks, which are simply re-projected back to 2D and compared against the original 2D trajectories for supervision. Crucially, our framework is fully end-to-end differentiable and can be optimized either on video datasets to learn generalizable image priors, or even on a single video to capture scene-specific structure - highlighting strong data efficiency. We demonstrate the effectiveness of our rigidity embeddings and geometry across multiple settings, including downstream object segmentation, SE(3) rigid motion estimation, and self-supervised depth estimation. Our findings suggest that SIRE can learn strong geometry and motion rigidity priors from video data, with minimal supervision.

Title: SANDRO: a Robust Solver with a Splitting Strategy for Point Cloud Registration

Authors: Michael Adlerstein, João Carlos Virgolino Soares, Angelo Bratta, Claudio Semini
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.07743
Pdf URL: https://arxiv.org/pdf/2503.07743
Copy Paste: [[2503.07743]] SANDRO: a Robust Solver with a Splitting Strategy for Point Cloud Registration(https://arxiv.org/abs/2503.07743)
Keywords: robust
Abstract: Point cloud registration is a critical problem in computer vision and robotics, especially in the field of navigation. Current methods often fail when faced with high outlier rates or take a long time to converge to a suitable solution. In this work, we introduce a novel algorithm for point cloud registration called SANDRO (Splitting strategy for point cloud Alignment using Non-convex anD Robust Optimization), which combines an Iteratively Reweighted Least Squares (IRLS) framework with a robust loss function with graduated non-convexity. This approach is further enhanced by a splitting strategy designed to handle high outlier rates and skewed distributions of outliers. SANDRO is capable of addressing important limitations of existing methods, as in challenging scenarios where the presence of high outlier rates and point cloud symmetries significantly hinder convergence. SANDRO achieves superior performance in terms of success rate when compared to the state-of-the-art methods, demonstrating a 20% improvement from the current state of the art when tested on the Redwood real dataset and 60% improvement when tested on synthetic data.

Title: SegResMamba: An Efficient Architecture for 3D Medical Image Segmentation

Authors: Badhan Kumar Das, Ajay Singh, Saahil Islam, Gengyan Zhao, Andreas Maier
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07766
Pdf URL: https://arxiv.org/pdf/2503.07766
Copy Paste: [[2503.07766]] SegResMamba: An Efficient Architecture for 3D Medical Image Segmentation(https://arxiv.org/abs/2503.07766)
Keywords: transformer, segmentation
Abstract: The Transformer architecture has opened a new paradigm in the domain of deep learning with its ability to model long-range dependencies and capture global context and has outpaced the traditional Convolution Neural Networks (CNNs) in many aspects. However, applying Transformer models to 3D medical image datasets presents significant challenges due to their high training time, and memory requirements, which not only hinder scalability but also contribute to elevated CO$_2$ footprint. This has led to an exploration of alternative models that can maintain or even improve performance while being more efficient and environmentally sustainable. Recent advancements in Structured State Space Models (SSMs) effectively address some of the inherent limitations of Transformers, particularly their high memory and computational demands. Inspired by these advancements, we propose an efficient 3D segmentation model for medical imaging called SegResMamba, designed to reduce computation complexity, memory usage, training time, and environmental impact while maintaining high performance. Our model uses less than half the memory during training compared to other state-of-the-art (SOTA) architectures, achieving comparable performance with significantly reduced resource demands.

Title: Better Pose Initialization for Fast and Robust 2D/3D Pelvis Registration

Authors: Yehyun Suh, J. Ryan Martin, Daniel Moyer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07767
Pdf URL: https://arxiv.org/pdf/2503.07767
Copy Paste: [[2503.07767]] Better Pose Initialization for Fast and Robust 2D/3D Pelvis Registration(https://arxiv.org/abs/2503.07767)
Keywords: robust
Abstract: This paper presents an approach for improving 2D/3D pelvis registration in optimization-based pose estimators using a learned initialization function. Current methods often fail to converge to the optimal solution when initialized naively. We find that even a coarse initializer greatly improves pose estimator accuracy, and improves overall computational efficiency. This approach proves to be effective also in challenging cases under more extreme pose variation. Experimental validation demonstrates that our method consistently achieves robust and accurate registration, enhancing the reliability of 2D/3D registration for clinical applications.

Title: NimbleReg: A light-weight deep-learning framework for diffeomorphic image registration

Authors: Antoine Legouhy, Ross Callaghan, Nolah Mazet, Vivien Julienne, Hojjat Azadbakht, Hui Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07768
Pdf URL: https://arxiv.org/pdf/2503.07768
Copy Paste: [[2503.07768]] NimbleReg: A light-weight deep-learning framework for diffeomorphic image registration(https://arxiv.org/abs/2503.07768)
Keywords: segmentation
Abstract: This paper presents NimbleReg, a light-weight deep-learning (DL) framework for diffeomorphic image registration leveraging surface representation of multiple segmented anatomical regions. Deep learning has revolutionized image registration but most methods typically rely on cumbersome gridded representations, leading to hardware-intensive models. Reliable fine-grained segmentations, that are now accessible at low cost, are often used to guide the alignment. Light-weight methods representing segmentations in terms of boundary surfaces have been proposed, but they lack mechanism to support the fusion of multiple regional mappings into an overall diffeomorphic transformation. Building on these advances, we propose a DL registration method capable of aligning surfaces from multiple segmented regions to generate an overall diffeomorphic transformation for the whole ambient space. The proposed model is light-weight thanks to a PointNet backbone. Diffeomoprhic properties are guaranteed by taking advantage of the stationary velocity field parametrization of diffeomorphisms. We demonstrate that this approach achieves alignment comparable to state-of-the-art DL-based registration techniques that consume images.

Title: Evaluating LLaMA 3.2 for Software Vulnerability Detection

Authors: José Gonçalves, Miguel Silva, Bernardo Cabral, Tiago Dias, Eva Maia, Isabel Praça, Ricardo Severino, Luís Lino Ferreira
Subjects: cs.LG, cs.AI, cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2503.07770
Pdf URL: https://arxiv.org/pdf/2503.07770
Copy Paste: [[2503.07770]] Evaluating LLaMA 3.2 for Software Vulnerability Detection(https://arxiv.org/abs/2503.07770)
Keywords: large language model
Abstract: Deep Learning (DL) has emerged as a powerful tool for vulnerability detection, often outperforming traditional solutions. However, developing effective DL models requires large amounts of real-world data, which can be difficult to obtain in sufficient quantities. To address this challenge, DiverseVul dataset has been curated as the largest dataset of vulnerable and non-vulnerable C/C++ functions extracted exclusively from real-world projects. Its goal is to provide high-quality, large-scale samples for training DL models. However, during our study several inconsistencies were identified in the raw dataset while applying pre-processing techniques, highlighting the need for a refined version. In this work, we present a refined version of DiverseVul dataset, which is used to fine-tune a large language model, LLaMA 3.2, for vulnerability detection. Experimental results show that the use of pre-processing techniques led to an improvement in performance, with the model achieving an F1-Score of 66%, a competitive result when compared to our baseline, which achieved a 47% F1-Score in software vulnerability detection.

Title: Sublinear Algorithms for Wasserstein and Total Variation Distances: Applications to Fairness and Privacy Auditing

Authors: Debabrota Basu, Debarshi Chanda
Subjects: cs.LG, cs.CY, cs.DS, stat.CO
Abstract URL: https://arxiv.org/abs/2503.07775
Pdf URL: https://arxiv.org/pdf/2503.07775
Copy Paste: [[2503.07775]] Sublinear Algorithms for Wasserstein and Total Variation Distances: Applications to Fairness and Privacy Auditing(https://arxiv.org/abs/2503.07775)
Keywords: privacy, federate, fair
Abstract: Resource-efficiently computing representations of probability distributions and the distances between them while only having access to the samples is a fundamental and useful problem across mathematical sciences. In this paper, we propose a generic algorithmic framework to estimate the PDF and CDF of any sub-Gaussian distribution while the samples from them arrive in a stream. We compute mergeable summaries of distributions from the stream of samples that require sublinear space w.r.t. the number of observed samples. This allows us to estimate Wasserstein and Total Variation (TV) distances between any two sub-Gaussian distributions while samples arrive in streams and from multiple sources (e.g. federated learning). Our algorithms significantly improves on the existing methods for distance estimation incurring super-linear time and linear space complexities. In addition, we use the proposed estimators of Wasserstein and TV distances to audit the fairness and privacy of the ML algorithms. We empirically demonstrate the efficiency of the algorithms for estimating these distances and auditing using both synthetic and real-world datasets.

Title: Joint Explainability-Performance Optimization With Surrogate Models for AI-Driven Edge Services

Authors: Foivos Charalampakos, Thomas Tsouparopoulos, Iordanis Koutsopoulos
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07784
Pdf URL: https://arxiv.org/pdf/2503.07784
Copy Paste: [[2503.07784]] Joint Explainability-Performance Optimization With Surrogate Models for AI-Driven Edge Services(https://arxiv.org/abs/2503.07784)
Keywords: explainability
Abstract: Explainable AI is a crucial component for edge services, as it ensures reliable decision making based on complex AI models. Surrogate models are a prominent approach of XAI where human-interpretable models, such as a linear regression model, are trained to approximate a complex (black-box) model's predictions. This paper delves into the balance between the predictive accuracy of complex AI models and their approximation by surrogate ones, advocating that both these models benefit from being learned simultaneously. We derive a joint (bi-level) training scheme for both models and we introduce a new algorithm based on multi-objective optimization (MOO) to simultaneously minimize both the complex model's prediction error and the error between its outputs and those of the surrogate. Our approach leads to improvements that exceed 99% in the approximation of the black-box model through the surrogate one, as measured by the metric of Fidelity, for a compromise of less than 3% absolute reduction in the black-box model's predictive accuracy, compared to single-task and multi-task learning baselines. By improving Fidelity, we can derive more trustworthy explanations of the complex model's outcomes from the surrogate, enabling reliable AI applications for intelligent services at the network edge.

Title: On the Semantic Security of NTRU -- with a gentle introduction to cryptography

Authors: Liam Peet-Pare
Subjects: cs.CR, math.NT
Abstract URL: https://arxiv.org/abs/2503.07790
Pdf URL: https://arxiv.org/pdf/2503.07790
Copy Paste: [[2503.07790]] On the Semantic Security of NTRU -- with a gentle introduction to cryptography(https://arxiv.org/abs/2503.07790)
Keywords: secure, security, attack
Abstract: This paper provides an explanation of NTRU, a post quantum encryption scheme, while also providing a gentle introduction to cryptography. NTRU is a very efficient lattice based cryptosystem that appears to be safe against attacks by quantum computers. NTRU's efficiency suggests that it is a strong candidate as an alternative to RSA, ElGamal, and ECC for the post quantum world. The paper begins with an introduction to cryptography and security proofs for cryptographic schemes before explaining the NTRU cryptosystem and culminating with a proof that the original presentation of NTRU is not IND-CPA secure. We will conclude by mentioning padding schemes to NTRU that are provably IND-CCA2 secure in the random oracle model. The paper is designed to be accessible to anyone with minimal background in abstract algebra and number theory - no previous knowledge of cryptography is assumed. Given the author's lack of familiarity with the subject, this paper aims to be an expository work rather than to provide new insights to the subject matter.

Title: Self-supervised Normality Learning and Divergence Vector-guided Model Merging for Zero-shot Congenital Heart Disease Detection in Fetal Ultrasound Videos

Authors: Pramit Saha, Divyanshu Mishra, Netzahualcoyotl Hernandez-Cruz, Olga Patey, Aris Papageorghiou, Yuki M. Asano, J. Alison Noble
Subjects: cs.CV, cs.AI, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07799
Pdf URL: https://arxiv.org/pdf/2503.07799
Copy Paste: [[2503.07799]] Self-supervised Normality Learning and Divergence Vector-guided Model Merging for Zero-shot Congenital Heart Disease Detection in Fetal Ultrasound Videos(https://arxiv.org/abs/2503.07799)
Keywords: privacy
Abstract: Congenital Heart Disease (CHD) is one of the leading causes of fetal mortality, yet the scarcity of labeled CHD data and strict privacy regulations surrounding fetal ultrasound (US) imaging present significant challenges for the development of deep learning-based models for CHD detection. Centralised collection of large real-world datasets for rare conditions, such as CHD, from large populations requires significant co-ordination and resource. In addition, data governance rules increasingly prevent data sharing between sites. To address these challenges, we introduce, for the first time, a novel privacy-preserving, zero-shot CHD detection framework that formulates CHD detection as a normality modeling problem integrated with model merging. In our framework dubbed Sparse Tube Ultrasound Distillation (STUD), each hospital site first trains a sparse video tube-based self-supervised video anomaly detection (VAD) model on normal fetal heart US clips with self-distillation loss. This enables site-specific models to independently learn the distribution of healthy cases. To aggregate knowledge across the decentralized models while maintaining privacy, we propose a Divergence Vector-Guided Model Merging approach, DivMerge, that combines site-specific models into a single VAD model without data exchange. Our approach preserves domain-agnostic rich spatio-temporal representations, ensuring generalization to unseen CHD cases. We evaluated our approach on real-world fetal US data collected from 5 hospital sites. Our merged model outperformed site-specific models by 23.77% and 30.13% in accuracy and F1-score respectively on external test sets.

Title: Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models

Authors: Kefan Song, Jin Yao, Runnan Jiang, Rohan Chandra, Shangtong Zhang
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07806
Pdf URL: https://arxiv.org/pdf/2503.07806
Copy Paste: [[2503.07806]] Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models(https://arxiv.org/abs/2503.07806)
Keywords: fair, large language model
Abstract: As Large Language Models (LLMs) become increasingly powerful and accessible to human users, ensuring fairness across diverse demographic groups, i.e., group fairness, is a critical ethical concern. However, current fairness and bias research in LLMs is limited in two aspects. First, compared to traditional group fairness in machine learning classification, it requires that the non-sensitive attributes, in this case, the prompt questions, be the same across different groups. In many practical scenarios, different groups, however, may prefer different prompt questions and this requirement becomes impractical. Second, it evaluates group fairness only for the LLM's final output without identifying the source of possible bias. Namely, the bias in LLM's output can result from both the pretraining and the finetuning. For finetuning, the bias can result from both the RLHF procedure and the learned reward model. Arguably, evaluating the group fairness of each component in the LLM pipeline could help develop better methods to mitigate the possible bias. Recognizing those two limitations, this work benchmarks the group fairness of learned reward models. By using expert-written text from arXiv, we are able to benchmark the group fairness of reward models without requiring the same prompt questions across different demographic groups. Surprisingly, our results demonstrate that all the evaluated reward models (e.g., Nemotron-4-340B-Reward, ArmoRM-Llama3-8B-v0.1, and GRM-llama3-8B-sftreg) exhibit statistically significant group unfairness. We also observed that top-performing reward models (w.r.t. canonical performance metrics) tend to demonstrate better group fairness.

Title: Training Domain Draft Models for Speculative Decoding: Best Practices and Insights

Authors: Fenglu Hong, Ravi Raju, Jonathan Lingjie Li, Bo Li, Urmish Thakker, Avinash Ravichandran, Swayambhoo Jain, Changran Hu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07807
Pdf URL: https://arxiv.org/pdf/2503.07807
Copy Paste: [[2503.07807]] Training Domain Draft Models for Speculative Decoding: Best Practices and Insights(https://arxiv.org/abs/2503.07807)
Keywords: large language model
Abstract: Speculative decoding is an effective method for accelerating inference of large language models (LLMs) by employing a small draft model to predict the output of a target model. However, when adapting speculative decoding to domain-specific target models, the acceptance rate of the generic draft model drops significantly due to domain shift. In this work, we systematically investigate knowledge distillation techniques for training domain draft models to improve their speculation accuracy. We compare white-box and black-box distillation approaches and explore their effectiveness in various data accessibility scenarios, including historical user queries, curated domain data, and synthetically generated alignment data. Our experiments across Function Calling, Biology, and Chinese domains show that offline distillation consistently outperforms online distillation by 11% to 25%, white-box distillation surpasses black-box distillation by 2% to 10%, and data scaling trends hold across domains. Additionally, we find that synthetic data can effectively align draft models and achieve 80% to 93% of the performance of training on historical user queries. These findings provide practical guidelines for training domain-specific draft models to improve speculative decoding efficiency.

Title: AgriField3D: A Curated 3D Point Cloud and Procedural Model Dataset of Field-Grown Maize from a Diversity Panel

Authors: Elvis Kimara, Mozhgan Hadadi, Jackson Godbersen, Aditya Balu, Talukder Jubery, Yawei Li, Adarsh Krishnamurthy, Patrick S. Schnable, Baskar Ganapathysubramanian
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07813
Pdf URL: https://arxiv.org/pdf/2503.07813
Copy Paste: [[2503.07813]] AgriField3D: A Curated 3D Point Cloud and Procedural Model Dataset of Field-Grown Maize from a Diversity Panel(https://arxiv.org/abs/2503.07813)
Keywords: segmentation
Abstract: The application of artificial intelligence (AI) in three-dimensional (3D) agricultural research, particularly for maize, has been limited by the scarcity of large-scale, diverse datasets. While 2D image datasets are abundant, they fail to capture essential structural details such as leaf architecture, plant volume, and spatial arrangements that 3D data provide. To address this limitation, we present AgriField3D (this https URL), a curated dataset of 3D point clouds of field-grown maize plants from a diverse genetic panel, designed to be AI-ready for advancing agricultural research. Our dataset comprises over 1,000 high-quality point clouds collected using a Terrestrial Laser Scanner, complemented by procedural models that provide structured, parametric representations of maize plants. These procedural models, generated using Non-Uniform Rational B-Splines (NURBS) and optimized via a two-step process combining Particle Swarm Optimization (PSO) and differentiable programming, enable precise, scalable reconstructions of leaf surfaces and plant architectures. To enhance usability, we performed graph-based segmentation to isolate individual leaves and stalks, ensuring consistent labeling across all samples. We also conducted rigorous manual quality control on all datasets, correcting errors in segmentation, ensuring accurate leaf ordering, and validating metadata annotations. The dataset further includes metadata detailing plant morphology and quality, alongside multi-resolution subsampled versions (100k, 50k, 10k points) optimized for various computational needs. By integrating point cloud data of field grown plants with high-fidelity procedural models and ensuring meticulous manual validation, AgriField3D provides a comprehensive foundation for AI-driven phenotyping, plant structural analysis, and 3D applications in agricultural research.

Title: Group Fairness in Multi-Task Reinforcement Learning

Authors: Kefan Song, Runnan Jiang, Rohan Chandra, Shangtong Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07817
Pdf URL: https://arxiv.org/pdf/2503.07817
Copy Paste: [[2503.07817]] Group Fairness in Multi-Task Reinforcement Learning(https://arxiv.org/abs/2503.07817)
Keywords: fair
Abstract: This paper addresses a critical societal consideration in the application of Reinforcement Learning (RL): ensuring equitable outcomes across different demographic groups in multi-task settings. While previous work has explored fairness in single-task RL, many real-world applications are multi-task in nature and require policies to maintain fairness across all tasks. We introduce a novel formulation of multi-task group fairness in RL and propose a constrained optimization algorithm that explicitly enforces fairness constraints across multiple tasks simultaneously. We have shown that our proposed algorithm does not violate fairness constraints with high probability and with sublinear regret in the finite-horizon episodic setting. Through experiments in RiverSwim and MuJoCo environments, we demonstrate that our approach better ensures group fairness across multiple tasks compared to previous methods that lack explicit multi-task fairness constraints in both the finite-horizon setting and the infinite-horizon setting. Our results show that the proposed algorithm achieves smaller fairness gaps while maintaining comparable returns across different demographic groups and tasks, suggesting its potential for addressing fairness concerns in real-world multi-task RL applications.

Title: Strengthening the Internal Adversarial Robustness in Lifted Neural Networks

Authors: Christopher Zach
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.07818
Pdf URL: https://arxiv.org/pdf/2503.07818
Copy Paste: [[2503.07818]] Strengthening the Internal Adversarial Robustness in Lifted Neural Networks(https://arxiv.org/abs/2503.07818)
Keywords: robust
Abstract: Lifted neural networks (i.e. neural architectures explicitly optimizing over respective network potentials to determine the neural activities) can be combined with a type of adversarial training to gain robustness for internal as well as input layers, in addition to improved generalization performance. In this work we first investigate how adversarial robustness in this framework can be further strengthened by solely modifying the training loss. In a second step we fix some remaining limitations and arrive at a novel training loss for lifted neural networks, that combines targeted and untargeted adversarial perturbations.

Title: Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation

Authors: Fan Yin, Zifeng Wang, I-Hung Hsu, Jun Yan, Ke Jiang, Yanfei Chen, Jindong Gu, Long T. Le, Kai-Wei Chang, Chen-Yu Lee, Hamid Palangi, Tomas Pfister
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07826
Pdf URL: https://arxiv.org/pdf/2503.07826
Copy Paste: [[2503.07826]] Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation(https://arxiv.org/abs/2503.07826)
Keywords: large language model
Abstract: Large language models (LLMs) have exhibited the ability to effectively utilize external tools to address user queries. However, their performance may be limited in complex, multi-turn interactions involving users and multiple tools. To address this, we propose Magnet, a principled framework for synthesizing high-quality training trajectories to enhance the function calling capability of large language model agents in multi-turn conversations with humans. The framework is based on automatic and iterative translations from a function signature path to a sequence of queries and executable function calls. We model the complicated function interactions in multi-turn cases with graph and design novel node operations to build reliable signature paths. Motivated by context distillation, when guiding the generation of positive and negative trajectories using a teacher model, we provide reference function call sequences as positive hints in context and contrastive, incorrect function calls as negative hints. Experiments show that training with the positive trajectories with supervised fine-tuning and preference optimization against negative trajectories, our 14B model, Magnet-14B-mDPO, obtains 68.01 on BFCL-v3 and 73.30 on ToolQuery, surpassing the performance of the teacher model Gemini-1.5-pro-002 by a large margin in function calling.

Title: Modern Models, Medieval Texts: A POS Tagging Study of Old Occitan

Authors: Matthias Schöffel, Marinus Wiedner, Esteban Garces Arias, Paula Ruppert, Christian Heumann, Matthias Aßenmacher
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07827
Pdf URL: https://arxiv.org/pdf/2503.07827
Copy Paste: [[2503.07827]] Modern Models, Medieval Texts: A POS Tagging Study of Old Occitan(https://arxiv.org/abs/2503.07827)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing, yet their effectiveness in handling historical languages remains largely unexplored. This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan, a historical language characterized by non-standardized orthography and significant diachronic variation. Through comparative analysis of two distinct corpora-hagiographical and medical texts-we evaluate how current models handle the inherent challenges of processing a low-resource historical language. Our findings demonstrate critical limitations in LLM performance when confronted with extreme orthographic and syntactic variability. We provide detailed error analysis and specific recommendations for improving model performance in historical language processing. This research advances our understanding of LLM capabilities in challenging linguistic contexts while offering practical insights for both computational linguistics and historical language studies.

Title: Fixing the RANSAC Stopping Criterion

Authors: Johannes Schönberger, Viktor Larsson, Marc Pollefeys
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07829
Pdf URL: https://arxiv.org/pdf/2503.07829
Copy Paste: [[2503.07829]] Fixing the RANSAC Stopping Criterion(https://arxiv.org/abs/2503.07829)
Keywords: robust
Abstract: For several decades, RANSAC has been one of the most commonly used robust estimation algorithms for many problems in computer vision and related fields. The main contribution of this paper lies in addressing a long-standing error baked into virtually any system building upon the RANSAC algorithm. Since its inception in 1981 by Fischler and Bolles, many variants of RANSAC have been proposed on top of the same original idea relying on the fact that random sampling has a high likelihood of generating a good hypothesis from minimal subsets of measurements. An approximation to the sampling probability was originally derived by the paper in 1981 in support of adaptively stopping RANSAC and is, as such, used in the vast majority of today's RANSAC variants and implementations. The impact of this approximation has since not been questioned or thoroughly studied by any of the later works. As we theoretically derive and practically demonstrate in this paper, the approximation leads to severe undersampling and thus failure to find good models. The discrepancy is especially pronounced in challenging scenarios with few inliers and high model complexity. An implementation of computing the exact probability is surprisingly simple yet highly effective and has potentially drastic impact across a large range of computer vision systems.

Title: HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations

Authors: Samir Abdaljalil, Hasan Kurban, Erchin Serpedin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07833
Pdf URL: https://arxiv.org/pdf/2503.07833
Copy Paste: [[2503.07833]] HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations(https://arxiv.org/abs/2503.07833)
Keywords: large language model
Abstract: Large Language Models (LLMs) are increasingly used in various contexts, yet remain prone to generating non-factual content, commonly referred to as "hallucinations". The literature categorizes hallucinations into several types, including entity-level, relation-level, and sentence-level hallucinations. However, existing hallucination datasets often fail to capture fine-grained hallucinations in multilingual settings. In this work, we introduce HalluVerse25, a multilingual LLM hallucination dataset that categorizes fine-grained hallucinations in English, Arabic, and Turkish. Our dataset construction pipeline uses an LLM to inject hallucinations into factual biographical sentences, followed by a rigorous human annotation process to ensure data quality. We evaluate several LLMs on HalluVerse25, providing valuable insights into how proprietary models perform in detecting LLM-generated hallucinations across different contexts.

Title: TwinTURBO: Semi-Supervised Fine-Tuning of Foundation Models via Mutual Information Decompositions for Downstream Task and Latent Spaces

Authors: Guillaume Quétant, Pavlo Molchanov, Slava Voloshynovskiy
Subjects: cs.LG, cs.CV, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2503.07851
Pdf URL: https://arxiv.org/pdf/2503.07851
Copy Paste: [[2503.07851]] TwinTURBO: Semi-Supervised Fine-Tuning of Foundation Models via Mutual Information Decompositions for Downstream Task and Latent Spaces(https://arxiv.org/abs/2503.07851)
Keywords: transformer
Abstract: We present a semi-supervised fine-tuning framework for foundation models that utilises mutual information decomposition to address the challenges of training for a limited amount of labelled data. Our approach derives two distinct lower bounds: i) for the downstream task space, such as classification, optimised using conditional and marginal cross-entropy alongside Kullback-Leibler divergence, and ii) for the latent space representation, regularised and aligned using a contrastive-like decomposition. This fine-tuning strategy retains the pre-trained structure of the foundation model, modifying only a specialised projector module comprising a small transformer and a token aggregation technique. Experiments on several datasets demonstrate significant improvements in classification tasks under extremely low-labelled conditions by effectively leveraging unlabelled data.

Title: Learning and Evaluating Hierarchical Feature Representations

Authors: Depanshu Sani, Saket Anand
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07853
Pdf URL: https://arxiv.org/pdf/2503.07853
Copy Paste: [[2503.07853]] Learning and Evaluating Hierarchical Feature Representations(https://arxiv.org/abs/2503.07853)
Keywords: transformer
Abstract: Hierarchy-aware representations ensure that the semantically closer classes are mapped closer in the feature space, thereby reducing the severity of mistakes while enabling consistent coarse-level class predictions. Towards this end, we propose a novel framework, Hierarchical Composition of Orthogonal Subspaces (Hier-COS), which learns to map deep feature embeddings into a vector space that is, by design, consistent with the structure of a given taxonomy tree. Our approach augments neural network backbones with a simple transformation module that maps learned discriminative features to subspaces defined using a fixed orthogonal frame. This construction naturally improves the severity of mistakes and promotes hierarchical consistency. Furthermore, we highlight the fundamental limitations of existing hierarchical evaluation metrics popularly used by the vision community and introduce a preference-based metric, Hierarchically Ordered Preference Score (HOPS), to overcome these limitations. We benchmark our method on multiple large and challenging datasets having deep label hierarchies (ranging from 3 - 12 levels) and compare with several baselines and SOTA. Through extensive experiments, we demonstrate that Hier-COS achieves state-of-the-art hierarchical performance across all the datasets while simultaneously beating top-1 accuracy in all but one case. We also demonstrate the performance of a Vision Transformer (ViT) backbone and show that learning a transformation module alone can map the learned features from a pre-trained ViT to Hier-COS and yield substantial performance benefits.

Title: Blind Video Super-Resolution based on Implicit Kernels

Authors: Qiang Zhu, Yuxuan Jiang, Shuyuan Zhu, Fan Zhang, David Bull, Bing Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07856
Pdf URL: https://arxiv.org/pdf/2503.07856
Copy Paste: [[2503.07856]] Blind Video Super-Resolution based on Implicit Kernels(https://arxiv.org/abs/2503.07856)
Keywords: transformer
Abstract: Blind video super-resolution (BVSR) is a low-level vision task which aims to generate high-resolution videos from low-resolution counterparts in unknown degradation scenarios. Existing approaches typically predict blur kernels that are spatially invariant in each video frame or even the entire video. These methods do not consider potential spatio-temporal varying degradations in videos, resulting in suboptimal BVSR performance. In this context, we propose a novel BVSR model based on Implicit Kernels, BVSR-IK, which constructs a multi-scale kernel dictionary parameterized by implicit neural representations. It also employs a newly designed recurrent Transformer to predict the coefficient weights for accurate filtering in both frame correction and feature alignment. Experimental results have demonstrated the effectiveness of the proposed BVSR-IK, when compared with four state-of-the-art BVSR models on three commonly used datasets, with BVSR-IK outperforming the second best approach, FMA-Net, by up to 0.59 dB in PSNR. Source code will be available at this https URL.

Title: Efficient Resource Management for Secure and Low-Latency O-RAN Communication

Authors: Zaineh Abughazzah, Emna Baccour, Ahmed Refaey, Amr Mohamed, Mounir Hamdi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.07857
Pdf URL: https://arxiv.org/pdf/2503.07857
Copy Paste: [[2503.07857]] Efficient Resource Management for Secure and Low-Latency O-RAN Communication(https://arxiv.org/abs/2503.07857)
Keywords: secure, security
Abstract: Open Radio Access Networks (O-RAN) are transforming telecommunications by shifting from centralized to distributed architectures, promoting flexibility, interoperability, and innovation through open interfaces and multi-vendor environments. However, O-RAN's reliance on cloud-based architecture and enhanced observability introduces significant security and resource management challenges. Efficient resource management is crucial for secure and reliable communication in O-RAN, within the resource-constrained environment and heterogeneity of requirements, where multiple User Equipment (UE) and O-RAN Radio Units (O-RUs) coexist. This paper develops a framework to manage these aspects, ensuring each O-RU is associated with UEs based on their communication channel qualities and computational resources, and selecting appropriate encryption algorithms to safeguard data confidentiality, integrity, and authentication. A Multi-objective Optimization Problem (MOP) is formulated to minimize latency and maximize security within resource constraints. Different approaches are proposed to relax the complexity of the problem and achieve near-optimal performance, facilitating trade-offs between latency, security, and solution complexity. Simulation results demonstrate that the proposed approaches are close enough to the optimal solution, proving that our approach is both effective and efficient.

Title: Right Reward Right Time for Federated Learning

Authors: Thanh Linh Nguyen, Dinh Thai Hoang, Diep N. Nguyen, Quoc-Viet Pham
Subjects: cs.LG, cs.AI, cs.DC, cs.GT
Abstract URL: https://arxiv.org/abs/2503.07869
Pdf URL: https://arxiv.org/pdf/2503.07869
Copy Paste: [[2503.07869]] Right Reward Right Time for Federated Learning(https://arxiv.org/abs/2503.07869)
Keywords: privacy, federate
Abstract: Critical learning periods (CLPs) in federated learning (FL) refer to early stages during which low-quality contributions (e.g., sparse training data availability) can permanently impair the learning performance of the global model owned by the model owner (i.e., the cloud server). However, strategies to motivate clients with high-quality contributions to join the FL training process and share trained model updates during CLPs remain underexplored. Additionally, existing incentive mechanisms in FL treat all training periods equally, which consequently fails to motivate clients to participate early. Compounding this challenge is the cloud's limited knowledge of client training capabilities due to privacy regulations, leading to information asymmetry. Therefore, in this article, we propose a time-aware incentive mechanism, called Right Reward Right Time (R3T), to encourage client involvement, especially during CLPs, to maximize the utility of the cloud in FL. Specifically, the cloud utility function captures the trade-off between the achieved model performance and payments allocated for clients' contributions, while accounting for clients' time and system capabilities, efforts, joining time, and rewards. Then, we analytically derive the optimal contract for the cloud and devise a CLP-aware mechanism to incentivize early participation and efforts while maximizing cloud utility, even under information asymmetry. By providing the right reward at the right time, our approach can attract the highest-quality contributions during CLPs. Simulation and proof-of-concept studies show that R3T increases cloud utility and is more economically effective than benchmarks. Notably, our proof-of-concept results show up to a 47.6% reduction in the total number of clients and up to a 300% improvement in convergence time while reaching competitive test accuracies compared with incentive mechanism benchmarks.

Title: MapQA: Open-domain Geospatial Question Answering on Map Data

Authors: Zekun Li, Malcolm Grossman, Eric (Ehsan)Qasemi, Mihir Kulkarni, Muhao Chen, Yao-Yi Chiang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2503.07871
Pdf URL: https://arxiv.org/pdf/2503.07871
Copy Paste: [[2503.07871]] MapQA: Open-domain Geospatial Question Answering on Map Data(https://arxiv.org/abs/2503.07871)
Keywords: large language model
Abstract: Geospatial question answering (QA) is a fundamental task in navigation and point of interest (POI) searches. While existing geospatial QA datasets exist, they are limited in both scale and diversity, often relying solely on textual descriptions of geo-entities without considering their geometries. A major challenge in scaling geospatial QA datasets for reasoning lies in the complexity of geospatial relationships, which require integrating spatial structures, topological dependencies, and multi-hop reasoning capabilities that most text-based QA datasets lack. To address these limitations, we introduce MapQA, a novel dataset that not only provides question-answer pairs but also includes the geometries of geo-entities referenced in the questions. MapQA is constructed using SQL query templates to extract question-answer pairs from OpenStreetMap (OSM) for two study regions: Southern California and Illinois. It consists of 3,154 QA pairs spanning nine question types that require geospatial reasoning, such as neighborhood inference and geo-entity type identification. Compared to existing datasets, MapQA expands both the number and diversity of geospatial question types. We explore two approaches to tackle this challenge: (1) a retrieval-based language model that ranks candidate geo-entities by embedding similarity, and (2) a large language model (LLM) that generates SQL queries from natural language questions and geo-entity attributes, which are then executed against an OSM database. Our findings indicate that retrieval-based methods effectively capture concepts like closeness and direction but struggle with questions that require explicit computations (e.g., distance calculations). LLMs (e.g., GPT and Gemini) excel at generating SQL queries for one-hop reasoning but face challenges with multi-hop reasoning, highlighting a key bottleneck in advancing geospatial QA systems.

Title: Topology-Preserving Loss for Accurate and Anatomically Consistent Cardiac Mesh Reconstruction

Authors: Chenyu Zhang, Yihao Luo, Yinzhe Wu, Choon Hwai Yap, Guang Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07874
Pdf URL: https://arxiv.org/pdf/2503.07874
Copy Paste: [[2503.07874]] Topology-Preserving Loss for Accurate and Anatomically Consistent Cardiac Mesh Reconstruction(https://arxiv.org/abs/2503.07874)
Keywords: segmentation
Abstract: Accurate cardiac mesh reconstruction from volumetric data is essential for personalized cardiac modeling and clinical analysis. However, existing deformation-based approaches are prone to topological inconsistencies, particularly membrane penetration, which undermines the anatomical plausibility of the reconstructed mesh. To address this issue, we introduce Topology-Preserving Mesh Loss (TPM Loss), a novel loss function that explicitly enforces topological constraints during mesh deformation. By identifying topology-violating points, TPM Loss ensures spatially consistent reconstructions. Extensive experiments on CT and MRI datasets show that TPM Loss reduces topology violations by up to 93.1% while maintaining high segmentation accuracy (DSC: 89.1%-92.9%) and improving mesh fidelity (Chamfer Distance reduction up to 0.26 mm). These results demonstrate that TPM Loss effectively prevents membrane penetration and significantly improves cardiac mesh quality, enabling more accurate and anatomically consistent cardiac reconstructions.

Title: Measuring directional bias amplification in image captions using predictability

Authors: Rahul Nair, Bhanu Tokas, Hannah Kerner
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07878
Pdf URL: https://arxiv.org/pdf/2503.07878
Copy Paste: [[2503.07878]] Measuring directional bias amplification in image captions using predictability(https://arxiv.org/abs/2503.07878)
Keywords: attack
Abstract: When we train models on biased ML datasets, they not only learn these biases but can inflate them at test time - a phenomenon called bias amplification. To measure bias amplification in ML datasets, many co-occurrence-based metrics have been proposed. Co-occurrence-based metrics are effective in measuring bias amplification in simple problems like image classification. However, these metrics are ineffective for complex problems like image captioning as they cannot capture the semantics of a caption. To measure bias amplification in captions, prior work introduced a predictability-based metric called Leakage in Captioning (LIC). While LIC captures the semantics and context of captions, it has limitations. LIC cannot identify the direction in which bias is amplified, poorly estimates dataset bias due to a weak vocabulary substitution strategy, and is highly sensitive to attacker models (a hyperparameter in predictability-based metrics). To overcome these issues, we propose Directional Predictability Amplification in Captioning (DPAC). DPAC measures directional bias amplification in captions, provides a better estimate of dataset bias using an improved substitution strategy, and is less sensitive to attacker models. Our experiments on the COCO captioning dataset show how DPAC is the most reliable metric to measure bias amplification in captions.

Title: Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

Authors: Alex Fang, Hadi Pouransari, Matt Jordan, Alexander Toshev, Vaishaal Shankar, Ludwig Schmidt, Tom Gunter
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07879
Pdf URL: https://arxiv.org/pdf/2503.07879
Copy Paste: [[2503.07879]] Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality(https://arxiv.org/abs/2503.07879)
Keywords: large language model
Abstract: Data filtering has become a powerful tool for improving model performance while reducing computational cost. However, as large language model compute budgets continue to grow, the limited data volume provided by heavily filtered and deduplicated datasets will become a practical constraint. In efforts to better understand how to proceed, we study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude. While this finding relies on repeating the dataset for many epochs, we also investigate repeats within these datasets at the document level. We find that not all documents within a dataset are equal, and we can create better datasets relative to a token budget by explicitly manipulating the counts of individual documents. We conclude by arguing that even as large language models scale, data filtering remains an important direction of research.

Title: ReLATE: Resilient Learner Selection for Multivariate Time-Series Classification Against Adversarial Attacks

Authors: Cagla Ipek Kocal, Onat Gungor, Aaron Tartz, Tajana Rosing, Baris Aksanli
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2503.07882
Pdf URL: https://arxiv.org/pdf/2503.07882
Copy Paste: [[2503.07882]] ReLATE: Resilient Learner Selection for Multivariate Time-Series Classification Against Adversarial Attacks(https://arxiv.org/abs/2503.07882)
Keywords: attack, robust
Abstract: Minimizing computational overhead in time-series classification, particularly in deep learning models, presents a significant challenge. This challenge is further compounded by adversarial attacks, emphasizing the need for resilient methods that ensure robust performance and efficient model selection. We introduce ReLATE, a framework that identifies robust learners based on dataset similarity, reduces computational overhead, and enhances resilience. ReLATE maintains multiple deep learning models in well-known adversarial attack scenarios, capturing model performance. ReLATE identifies the most analogous dataset to a given target using a similarity metric, then applies the optimal model from the most similar dataset. ReLATE reduces computational overhead by an average of 81.2%, enhancing adversarial resilience and streamlining robust model selection, all without sacrificing performance, within 4.2% of Oracle.

Title: Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?

Authors: Yuru Jia, Valerio Marsocci, Ziyang Gong, Xue Yang, Maarten Vergauwen, Andrea Nascetti
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07890
Pdf URL: https://arxiv.org/pdf/2503.07890
Copy Paste: [[2503.07890]] Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?(https://arxiv.org/abs/2503.07890)
Keywords: diffusion, generative, segmentation
Abstract: Self-supervised learning (SSL) has revolutionized representation learning in Remote Sensing (RS), advancing Geospatial Foundation Models (GFMs) to leverage vast unlabeled satellite imagery for diverse downstream tasks. Currently, GFMs primarily focus on discriminative objectives, such as contrastive learning or masked image modeling, owing to their proven success in learning transferable representations. However, generative diffusion models--which demonstrate the potential to capture multi-grained semantics essential for RS tasks during image generation--remain underexplored for discriminative applications. This prompts the question: can generative diffusion models also excel and serve as GFMs with sufficient discriminative power? In this work, we answer this question with SatDiFuser, a framework that transforms a diffusion-based generative geospatial foundation model into a powerful pretraining tool for discriminative RS. By systematically analyzing multi-stage, noise-dependent diffusion features, we develop three fusion strategies to effectively leverage these diverse representations. Extensive experiments on remote sensing benchmarks show that SatDiFuser outperforms state-of-the-art GFMs, achieving gains of up to +5.7% mIoU in semantic segmentation and +7.9% F1-score in classification, demonstrating the capacity of diffusion-based generative foundation models to rival or exceed discriminative GFMs. Code will be released.

Title: Gemini Embedding: Generalizable Embeddings from Gemini

Authors: Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernández Ábrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, Xiaoqi Ren, Shanfeng Zhang, Daniel Salz, Michael Boratko, Jay Han, Blair Chen, Shuo Huang, Vikram Rao, Paul Suganthan, Feng Han, Andreas Doumanoglou, Nithi Gupta, Fedor Moiseev, Cathy Yip, Aashi Jain, Simon Baumgartner, Shahrokh Shahi, Frank Palma Gomez, Sandeep Mariserla, Min Choi, Parashar Shah, Sonam Goenka, Ke Chen, Ye Xia, Koert Chen, Sai Meher Karthik Duddu, Yichang Chen, Trevor Walker, Wenlei Zhou, Rakesh Ghiya, Zach Gleicher, Karan Gill, Zhe Dong, Mojtaba Seyedhosseini, Yunhsuan Sung, Raphael Hoffmann, Tom Duerig
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07891
Pdf URL: https://arxiv.org/pdf/2503.07891
Copy Paste: [[2503.07891]] Gemini Embedding: Generalizable Embeddings from Gemini(https://arxiv.org/abs/2503.07891)
Keywords: large language model
Abstract: In this report, we introduce Gemini Embedding, a state-of-the-art embedding model leveraging the power of Gemini, Google's most capable large language model. Capitalizing on Gemini's inherent multilingual and code understanding capabilities, Gemini Embedding produces highly generalizable embeddings for text spanning numerous languages and textual modalities. The representations generated by Gemini Embedding can be precomputed and applied to a variety of downstream tasks including classification, similarity, clustering, ranking, and retrieval. Evaluated on the Massive Multilingual Text Embedding Benchmark (MMTEB), which includes over one hundred tasks across 250+ languages, Gemini Embedding substantially outperforms prior state-of-the-art models, demonstrating considerable improvements in embedding quality. Achieving state-of-the-art performance across MMTEB's multilingual, English, and code benchmarks, our unified model demonstrates strong capabilities across a broad selection of tasks and surpasses specialized domain-specific models.

Title: Can Memory-Augmented Language Models Generalize on Reasoning-in-a-Haystack Tasks?

Authors: Payel Das, Ching-Yun Ko, Sihui Dai, Georgios Kollias, Subhajit Chaudhury, Aurelie Lozano
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.07903
Pdf URL: https://arxiv.org/pdf/2503.07903
Copy Paste: [[2503.07903]] Can Memory-Augmented Language Models Generalize on Reasoning-in-a-Haystack Tasks?(https://arxiv.org/abs/2503.07903)
Keywords: transformer, large language model
Abstract: Large language models often expose their brittleness in reasoning tasks, especially while executing long chains of reasoning over context. We propose MemReasoner, a new and simple memory-augmented LLM architecture, in which the memory learns the relative order of facts in context, and enables hopping over them, while the decoder selectively attends to the memory. MemReasoner is trained end-to-end, with optional supporting fact supervision of varying degrees. We train MemReasoner, along with existing memory-augmented transformer models and a state-space model, on two distinct synthetic multi-hop reasoning tasks. Experiments performed under a variety of challenging scenarios, including the presence of long distractor text or target answer changes in test set, show strong generalization of MemReasoner on both single- and two-hop tasks. This generalization of MemReasoner is achieved using none-to-weak supporting fact supervision (using none and 1\% of supporting facts for one- and two-hop tasks, respectively). In contrast, baseline models overall struggle to generalize and benefit far less from using full supporting fact supervision. The results highlight the importance of explicit memory mechanisms, combined with additional weak supervision, for improving large language model's context processing ability toward reasoning tasks.

Title: Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

Authors: Qinghao Ye, Xianhan Zeng, Fu Li, Chunyuan Li, Haoqi Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07906
Pdf URL: https://arxiv.org/pdf/2503.07906
Copy Paste: [[2503.07906]] Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning(https://arxiv.org/abs/2503.07906)
Keywords: robust
Abstract: Image captioning has long been a pivotal task in visual understanding, with recent advancements in vision-language models (VLMs) significantly enhancing the ability to generate detailed image captions. However, the evaluation of detailed image captioning remains underexplored due to outdated evaluation metrics and coarse annotations. In this paper, we introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks. DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units, termed primitive information units, and assessing them individually. Our evaluation shows that DCScore aligns more closely with human judgment than other rule-based or model-based metrics. Concurrently, DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models. Additionally, we present an automatic fine-grained feedback collection method, FeedQuill, for preference optimization based on our advanced metric, showing robust generalization capabilities across auto-generated preference data. Extensive experiments on multiple VLMs demonstrate that our method not only significantly reduces hallucinations but also enhances performance across various benchmarks, achieving superior detail captioning performance while surpassing GPT-4o.

Title: FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction

Authors: Dennis Rotondi, Fabio Scaparro, Hermann Blum, Kai O. Arras
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.07909
Pdf URL: https://arxiv.org/pdf/2503.07909
Copy Paste: [[2503.07909]] FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction(https://arxiv.org/abs/2503.07909)
Keywords: segmentation
Abstract: The concept of 3D scene graphs is increasingly recognized as a powerful semantic and hierarchical representation of the environment. Current approaches often address this at a coarse, object-level resolution. In contrast, our goal is to develop a representation that enables robots to directly interact with their environment by identifying both the location of functional interactive elements and how these can be used. To achieve this, we focus on detecting and storing objects at a finer resolution, focusing on affordance-relevant parts. The primary challenge lies in the scarcity of data that extends beyond instance-level detection and the inherent difficulty of capturing detailed object features using robotic sensors. We leverage currently available 3D resources to generate 2D data and train a detector, which is then used to augment the standard 3D scene graph generation pipeline. Through our experiments, we demonstrate that our approach achieves functional element segmentation comparable to state-of-the-art 3D models and that our augmentation enables task-driven affordance grounding with higher accuracy than the current solutions.

Title: Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

Authors: Samuel Cahyawijaya, Holy Lovenia, Joel Ruben Antony Moniz, Tack Hwa Wong, Mohammad Rifqi Farhansyah, Thant Thiri Maung, Frederikus Hudi, David Anugraha, Muhammad Ravi Shulthan Habibi, Muhammad Reza Qorib, Amit Agarwal, Joseph Marvin Imperial, Hitesh Laxmichand Patel, Vicky Feliren, Bahrul Ilmi Nasution, Manuel Antonio Rufino, Genta Indra Winata, Rian Adam Rajagede, Carlos Rafael Catalan, Mohamed Fazli Imam, Priyaranjan Pattnayak, Salsabila Zahirah Pranida, Kevin Pratama, Yeshil Bangera, Adisai Na-Thalang, Patricia Nicole Monderin, Yueqi Song, Christian Simon, Lynnette Hui Xian Ng, Richardy Lobo' Sapan, Taki Hasan Rafi, Bin Wang, Supryadi, Kanyakorn Veerakanjana, Piyalitt Ittichaiwong, Matthew Theodore Roque, Karissa Vincentio, Takdanai Kreangphet, Phakphum Artkaew, Kadek Hendrawan Palgunadi, Yanzhi Yu, Rochana Prih Hastuti, William Nixon, Mithil Bangera, Adrian Xuan Wei Lim, Aye Hninn Khine, Hanif Muhammad Zhafran, Teddy Ferdinan, Audra Aurora Izzani, Ayushman Singh, Evan, Jauza Akbar Krito, Michael Anugraha, Fenal Ashokbhai Ilasariya, Haochen Li, John Amadeo Daniswara, Filbert Aurelian Tjiaranata, Eryawan Presma Yulianrifat, Can Udomcharoenchaikit, Fadil Risdian Ansori, Mahardika Krisna Ihsani, Giang Nguyen, Anab Maulana Barik, Dan John Velasco, Rifo Ahmad Genadi, Saptarshi Saha, Chengwei Wei, Isaiah Flores, Kenneth Ko Han Chen, Anjela Gail Santos, Wan Shen Lim, Kaung Si Phyo, Tim Santos, Meisyarah Dwiastuti, Jiayun Luo, Jan Christian Blaise Cruz, Ming Shan Hee, Ikhlasul Akmal Hanif, M.Alif Al Hakim, Muhammad Rizky Sya'ban, Kun Kerdthaisong, Lester James V. Miranda, Fajri Koto, Tirana Noor Fatyanosa, Alham Fikri Aji, Jostin Jerico Rosal, Jun Kevin, Robert Wijaya, Onno P. Kampman, Ruochen Zhang, Börje F. Karlsson, Peerat Limkonchotiwat
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.07920
Pdf URL: https://arxiv.org/pdf/2503.07920
Copy Paste: [[2503.07920]] Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia(https://arxiv.org/abs/2503.07920)
Keywords: generative
Abstract: Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.

Title: From Slices to Sequences: Autoregressive Tracking Transformer for Cohesive and Consistent 3D Lymph Node Detection in CT Scans

Authors: Qinji Yu, Yirui Wang, Ke Yan, Dandan Zheng, Dashan Ai, Dazhou Guo, Zhanghexuan Ji, Yanzhou Su, Yun Bian, Na Shen, Xiaowei Ding, Le Lu, Xianghua Ye, Dakai Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07933
Pdf URL: https://arxiv.org/pdf/2503.07933
Copy Paste: [[2503.07933]] From Slices to Sequences: Autoregressive Tracking Transformer for Cohesive and Consistent 3D Lymph Node Detection in CT Scans(https://arxiv.org/abs/2503.07933)
Keywords: transformer
Abstract: Lymph node (LN) assessment is an essential task in the routine radiology workflow, providing valuable insights for cancer staging, treatment planning and beyond. Identifying scatteredly-distributed and low-contrast LNs in 3D CT scans is highly challenging, even for experienced clinicians. Previous lesion and LN detection methods demonstrate effectiveness of 2.5D approaches (i.e, using 2D network with multi-slice inputs), leveraging pretrained 2D model weights and showing improved accuracy as compared to separate 2D or 3D detectors. However, slice-based 2.5D detectors do not explicitly model inter-slice consistency for LN as a 3D object, requiring heuristic post-merging steps to generate final 3D LN instances, which can involve tuning a set of parameters for each dataset. In this work, we formulate 3D LN detection as a tracking task and propose LN-Tracker, a novel LN tracking transformer, for joint end-to-end detection and 3D instance association. Built upon DETR-based detector, LN-Tracker decouples transformer decoder's query into the track and detection groups, where the track query autoregressively follows previously tracked LN instances along the z-axis of a CT scan. We design a new transformer decoder with masked attention module to align track query's content to the context of current slice, meanwhile preserving detection query's high accuracy in current slice. An inter-slice similarity loss is introduced to encourage cohesive LN association between slices. Extensive evaluation on four lymph node datasets shows LN-Tracker's superior performance, with at least 2.7% gain in average sensitivity when compared to other top 3D/2.5D detectors. Further validation on public lung nodule and prostate tumor detection tasks confirms the generalizability of LN-Tracker as it achieves top performance on both tasks. Datasets will be released upon acceptance.

Title: CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement

Authors: Chenrui Ma, Rongchang Zhao, Xi Xiao, Hongyang Xie, Tianyang Wang, Xiao Wang, Hao Zhang, Yanning Shen
Subjects: cs.LG, cs.CV, stat.ME
Abstract URL: https://arxiv.org/abs/2503.07938
Pdf URL: https://arxiv.org/pdf/2503.07938
Copy Paste: [[2503.07938]] CAD-VAE: Leveraging Correlation-Aware Latents for Comprehensive Fair Disentanglement(https://arxiv.org/abs/2503.07938)
Keywords: fair, generative
Abstract: While deep generative models have significantly advanced representation learning, they may inherit or amplify biases and fairness issues by encoding sensitive attributes alongside predictive features. Enforcing strict independence in disentanglement is often unrealistic when target and sensitive factors are naturally correlated. To address this challenge, we propose CAD-VAE (Correlation-Aware Disentangled VAE), which introduces a correlated latent code to capture the shared information between target and sensitive attributes. Given this correlated latent, our method effectively separates overlapping factors without extra domain knowledge by directly minimizing the conditional mutual information between target and sensitive codes. A relevance-driven optimization strategy refines the correlated code by efficiently capturing essential correlated features and eliminating redundancy. Extensive experiments on benchmark datasets demonstrate that CAD-VAE produces fairer representations, realistic counterfactuals, and improved fairness-aware image editing.

Title: STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision

Authors: Hin Wai Lui, Jeffrey L. Krichmar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07939
Pdf URL: https://arxiv.org/pdf/2503.07939
Copy Paste: [[2503.07939]] STRMs: Spatial Temporal Reasoning Models for Vision-Based Localization Rivaling GPS Precision(https://arxiv.org/abs/2503.07939)
Keywords: transformer, generative
Abstract: This paper explores vision-based localization through a biologically-inspired approach that mirrors how humans and animals link views or perspectives when navigating their world. We introduce two sequential generative models, VAE-RNN and VAE-Transformer, which transform first-person perspective (FPP) observations into global map perspective (GMP) representations and precise geographical coordinates. Unlike retrieval-based methods, our approach frames localization as a generative task, learning direct mappings between perspectives without relying on dense satellite image databases. We evaluate these models across two real-world environments: a university campus navigated by a Jackal robot and an urban downtown area navigated by a Tesla sedan. The VAE-Transformer achieves impressive precision, with median deviations of 2.29m (1.37% of environment size) and 4.45m (0.35% of environment size) respectively, outperforming both VAE-RNN and prior cross-view geo-localization approaches. Our comprehensive Localization Performance Characteristics (LPC) analysis demonstrates superior performance with the VAE-Transformer achieving an AUC of 0.777 compared to 0.295 for VIGOR 200 and 0.225 for TransGeo, establishing a new state-of-the-art in vision-based localization. In some scenarios, our vision-based system rivals commercial smartphone GPS accuracy (AUC of 0.797) while requiring 5x less GPU memory and delivering 3x faster inference than existing methods in cross-view geo-localization. These results demonstrate that models inspired by biological spatial navigation can effectively memorize complex, dynamic environments and provide precise localization with minimal computational resources.

Title: BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes

Authors: Minkyun Seo, Hyungtae Lim, Kanghee Lee, Luca Carlone, Jaesik Park
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2503.07940
Pdf URL: https://arxiv.org/pdf/2503.07940
Copy Paste: [[2503.07940]] BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes(https://arxiv.org/abs/2503.07940)
Keywords: robust
Abstract: Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors, and (c) raw coordinate usage, which exacerbates scale discrepancies. To address these issues, we present a zero-shot registration pipeline called BUFFER-X by (a) adaptively determining voxel size/search radii, (b) using farthest point sampling to bypass learned detectors, and (c) leveraging patch-wise scale normalization for consistent coordinate bounds. In particular, we present a multi-scale patch-based descriptor generation and a hierarchical inlier search across scales to improve robustness in diverse scenes. We also propose a novel generalizability benchmark using 11 datasets that cover various indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X achieves substantial generalization without prior information or manual parameter tuning for the test datasets. Our code is available at this https URL.

Title: Enhancing Sentiment Analysis through Multimodal Fusion: A BERT-DINOv2 Approach

Authors: Taoxu Zhao, Meisi Li, Kehao Chen, Liye Wang, Xucheng Zhou, Kunal Chaturvedi, Mukesh Prasad, Ali Anaissi, Ali Braytee
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.07943
Pdf URL: https://arxiv.org/pdf/2503.07943
Copy Paste: [[2503.07943]] Enhancing Sentiment Analysis through Multimodal Fusion: A BERT-DINOv2 Approach(https://arxiv.org/abs/2503.07943)
Keywords: extraction, transformer
Abstract: Multimodal sentiment analysis enhances conventional sentiment analysis, which traditionally relies solely on text, by incorporating information from different modalities such as images, text, and audio. This paper proposes a novel multimodal sentiment analysis architecture that integrates text and image data to provide a more comprehensive understanding of sentiments. For text feature extraction, we utilize BERT, a natural language processing model. For image feature extraction, we employ DINOv2, a vision-transformer-based model. The textual and visual latent features are integrated using proposed fusion techniques, namely the Basic Fusion Model, Self Attention Fusion Model, and Dual Attention Fusion Model. Experiments on three datasets, Memotion 7k dataset, MVSA single dataset, and MVSA multi dataset, demonstrate the viability and practicality of the proposed multimodal architecture.

Title: Text-RGBT Person Retrieval: Multilevel Global-Local Cross-Modal Alignment and A High-quality Benchmark

Authors: Yifei Deng, Zhengyu Chen, Ziheng Xu, Chenglong Li, Jin Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07950
Pdf URL: https://arxiv.org/pdf/2503.07950
Copy Paste: [[2503.07950]] Text-RGBT Person Retrieval: Multilevel Global-Local Cross-Modal Alignment and A High-quality Benchmark(https://arxiv.org/abs/2503.07950)
Keywords: robust
Abstract: The performance of traditional text-image person retrieval task is easily affected by lighting variations due to imaging limitations of visible spectrum sensors. In this work, we design a novel task called text-RGBT person retrieval that integrates complementary benefits from thermal and visible modalities for robust person retrieval in challenging environments. Aligning text and multi-modal visual representations is the key issue in text-RGBT person retrieval, but the heterogeneity between visible and thermal modalities may interfere with the alignment of visual and text modalities. To handle this problem, we propose a Multi-level Global-local cross-modal Alignment Network (MGANet), which sufficiently mines the relationships between modality-specific and modality-collaborative visual with the text, for text-RGBT person retrieval. To promote the research and development of this field, we create a high-quality text-RGBT person retrieval dataset, RGBT-PEDES. RGBT-PEDES contains 1,822 identities from different age groups and genders with 4,723 pairs of calibrated RGB and thermal images, and covers high-diverse scenes from both daytime and nighttime with a various of challenges such as occlusion, weak alignment and adverse lighting conditions. Additionally, we carefully annotate 7,987 fine-grained textual descriptions for all RGBT person image pairs. Extensive experiments on RGBT-PEDES demonstrate that our method outperforms existing text-image person retrieval methods. The code and dataset will be released upon the acceptance.

Title: EFPC: Towards Efficient and Flexible Prompt Compression

Authors: Yun-Hao Cao, Yangsong Wang, Shuzheng Hao, Zhenxing Li, Chengjun Zhan, Sichao Liu, Yi-Qi Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.07956
Pdf URL: https://arxiv.org/pdf/2503.07956
Copy Paste: [[2503.07956]] EFPC: Towards Efficient and Flexible Prompt Compression(https://arxiv.org/abs/2503.07956)
Keywords: large language model
Abstract: The emergence of large language models (LLMs) like GPT-4 has revolutionized natural language processing (NLP), enabling diverse, complex tasks. However, extensive token counts lead to high computational and financial burdens. To address this, we propose Efficient and Flexible Prompt Compression (EFPC), a novel method unifying task-aware and task-agnostic compression for a favorable accuracy-efficiency trade-off. EFPC uses GPT-4 to generate compressed prompts and integrates them with original prompts for training. During training and inference, we selectively prepend user instructions and compress prompts based on predicted probabilities. EFPC is highly data-efficient, achieving significant performance with minimal data. Compared to the state-of-the-art method LLMLingua-2, EFPC achieves a 4.8% relative improvement in F1-score with 1% additional data at a 4x compression rate, and an 11.4% gain with 10% additional data on the LongBench single-doc QA benchmark. EFPC's unified framework supports broad applicability and enhances performance across various models, tasks, and domains, offering a practical advancement in NLP.

Title: Pre-trained Models Succeed in Medical Imaging with Representation Similarity Degradation

Authors: Wenqiang Zu, Shenghao Xie, Hao Chen, Lei Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07958
Pdf URL: https://arxiv.org/pdf/2503.07958
Copy Paste: [[2503.07958]] Pre-trained Models Succeed in Medical Imaging with Representation Similarity Degradation(https://arxiv.org/abs/2503.07958)
Keywords: robust
Abstract: This paper investigates the critical problem of representation similarity evolution during cross-domain transfer learning, with particular focus on understanding why pre-trained models maintain effectiveness when adapted to medical imaging tasks despite significant domain gaps. The study establishes a rigorous problem definition centered on quantifying and analyzing representation similarity trajectories throughout the fine-tuning process, while carefully delineating the scope to encompass both medical image analysis and broader cross-domain adaptation scenarios. Our empirical findings reveal three critical discoveries: the potential existence of high-performance models that preserve both task accuracy and representation similarity to their pre-trained origins; a robust linear correlation between layer-wise similarity metrics and representation quality indicators; and distinct adaptation patterns that differentiate supervised versus self-supervised pre-training paradigms. The proposed similarity space framework not only provides mechanistic insights into knowledge transfer dynamics but also raises fundamental questions about optimal utilization of pre-trained models. These results advance our understanding of neural network adaptation processes while offering practical implications for transfer learning strategies that extend beyond medical imaging applications. The code will be available once accepted.

Title: Recent Advances in Hypergraph Neural Networks

Authors: Murong Yang, Xin-Jian Xu
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2503.07959
Pdf URL: https://arxiv.org/pdf/2503.07959
Copy Paste: [[2503.07959]] Recent Advances in Hypergraph Neural Networks(https://arxiv.org/abs/2503.07959)
Keywords: generative
Abstract: The growing interest in hypergraph neural networks (HGNNs) is driven by their capacity to capture the complex relationships and patterns within hypergraph structured data across various domains, including computer vision, complex networks, and natural language processing. This paper comprehensively reviews recent advances in HGNNs and presents a taxonomy of mainstream models based on their architectures: hypergraph convolutional networks (HGCNs), hypergraph attention networks (HGATs), hypergraph autoencoders (HGAEs), hypergraph recurrent networks (HGRNs), and deep hypergraph generative models (DHGGMs). For each category, we delve into its practical applications, mathematical mechanisms, literature contributions, and open problems. Finally, we discuss some common challenges and promising research this http URL paper aspires to be a helpful resource that provides guidance for future research and applications of HGNNs.

Title: LabelCoRank: Revolutionizing Long Tail Multi-Label Classification with Co-Occurrence Reranking

Authors: Yan Yan, Junyuan Liu, Bo-Wen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.07968
Pdf URL: https://arxiv.org/pdf/2503.07968
Copy Paste: [[2503.07968]] LabelCoRank: Revolutionizing Long Tail Multi-Label Classification with Co-Occurrence Reranking(https://arxiv.org/abs/2503.07968)
Keywords: robust
Abstract: Motivation: Despite recent advancements in semantic representation driven by pre-trained and large-scale language models, addressing long tail challenges in multi-label text classification remains a significant issue. Long tail challenges have persistently posed difficulties in accurately classifying less frequent labels. Current approaches often focus on improving text semantics while neglecting the crucial role of label relationships. Results: This paper introduces LabelCoRank, a novel approach inspired by ranking principles. LabelCoRank leverages label co-occurrence relationships to refine initial label classifications through a dual-stage reranking process. The first stage uses initial classification results to form a preliminary ranking. In the second stage, a label co-occurrence matrix is utilized to rerank the preliminary results, enhancing the accuracy and relevance of the final classifications. By integrating the reranked label representations as additional text features, LabelCoRank effectively mitigates long tail issues in multi-labeltext classification. Experimental evaluations on popular datasets including MAG-CS, PubMed, and AAPD demonstrate the effectiveness and robustness of LabelCoRank.

Title: Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection

Authors: Jiahao Xu, Zikai Zhang, Rui Hu
Subjects: cs.LG, cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2503.07978
Pdf URL: https://arxiv.org/pdf/2503.07978
Copy Paste: [[2503.07978]] Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection(https://arxiv.org/abs/2503.07978)
Keywords: defense, attack, robust, federate
Abstract: The distributed nature of training makes Federated Learning (FL) vulnerable to backdoor attacks, where malicious model updates aim to compromise the global model's performance on specific tasks. Existing defense methods show limited efficacy as they overlook the inconsistency between benign and malicious model updates regarding both general and fine-grained directions. To fill this gap, we introduce AlignIns, a novel defense method designed to safeguard FL systems against backdoor attacks. AlignIns looks into the direction of each model update through a direction alignment inspection process. Specifically, it examines the alignment of model updates with the overall update direction and analyzes the distribution of the signs of their significant parameters, comparing them with the principle sign across all model updates. Model updates that exhibit an unusual degree of alignment are considered malicious and thus be filtered out. We provide the theoretical analysis of the robustness of AlignIns and its propagation error in FL. Our empirical results on both independent and identically distributed (IID) and non-IID datasets demonstrate that AlignIns achieves higher robustness compared to the state-of-the-art defense methods. The code is available at this https URL.

Title: Regulatory DNA sequence Design with Reinforcement Learning

Authors: Zhao Yang, Bing Su, Chuan Cao, Ji-Rong Wen
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2503.07981
Pdf URL: https://arxiv.org/pdf/2503.07981
Copy Paste: [[2503.07981]] Regulatory DNA sequence Design with Reinforcement Learning(https://arxiv.org/abs/2503.07981)
Keywords: generative
Abstract: Cis-regulatory elements (CREs), such as promoters and enhancers, are relatively short DNA sequences that directly regulate gene expression. The fitness of CREs, measured by their ability to modulate gene expression, highly depends on the nucleotide sequences, especially specific motifs known as transcription factor binding sites (TFBSs). Designing high-fitness CREs is crucial for therapeutic and bioengineering applications. Current CRE design methods are limited by two major drawbacks: (1) they typically rely on iterative optimization strategies that modify existing sequences and are prone to local optima, and (2) they lack the guidance of biological prior knowledge in sequence optimization. In this paper, we address these limitations by proposing a generative approach that leverages reinforcement learning (RL) to fine-tune a pre-trained autoregressive (AR) model. Our method incorporates data-driven biological priors by deriving computational inference-based rewards that simulate the addition of activator TFBSs and removal of repressor TFBSs, which are then integrated into the RL process. We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types, demonstrating its ability to generate high-fitness CREs while maintaining sequence diversity. The code is available at this https URL.

Title: DiffEGG: Diffusion-Driven Edge Generation as a Pixel-Annotation-Free Alternative for Instance Annotation

Authors: Sanghyun Jo, Ziseok Lee, Wooyeol Lee, Kyungsu Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.07982
Pdf URL: https://arxiv.org/pdf/2503.07982
Copy Paste: [[2503.07982]] DiffEGG: Diffusion-Driven Edge Generation as a Pixel-Annotation-Free Alternative for Instance Annotation(https://arxiv.org/abs/2503.07982)
Keywords: diffusion, segmentation
Abstract: Achieving precise panoptic segmentation relies on pixel-wise instance annotations, but obtaining such datasets is costly. Unsupervised instance segmentation (UIS) eliminates annotation requirements but struggles with adjacent instance merging and single-instance fragmentation, largely due to the limitations of DINO-based backbones which lack strong instance separation cues. Weakly-supervised panoptic segmentation (WPS) reduces annotation costs using sparse labels (e.g., points, boxes), yet these annotations remain expensive and introduce human bias and boundary errors. To address these challenges, we propose DiffEGG (Diffusion-Driven EdGe Generation), a fully annotation-free method that extracts instance-aware features from pretrained diffusion models to generate precise instance edge maps. Unlike DINO-based UIS methods, diffusion models inherently capture fine-grained, instance-aware features, enabling more precise boundary delineation. For WPS, DiffEGG eliminates annotation costs and human bias by operating without any form of manual supervision, addressing the key limitations of prior best methods. Additionally, we introduce RIP, a post-processing technique that fuses DiffEGG's edge maps with segmentation masks in a task-agnostic manner. RIP allows DiffEGG to be seamlessly integrated into various segmentation frameworks. When applied to UIS, DiffEGG and RIP achieve an average $+4.4\text{ AP}$ improvement over prior best UIS methods. When combined with weakly-supervised semantic segmentation (WSS), DiffEGG enables WPS without instance annotations, outperforming prior best point-supervised WPS methods by $+1.7\text{ PQ}$. These results demonstrate that DiffEGG's edge maps serve as a cost-effective, annotation-free alternative to instance annotations, significantly improving segmentation without human intervention. Code is available at this https URL.

Title: CDI3D: Cross-guided Dense-view Interpolation for 3D Reconstruction

Authors: Zhiyuan Wu, Xibin Song, Senbo Wang, Weizhe Liu, Jiayu Yang, Ziang Cheng, Shenzhou Chen, Taizhang Shang, Weixuan Sun, Shan Luo, Pan Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08005
Pdf URL: https://arxiv.org/pdf/2503.08005
Copy Paste: [[2503.08005]] CDI3D: Cross-guided Dense-view Interpolation for 3D Reconstruction(https://arxiv.org/abs/2503.08005)
Keywords: robust, diffusion
Abstract: 3D object reconstruction from single-view image is a fundamental task in computer vision with wide-ranging applications. Recent advancements in Large Reconstruction Models (LRMs) have shown great promise in leveraging multi-view images generated by 2D diffusion models to extract 3D content. However, challenges remain as 2D diffusion models often struggle to produce dense images with strong multi-view consistency, and LRMs tend to amplify these inconsistencies during the 3D reconstruction process. Addressing these issues is critical for achieving high-quality and efficient 3D reconstruction. In this paper, we present CDI3D, a feed-forward framework designed for efficient, high-quality image-to-3D generation with view interpolation. To tackle the aforementioned challenges, we propose to integrate 2D diffusion-based view interpolation into the LRM pipeline to enhance the quality and consistency of the generated mesh. Specifically, our approach introduces a Dense View Interpolation (DVI) module, which synthesizes interpolated images between main views generated by the 2D diffusion model, effectively densifying the input views with better multi-view consistency. We also design a tilt camera pose trajectory to capture views with different elevations and perspectives. Subsequently, we employ a tri-plane-based mesh reconstruction strategy to extract robust tokens from these interpolated and original views, enabling the generation of high-quality 3D meshes with superior texture and geometry. Extensive experiments demonstrate that our method significantly outperforms previous state-of-the-art approaches across various benchmarks, producing 3D content with enhanced texture fidelity and geometric accuracy.

Title: A Survey on Wi-Fi Sensing Generalizability: Taxonomy, Techniques, Datasets, and Future Research Prospects

Authors: Fei Wang, Tingting Zhang, Bintong Zhao, Libao Xing, Tiantian Wang, Han Ding, Tony Xiao Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08008
Pdf URL: https://arxiv.org/pdf/2503.08008
Copy Paste: [[2503.08008]] A Survey on Wi-Fi Sensing Generalizability: Taxonomy, Techniques, Datasets, and Future Research Prospects(https://arxiv.org/abs/2503.08008)
Keywords: robust, large language model
Abstract: Wi-Fi sensing has emerged as a transformative technology that leverages ubiquitous wireless signals to enable a variety of applications ranging from activity and gesture recognition to indoor localization and health monitoring. However, the inherent dependency of Wi-Fi signals on environmental conditions introduces significant generalization challenges,variations in surroundings, human positions, and orientations often lead to inconsistent signal features, impeding robust action recognition. In this survey, we review over 200 studies on Wi-Fi sensing generalization, categorizing them along the entire sensing pipeline: device deployment, signal processing, feature learning, and model deployment. We systematically analyze state-of-the-art techniques, which are employed to mitigate the adverse effects of environmental variability. Moreover, we provide a comprehensive overview of open-source datasets such as Widar3.0, XRF55, and XRFv2, highlighting their unique characteristics and applicability for multimodal fusion and cross-modal tasks. Finally, we discuss emerging research directions, such as multimodal approaches and the integration of large language models,to inspire future advancements in this rapidly evolving field. Our survey aims to serve as a valuable resource for researchers, offering insights into current methodologies, available datasets, and promising avenues for further investigation.

Title: Exploring Bias in over 100 Text-to-Image Generative Models

Authors: Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08012
Pdf URL: https://arxiv.org/pdf/2503.08012
Copy Paste: [[2503.08012]] Exploring Bias in over 100 Text-to-Image Generative Models(https://arxiv.org/abs/2503.08012)
Keywords: robust, generative
Abstract: We investigate bias trends in text-to-image generative models over time, focusing on the increasing availability of models through open platforms like Hugging Face. While these platforms democratize AI, they also facilitate the spread of inherently biased models, often shaped by task-specific fine-tuning. Ensuring ethical and transparent AI deployment requires robust evaluation frameworks and quantifiable bias metrics. To this end, we assess bias across three key dimensions: (i) distribution bias, (ii) generative hallucination, and (iii) generative miss-rate. Analyzing over 100 models, we reveal how bias patterns evolve over time and across generative tasks. Our findings indicate that artistic and style-transferred models exhibit significant bias, whereas foundation models, benefiting from broader training distributions, are becoming progressively less biased. By identifying these systemic trends, we contribute a large-scale evaluation corpus to inform bias research and mitigation strategies, fostering more responsible AI development. Keywords: Bias, Ethical AI, Text-to-Image, Generative Models, Open-Source Models

Title: GPT-PPG: A GPT-based Foundation Model for Photoplethysmography Signals

Authors: Zhaoliang Chen, Cheng Ding, Saurabh Kataria, Runze Yan, Minxiao Wang, Randall Lee, Xiao Hu
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2503.08015
Pdf URL: https://arxiv.org/pdf/2503.08015
Copy Paste: [[2503.08015]] GPT-PPG: A GPT-based Foundation Model for Photoplethysmography Signals(https://arxiv.org/abs/2503.08015)
Keywords: transformer, generative
Abstract: This study introduces a novel application of a Generative Pre-trained Transformer (GPT) model tailored for photoplethysmography (PPG) signals, serving as a foundation model for various downstream tasks. Adapting the standard GPT architecture to suit the continuous characteristics of PPG signals, our approach demonstrates promising results. Our models are pre-trained on our extensive dataset that contains more than 200 million 30s PPG samples. We explored different supervised fine-tuning techniques to adapt our model to downstream tasks, resulting in performance comparable to or surpassing current state-of-the-art (SOTA) methods in tasks like atrial fibrillation detection. A standout feature of our GPT model is its inherent capability to perform generative tasks such as signal denoising effectively, without the need for further fine-tuning. This success is attributed to the generative nature of the GPT framework.

Title: Partial differential equation system for binarization of degraded document images

Authors: Youjin Liu, Yu Wang
Subjects: cs.CV, math.DS
Abstract URL: https://arxiv.org/abs/2503.08017
Pdf URL: https://arxiv.org/pdf/2503.08017
Copy Paste: [[2503.08017]] Partial differential equation system for binarization of degraded document images(https://arxiv.org/abs/2503.08017)
Keywords: diffusion
Abstract: In recent years, partial differential equation (PDE) systems have been successfully applied to the binarization of text images, achieving promising results. Inspired by the DH model and incorporating a novel image modeling approach, this study proposes a new weakly coupled PDE system for degraded text image binarization. In this system, the first equation is designed to estimate the background component, incorporating both diffusion and fidelity terms. The second equation estimates the foreground component and includes diffusion, fidelity, and binarization source terms. The final binarization result is obtained by applying a hard projection to the estimated foreground component. Experimental results on 86 degraded text images demonstrate that the proposed model exhibits significant advantages in handling degraded text images.

Title: Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models

Authors: Bozhi Luan, Wengang Zhou, Hao Feng, Zhe Wang, Xiaosong Li, Houqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08019
Pdf URL: https://arxiv.org/pdf/2503.08019
Copy Paste: [[2503.08019]] Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models(https://arxiv.org/abs/2503.08019)
Keywords: robust
Abstract: As the computational needs of Large Vision-Language Models (LVLMs) increase, visual token pruning has proven effective in improving inference speed and memory efficiency. Traditional pruning methods in LVLMs predominantly focus on attention scores to determine token relevance, overlooking critical aspects such as spatial position and token similarity. To this end, we introduce AdaptPrune, a novel plug-and-play training-free pruning method that builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach. Our method is based on several observed phenomena in large models: the positional bias in the model's image attention and the redundancy of token information ignored by previous approaches. By integrating attention, spatial, and similarity information, our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions. Our method has been extensively tested across various LVLMs and benchmarks, confirming its robustness and adaptability. The results demonstrate that AdaptPrune consistently outperforms existing methods across various pruning ratios. Code is available at this https URL.

Title: In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

Authors: Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long T. Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, Tomas Pfister
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08026
Pdf URL: https://arxiv.org/pdf/2503.08026
Copy Paste: [[2503.08026]] In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents(https://arxiv.org/abs/2503.08026)
Keywords: large language model
Abstract: Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities-utterances, turns, and sessions-into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs' cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset.

Title: Learning to Search Effective Example Sequences for In-Context Learning

Authors: Xiang Gao, Ankita Sinha, Kamalika Das
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08030
Pdf URL: https://arxiv.org/pdf/2503.08030
Copy Paste: [[2503.08030]] Learning to Search Effective Example Sequences for In-Context Learning(https://arxiv.org/abs/2503.08030)
Keywords: large language model
Abstract: Large language models (LLMs) demonstrate impressive few-shot learning capabilities, but their performance varies widely based on the sequence of in-context examples. Key factors influencing this include the sequence's length, composition, and arrangement, as well as its relation to the specific query. Existing methods often tackle these factors in isolation, overlooking their interdependencies. Moreover, the extensive search space for selecting optimal sequences complicates the development of a holistic approach. In this work, we introduce Beam Search-based Example Sequence Constructor (BESC), a novel method for learning to construct optimal example sequences. BESC addresses all key factors involved in sequence selection by considering them jointly during inference, while incrementally building the sequence. This design enables the use of beam search to significantly reduce the complexity of the search space. Experiments across various datasets and language models show notable improvements in performance.

Title: HOFAR: High-Order Augmentation of Flow Autoregressive Transformers

Authors: Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08032
Pdf URL: https://arxiv.org/pdf/2503.08032
Copy Paste: [[2503.08032]] HOFAR: High-Order Augmentation of Flow Autoregressive Transformers(https://arxiv.org/abs/2503.08032)
Keywords: transformer
Abstract: Flow Matching and Transformer architectures have demonstrated remarkable performance in image generation tasks, with recent work FlowAR [Ren et al., 2024] synergistically integrating both paradigms to advance synthesis fidelity. However, current FlowAR implementations remain constrained by first-order trajectory modeling during the generation process. This paper introduces a novel framework that systematically enhances flow autoregressive transformers through high-order supervision. We provide theoretical analysis and empirical evaluation showing that our High-Order FlowAR (HOFAR) demonstrates measurable improvements in generation quality compared to baseline models. The proposed approach advances the understanding of flow-based autoregressive modeling by introducing a systematic framework for analyzing trajectory dynamics through high-order expansion.

Title: Group Preference Alignment: Customized LLM Response Generation from In-Situ Conversations

Authors: Ishani Mondal, Jack W. Stokes, Sujay Kumar Jauhar, Longqi Yang, Mengting Wan, Xiaofeng Xu, Xia Song, Jennifer Neville
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08035
Pdf URL: https://arxiv.org/pdf/2503.08035
Copy Paste: [[2503.08035]] Group Preference Alignment: Customized LLM Response Generation from In-Situ Conversations(https://arxiv.org/abs/2503.08035)
Keywords: robust, extraction
Abstract: LLMs often fail to meet the specialized needs of distinct user groups due to their one-size-fits-all training paradigm \cite{lucy-etal-2024-one} and there is limited research on what personalization aspects each group expect. To address these limitations, we propose a group-aware personalization framework, Group Preference Alignment (GPA), that identifies context-specific variations in conversational preferences across user groups and then steers LLMs to address those preferences. Our approach consists of two steps: (1) Group-Aware Preference Extraction, where maximally divergent user-group preferences are extracted from real-world conversation logs and distilled into interpretable rubrics, and (2) Tailored Response Generation, which leverages these rubrics through two methods: a) Context-Tuned Inference (GAP-CT), that dynamically adjusts responses via context-dependent prompt instructions, and b) Rubric-Finetuning Inference (GPA-FT), which uses the rubrics to generate contrastive synthetic data for personalization of group-specific models via alignment. Experiments demonstrate that our framework significantly improves alignment of the output with respect to user preferences and outperforms baseline methods, while maintaining robust performance on standard benchmarks.

Title: Generalized Kullback-Leibler Divergence Loss

Authors: Jiequan Cui, Beier Zhu, Qingshan Xu, Zhuotao Tian, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.08038
Pdf URL: https://arxiv.org/pdf/2503.08038
Copy Paste: [[2503.08038]] Generalized Kullback-Leibler Divergence Loss(https://arxiv.org/abs/2503.08038)
Keywords: robust
Abstract: In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of (1) a weighted Mean Square Error (wMSE) loss and (2) a Cross-Entropy loss incorporating soft labels. Thanks to the decoupled structure of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL loss in scenarios like knowledge distillation by breaking its asymmetric optimization property along with a smoother weight function. This modification effectively alleviates convergence challenges in optimization, particularly for classes with high predicted scores in soft labels. Secondly, we introduce class-wise global information into KL/DKL to reduce bias arising from individual samples. With these two enhancements, we derive the Generalized Kullback-Leibler (GKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100, ImageNet, and vision-language datasets, focusing on adversarial training, and knowledge distillation tasks. Specifically, we achieve new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive knowledge distillation performance across CIFAR/ImageNet models and CLIP models, demonstrating the substantial practical merits. Our code is available at this https URL.

Title: Accurate INT8 Training Through Dynamic Block-Level Fallback

Authors: Pengle Zhang, Jia Wei, Jintao Zhang, Jun Zhu, Jianfei Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08040
Pdf URL: https://arxiv.org/pdf/2503.08040
Copy Paste: [[2503.08040]] Accurate INT8 Training Through Dynamic Block-Level Fallback(https://arxiv.org/abs/2503.08040)
Keywords: robust, transformer
Abstract: Transformer models have achieved remarkable success across various AI applications but face significant training costs. Low-bit training, such as INT8 training, can leverage computational units with higher throughput, and has already demonstrated its effectiveness on GPT2 models with block-level quantization. However, it struggles with modern Transformer variants incorporating GLU units. This is because those variants demonstrate complex distributions of activation outliers. To address the challenge, we propose Fallback Quantization, implementing mixed-precision GEMM that dynamically falls back 8-bit to 16-bit for activation blocks containing outliers. Experiments show that our approach is robustly competent in both fine-tuning and pretraining settings. Moreover, our method achieves a 1.57x end-to-end training speedup on RTX4090 GPUs.

Title: Structural and Statistical Texture Knowledge Distillation and Learning for Segmentation

Authors: Deyi Ji, Feng Zhao, Hongtao Lu, Feng Wu, Jieping Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08043
Pdf URL: https://arxiv.org/pdf/2503.08043
Copy Paste: [[2503.08043]] Structural and Statistical Texture Knowledge Distillation and Learning for Segmentation(https://arxiv.org/abs/2503.08043)
Keywords: segmentation
Abstract: Low-level texture feature/knowledge is also of vital importance for characterizing the local structural pattern and global statistical properties, such as boundary, smoothness, regularity, and color contrast, which may not be well addressed by high-level deep features. In this paper, we aim to re-emphasize the low-level texture information in deep networks for semantic segmentation and related knowledge distillation tasks. To this end, we take full advantage of both structural and statistical texture knowledge and propose a novel Structural and Statistical Texture Knowledge Distillation (SSTKD) framework for semantic segmentation. Specifically, Contourlet Decomposition Module (CDM) is introduced to decompose the low-level features with iterative Laplacian pyramid and directional filter bank to mine the structural texture knowledge, and Texture Intensity Equalization Module (TIEM) is designed to extract and enhance the statistical texture knowledge with the corresponding Quantization Congruence Loss (QDL). Moreover, we propose the Co-occurrence TIEM (C-TIEM) and generic segmentation frameworks, namely STLNet++ and U-SSNet, to enable existing segmentation networks to harvest the structural and statistical texture information more effectively. Extensive experimental results on three segmentation tasks demonstrate the effectiveness of the proposed methods and their state-of-the-art performance on seven popular benchmark datasets, respectively.

Title: Adapting Large Language Models for Parameter-Efficient Log Anomaly Detection

Authors: Ying Fu Lim, Jiawen Zhu, Guansong Pang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.08045
Pdf URL: https://arxiv.org/pdf/2503.08045
Copy Paste: [[2503.08045]] Adapting Large Language Models for Parameter-Efficient Log Anomaly Detection(https://arxiv.org/abs/2503.08045)
Keywords: security, robust, large language model
Abstract: Log Anomaly Detection (LAD) seeks to identify atypical patterns in log data that are crucial to assessing the security and condition of systems. Although Large Language Models (LLMs) have shown tremendous success in various fields, the use of LLMs in enabling the detection of log anomalies is largely unexplored. This work aims to fill this gap. Due to the prohibitive costs involved in fully fine-tuning LLMs, we explore the use of parameter-efficient fine-tuning techniques (PEFTs) for adapting LLMs to LAD. To have an in-depth exploration of the potential of LLM-driven LAD, we present a comprehensive investigation of leveraging two of the most popular PEFTs -- Low-Rank Adaptation (LoRA) and Representation Fine-tuning (ReFT) -- to tap into three prominent LLMs of varying size, including RoBERTa, GPT-2, and Llama-3, for parameter-efficient LAD. Comprehensive experiments on four public log datasets are performed to reveal important insights into effective LLM-driven LAD in several key perspectives, including the efficacy of these PEFT-based LLM-driven LAD methods, their stability, sample efficiency, robustness w.r.t. unstable logs, and cross-dataset generalization. Code is available at this https URL.

Title: SphOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Learning Models

Authors: Nadarasar Bahavan, Sachith Seneviratne, Saman Halgamuge
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08049
Pdf URL: https://arxiv.org/pdf/2503.08049
Copy Paste: [[2503.08049]] SphOR: A Representation Learning Perspective on Open-set Recognition for Identifying Unknown Classes in Deep Learning Models(https://arxiv.org/abs/2503.08049)
Keywords: generative
Abstract: The widespread use of deep learning classifiers necessitates Open-set recognition (OSR), which enables the identification of input data not only from classes known during training but also from unknown classes that might be present in test data. Many existing OSR methods are computationally expensive due to the reliance on complex generative models or suffer from high training costs. We investigate OSR from a representation-learning perspective, specifically through spherical embeddings. We introduce SphOR, a computationally efficient representation learning method that models the feature space as a mixture of von Mises-Fisher distributions. This approach enables the use of semantically ambiguous samples during training, to improve the detection of samples from unknown classes. We further explore the relationship between OSR performance and key representation learning properties which influence how well features are structured in high-dimensional space. Extensive experiments on multiple OSR benchmarks demonstrate the effectiveness of our method, producing state-of-the-art results, with improvements up-to 6% that validate its performance.

Title: "We just did not have that on the embedded system": Insights and Challenges for Securing Microcontroller Systems from the Embedded CTF Competitions

Authors: Zheyuan Ma, Gaoxiang Liu, Alex Eastman, Kai Kaufman, Md Armanuzzaman, Xi Tan, Katherine Jesse, Robert Walls, Ziming Zhao
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.08053
Pdf URL: https://arxiv.org/pdf/2503.08053
Copy Paste: [[2503.08053]] "We just did not have that on the embedded system": Insights and Challenges for Securing Microcontroller Systems from the Embedded CTF Competitions(https://arxiv.org/abs/2503.08053)
Keywords: secure, security, attack
Abstract: Microcontroller systems are integral to our daily lives, powering mission-critical applications such as vehicles, medical devices, and industrial control systems. Therefore, it is essential to investigate and outline the challenges encountered in developing secure microcontroller systems. While previous research has focused solely on microcontroller firmware analysis to identify and characterize vulnerabilities, our study uniquely leverages data from the 2023 and 2024 MITRE eCTF team submissions and post-competition interviews. This approach allows us to dissect the entire lifecycle of secure microcontroller system development from both technical and perceptual perspectives, providing deeper insights into how these vulnerabilities emerge in the first place. Through the lens of eCTF, we identify fundamental conceptual and practical challenges in securing microcontroller systems. Conceptually, it is difficult to adapt from a microprocessor system to a microcontroller system, and participants are not wholly aware of the unique attacks against microcontrollers. Practically, security-enhancing tools, such as the memory-safe language Rust, lack adequate support on microcontrollers. Additionally, poor-quality entropy sources weaken cryptography and secret generation. Additionally, our findings articulate specific research, developmental, and educational deficiencies, leading to targeted recommendations for researchers, developers, vendors, and educators to enhance the security of microcontroller systems.

Title: Unmasking the Unknown: Facial Deepfake Detection in the Open-Set Paradigm

Authors: Nadarasar Bahavan, Sanjay Saha, Ken Chen, Sachith Seneviratne, Sanka Rasnayaka, Saman Halgamuge
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08055
Pdf URL: https://arxiv.org/pdf/2503.08055
Copy Paste: [[2503.08055]] Unmasking the Unknown: Facial Deepfake Detection in the Open-Set Paradigm(https://arxiv.org/abs/2503.08055)
Keywords: robust, generative
Abstract: Facial forgery methods such as deepfakes can be misused for identity manipulation and spreading misinformation. They have evolved alongside advancements in generative AI, leading to new and more sophisticated forgery techniques that diverge from existing 'known' methods. Conventional deepfake detection methods use the closedset paradigm, thus limiting their applicability to detecting forgeries created using methods that are not part of the training dataset. In this paper, we propose a shift from the closed-set paradigm for deepfake detection. In the open-set paradigm, models are designed not only to identify images created by known facial forgery methods but also to identify and flag those produced by previously unknown methods as 'unknown' and not as unforged/real/unmanipulated. In this paper, we propose an open-set deepfake classification algorithm based on supervised contrastive learning. The open-set paradigm used in our model allows it to function as a more robust tool capable of handling emerging and unseen deepfake techniques, enhancing reliability and confidence, and complementing forensic analysis. In open-set paradigm, we identify three groups including the "unknown group that is neither considered known deepfake nor real. We investigate deepfake open-set classification across three scenarios, classifying deepfakes from unknown methods not as real, distinguishing real images from deepfakes, and classifying deepfakes from known methods, using the FaceForensics++ dataset as a benchmark. Our method achieves state of the art results in the first two tasks and competitive results in the third task.

Title: Odysseus Navigates the Sirens' Song: Dynamic Focus Decoding for Factual and Diverse Open-Ended Text Generation

Authors: Wen Luo, Feifan Song, Wei Li, Guangyue Peng, Shaohang Wei, Houfeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08057
Pdf URL: https://arxiv.org/pdf/2503.08057
Copy Paste: [[2503.08057]] Odysseus Navigates the Sirens' Song: Dynamic Focus Decoding for Factual and Diverse Open-Ended Text Generation(https://arxiv.org/abs/2503.08057)
Keywords: large language model
Abstract: Large Language Models (LLMs) are increasingly required to generate text that is both factually accurate and diverse across various open-ended applications. However, current stochastic decoding methods struggle to balance such objectives. We introduce Dynamic Focus Decoding (DFD), a novel plug-and-play stochastic approach that resolves this trade-off without requiring additional data, knowledge, or models. DFD adaptively adjusts the decoding focus based on distributional differences across layers, leveraging the modular and hierarchical nature of factual knowledge within LLMs. This dynamic adjustment improves factuality in knowledge-intensive decoding steps and promotes diversity in less knowledge-reliant steps. DFD can be easily integrated with existing decoding methods, enhancing both factuality and diversity with minimal computational overhead. Extensive experiments across seven datasets demonstrate that DFD significantly improves performance, providing a scalable and efficient solution for open-ended text generation.

Title: Symbolic Neural Ordinary Differential Equations

Authors: Xin Li, Chengli Zhao, Xue Zhang, Xiaojun Duan
Subjects: cs.LG, math-ph
Abstract URL: https://arxiv.org/abs/2503.08059
Pdf URL: https://arxiv.org/pdf/2503.08059
Copy Paste: [[2503.08059]] Symbolic Neural Ordinary Differential Equations(https://arxiv.org/abs/2503.08059)
Keywords: interpretability
Abstract: Differential equations are widely used to describe complex dynamical systems with evolving parameters in nature and engineering. Effectively learning a family of maps from the parameter function to the system dynamics is of great significance. In this study, we propose a novel learning framework of symbolic continuous-depth neural networks, termed Symbolic Neural Ordinary Differential Equations (SNODEs), to effectively and accurately learn the underlying dynamics of complex systems. Specifically, our learning framework comprises three stages: initially, pre-training a predefined symbolic neural network via a gradient flow matching strategy; subsequently, fine-tuning this network using Neural ODEs; and finally, constructing a general neural network to capture residuals. In this process, we apply the SNODEs framework to partial differential equation systems through Fourier analysis, achieving resolution-invariant modeling. Moreover, this framework integrates the strengths of symbolism and connectionism, boasting a universal approximation theorem while significantly enhancing interpretability and extrapolation capabilities relative to state-of-the-art baseline methods. We demonstrate this through experiments on several representative complex systems. Therefore, our framework can be further applied to a wide range of scientific problems, such as system bifurcation and control, reconstruction and forecasting, as well as the discovery of new equations.

Title: Context-aware Biases for Length Extrapolation

Authors: Ali Veisi, Amir Mansourian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08067
Pdf URL: https://arxiv.org/pdf/2503.08067
Copy Paste: [[2503.08067]] Context-aware Biases for Length Extrapolation(https://arxiv.org/abs/2503.08067)
Keywords: transformer
Abstract: Transformers' ability to generalize to longer sequences than they have been trained on, known as length extrapolation, degrades as sequence length increases. Most of Relative Positional Encoding (RPE) methods address this problem by either adding constant linear biases or learning general biases, lacking the ability to specialize for different sequences. In this work, inspired by ALiBi, we propose Context-aware Biases for Length Extrapolation (Cable), that learns token-specific biases for each head in decoder-based transformers. Cable learns adaptive, context-aware biases, overcoming the limitations of fixed patterns by adding dynamic biases specific to each token in the sequence. Results show that when tested on a sequence length of 1024, a GPT-3 Medium (334M parameters) with our positional encoding, trained on a sequence length of 512, achieves better perplexity (-0.65) than a similar network with sinusoidal positional encoding trained on a sequence length of 1024. This is achieved with 48% lower memory usage, and only 3.5% higher training time. Furthermore, our method notably improves the extrapolation ability of existing RPE methods on the Edu-FineWeb10B and WikiText-103 datasets. Code is available at: this https URL

Title: Seeing Beyond Haze: Generative Nighttime Image Dehazing

Authors: Beibei Lin, Stephen Lin, Robby Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08073
Pdf URL: https://arxiv.org/pdf/2503.08073
Copy Paste: [[2503.08073]] Seeing Beyond Haze: Generative Nighttime Image Dehazing(https://arxiv.org/abs/2503.08073)
Keywords: diffusion, generative
Abstract: Nighttime image dehazing is particularly challenging when dense haze and intense glow severely degrade or completely obscure background information. Existing methods often encounter difficulties due to insufficient background priors and limited generative ability, both essential for handling such conditions. In this paper, we introduce BeyondHaze, a generative nighttime dehazing method that not only significantly reduces haze and glow effects but also infers background information in regions where it may be absent. Our approach is developed on two main ideas: gaining strong background priors by adapting image diffusion models to the nighttime dehazing problem, and enhancing generative ability for haze- and glow-obscured scene areas through guided training. Task-specific nighttime dehazing knowledge is distilled into an image diffusion model in a manner that preserves its capacity to generate clean images. The diffusion model is additionally trained on image pairs designed to improve its ability to generate background details and content that are missing in the input image due to haze effects. Since generative models are susceptible to hallucinations, we develop our framework to allow user control over the generative level, balancing visual realism and factual accuracy. Experiments on real-world images demonstrate that BeyondHaze effectively restores visibility in dense nighttime haze.

Title: Trend-Aware Supervision: On Learning Invariance for Semi-Supervised Facial Action Unit Intensity Estimation

Authors: Yingjie Chen, Jiarui Zhang, Tao Wang, Yun Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08078
Pdf URL: https://arxiv.org/pdf/2503.08078
Copy Paste: [[2503.08078]] Trend-Aware Supervision: On Learning Invariance for Semi-Supervised Facial Action Unit Intensity Estimation(https://arxiv.org/abs/2503.08078)
Keywords: robust
Abstract: With the increasing need for facial behavior analysis, semi-supervised AU intensity estimation using only keyframe annotations has emerged as a practical and effective solution to relieve the burden of annotation. However, the lack of annotations makes the spurious correlation problem caused by AU co-occurrences and subject variation much more prominent, leading to non-robust intensity estimation that is entangled among AUs and biased among subjects. We observe that trend information inherent in keyframe annotations could act as extra supervision and raising the awareness of AU-specific facial appearance changing trends during training is the key to learning invariant AU-specific features. To this end, we propose \textbf{T}rend-\textbf{A}ware \textbf{S}upervision (TAS), which pursues three kinds of trend awareness, including intra-trend ranking awareness, intra-trend speed awareness, and inter-trend subject awareness. TAS alleviates the spurious correlation problem by raising trend awareness during training to learn AU-specific features that represent the corresponding facial appearance changes, to achieve intensity estimation invariance. Experiments conducted on two commonly used AU benchmark datasets, BP4D and DISFA, show the effectiveness of each kind of awareness. And under trend-aware supervision, the performance can be improved without extra computational or storage costs during inference.

Title: Advancing Sentiment Analysis: A Novel LSTM Framework with Multi-head Attention

Authors: Jingyuan Yi, Peiyang Yu, Tianyi Huang, Xiaochuan Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08079
Pdf URL: https://arxiv.org/pdf/2503.08079
Copy Paste: [[2503.08079]] Advancing Sentiment Analysis: A Novel LSTM Framework with Multi-head Attention(https://arxiv.org/abs/2503.08079)
Keywords: extraction
Abstract: This work proposes an LSTM-based sentiment classification model with multi-head attention mechanism and TF-IDF optimization. Through the integration of TF-IDF feature extraction and multi-head attention, the model significantly improves text sentiment analysis performance. Experimental results on public data sets demonstrate that the new method achieves substantial improvements in the most critical metrics like accuracy, recall, and F1-score compared to baseline models. Specifically, the model achieves an accuracy of 80.28% on the test set, which is improved by about 12% in comparison with standard LSTM models. Ablation experiments also support the necessity and necessity of all modules, in which the impact of multi-head attention is greatest to performance improvement. This research provides a proper approach to sentiment analysis, which can be utilized in public opinion monitoring, product recommendation, etc.

Title: Degradation Self-Supervised Learning for Lithium-ion Battery Health Diagnostics

Authors: J. C. Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08083
Pdf URL: https://arxiv.org/pdf/2503.08083
Copy Paste: [[2503.08083]] Degradation Self-Supervised Learning for Lithium-ion Battery Health Diagnostics(https://arxiv.org/abs/2503.08083)
Keywords: transformer
Abstract: Health evaluation for lithium-ion batteries (LIBs) typically relies on constant charging/discharging protocols, often neglecting scenarios involving dynamic current profiles prevalent in electric vehicles. Conventional health indicators for LIBs also depend on the uniformity of measured data, restricting their adaptability to non-uniform conditions. In this study, a novel training strategy for estimating LIB health based on the paradigm of self-supervised learning is proposed. A multiresolution analysis technique, empirical wavelet transform, is utilized to decompose non-stationary voltage signals in the frequency domain. This allows the removal of ineffective components for the health evaluation model. The transformer neural network serves as the model backbone, and a loss function is designed to describe the capacity degradation behavior with the assumption that the degradation in LIBs across most operating conditions is inevitable and irreversible. The results show that the model can learn the aging characteristics by analyzing sequences of voltage and current profiles obtained at various time intervals from the same LIB cell. The proposed method is successfully applied to the Stanford University LIB aging dataset, derived from electric vehicle real driving profiles. Notably, this approach achieves an average correlation coefficient of 0.9 between the evaluated health index and the degradation of actual capacity, demonstrating its efficacy in capturing LIB health degradation. This research highlights the feasibility of training deep neural networks using unlabeled LIB data, offering cost-efficient means and unleashing the potential of the measured information.

Title: PRISM: Privacy-Preserving Improved Stochastic Masking for Federated Generative Models

Authors: Kyeongkook Seo, Dong-Jun Han, Jaejun Yoo
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2503.08085
Pdf URL: https://arxiv.org/pdf/2503.08085
Copy Paste: [[2503.08085]] PRISM: Privacy-Preserving Improved Stochastic Masking for Federated Generative Models(https://arxiv.org/abs/2503.08085)
Keywords: privacy, federate, generative
Abstract: Despite recent advancements in federated learning (FL), the integration of generative models into FL has been limited due to challenges such as high communication costs and unstable training in heterogeneous data environments. To address these issues, we propose PRISM, a FL framework tailored for generative models that ensures (i) stable performance in heterogeneous data distributions and (ii) resource efficiency in terms of communication cost and final model size. The key of our method is to search for an optimal stochastic binary mask for a random network rather than updating the model weights, identifying a sparse subnetwork with high generative performance; i.e., a ``strong lottery ticket''. By communicating binary masks in a stochastic manner, PRISM minimizes communication overhead. This approach, combined with the utilization of maximum mean discrepancy (MMD) loss and a mask-aware dynamic moving average aggregation method (MADA) on the server side, facilitates stable and strong generative capabilities by mitigating local divergence in FL scenarios. Moreover, thanks to its sparsifying characteristic, PRISM yields a lightweight model without extra pruning or quantization, making it ideal for environments such as edge devices. Experiments on MNIST, FMNIST, CelebA, and CIFAR10 demonstrate that PRISM outperforms existing methods, while maintaining privacy with minimal communication costs. PRISM is the first to successfully generate images under challenging non-IID and privacy-preserving FL environments on complex datasets, where previous methods have struggled.

Title: SparseVoxFormer: Sparse Voxel-based Transformer for Multi-modal 3D Object Detection

Authors: Hyeongseok Son, Jia He, Seung-In Park, Ying Min, Yunhao Zhang, ByungIn Yoo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08092
Pdf URL: https://arxiv.org/pdf/2503.08092
Copy Paste: [[2503.08092]] SparseVoxFormer: Sparse Voxel-based Transformer for Multi-modal 3D Object Detection(https://arxiv.org/abs/2503.08092)
Keywords: extraction, transformer
Abstract: Most previous 3D object detection methods that leverage the multi-modality of LiDAR and cameras utilize the Bird's Eye View (BEV) space for intermediate feature representation. However, this space uses a low x, y-resolution and sacrifices z-axis information to reduce the overall feature resolution, which may result in declined accuracy. To tackle the problem of using low-resolution features, this paper focuses on the sparse nature of LiDAR point cloud data. From our observation, the number of occupied cells in the 3D voxels constructed from a LiDAR data can be even fewer than the number of total cells in the BEV map, despite the voxels' significantly higher resolution. Based on this, we introduce a novel sparse voxel-based transformer network for 3D object detection, dubbed as SparseVoxFormer. Instead of performing BEV feature extraction, we directly leverage sparse voxel features as the input for a transformer-based detector. Moreover, with regard to the camera modality, we introduce an explicit modality fusion approach that involves projecting 3D voxel coordinates onto 2D images and collecting the corresponding image features. Thanks to these components, our approach can leverage geometrically richer multi-modal features while even reducing the computational cost. Beyond the proof-of-concept level, we further focus on facilitating better multi-modal fusion and flexible control over the number of sparse features. Finally, thorough experimental results demonstrate that utilizing a significantly smaller number of sparse features drastically reduces computational costs in a 3D object detector while enhancing both overall and long-range performance.

Title: MVGSR: Multi-View Consistency Gaussian Splatting for Robust Surface Reconstruction

Authors: Chenfeng Hou, Qi Xun Yeo, Mengqi Guo, Yongxin Su, Yanyan Li, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08093
Pdf URL: https://arxiv.org/pdf/2503.08093
Copy Paste: [[2503.08093]] MVGSR: Multi-View Consistency Gaussian Splatting for Robust Surface Reconstruction(https://arxiv.org/abs/2503.08093)
Keywords: robust, segmentation
Abstract: 3D Gaussian Splatting (3DGS) has gained significant attention for its high-quality rendering capabilities, ultra-fast training, and inference speeds. However, when we apply 3DGS to surface reconstruction tasks, especially in environments with dynamic objects and distractors, the method suffers from floating artifacts and color errors due to inconsistency from different viewpoints. To address this challenge, we propose Multi-View Consistency Gaussian Splatting for the domain of Robust Surface Reconstruction (\textbf{MVGSR}), which takes advantage of lightweight Gaussian models and a {heuristics-guided distractor masking} strategy for robust surface reconstruction in non-static environments. Compared to existing methods that rely on MLPs for distractor segmentation strategies, our approach separates distractors from static scene elements by comparing multi-view feature consistency, allowing us to obtain precise distractor masks early in training. Furthermore, we introduce a pruning measure based on multi-view contributions to reset transmittance, effectively reducing floating artifacts. Finally, a multi-view consistency loss is applied to achieve high-quality performance in surface reconstruction tasks. Experimental results demonstrate that MVGSR achieves competitive geometric accuracy and rendering fidelity compared to the state-of-the-art surface reconstruction algorithms. More information is available on our project page (\href{this https URL}{this url})

Title: MegaSR: Mining Customized Semantics and Expressive Guidance for Image Super-Resolution

Authors: Xinrui Li, Jianlong Wu, Xinchuan Huang, Chong Chen, Weili Guan, Xian-Sheng Hua, Liqiang Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08096
Pdf URL: https://arxiv.org/pdf/2503.08096
Copy Paste: [[2503.08096]] MegaSR: Mining Customized Semantics and Expressive Guidance for Image Super-Resolution(https://arxiv.org/abs/2503.08096)
Keywords: diffusion, segmentation
Abstract: Pioneering text-to-image (T2I) diffusion models have ushered in a new era of real-world image super-resolution (Real-ISR), significantly enhancing the visual perception of reconstructed images. However, existing methods typically integrate uniform abstract textual semantics across all blocks, overlooking the distinct semantic requirements at different depths and the fine-grained, concrete semantics inherently present in the images themselves. Moreover, relying solely on a single type of guidance further disrupts the consistency of reconstruction. To address these issues, we propose MegaSR, a novel framework that mines customized block-wise semantics and expressive guidance for diffusion-based ISR. Compared to uniform textual semantics, MegaSR enables flexible adaptation to multi-granularity semantic awareness by dynamically incorporating image attributes at each block. Furthermore, we experimentally identify HED edge maps, depth maps, and segmentation maps as the most expressive guidance, and propose a multi-stage aggregation strategy to modulate them into the T2I models. Extensive experiments demonstrate the superiority of MegaSR in terms of semantic richness and structural consistency.

Title: Whoever Started the Interference Should End It: Guiding Data-Free Model Merging via Task Vectors

Authors: Runxi Cheng, Feng Xiong, Yongxian Wei, Wanyun Zhu, Chun Yuan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08099
Pdf URL: https://arxiv.org/pdf/2503.08099
Copy Paste: [[2503.08099]] Whoever Started the Interference Should End It: Guiding Data-Free Model Merging via Task Vectors(https://arxiv.org/abs/2503.08099)
Keywords: data-free
Abstract: Model merging seeks to integrate task-specific expert models into a unified architecture while preserving multi-task generalization capabilities, yet parameter interference between constituent models frequently induces performance degradation. Although prior work has explored many merging strategies, resolving interference without additional data for retraining or test-time computation remains challenging. In this paper, we theoretically demonstrate that the task vectors of the linear layer constitute an approximate linear subspace for its corresponding input. Therefore, we can minimize interference under the guidance of task vectors. Based on this insight, we propose \textbf{WUDI-Merging} (\textbf{W}hoever started the interference sho\textbf{U}ld en\textbf{D} \textbf{I}t), a simple yet effective model merging method that eliminates interference without any additional data or rescaling coefficients. Comprehensive empirical evaluations across vision and language benchmarks demonstrate our method's superiority, achieving state-of-the-art performance in data-free model merging scenarios (average 10.9\% improvement versus baseline methods) while even outperforming mainstream test-time adaptation approaches by 3.3\%, and only very few computing resources are required. The code will be publicly available soon.

Title: Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning

Authors: Lizhen Xu, Xiuxiu Bai, Xiaojun Jia, Jianwu Fang, Shanmin Pang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08101
Pdf URL: https://arxiv.org/pdf/2503.08101
Copy Paste: [[2503.08101]] Accelerate 3D Object Detection Models via Zero-Shot Attention Key Pruning(https://arxiv.org/abs/2503.08101)
Keywords: transformer
Abstract: Query-based methods with dense features have demonstrated remarkable success in 3D object detection tasks. However, the computational demands of these models, particularly with large image sizes and multiple transformer layers, pose significant challenges for efficient running on edge devices. Existing pruning and distillation methods either need retraining or are designed for ViT models, which are hard to migrate to 3D detectors. To address this issue, we propose a zero-shot runtime pruning method for transformer decoders in 3D object detection models. The method, termed tgGBC (trim keys gradually Guided By Classification scores), systematically trims keys in transformer modules based on their importance. We expand the classification score to multiply it with the attention map to get the importance score of each key and then prune certain keys after each transformer layer according to their importance scores. Our method achieves a 1.99x speedup in the transformer decoder of the latest ToC3D model, with only a minimal performance loss of less than 1%. Interestingly, for certain models, our method even enhances their performance. Moreover, we deploy 3D detectors with tgGBC on an edge device, further validating the effectiveness of our method. The code can be found at this https URL.

Title: ACE: Concept Editing in Diffusion Models without Performance Degradation

Authors: Ruipeng Wang, Junfeng Fang, Jiaqi Li, Hao Chen, Jie Shi, Kun Wang, Xiang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08116
Pdf URL: https://arxiv.org/pdf/2503.08116
Copy Paste: [[2503.08116]] ACE: Concept Editing in Diffusion Models without Performance Degradation(https://arxiv.org/abs/2503.08116)
Keywords: diffusion
Abstract: Diffusion-based text-to-image models have demonstrated remarkable capabilities in generating realistic images, but they raise societal and ethical concerns, such as the creation of unsafe content. While concept editing is proposed to address these issues, they often struggle to balance the removal of unsafe concept with maintaining the model's general genera-tive capabilities. In this work, we propose ACE, a new editing method that enhances concept editing in diffusion models. ACE introduces a novel cross null-space projection approach to precisely erase unsafe concept while maintaining the model's ability to generate high-quality, semantically consistent images. Extensive experiments demonstrate that ACE significantly outperforms the advancing baselines,improving semantic consistency by 24.56% and image generation quality by 34.82% on average with only 1% of the time cost. These results highlight the practical utility of concept editing by mitigating its potential risks, paving the way for broader applications in the field. Code is avaliable at this https URL

Title: Convergence Dynamics and Stabilization Strategies of Co-Evolving Generative Models

Authors: Weiguo Gao, Ming Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08117
Pdf URL: https://arxiv.org/pdf/2503.08117
Copy Paste: [[2503.08117]] Convergence Dynamics and Stabilization Strategies of Co-Evolving Generative Models(https://arxiv.org/abs/2503.08117)
Keywords: generative
Abstract: The increasing prevalence of synthetic data in training loops has raised concerns about model collapse, where generative models degrade when trained on their own outputs. While prior work focuses on this self-consuming process, we study an underexplored yet prevalent phenomenon: co-evolving generative models that shape each other's training through iterative feedback. This is common in multimodal AI ecosystems, such as social media platforms, where text models generate captions that guide image models, and the resulting images influence the future adaptation of the text model. We take a first step by analyzing such a system, modeling the text model as a multinomial distribution and the image model as a conditional multi-dimensional Gaussian distribution. Our analysis uncovers three key results. First, when one model remains fixed, the other collapses: a frozen image model causes the text model to lose diversity, while a frozen text model leads to an exponential contraction of image diversity, though fidelity remains bounded. Second, in fully interactive systems, mutual reinforcement accelerates collapse, with image contraction amplifying text homogenization and vice versa, leading to a Matthew effect where dominant texts sustain higher image diversity while rarer texts collapse faster. Third, we analyze stabilization strategies implicitly introduced by real-world external influences. Random corpus injections for text models and user-content injections for image models prevent collapse while preserving both diversity and fidelity. Our theoretical findings are further validated through experiments.

Title: Uni$\textbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models

Authors: Junzhe Li, Xuerui Qiu, Linrui Xu, Liya Guo, Delin Qu, Tingting Long, Chun Fan, Ming Li
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2503.08120
Pdf URL: https://arxiv.org/pdf/2503.08120
Copy Paste: [[2503.08120]] Uni$\textbf{F}^2$ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models(https://arxiv.org/abs/2503.08120)
Keywords: diffusion, generative
Abstract: Unified multimodal models (UMMs) have emerged as a powerful paradigm in foundational computer vision research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily focuses on $\textbf{coarse}$ facial attribute understanding, with limited capacity to handle $\textbf{fine-grained}$ facial attributes and without addressing generation capabilities. To overcome these limitations, we propose Uni$\textbf{F}^2$ace, the first UMM tailored specifically for fine-grained face understanding and generation. In general, we train Uni$\textbf{F}^2$ace on a self-constructed, specialized dataset utilizing two mutually beneficial diffusion techniques and a two-level mixture-of-experts architecture. Specifically, we first build a large-scale facial dataset, Uni$\textbf{F}^2$ace-130K, which contains 130K image-text pairs with one million question-answering pairs that span a wide range of facial attributes. Second, we establish a theoretical connection between discrete diffusion score matching and masked generative models, optimizing both evidence lower bounds simultaneously, which significantly improves the model's ability to synthesize facial details. Finally, we introduce both token-level and sequence-level mixture-of-experts, enabling efficient fine-grained representation learning for both understanding and generation tasks. Extensive experiments on Uni$\textbf{F}^2$ace-130K demonstrate that Uni$\textbf{F}^2$ace outperforms existing UMMs and generative models, achieving superior performance across both understanding and generation tasks.

Title: AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification

Authors: Huy Nguyen, Kien Nguyen, Akila Pemasiri, Feng Liu, Sridha Sridharan, Clinton Fookes
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08121
Pdf URL: https://arxiv.org/pdf/2503.08121
Copy Paste: [[2503.08121]] AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification(https://arxiv.org/abs/2503.08121)
Keywords: robust
Abstract: We introduce AG-VPReID, a challenging large-scale benchmark dataset for aerial-ground video-based person re-identification (ReID), comprising 6,632 identities, 32,321 tracklets, and 9.6 million frames captured from drones (15-120m altitude), CCTV, and wearable cameras. This dataset presents a real-world benchmark to investigate the robustness of Person ReID approaches against the unique challenges of cross-platform aerial-ground settings. To address these challenges, we propose AG-VPReID-Net, an end-to-end framework combining three complementary streams: (1) an Adapted Temporal-Spatial Stream addressing motion pattern inconsistencies and temporal feature learning, (2) a Normalized Appearance Stream using physics-informed techniques to tackle resolution and appearance changes, and (3) a Multi-Scale Attention Stream handling scale variations across drone altitudes. Our approach integrates complementary visual-semantic information from all streams to generate robust, viewpoint-invariant person representations. Extensive experiments demonstrate that AG-VPReID-Net outperforms state-of-the-art approaches on both our new dataset and other existing video-based ReID benchmarks, showcasing its effectiveness and generalizability. The relatively lower performance of all state-of-the-art approaches, including our proposed approach, on our new dataset highlights its challenging nature. The AG-VPReID dataset, code and models are available at this https URL.

Title: Toward Stable World Models: Measuring and Addressing World Instability in Generative Environments

Authors: Soonwoo Kwon, Jin-Young Kim, Hyojun Go, Kyungjune Baek
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.08122
Pdf URL: https://arxiv.org/pdf/2503.08122
Copy Paste: [[2503.08122]] Toward Stable World Models: Measuring and Addressing World Instability in Generative Environments(https://arxiv.org/abs/2503.08122)
Keywords: diffusion, generative
Abstract: We present a novel study on enhancing the capability of preserving the content in world models, focusing on a property we term World Stability. Recent diffusion-based generative models have advanced the synthesis of immersive and realistic environments that are pivotal for applications such as reinforcement learning and interactive game engines. However, while these models excel in quality and diversity, they often neglect the preservation of previously generated scenes over time--a shortfall that can introduce noise into agent learning and compromise performance in safety-critical settings. In this work, we introduce an evaluation framework that measures world stability by having world models perform a sequence of actions followed by their inverses to return to their initial viewpoint, thereby quantifying the consistency between the starting and ending observations. Our comprehensive assessment of state-of-the-art diffusion-based world models reveals significant challenges in achieving high world stability. Moreover, we investigate several improvement strategies to enhance world stability. Our results underscore the importance of world stability in world modeling and provide actionable insights for future research in this domain.

Title: Large Scale Multi-Task Bayesian Optimization with Large Language Models

Authors: Yimeng Zeng, Natalie Maus, Haydn Thomas Jones, Jeffrey Tao, Fangping Wan, Marcelo Der Torossian Torres, Cesar de la Fuente-Nunez, Ryan Marcus, Osbert Bastani, Jacob R. Gardner
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08131
Pdf URL: https://arxiv.org/pdf/2503.08131
Copy Paste: [[2503.08131]] Large Scale Multi-Task Bayesian Optimization with Large Language Models(https://arxiv.org/abs/2503.08131)
Keywords: large language model
Abstract: In multi-task Bayesian optimization, the goal is to leverage experience from optimizing existing tasks to improve the efficiency of optimizing new ones. While approaches using multi-task Gaussian processes or deep kernel transfer exist, the performance improvement is marginal when scaling to more than a moderate number of tasks. We introduce a novel approach leveraging large language models (LLMs) to learn from, and improve upon, previous optimization trajectories, scaling to approximately 2000 distinct tasks. Specifically, we propose an iterative framework in which an LLM is fine-tuned using the high quality solutions produced by BayesOpt to generate improved initializations that accelerate convergence for future optimization tasks based on previous search trajectories. We evaluate our method on two distinct domains: database query optimization and antimicrobial peptide design. Results demonstrate that our approach creates a positive feedback loop, where the LLM's generated initializations gradually improve, leading to better optimization performance. As this feedback loop continues, we find that the LLM is eventually able to generate solutions to new tasks in just a few shots that are better than the solutions produced by "from scratch" by Bayesian optimization while simultaneously requiring significantly fewer oracle calls.

Title: MGHanD: Multi-modal Guidance for authentic Hand Diffusion

Authors: Taehyeon Eum, Jieun Choi, Tae-Kyun Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08133
Pdf URL: https://arxiv.org/pdf/2503.08133
Copy Paste: [[2503.08133]] MGHanD: Multi-modal Guidance for authentic Hand Diffusion(https://arxiv.org/abs/2503.08133)
Keywords: diffusion, generative
Abstract: Diffusion-based methods have achieved significant successes in T2I generation, providing realistic images from text prompts. Despite their capabilities, these models face persistent challenges in generating realistic human hands, often producing images with incorrect finger counts and structurally deformed hands. MGHanD addresses this challenge by applying multi-modal guidance during the inference process. For visual guidance, we employ a discriminator trained on a dataset comprising paired real and generated images with captions, derived from various hand-in-the-wild datasets. We also employ textual guidance with LoRA adapter, which learns the direction from `hands' towards more detailed prompts such as `natural hands', and `anatomically correct fingers' at the latent level. A cumulative hand mask which is gradually enlarged in the assigned time step is applied to the added guidance, allowing the hand to be refined while maintaining the rich generative capabilities of the pre-trained model. In the experiments, our method achieves superior hand generation qualities, without any specific conditions or priors. We carry out both quantitative and qualitative evaluations, along with user studies, to showcase the benefits of our approach in producing high-quality hand images.

Title: ArticulatedGS: Self-supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting

Authors: Junfu Guo, Yu Xin, Gaoyi Liu, Kai Xu, Ligang Liu, Ruizhen Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08135
Pdf URL: https://arxiv.org/pdf/2503.08135
Copy Paste: [[2503.08135]] ArticulatedGS: Self-supervised Digital Twin Modeling of Articulated Objects using 3D Gaussian Splatting(https://arxiv.org/abs/2503.08135)
Keywords: segmentation
Abstract: We tackle the challenge of concurrent reconstruction at the part level with the RGB appearance and estimation of motion parameters for building digital twins of articulated objects using the 3D Gaussian Splatting (3D-GS) method. With two distinct sets of multi-view imagery, each depicting an object in separate static articulation configurations, we reconstruct the articulated object in 3D Gaussian representations with both appearance and geometry information at the same time. Our approach decoupled multiple highly interdependent parameters through a multi-step optimization process, thereby achieving a stable optimization procedure and high-quality outcomes. We introduce ArticulatedGS, a self-supervised, comprehensive framework that autonomously learns to model shapes and appearances at the part level and synchronizes the optimization of motion parameters, all without reliance on 3D supervision, motion cues, or semantic labels. Our experimental results demonstrate that, among comparable methodologies, our approach has achieved optimal outcomes in terms of part segmentation accuracy, motion estimation accuracy, and visual quality.

Title: FlowDPS: Flow-Driven Posterior Sampling for Inverse Problems

Authors: Jeongsol Kim, Bryan Sangwoo Kim, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08136
Pdf URL: https://arxiv.org/pdf/2503.08136
Copy Paste: [[2503.08136]] FlowDPS: Flow-Driven Posterior Sampling for Inverse Problems(https://arxiv.org/abs/2503.08136)
Keywords: diffusion, transformer, generative
Abstract: Flow matching is a recent state-of-the-art framework for generative modeling based on ordinary differential equations (ODEs). While closely related to diffusion models, it provides a more general perspective on generative modeling. Although inverse problem solving has been extensively explored using diffusion models, it has not been rigorously examined within the broader context of flow models. Therefore, here we extend the diffusion inverse solvers (DIS) - which perform posterior sampling by combining a denoising diffusion prior with an likelihood gradient - into the flow framework. Specifically, by driving the flow-version of Tweedie's formula, we decompose the flow ODE into two components: one for clean image estimation and the other for noise estimation. By integrating the likelihood gradient and stochastic noise into each component, respectively, we demonstrate that posterior sampling for inverse problem solving can be effectively achieved using flows. Our proposed solver, Flow-Driven Posterior Sampling (FlowDPS), can also be seamlessly integrated into a latent flow model with a transformer architecture. Across four linear inverse problems, we confirm that FlowDPS outperforms state-of-the-art alternatives, all without requiring additional training.

Title: HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views

Authors: Ethan Griffiths, Maryam Haghighat, Simon Denman, Clinton Fookes, Milad Ramezani
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.08140
Pdf URL: https://arxiv.org/pdf/2503.08140
Copy Paste: [[2503.08140]] HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views(https://arxiv.org/abs/2503.08140)
Keywords: robust, transformer
Abstract: We present HOTFormerLoc, a novel and versatile Hierarchical Octree-based Transformer, for large-scale 3D place recognition in both ground-to-ground and ground-to-aerial scenarios across urban and forest environments. We propose an octree-based multi-scale attention mechanism that captures spatial and semantic features across granularities. To address the variable density of point distributions from spinning lidar, we present cylindrical octree attention windows to reflect the underlying distribution during attention. We introduce relay tokens to enable efficient global-local interactions and multi-scale representation learning at reduced computational cost. Our pyramid attentional pooling then synthesises a robust global descriptor for end-to-end place recognition in challenging environments. In addition, we introduce CS-Wild-Places, a novel 3D cross-source dataset featuring point cloud data from aerial and ground lidar scans captured in dense forests. Point clouds in CS-Wild-Places contain representational gaps and distinctive attributes such as varying point densities and noise patterns, making it a challenging benchmark for cross-view localisation in the wild. HOTFormerLoc achieves a top-1 average recall improvement of 5.5% - 11.5% on the CS-Wild-Places benchmark. Furthermore, it consistently outperforms SOTA 3D place recognition methods, with an average performance gain of 5.8% on well-established urban and forest datasets. The code and CS-Wild-Places benchmark is available at https://csiro-robotics.github.io/HOTFormerLoc .

Title: Scaling Probabilistic Circuits via Data Partitioning

Authors: Jonas Seng, Florian Peter Busch, Pooja Prasad, Devendra Singh Dhami, Martin Mundt, Kristian Kersting
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08141
Pdf URL: https://arxiv.org/pdf/2503.08141
Copy Paste: [[2503.08141]] Scaling Probabilistic Circuits via Data Partitioning(https://arxiv.org/abs/2503.08141)
Keywords: federate
Abstract: Probabilistic circuits (PCs) enable us to learn joint distributions over a set of random variables and to perform various probabilistic queries in a tractable fashion. Though the tractability property allows PCs to scale beyond non-tractable models such as Bayesian Networks, scaling training and inference of PCs to larger, real-world datasets remains challenging. To remedy the situation, we show how PCs can be learned across multiple machines by recursively partitioning a distributed dataset, thereby unveiling a deep connection between PCs and federated learning (FL). This leads to federated circuits (FCs) -- a novel and flexible federated learning (FL) framework that (1) allows one to scale PCs on distributed learning environments (2) train PCs faster and (3) unifies for the first time horizontal, vertical, and hybrid FL in one framework by re-framing FL as a density estimation problem over distributed datasets. We demonstrate FC's capability to scale PCs on various large-scale datasets. Also, we show FC's versatility in handling horizontal, vertical, and hybrid FL within a unified framework on multiple classification tasks.

Title: Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method

Authors: Fei Wang, Chengcheng Chen, Hongyu Chen, Yugang Chang, Weiming Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08144
Pdf URL: https://arxiv.org/pdf/2503.08144
Copy Paste: [[2503.08144]] Bring Remote Sensing Object Detect Into Nature Language Model: Using SFT Method(https://arxiv.org/abs/2503.08144)
Keywords: large language model
Abstract: Recently, large language models (LLMs) and visionlanguage models (VLMs) have achieved significant success, demonstrating remarkable capabilities in understanding various images and videos, particularly in classification and detection tasks. However, due to the substantial differences between remote sensing images and conventional optical images, these models face considerable challenges in comprehension, especially in detection tasks. Directly prompting VLMs with detection instructions often fails to yield satisfactory results. To address this issue, this letter explores the application of VLMs for object detection in remote sensing images. Specifically, we utilize publicly available remote sensing object detection datasets, including SSDD, HRSID, and NWPU-VHR-10, to convert traditional annotation information into natural language, thereby constructing an instruction-tuning (SFT) dataset for VLM training. We then evaluate the detection performance of different fine-tuning strategies for VLMs and obtain optimized model weights for object detection in remote sensing images. Finally, we assess the model's prior knowledge capabilities through natural language this http URL results demonstrate that, without modifying the model architecture, remote sensing object detection can be effectively achieved using natural language alone. Additionally, the model exhibits the ability to perform certain vision question answering (VQA) tasks. Our dataset and relevant code will be released soon.

Title: FilmComposer: LLM-Driven Music Production for Silent Film Clips

Authors: Zhifeng Xie, Qile He, Youjia Zhu, Qiwei He, Mengtian Li
Subjects: cs.CV, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.08147
Pdf URL: https://arxiv.org/pdf/2503.08147
Copy Paste: [[2503.08147]] FilmComposer: LLM-Driven Music Production for Silent Film Clips(https://arxiv.org/abs/2503.08147)
Keywords: generative
Abstract: In this work, we implement music production for silent film clips using LLM-driven method. Given the strong professional demands of film music production, we propose the FilmComposer, simulating the actual workflows of professional musicians. FilmComposer is the first to combine large generative models with a multi-agent approach, leveraging the advantages of both waveform music and symbolic music generation. Additionally, FilmComposer is the first to focus on the three core elements of music production for film-audio quality, musicality, and musical development-and introduces various controls, such as rhythm, semantics, and visuals, to enhance these key aspects. Specifically, FilmComposer consists of the visual processing module, rhythm-controllable MusicGen, and multi-agent assessment, arrangement and mix. In addition, our framework can seamlessly integrate into the actual music production pipeline and allows user intervention in every step, providing strong interactivity and a high degree of creative freedom. Furthermore, we propose MusicPro-7k which includes 7,418 film clips, music, description, rhythm spots and main melody, considering the lack of a professional and high-quality film music dataset. Finally, both the standard metrics and the new specialized metrics we propose demonstrate that the music generated by our model achieves state-of-the-art performance in terms of quality, consistency with video, diversity, musicality, and musical development. Project page: this https URL

Title: Few-Shot Class-Incremental Model Attribution Using Learnable Representation From CLIP-ViT Features

Authors: Hanbyul Lee, Juneho Yi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08148
Pdf URL: https://arxiv.org/pdf/2503.08148
Copy Paste: [[2503.08148]] Few-Shot Class-Incremental Model Attribution Using Learnable Representation From CLIP-ViT Features(https://arxiv.org/abs/2503.08148)
Keywords: generative
Abstract: Recently, images that distort or fabricate facts using generative models have become a social concern. To cope with continuous evolution of generative artificial intelligence (AI) models, model attribution (MA) is necessary beyond just detection of synthetic images. However, current deep learning-based MA methods must be trained from scratch with new data to recognize unseen models, which is time-consuming and data-intensive. This work proposes a new strategy to deal with persistently emerging generative models. We adapt few-shot class-incremental learning (FSCIL) mechanisms for MA problem to uncover novel generative AI models. Unlike existing FSCIL approaches that focus on object classification using high-level information, MA requires analyzing low-level details like color and texture in synthetic images. Thus, we utilize a learnable representation from different levels of CLIP-ViT features. To learn an effective representation, we propose Adaptive Integration Module (AIM) to calculate a weighted sum of CLIP-ViT block features for each image, enhancing the ability to identify generative models. Extensive experiments show our method effectively extends from prior generative models to recent ones.

Title: Domain Adaptation and Entanglement: an Optimal Transport Perspective

Authors: Okan Koç, Alexander Soen, Chao-Kai Chiang, Masashi Sugiyama
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08155
Pdf URL: https://arxiv.org/pdf/2503.08155
Copy Paste: [[2503.08155]] Domain Adaptation and Entanglement: an Optimal Transport Perspective(https://arxiv.org/abs/2503.08155)
Keywords: robust
Abstract: Current machine learning systems are brittle in the face of distribution shifts (DS), where the target distribution that the system is tested on differs from the source distribution used to train the system. This problem of robustness to DS has been studied extensively in the field of domain adaptation. For deep neural networks, a popular framework for unsupervised domain adaptation (UDA) is domain matching, in which algorithms try to align the marginal distributions in the feature or output space. The current theoretical understanding of these methods, however, is limited and existing theoretical results are not precise enough to characterize their performance in practice. In this paper, we derive new bounds based on optimal transport that analyze the UDA problem. Our new bounds include a term which we dub as \emph{entanglement}, consisting of an expectation of Wasserstein distance between conditionals with respect to changing data distributions. Analysis of the entanglement term provides a novel perspective on the unoptimizable aspects of UDA. In various experiments with multiple models across several DS scenarios, we show that this term can be used to explain the varying performance of UDA algorithms.

Title: Towards Large-scale Chemical Reaction Image Parsing via a Multimodal Large Language Model

Authors: Yufan Chen, Ching Ting Leung, Jianwei Sun, Yong Huang, Linyan Li, Hao Chen, Hanyu Gao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08156
Pdf URL: https://arxiv.org/pdf/2503.08156
Copy Paste: [[2503.08156]] Towards Large-scale Chemical Reaction Image Parsing via a Multimodal Large Language Model(https://arxiv.org/abs/2503.08156)
Keywords: robust, extraction, large language model
Abstract: Artificial intelligence (AI) has demonstrated significant promise in advancing organic chemistry research; however, its effectiveness depends on the availability of high-quality chemical reaction data. Currently, most published chemical reactions are not available in machine-readable form, limiting the broader application of AI in this field. The extraction of published chemical reactions into structured databases still relies heavily on manual curation, and robust automatic parsing of chemical reaction images into machine-readable data remains a significant challenge. To address this, we introduce the Reaction Image Multimodal large language model (RxnIM), the first multimodal large language model specifically designed to parse chemical reaction images into machine-readable reaction data. RxnIM not only extracts key chemical components from reaction images but also interprets the textual content that describes reaction conditions. Together with specially designed large-scale dataset generation method to support model training, our approach achieves excellent performance, with an average F1 score of 88% on various benchmarks, surpassing literature methods by 5%. This represents a crucial step toward the automatic construction of large databases of machine-readable reaction data parsed from images in the chemistry literature, providing essential data resources for AI research in chemistry. The source code, model checkpoints, and datasets developed in this work are released under permissive licenses. An instance of the RxnIM web application can be accessed at this https URL.

Title: U-StyDiT: Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers

Authors: Zhanjie Zhang, Ao Ma, Ke Cao, Jing Wang, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08157
Pdf URL: https://arxiv.org/pdf/2503.08157
Copy Paste: [[2503.08157]] U-StyDiT: Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers(https://arxiv.org/abs/2503.08157)
Keywords: diffusion, transformer
Abstract: Ultra-high quality artistic style transfer refers to repainting an ultra-high quality content image using the style information learned from the style image. Existing artistic style transfer methods can be categorized into style reconstruction-based and content-style disentanglement-based style transfer approaches. Although these methods can generate some artistic stylized images, they still exhibit obvious artifacts and disharmonious patterns, which hinder their ability to produce ultra-high quality artistic stylized images. To address these issues, we propose a novel artistic image style transfer method, U-StyDiT, which is built on transformer-based diffusion (DiT) and learns content-style disentanglement, generating ultra-high quality artistic stylized images. Specifically, we first design a Multi-view Style Modulator (MSM) to learn style information from a style image from local and global perspectives, conditioning U-StyDiT to generate stylized images with the learned style information. Then, we introduce a StyDiT Block to learn content and style conditions simultaneously from a style image. Additionally, we propose an ultra-high quality artistic image dataset, Aes4M, comprising 10 categories, each containing 400,000 style images. This dataset effectively solves the problem that the existing style transfer methods cannot produce high-quality artistic stylized images due to the size of the dataset and the quality of the images in the dataset. Finally, the extensive qualitative and quantitative experiments validate that our U-StyDiT can create higher quality stylized images compared to state-of-the-art artistic style transfer methods. To our knowledge, our proposed method is the first to address the generation of ultra-high quality stylized images using transformer-based diffusion.

Title: Concept-Driven Deep Learning for Enhanced Protein-Specific Molecular Generation

Authors: Taojie Kuang, Qianli Ma, Athanasios V. Vasilakos, Yu Wang, Qiang (Shawn)Cheng, Zhixiang Ren
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08160
Pdf URL: https://arxiv.org/pdf/2503.08160
Copy Paste: [[2503.08160]] Concept-Driven Deep Learning for Enhanced Protein-Specific Molecular Generation(https://arxiv.org/abs/2503.08160)
Keywords: interpretability, diffusion
Abstract: In recent years, deep learning techniques have made significant strides in molecular generation for specific targets, driving advancements in drug discovery. However, existing molecular generation methods present significant limitations: those operating at the atomic level often lack synthetic feasibility, drug-likeness, and interpretability, while fragment-based approaches frequently overlook comprehensive factors that influence protein-molecule interactions. To address these challenges, we propose a novel fragment-based molecular generation framework tailored for specific proteins. Our method begins by constructing a protein subpocket and molecular arm concept-based neural network, which systematically integrates interaction force information and geometric complementarity to sample molecular arms for specific protein subpockets. Subsequently, we introduce a diffusion model to generate molecular backbones that connect these arms, ensuring structural integrity and chemical diversity. Our approach significantly improves synthetic feasibility and binding affinity, with a 4% increase in drug-likeness and a 6% improvement in synthetic feasibility. Furthermore, by integrating explicit interaction data through a concept-based model, our framework enhances interpretability, offering valuable insights into the molecular design process.

Title: OASIS: Order-Augmented Strategy for Improved Code Search

Authors: Zuchen Gao, Zizheng Zhan, Xianming Li, Erxin Yu, Haotian Zhang, Yuqun Zhang, Jing Li
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.08161
Pdf URL: https://arxiv.org/pdf/2503.08161
Copy Paste: [[2503.08161]] OASIS: Order-Augmented Strategy for Improved Code Search(https://arxiv.org/abs/2503.08161)
Keywords: large language model
Abstract: Code embeddings capture the semantic representations of code and are crucial for various code-related large language model (LLM) applications, such as code search. Previous training primarily relies on optimizing the InfoNCE loss by comparing positive natural language (NL)-code pairs with in-batch negatives. However, due to the sparse nature of code contexts, training solely by comparing the major differences between positive and negative pairs may fail to capture deeper semantic nuances. To address this issue, we propose a novel order-augmented strategy for improved code search (OASIS). It leverages order-based similarity labels to train models to capture subtle differences in similarity among negative pairs. Extensive benchmark evaluations demonstrate that our OASIS model significantly outperforms previous state-of-the-art models focusing solely on major positive-negative differences. It underscores the value of exploiting subtle differences among negative pairs with order labels for effective code embedding training.

Title: XAI4Extremes: An interpretable machine learning framework for understanding extreme-weather precursors under climate change

Authors: Jiawen Wei, Aniruddha Bora, Vivek Oommen, Chenyu Dong, Juntao Yang, Jeff Adie, Chen Chen, Simon See, George Karniadakis, Gianmarco Mengaldo
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2503.08163
Pdf URL: https://arxiv.org/pdf/2503.08163
Copy Paste: [[2503.08163]] XAI4Extremes: An interpretable machine learning framework for understanding extreme-weather precursors under climate change(https://arxiv.org/abs/2503.08163)
Keywords: interpretability
Abstract: Extreme weather events are increasing in frequency and intensity due to climate change. This, in turn, is exacting a significant toll in communities worldwide. While prediction skills are increasing with advances in numerical weather prediction and artificial intelligence tools, extreme weather still present challenges. More specifically, identifying the precursors of such extreme weather events and how these precursors may evolve under climate change remain unclear. In this paper, we propose to use post-hoc interpretability methods to construct relevance weather maps that show the key extreme-weather precursors identified by deep learning models. We then compare this machine view with existing domain knowledge to understand whether deep learning models identified patterns in data that may enrich our understanding of extreme-weather precursors. We finally bin these relevant maps into different multi-year time periods to understand the role that climate change is having on these precursors. The experiments are carried out on Indochina heatwaves, but the methodology can be readily extended to other extreme weather events worldwide.

Title: Multimodal Generation of Animatable 3D Human Models with AvatarForge

Authors: Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08165
Pdf URL: https://arxiv.org/pdf/2503.08165
Copy Paste: [[2503.08165]] Multimodal Generation of Animatable 3D Human Models with AvatarForge(https://arxiv.org/abs/2503.08165)
Keywords: diffusion
Abstract: We introduce AvatarForge, a framework for generating animatable 3D human avatars from text or image inputs using AI-driven procedural generation. While diffusion-based methods have made strides in general 3D object generation, they struggle with high-quality, customizable human avatars due to the complexity and diversity of human body shapes, poses, exacerbated by the scarcity of high-quality data. Additionally, animating these avatars remains a significant challenge for existing methods. AvatarForge overcomes these limitations by combining LLM-based commonsense reasoning with off-the-shelf 3D human generators, enabling fine-grained control over body and facial details. Unlike diffusion models which often rely on pre-trained datasets lacking precise control over individual human features, AvatarForge offers a more flexible approach, bringing humans into the iterative design and modeling loop, with its auto-verification system allowing for continuous refinement of the generated avatars, and thus promoting high accuracy and customization. Our evaluations show that AvatarForge outperforms state-of-the-art methods in both text- and image-to-avatar generation, making it a versatile tool for artistic creation and animation.

Title: TSCnet: A Text-driven Semantic-level Controllable Framework for Customized Low-Light Image Enhancement

Authors: Miao Zhang, Jun Yin, Pengyu Zeng, Yiqing Shen, Shuai Lu, Xueqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08168
Pdf URL: https://arxiv.org/pdf/2503.08168
Copy Paste: [[2503.08168]] TSCnet: A Text-driven Semantic-level Controllable Framework for Customized Low-Light Image Enhancement(https://arxiv.org/abs/2503.08168)
Keywords: robust, diffusion, large language model
Abstract: Deep learning-based image enhancement methods show significant advantages in reducing noise and improving visibility in low-light conditions. These methods are typically based on one-to-one mapping, where the model learns a direct transformation from low light to specific enhanced images. Therefore, these methods are inflexible as they do not allow highly personalized mapping, even though an individual's lighting preferences are inherently personalized. To overcome these limitations, we propose a new light enhancement task and a new framework that provides customized lighting control through prompt-driven, semantic-level, and quantitative brightness adjustments. The framework begins by leveraging a Large Language Model (LLM) to understand natural language prompts, enabling it to identify target objects for brightness adjustments. To localize these target objects, the Retinex-based Reasoning Segment (RRS) module generates precise target localization masks using reflection images. Subsequently, the Text-based Brightness Controllable (TBC) module adjusts brightness levels based on the generated illumination map. Finally, an Adaptive Contextual Compensation (ACC) module integrates multi-modal inputs and controls a conditional diffusion model to adjust the lighting, ensuring seamless and precise enhancements accurately. Experimental results on benchmark datasets demonstrate our framework's superior performance at increasing visibility, maintaining natural color balance, and amplifying fine details without creating artifacts. Furthermore, its robust generalization capabilities enable complex semantic-level lighting adjustments in diverse open-world environments through natural language interactions.

Title: CQVPR: Landmark-aware Contextual Queries for Visual Place Recognition

Authors: Dongyue Li, Daisuke Deguchi, Hiroshi Murase
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08170
Pdf URL: https://arxiv.org/pdf/2503.08170
Copy Paste: [[2503.08170]] CQVPR: Landmark-aware Contextual Queries for Visual Place Recognition(https://arxiv.org/abs/2503.08170)
Keywords: extraction
Abstract: Visual Place Recognition (VPR) aims to estimate the location of the given query image within a database of geo-tagged images. To identify the exact location in an image, detecting landmarks is crucial. However, in some scenarios, such as urban environments, there are numerous landmarks, such as various modern buildings, and the landmarks in different cities often exhibit high visual similarity. Therefore, it is essential not only to leverage the landmarks but also to consider the contextual information surrounding them, such as whether there are trees, roads, or other features around the landmarks. We propose the Contextual Query VPR (CQVPR), which integrates contextual information with detailed pixel-level visual features. By leveraging a set of learnable contextual queries, our method automatically learns the high-level contexts with respect to landmarks and their surrounding areas. Heatmaps depicting regions that each query attends to serve as context-aware features, offering cues that could enhance the understanding of each scene. We further propose a query matching loss to supervise the extraction process of contextual queries. Extensive experiments on several datasets demonstrate that the proposed method outperforms other state-of-the-art methods, especially in challenging scenarios.

Title: Towards All-in-One Medical Image Re-Identification

Authors: Yuan Tian, Kaiyuan Ji, Rongzhao Zhang, Yankai Jiang, Chunyi Li, Xiaosong Wang, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08173
Pdf URL: https://arxiv.org/pdf/2503.08173
Copy Paste: [[2503.08173]] Towards All-in-One Medical Image Re-Identification(https://arxiv.org/abs/2503.08173)
Keywords: privacy, protect
Abstract: Medical image re-identification (MedReID) is under-explored so far, despite its critical applications in personalized healthcare and privacy protection. In this paper, we introduce a thorough benchmark and a unified model for this problem. First, to handle various medical modalities, we propose a novel Continuous Modality-based Parameter Adapter (ComPA). ComPA condenses medical content into a continuous modality representation and dynamically adjusts the modality-agnostic model with modality-specific parameters at runtime. This allows a single model to adaptively learn and process diverse modality data. Furthermore, we integrate medical priors into our model by aligning it with a bag of pre-trained medical foundation models, in terms of the differential features. Compared to single-image feature, modeling the inter-image difference better fits the re-identification problem, which involves discriminating multiple images. We evaluate the proposed model against 25 foundation models and 8 large multi-modal language models across 11 image datasets, demonstrating consistently superior performance. Additionally, we deploy the proposed MedReID technique to two real-world applications, i.e., history-augmented personalized diagnosis and medical privacy protection. Codes and model is available at \href{this https URL}{this https URL}.

Title: Towards Synthesized and Editable Motion In-Betweening Through Part-Wise Phase Representation

Authors: Minyue Dai, Jingbo Wang, Ke Fan, Bin Ji, Haoyu Zhao, Junting Dong, Bo Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08180
Pdf URL: https://arxiv.org/pdf/2503.08180
Copy Paste: [[2503.08180]] Towards Synthesized and Editable Motion In-Betweening Through Part-Wise Phase Representation(https://arxiv.org/abs/2503.08180)
Keywords: robust
Abstract: Styled motion in-betweening is crucial for computer animation and gaming. However, existing methods typically encode motion styles by modeling whole-body motions, often overlooking the representation of individual body parts. This limitation reduces the flexibility of infilled motion, particularly in adjusting the motion styles of specific limbs independently. To overcome this challenge, we propose a novel framework that models motion styles at the body-part level, enhancing both the diversity and controllability of infilled motions. Our approach enables more nuanced and expressive animations by allowing precise modifications to individual limb motions while maintaining overall motion coherence. Leveraging phase-related insights, our framework employs periodic autoencoders to automatically extract the phase of each body part, capturing distinctive local style features. Additionally, we effectively decouple the motion source from synthesis control by integrating motion manifold learning and conditional generation techniques from both image and motion domains. This allows the motion source to generate high-quality motions across various styles, with extracted motion and style features readily available for controlled synthesis in subsequent tasks. Comprehensive evaluations demonstrate that our method achieves superior speed, robust generalization, and effective generation of extended motion sequences.

Title: RigoChat 2: an adapted language model to Spanish using a bounded dataset and reduced hardware

Authors: Gonzalo Santamaría Gómez, Guillem García Subies, Pablo Gutiérrez Ruiz, Mario González Valero, Natàlia Fuertes, Helena Montoro Zamorano, Carmen Muñoz Sanz, Leire Rosado Plaza, Nuria Aldama García, David Betancur Sánchez, Kateryna Sushkova, Marta Guerrero Nieto, Álvaro Barbero Jiménez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08188
Pdf URL: https://arxiv.org/pdf/2503.08188
Copy Paste: [[2503.08188]] RigoChat 2: an adapted language model to Spanish using a bounded dataset and reduced hardware(https://arxiv.org/abs/2503.08188)
Keywords: large language model
Abstract: Large Language Models (LLMs) have become a key element of modern artificial intelligence, demonstrating the ability to address a wide range of language processing tasks at unprecedented levels of accuracy without the need of collecting problem-specific data. However, these versatile models face a significant challenge: both their training and inference processes require substantial computational resources, time, and memory. Consequently, optimizing this kind of models to minimize these requirements is crucial. In this article, we demonstrate that, with minimal resources and in a remarkably short time, it is possible to enhance a state-of-the-art model, specifically for a given language task, without compromising its overall capabilities using a relatively small pretrained LLM as a basis. Specifically, we present our use case, RigoChat 2, illustrating how LLMs can be adapted to achieve superior results in Spanish-language tasks.

Title: Automating Violence Detection and Categorization from Ancient Texts

Authors: Alhassan Abdelhalim, Michaela Regneri
Subjects: cs.CL, cs.DL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08192
Pdf URL: https://arxiv.org/pdf/2503.08192
Copy Paste: [[2503.08192]] Automating Violence Detection and Categorization from Ancient Texts(https://arxiv.org/abs/2503.08192)
Keywords: large language model
Abstract: Violence descriptions in literature offer valuable insights for a wide range of research in the humanities. For historians, depictions of violence are of special interest for analyzing the societal dynamics surrounding large wars and individual conflicts of influential people. Harvesting data for violence research manually is laborious and time-consuming. This study is the first one to evaluate the effectiveness of large language models (LLMs) in identifying violence in ancient texts and categorizing it across multiple dimensions. Our experiments identify LLMs as a valuable tool to scale up the accurate analysis of historical texts and show the effect of fine-tuning and data augmentation, yielding an F1-score of up to 0.93 for violence detection and 0.86 for fine-grained violence categorization.

Title: Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation

Authors: Wenlong Meng, Fan Zhang, Wendao Yao, Zhenyuan Guo, Yuwei Li, Chengkun Wei, Wenzhi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08195
Pdf URL: https://arxiv.org/pdf/2503.08195
Copy Paste: [[2503.08195]] Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation(https://arxiv.org/abs/2503.08195)
Keywords: security, defense, attack, robust, large language model
Abstract: Large language models (LLMs) have demonstrated significant utility in a wide range of applications; however, their deployment is plagued by security vulnerabilities, notably jailbreak attacks. These attacks manipulate LLMs to generate harmful or unethical content by crafting adversarial prompts. While much of the current research on jailbreak attacks has focused on single-turn interactions, it has largely overlooked the impact of historical dialogues on model behavior. In this paper, we introduce a novel jailbreak paradigm, Dialogue Injection Attack (DIA), which leverages the dialogue history to enhance the success rates of such attacks. DIA operates in a black-box setting, requiring only access to the chat API or knowledge of the LLM's chat template. We propose two methods for constructing adversarial historical dialogues: one adapts gray-box prefilling attacks, and the other exploits deferred responses. Our experiments show that DIA achieves state-of-the-art attack success rates on recent LLMs, including Llama-3.1 and GPT-4o. Additionally, we demonstrate that DIA can bypass 5 different defense mechanisms, highlighting its robustness and effectiveness.

Title: A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models

Authors: Miao Zhang, Zhenlong Fang, Tianyi Wang, Qian Zhang, Shuai Lu, Junfeng Jiao, Tianyu Shi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08199
Pdf URL: https://arxiv.org/pdf/2503.08199
Copy Paste: [[2503.08199]] A Cascading Cooperative Multi-agent Framework for On-ramp Merging Control Integrating Large Language Models(https://arxiv.org/abs/2503.08199)
Keywords: interpretability, large language model
Abstract: Traditional Reinforcement Learning (RL) suffers from replicating human-like behaviors, generalizing effectively in multi-agent scenarios, and overcoming inherent interpretability this http URL tasks are compounded when deep environment understanding, agent coordination and dynamic optimization are required. While Large Language Model (LLM) enhanced methods have shown promise in generalization and interoperability, they often neglect necessary multi-agent coordination. Therefore, we introduce the Cascading Cooperative Multi-agent (CCMA) framework, integrating RL for individual interactions, a fine-tuned LLM for regional cooperation, a reward function for global optimization, and the Retrieval-augmented Generation mechanism to dynamically optimize decision-making across complex driving scenarios. Our experiments demonstrate that the CCMA outperforms existing RL methods, demonstrating significant improvements in both micro and macro-level performance in complex driving environments.

Title: Route Sparse Autoencoder to Interpret Large Language Models

Authors: Wei Shi, Sihang Li, Tao Liang, Mingyang Wan, Gojun Ma, Xiang Wang, Xiangnan He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08200
Pdf URL: https://arxiv.org/pdf/2503.08200
Copy Paste: [[2503.08200]] Route Sparse Autoencoder to Interpret Large Language Models(https://arxiv.org/abs/2503.08200)
Keywords: extraction, interpretability, large language model
Abstract: Mechanistic interpretability of large language models (LLMs) aims to uncover the internal processes of information propagation and reasoning. Sparse autoencoders (SAEs) have demonstrated promise in this domain by extracting interpretable and monosemantic features. However, prior works primarily focus on feature extraction from a single layer, failing to effectively capture activations that span multiple layers. In this paper, we introduce Route Sparse Autoencoder (RouteSAE), a new framework that integrates a routing mechanism with a shared SAE to efficiently extract features from multiple layers. It dynamically assigns weights to activations from different layers, incurring minimal parameter overhead while achieving high interpretability and flexibility for targeted feature manipulation. We evaluate RouteSAE through extensive experiments on Llama-3.2-1B-Instruct. Specifically, under the same sparsity constraint of 64, RouteSAE extracts 22.5% more features than baseline SAEs while achieving a 22.3% higher interpretability score. These results underscore the potential of RouteSAE as a scalable and effective method for LLM interpretability, with applications in feature discovery and model intervention. Our codes are available at this https URL.

Title: DeepRAG: Building a Custom Hindi Embedding Model for Retrieval Augmented Generation from Scratch

Authors: Nandakishor M
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08213
Pdf URL: https://arxiv.org/pdf/2503.08213
Copy Paste: [[2503.08213]] DeepRAG: Building a Custom Hindi Embedding Model for Retrieval Augmented Generation from Scratch(https://arxiv.org/abs/2503.08213)
Keywords: transformer
Abstract: In this paper, I present our work on DeepRAG, a specialized embedding model we built specifically for Hindi language in RAG systems. While LLMs have gotten really good at generating text, their performance in retrieval tasks still depends heavily on having quality embeddings - something that's been lacking for Hindi despite being one of the world's most spoken languages. We tackled this by creating embeddings from the ground up rather than just fine-tuning existing models. Our process involved collecting diverse Hindi texts (over 2.7M samples), training a custom SentencePiece tokenizer that actually understands Hindi morphology, designing transformer architecture with Hindi-specific attention mechanisms, and optimizing with contrastive learning. Results were honestly better than I expected - we saw a 23% improvement in retrieval precision compared to the multilingual models everyone's been using. The paper details our methodology, which I think could help others working with low-resource languages where the one-size-fits-all multilingual models fall short. We've also integrated our embeddings with LangChain to build complete Hindi RAG systems, which might be useful for practitioners. While there's still tons more to explore, I believe this work addresses a critical gap for Hindi NLP and demonstrates why language-specific approaches matter.

Title: MVD-HuGaS: Human Gaussians from a Single Image via 3D Human Multi-view Diffusion Prior

Authors: Kaiqiang Xiong, Ying Feng, Qi Zhang, Jianbo Jiao, Yang Zhao, Zhihao Liang, Huachen Gao, Ronggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08218
Pdf URL: https://arxiv.org/pdf/2503.08218
Copy Paste: [[2503.08218]] MVD-HuGaS: Human Gaussians from a Single Image via 3D Human Multi-view Diffusion Prior(https://arxiv.org/abs/2503.08218)
Keywords: diffusion
Abstract: 3D human reconstruction from a single image is a challenging problem and has been exclusively studied in the literature. Recently, some methods have resorted to diffusion models for guidance, optimizing a 3D representation via Score Distillation Sampling(SDS) or generating one back-view image for facilitating reconstruction. However, these methods tend to produce unsatisfactory artifacts (\textit{e.g.} flattened human structure or over-smoothing results caused by inconsistent priors from multiple views) and struggle with real-world generalization in the wild. In this work, we present \emph{MVD-HuGaS}, enabling free-view 3D human rendering from a single image via a multi-view human diffusion model. We first generate multi-view images from the single reference image with an enhanced multi-view diffusion model, which is well fine-tuned on high-quality 3D human datasets to incorporate 3D geometry priors and human structure priors. To infer accurate camera poses from the sparse generated multi-view images for reconstruction, an alignment module is introduced to facilitate joint optimization of 3D Gaussians and camera poses. Furthermore, we propose a depth-based Facial Distortion Mitigation module to refine the generated facial regions, thereby improving the overall fidelity of the this http URL, leveraging the refined multi-view images, along with their accurate camera poses, MVD-HuGaS optimizes the 3D Gaussians of the target human for high-fidelity free-view renderings. Extensive experiments on Thuman2.0 and 2K2K datasets show that the proposed MVD-HuGaS achieves state-of-the-art performance on single-view 3D human rendering.

Title: CL-MVSNet: Unsupervised Multi-view Stereo with Dual-level Contrastive Learning

Authors: Kaiqiang Xiong, Rui Peng, Zhe Zhang, Tianxing Feng, Jianbo Jiao, Feng Gao, Ronggang Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08219
Pdf URL: https://arxiv.org/pdf/2503.08219
Copy Paste: [[2503.08219]] CL-MVSNet: Unsupervised Multi-view Stereo with Dual-level Contrastive Learning(https://arxiv.org/abs/2503.08219)
Keywords: robust
Abstract: Unsupervised Multi-View Stereo (MVS) methods have achieved promising progress recently. However, previous methods primarily depend on the photometric consistency assumption, which may suffer from two limitations: indistinguishable regions and view-dependent effects, e.g., low-textured areas and reflections. To address these issues, in this paper, we propose a new dual-level contrastive learning approach, named CL-MVSNet. Specifically, our model integrates two contrastive branches into an unsupervised MVS framework to construct additional supervisory signals. On the one hand, we present an image-level contrastive branch to guide the model to acquire more context awareness, thus leading to more complete depth estimation in indistinguishable regions. On the other hand, we exploit a scene-level contrastive branch to boost the representation ability, improving robustness to view-dependent effects. Moreover, to recover more accurate 3D geometry, we introduce an L0.5 photometric consistency loss, which encourages the model to focus more on accurate points while mitigating the gradient penalty of undesirable ones. Extensive experiments on DTU and Tanks&Temples benchmarks demonstrate that our approach achieves state-of-the-art performance among all end-to-end unsupervised MVS frameworks and outperforms its supervised counterpart by a considerable margin without fine-tuning.

Title: EgoBlind: Towards Egocentric Visual Assistance for the Blind People

Authors: Junbin Xiao, Nanxin Huang, Hao Qiu, Zhulin Tao, Xun Yang, Richang Hong, Meng Wang, Angela Yao
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2503.08221
Pdf URL: https://arxiv.org/pdf/2503.08221
Copy Paste: [[2503.08221]] EgoBlind: Towards Egocentric Visual Assistance for the Blind People(https://arxiv.org/abs/2503.08221)
Keywords: large language model
Abstract: We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,210 videos that record the daily lives of real blind users from a first-person perspective. It also features 4,927 questions directly posed or generated and verified by blind individuals to reflect their needs for visual assistance under various scenarios. We provide each question with an average of 3 reference answers to alleviate subjective evaluation. Using EgoBlind, we comprehensively evaluate 15 leading MLLMs and find that all models struggle, with the best performers achieving accuracy around 56\%, far behind human performance of 87.4\%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and provide heuristic suggestions for improvement. With these efforts, we hope EgoBlind can serve as a valuable foundation for developing more effective AI assistants to enhance the independence of the blind individuals' lives.

Title: A Grey-box Text Attack Framework using Explainable AI

Authors: Esther Chiramal, Kelvin Soh Boon Kai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08226
Pdf URL: https://arxiv.org/pdf/2503.08226
Copy Paste: [[2503.08226]] A Grey-box Text Attack Framework using Explainable AI(https://arxiv.org/abs/2503.08226)
Keywords: attack, transformer
Abstract: Explainable AI is a strong strategy implemented to understand complex black-box model predictions in a human interpretable language. It provides the evidence required to execute the use of trustworthy and reliable AI systems. On the other hand, however, it also opens the door to locating possible vulnerabilities in an AI model. Traditional adversarial text attack uses word substitution, data augmentation techniques and gradient-based attacks on powerful pre-trained Bidirectional Encoder Representations from Transformers (BERT) variants to generate adversarial sentences. These attacks are generally whitebox in nature and not practical as they can be easily detected by humans E.g. Changing the word from "Poor" to "Rich". We proposed a simple yet effective Grey-box cum Black-box approach that does not require the knowledge of the model while using a set of surrogate Transformer/BERT models to perform the attack using Explainable AI techniques. As Transformers are the current state-of-the-art models for almost all Natural Language Processing (NLP) tasks, an attack generated from BERT1 is transferable to BERT2. This transferability is made possible due to the attention mechanism in the transformer that allows the model to capture long-range dependencies in a sequence. Using the power of BERT generalisation via attention, we attempt to exploit how transformers learn by attacking a few surrogate transformer variants which are all based on a different architecture. We demonstrate that this approach is highly effective to generate semantically good sentences by changing as little as one word that is not detectable by humans while still fooling other BERT models.

Title: Modeling Variants of Prompts for Vision-Language Models

Authors: Ao Li, Zongfang Liu, Xinhua Li, Jinghui Zhang, Pengwei Wang, Hu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08229
Pdf URL: https://arxiv.org/pdf/2503.08229
Copy Paste: [[2503.08229]] Modeling Variants of Prompts for Vision-Language Models(https://arxiv.org/abs/2503.08229)
Keywords: robust
Abstract: Large pre-trained vision-language models (VLMs) offer a promising approach to leveraging human language for enhancing downstream tasks. However, VLMs such as CLIP face significant limitation: its performance is highly sensitive to prompt template design. Although prompt learning methods can address the sensitivity issue by replacing natural language prompts with learnable ones, they are incomprehensible to humans. Ensuring consistent performance across various prompt templates enables models to adapt seamlessly to diverse phrasings, enhancing their ability to handle downstream tasks without requiring extensive prompt engineering. In this work, we introduce the RobustPrompt Benchmark, a systematic benchmark to evaluate robustness to different prompt templates for VLMs. It includes a dataset with hundreds of carefully designed prompt templates, divided into six types, covering a wide variety of commonly used templates. Beside the benchmark, we propose Modeling Variants of Prompts (MVP), a simple yet effective method that mitigates sensitivity by modeling variants of prompt structures. The innovation of MVP lies in decoupling prompts into templates and class names, and using Variational Autoencoders (VAE) to model the distribution of diverse prompt structures. Experiments across 11 datasets demonstrate that MVP can greatly enhance model robustness to variations in input prompts without a drop in performance. The code is available at this https URL.

Title: EnergyFormer: Energy Attention with Fourier Embedding for Hyperspectral Image Classification

Authors: Saad Sohail, Muhammad Usama, Usman Ghous, Manuel Mazzara, Salvatore Distefano, Muhammad Ahmad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08239
Pdf URL: https://arxiv.org/pdf/2503.08239
Copy Paste: [[2503.08239]] EnergyFormer: Energy Attention with Fourier Embedding for Hyperspectral Image Classification(https://arxiv.org/abs/2503.08239)
Keywords: extraction, transformer
Abstract: Hyperspectral imaging (HSI) provides rich spectral-spatial information across hundreds of contiguous bands, enabling precise material discrimination in applications such as environmental monitoring, agriculture, and urban analysis. However, the high dimensionality and spectral variability of HSI data pose significant challenges for feature extraction and classification. This paper presents EnergyFormer, a transformer-based framework designed to address these challenges through three key innovations: (1) Multi-Head Energy Attention (MHEA), which optimizes an energy function to selectively enhance critical spectral-spatial features, improving feature discrimination; (2) Fourier Position Embedding (FoPE), which adaptively encodes spectral and spatial dependencies to reinforce long-range interactions; and (3) Enhanced Convolutional Block Attention Module (ECBAM), which selectively amplifies informative wavelength bands and spatial structures, enhancing representation learning. Extensive experiments on the WHU-Hi-HanChuan, Salinas, and Pavia University datasets demonstrate that EnergyFormer achieves exceptional overall accuracies of 99.28\%, 98.63\%, and 98.72\%, respectively, outperforming state-of-the-art CNN, transformer, and Mamba-based models. The source code will be made available at this https URL.

Title: Tangentially Aligned Integrated Gradients for User-Friendly Explanations

Authors: Lachlan Simpson, Federico Costanza, Kyle Millar, Adriel Cheng, Cheng-Chew Lim, Hong Gunn Chew
Subjects: cs.LG, math.DG
Abstract URL: https://arxiv.org/abs/2503.08240
Pdf URL: https://arxiv.org/pdf/2503.08240
Copy Paste: [[2503.08240]] Tangentially Aligned Integrated Gradients for User-Friendly Explanations(https://arxiv.org/abs/2503.08240)
Keywords: explainability
Abstract: Integrated gradients is prevalent within machine learning to address the black-box problem of neural networks. The explanations given by integrated gradients depend on a choice of base-point. The choice of base-point is not a priori obvious and can lead to drastically different explanations. There is a longstanding hypothesis that data lies on a low dimensional Riemannian manifold. The quality of explanations on a manifold can be measured by the extent to which an explanation for a point lies in its tangent space. In this work, we propose that the base-point should be chosen such that it maximises the tangential alignment of the explanation. We formalise the notion of tangential alignment and provide theoretical conditions under which a base-point choice will provide explanations lying in the tangent space. We demonstrate how to approximate the optimal base-point on several well-known image classification datasets. Furthermore, we compare the optimal base-point choice with common base-points and three gradient explainability models.

Title: Aligning Text to Image in Diffusion Models is Easier Than You Think

Authors: Jaa-Yeon Lee, Byunghee Cha, Jeongsol Kim, Jong Chul Ye
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08250
Pdf URL: https://arxiv.org/pdf/2503.08250
Copy Paste: [[2503.08250]] Aligning Text to Image in Diffusion Models is Easier Than You Think(https://arxiv.org/abs/2503.08250)
Keywords: diffusion, generative
Abstract: While recent advancements in generative modeling have significantly improved text-image alignment, some residual misalignment between text and image representations still remains. Although many approaches have attempted to address this issue by fine-tuning models using various reward models, etc., we revisit the challenge from the perspective of representation alignment-an approach that has gained popularity with the success of REPresentation Alignment (REPA). We first argue that conventional text-to-image (T2I) diffusion models, typically trained on paired image and text data (i.e., positive pairs) by minimizing score matching or flow matching losses, is suboptimal from the standpoint of representation alignment. Instead, a better alignment can be achieved through contrastive learning that leverages both positive and negative pairs. To achieve this efficiently even with pretrained models, we introduce a lightweight contrastive fine tuning strategy called SoftREPA that uses soft text tokens. This approach improves alignment with minimal computational overhead by adding fewer than 1M trainable parameters to the pretrained model. Our theoretical analysis demonstrates that our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency. Experimental results across text-to-image generation and text-guided image editing tasks validate the effectiveness of our approach in improving the semantic consistency of T2I generative models.

Title: SARA: Structural and Adversarial Representation Alignment for Training-efficient Diffusion Models

Authors: Hesen Chen, Junyan Wang, Zhiyu Tan, Hao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08253
Pdf URL: https://arxiv.org/pdf/2503.08253
Copy Paste: [[2503.08253]] SARA: Structural and Adversarial Representation Alignment for Training-efficient Diffusion Models(https://arxiv.org/abs/2503.08253)
Keywords: diffusion
Abstract: Modern diffusion models encounter a fundamental trade-off between training efficiency and generation quality. While existing representation alignment methods, such as REPA, accelerate convergence through patch-wise alignment, they often fail to capture structural relationships within visual representations and ensure global distribution consistency between pretrained encoders and denoising networks. To address these limitations, we introduce SARA, a hierarchical alignment framework that enforces multi-level representation constraints: (1) patch-wise alignment to preserve local semantic details, (2) autocorrelation matrix alignment to maintain structural consistency within representations, and (3) adversarial distribution alignment to mitigate global representation discrepancies. Unlike previous approaches, SARA explicitly models both intra-representation correlations via self-similarity matrices and inter-distribution coherence via adversarial alignment, enabling comprehensive alignment across local and global scales. Experiments on ImageNet-256 show that SARA achieves an FID of 1.36 while converging twice as fast as REPA, surpassing recent state-of-the-art image generation methods. This work establishes a systematic paradigm for optimizing diffusion training through hierarchical representation alignment.

Title: SoK: A cloudy view on trust relationships of CVMs -- How Confidential Virtual Machines are falling short in Public Cloud

Authors: Jana Eisoldt, Anna Galanou, Andrey Ruzhanskiy, Nils Küchenmeister, Yewgenij Baburkin, Tianxiang Dai, Ivan Gudymenko, Stefan Köpsell, Rüdiger Kapitza
Subjects: cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2503.08256
Pdf URL: https://arxiv.org/pdf/2503.08256
Copy Paste: [[2503.08256]] SoK: A cloudy view on trust relationships of CVMs -- How Confidential Virtual Machines are falling short in Public Cloud(https://arxiv.org/abs/2503.08256)
Keywords: security, privacy, protect
Abstract: Confidential computing in the public cloud intends to safeguard workload privacy while outsourcing infrastructure management to a cloud provider. This is achieved by executing customer workloads within so called Trusted Execution Environments (TEEs), such as Confidential Virtual Machines (CVMs), which protect them from unauthorized access by cloud administrators and privileged system software. At the core of confidential computing lies remote attestation -- a mechanism that enables workload owners to verify the initial state of their workload and furthermore authenticate the underlying hardware. hile this represents a significant advancement in cloud security, this SoK critically examines the confidential computing offerings of market-leading cloud providers to assess whether they genuinely adhere to its core principles. We develop a taxonomy based on carefully selected criteria to systematically evaluate these offerings, enabling us to analyse the components responsible for remote attestation, the evidence provided at each stage, the extent of cloud provider influence and whether this undermines the threat model of confidential computing. Specifically, we investigate how CVMs are deployed in the public cloud infrastructures, the extent to which customers can request and verify attestation evidence, and their ability to define and enforce configuration and attestation requirements. This analysis provides insight into whether confidential computing guarantees -- namely confidentiality and integrity -- are genuinely upheld. Our findings reveal that all major cloud providers retain control over critical parts of the trusted software stack and, in some cases, intervene in the standard remote attestation process. This directly contradicts their claims of delivering confidential computing, as the model fundamentally excludes the cloud provider from the set of trusted entities.

Title: DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness

Authors: Yiming Zhong, Qi Jiang, Jingyi Yu, Yuexin Ma
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.08257
Pdf URL: https://arxiv.org/pdf/2503.08257
Copy Paste: [[2503.08257]] DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness(https://arxiv.org/abs/2503.08257)
Keywords: robust, diffusion, generative
Abstract: A dexterous hand capable of grasping any object is essential for the development of general-purpose embodied intelligent robots. However, due to the high degree of freedom in dexterous hands and the vast diversity of objects, generating high-quality, usable grasping poses in a robust manner is a significant challenge. In this paper, we introduce DexGrasp Anything, a method that effectively integrates physical constraints into both the training and sampling phases of a diffusion-based generative model, achieving state-of-the-art performance across nearly all open datasets. Additionally, we present a new dexterous grasping dataset containing over 3.4 million diverse grasping poses for more than 15k different objects, demonstrating its potential to advance universal dexterous grasping. The code of our method and our dataset will be publicly released soon.

Title: Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks

Authors: Junying Wang, Hongyuan Zhang, Yuan Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08269
Pdf URL: https://arxiv.org/pdf/2503.08269
Copy Paste: [[2503.08269]] Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks(https://arxiv.org/abs/2503.08269)
Keywords: privacy, protect, attack
Abstract: Recent Customized Portrait Generation (CPG) methods, taking a facial image and a textual prompt as inputs, have attracted substantial attention. Although these methods generate high-fidelity portraits, they fail to prevent the generated portraits from being tracked and misused by malicious face recognition systems. To address this, this paper proposes a Customized Portrait Generation framework with facial Adversarial attacks (Adv-CPG). Specifically, to achieve facial privacy protection, we devise a lightweight local ID encryptor and an encryption enhancer. They implement progressive double-layer encryption protection by directly injecting the target identity and adding additional identity guidance, respectively. Furthermore, to accomplish fine-grained and personalized portrait generation, we develop a multi-modal image customizer capable of generating controlled fine-grained facial features. To the best of our knowledge, Adv-CPG is the first study that introduces facial adversarial attacks into CPG. Extensive experiments demonstrate the superiority of Adv-CPG, e.g., the average attack success rate of the proposed Adv-CPG is 28.1% and 2.86% higher compared to the SOTA noise-based attack methods and unconstrained attack methods, respectively.

Title: LangTime: A Language-Guided Unified Model for Time Series Forecasting with Proximal Policy Optimization

Authors: Wenzhe Niu, Zongxia Xie, Yanru Sun, Wei He, Man Xu, Chao Hao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08271
Pdf URL: https://arxiv.org/pdf/2503.08271
Copy Paste: [[2503.08271]] LangTime: A Language-Guided Unified Model for Time Series Forecasting with Proximal Policy Optimization(https://arxiv.org/abs/2503.08271)
Keywords: large language model
Abstract: Recent research has shown an increasing interest in utilizing pre-trained large language models (LLMs) for a variety of time series applications. However, there are three main challenges when using LLMs as foundational models for time series forecasting: (1) Cross-domain generalization. (2) Cross-modality alignment. (3) Error accumulation in autoregressive frameworks. To address these challenges, we proposed LangTime, a language-guided unified model for time series forecasting that incorporates cross-domain pre-training with reinforcement learning-based fine-tuning. Specifically, LangTime constructs Temporal Comprehension Prompts (TCPs), which include dataset-wise and channel-wise instructions, to facilitate domain adaptation and condense time series into a single token, enabling LLMs to understand better and align temporal data. To improve autoregressive forecasting, we introduce TimePPO, a reinforcement learning-based fine-tuning algorithm. TimePPO mitigates error accumulation by leveraging a multidimensional rewards function tailored for time series and a repeat-based value estimation strategy. Extensive experiments demonstrate that LangTime achieves state-of-the-art cross-domain forecasting performance, while TimePPO fine-tuning effectively enhances the stability and accuracy of autoregressive forecasting.

Title: PromptLNet: Region-Adaptive Aesthetic Enhancement via Prompt Guidance in Low-Light Enhancement Net

Authors: Jun Yin, Yangfan He, Miao Zhang, Pengyu Zeng, Tianyi Wang, Shuai Lu, Xueqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08276
Pdf URL: https://arxiv.org/pdf/2503.08276
Copy Paste: [[2503.08276]] PromptLNet: Region-Adaptive Aesthetic Enhancement via Prompt Guidance in Low-Light Enhancement Net(https://arxiv.org/abs/2503.08276)
Keywords: diffusion, large language model
Abstract: Learning and improving large language models through human preference feedback has become a mainstream approach, but it has rarely been applied to the field of low-light image enhancement. Existing low-light enhancement evaluations typically rely on objective metrics (such as FID, PSNR, etc.), which often result in models that perform well objectively but lack aesthetic quality. Moreover, most low-light enhancement models are primarily designed for global brightening, lacking detailed refinement. Therefore, the generated images often require additional local adjustments, leading to research gaps in practical applications. To bridge this gap, we propose the following innovations: 1) We collect human aesthetic evaluation text pairs and aesthetic scores from multiple low-light image datasets (e.g., LOL, LOL2, LOM, DCIM, MEF, etc.) to train a low-light image aesthetic evaluation model, supplemented by an optimization algorithm designed to fine-tune the diffusion model. 2) We propose a prompt-driven brightness adjustment module capable of performing fine-grained brightness and aesthetic adjustments for specific instances or regions. 3) We evaluate our method alongside existing state-of-the-art algorithms on mainstream benchmarks. Experimental results show that our method not only outperforms traditional methods in terms of visual quality but also provides greater flexibility and controllability, paving the way for improved aesthetic quality.

Title: OminiControl2: Efficient Conditioning for Diffusion Transformers

Authors: Zhenxiong Tan, Qiaochu Xue, Xingyi Yang, Songhua Liu, Xinchao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08280
Pdf URL: https://arxiv.org/pdf/2503.08280
Copy Paste: [[2503.08280]] OminiControl2: Efficient Conditioning for Diffusion Transformers(https://arxiv.org/abs/2503.08280)
Keywords: diffusion, transformer
Abstract: Fine-grained control of text-to-image diffusion transformer models (DiT) remains a critical challenge for practical deployment. While recent advances such as OminiControl and others have enabled a controllable generation of diverse control signals, these methods face significant computational inefficiency when handling long conditional inputs. We present OminiControl2, an efficient framework that achieves efficient image-conditional image generation. OminiControl2 introduces two key innovations: (1) a dynamic compression strategy that streamlines conditional inputs by preserving only the most semantically relevant tokens during generation, and (2) a conditional feature reuse mechanism that computes condition token features only once and reuses them across denoising steps. These architectural improvements preserve the original framework's parameter efficiency and multi-modal versatility while dramatically reducing computational costs. Our experiments demonstrate that OminiControl2 reduces conditional processing overhead by over 90% compared to its predecessor, achieving an overall 5.9$\times$ speedup in multi-conditional generation scenarios. This efficiency enables the practical implementation of complex, multi-modal control for high-quality image synthesis with DiT models.

Title: Neural cyberattacks applied to the vision under realistic visual stimuli

Authors: Victoria Magdalena López Madejska, Sergio López Bernal, Gregorio Martínez Pérez, Alberto Huertas Celdrán
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.08284
Pdf URL: https://arxiv.org/pdf/2503.08284
Copy Paste: [[2503.08284]] Neural cyberattacks applied to the vision under realistic visual stimuli(https://arxiv.org/abs/2503.08284)
Keywords: attack, robust
Abstract: Brain-Computer Interfaces (BCIs) are systems traditionally used in medicine and designed to interact with the brain to record or stimulate neurons. Despite their benefits, the literature has demonstrated that invasive BCIs focused on neurostimulation present vulnerabilities allowing attackers to gain control. In this context, neural cyberattacks emerged as threats able to disrupt spontaneous neural activity by performing neural overstimulation or inhibition. Previous work validated these attacks in small-scale simulations with a reduced number of neurons, lacking real-world complexity. Thus, this work tackles this limitation by analyzing the impact of two existing neural attacks, Neuronal Flooding (FLO) and Neuronal Jamming (JAM), on a complex neuronal topology of the primary visual cortex of mice consisting of approximately 230,000 neurons, tested on three realistic visual stimuli: flash effect, movie, and drifting gratings. Each attack was evaluated over three relevant events per stimulus, also testing the impact of attacking 25% and 50% of the neurons. The results, based on the number of spikes and shift percentages metrics, showed that the attacks caused the greatest impact on the movie, while dark and fixed events are the most robust. Although both attacks can significantly affect neural activity, JAM was generally more damaging, producing longer temporal delays, and had a larger prevalence. Finally, JAM did not require to alter many neurons to significantly affect neural activity, while the impact in FLO increased with the number of neurons attacked.

Title: SegDesicNet: Lightweight Semantic Segmentation in Remote Sensing with Geo-Coordinate Embeddings for Domain Adaptation

Authors: Sachin Verma, Frank Lindseth, Gabriel Kiss
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08290
Pdf URL: https://arxiv.org/pdf/2503.08290
Copy Paste: [[2503.08290]] SegDesicNet: Lightweight Semantic Segmentation in Remote Sensing with Geo-Coordinate Embeddings for Domain Adaptation(https://arxiv.org/abs/2503.08290)
Keywords: segmentation
Abstract: Semantic segmentation is essential for analyzing highdefinition remote sensing images (HRSIs) because it allows the precise classification of objects and regions at the pixel level. However, remote sensing data present challenges owing to geographical location, weather, and environmental variations, making it difficult for semantic segmentation models to generalize across diverse scenarios. Existing methods are often limited to specific data domains and require expert annotators and specialized equipment for semantic labeling. In this study, we propose a novel unsupervised domain adaptation technique for remote sensing semantic segmentation by utilizing geographical coordinates that are readily accessible in remote sensing setups as metadata in a dataset. To bridge the domain gap, we propose a novel approach that considers the combination of an imageś location encoding trait and the spherical nature of Earthś surface. Our proposed SegDesicNet module regresses the GRID positional encoding of the geo coordinates projected over the unit sphere to obtain the domain loss. Our experimental results demonstrate that the proposed SegDesicNet outperforms state of the art domain adaptation methods in remote sensing image segmentation, achieving an improvement of approximately ~6% in the mean intersection over union (MIoU) with a ~ 27\% drop in parameter count on benchmarked subsets of the publicly available FLAIR #1 dataset. We also benchmarked our method performance on the custom split of the ISPRS Potsdam dataset. Our algorithm seeks to reduce the modeling disparity between artificial neural networks and human comprehension of the physical world, making the technology more human centric and scalable.

Title: Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges

Authors: Xiaoxiao Liu, Qingying Xiao, Junying Chen, Xiangyi Feng, Xiangbo Wu, Bairui Zhang, Xiang Wan, Jian Chang, Guangjun Yu, Yan Hu, Benyou Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08292
Pdf URL: https://arxiv.org/pdf/2503.08292
Copy Paste: [[2503.08292]] Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges(https://arxiv.org/abs/2503.08292)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly applied to outpatient referral tasks across healthcare systems. However, there is a lack of standardized evaluation criteria to assess their effectiveness, particularly in dynamic, interactive scenarios. In this study, we systematically examine the capabilities and limitations of LLMs in managing tasks within Intelligent Outpatient Referral (IOR) systems and propose a comprehensive evaluation framework specifically designed for such systems. This framework comprises two core tasks: static evaluation, which focuses on evaluating the ability of predefined outpatient referrals, and dynamic evaluation, which evaluates capabilities of refining outpatient referral recommendations through iterative dialogues. Our findings suggest that LLMs offer limited advantages over BERT-like models, but show promise in asking effective questions during interactive dialogues.

Title: A systematic literature review of unsupervised learning algorithms for anomalous traffic detection based on flows

Authors: Alberto Miguel-Diez, Adrián Campazas-Vega, Claudia Álvarez-Aparicio, Gonzalo Esteban-Costales, Ángel Manuel Guerrero-Higueras
Subjects: cs.CR, cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2503.08293
Pdf URL: https://arxiv.org/pdf/2503.08293
Copy Paste: [[2503.08293]] A systematic literature review of unsupervised learning algorithms for anomalous traffic detection based on flows(https://arxiv.org/abs/2503.08293)
Keywords: attack
Abstract: The constant increase of devices connected to the Internet, and therefore of cyber-attacks, makes it necessary to analyze network traffic in order to recognize malicious activity. Traditional packet-based analysis methods are insufficient because in large networks the amount of traffic is so high that it is unfeasible to review all communications. For this reason, flows is a suitable approach for this situation, which in future 5G networks will have to be used, as the number of packets will increase dramatically. If this is also combined with unsupervised learning models, it can detect new threats for which it has not been trained. This paper presents a systematic review of the literature on unsupervised learning algorithms for detecting anomalies in network flows, following the PRISMA guideline. A total of 63 scientific articles have been reviewed, analyzing 13 of them in depth. The results obtained show that autoencoder is the most used option, followed by SVM, ALAD, or SOM. On the other hand, all the datasets used for anomaly detection have been collected, including some specialised in IoT or with real data collected from honeypots.

Title: D3PO: Preference-Based Alignment of Discrete Diffusion Models

Authors: Umberto Borso, Davide Paglieri, Jude Wells, Tim Rocktäschel
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08295
Pdf URL: https://arxiv.org/pdf/2503.08295
Copy Paste: [[2503.08295]] D3PO: Preference-Based Alignment of Discrete Diffusion Models(https://arxiv.org/abs/2503.08295)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved state-of-the-art performance across multiple domains, with recent advancements extending their applicability to discrete data. However, aligning discrete diffusion models with task-specific preferences remains challenging, particularly in scenarios where explicit reward functions are unavailable. In this work, we introduce Discrete Diffusion DPO (D3PO), the first adaptation of Direct Preference Optimization (DPO) to discrete diffusion models formulated as continuous-time Markov chains. Our approach derives a novel loss function that directly fine-tunes the generative process using preference data while preserving fidelity to a reference distribution. We validate D3PO on a structured binary sequence generation task, demonstrating that the method effectively aligns model outputs with preferences while maintaining structural validity. Our results highlight that D3PO enables controlled fine-tuning without requiring explicit reward models, making it a practical alternative to reinforcement learning-based approaches. Future research will explore extending D3PO to more complex generative tasks, including language modeling and protein sequence generation, as well as investigating alternative noise schedules, such as uniform noising, to enhance flexibility across different applications.

Title: Privacy for Free: Leveraging Local Differential Privacy Perturbed Data from Multiple Services

Authors: Rong Du, Qingqing Ye, Yue Fu, Haibo Hu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.08297
Pdf URL: https://arxiv.org/pdf/2503.08297
Copy Paste: [[2503.08297]] Privacy for Free: Leveraging Local Differential Privacy Perturbed Data from Multiple Services(https://arxiv.org/abs/2503.08297)
Keywords: privacy, robust
Abstract: Local Differential Privacy (LDP) has emerged as a widely adopted privacy-preserving technique in modern data analytics, enabling users to share statistical insights while maintaining robust privacy guarantees. However, current LDP applications assume a single service gathering perturbed information from users. In reality, multiple services may be interested in collecting users' data, which poses privacy burdens to users as more such services emerge. To address this issue, this paper proposes a framework for collecting and aggregating data based on perturbed information from multiple services, regardless of their estimated statistics (e.g., mean or distribution) and perturbation mechanisms. Then for mean estimation, we introduce the Unbiased Averaging (UA) method and its optimized version, User-level Weighted Averaging (UWA). The former utilizes biased perturbed data, while the latter assigns weights to different perturbed results based on perturbation information, thereby achieving minimal variance. For distribution estimation, we propose the User-level Likelihood Estimation (ULE), which treats all perturbed results from a user as a whole for maximum likelihood estimation. Experimental results demonstrate that our framework and constituting methods significantly improve the accuracy of both mean and distribution estimation.

Title: Large Language Model as Meta-Surrogate for Data-Driven Many-Task Optimization: A Proof-of-Principle Study

Authors: Xian-Rong Zhang, Yue-Jiao Gong, Jun Zhang
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2503.08301
Pdf URL: https://arxiv.org/pdf/2503.08301
Copy Paste: [[2503.08301]] Large Language Model as Meta-Surrogate for Data-Driven Many-Task Optimization: A Proof-of-Principle Study(https://arxiv.org/abs/2503.08301)
Keywords: robust, large language model
Abstract: In many-task optimization scenarios, surrogate models are valuable for mitigating the computational burden of repeated fitness evaluations across tasks. This study proposes a novel meta-surrogate framework to assist many-task optimization, by leveraging the knowledge transfer strengths and emergent capabilities of large language models (LLMs). We formulate a unified framework for many-task fitness prediction, by defining a universal model with metadata to fit a group of problems. Fitness prediction is performed on metadata and decision variables, enabling efficient knowledge sharing across tasks and adaptability to new tasks. The LLM-based meta-surrogate treats fitness prediction as conditional probability estimation, employing a unified token sequence representation for task metadata, inputs, and outputs. This approach facilitates efficient inter-task knowledge sharing through shared token embeddings and captures complex task dependencies via multi-task model training. Experimental results demonstrate the model's emergent generalization ability, including zero-shot performance on problems with unseen dimensions. When integrated into evolutionary transfer optimization (ETO), our framework supports dual-level knowledge transfer -- at both the surrogate and individual levels -- enhancing optimization efficiency and robustness. This work establishes a novel foundation for applying LLMs in surrogate modeling, offering a versatile solution for many-task optimization.

Title: $^R$FLAV: Rolling Flow matching for infinite Audio Video generation

Authors: Alex Ergasti, Giuseppe Gabriele Tarollo, Filippo Botti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08307
Pdf URL: https://arxiv.org/pdf/2503.08307
Copy Paste: [[2503.08307]] $^R$FLAV: Rolling Flow matching for infinite Audio Video generation(https://arxiv.org/abs/2503.08307)
Keywords: transformer, generative
Abstract: Joint audio-video (AV) generation is still a significant challenge in generative AI, primarily due to three critical requirements: quality of the generated samples, seamless multimodal synchronization and temporal coherence, with audio tracks that match the visual data and vice versa, and limitless video duration. In this paper, we present \arch{}, a novel transformer-based architecture that addresses all the key challenges of AV generation. We explore three distinct cross modality interaction modules, with our lightweight temporal fusion module emerging as the most effective and computationally efficient approach for aligning audio and visual modalities. Our experimental results demonstrate that \arch{} outperforms existing state-of-the-art models in multimodal AV generation tasks. Our code and checkpoints are available at this https URL.

Title: i-WiViG: Interpretable Window Vision GNN

Authors: Ivica Obadic, Dmitry Kangin, Dario Oliveira, Plamen P Angelov, Xiao Xiang Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08321
Pdf URL: https://arxiv.org/pdf/2503.08321
Copy Paste: [[2503.08321]] i-WiViG: Interpretable Window Vision GNN(https://arxiv.org/abs/2503.08321)
Keywords: interpretability
Abstract: Deep learning models based on graph neural networks have emerged as a popular approach for solving computer vision problems. They encode the image into a graph structure and can be beneficial for efficiently capturing the long-range dependencies typically present in remote sensing imagery. However, an important drawback of these methods is their black-box nature which may hamper their wider usage in critical applications. In this work, we tackle the self-interpretability of the graph-based vision models by proposing our Interpretable Window Vision GNN (i-WiViG) approach, which provides explanations by automatically identifying the relevant subgraphs for the model prediction. This is achieved with window-based image graph processing that constrains the node receptive field to a local image region and by using a self-interpretable graph bottleneck that ranks the importance of the long-range relations between the image regions. We evaluate our approach to remote sensing classification and regression tasks, showing it achieves competitive performance while providing inherent and faithful explanations through the identified relations. Further, the quantitative evaluation reveals that our model reduces the infidelity of post-hoc explanations compared to other Vision GNN models, without sacrificing explanation sparsity.

Title: Evaluating Interpretable Reinforcement Learning by Distilling Policies into Programs

Authors: Hector Kohler, Quentin Delfosse, Waris Radji, Riad Akrour, Philippe Preux
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08322
Pdf URL: https://arxiv.org/pdf/2503.08322
Copy Paste: [[2503.08322]] Evaluating Interpretable Reinforcement Learning by Distilling Policies into Programs(https://arxiv.org/abs/2503.08322)
Keywords: interpretability
Abstract: There exist applications of reinforcement learning like medicine where policies need to be ''interpretable'' by humans. User studies have shown that some policy classes might be more interpretable than others. However, it is costly to conduct human studies of policy interpretability. Furthermore, there is no clear definition of policy interpretabiliy, i.e., no clear metrics for interpretability and thus claims depend on the chosen definition. We tackle the problem of empirically evaluating policies interpretability without humans. Despite this lack of clear definition, researchers agree on the notions of ''simulatability'': policy interpretability should relate to how humans understand policy actions given states. To advance research in interpretable reinforcement learning, we contribute a new methodology to evaluate policy interpretability. This new methodology relies on proxies for simulatability that we use to conduct a large-scale empirical evaluation of policy interpretability. We use imitation learning to compute baseline policies by distilling expert neural networks into small programs. We then show that using our methodology to evaluate the baselines interpretability leads to similar conclusions as user studies. We show that increasing interpretability does not necessarily reduce performances and can sometimes increase them. We also show that there is no policy class that better trades off interpretability and performance across tasks making it necessary for researcher to have methodologies for comparing policies interpretability.

Title: Towards Scalable and Cross-Lingual Specialist Language Models for Oncology

Authors: Morteza Rohanian, Tarun Mehra, Nicola Miglino, Farhad Nooralahzadeh, Michael Krauthammer, Andreas Wicki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08323
Pdf URL: https://arxiv.org/pdf/2503.08323
Copy Paste: [[2503.08323]] Towards Scalable and Cross-Lingual Specialist Language Models for Oncology(https://arxiv.org/abs/2503.08323)
Keywords: extraction, large language model
Abstract: Clinical oncology generates vast, unstructured data that often contain inconsistencies, missing information, and ambiguities, making it difficult to extract reliable insights for data-driven decision-making. General-purpose large language models (LLMs) struggle with these challenges due to their lack of domain-specific reasoning, including specialized clinical terminology, context-dependent interpretations, and multi-modal data integration. We address these issues with an oncology-specialized, efficient, and adaptable NLP framework that combines instruction tuning, retrieval-augmented generation (RAG), and graph-based knowledge integration. Our lightweight models prove effective at oncology-specific tasks, such as named entity recognition (e.g., identifying cancer diagnoses), entity linking (e.g., linking entities to standardized ontologies), TNM staging, document classification (e.g., cancer subtype classification from pathology reports), and treatment response prediction. Our framework emphasizes adaptability and resource efficiency. We include minimal German instructions, collected at the University Hospital Zurich (USZ), to test whether small amounts of non-English language data can effectively transfer knowledge across languages. This approach mirrors our motivation for lightweight models, which balance strong performance with reduced computational costs, making them suitable for resource-limited healthcare settings. We validated our models on oncology datasets, demonstrating strong results in named entity recognition, relation extraction, and document classification.

Title: Prototype-based Heterogeneous Federated Learning for Blade Icing Detection in Wind Turbines with Class Imbalanced Data

Authors: Lele Qi, Mengna Liu, Xu Cheng, Fan Shi, Xiufeng Liu, Shengyong Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08325
Pdf URL: https://arxiv.org/pdf/2503.08325
Copy Paste: [[2503.08325]] Prototype-based Heterogeneous Federated Learning for Blade Icing Detection in Wind Turbines with Class Imbalanced Data(https://arxiv.org/abs/2503.08325)
Keywords: privacy, federate
Abstract: Wind farms, typically in high-latitude regions, face a high risk of blade icing. Traditional centralized training methods raise serious privacy concerns. To enhance data privacy in detecting wind turbine blade icing, traditional federated learning (FL) is employed. However, data heterogeneity, resulting from collections across wind farms in varying environmental conditions, impacts the model's optimization capabilities. Moreover, imbalances in wind turbine data lead to models that tend to favor recognizing majority classes, thus neglecting critical icing anomalies. To tackle these challenges, we propose a federated prototype learning model for class-imbalanced data in heterogeneous environments to detect wind turbine blade icing. We also propose a contrastive supervised loss function to address the class imbalance problem. Experiments on real data from 20 turbines across two wind farms show our method outperforms five FL models and five class imbalance methods, with an average improvement of 19.64\% in $ mF_{\beta} $ and 5.73\% in $ m $BA compared to the second-best method, BiFL.

Title: MFRS: A Multi-Frequency Reference Series Approach to Scalable and Accurate Time-Series Forecasting

Authors: Liang Yu, Lai Tu, Xiang Bai
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2503.08328
Pdf URL: https://arxiv.org/pdf/2503.08328
Copy Paste: [[2503.08328]] MFRS: A Multi-Frequency Reference Series Approach to Scalable and Accurate Time-Series Forecasting(https://arxiv.org/abs/2503.08328)
Keywords: transformer
Abstract: Multivariate time-series forecasting holds immense value across diverse applications, requiring methods to effectively capture complex temporal and inter-variable dynamics. A key challenge lies in uncovering the intrinsic patterns that govern predictability, beyond conventional designs, focusing on network architectures to explore latent relationships or temporal dependencies. Inspired by signal decomposition, this paper posits that time series predictability is derived from periodic characteristics at different frequencies. Consequently, we propose a novel time series forecasting method based on multi-frequency reference series correlation analysis. Through spectral analysis on long-term training data, we identify dominant spectral components and their harmonics to design base-pattern reference series. Unlike signal decomposition, which represents the original series as a linear combination of basis signals, our method uses a transformer model to compute cross-attention between the original series and reference series, capturing essential features for forecasting. Experiments on major open and synthetic datasets show state-of-the-art performance. Furthermore, by focusing on attention with a small number of reference series rather than pairwise variable attention, our method ensures scalability and broad applicability. The source code is available at: this https URL

Title: MINT-Demo: Membership Inference Test Demonstrator

Authors: Daniel DeAlcala, Aythami Morales, Julian Fierrez, Gonzalo Mancera, Ruben Tolosana, Ruben Vera-Rodriguez
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08332
Pdf URL: https://arxiv.org/pdf/2503.08332
Copy Paste: [[2503.08332]] MINT-Demo: Membership Inference Test Demonstrator(https://arxiv.org/abs/2503.08332)
Keywords: membership infer
Abstract: We present the Membership Inference Test Demonstrator, to emphasize the need for more transparent machine learning training processes. MINT is a technique for experimentally determining whether certain data has been used during the training of machine learning models. We conduct experiments with popular face recognition models and 5 public databases containing over 22M images. Promising results, up to 89% accuracy are achieved, suggesting that it is possible to recognize if an AI model has been trained with specific data. Finally, we present a MINT platform as demonstrator of this technology aimed to promote transparency in AI training.

Title: Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos

Authors: Soumya Shamarao Jahagirdar, Jayasree Saha, C V Jawahar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08335
Pdf URL: https://arxiv.org/pdf/2503.08335
Copy Paste: [[2503.08335]] Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos(https://arxiv.org/abs/2503.08335)
Keywords: large language model
Abstract: Learning multimodal video understanding typically relies on datasets comprising video clips paired with manually annotated captions. However, this becomes even more challenging when dealing with long-form videos, lasting from minutes to hours, in educational and news domains due to the need for more annotators with subject expertise. Hence, there arises a need for automated solutions. Recent advancements in Large Language Models (LLMs) promise to capture concise and informative content that allows the comprehension of entire videos by leveraging Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR) technologies. ASR provides textual content from audio, while OCR extracts textual content from specific frames. This paper introduces a dataset comprising long-form lectures and news videos. We present baseline approaches to understand their limitations on this dataset and advocate for exploring prompt engineering techniques to comprehend long-form multimodal video datasets comprehensively.

Title: Diffusion Transformer Meets Random Masks: An Advanced PET Reconstruction Framework

Authors: Bin Huang, Binzhong He, Yanhan Chen, Zhili Liu, Xinyue Wang, Binxuan Li, Qiegen Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08339
Pdf URL: https://arxiv.org/pdf/2503.08339
Copy Paste: [[2503.08339]] Diffusion Transformer Meets Random Masks: An Advanced PET Reconstruction Framework(https://arxiv.org/abs/2503.08339)
Keywords: diffusion, transformer
Abstract: Deep learning has significantly advanced PET image re-construction, achieving remarkable improvements in image quality through direct training on sinogram or image data. Traditional methods often utilize masks for inpainting tasks, but their incorporation into PET reconstruction frameworks introduces transformative potential. In this study, we pro-pose an advanced PET reconstruction framework called Diffusion tRansformer mEets rAndom Masks (DREAM). To the best of our knowledge, this is the first work to integrate mask mechanisms into both the sinogram domain and the latent space, pioneering their role in PET reconstruction and demonstrating their ability to enhance reconstruction fidelity and efficiency. The framework employs a high-dimensional stacking approach, transforming masked data from two to three dimensions to expand the solution space and enable the model to capture richer spatial rela-tionships. Additionally, a mask-driven latent space is de-signed to accelerate the diffusion process by leveraging sinogram-driven and mask-driven compact priors, which reduce computational complexity while preserving essen-tial data characteristics. A hierarchical masking strategy is also introduced, guiding the model from focusing on fi-ne-grained local details in the early stages to capturing broader global patterns over time. This progressive ap-proach ensures a balance between detailed feature preservation and comprehensive context understanding. Experimental results demonstrate that DREAM not only improves the overall quality of reconstructed PET images but also preserves critical clinical details, highlighting its potential to advance PET imaging technology. By inte-grating compact priors and hierarchical masking, DREAM offers a promising and efficient avenue for future research and application in PET imaging. The open-source code is available at: this https URL.

Title: Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs

Authors: Chongjun Tu, Peng Ye, Dongzhan Zhou, Lei Bai, Gang Yu, Tao Chen, Wanli Ouyang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08342
Pdf URL: https://arxiv.org/pdf/2503.08342
Copy Paste: [[2503.08342]] Attention Reallocation: Towards Zero-cost and Controllable Hallucination Mitigation of MLLMs(https://arxiv.org/abs/2503.08342)
Keywords: large language model
Abstract: Multi-Modal Large Language Models (MLLMs) stand out in various tasks but still struggle with hallucinations. While recent training-free mitigation methods mostly introduce additional inference overhead via retrospection strategy and contrastive decoding, we propose attention reallocation (AttnReal) to mitigate hallucinations with nearly zero extra cost. Our approach is motivated by the key observations that, MLLM's unreasonable attention distribution causes features to be dominated by historical output tokens, which further contributes to hallucinated responses because of the distribution gap between different token types. Based on the observations, AttnReal recycles excessive attention from output tokens and reallocates it to visual tokens, which reduces MLLM's reliance on language priors and ensures the decoding process depends more on the visual inputs. More interestingly, we find that, by controlling the intensity of AttnReal, we can achieve a wide-range trade-off between the response faithfulness and overall performance. Comprehensive results from different benchmarks validate the effectiveness of AttnReal across six open-source MLLMs and three decoding strategies.

Title: DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos

Authors: Lorenzo Mur-Labadia, Josechu Guerrero, Ruben Martinez-Cantin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08344
Pdf URL: https://arxiv.org/pdf/2503.08344
Copy Paste: [[2503.08344]] DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos(https://arxiv.org/abs/2503.08344)
Keywords: segmentation
Abstract: Environment understanding in egocentric videos is an important step for applications like robotics, augmented reality and assistive technologies. These videos are characterized by dynamic interactions and a strong dependence on the wearer engagement with the environment. Traditional approaches often focus on isolated clips or fail to integrate rich semantic and geometric information, limiting scene comprehension. We introduce Dynamic Image-Video Feature Fields (DIV FF), a framework that decomposes the egocentric scene into persistent, dynamic, and actor based components while integrating both image and video language features. Our model enables detailed segmentation, captures affordances, understands the surroundings and maintains consistent understanding over time. DIV-FF outperforms state-of-the-art methods, particularly in dynamically evolving scenarios, demonstrating its potential to advance long term, spatio temporal scene understanding.

Title: Pathology-Aware Adaptive Watermarking for Text-Driven Medical Image Synthesis

Authors: Chanyoung Kim, Dayun Ju, Jinyeong Kim, Woojung Han, Roberto Alcover-Couso, Seong Jae Hwang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08346
Pdf URL: https://arxiv.org/pdf/2503.08346
Copy Paste: [[2503.08346]] Pathology-Aware Adaptive Watermarking for Text-Driven Medical Image Synthesis(https://arxiv.org/abs/2503.08346)
Keywords: robust, watermark, diffusion
Abstract: As recent text-conditioned diffusion models have enabled the generation of high-quality images, concerns over their potential misuse have also grown. This issue is critical in the medical domain, where text-conditioned generated medical images could enable insurance fraud or falsified records, highlighting the urgent need for reliable safeguards against unethical use. While watermarking techniques have emerged as a promising solution in general image domains, their direct application to medical imaging presents significant challenges. A key challenge is preserving fine-grained disease manifestations, as even minor distortions from a watermark may lead to clinical misinterpretation, which compromises diagnostic integrity. To overcome this gap, we present MedSign, a deep learning-based watermarking framework specifically designed for text-to-medical image synthesis, which preserves pathologically significant regions by adaptively adjusting watermark strength. Specifically, we generate a pathology localization map using cross-attention between medical text tokens and the diffusion denoising network, aggregating token-wise attention across layers, heads, and time steps. Leveraging this map, we optimize the LDM decoder to incorporate watermarking during image synthesis, ensuring cohesive integration while minimizing interference in diagnostically critical regions. Experimental results show that our MedSign preserves diagnostic integrity while ensuring watermark robustness, achieving state-of-the-art performance in image quality and detection accuracy on MIMIC-CXR and OIA-ODIR datasets.

Title: Design and Implementation of FourCropNet: A CNN-Based System for Efficient Multi-Crop Disease Detection and Management

Authors: H. P. Khandagale, Sangram Patil, V. S. Gavali, S. V. Chavan, P. P. Halkarnikar, Prateek A. Meshram
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08348
Pdf URL: https://arxiv.org/pdf/2503.08348
Copy Paste: [[2503.08348]] Design and Implementation of FourCropNet: A CNN-Based System for Efficient Multi-Crop Disease Detection and Management(https://arxiv.org/abs/2503.08348)
Keywords: security, robust, extraction
Abstract: Plant disease detection is a critical task in agriculture, directly impacting crop yield, food security, and sustainable farming practices. This study proposes FourCropNet, a novel deep learning model designed to detect diseases in multiple crops, including CottonLeaf, Grape, Soybean, and Corn. The model leverages an advanced architecture comprising residual blocks for efficient feature extraction, attention mechanisms to enhance focus on disease-relevant regions, and lightweight layers for computational efficiency. These components collectively enable FourCropNet to achieve superior performance across varying datasets and class complexities, from single-crop datasets to combined datasets with 15 classes. The proposed model was evaluated on diverse datasets, demonstrating high accuracy, specificity, sensitivity, and F1 scores. Notably, FourCropNet achieved the highest accuracy of 99.7% for Grape, 99.5% for Corn, and 95.3% for the combined dataset. Its scalability and ability to generalize across datasets underscore its robustness. Comparative analysis shows that FourCropNet consistently outperforms state-of-the-art models such as MobileNet, VGG16, and EfficientNet across various metrics. FourCropNet's innovative design and consistent performance make it a reliable solution for real-time disease detection in agriculture. This model has the potential to assist farmers in timely disease diagnosis, reducing economic losses and promoting sustainable agricultural practices.

Title: Robust Latent Matters: Boosting Image Generation with Sampling Error

Authors: Kai Qiu, Xiang Li, Jason Kuen, Hao Chen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08354
Pdf URL: https://arxiv.org/pdf/2503.08354
Copy Paste: [[2503.08354]] Robust Latent Matters: Boosting Image Generation with Sampling Error(https://arxiv.org/abs/2503.08354)
Keywords: robust, generative
Abstract: Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer. Though the performance of tokenizer plays an essential role to the successful generation, its current evaluation metrics (e.g. rFID) fail to precisely assess the tokenizer and correlate its performance to the generation quality (e.g. gFID). In this paper, we comprehensively analyze the reason for the discrepancy of reconstruction and generation qualities in a discrete latent space, and, from which, we propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction. Specifically, a latent perturbation approach is proposed to simulate sampling noises, i.e., the unexpected tokens sampled, from the generative process. With the latent perturbation, we further propose (1) a novel tokenizer evaluation metric, i.e., pFID, which successfully correlates the tokenizer performance to generation quality and (2) a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer thus boosting the generation quality and convergence speed. Extensive benchmarking are conducted with 11 advanced discrete image tokenizers with 2 autoregressive generation models to validate our approach. The tokenizer trained with our proposed latent perturbation achieve a notable 1.60 gFID with classifier-free guidance (CFG) and 3.45 gFID without CFG with a $\sim$400M generator. Code: this https URL.

Title: Debiased Prompt Tuning in Vision-Language Model without Annotations

Authors: Chaoquan Jiang, Yunfan Yang, Rui Hu, Jitao Sang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08368
Pdf URL: https://arxiv.org/pdf/2503.08368
Copy Paste: [[2503.08368]] Debiased Prompt Tuning in Vision-Language Model without Annotations(https://arxiv.org/abs/2503.08368)
Keywords: robust
Abstract: Prompt tuning of Vision-Language Models (VLMs) such as CLIP, has demonstrated the ability to rapidly adapt to various downstream tasks. However, recent studies indicate that tuned VLMs may suffer from the problem of spurious correlations, where the model relies on spurious features (e.g. background and gender) in the data. This may lead to the model having worse robustness in out-of-distribution data. Standard methods for eliminating spurious correlation typically require us to know the spurious attribute labels of each sample, which is hard in the real world. In this work, we explore improving the group robustness of prompt tuning in VLMs without relying on manual annotation of spurious features. We notice the zero - shot image recognition ability of VLMs and use this ability to identify spurious features, thus avoiding the cost of manual annotation. By leveraging pseudo-spurious attribute annotations, we further propose a method to automatically adjust the training weights of different groups. Extensive experiments show that our approach efficiently improves the worst-group accuracy on CelebA, Waterbirds, and MetaShift datasets, achieving the best robustness gap between the worst-group accuracy and the overall accuracy.

Title: nnInteractive: Redefining 3D Promptable Segmentation

Authors: Fabian Isensee, Maximilian Rokuss, Lars Krämer, Stefan Dinkelacker, Ashis Ravindran, Florian Stritzke, Benjamin Hamm, Tassilo Wald, Moritz Langenberg, Constantin Ulrich, Jonathan Deissler, Ralf Floca, Klaus Maier-Hein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08373
Pdf URL: https://arxiv.org/pdf/2503.08373
Copy Paste: [[2503.08373]] nnInteractive: Redefining 3D Promptable Segmentation(https://arxiv.org/abs/2503.08373)
Keywords: segmentation
Abstract: Accurate and efficient 3D segmentation is essential for both clinical and research applications. While foundation models like SAM have revolutionized interactive segmentation, their 2D design and domain shift limitations make them ill-suited for 3D medical images. Current adaptations address some of these challenges but remain limited, either lacking volumetric awareness, offering restricted interactivity, or supporting only a small set of structures and modalities. Usability also remains a challenge, as current tools are rarely integrated into established imaging platforms and often rely on cumbersome web-based interfaces with restricted functionality. We introduce nnInteractive, the first comprehensive 3D interactive open-set segmentation method. It supports diverse prompts-including points, scribbles, boxes, and a novel lasso prompt-while leveraging intuitive 2D interactions to generate full 3D segmentations. Trained on 120+ diverse volumetric 3D datasets (CT, MRI, PET, 3D Microscopy, etc.), nnInteractive sets a new state-of-the-art in accuracy, adaptability, and usability. Crucially, it is the first method integrated into widely used image viewers (e.g., Napari, MITK), ensuring broad accessibility for real-world clinical and research applications. Extensive benchmarking demonstrates that nnInteractive far surpasses existing methods, setting a new standard for AI-driven interactive 3D segmentation. nnInteractive is publicly available: this https URL (Napari plugin), this https URL (MITK integration), this https URL (Python backend).

Title: Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens

Authors: Qingsong Xie, Zhao Zhang, Zhe Huang, Yanhao Zhang, Haonan Lu, Zhenyu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08377
Pdf URL: https://arxiv.org/pdf/2503.08377
Copy Paste: [[2503.08377]] Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens(https://arxiv.org/abs/2503.08377)
Keywords: diffusion, transformer
Abstract: Image tokenization has significantly advanced visual generation and multimodal modeling, particularly when paired with autoregressive models. However, current methods face challenges in balancing efficiency and fidelity: high-resolution image reconstruction either requires an excessive number of tokens or compromises critical details through token reduction. To resolve this, we propose Latent Consistency Tokenizer (Layton) that bridges discrete visual tokens with the compact latent space of pre-trained Latent Diffusion Models (LDMs), enabling efficient representation of 1024x1024 images using only 256 tokens-a 16 times compression over VQGAN. Layton integrates a transformer encoder, a quantized codebook, and a latent consistency decoder. Direct application of LDM as the decoder results in color and brightness discrepancies. Thus, we convert it to latent consistency decoder, reducing multi-step sampling to 1-2 steps for direct pixel-level supervision. Experiments demonstrate Layton's superiority in high-fidelity reconstruction, with 10.8 reconstruction Frechet Inception Distance on MSCOCO-2017 5K benchmark for 1024x1024 image reconstruction. We also extend Layton to a text-to-image generation model, LaytonGen, working in autoregression. It achieves 0.73 score on GenEval benchmark, surpassing current state-of-the-art methods. The code and model will be released.

Title: Twinner: Shining Light on Digital Twins in a Few Snaps

Authors: Jesus Zarzar, Tom Monnier, Roman Shapovalov, Andrea Vedaldi, David Novotny
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08382
Pdf URL: https://arxiv.org/pdf/2503.08382
Copy Paste: [[2503.08382]] Twinner: Shining Light on Digital Twins in a Few Snaps(https://arxiv.org/abs/2503.08382)
Keywords: transformer
Abstract: We present the first large reconstruction model, Twinner, capable of recovering a scene's illumination as well as an object's geometry and material properties from only a few posed images. Twinner is based on the Large Reconstruction Model and innovates in three key ways: 1) We introduce a memory-efficient voxel-grid transformer whose memory scales only quadratically with the size of the voxel grid. 2) To deal with scarcity of high-quality ground-truth PBR-shaded models, we introduce a large fully-synthetic dataset of procedurally-generated PBR-textured objects lit with varied illumination. 3) To narrow the synthetic-to-real gap, we finetune the model on real life datasets by means of a differentiable physically-based shading model, eschewing the need for ground-truth illumination or material properties which are challenging to obtain in real life. We demonstrate the efficacy of our model on the real life StanfordORB benchmark where, given few input views, we achieve reconstruction quality significantly superior to existing feedforward reconstruction networks, and comparable to significantly slower per-scene optimization methods.

Title: Recognition-Synergistic Scene Text Editing

Authors: Zhengyao Fang, Pengyuan Lyu, Jingjing Wu, Chengquan Zhang, Jun Yu, Guangming Lu, Wenjie Pei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08387
Pdf URL: https://arxiv.org/pdf/2503.08387
Copy Paste: [[2503.08387]] Recognition-Synergistic Scene Text Editing(https://arxiv.org/abs/2503.08387)
Keywords: transformer
Abstract: Scene text editing aims to modify text content within scene images while maintaining style consistency. Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content, while ensuring content consistency using a pre-trained recognition model. Despite notable progress, these methods suffer from complex pipelines, leading to suboptimal performance in complex scenarios. In this work, we introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing. Our model seamlessly integrates text recognition with text editing within a unified framework, and leverages the recognition model's ability to implicitly disentangle style and content while ensuring content consistency. Specifically, our approach employs a multi-modal parallel decoder based on transformer architecture, which predicts both text content and stylized images in parallel. Additionally, our cyclic self-supervised fine-tuning strategy enables effective training on unpaired real-world data without ground truth, enhancing style and content consistency through a twice-cyclic generation process. Built on a relatively simple architecture, \mymodel achieves state-of-the-art performance on both synthetic and real-world benchmarks, and further demonstrates the effectiveness of leveraging the generated hard cases to boost the performance of downstream recognition tasks. Code is available at this https URL.

Title: V-Max: Making RL practical for Autonomous Driving

Authors: Valentin Charraut, Thomas Tournaire, Waël Doulazmi, Thibault Buhet
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.08388
Pdf URL: https://arxiv.org/pdf/2503.08388
Copy Paste: [[2503.08388]] V-Max: Making RL practical for Autonomous Driving(https://arxiv.org/abs/2503.08388)
Keywords: transformer
Abstract: Learning-based decision-making has the potential to enable generalizable Autonomous Driving (AD) policies, reducing the engineering overhead of rule-based approaches. Imitation Learning (IL) remains the dominant paradigm, benefiting from large-scale human demonstration datasets, but it suffers from inherent limitations such as distribution shift and imitation gaps. Reinforcement Learning (RL) presents a promising alternative, yet its adoption in AD remains limited due to the lack of standardized and efficient research frameworks. To this end, we introduce V-Max, an open research framework providing all the necessary tools to make RL practical for AD. V-Max is built on Waymax, a hardware-accelerated AD simulator designed for large-scale experimentation. We extend it using ScenarioNet's approach, enabling the fast simulation of diverse AD datasets. V-Max integrates a set of observation and reward functions, transformer-based encoders, and training pipelines. Additionally, it includes adversarial evaluation settings and an extensive set of evaluation metrics. Through a large-scale benchmark, we analyze how network architectures, observation functions, training data, and reward shaping impact RL performance.

Title: DyArtbank: Diverse Artistic Style Transfer via Pre-trained Stable Diffusion and Dynamic Style Prompt Artbank

Authors: Zhanjie Zhang, Quanwei Zhang, Guangyuan Li, Junsheng Luan, Mengyuan Yang, Yun Wang, Lei Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08392
Pdf URL: https://arxiv.org/pdf/2503.08392
Copy Paste: [[2503.08392]] DyArtbank: Diverse Artistic Style Transfer via Pre-trained Stable Diffusion and Dynamic Style Prompt Artbank(https://arxiv.org/abs/2503.08392)
Keywords: diffusion
Abstract: Artistic style transfer aims to transfer the learned style onto an arbitrary content image. However, most existing style transfer methods can only render consistent artistic stylized images, making it difficult for users to get enough stylized images to enjoy. To solve this issue, we propose a novel artistic style transfer framework called DyArtbank, which can generate diverse and highly realistic artistic stylized images. Specifically, we introduce a Dynamic Style Prompt ArtBank (DSPA), a set of learnable parameters. It can learn and store the style information from the collection of artworks, dynamically guiding pre-trained stable diffusion to generate diverse and highly realistic artistic stylized images. DSPA can also generate random artistic image samples with the learned style information, providing a new idea for data augmentation. Besides, a Key Content Feature Prompt (KCFP) module is proposed to provide sufficient content prompts for pre-trained stable diffusion to preserve the detailed structure of the input content image. Extensive qualitative and quantitative experiments verify the effectiveness of our proposed method. Code is available: this https URL

Title: OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning

Authors: Jiawei Zhou, Lei Chen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2503.08398
Pdf URL: https://arxiv.org/pdf/2503.08398
Copy Paste: [[2503.08398]] OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning(https://arxiv.org/abs/2503.08398)
Keywords: large language model
Abstract: In this paper, we analyze and empirically show that the learned relevance for conventional information retrieval (IR) scenarios may be inconsistent in retrieval-augmented generation (RAG) scenarios. To bridge this gap, we introduce OpenRAG, a RAG framework that is optimized end-to-end by tuning the retriever to capture in-context relevance, enabling adaptation to the diverse and evolving needs. Extensive experiments across a wide range of tasks demonstrate that OpenRAG, by tuning a retriever end-to-end, leads to a consistent improvement of 4.0% over the original retriever, consistently outperforming existing state-of-the-art retrievers by 2.1%. Additionally, our results indicate that for some tasks, an end-to-end tuned 0.2B retriever can achieve improvements that surpass those of RAG-oriented or instruction-tuned 8B large language models (LLMs), highlighting the cost-effectiveness of our approach in enhancing RAG systems.

Title: Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information

Authors: Elizaveta Kuznetsova, Ilaria Vitulano, Mykola Makhortykh, Martha Stolze, Tomas Nagy, Victoria Vziatysheva
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2503.08404
Pdf URL: https://arxiv.org/pdf/2503.08404
Copy Paste: [[2503.08404]] Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information(https://arxiv.org/abs/2503.08404)
Keywords: generative, large language model
Abstract: The purpose of this study is to assess how large language models (LLMs) can be used for fact-checking and contribute to the broader debate on the use of automated means for veracity identification. To achieve this purpose, we use AI auditing methodology that systematically evaluates performance of five LLMs (ChatGPT 4, Llama 3 (70B), Llama 3.1 (405B), Claude 3.5 Sonnet, and Google Gemini) using prompts regarding a large set of statements fact-checked by professional journalists (16,513). Specifically, we use topic modeling and regression analysis to investigate which factors (e.g. topic of the prompt or the LLM type) affect evaluations of true, false, and mixed statements. Our findings reveal that while ChatGPT 4 and Google Gemini achieved higher accuracy than other models, overall performance across models remains modest. Notably, the results indicate that models are better at identifying false statements, especially on sensitive topics such as COVID-19, American political controversies, and social issues, suggesting possible guardrails that may enhance accuracy on these topics. The major implication of our findings is that there are significant challenges for using LLMs for factchecking, including significant variation in performance across different LLMs and unequal quality of outputs for specific topics which can be attributed to deficits of training data. Our research highlights the potential and limitations of LLMs in political fact-checking, suggesting potential avenues for further improvements in guardrails as well as fine-tuning.

Title: WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images

Authors: Yansong Guo, Jie Hu, Yansong Qu, Liujuan Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08407
Pdf URL: https://arxiv.org/pdf/2503.08407
Copy Paste: [[2503.08407]] WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images(https://arxiv.org/abs/2503.08407)
Keywords: robust, segmentation
Abstract: Recent advances in interactive 3D segmentation from 2D images have demonstrated impressive performance. However, current models typically require extensive scene-specific training to accurately reconstruct and segment objects, which limits their applicability in real-time scenarios. In this paper, we introduce WildSeg3D, an efficient approach that enables the segmentation of arbitrary 3D objects across diverse environments using a feed-forward mechanism. A key challenge of this feed-forward approach lies in the accumulation of 3D alignment errors across multiple 2D views, which can lead to inaccurate 3D segmentation results. To address this issue, we propose Dynamic Global Aligning (DGA), a technique that improves the accuracy of global multi-view alignment by focusing on difficult-to-match 3D points across images, using a dynamic adjustment function. Additionally, for real-time interactive segmentation, we introduce Multi-view Group Mapping (MGM), a method that utilizes an object mask cache to integrate multi-view segmentations and respond rapidly to user prompts. WildSeg3D demonstrates robust generalization across arbitrary scenes, thereby eliminating the need for scene-specific training. Specifically, WildSeg3D not only attains the accuracy of state-of-the-art (SOTA) methods but also achieves a $40\times$ speedup compared to existing SOTA models. Our code will be publicly available.

Title: Generalizable and Explainable Deep Learning for Medical Image Computing: An Overview

Authors: Ahmad Chaddad, Yan Hu, Yihang Wu, Binbin Wen, Reem Kateb
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08420
Pdf URL: https://arxiv.org/pdf/2503.08420
Copy Paste: [[2503.08420]] Generalizable and Explainable Deep Learning for Medical Image Computing: An Overview(https://arxiv.org/abs/2503.08420)
Keywords: robust, explainability
Abstract: Objective. This paper presents an overview of generalizable and explainable artificial intelligence (XAI) in deep learning (DL) for medical imaging, aimed at addressing the urgent need for transparency and explainability in clinical applications. Methodology. We propose to use four CNNs in three medical datasets (brain tumor, skin cancer, and chest x-ray) for medical image classification tasks. In addition, we perform paired t-tests to show the significance of the differences observed between different methods. Furthermore, we propose to combine ResNet50 with five common XAI techniques to obtain explainable results for model prediction, aiming at improving model transparency. We also involve a quantitative metric (confidence increase) to evaluate the usefulness of XAI techniques. Key findings. The experimental results indicate that ResNet50 can achieve feasible accuracy and F1 score in all datasets (e.g., 86.31\% accuracy in skin cancer). Furthermore, the findings show that while certain XAI methods, such as XgradCAM, effectively highlight relevant abnormal regions in medical images, others, like EigenGradCAM, may perform less effectively in specific scenarios. In addition, XgradCAM indicates higher confidence increase (e.g., 0.12 in glioma tumor) compared to GradCAM++ (0.09) and LayerCAM (0.08). Implications. Based on the experimental results and recent advancements, we outline future research directions to enhance the robustness and generalizability of DL models in the field of biomedical imaging.

Title: Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing

Authors: Chen Liao, Yan Shen, Dan Li, Zhongli Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08429
Pdf URL: https://arxiv.org/pdf/2503.08429
Copy Paste: [[2503.08429]] Using Powerful Prior Knowledge of Diffusion Model in Deep Unfolding Networks for Image Compressive Sensing(https://arxiv.org/abs/2503.08429)
Keywords: diffusion
Abstract: Recently, Deep Unfolding Networks (DUNs) have achieved impressive reconstruction quality in the field of image Compressive Sensing (CS) by unfolding iterative optimization algorithms into neural networks. The reconstruction quality of DUNs depends on the learned prior knowledge, so introducing stronger prior knowledge can further improve reconstruction quality. On the other hand, pre-trained diffusion models contain powerful prior knowledge and have a solid theoretical foundation and strong scalability, but it requires a large number of iterative steps to achieve reconstruction. In this paper, we propose to use the powerful prior knowledge of pre-trained diffusion model in DUNs to achieve high-quality reconstruction with less steps for image CS. Specifically, we first design an iterative optimization algorithm named Diffusion Message Passing (DMP), which embeds a pre-trained diffusion model into each iteration process of DMP. Then, we deeply unfold the DMP algorithm into a neural network named DMP-DUN. The proposed DMP-DUN can use lightweight neural networks to achieve mapping from measurement data to the intermediate steps of the reverse diffusion process and directly approximate the divergence of the diffusion model, thereby further improving reconstruction efficiency. Extensive experiments show that our proposed DMP-DUN achieves state-of-the-art performance and requires at least only 2 steps to reconstruct the image. Codes are available at this https URL.

Title: Controlling Latent Diffusion Using Latent CLIP

Authors: Jason Becker, Chris Wendler, Peter Baylies, Robert West, Christian Wressnegger
Subjects: cs.CV, cs.AI, cs.LG, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2503.08455
Pdf URL: https://arxiv.org/pdf/2503.08455
Copy Paste: [[2503.08455]] Controlling Latent Diffusion Using Latent CLIP(https://arxiv.org/abs/2503.08455)
Keywords: diffusion
Abstract: Instead of performing text-conditioned denoising in the image domain, latent diffusion models (LDMs) operate in latent space of a variational autoencoder (VAE), enabling more efficient processing at reduced computational costs. However, while the diffusion process has moved to the latent space, the contrastive language-image pre-training (CLIP) models, as used in many image processing tasks, still operate in pixel space. Doing so requires costly VAE-decoding of latent images before they can be processed. In this paper, we introduce Latent-CLIP, a CLIP model that operates directly in the latent space. We train Latent-CLIP on 2.7B pairs of latent images and descriptive texts, and show that it matches zero-shot classification performance of similarly sized CLIP models on both the ImageNet benchmark and a LDM-generated version of it, demonstrating its effectiveness in assessing both real and generated content. Furthermore, we construct Latent-CLIP rewards for reward-based noise optimization (ReNO) and show that they match the performance of their CLIP counterparts on GenEval and T2I-CompBench while cutting the cost of the total pipeline by 21%. Finally, we use Latent-CLIP to guide generation away from harmful content, achieving strong performance on the inappropriate image prompts (I2P) benchmark and a custom evaluation, without ever requiring the costly step of decoding intermediate images.

Title: TrackOcc: Camera-based 4D Panoptic Occupancy Tracking

Authors: Zhuoguang Chen, Kenan Li, Xiuyu Yang, Tao Jiang, Yiming Li, Hang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08471
Pdf URL: https://arxiv.org/pdf/2503.08471
Copy Paste: [[2503.08471]] TrackOcc: Camera-based 4D Panoptic Occupancy Tracking(https://arxiv.org/abs/2503.08471)
Keywords: segmentation
Abstract: Comprehensive and consistent dynamic scene understanding from camera input is essential for advanced autonomous systems. Traditional camera-based perception tasks like 3D object tracking and semantic occupancy prediction lack either spatial comprehensiveness or temporal consistency. In this work, we introduce a brand-new task, Camera-based 4D Panoptic Occupancy Tracking, which simultaneously addresses panoptic occupancy segmentation and object tracking from camera-only input. Furthermore, we propose TrackOcc, a cutting-edge approach that processes image inputs in a streaming, end-to-end manner with 4D panoptic queries to address the proposed task. Leveraging the localization-aware loss, TrackOcc enhances the accuracy of 4D panoptic occupancy tracking without bells and whistles. Experimental results demonstrate that our method achieves state-of-the-art performance on the Waymo dataset. The source code will be released at this https URL.

Title: NullFace: Training-Free Localized Face Anonymization

Authors: Han-Wei Kung, Tuomas Varanka, Terence Sim, Nicu Sebe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08478
Pdf URL: https://arxiv.org/pdf/2503.08478
Copy Paste: [[2503.08478]] NullFace: Training-Free Localized Face Anonymization(https://arxiv.org/abs/2503.08478)
Keywords: privacy, robust, diffusion
Abstract: Privacy concerns around ever increasing number of cameras are increasing in today's digital age. Although existing anonymization methods are able to obscure identity information, they often struggle to preserve the utility of the images. In this work, we introduce a training-free method for face anonymization that preserves key non-identity-related attributes. Our approach utilizes a pre-trained text-to-image diffusion model without requiring optimization or training. It begins by inverting the input image to recover its initial noise. The noise is then denoised through an identity-conditioned diffusion process, where modified identity embeddings ensure the anonymized face is distinct from the original identity. Our approach also supports localized anonymization, giving users control over which facial regions are anonymized or kept intact. Comprehensive evaluations against state-of-the-art methods show our approach excels in anonymization, attribute preservation, and image quality. Its flexibility, robustness, and practicality make it well-suited for real-world applications. Code and data can be found at this https URL .

Title: Generalizable AI-Generated Image Detection Based on Fractal Self-Similarity in the Spectrum

Authors: Shengpeng Xiao, Yuanfang Guo, Heqi Peng, Zeming Liu, Liang Yang, Yunhong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08484
Pdf URL: https://arxiv.org/pdf/2503.08484
Copy Paste: [[2503.08484]] Generalizable AI-Generated Image Detection Based on Fractal Self-Similarity in the Spectrum(https://arxiv.org/abs/2503.08484)
Keywords: diffusion, generative
Abstract: The generalization performance of AI-generated image detection remains a critical challenge. Although most existing methods perform well in detecting images from generative models included in the training set, their accuracy drops significantly when faced with images from unseen generators. To address this limitation, we propose a novel detection method based on the fractal self-similarity of the spectrum, a common feature among images generated by different models. Specifically, we demonstrate that AI-generated images exhibit fractal-like spectral growth through periodic extension and low-pass filtering. This observation motivates us to exploit the similarity among different fractal branches of the spectrum. Instead of directly analyzing the spectrum, our method mitigates the impact of varying spectral characteristics across different generators, improving detection performance for images from unseen models. Experiments on a public benchmark demonstrated the generalized detection performance across both GANs and diffusion models.

Title: A Triple-Inertial Accelerated Alternating Optimization Method for Deep Learning Training

Authors: Chengcheng Yan, Jiawei Xu, Qingsong Wang, Zheng Peng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08489
Pdf URL: https://arxiv.org/pdf/2503.08489
Copy Paste: [[2503.08489]] A Triple-Inertial Accelerated Alternating Optimization Method for Deep Learning Training(https://arxiv.org/abs/2503.08489)
Keywords: robust
Abstract: The stochastic gradient descent (SGD) algorithm has achieved remarkable success in training deep learning models. However, it has several limitations, including susceptibility to vanishing gradients, sensitivity to input data, and a lack of robust theoretical guarantees. In recent years, alternating minimization (AM) methods have emerged as a promising alternative for model training by employing gradient-free approaches to iteratively update model parameters. Despite their potential, these methods often exhibit slow convergence rates. To address this challenge, we propose a novel Triple-Inertial Accelerated Alternating Minimization (TIAM) framework for neural network training. The TIAM approach incorporates a triple-inertial acceleration strategy with a specialized approximation method, facilitating targeted acceleration of different terms in each sub-problem optimization. This integration improves the efficiency of convergence, achieving superior performance with fewer iterations. Additionally, we provide a convergence analysis of the TIAM algorithm, including its global convergence properties and convergence rate. Extensive experiments validate the effectiveness of the TIAM method, showing significant improvements in generalization capability and computational efficiency compared to existing approaches, particularly when applied to the rectified linear unit (ReLU) and its variants.

Title: Enhancing Multi-Hop Fact Verification with Structured Knowledge-Augmented Large Language Models

Authors: Han Cao, Lingwei Wei, Wei Zhou, Songlin Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08495
Pdf URL: https://arxiv.org/pdf/2503.08495
Copy Paste: [[2503.08495]] Enhancing Multi-Hop Fact Verification with Structured Knowledge-Augmented Large Language Models(https://arxiv.org/abs/2503.08495)
Keywords: large language model
Abstract: The rapid development of social platforms exacerbates the dissemination of misinformation, which stimulates the research in fact verification. Recent studies tend to leverage semantic features to solve this problem as a single-hop task. However, the process of verifying a claim requires several pieces of evidence with complicated inner logic and relations to verify the given claim in real-world situations. Recent studies attempt to improve both understanding and reasoning abilities to enhance the performance, but they overlook the crucial relations between entities that benefit models to understand better and facilitate the prediction. To emphasize the significance of relations, we resort to Large Language Models (LLMs) considering their excellent understanding ability. Instead of other methods using LLMs as the predictor, we take them as relation extractors, for they do better in understanding rather than reasoning according to the experimental results. Thus, to solve the challenges above, we propose a novel Structured Knowledge-Augmented LLM-based Network (LLM-SKAN) for multi-hop fact verification. Specifically, we utilize an LLM-driven Knowledge Extractor to capture fine-grained information, including entities and their complicated relations. Besides, we leverage a Knowledge-Augmented Relation Graph Fusion module to interact with each node and learn better claim-evidence representations comprehensively. The experimental results on four common-used datasets demonstrate the effectiveness and superiority of our model.

Title: Learning to Match Unpaired Data with Minimum Entropy Coupling

Authors: Mustapha Bounoua, Giulio Franzese, Pietro Michiardi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08501
Pdf URL: https://arxiv.org/pdf/2503.08501
Copy Paste: [[2503.08501]] Learning to Match Unpaired Data with Minimum Entropy Coupling(https://arxiv.org/abs/2503.08501)
Keywords: diffusion, generative
Abstract: Multimodal data is a precious asset enabling a variety of downstream tasks in machine learning. However, real-world data collected across different modalities is often not paired, which is a significant challenge to learn a joint distribution. A prominent approach to address the modality coupling problem is Minimum Entropy Coupling (MEC), which seeks to minimize the joint Entropy, while satisfying constraints on the marginals. Existing approaches to the MEC problem focus on finite, discrete distributions, limiting their application for cases involving continuous data. In this work, we propose a novel method to solve the continuous MEC problem, using well-known generative diffusion models that learn to approximate and minimize the joint Entropy through a cooperative scheme, while satisfying a relaxed version of the marginal constraints. We empirically demonstrate that our method, DDMEC, is general and can be easily used to address challenging tasks, including unsupervised single-cell multi-omics data alignment and unpaired image translation, outperforming specialized methods.

Title: CFNet: Optimizing Remote Sensing Change Detection through Content-Aware Enhancement

Authors: Fan Wu, Sijun Dong, Xiaoliang Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08505
Pdf URL: https://arxiv.org/pdf/2503.08505
Copy Paste: [[2503.08505]] CFNet: Optimizing Remote Sensing Change Detection through Content-Aware Enhancement(https://arxiv.org/abs/2503.08505)
Keywords: extraction
Abstract: Change detection is a crucial and widely applied task in remote sensing, aimed at identifying and analyzing changes occurring in the same geographical area over time. Due to variability in acquisition conditions, bi-temporal remote sensing images often exhibit significant differences in image style. Even with the powerful generalization capabilities of DNNs, these unpredictable style variations between bi-temporal images inevitably affect model's ability to accurately detect changed areas. To address issue above, we propose the Content Focuser Network (CFNet), which takes content-aware strategy as a key insight. CFNet employs EfficientNet-B5 as the backbone for feature extraction. To enhance the model's focus on the content features of images while mitigating the misleading effects of style features, we develop a constraint strategy that prioritizes the content features of bi-temporal images, termed Content-Aware. Furthermore, to enable the model to flexibly focus on changed and unchanged areas according to the requirements of different stages, we design a reweighting module based on the cosine distance between bi-temporal image features, termed Focuser. CFNet achieve outstanding performance across three well-known change detection datasets: CLCD (F1: 81.41%, IoU: 68.65%), LEVIR-CD (F1: 92.18%, IoU: 85.49%), and SYSU-CD (F1: 82.89%, IoU: 70.78%). The code and pretrained models of CFNet are publicly released at this https URL.

Title: ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews

Authors: Xian Gao, Jiacheng Ruan, Jingsheng Gao, Ting Liu, Yuzhuo Fu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08506
Pdf URL: https://arxiv.org/pdf/2503.08506
Copy Paste: [[2503.08506]] ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews(https://arxiv.org/abs/2503.08506)
Keywords: large language model
Abstract: Academic paper review is a critical yet time-consuming task within the research community. With the increasing volume of academic publications, automating the review process has become a significant challenge. The primary issue lies in generating comprehensive, accurate, and reasoning-consistent review comments that align with human reviewers' judgments. In this paper, we address this challenge by proposing ReviewAgents, a framework that leverages large language models (LLMs) to generate academic paper reviews. We first introduce a novel dataset, Review-CoT, consisting of 142k review comments, designed for training LLM agents. This dataset emulates the structured reasoning process of human reviewers-summarizing the paper, referencing relevant works, identifying strengths and weaknesses, and generating a review conclusion. Building upon this, we train LLM reviewer agents capable of structured reasoning using a relevant-paper-aware training method. Furthermore, we construct ReviewAgents, a multi-role, multi-LLM agent review framework, to enhance the review comment generation process. Additionally, we propose ReviewBench, a benchmark for evaluating the review comments generated by LLMs. Our experimental results on ReviewBench demonstrate that while existing LLMs exhibit a certain degree of potential for automating the review process, there remains a gap when compared to human-generated reviews. Moreover, our ReviewAgents framework further narrows this gap, outperforming advanced LLMs in generating review comments.

Title: Referring to Any Person

Authors: Qing Jiang, Lin Wu, Zhaoyang Zeng, Tianhe Ren, Yuda Xiong, Yihao Chen, Qin Liu, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08507
Pdf URL: https://arxiv.org/pdf/2503.08507
Copy Paste: [[2503.08507]] Referring to Any Person(https://arxiv.org/abs/2503.08507)
Keywords: robust, large language model
Abstract: Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks. Code is available at this https URL

Title: DISTINGUISH Workflow: A New Paradigm of Dynamic Well Placement Using Generative Machine Learning

Authors: Sergey Alyaev, Kristian Fossum, Hibat Errahmen Djecta, Jan Tveranger, Ahmed H. Elsheikh
Subjects: cs.LG, math.OC, physics.geo-ph, stat.AP
Abstract URL: https://arxiv.org/abs/2503.08509
Pdf URL: https://arxiv.org/pdf/2503.08509
Copy Paste: [[2503.08509]] DISTINGUISH Workflow: A New Paradigm of Dynamic Well Placement Using Generative Machine Learning(https://arxiv.org/abs/2503.08509)
Keywords: extraction, generative
Abstract: The real-time process of directional changes while drilling, known as geosteering, is crucial for hydrocarbon extraction and emerging directional drilling applications such as geothermal energy, civil infrastructure, and CO2 storage. The geo-energy industry seeks an automatic geosteering workflow that continually updates the subsurface uncertainties and captures the latest geological understanding given the most recent observations in real-time. We propose "DISTINGUISH": a real-time, AI-driven workflow designed to transform geosteering by integrating Generative Adversarial Networks (GANs) for geological parameterization, ensemble methods for model updating, and global discrete dynamic programming (DDP) optimization for complex decision-making during directional drilling operations. The DISTINGUISH framework relies on offline training of a GAN model to reproduce relevant geology realizations and a Forward Neural Network (FNN) to model Logging-While-Drilling (LWD) tools' response for a given geomodel. This paper introduces a first-of-its-kind workflow that progressively reduces GAN-geomodel uncertainty around and ahead of the drilling bit and adjusts the well plan accordingly. The workflow automatically integrates real-time LWD data with a DDP-based decision support system, enhancing predictive models of geology ahead of drilling and leading to better steering decisions. We present a simple yet representative benchmark case and document the performance target achieved by the DISTINGUISH workflow prototype. This benchmark will be a foundation for future methodological advancements and workflow refinements.

Title: SAS: Segment Any 3D Scene with Integrated 2D Priors

Authors: Zhuoyuan Li, Jiahao Lu, Jiacheng Deng, Hanzhi Chang, Lifan Wu, Yanzhe Liang, Tianzhu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08512
Pdf URL: https://arxiv.org/pdf/2503.08512
Copy Paste: [[2503.08512]] SAS: Segment Any 3D Scene with Integrated 2D Priors(https://arxiv.org/abs/2503.08512)
Keywords: diffusion, segmentation
Abstract: The open vocabulary capability of 3D models is increasingly valued, as traditional methods with models trained with fixed categories fail to recognize unseen objects in complex dynamic 3D scenes. In this paper, we propose a simple yet effective approach, SAS, to integrate the open vocabulary capability of multiple 2D models and migrate it to 3D domain. Specifically, we first propose Model Alignment via Text to map different 2D models into the same embedding space using text as a bridge. Then, we propose Annotation-Free Model Capability Construction to explicitly quantify the 2D model's capability of recognizing different categories using diffusion models. Following this, point cloud features from different 2D models are fused with the guide of constructed model capabilities. Finally, the integrated 2D open vocabulary capability is transferred to 3D domain through feature distillation. SAS outperforms previous methods by a large margin across multiple datasets, including ScanNet v2, Matterport3D, and nuScenes, while its generalizability is further validated on downstream tasks, e.g., gaussian segmentation and instance segmentation.

Title: Segmentation-Guided CT Synthesis with Pixel-Wise Conformal Uncertainty Bounds

Authors: David Vallmanya Poch, Yorick Estievenart, Elnura Zhalieva, Sukanya Patra, Mohammad Yaqub, Souhaib Ben Taieb
Subjects: cs.CV, physics.med-ph
Abstract URL: https://arxiv.org/abs/2503.08515
Pdf URL: https://arxiv.org/pdf/2503.08515
Copy Paste: [[2503.08515]] Segmentation-Guided CT Synthesis with Pixel-Wise Conformal Uncertainty Bounds(https://arxiv.org/abs/2503.08515)
Keywords: robust, segmentation
Abstract: Accurate dose calculations in proton therapy rely on high-quality CT images. While planning CTs (pCTs) serve as a reference for dosimetric planning, Cone Beam CT (CBCT) is used throughout Adaptive Radiotherapy (ART) to generate sCTs for improved dose calculations. Despite its lower cost and reduced radiation exposure advantages, CBCT suffers from severe artefacts and poor image quality, making it unsuitable for precise dosimetry. Deep learning-based CBCT-to-CT translation has emerged as a promising approach. Still, existing methods often introduce anatomical inconsistencies and lack reliable uncertainty estimates, limiting their clinical adoption. To bridge this gap, we propose STF-RUE, a novel framework integrating two key components. First, STF, a segmentation-guided CBCT-to-CT translation method that enhances anatomical consistency by leveraging segmentation priors extracted from pCTs. Second, RUE, a conformal prediction method that augments predicted CTs with pixel-wise conformal prediction intervals, providing clinicians with robust reliability indicator. Comprehensive experiments using UNet++ and Fast-DDPM on two benchmark datasets demonstrate that STF-RUE significantly improves translation accuracy, as measured by a novel soft-tissue-focused metric designed for precise dose computation. Additionally, STF-RUE provides better-calibrated uncertainty sets for synthetic CT, reinforcing trust in synthetic CTs. By addressing both anatomical fidelity and uncertainty quantification, STF-RUE marks a crucial step toward safer and more effective adaptive proton therapy. Code is available at this https URL.

Title: High-Quality 3D Head Reconstruction from Any Single Portrait Image

Authors: Jianfu Zhang, yujie Gao, Jiahui Zhan, Wentao Wang, Yiyi Zhang, Haohua Zhao, Liqing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08516
Pdf URL: https://arxiv.org/pdf/2503.08516
Copy Paste: [[2503.08516]] High-Quality 3D Head Reconstruction from Any Single Portrait Image(https://arxiv.org/abs/2503.08516)
Keywords: robust, diffusion, generative
Abstract: In this work, we introduce a novel high-fidelity 3D head reconstruction method from a single portrait image, regardless of perspective, expression, or accessories. Despite significant efforts in adapting 2D generative models for novel view synthesis and 3D optimization, most methods struggle to produce high-quality 3D portraits. The lack of crucial information, such as identity, expression, hair, and accessories, limits these approaches in generating realistic 3D head models. To address these challenges, we construct a new high-quality dataset containing 227 sequences of digital human portraits captured from 96 different perspectives, totalling 21,792 frames, featuring diverse expressions and accessories. To further improve performance, we integrate identity and expression information into the multi-view diffusion process to enhance facial consistency across views. Specifically, we apply identity- and expression-aware guidance and supervision to extract accurate facial representations, which guide the model and enforce objective functions to ensure high identity and expression consistency during generation. Finally, we generate an orbital video around the portrait consisting of 96 multi-view frames, which can be used for 3D portrait model reconstruction. Our method demonstrates robust performance across challenging scenarios, including side-face angles and complex accessories

Title: Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

Authors: Siqi Fan, Xuezhi Fang, Xingrun Xing, Peng Han, Shuo Shang, Yequan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08524
Pdf URL: https://arxiv.org/pdf/2503.08524
Copy Paste: [[2503.08524]] Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency(https://arxiv.org/abs/2503.08524)
Keywords: large language model
Abstract: Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components are required for inference, enabling a training-free pipeline. In this paper, we focus on the dynamic depth of LLM generation. A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance. We first observed that tokens predicted later have lower perplexity and thus require less computation. Then, we propose a training-free algorithm called Position-Aware Depth Decay Decoding ($D^3$), which leverages a power-law decay function, $\left\lfloor L \times (\alpha^i) \right\rfloor$, to determine the number of layers to retain when generating token $T_i$. Remarkably, without any retraining, the $D^3$ achieves success across a wide range of generation tasks for the first time. Experiments on large language models (\ie the Llama) with $7 \sim 70$ billion parameters show that $D^3$ can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop ($<1\%$) on the GSM8K and BBH benchmarks.

Title: GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training

Authors: Tong Wei, Yijun Yang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08525
Pdf URL: https://arxiv.org/pdf/2503.08525
Copy Paste: [[2503.08525]] GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training(https://arxiv.org/abs/2503.08525)
Keywords: large language model
Abstract: Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs). Yet, its efficacy in training vision-language model (VLM) agents for goal-directed action reasoning in visual environments is less established. This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse, characterized by a rapid loss of diversity in the agent's thoughts, state-irrelevant and incomplete reasoning, and subsequent invalid actions, resulting in negative rewards. To counteract thought collapse, we highlight the necessity of process guidance and propose an automated corrector that evaluates and refines the agent's reasoning at each RL step. This simple and scalable GTR (Guided Thought Reinforcement) framework trains reasoning and action simultaneously without the need for dense, per-step human labeling. Our experiments demonstrate that GTR significantly enhances the performance and generalization of the LLaVA-7b model across various visual environments, achieving 3-5 times higher task success rates compared to SoTA models with notably smaller model sizes.

Title: ChromaFormer: A Scalable and Accurate Transformer Architecture for Land Cover Classification

Authors: Mingshi Li, Dusan Grujicic, Ben Somers, Stien Heremans, Steven De Saeger, Matthew B. Blaschko
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08534
Pdf URL: https://arxiv.org/pdf/2503.08534
Copy Paste: [[2503.08534]] ChromaFormer: A Scalable and Accurate Transformer Architecture for Land Cover Classification(https://arxiv.org/abs/2503.08534)
Keywords: transformer
Abstract: Remote sensing imagery from systems such as Sentinel provides full coverage of the Earth's surface at around 10-meter resolution. The remote sensing community has transitioned to extensive use of deep learning models due to their high performance on benchmarks such as the UCMerced and ISPRS Vaihingen datasets. Convolutional models such as UNet and ResNet variations are commonly employed for remote sensing but typically only accept three channels, as they were developed for RGB imagery, while satellite systems provide more than ten. Recently, several transformer architectures have been proposed for remote sensing, but they have not been extensively benchmarked and are typically used on small datasets such as Salinas Valley. Meanwhile, it is becoming feasible to obtain dense spatial land-use labels for entire first-level administrative divisions of some countries. Scaling law observations suggest that substantially larger multi-spectral transformer models could provide a significant leap in remote sensing performance in these settings. In this work, we propose ChromaFormer, a family of multi-spectral transformer models, which we evaluate across orders of magnitude differences in model parameters to assess their performance and scaling effectiveness on a densely labeled imagery dataset of Flanders, Belgium, covering more than 13,500 km^2 and containing 15 classes. We propose a novel multi-spectral attention strategy and demonstrate its effectiveness through ablations. Furthermore, we show that models many orders of magnitude larger than conventional architectures, such as UNet, lead to substantial accuracy improvements: a UNet++ model with 23M parameters achieves less than 65% accuracy, while a multi-spectral transformer with 655M parameters achieves over 95% accuracy on the Biological Valuation Map of Flanders.

Title: DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering

Authors: Sher Badshah, Hassan Sajjad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08542
Pdf URL: https://arxiv.org/pdf/2503.08542
Copy Paste: [[2503.08542]] DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering(https://arxiv.org/abs/2503.08542)
Keywords: robust, large language model
Abstract: Evaluating Large Language Models (LLMs) free-form generated responses remains a challenge due to their diverse and open-ended nature. Traditional supervised signal-based automatic metrics fail to capture semantic equivalence or handle the variability of open-ended responses, while human evaluation, though reliable, is resource-intensive. Leveraging LLMs as evaluators offers a promising alternative due to their strong language understanding and instruction-following capabilities. Taking advantage of these capabilities, we propose the Dynamic Arbitration Framework for Evaluation (DAFE), which employs two primary LLM-as-judges and engages a third arbitrator only in cases of disagreements. This selective arbitration prioritizes evaluation reliability while reducing unnecessary computational demands compared to conventional majority voting. DAFE utilizes task-specific reference answers with dynamic arbitration to enhance judgment accuracy, resulting in significant improvements in evaluation metrics such as Macro F1 and Cohen's Kappa. Through experiments, including a comprehensive human evaluation, we demonstrate DAFE's ability to provide consistent, scalable, and resource-efficient assessments, establishing it as a robust framework for evaluating free-form model outputs.

Title: Transferring Extreme Subword Style Using Ngram Model-Based Logit Scaling

Authors: Craig Messner, Tom Lippincott
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08550
Pdf URL: https://arxiv.org/pdf/2503.08550
Copy Paste: [[2503.08550]] Transferring Extreme Subword Style Using Ngram Model-Based Logit Scaling(https://arxiv.org/abs/2503.08550)
Keywords: large language model
Abstract: We present an ngram model-based logit scaling technique that effectively transfers extreme subword stylistic variation to large language models at inference time. We demonstrate its efficacy by tracking the perplexity of generated text with respect to the ngram interpolated and original versions of an evaluation model. Minimizing the former measure while the latter approaches the perplexity of a text produced by a target author or character lets us select a sufficient degree of adaptation while retaining fluency.

Title: An Analysis of Safety Guarantees in Multi-Task Bayesian Optimization

Authors: Jannis O. Luebsen, Annika Eichler
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2503.08555
Pdf URL: https://arxiv.org/pdf/2503.08555
Copy Paste: [[2503.08555]] An Analysis of Safety Guarantees in Multi-Task Bayesian Optimization(https://arxiv.org/abs/2503.08555)
Keywords: robust
Abstract: In many practical scenarios of black box optimization, the objective function is subject to constraints that must be satisfied to avoid undesirable outcomes. Such constraints are typically unknown and must be learned during optimization. Safe Bayesian optimization aims to find the global optimum while ensuring that the constraints are satisfied with high probability. However, it is often sample-inefficient due to the small initial feasible set, which requires expansion by evaluating the objective or constraint functions, limiting its applicability to low-dimensional or inexpensive problems. To enhance sample efficiency, additional information from cheap simulations can be leveraged, albeit at the cost of safeness guarantees. This paper introduces a novel safe multi-task Bayesian optimization algorithm that integrates multiple tasks while maintaining high-probability safety. We derive robust uniform error bounds for the multi-task case and demonstrate the effectiveness of the approach on benchmark functions and a control problem. Our results show a significant improvement in sample efficiency, making the proposed method well-suited for expensive-to-evaluate functions.

Title: ComicsPAP: understanding comic strips by picking the correct panel

Authors: Emanuele Vivoli, Artemis Llabrés, Mohamed Ali Soubgui, Marco Bertini, Ernest Valveny Llobet, Dimosthenis Karatzas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08561
Pdf URL: https://arxiv.org/pdf/2503.08561
Copy Paste: [[2503.08561]] ComicsPAP: understanding comic strips by picking the correct panel(https://arxiv.org/abs/2503.08561)
Keywords: robust
Abstract: Large multimodal models (LMMs) have made impressive strides in image captioning, VQA, and video comprehension, yet they still struggle with the intricate temporal and spatial cues found in comics. To address this gap, we introduce ComicsPAP, a large-scale benchmark designed for comic strip understanding. Comprising over 100k samples and organized into 5 subtasks under a Pick-a-Panel framework, ComicsPAP demands models to identify the missing panel in a sequence. Our evaluations, conducted under both multi-image and single-image protocols, reveal that current state-of-the-art LMMs perform near chance on these tasks, underscoring significant limitations in capturing sequential and contextual dependencies. To close the gap, we adapted LMMs for comic strip understanding, obtaining better results on ComicsPAP than 10x bigger models, demonstrating that ComicsPAP offers a robust resource to drive future research in multimodal comic comprehension.

Title: DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process

Authors: Minjun Zhu, Yixuan Weng, Linyi Yang, Yue Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08569
Pdf URL: https://arxiv.org/pdf/2503.08569
Copy Paste: [[2503.08569]] DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process(https://arxiv.org/abs/2503.08569)
Keywords: large language model
Abstract: Large Language Models (LLMs) are increasingly utilized in scientific research assessment, particularly in automated paper review. However, existing LLM-based review systems face significant challenges, including limited domain expertise, hallucinated reasoning, and a lack of structured evaluation. To address these limitations, we introduce DeepReview, a multi-stage framework designed to emulate expert reviewers by incorporating structured analysis, literature retrieval, and evidence-based argumentation. Using DeepReview-13K, a curated dataset with structured annotations, we train DeepReviewer-14B, which outperforms CycleReviewer-70B with fewer tokens. In its best mode, DeepReviewer-14B achieves win rates of 88.21\% and 80.20\% against GPT-o1 and DeepSeek-R1 in evaluations. Our work sets a new benchmark for LLM-based paper review, with all resources publicly available. The code, model, dataset and demo have be released in this http URL.

Title: Modular Customization of Diffusion Models via Blockwise-Parameterized Low-Rank Adaptation

Authors: Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08575
Pdf URL: https://arxiv.org/pdf/2503.08575
Copy Paste: [[2503.08575]] Modular Customization of Diffusion Models via Blockwise-Parameterized Low-Rank Adaptation(https://arxiv.org/abs/2503.08575)
Keywords: diffusion
Abstract: Recent diffusion model customization has shown impressive results in incorporating subject or style concepts with a handful of images. However, the modular composition of multiple concepts into a customized model, aimed to efficiently merge decentralized-trained concepts without influencing their identities, remains unresolved. Modular customization is essential for applications like concept stylization and multi-concept customization using concepts trained by different users. Existing post-training methods are only confined to a fixed set of concepts, and any different combinations require a new round of retraining. In contrast, instant merging methods often cause identity loss and interference of individual merged concepts and are usually limited to a small number of concepts. To address these issues, we propose BlockLoRA, an instant merging method designed to efficiently combine multiple concepts while accurately preserving individual concepts' identity. With a careful analysis of the underlying reason for interference, we develop the Randomized Output Erasure technique to minimize the interference of different customized models. Additionally, Blockwise LoRA Parameterization is proposed to reduce the identity loss during instant model merging. Extensive experiments validate the effectiveness of BlockLoRA, which can instantly merge 15 concepts of people, subjects, scenes, and styles with high fidelity.

Title: RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding

Authors: Xichen Tan, Yunfan Ye, Yuanjing Luo, Qian Wan, Fang Liu, Zhiping Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08576
Pdf URL: https://arxiv.org/pdf/2503.08576
Copy Paste: [[2503.08576]] RAG-Adapter: A Plug-and-Play RAG-enhanced Framework for Long Video Understanding(https://arxiv.org/abs/2503.08576)
Keywords: large language model
Abstract: Multi-modal Large Language Models (MLLMs) capable of video understanding are advancing rapidly. To effectively assess their video comprehension capabilities, long video understanding benchmarks, such as Video-MME and MLVU, are proposed. However, these benchmarks directly use uniform frame sampling for testing, which results in significant information loss and affects the accuracy of the evaluations in reflecting the true abilities of MLLMs. To address this, we propose RAG-Adapter, a plug-and-play framework that reduces information loss during testing by sampling frames most relevant to the given question. Additionally, we introduce a Grouped-supervised Contrastive Learning (GCL) method to further enhance sampling effectiveness of RAG-Adapter through fine-tuning on our constructed MMAT dataset. Finally, we test numerous baseline MLLMs on various video understanding benchmarks, finding that RAG-Adapter sampling consistently outperforms uniform sampling (e.g., Accuracy of GPT-4o increases by 9.3 percent on Video-MME), providing a more accurate testing method for long video benchmarks.

Title: MsaMIL-Net: An End-to-End Multi-Scale Aware Multiple Instance Learning Network for Efficient Whole Slide Image Classification

Authors: Jiangping Wen, Jinyu Wen, Emei Fang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08581
Pdf URL: https://arxiv.org/pdf/2503.08581
Copy Paste: [[2503.08581]] MsaMIL-Net: An End-to-End Multi-Scale Aware Multiple Instance Learning Network for Efficient Whole Slide Image Classification(https://arxiv.org/abs/2503.08581)
Keywords: extraction
Abstract: Bag-based Multiple Instance Learning (MIL) approaches have emerged as the mainstream methodology for Whole Slide Image (WSI) classification. However, most existing methods adopt a segmented training strategy, which first extracts features using a pre-trained feature extractor and then aggregates these features through MIL. This segmented training approach leads to insufficient collaborative optimization between the feature extraction network and the MIL network, preventing end-to-end joint optimization and thereby limiting the overall performance of the model. Additionally, conventional methods typically extract features from all patches of fixed size, ignoring the multi-scale observation characteristics of pathologists. This not only results in significant computational resource waste when tumor regions represent a minimal proportion (as in the Camelyon16 dataset) but may also lead the model to suboptimal solutions. To address these limitations, this paper proposes an end-to-end multi-scale WSI classification framework that integrates multi-scale feature extraction with multiple instance learning. Specifically, our approach includes: (1) a semantic feature filtering module to reduce interference from non-lesion areas; (2) a multi-scale feature extraction module to capture pathological information at different levels; and (3) a multi-scale fusion MIL module for global modeling and feature integration. Through an end-to-end training strategy, we simultaneously optimize both the feature extractor and MIL network, ensuring maximum compatibility between them. Experiments were conducted on three cross-center datasets (DigestPath2019, BCNB, and UBC-OCEAN). Results demonstrate that our proposed method outperforms existing state-of-the-art approaches in terms of both accuracy (ACC) and AUC metrics.

Title: HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

Authors: Shehreen Azad, Vibhav Vineet, Yogesh Singh Rawat
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08585
Pdf URL: https://arxiv.org/pdf/2503.08585
Copy Paste: [[2503.08585]] HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding(https://arxiv.org/abs/2503.08585)
Keywords: robust, transformer, large language model
Abstract: Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lacks task-specific relevance. To address these challenges, we introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling, while avoiding LLM's context length limitations. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding, with the entity stream capturing frame-level object information within a short context and the scene stream identifying their broader interactions over longer period of time. Each stream is supported by dedicated memory banks which enables our proposed Hierachical Querying transformer (HierarQ) to effectively capture short and long-term context. Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance across most datasets, proving its robustness and efficiency for comprehensive video analysis.

Title: BiasEdit: Debiasing Stereotyped Language Models via Model Editing

Authors: Xin Xu, Wei Xu, Ningyu Zhang, Julian McAuley
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08588
Pdf URL: https://arxiv.org/pdf/2503.08588
Copy Paste: [[2503.08588]] BiasEdit: Debiasing Stereotyped Language Models via Model Editing(https://arxiv.org/abs/2503.08588)
Keywords: robust
Abstract: Previous studies have established that language models manifest stereotyped biases. Existing debiasing strategies, such as retraining a model with counterfactual data, representation projection, and prompting often fail to efficiently eliminate bias or directly alter the models' biased internal representations. To address these issues, we propose BiasEdit, an efficient model editing method to remove stereotypical bias from language models through lightweight networks that act as editors to generate parameter updates. BiasEdit employs a debiasing loss guiding editor networks to conduct local edits on partial parameters of a language model for debiasing while preserving the language modeling abilities during editing through a retention loss. Experiments on StereoSet and Crows-Pairs demonstrate the effectiveness, efficiency, and robustness of BiasEdit in eliminating bias compared to tangental debiasing baselines and little to no impact on the language models' general capabilities. In addition, we conduct bias tracing to probe bias in various modules and explore bias editing impacts on different components of language models.

Title: 3D Point Cloud Generation via Autoregressive Up-sampling

Authors: Ziqiao Meng, Qichao Wang, Zhipeng Zhou, Irwin King, Peilin Zhao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08594
Pdf URL: https://arxiv.org/pdf/2503.08594
Copy Paste: [[2503.08594]] 3D Point Cloud Generation via Autoregressive Up-sampling(https://arxiv.org/abs/2503.08594)
Keywords: diffusion, transformer, generative
Abstract: We introduce a pioneering autoregressive generative model for 3D point cloud generation. Inspired by visual autoregressive modeling (VAR), we conceptualize point cloud generation as an autoregressive up-sampling process. This leads to our novel model, PointARU, which progressively refines 3D point clouds from coarse to fine scales. PointARU follows a two-stage training paradigm: first, it learns multi-scale discrete representations of point clouds, and then it trains an autoregressive transformer for next-scale prediction. To address the inherent unordered and irregular structure of point clouds, we incorporate specialized point-based up-sampling network modules in both stages and integrate 3D absolute positional encoding based on the decoded point cloud at each scale during the second stage. Our model surpasses state-of-the-art (SoTA) diffusion-based approaches in both generation quality and parameter efficiency across diverse experimental settings, marking a new milestone for autoregressive methods in 3D point cloud generation. Furthermore, PointARU demonstrates exceptional performance in completing partial 3D shapes and up-sampling sparse point clouds, outperforming existing generative models in these tasks.

Title: NSF-SciFy: Mining the NSF Awards Database for Scientific Claims

Authors: Delip Rao, Weiqiu You, Eric Wong, Chris Callison-Burch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.08600
Pdf URL: https://arxiv.org/pdf/2503.08600
Copy Paste: [[2503.08600]] NSF-SciFy: Mining the NSF Awards Database for Scientific Claims(https://arxiv.org/abs/2503.08600)
Keywords: robust, extraction, large language model
Abstract: We present NSF-SciFy, a large-scale dataset for scientific claim extraction derived from the National Science Foundation (NSF) awards database, comprising over 400K grant abstracts spanning five decades. While previous datasets relied on published literature, we leverage grant abstracts which offer a unique advantage: they capture claims at an earlier stage in the research lifecycle before publication takes effect. We also introduce a new task to distinguish between existing scientific claims and aspirational research intentions in this http URL zero-shot prompting with frontier large language models, we jointly extract 114K scientific claims and 145K investigation proposals from 16K grant abstracts in the materials science domain to create a focused subset called NSF-SciFy-MatSci. We use this dataset to evaluate 3 three key tasks: (1) technical to non-technical abstract generation, where models achieve high BERTScore (0.85+ F1); (2) scientific claim extraction, where fine-tuned models outperform base models by 100% relative improvement; and (3) investigation proposal extraction, showing 90%+ improvement with fine-tuning. We introduce novel LLM-based evaluation metrics for robust assessment of claim/proposal extraction quality. As the largest scientific claim dataset to date -- with an estimated 2.8 million claims across all STEM disciplines funded by the NSF -- NSF-SciFy enables new opportunities for claim verification and meta-scientific research. We publicly release all datasets, trained models, and evaluation code to facilitate further research.

Title: LiSu: A Dataset and Method for LiDAR Surface Normal Estimation

Authors: Dušan Malić, Christian Fruhwirth-Reisinger, Samuel Schulter, Horst Possegger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08601
Pdf URL: https://arxiv.org/pdf/2503.08601
Copy Paste: [[2503.08601]] LiSu: A Dataset and Method for LiDAR Surface Normal Estimation(https://arxiv.org/abs/2503.08601)
Keywords: robust
Abstract: While surface normals are widely used to analyse 3D scene geometry, surface normal estimation from LiDAR point clouds remains severely underexplored. This is caused by the lack of large-scale annotated datasets on the one hand, and lack of methods that can robustly handle the sparse and often noisy LiDAR data in a reasonable time on the other hand. We address these limitations using a traffic simulation engine and present LiSu, the first large-scale, synthetic LiDAR point cloud dataset with ground truth surface normal annotations, eliminating the need for tedious manual labeling. Additionally, we propose a novel method that exploits the spatiotemporal characteristics of autonomous driving data to enhance surface normal estimation accuracy. By incorporating two regularization terms, we enforce spatial consistency among neighboring points and temporal smoothness across consecutive LiDAR frames. These regularizers are particularly effective in self-training settings, where they mitigate the impact of noisy pseudo-labels, enabling robust real-world deployment. We demonstrate the effectiveness of our method on LiSu, achieving state-of-the-art performance in LiDAR surface normal estimation. Moreover, we showcase its full potential in addressing the challenging task of synthetic-to-real domain adaptation, leading to improved neural surface reconstruction on real-world data.

Title: CellStyle: Improved Zero-Shot Cell Segmentation via Style Transfer

Authors: Rüveyda Yilmaz, Zhu Chen, Yuli Wu, Johannes Stegmaier
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.08603
Pdf URL: https://arxiv.org/pdf/2503.08603
Copy Paste: [[2503.08603]] CellStyle: Improved Zero-Shot Cell Segmentation via Style Transfer(https://arxiv.org/abs/2503.08603)
Keywords: segmentation
Abstract: Cell microscopy data are abundant; however, corresponding segmentation annotations remain scarce. Moreover, variations in cell types, imaging devices, and staining techniques introduce significant domain gaps between datasets. As a result, even large, pretrained segmentation models trained on diverse datasets (source datasets) struggle to generalize to unseen datasets (target datasets). To overcome this generalization problem, we propose CellStyle, which improves the segmentation quality of such models without requiring labels for the target dataset, thereby enabling zero-shot adaptation. CellStyle transfers the attributes of an unannotated target dataset, such as texture, color, and noise, to the annotated source dataset. This transfer is performed while preserving the cell shapes of the source images, ensuring that the existing source annotations can still be used while maintaining the visual characteristics of the target dataset. The styled synthetic images with the existing annotations enable the finetuning of a generalist segmentation model for application to the unannotated target data. We demonstrate that CellStyle significantly improves zero-shot cell segmentation performance across diverse datasets by finetuning multiple segmentation models on the style-transferred data. The code will be made publicly available.

Title: Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling

Authors: Subin Kim, Seoung Wug Oh, Jui-Hsien Wang, Joon-Young Lee, Jinwoo Shin
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08605
Pdf URL: https://arxiv.org/pdf/2503.08605
Copy Paste: [[2503.08605]] Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling(https://arxiv.org/abs/2503.08605)
Keywords: diffusion
Abstract: While recent advancements in text-to-video diffusion models enable high-quality short video generation from a single prompt, generating real-world long videos in a single pass remains challenging due to limited data and high computational costs. To address this, several works propose tuning-free approaches, i.e., extending existing models for long video generation, specifically using multiple prompts to allow for dynamic and controlled content changes. However, these methods primarily focus on ensuring smooth transitions between adjacent frames, often leading to content drift and a gradual loss of semantic coherence over longer sequences. To tackle such an issue, we propose Synchronized Coupled Sampling (SynCoS), a novel inference framework that synchronizes denoising paths across the entire video, ensuring long-range consistency across both adjacent and distant frames. Our approach combines two complementary sampling strategies: reverse and optimization-based sampling, which ensure seamless local transitions and enforce global coherence, respectively. However, directly alternating between these samplings misaligns denoising trajectories, disrupting prompt guidance and introducing unintended content changes as they operate independently. To resolve this, SynCoS synchronizes them through a grounded timestep and a fixed baseline noise, ensuring fully coupled sampling with aligned denoising paths. Extensive experiments show that SynCoS significantly improves multi-event long video generation, achieving smoother transitions and superior long-range coherence, outperforming previous approaches both quantitatively and qualitatively.

Title: LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization

Authors: Xianfeng Wu, Yajing Bai, Haoze Zheng, Harold Haodong Chen, Yexin Liu, Zihao Wang, Xuran Ma, Wen-Jie Shu, Xianzu Wu, Harry Yang, Ser-Nam Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08619
Pdf URL: https://arxiv.org/pdf/2503.08619
Copy Paste: [[2503.08619]] LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization(https://arxiv.org/abs/2503.08619)
Keywords: large language model
Abstract: Recent advances in text-to-image generation have primarily relied on extensive datasets and parameter-heavy architectures. These requirements severely limit accessibility for researchers and practitioners who lack substantial computational resources. In this paper, we introduce \model, an efficient training paradigm for image generation models that uses knowledge distillation (KD) and Direct Preference Optimization (DPO). Drawing inspiration from the success of data KD techniques widely adopted in Multi-Modal Large Language Models (MLLMs), LightGen distills knowledge from state-of-the-art (SOTA) text-to-image models into a compact Masked Autoregressive (MAR) architecture with only $0.7B$ parameters. Using a compact synthetic dataset of just $2M$ high-quality images generated from varied captions, we demonstrate that data diversity significantly outweighs data volume in determining model performance. This strategy dramatically reduces computational demands and reduces pre-training time from potentially thousands of GPU-days to merely 88 GPU-days. Furthermore, to address the inherent shortcomings of synthetic data, particularly poor high-frequency details and spatial inaccuracies, we integrate the DPO technique that refines image fidelity and positional accuracy. Comprehensive experiments confirm that LightGen achieves image generation quality comparable to SOTA models while significantly reducing computational resources and expanding accessibility for resource-constrained environments. Code is available at this https URL

Title: SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Authors: Muzhi Zhu, Yuzhuo Tian, Hao Chen, Chunluan Zhou, Qingpei Guo, Yang Liu, Ming Yang, Chunhua Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08625
Pdf URL: https://arxiv.org/pdf/2503.08625
Copy Paste: [[2503.08625]] SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories(https://arxiv.org/abs/2503.08625)
Keywords: robust, segmentation
Abstract: While MLLMs have demonstrated adequate image understanding capabilities, they still struggle with pixel-level comprehension, limiting their practical applications. Current evaluation tasks like VQA and visual grounding remain too coarse to assess fine-grained pixel comprehension accurately. Though segmentation is foundational for pixel-level understanding, existing methods often require MLLMs to generate implicit tokens, decoded through external pixel decoders. This approach disrupts the MLLM's text output space, potentially compromising language capabilities and reducing flexibility and extensibility, while failing to reflect the model's intrinsic pixel-level understanding. Thus, we introduce the Human-Like Mask Annotation Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools. Modeling segmentation as a multi-step Markov Decision Process, HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens. Through this setup, we develop SegAgent, a model fine-tuned on human-like annotation trajectories, which achieves performance comparable to state-of-the-art (SOTA) methods and supports additional tasks like mask refinement and annotation filtering. HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task that facilitates exploration of MLLMs' visual reasoning abilities. Our adaptations of policy improvement method StaR and PRM-guided tree search further enhance model robustness in complex segmentation tasks, laying a foundation for future advancements in fine-grained visual perception and multi-step decision-making for MLLMs.

Title: Secret-Key Generation from Private Identifiers under Channel Uncertainty

Authors: Vamoua Yachongka, Rémi A. Chou
Subjects: cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2503.08632
Pdf URL: https://arxiv.org/pdf/2503.08632
Copy Paste: [[2503.08632]] Secret-Key Generation from Private Identifiers under Channel Uncertainty(https://arxiv.org/abs/2503.08632)
Keywords: secure, privacy
Abstract: This study investigates secret-key generation for device authentication using physical identifiers, such as responses from physical unclonable functions (PUFs). The system includes two legitimate terminals (encoder and decoder) and an eavesdropper (Eve), each with access to different measurements of the identifier. From the device identifier, the encoder generates a secret key, which is securely stored in a private database, along with helper data that is saved in a public database accessible by the decoder for key reconstruction. Eve, who also has access to the public database, may use both her own measurements and the helper data to attempt to estimate the secret key and identifier. Our setup focuses on authentication scenarios where channel statistics are uncertain, with the involved parties employing multiple antennas to enhance signal reception. Our contributions include deriving inner and outer bounds on the optimal trade-off among secret-key, storage, and privacy-leakage rates for general discrete sources, and showing that these bounds are tight for Gaussian sources.

Title: How Does Overparameterization Affect Machine Unlearning of Deep Neural Networks?

Authors: Gal Alon, Yehuda Dar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08633
Pdf URL: https://arxiv.org/pdf/2503.08633
Copy Paste: [[2503.08633]] How Does Overparameterization Affect Machine Unlearning of Deep Neural Networks?(https://arxiv.org/abs/2503.08633)
Keywords: privacy
Abstract: Machine unlearning is the task of updating a trained model to forget specific training data without retraining from scratch. In this paper, we investigate how unlearning of deep neural networks (DNNs) is affected by the model parameterization level, which corresponds here to the DNN width. We define validation-based tuning for several unlearning methods from the recent literature, and show how these methods perform differently depending on (i) the DNN parameterization level, (ii) the unlearning goal (unlearned data privacy or bias removal), (iii) whether the unlearning method explicitly uses the unlearned examples. Our results show that unlearning excels on overparameterized models, in terms of balancing between generalization and achieving the unlearning goal; although for bias removal this requires the unlearning method to use the unlearned examples. We further elucidate our error-based analysis by measuring how much the unlearning changes the classification decision regions in the proximity of the unlearned examples, and avoids changing them elsewhere. By this we show that the unlearning success for overparameterized models stems from the ability to delicately change the model functionality in small regions in the input space while keeping much of the model functionality unchanged.

Title: Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning

Authors: Hubert Baniecki, Przemyslaw Biecek
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.08636
Pdf URL: https://arxiv.org/pdf/2503.08636
Copy Paste: [[2503.08636]] Birds look like cars: Adversarial analysis of intrinsically interpretable deep learning(https://arxiv.org/abs/2503.08636)
Keywords: security, attack, robust, interpretability
Abstract: A common belief is that intrinsically interpretable deep learning models ensure a correct, intuitive understanding of their behavior and offer greater robustness against accidental errors or intentional manipulation. However, these beliefs have not been comprehensively verified, and growing evidence casts doubt on them. In this paper, we highlight the risks related to overreliance and susceptibility to adversarial manipulation of these so-called "intrinsically (aka inherently) interpretable" models by design. We introduce two strategies for adversarial analysis with prototype manipulation and backdoor attacks against prototype-based networks, and discuss how concept bottleneck models defend against these attacks. Fooling the model's reasoning by exploiting its use of latent prototypes manifests the inherent uninterpretability of deep neural networks, leading to a false sense of security reinforced by a visual confirmation bias. The reported limitations of prototype-based networks put their trustworthiness and applicability into question, motivating further work on the robustness and alignment of (deep) interpretable models.

Title: Coefficient-to-Basis Network: A Fine-Tunable Operator Learning Framework for Inverse Problems with Adaptive Discretizations and Theoretical Guarantees

Authors: Zecheng Zhang, Hao Liu, Wenjing Liao, Guang Lin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08642
Pdf URL: https://arxiv.org/pdf/2503.08642
Copy Paste: [[2503.08642]] Coefficient-to-Basis Network: A Fine-Tunable Operator Learning Framework for Inverse Problems with Adaptive Discretizations and Theoretical Guarantees(https://arxiv.org/abs/2503.08642)
Keywords: robust
Abstract: We propose a Coefficient-to-Basis Network (C2BNet), a novel framework for solving inverse problems within the operator learning paradigm. C2BNet efficiently adapts to different discretizations through fine-tuning, using a pre-trained model to significantly reduce computational cost while maintaining high accuracy. Unlike traditional approaches that require retraining from scratch for new discretizations, our method enables seamless adaptation without sacrificing predictive performance. Furthermore, we establish theoretical approximation and generalization error bounds for C2BNet by exploiting low-dimensional structures in the underlying datasets. Our analysis demonstrates that C2BNet adapts to low-dimensional structures without relying on explicit encoding mechanisms, highlighting its robustness and efficiency. To validate our theoretical findings, we conducted extensive numerical experiments that showcase the superior performance of C2BNet on several inverse problems. The results confirm that C2BNet effectively balances computational efficiency and accuracy, making it a promising tool to solve inverse problems in scientific computing and engineering applications.

Title: MF-VITON: High-Fidelity Mask-Free Virtual Try-On with Minimal Input

Authors: Zhenchen Wan, Yanwu xu, Dongting Hu, Weilun Cheng, Tianxi Chen, Zhaoqing Wang, Feng Liu, Tongliang Liu, Mingming Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08650
Pdf URL: https://arxiv.org/pdf/2503.08650
Copy Paste: [[2503.08650]] MF-VITON: High-Fidelity Mask-Free Virtual Try-On with Minimal Input(https://arxiv.org/abs/2503.08650)
Keywords: diffusion
Abstract: Recent advancements in Virtual Try-On (VITON) have significantly improved image realism and garment detail preservation, driven by powerful text-to-image (T2I) diffusion models. However, existing methods often rely on user-provided masks, introducing complexity and performance degradation due to imperfect inputs, as shown in Fig.1(a). To address this, we propose a Mask-Free VITON (MF-VITON) framework that achieves realistic VITON using only a single person image and a target garment, eliminating the requirement for auxiliary masks. Our approach introduces a novel two-stage pipeline: (1) We leverage existing Mask-based VITON models to synthesize a high-quality dataset. This dataset contains diverse, realistic pairs of person images and corresponding garments, augmented with varied backgrounds to mimic real-world scenarios. (2) The pre-trained Mask-based model is fine-tuned on the generated dataset, enabling garment transfer without mask dependencies. This stage simplifies the input requirements while preserving garment texture and shape fidelity. Our framework achieves state-of-the-art (SOTA) performance regarding garment transfer accuracy and visual realism. Notably, the proposed Mask-Free model significantly outperforms existing Mask-based approaches, setting a new benchmark and demonstrating a substantial lead over previous approaches. For more details, visit our project page: this https URL.

Title: Extra Clients at No Extra Cost: Overcome Data Heterogeneity in Federated Learning with Filter Decomposition

Authors: Wei Chen, Qiang Qiu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.08652
Pdf URL: https://arxiv.org/pdf/2503.08652
Copy Paste: [[2503.08652]] Extra Clients at No Extra Cost: Overcome Data Heterogeneity in Federated Learning with Filter Decomposition(https://arxiv.org/abs/2503.08652)
Keywords: federate
Abstract: Data heterogeneity is one of the major challenges in federated learning (FL), which results in substantial client variance and slow convergence. In this study, we propose a novel solution: decomposing a convolutional filter in FL into a linear combination of filter subspace elements, i.e., filter atoms. This simple technique transforms global filter aggregation in FL into aggregating filter atoms and their atom coefficients. The key advantage here involves mathematically generating numerous cross-terms by expanding the product of two weighted sums from filter atom and atom coefficient. These cross-terms effectively emulate many additional latent clients, significantly reducing model variance, which is validated by our theoretical analysis and empirical observation. Furthermore, our method permits different training schemes for filter atoms and atom coefficients for highly adaptive model personalization and communication efficiency. Empirical results on benchmark datasets demonstrate that our filter decomposition technique substantially improves the accuracy of FL methods, confirming its efficacy in addressing data heterogeneity.

Title: Exploring the Word Sense Disambiguation Capabilities of Large Language Models

Authors: Pierpaolo Basile, Lucia Siciliani, Elio Musacchio, Giovanni Semeraro
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08662
Pdf URL: https://arxiv.org/pdf/2503.08662
Copy Paste: [[2503.08662]] Exploring the Word Sense Disambiguation Capabilities of Large Language Models(https://arxiv.org/abs/2503.08662)
Keywords: large language model
Abstract: Word Sense Disambiguation (WSD) is a historical task in computational linguistics that has received much attention over the years. However, with the advent of Large Language Models (LLMs), interest in this task (in its classical definition) has decreased. In this study, we evaluate the performance of various LLMs on the WSD task. We extend a previous benchmark (XL-WSD) to re-design two subtasks suitable for LLM: 1) given a word in a sentence, the LLM must generate the correct definition; 2) given a word in a sentence and a set of predefined meanings, the LLM must select the correct one. The extended benchmark is built using the XL-WSD and BabelNet. The results indicate that LLMs perform well in zero-shot learning but cannot surpass current state-of-the-art methods. However, a fine-tuned model with a medium number of parameters outperforms all other models, including the state-of-the-art.

Title: MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention

Authors: Yuhan Wang, Fangzhou Hong, Shuai Yang, Liming Jiang, Wayne Wu, Chen Change Loy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.08664
Pdf URL: https://arxiv.org/pdf/2503.08664
Copy Paste: [[2503.08664]] MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention(https://arxiv.org/abs/2503.08664)
Keywords: diffusion
Abstract: Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling multiview attention to higher resolutions. In this paper, we explore human multiview diffusion models at the megapixel level and introduce a solution called mesh attention to enable training at 1024x1024 resolution. Using a clothed human mesh as a central coarse geometric representation, the proposed mesh attention leverages rasterization and projection to establish direct cross-view coordinate correspondences. This approach significantly reduces the complexity of multiview attention while maintaining cross-view consistency. Building on this foundation, we devise a mesh attention block and combine it with keypoint conditioning to create our human-specific multiview diffusion model, MEAT. In addition, we present valuable insights into applying multiview human motion videos for diffusion training, addressing the longstanding issue of data scarcity. Extensive experiments show that MEAT effectively generates dense, consistent multiview human images at the megapixel level, outperforming existing multiview diffusion methods.

Title: REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder

Authors: Yitian Zhang, Long Mai, Aniruddha Mahapatra, David Bourgin, Yicong Hong, Jonah Casebeer, Feng Liu, Yun Fu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08665
Pdf URL: https://arxiv.org/pdf/2503.08665
Copy Paste: [[2503.08665]] REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder(https://arxiv.org/abs/2503.08665)
Keywords: robust, diffusion, transformer, generative
Abstract: We present a novel perspective on learning video embedders for generative modeling: rather than requiring an exact reproduction of an input video, an effective embedder should focus on synthesizing visually plausible reconstructions. This relaxed criterion enables substantial improvements in compression ratios without compromising the quality of downstream generative models. Specifically, we propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework that employs a diffusion transformer (DiT) to synthesize missing details from a compact latent space. Therein, we develop a dedicated latent conditioning module to condition the DiT decoder on the encoded video latent embedding. Our experiments demonstrate that our approach enables superior encoding-decoding performance compared to state-of-the-art methods, particularly as the compression ratio increases. To demonstrate the efficacy of our approach, we report results from our video embedders achieving a temporal compression ratio of up to 32x (8x higher than leading video embedders) and validate the robustness of this ultra-compact latent space for text-to-video generation, providing a significant efficiency boost in latent diffusion model training and inference.

Title: Language-Depth Navigated Thermal and Visible Image Fusion

Authors: Jinchang Zhang, Zijun Li, Guoyu Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08676
Pdf URL: https://arxiv.org/pdf/2503.08676
Copy Paste: [[2503.08676]] Language-Depth Navigated Thermal and Visible Image Fusion(https://arxiv.org/abs/2503.08676)
Keywords: diffusion
Abstract: Depth-guided multimodal fusion combines depth information from visible and infrared images, significantly enhancing the performance of 3D reconstruction and robotics applications. Existing thermal-visible image fusion mainly focuses on detection tasks, ignoring other critical information such as depth. By addressing the limitations of single modalities in low-light and complex environments, the depth information from fused images not only generates more accurate point cloud data, improving the completeness and precision of 3D reconstruction, but also provides comprehensive scene understanding for robot navigation, localization, and environmental perception. This supports precise recognition and efficient operations in applications such as autonomous driving and rescue missions. We introduce a text-guided and depth-driven infrared and visible image fusion network. The model consists of an image fusion branch for extracting multi-channel complementary information through a diffusion model, equipped with a text-guided module, and two auxiliary depth estimation branches. The fusion branch uses CLIP to extract semantic information and parameters from depth-enriched image descriptions to guide the diffusion model in extracting multi-channel features and generating fused images. These fused images are then input into the depth estimation branches to calculate depth-driven loss, optimizing the image fusion network. This framework aims to integrate vision-language and depth to directly generate color-fused images from multimodal inputs.

Title: OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Authors: Yongsheng Yu, Ziyun Zeng, Haitian Zheng, Jiebo Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08677
Pdf URL: https://arxiv.org/pdf/2503.08677
Copy Paste: [[2503.08677]] OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting(https://arxiv.org/abs/2503.08677)
Keywords: robust, diffusion, generative
Abstract: Diffusion-based generative models have revolutionized object-oriented image editing, yet their deployment in realistic object removal and insertion remains hampered by challenges such as the intricate interplay of physical effects and insufficient paired training data. In this work, we introduce OmniPaint, a unified framework that re-conceptualizes object removal and insertion as interdependent processes rather than isolated tasks. Leveraging a pre-trained diffusion prior along with a progressive training pipeline comprising initial paired sample optimization and subsequent large-scale unpaired refinement via CycleFlow, OmniPaint achieves precise foreground elimination and seamless object insertion while faithfully preserving scene geometry and intrinsic properties. Furthermore, our novel CFD metric offers a robust, reference-free evaluation of context consistency and object hallucination, establishing a new benchmark for high-fidelity image editing. Project page: this https URL

Title: Self-Taught Self-Correction for Small Language Models

Authors: Viktor Moskvoretskii, Chris Biemann, Irina Nikishina
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.08681
Pdf URL: https://arxiv.org/pdf/2503.08681
Copy Paste: [[2503.08681]] Self-Taught Self-Correction for Small Language Models(https://arxiv.org/abs/2503.08681)
Keywords: large language model
Abstract: Although large language models (LLMs) have achieved remarkable performance across various tasks, they remain prone to errors. A key challenge is enabling them to self-correct. While prior research has relied on external tools or large proprietary models, this work explores self-correction in small language models (SLMs) through iterative fine-tuning using solely self-generated data. We introduce the Self-Taught Self-Correction (STaSC) algorithm, which incorporates multiple algorithmic design choices. Experimental results on a question-answering task demonstrate that STaSC effectively learns self-correction, leading to significant performance improvements. Our analysis further provides insights into the mechanisms of self-correction and the impact of different design choices on learning dynamics and overall performance. To support future research, we release our user-friendly codebase and lightweight models.

Title: "Principal Components" Enable A New Language of Images

Authors: Xin Wen, Bingchen Zhao, Ismail Elezi, Jiankang Deng, Xiaojuan Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08685
Pdf URL: https://arxiv.org/pdf/2503.08685
Copy Paste: [[2503.08685]] "Principal Components" Enable A New Language of Images(https://arxiv.org/abs/2503.08685)
Keywords: interpretability, diffusion
Abstract: We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. While existing visual tokenizers primarily optimize for reconstruction fidelity, they often neglect the structural properties of the latent space -- a critical factor for both interpretability and downstream tasks. Our method generates a 1D causal token sequence for images, where each successive token contributes non-overlapping information with mathematically guaranteed decreasing explained variance, analogous to principal component analysis. This structural constraint ensures the tokenizer extracts the most salient visual features first, with each subsequent token adding diminishing yet complementary information. Additionally, we identified and resolved a semantic-spectrum coupling effect that causes the unwanted entanglement of high-level semantic content and low-level spectral details in the tokens by leveraging a diffusion decoder. Experiments demonstrate that our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system. Moreover, auto-regressive models trained on our token sequences achieve performance comparable to current state-of-the-art methods while requiring fewer tokens for training and inference.

Title: OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

Authors: Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.08686
Pdf URL: https://arxiv.org/pdf/2503.08686
Copy Paste: [[2503.08686]] OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models(https://arxiv.org/abs/2503.08686)
Keywords: transformer
Abstract: Recent advancements in unified multimodal understanding and visual generation (or multimodal generation) models have been hindered by their quadratic computational complexity and dependence on large-scale training data. We present OmniMamba, the first linear-architecture-based multimodal generation model that generates both text and images through a unified next-token prediction paradigm. The model fully leverages Mamba-2's high computational and memory efficiency, extending its capabilities from text generation to multimodal generation. To address the data inefficiency of existing unified models, we propose two key innovations: (1) decoupled vocabularies to guide modality-specific generation, and (2) task-specific LoRA for parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage training strategy to mitigate data imbalance between two tasks. Equipped with these techniques, OmniMamba achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks, despite being trained on merely 2M image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba stands out with outstanding inference efficiency, achieving up to a 119.2 times speedup and 63% GPU memory reduction for long-sequence generation compared to Transformer-based counterparts. Code and models are released at this https URL