2025-03-27

Title: Robust Object Detection of Underwater Robot based on Domain Generalization

Authors: Pinhao Song
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.19929
Pdf URL: https://arxiv.org/pdf/2503.19929
Copy Paste: [[2503.19929]] Robust Object Detection of Underwater Robot based on Domain Generalization(https://arxiv.org/abs/2503.19929)
Keywords: robust
Abstract: Object detection aims to obtain the location and the category of specific objects in a given image, which includes two tasks: classification and location. In recent years, researchers tend to apply object detection to underwater robots equipped with vision systems to complete tasks including seafood fishing, fish farming, biodiversity monitoring and so on. However, the diversity and complexity of underwater environments bring new challenges to object detection. First, aquatic organisms tend to live together, which leads to severe occlusion. Second, theaquatic organisms are good at hiding themselves, which have a similar color to the background. Third, the various water quality and changeable and extreme lighting conditions lead to the distorted, low contrast, blue or green images obtained by the underwater camera, resulting in domain shift. And the deep model is generally vulnerable to facing domain shift. Fourth, the movement of the underwater robot leads to the blur of the captured image and makes the water muddy, which results in low visibility of the water. This paper investigates the problems brought by the underwater environment mentioned above, and aims to design a high-performance and robust underwater object detector.

Title: VisualQuest: A Diverse Image Dataset for Evaluating Visual Recognition in LLMs

Authors: Kelaiti Xiao, Liang Yang, Paerhati Tulajiang, Hongfei Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19936
Pdf URL: https://arxiv.org/pdf/2503.19936
Copy Paste: [[2503.19936]] VisualQuest: A Diverse Image Dataset for Evaluating Visual Recognition in LLMs(https://arxiv.org/abs/2503.19936)
Keywords: robust, large language model
Abstract: This paper introduces VisualQuest, a novel image dataset designed to assess the ability of large language models (LLMs) to interpret non-traditional, stylized imagery. Unlike conventional photographic benchmarks, VisualQuest challenges models with images that incorporate abstract, symbolic, and metaphorical elements, requiring the integration of domain-specific knowledge and advanced reasoning. The dataset was meticulously curated through multiple stages of filtering, annotation, and standardization to ensure high quality and diversity. Our evaluations using several state-of-the-art multimodal LLMs reveal significant performance variations that underscore the importance of both factual background knowledge and inferential capabilities in visual recognition tasks. VisualQuest thus provides a robust and comprehensive benchmark for advancing research in multimodal reasoning and model architecture design.

Title: Continual Learning With Quasi-Newton Methods

Authors: Steven Vander Eeckt, Hugo Van hamme
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2503.19939
Pdf URL: https://arxiv.org/pdf/2503.19939
Copy Paste: [[2503.19939]] Continual Learning With Quasi-Newton Methods(https://arxiv.org/abs/2503.19939)
Keywords: robust
Abstract: Catastrophic forgetting remains a major challenge when neural networks learn tasks sequentially. Elastic Weight Consolidation (EWC) attempts to address this problem by introducing a Bayesian-inspired regularization loss to preserve knowledge of previously learned tasks. However, EWC relies on a Laplace approximation where the Hessian is simplified to the diagonal of the Fisher information matrix, assuming uncorrelated model parameters. This overly simplistic assumption often leads to poor Hessian estimates, limiting its effectiveness. To overcome this limitation, we introduce Continual Learning with Sampled Quasi-Newton (CSQN), which leverages Quasi-Newton methods to compute more accurate Hessian approximations. CSQN captures parameter interactions beyond the diagonal without requiring architecture-specific modifications, making it applicable across diverse tasks and architectures. Experimental results across four benchmarks demonstrate that CSQN consistently outperforms EWC and other state-of-the-art baselines, including rehearsal-based methods. CSQN reduces EWC's forgetting by 50 percent and improves its performance by 8 percent on average. Notably, CSQN achieves superior results on three out of four benchmarks, including the most challenging scenarios, highlighting its potential as a robust solution for continual learning.

Title: Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders

Authors: Paul Koch, Jörg Krüger, Ankit Chowdhury, Oliver Heimann
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19947
Pdf URL: https://arxiv.org/pdf/2503.19947
Copy Paste: [[2503.19947]] Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders(https://arxiv.org/abs/2503.19947)
Keywords: extraction, segmentation
Abstract: Generalized metric depth understanding is critical for precise vision-guided robotics, which current state-of-the-art (SOTA) vision-encoders do not support. To address this, we propose Vanishing Depth, a self-supervised training approach that extends pretrained RGB encoders to incorporate and align metric depth into their feature embeddings. Based on our novel positional depth encoding, we enable stable depth density and depth distribution invariant feature extraction. We achieve performance improvements and SOTA results across a spectrum of relevant RGBD downstream tasks - without the necessity of finetuning the encoder. Most notably, we achieve 56.05 mIoU on SUN-RGBD segmentation, 88.3 RMSE on Void's depth completion, and 83.8 Top 1 accuracy on NYUv2 scene classification. In 6D-object pose estimation, we outperform our predecessors of DinoV2, EVA-02, and Omnivore and achieve SOTA results for non-finetuned encoders in several related RGBD downstream tasks.

Title: Test-Time Reasoning Through Visual Human Preferences with VLMs and Soft Rewards

Authors: Alexander Gambashidze, Konstantin Sobolev, Andrey Kuznetsov, Ivan Oseledets
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19948
Pdf URL: https://arxiv.org/pdf/2503.19948
Copy Paste: [[2503.19948]] Test-Time Reasoning Through Visual Human Preferences with VLMs and Soft Rewards(https://arxiv.org/abs/2503.19948)
Keywords: explainability
Abstract: Can Visual Language Models (VLMs) effectively capture human visual preferences? This work addresses this question by training VLMs to think about preferences at test time, employing reinforcement learning methods inspired by DeepSeek R1 and OpenAI O1. Using datasets such as ImageReward and Human Preference Score v2 (HPSv2), our models achieve accuracies of 64.9% on the ImageReward test set (trained on ImageReward official split) and 65.4% on HPSv2 (trained on approximately 25% of its data). These results match traditional encoder-based models while providing transparent reasoning and enhanced generalization. This approach allows to use not only rich VLM world knowledge, but also its potential to think, yielding interpretable outcomes that help decision-making processes. By demonstrating that human visual preferences reasonable by current VLMs, we introduce efficient soft-reward strategies for image ranking, outperforming simplistic selection or scoring methods. This reasoning capability enables VLMs to rank arbitrary images-regardless of aspect ratio or complexity-thereby potentially amplifying the effectiveness of visual Preference Optimization. By reducing the need for extensive markup while improving reward generalization and explainability, our findings can be a strong mile-stone that will enhance text-to-vision models even further.

Title: LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

Authors: Han Chen, Zicong Jiang, Zining Zhang, Bingsheng He, Pingyi Luo, Mian Lu, Yuqiang Chen
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.19950
Pdf URL: https://arxiv.org/pdf/2503.19950
Copy Paste: [[2503.19950]] LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation(https://arxiv.org/abs/2503.19950)
Keywords: transformer, large language model
Abstract: We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable this http URL integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in this https URL.

Title: ACVUBench: Audio-Centric Video Understanding Benchmark

Authors: Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, Chao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.19951
Pdf URL: https://arxiv.org/pdf/2503.19951
Copy Paste: [[2503.19951]] ACVUBench: Audio-Centric Video Understanding Benchmark(https://arxiv.org/abs/2503.19951)
Keywords: large language model
Abstract: Audio often serves as an auxiliary modality in video understanding tasks of audio-visual large language models (LLMs), merely assisting in the comprehension of visual information. However, a thorough understanding of videos significantly depends on auditory information, as audio offers critical context, emotional cues, and semantic meaning that visual data alone often lacks. This paper proposes an audio-centric video understanding benchmark (ACVUBench) to evaluate the video comprehension capabilities of multimodal LLMs with a particular focus on auditory information. Specifically, ACVUBench incorporates 2,662 videos spanning 18 different domains with rich auditory information, together with over 13k high-quality human annotated or validated question-answer pairs. Moreover, ACVUBench introduces a suite of carefully designed audio-centric tasks, holistically testing the understanding of both audio content and audio-visual interactions in videos. A thorough evaluation across a diverse range of open-source and proprietary multimodal LLMs is performed, followed by the analyses of deficiencies in audio-visual LLMs. Demos are available at this https URL.

Title: SLIP: Spoof-Aware One-Class Face Anti-Spoofing with Language Image Pretraining

Authors: Pei-Kai Huang, Jun-Xiong Chong, Cheng-Hsuan Chiang, Tzu-Hsien Chen, Tyng-Luh Liu, Chiou-Ting Hsu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.19982
Pdf URL: https://arxiv.org/pdf/2503.19982
Copy Paste: [[2503.19982]] SLIP: Spoof-Aware One-Class Face Anti-Spoofing with Language Image Pretraining(https://arxiv.org/abs/2503.19982)
Keywords: security, attack
Abstract: Face anti-spoofing (FAS) plays a pivotal role in ensuring the security and reliability of face recognition systems. With advancements in vision-language pretrained (VLP) models, recent two-class FAS techniques have leveraged the advantages of using VLP guidance, while this potential remains unexplored in one-class FAS methods. The one-class FAS focuses on learning intrinsic liveness features solely from live training images to differentiate between live and spoof faces. However, the lack of spoof training data can lead one-class FAS models to inadvertently incorporate domain information irrelevant to the live/spoof distinction (e.g., facial content), causing performance degradation when tested with a new application domain. To address this issue, we propose a novel framework called Spoof-aware one-class face anti-spoofing with Language Image Pretraining (SLIP). Given that live faces should ideally not be obscured by any spoof-attack-related objects (e.g., paper, or masks) and are assumed to yield zero spoof cue maps, we first propose an effective language-guided spoof cue map estimation to enhance one-class FAS models by simulating whether the underlying faces are covered by attack-related objects and generating corresponding nonzero spoof cue maps. Next, we introduce a novel prompt-driven liveness feature disentanglement to alleviate live/spoof-irrelative domain variations by disentangling live/spoof-relevant and domain-dependent information. Finally, we design an effective augmentation strategy by fusing latent features from live images and spoof prompts to generate spoof-like image features and thus diversify latent spoof features to facilitate the learning of one-class FAS. Our extensive experiments and ablation studies support that SLIP consistently outperforms previous one-class FAS methods.

Title: ExCoT: Optimizing Reasoning for Text-to-SQL with Execution Feedback

Authors: Bohan Zhai, Canwen Xu, Yuxiong He, Zhewei Yao
Subjects: cs.LG, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2503.19988
Pdf URL: https://arxiv.org/pdf/2503.19988
Copy Paste: [[2503.19988]] ExCoT: Optimizing Reasoning for Text-to-SQL with Execution Feedback(https://arxiv.org/abs/2503.19988)
Keywords: large language model
Abstract: Text-to-SQL demands precise reasoning to convert natural language questions into structured queries. While large language models (LLMs) excel in many reasoning tasks, their ability to leverage Chain-of-Thought (CoT) reasoning for text-to-SQL remains underexplored. We identify critical limitations: zero-shot CoT offers minimal gains, and Direct Preference Optimization (DPO) applied without CoT yields marginal improvements. We propose ExCoT, a novel framework that iteratively optimizes open-source LLMs by combining CoT reasoning with off-policy and on-policy DPO, relying solely on execution accuracy as feedback. This approach eliminates the need for reward models or human-annotated preferences. Our experimental results demonstrate significant performance gains: ExCoT improves execution accuracy on BIRD dev set from 57.37% to 68.51% and on Spider test set from 78.81% to 86.59% for LLaMA-3 70B, with Qwen-2.5-Coder demonstrating similar improvements. Our best model achieves state-of-the-art performance in the single-model setting on both BIRD and Spider datasets, notably achieving 68.53% on the BIRD test set.

Title: The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs

Authors: Jonathan Sauder, Viktor Domazetoski, Guilhem Banc-Prandi, Gabriela Perna, Anders Meibom, Devis Tuia
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20000
Pdf URL: https://arxiv.org/pdf/2503.20000
Copy Paste: [[2503.20000]] The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs(https://arxiv.org/abs/2503.20000)
Keywords: segmentation
Abstract: Coral reefs are declining worldwide due to climate change and local stressors. To inform effective conservation or restoration, monitoring at the highest possible spatial and temporal resolution is necessary. Conventional coral reef surveying methods are limited in scalability due to their reliance on expert labor time, motivating the use of computer vision tools to automate the identification and abundance estimation of live corals from images. However, the design and evaluation of such tools has been impeded by the lack of large high quality datasets. We release the Coralscapes dataset, the first general-purpose dense semantic segmentation dataset for coral reefs, covering 2075 images, 39 benthic classes, and 174k segmentation masks annotated by experts. Coralscapes has a similar scope and the same structure as the widely used Cityscapes dataset for urban scene segmentation, allowing benchmarking of semantic segmentation models in a new challenging domain which requires expert knowledge to annotate. We benchmark a wide range of semantic segmentation models, and find that transfer learning from Coralscapes to existing smaller datasets consistently leads to state-of-the-art performance. Coralscapes will catalyze research on efficient, scalable, and standardized coral reef surveying methods based on computer vision, and holds the potential to streamline the development of underwater ecological robotics.

Title: Hyperdimensional Uncertainty Quantification for Multimodal Uncertainty Fusion in Autonomous Vehicles Perception

Authors: Luke Chen, Junyao Wang, Trier Mortlock, Pramod Khargonekar, Mohammad Abdullah Al Faruque
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.20011
Pdf URL: https://arxiv.org/pdf/2503.20011
Copy Paste: [[2503.20011]] Hyperdimensional Uncertainty Quantification for Multimodal Uncertainty Fusion in Autonomous Vehicles Perception(https://arxiv.org/abs/2503.20011)
Keywords: segmentation
Abstract: Uncertainty Quantification (UQ) is crucial for ensuring the reliability of machine learning models deployed in real-world autonomous systems. However, existing approaches typically quantify task-level output prediction uncertainty without considering epistemic uncertainty at the multimodal feature fusion level, leading to sub-optimal outcomes. Additionally, popular uncertainty quantification methods, e.g., Bayesian approximations, remain challenging to deploy in practice due to high computational costs in training and inference. In this paper, we propose HyperDUM, a novel deterministic uncertainty method (DUM) that efficiently quantifies feature-level epistemic uncertainty by leveraging hyperdimensional computing. Our method captures the channel and spatial uncertainties through channel and patch -wise projection and bundling techniques respectively. Multimodal sensor features are then adaptively weighted to mitigate uncertainty propagation and improve feature fusion. Our evaluations show that HyperDUM on average outperforms the state-of-the-art (SOTA) algorithms by up to 2.01%/1.27% in 3D Object Detection and up to 1.29% improvement over baselines in semantic segmentation tasks under various types of uncertainties. Notably, HyperDUM requires 2.36x less Floating Point Operations and up to 38.30x less parameters than SOTA methods, providing an efficient solution for real-world autonomous systems.

Title: Experience Replay Addresses Loss of Plasticity in Continual Learning

Authors: Jiuqi Wang, Rohan Chandra, Shangtong Zhang
Subjects: cs.LG, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2503.20018
Pdf URL: https://arxiv.org/pdf/2503.20018
Copy Paste: [[2503.20018]] Experience Replay Addresses Loss of Plasticity in Continual Learning(https://arxiv.org/abs/2503.20018)
Keywords: transformer
Abstract: Loss of plasticity is one of the main challenges in continual learning with deep neural networks, where neural networks trained via backpropagation gradually lose their ability to adapt to new tasks and perform significantly worse than their freshly initialized counterparts. The main contribution of this paper is to propose a new hypothesis that experience replay addresses the loss of plasticity in continual learning. Here, experience replay is a form of memory. We provide supporting evidence for this hypothesis. In particular, we demonstrate in multiple different tasks, including regression, classification, and policy evaluation, that by simply adding an experience replay and processing the data in the experience replay with Transformers, the loss of plasticity disappears. Notably, we do not alter any standard components of deep learning. For example, we do not change backpropagation. We do not modify the activation functions. And we do not use any regularization. We conjecture that experience replay and Transformers can address the loss of plasticity because of the in-context learning phenomenon.

Title: Deep Learning Approaches for Blood Disease Diagnosis Across Hematopoietic Lineages

Authors: Gabriel Bo, Justin Gu, Christopher Sun
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2503.20049
Pdf URL: https://arxiv.org/pdf/2503.20049
Copy Paste: [[2503.20049]] Deep Learning Approaches for Blood Disease Diagnosis Across Hematopoietic Lineages(https://arxiv.org/abs/2503.20049)
Keywords: robust, transformer
Abstract: We present a foundation modeling framework that leverages deep learning to uncover latent genetic signatures across the hematopoietic hierarchy. Our approach trains a fully connected autoencoder on multipotent progenitor cells, reducing over 20,000 gene features to a 256-dimensional latent space that captures predictive information for both progenitor and downstream differentiated cells such as monocytes and lymphocytes. We validate the quality of these embeddings by training feed-forward, transformer, and graph convolutional architectures for blood disease diagnosis tasks. We also explore zero-shot prediction using a progenitor disease state classification model to classify downstream cell conditions. Our models achieve greater than 95% accuracy for multi-class classification, and in the zero-shot setting, we achieve greater than 0.7 F1-score on the binary classification task. Future work should improve embeddings further to increase robustness on lymphocyte classification specifically.

Title: Poor Alignment and Steerability of Large Language Models: Evidence from College Admission Essays

Authors: Jinsook Lee, AJ Alvero, Thorsten Joachims, René Kizilcec
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20062
Pdf URL: https://arxiv.org/pdf/2503.20062
Copy Paste: [[2503.20062]] Poor Alignment and Steerability of Large Language Models: Evidence from College Admission Essays(https://arxiv.org/abs/2503.20062)
Keywords: large language model
Abstract: People are increasingly using technologies equipped with large language models (LLM) to write texts for formal communication, which raises two important questions at the intersection of technology and society: Who do LLMs write like (model alignment); and can LLMs be prompted to change who they write like (model steerability). We investigate these questions in the high-stakes context of undergraduate admissions at a selective university by comparing lexical and sentence variation between essays written by 30,000 applicants to two types of LLM-generated essays: one prompted with only the essay question used by the human applicants; and another with additional demographic information about each applicant. We consistently find that both types of LLM-generated essays are linguistically distinct from human-authored essays, regardless of the specific model and analytical approach. Further, prompting a specific sociodemographic identity is remarkably ineffective in aligning the model with the linguistic patterns observed in human writing from this identity group. This holds along the key dimensions of sex, race, first-generation status, and geographic location. The demographically prompted and unprompted synthetic texts were also more similar to each other than to the human text, meaning that prompting did not alleviate homogenization. These issues of model alignment and steerability in current LLMs raise concerns about the use of LLMs in high-stakes contexts.

Title: iNatAg: Multi-Class Classification Models Enabled by a Large-Scale Benchmark Dataset with 4.7M Images of 2,959 Crop and Weed Species

Authors: Naitik Jain, Amogh Joshi, Mason Earles
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20068
Pdf URL: https://arxiv.org/pdf/2503.20068
Copy Paste: [[2503.20068]] iNatAg: Multi-Class Classification Models Enabled by a Large-Scale Benchmark Dataset with 4.7M Images of 2,959 Crop and Weed Species(https://arxiv.org/abs/2503.20068)
Keywords: robust, transformer
Abstract: Accurate identification of crop and weed species is critical for precision agriculture and sustainable farming. However, it remains a challenging task due to a variety of factors -- a high degree of visual similarity among species, environmental variability, and a continued lack of large, agriculture-specific image data. We introduce iNatAg, a large-scale image dataset which contains over 4.7 million images of 2,959 distinct crop and weed species, with precise annotations along the taxonomic hierarchy from binary crop/weed labels to specific species labels. Curated from the broader iNaturalist database, iNatAg contains data from every continent and accurately reflects the variability of natural image captures and environments. Enabled by this data, we train benchmark models built upon the Swin Transformer architecture and evaluate the impact of various modifications such as the incorporation of geospatial data and LoRA finetuning. Our best models achieve state-of-the-art performance across all taxonomic classification tasks, achieving 92.38\% on crop and weed classification. Furthermore, the scale of our dataset enables us to explore incorrect misclassifications and unlock new analytic possiblities for plant species. By combining large-scale species coverage, multi-task labels, and geographic diversity, iNatAg provides a new foundation for building robust, geolocation-aware agricultural classification systems. We release the iNatAg dataset publicly through AgML (this https URL), enabling direct access and integration into agricultural machine learning workflows.

Title: Cross-Tokenizer Distillation via Approximate Likelihood Matching

Authors: Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vulić
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20083
Pdf URL: https://arxiv.org/pdf/2503.20083
Copy Paste: [[2503.20083]] Cross-Tokenizer Distillation via Approximate Likelihood Matching(https://arxiv.org/abs/2503.20083)
Keywords: robust, large language model
Abstract: Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM. However, current distillation methods predominantly require the same tokenizer between the teacher and the student, restricting their applicability to only a small subset of teacher-student pairs. In this work, we develop a cross-tokenizer distillation method to solve this crucial deficiency. Our method is the first to enable cross-tokenizer distillation without a next-token prediction loss as the main objective, instead purely maximizing the student predictions' similarity to the teacher's predictions (known as pure distillation), while also being robust to large mismatches between the teacher and the student tokenizer function and vocabulary. Empirically, our method enables substantially improved performance as tested on two use cases. First, we show that viewing tokenizer transfer as self-distillation enables unprecedently effective transfer across tokenizers. We transfer (subword-level) Llama and Gemma models to byte-level tokenization more effectively than prior methods transfer to a similar subword tokenizer under a comparable training budget. Transferring different base models to the same tokenizer also enables ensembling them (e.g., via averaging their predicted probabilities) which boosts performance. Second, we use our cross-tokenizer distillation method to distil a large maths-specialized LLM into a smaller model, achieving competitive maths problem-solving performance. Overall, our results make substantial strides toward better adaptability and enhanced interaction between different LLMs.

Title: Can Multi-modal (reasoning) LLMs work as deepfake detectors?

Authors: Simiao Ren, Yao Yao, Kidus Zewde, Zisheng Liang, Tsang (Dennis)Ng, Ning-Yau Cheng, Xiaoou Zhan, Qinzhe Liu, Yifei Chen, Hengwei Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20084
Pdf URL: https://arxiv.org/pdf/2503.20084
Copy Paste: [[2503.20084]] Can Multi-modal (reasoning) LLMs work as deepfake detectors?(https://arxiv.org/abs/2503.20084)
Keywords: robust, interpretability, generative, large language model
Abstract: Deepfake detection remains a critical challenge in the era of advanced generative models, particularly as synthetic media becomes more sophisticated. In this study, we explore the potential of state of the art multi-modal (reasoning) large language models (LLMs) for deepfake image detection such as (OpenAI O1/4o, Gemini thinking Flash 2, Deepseek Janus, Grok 3, llama 3.2, Qwen 2/2.5 VL, Mistral Pixtral, Claude 3.5/3.7 sonnet) . We benchmark 12 latest multi-modal LLMs against traditional deepfake detection methods across multiple datasets, including recently published real-world deepfake imagery. To enhance performance, we employ prompt tuning and conduct an in-depth analysis of the models' reasoning pathways to identify key contributing factors in their decision-making process. Our findings indicate that best multi-modal LLMs achieve competitive performance with promising generalization ability with zero shot, even surpass traditional deepfake detection pipelines in out-of-distribution datasets while the rest of the LLM families performs extremely disappointing with some worse than random guess. Furthermore, we found newer model version and reasoning capabilities does not contribute to performance in such niche tasks of deepfake detection while model size do help in some cases. This study highlights the potential of integrating multi-modal reasoning in future deepfake detection frameworks and provides insights into model interpretability for robustness in real-world scenarios.

Title: Generative Linguistics, Large Language Models, and the Social Nature of Scientific Success

Authors: Sophie Hao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20088
Pdf URL: https://arxiv.org/pdf/2503.20088
Copy Paste: [[2503.20088]] Generative Linguistics, Large Language Models, and the Social Nature of Scientific Success(https://arxiv.org/abs/2503.20088)
Keywords: generative, large language model
Abstract: Chesi's (forthcoming) target paper depicts a generative linguistics in crisis, foreboded by Piantadosi's (2023) declaration that "modern language models refute Chomsky's approach to language." In order to survive, Chesi warns, generativists must hold themselves to higher standards of formal and empirical rigor. This response argues that the crisis described by Chesi and Piantadosi actually has little to do with rigor, but is rather a reflection of generativists' limited social ambitions. Chesi ties the fate of generative linguistics to its intellectual merits, but the current success of language model research is social in nature as much as it is intellectual. In order to thrive, then, generativists must do more than heed Chesi's call for rigor; they must also expand their ambitions by giving outsiders a stake in their future success.

Title: Fundamental Limits of Perfect Concept Erasure

Authors: Somnath Basu Roy Chowdhury, Avinava Dubey, Ahmad Beirami, Rahul Kidambi, Nicholas Monath, Amr Ahmed, Snigdha Chaturvedi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.20098
Pdf URL: https://arxiv.org/pdf/2503.20098
Copy Paste: [[2503.20098]] Fundamental Limits of Perfect Concept Erasure(https://arxiv.org/abs/2503.20098)
Keywords: robust, fair
Abstract: Concept erasure is the task of erasing information about a concept (e.g., gender or race) from a representation set while retaining the maximum possible utility -- information from original representations. Concept erasure is useful in several applications, such as removing sensitive concepts to achieve fairness and interpreting the impact of specific concepts on a model's performance. Previous concept erasure techniques have prioritized robustly erasing concepts over retaining the utility of the resultant representations. However, there seems to be an inherent tradeoff between erasure and retaining utility, making it unclear how to achieve perfect concept erasure while maintaining high utility. In this paper, we offer a fresh perspective toward solving this problem by quantifying the fundamental limits of concept erasure through an information-theoretic lens. Using these results, we investigate constraints on the data distribution and the erasure functions required to achieve the limits of perfect concept erasure. Empirically, we show that the derived erasure functions achieve the optimal theoretical bounds. Additionally, we show that our approach outperforms existing methods on a range of synthetic and real-world datasets using GPT-4 representations.

Title: Extendable Long-Horizon Planning via Hierarchical Multiscale Diffusion

Authors: Chang Chen, Hany Hamed, Doojin Baek, Taegu Kang, Yoshua Bengio, Sungjin Ahn
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.20102
Pdf URL: https://arxiv.org/pdf/2503.20102
Copy Paste: [[2503.20102]] Extendable Long-Horizon Planning via Hierarchical Multiscale Diffusion(https://arxiv.org/abs/2503.20102)
Keywords: diffusion
Abstract: This paper tackles a novel problem, extendable long-horizon planning-enabling agents to plan trajectories longer than those in training data without compounding errors. To tackle this, we propose the Hierarchical Multiscale Diffuser (HM-Diffuser) and Progressive Trajectory Extension (PTE), an augmentation method that iteratively generates longer trajectories by stitching shorter ones. HM-Diffuser trains on these extended trajectories using a hierarchical structure, efficiently handling tasks across multiple temporal scales. Additionally, we introduce Adaptive Plan Pondering and the Recursive HM-Diffuser, which consolidate hierarchical layers into a single model to process temporal scales recursively. Experimental results demonstrate the effectiveness of our approach, advancing diffusion-based planners for scalable long-horizon planning.

Title: Bigger But Not Better: Small Neural Language Models Outperform Large Language Models in Detection of Thought Disorder

Authors: Changye Li, Weizhe Xu, Serguei Pakhomov, Ellen Bradley, Dror Ben-Zeev, Trevor Cohen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20103
Pdf URL: https://arxiv.org/pdf/2503.20103
Copy Paste: [[2503.20103]] Bigger But Not Better: Small Neural Language Models Outperform Large Language Models in Detection of Thought Disorder(https://arxiv.org/abs/2503.20103)
Keywords: privacy, large language model
Abstract: Disorganized thinking is a key diagnostic indicator of schizophrenia-spectrum disorders. Recently, clinical estimates of the severity of disorganized thinking have been shown to correlate with measures of how difficult speech transcripts would be for large language models (LLMs) to predict. However, LLMs' deployment challenges -- including privacy concerns, computational and financial costs, and lack of transparency of training data -- limit their clinical utility. We investigate whether smaller neural language models can serve as effective alternatives for detecting positive formal thought disorder, using the same sliding window based perplexity measurements that proved effective with larger models. Surprisingly, our results show that smaller models are more sensitive to linguistic differences associated with formal thought disorder than their larger counterparts. Detection capability declines beyond a certain model size and context length, challenging the common assumption of ``bigger is better'' for LLM-based applications. Our findings generalize across audio diaries and clinical interview speech samples from individuals with psychotic symptoms, suggesting a promising direction for developing efficient, cost-effective, and privacy-preserving screening tools that can be deployed in both clinical and naturalistic settings.

Title: "Is There Anything Else?'': Examining Administrator Influence on Linguistic Features from the Cookie Theft Picture Description Cognitive Test

Authors: Changye Li, Zhecheng Sheng, Trevor Cohen, Serguei Pakhomov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20104
Pdf URL: https://arxiv.org/pdf/2503.20104
Copy Paste: [[2503.20104]] "Is There Anything Else?'': Examining Administrator Influence on Linguistic Features from the Cookie Theft Picture Description Cognitive Test(https://arxiv.org/abs/2503.20104)
Keywords: generative
Abstract: Alzheimer's Disease (AD) dementia is a progressive neurodegenerative disease that negatively impacts patients' cognitive ability. Previous studies have demonstrated that changes in naturalistic language samples can be useful for early screening of AD dementia. However, the nature of language deficits often requires test administrators to use various speech elicitation techniques during spontaneous language assessments to obtain enough propositional utterances from dementia patients. This could lead to the ``observer's effect'' on the downstream analysis that has not been fully investigated. Our study seeks to quantify the influence of test administrators on linguistic features in dementia assessment with two English corpora the ``Cookie Theft'' picture description datasets collected at different locations and test administrators show different levels of administrator involvement. Our results show that the level of test administrator involvement significantly impacts observed linguistic features in patient speech. These results suggest that many of significant linguistic features in the downstream classification task may be partially attributable to differences in the test administration practices rather than solely to participants' cognitive status. The variations in test administrator behavior can lead to systematic biases in linguistic data, potentially confounding research outcomes and clinical assessments. Our study suggests that there is a need for a more standardized test administration protocol in the development of responsible clinical speech analytics frameworks.

Title: From Interpretation to Correction: A Decentralized Optimization Framework for Exact Convergence in Federated Learning

Authors: Bicheng Ying, Zhe Li, Haibo Yang
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2503.20117
Pdf URL: https://arxiv.org/pdf/2503.20117
Copy Paste: [[2503.20117]] From Interpretation to Correction: A Decentralized Optimization Framework for Exact Convergence in Federated Learning(https://arxiv.org/abs/2503.20117)
Keywords: federate
Abstract: This work introduces a novel decentralized framework to interpret federated learning (FL) and, consequently, correct the biases introduced by arbitrary client participation and data heterogeneity, which are two typical traits in practical FL. Specifically, we first reformulate the core processes of FedAvg - client participation, local updating, and model aggregation - as stochastic matrix multiplications. This reformulation allows us to interpret FedAvg as a decentralized algorithm. Leveraging the decentralized optimization framework, we are able to provide a concise analysis to quantify the impact of arbitrary client participation and data heterogeneity on FedAvg's convergence point. This insight motivates the development of Federated Optimization with Exact Convergence via Push-pull Strategy (FOCUS), a novel algorithm inspired by the decentralized algorithm that eliminates these biases and achieves exact convergence without requiring the bounded heterogeneity assumption. Furthermore, we theoretically prove that FOCUS exhibits linear convergence (exponential decay) for both strongly convex and non-convex functions satisfying the Polyak-Lojasiewicz condition, regardless of the arbitrary nature of client participation.

Title: Unlocking the Value of Decentralized Data: A Federated Dual Learning Approach for Model Aggregation

Authors: Junyi Zhu, Ruicong Yao, Taha Ceritli, Savas Ozkan, Matthew B. Blaschko, Eunchung Noh, Jeongwon Min, Cho Jung Min, Mete Ozay
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20138
Pdf URL: https://arxiv.org/pdf/2503.20138
Copy Paste: [[2503.20138]] Unlocking the Value of Decentralized Data: A Federated Dual Learning Approach for Model Aggregation(https://arxiv.org/abs/2503.20138)
Keywords: federate
Abstract: Artificial Intelligence (AI) technologies have revolutionized numerous fields, yet their applications often rely on costly and time-consuming data collection processes. Federated Learning (FL) offers a promising alternative by enabling AI models to be trained on decentralized data where data is scattered across clients (distributed nodes). However, existing FL approaches struggle to match the performance of centralized training due to challenges such as heterogeneous data distribution and communication delays, limiting their potential for breakthroughs. We observe that many real-world use cases involve hybrid data regimes, in which a server (center node) has access to some data while a large amount of data is distributed across associated clients. To improve the utilization of decentralized data under this regime, address data heterogeneity issue, and facilitate asynchronous communication between the server and clients, we propose a dual learning approach that leverages centralized data at the server to guide the merging of model updates from clients. Our method accommodates scenarios where server data is out-of-domain relative to decentralized client data, making it applicable to a wide range of use cases. We provide theoretical analysis demonstrating the faster convergence of our method compared to existing methods. Furthermore, experimental results across various scenarios show that our approach significantly outperforms existing technologies, highlighting its potential to unlock the value of large amounts of decentralized data.

Title: AIGC-assisted Federated Learning for Edge Intelligence: Architecture Design, Research Challenges and Future Directions

Authors: Xianke Qiang, Zheng Chang, Ying-Chang Liang
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2503.20166
Pdf URL: https://arxiv.org/pdf/2503.20166
Copy Paste: [[2503.20166]] AIGC-assisted Federated Learning for Edge Intelligence: Architecture Design, Research Challenges and Future Directions(https://arxiv.org/abs/2503.20166)
Keywords: security, privacy, federate, diffusion, generative
Abstract: Federated learning (FL) can fully leverage large-scale terminal data while ensuring privacy and security, and is considered as a distributed alternative for the centralized machine learning. However, the issue of data heterogeneity poses limitations on FL's performance. To address this challenge, artificial intelligence-generated content (AIGC) which is an innovative data synthesis technique emerges as one potential solution. In this article, we first provide an overview of the system architecture, performance metrics, and challenges associated with AIGC-assistant FL system design. We then propose the Generative federated learning (GenFL) architecture and present its workflow, including the design of aggregation and weight policy. Finally, using the CIFAR10 and CIFAR100 datasets, we employ diffusion models to generate dataset and improve FL performance. Experiments conducted under various non-independent and identically distributed (non-IID) data distributions demonstrate the effectiveness of GenFL on overcoming the bottlenecks in FL caused by data heterogeneity. Open research directions in the research of AIGC-assisted FL are also discussed.

Title: Guiding Human-Object Interactions with Rich Geometry and Relations

Authors: Mengqing Xue, Yifei Liu, Ling Guo, Shaoli Huang, Changxing Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20172
Pdf URL: https://arxiv.org/pdf/2503.20172
Copy Paste: [[2503.20172]] Guiding Human-Object Interactions with Rich Geometry and Relations(https://arxiv.org/abs/2503.20172)
Keywords: robust, diffusion
Abstract: Human-object interaction (HOI) synthesis is crucial for creating immersive and realistic experiences for applications such as virtual reality. Existing methods often rely on simplified object representations, such as the object's centroid or the nearest point to a human, to achieve physically plausible motions. However, these approaches may overlook geometric complexity, resulting in suboptimal interaction fidelity. To address this limitation, we introduce ROG, a novel diffusion-based framework that models the spatiotemporal relationships inherent in HOIs with rich geometric detail. For efficient object representation, we select boundary-focused and fine-detail key points from the object mesh, ensuring a comprehensive depiction of the object's geometry. This representation is used to construct an interactive distance field (IDF), capturing the robust HOI dynamics. Furthermore, we develop a diffusion-based relation model that integrates spatial and temporal attention mechanisms, enabling a better understanding of intricate HOI relationships. This relation model refines the generated motion's IDF, guiding the motion generation process to produce relation-aware and semantically aligned movements. Experimental evaluations demonstrate that ROG significantly outperforms state-of-the-art methods in the realism and semantic accuracy of synthesized HOIs.

Title: Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration

Authors: Shihao Zhou, Dayu Li, Jinshan Pan, Juncheng Zhou, Jinglei Shi, Jufeng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20174
Pdf URL: https://arxiv.org/pdf/2503.20174
Copy Paste: [[2503.20174]] Devil is in the Uniformity: Exploring Diverse Learners within Transformer for Image Restoration(https://arxiv.org/abs/2503.20174)
Keywords: transformer
Abstract: Transformer-based approaches have gained significant attention in image restoration, where the core component, i.e, Multi-Head Attention (MHA), plays a crucial role in capturing diverse features and recovering high-quality results. In MHA, heads perform attention calculation independently from uniform split subspaces, and a redundancy issue is triggered to hinder the model from achieving satisfactory outputs. In this paper, we propose to improve MHA by exploring diverse learners and introducing various interactions between heads, which results in a Hierarchical multI-head atteNtion driven Transformer model, termed HINT, for image restoration. HINT contains two modules, i.e., the Hierarchical Multi-Head Attention (HMHA) and the Query-Key Cache Updating (QKCU) module, to address the redundancy problem that is rooted in vanilla MHA. Specifically, HMHA extracts diverse contextual features by employing heads to learn from subspaces of varying sizes and containing different information. Moreover, QKCU, comprising intra- and inter-layer schemes, further reduces the redundancy problem by facilitating enhanced interactions between attention heads within and across layers. Extensive experiments are conducted on 12 benchmarks across 5 image restoration tasks, including low-light enhancement, dehazing, desnowing, denoising, and deraining, to demonstrate the superiority of HINT. The source code is available in the supplementary materials.

Title: Offline Reinforcement Learning with Discrete Diffusion Skills

Authors: RuiXi Qiao, Jie Cheng, Xingyuan Dai, Yonglin Tian, Yisheng Lv
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.20176
Pdf URL: https://arxiv.org/pdf/2503.20176
Copy Paste: [[2503.20176]] Offline Reinforcement Learning with Discrete Diffusion Skills(https://arxiv.org/abs/2503.20176)
Keywords: interpretability, diffusion, transformer
Abstract: Skills have been introduced to offline reinforcement learning (RL) as temporal abstractions to tackle complex, long-horizon tasks, promoting consistent behavior and enabling meaningful exploration. While skills in offline RL are predominantly modeled within a continuous latent space, the potential of discrete skill spaces remains largely underexplored. In this paper, we propose a compact discrete skill space for offline RL tasks supported by state-of-the-art transformer-based encoder and diffusion-based decoder. Coupled with a high-level policy trained via offline RL techniques, our method establishes a hierarchical RL framework where the trained diffusion decoder plays a pivotal role. Empirical evaluations show that the proposed algorithm, Discrete Diffusion Skill (DDS), is a powerful offline RL method. DDS performs competitively on Locomotion and Kitchen tasks and excels on long-horizon tasks, achieving at least a 12 percent improvement on AntMaze-v2 benchmarks compared to existing offline RL approaches. Furthermore, DDS offers improved interpretability, training stability, and online exploration compared to previous skill-based methods.

Title: Leveraging Implicit Sentiments: Enhancing Reliability and Validity in Psychological Trait Evaluation of LLMs

Authors: Huanhuan Ma, Haisong Gong, Xiaoyuan Yi, Xing Xie, Dongkuan Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20182
Pdf URL: https://arxiv.org/pdf/2503.20182
Copy Paste: [[2503.20182]] Leveraging Implicit Sentiments: Enhancing Reliability and Validity in Psychological Trait Evaluation of LLMs(https://arxiv.org/abs/2503.20182)
Keywords: large language model
Abstract: Recent advancements in Large Language Models (LLMs) have led to their increasing integration into human life. With the transition from mere tools to human-like assistants, understanding their psychological aspects-such as emotional tendencies and personalities-becomes essential for ensuring their trustworthiness. However, current psychological evaluations of LLMs, often based on human psychological assessments like the BFI, face significant limitations. The results from these approaches often lack reliability and have limited validity when predicting LLM behavior in real-world scenarios. In this work, we introduce a novel evaluation instrument specifically designed for LLMs, called Core Sentiment Inventory (CSI). CSI is a bilingual tool, covering both English and Chinese, that implicitly evaluates models' sentiment tendencies, providing an insightful psychological portrait of LLM across three dimensions: optimism, pessimism, and neutrality. Through extensive experiments, we demonstrate that: 1) CSI effectively captures nuanced emotional patterns, revealing significant variation in LLMs across languages and contexts; 2) Compared to current approaches, CSI significantly improves reliability, yielding more consistent results; and 3) The correlation between CSI scores and the sentiment of LLM's real-world outputs exceeds 0.85, demonstrating its strong validity in predicting LLM behavior. We make CSI public available via: this https URL.

Title: Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector

Authors: Xiao Guo, Xiufeng Song, Yue Zhang, Xiaohong Liu, Xiaoming Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20188
Pdf URL: https://arxiv.org/pdf/2503.20188
Copy Paste: [[2503.20188]] Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector(https://arxiv.org/abs/2503.20188)
Keywords: interpretability, explainability, large language model
Abstract: Deepfake detection is a long-established research topic vital for mitigating the spread of malicious misinformation. Unlike prior methods that provide either binary classification results or textual explanations separately, we introduce a novel method capable of generating both simultaneously. Our method harnesses the multi-modal learning capability of the pre-trained CLIP and the unprecedented interpretability of large language models (LLMs) to enhance both the generalization and explainability of deepfake detection. Specifically, we introduce a multi-modal face forgery detector (M2F2-Det) that employs tailored face forgery prompt learning, incorporating the pre-trained CLIP to improve generalization to unseen forgeries. Also, M2F2-Det incorporates an LLM to provide detailed textual explanations of its detection decisions, enhancing interpretability by bridging the gap between natural language and subtle cues of facial forgeries. Empirically, we evaluate M2F2-Det on both detection and explanation generation tasks, where it achieves state-of-the-art performance, demonstrating its effectiveness in identifying and explaining diverse forgeries.

Title: Cross-Modal Prototype Allocation: Unsupervised Slide Representation Learning via Patch-Text Contrast in Computational Pathology

Authors: Yuxuan Chen, Jiawen Li, Jiali Hu, Xitong Ling, Tian Guan, Anjia Han, Yonghong He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20190
Pdf URL: https://arxiv.org/pdf/2503.20190
Copy Paste: [[2503.20190]] Cross-Modal Prototype Allocation: Unsupervised Slide Representation Learning via Patch-Text Contrast in Computational Pathology(https://arxiv.org/abs/2503.20190)
Keywords: large language model
Abstract: With the rapid advancement of pathology foundation models (FMs), the representation learning of whole slide images (WSIs) attracts increasing attention. Existing studies develop high-quality patch feature extractors and employ carefully designed aggregation schemes to derive slide-level representations. However, mainstream weakly supervised slide representation learning methods, primarily based on multiple instance learning (MIL), are tailored to specific downstream tasks, which limits their generalizability. To address this issue, some studies explore unsupervised slide representation learning. However, these approaches focus solely on the visual modality of patches, neglecting the rich semantic information embedded in textual data. In this work, we propose ProAlign, a cross-modal unsupervised slide representation learning framework. Specifically, we leverage a large language model (LLM) to generate descriptive text for the prototype types present in a WSI, introducing patch-text contrast to construct initial prototype embeddings. Furthermore, we propose a parameter-free attention aggregation strategy that utilizes the similarity between patches and these prototypes to form unsupervised slide embeddings applicable to a wide range of downstream tasks. Extensive experiments on four public datasets show that ProAlign outperforms existing unsupervised frameworks and achieves performance comparable to some weakly supervised models.

Title: GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization

Authors: Zhouhong Gu, Xingzhou Chen, Xiaoran Shi, Tao Wang, Suhang Zheng, Tianyu Li, Hongwei Feng, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20194
Pdf URL: https://arxiv.org/pdf/2503.20194
Copy Paste: [[2503.20194]] GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization(https://arxiv.org/abs/2503.20194)
Keywords: robust, generative, large language model
Abstract: Recent advances in large language models have highlighted the critical need for precise control over model outputs through predefined constraints. While existing methods attempt to achieve this through either direct instruction-response synthesis or preferential response optimization, they often struggle with constraint understanding and adaptation. This limitation becomes particularly evident when handling fine-grained constraints, leading to either hallucination or brittle performance. We introduce Generative Adversarial Policy Optimization (GAPO), a novel framework that combines GAN-based training dynamics with an encoder-only reward model to progressively learn and adapt to increasingly complex constraints. GAPO leverages adversarial training to automatically generate training samples of varying difficulty while utilizing the encoder-only architecture to better capture prompt-response relationships. Extensive experiments demonstrate GAPO's superior performance across multiple benchmarks, particularly in scenarios requiring fine-grained constraint handling, where it significantly outperforms existing methods like PPO, DPO, and KTO. Our results suggest that GAPO's unique approach to preferential prompt learning offers a more robust and effective solution for controlling LLM outputs. Code is avaliable in this https URL.

Title: Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

Authors: Alex Jinpeng Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20198
Pdf URL: https://arxiv.org/pdf/2503.20198
Copy Paste: [[2503.20198]] Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models(https://arxiv.org/abs/2503.20198)
Keywords: robust, diffusion, generative
Abstract: Recent advancements in autoregressive and diffusion models have led to strong performance in image generation with short scene text words. However, generating coherent, long-form text in images, such as paragraphs in slides or documents, remains a major challenge for current generative models. We present the first work specifically focused on long text image generation, addressing a critical gap in existing text-to-image systems that typically handle only brief phrases or single sentences. Through comprehensive analysis of state-of-the-art autoregressive generation models, we identify the image tokenizer as a critical bottleneck in text generating quality. To address this, we introduce a novel text-focused, binary tokenizer optimized for capturing detailed scene text features. Leveraging our tokenizer, we develop \ModelName, a multimodal autoregressive model that excels in generating high-quality long-text images with unprecedented fidelity. Our model offers robust controllability, enabling customization of text properties such as font style, size, color, and alignment. Extensive experiments demonstrate that \ModelName~significantly outperforms SD3.5 Large~\cite{sd3} and GPT4o~\cite{gpt4o} with DALL-E 3~\cite{dalle3} in generating long text accurately, consistently, and flexibly. Beyond its technical achievements, \ModelName~opens up exciting opportunities for innovative applications like interleaved document and PowerPoint generation, establishing a new frontier in long-text image generating.

Title: Assessing SAM for Tree Crown Instance Segmentation from Drone Imagery

Authors: Mélisande Teng, Arthur Ouaknine, Etienne Laliberté, Yoshua Bengio, David Rolnick, Hugo Larochelle
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20199
Pdf URL: https://arxiv.org/pdf/2503.20199
Copy Paste: [[2503.20199]] Assessing SAM for Tree Crown Instance Segmentation from Drone Imagery(https://arxiv.org/abs/2503.20199)
Keywords: segmentation
Abstract: The potential of tree planting as a natural climate solution is often undermined by inadequate monitoring of tree planting projects. Current monitoring methods involve measuring trees by hand for each species, requiring extensive cost, time, and labour. Advances in drone remote sensing and computer vision offer great potential for mapping and characterizing trees from aerial imagery, and large pre-trained vision models, such as the Segment Anything Model (SAM), may be a particularly compelling choice given limited labeled data. In this work, we compare SAM methods for the task of automatic tree crown instance segmentation in high resolution drone imagery of young tree plantations. We explore the potential of SAM for this task, and find that methods using SAM out-of-the-box do not outperform a custom Mask R-CNN, even with well-designed prompts, but that there is potential for methods which tune SAM further. We also show that predictions can be improved by adding Digital Surface Model (DSM) information as an input.

Title: SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain

Authors: Nan Gao, Yihua Bao, Dongdong Weng, Jiayi Zhao, Jia Li, Yan Zhou, Pengfei Wan, Di Zhang
Subjects: cs.CL, cs.AI, cs.HC, cs.RO
Abstract URL: https://arxiv.org/abs/2503.20202
Pdf URL: https://arxiv.org/pdf/2503.20202
Copy Paste: [[2503.20202]] SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain(https://arxiv.org/abs/2503.20202)
Keywords: large language model
Abstract: Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to parse speech content and generate reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech this http URL, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to generate context-aware gesture labels. Subsequently, we constructed an intent chain-annotated text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results demonstrate that SARGes achieves highly semantically-aligned gesture labeling (50.2% accuracy) with efficient single-pass inference (0.4 seconds). The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.

Title: BEAR: A Video Dataset For Fine-grained Behaviors Recognition Oriented with Action and Environment Factors

Authors: Chengyang Hu, Yuduo Chen, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20209
Pdf URL: https://arxiv.org/pdf/2503.20209
Copy Paste: [[2503.20209]] BEAR: A Video Dataset For Fine-grained Behaviors Recognition Oriented with Action and Environment Factors(https://arxiv.org/abs/2503.20209)
Keywords: fair
Abstract: Behavior recognition is an important task in video representation learning. An essential aspect pertains to effective feature learning conducive to behavior recognition. Recently, researchers have started to study fine-grained behavior recognition, which provides similar behaviors and encourages the model to concern with more details of behaviors with effective features for distinction. However, previous fine-grained behaviors limited themselves to controlling partial information to be similar, leading to an unfair and not comprehensive evaluation of existing works. In this work, we develop a new video fine-grained behavior dataset, named BEAR, which provides fine-grained (i.e. similar) behaviors that uniquely focus on two primary factors defining behavior: Environment and Action. It includes two fine-grained behavior protocols including Fine-grained Behavior with Similar Environments and Fine-grained Behavior with Similar Actions as well as multiple sub-protocols as different scenarios. Furthermore, with this new dataset, we conduct multiple experiments with different behavior recognition models. Our research primarily explores the impact of input modality, a critical element in studying the environmental and action-based aspects of behavior recognition. Our experimental results yield intriguing insights that have substantial implications for further research endeavors.

Title: Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors

Authors: Weilong Yan, Ming Li, Haipeng Li, Shuwei Shao, Robby T. Tan
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.20211
Pdf URL: https://arxiv.org/pdf/2503.20211
Copy Paste: [[2503.20211]] Synthetic-to-Real Self-supervised Robust Depth Estimation via Learning with Motion and Structure Priors(https://arxiv.org/abs/2503.20211)
Keywords: robust
Abstract: Self-supervised depth estimation from monocular cameras in diverse outdoor conditions, such as daytime, rain, and nighttime, is challenging due to the difficulty of learning universal representations and the severe lack of labeled real-world adverse data. Previous methods either rely on synthetic inputs and pseudo-depth labels or directly apply daytime strategies to adverse conditions, resulting in suboptimal results. In this paper, we present the first synthetic-to-real robust depth estimation framework, incorporating motion and structure priors to capture real-world knowledge effectively. In the synthetic adaptation, we transfer motion-structure knowledge inside cost volumes for better robust representation, using a frozen daytime model to train a depth estimator in synthetic adverse conditions. In the innovative real adaptation, which targets to fix synthetic-real gaps, models trained earlier identify the weather-insensitive regions with a designed consistency-reweighting strategy to emphasize valid pseudo-labels. We introduce a new regularization by gathering explicit depth distributions to constrain the model when facing real-world data. Experiments show that our method outperforms the state-of-the-art across diverse conditions in multi-frame and single-frame evaluations. We achieve improvements of 7.5% and 4.3% in AbsRel and RMSE on average for nuScenes and Robotcar datasets (daytime, nighttime, rain). In zero-shot evaluation of DrivingStereo (rain, fog), our method generalizes better than the previous ones.

Title: Qwen2.5-Omni Technical Report

Authors: Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, Junyang Lin
Subjects: cs.CL, cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.20215
Pdf URL: https://arxiv.org/pdf/2503.20215
Copy Paste: [[2503.20215]] Qwen2.5-Omni Technical Report(https://arxiv.org/abs/2503.20215)
Keywords: robust, large language model
Abstract: In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

Title: Video Motion Graphs

Authors: Haiyang Liu, Zhan Xu, Fa-Ting Hong, Hsin-Ping Huang, Yi Zhou, Yang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20218
Pdf URL: https://arxiv.org/pdf/2503.20218
Copy Paste: [[2503.20218]] Video Motion Graphs(https://arxiv.org/abs/2503.20218)
Keywords: robust, diffusion, generative
Abstract: We present Video Motion Graphs, a system designed to generate realistic human motion videos. Using a reference video and conditional signals such as music or motion tags, the system synthesizes new videos by first retrieving video clips with gestures matching the conditions and then generating interpolation frames to seamlessly connect clip boundaries. The core of our approach is HMInterp, a robust Video Frame Interpolation (VFI) model that enables seamless interpolation of discontinuous frames, even for complex motion scenarios like dancing. HMInterp i) employs a dual-branch interpolation approach, combining a Motion Diffusion Model for human skeleton motion interpolation with a diffusion-based video frame interpolation model for final frame generation. ii) adopts condition progressive training to effectively leverage identity strong and weak conditions, such as images and pose. These designs ensure both high video texture quality and accurate motion trajectory. Results show that our Video Motion Graphs outperforms existing generative- and retrieval-based methods for multi-modal conditioned human motion video generation. Project page can be found at this https URL

Title: DINeMo: Learning Neural Mesh Models with no 3D Annotations

Authors: Weijie Guo, Guofeng Zhang, Wufei Ma, Alan Yuille
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20220
Pdf URL: https://arxiv.org/pdf/2503.20220
Copy Paste: [[2503.20220]] DINeMo: Learning Neural Mesh Models with no 3D Annotations(https://arxiv.org/abs/2503.20220)
Keywords: robust
Abstract: Category-level 3D/6D pose estimation is a crucial step towards comprehensive 3D scene understanding, which would enable a broad range of applications in robotics and embodied AI. Recent works explored neural mesh models that approach a range of 2D and 3D tasks from an analysis-by-synthesis perspective. Despite the largely enhanced robustness to partial occlusion and domain shifts, these methods depended heavily on 3D annotations for part-contrastive learning, which confines them to a narrow set of categories and hinders efficient scaling. In this work, we present DINeMo, a novel neural mesh model that is trained with no 3D annotations by leveraging pseudo-correspondence obtained from large visual foundation models. We adopt a bidirectional pseudo-correspondence generation method, which produce pseudo correspondence utilize both local appearance features and global context information. Experimental results on car datasets demonstrate that our DINeMo outperforms previous zero- and few-shot 3D pose estimation by a wide margin, narrowing the gap with fully-supervised methods by 67.3%. Our DINeMo also scales effectively and efficiently when incorporating more unlabeled images during training, which demonstrate the advantages over supervised learning methods that rely on 3D annotations. Our project page is available at this https URL.

Title: Advancements in Natural Language Processing: Exploring Transformer-Based Architectures for Text Understanding

Authors: Tianhao Wu, Yu Wang, Ngoc Quach
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20227
Pdf URL: https://arxiv.org/pdf/2503.20227
Copy Paste: [[2503.20227]] Advancements in Natural Language Processing: Exploring Transformer-Based Architectures for Text Understanding(https://arxiv.org/abs/2503.20227)
Keywords: transformer
Abstract: Natural Language Processing (NLP) has witnessed a transformative leap with the advent of transformer-based architectures, which have significantly enhanced the ability of machines to understand and generate human-like text. This paper explores the advancements in transformer models, such as BERT and GPT, focusing on their superior performance in text understanding tasks compared to traditional methods like recurrent neural networks (RNNs). By analyzing statistical properties through visual representations-including probability density functions of text length distributions and feature space classifications-the study highlights the models' proficiency in handling long-range dependencies, adapting to conditional shifts, and extracting features for classification, even with overlapping classes. Drawing on recent 2024 research, including enhancements in multi-hop knowledge graph reasoning and context-aware chat interactions, the paper outlines a methodology involving data preparation, model selection, pretraining, fine-tuning, and evaluation. The results demonstrate state-of-the-art performance on benchmarks like GLUE and SQuAD, with F1 scores exceeding 90%, though challenges such as high computational costs persist. This work underscores the pivotal role of transformers in modern NLP and suggests future directions, including efficiency optimization and multimodal integration, to further advance language-based AI systems.

Title: TeleLoRA: Teleporting Model-Specific Alignment Across LLMs

Authors: Xiao Lin, Manoj Acharya, Anirban Roy, Susmit Jha
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.20228
Pdf URL: https://arxiv.org/pdf/2503.20228
Copy Paste: [[2503.20228]] TeleLoRA: Teleporting Model-Specific Alignment Across LLMs(https://arxiv.org/abs/2503.20228)
Keywords: attack, large language model
Abstract: Mitigating Trojans in Large Language Models (LLMs) is one of many tasks where alignment data is LLM specific, as different LLMs have different Trojan triggers and trigger behaviors to be removed. In this paper, we introduce TeleLoRA (Teleporting Low-Rank Adaptation), a novel framework that synergizes model-specific alignment data across multiple LLMs to enable zero-shot Trojan mitigation on unseen LLMs without alignment data. TeleLoRA learns a unified generator of LoRA adapter weights by leveraging local activation information across multiple LLMs. This generator is designed to be permutation symmetric to generalize across models with different architectures and sizes. We optimize the model design for memory efficiency, making it feasible to learn with large-scale LLMs with minimal computational resources. Experiments on LLM Trojan mitigation benchmarks demonstrate that TeleLoRA effectively reduces attack success rates while preserving the benign performance of the models.

Title: TraNCE: Transformative Non-linear Concept Explainer for CNNs

Authors: Ugochukwu Ejike Akpudo, Yongsheng Gao, Jun Zhou, Andrew Lewis
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20230
Pdf URL: https://arxiv.org/pdf/2503.20230
Copy Paste: [[2503.20230]] TraNCE: Transformative Non-linear Concept Explainer for CNNs(https://arxiv.org/abs/2503.20230)
Keywords: explainability
Abstract: Convolutional neural networks (CNNs) have succeeded remarkably in various computer vision tasks. However, they are not intrinsically explainable. While the feature-level understanding of CNNs reveals where the models looked, concept-based explainability methods provide insights into what the models saw. However, their assumption of linear reconstructability of image activations fails to capture the intricate relationships within these activations. Their Fidelity-only approach to evaluating global explanations also presents a new concern. For the first time, we address these limitations with the novel Transformative Nonlinear Concept Explainer (TraNCE) for CNNs. Unlike linear reconstruction assumptions made by existing methods, TraNCE captures the intricate relationships within the activations. This study presents three original contributions to the CNN explainability literature: (i) An automatic concept discovery mechanism based on variational autoencoders (VAEs). This transformative concept discovery process enhances the identification of meaningful concepts from image activations. (ii) A visualization module that leverages the Bessel function to create a smooth transition between prototypical image pixels, revealing not only what the CNN saw but also what the CNN avoided, thereby mitigating the challenges of concept duplication as documented in previous works. (iii) A new metric, the Faith score, integrates both Coherence and Fidelity for a comprehensive evaluation of explainer faithfulness and consistency.

Title: Leveraging 3D Geometric Priors in 2D Rotation Symmetry Detection

Authors: Ahyun Seo, Minsu Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20235
Pdf URL: https://arxiv.org/pdf/2503.20235
Copy Paste: [[2503.20235]] Leveraging 3D Geometric Priors in 2D Rotation Symmetry Detection(https://arxiv.org/abs/2503.20235)
Keywords: robust, segmentation
Abstract: Symmetry plays a vital role in understanding structural patterns, aiding object recognition and scene interpretation. This paper focuses on rotation symmetry, where objects remain unchanged when rotated around a central axis, requiring detection of rotation centers and supporting vertices. Traditional methods relied on hand-crafted feature matching, while recent segmentation models based on convolutional neural networks detect rotation centers but struggle with 3D geometric consistency due to viewpoint distortions. To overcome this, we propose a model that directly predicts rotation centers and vertices in 3D space and projects the results back to 2D while preserving structural integrity. By incorporating a vertex reconstruction stage enforcing 3D geometric priors -- such as equal side lengths and interior angles -- our model enhances robustness and accuracy. Experiments on the DENDI dataset show superior performance in rotation axis detection and validate the impact of 3D priors through ablation studies.

Title: Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models

Authors: Prin Phunyaphibarn, Phillip Y. Lee, Jaihoon Kim, Minhyuk Sung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20240
Pdf URL: https://arxiv.org/pdf/2503.20240
Copy Paste: [[2503.20240]] Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models(https://arxiv.org/abs/2503.20240)
Keywords: diffusion
Abstract: Classifier-Free Guidance (CFG) is a fundamental technique in training conditional diffusion models. The common practice for CFG-based training is to use a single network to learn both conditional and unconditional noise prediction, with a small dropout rate for conditioning. However, we observe that the joint learning of unconditional noise with limited bandwidth in training results in poor priors for the unconditional case. More importantly, these poor unconditional noise predictions become a serious reason for degrading the quality of conditional generation. Inspired by the fact that most CFG-based conditional models are trained by fine-tuning a base model with better unconditional generation, we first show that simply replacing the unconditional noise in CFG with that predicted by the base model can significantly improve conditional generation. Furthermore, we show that a diffusion model other than the one the fine-tuned model was trained on can be used for unconditional noise replacement. We experimentally verify our claim with a range of CFG-based conditional models for both image and video generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix.

Title: Software Vulnerability Analysis Across Programming Language and Program Representation Landscapes: A Survey

Authors: Zhuoyun Qian, Fangtian Zhong, Qin Hu, Yili Jiang, Jiaqi Huang, Mengfei Ren, Jiguo Yu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.20244
Pdf URL: https://arxiv.org/pdf/2503.20244
Copy Paste: [[2503.20244]] Software Vulnerability Analysis Across Programming Language and Program Representation Landscapes: A Survey(https://arxiv.org/abs/2503.20244)
Keywords: security, attack
Abstract: Modern software systems are developed in diverse programming languages and often harbor critical vulnerabilities that attackers can exploit to compromise security. These vulnerabilities have been actively targeted in real-world attacks, causing substantial harm to users and cyberinfrastructure. Since many of these flaws originate from the code itself, a variety of techniques have been proposed to detect and mitigate them prior to software deployment. However, a comprehensive comparative study that spans different programming languages, program representations, bug types, and analysis techniques is still lacking. As a result, the relationships among programming languages, abstraction levels, vulnerability types, and detection approaches remain fragmented, and the limitations and research gaps across the landscape are not clearly understood. This article aims to bridge that gap by systematically examining widely used programming languages, levels of program representation, categories of vulnerabilities, and mainstream detection techniques. The survey provides a detailed understanding of current practices in vulnerability discovery, highlighting their strengths, limitations, and distinguishing characteristics. Furthermore, it identifies persistent challenges and outlines promising directions for future research in the field of software security.

Title: LogicQA: Logical Anomaly Detection with Vision Language Model Generated Questions

Authors: Yejin Kwon, Daeun Moon, Youngje Oh, Hyunsoo Yoon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20252
Pdf URL: https://arxiv.org/pdf/2503.20252
Copy Paste: [[2503.20252]] LogicQA: Logical Anomaly Detection with Vision Language Model Generated Questions(https://arxiv.org/abs/2503.20252)
Keywords: explainability
Abstract: Anomaly Detection (AD) focuses on detecting samples that differ from the standard pattern, making it a vital tool in process control. Logical anomalies may appear visually normal yet violate predefined constraints on object presence, arrangement, or quantity, depending on reasoning and explainability. We introduce LogicQA, a framework that enhances AD by providing industrial operators with explanations for logical anomalies. LogicQA compiles automatically generated questions into a checklist and collects responses to identify violations of logical constraints. LogicQA is training-free, annotation-free, and operates in a few-shot setting. We achieve state-of-the-art (SOTA) Logical AD performance on public benchmarks, MVTec LOCO AD, with an AUROC of 87.6 percent and an F1-max of 87.0 percent along with the explanations of anomalies. Also, our approach has shown outstanding performance on semiconductor SEM corporate data, further validating its effectiveness in industrial applications.

Title: How Secure is Forgetting? Linking Machine Unlearning to Machine Learning Attacks

Authors: Muhammed Shafi K. P., Serena Nicolazzo, Antonino Nocera, Vinod P
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.20257
Pdf URL: https://arxiv.org/pdf/2503.20257
Copy Paste: [[2503.20257]] How Secure is Forgetting? Linking Machine Unlearning to Machine Learning Attacks(https://arxiv.org/abs/2503.20257)
Keywords: secure, security, privacy, attack, membership infer
Abstract: As Machine Learning (ML) evolves, the complexity and sophistication of security threats against this paradigm continue to grow as well, threatening data privacy and model integrity. In response, Machine Unlearning (MU) is a recent technology that aims to remove the influence of specific data from a trained model, enabling compliance with privacy regulations and user requests. This can be done for privacy compliance (e.g., GDPR's right to be forgotten) or model refinement. However, the intersection between classical threats in ML and MU remains largely unexplored. In this Systematization of Knowledge (SoK), we provide a structured analysis of security threats in ML and their implications for MU. We analyze four major attack classes, namely, Backdoor Attacks, Membership Inference Attacks (MIA), Adversarial Attacks, and Inversion Attacks, we investigate their impact on MU and propose a novel classification based on how they are usually used in this context. Finally, we identify open challenges, including ethical considerations, and explore promising future research directions, paving the way for future research in secure and privacy-preserving Machine Unlearning.

Title: Revisit Time Series Classification Benchmark: The Impact of Temporal Information for Classification

Authors: Yunrui Zhang, Gustavo Batista, Salil S. Kanhere
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.20264
Pdf URL: https://arxiv.org/pdf/2503.20264
Copy Paste: [[2503.20264]] Revisit Time Series Classification Benchmark: The Impact of Temporal Information for Classification(https://arxiv.org/abs/2503.20264)
Keywords: robust, fair
Abstract: Time series classification is usually regarded as a distinct task from tabular data classification due to the importance of temporal information. However, in this paper, by performing permutation tests that disrupt temporal information on the UCR time series classification archive, the most widely used benchmark for time series classification, we identify a significant proportion of datasets where temporal information has little to no impact on classification. Many of these datasets are tabular in nature or rely mainly on tabular features, leading to potentially biased evaluations of time series classifiers focused on temporal information. To address this, we propose UCR Augmented, a benchmark based on the UCR time series classification archive designed to evaluate classifiers' ability to extract and utilize temporal information. Testing classifiers from seven categories on this benchmark revealed notable shifts in performance rankings. Some previously overlooked approaches perform well, while others see their performance decline significantly when temporal information is crucial. UCR Augmented provides a more robust framework for assessing time series classifiers, ensuring fairer evaluations. Our code is available at this https URL.

Title: EGVD: Event-Guided Video Diffusion Model for Physically Realistic Large-Motion Frame Interpolation

Authors: Ziran Zhang, Xiaohui Li, Yihao Liu, Yujin Wang, Yueting Chen, Tianfan Xue, Shi Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20268
Pdf URL: https://arxiv.org/pdf/2503.20268
Copy Paste: [[2503.20268]] EGVD: Event-Guided Video Diffusion Model for Physically Realistic Large-Motion Frame Interpolation(https://arxiv.org/abs/2503.20268)
Keywords: diffusion
Abstract: Video frame interpolation (VFI) in scenarios with large motion remains challenging due to motion ambiguity between frames. While event cameras can capture high temporal resolution motion information, existing event-based VFI methods struggle with limited training data and complex motion patterns. In this paper, we introduce Event-Guided Video Diffusion Model (EGVD), a novel framework that leverages the powerful priors of pre-trained stable video diffusion models alongside the precise temporal information from event cameras. Our approach features a Multi-modal Motion Condition Generator (MMCG) that effectively integrates RGB frames and event signals to guide the diffusion process, producing physically realistic intermediate frames. We employ a selective fine-tuning strategy that preserves spatial modeling capabilities while efficiently incorporating event-guided temporal information. We incorporate input-output normalization techniques inspired by recent advances in diffusion modeling to enhance training stability across varying noise levels. To improve generalization, we construct a comprehensive dataset combining both real and simulated event data across diverse scenarios. Extensive experiments on both real and simulated datasets demonstrate that EGVD significantly outperforms existing methods in handling large motion and challenging lighting conditions, achieving substantial improvements in perceptual quality metrics (27.4% better LPIPS on Prophesee and 24.1% on BSRGB) while maintaining competitive fidelity measures. Code and datasets available at: this https URL.

Title: ViLBench: A Suite for Vision-Language Process Reward Modeling

Authors: Haoqin Tu, Weitao Feng, Hardy Chen, Hui Liu, Xianfeng Tang, Cihang Xie
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2503.20271
Pdf URL: https://arxiv.org/pdf/2503.20271
Copy Paste: [[2503.20271]] ViLBench: A Suite for Vision-Language Process Reward Modeling(https://arxiv.org/abs/2503.20271)
Keywords: large language model
Abstract: Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations at this https URL with our code, model, and data.

Title: sudo rm -rf agentic_security

Authors: Sejin Lee, Jian Kim, Haon Park, Ashkan Yousefpour, Sangyoon Yu, Min Song
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.20279
Pdf URL: https://arxiv.org/pdf/2503.20279
Copy Paste: [[2503.20279]] sudo rm -rf agentic_security(https://arxiv.org/abs/2503.20279)
Keywords: secure, security, attack, robust, large language model
Abstract: Large Language Models (LLMs) are increasingly deployed as computer-use agents, autonomously performing tasks within real desktop or web environments. While this evolution greatly expands practical use cases for humans, it also creates serious security exposures. We present SUDO (Screen-based Universal Detox2Tox Offense), a novel attack framework that systematically bypasses refusal trained safeguards in commercial computer-use agents, such as Claude Computer Use. The core mechanism, Detox2Tox, transforms harmful requests (that agents initially reject) into seemingly benign requests via detoxification, secures detailed instructions from advanced vision language models (VLMs), and then reintroduces malicious content via toxification just before execution. Unlike conventional jailbreaks, SUDO iteratively refines its attacks based on a built-in refusal feedback, making it increasingly effective against robust policy filters. In extensive tests spanning 50 real-world tasks and multiple state-of-the-art VLMs, SUDO achieves a stark attack success rate of 24% (with no refinement), and up to 41% (by its iterative refinement) in Claude Computer Use. By revealing these vulnerabilities and demonstrating the ease with which they can be exploited in real-world computing environments, this paper highlights an immediate need for robust, context-aware safeguards. WARNING: This paper includes harmful or offensive model outputs.

Title: Are We There Yet? Unraveling the State-of-the-Art Graph Network Intrusion Detection Systems

Authors: Chenglong Wang, Pujia Zheng, Jiaping Gui, Cunqing Hua, Wajih Ul Hassan
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20281
Pdf URL: https://arxiv.org/pdf/2503.20281
Copy Paste: [[2503.20281]] Are We There Yet? Unraveling the State-of-the-Art Graph Network Intrusion Detection Systems(https://arxiv.org/abs/2503.20281)
Keywords: security, attack, robust
Abstract: Network Intrusion Detection Systems (NIDS) are vital for ensuring enterprise security. Recently, Graph-based NIDS (GIDS) have attracted considerable attention because of their capability to effectively capture the complex relationships within the graph structures of data communications. Despite their promise, the reproducibility and replicability of these GIDS remain largely unexplored, posing challenges for developing reliable and robust detection systems. This study bridges this gap by designing a systematic approach to evaluate state-of-the-art GIDS, which includes critically assessing, extending, and clarifying the findings of these systems. We further assess the robustness of GIDS under adversarial attacks. Evaluations were conducted on three public datasets as well as a newly collected large-scale enterprise dataset. Our findings reveal significant performance discrepancies, highlighting challenges related to dataset scale, model inputs, and implementation settings. We demonstrate difficulties in reproducing and replicating results, particularly concerning false positive rates and robustness against adversarial attacks. This work provides valuable insights and recommendations for future research, emphasizing the importance of rigorous reproduction and replication studies in developing robust and generalizable GIDS solutions.

Title: Model-Based Offline Reinforcement Learning with Adversarial Data Augmentation

Authors: Hongye Cao, Fan Feng, Jing Huo, Shangdong Yang, Meng Fang, Tianpei Yang, Yang Gao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20285
Pdf URL: https://arxiv.org/pdf/2503.20285
Copy Paste: [[2503.20285]] Model-Based Offline Reinforcement Learning with Adversarial Data Augmentation(https://arxiv.org/abs/2503.20285)
Keywords: robust
Abstract: Model-based offline Reinforcement Learning (RL) constructs environment models from offline datasets to perform conservative policy optimization. Existing approaches focus on learning state transitions through ensemble models, rollouting conservative estimation to mitigate extrapolation errors. However, the static data makes it challenging to develop a robust policy, and offline agents cannot access the environment to gather new data. To address these challenges, we introduce Model-based Offline Reinforcement learning with AdversariaL data augmentation (MORAL). In MORAL, we replace the fixed horizon rollout by employing adversaria data augmentation to execute alternating sampling with ensemble models to enrich training data. Specifically, this adversarial process dynamically selects ensemble models against policy for biased sampling, mitigating the optimistic estimation of fixed models, thus robustly expanding the training data for policy optimization. Moreover, a differential factor is integrated into the adversarial process for regularization, ensuring error minimization in extrapolations. This data-augmented optimization adapts to diverse offline tasks without rollout horizon tuning, showing remarkable applicability. Extensive experiments on D4RL benchmark demonstrate that MORAL outperforms other model-based offline RL methods in terms of policy learning and sample efficiency.

Title: RelTriple: Learning Plausible Indoor Layouts by Integrating Relationship Triples into the Diffusion Process

Authors: Kaifan Sun, Bingchen Yang, Peter Wonka, Jun Xiao, Haiyong Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20289
Pdf URL: https://arxiv.org/pdf/2503.20289
Copy Paste: [[2503.20289]] RelTriple: Learning Plausible Indoor Layouts by Integrating Relationship Triples into the Diffusion Process(https://arxiv.org/abs/2503.20289)
Keywords: diffusion, generative
Abstract: The generation of indoor furniture layouts has significant applications in augmented reality, smart homes, and architectural design. Successful furniture arrangement requires proper physical relationships (e.g., collision avoidance) and spacing relationships between furniture and their functional zones to be respected. However, manually defined relationships are almost always incomplete and can produce unrealistic layouts. This work instead extracts spacing relationships automatically based on a hierarchical analysis and adopts the Delaunay Triangulation to produce important triple relationships. Compared to pairwise relationship modeling, triple relationships account for interactions and space utilization among multiple objects. To this end, we introduce RelTriple, a novel approach that enhances furniture distribution by learning spacing relationships between objects and regions. We formulate triple relationships as object-to-object (O2O) losses and object-to-region (O2R) losses and integrate them directly into the training process of generative diffusion. Our approach consistently improves over existing state-of-the-art methods in visual results evaluation metrics on unconditional layout generation, floorplan-conditioned layout generation, and scene rearrangement, achieving at least 12% on the introduced spatial relationship metric and superior spatial coherence and practical usability.

Title: Context-Aware Weakly Supervised Image Manipulation Localization with SAM Refinement

Authors: Xinghao Wang, Changtao Miao, Dianmo Sheng, Tao Gong, Qi Chu, Bin Liu, Nenghai Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20294
Pdf URL: https://arxiv.org/pdf/2503.20294
Copy Paste: [[2503.20294]] Context-Aware Weakly Supervised Image Manipulation Localization with SAM Refinement(https://arxiv.org/abs/2503.20294)
Keywords: transformer
Abstract: Malicious image manipulation poses societal risks, increasing the importance of effective image manipulation detection methods. Recent approaches in image manipulation detection have largely been driven by fully supervised approaches, which require labor-intensive pixel-level annotations. Thus, it is essential to explore weakly supervised image manipulation localization methods that only require image-level binary labels for training. However, existing weakly supervised image manipulation methods overlook the importance of edge information for accurate localization, leading to suboptimal localization performance. To address this, we propose a Context-Aware Boundary Localization (CABL) module to aggregate boundary features and learn context-inconsistency for localizing manipulated areas. Furthermore, by leveraging Class Activation Mapping (CAM) and Segment Anything Model (SAM), we introduce the CAM-Guided SAM Refinement (CGSR) module to generate more accurate manipulation localization maps. By integrating two modules, we present a novel weakly supervised framework based on a dual-branch Transformer-CNN architecture. Our method achieves outstanding localization performance across multiple datasets.

Title: Traversing Distortion-Perception Tradeoff using a Single Score-Based Generative Model

Authors: Yuhan Wang, Suzhi Bi, Ying-Jun Angela Zhang, Xiaojun Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20297
Pdf URL: https://arxiv.org/pdf/2503.20297
Copy Paste: [[2503.20297]] Traversing Distortion-Perception Tradeoff using a Single Score-Based Generative Model(https://arxiv.org/abs/2503.20297)
Keywords: diffusion, generative
Abstract: The distortion-perception (DP) tradeoff reveals a fundamental conflict between distortion metrics (e.g., MSE and PSNR) and perceptual quality. Recent research has increasingly concentrated on evaluating denoising algorithms within the DP framework. However, existing algorithms either prioritize perceptual quality by sacrificing acceptable distortion, or focus on minimizing MSE for faithful restoration. When the goal shifts or noisy measurements vary, adapting to different points on the DP plane needs retraining or even re-designing the model. Inspired by recent advances in solving inverse problems using score-based generative models, we explore the potential of flexibly and optimally traversing DP tradeoffs using a single pre-trained score-based model. Specifically, we introduce a variance-scaled reverse diffusion process and theoretically characterize the marginal distribution. We then prove that the proposed sample process is an optimal solution to the DP tradeoff for conditional Gaussian distribution. Experimental results on two-dimensional and image datasets illustrate that a single score network can effectively and flexibly traverse the DP tradeoff for general denoising problems.

Title: Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability

Authors: Jianyang Zhang, Qianli Luo, Guowu Yang, Wenjing Yang, Weide Liu, Guosheng Lin, Fengmao Lv
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20301
Pdf URL: https://arxiv.org/pdf/2503.20301
Copy Paste: [[2503.20301]] Attribute-formed Class-specific Concept Space: Endowing Language Bottleneck Model with Better Interpretability and Scalability(https://arxiv.org/abs/2503.20301)
Keywords: interpretability
Abstract: Language Bottleneck Models (LBMs) are proposed to achieve interpretable image recognition by classifying images based on textual concept bottlenecks. However, current LBMs simply list all concepts together as the bottleneck layer, leading to the spurious cue inference problem and cannot generalized to unseen classes. To address these limitations, we propose the Attribute-formed Language Bottleneck Model (ALBM). ALBM organizes concepts in the attribute-formed class-specific space, where concepts are descriptions of specific attributes for specific classes. In this way, ALBM can avoid the spurious cue inference problem by classifying solely based on the essential concepts of each class. In addition, the cross-class unified attribute set also ensures that the concept spaces of different classes have strong correlations, as a result, the learned concept classifier can be easily generalized to unseen classes. Moreover, to further improve interpretability, we propose Visual Attribute Prompt Learning (VAPL) to extract visual features on fine-grained attributes. Furthermore, to avoid labor-intensive concept annotation, we propose the Description, Summary, and Supplement (DSS) strategy to automatically generate high-quality concept sets with a complete and precise attribute. Extensive experiments on 9 widely used few-shot benchmarks demonstrate the interpretability, transferability, and performance of our approach. The code and collected concept sets are available at this https URL.

Title: A Multilingual, Culture-First Approach to Addressing Misgendering in LLM Applications

Authors: Sunayana Sitaram, Adrian de Wynter, Isobel McCrum, Qilong Gu, Si-Qing Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20302
Pdf URL: https://arxiv.org/pdf/2503.20302
Copy Paste: [[2503.20302]] A Multilingual, Culture-First Approach to Addressing Misgendering in LLM Applications(https://arxiv.org/abs/2503.20302)
Keywords: large language model
Abstract: Misgendering is the act of referring to someone by a gender that does not match their chosen identity. It marginalizes and undermines a person's sense of self, causing significant harm. English-based approaches have clear-cut approaches to avoiding misgendering, such as the use of the pronoun ``they''. However, other languages pose unique challenges due to both grammatical and cultural constructs. In this work we develop methodologies to assess and mitigate misgendering across 42 languages and dialects using a participatory-design approach to design effective and appropriate guardrails across all languages. We test these guardrails in a standard large language model-based application (meeting transcript summarization), where both the data generation and the annotation steps followed a human-in-the-loop approach. We find that the proposed guardrails are very effective in reducing misgendering rates across all languages in the summaries generated, and without incurring loss of quality. Our human-in-the-loop approach demonstrates a method to feasibly scale inclusive and responsible AI-based solutions across multiple languages and cultures.

Title: Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs

Authors: Zitian Wang, Yue Liao, Kang Rong, Fengyun Rao, Yibo Yang, Si Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20309
Pdf URL: https://arxiv.org/pdf/2503.20309
Copy Paste: [[2503.20309]] Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs(https://arxiv.org/abs/2503.20309)
Keywords: large language model
Abstract: Preference alignment has emerged as an effective strategy to enhance the performance of Multimodal Large Language Models (MLLMs) following supervised fine-tuning. While existing preference alignment methods predominantly target hallucination factors, they overlook the factors essential for multi-modal comprehension capabilities, often narrowing their improvements on hallucination mitigation. To bridge this gap, we propose Instruction-oriented Preference Alignment (IPA), a scalable framework designed to automatically construct alignment preferences grounded in instruction fulfillment efficacy. Our method involves an automated preference construction coupled with a dedicated verification process that identifies instruction-oriented factors, avoiding significant variability in response representations. Additionally, IPA incorporates a progressive preference collection pipeline, further recalling challenging samples through model self-evolution and reference-guided refinement. Experiments conducted on Qwen2VL-7B demonstrate IPA's effectiveness across multiple benchmarks, including hallucination evaluation, visual question answering, and text understanding tasks, highlighting its capability to enhance general comprehension.

Title: Enabling Heterogeneous Adversarial Transferability via Feature Permutation Attacks

Authors: Tao Wu, Tie Luo
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20310
Pdf URL: https://arxiv.org/pdf/2503.20310
Copy Paste: [[2503.20310]] Enabling Heterogeneous Adversarial Transferability via Feature Permutation Attacks(https://arxiv.org/abs/2503.20310)
Keywords: attack, robust, transformer
Abstract: Adversarial attacks in black-box settings are highly practical, with transfer-based attacks being the most effective at generating adversarial examples (AEs) that transfer from surrogate models to unseen target models. However, their performance significantly degrades when transferring across heterogeneous architectures -- such as CNNs, MLPs, and Vision Transformers (ViTs) -- due to fundamental architectural differences. To address this, we propose Feature Permutation Attack (FPA), a zero-FLOP, parameter-free method that enhances adversarial transferability across diverse architectures. FPA introduces a novel feature permutation (FP) operation, which rearranges pixel values in selected feature maps to simulate long-range dependencies, effectively making CNNs behave more like ViTs and MLPs. This enhances feature diversity and improves transferability both across heterogeneous architectures and within homogeneous CNNs. Extensive evaluations on 14 state-of-the-art architectures show that FPA achieves maximum absolute gains in attack success rates of 7.68% on CNNs, 14.57% on ViTs, and 14.48% on MLPs, outperforming existing black-box attacks. Additionally, FPA is highly generalizable and can seamlessly integrate with other transfer-based attacks to further boost their performance. Our findings establish FPA as a robust, efficient, and computationally lightweight strategy for enhancing adversarial transferability across heterogeneous architectures.

Title: Wan: Open and Advanced Large-Scale Video Generative Models

Authors: WanTeam: Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing Wang, Tianyi Gui, Tingyu Weng, Tong Shen, Wei Lin, Wei Wang, Wei Wang, Wenmeng Zhou, Wente Wang, Wenting Shen, Wenyuan Yu, Xianzhong Shi, Xiaoming Huang, Xin Xu, Yan Kou, Yangyu Lv, Yifei Li, Yijing Liu, Yiming Wang, Yingya Zhang, Yitong Huang, Yong Li, You Wu, Yu Liu, Yulin Pan, Yun Zheng, Yuntao Hong, Yupeng Shi, Yutong Feng, Zeyinzi Jiang, Zhen Han, Zhi-Fan Wu, Ziyu Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20314
Pdf URL: https://arxiv.org/pdf/2503.20314
Copy Paste: [[2503.20314]] Wan: Open and Advanced Large-Scale Video Generative Models(https://arxiv.org/abs/2503.20314)
Keywords: diffusion, transformer, generative
Abstract: This report presents Wan, a comprehensive and open suite of video foundation models designed to push the boundaries of video generation. Built upon the mainstream diffusion transformer paradigm, Wan achieves significant advancements in generative capabilities through a series of innovations, including our novel VAE, scalable pre-training strategies, large-scale data curation, and automated evaluation metrics. These contributions collectively enhance the model's performance and versatility. Specifically, Wan is characterized by four key features: Leading Performance: The 14B model of Wan, trained on a vast dataset comprising billions of images and videos, demonstrates the scaling laws of video generation with respect to both data and model size. It consistently outperforms the existing open-source models as well as state-of-the-art commercial solutions across multiple internal and external benchmarks, demonstrating a clear and significant performance superiority. Comprehensiveness: Wan offers two capable models, i.e., 1.3B and 14B parameters, for efficiency and effectiveness respectively. It also covers multiple downstream applications, including image-to-video, instruction-guided video editing, and personal video generation, encompassing up to eight tasks. Consumer-Grade Efficiency: The 1.3B model demonstrates exceptional resource efficiency, requiring only 8.19 GB VRAM, making it compatible with a wide range of consumer-grade GPUs. Openness: We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community. This openness seeks to significantly expand the creative possibilities of video production in the industry and provide academia with high-quality video foundation models. All the code and models are available at this https URL.

Title: SpikeDerain: Unveiling Clear Videos from Rainy Sequences Using Color Spike Streams

Authors: Hanwen Liang, Xian Zhong, Wenxuan Liu, Yajing Zheng, Wenxin Huang, Zhaofei Yu, Tiejun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20315
Pdf URL: https://arxiv.org/pdf/2503.20315
Copy Paste: [[2503.20315]] SpikeDerain: Unveiling Clear Videos from Rainy Sequences Using Color Spike Streams(https://arxiv.org/abs/2503.20315)
Keywords: robust
Abstract: Restoring clear frames from rainy videos presents a significant challenge due to the rapid motion of rain streaks. Traditional frame-based visual sensors, which capture scene content synchronously, struggle to capture the fast-moving details of rain accurately. In recent years, neuromorphic sensors have introduced a new paradigm for dynamic scene perception, offering microsecond temporal resolution and high dynamic range. However, existing multimodal methods that fuse event streams with RGB images face difficulties in handling the complex spatiotemporal interference of raindrops in real scenes, primarily due to hardware synchronization errors and computational redundancy. In this paper, we propose a Color Spike Stream Deraining Network (SpikeDerain), capable of reconstructing spike streams of dynamic scenes and accurately removing rain streaks. To address the challenges of data scarcity in real continuous rainfall scenes, we design a physically interpretable rain streak synthesis model that generates parameterized continuous rain patterns based on arbitrary background images. Experimental results demonstrate that the network, trained with this synthetic data, remains highly robust even under extreme rainfall conditions. These findings highlight the effectiveness and robustness of our method across varying rainfall levels and datasets, setting new standards for video deraining tasks. The code will be released soon.

Title: Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models

Authors: Shih-Wen Ke, Guan-Yu Lai, Guo-Lin Fang, Hsi-Yuan Kao
Subjects: cs.CL, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2503.20320
Pdf URL: https://arxiv.org/pdf/2503.20320
Copy Paste: [[2503.20320]] Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models(https://arxiv.org/abs/2503.20320)
Keywords: security, attack, large language model
Abstract: Large language models (LLMs) are designed to align with human values in their responses. This study exploits LLMs with an iterative prompting technique where each prompt is systematically modified and refined across multiple iterations to enhance its effectiveness in jailbreaking attacks progressively. This technique involves analyzing the response patterns of LLMs, including GPT-3.5, GPT-4, LLaMa2, Vicuna, and ChatGLM, allowing us to adjust and optimize prompts to evade the LLMs' ethical and security constraints. Persuasion strategies enhance prompt effectiveness while maintaining consistency with malicious intent. Our results show that the attack success rates (ASR) increase as the attacking prompts become more refined with the highest ASR of 90% for GPT4 and ChatGLM and the lowest ASR of 68% for LLaMa2. Our technique outperforms baseline techniques (PAIR and PAP) in ASR and shows comparable performance with GCG and ArtPrompt.

Title: Recovering Dynamic 3D Sketches from Videos

Authors: Jaeah Lee, Changwoon Choi, Young Min Kim, Jaesik Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20321
Pdf URL: https://arxiv.org/pdf/2503.20321
Copy Paste: [[2503.20321]] Recovering Dynamic 3D Sketches from Videos(https://arxiv.org/abs/2503.20321)
Keywords: robust
Abstract: Understanding 3D motion from videos presents inherent challenges due to the diverse types of movement, ranging from rigid and deformable objects to articulated structures. To overcome this, we propose Liv3Stroke, a novel approach for abstracting objects in motion with deformable 3D strokes. The detailed movements of an object may be represented by unstructured motion vectors or a set of motion primitives using a pre-defined articulation from a template model. Just as a free-hand sketch can intuitively visualize scenes or intentions with a sparse set of lines, we utilize a set of parametric 3D curves to capture a set of spatially smooth motion elements for general objects with unknown structures. We first extract noisy, 3D point cloud motion guidance from video frames using semantic features, and our approach deforms a set of curves to abstract essential motion features as a set of explicit 3D representations. Such abstraction enables an understanding of prominent components of motions while maintaining robustness to environmental factors. Our approach allows direct analysis of 3D object movements from video, tackling the uncertainty that typically occurs when translating real-world motion into recorded footage. The project page is accessible via: this https URL}

Title: Dynamic Pyramid Network for Efficient Multimodal Large Language Model

Authors: Hao Ai, Kunyi Wang, Zezhou Wang, Hao Lu, Jin Tian, Yaxin Luo, Peng Xing, Jen-Yuan Huang, Huaxia Li, Gen luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20322
Pdf URL: https://arxiv.org/pdf/2503.20322
Copy Paste: [[2503.20322]] Dynamic Pyramid Network for Efficient Multimodal Large Language Model(https://arxiv.org/abs/2503.20322)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks, but their expensive computations still limit the real-world application. To address this issue, recent efforts aim to compress the visual features to save the computational costs of MLLMs. However, direct visual compression methods, e.g. efficient projectors, inevitably destroy the visual semantics in MLLM, especially in difficult samples. To overcome this shortcoming, we propose a novel dynamic pyramid network (DPN) for efficient MLLMs. Specifically, DPN formulates MLLM as a hierarchical structure where visual features are gradually compressed with increasing depth. In this case, even with a high compression ratio, fine-grained visual information can still be perceived in shallow layers. To maximize the benefit of DPN, we further propose an innovative Dynamic Pooling Experts (DPE) that can dynamically choose the optimal visual compression rate according to input features. With this design, harder samples will be assigned larger computations, thus preserving the model performance. To validate our approach, we conduct extensive experiments on two popular MLLMs and ten benchmarks. Experimental results show that DPN can save up to 56% average FLOPs on LLaVA while further achieving +0.74% performance gains. Besides, the generalization ability of DPN is also validated on the existing high-resolution MLLM called LLaVA-HR. Our source codes are anonymously released at this https URL.

Title: Progressive Focused Transformer for Single Image Super-Resolution

Authors: Wei Long, Xingyu Zhou, Leheng Zhang, Shuhang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20337
Pdf URL: https://arxiv.org/pdf/2503.20337
Copy Paste: [[2503.20337]] Progressive Focused Transformer for Single Image Super-Resolution(https://arxiv.org/abs/2503.20337)
Keywords: transformer
Abstract: Transformer-based methods have achieved remarkable results in image super-resolution tasks because they can capture non-local dependencies in low-quality input images. However, this feature-intensive modeling approach is computationally expensive because it calculates the similarities between numerous features that are irrelevant to the query features when obtaining attention weights. These unnecessary similarity calculations not only degrade the reconstruction performance but also introduce significant computational overhead. How to accurately identify the features that are important to the current query features and avoid similarity calculations between irrelevant features remains an urgent problem. To address this issue, we propose a novel and effective Progressive Focused Transformer (PFT) that links all isolated attention maps in the network through Progressive Focused Attention (PFA) to focus attention on the most important tokens. PFA not only enables the network to capture more critical similar features, but also significantly reduces the computational cost of the overall network by filtering out irrelevant features before calculating similarities. Extensive experiments demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance on various single image super-resolution benchmarks.

Title: Wasserstein Distributionally Robust Bayesian Optimization with Continuous Context

Authors: Francesco Micheli, Efe C. Balta, Anastasios Tsiamis, John Lygeros
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2503.20341
Pdf URL: https://arxiv.org/pdf/2503.20341
Copy Paste: [[2503.20341]] Wasserstein Distributionally Robust Bayesian Optimization with Continuous Context(https://arxiv.org/abs/2503.20341)
Keywords: robust
Abstract: We address the challenge of sequential data-driven decision-making under context distributional uncertainty. This problem arises in numerous real-world scenarios where the learner optimizes black-box objective functions in the presence of uncontrollable contextual variables. We consider the setting where the context distribution is uncertain but known to lie within an ambiguity set defined as a ball in the Wasserstein distance. We propose a novel algorithm for Wasserstein Distributionally Robust Bayesian Optimization that can handle continuous context distributions while maintaining computational tractability. Our theoretical analysis combines recent results in self-normalized concentration in Hilbert spaces and finite-sample bounds for distributionally robust optimization to establish sublinear regret bounds that match state-of-the-art results. Through extensive comparisons with existing approaches on both synthetic and real-world problems, we demonstrate the simplicity, effectiveness, and practical applicability of our proposed method.

Title: Consistency Trajectory Matching for One-Step Generative Super-Resolution

Authors: Weiyi You, Mingyang Zhang, Leheng Zhang, Kexuan Shi, Xingyu Zhou, Shuhang Gu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20349
Pdf URL: https://arxiv.org/pdf/2503.20349
Copy Paste: [[2503.20349]] Consistency Trajectory Matching for One-Step Generative Super-Resolution(https://arxiv.org/abs/2503.20349)
Keywords: diffusion, generative
Abstract: Current diffusion-based super-resolution (SR) approaches achieve commendable performance at the cost of high inference overhead. Therefore, distillation techniques are utilized to accelerate the multi-step teacher model into one-step student model. Nevertheless, these methods significantly raise training costs and constrain the performance of the student model by the teacher model. To overcome these tough challenges, we propose Consistency Trajectory Matching for Super-Resolution (CTMSR), a distillation-free strategy that is able to generate photo-realistic SR results in one step. Concretely, we first formulate a Probability Flow Ordinary Differential Equation (PF-ODE) trajectory to establish a deterministic mapping from low-resolution (LR) images with noise to high-resolution (HR) images. Then we apply the Consistency Training (CT) strategy to directly learn the mapping in one step, eliminating the necessity of pre-trained diffusion model. To further enhance the performance and better leverage the ground-truth during the training process, we aim to align the distribution of SR results more closely with that of the natural images. To this end, we propose to minimize the discrepancy between their respective PF-ODE trajectories from the LR image distribution by our meticulously designed Distribution Trajectory Matching (DTM) loss, resulting in improved realism of our recovered HR images. Comprehensive experimental results demonstrate that the proposed methods can attain comparable or even superior capabilities on both synthetic and real datasets while maintaining minimal inference latency.

Title: CNN+Transformer Based Anomaly Traffic Detection in UAV Networks for Emergency Rescue

Authors: Yulu Han, Ziye Jia, Sijie He, Yu Zhang, Qihui Wu
Subjects: cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2503.20355
Pdf URL: https://arxiv.org/pdf/2503.20355
Copy Paste: [[2503.20355]] CNN+Transformer Based Anomaly Traffic Detection in UAV Networks for Emergency Rescue(https://arxiv.org/abs/2503.20355)
Keywords: security, attack, transformer
Abstract: The unmanned aerial vehicle (UAV) network has gained significant attentions in recent years due to its various applications. However, the traffic security becomes the key threatening public safety issue in an emergency rescue system due to the increasing vulnerability of UAVs to cyber attacks in environments with high heterogeneities. Hence, in this paper, we propose a novel anomaly traffic detection architecture for UAV networks based on the software-defined networking (SDN) framework and blockchain technology. Specifically, SDN separates the control and data plane to enhance the network manageability and security. Meanwhile, the blockchain provides decentralized identity authentication and data security records. Beisdes, a complete security architecture requires an effective mechanism to detect the time-series based abnormal traffic. Thus, an integrated algorithm combining convolutional neural networks (CNNs) and Transformer (CNN+Transformer) for anomaly traffic detection is developed, which is called CTranATD. Finally, the simulation results show that the proposed CTranATD algorithm is effective and outperforms the individual CNN, Transformer, and LSTM algorithms for detecting anomaly traffic.

Title: UnReference: analysis of the effect of spoofing on RTK reference stations for connected rovers

Authors: Marco Spanghero, Panos Papadimitratos
Subjects: cs.CR, eess.SP
Abstract URL: https://arxiv.org/abs/2503.20364
Pdf URL: https://arxiv.org/pdf/2503.20364
Copy Paste: [[2503.20364]] UnReference: analysis of the effect of spoofing on RTK reference stations for connected rovers(https://arxiv.org/abs/2503.20364)
Keywords: attack, robust
Abstract: Global Navigation Satellite Systems (GNSS) provide standalone precise navigation for a wide gamut of applications. Nevertheless, applications or systems such as unmanned vehicles (aerial or ground vehicles and surface vessels) generally require a much higher level of accuracy than those provided by standalone receivers. The most effective and economical way of achieving centimeter-level accuracy is to rely on corrections provided by fixed \emph{reference station} receivers to improve the satellite ranging measurements. Differential GNSS (DGNSS) and Real Time Kinematics (RTK) provide centimeter-level accuracy by distributing online correction streams to connected nearby mobile receivers typically termed \emph{rovers}. However, due to their static nature, reference stations are prime targets for GNSS attacks, both simplistic jamming and advanced spoofing, with different levels of adversarial control and complexity. Jamming the reference station would deny corrections and thus accuracy to the rovers. Spoofing the reference station would force it to distribute misleading corrections. As a result, all connected rovers using those corrections will be equally influenced by the adversary independently of their actual trajectory. We evaluate a battery of tests generated with an RF simulator to test the robustness of a common DGNSS/RTK processing library and receivers. We test both jamming and synchronized spoofing to demonstrate that adversarial action on the rover using reference spoofing is both effective and convenient from an adversarial perspective. Additionally, we discuss possible strategies based on existing countermeasures (self-validation of the PNT solution and monitoring of own clock drift) that the rover and the reference station can adopt to avoid using or distributing bogus corrections.

Title: RSRWKV: A Linear-Complexity 2D Attention Mechanism for Efficient Remote Sensing Vision Task

Authors: Chunshan Li, Rong Wang, Xiaofei Yang, Dianhui Chu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20382
Pdf URL: https://arxiv.org/pdf/2503.20382
Copy Paste: [[2503.20382]] RSRWKV: A Linear-Complexity 2D Attention Mechanism for Efficient Remote Sensing Vision Task(https://arxiv.org/abs/2503.20382)
Keywords: extraction, transformer, segmentation
Abstract: High-resolution remote sensing analysis faces challenges in global context modeling due to scene complexity and scale diversity. While CNNs excel at local feature extraction via parameter sharing, their fixed receptive fields fundamentally restrict long-range dependency modeling. Vision Transformers (ViTs) effectively capture global semantic relationships through self-attention mechanisms but suffer from quadratic computational complexity relative to image resolution, creating critical efficiency bottlenecks for high-resolution imagery. The RWKV model's linear-complexity sequence modeling achieves breakthroughs in NLP but exhibits anisotropic limitations in vision tasks due to its 1D scanning mechanism. To address these challenges, we propose RSRWKV, featuring a novel 2D-WKV scanning mechanism that bridges sequential processing and 2D spatial reasoning while maintaining linear complexity. This enables isotropic context aggregation across multiple directions. The MVC-Shift module enhances multi-scale receptive field coverage, while the ECA module strengthens cross-channel feature interaction and semantic saliency modeling. Experimental results demonstrate RSRWKV's superior performance over CNN and Transformer baselines in classification, detection, and segmentation tasks on NWPU RESISC45, VHR-10.v2, and GLH-Water datasets, offering a scalable solution for high-resolution remote sensing analysis.

Title: FastFT: Accelerating Reinforced Feature Transformation via Advanced Exploration Strategies

Authors: Tianqi He, Xiaohan Huang, Yi Du, Qingqing Long, Ziyue Qiao, Min Wu, Yanjie Fu, Yuanchun Zhou, Meng Xiao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20394
Pdf URL: https://arxiv.org/pdf/2503.20394
Copy Paste: [[2503.20394]] FastFT: Accelerating Reinforced Feature Transformation via Advanced Exploration Strategies(https://arxiv.org/abs/2503.20394)
Keywords: generative
Abstract: Feature Transformation is crucial for classic machine learning that aims to generate feature combinations to enhance the performance of downstream tasks from a data-centric perspective. Current methodologies, such as manual expert-driven processes, iterative-feedback techniques, and exploration-generative tactics, have shown promise in automating such data engineering workflow by minimizing human involvement. However, three challenges remain in those frameworks: (1) It predominantly depends on downstream task performance metrics, as assessment is time-consuming, especially for large datasets. (2) The diversity of feature combinations will hardly be guaranteed after random exploration ends. (3) Rare significant transformations lead to sparse valuable feedback that hinders the learning processes or leads to less effective results. In response to these challenges, we introduce FastFT, an innovative framework that leverages a trio of advanced this http URL first decouple the feature transformation evaluation from the outcomes of the generated datasets via the performance predictor. To address the issue of reward sparsity, we developed a method to evaluate the novelty of generated transformation sequences. Incorporating this novelty into the reward function accelerates the model's exploration of effective transformations, thereby improving the search productivity. Additionally, we combine novelty and performance to create a prioritized memory buffer, ensuring that essential experiences are effectively revisited during exploration. Our extensive experimental evaluations validate the performance, efficiency, and traceability of our proposed framework, showcasing its superiority in handling complex feature transformation tasks.

Title: Active Data Sampling and Generation for Bias Remediation

Authors: Antonio Maratea, Rita Perna
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.20414
Pdf URL: https://arxiv.org/pdf/2503.20414
Copy Paste: [[2503.20414]] Active Data Sampling and Generation for Bias Remediation(https://arxiv.org/abs/2503.20414)
Keywords: fair
Abstract: Adequate sampling space coverage is the keystone to effectively train trustworthy Machine Learning models. Unfortunately, real data do carry several inherent risks due to the many potential biases they exhibit when gathered without a proper random sampling over the reference population, and most of the times this is way too expensive or time consuming to be a viable option. Depending on how training data have been gathered, unmitigated biases can lead to harmful or discriminatory consequences that ultimately hinders large scale applicability of pre-trained models and undermine their truthfulness or fairness expectations. In this paper, a mixed active sampling and data generation strategy -- called samplation -- is proposed as a mean to compensate during fine-tuning of a pre-trained classifer the unfair classifications it produces, assuming that the training data come from a non-probabilistic sampling schema. Given a pre-trained classifier, first a fairness metric is evaluated on a test set, then new reservoirs of labeled data are generated and finally a number of reversely-biased artificial samples are generated for the fine-tuning of the model. Using as case study Deep Models for visual semantic role labeling, the proposed method has been able to fully cure a simulated gender bias starting from a 90/10 imbalance, with only a small percentage of new data and with a minor effect on accuracy.

Title: CFunModel: A "Funny" Language Model Capable of Chinese Humor Generation and Processing

Authors: Zhenghan Yu, Xinyu Hu, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20417
Pdf URL: https://arxiv.org/pdf/2503.20417
Copy Paste: [[2503.20417]] CFunModel: A "Funny" Language Model Capable of Chinese Humor Generation and Processing(https://arxiv.org/abs/2503.20417)
Keywords: large language model
Abstract: Humor plays a significant role in daily language communication. With the rapid development of large language models (LLMs), natural language processing has made significant strides in understanding and generating various genres of texts. However, most LLMs exhibit poor performance in generating and processing Chinese humor. In this study, we introduce a comprehensive Chinese humor-related dataset, the Chinese Fun Set (CFunSet). This dataset aggregates existing Chinese humor datasets and includes over 20,000 jokes collected from Tieba-JokeBar, a Chinese online platform known for joke sharing. The resulting corpus comprises more than 160,000 entries. Leveraging CFunSet, we developed the Chinese Fun Model (CFunModel), the first large language model designed to handle various Chinese humor-related tasks including Crosstalk Response Selection, Humor Recognition, Joke Generation, etc. Experimental results demonstrate that CFunModel outperforms popular large language models in these tasks. Our CFunSet is available at this https URL and CFunModel is available at this https URL. A demostration video of our work is available at this https URL.

Title: ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On

Authors: Ji Woo Hong, Tri Ton, Trung X. Pham, Gwanhyeong Koo, Sunjae Yoon, Chang D. Yoo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20418
Pdf URL: https://arxiv.org/pdf/2503.20418
Copy Paste: [[2503.20418]] ITA-MDT: Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On(https://arxiv.org/abs/2503.20418)
Keywords: diffusion, transformer
Abstract: This paper introduces ITA-MDT, the Image-Timestep-Adaptive Masked Diffusion Transformer Framework for Image-Based Virtual Try-On (IVTON), designed to overcome the limitations of previous approaches by leveraging the Masked Diffusion Transformer (MDT) for improved handling of both global garment context and fine-grained details. The IVTON task involves seamlessly superimposing a garment from one image onto a person in another, creating a realistic depiction of the person wearing the specified garment. Unlike conventional diffusion-based virtual try-on models that depend on large pre-trained U-Net architectures, ITA-MDT leverages a lightweight, scalable transformer-based denoising diffusion model with a mask latent modeling scheme, achieving competitive results while reducing computational overhead. A key component of ITA-MDT is the Image-Timestep Adaptive Feature Aggregator (ITAFA), a dynamic feature aggregator that combines all of the features from the image encoder into a unified feature of the same size, guided by diffusion timestep and garment image complexity. This enables adaptive weighting of features, allowing the model to emphasize either global information or fine-grained details based on the requirements of the denoising stage. Additionally, the Salient Region Extractor (SRE) module is presented to identify complex region of the garment to provide high-resolution local information to the denoising model as an additional condition alongside the global information of the full garment image. This targeted conditioning strategy enhances detail preservation of fine details in highly salient garment regions, optimizing computational resources by avoiding unnecessarily processing entire garment image. Comparative evaluations confirms that ITA-MDT improves efficiency while maintaining strong performance, reaching state-of-the-art results in several metrics.

Title: Cherry Yield Forecast: Harvest Prediction for Individual Sweet Cherry Trees

Authors: Andreas Gilson, Peter Pietrzyk, Chiara Paglia, Annika Killer, Fabian Keil, Lukas Meyer, Dominikus Kittemann, Patrick Noack, Oliver Scholz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20419
Pdf URL: https://arxiv.org/pdf/2503.20419
Copy Paste: [[2503.20419]] Cherry Yield Forecast: Harvest Prediction for Individual Sweet Cherry Trees(https://arxiv.org/abs/2503.20419)
Keywords: extraction
Abstract: This paper is part of a publication series from the For5G project that has the goal of creating digital twins of sweet cherry trees. At the beginning a brief overview of the revious work in this project is provided. Afterwards the focus shifts to a crucial problem in the fruit farming domain: the difficulty of making reliable yield predictions early in the season. Following three Satin sweet cherry trees along the year 2023 enabled the collection of accurate ground truth data about the development of cherries from dormancy until harvest. The methodology used to collect this data is presented, along with its valuation and visualization. The predictive power of counting objects at all relevant vegetative stages of the fruit development cycle in cherry trees with regards to yield predictions is investigated. It is found that all investigated fruit states are suitable for yield predictions based on linear regression. Conceptionally, there is a trade-off between earliness and external events with the potential to invalidate the prediction. Considering this, two optimal timepoints are suggested that are opening cluster stage before the start of the flowering and the early fruit stage right after the second fruit drop. However, both timepoints are challenging to solve with automated procedures based on image data. Counting developing cherries based on images is exceptionally difficult due to the small fruit size and their tendency to be occluded by leaves. It was not possible to obtain satisfying results relying on a state-of-the-art fruit-counting method. Counting the elements within a bursting bud is also challenging, even when using high resolution cameras. It is concluded that accurate yield prediction for sweet cherry trees is possible when objects are manually counted and that automated features extraction with similar accuracy remains an open problem yet to be solved.

Title: TempTest: Local Normalization Distortion and the Detection of Machine-generated Text

Authors: Tom Kempton, Stuart Burrell, Connor Cheverall
Subjects: cs.CL, cs.LG, math.DS
Abstract URL: https://arxiv.org/abs/2503.20421
Pdf URL: https://arxiv.org/pdf/2503.20421
Copy Paste: [[2503.20421]] TempTest: Local Normalization Distortion and the Detection of Machine-generated Text(https://arxiv.org/abs/2503.20421)
Keywords: attack
Abstract: Existing methods for the zero-shot detection of machine-generated text are dominated by three statistical quantities: log-likelihood, log-rank, and entropy. As language models mimic the distribution of human text ever closer, this will limit our ability to build effective detection algorithms. To combat this, we introduce a method for detecting machine-generated text that is entirely agnostic of the generating language model. This is achieved by targeting a defect in the way that decoding strategies, such as temperature or top-k sampling, normalize conditional probability measures. This method can be rigorously theoretically justified, is easily explainable, and is conceptually distinct from existing methods for detecting machine-generated text. We evaluate our detector in the white and black box settings across various language models, datasets, and passage lengths. We also study the effect of paraphrasing attacks on our detector and the extent to which it is biased against non-native speakers. In each of these settings, the performance of our test is at least comparable to that of other state-of-the-art text detectors, and in some cases, we strongly outperform these baselines.

Title: Evaluating Facial Expression Recognition Datasets for Deep Learning: A Benchmark Study with Novel Similarity Metrics

Authors: F. Xavier Gaya-Morey, Cristina Manresa-Yee, Célia Martinie, Jose M. Buades-Rubio
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20428
Pdf URL: https://arxiv.org/pdf/2503.20428
Copy Paste: [[2503.20428]] Evaluating Facial Expression Recognition Datasets for Deep Learning: A Benchmark Study with Novel Similarity Metrics(https://arxiv.org/abs/2503.20428)
Keywords: robust, fair
Abstract: This study investigates the key characteristics and suitability of widely used Facial Expression Recognition (FER) datasets for training deep learning models. In the field of affective computing, FER is essential for interpreting human emotions, yet the performance of FER systems is highly contingent on the quality and diversity of the underlying datasets. To address this issue, we compiled and analyzed 24 FER datasets, including those targeting specific age groups such as children, adults, and the elderly, and processed them through a comprehensive normalization pipeline. In addition, we enriched the datasets with automatic annotations for age and gender, enabling a more nuanced evaluation of their demographic properties. To further assess dataset efficacy, we introduce three novel metricsLocal, Global, and Paired Similarity, which quantitatively measure dataset difficulty, generalization capability, and cross-dataset transferability. Benchmark experiments using state-of-the-art neural networks reveal that large-scale, automatically collected datasets (e.g., AffectNet, FER2013) tend to generalize better, despite issues with labeling noise and demographic biases, whereas controlled datasets offer higher annotation quality but limited variability. Our findings provide actionable recommendations for dataset selection and design, advancing the development of more robust, fair, and effective FER systems.

Title: Latent Beam Diffusion Models for Decoding Image Sequences

Authors: Guilherme Fernandes, Vasco Ramos, Regev Cohen, Idan Szpektor, João Magalhães
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20429
Pdf URL: https://arxiv.org/pdf/2503.20429
Copy Paste: [[2503.20429]] Latent Beam Diffusion Models for Decoding Image Sequences(https://arxiv.org/abs/2503.20429)
Keywords: diffusion
Abstract: While diffusion models excel at generating high-quality images from text prompts, they struggle with visual consistency in image sequences. Existing methods generate each image independently, leading to disjointed narratives - a challenge further exacerbated in non-linear storytelling, where scenes must connect beyond adjacent frames. We introduce a novel beam search strategy for latent space exploration, enabling conditional generation of full image sequences with beam search decoding. Unlike prior approaches that use fixed latent priors, our method dynamically searches for an optimal sequence of latent representations, ensuring coherent visual transitions. To address beam search's quadratic complexity, we integrate a cross-attention mechanism that efficiently scores search paths and enables pruning, prioritizing alignment with both textual prompts and visual context. Human evaluations confirm that our approach outperforms baseline methods, producing full sequences with superior coherence, visual continuity, and textual alignment. By bridging advances in search optimization and latent space refinement, this work sets a new standard for structured image sequence generation.

Title: Siformer: Feature-isolated Transformer for Efficient Skeleton-based Sign Language Recognition

Authors: Muxin Pu, Mei Kuan Lim, Chun Yong Chong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20436
Pdf URL: https://arxiv.org/pdf/2503.20436
Copy Paste: [[2503.20436]] Siformer: Feature-isolated Transformer for Efficient Skeleton-based Sign Language Recognition(https://arxiv.org/abs/2503.20436)
Keywords: robust, transformer
Abstract: Sign language recognition (SLR) refers to interpreting sign language glosses from given videos automatically. This research area presents a complex challenge in computer vision because of the rapid and intricate movements inherent in sign languages, which encompass hand gestures, body postures, and even facial expressions. Recently, skeleton-based action recognition has attracted increasing attention due to its ability to handle variations in subjects and backgrounds independently. However, current skeleton-based SLR methods exhibit three limitations: 1) they often neglect the importance of realistic hand poses, where most studies train SLR models on non-realistic skeletal representations; 2) they tend to assume complete data availability in both training or inference phases, and capture intricate relationships among different body parts collectively; 3) these methods treat all sign glosses uniformly, failing to account for differences in complexity levels regarding skeletal representations. To enhance the realism of hand skeletal representations, we present a kinematic hand pose rectification method for enforcing constraints. Mitigating the impact of missing data, we propose a feature-isolated mechanism to focus on capturing local spatial-temporal context. This method captures the context concurrently and independently from individual features, thus enhancing the robustness of the SLR model. Additionally, to adapt to varying complexity levels of sign glosses, we develop an input-adaptive inference approach to optimise computational efficiency and accuracy. Experimental results demonstrate the effectiveness of our approach, as evidenced by achieving a new state-of-the-art (SOTA) performance on WLASL100 and LSA64. For WLASL100, we achieve a top-1 accuracy of 86.50\%, marking a relative improvement of 2.39% over the previous SOTA. For LSA64, we achieve a top-1 accuracy of 99.84%.

Title: Lipschitz Constant Meets Condition Number: Learning Robust and Compact Deep Neural Networks

Authors: Yangqi Feng, Shing-Ho J. Lin, Baoyuan Gao, Xian Wei
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.20454
Pdf URL: https://arxiv.org/pdf/2503.20454
Copy Paste: [[2503.20454]] Lipschitz Constant Meets Condition Number: Learning Robust and Compact Deep Neural Networks(https://arxiv.org/abs/2503.20454)
Keywords: attack, robust
Abstract: Recent research has revealed that high compression of Deep Neural Networks (DNNs), e.g., massive pruning of the weight matrix of a DNN, leads to a severe drop in accuracy and susceptibility to adversarial attacks. Integration of network pruning into an adversarial training framework has been proposed to promote adversarial robustness. It has been observed that a highly pruned weight matrix tends to be ill-conditioned, i.e., increasing the condition number of the weight matrix. This phenomenon aggravates the vulnerability of a DNN to input noise. Although a highly pruned weight matrix is considered to be able to lower the upper bound of the local Lipschitz constant to tolerate large distortion, the ill-conditionedness of such a weight matrix results in a non-robust DNN model. To overcome this challenge, this work develops novel joint constraints to adjust the weight distribution of networks, namely, the Transformed Sparse Constraint joint with Condition Number Constraint (TSCNC), which copes with smoothing distribution and differentiable constraint functions to reduce condition number and thus avoid the ill-conditionedness of weight matrices. Furthermore, our theoretical analyses unveil the relevance between the condition number and the local Lipschitz constant of the weight matrix, namely, the sharply increasing condition number becomes the dominant factor that restricts the robustness of over-sparsified models. Extensive experiments are conducted on several public datasets, and the results show that the proposed constraints significantly improve the robustness of a DNN with high pruning rates.

Title: From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

Authors: Yucheng Suo, Fan Ma, Linchao Zhu, Tianyi Wang, Fengyun Rao, Yi Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20472
Pdf URL: https://arxiv.org/pdf/2503.20472
Copy Paste: [[2503.20472]] From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment(https://arxiv.org/abs/2503.20472)
Keywords: robust, large language model
Abstract: Multi-modal Large language models (MLLMs) show remarkable ability in video understanding. Nevertheless, understanding long videos remains challenging as the models can only process a finite number of frames in a single inference, potentially omitting crucial visual information. To address the challenge, we propose generating multiple predictions through visual context sampling, followed by a scoring mechanism to select the final prediction. Specifically, we devise a bin-wise sampling strategy that enables MLLMs to generate diverse answers based on various combinations of keyframes, thereby enriching the visual context. To determine the final prediction from the sampled answers, we employ a self-reward by linearly combining three scores: (1) a frequency score indicating the prevalence of each option, (2) a marginal confidence score reflecting the inter-intra sample certainty of MLLM predictions, and (3) a reasoning score for different question types, including clue-guided answering for global questions and temporal self-refocusing for local questions. The frequency score ensures robustness through majority correctness, the confidence-aligned score reflects prediction certainty, and the typed-reasoning score addresses cases with sparse key visual information using tailored strategies. Experiments show that this approach covers the correct answer for a high percentage of long video questions, on seven datasets show that our method improves the performance of three MLLMs.

Title: Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability

Authors: Yingdong Shi, Changming Li, Yifan Wang, Yongxiang Zhao, Anqi Pang, Sibei Yang, Jingyi Yu, Kan Ren
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20483
Pdf URL: https://arxiv.org/pdf/2503.20483
Copy Paste: [[2503.20483]] Dissecting and Mitigating Diffusion Bias via Mechanistic Interpretability(https://arxiv.org/abs/2503.20483)
Keywords: interpretability, diffusion
Abstract: Diffusion models have demonstrated impressive capabilities in synthesizing diverse content. However, despite their high-quality outputs, these models often perpetuate social biases, including those related to gender and race. These biases can potentially contribute to harmful real-world consequences, reinforcing stereotypes and exacerbating inequalities in various social contexts. While existing research on diffusion bias mitigation has predominantly focused on guiding content generation, it often neglects the intrinsic mechanisms within diffusion models that causally drive biased outputs. In this paper, we investigate the internal processes of diffusion models, identifying specific decision-making mechanisms, termed bias features, embedded within the model architecture. By directly manipulating these features, our method precisely isolates and adjusts the elements responsible for bias generation, permitting granular control over the bias levels in the generated content. Through experiments on both unconditional and conditional diffusion models across various social bias attributes, we demonstrate our method's efficacy in managing generation distribution while preserving image quality. We also dissect the discovered model mechanism, revealing different intrinsic features controlling fine-grained aspects of generation, boosting further research on mechanistic interpretability of diffusion models.

Title: Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation

Authors: Qi Si, Bo Wang, Zhao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20484
Pdf URL: https://arxiv.org/pdf/2503.20484
Copy Paste: [[2503.20484]] Contrastive Learning Guided Latent Diffusion Model for Image-to-Image Translation(https://arxiv.org/abs/2503.20484)
Keywords: diffusion
Abstract: The diffusion model has demonstrated superior performance in synthesizing diverse and high-quality images for text-guided image translation. However, there remains room for improvement in both the formulation of text prompts and the preservation of reference image content. First, variations in target text prompts can significantly influence the quality of the generated images, and it is often challenging for users to craft an optimal prompt that fully captures the content of the input image. Second, while existing models can introduce desired modifications to specific regions of the reference image, they frequently induce unintended alterations in areas that should remain unchanged. To address these challenges, we propose pix2pix-zeroCon, a zero-shot diffusion-based method that eliminates the need for additional training by leveraging patch-wise contrastive loss. Specifically, we automatically determine the editing direction in the text embedding space based on the reference image and target prompts. Furthermore, to ensure precise content and structural preservation in the edited image, we introduce cross-attention guiding loss and patch-wise contrastive loss between the generated and original image embeddings within a pre-trained diffusion model. Notably, our approach requires no additional training and operates directly on a pre-trained text-to-image diffusion model. Extensive experiments demonstrate that our method surpasses existing models in image-to-image translation, achieving enhanced fidelity and controllability.

Title: VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

Authors: Jiale Cheng, Ruiliang Lyu, Xiaotao Gu, Xiao Liu, Jiazheng Xu, Yida Lu, Jiayan Teng, Zhuoyi Yang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20491
Pdf URL: https://arxiv.org/pdf/2503.20491
Copy Paste: [[2503.20491]] VPO: Aligning Text-to-Video Generation Models with Prompt Optimization(https://arxiv.org/abs/2503.20491)
Keywords: large language model
Abstract: Video generation models have achieved remarkable progress in text-to-video tasks. These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured. This gap makes prompt optimization crucial for generating high-quality videos. Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks. Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos. To achieve this, VPO employs a two-stage optimization approach. First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods. Moreover, VPO shows strong generalization across video generation models. Furthermore, we demonstrate that VPO could outperform and be combined with RLHF methods on video generation models, underscoring the effectiveness of VPO in aligning video generation models. Our code and data are publicly available at this https URL.

Title: Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models

Authors: Fanhu Zeng, Zhen Cheng, Fei Zhu, Xu-Yao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20492
Pdf URL: https://arxiv.org/pdf/2503.20492
Copy Paste: [[2503.20492]] Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models(https://arxiv.org/abs/2503.20492)
Keywords: security
Abstract: Reliable prediction by classifiers is crucial for their deployment in high security and dynamically changing situations. However, modern neural networks often exhibit overconfidence for misclassified predictions, highlighting the need for confidence estimation to detect errors. Despite the achievements obtained by existing methods on small-scale datasets, they all require training from scratch and there are no efficient and effective misclassification detection (MisD) methods, hindering practical application towards large-scale and ever-changing datasets. In this paper, we pave the way to exploit vision language model (VLM) leveraging text information to establish an efficient and general-purpose misclassification detection framework. By harnessing the power of VLM, we construct FSMisD, a Few-Shot prompt learning framework for MisD to refrain from training from scratch and therefore improve tuning efficiency. To enhance misclassification detection ability, we use adaptive pseudo sample generation and a novel negative loss to mitigate the issue of overconfidence by pushing category prompts away from pseudo features. We conduct comprehensive experiments with prompt learning methods and validate the generalization ability across various datasets with domain shift. Significant and consistent improvement demonstrates the effectiveness, efficiency and generalizability of our approach.

Title: Enhancing Depression Detection via Question-wise Modality Fusion

Authors: Aishik Mandal, Dana Atzil-Slonim, Thamar Solorio, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20496
Pdf URL: https://arxiv.org/pdf/2503.20496
Copy Paste: [[2503.20496]] Enhancing Depression Detection via Question-wise Modality Fusion(https://arxiv.org/abs/2503.20496)
Keywords: interpretability
Abstract: Depression is a highly prevalent and disabling condition that incurs substantial personal and societal costs. Current depression diagnosis involves determining the depression severity of a person through self-reported questionnaires or interviews conducted by clinicians. This often leads to delayed treatment and involves substantial human resources. Thus, several works try to automate the process using multimodal data. However, they usually overlook the following: i) The variable contribution of each modality for each question in the questionnaire and ii) Using ordinal classification for the task. This results in sub-optimal fusion and training methods. In this work, we propose a novel Question-wise Modality Fusion (QuestMF) framework trained with a novel Imbalanced Ordinal Log-Loss (ImbOLL) function to tackle these issues. The performance of our framework is comparable to the current state-of-the-art models on the E-DAIC dataset and enhances interpretability by predicting scores for each question. This will help clinicians identify an individual's symptoms, allowing them to customise their interventions accordingly. We also make the code for the QuestMF framework publicly available.

Title: MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning

Authors: Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Jiayi Ji, Jie Lou, Debing Zhang, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20502
Pdf URL: https://arxiv.org/pdf/2503.20502
Copy Paste: [[2503.20502]] MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning(https://arxiv.org/abs/2503.20502)
Keywords: large language model
Abstract: Visual instruction tuning (VIT) has emerged as a crucial technique for enabling multi-modal large language models (MLLMs) to follow user instructions adeptly. Yet, a significant gap persists in understanding the attributes of high-quality instruction tuning data and frameworks for its automated selection. To address this, we introduce MLLM-Selector, an automated approach that identifies valuable data for VIT by weighing necessity and diversity. Our process starts by randomly sampling a subset from the VIT data pool to fine-tune a pretrained model, thus creating a seed model with an initial ability to follow instructions. Then, leveraging the seed model, we calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance. Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector, our methodology that fuses necessity scoring with strategic sampling for superior data refinement. Empirical results indicate that within identical experimental conditions, MLLM-Selector surpasses LLaVA-1.5 in some benchmarks with less than 1% of the data and consistently exceeds performance across all validated benchmarks when using less than 50%.

Title: Vision-Amplified Semantic Entropy for Hallucination Detection in Medical Visual Question Answering

Authors: Zehui Liao, Shishuai Hu, Ke Zou, Huazhu Fu, Liangli Zhen, Yong Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20504
Pdf URL: https://arxiv.org/pdf/2503.20504
Copy Paste: [[2503.20504]] Vision-Amplified Semantic Entropy for Hallucination Detection in Medical Visual Question Answering(https://arxiv.org/abs/2503.20504)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have demonstrated significant potential in medical Visual Question Answering (VQA). Yet, they remain prone to hallucinations-incorrect responses that contradict input images, posing substantial risks in clinical decision-making. Detecting these hallucinations is essential for establishing trust in MLLMs among clinicians and patients, thereby enabling their real-world adoption. Current hallucination detection methods, especially semantic entropy (SE), have demonstrated promising hallucination detection capacity for LLMs. However, adapting SE to medical MLLMs by incorporating visual perturbations presents a dilemma. Weak perturbations preserve image content and ensure clinical validity, but may be overlooked by medical MLLMs, which tend to over rely on language priors. In contrast, strong perturbations can distort essential diagnostic features, compromising clinical interpretation. To address this issue, we propose Vision Amplified Semantic Entropy (VASE), which incorporates weak image transformations and amplifies the impact of visual input, to improve hallucination detection in medical VQA. We first estimate the semantic predictive distribution under weak visual transformations to preserve clinical validity, and then amplify visual influence by contrasting this distribution with that derived from a distorted image. The entropy of the resulting distribution is estimated as VASE. Experiments on two medical open-ended VQA datasets demonstrate that VASE consistently outperforms existing hallucination detection methods.

Title: Explainable ICD Coding via Entity Linking

Authors: Leonor Barreiros, Isabel Coutinho, Gonçalo M. Correia, Bruno Martins
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20508
Pdf URL: https://arxiv.org/pdf/2503.20508
Copy Paste: [[2503.20508]] Explainable ICD Coding via Entity Linking(https://arxiv.org/abs/2503.20508)
Keywords: large language model
Abstract: Clinical coding is a critical task in healthcare, although traditional methods for automating clinical coding may not provide sufficient explicit evidence for coders in production environments. This evidence is crucial, as medical coders have to make sure there exists at least one explicit passage in the input health record that justifies the attribution of a code. We therefore propose to reframe the task as an entity linking problem, in which each document is annotated with its set of codes and respective textual evidence, enabling better human-machine collaboration. By leveraging parameter-efficient fine-tuning of Large Language Models (LLMs), together with constrained decoding, we introduce three approaches to solve this problem that prove effective at disambiguating clinical mentions and that perform well in few-shot scenarios.

Title: Small Object Detection: A Comprehensive Survey on Challenges, Techniques and Real-World Applications

Authors: Mahya Nikouei, Bita Baroutian, Shahabedin Nabavi, Fateme Taraghi, Atefe Aghaei, Ayoob Sajedi, Mohsen Ebrahimi Moghaddam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20516
Pdf URL: https://arxiv.org/pdf/2503.20516
Copy Paste: [[2503.20516]] Small Object Detection: A Comprehensive Survey on Challenges, Techniques and Real-World Applications(https://arxiv.org/abs/2503.20516)
Keywords: robust, extraction, transformer
Abstract: Small object detection (SOD) is a critical yet challenging task in computer vision, with applications like spanning surveillance, autonomous systems, medical imaging, and remote sensing. Unlike larger objects, small objects contain limited spatial and contextual information, making accurate detection difficult. Challenges such as low resolution, occlusion, background interference, and class imbalance further complicate the problem. This survey provides a comprehensive review of recent advancements in SOD using deep learning, focusing on articles published in Q1 journals during 2024-2025. We analyzed challenges, state-of-the-art techniques, datasets, evaluation metrics, and real-world applications. Recent advancements in deep learning have introduced innovative solutions, including multi-scale feature extraction, Super-Resolution (SR) techniques, attention mechanisms, and transformer-based architectures. Additionally, improvements in data augmentation, synthetic data generation, and transfer learning have addressed data scarcity and domain adaptation issues. Furthermore, emerging trends such as lightweight neural networks, knowledge distillation (KD), and self-supervised learning offer promising directions for improving detection efficiency, particularly in resource-constrained environments like Unmanned Aerial Vehicles (UAV)-based surveillance and edge computing. We also review widely used datasets, along with standard evaluation metrics such as mean Average Precision (mAP) and size-specific AP scores. The survey highlights real-world applications, including traffic monitoring, maritime surveillance, industrial defect detection, and precision agriculture. Finally, we discuss open research challenges and future directions, emphasizing the need for robust domain adaptation techniques, better feature fusion strategies, and real-time performance optimization.

Title: MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation

Authors: Jinnan Chen, Lingting Zhu, Zeyu Hu, Shengju Qian, Yugang Chen, Xin Wang, Gim Hee Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20519
Pdf URL: https://arxiv.org/pdf/2503.20519
Copy Paste: [[2503.20519]] MAR-3D: Progressive Masked Auto-regressor for High-Resolution 3D Generation(https://arxiv.org/abs/2503.20519)
Keywords: diffusion, transformer, generative
Abstract: Recent advances in auto-regressive transformers have revolutionized generative modeling across different domains, from language processing to visual generation, demonstrating remarkable capabilities. However, applying these advances to 3D generation presents three key challenges: the unordered nature of 3D data conflicts with sequential next-token prediction paradigm, conventional vector quantization approaches incur substantial compression loss when applied to 3D meshes, and the lack of efficient scaling strategies for higher resolution latent prediction. To address these challenges, we introduce MAR-3D, which integrates a pyramid variational autoencoder with a cascaded masked auto-regressive transformer (Cascaded MAR) for progressive latent upscaling in the continuous space. Our architecture employs random masking during training and auto-regressive denoising in random order during inference, naturally accommodating the unordered property of 3D latent tokens. Additionally, we propose a cascaded training strategy with condition augmentation that enables efficiently up-scale the latent token resolution with fast convergence. Extensive experiments demonstrate that MAR-3D not only achieves superior performance and generalization capabilities compared to existing methods but also exhibits enhanced scaling capabilities compared to joint distribution modeling approaches (e.g., diffusion transformers).

Title: GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

Authors: Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, Gianluca Corrado
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.20523
Pdf URL: https://arxiv.org/pdf/2503.20523
Copy Paste: [[2503.20523]] GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving(https://arxiv.org/abs/2503.20523)
Keywords: diffusion, generative
Abstract: Generative models offer a scalable and flexible paradigm for simulating complex environments, yet current approaches fall short in addressing the domain-specific requirements of autonomous driving - such as multi-agent interactions, fine-grained control, and multi-camera consistency. We introduce GAIA-2, Generative AI for Autonomy, a latent diffusion world model that unifies these capabilities within a single generative framework. GAIA-2 supports controllable video generation conditioned on a rich set of structured inputs: ego-vehicle dynamics, agent configurations, environmental factors, and road semantics. It generates high-resolution, spatiotemporally consistent multi-camera videos across geographically diverse driving environments (UK, US, Germany). The model integrates both structured conditioning and external latent embeddings (e.g., from a proprietary driving model) to facilitate flexible and semantically grounded scene synthesis. Through this integration, GAIA-2 enables scalable simulation of both common and rare driving scenarios, advancing the use of generative world models as a core tool in the development of autonomous systems. Videos are available at this https URL.

Title: StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIs

Authors: Zhicheng Guo, Sijie Cheng, Yuchen Niu, Hao Wang, Sicheng Zhou, Wenbing Huang, Yang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20527
Pdf URL: https://arxiv.org/pdf/2503.20527
Copy Paste: [[2503.20527]] StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIs(https://arxiv.org/abs/2503.20527)
Keywords: large language model
Abstract: The rapid advancement of large language models (LLMs) has spurred significant interest in tool learning, where LLMs are augmented with external tools to tackle complex tasks. However, existing tool environments face challenges in balancing stability, scalability, and realness, particularly for benchmarking purposes. To address this problem, we propose MirrorAPI, a novel framework that trains specialized LLMs to accurately simulate real API responses, effectively acting as "mirrors" to tool environments. Using a comprehensive dataset of request-response pairs from 7,000+ APIs, we employ supervised fine-tuning and chain-of-thought reasoning to enhance simulation fidelity. MirrorAPI achieves superior accuracy and stability compared to state-of-the-art methods, as demonstrated by its performance on the newly constructed MirrorAPI-Bench and its integration into StableToolBench.

Title: TD-BFR: Truncated Diffusion Model for Efficient Blind Face Restoration

Authors: Ziying Zhang, Xiang Gao, Zhixin Wang, Qiang hu, Xiaoyun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20537
Pdf URL: https://arxiv.org/pdf/2503.20537
Copy Paste: [[2503.20537]] TD-BFR: Truncated Diffusion Model for Efficient Blind Face Restoration(https://arxiv.org/abs/2503.20537)
Keywords: robust, diffusion, generative
Abstract: Diffusion-based methodologies have shown significant potential in blind face restoration (BFR), leveraging their robust generative capabilities. However, they are often criticized for two significant problems: 1) slow training and inference speed, and 2) inadequate recovery of fine-grained facial details. To address these problems, we propose a novel Truncated Diffusion model for efficient Blind Face Restoration (TD-BFR), a three-stage paradigm tailored for the progressive resolution of degraded images. Specifically, TD-BFR utilizes an innovative truncated sampling method, starting from low-quality (LQ) images at low resolution to enhance sampling speed, and then introduces an adaptive degradation removal module to handle unknown degradations and connect the generation processes across different resolutions. Additionally, we further adapt the priors of pre-trained diffusion models to recover rich facial details. Our method efficiently restores high-quality images in a coarse-to-fine manner and experimental results demonstrate that TD-BFR is, on average, \textbf{4.75$\times$} faster than current state-of-the-art diffusion-based BFR methods while maintaining competitive quality.

Title: A Theoretical Framework for Prompt Engineering: Approximating Smooth Functions with Transformer Prompts

Authors: Ryumei Nakada, Wenlong Ji, Tianxi Cai, James Zou, Linjun Zhang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.20561
Pdf URL: https://arxiv.org/pdf/2503.20561
Copy Paste: [[2503.20561]] A Theoretical Framework for Prompt Engineering: Approximating Smooth Functions with Transformer Prompts(https://arxiv.org/abs/2503.20561)
Keywords: robust, transformer, large language model
Abstract: Prompt engineering has emerged as a powerful technique for guiding large language models (LLMs) toward desired responses, significantly enhancing their performance across diverse tasks. Beyond their role as static predictors, LLMs increasingly function as intelligent agents, capable of reasoning, decision-making, and adapting dynamically to complex environments. However, the theoretical underpinnings of prompt engineering remain largely unexplored. In this paper, we introduce a formal framework demonstrating that transformer models, when provided with carefully designed prompts, can act as a configurable computational system by emulating a ``virtual'' neural network during inference. Specifically, input prompts effectively translate into the corresponding network configuration, enabling LLMs to adjust their internal computations dynamically. Building on this construction, we establish an approximation theory for $\beta$-times differentiable functions, proving that transformers can approximate such functions with arbitrary precision when guided by appropriately structured prompts. Moreover, our framework provides theoretical justification for several empirically successful prompt engineering techniques, including the use of longer, structured prompts, filtering irrelevant information, enhancing prompt token diversity, and leveraging multi-agent interactions. By framing LLMs as adaptable agents rather than static models, our findings underscore their potential for autonomous reasoning and problem-solving, paving the way for more robust and theoretically grounded advancements in prompt engineering and AI agent design.

Title: Low-resource Information Extraction with the European Clinical Case Corpus

Authors: Soumitra Ghosh, Begona Altuna, Saeed Farzi, Pietro Ferrazzi, Alberto Lavelli, Giulia Mezzanotte, Manuela Speranza, Bernardo Magnini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20568
Pdf URL: https://arxiv.org/pdf/2503.20568
Copy Paste: [[2503.20568]] Low-resource Information Extraction with the European Clinical Case Corpus(https://arxiv.org/abs/2503.20568)
Keywords: extraction, large language model
Abstract: We present E3C-3.0, a multilingual dataset in the medical domain, comprising clinical cases annotated with diseases and test-result relations. The dataset includes both native texts in five languages (English, French, Italian, Spanish and Basque) and texts translated and projected from the English source into five target languages (Greek, Italian, Polish, Slovak, and Slovenian). A semi-automatic approach has been implemented, including automatic annotation projection based on Large Language Models (LLMs) and human revision. We present several experiments showing that current state-of-the-art LLMs can benefit from being fine-tuned on the E3C-3.0 dataset. We also show that transfer learning in different languages is very effective, mitigating the scarcity of data. Finally, we compare performance both on native data and on projected data. We release the data at this https URL .

Title: Feature Statistics with Uncertainty Help Adversarial Robustness

Authors: Ran Wang, Xinlei Zhou, Rihao Li, Meng Hu, Wenhui Wu, Yuheng Jia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.20583
Pdf URL: https://arxiv.org/pdf/2503.20583
Copy Paste: [[2503.20583]] Feature Statistics with Uncertainty Help Adversarial Robustness(https://arxiv.org/abs/2503.20583)
Keywords: security, attack, robust
Abstract: Despite the remarkable success of deep neural networks (DNNs), the security threat of adversarial attacks poses a significant challenge to the reliability of DNNs. By introducing randomness into different parts of DNNs, stochastic methods can enable the model to learn some uncertainty, thereby improving model robustness efficiently. In this paper, we theoretically discover a universal phenomenon that adversarial attacks will shift the distributions of feature statistics. Motivated by this theoretical finding, we propose a robustness enhancement module called Feature Statistics with Uncertainty (FSU). It resamples channel-wise feature means and standard deviations of examples from multivariate Gaussian distributions, which helps to reconstruct the attacked examples and calibrate the shifted distributions. The calibration recovers some domain characteristics of the data for classification, thereby mitigating the influence of perturbations and weakening the ability of attacks to deceive models. The proposed FSU module has universal applicability in training, attacking, predicting and fine-tuning, demonstrating impressive robustness enhancement ability at trivial additional time cost. For example, against powerful optimization-based CW attacks, by incorporating FSU into attacking and predicting phases, it endows many collapsed state-of-the-art models with 50%-80% robust accuracy on CIFAR10, CIFAR100 and SVHN.

Title: Diffusion Counterfactuals for Image Regressors

Authors: Trung Duc Ha, Sidney Bender
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2503.20595
Pdf URL: https://arxiv.org/pdf/2503.20595
Copy Paste: [[2503.20595]] Diffusion Counterfactuals for Image Regressors(https://arxiv.org/abs/2503.20595)
Keywords: diffusion, generative
Abstract: Counterfactual explanations have been successfully applied to create human interpretable explanations for various black-box models. They are handy for tasks in the image domain, where the quality of the explanations benefits from recent advances in generative models. Although counterfactual explanations have been widely applied to classification models, their application to regression tasks remains underexplored. We present two methods to create counterfactual explanations for image regression tasks using diffusion-based generative models to address challenges in sparsity and quality: 1) one based on a Denoising Diffusion Probabilistic Model that operates directly in pixel-space and 2) another based on a Diffusion Autoencoder operating in latent space. Both produce realistic, semantic, and smooth counterfactuals on CelebA-HQ and a synthetic data set, providing easily interpretable insights into the decision-making process of the regression model and reveal spurious correlations. We find that for regression counterfactuals, changes in features depend on the region of the predicted value. Large semantic changes are needed for significant changes in predicted values, making it harder to find sparse counterfactuals than with classifiers. Moreover, pixel space counterfactuals are more sparse while latent space counterfactuals are of higher quality and allow bigger semantic changes.

Title: IAP: Improving Continual Learning of Vision-Language Models via Instance-Aware Prompting

Authors: Hao Fu, Hanbin Zhao, Jiahua Dong, Chao Zhang, Hui Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20612
Pdf URL: https://arxiv.org/pdf/2503.20612
Copy Paste: [[2503.20612]] IAP: Improving Continual Learning of Vision-Language Models via Instance-Aware Prompting(https://arxiv.org/abs/2503.20612)
Keywords: transformer
Abstract: Recent pre-trained vision-language models (PT-VLMs) often face a Multi-Domain Class-Incremental Learning (MCIL) scenario in practice, where several classes and domains of multi-modal tasks are incrementally arrived. Without access to previously learned tasks and unseen tasks, memory-constrained MCIL suffers from forward and backward forgetting. To alleviate the above challenges, parameter-efficient fine-tuning techniques (PEFT), such as prompt tuning, are employed to adapt the PT-VLM to the diverse incrementally learned tasks. To achieve effective new task adaptation, existing methods only consider the effect of PEFT strategy selection, but neglect the influence of PEFT parameter setting (e.g., prompting). In this paper, we tackle the challenge of optimizing prompt designs for diverse tasks in MCIL and propose an Instance-Aware Prompting (IAP) framework. Specifically, our Instance-Aware Gated Prompting (IA-GP) module enhances adaptation to new tasks while mitigating forgetting by dynamically assigning prompts across transformer layers at the instance level. Our Instance-Aware Class-Distribution-Driven Prompting (IA-CDDP) improves the task adaptation process by determining an accurate task-label-related confidence score for each instance. Experimental evaluations across 11 datasets, using three performance metrics, demonstrate the effectiveness of our proposed method. Code can be found at this https URL.

Title: State-Aware Perturbation Optimization for Robust Deep Reinforcement Learning

Authors: Zongyuan Zhang, Tianyang Duan, Zheng Lin, Dong Huang, Zihan Fang, Zekai Sun, Ling Xiong, Hongbin Liang, Heming Cui, Yong Cui
Subjects: cs.LG, cs.AI, cs.NI, eess.SY
Abstract URL: https://arxiv.org/abs/2503.20613
Pdf URL: https://arxiv.org/pdf/2503.20613
Copy Paste: [[2503.20613]] State-Aware Perturbation Optimization for Robust Deep Reinforcement Learning(https://arxiv.org/abs/2503.20613)
Keywords: attack, robust, steal
Abstract: Recently, deep reinforcement learning (DRL) has emerged as a promising approach for robotic control. However, the deployment of DRL in real-world robots is hindered by its sensitivity to environmental perturbations. While existing whitebox adversarial attacks rely on local gradient information and apply uniform perturbations across all states to evaluate DRL robustness, they fail to account for temporal dynamics and state-specific vulnerabilities. To combat the above challenge, we first conduct a theoretical analysis of white-box attacks in DRL by establishing the adversarial victim-dynamics Markov decision process (AVD-MDP), to derive the necessary and sufficient conditions for a successful attack. Based on this, we propose a selective state-aware reinforcement adversarial attack method, named STAR, to optimize perturbation stealthiness and state visitation dispersion. STAR first employs a soft mask-based state-targeting mechanism to minimize redundant perturbations, enhancing stealthiness and attack effectiveness. Then, it incorporates an information-theoretic optimization objective to maximize mutual information between perturbations, environmental states, and victim actions, ensuring a dispersed state-visitation distribution that steers the victim agent into vulnerable states for maximum return reduction. Extensive experiments demonstrate that STAR outperforms state-of-the-art benchmarks.

Title: ProFed: a Benchmark for Proximity-based non-IID Federated Learning

Authors: Davide Domini, Gianluca Aguzzi, Mirko Viroli
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.20618
Pdf URL: https://arxiv.org/pdf/2503.20618
Copy Paste: [[2503.20618]] ProFed: a Benchmark for Proximity-based non-IID Federated Learning(https://arxiv.org/abs/2503.20618)
Keywords: federate
Abstract: In recent years, cro:flFederated learning (FL) has gained significant attention within the machine learning community. Although various FL algorithms have been proposed in the literature, their performance often degrades when data across clients is non-independently and identically distributed (non-IID). This skewness in data distribution often emerges from geographic patterns, with notable examples including regional linguistic variations in text data or localized traffic patterns in urban environments. Such scenarios result in IID data within specific regions but non-IID data across regions. However, existing FL algorithms are typically evaluated by randomly splitting non-IID data across devices, disregarding their spatial distribution. To address this gap, we introduce ProFed, a benchmark that simulates data splits with varying degrees of skewness across different regions. We incorporate several skewness methods from the literature and apply them to well-known datasets, including MNIST, FashionMNIST, CIFAR-10, and CIFAR-100. Our goal is to provide researchers with a standardized framework to evaluate FL algorithms more effectively and consistently against established baselines.

Title: Collaborative Storytelling and LLM: A Linguistic Analysis of Automatically-Generated Role-Playing Game Sessions

Authors: Alessandro Maisto
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20623
Pdf URL: https://arxiv.org/pdf/2503.20623
Copy Paste: [[2503.20623]] Collaborative Storytelling and LLM: A Linguistic Analysis of Automatically-Generated Role-Playing Game Sessions(https://arxiv.org/abs/2503.20623)
Keywords: large language model
Abstract: Role-playing games (RPG) are games in which players interact with one another to create narratives. The role of players in the RPG is largely based on the interaction between players and their characters. This emerging form of shared narrative, primarily oral, is receiving increasing attention. In particular, many authors investigated the use of an LLM as an actor in the game. In this paper, we aim to discover to what extent the language of Large Language Models (LLMs) exhibit oral or written features when asked to generate an RPG session without human interference. We will conduct a linguistic analysis of the lexical and syntactic features of the generated texts and compare the results with analyses of conversations, transcripts of human RPG sessions, and books. We found that LLMs exhibit a pattern that is distinct from all other text categories, including oral conversations, human RPG sessions and books. Our analysis has shown how training influences the way LLMs express themselves and provides important indications of the narrative capabilities of these tools.

Title: $β$-GNN: A Robust Ensemble Approach Against Graph Structure Perturbation

Authors: Haci Ismail Aslan, Philipp Wiesner, Ping Xiong, Odej Kao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20630
Pdf URL: https://arxiv.org/pdf/2503.20630
Copy Paste: [[2503.20630]] $β$-GNN: A Robust Ensemble Approach Against Graph Structure Perturbation(https://arxiv.org/abs/2503.20630)
Keywords: security, attack, robust
Abstract: Graph Neural Networks (GNNs) are playing an increasingly important role in the efficient operation and security of computing systems, with applications in workload scheduling, anomaly detection, and resource management. However, their vulnerability to network perturbations poses a significant challenge. We propose $\beta$-GNN, a model enhancing GNN robustness without sacrificing clean data performance. $\beta$-GNN uses a weighted ensemble, combining any GNN with a multi-layer perceptron. A learned dynamic weight, $\beta$, modulates the GNN's contribution. This $\beta$ not only weights GNN influence but also indicates data perturbation levels, enabling proactive mitigation. Experimental results on diverse datasets show $\beta$-GNN's superior adversarial accuracy and attack severity quantification. Crucially, $\beta$-GNN avoids perturbation assumptions, preserving clean data structure and performance.

Title: PVLens: Enhancing Pharmacovigilance Through Automated Label Extraction

Authors: Jeffery L Painter, Gregory E Powell, Andrew Bate
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20639
Pdf URL: https://arxiv.org/pdf/2503.20639
Copy Paste: [[2503.20639]] PVLens: Enhancing Pharmacovigilance Through Automated Label Extraction(https://arxiv.org/abs/2503.20639)
Keywords: extraction
Abstract: Reliable drug safety reference databases are essential for pharmacovigilance, yet existing resources like SIDER are outdated and static. We introduce PVLens, an automated system that extracts labeled safety information from FDA Structured Product Labels (SPLs) and maps terms to MedDRA. PVLens integrates automation with expert oversight through a web-based review tool. In validation against 97 drug labels, PVLens achieved an F1 score of 0.882, with high recall (0.983) and moderate precision (0.799). By offering a scalable, more accurate and continuously updated alternative to SIDER, PVLens enhances real-time pharamcovigilance with improved accuracy and contemporaneous insights.

Title: Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging

Authors: Han Wu, Yuxuan Yao, Shuqi Liu, Zehua Liu, Xiaojin Fu, Xiongwei Han, Xing Li, Hui-Ling Zhen, Tao Zhong, Mingxuan Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20641
Pdf URL: https://arxiv.org/pdf/2503.20641
Copy Paste: [[2503.20641]] Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging(https://arxiv.org/abs/2503.20641)
Keywords: robust, large language model
Abstract: The transition from System 1 to System 2 reasoning in large language models (LLMs) has marked significant advancements in handling complex tasks through deliberate, iterative thinking. However, this progress often comes at the cost of efficiency, as models tend to overthink, generating redundant reasoning steps without proportional improvements in output quality. Long-to-Short (L2S) reasoning has emerged as a promising solution to this challenge, aiming to balance reasoning depth with practical efficiency. While existing approaches, such as supervised fine-tuning (SFT), reinforcement learning (RL), and prompt engineering, have shown potential, they are either computationally expensive or unstable. Model merging, on the other hand, offers a cost-effective and robust alternative by integrating the quick-thinking capabilities of System 1 models with the methodical reasoning of System 2 models. In this work, we present a comprehensive empirical study on model merging for L2S reasoning, exploring diverse methodologies, including task-vector-based, SVD-based, and activation-informed merging. Our experiments reveal that model merging can reduce average response length by up to 55% while preserving or even improving baseline performance. We also identify a strong correlation between model scale and merging efficacy with extensive evaluations on 1.5B/7B/14B/32B models. Furthermore, we investigate the merged model's ability to self-critique and self-correct, as well as its adaptive response length based on task complexity. Our findings highlight model merging as a highly efficient and effective paradigm for L2S reasoning, offering a practical solution to the overthinking problem while maintaining the robustness of System 2 reasoning. This work can be found on Github this https URL.

Title: MMGen: Unified Multi-modal Image Generation and Understanding in One Go

Authors: Jiepeng Wang, Zhaoqing Wang, Hao Pan, Yuan Liu, Dongdong Yu, Changhu Wang, Wenping Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20644
Pdf URL: https://arxiv.org/pdf/2503.20644
Copy Paste: [[2503.20644]] MMGen: Unified Multi-modal Image Generation and Understanding in One Go(https://arxiv.org/abs/2503.20644)
Keywords: diffusion, transformer, generative, segmentation
Abstract: A unified diffusion framework for multi-modal generation and understanding has the transformative potential to achieve seamless and controllable image diffusion and other cross-modal tasks. In this paper, we introduce MMGen, a unified framework that integrates multiple generative tasks into a single diffusion model. This includes: (1) multi-modal category-conditioned generation, where multi-modal outputs are generated simultaneously through a single inference process, given category information; (2) multi-modal visual understanding, which accurately predicts depth, surface normals, and segmentation maps from RGB images; and (3) multi-modal conditioned generation, which produces corresponding RGB images based on specific modality conditions and other aligned modalities. Our approach develops a novel diffusion transformer that flexibly supports multi-modal output, along with a simple modality-decoupling strategy to unify various tasks. Extensive experiments and applications demonstrate the effectiveness and superiority of MMGen across diverse tasks and conditions, highlighting its potential for applications that require simultaneous generation and understanding.

Title: Imitating Radiological Scrolling: A Global-Local Attention Model for 3D Chest CT Volumes Multi-Label Anomaly Classification

Authors: Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20652
Pdf URL: https://arxiv.org/pdf/2503.20652
Copy Paste: [[2503.20652]] Imitating Radiological Scrolling: A Global-Local Attention Model for 3D Chest CT Volumes Multi-Label Anomaly Classification(https://arxiv.org/abs/2503.20652)
Keywords: transformer, segmentation
Abstract: The rapid increase in the number of Computed Tomography (CT) scan examinations has created an urgent need for automated tools, such as organ segmentation, anomaly classification, and report generation, to assist radiologists with their growing workload. Multi-label classification of Three-Dimensional (3D) CT scans is a challenging task due to the volumetric nature of the data and the variety of anomalies to be detected. Existing deep learning methods based on Convolutional Neural Networks (CNNs) struggle to capture long-range dependencies effectively, while Vision Transformers require extensive pre-training, posing challenges for practical use. Additionally, these existing methods do not explicitly model the radiologist's navigational behavior while scrolling through CT scan slices, which requires both global context understanding and local detail awareness. In this study, we present CT-Scroll, a novel global-local attention model specifically designed to emulate the scrolling behavior of radiologists during the analysis of 3D CT scans. Our approach is evaluated on two public datasets, demonstrating its efficacy through comprehensive experiments and an ablation study that highlights the contribution of each model component.

Title: DR-PETS: Learning-Based Control With Planning in Adversarial Environments

Authors: Hozefa Jesawada, Antonio Acernese, Giovanni Russo, Carmen Del Vecchiob
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2503.20660
Pdf URL: https://arxiv.org/pdf/2503.20660
Copy Paste: [[2503.20660]] DR-PETS: Learning-Based Control With Planning in Adversarial Environments(https://arxiv.org/abs/2503.20660)
Keywords: robust
Abstract: Ensuring robustness against epistemic, possibly adversarial, perturbations is essential for reliable real-world decision-making. While the Probabilistic Ensembles with Trajectory Sampling (PETS) algorithm inherently handles uncertainty via ensemble-based probabilistic models, it lacks guarantees against structured adversarial or worst-case uncertainty distributions. To address this, we propose DR-PETS, a distributionally robust extension of PETS that certifies robustness against adversarial perturbations. We formalize uncertainty via a p-Wasserstein ambiguity set, enabling worst-case-aware planning through a min-max optimization framework. While PETS passively accounts for stochasticity, DR-PETS actively optimizes robustness via a tractable convex approximation integrated into PETS planning loop. Experiments on pendulum stabilization and cart-pole balancing show that DR-PETS certifies robustness against adversarial parameter perturbations, achieving consistent performance in worst-case scenarios where PETS deteriorates.

Title: ARMO: Autoregressive Rigging for Multi-Category Objects

Authors: Mingze Sun, Shiwei Mao, Keyi Chen, Yurun Chen, Shunlin Lu, Jingbo Wang, Junting Dong, Ruqi Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20663
Pdf URL: https://arxiv.org/pdf/2503.20663
Copy Paste: [[2503.20663]] ARMO: Autoregressive Rigging for Multi-Category Objects(https://arxiv.org/abs/2503.20663)
Keywords: diffusion, generative
Abstract: Recent advancements in large-scale generative models have significantly improved the quality and diversity of 3D shape generation. However, most existing methods focus primarily on generating static 3D models, overlooking the potentially dynamic nature of certain shapes, such as humanoids, animals, and insects. To address this gap, we focus on rigging, a fundamental task in animation that establishes skeletal structures and skinning for 3D models. In this paper, we introduce OmniRig, the first large-scale rigging dataset, comprising 79,499 meshes with detailed skeleton and skinning information. Unlike traditional benchmarks that rely on predefined standard poses (e.g., A-pose, T-pose), our dataset embraces diverse shape categories, styles, and poses. Leveraging this rich dataset, we propose ARMO, a novel rigging framework that utilizes an autoregressive model to predict both joint positions and connectivity relationships in a unified manner. By treating the skeletal structure as a complete graph and discretizing it into tokens, we encode the joints using an auto-encoder to obtain a latent embedding and an autoregressive model to predict the tokens. A mesh-conditioned latent diffusion model is used to predict the latent embedding for conditional skeleton generation. Our method addresses the limitations of regression-based approaches, which often suffer from error accumulation and suboptimal connectivity estimation. Through extensive experiments on the OmniRig dataset, our approach achieves state-of-the-art performance in skeleton prediction, demonstrating improved generalization across diverse object categories. The code and dataset will be made public for academic use upon acceptance.

Title: Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy

Authors: Yinan Sun, Xiongkuo Min, Zicheng Zhang, Yixuan Gao, Yuqin Cao, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20673
Pdf URL: https://arxiv.org/pdf/2503.20673
Copy Paste: [[2503.20673]] Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy(https://arxiv.org/abs/2503.20673)
Keywords: large language model
Abstract: The rapid development of multimodal large language models has resulted in remarkable advancements in visual perception and understanding, consolidating several tasks into a single visual question-answering framework. However, these models are prone to hallucinations, which limit their reliability as artificial intelligence systems. While this issue is extensively researched in natural language processing and image captioning, there remains a lack of investigation of hallucinations in Low-level Visual Perception and Understanding (HLPU), especially in the context of image quality assessment tasks. We consider that these hallucinations arise from an absence of clear self-awareness within the models. To address this issue, we first introduce the HLPU instruction database, the first instruction database specifically focused on hallucinations in low-level vision tasks. This database contains approximately 200K question-answer pairs and comprises four subsets, each covering different types of instructions. Subsequently, we propose the Self-Awareness Failure Elimination (SAFEQA) model, which utilizes image features, salient region features and quality features to improve the perception and comprehension abilities of the model in low-level vision tasks. Furthermore, we propose the Enhancing Self-Awareness Preference Optimization (ESA-PO) framework to increase the model's awareness of knowledge boundaries, thereby mitigating the incidence of hallucination. Finally, we conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations. Notably, our proposed method improves both accuracy and self-awareness of the proposed model and outperforms close-source models in terms of various evaluation metrics.

Title: Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound

Authors: Yuhao Huang, Ao Chang, Haoran Dou, Xing Tao, Xinrui Zhou, Yan Cao, Ruobing Huang, Alejandro F Frangi, Lingyun Bao, Xin Yang, Dong Ni
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20685
Pdf URL: https://arxiv.org/pdf/2503.20685
Copy Paste: [[2503.20685]] Flip Learning: Weakly Supervised Erase to Segment Nodules in Breast Ultrasound(https://arxiv.org/abs/2503.20685)
Keywords: segmentation
Abstract: Accurate segmentation of nodules in both 2D breast ultrasound (BUS) and 3D automated breast ultrasound (ABUS) is crucial for clinical diagnosis and treatment planning. Therefore, developing an automated system for nodule segmentation can enhance user independence and expedite clinical analysis. Unlike fully-supervised learning, weakly-supervised segmentation (WSS) can streamline the laborious and intricate annotation process. However, current WSS methods face challenges in achieving precise nodule segmentation, as many of them depend on inaccurate activation maps or inefficient pseudo-mask generation algorithms. In this study, we introduce a novel multi-agent reinforcement learning-based WSS framework called Flip Learning, which relies solely on 2D/3D boxes for accurate segmentation. Specifically, multiple agents are employed to erase the target from the box to facilitate classification tag flipping, with the erased region serving as the predicted segmentation mask. The key contributions of this research are as follows: (1) Adoption of a superpixel/supervoxel-based approach to encode the standardized environment, capturing boundary priors and expediting the learning process. (2) Introduction of three meticulously designed rewards, comprising a classification score reward and two intensity distribution rewards, to steer the agents' erasing process precisely, thereby avoiding both under- and over-segmentation. (3) Implementation of a progressive curriculum learning strategy to enable agents to interact with the environment in a progressively challenging manner, thereby enhancing learning efficiency. Extensively validated on the large in-house BUS and ABUS datasets, our Flip Learning method outperforms state-of-the-art WSS methods and foundation models, and achieves comparable performance as fully-supervised learning algorithms.

Title: From Annotation to Adaptation: Metrics, Synthetic Data, and Aspect Extraction for Aspect-Based Sentiment Analysis with Large Language Models

Authors: Nikita Neveditsin, Pawan Lingras, Vijay Mago
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.20715
Pdf URL: https://arxiv.org/pdf/2503.20715
Copy Paste: [[2503.20715]] From Annotation to Adaptation: Metrics, Synthetic Data, and Aspect Extraction for Aspect-Based Sentiment Analysis with Large Language Models(https://arxiv.org/abs/2503.20715)
Keywords: extraction, generative, large language model
Abstract: This study examines the performance of Large Language Models (LLMs) in Aspect-Based Sentiment Analysis (ABSA), with a focus on implicit aspect extraction in a novel domain. Using a synthetic sports feedback dataset, we evaluate open-weight LLMs' ability to extract aspect-polarity pairs and propose a metric to facilitate the evaluation of aspect extraction with generative models. Our findings highlight both the potential and limitations of LLMs in the ABSA task.

Title: Learning Straight Flows by Learning Curved Interpolants

Authors: Shiv Shankar, Tomas Geffner
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.20719
Pdf URL: https://arxiv.org/pdf/2503.20719
Copy Paste: [[2503.20719]] Learning Straight Flows by Learning Curved Interpolants(https://arxiv.org/abs/2503.20719)
Keywords: generative
Abstract: Flow matching models typically use linear interpolants to define the forward/noise addition process. This, together with the independent coupling between noise and target distributions, yields a vector field which is often non-straight. Such curved fields lead to a slow inference/generation process. In this work, we propose to learn flexible (potentially curved) interpolants in order to learn straight vector fields to enable faster generation. We formulate this via a multi-level optimization problem and propose an efficient approximate procedure to solve it. Our framework provides an end-to-end and simulation-free optimization procedure, which can be leveraged to learn straight line generative trajectories.

Title: A weakly-supervised deep learning model for fast localisation and delineation of the skeleton, internal organs, and spinal canal on Whole-Body Diffusion-Weighted MRI (WB-DWI)

Authors: A. Candito (1), A. Dragan (1,2), R. Holbrey (1), A. Ribeiro (2), R. Donners (3), C. Messiou (1,2), N. Tunariu (1,2), D.-M. Koh (1,2), M. D. Blackledge (1), (1)The Institute of Cancer Research, London, United Kingdom (2)The Royal Marsden NHS Foundation Trust, London, United Kingdom (3)University Hospital Basel, Basel, Switzerland
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.20722
Pdf URL: https://arxiv.org/pdf/2503.20722
Copy Paste: [[2503.20722]] A weakly-supervised deep learning model for fast localisation and delineation of the skeleton, internal organs, and spinal canal on Whole-Body Diffusion-Weighted MRI (WB-DWI)(https://arxiv.org/abs/2503.20722)
Keywords: diffusion, segmentation
Abstract: Background: Apparent Diffusion Coefficient (ADC) values and Total Diffusion Volume (TDV) from Whole-body diffusion-weighted MRI (WB-DWI) are recognized cancer imaging biomarkers. However, manual disease delineation for ADC and TDV measurements is unfeasible in clinical practice, demanding automation. As a first step, we propose an algorithm to generate fast and reproducible probability maps of the skeleton, adjacent internal organs (liver, spleen, urinary bladder, and kidneys), and spinal canal. Methods: We developed an automated deep-learning pipeline based on a 3D patch-based Residual U-Net architecture that localizes and delineates these anatomical structures on WB-DWI. The algorithm was trained using "soft-labels" (non-binary segmentations) derived from a computationally intensive atlas-based approach. For training and validation, we employed a multi-center WB-DWI dataset comprising 532 scans from patients with Advanced Prostate Cancer (APC) or Multiple Myeloma (MM), with testing on 45 patients. Results: Our weakly-supervised deep learning model achieved an average dice score/precision/recall of 0.66/0.6/0.73 for skeletal delineations, 0.8/0.79/0.81 for internal organs, and 0.85/0.79/0.94 for spinal canal, with surface distances consistently below 3 mm. Relative median ADC and log-transformed volume differences between automated and manual expert-defined full-body delineations were below 10% and 4%, respectively. The computational time for generating probability maps was 12x faster than the atlas-based registration algorithm (25 s vs. 5 min). An experienced radiologist rated the model's accuracy "good" or "excellent" on test datasets. Conclusion: Our model offers fast and reproducible probability maps for localizing and delineating body regions on WB-DWI, enabling ADC and TDV quantification, potentially supporting clinicians in disease staging and treatment response assessment.

Title: Dynamic Motion Blending for Versatile Motion Editing

Authors: Nan Jiang, Hongjie Li, Ziye Yuan, Zimo He, Yixin Chen, Tengyu Liu, Yixin Zhu, Siyuan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20724
Pdf URL: https://arxiv.org/pdf/2503.20724
Copy Paste: [[2503.20724]] Dynamic Motion Blending for Versatile Motion Editing(https://arxiv.org/abs/2503.20724)
Keywords: diffusion, large language model
Abstract: Text-guided motion editing enables high-level semantic control and iterative modifications beyond traditional keyframe animation. Existing methods rely on limited pre-collected training triplets, which severely hinders their versatility in diverse editing scenarios. We introduce MotionCutMix, an online data augmentation technique that dynamically generates training triplets by blending body part motions based on input text. While MotionCutMix effectively expands the training distribution, the compositional nature introduces increased randomness and potential body part incoordination. To model such a rich distribution, we present MotionReFit, an auto-regressive diffusion model with a motion coordinator. The auto-regressive architecture facilitates learning by decomposing long sequences, while the motion coordinator mitigates the artifacts of motion composition. Our method handles both spatial and temporal motion edits directly from high-level human instructions, without relying on additional specifications or Large Language Models. Through extensive experiments, we show that MotionReFit achieves state-of-the-art performance in text-guided motion editing.

Title: RecTable: Fast Modeling Tabular Data with Rectified Flow

Authors: Masane Fuchi, Tomohiro Takagi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.20731
Pdf URL: https://arxiv.org/pdf/2503.20731
Copy Paste: [[2503.20731]] RecTable: Fast Modeling Tabular Data with Rectified Flow(https://arxiv.org/abs/2503.20731)
Keywords: diffusion
Abstract: Score-based or diffusion models generate high-quality tabular data, surpassing GAN-based and VAE-based models. However, these methods require substantial training time. In this paper, we introduce RecTable, which uses the rectified flow modeling, applied in such as text-to-image generation and text-to-video generation. RecTable features a simple architecture consisting of a few stacked gated linear unit blocks. Additionally, our training strategies are also simple, incorporating a mixed-type noise distribution and a logit-normal timestep distribution. Our experiments demonstrate that RecTable achieves competitive performance compared to the several state-of-the-art diffusion and score-based models while reducing the required training time. Our code is available at this https URL.

Title: SChanger: Change Detection from a Semantic Change and Spatial Consistency Perspective

Authors: Ziyu Zhou, Keyan Hu, Yutian Fang, Xiaoping Rui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20734
Pdf URL: https://arxiv.org/pdf/2503.20734
Copy Paste: [[2503.20734]] SChanger: Change Detection from a Semantic Change and Spatial Consistency Perspective(https://arxiv.org/abs/2503.20734)
Keywords: extraction
Abstract: Change detection is a key task in Earth observation applications. Recently, deep learning methods have demonstrated strong performance and widespread application. However, change detection faces data scarcity due to the labor-intensive process of accurately aligning remote sensing images of the same area, which limits the performance of deep learning algorithms. To address the data scarcity issue, we develop a fine-tuning strategy called the Semantic Change Network (SCN). We initially pre-train the model on single-temporal supervised tasks to acquire prior knowledge of instance feature extraction. The model then employs a shared-weight Siamese architecture and extended Temporal Fusion Module (TFM) to preserve this prior knowledge and is fine-tuned on change detection tasks. The learned semantics for identifying all instances is changed to focus on identifying only the changes. Meanwhile, we observe that the locations of changes between the two images are spatially identical, a concept we refer to as spatial consistency. We introduce this inductive bias through an attention map that is generated by large-kernel convolutions and applied to the features from both time points. This enhances the modeling of multi-scale changes and helps capture underlying relationships in change detection semantics. We develop a binary change detection model utilizing these two strategies. The model is validated against state-of-the-art methods on six datasets, surpassing all benchmark methods and achieving F1 scores of 92.87%, 86.43%, 68.95%, 97.62%, 84.58%, and 93.20% on the LEVIR-CD, LEVIR-CD+, S2Looking, CDD, SYSU-CD, and WHU-CD datasets, respectively.

Title: High Quality Diffusion Distillation on a Single GPU with Relative and Absolute Position Matching

Authors: Guoqiang Zhang, Kenta Niwa, J.P. Lewis, Cedric Mesnage, W. Bastiaan Kleijn
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20744
Pdf URL: https://arxiv.org/pdf/2503.20744
Copy Paste: [[2503.20744]] High Quality Diffusion Distillation on a Single GPU with Relative and Absolute Position Matching(https://arxiv.org/abs/2503.20744)
Keywords: diffusion
Abstract: We introduce relative and absolute position matching (RAPM), a diffusion distillation method resulting in high quality generation that can be trained efficiently on a single GPU. Recent diffusion distillation research has achieved excellent results for high-resolution text-to-image generation with methods such as phased consistency models (PCM) and improved distribution matching distillation (DMD2). However, these methods generally require many GPUs (e.g.~8-64) and significant batchsizes (e.g.~128-2048) during training, resulting in memory and compute requirements that are beyond the resources of some researchers. RAPM provides effective single-GPU diffusion distillation training with a batchsize of 1. The new method attempts to mimic the sampling trajectories of the teacher model by matching the relative and absolute positions. The design of relative positions is inspired by PCM. Two discriminators are introduced accordingly in RAPM, one for matching relative positions and the other for absolute positions. Experimental results on StableDiffusion (SD) V1.5 and SDXL indicate that RAPM with 4 timesteps produces comparable FID scores as the best method with 1 timestep under very limited computational resources.

Title: MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams

Authors: Yanpeng Sun, Shan Zhang, Wei Tang, Aotian Chen, Piotr Koniusz, Kai Zou, Yuan Xue, Anton van den Hengel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20745
Pdf URL: https://arxiv.org/pdf/2503.20745
Copy Paste: [[2503.20745]] MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams(https://arxiv.org/abs/2503.20745)
Keywords: large language model
Abstract: Diagrams serve as a fundamental form of visual language, representing complex concepts and their inter-relationships through structured symbols, shapes, and spatial arrangements. Unlike natural images, their inherently symbolic and abstract nature poses significant challenges for Multimodal Large Language Models (MLLMs). However, current benchmarks conflate perceptual and reasoning tasks, making it difficult to assess whether MLLMs genuinely understand mathematical diagrams beyond superficial pattern recognition. To address this gap, we introduce MATHGLANCE, a benchmark specifically designed to isolate and evaluate mathematical perception in MLLMs. MATHGLANCE comprises 1.2K images and 1.6K carefully curated questions spanning four perception tasks: shape classification, object counting, relationship identification, and object grounding, covering diverse domains including plane geometry, solid geometry, and graphical representations. Our evaluation of MLLMs reveals that their ability to understand diagrams is notably limited, particularly in fine-grained grounding tasks. In response, we construct GeoPeP, a perception-oriented dataset of 200K structured geometry image-text pairs explicitly annotated with geometric primitives and precise spatial relationships. Training MLLM on GeoPeP leads to significant gains in perceptual accuracy, which in turn substantially improves mathematical reasoning. Our benchmark and dataset establish critical standards for evaluating and advancing multimodal mathematical understanding, providing valuable resources and insights to foster future MLLM research.

Title: UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines

Authors: Chen Tang, Xinzhu Ma, Encheng Su, Xiufeng Song, Xiaohong Liu, Wei-Hong Li, Lei Bai, Wanli Ouyang, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20748
Pdf URL: https://arxiv.org/pdf/2503.20748
Copy Paste: [[2503.20748]] UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines(https://arxiv.org/abs/2503.20748)
Keywords: transformer
Abstract: Traditional spatiotemporal models generally rely on task-specific architectures, which limit their generalizability and scalability across diverse tasks due to domain-specific design requirements. In this paper, we introduce \textbf{UniSTD}, a unified Transformer-based framework for spatiotemporal modeling, which is inspired by advances in recent foundation models with the two-stage pretraining-then-adaption paradigm. Specifically, our work demonstrates that task-agnostic pretraining on 2D vision and vision-text datasets can build a generalizable model foundation for spatiotemporal learning, followed by specialized joint training on spatiotemporal datasets to enhance task-specific adaptability. To improve the learning capabilities across domains, our framework employs a rank-adaptive mixture-of-expert adaptation by using fractional interpolation to relax the discrete variables so that can be optimized in the continuous space. Additionally, we introduce a temporal module to incorporate temporal dynamics explicitly. We evaluate our approach on a large-scale dataset covering 10 tasks across 4 disciplines, demonstrating that a unified spatiotemporal model can achieve scalable, cross-task learning and support up to 10 tasks simultaneously within one model while reducing training costs in multi-domain applications. Code will be available at this https URL.

Title: Optimal Scaling Laws for Efficiency Gains in a Theoretical Transformer-Augmented Sectional MoE Framework

Authors: Soham Sane
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20750
Pdf URL: https://arxiv.org/pdf/2503.20750
Copy Paste: [[2503.20750]] Optimal Scaling Laws for Efficiency Gains in a Theoretical Transformer-Augmented Sectional MoE Framework(https://arxiv.org/abs/2503.20750)
Keywords: transformer
Abstract: This paper introduces a theoretical framework for a Transformer-augmented, sectional Mixture-of-Experts (MoE) architecture that aims to enhance computational efficiency while preserving model scalability. Unlike conventional MoE models, which route entire token embeddings to selected experts, our approach portions the embedding dimension itself -- assigning segments of each token's representation to dedicated experts. To combat losses in token representation, we utilize a pre-expert transformer layer to recompute attention across tokens and reduce the sequence length dimensionality. We extend our theory by deriving optimal scaling laws that a non-linear relationship between the number of experts and factors such as model dimensionality, sequence length, and system overhead. These formulations yield closed-form and numerically-solvable expressions for identifying the optimal expert count under given architectural and hardware constraints. As a result, our framework not only provides theoretical bounds for computing efficiency with varying frameworks but also guides practical design choices for scaling large models effectively. While empirical validation is pending, we present a comprehensive experimental road map to evaluate the framework's efficiency, scalability, and practicality in future work.

Title: Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning

Authors: Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20752
Pdf URL: https://arxiv.org/pdf/2503.20752
Copy Paste: [[2503.20752]] Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning(https://arxiv.org/abs/2503.20752)
Keywords: robust
Abstract: Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods improve VLM reasoning via Chain-of-Thought (CoT) supervised fine-tuning, using meticulously annotated training data to enhance visual reasoning capabilities. However, this training paradigm may lead to overfitting and cognitive rigidity, restricting the model's ability to transfer visual reasoning skills across domains and limiting its real-world applicability. To address these limitations, we propose Reason-RFT, a novel reinforcement fine-tuning framework that significantly enhances generalization capabilities in visual reasoning tasks. Reason-RFT introduces a two-phase training framework for visual reasoning: (1) Supervised Fine-Tuning (SFT) with curated Chain-of-Thought (CoT) data activates the reasoning potential of Vision-Language Models (VLMs), followed by (2) Group Relative Policy Optimization (GRPO)-based reinforcement learning that generates multiple reasoning-response pairs, significantly enhancing generalization in visual reasoning tasks. To evaluate Reason-RFT's visual reasoning capabilities, we reconstructed a comprehensive dataset spanning visual counting, structure perception, and spatial this http URL results demonstrate Reasoning-RFT's three key advantages: (1) Performance Enhancement: achieving state-of-the-art results across multiple tasks, outperforming most mainstream open-source and proprietary models; (2) Generalization Superiority: consistently maintaining robust performance across diverse tasks and domains, outperforming alternative training paradigms; (3) Data Efficiency: excelling in few-shot learning scenarios while surpassing full-dataset SFT baselines.

Title: MindfulLIME: A Stable Solution for Explanations of Machine Learning Models with Enhanced Localization Precision -- A Medical Image Case Study

Authors: Shakiba Rahimiaghdam, Hande Alemdar
Subjects: cs.LG, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.20758
Pdf URL: https://arxiv.org/pdf/2503.20758
Copy Paste: [[2503.20758]] MindfulLIME: A Stable Solution for Explanations of Machine Learning Models with Enhanced Localization Precision -- A Medical Image Case Study(https://arxiv.org/abs/2503.20758)
Keywords: interpretability, segmentation
Abstract: Ensuring transparency in machine learning decisions is critically important, especially in sensitive sectors such as healthcare, finance, and justice. Despite this, some popular explainable algorithms, such as Local Interpretable Model-agnostic Explanations (LIME), often produce unstable explanations due to the random generation of perturbed samples. Random perturbation introduces small changes or noise to modified instances of the original data, leading to inconsistent explanations. Even slight variations in the generated samples significantly affect the explanations provided by such models, undermining trust and hindering the adoption of interpretable models. To address this challenge, we propose MindfulLIME, a novel algorithm that intelligently generates purposive samples using a graph-based pruning algorithm and uncertainty sampling. MindfulLIME substantially improves the consistency of visual explanations compared to random sampling approaches. Our experimental evaluation, conducted on a widely recognized chest X-ray dataset, confirms MindfulLIME's stability with a 100% success rate in delivering reliable explanations under identical conditions. Additionally, MindfulLIME improves the localization precision of visual explanations by reducing the distance between the generated explanations and the actual local annotations compared to LIME. We also performed comprehensive experiments considering various segmentation algorithms and sample numbers, focusing on stability, quality, and efficiency. The results demonstrate the outstanding performance of MindfulLIME across different segmentation settings, generating fewer high-quality samples within a reasonable processing time. By addressing the stability limitations of LIME in image data, MindfulLIME enhances the trustworthiness and interpretability of machine learning models in specific medical imaging applications, a critical domain.

Title: Reliable algorithm selection for machine learning-guided design

Authors: Clara Fannjiang, Ji Won Park
Subjects: cs.LG, q-bio.QM, stat.ML
Abstract URL: https://arxiv.org/abs/2503.20767
Pdf URL: https://arxiv.org/pdf/2503.20767
Copy Paste: [[2503.20767]] Reliable algorithm selection for machine learning-guided design(https://arxiv.org/abs/2503.20767)
Keywords: generative
Abstract: Algorithms for machine learning-guided design, or design algorithms, use machine learning-based predictions to propose novel objects with desired property values. Given a new design task -- for example, to design novel proteins with high binding affinity to a therapeutic target -- one must choose a design algorithm and specify any hyperparameters and predictive and/or generative models involved. How can these decisions be made such that the resulting designs are successful? This paper proposes a method for design algorithm selection, which aims to select design algorithms that will produce a distribution of design labels satisfying a user-specified success criterion -- for example, that at least ten percent of designs' labels exceed a threshold. It does so by combining designs' predicted property values with held-out labeled data to reliably forecast characteristics of the label distributions produced by different design algorithms, building upon techniques from prediction-powered inference. The method is guaranteed with high probability to return design algorithms that yield successful label distributions (or the null set if none exist), if the density ratios between the design and labeled data distributions are known. We demonstrate the method's effectiveness in simulated protein and RNA design tasks, in settings with either known or estimated density ratios.

Title: An Empirical Study of the Impact of Federated Learning on Machine Learning Model Accuracy

Authors: Haotian Yang, Zhuoran Wang, Benson Chou, Sophie Xu, Hao Wang, Jingxian Wang, Qizhen Zhang
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2503.20768
Pdf URL: https://arxiv.org/pdf/2503.20768
Copy Paste: [[2503.20768]] An Empirical Study of the Impact of Federated Learning on Machine Learning Model Accuracy(https://arxiv.org/abs/2503.20768)
Keywords: federate
Abstract: Federated Learning (FL) enables distributed ML model training on private user data at the global scale. Despite the potential of FL demonstrated in many domains, an in-depth view of its impact on model accuracy remains unclear. In this paper, we investigate, systematically, how this learning paradigm can affect the accuracy of state-of-the-art ML models for a variety of ML tasks. We present an empirical study that involves various data types: text, image, audio, and video, and FL configuration knobs: data distribution, FL scale, client sampling, and local and global computations. Our experiments are conducted in a unified FL framework to achieve high fidelity, with substantial human efforts and resource investments. Based on the results, we perform a quantitative analysis of the impact of FL, and highlight challenging scenarios where applying FL degrades the accuracy of the model drastically and identify cases where the impact is negligible. The detailed and extensive findings can benefit practical deployments and future development of FL.

Title: Disentangled Source-Free Personalization for Facial Expression Recognition with Neutral Target Data

Authors: Masoumeh Sharafi, Emma Ollivier, Muhammad Osama Zeeshan, Soufiane Belharbi, Marco Pedersoli, Alessandro Lameiras Koerich, Simon Bacon, EricGranger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20771
Pdf URL: https://arxiv.org/pdf/2503.20771
Copy Paste: [[2503.20771]] Disentangled Source-Free Personalization for Facial Expression Recognition with Neutral Target Data(https://arxiv.org/abs/2503.20771)
Keywords: privacy
Abstract: Facial Expression Recognition (FER) from videos is a crucial task in various application areas, such as human-computer interaction and health monitoring (e.g., pain, depression, fatigue, and stress). Beyond the challenges of recognizing subtle emotional or health states, the effectiveness of deep FER models is often hindered by the considerable variability of expressions among subjects. Source-free domain adaptation (SFDA) methods are employed to adapt a pre-trained source model using only unlabeled target domain data, thereby avoiding data privacy and storage issues. Typically, SFDA methods adapt to a target domain dataset corresponding to an entire population and assume it includes data from all recognition classes. However, collecting such comprehensive target data can be difficult or even impossible for FER in healthcare applications. In many real-world scenarios, it may be feasible to collect a short neutral control video (displaying only neutral expressions) for target subjects before deployment. These videos can be used to adapt a model to better handle the variability of expressions among subjects. This paper introduces the Disentangled Source-Free Domain Adaptation (DSFDA) method to address the SFDA challenge posed by missing target expression data. DSFDA leverages data from a neutral target control video for end-to-end generation and adaptation of target data with missing non-neutral data. Our method learns to disentangle features related to expressions and identity while generating the missing non-neutral target data, thereby enhancing model accuracy. Additionally, our self-supervision strategy improves model adaptation by reconstructing target images that maintain the same identity and source expression.

Title: Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

Authors: Shijie Zhou, Hui Ren, Yijia Weng, Shuwang Zhang, Zhen Wang, Dejia Xu, Zhiwen Fan, Suya You, Zhangyang Wang, Leonidas Guibas, Achuta Kadambi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20776
Pdf URL: https://arxiv.org/pdf/2503.20776
Copy Paste: [[2503.20776]] Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields(https://arxiv.org/abs/2503.20776)
Keywords: segmentation
Abstract: Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g. SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.

Title: BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation

Authors: Yulu Pan, Ce Zhang, Gedas Bertasius
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20781
Pdf URL: https://arxiv.org/pdf/2503.20781
Copy Paste: [[2503.20781]] BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation(https://arxiv.org/abs/2503.20781)
Keywords: fair
Abstract: We present BASKET, a large-scale basketball video dataset for fine-grained skill estimation. BASKET contains 4,477 hours of video capturing 32,232 basketball players from all over the world. Compared to prior skill estimation datasets, our dataset includes a massive number of skilled participants with unprecedented diversity in terms of gender, age, skill level, geographical location, etc. BASKET includes 20 fine-grained basketball skills, challenging modern video recognition models to capture the intricate nuances of player skill through in-depth video analysis. Given a long highlight video (8-10 minutes) of a particular player, the model needs to predict the skill level (e.g., excellent, good, average, fair, poor) for each of the 20 basketball skills. Our empirical analysis reveals that the current state-of-the-art video models struggle with this task, significantly lagging behind the human baseline. We believe that BASKET could be a useful resource for developing new video models with advanced long-range, fine-grained recognition capabilities. In addition, we hope that our dataset will be useful for domain-specific applications such as fair basketball scouting, personalized player development, and many others. Dataset and code are available at this https URL.

Title: Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising

Authors: Yan-Bo Lin, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, Xiaofei Wang, Gedas Bertasius, Lijuan Wang
Subjects: cs.CV, cs.LG, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.20782
Pdf URL: https://arxiv.org/pdf/2503.20782
Copy Paste: [[2503.20782]] Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising(https://arxiv.org/abs/2503.20782)
Keywords: robust
Abstract: In this paper, we introduce zero-shot audio-video editing, a novel task that requires transforming original audio-visual content to align with a specified textual prompt without additional model training. To evaluate this task, we curate a benchmark dataset, AvED-Bench, designed explicitly for zero-shot audio-video editing. AvED-Bench includes 110 videos, each with a 10-second duration, spanning 11 categories from VGGSound. It offers diverse prompts and scenarios that require precise alignment between auditory and visual elements, enabling robust evaluation. We identify limitations in existing zero-shot audio and video editing methods, particularly in synchronization and coherence between modalities, which often result in inconsistent outcomes. To address these challenges, we propose AvED, a zero-shot cross-modal delta denoising framework that leverages audio-video interactions to achieve synchronized and coherent edits. AvED demonstrates superior results on both AvED-Bench and the recent OAVE dataset to validate its generalization capabilities. Results are available at this https URL

Title: FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks

Authors: Jinwei Li, Huan-ang Gao, Wenyi Li, Haohan Chi, Chenyu Liu, Chenxi Du, Yiqian Liu, Mingju Gao, Guiyu Zhang, Zongzheng Zhang, Li Yi, Yao Yao, Jingwei Zhao, Hongyang Li, Yikai Wang, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20784
Pdf URL: https://arxiv.org/pdf/2503.20784
Copy Paste: [[2503.20784]] FB-4D: Spatial-Temporal Coherent Dynamic 3D Content Generation with Feature Banks(https://arxiv.org/abs/2503.20784)
Keywords: robust, diffusion
Abstract: With the rapid advancements in diffusion models and 3D generation techniques, dynamic 3D content generation has become a crucial research area. However, achieving high-fidelity 4D (dynamic 3D) generation with strong spatial-temporal consistency remains a challenging task. Inspired by recent findings that pretrained diffusion features capture rich correspondences, we propose FB-4D, a novel 4D generation framework that integrates a Feature Bank mechanism to enhance both spatial and temporal consistency in generated frames. In FB-4D, we store features extracted from previous frames and fuse them into the process of generating subsequent frames, ensuring consistent characteristics across both time and multiple views. To ensure a compact representation, the Feature Bank is updated by a proposed dynamic merging mechanism. Leveraging this Feature Bank, we demonstrate for the first time that generating additional reference sequences through multiple autoregressive iterations can continuously improve generation performance. Experimental results show that FB-4D significantly outperforms existing methods in terms of rendering quality, spatial-temporal consistency, and robustness. It surpasses all multi-view generation tuning-free approaches by a large margin and achieves performance on par with training-based methods.

Title: Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

Authors: Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.20785
Pdf URL: https://arxiv.org/pdf/2503.20785
Copy Paste: [[2503.20785]] Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency(https://arxiv.org/abs/2503.20785)
Keywords: diffusion
Abstract: We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.

Title: Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark

Authors: Sondos Mahmoud Bsharat, Mukul Ranjan, Aidar Myrzakhan, Jiacheng Liu, Bowei Guo, Shengkun Tang, Zhuang Liu, Yuanzhi Li, Zhiqiang Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.20786
Pdf URL: https://arxiv.org/pdf/2503.20786
Copy Paste: [[2503.20786]] Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark(https://arxiv.org/abs/2503.20786)
Keywords: privacy, large language model
Abstract: Rapid advancements in large language models (LLMs) have increased interest in deploying them on mobile devices for on-device AI applications. Mobile users interact differently with LLMs compared to desktop users, creating unique expectations and data biases. Current benchmark datasets primarily target at server and desktop environments, and there is a notable lack of extensive datasets specifically designed for mobile contexts. Additionally, mobile devices face strict limitations in storage and computing resources, constraining model size and capabilities, thus requiring optimized efficiency and prioritized knowledge. To address these challenges, we introduce Mobile-MMLU, a large-scale benchmark dataset tailored for mobile intelligence. It consists of 16,186 questions across 80 mobile-related fields, designed to evaluate LLM performance in realistic mobile scenarios. A challenging subset, Mobile-MMLU-Pro, provides advanced evaluation similar in size to MMLU-Pro but significantly more difficult than our standard full set. Both benchmarks use multiple-choice, order-invariant questions focused on practical mobile interactions, such as recipe suggestions, travel planning, and essential daily tasks. The dataset emphasizes critical mobile-specific metrics like inference latency, energy consumption, memory usage, and response quality, offering comprehensive insights into model performance under mobile constraints. Moreover, it prioritizes privacy and adaptability, assessing models' ability to perform on-device processing, maintain user privacy, and adapt to personalized usage patterns. Mobile-MMLU family offers a standardized framework for developing and comparing mobile-optimized LLMs, enabling advancements in productivity and decision-making within mobile computing environments. Our code and data are available at: this https URL.