2025-03-20

Title: Synthetic Data Generation of Body Motion Data by Neural Gas Network for Emotion Recognition

Authors: Seyed Muhammad Hossein Mousavi
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2503.14513
Pdf URL: https://arxiv.org/pdf/2503.14513
Copy Paste: [[2503.14513]] Synthetic Data Generation of Body Motion Data by Neural Gas Network for Emotion Recognition(https://arxiv.org/abs/2503.14513)
Keywords: robust, generative
Abstract: In the domain of emotion recognition using body motion, the primary challenge lies in the scarcity of diverse and generalizable datasets. Automatic emotion recognition uses machine learning and artificial intelligence techniques to recognize a person's emotional state from various data types, such as text, images, sound, and body motion. Body motion poses unique challenges as many factors, such as age, gender, ethnicity, personality, and illness, affect its appearance, leading to a lack of diverse and robust datasets specifically for emotion recognition. To address this, employing Synthetic Data Generation (SDG) methods, such as Generative Adversarial Networks (GANs) and Variational Auto Encoders (VAEs), offers potential solutions, though these methods are often complex. This research introduces a novel application of the Neural Gas Network (NGN) algorithm for synthesizing body motion data and optimizing diversity and generation speed. By learning skeletal structure topology, the NGN fits the neurons or gas particles on body joints. Generated gas particles, which form the skeletal structure later on, will be used to synthesize the new body posture. By attaching body postures over frames, the final synthetic body motion appears. We compared our generated dataset against others generated by GANs, VAEs, and another benchmark algorithm, using benchmark metrics such as Fréchet Inception Distance (FID), Diversity, and a few more. Furthermore, we continued evaluation using classification metrics such as accuracy, precision, recall, and a few others. Joint-related features or kinematic parameters were extracted, and the system assessed model performance against unseen data. Our findings demonstrate that the NGN algorithm produces more realistic and emotionally distinct body motion data and does so with more synthesizing speed than existing methods.

Title: Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control

Authors: Hejia Chen, Haoxian Zhang, Shoulong Zhang, Xiaoqiang Liu, Sisi Zhuang, Yuan Zhang, Pengfei Wan, Di Zhang, Shuai Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14517
Pdf URL: https://arxiv.org/pdf/2503.14517
Copy Paste: [[2503.14517]] Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control(https://arxiv.org/abs/2503.14517)
Keywords: diffusion, transformer
Abstract: Speech-driven 3D talking face method should offer both accurate lip synchronization and controllable expressions. Previous methods solely adopt discrete emotion labels to globally control expressions throughout sequences while limiting flexible fine-grained facial control within the spatiotemporal domain. We propose a diffusion-transformer-based 3D talking face generation model, Cafe-Talk, which simultaneously incorporates coarse- and fine-grained multimodal control conditions. Nevertheless, the entanglement of multiple conditions challenges achieving satisfying performance. To disentangle speech audio and fine-grained conditions, we employ a two-stage training pipeline. Specifically, Cafe-Talk is initially trained using only speech audio and coarse-grained conditions. Then, a proposed fine-grained control adapter gradually adds fine-grained instructions represented by action units (AUs), preventing unfavorable speech-lip synchronization. To disentangle coarse- and fine-grained conditions, we design a swap-label training mechanism, which enables the dominance of the fine-grained conditions. We also devise a mask-based CFG technique to regulate the occurrence and intensity of fine-grained control. In addition, a text-based detector is introduced with text-AU alignment to enable natural language user input and further support multimodal control. Extensive experimental results prove that Cafe-Talk achieves state-of-the-art lip synchronization and expressiveness performance and receives wide acceptance in fine-grained control in user studies. Project page: this https URL

Title: ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis

Authors: Yu Fang, Yue Yang, Xinghao Zhu, Kaiyuan Zheng, Gedas Bertasius, Daniel Szafir, Mingyu Ding
Subjects: cs.CV, cs.GR, cs.RO
Abstract URL: https://arxiv.org/abs/2503.14526
Pdf URL: https://arxiv.org/pdf/2503.14526
Copy Paste: [[2503.14526]] ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis(https://arxiv.org/abs/2503.14526)
Keywords: robust
Abstract: Vision-language-action (VLA) models present a promising paradigm by training policies directly on real robot datasets like Open X-Embodiment. However, the high cost of real-world data collection hinders further data scaling, thereby restricting the generalizability of VLAs. In this paper, we introduce ReBot, a novel real-to-sim-to-real approach for scaling real robot datasets and adapting VLA models to target domains, which is the last-mile deployment challenge in robot manipulation. Specifically, ReBot replays real-world robot trajectories in simulation to diversify manipulated objects (real-to-sim), and integrates the simulated movements with inpainted real-world background to synthesize physically realistic and temporally consistent robot videos (sim-to-real). Our approach has several advantages: 1) it enjoys the benefit of real data to minimize the sim-to-real gap; 2) it leverages the scalability of simulation; and 3) it can generalize a pretrained VLA to a target domain with fully automated data pipelines. Extensive experiments in both simulation and real-world environments show that ReBot significantly enhances the performance and robustness of VLAs. For example, in SimplerEnv with the WidowX robot, ReBot improved the in-domain performance of Octo by 7.2% and OpenVLA by 21.8%, and out-of-domain generalization by 19.9% and 9.4%, respectively. For real-world evaluation with a Franka robot, ReBot increased the success rates of Octo by 17% and OpenVLA by 20%. More information can be found at: this https URL

Title: SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders

Authors: Qing Li, Jiahui Geng, Derui Zhu, Fengyu Cai, Chenyang Lyu, Fakhri Karray
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14530
Pdf URL: https://arxiv.org/pdf/2503.14530
Copy Paste: [[2503.14530]] SAUCE: Selective Concept Unlearning in Vision-Language Models with Sparse Autoencoders(https://arxiv.org/abs/2503.14530)
Keywords: attack, robust, large language model
Abstract: Unlearning methods for vision-language models (VLMs) have primarily adapted techniques from large language models (LLMs), relying on weight updates that demand extensive annotated forget sets. Moreover, these methods perform unlearning at a coarse granularity, often leading to excessive forgetting and reduced model utility. To address this issue, we introduce SAUCE, a novel method that leverages sparse autoencoders (SAEs) for fine-grained and selective concept unlearning in VLMs. Briefly, SAUCE first trains SAEs to capture high-dimensional, semantically rich sparse features. It then identifies the features most relevant to the target concept for unlearning. During inference, it selectively modifies these features to suppress specific concepts while preserving unrelated information. We evaluate SAUCE on two distinct VLMs, LLaVA-v1.5-7B and LLaMA-3.2-11B-Vision-Instruct, across two types of tasks: concrete concept unlearning (objects and sports scenes) and abstract concept unlearning (emotions, colors, and materials), encompassing a total of 60 concepts. Extensive experiments demonstrate that SAUCE outperforms state-of-the-art methods by 18.04% in unlearning quality while maintaining comparable model utility. Furthermore, we investigate SAUCE's robustness against widely used adversarial attacks, its transferability across models, and its scalability in handling multiple simultaneous unlearning requests. Our findings establish SAUCE as an effective and scalable solution for selective concept unlearning in VLMs.

Title: PANDORA: Diffusion Policy Learning for Dexterous Robotic Piano Playing

Authors: Yanjia Huang, Renjie Li, Zhengzhong Tu
Subjects: cs.LG, cs.RO, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.14545
Pdf URL: https://arxiv.org/pdf/2503.14545
Copy Paste: [[2503.14545]] PANDORA: Diffusion Policy Learning for Dexterous Robotic Piano Playing(https://arxiv.org/abs/2503.14545)
Keywords: diffusion, large language model
Abstract: We present PANDORA, a novel diffusion-based policy learning framework designed specifically for dexterous robotic piano performance. Our approach employs a conditional U-Net architecture enhanced with FiLM-based global conditioning, which iteratively denoises noisy action sequences into smooth, high-dimensional trajectories. To achieve precise key execution coupled with expressive musical performance, we design a composite reward function that integrates task-specific accuracy, audio fidelity, and high-level semantic feedback from a large language model (LLM) oracle. The LLM oracle assesses musical expressiveness and stylistic nuances, enabling dynamic, hand-specific reward adjustments. Further augmented by a residual inverse-kinematics refinement policy, PANDORA achieves state-of-the-art performance in the ROBOPIANIST environment, significantly outperforming baselines in both precision and expressiveness. Ablation studies validate the critical contributions of diffusion-based denoising and LLM-driven semantic feedback in enhancing robotic musicianship. Videos available at: this https URL

Title: Sampling Decisions

Authors: Michael Chertkov, Sungsoo Ahn, Hamidreza Behjoo
Subjects: cs.LG, cond-mat.stat-mech, cs.AI, eess.SY, stat.ML
Abstract URL: https://arxiv.org/abs/2503.14549
Pdf URL: https://arxiv.org/pdf/2503.14549
Copy Paste: [[2503.14549]] Sampling Decisions(https://arxiv.org/abs/2503.14549)
Keywords: diffusion, generative
Abstract: In this manuscript we introduce a novel Decision Flow (DF) framework for sampling from a target distribution while incorporating additional guidance from a prior sampler. DF can be viewed as an AI driven algorithmic reincarnation of the Markov Decision Process (MDP) approach in Stochastic Optimal Control. It extends the continuous space, continuous time path Integral Diffusion sampling technique to discrete time and space, while also generalizing the Generative Flow Network framework. In its most basic form, an explicit, Neural Network (NN) free formulation, DF leverages the linear solvability of the the underlying MDP to adjust the transition probabilities of the prior sampler. The resulting Markov Process is expressed as a convolution of the reverse time Green's function of the prior sampling with the target distribution. We illustrate the DF framework through an example of sampling from the Ising model, discuss potential NN based extensions, and outline how DF can enhance guided sampling across various applications.

Title: Fire and Smoke Datasets in 20 Years: An In-depth Review

Authors: Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Fatemeh Afghah, Connor Peter McGrath, Danish Bhatkar, Mithilesh Anil Biradar, Abolfazl Razi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14552
Pdf URL: https://arxiv.org/pdf/2503.14552
Copy Paste: [[2503.14552]] Fire and Smoke Datasets in 20 Years: An In-depth Review(https://arxiv.org/abs/2503.14552)
Keywords: segmentation
Abstract: Fire and smoke phenomena pose a significant threat to the natural environment, ecosystems, and global economy, as well as human lives and wildlife. In this particular circumstance, there is a demand for more sophisticated and advanced technologies to implement an effective strategy for early detection, real-time monitoring, and minimizing the overall impacts of fires on ecological balance and public safety. Recently, the rapid advancement of Artificial Intelligence (AI) and Computer Vision (CV) frameworks has substantially revolutionized the momentum for developing efficient fire management systems. However, these systems extensively rely on the availability of adequate and high-quality fire and smoke data to create proficient Machine Learning (ML) methods for various tasks, such as detection and monitoring. Although fire and smoke datasets play a critical role in training, evaluating, and testing advanced Deep Learning (DL) models, a comprehensive review of the existing datasets is still unexplored. For this purpose, we provide an in-depth review to systematically analyze and evaluate fire and smoke datasets collected over the past 20 years. We investigate the characteristics of each dataset, including type, size, format, collection methods, and geographical diversities. We also review and highlight the unique features of each dataset, such as imaging modalities (RGB, thermal, infrared) and their applicability for different fire management tasks (classification, segmentation, detection). Furthermore, we summarize the strengths and weaknesses of each dataset and discuss their potential for advancing research and technology in fire management. Ultimately, we conduct extensive experimental analyses across different datasets using several state-of-the-art algorithms, such as ResNet-50, DeepLab-V3, and YoloV8.

Title: Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions

Authors: Kasra Borazjani, Payam Abdisarabshali, Naji Khosravan, Seyyedali Hosseinalipour
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14553
Pdf URL: https://arxiv.org/pdf/2503.14553
Copy Paste: [[2503.14553]] Redefining non-IID Data in Federated Learning for Computer Vision Tasks: Migrating from Labels to Embeddings for Task-Specific Data Distributions(https://arxiv.org/abs/2503.14553)
Keywords: federate
Abstract: Federated Learning (FL) represents a paradigm shift in distributed machine learning (ML), enabling clients to train models collaboratively while keeping their raw data private. This paradigm shift from traditional centralized ML introduces challenges due to the non-iid (non-independent and identically distributed) nature of data across clients, significantly impacting FL's performance. Existing literature, predominantly model data heterogeneity by imposing label distribution skew across clients. In this paper, we show that label distribution skew fails to fully capture the real-world data heterogeneity among clients in computer vision tasks beyond classification. Subsequently, we demonstrate that current approaches overestimate FL's performance by relying on label/class distribution skew, exposing an overlooked gap in the literature. By utilizing pre-trained deep neural networks to extract task-specific data embeddings, we define task-specific data heterogeneity through the lens of each vision task and introduce a new level of data heterogeneity called embedding-based data heterogeneity. Our methodology involves clustering data points based on embeddings and distributing them among clients using the Dirichlet distribution. Through extensive experiments, we evaluate the performance of different FL methods under our revamped notion of data heterogeneity, introducing new benchmark performance measures to the literature. We further unveil a series of open research directions that can be pursued.

Title: SuperPC: A Single Diffusion Model for Point Cloud Completion, Upsampling, Denoising, and Colorization

Authors: Yi Du, Zhipeng Zhao, Shaoshu Su, Sharath Golluri, Haoze Zheng, Runmao Yao, Chen Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.14558
Pdf URL: https://arxiv.org/pdf/2503.14558
Copy Paste: [[2503.14558]] SuperPC: A Single Diffusion Model for Point Cloud Completion, Upsampling, Denoising, and Colorization(https://arxiv.org/abs/2503.14558)
Keywords: diffusion
Abstract: Point cloud (PC) processing tasks-such as completion, upsampling, denoising, and colorization-are crucial in applications like autonomous driving and 3D reconstruction. Despite substantial advancements, prior approaches often address each of these tasks independently, with separate models focused on individual issues. However, this isolated approach fails to account for the fact that defects like incompleteness, low resolution, noise, and lack of color frequently coexist, with each defect influencing and correlating with the others. Simply applying these models sequentially can lead to error accumulation from each model, along with increased computational costs. To address these challenges, we introduce SuperPC, the first unified diffusion model capable of concurrently handling all four tasks. Our approach employs a three-level-conditioned diffusion framework, enhanced by a novel spatial-mix-fusion strategy, to leverage the correlations among these four defects for simultaneous, efficient processing. We show that SuperPC outperforms the state-of-the-art specialized models as well as their combination on all four individual tasks.

Title: SpecReX: Explainable AI for Raman Spectroscopy

Authors: Nathan Blake, David A. Kelly, Akchunya Chanchal, Sarah Kapllani-Mucaj, Geraint Thomas, Hana Chockler
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14567
Pdf URL: https://arxiv.org/pdf/2503.14567
Copy Paste: [[2503.14567]] SpecReX: Explainable AI for Raman Spectroscopy(https://arxiv.org/abs/2503.14567)
Keywords: explainability
Abstract: Raman spectroscopy is becoming more common for medical diagnostics with deep learning models being increasingly used to leverage its full potential. However, the opaque nature of such models and the sensitivity of medical diagnosis together with regulatory requirements necessitate the need for explainable AI tools. We introduce SpecReX, specifically adapted to explaining Raman spectra. SpecReX uses the theory of actual causality to rank causal responsibility in a spectrum, quantified by iteratively refining mutated versions of the spectrum and testing if it retains the original classification. The explanations provided by SpecReX take the form of a responsibility map, highlighting spectral regions most responsible for the model to make a correct classification. To assess the validity of SpecReX, we create increasingly complex simulated spectra, in which a "ground truth" signal is seeded, to train a classifier. We then obtain SpecReX explanations and compare the results with another explainability tool. By using simulated spectra we establish that SpecReX localizes to the known differences between classes, under a number of conditions. This provides a foundation on which we can find the spectral features which differentiate disease classes. This is an important first step in proving the validity of SpecReX.

Title: Potential Score Matching: Debiasing Molecular Structure Sampling with Potential Energy Guidance

Authors: Liya Guo, Zun Wang, Chang Liu, Junzhe Li, Pipi Hu, Yi Zhu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14569
Pdf URL: https://arxiv.org/pdf/2503.14569
Copy Paste: [[2503.14569]] Potential Score Matching: Debiasing Molecular Structure Sampling with Potential Energy Guidance(https://arxiv.org/abs/2503.14569)
Keywords: diffusion, generative
Abstract: The ensemble average of physical properties of molecules is closely related to the distribution of molecular conformations, and sampling such distributions is a fundamental challenge in physics and chemistry. Traditional methods like molecular dynamics (MD) simulations and Markov chain Monte Carlo (MCMC) sampling are commonly used but can be time-consuming and costly. Recently, diffusion models have emerged as efficient alternatives by learning the distribution of training data. Obtaining an unbiased target distribution is still an expensive task, primarily because it requires satisfying ergodicity. To tackle these challenges, we propose Potential Score Matching (PSM), an approach that utilizes the potential energy gradient to guide generative models. PSM does not require exact energy functions and can debias sample distributions even when trained on limited and biased data. Our method outperforms existing state-of-the-art (SOTA) models on the Lennard-Jones (LJ) potential, a commonly used toy model. Furthermore, we extend the evaluation of PSM to high-dimensional problems using the MD17 and MD22 datasets. The results demonstrate that molecular distributions generated by PSM more closely approximate the Boltzmann distribution compared to traditional diffusion models.

Title: Robust Weight Imprinting: Insights from Neural Collapse and Proxy-Based Aggregation

Authors: Justus Westerhoff, Golzar Atefi, Mario Koddenbrock, Alexei Figueroa, Alexander Löser, Erik Rodner, Felix A. Gers
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14572
Pdf URL: https://arxiv.org/pdf/2503.14572
Copy Paste: [[2503.14572]] Robust Weight Imprinting: Insights from Neural Collapse and Proxy-Based Aggregation(https://arxiv.org/abs/2503.14572)
Keywords: robust
Abstract: The capacity of a foundation model allows for adaptation to new downstream tasks. Weight imprinting is a universal and efficient method to fulfill this purpose. It has been reinvented several times, but it has not been systematically studied. In this paper, we propose a framework for imprinting, identifying three main components: generation, normalization, and aggregation. This allows us to conduct an in-depth analysis of imprinting and a comparison of the existing work. We reveal the benefits of representing novel data with multiple proxies in the generation step and show the importance of proper normalization. We determine those proxies through clustering and propose a novel variant of imprinting that outperforms previous work. We motivate this by the neural collapse phenomenon -- an important connection that we can draw for the first time. Our results show an increase of up to 4% in challenging scenarios with complex data distributions for new classes.

Title: Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM

Authors: Yazeed Alnumay, Alexandre Barbet, Anna Bialas, William Darling, Shaan Desai, Joan Devassy, Kyle Duffy, Stephanie Howe, Olivia Lasche, Justin Lee, Anirudh Shrinivason, Jennifer Tracey
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14603
Pdf URL: https://arxiv.org/pdf/2503.14603
Copy Paste: [[2503.14603]] Command R7B Arabic: A Small, Enterprise Focused, Multilingual, and Culturally Aware Arabic LLM(https://arxiv.org/abs/2503.14603)
Keywords: large language model
Abstract: Building high-quality large language models (LLMs) for enterprise Arabic applications remains challenging due to the limited availability of digitized Arabic data. In this work, we present a data synthesis and refinement strategy to help address this problem, namely, by leveraging synthetic data generation and human-in-the-loop annotation to expand our Arabic training corpus. We further present our iterative post training recipe that is essential to achieving state-of-the-art performance in aligning the model with human preferences, a critical aspect to enterprise use cases. The culmination of this effort is the release of a small, 7B, open-weight model that outperforms similarly sized peers in head-to-head comparisons and on Arabic-focused benchmarks covering cultural knowledge, instruction following, RAG, and contextual faithfulness.

Title: Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

Authors: Sara Sarto, Marcella Cornia, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.14604
Pdf URL: https://arxiv.org/pdf/2503.14604
Copy Paste: [[2503.14604]] Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives(https://arxiv.org/abs/2503.14604)
Keywords: robust, large language model
Abstract: The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.

Title: Transparent Attested DNS for Confidential Computing Services

Authors: Antoine Delignat-Lavaud, Cédric Fournet, Kapil Vaswani, Manuel Costa, Sylvan Clebsch, Christoph M. Wintersteiger
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2503.14611
Pdf URL: https://arxiv.org/pdf/2503.14611
Copy Paste: [[2503.14611]] Transparent Attested DNS for Confidential Computing Services(https://arxiv.org/abs/2503.14611)
Keywords: secure, security, protect, attack
Abstract: Confidential services running in hardware-protected Trusted Execution Environments (TEEs) can provide higher security assurance, but this requires custom clients and protocols to distribute, update, and verify their attestation evidence. Compared with classic Internet security, built upon universal abstractions such as domain names, origins, and certificates, this puts a significant burden on service users and providers. In particular, Web browsers and other legacy clients do not get the same security guaranties as custom clients. We present a new approach for users to establish trust in confidential services. We propose attested DNS (aDNS): a name service that securely binds the attested implementation of confidential services to their domain names. ADNS enforces policies for all names in its zone of authority: any TEE that runs a service must present hardware attestation that complies with the domain-specific policy before registering keys and obtaining certificates for any name in this domain. ADNS provides protocols for zone delegation, TEE registration, and certificate issuance. ADNS builds on standards such as DNSSEC, DANE, ACME and Certificate Transparency. ADNS provides DNS transparency by keeping all records, policies, and attestations in a public append-only log, thereby enabling auditing and preventing targeted attacks. We implement aDNS as a confidential service using a fault-tolerant network of TEEs. We evaluate it using sample confidential services that illustrate various TEE platforms. On the client side, we provide a generic browser extension that queries and verifies attestation records before opening TLS connections, with negligible performance overhead, and we show that, with aDNS, even legacy Web clients benefit from confidential computing as long as some enlightened clients verify attestations to deter or blame malicious actors.

Title: Unique Hard Attention: A Tale of Two Sides

Authors: Selim Jerad, Anej Svete, Jiaoda Li, Ryan Cotterell
Subjects: cs.LG, cs.CC, cs.CL, cs.FL
Abstract URL: https://arxiv.org/abs/2503.14615
Pdf URL: https://arxiv.org/pdf/2503.14615
Copy Paste: [[2503.14615]] Unique Hard Attention: A Tale of Two Sides(https://arxiv.org/abs/2503.14615)
Keywords: transformer
Abstract: Understanding the expressive power of transformers has recently attracted attention, as it offers insights into their abilities and limitations. Many studies analyze unique hard attention transformers, where attention selects a single position that maximizes the attention scores. When multiple positions achieve the maximum score, either the rightmost or the leftmost of those is chosen. In this paper, we highlight the importance of this seeming triviality. Recently, finite-precision transformers with both leftmost- and rightmost-hard attention were shown to be equivalent to Linear Temporal Logic (LTL). We show that this no longer holds with only leftmost-hard attention -- in that case, they correspond to a \emph{strictly weaker} fragment of LTL. Furthermore, we show that models with leftmost-hard attention are equivalent to \emph{soft} attention, suggesting they may better approximate real-world transformers than right-attention models. These findings refine the landscape of transformer expressivity and underscore the role of attention directionality.

Title: Anomaly-Flow: A Multi-domain Federated Generative Adversarial Network for Distributed Denial-of-Service Detection

Authors: Leonardo Henrique de Melo, Gustavo de Carvalho Bertoli, Michele Nogueira, Aldri Luiz dos Santos, Lourenço Alves Pereira Junior
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14618
Pdf URL: https://arxiv.org/pdf/2503.14618
Copy Paste: [[2503.14618]] Anomaly-Flow: A Multi-domain Federated Generative Adversarial Network for Distributed Denial-of-Service Detection(https://arxiv.org/abs/2503.14618)
Keywords: security, privacy, defense, attack, federate, generative
Abstract: Distributed denial-of-service (DDoS) attacks remain a critical threat to Internet services, causing costly disruptions. While machine learning (ML) has shown promise in DDoS detection, current solutions struggle with multi-domain environments where attacks must be detected across heterogeneous networks and organizational boundaries. This limitation severely impacts the practical deployment of ML-based defenses in real-world settings. This paper introduces Anomaly-Flow, a novel framework that addresses this critical gap by combining Federated Learning (FL) with Generative Adversarial Networks (GANs) for privacy-preserving, multi-domain DDoS detection. Our proposal enables collaborative learning across diverse network domains while preserving data privacy through synthetic flow generation. Through extensive evaluation across three distinct network datasets, Anomaly-Flow achieves an average F1-score of $0.747$, outperforming baseline models. Importantly, our framework enables organizations to share attack detection capabilities without exposing sensitive network data, making it particularly valuable for critical infrastructure and privacy-sensitive sectors. Beyond immediate technical contributions, this work provides insights into the challenges and opportunities in multi-domain DDoS detection, establishing a foundation for future research in collaborative network defense systems. Our findings have important implications for academic research and industry practitioners working to deploy practical ML-based security solutions.

Title: Retrieval-Augmented Simulacra: Generative Agents for Up-to-date and Knowledge-Adaptive Simulations

Authors: Hikaru Shimadzu, Takehito Utsuro, Daisuke Kitayama
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.14620
Pdf URL: https://arxiv.org/pdf/2503.14620
Copy Paste: [[2503.14620]] Retrieval-Augmented Simulacra: Generative Agents for Up-to-date and Knowledge-Adaptive Simulations(https://arxiv.org/abs/2503.14620)
Keywords: generative
Abstract: In the 2023 edition of the White Paper on Information and Communications, it is estimated that the population of social networking services in Japan will exceed 100 million by 2022, and the influence of social networking services in Japan is growing significantly. In addition, marketing using SNS and research on the propagation of emotions and information on SNS are being actively conducted, creating the need for a system for predicting trends in SNS interactions. We have already created a system that simulates the behavior of various communities on SNS by building a virtual SNS environment in which agents post and reply to each other in a chat community created by agents using a LLMs. In this paper, we evaluate the impact of the search extension generation mechanism used to create posts and replies in a virtual SNS environment using a simulation system on the ability to generate posts and replies. As a result of the evaluation, we confirmed that the proposed search extension generation mechanism, which mimics human search behavior, generates the most natural exchange.

Title: Dynamic Accumulated Attention Map for Interpreting Evolution of Decision-Making in Vision Transformer

Authors: Yi Liao, Yongsheng Gao, Weichuan Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14640
Pdf URL: https://arxiv.org/pdf/2503.14640
Copy Paste: [[2503.14640]] Dynamic Accumulated Attention Map for Interpreting Evolution of Decision-Making in Vision Transformer(https://arxiv.org/abs/2503.14640)
Keywords: transformer
Abstract: Various Vision Transformer (ViT) models have been widely used for image recognition tasks. However, existing visual explanation methods can not display the attention flow hidden inside the inner structure of ViT models, which explains how the final attention regions are formed inside a ViT for its decision-making. In this paper, a novel visual explanation approach, Dynamic Accumulated Attention Map (DAAM), is proposed to provide a tool that can visualize, for the first time, the attention flow from the top to the bottom through ViT networks. To this end, a novel decomposition module is proposed to construct and store the spatial feature information by unlocking the [class] token generated by the self-attention module of each ViT block. The module can also obtain the channel importance coefficients by decomposing the classification score for supervised ViT models. Because of the lack of classification score in self-supervised ViT models, we propose dimension-wise importance weights to compute the channel importance coefficients. Such spatial features are linearly combined with the corresponding channel importance coefficients, forming the attention map for each block. The dynamic attention flow is revealed by block-wisely accumulating each attention map. The contribution of this work focuses on visualizing the evolution dynamic of the decision-making attention for any intermediate block inside a ViT model by proposing a novel decomposition module and dimension-wise importance weights. The quantitative and qualitative analysis consistently validate the effectiveness and superior capacity of the proposed DAAM for not only interpreting ViT models with the fully-connected layers as the classifier but also self-supervised ViT models. The code is available at this https URL.

Title: A Simple Combination of Diffusion Models for Better Quality Trade-Offs in Image Denoising

Authors: Jonas Dornbusch, Emanuel Pfarr, Florin-Alexandru Vasluianu, Frank Werner, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14654
Pdf URL: https://arxiv.org/pdf/2503.14654
Copy Paste: [[2503.14654]] A Simple Combination of Diffusion Models for Better Quality Trade-Offs in Image Denoising(https://arxiv.org/abs/2503.14654)
Keywords: diffusion, generative
Abstract: Diffusion models have garnered considerable interest in computer vision, owing both to their capacity to synthesize photorealistic images and to their proven effectiveness in image reconstruction tasks. However, existing approaches fail to efficiently balance the high visual quality of diffusion models with the low distortion achieved by previous image reconstruction methods. Specifically, for the fundamental task of additive Gaussian noise removal, we first illustrate an intuitive method for leveraging pretrained diffusion models. Further, we introduce our proposed Linear Combination Diffusion Denoiser (LCDD), which unifies two complementary inference procedures - one that leverages the model's generative potential and another that ensures faithful signal recovery. By exploiting the inherent structure of the denoising samples, LCDD achieves state-of-the-art performance and offers controlled, well-behaved trade-offs through a simple scalar hyperparameter adjustment.

Title: Sepsyn-OLCP: An Online Learning-based Framework for Early Sepsis Prediction with Uncertainty Quantification using Conformal Prediction

Authors: Anni Zhou, Beyah Raheem, Rishikesan Kamaleswaran, Yao Xie
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.14663
Pdf URL: https://arxiv.org/pdf/2503.14663
Copy Paste: [[2503.14663]] Sepsyn-OLCP: An Online Learning-based Framework for Early Sepsis Prediction with Uncertainty Quantification using Conformal Prediction(https://arxiv.org/abs/2503.14663)
Keywords: robust
Abstract: Sepsis is a life-threatening syndrome with high morbidity and mortality in hospitals. Early prediction of sepsis plays a crucial role in facilitating early interventions for septic patients. However, early sepsis prediction systems with uncertainty quantification and adaptive learning are scarce. This paper proposes Sepsyn-OLCP, a novel online learning algorithm for early sepsis prediction by integrating conformal prediction for uncertainty quantification and Bayesian bandits for adaptive decision-making. By combining the robustness of Bayesian models with the statistical uncertainty guarantees of conformal prediction methodologies, this algorithm delivers accurate and trustworthy predictions, addressing the critical need for reliable and adaptive systems in high-stakes healthcare applications such as early sepsis prediction. We evaluate the performance of Sepsyn-OLCP in terms of regret in stochastic bandit setting, the area under the receiver operating characteristic curve (AUROC), and F-measure. Our results show that Sepsyn-OLCP outperforms existing individual models, increasing AUROC of a neural network from 0.64 to 0.73 without retraining and high computational costs. And the model selection policy converges to the optimal strategy in the long run. We propose a novel reinforcement learning-based framework integrated with conformal prediction techniques to provide uncertainty quantification for early sepsis prediction. The proposed methodology delivers accurate and trustworthy predictions, addressing a critical need in high-stakes healthcare applications like early sepsis prediction.

Title: These Magic Moments: Differentiable Uncertainty Quantification of Radiance Field Models

Authors: Parker Ewen, Hao Chen, Seth Isaacson, Joey Wilson, Katherine A. Skinner, Ram Vasudevan
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.14665
Pdf URL: https://arxiv.org/pdf/2503.14665
Copy Paste: [[2503.14665]] These Magic Moments: Differentiable Uncertainty Quantification of Radiance Field Models(https://arxiv.org/abs/2503.14665)
Keywords: robust
Abstract: This paper introduces a novel approach to uncertainty quantification for radiance fields by leveraging higher-order moments of the rendering equation. Uncertainty quantification is crucial for downstream tasks including view planning and scene understanding, where safety and robustness are paramount. However, the high dimensionality and complexity of radiance fields pose significant challenges for uncertainty quantification, limiting the use of these uncertainty quantification methods in high-speed decision-making. We demonstrate that the probabilistic nature of the rendering process enables efficient and differentiable computation of higher-order moments for radiance field outputs, including color, depth, and semantic predictions. Our method outperforms existing radiance field uncertainty estimation techniques while offering a more direct, computationally efficient, and differentiable formulation without the need for this http URL uncertainty quantification, we also illustrate the utility of our approach in downstream applications such as next-best-view (NBV) selection and active ray sampling for neural radiance field training. Extensive experiments on synthetic and real-world scenes confirm the efficacy of our approach, which achieves state-of-the-art performance while maintaining simplicity.

Title: Generating Medically-Informed Explanations for Depression Detection using LLMs

Authors: Xiangyong Chen, Xiaochuan Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.14671
Pdf URL: https://arxiv.org/pdf/2503.14671
Copy Paste: [[2503.14671]] Generating Medically-Informed Explanations for Depression Detection using LLMs(https://arxiv.org/abs/2503.14671)
Keywords: interpretability, explainability, large language model
Abstract: Early detection of depression from social media data offers a valuable opportunity for timely intervention. However, this task poses significant challenges, requiring both professional medical knowledge and the development of accurate and explainable models. In this paper, we propose LLM-MTD (Large Language Model for Multi-Task Depression Detection), a novel approach that leverages a pre-trained large language model to simultaneously classify social media posts for depression and generate textual explanations grounded in medical diagnostic criteria. We train our model using a multi-task learning framework with a combined loss function that optimizes both classification accuracy and explanation quality. We evaluate LLM-MTD on the benchmark Reddit Self-Reported Depression Dataset (RSDD) and compare its performance against several competitive baseline methods, including traditional machine learning and fine-tuned BERT. Our experimental results demonstrate that LLM-MTD achieves state-of-the-art performance in depression detection, showing significant improvements in AUPRC and other key metrics. Furthermore, human evaluation of the generated explanations reveals their relevance, completeness, and medical accuracy, highlighting the enhanced interpretability of our approach. This work contributes a novel methodology for depression detection that combines the power of large language models with the crucial aspect of explainability.

Title: DPImageBench: A Unified Benchmark for Differentially Private Image Synthesis

Authors: Chen Gong, Kecen Li, Zinan Lin, Tianhao Wang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14681
Pdf URL: https://arxiv.org/pdf/2503.14681
Copy Paste: [[2503.14681]] DPImageBench: A Unified Benchmark for Differentially Private Image Synthesis(https://arxiv.org/abs/2503.14681)
Keywords: privacy, protect
Abstract: Differentially private (DP) image synthesis aims to generate artificial images that retain the properties of sensitive images while protecting the privacy of individual images within the dataset. Despite recent advancements, we find that inconsistent--and sometimes flawed--evaluation protocols have been applied across studies. This not only impedes the understanding of current methods but also hinders future advancements. To address the issue, this paper introduces DPImageBench for DP image synthesis, with thoughtful design across several dimensions: (1) Methods. We study eleven prominent methods and systematically characterize each based on model architecture, pretraining strategy, and privacy mechanism. (2) Evaluation. We include nine datasets and seven fidelity and utility metrics to thoroughly assess them. Notably, we find that a common practice of selecting downstream classifiers based on the highest accuracy on the sensitive test set not only violates DP but also overestimates the utility scores. DPImageBench corrects for these mistakes. (3) Platform. Despite the methods and evaluation protocols, DPImageBench provides a standardized interface that accommodates current and future implementations within a unified framework. With DPImageBench, we have several noteworthy findings. For example, contrary to the common wisdom that pretraining on public image datasets is usually beneficial, we find that the distributional similarity between pretraining and sensitive images significantly impacts the performance of the synthetic images and does not always yield improvements. In addition, adding noise to low-dimensional features, such as the high-level characteristics of sensitive images, is less affected by the privacy budget compared to adding noise to high-dimensional features, like weight gradients. The former methods perform better than the latter under a low privacy budget.

Title: HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

Authors: Rui Yang, Lin Song, Yicheng Xiao, Runhui Huang, Yixiao Ge, Ying Shan, Hengshuang Zhao
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.14694
Pdf URL: https://arxiv.org/pdf/2503.14694
Copy Paste: [[2503.14694]] HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding(https://arxiv.org/abs/2503.14694)
Keywords: transformer, large language model
Abstract: Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-intensive and often exhibit performance gaps compared to their compositional counterparts. To alleviate this issue, we propose a simple yet efficient method to construct a baseline for the native and end-to-end large multi-modal model in a single transformer. First, we propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. Second, we devise an efficient training recipe for the proposed model, which harnesses the prior knowledge of the pre-trained models, addressing both the performance limitations and the challenge of resource consumption. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.

Title: SplatVoxel: History-Aware Novel View Streaming without Temporal Training

Authors: Yiming Wang, Lucy Chai, Xuan Luo, Michael Niemeyer, Manuel Lagunas, Stephen Lombardi, Siyu Tang, Tiancheng Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14698
Pdf URL: https://arxiv.org/pdf/2503.14698
Copy Paste: [[2503.14698]] SplatVoxel: History-Aware Novel View Streaming without Temporal Training(https://arxiv.org/abs/2503.14698)
Keywords: transformer
Abstract: We study the problem of novel view streaming from sparse-view videos, which aims to generate a continuous sequence of high-quality, temporally consistent novel views as new input frames arrive. However, existing novel view synthesis methods struggle with temporal coherence and visual fidelity, leading to flickering and inconsistency. To address these challenges, we introduce history-awareness, leveraging previous frames to reconstruct the scene and improve quality and stability. We propose a hybrid splat-voxel feed-forward scene reconstruction approach that combines Gaussian Splatting to propagate information over time, with a hierarchical voxel grid for temporal fusion. Gaussian primitives are efficiently warped over time using a motion graph that extends 2D tracking models to 3D motion, while a sparse voxel transformer integrates new temporal observations in an error-aware manner. Crucially, our method does not require training on multi-view video datasets, which are currently limited in size and diversity, and can be directly applied to sparse-view video streams in a history-aware manner at inference time. Our approach achieves state-of-the-art performance in both static and streaming scene reconstruction, effectively reducing temporal artifacts and visual artifacts while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU. Project Page: this https URL

Title: ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints

Authors: Vihaan Misra, Peter Schaldenbrand, Jean Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14720
Pdf URL: https://arxiv.org/pdf/2503.14720
Copy Paste: [[2503.14720]] ShapeShift: Towards Text-to-Shape Arrangement Synthesis with Content-Aware Geometric Constraints(https://arxiv.org/abs/2503.14720)
Keywords: diffusion
Abstract: While diffusion-based models excel at generating photorealistic images from text, a more nuanced challenge emerges when constrained to using only a fixed set of rigid shapes, akin to solving tangram puzzles or arranging real-world objects to match semantic descriptions. We formalize this problem as shape-based image generation, a new text-guided image-to-image translation task that requires rearranging the input set of rigid shapes into non-overlapping configurations and visually communicating the target concept. Unlike pixel-manipulation approaches, our method, ShapeShift, explicitly parameterizes each shape within a differentiable vector graphics pipeline, iteratively optimizing placement and orientation through score distillation sampling from pretrained diffusion models. To preserve arrangement clarity, we introduce a content-aware collision resolution mechanism that applies minimal semantically coherent adjustments when overlaps occur, ensuring smooth convergence toward physically valid configurations. By bridging diffusion-based semantic guidance with explicit geometric constraints, our approach yields interpretable compositions where spatial relationships clearly embody the textual prompt. Extensive experiments demonstrate compelling results across diverse scenarios, with quantitative and qualitative advantages over alternative techniques.

Title: Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence

Authors: Sophia Hager, David Mueller, Kevin Duh, Nicholas Andrews
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14749
Pdf URL: https://arxiv.org/pdf/2503.14749
Copy Paste: [[2503.14749]] Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence(https://arxiv.org/abs/2503.14749)
Keywords: large language model
Abstract: As large language models (LLMs) are increasingly used for factual question-answering, it becomes more important for LLMs to have the capability to communicate the likelihood that their answer is correct. For these verbalized expressions of uncertainty to be meaningful, they should reflect the error rates at the expressed level of confidence. However, when prompted to express confidence, the error rates of current LLMs are inconsistent with their communicated confidences, highlighting the need for uncertainty quantification methods. Many prior methods calculate lexical uncertainty, estimating a model's confidence in the specific string it generated. In some cases, however, it may be more useful to estimate semantic uncertainty, or the model's confidence in the answer regardless of how it is verbalized. We propose a simple procedure, uncertainty distillation, to teach an LLM to verbalize calibrated semantic confidences. Using held-out data to map initial uncertainty estimates to meaningful probabilities, we create examples annotated with verbalized probabilities for supervised fine-tuning. We demonstrate our method yields verbalized confidences that correlate with observed error rates with a small fine-tuned language model as well as with larger instruction-tuned models, and find that our semantic uncertainty correlates well with lexical uncertainty on short answers.

Title: LipShiFT: A Certifiably Robust Shift-based Vision Transformer

Authors: Rohan Menon, Nicola Franco, Stephan Günnemann
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.14751
Pdf URL: https://arxiv.org/pdf/2503.14751
Copy Paste: [[2503.14751]] LipShiFT: A Certifiably Robust Shift-based Vision Transformer(https://arxiv.org/abs/2503.14751)
Keywords: robust, transformer
Abstract: Deriving tight Lipschitz bounds for transformer-based architectures presents a significant challenge. The large input sizes and high-dimensional attention modules typically prove to be crucial bottlenecks during the training process and leads to sub-optimal results. Our research highlights practical constraints of these methods in vision tasks. We find that Lipschitz-based margin training acts as a strong regularizer while restricting weights in successive layers of the model. Focusing on a Lipschitz continuous variant of the ShiftViT model, we address significant training challenges for transformer-based architectures under norm-constrained input setting. We provide an upper bound estimate for the Lipschitz constants of this model using the $l_2$ norm on common image classification datasets. Ultimately, we demonstrate that our method scales to larger models and advances the state-of-the-art in certified robustness for transformer-based architectures.

Title: Revisiting Image Fusion for Multi-Illuminant White-Balance Correction

Authors: David Serrano-Lozano, Aditya Arora, Luis Herranz, Konstantinos G. Derpanis, Michael S. Brown, Javier Vazquez-Corral
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14774
Pdf URL: https://arxiv.org/pdf/2503.14774
Copy Paste: [[2503.14774]] Revisiting Image Fusion for Multi-Illuminant White-Balance Correction(https://arxiv.org/abs/2503.14774)
Keywords: transformer
Abstract: White balance (WB) correction in scenes with multiple illuminants remains a persistent challenge in computer vision. Recent methods explored fusion-based approaches, where a neural network linearly blends multiple sRGB versions of an input image, each processed with predefined WB presets. However, we demonstrate that these methods are suboptimal for common multi-illuminant scenarios. Additionally, existing fusion-based methods rely on sRGB WB datasets lacking dedicated multi-illuminant images, limiting both training and evaluation. To address these challenges, we introduce two key contributions. First, we propose an efficient transformer-based model that effectively captures spatial dependencies across sRGB WB presets, substantially improving upon linear fusion techniques. Second, we introduce a large-scale multi-illuminant dataset comprising over 16,000 sRGB images rendered with five different WB settings, along with WB-corrected images. Our method achieves up to 100\% improvement over existing techniques on our new multi-illuminant image fusion dataset.

Title: RAT: Boosting Misclassification Detection Ability without Extra Data

Authors: Ge Yan, Tsui-Wei Weng
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14783
Pdf URL: https://arxiv.org/pdf/2503.14783
Copy Paste: [[2503.14783]] RAT: Boosting Misclassification Detection Ability without Extra Data(https://arxiv.org/abs/2503.14783)
Keywords: robust
Abstract: As deep neural networks(DNN) become increasingly prevalent, particularly in high-stakes areas such as autonomous driving and healthcare, the ability to detect incorrect predictions of models and intervene accordingly becomes crucial for safety. In this work, we investigate the detection of misclassified inputs for image classification models from the lens of adversarial perturbation: we propose to use robust radius (a.k.a. input-space margin) as a confidence metric and design two efficient estimation algorithms, RR-BS and RR-Fast, for misclassification detection. Furthermore, we design a training method called Radius Aware Training (RAT) to boost models' ability to identify mistakes. Extensive experiments show our method could achieve up to 29.3% reduction on AURC and 21.62% reduction in FPR@95TPR, compared with previous methods.

Title: SEEK: Self-adaptive Explainable Kernel For Nonstationary Gaussian Processes

Authors: Nima Negarandeh, Carlos Mora, Ramin Bostanabad
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.14785
Pdf URL: https://arxiv.org/pdf/2503.14785
Copy Paste: [[2503.14785]] SEEK: Self-adaptive Explainable Kernel For Nonstationary Gaussian Processes(https://arxiv.org/abs/2503.14785)
Keywords: robust, interpretability
Abstract: Gaussian processes (GPs) are powerful probabilistic models that define flexible priors over functions, offering strong interpretability and uncertainty quantification. However, GP models often rely on simple, stationary kernels which can lead to suboptimal predictions and miscalibrated uncertainty estimates, especially in nonstationary real-world applications. In this paper, we introduce SEEK, a novel class of learnable kernels to model complex, nonstationary functions via GPs. Inspired by artificial neurons, SEEK is derived from first principles to ensure symmetry and positive semi-definiteness, key properties of valid kernels. The proposed method achieves flexible and adaptive nonstationarity by learning a mapping from a set of base kernels. Compared to existing techniques, our approach is more interpretable and much less prone to overfitting. We conduct comprehensive sensitivity analyses and comparative studies to demonstrate that our approach is not robust to only many of its design choices, but also outperforms existing stationary/nonstationary kernels in both mean prediction accuracy and uncertainty quantification.

Title: MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models

Authors: Chejian Xu, Jiawei Zhang, Zhaorun Chen, Chulin Xie, Mintong Kang, Yujin Potter, Zhun Wang, Zhuowen Yuan, Alexander Xiong, Zidi Xiong, Chenhui Zhang, Lingzhi Yuan, Yi Zeng, Peiyang Xu, Chengquan Guo, Andy Zhou, Jeffrey Ziwei Tan, Xuandong Zhao, Francesco Pinto, Zhen Xiang, Yu Gai, Zinan Lin, Dan Hendrycks, Bo Li, Dawn Song
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.14827
Pdf URL: https://arxiv.org/pdf/2503.14827
Copy Paste: [[2503.14827]] MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models(https://arxiv.org/abs/2503.14827)
Keywords: privacy, robust, fair
Abstract: Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants. However, several studies have revealed vulnerabilities in these models, such as generating unsafe content by text-to-image models. Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy. In this paper, we present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs. Our platform assesses models from multiple perspectives, including safety, hallucination, fairness/bias, privacy, adversarial robustness, and out-of-distribution (OOD) generalization. We have designed various evaluation scenarios and red teaming algorithms under different tasks for each perspective to generate challenging data, forming a high-quality benchmark. We evaluate a range of multimodal models using MMDT, and our findings reveal a series of vulnerabilities and areas for improvement across these perspectives. This work introduces the first comprehensive and unique safety and trustworthiness evaluation platform for MMFMs, paving the way for developing safer and more reliable MMFMs and systems. Our platform and benchmark are available at this https URL.

Title: Decompositional Neural Scene Reconstruction with Generative Diffusion Prior

Authors: Junfeng Ni, Yu Liu, Ruijie Lu, Zirui Zhou, Song-Chun Zhu, Yixin Chen, Siyuan Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14830
Pdf URL: https://arxiv.org/pdf/2503.14830
Copy Paste: [[2503.14830]] Decompositional Neural Scene Reconstruction with Generative Diffusion Prior(https://arxiv.org/abs/2503.14830)
Keywords: diffusion, generative
Abstract: Decompositional reconstruction of 3D scenes, with complete shapes and detailed texture of all objects within, is intriguing for downstream applications but remains challenging, particularly with sparse views as input. Recent approaches incorporate semantic or geometric regularization to address this issue, but they suffer significant degradation in underconstrained areas and fail to recover occluded regions. We argue that the key to solving this problem lies in supplementing missing information for these areas. To this end, we propose DP-Recon, which employs diffusion priors in the form of Score Distillation Sampling (SDS) to optimize the neural representation of each individual object under novel views. This provides additional information for the underconstrained areas, but directly incorporating diffusion prior raises potential conflicts between the reconstruction and generative guidance. Therefore, we further introduce a visibility-guided approach to dynamically adjust the per-pixel SDS loss weights. Together these components enhance both geometry and appearance recovery while remaining faithful to input images. Extensive experiments across Replica and ScanNet++ demonstrate that our method significantly outperforms SOTA methods. Notably, it achieves better object reconstruction under 10 views than the baselines under 100 views. Our method enables seamless text-based editing for geometry and appearance through SDS optimization and produces decomposed object meshes with detailed UV maps that support photorealistic Visual effects (VFX) editing. The project page is available at this https URL.

Title: On the Robustness Tradeoff in Fine-Tuning

Authors: Kunyang Li, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Blaine Hoak, Yohan Beugin, Eric Pauley, Patrick McDaniel
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.14836
Pdf URL: https://arxiv.org/pdf/2503.14836
Copy Paste: [[2503.14836]] On the Robustness Tradeoff in Fine-Tuning(https://arxiv.org/abs/2503.14836)
Keywords: robust
Abstract: Fine-tuning has become the standard practice for adapting pre-trained (upstream) models to downstream tasks. However, the impact on model robustness is not well understood. In this work, we characterize the robustness-accuracy trade-off in fine-tuning. We evaluate the robustness and accuracy of fine-tuned models over 6 benchmark datasets and 7 different fine-tuning strategies. We observe a consistent trade-off between adversarial robustness and accuracy. Peripheral updates such as BitFit are more effective for simple tasks--over 75% above the average measured with area under the Pareto frontiers on CIFAR-10 and CIFAR-100. In contrast, fine-tuning information-heavy layers, such as attention layers via Compacter, achieves a better Pareto frontier on more complex tasks--57.5% and 34.6% above the average on Caltech-256 and CUB-200, respectively. Lastly, we observe that robustness of fine-tuning against out-of-distribution data closely tracks accuracy. These insights emphasize the need for robustness-aware fine-tuning to ensure reliable real-world deployments.

Title: SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments

Authors: Yinqi Chen, Meiying Zhang, Qi Hao, Guang Zhou
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.14837
Pdf URL: https://arxiv.org/pdf/2503.14837
Copy Paste: [[2503.14837]] SemanticFlow: A Self-Supervised Framework for Joint Scene Flow Prediction and Instance Segmentation in Dynamic Environments(https://arxiv.org/abs/2503.14837)
Keywords: robust, segmentation
Abstract: Accurate perception of dynamic traffic scenes is crucial for high-level autonomous driving systems, requiring robust object motion estimation and instance segmentation. However, traditional methods often treat them as separate tasks, leading to suboptimal performance, spatio-temporal inconsistencies, and inefficiency in complex scenarios due to the absence of information sharing. This paper proposes a multi-task SemanticFlow framework to simultaneously predict scene flow and instance segmentation of full-resolution point clouds. The novelty of this work is threefold: 1) developing a coarse-to-fine prediction based multi-task scheme, where an initial coarse segmentation of static backgrounds and dynamic objects is used to provide contextual information for refining motion and semantic information through a shared feature processing module; 2) developing a set of loss functions to enhance the performance of scene flow estimation and instance segmentation, while can help ensure spatial and temporal consistency of both static and dynamic objects within traffic scenes; 3) developing a self-supervised learning scheme, which utilizes coarse segmentation to detect rigid objects and compute their transformation matrices between sequential frames, enabling the generation of self-supervised labels. The proposed framework is validated on the Argoverse and Waymo datasets, demonstrating superior performance in instance segmentation accuracy, scene flow estimation, and computational efficiency, establishing a new benchmark for self-supervised methods in dynamic scene understanding.

Title: LogLLaMA: Transformer-based log anomaly detection with LLaMA

Authors: Zhuoyi Yang, Ian G. Harris
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2503.14849
Pdf URL: https://arxiv.org/pdf/2503.14849
Copy Paste: [[2503.14849]] LogLLaMA: Transformer-based log anomaly detection with LLaMA(https://arxiv.org/abs/2503.14849)
Keywords: transformer, generative, large language model
Abstract: Log anomaly detection refers to the task that distinguishes the anomalous log messages from normal log messages. Transformer-based large language models (LLMs) are becoming popular for log anomaly detection because of their superb ability to understand complex and long language patterns. In this paper, we propose LogLLaMA, a novel framework that leverages LLaMA2. LogLLaMA is first finetuned on normal log messages from three large-scale datasets to learn their patterns. After finetuning, the model is capable of generating successive log messages given previous log messages. Our generative model is further trained to identify anomalous log messages using reinforcement learning (RL). The experimental results show that LogLLaMA outperforms the state-of-the-art approaches for anomaly detection on BGL, Thunderbird, and HDFS datasets.

Title: Unlocking the Capabilities of Vision-Language Models for Generalizable and Explainable Deepfake Detection

Authors: Peipeng Yu, Jianwei Fei, Hui Gao, Xuan Feng, Zhihua Xia, Chip Hong Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14853
Pdf URL: https://arxiv.org/pdf/2503.14853
Copy Paste: [[2503.14853]] Unlocking the Capabilities of Vision-Language Models for Generalizable and Explainable Deepfake Detection(https://arxiv.org/abs/2503.14853)
Keywords: explainability, large language model
Abstract: Current vision-language models (VLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection due to the misaligned of their knowledge and forensics patterns. To this end, we present a novel paradigm that unlocks VLMs' potential capabilities through three components: (1) A knowledge-guided forgery adaptation module that aligns VLM's semantic space with forensic features through contrastive learning with external manipulation knowledge; (2) A multi-modal prompt tuning framework that jointly optimizes visual-textual embeddings for both localization and explainability; (3) An iterative refinement strategy enabling multi-turn dialog for evidence-based reasoning. Our framework includes a VLM-based Knowledge-guided Forgery Detector (KFD), a VLM image encoder, and a Large Language Model (LLM). The VLM image encoder extracts visual prompt embeddings from images, while the LLM receives visual and question prompt embeddings for inference. The KFD is used to calculate correlations between image features and pristine/deepfake class embeddings, enabling forgery classification and localization. The outputs from these components are used to construct forgery prompt embeddings. Finally, we feed these prompt embeddings into the LLM to generate textual detection responses to assist judgment. Extensive experiments on multiple benchmarks, including FF++, CDF2, DFD, DFDCP, and DFDC, demonstrate that our scheme surpasses state-of-the-art methods in generalization performance, while also supporting multi-turn dialogue capabilities.

Title: Global Renewables Watch: A Temporal Dataset of Solar and Wind Energy Derived from Satellite Imagery

Authors: Caleb Robinson, Anthony Ortiz, Allen Kim, Rahul Dodhia, Andrew Zolli, Shivaprakash K Nagaraju, James Oakleaf, Joe Kiesecker, Juan M. Lavista Ferres
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.14860
Pdf URL: https://arxiv.org/pdf/2503.14860
Copy Paste: [[2503.14860]] Global Renewables Watch: A Temporal Dataset of Solar and Wind Energy Derived from Satellite Imagery(https://arxiv.org/abs/2503.14860)
Keywords: segmentation
Abstract: We present a comprehensive global temporal dataset of commercial solar photovoltaic (PV) farms and onshore wind turbines, derived from high-resolution satellite imagery analyzed quarterly from the fourth quarter of 2017 to the second quarter of 2024. We create this dataset by training deep learning-based segmentation models to identify these renewable energy installations from satellite imagery, then deploy them on over 13 trillion pixels covering the world. For each detected feature, we estimate the construction date and the preceding land use type. This dataset offers crucial insights into progress toward sustainable development goals and serves as a valuable resource for policymakers, researchers, and stakeholders aiming to assess and promote effective strategies for renewable energy deployment. Our final spatial dataset includes 375,197 individual wind turbines and 86,410 solar PV installations. We aggregate our predictions to the country level -- estimating total power capacity based on construction date, solar PV area, and number of windmills -- and find an $r^2$ value of $0.96$ and $0.93$ for solar PV and onshore wind respectively compared to IRENA's most recent 2023 country-level capacity estimates.

Title: Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark

Authors: Ying Liu, Yijing Hua, Haojiang Chai, Yanbo Wang, TengQi Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14862
Pdf URL: https://arxiv.org/pdf/2503.14862
Copy Paste: [[2503.14862]] Fine-Grained Open-Vocabulary Object Detection with Fined-Grained Prompts: Task, Dataset and Benchmark(https://arxiv.org/abs/2503.14862)
Keywords: fair
Abstract: Open-vocabulary detectors are proposed to locate and recognize objects in novel classes. However, variations in vision-aware language vocabulary data used for open-vocabulary learning can lead to unfair and unreliable evaluations. Recent evaluation methods have attempted to address this issue by incorporating object properties or adding locations and characteristics to the captions. Nevertheless, since these properties and locations depend on the specific details of the images instead of classes, detectors can not make accurate predictions without precise descriptions provided through human annotation. This paper introduces 3F-OVD, a novel task that extends supervised fine-grained object detection to the open-vocabulary setting. Our task is intuitive and challenging, requiring a deep understanding of Fine-grained captions and careful attention to Fine-grained details in images in order to accurately detect Fine-grained objects. Additionally, due to the scarcity of qualified fine-grained object detection datasets, we have created a new dataset, NEU-171K, tailored for both supervised and open-vocabulary settings. We benchmark state-of-the-art object detectors on our dataset for both settings. Furthermore, we propose a simple yet effective post-processing technique.

Title: Temporal-Consistent Video Restoration with Pre-trained Diffusion Models

Authors: Hengkang Wang, Yang Liu, Huidong Liu, Chien-Chih Wang, Yanhui Guo, Hongdong Li, Bryan Wang, Ju Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14863
Pdf URL: https://arxiv.org/pdf/2503.14863
Copy Paste: [[2503.14863]] Temporal-Consistent Video Restoration with Pre-trained Diffusion Models(https://arxiv.org/abs/2503.14863)
Keywords: diffusion
Abstract: Video restoration (VR) aims to recover high-quality videos from degraded ones. Although recent zero-shot VR methods using pre-trained diffusion models (DMs) show good promise, they suffer from approximation errors during reverse diffusion and insufficient temporal consistency. Moreover, dealing with 3D video data, VR is inherently computationally intensive. In this paper, we advocate viewing the reverse process in DMs as a function and present a novel Maximum a Posterior (MAP) framework that directly parameterizes video frames in the seed space of DMs, eliminating approximation errors. We also introduce strategies to promote bilevel temporal consistency: semantic consistency by leveraging clustering structures in the seed space, and pixel-level consistency by progressive warping with optical flow refinements. Extensive experiments on multiple virtual reality tasks demonstrate superior visual quality and temporal consistency achieved by our method compared to the state-of-the-art.

Title: Efficient Personalization of Quantized Diffusion Model without Backpropagation

Authors: Hoigi Seo, Wongi Jeong, Kyungryeol Lee, Se Young Chun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14868
Pdf URL: https://arxiv.org/pdf/2503.14868
Copy Paste: [[2503.14868]] Efficient Personalization of Quantized Diffusion Model without Backpropagation(https://arxiv.org/abs/2503.14868)
Keywords: diffusion
Abstract: Diffusion models have shown remarkable performance in image synthesis, but they demand extensive computational and memory resources for training, fine-tuning and inference. Although advanced quantization techniques have successfully minimized memory usage for inference, training and fine-tuning these quantized models still require large memory possibly due to dequantization for accurate computation of gradients and/or backpropagation for gradient-based algorithms. However, memory-efficient fine-tuning is particularly desirable for applications such as personalization that often must be run on edge devices like mobile phones with private data. In this work, we address this challenge by quantizing a diffusion model with personalization via Textual Inversion and by leveraging a zeroth-order optimization on personalization tokens without dequantization so that it does not require gradient and activation storage for backpropagation that consumes considerable memory. Since a gradient estimation using zeroth-order optimization is quite noisy for a single or a few images in personalization, we propose to denoise the estimated gradient by projecting it onto a subspace that is constructed with the past history of the tokens, dubbed Subspace Gradient. In addition, we investigated the influence of text embedding in image generation, leading to our proposed time steps sampling, dubbed Partial Uniform Timestep Sampling for sampling with effective diffusion timesteps. Our method achieves comparable performance to prior methods in image and text alignment scores for personalizing Stable Diffusion with only forward passes while reducing training memory demand up to $8.2\times$.

Title: Robust Support Vector Machines for Imbalanced and Noisy Data via Benders Decomposition

Authors: Seyed Mojtaba Mohasel, Hamidreza Koosha
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.14873
Pdf URL: https://arxiv.org/pdf/2503.14873
Copy Paste: [[2503.14873]] Robust Support Vector Machines for Imbalanced and Noisy Data via Benders Decomposition(https://arxiv.org/abs/2503.14873)
Keywords: robust
Abstract: This study introduces a novel formulation to enhance Support Vector Machines (SVMs) in handling class imbalance and noise. Unlike the conventional Soft Margin SVM, which penalizes the magnitude of constraint violations, the proposed model quantifies the number of violations and aims to minimize their frequency. To achieve this, a binary variable is incorporated into the objective function of the primal SVM formulation, replacing the traditional slack variable. Furthermore, each misclassified sample is assigned a priority and an associated constraint. The resulting formulation is a mixed-integer programming model, efficiently solved using Benders decomposition. The proposed model's performance was benchmarked against existing models, including Soft Margin SVM, weighted SVM, and NuSVC. Two primary hypotheses were examined: 1) The proposed model improves the F1-score for the minority class in imbalanced classification tasks. 2) The proposed model enhances classification accuracy in noisy datasets. These hypotheses were evaluated using a Wilcoxon test across multiple publicly available datasets from the OpenML repository. The results supported both hypotheses ($ p < 0.05 $). In addition, the proposed model exhibited several interesting properties, such as improved robustness to noise, a decision boundary shift favoring the minority class, a reduced number of support vectors, and decreased prediction time. The open-source Python implementation of the proposed SVM model is available.

Title: Exploring the Limits of KV Cache Compression in Visual Autoregressive Transformers

Authors: Bo Chen, Xiaoyu Li, Yekun Ke, Yingyu Liang, Zhenmei Shi, Zhao Song
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2503.14881
Pdf URL: https://arxiv.org/pdf/2503.14881
Copy Paste: [[2503.14881]] Exploring the Limits of KV Cache Compression in Visual Autoregressive Transformers(https://arxiv.org/abs/2503.14881)
Keywords: transformer
Abstract: A fundamental challenge in Visual Autoregressive models is the substantial memory overhead required during inference to store previously generated representations. Despite various attempts to mitigate this issue through compression techniques, prior works have not explicitly formalized the problem of KV-cache compression in this context. In this work, we take the first step in formally defining the KV-cache compression problem for Visual Autoregressive transformers. We then establish a fundamental negative result, proving that any mechanism for sequential visual token generation under attention-based architectures must use at least $\Omega(n^2 d)$ memory, when $d = \Omega(\log n)$, where $n$ is the number of tokens generated and $d$ is the embedding dimensionality. This result demonstrates that achieving truly sub-quadratic memory usage is impossible without additional structural constraints. Our proof is constructed via a reduction from a computational lower bound problem, leveraging randomized embedding techniques inspired by dimensionality reduction principles. Finally, we discuss how sparsity priors on visual representations can influence memory efficiency, presenting both impossibility results and potential directions for mitigating memory overhead.

Title: MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer

Authors: Honglin Lin, Zhuoshi Pan, Yu Li, Qizhi Pei, Xin Gao, Mengzhang Cai, Conghui He, Lijun Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14891
Pdf URL: https://arxiv.org/pdf/2503.14891
Copy Paste: [[2503.14891]] MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer(https://arxiv.org/abs/2503.14891)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated promising capabilities in solving mathematical reasoning tasks, leveraging Chain-of-Thought (CoT) data as a vital component in guiding answer generation. Current paradigms typically generate CoT and answers directly for a given problem, diverging from human problem-solving strategies to some extent. Humans often solve problems by recalling analogous cases and leveraging their solutions to reason about the current task. Inspired by this cognitive process, we propose \textbf{MetaLadder}, a novel framework that explicitly prompts LLMs to recall and reflect on meta-problems, those structurally or semantically analogous problems, alongside their CoT solutions before addressing the target problem. Additionally, we introduce a problem-restating mechanism to enhance the model's comprehension of the target problem by regenerating the original question, which further improves reasoning accuracy. Therefore, the model can achieve reasoning transfer from analogical problems, mimicking human-like "learning from examples" and generalization abilities. Extensive experiments on mathematical benchmarks demonstrate that our MetaLadder significantly boosts LLMs' problem-solving accuracy, largely outperforming standard CoT-based methods (\textbf{10.3\%} accuracy gain) and other methods. Our code and data has been released at this https URL.

Title: Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations

Authors: Shuo Li, Jiajun Sun, Guodong Zheng, Xiaoran Fan, Yujiong Shen, Yi Lu, Zhiheng Xi, Yuming Yang, Wenming Tan, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.14895
Pdf URL: https://arxiv.org/pdf/2503.14895
Copy Paste: [[2503.14895]] Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations(https://arxiv.org/abs/2503.14895)
Keywords: large language model
Abstract: Recently, multimodal large language models (MLLMs) have demonstrated remarkable performance in visual-language tasks. However, the authenticity of the responses generated by MLLMs is often compromised by object hallucinations. We identify that a key cause of these hallucinations is the model's over-susceptibility to specific image frequency features in detecting objects. In this paper, we introduce Multi-Frequency Perturbations (MFP), a simple, cost-effective, and pluggable method that leverages both low-frequency and high-frequency features of images to perturb visual feature representations and explicitly suppress redundant frequency-domain features during inference, thereby mitigating hallucinations. Experimental results demonstrate that our method significantly mitigates object hallucinations across various model architectures. Furthermore, as a training-time method, MFP can be combined with inference-time methods to achieve state-of-the-art performance on the CHAIR benchmark.

Title: Deep Contrastive Unlearning for Language Models

Authors: Estrid He, Tabinda Sarwar, Ibrahim Khalil, Xun Yi, Ke Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14900
Pdf URL: https://arxiv.org/pdf/2503.14900
Copy Paste: [[2503.14900]] Deep Contrastive Unlearning for Language Models(https://arxiv.org/abs/2503.14900)
Keywords: privacy, protect, large language model
Abstract: The past a few years have witnessed the great success of large language models, demonstrating powerful capabilities in comprehending textual data and generating human-like languages. Large language models achieve success by being trained on vast amounts of textual data, including online sources with copyrighted content and user-generated knowledge. However, this comes at a cost: the potential risk of exposing users' privacy and violating copyright protections. Thus, to safeguard individuals' "right to be forgotten", there has been increasing interests in machine unlearning -- the process of removing information carried by particular training samples from a model while not deteriorating its predictive quality. This is a challenging task due to the black-box nature of language models. Most existing studies focus on mitigating the impact of those forgot samples upon a model's outputs, and do not explicitly consider the geometric distributions of samples in the latent space of a model. To address this issue, we propose a machine unlearning framework, named Deep Contrastive Unlearning for fine-Tuning (DeepCUT) language models. Our proposed model achieves machine unlearning by directly optimizing the latent space of a model. Comprehensive experiments on real-world datasets demonstrate the effectiveness and efficiency of DeepCUT with consistent and significant improvement over baseline methods.

Title: Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

Authors: Siwei Wen, Junyan Ye, Peilin Feng, Hengrui Kang, Zichen Wen, Yize Chen, Jiang Wu, Wenjun Wu, Conghui He, Weijia Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14905
Pdf URL: https://arxiv.org/pdf/2503.14905
Copy Paste: [[2503.14905]] Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation(https://arxiv.org/abs/2503.14905)
Keywords: robust, interpretability
Abstract: With the rapid advancement of Artificial Intelligence Generated Content (AIGC) technologies, synthetic images have become increasingly prevalent in everyday life, posing new challenges for authenticity assessment and detection. Despite the effectiveness of existing methods in evaluating image authenticity and locating forgeries, these approaches often lack human interpretability and do not fully address the growing complexity of synthetic data. To tackle these challenges, we introduce FakeVLM, a specialized large multimodal model designed for both general synthetic image and DeepFake detection tasks. FakeVLM not only excels in distinguishing real from fake images but also provides clear, natural language explanations for image artifacts, enhancing interpretability. Additionally, we present FakeClue, a comprehensive dataset containing over 100,000 images across seven categories, annotated with fine-grained artifact clues in natural language. FakeVLM demonstrates performance comparable to expert models while eliminating the need for additional classifiers, making it a robust solution for synthetic data detection. Extensive evaluations across multiple datasets confirm the superiority of FakeVLM in both authenticity classification and artifact explanation tasks, setting a new benchmark for synthetic image detection. The dataset and code will be released in: this https URL.

Title: Robust Distribution Alignment for Industrial Anomaly Detection under Distribution Shift

Authors: Jingyi Liao, Xun Xu, Yongyi Su, Rong-Cheng Tu, Yifan Liu, Dacheng Tao, Xulei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14910
Pdf URL: https://arxiv.org/pdf/2503.14910
Copy Paste: [[2503.14910]] Robust Distribution Alignment for Industrial Anomaly Detection under Distribution Shift(https://arxiv.org/abs/2503.14910)
Keywords: robust
Abstract: Anomaly detection plays a crucial role in quality control for industrial applications. However, ensuring robustness under unseen domain shifts such as lighting variations or sensor drift remains a significant challenge. Existing methods attempt to address domain shifts by training generalizable models but often rely on prior knowledge of target distributions and can hardly generalise to backbones designed for other data modalities. To overcome these limitations, we build upon memory-bank-based anomaly detection methods, optimizing a robust Sinkhorn distance on limited target training data to enhance generalization to unseen target domains. We evaluate the effectiveness on both 2D and 3D anomaly detection benchmarks with simulated distribution shifts. Our proposed method demonstrates superior results compared with state-of-the-art anomaly detection and domain adaptation methods.

Title: Deep Polycuboid Fitting for Compact 3D Representation of Indoor Scenes

Authors: Gahye Lee, Hyejeong Yoon, Jungeon Kim, Seungyong Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14912
Pdf URL: https://arxiv.org/pdf/2503.14912
Copy Paste: [[2503.14912]] Deep Polycuboid Fitting for Compact 3D Representation of Indoor Scenes(https://arxiv.org/abs/2503.14912)
Keywords: transformer
Abstract: This paper presents a novel framework for compactly representing a 3D indoor scene using a set of polycuboids through a deep learning-based fitting method. Indoor scenes mainly consist of man-made objects, such as furniture, which often exhibit rectilinear geometry. This property allows indoor scenes to be represented using combinations of polycuboids, providing a compact representation that benefits downstream applications like furniture rearrangement. Our framework takes a noisy point cloud as input and first detects six types of cuboid faces using a transformer network. Then, a graph neural network is used to validate the spatial relationships of the detected faces to form potential polycuboids. Finally, each polycuboid instance is reconstructed by forming a set of boxes based on the aggregated face labels. To train our networks, we introduce a synthetic dataset encompassing a diverse range of cuboid and polycuboid shapes that reflect the characteristics of indoor scenes. Our framework generalizes well to real-world indoor scene datasets, including Replica, ScanNet, and scenes captured with an iPhone. The versatility of our method is demonstrated through practical applications, such as virtual room tours and scene editing.

Title: MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models

Authors: Jiazheng Li, Lu Yu, Qing Cui, Zhiqiang Zhang, Jun Zhou, Yanfang Ye, Chuxu Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14917
Pdf URL: https://arxiv.org/pdf/2503.14917
Copy Paste: [[2503.14917]] MASS: Mathematical Data Selection via Skill Graphs for Pretraining Large Language Models(https://arxiv.org/abs/2503.14917)
Keywords: large language model
Abstract: High-quality data plays a critical role in the pretraining and fine-tuning of large language models (LLMs), even determining their performance ceiling to some degree. Consequently, numerous data selection methods have been proposed to identify subsets of data that can effectively and efficiently enhance model performance. However, most of these methods focus on general data selection and tend to overlook the specific nuances of domain-related data. In this paper, we introduce MASS, a \textbf{MA}thematical data \textbf{S}election framework using the \textbf{S}kill graph for pretraining LLMs in the mathematical reasoning domain. By taking into account the unique characteristics of mathematics and reasoning, we construct a skill graph that captures the mathematical skills and their interrelations from a reference dataset. This skill graph guides us in assigning quality scores to the target dataset, enabling us to select the top-ranked subset which is further used to pretrain LLMs. Experimental results demonstrate the efficiency and effectiveness of MASS across different model sizes (1B and 7B) and pretraining datasets (web data and synthetic data). Specifically, in terms of efficiency, models trained on subsets selected by MASS can achieve similar performance to models trained on the original datasets, with a significant reduction in the number of trained tokens - ranging from 50\% to 70\% fewer tokens. In terms of effectiveness, when trained on the same amount of tokens, models trained on the data selected by MASS outperform those trained on the original datasets by 3.3\% to 5.9\%. These results underscore the potential of MASS to improve both the efficiency and effectiveness of pretraining LLMs.

Title: GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation

Authors: Junyu Shi, Lijiang Liu, Yong Sun, Zhiyuan Zhang, Jinni Zhou, Qiang Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14919
Pdf URL: https://arxiv.org/pdf/2503.14919
Copy Paste: [[2503.14919]] GenM$^3$: Generative Pretrained Multi-path Motion Model for Text Conditional Human Motion Generation(https://arxiv.org/abs/2503.14919)
Keywords: transformer, generative, large language model
Abstract: Scaling up motion datasets is crucial to enhance motion generation capabilities. However, training on large-scale multi-source datasets introduces data heterogeneity challenges due to variations in motion content. To address this, we propose Generative Pretrained Multi-path Motion Model (GenM$^3$), a comprehensive framework designed to learn unified motion representations. GenM$^3$ comprises two components: 1) a Multi-Expert VQ-VAE (MEVQ-VAE) that adapts to different dataset distributions to learn a unified discrete motion representation, and 2) a Multi-path Motion Transformer (MMT) that improves intra-modal representations by using separate modality-specific pathways, each with densely activated experts to accommodate variations within that modality, and improves inter-modal alignment by the text-motion shared pathway. To enable large-scale training, we integrate and unify 11 high-quality motion datasets (approximately 220 hours of motion data) and augment it with textual annotations (nearly 10,000 motion sequences labeled by a large language model and 300+ by human experts). After training on our integrated dataset, GenM$^3$ achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin. It also demonstrates strong zero-shot generalization on IDEA400 dataset, highlighting its effectiveness and adaptability across diverse motion scenarios.

Title: A Semantic and Clean-label Backdoor Attack against Graph Convolutional Networks

Authors: Jiazhu Dai, Haoyu Sun
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.14922
Pdf URL: https://arxiv.org/pdf/2503.14922
Copy Paste: [[2503.14922]] A Semantic and Clean-label Backdoor Attack against Graph Convolutional Networks(https://arxiv.org/abs/2503.14922)
Keywords: security, attack
Abstract: Graph Convolutional Networks (GCNs) have shown excellent performance in graph-structured tasks such as node classification and graph classification. However, recent research has shown that GCNs are vulnerable to a new type of threat called the backdoor attack, where the adversary can inject a hidden backdoor into the GCNs so that the backdoored model performs well on benign samples, whereas its prediction will be maliciously changed to the attacker-specified target label if the hidden backdoor is activated by the attacker-defined trigger. Clean-label backdoor attack and semantic backdoor attack are two new backdoor attacks to Deep Neural Networks (DNNs), they are more imperceptible and have posed new and serious threats. The semantic and clean-label backdoor attack is not fully explored in GCNs. In this paper, we propose a semantic and clean-label backdoor attack against GCNs under the context of graph classification to reveal the existence of this security vulnerability in GCNs. Specifically, SCLBA conducts an importance analysis on graph samples to select one type of node as semantic trigger, which is then inserted into the graph samples to create poisoning samples without changing the labels of the poisoning samples to the attacker-specified target label. We evaluate SCLBA on multiple datasets and the results show that SCLBA can achieve attack success rates close to 99% with poisoning rates of less than 3%, and with almost no impact on the performance of model on benign samples.

Title: pFedFair: Towards Optimal Group Fairness-Accuracy Trade-off in Heterogeneous Federated Learning

Authors: Haoyu Lei, Shizhan Gong, Qi Dou, Farzan Farnia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.14925
Pdf URL: https://arxiv.org/pdf/2503.14925
Copy Paste: [[2503.14925]] pFedFair: Towards Optimal Group Fairness-Accuracy Trade-off in Heterogeneous Federated Learning(https://arxiv.org/abs/2503.14925)
Keywords: federate, fair
Abstract: Federated learning (FL) algorithms commonly aim to maximize clients' accuracy by training a model on their collective data. However, in several FL applications, the model's decisions should meet a group fairness constraint to be independent of sensitive attributes such as gender or race. While such group fairness constraints can be incorporated into the objective function of the FL optimization problem, in this work, we show that such an approach would lead to suboptimal classification accuracy in an FL setting with heterogeneous client distributions. To achieve an optimal accuracy-group fairness trade-off, we propose the Personalized Federated Learning for Client-Level Group Fairness (pFedFair) framework, where clients locally impose their fairness constraints over the distributed training process. Leveraging the image embedding models, we extend the application of pFedFair to computer vision settings, where we numerically show that pFedFair achieves an optimal group fairness-accuracy trade-off in heterogeneous FL settings. We present the results of several numerical experiments on benchmark and synthetic datasets, which highlight the suboptimality of non-personalized FL algorithms and the improvements made by the pFedFair method.

Title: Covering Cracks in Content Moderation: Delexicalized Distant Supervision for Illicit Drug Jargon Detection

Authors: Minkyoo Song, Eugene Jang, Jaehan Kim, Seungwon Shin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.14926
Pdf URL: https://arxiv.org/pdf/2503.14926
Copy Paste: [[2503.14926]] Covering Cracks in Content Moderation: Delexicalized Distant Supervision for Illicit Drug Jargon Detection(https://arxiv.org/abs/2503.14926)
Keywords: robust
Abstract: In light of rising drug-related concerns and the increasing role of social media, sales and discussions of illicit drugs have become commonplace online. Social media platforms hosting user-generated content must therefore perform content moderation, which is a difficult task due to the vast amount of jargon used in drug discussions. Previous works on drug jargon detection were limited to extracting a list of terms, but these approaches have fundamental problems in practical application. First, they are trivially evaded using word substitutions. Second, they cannot distinguish whether euphemistic terms such as "pot" or "crack" are being used as drugs or in their benign meanings. We argue that drug content moderation should be done using contexts rather than relying on a banlist. However, manually annotated datasets for training such a task are not only expensive but also prone to becoming obsolete. We present JEDIS, a framework for detecting illicit drug jargon terms by analyzing their contexts. JEDIS utilizes a novel approach that combines distant supervision and delexicalization, which allows JEDIS to be trained without human-labeled data while being robust to new terms and euphemisms. Experiments on two manually annotated datasets show JEDIS significantly outperforms state-of-the-art word-based baselines in terms of F1-score and detection coverage in drug jargon detection. We also conduct qualitative analysis that demonstrates JEDIS is robust against pitfalls faced by existing approaches.

Title: Shushing! Let's Imagine an Authentic Speech from the Silent Video

Authors: Jiaxin Ye, Hongming Shan
Subjects: cs.CV, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2503.14928
Pdf URL: https://arxiv.org/pdf/2503.14928
Copy Paste: [[2503.14928]] Shushing! Let's Imagine an Authentic Speech from the Silent Video(https://arxiv.org/abs/2503.14928)
Keywords: diffusion, transformer
Abstract: Vision-guided speech generation aims to produce authentic speech from facial appearance or lip motions without relying on auditory signals, offering significant potential for applications such as dubbing in filmmaking and assisting individuals with aphonia. Despite recent progress, existing methods struggle to achieve unified cross-modal alignment across semantics, timbre, and emotional prosody from visual cues, prompting us to propose Consistent Video-to-Speech (CV2S) as an extended task to enhance cross-modal consistency. To tackle emerging challenges, we introduce ImaginTalk, a novel cross-modal diffusion framework that generates faithful speech using only visual input, operating within a discrete space. Specifically, we propose a discrete lip aligner that predicts discrete speech tokens from lip videos to capture semantic information, while an error detector identifies misaligned tokens, which are subsequently refined through masked language modeling with BERT. To further enhance the expressiveness of the generated speech, we develop a style diffusion transformer equipped with a face-style adapter that adaptively customizes identity and prosody dynamics across both the channel and temporal dimensions while ensuring synchronization with lip-aware semantic features. Extensive experiments demonstrate that ImaginTalk can generate high-fidelity speech with more accurate semantic details and greater expressiveness in timbre and emotion compared to state-of-the-art baselines. Demos are shown at our project page: this https URL.

Title: Prada: Black-Box LLM Adaptation with Private Data on Resource-Constrained Devices

Authors: Ziyao Wang, Yexiao He, Zheyu Shen, Yu Li, Guoheng Sun, Myungjin Lee, Ang Li
Subjects: cs.CR, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2503.14932
Pdf URL: https://arxiv.org/pdf/2503.14932
Copy Paste: [[2503.14932]] Prada: Black-Box LLM Adaptation with Private Data on Resource-Constrained Devices(https://arxiv.org/abs/2503.14932)
Keywords: privacy, large language model
Abstract: In recent years, Large Language Models (LLMs) have demonstrated remarkable abilities in various natural language processing tasks. However, adapting these models to specialized domains using private datasets stored on resource-constrained edge devices, such as smartphones and personal computers, remains challenging due to significant privacy concerns and limited computational resources. Existing model adaptation methods either compromise data privacy by requiring data transmission or jeopardize model privacy by exposing proprietary LLM parameters. To address these challenges, we propose Prada, a novel privacy-preserving and efficient black-box LLM adaptation system using private on-device datasets. Prada employs a lightweight proxy model fine-tuned with Low-Rank Adaptation (LoRA) locally on user devices. During inference, Prada leverages the logits offset, i.e., difference in outputs between the base and adapted proxy models, to iteratively refine outputs from a remote black-box LLM. This offset-based adaptation approach preserves both data privacy and model privacy, as there is no need to share sensitive data or proprietary model parameters. Furthermore, we incorporate speculative decoding to further speed up the inference process of Prada, making the system practically deployable on bandwidth-constrained edge devices, enabling a more practical deployment of Prada. Extensive experiments on various downstream tasks demonstrate that Prada achieves performance comparable to centralized fine-tuning methods while significantly reducing computational overhead by up to 60% and communication costs by up to 80%.

Title: FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

Authors: Chongjun Tu, Lin Zhang, Pengtao Chen, Peng Ye, Xianfang Zeng, Wei Cheng, Gang Yu, Tao Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14935
Pdf URL: https://arxiv.org/pdf/2503.14935
Copy Paste: [[2503.14935]] FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding(https://arxiv.org/abs/2503.14935)
Keywords: interpretability, large language model
Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models. Project page: \href{this https URL}{this https URL}.

Title: VisNumBench: Evaluating Number Sense of Multimodal Large Language Models

Authors: Tengjin Weng, Jingyi Wang, Wenhao Jiang, Zhong Ming
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14939
Pdf URL: https://arxiv.org/pdf/2503.14939
Copy Paste: [[2503.14939]] VisNumBench: Evaluating Number Sense of Multimodal Large Language Models(https://arxiv.org/abs/2503.14939)
Keywords: large language model
Abstract: Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about 1,900 multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks. Our experiments on VisNumBench led to the following key findings: (i) The 17 MLLMs we tested, including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash, perform significantly below human levels in number sense-related tasks. (ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities. (iii) Stronger MLLMs with larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities. We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing MLLMs' number sense abilities. All benchmark resources, including code and datasets, will be publicly available at this https URL.

Title: UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

Authors: Qihui Zhang, Munan Ning, Zheyuan Liu, Yanbo Wang, Jiayi Ye, Yue Huang, Shuo Yang, Xiao Chen, Yibing Song, Li Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14941
Pdf URL: https://arxiv.org/pdf/2503.14941
Copy Paste: [[2503.14941]] UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation(https://arxiv.org/abs/2503.14941)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models. Existing evaluation methods face limitations due to the significant human workload required to design Q&A pairs for visual images, which inherently restricts the scale and scope of evaluations. Although automated MLLM-as-judge approaches attempt to reduce the human workload through automatic evaluations, they often introduce biases. To address these problems, we propose an Unsupervised Peer review MLLM Evaluation framework. It utilizes only image data, allowing models to automatically generate questions and conduct peer review assessments of answers from other models, effectively alleviating the reliance on human workload. Additionally, we introduce the vision-language scoring system to mitigate the bias issues, which focuses on three aspects: (i) response correctness; (ii) visual understanding and reasoning; and (iii) image-text correlation. Experimental results demonstrate that UPME achieves a Pearson correlation of 0.944 with human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset, indicating that our framework closely aligns with human-designed benchmarks and inherent human preferences.

Title: MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Authors: Zihan Cao, Yu Zhong, Ziqi Wang, Liang-Jian Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14944
Pdf URL: https://arxiv.org/pdf/2503.14944
Copy Paste: [[2503.14944]] MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance(https://arxiv.org/abs/2503.14944)
Keywords: diffusion, transformer
Abstract: Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (\textit{e.g.}, noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities. To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants. Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines. Codes are available at this https URL.

Title: Generating Multimodal Driving Scenes via Next-Scene Prediction

Authors: Yanhao Wu, Haoyang Zhang, Tianwei Lin, Lichao Huang, Shujie Luo, Rui Wu, Congpei Qiu, Wei Ke, Tong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14945
Pdf URL: https://arxiv.org/pdf/2503.14945
Copy Paste: [[2503.14945]] Generating Multimodal Driving Scenes via Next-Scene Prediction(https://arxiv.org/abs/2503.14945)
Keywords: generative
Abstract: Generative models in Autonomous Driving (AD) enable diverse scene creation, yet existing methods fall short by only capturing a limited range of modalities, restricting the capability of generating controllable scenes for comprehensive evaluation of AD systems. In this paper, we introduce a multimodal generation framework that incorporates four major data modalities, including a novel addition of map modality. With tokenized modalities, our scene sequence generation framework autoregressively predicts each scene while managing computational demands through a two-stage approach. The Temporal AutoRegressive (TAR) component captures inter-frame dynamics for each modality while the Ordered AutoRegressive (OAR) component aligns modalities within each scene by sequentially predicting tokens in a fixed order. To maintain coherence between map and ego-action modalities, we introduce the Action-aware Map Alignment (AMA) module, which applies a transformation based on the ego-action to maintain coherence between these modalities. Our framework effectively generates complex, realistic driving scenes over extended sequences, ensuring multimodal consistency and offering fine-grained control over scene elements.

Title: ChatStitch: Visualizing Through Structures via Surround-View Unsupervised Deep Image Stitching with Collaborative LLM-Agents

Authors: Hao Liang, Zhipeng Dong, Yi Yang, Mengyin Fu
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2503.14948
Pdf URL: https://arxiv.org/pdf/2503.14948
Copy Paste: [[2503.14948]] ChatStitch: Visualizing Through Structures via Surround-View Unsupervised Deep Image Stitching with Collaborative LLM-Agents(https://arxiv.org/abs/2503.14948)
Keywords: large language model
Abstract: Collaborative perception has garnered significant attention for its ability to enhance the perception capabilities of individual vehicles through the exchange of information with surrounding vehicle-agents. However, existing collaborative perception systems are limited by inefficiencies in user interaction and the challenge of multi-camera photorealistic visualization. To address these challenges, this paper introduces ChatStitch, the first collaborative perception system capable of unveiling obscured blind spot information through natural language commands integrated with external digital assets. To adeptly handle complex or abstract commands, ChatStitch employs a multi-agent collaborative framework based on Large Language Models. For achieving the most intuitive perception for humans, ChatStitch proposes SV-UDIS, the first surround-view unsupervised deep image stitching method under the non-global-overlapping condition. We conducted extensive experiments on the UDIS-D, MCOV-SLAM open datasets, and our real-world dataset. Specifically, our SV-UDIS method achieves state-of-the-art performance on the UDIS-D dataset for 3, 4, and 5 image stitching tasks, with PSNR improvements of 9%, 17%, and 21%, and SSIM improvements of 8%, 18%, and 26%, respectively.

Title: USAM-Net: A U-Net-based Network for Improved Stereo Correspondence and Scene Depth Estimation using Features from a Pre-trained Image Segmentation network

Authors: Joseph Emmanuel DL Dayo, Prospero C. Naval Jr
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.14950
Pdf URL: https://arxiv.org/pdf/2503.14950
Copy Paste: [[2503.14950]] USAM-Net: A U-Net-based Network for Improved Stereo Correspondence and Scene Depth Estimation using Features from a Pre-trained Image Segmentation network(https://arxiv.org/abs/2503.14950)
Keywords: segmentation
Abstract: The increasing demand for high-accuracy depth estimation in autonomous driving and augmented reality applications necessitates advanced neural architectures capable of effectively leveraging multiple data modalities. In this context, we introduce the Unified Segmentation Attention Mechanism Network (USAM-Net), a novel convolutional neural network that integrates stereo image inputs with semantic segmentation maps and attention to enhance depth estimation performance. USAM-Net employs a dual-pathway architecture, which combines a pre-trained segmentation model (SAM) and a depth estimation model. The segmentation pathway preprocesses the stereo images to generate semantic masks, which are then concatenated with the stereo images as inputs to the depth estimation pathway. This integration allows the model to focus on important features such as object boundaries and surface textures which are crucial for accurate depth perception. Empirical evaluation on the DrivingStereo dataset demonstrates that USAM-Net achieves superior performance metrics, including a Global Difference (GD) of 3.61\% and an End-Point Error (EPE) of 0.88, outperforming traditional models such as CFNet, SegStereo, and iResNet. These results underscore the effectiveness of integrating segmentation information into stereo depth estimation tasks, highlighting the potential of USAM-Net in applications demanding high-precision depth data.

Title: Depth-Aware Range Image-Based Model for Point Cloud Segmentation

Authors: Bike Chen, Antti Tikanmäki, Juha Röning
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14955
Pdf URL: https://arxiv.org/pdf/2503.14955
Copy Paste: [[2503.14955]] Depth-Aware Range Image-Based Model for Point Cloud Segmentation(https://arxiv.org/abs/2503.14955)
Keywords: segmentation
Abstract: Point cloud segmentation (PCS) aims to separate points into different and meaningful groups. The task plays an important role in robotics because PCS enables robots to understand their physical environments directly. To process sparse and large-scale outdoor point clouds in real time, range image-based models are commonly adopted. However, in a range image, the lack of explicit depth information inevitably causes some separate objects in 3D space to touch each other, bringing difficulty for the range image-based models in correctly segmenting the objects. Moreover, previous PCS models are usually derived from the existing color image-based models and unable to make full use of the implicit but ordered depth information inherent in the range image, thereby achieving inferior performance. In this paper, we propose Depth-Aware Module (DAM) and Fast FMVNet V3. DAM perceives the ordered depth information in the range image by explicitly modelling the interdependence among channels. Fast FMVNet V3 incorporates DAM by integrating it into the last block in each architecture stage. Extensive experiments conducted on SemanticKITTI, nuScenes, and SemanticPOSS demonstrate that DAM brings a significant improvement for Fast FMVNet V3 with negligible computational cost.

Title: Reducing Annotation Burden: Exploiting Image Knowledge for Few-Shot Medical Video Object Segmentation via Spatiotemporal Consistency Relearning

Authors: Zixuan Zheng, Yilei Shi, Chunlei Li, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14958
Pdf URL: https://arxiv.org/pdf/2503.14958
Copy Paste: [[2503.14958]] Reducing Annotation Burden: Exploiting Image Knowledge for Few-Shot Medical Video Object Segmentation via Spatiotemporal Consistency Relearning(https://arxiv.org/abs/2503.14958)
Keywords: segmentation
Abstract: Few-shot video object segmentation aims to reduce annotation costs; however, existing methods still require abundant dense frame annotations for training, which are scarce in the medical domain. We investigate an extremely low-data regime that utilizes annotations from only a few video frames and leverages existing labeled images to minimize costly video annotations. Specifically, we propose a two-phase framework. First, we learn a few-shot segmentation model using labeled images. Subsequently, to improve performance without full supervision, we introduce a spatiotemporal consistency relearning approach on medical videos that enforces consistency between consecutive frames. Constraints are also enforced between the image model and relearning model at both feature and prediction levels. Experiments demonstrate the superiority of our approach over state-of-the-art few-shot segmentation methods. Our model bridges the gap between abundant annotated medical images and scarce, sparsely labeled medical videos to achieve strong video segmentation performance in this low data regime. Code is available at this https URL.

Title: Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models

Authors: Tingxiu Chen, Yilei Shi, Zixuan Zheng, Bingcong Yan, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.14966
Pdf URL: https://arxiv.org/pdf/2503.14966
Copy Paste: [[2503.14966]] Ultrasound Image-to-Video Synthesis via Latent Dynamic Diffusion Models(https://arxiv.org/abs/2503.14966)
Keywords: diffusion
Abstract: Ultrasound video classification enables automated diagnosis and has emerged as an important research area. However, publicly available ultrasound video datasets remain scarce, hindering progress in developing effective video classification models. We propose addressing this shortage by synthesizing plausible ultrasound videos from readily available, abundant ultrasound images. To this end, we introduce a latent dynamic diffusion model (LDDM) to efficiently translate static images to dynamic sequences with realistic video characteristics. We demonstrate strong quantitative results and visually appealing synthesized videos on the BUSV benchmark. Notably, training video classification models on combinations of real and LDDM-synthesized videos substantially improves performance over using real data alone, indicating our method successfully emulates dynamics critical for discrimination. Our image-to-video approach provides an effective data augmentation solution to advance ultrasound video analysis. Code is available at this https URL.

Title: Language-based Image Colorization: A Benchmark and Beyond

Authors: Yifan Li, Shuai Yang, Jiaying Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14974
Pdf URL: https://arxiv.org/pdf/2503.14974
Copy Paste: [[2503.14974]] Language-based Image Colorization: A Benchmark and Beyond(https://arxiv.org/abs/2503.14974)
Keywords: diffusion
Abstract: Image colorization aims to bring colors back to grayscale images. Automatic image colorization methods, which requires no additional guidance, struggle to generate high-quality images due to color ambiguity, and provides limited user controllability. Thanks to the emergency of cross-modality datasets and models, language-based colorization methods are proposed to fully utilize the efficiency and flexibly of text descriptions to guide colorization. In view of the lack of a comprehensive review of language-based colorization literature, we conduct a thorough analysis and benchmarking. We first briefly summarize existing automatic colorization methods. Then, we focus on language-based methods and point out their core challenge on cross-modal alignment. We further divide these methods into two categories: one attempts to train a cross-modality network from scratch, while the other utilizes the pre-trained cross-modality model to establish the textual-visual correspondence. Based on the analyzed limitations of existing language-based methods, we propose a simple yet effective method based on distilled diffusion model. Extensive experiments demonstrate that our simple baseline can produces better results than previous complex methods with 14 times speed up. To the best of our knowledge, this is the first comprehensive review and benchmark on language-based image colorization field, providing meaningful insights for the community. The code is available at this https URL.

Title: Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening

Authors: Zihan Cao, Yu Zhong, Liang-Jian Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14975
Pdf URL: https://arxiv.org/pdf/2503.14975
Copy Paste: [[2503.14975]] Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening(https://arxiv.org/abs/2503.14975)
Keywords: robust, diffusion
Abstract: Pansharpening, a pivotal task in remote sensing for fusing high-resolution panchromatic and multispectral imagery, has garnered significant research interest. Recent advancements employing diffusion models based on stochastic differential equations (SDEs) have demonstrated state-of-the-art performance. However, the inherent multi-step sampling process of SDEs imposes substantial computational overhead, hindering practical deployment. While existing methods adopt efficient samplers, knowledge distillation, or retraining to reduce sampling steps (e.g., from 1,000 to fewer steps), such approaches often compromise fusion quality. In this work, we propose the Optimal Transport Flow Matching (OTFM) framework, which integrates the dual formulation of unbalanced optimal transport (UOT) to achieve one-step, high-quality pansharpening. Unlike conventional OT formulations that enforce rigid distribution alignment, UOT relaxes marginal constraints to enhance modeling flexibility, accommodating the intrinsic spectral and spatial disparities in remote sensing data. Furthermore, we incorporate task-specific regularization into the UOT objective, enhancing the robustness of the flow model. The OTFM framework enables simulation-free training and single-step inference while maintaining strict adherence to pansharpening constraints. Experimental evaluations across multiple datasets demonstrate that OTFM matches or exceeds the performance of previous regression-based models and leading diffusion-based methods while only needing one sampling step. Codes are available at this https URL.

Title: One-Shot Medical Video Object Segmentation via Temporal Contrastive Memory Networks

Authors: Yaxiong Chen, Junjian Hu, Chunlei Li, Zixuan Zheng, Jingliang Hu, Yilei Shi, Shengwu Xiong, Xiao Xiang Zhu, Lichao Mou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14979
Pdf URL: https://arxiv.org/pdf/2503.14979
Copy Paste: [[2503.14979]] One-Shot Medical Video Object Segmentation via Temporal Contrastive Memory Networks(https://arxiv.org/abs/2503.14979)
Keywords: segmentation
Abstract: Video object segmentation is crucial for the efficient analysis of complex medical video data, yet it faces significant challenges in data availability and annotation. We introduce the task of one-shot medical video object segmentation, which requires separating foreground and background pixels throughout a video given only the mask annotation of the first frame. To address this problem, we propose a temporal contrastive memory network comprising image and mask encoders to learn feature representations, a temporal contrastive memory bank that aligns embeddings from adjacent frames while pushing apart distant ones to explicitly model inter-frame relationships and stores these features, and a decoder that fuses encoded image features and memory readouts for segmentation. We also collect a diverse, multi-source medical video dataset spanning various modalities and anatomies to benchmark this task. Extensive experiments demonstrate state-of-the-art performance in segmenting both seen and unseen structures from a single exemplar, showing ability to generalize from scarce labels. This highlights the potential to alleviate annotation burdens for medical video analysis. Code is available at this https URL.

Title: Semi-KAN: KAN Provides an Effective Representation for Semi-Supervised Learning in Medical Image Segmentation

Authors: Zanting Ye, Xiaolong Niu, Xuanbin Wu, Wenxiang Yi, Yuan Chang, Lijun Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.14983
Pdf URL: https://arxiv.org/pdf/2503.14983
Copy Paste: [[2503.14983]] Semi-KAN: KAN Provides an Effective Representation for Semi-Supervised Learning in Medical Image Segmentation(https://arxiv.org/abs/2503.14983)
Keywords: robust, segmentation
Abstract: Deep learning-based medical image segmentation has shown remarkable success; however, it typically requires extensive pixel-level annotations, which are both expensive and time-intensive. Semi-supervised medical image segmentation (SSMIS) offers a viable alternative, driven by advancements in CNNs and ViTs. However, these networks often rely on single fixed activation functions and linear modeling patterns, limiting their ability to effectively learn robust representations. Given the limited availability of labeled date, achieving robust representation learning becomes crucial. Inspired by Kolmogorov-Arnold Networks (KANs), we propose Semi-KAN, which leverages the untapped potential of KANs to enhance backbone architectures for representation learning in SSMIS. Our findings indicate that: (1) compared to networks with fixed activation functions, KANs exhibit superior representation learning capabilities with fewer parameters, and (2) KANs excel in high-semantic feature spaces. Building on these insights, we integrate KANs into tokenized intermediate representations, applying them selectively at the encoder's bottleneck and the decoder's top layers within a U-Net pipeline to extract high-level semantic features. Although learnable activation functions improve feature expansion, they introduce significant computational overhead with only marginal performance gains. To mitigate this, we reduce the feature dimensions and employ horizontal scaling to capture multiple pattern representations. Furthermore, we design a multi-branch U-Net architecture with uncertainty estimation to effectively learn diverse pattern representations. Extensive experiments on four public datasets demonstrate that Semi-KAN surpasses baseline networks, utilizing fewer KAN layers and lower computational cost, thereby underscoring the potential of KANs as a promising approach for SSMIS.

Title: Inspecting the Representation Manifold of Differentially-Private Text

Authors: Stefan Arnold
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.14991
Pdf URL: https://arxiv.org/pdf/2503.14991
Copy Paste: [[2503.14991]] Inspecting the Representation Manifold of Differentially-Private Text(https://arxiv.org/abs/2503.14991)
Keywords: privacy
Abstract: Differential Privacy (DP) for text has recently taken the form of text paraphrasing using language models and temperature sampling to better balance privacy and utility. However, the geometric distortion of DP regarding the structure and complexity in the representation space remains unexplored. By estimating the intrinsic dimension of paraphrased text across varying privacy budgets, we find that word-level methods severely raise the representation manifold, while sentence-level methods produce paraphrases whose manifolds are topologically more consistent with human-written paraphrases. Among sentence-level methods, masked paraphrasing, compared to causal paraphrasing, demonstrates superior preservation of structural complexity, suggesting that autoregressive generation propagates distortions from unnatural word choices that cascade and inflate the representation space.

Title: Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

Authors: Francesco Maria Molfese, Luca Moroni, Luca Gioffrè, Alessandro Scirè, Simone Conia, Roberto Navigli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.14996
Pdf URL: https://arxiv.org/pdf/2503.14996
Copy Paste: [[2503.14996]] Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering(https://arxiv.org/abs/2503.14996)
Keywords: extraction, large language model
Abstract: One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA). While open-ended question answering tasks are more challenging to evaluate, MCQA tasks are, in principle, easier to assess, as the model's answer is thought to be simple to extract and is directly compared to a set of predefined choices. However, recent studies have started to question the reliability of MCQA evaluation, showing that multiple factors can significantly impact the reported performance of LLMs, especially when the model generates free-form text before selecting one of the answer choices. In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons. We systematically analyze whether existing answer extraction methods are aligned with human judgment, and how they are influenced by answer constraints in the prompt across different domains. Our experiments demonstrate that traditional evaluation strategies often underestimate LLM capabilities, while LLM-based answer extractors are prone to systematic errors. Moreover, we reveal a fundamental trade-off between including format constraints in the prompt to simplify answer extraction and allowing models to generate free-form text to improve reasoning. Our findings call for standardized evaluation methodologies and highlight the need for more reliable and consistent MCQA evaluation practices.

Title: LLM Alignment for the Arabs: A Homogenous Culture or Diverse Ones?

Authors: Amr Keleg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15003
Pdf URL: https://arxiv.org/pdf/2503.15003
Copy Paste: [[2503.15003]] LLM Alignment for the Arabs: A Homogenous Culture or Diverse Ones?(https://arxiv.org/abs/2503.15003)
Keywords: large language model
Abstract: Large language models (LLMs) have the potential of being useful tools that can automate tasks and assist humans. However, these models are more fluent in English and more aligned with Western cultures, norms, and values. Arabic-specific LLMs are being developed to better capture the nuances of the Arabic language, as well as the views of the Arabs. Yet, Arabs are sometimes assumed to share the same culture. In this position paper, I discuss the limitations of this assumption and provide preliminary thoughts for how to build systems that can better represent the cultural diversity within the Arab world. The invalidity of the cultural homogeneity assumption might seem obvious, yet, it is widely adopted in developing multilingual and Arabic-specific LLMs. I hope that this paper will encourage the NLP community to be considerate of the cultural diversity within various communities speaking the same language.

Title: Semantic Segmentation of Transparent and Opaque Drinking Glasses with the Help of Zero-shot Learning

Authors: Annalena Blänsdorf, Tristan Wirth, Arne Rak, Thomas Pöllabauer, Volker Knauthe, Arjan Kuijper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15004
Pdf URL: https://arxiv.org/pdf/2503.15004
Copy Paste: [[2503.15004]] Semantic Segmentation of Transparent and Opaque Drinking Glasses with the Help of Zero-shot Learning(https://arxiv.org/abs/2503.15004)
Keywords: segmentation
Abstract: Segmenting transparent structures in images is challenging since they are difficult to distinguish from the background. Common examples are drinking glasses, which are a ubiquitous part of our lives and appear in many different shapes and sizes. In this work we propose TransCaGNet, a modified version of the zero-shot model CaGNet. We exchange the segmentation backbone with the architecture of Trans4Trans to be capable of segmenting transparent objects. Since some glasses are rarely captured, we use zeroshot learning to be able to create semantic segmentations of glass categories not given during training. We propose a novel synthetic dataset covering a diverse set of different environmental conditions. Additionally we capture a real-world evaluation dataset since most applications take place in the real world. Comparing our model with Zeg-Clip we are able to show that TransCaGNet produces better mean IoU and accuracy values while ZegClip outperforms it mostly for unseen classes. To improve the segmentation results, we combine the semantic segmentation of the models with the segmentation results of SAM 2. Our evaluation emphasizes that distinguishing between different classes is challenging for the models due to similarity, points of view, or coverings. Taking this behavior into account, we assign glasses multiple possible categories. The modification leads to an improvement up to 13.68% for the mean IoU and up to 17.88% for the mean accuracy values on the synthetic dataset. Using our difficult synthetic dataset for training, the models produce even better results on the real-world dataset. The mean IoU is improved up to 5.55% and the mean accuracy up to 5.72% on the real-world dataset.

Title: OFL: Opportunistic Federated Learning for Resource-Heterogeneous and Privacy-Aware Devices

Authors: Yunlong Mao, Mingyang Niu, Ziqin Dang, Chengxi Li, Hanning Xia, Yuejuan Zhu, Haoyu Bian, Yuan Zhang, Jingyu Hua, Sheng Zhong
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.15015
Pdf URL: https://arxiv.org/pdf/2503.15015
Copy Paste: [[2503.15015]] OFL: Opportunistic Federated Learning for Resource-Heterogeneous and Privacy-Aware Devices(https://arxiv.org/abs/2503.15015)
Keywords: secure, security, privacy, attack, federate
Abstract: Efficient and secure federated learning (FL) is a critical challenge for resource-limited devices, especially mobile devices. Existing secure FL solutions commonly incur significant overhead, leading to a contradiction between efficiency and security. As a result, these two concerns are typically addressed separately. This paper proposes Opportunistic Federated Learning (OFL), a novel FL framework designed explicitly for resource-heterogenous and privacy-aware FL devices, solving efficiency and security problems jointly. OFL optimizes resource utilization and adaptability across diverse devices by adopting a novel hierarchical and asynchronous aggregation strategy. OFL provides strong security by introducing a differentially private and opportunistic model updating mechanism for intra-cluster model aggregation and an advanced threshold homomorphic encryption scheme for inter-cluster aggregation. Moreover, OFL secures global model aggregation by implementing poisoning attack detection using frequency analysis while keeping models encrypted. We have implemented OFL in a real-world testbed and evaluated OFL comprehensively. The evaluation results demonstrate that OFL achieves satisfying model performance and improves efficiency and security, outperforming existing solutions.

Title: Manifold Learning for Hyperspectral Images

Authors: Fethi Harkat (EDP, DT), Tiphaine Deuberet (DT), Guillaume Gey (DT), Valérie Perrier (EDP), Kévin Polisano (SVH)
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.15016
Pdf URL: https://arxiv.org/pdf/2503.15016
Copy Paste: [[2503.15016]] Manifold Learning for Hyperspectral Images(https://arxiv.org/abs/2503.15016)
Keywords: robust, extraction
Abstract: Traditional feature extraction and projection techniques, such as Principal Component Analysis, struggle to adequately represent X-Ray Transmission (XRT) Multi-Energy (ME) images, limiting the performance of neural networks in decision-making processes. To address this issue, we propose a method that approximates the dataset topology by constructing adjacency graphs using the Uniform Manifold Approximation and Projection. This approach captures nonlinear correlations within the data, significantly improving the performance of machine learning algorithms, particularly in processing Hyperspectral Images (HSI) from X-ray transmission spectroscopy. This technique not only preserves the global structure of the data but also enhances feature separability, leading to more accurate and robust classification results.

Title: Exploiting Diffusion Prior for Real-World Image Dehazing with Unpaired Training

Authors: Yunwei Lan, Zhigao Cui, Chang Liu, Jialun Peng, Nian Wang, Xin Luo, Dong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15017
Pdf URL: https://arxiv.org/pdf/2503.15017
Copy Paste: [[2503.15017]] Exploiting Diffusion Prior for Real-World Image Dehazing with Unpaired Training(https://arxiv.org/abs/2503.15017)
Keywords: diffusion, generative
Abstract: Unpaired training has been verified as one of the most effective paradigms for real scene dehazing by learning from unpaired real-world hazy and clear images. Although numerous studies have been proposed, current methods demonstrate limited generalization for various real scenes due to limited feature representation and insufficient use of real-world prior. Inspired by the strong generative capabilities of diffusion models in producing both hazy and clear images, we exploit diffusion prior for real-world image dehazing, and propose an unpaired framework named Diff-Dehazer. Specifically, we leverage diffusion prior as bijective mapping learners within the CycleGAN, a classic unpaired learning framework. Considering that physical priors contain pivotal statistics information of real-world data, we further excavate real-world knowledge by integrating physical priors into our framework. Furthermore, we introduce a new perspective for adequately leveraging the representation ability of diffusion models by removing degradation in image and text modalities, so as to improve the dehazing effect. Extensive experiments on multiple real-world datasets demonstrate the superior performance of our method. Our code this https URL.

Title: Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

Authors: Shengqiong Wu, Hao Fei, Jingkang Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, Tat-seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15019
Pdf URL: https://arxiv.org/pdf/2503.15019
Copy Paste: [[2503.15019]] Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene(https://arxiv.org/abs/2503.15019)
Keywords: large language model
Abstract: The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can primarily suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG. Extensive experiments on the benchmark data demonstrate that we strikingly outperform baseline models by a large margin, highlighting the effectiveness of our method.

Title: Bridging the Gap: Fusing CNNs and Transformers to Decode the Elegance of Handwritten Arabic Script

Authors: Chaouki Boufenar, Mehdi Ayoub Rabiai, Boualem Nadjib Zahaf, Khelil Rafik Ouaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15023
Pdf URL: https://arxiv.org/pdf/2503.15023
Copy Paste: [[2503.15023]] Bridging the Gap: Fusing CNNs and Transformers to Decode the Elegance of Handwritten Arabic Script(https://arxiv.org/abs/2503.15023)
Keywords: robust, transformer
Abstract: Handwritten Arabic script recognition is a challenging task due to the script's dynamic letter forms and contextual variations. This paper proposes a hybrid approach combining convolutional neural networks (CNNs) and Transformer-based architectures to address these complexities. We evaluated custom and fine-tuned models, including EfficientNet-B7 and Vision Transformer (ViT-B16), and introduced an ensemble model that leverages confidence-based fusion to integrate their strengths. Our ensemble achieves remarkable performance on the IFN/ENIT dataset, with 96.38% accuracy for letter classification and 97.22% for positional classification. The results highlight the complementary nature of CNNs and Transformers, demonstrating their combined potential for robust Arabic handwriting recognition. This work advances OCR systems, offering a scalable solution for real-world applications.

Title: Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

Authors: Jin Wang, Chenghui Lv, Xian Li, Shichao Dong, Huadong Li, kelu Yao, Chao Li, Wenqi Shao, Ping Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15024
Pdf URL: https://arxiv.org/pdf/2503.15024
Copy Paste: [[2503.15024]] Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models(https://arxiv.org/abs/2503.15024)
Keywords: security, robust
Abstract: Recently, the rapid development of AIGC has significantly boosted the diversities of fake media spread in the Internet, posing unprecedented threats to social security, politics, law, and etc. To detect the ever-increasingly diverse malicious fake media in the new era of AIGC, recent studies have proposed to exploit Large Vision Language Models (LVLMs) to design robust forgery detectors due to their impressive performance on a wide range of multimodal tasks. However, it still lacks a comprehensive benchmark designed to comprehensively assess LVLMs' discerning capabilities on forgery media. To fill this gap, we present Forensics-Bench, a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries. Forensics-Bench comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models. We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench. We anticipate that Forensics-Bench will motivate the community to advance the frontier of LVLMs, striving for all-around forgery detectors in the era of AIGC. The deliverables will be updated at this https URL.

Title: Multivariate Gaussian Topic Modelling: A novel approach to discover topics with greater semantic coherence

Authors: Satyajeet Sahoo, Jhareswar Maiti, Virendra Kumar Tewari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15036
Pdf URL: https://arxiv.org/pdf/2503.15036
Copy Paste: [[2503.15036]] Multivariate Gaussian Topic Modelling: A novel approach to discover topics with greater semantic coherence(https://arxiv.org/abs/2503.15036)
Keywords: interpretability, generative
Abstract: An important aspect of text mining involves information retrieval in form of discovery of semantic themes (topics) from documents using topic modelling. While generative topic models like Latent Dirichlet Allocation (LDA) elegantly model topics as probability distributions and are useful in identifying latent topics from large document corpora with minimal supervision, they suffer from difficulty in topic interpretability and reduced performance in shorter texts. Here we propose a novel Multivariate Gaussian Topic modelling (MGD) approach. In this approach topics are presented as Multivariate Gaussian Distributions and documents as Gaussian Mixture Models. Using EM algorithm, the various constituent Multivariate Gaussian Distributions and their corresponding parameters are identified. Analysis of the parameters helps identify the keywords having the highest variance and mean contributions to the topic, and from these key-words topic annotations are carried out. This approach is first applied on a synthetic dataset to demonstrate the interpretability benefits vis-à-vis LDA. A real-world application of this topic model is demonstrated in analysis of risks and hazards at a petrochemical plant by applying the model on safety incident reports to identify the major latent hazards plaguing the plant. This model achieves a higher mean topic coherence of 0.436 vis-à-vis 0.294 for LDA.

Title: SPADE: Systematic Prompt Framework for Automated Dialogue Expansion in Machine-Generated Text Detection

Authors: Haoyi Li, Angela Yifei Yuan, Soyeon Caren Han, Christopher Leckie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15044
Pdf URL: https://arxiv.org/pdf/2503.15044
Copy Paste: [[2503.15044]] SPADE: Systematic Prompt Framework for Automated Dialogue Expansion in Machine-Generated Text Detection(https://arxiv.org/abs/2503.15044)
Keywords: large language model
Abstract: The increasing capability of large language models (LLMs) to generate synthetic content has heightened concerns about their misuse, driving the development of Machine-Generated Text (MGT) detection models. However, these detectors face significant challenges due to the lack of systematically generated, high-quality datasets for training. To address this issue, we propose five novel data augmentation frameworks for synthetic user dialogue generation through a structured prompting approach, reducing the costs associated with traditional data collection methods. Our proposed method yields 14 new dialogue datasets, which we benchmark against seven MGT detection models. The results demonstrate improved generalization performance when utilizing a mixed dataset produced by our proposed augmentation framework. Furthermore, considering that real-world agents lack knowledge of future opponent utterances, we simulate online dialogue detection and examine the relationship between chat history length and detection accuracy. We also benchmark online detection performance with limited chat history on our frameworks. Our open-source datasets can be downloaded from this https URL.

Title: ELTEX: A Framework for Domain-Driven Synthetic Data Generation

Authors: Arina Razmyslovich, Kseniia Murasheva, Sofia Sedlova, Julien Capitaine, Eugene Dmitriev
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15055
Pdf URL: https://arxiv.org/pdf/2503.15055
Copy Paste: [[2503.15055]] ELTEX: A Framework for Domain-Driven Synthetic Data Generation(https://arxiv.org/abs/2503.15055)
Keywords: security, attack, extraction, large language model
Abstract: We present ELTEX (Efficient LLM Token Extraction), a domain-driven framework for generating high-quality synthetic training data in specialized domains. While Large Language Models (LLMs) have shown impressive general capabilities, their performance in specialized domains like cybersecurity remains limited by the scarcity of domain-specific training data. ELTEX addresses this challenge by systematically integrating explicit domain indicator extraction with dynamic prompting to preserve critical domain knowledge throughout the generation process. We demonstrate ELTEX's effectiveness in the context of blockchain-related cyberattack detection, where we fine-tune Gemma-2B using various combinations of real and ELTEX-generated data. Our results show that the ELTEX-enhanced model achieves performance competitive with GPT-4 across both standard classification metrics and uncertainty calibration, while requiring significantly fewer computational resources. We release a curated synthetic dataset of social media texts for cyberattack detection in blockchain. Our work demonstrates that domain-driven synthetic data generation can effectively bridge the performance gap between resource-efficient models and larger architectures in specialized domains.

Title: Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation

Authors: Suhyeon Lee, Kwanyoung Kim, Jong Chul Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15056
Pdf URL: https://arxiv.org/pdf/2503.15056
Copy Paste: [[2503.15056]] Single-Step Bidirectional Unpaired Image Translation Using Implicit Bridge Consistency Distillation(https://arxiv.org/abs/2503.15056)
Keywords: diffusion
Abstract: Unpaired image-to-image translation has seen significant progress since the introduction of CycleGAN. However, methods based on diffusion models or Schrödinger bridges have yet to be widely adopted in real-world applications due to their iterative sampling nature. To address this challenge, we propose a novel framework, Implicit Bridge Consistency Distillation (IBCD), which enables single-step bidirectional unpaired translation without using adversarial loss. IBCD extends consistency distillation by using a diffusion implicit bridge model that connects PF-ODE trajectories between distributions. Additionally, we introduce two key improvements: 1) distribution matching for consistency distillation and 2) adaptive weighting method based on distillation difficulty. Experimental results demonstrate that IBCD achieves state-of-the-art performance on benchmark datasets in a single generation step. Project page available at this https URL

Title: Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis

Authors: Imanol G. Estepa, Jesús M. Rodríguez-de-Vera, Ignacio Sarasúa, Bhalaji Nagarajan, Petia Radeva
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15060
Pdf URL: https://arxiv.org/pdf/2503.15060
Copy Paste: [[2503.15060]] Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis(https://arxiv.org/abs/2503.15060)
Keywords: generative
Abstract: While representation learning and generative modeling seek to understand visual data, unifying both domains remains unexplored. Recent Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between both paradigms. However, they rely solely on semantic token reconstruction, which requires an external tokenizer during training -- introducing a significant overhead. In this work, we introduce Sorcen, a novel unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective. Our Contrastive objective, "Echo Contrast", leverages the generative capabilities of Sorcen, eliminating the need for additional image crops or augmentations during training. Sorcen "generates" an echo sample in the semantic token space, forming the contrastive positive pair. Sorcen operates exclusively on precomputed tokens, eliminating the need for an online token transformation during training, thereby significantly reducing computational overhead. Extensive experiments on ImageNet-1k demonstrate that Sorcen outperforms the previous Unified SSL SoTA by 0.4%, 1.48 FID, 1.76%, and 1.53% on linear probing, unconditional image generation, few-shot learning, and transfer learning, respectively, while being 60.8% more efficient. Additionally, Sorcen surpasses previous single-crop MIM SoTA in linear probing and achieves SoTA performance in unconditional image generation, highlighting significant improvements and breakthroughs in Unified SSL models.

Title: A Comprehensive Quantification of Inconsistencies in Memory Dumps

Authors: Andrea Oliveri, Davide Balzarotti
Subjects: cs.CR, cs.OS
Abstract URL: https://arxiv.org/abs/2503.15065
Pdf URL: https://arxiv.org/pdf/2503.15065
Copy Paste: [[2503.15065]] A Comprehensive Quantification of Inconsistencies in Memory Dumps(https://arxiv.org/abs/2503.15065)
Keywords: attack, steal
Abstract: Memory forensics is a powerful technique commonly adopted to investigate compromised machines and to detect stealthy computer attacks that do not store data on non-volatile storage. To employ this technique effectively, the analyst has to first acquire a faithful copy of the system's volatile memory after the incident. However, almost all memory acquisition tools capture the content of physical memory without stopping the system's activity and by following the ascending order of the physical pages, which can lead to inconsistencies and errors in the dump. In this paper we developed a system to track all write operations performed by the OS kernel during a memory acquisition process. This allows us to quantify, for the first time, the exact number and type of inconsistencies observed in memory dumps. We examine the runtime activity of three different operating systems and the way the manage physical memory. Then, focusing on Linux, we quantify how different acquisition modes, file systems, and hardware targets influence the frequency of kernel writes during the dump. We also analyze the impact of inconsistencies on the reconstruction of page tables and major kernel data structures used by Volatility to extract forensic artifacts. Our results show that inconsistencies are very common and that their presence can undermine the reliability and validity of memory forensics analysis.

Title: An Investigation of Beam Density on LiDAR Object Detection Performance

Authors: Christoph Griesbacher, Christian Fruhwirth-Reisinger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15087
Pdf URL: https://arxiv.org/pdf/2503.15087
Copy Paste: [[2503.15087]] An Investigation of Beam Density on LiDAR Object Detection Performance(https://arxiv.org/abs/2503.15087)
Keywords: robust
Abstract: Accurate 3D object detection is a critical component of autonomous driving, enabling vehicles to perceive their surroundings with precision and make informed decisions. LiDAR sensors, widely used for their ability to provide detailed 3D measurements, are key to achieving this capability. However, variations between training and inference data can cause significant performance drops when object detection models are employed in different sensor settings. One critical factor is beam density, as inference on sparse, cost-effective LiDAR sensors is often preferred in real-world applications. Despite previous work addressing the beam-density-induced domain gap, substantial knowledge gaps remain, particularly concerning dense 128-beam sensors in cross-domain scenarios. To gain better understanding of the impact of beam density on domain gaps, we conduct a comprehensive investigation that includes an evaluation of different object detection architectures. Our architecture evaluation reveals that combining voxel- and point-based approaches yields superior cross-domain performance by leveraging the strengths of both representations. Building on these findings, we analyze beam-density-induced domain gaps and argue that these domain gaps must be evaluated in conjunction with other domain shifts. Contrary to conventional beliefs, our experiments reveal that detectors benefit from training on denser data and exhibit robustness to beam density variations during inference.

Title: Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings

Authors: Zonghao Ying, Guangyi Zheng, Yongxin Huang, Deyue Zhang, Wenxin Zhang, Quanchen Zou, Aishan Liu, Xianglong Liu, Dacheng Tao
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.15092
Pdf URL: https://arxiv.org/pdf/2503.15092
Copy Paste: [[2503.15092]] Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings(https://arxiv.org/abs/2503.15092)
Keywords: large language model
Abstract: This study presents the first comprehensive safety evaluation of the DeepSeek models, focusing on evaluating the safety risks associated with their generated content. Our evaluation encompasses DeepSeek's latest generation of large language models, multimodal large language models, and text-to-image models, systematically examining their performance regarding unsafe content generation. Notably, we developed a bilingual (Chinese-English) safety evaluation dataset tailored to Chinese sociocultural contexts, enabling a more thorough evaluation of the safety capabilities of Chinese-developed models. Experimental results indicate that despite their strong general capabilities, DeepSeek models exhibit significant safety vulnerabilities across multiple risk dimensions, including algorithmic discrimination and sexual content. These findings provide crucial insights for understanding and improving the safety of large foundation models. Our code is available at this https URL.

Title: Diffusion-Based Forecasting for Uncertainty-Aware Model Predictive Control

Authors: Stelios Zarifis, Ioannis Kordonis, Petros Maragos
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2503.15095
Pdf URL: https://arxiv.org/pdf/2503.15095
Copy Paste: [[2503.15095]] Diffusion-Based Forecasting for Uncertainty-Aware Model Predictive Control(https://arxiv.org/abs/2503.15095)
Keywords: diffusion
Abstract: We propose Diffusion-Informed Model Predictive Control (D-I MPC), a generic framework for uncertainty-aware prediction and decision-making in partially observable stochastic systems by integrating diffusion-based time series forecasting models in Model Predictive Control algorithms. In our approach, a diffusion-based time series forecasting model is used to probabilistically estimate the evolution of the system's stochastic components. These forecasts are then incorporated into MPC algorithms to estimate future trajectories and optimize action selection under the uncertainty of the future. We evaluate the framework on the task of energy arbitrage, where a Battery Energy Storage System participates in the day-ahead electricity market of the New York state. Experimental results indicate that our model-based approach with a diffusion-based forecaster significantly outperforms both implementations with classical forecasting methods and model-free reinforcement learning baselines.

Title: Distilling 3D distinctive local descriptors for 6D pose estimation

Authors: Amir Hamza, Andrea Caraffa, Davide Boscaini, Fabio Poiesi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15106
Pdf URL: https://arxiv.org/pdf/2503.15106
Copy Paste: [[2503.15106]] Distilling 3D distinctive local descriptors for 6D pose estimation(https://arxiv.org/abs/2503.15106)
Keywords: robust
Abstract: Three-dimensional local descriptors are crucial for encoding geometric surface properties, making them essential for various point cloud understanding tasks. Among these descriptors, GeDi has demonstrated strong zero-shot 6D pose estimation capabilities but remains computationally impractical for real-world applications due to its expensive inference process. \textit{Can we retain GeDi's effectiveness while significantly improving its efficiency?} In this paper, we explore this question by introducing a knowledge distillation framework that trains an efficient student model to regress local descriptors from a GeDi teacher. Our key contributions include: an efficient large-scale training procedure that ensures robustness to occlusions and partial observations while operating under compute and storage constraints, and a novel loss formulation that handles weak supervision from non-distinctive teacher descriptors. We validate our approach on five BOP Benchmark datasets and demonstrate a significant reduction in inference time while maintaining competitive performance with existing methods, bringing zero-shot 6D pose estimation closer to real-time feasibility. Project Website: this https URL

Title: VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

Authors: Mohamed Salim Aissi, Clemence Grislain, Mohamed Chetouani, Olivier Sigaud, Laure Soulier, Nicolas Thome
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2503.15108
Pdf URL: https://arxiv.org/pdf/2503.15108
Copy Paste: [[2503.15108]] VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making(https://arxiv.org/abs/2503.15108)
Keywords: explainability, large language model
Abstract: While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this paper, we introduce VIPER, a novel framework for multimodal instruction-based planning that integrates VLM-based perception with LLM-based reasoning. Our approach uses a modular pipeline where a frozen VLM generates textual descriptions of image observations, which are then processed by an LLM policy to predict actions based on the task goal. We fine-tune the reasoning module using behavioral cloning and reinforcement learning, improving our agent's decision-making capabilities. Experiments on the ALFWorld benchmark show that VIPER significantly outperforms state-of-the-art visual instruction-based planners while narrowing the gap with purely text-based oracles. By leveraging text as an intermediate representation, VIPER also enhances explainability, paving the way for a fine-grained analysis of perception and reasoning components.

Title: FedLWS: Federated Learning with Adaptive Layer-wise Weight Shrinking

Authors: Changlong Shi, Jinmeng Li, He Zhao, Dan dan Guo, Yi Chang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15111
Pdf URL: https://arxiv.org/pdf/2503.15111
Copy Paste: [[2503.15111]] FedLWS: Federated Learning with Adaptive Layer-wise Weight Shrinking(https://arxiv.org/abs/2503.15111)
Keywords: privacy, federate
Abstract: In Federated Learning (FL), weighted aggregation of local models is conducted to generate a new global model, and the aggregation weights are typically normalized to 1. A recent study identifies the global weight shrinking effect in FL, indicating an enhancement in the global model's generalization when the sum of weights (i.e., the shrinking factor) is smaller than 1, where how to learn the shrinking factor becomes crucial. However, principled approaches to this solution have not been carefully studied from the adequate consideration of privacy concerns and layer-wise distinctions. To this end, we propose a novel model aggregation strategy, Federated Learning with Adaptive Layer-wise Weight Shrinking (FedLWS), which adaptively designs the shrinking factor in a layer-wise manner and avoids optimizing the shrinking factors on a proxy dataset. We initially explored the factors affecting the shrinking factor during the training process. Then we calculate the layer-wise shrinking factors by considering the distinctions among each layer of the global model. FedLWS can be easily incorporated with various existing methods due to its flexibility. Extensive experiments under diverse scenarios demonstrate the superiority of our method over several state-of-the-art approaches, providing a promising tool for enhancing the global model in FL.

Title: DeCaFlow: A Deconfounding Causal Generative Model

Authors: Alejandro Almodóvar, Adrián Javaloy, Juan Parras, Santiago Zazo, Isabel Valera
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15114
Pdf URL: https://arxiv.org/pdf/2503.15114
Copy Paste: [[2503.15114]] DeCaFlow: A Deconfounding Causal Generative Model(https://arxiv.org/abs/2503.15114)
Keywords: generative
Abstract: Causal generative models (CGMs) have recently emerged as capable approaches to simulate the causal mechanisms generating our observations, enabling causal inference. Unfortunately, existing approaches either are overly restrictive, assuming the absence of hidden confounders, or lack generality, being tailored to a particular query and graph. In this work, we introduce DeCaFlow, a CGM that accounts for hidden confounders in a single amortized training process using only observational data and the causal graph. Importantly, DeCaFlow can provably identify all causal queries with a valid adjustment set or sufficiently informative proxy variables. Remarkably, for the first time to our knowledge, we show that a confounded counterfactual query is identifiable, and thus solvable by DeCaFlow, as long as its interventional counterpart is as well. Our empirical results on diverse settings (including the Ecoli70 dataset, with 3 independent hidden confounders, tens of observed variables and hundreds of causal queries) show that DeCaFlow outperforms existing approaches, while demonstrating its out-of-the-box flexibility.

Title: Exploring Model Editing for LLM-based Aspect-Based Sentiment Classification

Authors: Shichen Li, Zhongqing Wang, Zheyu Zhao, Yue Zhang, Peifeng Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15117
Pdf URL: https://arxiv.org/pdf/2503.15117
Copy Paste: [[2503.15117]] Exploring Model Editing for LLM-based Aspect-Based Sentiment Classification(https://arxiv.org/abs/2503.15117)
Keywords: large language model
Abstract: Model editing aims at selectively updating a small subset of a neural model's parameters with an interpretable strategy to achieve desired modifications. It can significantly reduce computational costs to adapt to large language models (LLMs). Given its ability to precisely target critical components within LLMs, model editing shows great potential for efficient fine-tuning applications. In this work, we investigate model editing to serve an efficient method for adapting LLMs to solve aspect-based sentiment classification. Through causal interventions, we trace and determine which neuron hidden states are essential for the prediction of the model. By performing interventions and restorations on each component of an LLM, we identify the importance of these components for aspect-based sentiment classification. Our findings reveal that a distinct set of mid-layer representations is essential for detecting the sentiment polarity of given aspect words. Leveraging these insights, we develop a model editing approach that focuses exclusively on these critical parts of the LLM, leading to a more efficient method for adapting LLMs. Our in-domain and out-of-domain experiments demonstrate that this approach achieves competitive results compared to the currently strongest methods with significantly fewer trainable parameters, highlighting a more efficient and interpretable fine-tuning strategy.

Title: Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation

Authors: Haoyu Ji, Bowen Chen, Weihong Ren, Wenze Huang, Zhihao Yang, Zhiyong Wang, Honghai Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15126
Pdf URL: https://arxiv.org/pdf/2503.15126
Copy Paste: [[2503.15126]] Text-Derived Relational Graph-Enhanced Network for Skeleton-Based Action Segmentation(https://arxiv.org/abs/2503.15126)
Keywords: large language model, segmentation
Abstract: Skeleton-based Temporal Action Segmentation (STAS) aims to segment and recognize various actions from long, untrimmed sequences of human skeletal movements. Current STAS methods typically employ spatio-temporal modeling to establish dependencies among joints as well as frames, and utilize one-hot encoding with cross-entropy loss for frame-wise classification supervision. However, these methods overlook the intrinsic correlations among joints and actions within skeletal features, leading to a limited understanding of human movements. To address this, we propose a Text-Derived Relational Graph-Enhanced Network (TRG-Net) that leverages prior graphs generated by Large Language Models (LLM) to enhance both modeling and supervision. For modeling, the Dynamic Spatio-Temporal Fusion Modeling (DSFM) method incorporates Text-Derived Joint Graphs (TJG) with channel- and frame-level dynamic adaptation to effectively model spatial relations, while integrating spatio-temporal core features during temporal modeling. For supervision, the Absolute-Relative Inter-Class Supervision (ARIS) method employs contrastive learning between action features and text embeddings to regularize the absolute class distributions, and utilizes Text-Derived Action Graphs (TAG) to capture the relative inter-class relationships among action features. Additionally, we propose a Spatial-Aware Enhancement Processing (SAEP) method, which incorporates random joint occlusion and axial rotation to enhance spatial generalization. Performance evaluations on four public datasets demonstrate that TRG-Net achieves state-of-the-art results.

Title: Increasing the Robustness of the Fine-tuned Multilingual Machine-Generated Text Detectors

Authors: Dominik Macko, Robert Moro, Ivan Srba
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15128
Pdf URL: https://arxiv.org/pdf/2503.15128
Copy Paste: [[2503.15128]] Increasing the Robustness of the Fine-tuned Multilingual Machine-Generated Text Detectors(https://arxiv.org/abs/2503.15128)
Keywords: robust
Abstract: Since the proliferation of LLMs, there have been concerns about their misuse for harmful content creation and spreading. Recent studies justify such fears, providing evidence of LLM vulnerabilities and high potential of their misuse. Humans are no longer able to distinguish between high-quality machine-generated and authentic human-written texts. Therefore, it is crucial to develop automated means to accurately detect machine-generated content. It would enable to identify such content in online information space, thus providing an additional information about its credibility. This work addresses the problem by proposing a robust fine-tuning process of LLMs for the detection task, making the detectors more robust against obfuscation and more generalizable to out-of-distribution data.

Title: EmoGRACE: Aspect-based emotion analysis for social media data

Authors: Christina Zorenböhmer, Sebastian Schmidt, Bernd Resch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15133
Pdf URL: https://arxiv.org/pdf/2503.15133
Copy Paste: [[2503.15133]] EmoGRACE: Aspect-based emotion analysis for social media data(https://arxiv.org/abs/2503.15133)
Keywords: extraction
Abstract: While sentiment analysis has advanced from sentence to aspect-level, i.e., the identification of concrete terms related to a sentiment, the equivalent field of Aspect-based Emotion Analysis (ABEA) is faced with dataset bottlenecks and the increased complexity of emotion classes in contrast to binary sentiments. This paper addresses these gaps, by generating a first ABEA training dataset, consisting of 2,621 English Tweets, and fine-tuning a BERT-based model for the ABEA sub-tasks of Aspect Term Extraction (ATE) and Aspect Emotion Classification (AEC). The dataset annotation process was based on the hierarchical emotion theory by Shaver et al. [1] and made use of group annotation and majority voting strategies to facilitate label consistency. The resulting dataset contained aspect-level emotion labels for Anger, Sadness, Happiness, Fear, and a None class. Using the new ABEA training dataset, the state-of-the-art ABSA model GRACE by Luo et al. [2] was fine-tuned for ABEA. The results reflected a performance plateau at an F1-score of 70.1% for ATE and 46.9% for joint ATE and AEC extraction. The limiting factors for model performance were broadly identified as the small training dataset size coupled with the increased task complexity, causing model overfitting and limited abilities to generalize well on new data.

Title: Machine learning surrogate models of many-body dispersion interactions in polymer melts

Authors: Zhaoxiang Shen, Raúl I. Sosa, Jakub Lengiewicz, Alexandre Tkatchenko, Stéphane P.A. Bordas
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2503.15149
Pdf URL: https://arxiv.org/pdf/2503.15149
Copy Paste: [[2503.15149]] Machine learning surrogate models of many-body dispersion interactions in polymer melts(https://arxiv.org/abs/2503.15149)
Keywords: robust
Abstract: Accurate prediction of many-body dispersion (MBD) interactions is essential for understanding the van der Waals forces that govern the behavior of many complex molecular systems. However, the high computational cost of MBD calculations limits their direct application in large-scale simulations. In this work, we introduce a machine learning surrogate model specifically designed to predict MBD forces in polymer melts, a system that demands accurate MBD description and offers structural advantages for machine learning approaches. Our model is based on a trimmed SchNet architecture that selectively retains the most relevant atomic connections and incorporates trainable radial basis functions for geometric encoding. We validate our surrogate model on datasets from polyethylene, polypropylene, and polyvinyl chloride melts, demonstrating high predictive accuracy and robust generalization across diverse polymer systems. In addition, the model captures key physical features, such as the characteristic decay behavior of MBD interactions, providing valuable insights for optimizing cutoff strategies. Characterized by high computational efficiency, our surrogate model enables practical incorporation of MBD effects into large-scale molecular simulations.

Title: Preference Construction: A Bayesian Interactive Preference Elicitation Framework Based on Monte Carlo Tree Search

Authors: Yan Wang, Jiapeng Liu, Milosz Kadziński, Xiuwu Liao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15150
Pdf URL: https://arxiv.org/pdf/2503.15150
Copy Paste: [[2503.15150]] Preference Construction: A Bayesian Interactive Preference Elicitation Framework Based on Monte Carlo Tree Search(https://arxiv.org/abs/2503.15150)
Keywords: robust
Abstract: We present a novel preference learning framework to capture participant preferences efficiently within limited interaction rounds. It involves three main contributions. First, we develop a variational Bayesian approach to infer the participant's preference model by estimating posterior distributions and managing uncertainty from limited information. Second, we propose an adaptive questioning policy that maximizes cumulative uncertainty reduction, formulating questioning as a finite Markov decision process and using Monte Carlo Tree Search to prioritize promising question trajectories. By considering long-term effects and leveraging the efficiency of the Bayesian approach, the policy avoids shortsightedness. Third, we apply the framework to Multiple Criteria Decision Aiding, with pairwise comparison as the preference information and an additive value function as the preference model. We integrate the reparameterization trick to address high-variance issues, enhancing robustness and efficiency. Computational studies on real-world and synthetic datasets demonstrate the framework's practical usability, outperforming baselines in capturing preferences and achieving superior uncertainty reduction within limited interactions.

Title: ARC: Anchored Representation Clouds for High-Resolution INR Classification

Authors: Joost Luijmes, Alexander Gielisse, Roman Knyazhitskiy, Jan van Gemert
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15156
Pdf URL: https://arxiv.org/pdf/2503.15156
Copy Paste: [[2503.15156]] ARC: Anchored Representation Clouds for High-Resolution INR Classification(https://arxiv.org/abs/2503.15156)
Keywords: robust
Abstract: Implicit neural representations (INRs) encode signals in neural network weights as a memory-efficient representation, decoupling sampling resolution from the associated resource costs. Current INR image classification methods are demonstrated on low-resolution data and are sensitive to image-space transformations. We attribute these issues to the global, fully-connected MLP neural network architecture encoding of current INRs, which lack mechanisms for local representation: MLPs are sensitive to absolute image location and struggle with high-frequency details. We propose ARC: Anchored Representation Clouds, a novel INR architecture that explicitly anchors latent vectors locally in image-space. By introducing spatial structure to the latent vectors, ARC captures local image data which in our testing leads to state-of-the-art implicit image classification of both low- and high-resolution images and increased robustness against image-space translation. Code can be found at this https URL.

Title: UltraFlwr -- An Efficient Federated Medical and Surgical Object Detection Framework

Authors: Yang Li, Soumya Snigdha Kundu, Maxence Boels, Toktam Mahmoodi, Sebastien Ourselin, Tom Vercauteren, Prokar Dasgupta, Jonathan Shapey, Alejandro Granados
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15161
Pdf URL: https://arxiv.org/pdf/2503.15161
Copy Paste: [[2503.15161]] UltraFlwr -- An Efficient Federated Medical and Surgical Object Detection Framework(https://arxiv.org/abs/2503.15161)
Keywords: federate
Abstract: Object detection shows promise for medical and surgical applications such as cell counting and tool tracking. However, its faces multiple real-world edge deployment challenges including limited high-quality annotated data, data sharing restrictions, and computational constraints. In this work, we introduce UltraFlwr, a framework for federated medical and surgical object detection. By leveraging Federated Learning (FL), UltraFlwr enables decentralized model training across multiple sites without sharing raw data. To further enhance UltraFlwr's efficiency, we propose YOLO-PA, a set of novel Partial Aggregation (PA) strategies specifically designed for YOLO models in FL. YOLO-PA significantly reduces communication overhead by up to 83% per round while maintaining performance comparable to Full Aggregation (FA) strategies. Our extensive experiments on BCCD and m2cai16-tool-locations datasets demonstrate that YOLO-PA not only provides better client models compared to client-wise centralized training and FA strategies, but also facilitates efficient training and deployment across resource-constrained edge devices. Further, we also establish one of the first benchmarks in federated medical and surgical object detection. This paper advances the feasibility of training and deploying detection models on the edge, making federated object detection more practical for time-critical and resource-constrained medical and surgical applications. UltraFlwr is publicly available at this https URL.

Title: Global Group Fairness in Federated Learning via Function Tracking

Authors: Yves Rychener, Daniel Kuhn, Yifan Hu
Subjects: cs.LG, math.OC, stat.ME
Abstract URL: https://arxiv.org/abs/2503.15163
Pdf URL: https://arxiv.org/pdf/2503.15163
Copy Paste: [[2503.15163]] Global Group Fairness in Federated Learning via Function Tracking(https://arxiv.org/abs/2503.15163)
Keywords: privacy, federate, fair
Abstract: We investigate group fairness regularizers in federated learning, aiming to train a globally fair model in a distributed setting. Ensuring global fairness in distributed training presents unique challenges, as fairness regularizers typically involve probability metrics between distributions across all clients and are not naturally separable by client. To address this, we introduce a function-tracking scheme for the global fairness regularizer based on a Maximum Mean Discrepancy (MMD), which incurs a small communication overhead. This scheme seamlessly integrates into most federated learning algorithms while preserving rigorous convergence guarantees, as demonstrated in the context of FedAvg. Additionally, when enforcing differential privacy, the kernel-based MMD regularization enables straightforward analysis through a change of kernel, leveraging an intuitive interpretation of kernel convolution. Numerical experiments confirm our theoretical insights.

Title: Comparing Llama3 and DeepSeekR1 on Biomedical Text Classification Tasks

Authors: Yuting Guo, Abeed Sarker
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15169
Pdf URL: https://arxiv.org/pdf/2503.15169
Copy Paste: [[2503.15169]] Comparing Llama3 and DeepSeekR1 on Biomedical Text Classification Tasks(https://arxiv.org/abs/2503.15169)
Keywords: large language model
Abstract: This study compares the performance of two open-source large language models (LLMs)-Llama3-70B and DeepSeekR1-distill-Llama3-70B-on six biomedical text classification tasks. Four tasks involve data from social media, while two tasks focus on clinical notes from electronic health records, and all experiments were performed in zero-shot settings. Performance metrics, including precision, recall, and F1 scores, were measured for each task, along with their 95% confidence intervals. Results demonstrated that DeepSeekR1-distill-Llama3-70B generally performs better in terms of precision on most tasks, with mixed results on recall. While the zero-shot LLMs demonstrated high F1 scores for some tasks, they grossly underperformed on others, for data from both sources. The findings suggest that model selection should be guided by the specific requirements of the health-related text classification tasks, particularly when considering the precision-recall trade-offs, and that, in the presence of annotated data, supervised classification approaches may be more reliable than zero-shot LLMs.

Title: Benchmarking Large Language Models for Handwritten Text Recognition

Authors: Giorgia Crosilla, Lukas Klic, Giovanni Colavizza
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15195
Pdf URL: https://arxiv.org/pdf/2503.15195
Copy Paste: [[2503.15195]] Benchmarking Large Language Models for Handwritten Text Recognition(https://arxiv.org/abs/2503.15195)
Keywords: large language model
Abstract: Traditional machine learning models for Handwritten Text Recognition (HTR) rely on supervised training, requiring extensive manual annotations, and often produce errors due to the separation between layout and text processing. In contrast, Multimodal Large Language Models (MLLMs) offer a general approach to recognizing diverse handwriting styles without the need for model-specific training. The study benchmarks various proprietary and open-source LLMs against Transkribus models, evaluating their performance on both modern and historical datasets written in English, French, German, and Italian. In addition, emphasis is placed on testing the models' ability to autonomously correct previously generated outputs. Findings indicate that proprietary models, especially Claude 3.5 Sonnet, outperform open-source alternatives in zero-shot settings. MLLMs achieve excellent results in recognizing modern handwriting and exhibit a preference for the English language due to their pre-training dataset composition. Comparisons with Transkribus show no consistent advantage for either approach. Moreover, LLMs demonstrate limited ability to autonomously correct errors in zero-shot transcriptions.

Title: Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization

Authors: Feifei Li, Mi Zhang, Yiming Sun, Min Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15197
Pdf URL: https://arxiv.org/pdf/2503.15197
Copy Paste: [[2503.15197]] Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization(https://arxiv.org/abs/2503.15197)
Keywords: diffusion
Abstract: Text-to-image diffusion models have achieved state-of-the-art results in synthesis tasks; however, there is a growing concern about their potential misuse in creating harmful content. To mitigate these risks, post-hoc model intervention techniques, such as concept unlearning and safety guidance, have been developed. However, fine-tuning model weights or adapting the hidden states of the diffusion model operates in an uninterpretable way, making it unclear which part of the intermediate variables is responsible for unsafe generation. These interventions severely affect the sampling trajectory when erasing harmful concepts from complex, multi-concept prompts, thus hindering their practical use in real-world settings. In this work, we propose the safe generation framework Detect-and-Guide (DAG), leveraging the internal knowledge of diffusion models to perform self-diagnosis and fine-grained self-regulation during the sampling process. DAG first detects harmful concepts from noisy latents using refined cross-attention maps of optimized tokens, then applies safety guidance with adaptive strength and editing regions to negate unsafe generation. The optimization only requires a small annotated dataset and can provide precise detection maps with generalizability and concept specificity. Moreover, DAG does not require fine-tuning of diffusion models, and therefore introduces no loss to their generation diversity. Experiments on erasing sexual content show that DAG achieves state-of-the-art safe generation performance, balancing harmfulness mitigation and text-following performance on multi-concept real-world prompts.

Title: DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

Authors: Jiazhe Guo, Yikang Ding, Xiwu Chen, Shuo Chen, Bohan Li, Yingshuang Zou, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Zhiheng Li, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15208
Pdf URL: https://arxiv.org/pdf/2503.15208
Copy Paste: [[2503.15208]] DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation(https://arxiv.org/abs/2503.15208)
Keywords: diffusion, generative
Abstract: Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations.

Title: Kolmogorov-Arnold Network for Transistor Compact Modeling

Authors: Rodion Novkin, Hussam Amrouch
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15209
Pdf URL: https://arxiv.org/pdf/2503.15209
Copy Paste: [[2503.15209]] Kolmogorov-Arnold Network for Transistor Compact Modeling(https://arxiv.org/abs/2503.15209)
Keywords: robust, interpretability
Abstract: Neural network (NN)-based transistor compact modeling has recently emerged as a transformative solution for accelerating device modeling and SPICE circuit simulations. However, conventional NN architectures, despite their widespread adoption in state-of-the-art methods, primarily function as black-box problem solvers. This lack of interpretability significantly limits their capacity to extract and convey meaningful insights into learned data patterns, posing a major barrier to their broader adoption in critical modeling tasks. This work introduces, for the first time, Kolmogorov-Arnold network (KAN) for the transistor - a groundbreaking NN architecture that seamlessly integrates interpretability with high precision in physics-based function modeling. We systematically evaluate the performance of KAN and Fourier KAN for FinFET compact modeling, benchmarking them against the golden industry-standard compact model and the widely used MLP architecture. Our results reveal that KAN and FKAN consistently achieve superior prediction accuracy for critical figures of merit, including gate current, drain charge, and source charge. Furthermore, we demonstrate and improve the unique ability of KAN to derive symbolic formulas from learned data patterns - a capability that not only enhances interpretability but also facilitates in-depth transistor analysis and optimization. This work highlights the transformative potential of KAN in bridging the gap between interpretability and precision in NN-driven transistor compact modeling. By providing a robust and transparent approach to transistor modeling, KAN represents a pivotal advancement for the semiconductor industry as it navigates the challenges of advanced technology scaling.

Title: CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification

Authors: Wenlong Yu, Qilong Wang, Chuang Liu, Dong Li, Qinghua Hu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15234
Pdf URL: https://arxiv.org/pdf/2503.15234
Copy Paste: [[2503.15234]] CoE: Chain-of-Explanation via Automatic Visual Concept Circuit Description and Polysemanticity Quantification(https://arxiv.org/abs/2503.15234)
Keywords: interpretability, explainability
Abstract: Explainability is a critical factor influencing the wide deployment of deep vision models (DVMs). Concept-based post-hoc explanation methods can provide both global and local insights into model decisions. However, current methods in this field face challenges in that they are inflexible to automatically construct accurate and sufficient linguistic explanations for global concepts and local circuits. Particularly, the intrinsic polysemanticity in semantic Visual Concepts (VCs) impedes the interpretability of concepts and DVMs, which is underestimated severely. In this paper, we propose a Chain-of-Explanation (CoE) approach to address these issues. Specifically, CoE automates the decoding and description of VCs to construct global concept explanation datasets. Further, to alleviate the effect of polysemanticity on model explainability, we design a concept polysemanticity disentanglement and filtering mechanism to distinguish the most contextually relevant concept atoms. Besides, a Concept Polysemanticity Entropy (CPE), as a measure of model interpretability, is formulated to quantify the degree of concept uncertainty. The modeling of deterministic concepts is upgraded to uncertain concept atom distributions. Finally, CoE automatically enables linguistic local explanations of the decision-making process of DVMs by tracing the concept circuit. GPT-4o and human-based experiments demonstrate the effectiveness of CPE and the superiority of CoE, achieving an average absolute improvement of 36% in terms of explainability scores.

Title: Exploring Large Language Models for Word Games:Who is the Spy?

Authors: Chentian Wei, Jiewei Chen, Jinzhu Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15235
Pdf URL: https://arxiv.org/pdf/2503.15235
Copy Paste: [[2503.15235]] Exploring Large Language Models for Word Games:Who is the Spy?(https://arxiv.org/abs/2503.15235)
Keywords: large language model
Abstract: Word games hold significant research value for natural language processing (NLP), game theory, and related fields due to their rule-based and situational nature. This study explores how large language models (LLMs) can be effectively involved in word games and proposes a training-free framework. "Shei Shi Wo Di" or "Who is the Spy" in English, is a classic word game. Using this game as an example, we introduce a Chain-of-Thought (CoT)-based scheduling framework to enable LLMs to achieve excellent performance in tasks such as inferring role words and disguising their identities. We evaluate the framework's performance based on game success rates and the accuracy of the LLM agents' analytical results. Experimental results affirm the framework's effectiveness, demonstrating notable improvements in LLM performance across multiple datasets. This work highlights the potential of LLMs in mastering situational reasoning and social interactions within structured game environments. Our code is publicly available at this https URL.

Title: Your Signal, Their Data: An Empirical Privacy Analysis of Wireless-scanning SDKs in Android

Authors: Aniketh Girish, Joel Reardon, Juan Tapiador, Srdjan Matic, Narseo Vallina-Rodriguez
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2503.15238
Pdf URL: https://arxiv.org/pdf/2503.15238
Copy Paste: [[2503.15238]] Your Signal, Their Data: An Empirical Privacy Analysis of Wireless-scanning SDKs in Android(https://arxiv.org/abs/2503.15238)
Keywords: privacy
Abstract: Mobile apps frequently use Bluetooth Low Energy (BLE) and WiFi scanning permissions to discover nearby devices like peripherals and connect to WiFi Access Points (APs). However, wireless interfaces also serve as a covert proxy for geolocation data, enabling continuous user tracking and profiling. This includes technologies like BLE beacons, which are BLE devices broadcasting unique identifiers to determine devices' indoor physical locations; such beacons are easily found in shopping centres. Despite the widespread use of wireless scanning APIs and their potential for privacy abuse, the interplay between commercial mobile SDKs with wireless sensing and beaconing technologies remains largely unexplored. In this work, we conduct the first systematic analysis of 52 wireless-scanning SDKs, revealing their data collection practices and privacy risks. We develop a comprehensive analysis pipeline that enables us to detect beacon scanning capabilities, inject wireless events to trigger app behaviors, and monitor runtime execution on instrumented devices. Our findings show that 86% of apps integrating these SDKs collect at least one sensitive data type, including device and user identifiers such as AAID, email, along with GPS coordinates, WiFi and Bluetooth scan results. We uncover widespread SDK-to-SDK data sharing and evidence of ID bridging, where persistent and resettable identifiers are shared and synchronized within SDKs embedded in applications to potentially construct detailed mobility profiles, compromising user anonymity and enabling long-term tracking. We provide evidence of key actors engaging in these practices and conclude by proposing mitigation strategies such as stronger SDK sandboxing, stricter enforcement of platform policies, and improved transparency mechanisms to limit unauthorized tracking.

Title: BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?

Authors: Pierre Chambon, Baptiste Roziere, Benoit Sagot, Gabriel Synnaeve
Subjects: cs.CL, cs.AI, cs.CC
Abstract URL: https://arxiv.org/abs/2503.15242
Pdf URL: https://arxiv.org/pdf/2503.15242
Copy Paste: [[2503.15242]] BigO(Bench) -- Can LLMs Generate Code with Controlled Time and Space Complexity?(https://arxiv.org/abs/2503.15242)
Keywords: generative
Abstract: We introduce BigO(Bench), a novel coding benchmark designed to evaluate the capabilities of generative language models in understanding and generating code with specified time and space complexities. This benchmark addresses the gap in current evaluations that often overlook the ability of models to comprehend and produce code constrained by computational complexity. BigO(Bench) includes tooling to infer the algorithmic complexity of any Python function from profiling measurements, including human- or LLM-generated solutions. BigO(Bench) also includes of set of 3,105 coding problems and 1,190,250 solutions from Code Contests annotated with inferred (synthetic) time and space complexity labels from the complexity framework, as well as corresponding runtime and memory footprint values for a large set of input sizes. We present results from evaluating multiple state-of-the-art language models on this benchmark, highlighting their strengths and weaknesses in handling complexity requirements. In particular, token-space reasoning models are unrivaled in code generation but not in complexity understanding, hinting that they may not generalize well to tasks for which no reward was given at training time.

Title: ImputeGAP: A Comprehensive Library for Time Series Imputation

Authors: Quentin Nater, Mourad Khayati, Jacques Pasquier
Subjects: cs.LG, cs.DB
Abstract URL: https://arxiv.org/abs/2503.15250
Pdf URL: https://arxiv.org/pdf/2503.15250
Copy Paste: [[2503.15250]] ImputeGAP: A Comprehensive Library for Time Series Imputation(https://arxiv.org/abs/2503.15250)
Keywords: explainability
Abstract: With the prevalence of sensor failures, imputation--the process of estimating missing values--has emerged as the cornerstone of time series data preparation. While numerous imputation algorithms have been developed to address these data gaps, existing libraries provide limited support. Furthermore, they often lack the ability to simulate realistic patterns of time series missing data and fail to account for the impact of imputation on subsequent downstream analysis. This paper introduces ImputeGAP, a comprehensive library for time series imputation that supports a diverse range of imputation methods and modular missing data simulation catering to datasets with varying characteristics. The library includes extensive customization options, such as automated hyperparameter tuning, benchmarking, explainability, downstream evaluation, and compatibility with popular time series frameworks.

Title: DEPT: Deep Extreme Point Tracing for Ultrasound Image Segmentation

Authors: Lei Shi, Xi Fang, Naiyu Wang, Junxing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15260
Pdf URL: https://arxiv.org/pdf/2503.15260
Copy Paste: [[2503.15260]] DEPT: Deep Extreme Point Tracing for Ultrasound Image Segmentation(https://arxiv.org/abs/2503.15260)
Keywords: segmentation
Abstract: Automatic medical image segmentation plays a crucial role in computer aided diagnosis. However, fully supervised learning approaches often require extensive and labor-intensive annotation efforts. To address this challenge, weakly supervised learning methods, particularly those using extreme points as supervisory signals, have the potential to offer an effective solution. In this paper, we introduce Deep Extreme Point Tracing (DEPT) integrated with Feature-Guided Extreme Point Masking (FGEPM) algorithm for ultrasound image segmentation. Notably, our method generates pseudo labels by identifying the lowest-cost path that connects all extreme points on the feature map-based cost matrix. Additionally, an iterative training strategy is proposed to refine pseudo labels progressively, enabling continuous network improvement. Experimental results on two public datasets demonstrate the effectiveness of our proposed method. The performance of our method approaches that of the fully supervised method and outperforms several existing weakly supervised methods.

Title: LEGION: Learning to Ground and Explain for Synthetic Image Detection

Authors: Hengrui Kang, Siwei Wen, Zichen Wen, Junyan Ye, Weijia Li, Peilin Feng, Baichuan Zhou, Bin Wang, Dahua Lin, Linfeng Zhang, Conghui He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15264
Pdf URL: https://arxiv.org/pdf/2503.15264
Copy Paste: [[2503.15264]] LEGION: Learning to Ground and Explain for Synthetic Image Detection(https://arxiv.org/abs/2503.15264)
Keywords: interpretability, generative, large language model, segmentation
Abstract: The rapid advancements in generative technology have emerged as a double-edged sword. While offering powerful tools that enhance convenience, they also pose significant social concerns. As defenders, current synthetic image detection methods often lack artifact-level textual interpretability and are overly focused on image manipulation detection, and current datasets usually suffer from outdated generators and a lack of fine-grained annotations. In this paper, we introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels. Furthermore, we propose LEGION (LEarning to Ground and explain for Synthetic Image detectiON), a multimodal large language model (MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation. Building upon this capability, we further explore LEGION as a controller, integrating it into image refinement pipelines to guide the generation of higher-quality and more realistic images. Extensive experiments show that LEGION outperforms existing methods across multiple benchmarks, particularly surpassing the second-best traditional expert on SynthScars by 3.31% in mIoU and 7.75% in F1 score. Moreover, the refined images generated under its guidance exhibit stronger alignment with human preferences. The code, model, and dataset will be released.

Title: Learning to quantify graph nodes

Authors: Alessio Micheli, Alejandro Moreo, Marco Podda, Fabrizio Sebastiani, William Simoni, Domenico Tortorella
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15267
Pdf URL: https://arxiv.org/pdf/2503.15267
Copy Paste: [[2503.15267]] Learning to quantify graph nodes(https://arxiv.org/abs/2503.15267)
Keywords: robust
Abstract: Network Quantification is the problem of estimating the class proportions in unlabeled subsets of graph nodes. When prior probability shift is at play, this task cannot be effectively addressed by first classifying the nodes and then counting the class predictions. In addition, unlike non-relational quantification on i.i.d. datapoints, Network Quantification demands enhanced flexibility to capture a broad range of connectivity patterns, resilience to the challenge of heterophily, and efficiency to scale to larger networks. To meet these stringent requirements we introduce XNQ, a novel method that synergizes the flexibility and efficiency of the unsupervised node embeddings computed by randomized recursive Graph Neural Networks, with an Expectation-Maximization algorithm that provides a robust quantification-aware adjustment to the output probabilities of a calibrated node classifier. We validate the design choices underpinning our method through comprehensive ablation experiments. In an extensive evaluation, we find that our approach consistently and significantly improves on the best Network Quantification methods to date, thereby setting the new state of the art for this challenging task. Simultaneously, it provides a training speed-up of up to 10x-100x over other graph learning based methods.

Title: MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration

Authors: David Wan, Justin Chih-Yao Chen, Elias Stengel-Eskin, Mohit Bansal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15272
Pdf URL: https://arxiv.org/pdf/2503.15272
Copy Paste: [[2503.15272]] MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration(https://arxiv.org/abs/2503.15272)
Keywords: large language model
Abstract: Multi-agent collaboration among models has shown promise in reasoning tasks but is underexplored in long-form generation tasks like summarization and question-answering. We extend multi-agent multi-model reasoning to generation, specifically to improving faithfulness through refinement, i.e., revising model-generated outputs to remove factual inconsistencies. We investigate how iterative collaboration among multiple instances and types of large language models (LLMs) enhances subtasks in the refinement process, such as error detection, critiquing unfaithful sentences, and making corrections based on critiques. We design intrinsic evaluations for each subtask, with our findings indicating that both multi-agent (multiple instances) and multi-model (diverse LLM types) approaches benefit error detection and critiquing. Additionally, reframing critiquing and refinement as reranking rather than generation tasks improves multi-agent performance. We consolidate these insights into a final "recipe" called Multi-Agent Multi-Model Refinement (MAMM-Refine), where multi-agent and multi-model collaboration significantly boosts performance on three summarization datasets as well as on long-form question answering, demonstrating the effectiveness and generalizability of our recipe.

Title: TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

Authors: Teng-Fang Hsiao, Bo-Kai Ruan, Yi-Lun Wu, Tzu-Ling Lin, Hong-Han Shuai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15283
Pdf URL: https://arxiv.org/pdf/2503.15283
Copy Paste: [[2503.15283]] TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models(https://arxiv.org/abs/2503.15283)
Keywords: robust
Abstract: Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking -- this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.

Title: EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds

Authors: Yuanchao Yue, Hui Yuan, Qinglong Miao, Xiaolong Mao, Raouf Hamzaoui, Peter Eisert
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15284
Pdf URL: https://arxiv.org/pdf/2503.15284
Copy Paste: [[2503.15284]] EdgeRegNet: Edge Feature-based Multimodal Registration Network between Images and LiDAR Point Clouds(https://arxiv.org/abs/2503.15284)
Keywords: robust
Abstract: Cross-modal data registration has long been a critical task in computer vision, with extensive applications in autonomous driving and robotics. Accurate and robust registration methods are essential for aligning data from different modalities, forming the foundation for multimodal sensor data fusion and enhancing perception systems' accuracy and reliability. The registration task between 2D images captured by cameras and 3D point clouds captured by Light Detection and Ranging (LiDAR) sensors is usually treated as a visual pose estimation problem. High-dimensional feature similarities from different modalities are leveraged to identify pixel-point correspondences, followed by pose estimation techniques using least squares methods. However, existing approaches often resort to downsampling the original point cloud and image data due to computational constraints, inevitably leading to a loss in precision. Additionally, high-dimensional features extracted using different feature extractors from various modalities require specific techniques to mitigate cross-modal differences for effective matching. To address these challenges, we propose a method that uses edge information from the original point clouds and images for cross-modal registration. We retain crucial information from the original data by extracting edge points and pixels, enhancing registration accuracy while maintaining computational efficiency. The use of edge points and edge pixels allows us to introduce an attention-based feature exchange block to eliminate cross-modal disparities. Furthermore, we incorporate an optimal matching layer to improve correspondence identification. We validate the accuracy of our method on the KITTI and nuScenes datasets, demonstrating its state-of-the-art performance.

Title: PAPI-Reg: Patch-to-Pixel Solution for Efficient Cross-Modal Registration between LiDAR Point Cloud and Camera Image

Authors: Yuanchao Yue, Zhengxin Li, Wei Zhang, Hui Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15285
Pdf URL: https://arxiv.org/pdf/2503.15285
Copy Paste: [[2503.15285]] PAPI-Reg: Patch-to-Pixel Solution for Efficient Cross-Modal Registration between LiDAR Point Cloud and Camera Image(https://arxiv.org/abs/2503.15285)
Keywords: extraction
Abstract: The primary requirement for cross-modal data fusion is the precise alignment of data from different sensors. However, the calibration between LiDAR point clouds and camera images is typically time-consuming and needs external calibration board or specific environmental features. Cross-modal registration effectively solves this problem by aligning the data directly without requiring external calibration. However, due to the domain gap between the point cloud and the image, existing methods rarely achieve satisfactory registration accuracy while maintaining real-time performance. To address this issue, we propose a framework that projects point clouds into several 2D representations for matching with camera images, which not only leverages the geometric characteristic of LiDAR point clouds more effectively but also bridge the domain gap between the point cloud and image. Moreover, to tackle the challenges of cross modal differences and the limited overlap between LiDAR point clouds and images in the image matching task, we introduce a multi-scale feature extraction network to effectively extract features from both camera images and the projection maps of LiDAR point cloud. Additionally, we propose a patch-to-pixel matching network to provide more effective supervision and achieve higher accuracy. We validate the performance of our model through experiments on the KITTI and nuScenes datasets. Our network achieves real-time performance and extremely high registration accuracy. On the KITTI dataset, our model achieves a registration accuracy rate of over 99\%.

Title: TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification

Authors: Junnan Zhu, Min Xiao, Yining Wang, Feifei Zhai, Yu Zhou, Chengqing Zong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15289
Pdf URL: https://arxiv.org/pdf/2503.15289
Copy Paste: [[2503.15289]] TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification(https://arxiv.org/abs/2503.15289)
Keywords: robust
Abstract: LLMs have achieved remarkable fluency and coherence in text generation, yet their widespread adoption has raised concerns about content reliability and accountability. In high-stakes domains such as healthcare, law, and news, it is crucial to understand where and how the content is created. To address this, we introduce the Text pROVEnance (TROVE) challenge, designed to trace each sentence of a target text back to specific source sentences within potentially lengthy or multi-document inputs. Beyond identifying sources, TROVE annotates the fine-grained relationships (quotation, compression, inference, and others), providing a deep understanding of how each target sentence is formed. To benchmark TROVE, we construct our dataset by leveraging three public datasets covering 11 diverse scenarios (e.g., QA and summarization) in English and Chinese, spanning source texts of varying lengths (0-5k, 5-10k, 10k+), emphasizing the multi-document and long-document settings essential for provenance. To ensure high-quality data, we employ a three-stage annotation process: sentence retrieval, GPT provenance, and human provenance. We evaluate 11 LLMs under direct prompting and retrieval-augmented paradigms, revealing that retrieval is essential for robust performance, larger models perform better in complex relationship classification, and closed-source models often lead, yet open-source models show significant promise, particularly with retrieval augmentation.

Title: Test-Time Backdoor Detection for Object Detection Models

Authors: Hangtao Zhang, Yichen Wang, Shihui Yan, Chenyu Zhu, Ziqi Zhou, Linshan Hou, Shengshan Hu, Minghui Li, Yanjun Zhang, Leo Yu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15293
Pdf URL: https://arxiv.org/pdf/2503.15293
Copy Paste: [[2503.15293]] Test-Time Backdoor Detection for Object Detection Models(https://arxiv.org/abs/2503.15293)
Keywords: defense, attack
Abstract: Object detection models are vulnerable to backdoor attacks, where attackers poison a small subset of training samples by embedding a predefined trigger to manipulate prediction. Detecting poisoned samples (i.e., those containing triggers) at test time can prevent backdoor activation. However, unlike image classification tasks, the unique characteristics of object detection -- particularly its output of numerous objects -- pose fresh challenges for backdoor detection. The complex attack effects (e.g., "ghost" object emergence or "vanishing" object) further render current defenses fundamentally inadequate. To this end, we design TRAnsformation Consistency Evaluation (TRACE), a brand-new method for detecting poisoned samples at test time in object detection. Our journey begins with two intriguing observations: (1) poisoned samples exhibit significantly more consistent detection results than clean ones across varied backgrounds. (2) clean samples show higher detection consistency when introduced to different focal information. Based on these phenomena, TRACE applies foreground and background transformations to each test sample, then assesses transformation consistency by calculating the variance in objects confidences. TRACE achieves black-box, universal backdoor detection, with extensive experiments showing a 30% improvement in AUROC over state-of-the-art defenses and resistance to adaptive attacks.

Title: DCA: Dividing and Conquering Amnesia in Incremental Object Detection

Authors: Aoting Zhang, Dongbao Yang, Chang Liu, Xiaopeng Hong, Miao Shang, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15295
Pdf URL: https://arxiv.org/pdf/2503.15295
Copy Paste: [[2503.15295]] DCA: Dividing and Conquering Amnesia in Incremental Object Detection(https://arxiv.org/abs/2503.15295)
Keywords: transformer
Abstract: Incremental object detection (IOD) aims to cultivate an object detector that can continuously localize and recognize novel classes while preserving its performance on previous classes. Existing methods achieve certain success by improving knowledge distillation and exemplar replay for transformer-based detection frameworks, but the intrinsic forgetting mechanisms remain underexplored. In this paper, we dive into the cause of forgetting and discover forgetting imbalance between localization and recognition in transformer-based IOD, which means that localization is less-forgetting and can generalize to future classes, whereas catastrophic forgetting occurs primarily on recognition. Based on these insights, we propose a Divide-and-Conquer Amnesia (DCA) strategy, which redesigns the transformer-based IOD into a localization-then-recognition process. DCA can well maintain and transfer the localization ability, leaving decoupled fragile recognition to be specially conquered. To reduce feature drift in recognition, we leverage semantic knowledge encoded in pre-trained language models to anchor class representations within a unified feature space across incremental tasks. This involves designing a duplex classifier fusion and embedding class semantic features into the recognition decoding process in the form of queries. Extensive experiments validate that our approach achieves state-of-the-art performance, especially for long-term incremental scenarios. For example, under the four-step setting on MS-COCO, our DCA strategy significantly improves the final AP by 6.9%.

Title: Inside-Out: Hidden Factual Knowledge in LLMs

Authors: Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpector, Jonathan Herzig, Roi Reichart
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15299
Pdf URL: https://arxiv.org/pdf/2503.15299
Copy Paste: [[2503.15299]] Inside-Out: Hidden Factual Knowledge in LLMs(https://arxiv.org/abs/2503.15299)
Keywords: large language model
Abstract: This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs. While a few studies hint at this possibility, none has clearly defined or demonstrated this phenomenon. We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher. This gives rise to external and internal knowledge, depending on the information used to score individual answer candidates: either the model's observable token-level probabilities or its intermediate computations. Hidden knowledge arises when internal knowledge exceeds external knowledge. We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup. Our results indicate that: (1) LLMs consistently encode more factual knowledge internally than what they express externally, with an average gap of 40%. (2) Surprisingly, some knowledge is so deeply hidden that a model can internally know an answer perfectly, yet fail to generate it even once, despite large-scale repeated sampling of 1,000 answers. This reveals fundamental limitations in the generation capabilities of LLMs, which (3) puts a practical constraint on scaling test-time compute via repeated answer sampling in closed-book QA: significant performance improvements remain inaccessible because some answers are practically never sampled, yet if they were, we would be guaranteed to rank them first.

Title: SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes

Authors: Weixiao Gao, Liangliang Nan, Hugo Ledoux
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15300
Pdf URL: https://arxiv.org/pdf/2503.15300
Copy Paste: [[2503.15300]] SUM Parts: Benchmarking Part-Level Semantic Segmentation of Urban Meshes(https://arxiv.org/abs/2503.15300)
Keywords: segmentation
Abstract: Semantic segmentation in urban scene analysis has mainly focused on images or point clouds, while textured meshes - offering richer spatial representation - remain underexplored. This paper introduces SUM Parts, the first large-scale dataset for urban textured meshes with part-level semantic labels, covering about 2.5 km2 with 21 classes. The dataset was created using our own annotation tool, which supports both face- and texture-based annotations with efficient interactive selection. We also provide a comprehensive evaluation of 3D semantic segmentation and interactive annotation methods on this dataset. Our project page is available at this https URL.

Title: TruthLens:A Training-Free Paradigm for DeepFake Detection

Authors: Ritabrata Chakraborty, Rajatsubhra Chakraborty, Ali Khaleghi Rahimian, Thomas MacDougall
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15342
Pdf URL: https://arxiv.org/pdf/2503.15342
Copy Paste: [[2503.15342]] TruthLens:A Training-Free Paradigm for DeepFake Detection(https://arxiv.org/abs/2503.15342)
Keywords: interpretability, explainability, large language model
Abstract: The proliferation of synthetic images generated by advanced AI models poses significant challenges in identifying and understanding manipulated visual content. Current fake image detection methods predominantly rely on binary classification models that focus on accuracy while often neglecting interpretability, leaving users without clear insights into why an image is deemed real or fake. To bridge this gap, we introduce TruthLens, a novel training-free framework that reimagines deepfake detection as a visual question-answering (VQA) task. TruthLens utilizes state-of-the-art large vision-language models (LVLMs) to observe and describe visual artifacts and combines this with the reasoning capabilities of large language models (LLMs) like GPT-4 to analyze and aggregate evidence into informed decisions. By adopting a multimodal approach, TruthLens seamlessly integrates visual and semantic reasoning to not only classify images as real or fake but also provide interpretable explanations for its decisions. This transparency enhances trust and provides valuable insights into the artifacts that signal synthetic content. Extensive evaluations demonstrate that TruthLens outperforms conventional methods, achieving high accuracy on challenging datasets while maintaining a strong emphasis on explainability. By reframing deepfake detection as a reasoning-driven process, TruthLens establishes a new paradigm in combating synthetic media, combining cutting-edge performance with interpretability to address the growing threats of visual disinformation.

Title: SPILL: Domain-Adaptive Intent Clustering based on Selection and Pooling with Large Language Models

Authors: I-Fan Lin, Faegheh Hasibi, Suzan Verberne
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15351
Pdf URL: https://arxiv.org/pdf/2503.15351
Copy Paste: [[2503.15351]] SPILL: Domain-Adaptive Intent Clustering based on Selection and Pooling with Large Language Models(https://arxiv.org/abs/2503.15351)
Keywords: large language model
Abstract: In this paper, we propose Selection and Pooling with Large Language Models (SPILL), an intuitive and domain-adaptive method for intent clustering without fine-tuning. Existing embeddings-based clustering methods rely on a few labeled examples or unsupervised fine-tuning to optimize results for each new dataset, which makes them less generalizable to multiple datasets. Our goal is to make these existing embedders more generalizable to new domain datasets without further fine-tuning. Inspired by our theoretical derivation and simulation results on the effectiveness of sampling and pooling techniques, we view the clustering task as a small-scale selection problem. A good solution to this problem is associated with better clustering performance. Accordingly, we propose a two-stage approach: First, for each utterance (referred to as the seed), we derive its embedding using an existing embedder. Then, we apply a distance metric to select a pool of candidates close to the seed. Because the embedder is not optimized for new datasets, in the second stage, we use an LLM to further select utterances from these candidates that share the same intent as the seed. Finally, we pool these selected candidates with the seed to derive a refined embedding for the seed. We found that our method generally outperforms directly using an embedder, and it achieves comparable results to other state-of-the-art studies, even those that use much larger models and require fine-tuning, showing its strength and efficiency. Our results indicate that our method enables existing embedders to be further improved without additional fine-tuning, making them more adaptable to new domain datasets. Additionally, viewing the clustering task as a small-scale selection problem gives the potential of using LLMs to customize clustering tasks according to the user's goals.

Title: SemEval-2025 Task 1: AdMIRe -- Advancing Multimodal Idiomaticity Representation

Authors: Thomas Pickard, Aline Villavicencio, Maggie Mi, Wei He, Dylan Phelps, Carolina Scarton, Marco Idiart
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.15358
Pdf URL: https://arxiv.org/pdf/2503.15358
Copy Paste: [[2503.15358]] SemEval-2025 Task 1: AdMIRe -- Advancing Multimodal Idiomaticity Representation(https://arxiv.org/abs/2503.15358)
Keywords: robust, large language model
Abstract: Idiomatic expressions present a unique challenge in NLP, as their meanings are often not directly inferable from their constituent words. Despite recent advancements in Large Language Models (LLMs), idiomaticity remains a significant obstacle to robust semantic representation. We present datasets and tasks for SemEval-2025 Task 1: AdMiRe (Advancing Multimodal Idiomaticity Representation), which challenges the community to assess and improve models' ability to interpret idiomatic expressions in multimodal contexts and in multiple languages. Participants competed in two subtasks: ranking images based on their alignment with idiomatic or literal meanings, and predicting the next image in a sequence. The most effective methods achieved human-level performance by leveraging pretrained LLMs and vision-language models in mixture-of-experts settings, with multiple queries used to smooth over the weaknesses in these models' representations of idiomaticity.

Title: FedBEns: One-Shot Federated Learning based on Bayesian Ensemble

Authors: Jacopo Talpini, Marco Savi, Giovanni Neglia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15367
Pdf URL: https://arxiv.org/pdf/2503.15367
Copy Paste: [[2503.15367]] FedBEns: One-Shot Federated Learning based on Bayesian Ensemble(https://arxiv.org/abs/2503.15367)
Keywords: federate
Abstract: One-Shot Federated Learning (FL) is a recent paradigm that enables multiple clients to cooperatively learn a global model in a single round of communication with a central server. In this paper, we analyze the One-Shot FL problem through the lens of Bayesian inference and propose FedBEns, an algorithm that leverages the inherent multimodality of local loss functions to find better global models. Our algorithm leverages a mixture of Laplace approximations for the clients' local posteriors, which the server then aggregates to infer the global model. We conduct extensive experiments on various datasets, demonstrating that the proposed method outperforms competing baselines that typically rely on unimodal approximations of the local losses.

Title: EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models

Authors: Yinan Liang, Ziwei Wang, Xiuwei Xu, Jie Zhou, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15369
Pdf URL: https://arxiv.org/pdf/2503.15369
Copy Paste: [[2503.15369]] EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models(https://arxiv.org/abs/2503.15369)
Keywords: large language model
Abstract: While multimodal large language models demonstrate strong performance in complex reasoning tasks, they pose significant challenges related to model complexity during deployment, especially for resource-limited devices. In this paper, we propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning. Conventional methods rely on the training data of the original model to select the proper pruning ratio for different network components. However, these methods are impractical for large vision-language models due to the unaffordable search costs caused by web-scale training corpus. In contrast, our approach only leverages a small number of samples to search for the desired pruning policy by maximizing its generalization ability on unknown training data while maintaining the model accuracy, which enables the achievement of an optimal trade-off between accuracy and efficiency for large visual language models. Specifically, we formulate the generalization gap of the pruning strategy using the structural risk minimization principle. Based on both task performance and generalization capability, we iteratively search for the optimal pruning policy within a given search space and optimize the vision projector to evolve the search space with higher upper bound of performance. We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering. Using only 64 samples for pruning policy search, EfficientLLaVA achieves an accuracy of 83.05% on ScienceQA, along with a $\times$ 1.8 speedup compared to the dense LLaVA-v1.5-7B model.

Title: Real-world validation of a multimodal LLM-powered pipeline for High-Accuracy Clinical Trial Patient Matching leveraging EHR data

Authors: Anatole Callies (Inato), Quentin Bodinier (Inato), Philippe Ravaud (Inato, Université Paris Cité and Université Sorbonne Paris Nord, INSERM, INRAE, Paris, France, Centre d'epidémiologie clinique, AP-HP, Hôpital Hôtel Dieu, Paris, France), Kourosh Davarpanah (Inato)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15374
Pdf URL: https://arxiv.org/pdf/2503.15374
Copy Paste: [[2503.15374]] Real-world validation of a multimodal LLM-powered pipeline for High-Accuracy Clinical Trial Patient Matching leveraging EHR data(https://arxiv.org/abs/2503.15374)
Keywords: robust
Abstract: Background: Patient recruitment in clinical trials is hindered by complex eligibility criteria and labor-intensive chart reviews. Prior research using text-only models have struggled to address this problem in a reliable and scalable way due to (1) limited reasoning capabilities, (2) information loss from converting visual records to text, and (3) lack of a generic EHR integration to extract patient data. Methods: We introduce a broadly applicable, integration-free, LLM-powered pipeline that automates patient-trial matching using unprocessed documents extracted from EHRs. Our approach leverages (1) the new reasoning-LLM paradigm, enabling the assessment of even the most complex criteria, (2) visual capabilities of latest LLMs to interpret medical records without lossy image-to-text conversions, and (3) multimodal embeddings for efficient medical record search. The pipeline was validated on the n2c2 2018 cohort selection dataset (288 diabetic patients) and a real-world dataset composed of 485 patients from 30 different sites matched against 36 diverse trials. Results: On the n2c2 dataset, our method achieved a new state-of-the-art criterion-level accuracy of 93\%. In real-world trials, the pipeline yielded an accuracy of 87\%, undermined by the difficulty to replicate human decision-making when medical records lack sufficient information. Nevertheless, users were able to review overall eligibility in under 9 minutes per patient on average, representing an 80\% improvement over traditional manual chart reviews. Conclusion: This pipeline demonstrates robust performance in clinical trial patient matching without requiring custom integration with site systems or trial-specific tailoring, thereby enabling scalable deployment across sites seeking to leverage AI for patient matching.

Title: Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement

Authors: Yuchen Ren, Zhengyu Zhao, Chenhao Lin, Bo Yang, Lu Zhou, Zhe Liu, Chao Shen
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2503.15404
Pdf URL: https://arxiv.org/pdf/2503.15404
Copy Paste: [[2503.15404]] Improving Adversarial Transferability on Vision Transformers via Forward Propagation Refinement(https://arxiv.org/abs/2503.15404)
Keywords: defense, robust, transformer
Abstract: Vision Transformers (ViTs) have been widely applied in various computer vision and vision-language tasks. To gain insights into their robustness in practical scenarios, transferable adversarial examples on ViTs have been extensively studied. A typical approach to improving adversarial transferability is by refining the surrogate model. However, existing work on ViTs has restricted their surrogate refinement to backward propagation. In this work, we instead focus on Forward Propagation Refinement (FPR) and specifically refine two key modules of ViTs: attention maps and token embeddings. For attention maps, we propose Attention Map Diversification (AMD), which diversifies certain attention maps and also implicitly imposes beneficial gradient vanishing during backward propagation. For token embeddings, we propose Momentum Token Embedding (MTE), which accumulates historical token embeddings to stabilize the forward updates in both the Attention and MLP blocks. We conduct extensive experiments with adversarial examples transferred from ViTs to various CNNs and ViTs, demonstrating that our FPR outperforms the current best (backward) surrogate refinement by up to 7.0\% on average. We also validate its superiority against popular defenses and its compatibility with other transfer methods. Codes and appendix are available at this https URL.

Title: Visual Persona: Foundation Model for Full-Body Human Customization

Authors: Jisu Nam, Soowon Son, Zhan Xu, Jing Shi, Difan Liu, Feng Liu, Aashish Misraa, Seungryong Kim, Yang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15406
Pdf URL: https://arxiv.org/pdf/2503.15406
Copy Paste: [[2503.15406]] Visual Persona: Foundation Model for Full-Body Human Customization(https://arxiv.org/abs/2503.15406)
Keywords: diffusion, transformer
Abstract: We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations. Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain. To address this, we propose a data curation pipeline leveraging vision-language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K, a dataset of 580k paired human images across 100k unique identities. For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images. Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs. Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks.

Title: Learn Your Scales: Towards Scale-Consistent Generative Novel View Synthesis

Authors: Fereshteh Forghani, Jason J. Yu, Tristan Aumentado-Armstrong, Konstantinos G. Derpanis, Marcus A. Brubaker
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15412
Pdf URL: https://arxiv.org/pdf/2503.15412
Copy Paste: [[2503.15412]] Learn Your Scales: Towards Scale-Consistent Generative Novel View Synthesis(https://arxiv.org/abs/2503.15412)
Keywords: generative
Abstract: Conventional depth-free multi-view datasets are captured using a moving monocular camera without metric calibration. The scales of camera positions in this monocular setting are ambiguous. Previous methods have acknowledged scale ambiguity in multi-view data via various ad-hoc normalization pre-processing steps, but have not directly analyzed the effect of incorrect scene scales on their application. In this paper, we seek to understand and address the effect of scale ambiguity when used to train generative novel view synthesis methods (GNVS). In GNVS, new views of a scene or object can be minimally synthesized given a single image and are, thus, unconstrained, necessitating the use of generative methods. The generative nature of these models captures all aspects of uncertainty, including any uncertainty of scene scales, which act as nuisance variables for the task. We study the effect of scene scale ambiguity in GNVS when sampled from a single image by isolating its effect on the resulting models and, based on these intuitions, define new metrics that measure the scale inconsistency of generated views. We then propose a framework to estimate scene scales jointly with the GNVS model in an end-to-end fashion. Empirically, we show that our method reduces the scale inconsistency of generated views without the complexity or downsides of previous scale normalization methods. Further, we show that removing this ambiguity improves generated image quality of the resulting GNVS model.

Title: LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding

Authors: Amirhossein Kazerouni, Soroush Mehraban, Michael Brudno, Babak Taati
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.15420
Pdf URL: https://arxiv.org/pdf/2503.15420
Copy Paste: [[2503.15420]] LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding(https://arxiv.org/abs/2503.15420)
Keywords: generative
Abstract: Implicit Neural Representations (INRs) are proving to be a powerful paradigm in unifying task modeling across diverse data domains, offering key advantages such as memory efficiency and resolution independence. Conventional deep learning models are typically modality-dependent, often requiring custom architectures and objectives for different types of signals. However, existing INR frameworks frequently rely on global latent vectors or exhibit computational inefficiencies that limit their broader applicability. We introduce LIFT, a novel, high-performance framework that addresses these challenges by capturing multiscale information through meta-learning. LIFT leverages multiple parallel localized implicit functions alongside a hierarchical latent generator to produce unified latent representations that span local, intermediate, and global features. This architecture facilitates smooth transitions across local regions, enhancing expressivity while maintaining inference efficiency. Additionally, we introduce ReLIFT, an enhanced variant of LIFT that incorporates residual connections and expressive frequency encodings. With this straightforward approach, ReLIFT effectively addresses the convergence-capacity gap found in comparable methods, providing an efficient yet powerful solution to improve capacity and speed up convergence. Empirical results show that LIFT achieves state-of-the-art (SOTA) performance in generative modeling and classification tasks, with notable reductions in computational costs. Moreover, in single-task settings, the streamlined ReLIFT architecture proves effective in signal representations and inverse problem tasks.

Title: Visual Position Prompt for MLLM based Visual Grounding

Authors: Wei Tang, Yanpeng Sun, Qinying Gu, Zechao Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15426
Pdf URL: https://arxiv.org/pdf/2503.15426
Copy Paste: [[2503.15426]] Visual Position Prompt for MLLM based Visual Grounding(https://arxiv.org/abs/2503.15426)
Keywords: extraction, large language model
Abstract: Although Multimodal Large Language Models (MLLMs) excel at various image-related tasks, they encounter challenges in precisely aligning coordinates with spatial information within images, particularly in position-aware tasks such as visual grounding. This limitation arises from two key factors. First, MLLMs lack explicit spatial references, making it difficult to associate textual descriptions with precise image locations. Second, their feature extraction processes prioritize global context over fine-grained spatial details, leading to weak localization capability. To address this issue, we introduce VPP-LLaVA, an MLLM equipped with Visual Position Prompt (VPP) to improve its grounding capability. VPP-LLaVA integrates two complementary mechanisms. The global VPP overlays learnable, axis-like embeddings onto the input image to provide structured spatial cues. The local VPP focuses on fine-grained localization by incorporating position-aware queries, which suggests probable object locations. We also introduce a VPP-SFT dataset with 0.6M samples, consolidating high-quality visual grounding data into a compact format for efficient model training. Training on this dataset with VPP enhances the model's performance, achieving state-of-the-art results on standard grounding benchmarks despite using fewer training samples compared to other MLLMs like MiniGPT-v2, which rely on much larger datasets ($\sim$21M samples). The code and VPP-SFT dataset will be available at this https URL upon acceptance.

Title: V2X-DG: Domain Generalization for Vehicle-to-Everything Cooperative Perception

Authors: Baolu Li, Zongzhe Xu, Jinlong Li, Xinyu Liu, Jianwu Fang, Xiaopeng Li, Hongkai Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15435
Pdf URL: https://arxiv.org/pdf/2503.15435
Copy Paste: [[2503.15435]] V2X-DG: Domain Generalization for Vehicle-to-Everything Cooperative Perception(https://arxiv.org/abs/2503.15435)
Keywords: robust
Abstract: LiDAR-based Vehicle-to-Everything (V2X) cooperative perception has demonstrated its impact on the safety and effectiveness of autonomous driving. Since current cooperative perception algorithms are trained and tested on the same dataset, the generalization ability of cooperative perception systems remains underexplored. This paper is the first work to study the Domain Generalization problem of LiDAR-based V2X cooperative perception (V2X-DG) for 3D detection based on four widely-used open source datasets: OPV2V, V2XSet, V2V4Real and DAIR-V2X. Our research seeks to sustain high performance not only within the source domain but also across other unseen domains, achieved solely through training on source domain. To this end, we propose Cooperative Mixup Augmentation based Generalization (CMAG) to improve the model generalization capability by simulating the unseen cooperation, which is designed compactly for the domain gaps in cooperative perception. Furthermore, we propose a constraint for the regularization of the robust generalized feature representation learning: Cooperation Feature Consistency (CFC), which aligns the intermediately fused features of the generalized cooperation by CMAG and the early fused features of the original cooperation in source domain. Extensive experiments demonstrate that our approach achieves significant performance gains when generalizing to other unseen datasets while it also maintains strong performance on the source dataset.

Title: MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

Authors: Lixing Xiao, Shunlin Lu, Huaijin Pi, Ke Fan, Liang Pan, Yueer Zhou, Ziyong Feng, Xiaowei Zhou, Sida Peng, Jingbo Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15451
Pdf URL: https://arxiv.org/pdf/2503.15451
Copy Paste: [[2503.15451]] MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space(https://arxiv.org/abs/2503.15451)
Keywords: diffusion
Abstract: This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: this https URL

Title: Evaluating Bias in Retrieval-Augmented Medical Question-Answering Systems

Authors: Yuelyu Ji, Hang Zhang, Yanshan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2503.15454
Pdf URL: https://arxiv.org/pdf/2503.15454
Copy Paste: [[2503.15454]] Evaluating Bias in Retrieval-Augmented Medical Question-Answering Systems(https://arxiv.org/abs/2503.15454)
Keywords: fair
Abstract: Medical QA systems powered by Retrieval-Augmented Generation (RAG) models support clinical decision-making but may introduce biases related to race, gender, and social determinants of health. We systematically evaluate biases in RAG-based LLM by examining demographic-sensitive queries and measuring retrieval discrepancies. Using datasets like MMLU and MedMCQA, we analyze retrieval overlap and correctness disparities. Our findings reveal substantial demographic disparities within RAG pipelines, emphasizing the critical need for retrieval methods that explicitly account for fairness to ensure equitable clinical decision-making.

Title: Di$\mathtt{[M]}$O: Distilling Masked Diffusion Models into One-step Generator

Authors: Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.15457
Pdf URL: https://arxiv.org/pdf/2503.15457
Copy Paste: [[2503.15457]] Di$\mathtt{[M]}$O: Distilling Masked Diffusion Models into One-step Generator(https://arxiv.org/abs/2503.15457)
Keywords: diffusion, generative
Abstract: Masked Diffusion Models (MDMs) have emerged as a powerful generative modeling technique. Despite their remarkable results, they typically suffer from slow inference with several steps. In this paper, we propose Di$\mathtt{[M]}$O, a novel approach that distills masked diffusion models into a one-step generator. Di$\mathtt{[M]}$O addresses two key challenges: (1) the intractability of using intermediate-step information for one-step generation, which we solve through token-level distribution matching that optimizes model output logits by an 'on-policy framework' with the help of an auxiliary model; and (2) the lack of entropy in the initial distribution, which we address through a token initialization strategy that injects randomness while maintaining similarity to teacher training distribution. We show Di$\mathtt{[M]}$O's effectiveness on both class-conditional and text-conditional image generation, impressively achieving performance competitive to multi-step teacher outputs while drastically reducing inference time. To our knowledge, we are the first to successfully achieve one-step distillation of masked diffusion models and the first to apply discrete distillation to text-to-image generation, opening new paths for efficient generative modeling.

Title: From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment

Authors: Jia-Nan Li, Jian Guan, Songhao Wu, Wei Wu, Rui Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15463
Pdf URL: https://arxiv.org/pdf/2503.15463
Copy Paste: [[2503.15463]] From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment(https://arxiv.org/abs/2503.15463)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have traditionally been aligned through one-size-fits-all approaches that assume uniform human preferences, fundamentally overlooking the diversity in user values and needs. This paper introduces a comprehensive framework for scalable personalized alignment of LLMs. We establish a systematic preference space characterizing psychological and behavioral dimensions, alongside diverse persona representations for robust preference inference in real-world scenarios. Building upon this foundation, we introduce \textsc{AlignX}, a large-scale dataset of over 1.3 million personalized preference examples, and develop two complementary alignment approaches: \textit{in-context alignment} directly conditioning on persona representations and \textit{preference-bridged alignment} modeling intermediate preference distributions. Extensive experiments demonstrate substantial improvements over existing methods, with an average 17.06\% accuracy gain across four benchmarks while exhibiting a strong adaptation capability to novel preferences, robustness to limited user data, and precise preference controllability. These results validate our framework's effectiveness, advancing toward truly user-adaptive AI systems.

Title: FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers

Authors: Ruichen Chen, Keith G. Mills, Di Niu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15465
Pdf URL: https://arxiv.org/pdf/2503.15465
Copy Paste: [[2503.15465]] FP4DiT: Towards Effective Floating Point Quantization for Diffusion Transformers(https://arxiv.org/abs/2503.15465)
Keywords: robust, diffusion, transformer
Abstract: Diffusion Models (DM) have revolutionized the text-to-image visual generation process. However, the large computational cost and model footprint of DMs hinders practical deployment, especially on edge devices. Post-training quantization (PTQ) is a lightweight method to alleviate these burdens without the need for training or fine-tuning. While recent DM PTQ methods achieve W4A8 on integer-based PTQ, two key limitations remain: First, while most existing DM PTQ methods evaluate on classical DMs like Stable Diffusion XL, 1.5 or earlier, which use convolutional U-Nets, newer Diffusion Transformer (DiT) models like the PixArt series, Hunyuan and others adopt fundamentally different transformer backbones to achieve superior image synthesis. Second, integer (INT) quantization is prevailing in DM PTQ but doesn't align well with the network weight and activation distribution, while Floating-Point Quantization (FPQ) is still under-investigated, yet it holds the potential to better align the weight and activation distributions in low-bit settings for DiT. In response, we introduce FP4DiT, a PTQ method that leverages FPQ to achieve W4A6 quantization. Specifically, we extend and generalize the Adaptive Rounding PTQ technique to adequately calibrate weight quantization for FPQ and demonstrate that DiT activations depend on input patch data, necessitating robust online activation quantization techniques. Experimental results demonstrate that FP4DiT outperforms integer-based PTQ at W4A6 and W4A8 precision and generates convincing visual content on PixArt-$\alpha$, PixArt-$\Sigma$ and Hunyuan in terms of several T2I metrics such as HPSv2 and CLIP.

Title: Dynamic Bi-Elman Attention Networks (DBEAN): Dual-Directional Context-Aware Representation Learning for Enhanced Text Classification

Authors: ZhengLin Lai, MengYao Liao, Dong Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2503.15469
Pdf URL: https://arxiv.org/pdf/2503.15469
Copy Paste: [[2503.15469]] Dynamic Bi-Elman Attention Networks (DBEAN): Dual-Directional Context-Aware Representation Learning for Enhanced Text Classification(https://arxiv.org/abs/2503.15469)
Keywords: extraction, interpretability, transformer
Abstract: Text classification, a fundamental task in natural language processing (NLP), aims to categorize textual data into predefined labels. Traditional methods struggled with complex linguistic structures and semantic dependencies. The advent of deep learning, particularly recurrent neural networks (RNNs) and Transformer-based models, has significantly advanced the field by enabling nuanced feature extraction and context-aware predictions. Despite improvements, existing models exhibit limitations in balancing interpretability, computational efficiency, and long-range contextual understanding. This paper proposes the Dynamic Bidirectional Elman with Attention Network (DBEAN), which integrates bidirectional temporal modelling with self-attention mechanisms. DBEAN dynamically assigns weights to critical segments of input, improving contextual representation while maintaining computational efficiency.

Title: Cube: A Roblox View of 3D Intelligence

Authors: Foundation AI Team Roblox: Kiran Bhat, Nishchaie Khanna, Karun Channa, Tinghui Zhou, Yiheng Zhu, Xiaoxia Sun, Charles Shang, Anirudh Sudarshan, Maurice Chu, Daiqing Li, Kangle Deng, Jean-Philippe Fauconnier, Tijmen Verhulsdonck, Maneesh Agrawala, Kayvon Fatahalian, Alexander Weiss, Christian Reiser, Ravi Kiran Chirravuri, Ravali Kandur, Alejandro Pelaez, Akash Garg, Michael Palleschi, Jessica Wang, Skylar Litz, Leon Liu, Anying Li, David Harmon, Derek Liu, Liangjun Feng, Denis Goupil, Lukas Kuczynski, Jihyun Yoon, Naveen Marri, Peiye Zhuang, Yinan Zhang, Brian Yin, Haomiao Jiang, Marcel van Workum, Thomas Lane, Bryce Erickson, Salil Pathare, Kyle Price, Anupam Singh, David Baszucki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.15475
Pdf URL: https://arxiv.org/pdf/2503.15475
Copy Paste: [[2503.15475]] Cube: A Roblox View of 3D Intelligence(https://arxiv.org/abs/2503.15475)
Keywords: large language model
Abstract: Foundation models trained on vast amounts of data have demonstrated remarkable reasoning and generation capabilities in the domains of text, images, audio and video. Our goal at Roblox is to build such a foundation model for 3D intelligence, a model that can support developers in producing all aspects of a Roblox experience, from generating 3D objects and scenes to rigging characters for animation to producing programmatic scripts describing object behaviors. We discuss three key design requirements for such a 3D foundation model and then present our first step towards building such a model. We expect that 3D geometric shapes will be a core data type and describe our solution for 3D shape tokenizer. We show how our tokenization scheme can be used in applications for text-to-shape generation, shape-to-text generation and text-to-scene generation. We demonstrate how these applications can collaborate with existing large language models (LLMs) to perform scene analysis and reasoning. We conclude with a discussion outlining our path to building a fully unified foundation model for 3D intelligence.

Title: SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

Authors: Yifei Zhou, Song Jiang, Yuandong Tian, Jason Weston, Sergey Levine, Sainbayar Sukhbaatar, Xian Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.15478
Pdf URL: https://arxiv.org/pdf/2503.15478
Copy Paste: [[2503.15478]] SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks(https://arxiv.org/abs/2503.15478)
Keywords: large language model
Abstract: Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks. However, existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs and it remains unclear how to develop such algorithms. To study this, we first introduce a new benchmark, ColBench, where an LLM agent interacts with a human collaborator over multiple turns to solve realistic tasks in backend programming and frontend design. Building on this benchmark, we propose a novel RL algorithm, SWEET-RL (RL with Step-WisE Evaluation from Training-time information), that uses a carefully designed optimization objective to train a critic model with access to additional training-time information. The critic provides step-level rewards for improving the policy model. Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms, enabling Llama-3.1-8B to match or exceed the performance of GPT4-o in realistic collaborative content creation.

Title: Value Profiles for Encoding Human Variation

Authors: Taylor Sorensen, Pushkar Mishra, Roma Patel, Michael Henry Tessler, Michiel Bakker, Georgina Evans, Iason Gabriel, Noah Goodman, Verena Rieser
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2503.15484
Pdf URL: https://arxiv.org/pdf/2503.15484
Copy Paste: [[2503.15484]] Value Profiles for Encoding Human Variation(https://arxiv.org/abs/2503.15484)
Keywords: interpretability
Abstract: Modelling human variation in rating tasks is crucial for enabling AI systems for personalization, pluralistic model alignment, and computational social science. We propose representing individuals using value profiles -- natural language descriptions of underlying values compressed from in-context demonstrations -- along with a steerable decoder model to estimate ratings conditioned on a value profile or other rater information. To measure the predictive information in rater representations, we introduce an information-theoretic methodology. We find that demonstrations contain the most information, followed by value profiles and then demographics. However, value profiles offer advantages in terms of scrutability, interpretability, and steerability due to their compressed natural language format. Value profiles effectively compress the useful information from demonstrations (>70% information preservation). Furthermore, clustering value profiles to identify similarly behaving individuals better explains rater variation than the most predictive demographic groupings. Going beyond test set performance, we show that the decoder models interpretably change ratings according to semantic profile differences, are well-calibrated, and can help explain instance-level disagreement by simulating an annotator population. These results demonstrate that value profiles offer novel, predictive ways to describe individual variation beyond demographics or group information.

Title: TULIP: Towards Unified Language-Image Pretraining

Authors: Zineng Tang, Long Lian, Seun Eisape, XuDong Wang, Roei Herzig, Adam Yala, Alane Suhr, Trevor Darrell, David M. Chan
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.15485
Pdf URL: https://arxiv.org/pdf/2503.15485
Copy Paste: [[2503.15485]] TULIP: Towards Unified Language-Image Pretraining(https://arxiv.org/abs/2503.15485)
Keywords: generative
Abstract: Despite the recent success of image-text contrastive models like CLIP and SigLIP, these models often struggle with vision-centric tasks that demand high-fidelity image understanding, such as counting, depth estimation, and fine-grained object recognition. These models, by performing language alignment, tend to prioritize high-level semantics over visual understanding, weakening their image understanding. On the other hand, vision-focused models are great at processing visual information but struggle to understand language, limiting their flexibility for language-driven tasks. In this work, we introduce TULIP, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features while preserving global semantic alignment. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across multiple benchmarks, establishing a new SOTA zero-shot performance on ImageNet-1K, delivering up to a $2\times$ enhancement over SigLIP on RxRx1 in linear probing for few-shot classification, and improving vision-language models, achieving over $3\times$ higher scores than SigLIP on MMVP. Our code/checkpoints are available at this https URL