2025-06-09

Title: Zero-Trust Mobility-Aware Authentication Framework for Secure Vehicular Fog Computing Networks

Authors: Taimoor Ahmad
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05355
Pdf URL: https://arxiv.org/pdf/2506.05355
Copy Paste: [[2506.05355]] Zero-Trust Mobility-Aware Authentication Framework for Secure Vehicular Fog Computing Networks(https://arxiv.org/abs/2506.05355)
Keywords: secure, security, attack
Abstract: Vehicular Fog Computing (VFC) is a promising paradigm to meet the low-latency and high-bandwidth demands of Intelligent Transportation Systems (ITS). However, dynamic vehicle mobility and diverse trust boundaries introduce critical security challenges. This paper presents a novel Zero-Trust Mobility-Aware Authentication Framework (ZTMAF) for secure communication in VFC networks. The framework employs context-aware authentication with lightweight cryptographic primitives, a decentralized trust evaluation system, and fog node-assisted session validation to combat spoofing, replay, and impersonation attacks. Simulation results on NS-3 and SUMO demonstrate improved authentication latency, reduced computational overhead, and better scalability compared to traditional PKI and blockchain-based models. Our findings suggest that ZTMAF is effective for secure, real-time V2X interactions under adversarial and mobility-variant scenarios.

Title: AI-Driven Dynamic Firewall Optimization Using Reinforcement Learning for Anomaly Detection and Prevention

Authors: Taimoor Ahmad
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05356
Pdf URL: https://arxiv.org/pdf/2506.05356
Copy Paste: [[2506.05356]] AI-Driven Dynamic Firewall Optimization Using Reinforcement Learning for Anomaly Detection and Prevention(https://arxiv.org/abs/2506.05356)
Keywords: attack
Abstract: The growing complexity of cyber threats has rendered static firewalls increasingly ineffective for dynamic, real-time intrusion prevention. This paper proposes a novel AI-driven dynamic firewall optimization framework that leverages deep reinforcement learning (DRL) to autonomously adapt and update firewall rules in response to evolving network threats. Our system employs a Markov Decision Process (MDP) formulation, where the RL agent observes network states, detects anomalies using a hybrid LSTM-CNN model, and dynamically modifies firewall configurations to mitigate risks. We train and evaluate our framework on the NSL-KDD and CIC-IDS2017 datasets using a simulated software-defined network environment. Results demonstrate significant improvements in detection accuracy, false positive reduction, and rule update latency when compared to traditional signature- and behavior-based firewalls. The proposed method provides a scalable, autonomous solution for enhancing network resilience against complex attack vectors in both enterprise and critical infrastructure settings.

Title: Can ChatGPT Perform Image Splicing Detection? A Preliminary Study

Authors: Souradip Nath
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2506.05358
Pdf URL: https://arxiv.org/pdf/2506.05358
Copy Paste: [[2506.05358]] Can ChatGPT Perform Image Splicing Detection? A Preliminary Study(https://arxiv.org/abs/2506.05358)
Keywords: interpretability, large language model
Abstract: Multimodal Large Language Models (MLLMs) like GPT-4V are capable of reasoning across text and image modalities, showing promise in a variety of complex vision-language tasks. In this preliminary study, we investigate the out-of-the-box capabilities of GPT-4V in the domain of image forensics, specifically, in detecting image splicing manipulations. Without any task-specific fine-tuning, we evaluate GPT-4V using three prompting strategies: Zero-Shot (ZS), Few-Shot (FS), and Chain-of-Thought (CoT), applied over a curated subset of the CASIA v2.0 splicing dataset. Our results show that GPT-4V achieves competitive detection performance in zero-shot settings (more than 85% accuracy), with CoT prompting yielding the most balanced trade-off across authentic and spliced images. Qualitative analysis further reveals that the model not only detects low-level visual artifacts but also draws upon real-world contextual knowledge such as object scale, semantic consistency, and architectural facts, to identify implausible composites. While GPT-4V lags behind specialized state-of-the-art splicing detection models, its generalizability, interpretability, and encyclopedic reasoning highlight its potential as a flexible tool in image forensics.

Title: CarboNeXT and CarboFormer: Dual Semantic Segmentation Architectures for Detecting and Quantifying Carbon Dioxide Emissions Using Optical Gas Imaging

Authors: Taminul Islam, Toqi Tahamid Sarker, Mohamed G Embaby, Khaled R Ahmed, Amer AbuGhazaleh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05360
Pdf URL: https://arxiv.org/pdf/2506.05360
Copy Paste: [[2506.05360]] CarboNeXT and CarboFormer: Dual Semantic Segmentation Architectures for Detecting and Quantifying Carbon Dioxide Emissions Using Optical Gas Imaging(https://arxiv.org/abs/2506.05360)
Keywords: robust, segmentation
Abstract: Carbon dioxide (CO$_2$) emissions are critical indicators of both environmental impact and various industrial processes, including livestock management. We introduce CarboNeXT, a semantic segmentation framework for Optical Gas Imaging (OGI), designed to detect and quantify CO$_2$ emissions across diverse applications. Our approach integrates a multi-scale context aggregation network with UPerHead and auxiliary FCN components to effectively model both local details and global relationships in gas plume imagery. We contribute two novel datasets: (1) the Controlled Carbon Dioxide Release (CCR) dataset, which simulates gas leaks with systematically varied flow rates (10-100 SCCM), and (2) the Real Time Ankom (RTA) dataset, focusing on emissions from dairy cow rumen fluid in vitro experiments. Extensive evaluations demonstrate that CarboNeXT outperforms state-of-the-art methods, achieving 88.46% mIoU on CCR and 92.95% mIoU on RTA, with particular effectiveness in challenging low-flow scenarios. The model operates at 60.95 FPS, enabling real-time monitoring applications. Additionally, we propose CarboFormer, a lightweight variant with only 5.07M parameters that achieves 84.68 FPS, with competitive performance of 84.88% mIoU on CCR and 92.98% on RTA, making it suitable for resource-constrained platforms such as programmable drones. Our work advances both environmental sensing and precision livestock management by providing robust tools for CO$_2$ emission analysis, with a specific focus on livestock applications.

Title: Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching

Authors: Tinglin Huang, Tianyu Liu, Mehrtash Babadi, Wengong Jin, Rex Ying
Subjects: cs.CV, q-bio.GN
Abstract URL: https://arxiv.org/abs/2506.05361
Pdf URL: https://arxiv.org/pdf/2506.05361
Copy Paste: [[2506.05361]] Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching(https://arxiv.org/abs/2506.05361)
Keywords: generative
Abstract: Spatial transcriptomics (ST) has emerged as a powerful technology for bridging histology imaging with gene expression profiling. However, its application has been limited by low throughput and the need for specialized experimental facilities. Prior works sought to predict ST from whole-slide histology images to accelerate this process, but they suffer from two major limitations. First, they do not explicitly model cell-cell interaction as they factorize the joint distribution of whole-slide ST data and predict the gene expression of each spot independently. Second, their encoders struggle with memory constraints due to the large number of spots (often exceeding 10,000) in typical ST datasets. Herein, we propose STFlow, a flow matching generative model that considers cell-cell interaction by modeling the joint distribution of gene expression of an entire slide. It also employs an efficient slide-level encoder with local spatial attention, enabling whole-slide processing without excessive memory overhead. On the recently curated HEST-1k and STImage-1K4M benchmarks, STFlow substantially outperforms state-of-the-art baselines and achieves over 18% relative improvements over the pathology foundation models.

Title: Seed Selection for Human-Oriented Image Reconstruction via Guided Diffusion

Authors: Yui Tatsumi, Ziyue Zeng, Hiroshi Watanabe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05363
Pdf URL: https://arxiv.org/pdf/2506.05363
Copy Paste: [[2506.05363]] Seed Selection for Human-Oriented Image Reconstruction via Guided Diffusion(https://arxiv.org/abs/2506.05363)
Keywords: diffusion
Abstract: Conventional methods for scalable image coding for humans and machines require the transmission of additional information to achieve scalability. A recent diffusion-based method avoids this by generating human-oriented images from machine-oriented images without extra bitrate. This method, however, uses a single random seed, which may lead to suboptimal image quality. In this paper, we propose a seed selection method that identifies the optimal seed from multiple candidates to improve image quality without increasing the bitrate. To reduce computational cost, the selection is performed based on intermediate outputs obtained from early steps of the reverse diffusion process. Experimental results demonstrate that our method outperforms the baseline across multiple metrics.

Title: Text2Stereo: Repurposing Stable Diffusion for Stereo Generation with Consistency Rewards

Authors: Aakash Garg, Libing Zeng, Andrii Tsarov, Nima Khademi Kalantari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05367
Pdf URL: https://arxiv.org/pdf/2506.05367
Copy Paste: [[2506.05367]] Text2Stereo: Repurposing Stable Diffusion for Stereo Generation with Consistency Rewards(https://arxiv.org/abs/2506.05367)
Keywords: diffusion
Abstract: In this paper, we propose a novel diffusion-based approach to generate stereo images given a text prompt. Since stereo image datasets with large baselines are scarce, training a diffusion model from scratch is not feasible. Therefore, we propose leveraging the strong priors learned by Stable Diffusion and fine-tuning it on stereo image datasets to adapt it to the task of stereo generation. To improve stereo consistency and text-to-image alignment, we further tune the model using prompt alignment and our proposed stereo consistency reward functions. Comprehensive experiments demonstrate the superiority of our approach in generating high-quality stereo images across diverse scenarios, outperforming existing methods.

Title: Speaking images. A novel framework for the automated self-description of artworks

Authors: Valentine Bernasconi, Gustavo Marfia
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05368
Pdf URL: https://arxiv.org/pdf/2506.05368
Copy Paste: [[2506.05368]] Speaking images. A novel framework for the automated self-description of artworks(https://arxiv.org/abs/2506.05368)
Keywords: generative
Abstract: Recent breakthroughs in generative AI have opened the door to new research perspectives in the domain of art and cultural heritage, where a large number of artifacts have been digitized. There is a need for innovation to ease the access and highlight the content of digital collections. Such innovations develop into creative explorations of the digital image in relation to its malleability and contemporary interpretation, in confrontation to the original historical object. Based on the concept of the autonomous image, we propose a new framework towards the production of self-explaining cultural artifacts using open-source large-language, face detection, text-to-speech and audio-to-animation models. The goal is to start from a digitized artwork and to automatically assemble a short video of the latter where the main character animates to explain its content. The whole process questions cultural biases encapsulated in large-language models, the potential of digital images and deepfakes of artworks for educational purposes, along with concerns of the field of art history regarding such creative diversions.

Title: State Estimation and Control of Dynamic Systems from High-Dimensional Image Data

Authors: Ashik E Rasul, Hyung-Jin Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05375
Pdf URL: https://arxiv.org/pdf/2506.05375
Copy Paste: [[2506.05375]] State Estimation and Control of Dynamic Systems from High-Dimensional Image Data(https://arxiv.org/abs/2506.05375)
Keywords: extraction
Abstract: Accurate state estimation is critical for optimal policy design in dynamic systems. However, obtaining true system states is often impractical or infeasible, complicating the policy learning process. This paper introduces a novel neural architecture that integrates spatial feature extraction using convolutional neural networks (CNNs) and temporal modeling through gated recurrent units (GRUs), enabling effective state representation from sequences of images and corresponding actions. These learned state representations are used to train a reinforcement learning agent with a Deep Q-Network (DQN). Experimental results demonstrate that our proposed approach enables real-time, accurate estimation and control without direct access to ground-truth states. Additionally, we provide a quantitative evaluation methodology for assessing the accuracy of the learned states, highlighting their impact on policy performance and control stability.

Title: A Red Teaming Roadmap Towards System-Level Safety

Authors: Zifan Wang, Christina Q. Knight, Jeremy Kritz, Willow E. Primack, Julian Michael
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05376
Pdf URL: https://arxiv.org/pdf/2506.05376
Copy Paste: [[2506.05376]] A Red Teaming Roadmap Towards System-Level Safety(https://arxiv.org/abs/2506.05376)
Keywords: attack, large language model
Abstract: Large Language Model (LLM) safeguards, which implement request refusals, have become a widely adopted mitigation strategy against misuse. At the intersection of adversarial machine learning and AI safety, safeguard red teaming has effectively identified critical vulnerabilities in state-of-the-art refusal-trained LLMs. However, in our view the many conference submissions on LLM red teaming do not, in aggregate, prioritize the right research problems. First, testing against clear product safety specifications should take a higher priority than abstract social biases or ethical principles. Second, red teaming should prioritize realistic threat models that represent the expanding risk landscape and what real attackers might do. Finally, we contend that system-level safety is a necessary step to move red teaming research forward, as AI models present new threats as well as affordances for threat mitigation (e.g., detection and banning of malicious users) once placed in a deployment context. Adopting these priorities will be necessary in order for red teaming research to adequately address the slate of new threats that rapid AI advances present today and will present in the very near future.

Title: An Independent Discriminant Network Towards Identification of Counterfeit Images and Videos

Authors: Shayantani Kar, B. Shresth Bhimrajka, Aditya Kumar, Sahil Gupta, Sourav Ghosh, Subhamita Mukherjee, Shauvik Paul
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05377
Pdf URL: https://arxiv.org/pdf/2506.05377
Copy Paste: [[2506.05377]] An Independent Discriminant Network Towards Identification of Counterfeit Images and Videos(https://arxiv.org/abs/2506.05377)
Keywords: generative
Abstract: Rapid spread of false images and videos on online platforms is an emerging problem. Anyone may add, delete, clone or modify people and entities from an image using various editing software which are readily available. This generates false and misleading proof to hide the crime. Now-a-days, these false and counterfeit images and videos are flooding on the internet. These spread false information. Many methods are available in literature for detecting those counterfeit contents but new methods of counterfeiting are also evolving. Generative Adversarial Networks (GAN) are observed to be one effective method as it modifies the context and definition of images producing plausible results via image-to-image translation. This work uses an independent discriminant network that can identify GAN generated image or video. A discriminant network has been created using a convolutional neural network based on InceptionResNetV2. The article also proposes a platform where users can detect forged images and videos. This proposed work has the potential to help the forensics domain to detect counterfeit videos and hidden criminal evidence towards the identification of criminal activities.

Title: A Compendium of Autonomous Navigation using Object Detection and Tracking in Unmanned Aerial Vehicles

Authors: Mohit Arora, Pratyush Shukla, Shivali Chopra
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2506.05378
Pdf URL: https://arxiv.org/pdf/2506.05378
Copy Paste: [[2506.05378]] A Compendium of Autonomous Navigation using Object Detection and Tracking in Unmanned Aerial Vehicles(https://arxiv.org/abs/2506.05378)
Keywords: security, robust
Abstract: Unmanned Aerial Vehicles (UAVs) are one of the most revolutionary inventions of 21st century. At the core of a UAV lies the central processing system that uses wireless signals to control their movement. The most popular UAVs are quadcopters that use a set of four motors, arranged as two on either side with opposite spin. An autonomous UAV is called a drone. Drones have been in service in the US army since the 90's for covert missions critical to national security. It would not be wrong to claim that drones make up an integral part of the national security and provide the most valuable service during surveillance operations. While UAVs are controlled using wireless signals, there reside some challenges that disrupt the operation of such vehicles such as signal quality and range, real time processing, human expertise, robust hardware and data security. These challenges can be solved by programming UAVs to be autonomous, using object detection and tracking, through Computer Vision algorithms. Computer Vision is an interdisciplinary field that seeks the use of deep learning to gain a high-level understanding of digital images and videos for the purpose of automating the task of human visual system. Using computer vision, algorithms for detecting and tracking various objects can be developed suitable to the hardware so as to allow real time processing for immediate judgement. This paper attempts to review the various approaches several authors have proposed for the purpose of autonomous navigation of UAVs by through various algorithms of object detection and tracking in real time, for the purpose of applications in various fields such as disaster management, dense area exploration, traffic vehicle surveillance etc.

Title: EvidenceOutcomes: a Dataset of Clinical Trial Publications with Clinically Meaningful Outcomes

Authors: Yiliang Zhou, Abigail M. Newbury, Gongbo Zhang, Betina Ross Idnay, Hao Liu, Chunhua Weng, Yifan Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05380
Pdf URL: https://arxiv.org/pdf/2506.05380
Copy Paste: [[2506.05380]] EvidenceOutcomes: a Dataset of Clinical Trial Publications with Clinically Meaningful Outcomes(https://arxiv.org/abs/2506.05380)
Keywords: robust, extraction
Abstract: The fundamental process of evidence extraction and synthesis in evidence-based medicine involves extracting PICO (Population, Intervention, Comparison, and Outcome) elements from biomedical literature. However, Outcomes, being the most complex elements, are often neglected or oversimplified in existing benchmarks. To address this issue, we present EvidenceOutcomes, a novel, large, annotated corpus of clinically meaningful outcomes extracted from biomedical literature. We first developed a robust annotation guideline for extracting clinically meaningful outcomes from text through iteration and discussion with clinicians and Natural Language Processing experts. Then, three independent annotators annotated the Results and Conclusions sections of a randomly selected sample of 500 PubMed abstracts and 140 PubMed abstracts from the existing EBM-NLP corpus. This resulted in EvidenceOutcomes with high-quality annotations of an inter-rater agreement of 0.76. Additionally, our fine-tuned PubMedBERT model, applied to these 500 PubMed abstracts, achieved an F1-score of 0.69 at the entity level and 0.76 at the token level on the subset of 140 PubMed abstracts from the EBM-NLP corpus. EvidenceOutcomes can serve as a shared benchmark to develop and test future machine learning algorithms to extract clinically meaningful outcomes from biomedical abstracts.

Title: Heterogeneous Secure Transmissions in IRS-Assisted NOMA Communications: CO-GNN Approach

Authors: Linlin Liang, Zongkai Tian, Haiyan Huang, Xiaoyan Li, Zhisheng Yin, Dehua Zhang, Nina Zhang, Wenchao Zhai
Subjects: cs.CR, cs.IT, eess.SP
Abstract URL: https://arxiv.org/abs/2506.05381
Pdf URL: https://arxiv.org/pdf/2506.05381
Copy Paste: [[2506.05381]] Heterogeneous Secure Transmissions in IRS-Assisted NOMA Communications: CO-GNN Approach(https://arxiv.org/abs/2506.05381)
Keywords: secure, security
Abstract: Intelligent Reflecting Surfaces (IRS) enhance spectral efficiency by adjusting reflection phase shifts, while Non-Orthogonal Multiple Access (NOMA) increases system capacity. Consequently, IRS-assisted NOMA communications have garnered significant research interest. However, the passive nature of the IRS, lacking authentication and security protocols, makes these systems vulnerable to external eavesdropping due to the openness of electromagnetic signal propagation and reflection. NOMA's inherent multi-user signal superposition also introduces internal eavesdropping risks during user pairing. This paper investigates secure transmissions in IRS-assisted NOMA systems with heterogeneous resource configuration in wireless networks to mitigate both external and internal eavesdropping. To maximize the sum secrecy rate of legitimate users, we propose a combinatorial optimization graph neural network (CO-GNN) approach to jointly optimize beamforming at the base station, power allocation of NOMA users, and phase shifts of IRS for dynamic heterogeneous resource allocation, thereby enabling the design of dual-link or multi-link secure transmissions in the presence of eavesdroppers on the same or heterogeneous links. The CO-GNN algorithm simplifies the complex mathematical problem-solving process, eliminates the need for channel estimation, and enhances scalability. Simulation results demonstrate that the proposed algorithm significantly enhances the secure transmission performance of the system.

Title: How stealthy is stealthy? Studying the Efficacy of Black-Box Adversarial Attacks in the Real World

Authors: Francesco Panebianco, Mario D'Onghia, Stefano Zanero aand Michele Carminati
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05382
Pdf URL: https://arxiv.org/pdf/2506.05382
Copy Paste: [[2506.05382]] How stealthy is stealthy? Studying the Efficacy of Black-Box Adversarial Attacks in the Real World(https://arxiv.org/abs/2506.05382)
Keywords: attack, robust, steal
Abstract: Deep learning systems, critical in domains like autonomous vehicles, are vulnerable to adversarial examples (crafted inputs designed to mislead classifiers). This study investigates black-box adversarial attacks in computer vision. This is a realistic scenario, where attackers have query-only access to the target model. Three properties are introduced to evaluate attack feasibility: robustness to compression, stealthiness to automatic detection, and stealthiness to human inspection. State-of-the-Art methods tend to prioritize one criterion at the expense of others. We propose ECLIPSE, a novel attack method employing Gaussian blurring on sampled gradients and a local surrogate model. Comprehensive experiments on a public dataset highlight ECLIPSE's advantages, demonstrating its contribution to the trade-off between the three properties.

Title: Can Vision Transformers with ResNet's Global Features Fairly Authenticate Demographic Faces?

Authors: Abu Sufian, Marco Leo, Cosimo Distante, Anirudha Ghosh, Debaditya Barman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05383
Pdf URL: https://arxiv.org/pdf/2506.05383
Copy Paste: [[2506.05383]] Can Vision Transformers with ResNet's Global Features Fairly Authenticate Demographic Faces?(https://arxiv.org/abs/2506.05383)
Keywords: biometric, fair, transformer
Abstract: Biometric face authentication is crucial in computer vision, but ensuring fairness and generalization across demographic groups remains a big challenge. Therefore, we investigated whether Vision Transformer (ViT) and ResNet, leveraging pre-trained global features, can fairly authenticate different demographic faces while relying minimally on local features. In this investigation, we used three pre-trained state-of-the-art (SOTA) ViT foundation models from Facebook, Google, and Microsoft for global features as well as ResNet-18. We concatenated the features from ViT and ResNet, passed them through two fully connected layers, and trained on customized face image datasets to capture the local features. Then, we designed a novel few-shot prototype network with backbone features embedding. We also developed new demographic face image support and query datasets for this empirical study. The network's testing was conducted on this dataset in one-shot, three-shot, and five-shot scenarios to assess how performance improves as the size of the support set increases. We observed results across datasets with varying races/ethnicities, genders, and age groups. The Microsoft Swin Transformer backbone performed better among the three SOTA ViT for this task. The code and data are available at: this https URL.

Title: Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment

Authors: Zhuoxuan Cai, Jian Zhang, Xinbin Yuan, Pengtao Jiang, Wenxiang Chen, Bowen Tang, Lujian Yao, Qiyuan Wang, Jinwen Chen, Bo Li
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2506.05384
Pdf URL: https://arxiv.org/pdf/2506.05384
Copy Paste: [[2506.05384]] Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment(https://arxiv.org/abs/2506.05384)
Keywords: interpretability, large language model
Abstract: Recent studies demonstrate that multimodal large language models (MLLMs) can proficiently evaluate visual quality through interpretable assessments. However, existing approaches typically treat quality scoring and reasoning descriptions as separate tasks with disjoint optimization objectives, leading to a trade-off: models adept at quality reasoning descriptions struggle with precise score regression, while score-focused models lack interpretability. This limitation hinders the full potential of MLLMs in visual quality assessment, where accuracy and interpretability should be mutually reinforcing. To address this, we propose a unified two-stage training framework comprising a cold-start stage and a reinforcement learning-based fine-tuning stage. Specifically, in the first stage, we distill high-quality data from a teacher model through expert-designed prompts, initializing reasoning capabilities via cross-entropy loss supervision. In the second stage, we introduce a novel reward with Group Relative Policy Optimization (GRPO) to jointly optimize scoring accuracy and reasoning consistency. We designate the models derived from these two stages as Q-Ponder-CI and Q-Ponder. Extensive experiments show that Q-Ponder achieves state-of-the-art (SOTA) performance on quality score regression benchmarks, delivering up to 6.5% higher SRCC on cross-domain datasets. Furthermore, Q-Ponder significantly outperforms description-based SOTA models, including its teacher model Qwen-2.5-VL-72B, particularly in description accuracy and reasonableness, demonstrating the generalization potential over diverse tasks.

Title: LLMs Can Also Do Well! Breaking Barriers in Semantic Role Labeling via Large Language Models

Authors: Xinxin Li, Huiyao Chen, Chengjun Liu, Jing Li, Meishan Zhang, Jun Yu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05385
Pdf URL: https://arxiv.org/pdf/2506.05385
Copy Paste: [[2506.05385]] LLMs Can Also Do Well! Breaking Barriers in Semantic Role Labeling via Large Language Models(https://arxiv.org/abs/2506.05385)
Keywords: generative, large language model
Abstract: Semantic role labeling (SRL) is a crucial task of natural language processing (NLP). Although generative decoder-based large language models (LLMs) have achieved remarkable success across various NLP tasks, they still lag behind state-of-the-art encoder-decoder (BERT-like) models in SRL. In this work, we seek to bridge this gap by equipping LLMs for SRL with two mechanisms: (a) retrieval-augmented generation and (b) self-correction. The first mechanism enables LLMs to leverage external linguistic knowledge such as predicate and argument structure descriptions, while the second allows LLMs to identify and correct inconsistent SRL outputs. We conduct extensive experiments on three widely-used benchmarks of SRL (CPB1.0, CoNLL-2009, and CoNLL-2012). Results demonstrate that our method achieves state-of-the-art performance in both Chinese and English, marking the first successful application of LLMs to surpass encoder-decoder approaches in SRL.

Title: Beyond RAG: Reinforced Reasoning Augmented Generation for Clinical Notes

Authors: Lo Pang-Yun Ting, Chengshuai Zhao, Yu-Hua Zeng, Yuan Jee Lim, Kun-Ta Chuang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05386
Pdf URL: https://arxiv.org/pdf/2506.05386
Copy Paste: [[2506.05386]] Beyond RAG: Reinforced Reasoning Augmented Generation for Clinical Notes(https://arxiv.org/abs/2506.05386)
Keywords: large language model
Abstract: Clinical note generation aims to automatically produce free-text summaries of a patient's condition and diagnostic process, with discharge instructions being a representative long-form example. While recent large language model (LLM)-based methods pre-trained on general clinical corpora show promise in clinical text generation, they fall short in producing long-form notes from limited patient information. In this paper, we propose R2AG, the first reinforced retriever for long-form discharge instruction generation based on pre-admission data. R2AG is trained with reinforcement learning to retrieve reasoning paths from a medical knowledge graph, providing explicit semantic guidance to the LLM. To bridge the information gap, we propose Group-Based Retriever Optimization (GRO) which improves retrieval quality with group-relative rewards, encouraging reasoning leaps for deeper inference by the LLM. Comprehensive experiments on the MIMIC-IV-Note dataset show that R2AG outperforms baselines in both clinical efficacy and natural language generation metrics. Further analysis reveals that R2AG fills semantic gaps in sparse input scenarios, and retrieved reasoning paths help LLMs avoid clinical misinterpretation by focusing on key evidence and following coherent reasoning.

Title: Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs

Authors: Jaydip Sen, Saptarshi Sengupta. Subhasis Dasgupta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05387
Pdf URL: https://arxiv.org/pdf/2506.05387
Copy Paste: [[2506.05387]] Advancing Decoding Strategies: Enhancements in Locally Typical Sampling for LLMs(https://arxiv.org/abs/2506.05387)
Keywords: large language model
Abstract: This chapter explores advancements in decoding strategies for large language models (LLMs), focusing on enhancing the Locally Typical Sampling (LTS) algorithm. Traditional decoding methods, such as top-k and nucleus sampling, often struggle to balance fluency, diversity, and coherence in text generation. To address these challenges, Adaptive Semantic-Aware Typicality Sampling (ASTS) is proposed as an improved version of LTS, incorporating dynamic entropy thresholding, multi-objective scoring, and reward-penalty adjustments. ASTS ensures contextually coherent and diverse text generation while maintaining computational efficiency. Its performance is evaluated across multiple benchmarks, including story generation and abstractive summarization, using metrics such as perplexity, MAUVE, and diversity scores. Experimental results demonstrate that ASTS outperforms existing sampling techniques by reducing repetition, enhancing semantic alignment, and improving fluency.

Title: Understanding Gender Bias in AI-Generated Product Descriptions

Authors: Markelle Kelly, Mohammad Tahaei, Padhraic Smyth, Lauren Wilcox
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05390
Pdf URL: https://arxiv.org/pdf/2506.05390
Copy Paste: [[2506.05390]] Understanding Gender Bias in AI-Generated Product Descriptions(https://arxiv.org/abs/2506.05390)
Keywords: large language model
Abstract: While gender bias in large language models (LLMs) has been extensively studied in many domains, uses of LLMs in e-commerce remain largely unexamined and may reveal novel forms of algorithmic bias and harm. Our work investigates this space, developing data-driven taxonomic categories of gender bias in the context of product description generation, which we situate with respect to existing general purpose harms taxonomies. We illustrate how AI-generated product descriptions can uniquely surface gender biases in ways that require specialized detection and mitigation approaches. Further, we quantitatively analyze issues corresponding to our taxonomic categories in two models used for this task -- GPT-3.5 and an e-commerce-specific LLM -- demonstrating that these forms of bias commonly occur in practice. Our results illuminate unique, under-explored dimensions of gender bias, such as assumptions about clothing size, stereotypical bias in which features of a product are advertised, and differences in the use of persuasive language. These insights contribute to our understanding of three types of AI harms identified by current frameworks: exclusionary norms, stereotyping, and performance disparities, particularly for the context of e-commerce.

Title: Are Large Language Models Good Temporal Graph Learners?

Authors: Shenyang Huang, Ali Parviz, Emma Kondrup, Zachary Yang, Zifeng Ding, Michael Bronstein, Reihaneh Rabbany, Guillaume Rabusseau
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05393
Pdf URL: https://arxiv.org/pdf/2506.05393
Copy Paste: [[2506.05393]] Are Large Language Models Good Temporal Graph Learners?(https://arxiv.org/abs/2506.05393)
Keywords: interpretability, explainability, large language model
Abstract: Large Language Models (LLMs) have recently driven significant advancements in Natural Language Processing and various other applications. While a broad range of literature has explored the graph-reasoning capabilities of LLMs, including their use of predictors on graphs, the application of LLMs to dynamic graphs -- real world evolving networks -- remains relatively unexplored. Recent work studies synthetic temporal graphs generated by random graph models, but applying LLMs to real-world temporal graphs remains an open question. To address this gap, we introduce Temporal Graph Talker (TGTalker), a novel temporal graph learning framework designed for LLMs. TGTalker utilizes the recency bias in temporal graphs to extract relevant structural information, converted to natural language for LLMs, while leveraging temporal neighbors as additional information for prediction. TGTalker demonstrates competitive link prediction capabilities compared to existing Temporal Graph Neural Network (TGNN) models. Across five real-world networks, TGTalker performs competitively with state-of-the-art temporal graph methods while consistently outperforming popular models such as TGN and HTGN. Furthermore, TGTalker generates textual explanations for each prediction, thus opening up exciting new directions in explainability and interpretability for temporal link prediction. The code is publicly available at this https URL.

Title: Attacking Attention of Foundation Models Disrupts Downstream Tasks

Authors: Hondamunige Prasanna Silva, Federico Becattini, Lorenzo Seidenari
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05394
Pdf URL: https://arxiv.org/pdf/2506.05394
Copy Paste: [[2506.05394]] Attacking Attention of Foundation Models Disrupts Downstream Tasks(https://arxiv.org/abs/2506.05394)
Keywords: security, attack, transformer, segmentation
Abstract: Foundation models represent the most prominent and recent paradigm shift in artificial this http URL models are large models, trained on broad data that deliver high accuracy in many downstream tasks, often without fine-tuning. For this reason, models such as CLIP , DINO or Vision Transfomers (ViT), are becoming the bedrock of many industrial AI-powered applications. However, the reliance on pre-trained foundation models also introduces significant security concerns, as these models are vulnerable to adversarial attacks. Such attacks involve deliberately crafted inputs designed to deceive AI systems, jeopardizing their this http URL paper studies the vulnerabilities of vision foundation models, focusing specifically on CLIP and ViTs, and explores the transferability of adversarial attacks to downstream tasks. We introduce a novel attack, targeting the structure of transformer-based architectures in a task-agnostic this http URL demonstrate the effectiveness of our attack on several downstream tasks: classification, captioning, image/text retrieval, segmentation and depth estimation.

Title: TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations

Authors: Mert Can Cakmak, Nitin Agarwal, Diwash Poudel
Subjects: cs.CV, cs.IR, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2506.05395
Pdf URL: https://arxiv.org/pdf/2506.05395
Copy Paste: [[2506.05395]] TriPSS: A Tri-Modal Keyframe Extraction Framework Using Perceptual, Structural, and Semantic Representations(https://arxiv.org/abs/2506.05395)
Keywords: robust, extraction, segmentation
Abstract: Efficient keyframe extraction is critical for effective video summarization and retrieval, yet capturing the complete richness of video content remains challenging. In this work, we present TriPSS, a novel tri-modal framework that effectively integrates perceptual cues from color features in the CIELAB space, deep structural embeddings derived from ResNet-50, and semantic context from frame-level captions generated by Llama-3.2-11B-Vision-Instruct. By fusing these diverse modalities using principal component analysis, TriPSS constructs robust multi-modal embeddings that enable adaptive segmentation of video content via HDBSCAN clustering. A subsequent refinement stage incorporating quality assessment and duplicate filtering ensures that the final keyframe set is both concise and semantically rich. Comprehensive evaluations on benchmark datasets TVSum20 and SumMe demonstrate that TriPSS achieves state-of-the-art performance, substantially outperforming traditional unimodal and previous multi-modal methods. These results underscore TriPSS's ability to capture nuanced visual and semantic information, thereby setting a new benchmark for video content understanding in large-scale retrieval scenarios.

Title: Talk2SAM: Text-Guided Semantic Enhancement for Complex-Shaped Object Segmentation

Authors: Luka Vetoshkin, Dmitry Yudin
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.05396
Pdf URL: https://arxiv.org/pdf/2506.05396
Copy Paste: [[2506.05396]] Talk2SAM: Text-Guided Semantic Enhancement for Complex-Shaped Object Segmentation(https://arxiv.org/abs/2506.05396)
Keywords: segmentation
Abstract: Segmenting objects with complex shapes, such as wires, bicycles, or structural grids, remains a significant challenge for current segmentation models, including the Segment Anything Model (SAM) and its high-quality variant SAM-HQ. These models often struggle with thin structures and fine boundaries, leading to poor segmentation quality. We propose Talk2SAM, a novel approach that integrates textual guidance to improve segmentation of such challenging objects. The method uses CLIP-based embeddings derived from user-provided text prompts to identify relevant semantic regions, which are then projected into the DINO feature space. These features serve as additional prompts for SAM-HQ, enhancing its ability to focus on the target object. Beyond improving segmentation accuracy, Talk2SAM allows user-controllable segmentation, enabling disambiguation of objects within a single bounding box based on textual input. We evaluate our approach on three benchmarks: BIG, ThinObject5K, and DIS5K. Talk2SAM consistently outperforms SAM-HQ, achieving up to +5.9\% IoU and +8.3\% boundary IoU improvements. Our results demonstrate that incorporating natural language guidance provides a flexible and effective means for precise object segmentation, particularly in cases where traditional prompt-based methods fail. The source code is available on GitHub: this https URL

Title: Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation

Authors: Israa A. Albadarneh, Bassam H. Hammo, Omar S. Al-Kadi
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05399
Pdf URL: https://arxiv.org/pdf/2506.05399
Copy Paste: [[2506.05399]] Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation(https://arxiv.org/abs/2506.05399)
Keywords: transformer
Abstract: Image captioning involves generating textual descriptions from input images, bridging the gap between computer vision and natural language processing. Recent advancements in transformer-based models have significantly improved caption generation by leveraging attention mechanisms for better scene understanding. While various surveys have explored deep learning-based approaches for image captioning, few have comprehensively analyzed attention-based transformer models across multiple languages. This survey reviews attention-based image captioning models, categorizing them into transformer-based, deep learning-based, and hybrid approaches. It explores benchmark datasets, discusses evaluation metrics such as BLEU, METEOR, CIDEr, and ROUGE, and highlights challenges in multilingual captioning. Additionally, this paper identifies key limitations in current models, including semantic inconsistencies, data scarcity in non-English languages, and limitations in reasoning ability. Finally, we outline future research directions, such as multimodal learning, real-time applications in AI-powered assistants, healthcare, and forensic analysis. This survey serves as a comprehensive reference for researchers aiming to advance the field of attention-based image captioning.

Title: Auto Review: Second Stage Error Detection for Highly Accurate Information Extraction from Phone Conversations

Authors: Ayesha Qamar, Arushi Raghuvanshi, Conal Sathi, Youngseo Son
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05400
Pdf URL: https://arxiv.org/pdf/2506.05400
Copy Paste: [[2506.05400]] Auto Review: Second Stage Error Detection for Highly Accurate Information Extraction from Phone Conversations(https://arxiv.org/abs/2506.05400)
Keywords: extraction, large language model
Abstract: Automating benefit verification phone calls saves time in healthcare and helps patients receive treatment faster. It is critical to obtain highly accurate information in these phone calls, as it can affect a patient's healthcare journey. Given the noise in phone call transcripts, we have a two-stage system that involves a post-call review phase for potentially noisy fields, where human reviewers manually verify the extracted data$\unicode{x2013}$a labor-intensive task. To automate this stage, we introduce Auto Review, which significantly reduces manual effort while maintaining a high bar for accuracy. This system, being highly reliant on call transcripts, suffers a performance bottleneck due to automatic speech recognition (ASR) issues. This problem is further exacerbated by the use of domain-specific jargon in the calls. In this work, we propose a second-stage postprocessing pipeline for accurate information extraction. We improve accuracy by using multiple ASR alternatives and a pseudo-labeling approach that does not require manually corrected transcripts. Experiments with general-purpose large language models and feature-based model pipelines demonstrate substantial improvements in the quality of corrected call transcripts, thereby enhancing the efficiency of Auto Review.

Title: Robust Anti-Backdoor Instruction Tuning in LVLMs

Authors: Yuan Xun, Siyuan Liang, Xiaojun Jia, Xinwei Liu, Xiaochun Cao
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05401
Pdf URL: https://arxiv.org/pdf/2506.05401
Copy Paste: [[2506.05401]] Robust Anti-Backdoor Instruction Tuning in LVLMs(https://arxiv.org/abs/2506.05401)
Keywords: defense, attack, robust, steal
Abstract: Large visual language models (LVLMs) have demonstrated excellent instruction-following capabilities, yet remain vulnerable to stealthy backdoor attacks when finetuned using contaminated data. Existing backdoor defense techniques are usually developed for single-modal visual or language models under fully parameter-adjustable settings or rely on supervisory knowledge during training. However, in real-world scenarios, defenders cannot modify frozen visual encoders or core LLM parameters, nor possess prior knowledge of unknown trigger patterns or target responses. Motivated by the empirical finding that LVLMs readily overfit to fixed, unknown triggers, which can embed malicious associations during adapter-level tuning, we aim to design a defense that operates without access to core weights or attack priors. To this end, we introduce a lightweight, certified-agnostic defense framework, Robust Instruction Tuning, that finetunes only adapter modules and text embedding layers under instruction tuning. Our method integrates two complementary regularizations: (1) Input Diversity Regularization, which perturbs trigger components across training samples to disrupt consistent spurious cues; and (2) Anomalous Activation Regularization, which dynamically sparses adapter weights exhibiting abnormally sharp activations linked to backdoor patterns. These mechanisms jointly guide the model toward learning semantically grounded representations rather than memorizing superficial trigger-response mappings. Extensive experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero, with an increase in training cost of less than 15%.

Title: Sylva: Tailoring Personalized Adversarial Defense in Pre-trained Models via Collaborative Fine-tuning

Authors: Tianyu Qi, Lei Xue, Yufeng Zhan, Xiaobo Ma
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05402
Pdf URL: https://arxiv.org/pdf/2506.05402
Copy Paste: [[2506.05402]] Sylva: Tailoring Personalized Adversarial Defense in Pre-trained Models via Collaborative Fine-tuning(https://arxiv.org/abs/2506.05402)
Keywords: security, privacy, defense, attack, robust, federate
Abstract: The growing adoption of large pre-trained models in edge computing has made deploying model inference on mobile clients both practical and popular. These devices are inherently vulnerable to direct adversarial attacks, which pose a substantial threat to the robustness and security of deployed models. Federated adversarial training (FAT) has emerged as an effective solution to enhance model robustness while preserving client privacy. However, FAT frequently produces a generalized global model, which struggles to address the diverse and heterogeneous data distributions across clients, resulting in insufficiently personalized performance, while also encountering substantial communication challenges during the training process. In this paper, we propose \textit{Sylva}, a personalized collaborative adversarial training framework designed to deliver customized defense models for each client through a two-phase process. In Phase 1, \textit{Sylva} employs LoRA for local adversarial fine-tuning, enabling clients to personalize model robustness while drastically reducing communication costs by uploading only LoRA parameters during federated aggregation. In Phase 2, a game-based layer selection strategy is introduced to enhance accuracy on benign data, further refining the personalized model. This approach ensures that each client receives a tailored defense model that balances robustness and accuracy effectively. Extensive experiments on benchmark datasets demonstrate that \textit{Sylva} can achieve up to 50$\times$ improvements in communication efficiency compared to state-of-the-art algorithms, while achieving up to 29.5\% and 50.4\% enhancements in adversarial robustness and benign accuracy, respectively.

Title: Poisoning Behavioral-based Worker Selection in Mobile Crowdsensing using Generative Adversarial Networks

Authors: Ruba Nasser, Ahmed Alagha, Shakti Singh, Rabeb Mizouni, Hadi Otrok, Jamal Bentahar
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05403
Pdf URL: https://arxiv.org/pdf/2506.05403
Copy Paste: [[2506.05403]] Poisoning Behavioral-based Worker Selection in Mobile Crowdsensing using Generative Adversarial Networks(https://arxiv.org/abs/2506.05403)
Keywords: attack, generative
Abstract: With the widespread adoption of Artificial intelligence (AI), AI-based tools and components are becoming omnipresent in today's solutions. However, these components and tools are posing a significant threat when it comes to adversarial attacks. Mobile Crowdsensing (MCS) is a sensing paradigm that leverages the collective participation of workers and their smart devices to collect data. One of the key challenges faced at the selection stage is ensuring task completion due to workers' varying behavior. AI has been utilized to tackle this challenge by building unique models for each worker to predict their behavior. However, the integration of AI into the system introduces vulnerabilities that can be exploited by malicious insiders to reduce the revenue obtained by victim workers. This work proposes an adversarial attack targeting behavioral-based selection models in MCS. The proposed attack leverages Generative Adversarial Networks (GANs) to generate poisoning points that can mislead the models during the training stage without being detected. This way, the potential damage introduced by GANs on worker selection in MCS can be anticipated. Simulation results using a real-life dataset show the effectiveness of the proposed attack in compromising the victim workers' model and evading detection by an outlier detector, compared to a benchmark. In addition, the impact of the attack on reducing the payment obtained by victim workers is evaluated.

Title: PCEvolve: Private Contrastive Evolution for Synthetic Dataset Generation via Few-Shot Private Data and Generative APIs

Authors: Jianqing Zhang, Yang Liu, Jie Fu, Yang Hua, Tianyuan Zou, Jian Cao, Qiang Yang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05407
Pdf URL: https://arxiv.org/pdf/2506.05407
Copy Paste: [[2506.05407]] PCEvolve: Private Contrastive Evolution for Synthetic Dataset Generation via Few-Shot Private Data and Generative APIs(https://arxiv.org/abs/2506.05407)
Keywords: privacy, protect, diffusion, generative
Abstract: The rise of generative APIs has fueled interest in privacy-preserving synthetic data generation. While the Private Evolution (PE) algorithm generates Differential Privacy (DP) synthetic images using diffusion model APIs, it struggles with few-shot private data due to the limitations of its DP-protected similarity voting approach. In practice, the few-shot private data challenge is particularly prevalent in specialized domains like healthcare and industry. To address this challenge, we propose a novel API-assisted algorithm, Private Contrastive Evolution (PCEvolve), which iteratively mines inherent inter-class contrastive relationships in few-shot private data beyond individual data points and seamlessly integrates them into an adapted Exponential Mechanism (EM) to optimize DP's utility in an evolution loop. We conduct extensive experiments on four specialized datasets, demonstrating that PCEvolve outperforms PE and other API-assisted baselines. These results highlight the potential of leveraging API access with private data for quality evaluation, enabling the generation of high-quality DP synthetic images and paving the way for more accessible and effective privacy-preserving generative API applications. Our code is available at this https URL.

Title: Differentially Private Federated $k$-Means Clustering with Server-Side Data

Authors: Jonathan Scott, Christoph H. Lampert, David Saulpic
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05408
Pdf URL: https://arxiv.org/pdf/2506.05408
Copy Paste: [[2506.05408]] Differentially Private Federated $k$-Means Clustering with Server-Side Data(https://arxiv.org/abs/2506.05408)
Keywords: privacy, federate
Abstract: Clustering is a cornerstone of data analysis that is particularly suited to identifying coherent subgroups or substructures in unlabeled data, as are generated continuously in large amounts these days. However, in many cases traditional clustering methods are not applicable, because data are increasingly being produced and stored in a distributed way, e.g. on edge devices, and privacy concerns prevent it from being transferred to a central server. To address this challenge, we present \acronym, a new algorithm for $k$-means clustering that is fully-federated as well as differentially private. Our approach leverages (potentially small and out-of-distribution) server-side data to overcome the primary challenge of differentially private clustering methods: the need for a good initialization. Combining our initialization with a simple federated DP-Lloyds algorithm we obtain an algorithm that achieves excellent results on synthetic and real-world benchmark tasks. We also provide a theoretical analysis of our method that provides bounds on the convergence speed and cluster identification success.

Title: Object-level Self-Distillation for Vision Pretraining

Authors: Çağlar Hızlı, Çağatay Yıldız, Pekka Marttinen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05409
Pdf URL: https://arxiv.org/pdf/2506.05409
Copy Paste: [[2506.05409]] Object-level Self-Distillation for Vision Pretraining(https://arxiv.org/abs/2506.05409)
Keywords: transformer
Abstract: State-of-the-art vision pretraining methods rely on image-level self-distillation from object-centric datasets such as ImageNet, implicitly assuming each image contains a single object. This assumption does not always hold: many ImageNet images already contain multiple objects. Further, it limits scalability to scene-centric datasets that better mirror real-world complexity. We address these challenges by introducing Object-level Self-DIStillation (ODIS), a pretraining approach that shifts the self-distillation granularity from whole images to individual objects. Using object-aware cropping and masked attention, ODIS isolates object-specific regions, guiding the transformer toward semantically meaningful content and transforming a noisy, scene-level task into simpler object-level sub-tasks. We show that this approach improves visual representations both at the image and patch levels. Using masks at inference time, our method achieves an impressive $82.6\%$ $k$-NN accuracy on ImageNet1k with ViT-Large.

Title: Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs

Authors: Wanyun Cui, Mingwei Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05410
Pdf URL: https://arxiv.org/pdf/2506.05410
Copy Paste: [[2506.05410]] Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs(https://arxiv.org/abs/2506.05410)
Keywords: large language model
Abstract: Recent advances in Large Language Models (LLMs) have highlighted the critical importance of extending context length, yet the quadratic complexity of attention mechanisms poses significant challenges for efficient long-context modeling. KV cache compression has emerged as a key approach to address this challenge. Through extensive empirical analysis, we reveal a fundamental yet previously overlooked asymmetry in KV caches: while adjacent keys receive similar attention weights (local homogeneity), adjacent values demonstrate distinct heterogeneous distributions. This key-value asymmetry reveals a critical limitation in existing compression methods that treat keys and values uniformly. To address the limitation, we propose a training-free compression framework (AsymKV) that combines homogeneity-based key merging with a mathematically proven lossless value compression. Extensive experiments demonstrate that AsymKV consistently outperforms existing long-context methods across various tasks and base models. For example, on LLaMA3.1-8B, AsymKV achieves an average score of 43.95 on LongBench, surpassing SOTA methods like H$_2$O (38.89) by a large margin.

Title: QA-HFL: Quality-Aware Hierarchical Federated Learning for Resource-Constrained Mobile Devices with Heterogeneous Image Quality

Authors: Sajid Hussain, Muhammad Sohail, Nauman Ali Khan
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05411
Pdf URL: https://arxiv.org/pdf/2506.05411
Copy Paste: [[2506.05411]] QA-HFL: Quality-Aware Hierarchical Federated Learning for Resource-Constrained Mobile Devices with Heterogeneous Image Quality(https://arxiv.org/abs/2506.05411)
Keywords: privacy, protect, federate
Abstract: This paper introduces QA-HFL, a quality-aware hierarchical federated learning framework that efficiently handles heterogeneous image quality across resource-constrained mobile devices. Our approach trains specialized local models for different image quality levels and aggregates their features using a quality-weighted fusion mechanism, while incorporating differential privacy protection. Experiments on MNIST demonstrate that QA-HFL achieves 92.31% accuracy after just three federation rounds, significantly outperforming state-of-the-art methods like FedRolex (86.42%). Under strict privacy constraints, our approach maintains 30.77% accuracy with formal differential privacy guarantees. Counter-intuitively, low-end devices contributed most significantly (63.5%) to the final model despite using 100 fewer parameters than high-end counterparts. Our quality-aware approach addresses accuracy decline through device-specific regularization, adaptive weighting, intelligent client selection, and server-side knowledge distillation, while maintaining efficient communication with a 4.71% compression ratio. Statistical analysis confirms that our approach significantly outperforms baseline methods (p 0.01) under both standard and privacy-constrained conditions.

Title: Can Vision Language Models Infer Human Gaze Direction? A Controlled Study

Authors: Zory Zhang, Pinyuan Feng, Bingyang Wang, Tianwei Zhao, Suyang Yu, Qingying Gao, Hokin Deng, Ziqiao Ma, Yijiang Li, Dezhi Luo
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05412
Pdf URL: https://arxiv.org/pdf/2506.05412
Copy Paste: [[2506.05412]] Can Vision Language Models Infer Human Gaze Direction? A Controlled Study(https://arxiv.org/abs/2506.05412)
Keywords: robust
Abstract: Gaze-referential inference--the ability to infer what others are looking at--is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.

Title: SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs

Authors: Patrik Czakó, Gábor Kertész, Sándor Szénási
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05413
Pdf URL: https://arxiv.org/pdf/2506.05413
Copy Paste: [[2506.05413]] SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs(https://arxiv.org/abs/2506.05413)
Keywords: large language model
Abstract: We present SmoothRot, a novel post-training quantization technique to enhance the efficiency of 4-bit quantization in Large Language Models (LLMs). SmoothRot addresses the critical challenge of massive activation outliers, by integrating channel-wise scaling with Hadamard transformations. Our technique effectively transforms extreme outliers into quantization-friendly activations, significantly improving quantization accuracy. Experiments conducted on popular LLMs (LLaMA2 7B, LLaMA3.1 8B, and Mistral 7B) demonstrate that SmoothRot consistently reduces the performance gap between quantized and FP16 models by approximately 10-30\% across language generation and zero-shot reasoning tasks, without introducing additional inference latency. Code is available at this https URL.

Title: SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing

Authors: Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, Eli Shlizerman
Subjects: cs.CV, cs.AI, cs.LG, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.05414
Pdf URL: https://arxiv.org/pdf/2506.05414
Copy Paste: [[2506.05414]] SAVVY: Spatial Awareness via Audio-Visual LLMs through Seeing and Hearing(https://arxiv.org/abs/2506.05414)
Keywords: large language model
Abstract: 3D spatial reasoning in dynamic, audio-visual environments is a cornerstone of human cognition yet remains largely unexplored by existing Audio-Visual Large Language Models (AV-LLMs) and benchmarks, which predominantly focus on static or 2D scenes. We introduce SAVVY-Bench, the first benchmark for 3D spatial reasoning in dynamic scenes with synchronized spatial audio. SAVVY-Bench is comprised of thousands of relationships involving static and moving objects, and requires fine-grained temporal grounding, consistent 3D localization, and multi-modal annotation. To tackle this challenge, we propose SAVVY, a novel training-free reasoning pipeline that consists of two stages: (i) Egocentric Spatial Tracks Estimation, which leverages AV-LLMs as well as other audio-visual methods to track the trajectories of key objects related to the query using both visual and spatial audio cues, and (ii) Dynamic Global Map Construction, which aggregates multi-modal queried object trajectories and converts them into a unified global dynamic map. Using the constructed map, a final QA answer is obtained through a coordinate transformation that aligns the global map with the queried viewpoint. Empirical evaluation demonstrates that SAVVY substantially enhances performance of state-of-the-art AV-LLMs, setting a new standard and stage for approaching dynamic 3D spatial reasoning in AV-LLMs.

Title: FERRET: Private Deep Learning Faster And Better Than DPSGD

Authors: David Zagardo
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05416
Pdf URL: https://arxiv.org/pdf/2506.05416
Copy Paste: [[2506.05416]] FERRET: Private Deep Learning Faster And Better Than DPSGD(https://arxiv.org/abs/2506.05416)
Keywords: privacy
Abstract: We revisit 1-bit gradient compression through the lens of mutual-information differential privacy (MI-DP). Building on signSGD, we propose FERRET--Fast and Effective Restricted Release for Ethical Training--which transmits at most one sign bit per parameter group with Bernoulli masking. Theory: We prove each fired group leaks at most ln 2 nats; after subsampling with rate s, the total privacy loss of G groups trained for T steps with firing probability p is epsilon = G * T * s * p * ln 2. Thus FERRET achieves MI-DP for epsilon in [0.1, 2] without additive noise. Practice: We evaluate three granularities--FERRET-MAX (finest), FERRET-EIGHTH (medium), and FERRET-2 (coarsest)--on five LLMs (137M-1.8B parameters) against DPSGD and Non-DP baselines. All methods trained for 1, 3, and 5 epochs. Utility: Across all settings, FERRET-MAX/EIGHTH beat DPSGD's perplexity. At epsilon=0.5, 5 epochs: FERRET-EIGHTH achieves 3.98 perplexity vs DPSGD's 11.61 (2.9x better), within 23% of Non-DP (3.25). Privacy: MI-AUC stays at chance for FERRET-MAX/EIGHTH (~0.51), matching DPSGD vs Non-DP's 0.76-0.99. FERRET-2 shows higher leakage (~0.55) due to lower headroom. Efficiency: Stricter budgets fire fewer signs, so FERRET uses 19-33% of DPSGD's training time and only 34-36% of Non-DP training time. Take-away: Sign-based MI-DP gets closer to achieving all three qualities of the privacy, utility, performance trilemma: FERRET trains up to 5x faster, achieves 3x lower perplexity compared to DPSGD and 1.2x greater than Non-DP, all while providing formal, mathematically provable privacy guarantees using zero additive noise. The results also show that, in certain instances, masked 1-bit updates can match non-private training utility while safeguarding data.

Title: Better STEP, a format and dataset for boundary representation

Authors: Nafiseh Izadyar, Sai Chandra Madduri, Teseo Schneider
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05417
Pdf URL: https://arxiv.org/pdf/2506.05417
Copy Paste: [[2506.05417]] Better STEP, a format and dataset for boundary representation(https://arxiv.org/abs/2506.05417)
Keywords: segmentation
Abstract: Boundary representation (B-rep) generated from computer-aided design (CAD) is widely used in industry, with several large datasets available. However, the data in these datasets is represented in STEP format, requiring a CAD kernel to read and process it. This dramatically limits their scope and usage in large learning pipelines, as it constrains the possibility of deploying them on computing clusters due to the high cost of per-node licenses. This paper introduces an alternative format based on the open, cross-platform format HDF5 and a corresponding dataset for STEP files, paired with an open-source library to query and process them. Our Python package also provides standard functionalities such as sampling, normals, and curvature to ease integration in existing pipelines. To demonstrate the effectiveness of our format, we converted the Fusion 360 dataset and the ABC dataset. We developed four standard use cases (normal estimation, denoising, surface reconstruction, and segmentation) to assess the integrity of the data and its compliance with the original STEP files.

Title: Self-Predictive Dynamics for Generalization of Vision-based Reinforcement Learning

Authors: Kyungsoo Kim, Jeongsoo Ha, Yusung Kim
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05418
Pdf URL: https://arxiv.org/pdf/2506.05418
Copy Paste: [[2506.05418]] Self-Predictive Dynamics for Generalization of Vision-based Reinforcement Learning(https://arxiv.org/abs/2506.05418)
Keywords: robust
Abstract: Vision-based reinforcement learning requires efficient and robust representations of image-based observations, especially when the images contain distracting (task-irrelevant) elements such as shadows, clouds, and light. It becomes more important if those distractions are not exposed during training. We design a Self-Predictive Dynamics (SPD) method to extract task-relevant features efficiently, even in unseen observations after training. SPD uses weak and strong augmentations in parallel, and learns representations by predicting inverse and forward transitions across the two-way augmented versions. In a set of MuJoCo visual control tasks and an autonomous driving task (CARLA), SPD outperforms previous studies in complex observations, and significantly improves the generalization performance for unseen observations. Our code is available at this https URL.

Title: Dream to Generalize: Zero-Shot Model-Based Reinforcement Learning for Unseen Visual Distractions

Authors: Jeongsoo Ha, Kyungsoo Kim, Yusung Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05419
Pdf URL: https://arxiv.org/pdf/2506.05419
Copy Paste: [[2506.05419]] Dream to Generalize: Zero-Shot Model-Based Reinforcement Learning for Unseen Visual Distractions(https://arxiv.org/abs/2506.05419)
Keywords: robust
Abstract: Model-based reinforcement learning (MBRL) has been used to efficiently solve vision-based control tasks in highdimensional image observations. Although recent MBRL algorithms perform well in trained observations, they fail when faced with visual distractions in observations. These task-irrelevant distractions (e.g., clouds, shadows, and light) may be constantly present in real-world scenarios. In this study, we propose a novel self-supervised method, Dream to Generalize (Dr. G), for zero-shot MBRL. Dr. G trains its encoder and world model with dual contrastive learning which efficiently captures task-relevant features among multi-view data augmentations. We also introduce a recurrent state inverse dynamics model that helps the world model to better understand the temporal structure. The proposed methods can enhance the robustness of the world model against visual distractions. To evaluate the generalization performance, we first train Dr. G on simple backgrounds and then test it on complex natural video backgrounds in the DeepMind Control suite, and the randomizing environments in Robosuite. Dr. G yields a performance improvement of 117% and 14% over prior works, respectively. Our code is open-sourced and available at this https URL

Title: TRIDENT -- A Three-Tier Privacy-Preserving Propaganda Detection Model in Mobile Networks using Transformers, Adversarial Learning, and Differential Privacy

Authors: Al Nahian Bin Emran, Dhiman Goswami, Md Hasan Ullah Sadi, Sanchari Das
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2506.05421
Pdf URL: https://arxiv.org/pdf/2506.05421
Copy Paste: [[2506.05421]] TRIDENT -- A Three-Tier Privacy-Preserving Propaganda Detection Model in Mobile Networks using Transformers, Adversarial Learning, and Differential Privacy(https://arxiv.org/abs/2506.05421)
Keywords: privacy, protect, defense, transformer
Abstract: The proliferation of propaganda on mobile platforms raises critical concerns around detection accuracy and user privacy. To address this, we propose TRIDENT - a three-tier propaganda detection model implementing transformers, adversarial learning, and differential privacy which integrates syntactic obfuscation and label perturbation to mitigate privacy leakage while maintaining propaganda detection accuracy. TRIDENT leverages multilingual back-translation to introduce semantic variance, character-level noise, and entity obfuscation for differential privacy enforcement, and combines these techniques into a unified defense mechanism. Using a binary propaganda classification dataset, baseline transformer models (BERT, GPT-2) we achieved F1 scores of 0.89 and 0.90. Applying TRIDENT's third-tier defense yields a reduced but effective cumulative F1 of 0.83, demonstrating strong privacy protection across mobile ML deployments with minimal degradation.

Title: SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

Authors: Fanqi Kong, Weiqin Zu, Xinyu Chen, Yaodong Yang, Song-Chun Zhu, Xue Feng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05425
Pdf URL: https://arxiv.org/pdf/2506.05425
Copy Paste: [[2506.05425]] SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning(https://arxiv.org/abs/2506.05425)
Keywords: large language model
Abstract: The rich and multifaceted nature of human social interaction, encompassing multimodal cues, unobservable relations and mental states, and dynamical behavior, presents a formidable challenge for artificial intelligence. To advance research in this area, we introduce SIV-Bench, a novel video benchmark for rigorously evaluating the capabilities of Multimodal Large Language Models (MLLMs) across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP). SIV-Bench features 2,792 video clips and 8,792 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline. It is originally collected from TikTok and YouTube, covering a wide range of video genres, presentation styles, and linguistic and cultural backgrounds. It also includes a dedicated setup for analyzing the impact of different textual cues-original on-screen text, added dialogue, or no text. Our comprehensive experiments on leading MLLMs reveal that while models adeptly handle SSU, they significantly struggle with SSR and SDP, where Relation Inference (RI) is an acute bottleneck, as further examined in our analysis. Our study also confirms the critical role of transcribed dialogue in aiding comprehension of complex social interactions. By systematically identifying current MLLMs' strengths and limitations, SIV-Bench offers crucial insights to steer the development of more socially intelligent AI. The dataset and code are available at this https URL.

Title: Mixture-of-Experts Meets In-Context Reinforcement Learning

Authors: Wenhao Wu, Fuhong Liu, Haoru Li, Zican Hu, Daoyi Dong, Chunlin Chen, Zhi Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05426
Pdf URL: https://arxiv.org/pdf/2506.05426
Copy Paste: [[2506.05426]] Mixture-of-Experts Meets In-Context Reinforcement Learning(https://arxiv.org/abs/2506.05426)
Keywords: transformer
Abstract: In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose \textbf{T2MIR} (\textbf{T}oken- and \textbf{T}ask-wise \textbf{M}oE for \textbf{I}n-context \textbf{R}L), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at this https URL.

Title: Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction

Authors: Zhihao Tang, Chaozhuo Li, Litian Zhang, Xi Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05428
Pdf URL: https://arxiv.org/pdf/2506.05428
Copy Paste: [[2506.05428]] Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction(https://arxiv.org/abs/2506.05428)
Keywords: robust, diffusion
Abstract: Early prediction of Mild Cognitive Impairment (MCI) conversion is hampered by a trade-off between immediacy--making fast predictions from a single baseline sMRI--and accuracy--leveraging longitudinal scans to capture disease progression. We propose MCI-Diff, a diffusion-based framework that synthesizes clinically plausible future sMRI representations directly from baseline data, achieving both real-time risk assessment and high predictive performance. First, a multi-task sequence reconstruction strategy trains a shared denoising network on interpolation and extrapolation tasks to handle irregular follow-up sampling and learn robust latent trajectories. Second, an LLM-driven "linguistic compass" is introduced for clinical plausibility sampling: generated feature candidates are quantized, tokenized, and scored by a fine-tuned language model conditioned on expected structural biomarkers, guiding autoregressive generation toward realistic disease patterns. Experiments on ADNI and AIBL cohorts show that MCI-Diff outperforms state-of-the-art baselines, improving early conversion accuracy by 5-12%.

Title: Coordinated Robustness Evaluation Framework for Vision-Language Models

Authors: Ashwin Ramesh Babu, Sajad Mousavi, Vineet Gundecha, Sahand Ghorbanpour, Avisek Naug, Antonio Guillen, Ricardo Luna Gutierrez, Soumyendu Sarkar
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05429
Pdf URL: https://arxiv.org/pdf/2506.05429
Copy Paste: [[2506.05429]] Coordinated Robustness Evaluation Framework for Vision-Language Models(https://arxiv.org/abs/2506.05429)
Keywords: attack, robust
Abstract: Vision-language models, which integrate computer vision and natural language processing capabilities, have demonstrated significant advancements in tasks such as image captioning and visual question and answering. However, similar to traditional models, they are susceptible to small perturbations, posing a challenge to their robustness, particularly in deployment scenarios. Evaluating the robustness of these models requires perturbations in both the vision and language modalities to learn their inter-modal dependencies. In this work, we train a generic surrogate model that can take both image and text as input and generate joint representation which is further used to generate adversarial perturbations for both the text and image modalities. This coordinated attack strategy is evaluated on the visual question and answering and visual reasoning datasets using various state-of-the-art vision-language models. Our results indicate that the proposed strategy outperforms other multi-modal attacks and single-modality attacks from the recent literature. Our results demonstrate their effectiveness in compromising the robustness of several state-of-the-art pre-trained multi-modal models such as instruct-BLIP, ViLT and others.

Title: Explainer-guided Targeted Adversarial Attacks against Binary Code Similarity Detection Models

Authors: Mingjie Chen (Zhejiang University), Tiancheng Zhu (Huazhong University of Science and Technology), Mingxue Zhang (The State Key Laboratory of Blockchain and Data Security, Zhejiang University & Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security), Yiling He (University College London), Minghao Lin (University of Southern California), Penghui Li (Columbia University), Kui Ren (The State Key Laboratory of Blockchain and Data Security, Zhejiang University)
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05430
Pdf URL: https://arxiv.org/pdf/2506.05430
Copy Paste: [[2506.05430]] Explainer-guided Targeted Adversarial Attacks against Binary Code Similarity Detection Models(https://arxiv.org/abs/2506.05430)
Keywords: security, attack, robust
Abstract: Binary code similarity detection (BCSD) serves as a fundamental technique for various software engineering tasks, e.g., vulnerability detection and classification. Attacks against such models have therefore drawn extensive attention, aiming at misleading the models to generate erroneous predictions. Prior works have explored various approaches to generating semantic-preserving variants, i.e., adversarial samples, to evaluate the robustness of the models against adversarial attacks. However, they have mainly relied on heuristic criteria or iterative greedy algorithms to locate salient code influencing the model output, failing to operate on a solid theoretical basis. Moreover, when processing programs with high complexities, such attacks tend to be time-consuming. In this work, we propose a novel optimization for adversarial attacks against BCSD models. In particular, we aim to improve the attacks in a challenging scenario, where the attack goal is to limit the model predictions to a specific range, i.e., the targeted attacks. Our attack leverages the superior capability of black-box, model-agnostic explainers in interpreting the model decision boundaries, thereby pinpointing the critical code snippet to apply semantic-preserving perturbations. The evaluation results demonstrate that compared with the state-of-the-art attacks, the proposed attacks achieve higher attack success rate in almost all scenarios, while also improving the efficiency and transferability. Our real-world case studies on vulnerability detection and classification further demonstrate the security implications of our attacks, highlighting the urgent need to further enhance the robustness of existing BCSD models.

Title: Robustness Evaluation for Video Models with Reinforcement Learning

Authors: Ashwin Ramesh Babu, Sajad Mousavi, Vineet Gundecha, Sahand Ghorbanpour, Avisek Naug, Antonio Guillen, Ricardo Luna Gutierrez, Soumyendu Sarkar
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05431
Pdf URL: https://arxiv.org/pdf/2506.05431
Copy Paste: [[2506.05431]] Robustness Evaluation for Video Models with Reinforcement Learning(https://arxiv.org/abs/2506.05431)
Keywords: attack, robust
Abstract: Evaluating the robustness of Video classification models is very challenging, specifically when compared to image-based models. With their increased temporal dimension, there is a significant increase in complexity and computational cost. One of the key challenges is to keep the perturbations to a minimum to induce misclassification. In this work, we propose a multi-agent reinforcement learning approach (spatial and temporal) that cooperatively learns to identify the given video's sensitive spatial and temporal regions. The agents consider temporal coherence in generating fine perturbations, leading to a more effective and visually imperceptible attack. Our method outperforms the state-of-the-art solutions on the Lp metric and the average queries. Our method enables custom distortion types, making the robustness evaluation more relevant to the use case. We extensively evaluate 4 popular models for video action recognition on two popular datasets, HMDB-51 and UCF-101.

Title: PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling

Authors: Yuxuan Yue, Zukang Xu, Zhihang Yuan, Dawei Yang, Jianglong Wu, Liqiang Nie
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05432
Pdf URL: https://arxiv.org/pdf/2506.05432
Copy Paste: [[2506.05432]] PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling(https://arxiv.org/abs/2506.05432)
Keywords: large language model
Abstract: Large Language Models (LLMs) face significant challenges in edge deployment due to their massive parameter scale. Vector Quantization (VQ), a clustering-based quantization method, serves as a prevalent solution to this issue for its extremely low-bit (even at 2-bit) and considerable accuracy. Since a vector is a quantity in mathematics and physics that has both direction and magnitude, existing VQ works typically quantize them in a coupled manner. However, we find that direction exhibits significantly greater sensitivity to quantization compared to the magnitude. For instance, when separately clustering the directions and magnitudes of weight vectors in LLaMA-2-7B, the accuracy drop of zero-shot tasks are 46.5\% and 2.3\%, respectively. This gap even increases with the reduction of clustering centers. Further, Euclidean distance, a common metric to access vector similarities in current VQ works, places greater emphasis on reducing the magnitude error. This property is contrary to the above finding, unavoidably leading to larger quantization errors. To these ends, this paper proposes Polar Coordinate Decoupled Vector Quantization (PCDVQ), an effective and efficient VQ framework consisting of two key modules: 1) Polar Coordinate Decoupling (PCD), which transforms vectors into their polar coordinate representations and perform independent quantization of the direction and magnitude parameters.2) Distribution Aligned Codebook Construction (DACC), which optimizes the direction and magnitude codebooks in accordance with the source distribution. Experimental results show that PCDVQ outperforms baseline methods at 2-bit level by at least 1.5\% zero-shot accuracy, establishing a novel paradigm for accurate and highly compressed LLMs.

Title: Efficient Robust Conformal Prediction via Lipschitz-Bounded Networks

Authors: Thomas Massena (IRIT, DTIPG - SNCF, UT3), Léo andéol (IMT, DTIPG - SNCF, UT3), Thibaut Boissin (IRIT, UT3), Franck Mamalet, Corentin Friedrich, Mathieu Serrurier (IRIT, UT3), Sébastien Gerchinovitz (IMT)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05434
Pdf URL: https://arxiv.org/pdf/2506.05434
Copy Paste: [[2506.05434]] Efficient Robust Conformal Prediction via Lipschitz-Bounded Networks(https://arxiv.org/abs/2506.05434)
Keywords: attack, robust
Abstract: Conformal Prediction (CP) has proven to be an effective post-hoc method for improving the trustworthiness of neural networks by providing prediction sets with finite-sample guarantees. However, under adversarial attacks, classical conformal guarantees do not hold anymore: this problem is addressed in the field of Robust Conformal Prediction. Several methods have been proposed to provide robust CP sets with guarantees under adversarial perturbations, but, for large scale problems, these sets are either too large or the methods are too computationally demanding to be deployed in real life scenarios. In this work, we propose a new method that leverages Lipschitz-bounded networks to precisely and efficiently estimate robust CP sets. When combined with a 1-Lipschitz robust network, we demonstrate that our lip-rcp method outperforms state-of-the-art results in both the size of the robust CP sets and computational efficiency in medium and large-scale scenarios such as ImageNet. Taking a different angle, we also study vanilla CP under attack, and derive new worst-case coverage bounds of vanilla CP sets, which are valid simultaneously for all adversarial attack levels. Our lip-rcp method makes this second approach as efficient as vanilla CP while also allowing robustness guarantees.

Title: An Unsupervised Framework for Dynamic Health Indicator Construction and Its Application in Rolling Bearing Prognostics

Authors: Tongda Sun, Chen Yin, Huailiang Zheng, Yining Dong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05438
Pdf URL: https://arxiv.org/pdf/2506.05438
Copy Paste: [[2506.05438]] An Unsupervised Framework for Dynamic Health Indicator Construction and Its Application in Rolling Bearing Prognostics(https://arxiv.org/abs/2506.05438)
Keywords: extraction
Abstract: Health indicator (HI) plays a key role in degradation assessment and prognostics of rolling bearings. Although various HI construction methods have been investigated, most of them rely on expert knowledge for feature extraction and overlook capturing dynamic information hidden in sequential degradation processes, which limits the ability of the constructed HI for degradation trend representation and prognostics. To address these concerns, a novel dynamic HI that considers HI-level temporal dependence is constructed through an unsupervised framework. Specifically, a degradation feature learning module composed of a skip-connection-based autoencoder first maps raw signals to a representative degradation feature space (DFS) to automatically extract essential degradation features without the need for expert knowledge. Subsequently, in this DFS, a new HI-generating module embedded with an inner HI-prediction block is proposed for dynamic HI construction, where the temporal dependence between past and current HI states is guaranteed and modeled explicitly. On this basis, the dynamic HI captures the inherent dynamic contents of the degradation process, ensuring its effectiveness for degradation tendency modeling and future degradation prognostics. The experiment results on two bearing lifecycle datasets demonstrate that the proposed HI construction method outperforms comparison methods, and the constructed dynamic HI is superior for prognostic tasks.

Title: U-NetMN and SegNetMN: Modified U-Net and SegNet models for bimodal SAR image segmentation

Authors: Marwane Kzadri, Franco Alberto Cardillo, Nanée Chahinian, Carole Delenne, Renaud Hostache, Jamal Riffi
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2506.05444
Pdf URL: https://arxiv.org/pdf/2506.05444
Copy Paste: [[2506.05444]] U-NetMN and SegNetMN: Modified U-Net and SegNet models for bimodal SAR image segmentation(https://arxiv.org/abs/2506.05444)
Keywords: segmentation
Abstract: Segmenting Synthetic Aperture Radar (SAR) images is crucial for many remote sensing applications, particularly water body detection. However, deep learning-based segmentation models often face challenges related to convergence speed and stability, mainly due to the complex statistical distribution of this type of data. In this study, we evaluate the impact of mode normalization on two widely used semantic segmentation models, U-Net and SegNet. Specifically, we integrate mode normalization, to reduce convergence time while maintaining the performance of the baseline models. Experimental results demonstrate that mode normalization significantly accelerates convergence. Furthermore, cross-validation results indicate that normalized models exhibit increased stability in different zones. These findings highlight the effectiveness of normalization in improving computational efficiency and generalization in SAR image segmentation.

Title: Causal Policy Learning in Reinforcement Learning: Backdoor-Adjusted Soft Actor-Critic

Authors: Thanh Vinh Vo, Young Lee, Haozhe Ma, Chien Lu, Tze-Yun Leong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05445
Pdf URL: https://arxiv.org/pdf/2506.05445
Copy Paste: [[2506.05445]] Causal Policy Learning in Reinforcement Learning: Backdoor-Adjusted Soft Actor-Critic(https://arxiv.org/abs/2506.05445)
Keywords: robust
Abstract: Hidden confounders that influence both states and actions can bias policy learning in reinforcement learning (RL), leading to suboptimal or non-generalizable behavior. Most RL algorithms ignore this issue, learning policies from observational trajectories based solely on statistical associations rather than causal effects. We propose DoSAC (Do-Calculus Soft Actor-Critic with Backdoor Adjustment), a principled extension of the SAC algorithm that corrects for hidden confounding via causal intervention estimation. DoSAC estimates the interventional policy $\pi(a | \mathrm{do}(s))$ using the backdoor criterion, without requiring access to true confounders or causal labels. To achieve this, we introduce a learnable Backdoor Reconstructor that infers pseudo-past variables (previous state and action) from the current state to enable backdoor adjustment from observational data. This module is integrated into a soft actor-critic framework to compute both the interventional policy and its entropy. Empirical results on continuous control benchmarks show that DoSAC outperforms baselines under confounded settings, with improved robustness, generalization, and policy reliability.

Title: Sentinel: SOTA model to protect against prompt injections

Authors: Dror Ivry, Oran Nahum
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05446
Pdf URL: https://arxiv.org/pdf/2506.05446
Copy Paste: [[2506.05446]] Sentinel: SOTA model to protect against prompt injections(https://arxiv.org/abs/2506.05446)
Keywords: protect, attack, large language model
Abstract: Large Language Models (LLMs) are increasingly powerful but remain vulnerable to prompt injection attacks, where malicious inputs cause the model to deviate from its intended instructions. This paper introduces Sentinel, a novel detection model, qualifire/prompt-injection-sentinel, based on the \answerdotai/ModernBERT-large architecture. By leveraging ModernBERT's advanced features and fine-tuning on an extensive and diverse dataset comprising a few open-source and private collections, Sentinel achieves state-of-the-art performance. This dataset amalgamates varied attack types, from role-playing and instruction hijacking to attempts to generate biased content, alongside a broad spectrum of benign instructions, with private datasets specifically targeting nuanced error correction and real-world misclassifications. On a comprehensive, unseen internal test set, Sentinel demonstrates an average accuracy of 0.987 and an F1-score of 0.980. Furthermore, when evaluated on public benchmarks, it consistently outperforms strong baselines like protectai/deberta-v3-base-prompt-injection-v2. This work details Sentinel's architecture, its meticulous dataset curation, its training methodology, and a thorough evaluation, highlighting its superior detection capabilities.

Title: MLLM-CL: Continual Learning for Multimodal Large Language Models

Authors: Hongbo Zhao, Fei Zhu, Rundong Wang, Gaofeng Meng, Zhaoxiang Zhang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05453
Pdf URL: https://arxiv.org/pdf/2506.05453
Copy Paste: [[2506.05453]] MLLM-CL: Continual Learning for Multimodal Large Language Models(https://arxiv.org/abs/2506.05453)
Keywords: large language model
Abstract: Recent Multimodal Large Language Models (MLLMs) excel in vision-language understanding but face challenges in adapting to dynamic real-world scenarios that require continuous integration of new knowledge and skills. While continual learning (CL) offers a potential solution, existing benchmarks and methods suffer from critical limitations. In this paper, we introduce MLLM-CL, a novel benchmark encompassing domain and ability continual learning, where the former focuses on independently and identically distributed (IID) evaluation across evolving mainstream domains, whereas the latter evaluates on non-IID scenarios with emerging model ability. Methodologically, we propose preventing catastrophic interference through parameter isolation, along with an MLLM-based routing mechanism. Extensive experiments demonstrate that our approach can integrate domain-specific knowledge and functional abilities with minimal forgetting, significantly outperforming existing methods.

Title: Zeroth-Order Optimization Finds Flat Minima

Authors: Liang Zhang, Bingcong Li, Kiran Koshy Thekumparampil, Sewoong Oh, Michael Muehlebach, Niao He
Subjects: cs.LG, cs.AI, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05454
Pdf URL: https://arxiv.org/pdf/2506.05454
Copy Paste: [[2506.05454]] Zeroth-Order Optimization Finds Flat Minima(https://arxiv.org/abs/2506.05454)
Keywords: attack
Abstract: Zeroth-order methods are extensively used in machine learning applications where gradients are infeasible or expensive to compute, such as black-box attacks, reinforcement learning, and language model fine-tuning. Existing optimization theory focuses on convergence to an arbitrary stationary point, but less is known on the implicit regularization that provides a fine-grained characterization on which particular solutions are finally reached. We show that zeroth-order optimization with the standard two-point estimator favors solutions with small trace of Hessian, which is widely used in previous work to distinguish between sharp and flat minima. We further provide convergence rates of zeroth-order optimization to approximate flat minima for convex and sufficiently smooth functions, where flat minima are defined as the minimizers that achieve the smallest trace of Hessian among all optimal solutions. Experiments on binary classification tasks with convex losses and language model fine-tuning support our theoretical findings.

Title: Towards Reliable Identification of Diffusion-based Image Manipulations

Authors: Alex Costanzino, Woody Bayliss, Juil Sock, Marc Gorriz Blanch, Danijela Horak, Ivan Laptev, Philip Torr, Fabio Pizzati
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05466
Pdf URL: https://arxiv.org/pdf/2506.05466
Copy Paste: [[2506.05466]] Towards Reliable Identification of Diffusion-based Image Manipulations(https://arxiv.org/abs/2506.05466)
Keywords: diffusion
Abstract: Changing facial expressions, gestures, or background details may dramatically alter the meaning conveyed by an image. Notably, recent advances in diffusion models greatly improve the quality of image manipulation while also opening the door to misuse. Identifying changes made to authentic images, thus, becomes an important task, constantly challenged by new diffusion-based editing tools. To this end, we propose a novel approach for ReliAble iDentification of inpainted AReas (RADAR). RADAR builds on existing foundation models and combines features from different image modalities. It also incorporates an auxiliary contrastive loss that helps to isolate manipulated image patches. We demonstrate these techniques to significantly improve both the accuracy of our method and its generalisation to a large number of diffusion models. To support realistic evaluation, we further introduce BBC-PAIR, a new comprehensive benchmark, with images tampered by 28 diffusion models. Our experiments show that RADAR achieves excellent results, outperforming the state-of-the-art in detecting and localising image edits made by both seen and unseen diffusion models. Our code, data and models will be publicly available at this http URL.

Title: F2T2-HiT: A U-Shaped FFT Transformer and Hierarchical Transformer for Reflection Removal

Authors: Jie Cai, Kangning Yang, Ling Ouyang, Lan Fu, Jiaming Ding, Huiming Sun, Chiu Man Ho, Zibo Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05489
Pdf URL: https://arxiv.org/pdf/2506.05489
Copy Paste: [[2506.05489]] F2T2-HiT: A U-Shaped FFT Transformer and Hierarchical Transformer for Reflection Removal(https://arxiv.org/abs/2506.05489)
Keywords: extraction, transformer
Abstract: Single Image Reflection Removal (SIRR) technique plays a crucial role in image processing by eliminating unwanted reflections from the background. These reflections, often caused by photographs taken through glass surfaces, can significantly degrade image quality. SIRR remains a challenging problem due to the complex and varied reflections encountered in real-world scenarios. These reflections vary significantly in intensity, shapes, light sources, sizes, and coverage areas across the image, posing challenges for most existing methods to effectively handle all cases. To address these challenges, this paper introduces a U-shaped Fast Fourier Transform Transformer and Hierarchical Transformer (F2T2-HiT) architecture, an innovative Transformer-based design for SIRR. Our approach uniquely combines Fast Fourier Transform (FFT) Transformer blocks and Hierarchical Transformer blocks within a UNet framework. The FFT Transformer blocks leverage the global frequency domain information to effectively capture and separate reflection patterns, while the Hierarchical Transformer blocks utilize multi-scale feature extraction to handle reflections of varying sizes and complexities. Extensive experiments conducted on three publicly available testing datasets demonstrate state-of-the-art performance, validating the effectiveness of our approach.

Title: Conformal Prediction Beyond the Seen: A Missing Mass Perspective for Uncertainty Quantification in Generative Models

Authors: Sima Noorani, Shayan Kiyani, George Pappas, Hamed Hassani
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05497
Pdf URL: https://arxiv.org/pdf/2506.05497
Copy Paste: [[2506.05497]] Conformal Prediction Beyond the Seen: A Missing Mass Perspective for Uncertainty Quantification in Generative Models(https://arxiv.org/abs/2506.05497)
Keywords: generative, large language model
Abstract: Uncertainty quantification (UQ) is essential for safe deployment of generative AI models such as large language models (LLMs), especially in high stakes applications. Conformal prediction (CP) offers a principled uncertainty quantification framework, but classical methods focus on regression and classification, relying on geometric distances or softmax scores: tools that presuppose structured outputs. We depart from this paradigm by studying CP in a query only setting, where prediction sets must be constructed solely from finite queries to a black box generative model, introducing a new trade off between coverage, test time query budget, and informativeness. We introduce Conformal Prediction with Query Oracle (CPQ), a framework characterizing the optimal interplay between these objectives. Our finite sample algorithm is built on two core principles: one governs the optimal query policy, and the other defines the optimal mapping from queried samples to prediction sets. Remarkably, both are rooted in the classical missing mass problem in statistics. Specifically, the optimal query policy depends on the rate of decay, or the derivative, of the missing mass, for which we develop a novel estimator. Meanwhile, the optimal mapping hinges on the missing mass itself, which we estimate using Good Turing estimators. We then turn our focus to implementing our method for language models, where outputs are vast, variable, and often under specified. Fine grained experiments on three real world open ended tasks and two LLMs, show CPQ applicability to any black box LLM and highlight: (1) individual contribution of each principle to CPQ performance, and (2) CPQ ability to yield significantly more informative prediction sets than existing conformal methods for language uncertainty quantification.

Title: The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models

Authors: Alex Damian, Jason D. Lee, Joan Bruna
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05500
Pdf URL: https://arxiv.org/pdf/2506.05500
Copy Paste: [[2506.05500]] The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models(https://arxiv.org/abs/2506.05500)
Keywords: generative
Abstract: In this work we consider generic Gaussian Multi-index models, in which the labels only depend on the (Gaussian) $d$-dimensional inputs through their projection onto a low-dimensional $r = O_d(1)$ subspace, and we study efficient agnostic estimation procedures for this hidden subspace. We introduce the \emph{generative leap} exponent $k^\star$, a natural extension of the generative exponent from [Damian et al.'24] to the multi-index setting. We first show that a sample complexity of $n=\Theta(d^{1 \vee \k/2})$ is necessary in the class of algorithms captured by the Low-Degree-Polynomial framework. We then establish that this sample complexity is also sufficient, by giving an agnostic sequential estimation procedure (that is, requiring no prior knowledge of the multi-index model) based on a spectral U-statistic over appropriate Hermite tensors. We further compute the generative leap exponent for several examples including piecewise linear functions (deep ReLU networks with bias), and general deep neural networks (with $r$-dimensional first hidden layer).

Title: FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL

Authors: Kaihang Pan, Wendong Bu, Yuruo Wu, Yang Wu, Kai Shen, Yunfei Li, Hang Zhao, Juncheng Li, Siliang Tang, Yueting Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05501
Pdf URL: https://arxiv.org/pdf/2506.05501
Copy Paste: [[2506.05501]] FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL(https://arxiv.org/abs/2506.05501)
Keywords: diffusion
Abstract: Recent studies extend the autoregression paradigm to text-to-image generation, achieving performance comparable to diffusion models. However, our new PairComp benchmark -- featuring test cases of paired prompts with similar syntax but different fine-grained semantics -- reveals that existing models struggle with fine-grained text-image alignment thus failing to realize precise control over visual tokens. To address this, we propose FocusDiff, which enhances fine-grained text-image semantic alignment by focusing on subtle differences between similar text-image pairs. We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics, further introducing a novel reinforcement learning algorithm to emphasize such fine-grained semantic differences for desired image generation. Our approach achieves state-of-the-art performance on existing text-to-image benchmarks and significantly outperforms prior methods on PairComp.

Title: StealthInk: A Multi-bit and Stealthy Watermark for Large Language Models

Authors: Ya Jiang, Chuxiong Wu, Massieh Kordi Boroujeny, Brian Mark, Kai Zeng
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05502
Pdf URL: https://arxiv.org/pdf/2506.05502
Copy Paste: [[2506.05502]] StealthInk: A Multi-bit and Stealthy Watermark for Large Language Models(https://arxiv.org/abs/2506.05502)
Keywords: steal, watermark, large language model
Abstract: Watermarking for large language models (LLMs) offers a promising approach to identifying AI-generated text. Existing approaches, however, either compromise the distribution of original generated text by LLMs or are limited to embedding zero-bit information that only allows for watermark detection but ignores identification. We present StealthInk, a stealthy multi-bit watermarking scheme that preserves the original text distribution while enabling the embedding of provenance data, such as userID, TimeStamp, and modelID, within LLM-generated text. This enhances fast traceability without requiring access to the language model's API or prompts. We derive a lower bound on the number of tokens necessary for watermark detection at a fixed equal error rate, which provides insights on how to enhance the capacity. Comprehensive empirical evaluations across diverse tasks highlight the stealthiness, detectability, and resilience of StealthInk, establishing it as an effective solution for LLM watermarking applications.

Title: MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

Authors: Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, Furong Huang
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05523
Pdf URL: https://arxiv.org/pdf/2506.05523
Copy Paste: [[2506.05523]] MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning(https://arxiv.org/abs/2506.05523)
Keywords: robust, generative
Abstract: Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics -- enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems -- including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models -- reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.

Title: Spectral Graph Neural Networks are Incomplete on Graphs with a Simple Spectrum

Authors: Snir Hordan, Maya Bechler-Speicher, Gur Lifshitz, Nadav Dym
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05530
Pdf URL: https://arxiv.org/pdf/2506.05530
Copy Paste: [[2506.05530]] Spectral Graph Neural Networks are Incomplete on Graphs with a Simple Spectrum(https://arxiv.org/abs/2506.05530)
Keywords: transformer
Abstract: Spectral features are widely incorporated within Graph Neural Networks (GNNs) to improve their expressive power, or their ability to distinguish among non-isomorphic graphs. One popular example is the usage of graph Laplacian eigenvectors for positional encoding in MPNNs and Graph Transformers. The expressive power of such Spectrally-enhanced GNNs (SGNNs) is usually evaluated via the k-WL graph isomorphism test hierarchy and homomorphism counting. Yet, these frameworks align poorly with the graph spectra, yielding limited insight into SGNNs' expressive power. We leverage a well-studied paradigm of classifying graphs by their largest eigenvalue multiplicity to introduce an expressivity hierarchy for SGNNs. We then prove that many SGNNs are incomplete even on graphs with distinct eigenvalues. To mitigate this deficiency, we adapt rotation equivariant neural networks to the graph spectra setting to propose a method to provably improve SGNNs' expressivity on simple spectrum graphs. We empirically verify our theoretical claims via an image classification experiment on the MNIST Superpixel dataset and eigenvector canonicalization on graphs from ZINC.

Title: Personalized Interpretability -- Interactive Alignment of Prototypical Parts Networks

Authors: Tomasz Michalski, Adam Wróbel, Andrea Bontempelli, Jakub Luśtyk, Mikolaj Kniejski, Stefano Teso, Andrea Passerini, Bartosz Zieliński, Dawid Rymarczyk
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2506.05533
Pdf URL: https://arxiv.org/pdf/2506.05533
Copy Paste: [[2506.05533]] Personalized Interpretability -- Interactive Alignment of Prototypical Parts Networks(https://arxiv.org/abs/2506.05533)
Keywords: interpretability
Abstract: Concept-based interpretable neural networks have gained significant attention due to their intuitive and easy-to-understand explanations based on case-based reasoning, such as "this bird looks like those sparrows". However, a major limitation is that these explanations may not always be comprehensible to users due to concept inconsistency, where multiple visual features are inappropriately mixed (e.g., a bird's head and wings treated as a single concept). This inconsistency breaks the alignment between model reasoning and human understanding. Furthermore, users have specific preferences for how concepts should look, yet current approaches provide no mechanism for incorporating their feedback. To address these issues, we introduce YoursProtoP, a novel interactive strategy that enables the personalization of prototypical parts - the visual concepts used by the model - according to user needs. By incorporating user supervision, YoursProtoP adapts and splits concepts used for both prediction and explanation to better match the user's preferences and understanding. Through experiments on both the synthetic FunnyBirds dataset and a real-world scenario using the CUB, CARS, and PETS datasets in a comprehensive user study, we demonstrate the effectiveness of YoursProtoP in achieving concept consistency without compromising the accuracy of the model.

Title: SocialDF: Benchmark Dataset and Detection Model for Mitigating Harmful Deepfake Content on Social Media Platforms

Authors: Arnesh Batra, Anushk Kumar, Jashn Khemani, Arush Gumber, Arhan Jain, Somil Gupta
Subjects: cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.05538
Pdf URL: https://arxiv.org/pdf/2506.05538
Copy Paste: [[2506.05538]] SocialDF: Benchmark Dataset and Detection Model for Mitigating Harmful Deepfake Content on Social Media Platforms(https://arxiv.org/abs/2506.05538)
Keywords: security, robust, generative
Abstract: The rapid advancement of deep generative models has significantly improved the realism of synthetic media, presenting both opportunities and security challenges. While deepfake technology has valuable applications in entertainment and accessibility, it has emerged as a potent vector for misinformation campaigns, particularly on social media. Existing detection frameworks struggle to distinguish between benign and adversarially generated deepfakes engineered to manipulate public perception. To address this challenge, we introduce SocialDF, a curated dataset reflecting real-world deepfake challenges on social media platforms. This dataset encompasses high-fidelity deepfakes sourced from various online ecosystems, ensuring broad coverage of manipulative techniques. We propose a novel LLM-based multi-factor detection approach that combines facial recognition, automated speech transcription, and a multi-agent LLM pipeline to cross-verify audio-visual cues. Our methodology emphasizes robust, multi-modal verification techniques that incorporate linguistic, behavioral, and contextual analysis to effectively discern synthetic media from authentic content.

Title: Agentomics-ML: Autonomous Machine Learning Experimentation Agent for Genomic and Transcriptomic Data

Authors: Vlastimil Martinek, Andrea Gariboldi, Dimosthenis Tzimotoudis, Aitor Alberdi Escudero, Edward Blake, David Cechak, Luke Cassar, Alessandro Balestrucci, Panagiotis Alexiou
Subjects: cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2506.05542
Pdf URL: https://arxiv.org/pdf/2506.05542
Copy Paste: [[2506.05542]] Agentomics-ML: Autonomous Machine Learning Experimentation Agent for Genomic and Transcriptomic Data(https://arxiv.org/abs/2506.05542)
Keywords: large language model
Abstract: The adoption of machine learning (ML) and deep learning methods has revolutionized molecular medicine by driving breakthroughs in genomics, transcriptomics, drug discovery, and biological systems modeling. The increasing quantity, multimodality, and heterogeneity of biological datasets demand automated methods that can produce generalizable predictive models. Recent developments in large language model-based agents have shown promise for automating end-to-end ML experimentation on structured benchmarks. However, when applied to heterogeneous computational biology datasets, these methods struggle with generalization and success rates. Here, we introduce Agentomics-ML, a fully autonomous agent-based system designed to produce a classification model and the necessary files for reproducible training and inference. Our method follows predefined steps of an ML experimentation process, repeatedly interacting with the file system through Bash to complete individual steps. Once an ML model is produced, training and validation metrics provide scalar feedback to a reflection step to identify issues such as overfitting. This step then creates verbal feedback for future iterations, suggesting adjustments to steps such as data representation, model architecture, and hyperparameter choices. We have evaluated Agentomics-ML on several established genomic and transcriptomic benchmark datasets and show that it outperforms existing state-of-the-art agent-based methods in both generalization and success rates. While state-of-the-art models built by domain experts still lead in absolute performance on the majority of the computational biology datasets used in this work, Agentomics-ML narrows the gap for fully autonomous systems and achieves state-of-the-art performance on one of the used benchmark datasets. The code is available at this https URL.

Title: FRAME: Pre-Training Video Feature Representations via Anticipation and Memory

Authors: Sethuraman TV, Savya Khosla, Vignesh Srinivasakumar, Jiahui Huang, Seoung Wug Oh, Simon Jenni, Derek Hoiem, Joon-Young Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05543
Pdf URL: https://arxiv.org/pdf/2506.05543
Copy Paste: [[2506.05543]] FRAME: Pre-Training Video Feature Representations via Anticipation and Memory(https://arxiv.org/abs/2506.05543)
Keywords: segmentation
Abstract: Dense video prediction tasks, such as object tracking and semantic segmentation, require video encoders that generate temporally consistent, spatially dense features for every frame. However, existing approaches fall short: image encoders like DINO or CLIP lack temporal awareness, while video models such as VideoMAE underperform compared to image encoders on dense prediction tasks. We address this gap with FRAME, a self-supervised video frame encoder tailored for dense video understanding. FRAME learns to predict current and future DINO patch features from past and present RGB frames, leading to spatially precise and temporally coherent representations. To our knowledge, FRAME is the first video encoder to leverage image-based models for dense prediction while outperforming them on tasks requiring fine-grained visual correspondence. As an auxiliary capability, FRAME aligns its class token with CLIP's semantic space, supporting language-driven tasks such as video classification. We evaluate FRAME across six dense prediction tasks on seven datasets, where it consistently outperforms image encoders and existing self-supervised video models. Despite its versatility, FRAME maintains a compact architecture suitable for a range of downstream applications.

Title: Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos

Authors: Vadim Tschernezki, Diane Larlus, Andrea Vedaldi, Iro Laina
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05546
Pdf URL: https://arxiv.org/pdf/2506.05546
Copy Paste: [[2506.05546]] Layered Motion Fusion: Lifting Motion Segmentation to 3D in Egocentric Videos(https://arxiv.org/abs/2506.05546)
Keywords: segmentation
Abstract: Computer vision is largely based on 2D techniques, with 3D vision still relegated to a relatively narrow subset of applications. However, by building on recent advances in 3D models such as neural radiance fields, some authors have shown that 3D techniques can at last improve outputs extracted from independent 2D views, by fusing them into 3D and denoising them. This is particularly helpful in egocentric videos, where the camera motion is significant, but only under the assumption that the scene itself is static. In fact, as shown in the recent analysis conducted by EPIC Fields, 3D techniques are ineffective when it comes to studying dynamic phenomena, and, in particular, when segmenting moving objects. In this paper, we look into this issue in more detail. First, we propose to improve dynamic segmentation in 3D by fusing motion segmentation predictions from a 2D-based model into layered radiance fields (Layered Motion Fusion). However, the high complexity of long, dynamic videos makes it challenging to capture the underlying geometric structure, and, as a result, hinders the fusion of motion cues into the (incomplete) scene geometry. We address this issue through test-time refinement, which helps the model to focus on specific frames, thereby reducing the data complexity. This results in a synergy between motion fusion and the refinement, and in turn leads to segmentation predictions of the 3D model that surpass the 2D baseline by a large margin. This demonstrates that 3D techniques can enhance 2D analysis even for dynamic phenomena in a challenging and realistic setting.

Title: When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding

Authors: Yan Shu, Hangui Lin, Yexin Liu, Yan Zhang, Gangyan Zeng, Yan Li, Yu Zhou, Ser-Nam Lim, Harry Yang, Nicu Sebe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05551
Pdf URL: https://arxiv.org/pdf/2506.05551
Copy Paste: [[2506.05551]] When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding(https://arxiv.org/abs/2506.05551)
Keywords: transformer
Abstract: Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of over 1,730 samples spanning both semantic and non-semantic cases, with manually curated question-answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.

Title: EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh

Authors: Tao Hu, Haoyang Peng, Xiao Liu, Yuewen Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05554
Pdf URL: https://arxiv.org/pdf/2506.05554
Copy Paste: [[2506.05554]] EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh(https://arxiv.org/abs/2506.05554)
Keywords: robust, diffusion
Abstract: Generating high-quality camera-controllable videos from monocular input is a challenging task, particularly under extreme viewpoint. Existing methods often struggle with geometric inconsistencies and occlusion artifacts in boundaries, leading to degraded visual quality. In this paper, we introduce EX-4D, a novel framework that addresses these challenges through a Depth Watertight Mesh representation. The representation serves as a robust geometric prior by explicitly modeling both visible and occluded regions, ensuring geometric consistency in extreme camera pose. To overcome the lack of paired multi-view datasets, we propose a simulated masking strategy that generates effective training data only from monocular videos. Additionally, a lightweight LoRA-based video diffusion adapter is employed to synthesize high-quality, physically consistent, and temporally coherent videos. Extensive experiments demonstrate that EX-4D outperforms state-of-the-art methods in terms of physical consistency and extreme-view quality, enabling practical 4D video generation.

Title: On-the-fly Reconstruction for Large-Scale Novel View Synthesis from Unposed Images

Authors: Andreas Meuleman, Ishaan Shah, Alexandre Lanvin, Bernhard Kerbl, George Drettakis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05558
Pdf URL: https://arxiv.org/pdf/2506.05558
Copy Paste: [[2506.05558]] On-the-fly Reconstruction for Large-Scale Novel View Synthesis from Unposed Images(https://arxiv.org/abs/2506.05558)
Keywords: robust
Abstract: Radiance field methods such as 3D Gaussian Splatting (3DGS) allow easy reconstruction from photos, enabling free-viewpoint navigation. Nonetheless, pose estimation using Structure from Motion and 3DGS optimization can still each take between minutes and hours of computation after capture is complete. SLAM methods combined with 3DGS are fast but struggle with wide camera baselines and large scenes. We present an on-the-fly method to produce camera poses and a trained 3DGS immediately after capture. Our method can handle dense and wide-baseline captures of ordered photo sequences and large-scale scenes. To do this, we first introduce fast initial pose estimation, exploiting learned features and a GPU-friendly mini bundle adjustment. We then introduce direct sampling of Gaussian primitive positions and shapes, incrementally spawning primitives where required, significantly accelerating training. These two efficient steps allow fast and robust joint optimization of poses and Gaussian primitives. Our incremental approach handles large-scale scenes by introducing scalable radiance field construction, progressively clustering 3DGS primitives, storing them in anchors, and offloading them from the GPU. Clustered primitives are progressively merged, keeping the required scale of 3DGS at any viewpoint. We evaluate our solution on a variety of datasets and show that our solution can provide on-the-fly processing of all the capture scenarios and scene sizes we target while remaining competitive with other methods that only handle specific capture styles or scene sizes in speed, image quality, or both.

Title: Improving LLMs with a knowledge from databases

Authors: Petr Máša
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05560
Pdf URL: https://arxiv.org/pdf/2506.05560
Copy Paste: [[2506.05560]] Improving LLMs with a knowledge from databases(https://arxiv.org/abs/2506.05560)
Keywords: large language model
Abstract: Large language models (LLMs) are achieving significant progress almost every moment now. Many advanced techniques have been introduced and widely accepted, like retrieval-augmentation generation (RAG), agents, and tools. Tools can query the database to answer questions from structured data files or perform groupings or other statistics. This unlocks huge opportunities, such as it can answer any question, but also poses threats, such as safety, because there is no control over the commands that are created. We would like to discuss whether we can create a new method that improves answers based on dataset/database via some interpretable ML methods, namely enhanced association rules. The advantage would be if the method can be also used in some safe technique like RAG. Association rules have a sound history. Since the introduction of CN2 and aproiri, many enhancements have been made. In parallel, enhanced association rules have been introduced and evolved over the last 40 years. The general problem is typically that there are too many rules. There are some techniques for handling it, but when LLM emerged, it turned out to be the best use case for the RAG technique for LLMs. We proposed a method that generates a ruleset based on defined knowledge patterns, then converts rules into text form via a rule-to-text converter, and includes the result as an RAG into LLM. We compared this method with ChatGPT (even with using agents) and we have discovered a significant improvement in answering questions based on the dataset. We have also tried several strategies how much rules to generate. We found this improvement interesting. Moreover, it can also be improved in many ways as future work, like incorporating other patterns, the use of rule mining as an agent, and many others.

Title: Ravan: Multi-Head Low-Rank Adaptation for Federated Fine-Tuning

Authors: Arian Raje, Baris Askin, Divyansh Jhunjhunwala, Gauri Joshi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05568
Pdf URL: https://arxiv.org/pdf/2506.05568
Copy Paste: [[2506.05568]] Ravan: Multi-Head Low-Rank Adaptation for Federated Fine-Tuning(https://arxiv.org/abs/2506.05568)
Keywords: robust, federate, large language model
Abstract: Large language models (LLMs) have not yet effectively leveraged the vast amounts of edge-device data, and federated learning (FL) offers a promising paradigm to collaboratively fine-tune LLMs without transferring private edge data to the cloud. To operate within the computation and communication constraints of edge devices, recent literature on federated fine-tuning of LLMs proposes the use of low-rank adaptation (LoRA) and similar parameter-efficient methods. However, LoRA-based methods suffer from accuracy degradation in FL settings, primarily because of data and computational heterogeneity across clients. We propose \textsc{Ravan}, an adaptive multi-head LoRA method that balances parameter efficiency and model expressivity by reparameterizing the weight updates as the sum of multiple LoRA heads $s_i\textbf{B}_i\textbf{H}_i\textbf{A}_i$ in which only the core matrices $\textbf{H}_i$ and their lightweight scaling factors $s_i$ are trained. These trainable scaling factors let the optimization focus on the most useful heads, recovering a higher-rank approximation of the full update without increasing the number of communicated parameters since clients upload $s_i\textbf{H}_i$ directly. Experiments on vision and language benchmarks show that \textsc{Ravan} improves test accuracy by 2-8\% over prior parameter-efficient baselines, making it a robust and scalable solution for federated fine-tuning of LLMs.

Title: PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers

Authors: Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, Katerina Fragkiadaki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05573
Pdf URL: https://arxiv.org/pdf/2506.05573
Copy Paste: [[2506.05573]] PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers(https://arxiv.org/abs/2506.05573)
Keywords: diffusion, transformer, generative
Abstract: We introduce PartCrafter, the first structured 3D generative model that jointly synthesizes multiple semantically meaningful and geometrically distinct 3D meshes from a single RGB image. Unlike existing methods that either produce monolithic 3D shapes or follow two-stage pipelines, i.e., first segmenting an image and then reconstructing each segment, PartCrafter adopts a unified, compositional generation architecture that does not rely on pre-segmented inputs. Conditioned on a single image, it simultaneously denoises multiple 3D parts, enabling end-to-end part-aware generation of both individual objects and complex multi-object scenes. PartCrafter builds upon a pretrained 3D mesh diffusion transformer (DiT) trained on whole objects, inheriting the pretrained weights, encoder, and decoder, and introduces two key innovations: (1) A compositional latent space, where each 3D part is represented by a set of disentangled latent tokens; (2) A hierarchical attention mechanism that enables structured information flow both within individual parts and across all parts, ensuring global coherence while preserving part-level detail during generation. To support part-level supervision, we curate a new dataset by mining part-level annotations from large-scale 3D object datasets. Experiments show that PartCrafter outperforms existing approaches in generating decomposable 3D meshes, including parts that are not directly visible in input images, demonstrating the strength of part-aware generative priors for 3D understanding and synthesis. Code and training data will be released.

Title: When can in-context learning generalize out of task distribution?

Authors: Chase Goddard, Lindsay M. Smith, Vudtiwat Ngampruetikorn, David J. Schwab
Subjects: cs.LG, cond-mat.dis-nn, cond-mat.stat-mech, q-bio.NC, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05574
Pdf URL: https://arxiv.org/pdf/2506.05574
Copy Paste: [[2506.05574]] When can in-context learning generalize out of task distribution?(https://arxiv.org/abs/2506.05574)
Keywords: transformer
Abstract: In-context learning (ICL) is a remarkable capability of pretrained transformers that allows models to generalize to unseen tasks after seeing only a few examples. We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize \emph{out-of-distribution}. Previous work has focused on the number of distinct tasks necessary in the pretraining dataset. Here, we use a different notion of task diversity to study the emergence of ICL in transformers trained on linear functions. We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining task distribution, to a solution which generalizes out of distribution to the entire task space. We also investigate the nature of the solutions learned by the transformer on both sides of the transition, and observe similar transitions in nonlinear regression problems. We construct a phase diagram to characterize how our concept of task diversity interacts with the number of pretraining tasks. In addition, we explore how factors such as the depth of the model and the dimensionality of the regression problem influence the transition.

Title: Conformal Prediction Adaptive to Unknown Subpopulation Shifts

Authors: Nien-Shao Wang, Duygu Nur Yaldiz, Yavuz Faruk Bakman, Sai Praneeth Karimireddy
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05583
Pdf URL: https://arxiv.org/pdf/2506.05583
Copy Paste: [[2506.05583]] Conformal Prediction Adaptive to Unknown Subpopulation Shifts(https://arxiv.org/abs/2506.05583)
Keywords: transformer, large language model
Abstract: Conformal prediction is widely used to equip black-box machine learning models with uncertainty quantification enjoying formal coverage guarantees. However, these guarantees typically break down in the presence of distribution shifts, where the data distribution at test time differs from the training (or calibration-time) distribution. In this work, we address subpopulation shifts, where the test environment exhibits an unknown and differing mixture of subpopulations compared to the calibration data. We propose new methods that provably adapt conformal prediction to such shifts, ensuring valid coverage without requiring explicit knowledge of subpopulation structure. Our algorithms scale to high-dimensional settings and perform effectively in realistic machine learning tasks. Extensive experiments on vision (with vision transformers) and language (with large language models) benchmarks demonstrate that our methods reliably maintain coverage and controls risk in scenarios where standard conformal prediction fails.

Title: TabFlex: Scaling Tabular Learning to Millions with Linear Attention

Authors: Yuchen Zeng, Tuan Dinh, Wonjun Kang, Andreas C Mueller
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05584
Pdf URL: https://arxiv.org/pdf/2506.05584
Copy Paste: [[2506.05584]] TabFlex: Scaling Tabular Learning to Millions with Linear Attention(https://arxiv.org/abs/2506.05584)
Keywords: large language model
Abstract: Leveraging the in-context learning (ICL) capability of Large Language Models (LLMs) for tabular classification has gained significant attention for its training-free adaptability across diverse datasets. Recent advancements, like TabPFN, excel in small-scale tabular datasets but struggle to scale for large and complex datasets. Our work enhances the efficiency and scalability of TabPFN for larger datasets by incorporating linear attention mechanisms as a scalable alternative to complexity-quadratic self-attention. Our model, TabFlex, efficiently handles tabular datasets with thousands of features and hundreds of classes, scaling seamlessly to millions of samples. For instance, TabFlex processes the poker-hand dataset with over a million samples in just 5 seconds. Our extensive evaluations demonstrate that TabFlex can achieve over a 2x speedup compared to TabPFN and a 1.5x speedup over XGBoost, outperforming 25 tested baselines in terms of efficiency across a diverse range of datasets. Furthermore, TabFlex remains highly effective on large-scale datasets, delivering strong performance with significantly reduced computational costs, especially when combined with data-efficient techniques such as dimensionality reduction and data sampling.

Title: CoFrNets: Interpretable Neural Architecture Inspired by Continued Fractions

Authors: Isha Puri, Amit Dhurandhar, Tejaswini Pedapati, Kartikeyan Shanmugam, Dennis Wei, Kush R. Varshney
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05586
Pdf URL: https://arxiv.org/pdf/2506.05586
Copy Paste: [[2506.05586]] CoFrNets: Interpretable Neural Architecture Inspired by Continued Fractions(https://arxiv.org/abs/2506.05586)
Keywords: interpretability
Abstract: In recent years there has been a considerable amount of research on local post hoc explanations for neural networks. However, work on building interpretable neural architectures has been relatively sparse. In this paper, we present a novel neural architecture, CoFrNet, inspired by the form of continued fractions which are known to have many attractive properties in number theory, such as fast convergence of approximations to real numbers. We show that CoFrNets can be efficiently trained as well as interpreted leveraging their particular functional form. Moreover, we prove that such architectures are universal approximators based on a proof strategy that is different than the typical strategy used to prove universal approximation results for neural networks based on infinite width (or depth), which is likely to be of independent interest. We experiment on nonlinear synthetic functions and are able to accurately model as well as estimate feature attributions and even higher order terms in some cases, which is a testament to the representational power as well as interpretability of such architectures. To further showcase the power of CoFrNets, we experiment on seven real datasets spanning tabular, text and image modalities, and show that they are either comparable or significantly better than other interpretable models and multilayer perceptrons, sometimes approaching the accuracies of state-of-the-art models.

Title: UTSA-NLP at ArchEHR-QA 2025: Improving EHR Question Answering via Self-Consistency Prompting

Authors: Sara Shields-Menard, Zach Reimers, Joshua Gardner, David Perry, Anthony Rios
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05589
Pdf URL: https://arxiv.org/pdf/2506.05589
Copy Paste: [[2506.05589]] UTSA-NLP at ArchEHR-QA 2025: Improving EHR Question Answering via Self-Consistency Prompting(https://arxiv.org/abs/2506.05589)
Keywords: large language model
Abstract: We describe our system for the ArchEHR-QA Shared Task on answering clinical questions using electronic health records (EHRs). Our approach uses large language models in two steps: first, to find sentences in the EHR relevant to a clinician's question, and second, to generate a short, citation-supported response based on those sentences. We use few-shot prompting, self-consistency, and thresholding to improve the sentence classification step to decide which sentences are essential. We compare several models and find that a smaller 8B model performs better than a larger 70B model for identifying relevant information. Our results show that accurate sentence selection is critical for generating high-quality responses and that self-consistency with thresholding helps make these decisions more reliable.

Title: SoK: Are Watermarks in LLMs Ready for Deployment?

Authors: Kieu Dang, Phung Lai, NhatHai Phan, Yelong Shen, Ruoming Jin, Abdallah Khreishah, My Thai
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05594
Pdf URL: https://arxiv.org/pdf/2506.05594
Copy Paste: [[2506.05594]] SoK: Are Watermarks in LLMs Ready for Deployment?(https://arxiv.org/abs/2506.05594)
Keywords: security, attack, steal, watermark, large language model
Abstract: Large Language Models (LLMs) have transformed natural language processing, demonstrating impressive capabilities across diverse tasks. However, deploying these models introduces critical risks related to intellectual property violations and potential misuse, particularly as adversaries can imitate these models to steal services or generate misleading outputs. We specifically focus on model stealing attacks, as they are highly relevant to proprietary LLMs and pose a serious threat to their security, revenue, and ethical deployment. While various watermarking techniques have emerged to mitigate these risks, it remains unclear how far the community and industry have progressed in developing and deploying watermarks in LLMs. To bridge this gap, we aim to develop a comprehensive systematization for watermarks in LLMs by 1) presenting a detailed taxonomy for watermarks in LLMs, 2) proposing a novel intellectual property classifier to explore the effectiveness and impacts of watermarks on LLMs under both attack and attack-free environments, 3) analyzing the limitations of existing watermarks in LLMs, and 4) discussing practical challenges and potential future directions for watermarks in LLMs. Through extensive experiments, we show that despite promising research outcomes and significant attention from leading companies and community to deploy watermarks, these techniques have yet to reach their full potential in real-world applications due to their unfavorable impacts on model utility of LLMs and downstream tasks. Our findings provide an insightful understanding of watermarks in LLMs, highlighting the need for practical watermarks solutions tailored to LLM deployment.

Title: Zero-shot protein stability prediction by inverse folding models: a free energy interpretation

Authors: Jes Frellsen, Maher M. Kassem, Tone Bengtsen, Lars Olsen, Kresten Lindorff-Larsen, Jesper Ferkinghoff-Borg, Wouter Boomsma
Subjects: cs.LG, cs.AI, q-bio.BM, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05596
Pdf URL: https://arxiv.org/pdf/2506.05596
Copy Paste: [[2506.05596]] Zero-shot protein stability prediction by inverse folding models: a free energy interpretation(https://arxiv.org/abs/2506.05596)
Keywords: fair
Abstract: Inverse folding models have proven to be highly effective zero-shot predictors of protein stability. Despite this success, the link between the amino acid preferences of an inverse folding model and the free-energy considerations underlying thermodynamic stability remains incompletely understood. A better understanding would be of interest not only from a theoretical perspective, but also potentially provide the basis for stronger zero-shot stability prediction. In this paper, we take steps to clarify the free-energy foundations of inverse folding models. Our derivation reveals the standard practice of likelihood ratios as a simplistic approximation and suggests several paths towards better estimates of the relative stability. We empirically assess these approaches and demonstrate that considerable gains in zero-shot performance can be achieved with fairly simple means.

Title: FaCTR: Factorized Channel-Temporal Representation Transformers for Efficient Time Series Forecasting

Authors: Yash Vijay, Harini Subramanyan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05597
Pdf URL: https://arxiv.org/pdf/2506.05597
Copy Paste: [[2506.05597]] FaCTR: Factorized Channel-Temporal Representation Transformers for Efficient Time Series Forecasting(https://arxiv.org/abs/2506.05597)
Keywords: interpretability, transformer
Abstract: While Transformers excel in language and vision-where inputs are semantically rich and exhibit univariate dependency structures-their architectural complexity leads to diminishing returns in time series forecasting. Time series data is characterized by low per-timestep information density and complex dependencies across channels and covariates, requiring conditioning on structured variable interactions. To address this mismatch and overparameterization, we propose FaCTR, a lightweight spatiotemporal Transformer with an explicitly structural design. FaCTR injects dynamic, symmetric cross-channel interactions-modeled via a low-rank Factorization Machine into temporally contextualized patch embeddings through a learnable gating mechanism. It further encodes static and dynamic covariates for multivariate conditioning. Despite its compact design, FaCTR achieves state-of-the-art performance on eleven public forecasting benchmarks spanning both short-term and long-term horizons, with its largest variant using close to only 400K parameters-on average 50x smaller than competitive spatiotemporal transformer baselines. In addition, its structured design enables interpretability through cross-channel influence scores-an essential requirement for real-world decision-making. Finally, FaCTR supports self-supervised pretraining, positioning it as a compact yet versatile foundation for downstream time series tasks.

Title: SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLMs

Authors: Michael J Ryan, Omar Shaikh, Aditri Bhagirath, Daniel Frees, William Held, Diyi Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05598
Pdf URL: https://arxiv.org/pdf/2506.05598
Copy Paste: [[2506.05598]] SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLMs(https://arxiv.org/abs/2506.05598)
Keywords: large language model
Abstract: Recent calls for pluralistic alignment of Large Language Models (LLMs) encourage adapting models to diverse user preferences. However, most prior work on personalized reward models heavily rely on additional identity information, such as demographic details or a predefined set of preference categories. To this end, we introduce SynthesizeMe, an approach to inducing synthetic user personas from user interactions for personalized reward modeling. SynthesizeMe first generates and verifies reasoning to explain user preferences, then induces synthetic user personas from that reasoning, and finally filters to informative prior user interactions in order to build personalized prompts for a particular user. We show that using SynthesizeMe induced prompts improves personalized LLM-as-a-judge accuracy by 4.4% on Chatbot Arena. Combining SynthesizeMe derived prompts with a reward model achieves top performance on PersonalRewardBench: a new curation of user-stratified interactions with chatbots collected from 854 users of Chatbot Arena and PRISM.

Title: UniRes: Universal Image Restoration for Complex Degradations

Authors: Mo Zhou, Keren Ye, Mauricio Delbracio, Peyman Milanfar, Vishal M. Patel, Hossein Talebi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05599
Pdf URL: https://arxiv.org/pdf/2506.05599
Copy Paste: [[2506.05599]] UniRes: Universal Image Restoration for Complex Degradations(https://arxiv.org/abs/2506.05599)
Keywords: diffusion, generative
Abstract: Real-world image restoration is hampered by diverse degradations stemming from varying capture conditions, capture devices and post-processing pipelines. Existing works make improvements through simulating those degradations and leveraging image generative priors, however generalization to in-the-wild data remains an unresolved problem. In this paper, we focus on complex degradations, i.e., arbitrary mixtures of multiple types of known degradations, which is frequently seen in the wild. A simple yet flexible diffusionbased framework, named UniRes, is proposed to address such degradations in an end-to-end manner. It combines several specialized models during the diffusion sampling steps, hence transferring the knowledge from several well-isolated restoration tasks to the restoration of complex in-the-wild degradations. This only requires well-isolated training data for several degradation types. The framework is flexible as extensions can be added through a unified formulation, and the fidelity-quality trade-off can be adjusted through a new paradigm. Our proposed method is evaluated on both complex-degradation and single-degradation image restoration datasets. Extensive qualitative and quantitative experimental results show consistent performance gain especially for images with complex degradations.

Title: Network Hexagons Under Attack: Secure Crowdsourcing of Geo-Referenced Data

Authors: Okemawo Obadofin, Joao Barros
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2506.05601
Pdf URL: https://arxiv.org/pdf/2506.05601
Copy Paste: [[2506.05601]] Network Hexagons Under Attack: Secure Crowdsourcing of Geo-Referenced Data(https://arxiv.org/abs/2506.05601)
Keywords: secure, security, privacy, attack
Abstract: A critical requirement for modern-day Intelligent Transportation Systems (ITS) is the ability to collect geo-referenced data from connected vehicles and mobile devices in a safe, secure and anonymous way. The Nexagon protocol, which builds on the IETF Locator/ID Separation Protocol (LISP) and the Hierarchical Hexagonal Clustering (H3) geo-spatial indexing system, offers a promising framework for dynamic, privacy-preserving data aggregation. Seeking to address the critical security and privacy vulnerabilities that persist in its current specification, we apply the STRIDE and LINDDUN threat modelling frameworks and prove among other that the Nexagon protocol is susceptible to user re-identification, session linkage, and sparse-region attacks. To address these challenges, we propose an enhanced security architecture that combines public key infrastructure (PKI) with ephemeral pseudonym certificates. Our solution guarantees user and device anonymity through randomized key rotation and adaptive geospatial resolution, thereby effectively mitigating re-identification and surveillance risks in sparse environments. A prototype implementation over a microservice-based overlay network validates the approach and underscores its readiness for real-world deployment. Our results show that it is possible to achieve the required level of security without increasing latency by more than 25% or reducing the throughput by more than 7%.

Title: OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation

Authors: Ziyi Wang, Yuxuan Lu, Wenbo Li, Amirali Amini, Bo Sun, Yakov Bart, Weimin Lyu, Jiri Gesi, Tian Wang, Jing Huang, Yu Su, Upol Ehsan, Malihe Alikhani, Toby Jia-Jun Li, Lydia Chilton, Dakuo Wang
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2506.05606
Pdf URL: https://arxiv.org/pdf/2506.05606
Copy Paste: [[2506.05606]] OPeRA: A Dataset of Observation, Persona, Rationale, and Action for Evaluating LLMs on Human Online Shopping Behavior Simulation(https://arxiv.org/abs/2506.05606)
Keywords: large language model
Abstract: Can large language models (LLMs) accurately simulate the next web action of a specific user? While LLMs have shown promising capabilities in generating ``believable'' human behaviors, evaluating their ability to mimic real user behaviors remains an open challenge, largely due to the lack of high-quality, publicly available datasets that capture both the observable actions and the internal reasoning of an actual human user. To address this gap, we introduce OPERA, a novel dataset of Observation, Persona, Rationale, and Action collected from real human participants during online shopping sessions. OPERA is the first public dataset that comprehensively captures: user personas, browser observations, fine-grained web actions, and self-reported just-in-time rationales. We developed both an online questionnaire and a custom browser plugin to gather this dataset with high fidelity. Using OPERA, we establish the first benchmark to evaluate how well current LLMs can predict a specific user's next action and rationale with a given persona and history. This dataset lays the groundwork for future research into LLM agents that aim to act as personalized digital twins for human.

Title: Mitigating Confounding in Speech-Based Dementia Detection through Weight Masking

Authors: Zhecheng Sheng, Xiruo Ding, Brian Hur, Changye Li, Trevor Cohen, Serguei Pakhomov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05610
Pdf URL: https://arxiv.org/pdf/2506.05610
Copy Paste: [[2506.05610]] Mitigating Confounding in Speech-Based Dementia Detection through Weight Masking(https://arxiv.org/abs/2506.05610)
Keywords: transformer
Abstract: Deep transformer models have been used to detect linguistic anomalies in patient transcripts for early Alzheimer's disease (AD) screening. While pre-trained neural language models (LMs) fine-tuned on AD transcripts perform well, little research has explored the effects of the gender of the speakers represented by these transcripts. This work addresses gender confounding in dementia detection and proposes two methods: the $\textit{Extended Confounding Filter}$ and the $\textit{Dual Filter}$, which isolate and ablate weights associated with gender. We evaluate these methods on dementia datasets with first-person narratives from patients with cognitive impairment and healthy controls. Our results show transformer models tend to overfit to training data distributions. Disrupting gender-related weights results in a deconfounded dementia classifier, with the trade-off of slightly reduced dementia detection performance.

Title: Breaking Anonymity at Scale: Re-identifying the Trajectories of 100K Real Users in Japan

Authors: Abhishek Kumar Mishra, Mathieu Cunche, Heber H. Arcolezi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05611
Pdf URL: https://arxiv.org/pdf/2506.05611
Copy Paste: [[2506.05611]] Breaking Anonymity at Scale: Re-identifying the Trajectories of 100K Real Users in Japan(https://arxiv.org/abs/2506.05611)
Keywords: privacy, protect, robust
Abstract: Mobility traces represent a critical class of personal data, often subjected to privacy-preserving transformations before public release. In this study, we analyze the anonymized Yjmob100k dataset, which captures the trajectories of 100,000 users in Japan, and demonstrate how existing anonymization techniques fail to protect their sensitive attributes. We leverage population density patterns, structural correlations, and temporal activity profiles to re-identify the dataset's real-world location and timing. Our results reveal that the anonymization process carried out for Yjmob100k is inefficient and preserves enough spatial and temporal structure to enable re-identification. This work underscores the limitations of current trajectory anonymization methods and calls for more robust privacy mechanisms in the publication of mobility data.

Title: When Maximum Entropy Misleads Policy Optimization

Authors: Ruipeng Zhang, Ya-Chien Chang, Sicun Gao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05615
Pdf URL: https://arxiv.org/pdf/2506.05615
Copy Paste: [[2506.05615]] When Maximum Entropy Misleads Policy Optimization(https://arxiv.org/abs/2506.05615)
Keywords: robust
Abstract: The Maximum Entropy Reinforcement Learning (MaxEnt RL) framework is a leading approach for achieving efficient learning and robust performance across many RL tasks. However, MaxEnt methods have also been shown to struggle with performance-critical control problems in practice, where non-MaxEnt algorithms can successfully learn. In this work, we analyze how the trade-off between robustness and optimality affects the performance of MaxEnt algorithms in complex control tasks: while entropy maximization enhances exploration and robustness, it can also mislead policy optimization, leading to failure in tasks that require precise, low-entropy policies. Through experiments on a variety of control problems, we concretely demonstrate this misleading effect. Our analysis leads to better understanding of how to balance reward design and entropy maximization in challenging control problems.

Title: LFA applied to CNNs: Efficient Singular Value Decomposition of Convolutional Mappings by Local Fourier Analysis

Authors: Antonia van Betteray, Matthias Rottmann, Karsten Kahl
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05617
Pdf URL: https://arxiv.org/pdf/2506.05617
Copy Paste: [[2506.05617]] LFA applied to CNNs: Efficient Singular Value Decomposition of Convolutional Mappings by Local Fourier Analysis(https://arxiv.org/abs/2506.05617)
Keywords: robust
Abstract: The singular values of convolutional mappings encode interesting spectral properties, which can be used, e.g., to improve generalization and robustness of convolutional neural networks as well as to facilitate model compression. However, the computation of singular values is typically very resource-intensive. The naive approach involves unrolling the convolutional mapping along the input and channel dimensions into a large and sparse two-dimensional matrix, making the exact calculation of all singular values infeasible due to hardware limitations. In particular, this is true for matrices that represent convolutional mappings with large inputs and a high number of channels. Existing efficient methods leverage the Fast Fourier transformation (FFT) to transform convolutional mappings into the frequency domain, enabling the computation of singular values for matrices representing convolutions with larger input and channel dimensions. For a constant number of channels in a given convolution, an FFT can compute N singular values in O(N log N) complexity. In this work, we propose an approach of complexity O(N) based on local Fourier analysis, which additionally exploits the shift invariance of convolutional operators. We provide a theoretical analysis of our algorithm's runtime and validate its efficiency through numerical experiments. Our results demonstrate that our proposed method is scalable and offers a practical solution to calculate the entire set of singular values - along with the corresponding singular vectors if needed - for high-dimensional convolutional mappings.

Title: GP-MoLFormer-Sim: Test Time Molecular Optimization through Contextual Similarity Guidance

Authors: Jiri Navratil, Jarret Ross, Payel Das, Youssef Mroueh, Samuel C Hoffman, Vijil Chenthamarakshan, Brian Belgodere
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05628
Pdf URL: https://arxiv.org/pdf/2506.05628
Copy Paste: [[2506.05628]] GP-MoLFormer-Sim: Test Time Molecular Optimization through Contextual Similarity Guidance(https://arxiv.org/abs/2506.05628)
Keywords: generative
Abstract: The ability to design molecules while preserving similarity to a target molecule and/or property is crucial for various applications in drug discovery, chemical design, and biology. We introduce in this paper an efficient training-free method for navigating and sampling from the molecular space with a generative Chemical Language Model (CLM), while using the molecular similarity to the target as a guide. Our method leverages the contextual representations learned from the CLM itself to estimate the molecular similarity, which is then used to adjust the autoregressive sampling strategy of the CLM. At each step of the decoding process, the method tracks the distance of the current generations from the target and updates the logits to encourage the preservation of similarity in generations. We implement the method using a recently proposed $\sim$47M parameter SMILES-based CLM, GP-MoLFormer, and therefore refer to the method as GP-MoLFormer-Sim, which enables a test-time update of the deep generative policy to reflect the contextual similarity to a set of guide molecules. The method is further integrated into a genetic algorithm (GA) and tested on a set of standard molecular optimization benchmarks involving property optimization, molecular rediscovery, and structure-based drug design. Results show that, GP-MoLFormer-Sim, combined with GA (GP-MoLFormer-Sim+GA) outperforms existing training-free baseline methods, when the oracle remains black-box. The findings in this work are a step forward in understanding and guiding the generative mechanisms of CLMs.

Title: Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs

Authors: Ananth Muppidi, Abhilash Nandy, Sambaran Bandyopadhyay
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05629
Pdf URL: https://arxiv.org/pdf/2506.05629
Copy Paste: [[2506.05629]] Leveraging Self-Attention for Input-Dependent Soft Prompting in LLMs(https://arxiv.org/abs/2506.05629)
Keywords: large language model
Abstract: The performance of large language models in domain-specific tasks necessitates fine-tuning, which is computationally expensive and technically challenging. This paper focuses on parameter-efficient fine-tuning using soft prompting, a promising approach that adapts pre-trained models to downstream tasks by learning a small set of parameters. We propose a novel Input Dependent Soft Prompting technique with a self-Attention Mechanism (ID-SPAM) that generates soft prompts based on the input tokens and attends different tokens with varying importance. Our method is simple and efficient, keeping the number of trainable parameters small. We show the merits of the proposed approach compared to state-of-the-art techniques on various tasks and show the improved zero shot domain transfer capability.

Title: FedShield-LLM: A Secure and Scalable Federated Fine-Tuned Large Language Model

Authors: Md Jueal Mia, M. Hadi Amini
Subjects: cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2506.05640
Pdf URL: https://arxiv.org/pdf/2506.05640
Copy Paste: [[2506.05640]] FedShield-LLM: A Secure and Scalable Federated Fine-Tuned Large Language Model(https://arxiv.org/abs/2506.05640)
Keywords: secure, security, privacy, protect, attack, robust, membership infer, federate, large language model
Abstract: Federated Learning (FL) offers a decentralized framework for training and fine-tuning Large Language Models (LLMs) by leveraging computational resources across organizations while keeping sensitive data on local devices. It addresses privacy and security concerns while navigating challenges associated with the substantial computational demands of LLMs, which can be prohibitive for small and medium-sized organizations. FL supports the development of task-specific LLMs for cross-silo applications through fine-tuning but remains vulnerable to inference attacks, such as membership inference and gradient inversion, which threaten data privacy. Prior studies have utilized Differential Privacy (DP) in LLM fine-tuning, which, despite being effective at preserving privacy, can degrade model performance. To overcome these challenges, we propose a novel method, FedShield-LLM, that uses pruning with Fully Homomorphic Encryption (FHE) for Low-Rank Adaptation (LoRA) parameters, enabling secure computations on encrypted model updates while mitigating the attack surface by deactivating less important LoRA parameters. Furthermore, optimized federated algorithms for cross-silo environments enhance scalability and efficiency. Parameter-efficient fine-tuning techniques like LoRA substantially reduce computational and communication overhead, making FL feasible for resource-constrained clients. Experimental results show that the proposed method outperforms existing methods while maintaining robust privacy protection, enabling organizations to collaboratively train secure and efficient LLMs. The code and data are available at, this https URL

Title: Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones

Authors: Andrey Zhmoginov, Jihwan Lee, Mark Sandler
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05641
Pdf URL: https://arxiv.org/pdf/2506.05641
Copy Paste: [[2506.05641]] Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones(https://arxiv.org/abs/2506.05641)
Keywords: transformer
Abstract: Modern Foundation Models (FMs) are typically trained on corpora spanning a wide range of different data modalities, topics and downstream tasks. Utilizing these models can be very computationally expensive and is out of reach for most consumer devices. Furthermore, most of the broad FM knowledge may actually be irrelevant for a specific task at hand. Here we explore a technique for mapping parameters of a large Transformer to parameters of a smaller specialized model. By making this transformation task-specific, we aim to capture a narrower scope of the knowledge needed for performing a specific task by a smaller model. We study our method on image modeling tasks, showing that performance of generated models exceeds that of universal conditional models.

Title: Learning to Weight Parameters for Data Attribution

Authors: Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05647
Pdf URL: https://arxiv.org/pdf/2506.05647
Copy Paste: [[2506.05647]] Learning to Weight Parameters for Data Attribution(https://arxiv.org/abs/2506.05647)
Keywords: diffusion, generative
Abstract: We study data attribution in generative models, aiming to identify which training examples most influence a given output. Existing methods achieve this by tracing gradients back to training data. However, they typically treat all network parameters uniformly, ignoring the fact that different layers encode different types of information and may thus draw information differently from the training set. We propose a method that models this by learning parameter importance weights tailored for attribution, without requiring labeled data. This allows the attribution process to adapt to the structure of the model, capturing which training examples contribute to specific semantic aspects of an output, such as subject, style, or background. Our method improves attribution accuracy across diffusion models and enables fine-grained insights into how outputs borrow from training data.

Title: Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection

Authors: Shanmukha Vellamcheti, Sanjoy Kundu, Sathyanarayanan N. Aakur
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05651
Pdf URL: https://arxiv.org/pdf/2506.05651
Copy Paste: [[2506.05651]] Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection(https://arxiv.org/abs/2506.05651)
Keywords: large language model
Abstract: Understanding relationships between objects is central to visual intelligence, with applications in embodied AI, assistive systems, and scene understanding. Yet, most visual relationship detection (VRD) models rely on a fixed predicate set, limiting their generalization to novel interactions. A key challenge is the inability to visually ground semantically plausible, but unannotated, relationships hypothesized from external knowledge. This work introduces an iterative visual grounding framework that leverages large language models (LLMs) as structured relational priors. Inspired by expectation-maximization (EM), our method alternates between generating candidate scene graphs from detected objects using an LLM (expectation) and training a visual model to align these hypotheses with perceptual evidence (maximization). This process bootstraps relational understanding beyond annotated data and enables generalization to unseen predicates. Additionally, we introduce a new benchmark for open-world VRD on Visual Genome with 21 held-out predicates and evaluate under three settings: seen, unseen, and mixed. Our model outperforms LLM-only, few-shot, and debiased baselines, achieving mean recall (mR@50) of 15.9, 13.1, and 11.7 on predicate classification on these three sets. These results highlight the promise of grounded LLM priors for scalable open-world visual understanding.

Title: TissUnet: Improved Extracranial Tissue and Cranium Segmentation for Children through Adulthood

Authors: Markian Mandzak, Elvira Yang, Anna Zapaishchykova, Yu-Hui Chen, Lucas Heilbroner, John Zielke, Divyanshu Tak, Reza Mojahed-Yazdi, Francesca Romana Mussa, Zezhong Ye, Sridhar Vajapeyam, Viviana Benitez, Ralph Salloum, Susan N. Chi, Houman Sotoudeh, Jakob Seidlitz, Sabine Mueller, Hugo J.W.L. Aerts, Tina Y. Poussaint, Benjamin H. Kann
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05660
Pdf URL: https://arxiv.org/pdf/2506.05660
Copy Paste: [[2506.05660]] TissUnet: Improved Extracranial Tissue and Cranium Segmentation for Children through Adulthood(https://arxiv.org/abs/2506.05660)
Keywords: segmentation
Abstract: Extracranial tissues visible on brain magnetic resonance imaging (MRI) may hold significant value for characterizing health conditions and clinical decision-making, yet they are rarely quantified. Current tools have not been widely validated, particularly in settings of developing brains or underlying pathology. We present TissUnet, a deep learning model that segments skull bone, subcutaneous fat, and muscle from routine three-dimensional T1-weighted MRI, with or without contrast enhancement. The model was trained on 155 paired MRI-computed tomography (CT) scans and validated across nine datasets covering a wide age range and including individuals with brain tumors. In comparison to AI-CT-derived labels from 37 MRI-CT pairs, TissUnet achieved a median Dice coefficient of 0.79 [IQR: 0.77-0.81] in a healthy adult cohort. In a second validation using expert manual annotations, median Dice was 0.83 [IQR: 0.83-0.84] in healthy individuals and 0.81 [IQR: 0.78-0.83] in tumor cases, outperforming previous state-of-the-art method. Acceptability testing resulted in an 89% acceptance rate after adjudication by a tie-breaker(N=108 MRIs), and TissUnet demonstrated excellent performance in the blinded comparative review (N=45 MRIs), including both healthy and tumor cases in pediatric populations. TissUnet enables fast, accurate, and reproducible segmentation of extracranial tissues, supporting large-scale studies on craniofacial morphology, treatment effects, and cardiometabolic risk using standard brain T1w MRI.

Title: BAQ: Efficient Bit Allocation Quantization for Large Language Models

Authors: Chao Zhang, Li Wang, Samson Lasaulce, Merouane Debbah
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05664
Pdf URL: https://arxiv.org/pdf/2506.05664
Copy Paste: [[2506.05664]] BAQ: Efficient Bit Allocation Quantization for Large Language Models(https://arxiv.org/abs/2506.05664)
Keywords: large language model
Abstract: Post-training model quantization is a widely adopted technique for reducing the memory and computational costs of large language models (LLMs). However, most existing methods rely on uniform or heuristic bitwidth assignments, failing to account for the nonuniform sensitivity of weights to quantization noise. In this paper, we propose a novel framework for allocating quantization bitwidths based on sensitivity metrics derived from a Hessian proxy. We make key assumptions, which allow the layer/component-wise loss function to be expressed as an explicit function of the bitwidths. This enables a neat formulation of the bit allocation problem as a convex optimization task, whose closed-form solution adapts precision across weights to minimize the layer-wise quantization loss. Inspecting the solution provides several insights (such as the equal-loss structure), which are then exploited to design the proposed \textbf{BAQ} (Bit Allocation Quantization) algorithm. The proposed algorithm achieves a good trade-off between loss minimization and complexity and allows BAQ to be integrated into standard quantization pipelines with minimal overhead. Experimental results show that BAQ consistently outperforms GPTQ, achieving up to 56$\times$ lower perplexity at the same bitwidth on large language models ranging from 125M to 30B parameters. Leveraging our analytical results derived from solving the optimal bit allocation problem, we also provide a theoretical explanation for the observed gains. All codes of this paper are available at this https URL.

Title: DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models

Authors: Yuhan Hao, Zhengning Li, Lei Sun, Weilong Wang, Naixin Yi, Sheng Song, Caihong Qin, Mofan Zhou, Yifei Zhan, Peng Jia, Xianpeng Lang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05667
Pdf URL: https://arxiv.org/pdf/2506.05667
Copy Paste: [[2506.05667]] DriveAction: A Benchmark for Exploring Human-like Driving Decisions in VLA Models(https://arxiv.org/abs/2506.05667)
Keywords: robust
Abstract: Vision-Language-Action (VLA) models have advanced autonomous driving, but existing benchmarks still lack scenario diversity, reliable action-level annotation, and evaluation protocols aligned with human preferences. To address these limitations, we introduce DriveAction, the first action-driven benchmark specifically designed for VLA models, comprising 16,185 QA pairs generated from 2,610 driving scenarios. DriveAction leverages real-world driving data proactively collected by users of production-level autonomous vehicles to ensure broad and representative scenario coverage, offers high-level discrete action labels collected directly from users' actual driving operations, and implements an action-rooted tree-structured evaluation framework that explicitly links vision, language, and action tasks, supporting both comprehensive and task-specific assessment. Our experiments demonstrate that state-of-the-art vision-language models (VLMs) require both vision and language guidance for accurate action prediction: on average, accuracy drops by 3.3% without vision input, by 4.1% without language input, and by 8.0% without either. Our evaluation supports precise identification of model bottlenecks with robust and consistent results, thus providing new insights and a rigorous foundation for advancing human-like decisions in autonomous driving.

Title: RNE: a plug-and-play framework for diffusion density estimation and inference-time control

Authors: Jiajun He, José Miguel Hernández-Lobato, Yuanqi Du, Francisco Vargas
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05668
Pdf URL: https://arxiv.org/pdf/2506.05668
Copy Paste: [[2506.05668]] RNE: a plug-and-play framework for diffusion density estimation and inference-time control(https://arxiv.org/abs/2506.05668)
Keywords: diffusion
Abstract: In this paper, we introduce the Radon-Nikodym Estimator (RNE), a flexible, plug-and-play framework for diffusion inference-time density estimation and control, based on the concept of the density ratio between path distributions. RNE connects and unifies a variety of existing density estimation and inference-time control methods under a single and intuitive perspective, stemming from basic variational inference and probabilistic principles therefore offering both theoretical clarity and practical versatility. Experiments demonstrate that RNE achieves promising performances in diffusion density estimation and inference-time control tasks, including annealing, composition of diffusion models, and reward-tilting.

Title: Contextually Guided Transformers via Low-Rank Adaptation

Authors: Andrey Zhmoginov, Jihwan Lee, Max Vladymyrov, Mark Sandler
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05672
Pdf URL: https://arxiv.org/pdf/2506.05672
Copy Paste: [[2506.05672]] Contextually Guided Transformers via Low-Rank Adaptation(https://arxiv.org/abs/2506.05672)
Keywords: interpretability, transformer, large language model
Abstract: Large Language Models (LLMs) based on Transformers excel at text processing, but their reliance on prompts for specialized behavior introduces computational overhead. We propose a modification to a Transformer architecture that eliminates the need for explicit prompts by learning to encode context into the model's weights. Our Contextually Guided Transformer (CGT) model maintains a contextual summary at each sequence position, allowing it to update the weights on the fly based on the preceding context. This approach enables the model to self-specialize, effectively creating a tailored model for processing information following a given prefix. We demonstrate the effectiveness of our method on synthetic in-context learning tasks and language modeling benchmarks. Furthermore, we introduce techniques for enhancing the interpretability of the learned contextual representations, drawing connections to Variational Autoencoders and promoting smoother, more consistent context encoding. This work offers a novel direction for efficient and adaptable language modeling by integrating context directly into the model's architecture.

Title: Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery

Authors: Sajjad Abdoli, Freeman Lewin, Gediminas Vasiliauskas, Fabian Schonholz
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05673
Pdf URL: https://arxiv.org/pdf/2506.05673
Copy Paste: [[2506.05673]] Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery(https://arxiv.org/abs/2506.05673)
Keywords: robust, diffusion
Abstract: The development of modern Artificial Intelligence (AI) models, particularly diffusion-based models employed in computer vision and image generation tasks, is undergoing a paradigmatic shift in development methodologies. Traditionally dominated by a "Model Centric" approach, in which performance gains were primarily pursued through increasingly complex model architectures and hyperparameter optimization, the field is now recognizing a more nuanced "Data-Centric" approach. This emergent framework foregrounds the quality, structure, and relevance of training data as the principal driver of model performance. To operationalize this paradigm shift, we introduce the this http URL sample dataset (the "DSD"), initially comprised of approximately 10,610 high-quality human peer-ranked photography images accompanied by extensive multi-tier annotations. The DSD is a foundational computer vision dataset designed to usher in a new standard for commercial image datasets. Representing a small fraction of this http URL's 100 million-plus image catalog, the DSD provides a scalable foundation necessary for robust commercial and multimodal AI development. Through this in-depth exploratory analysis, we document the quantitative improvements generated by the DSD on specific models against known benchmarks and make the code and the trained models used in our evaluation publicly available.

Title: Zero-Shot Event Causality Identification via Multi-source Evidence Fuzzy Aggregation with Large Language Models

Authors: Zefan Zeng, Xingchen Hu, Qing Cheng, Weiping Ding, Wentao Li, Zhong Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05675
Pdf URL: https://arxiv.org/pdf/2506.05675
Copy Paste: [[2506.05675]] Zero-Shot Event Causality Identification via Multi-source Evidence Fuzzy Aggregation with Large Language Models(https://arxiv.org/abs/2506.05675)
Keywords: large language model
Abstract: Event Causality Identification (ECI) aims to detect causal relationships between events in textual contexts. Existing ECI models predominantly rely on supervised methodologies, suffering from dependence on large-scale annotated data. Although Large Language Models (LLMs) enable zero-shot ECI, they are prone to causal hallucination-erroneously establishing spurious causal links. To address these challenges, we propose MEFA, a novel zero-shot framework based on Multi-source Evidence Fuzzy Aggregation. First, we decompose causality reasoning into three main tasks (temporality determination, necessity analysis, and sufficiency verification) complemented by three auxiliary tasks. Second, leveraging meticulously designed prompts, we guide LLMs to generate uncertain responses and deterministic outputs. Finally, we quantify LLM's responses of sub-tasks and employ fuzzy aggregation to integrate these evidence for causality scoring and causality determination. Extensive experiments on three benchmarks demonstrate that MEFA outperforms second-best unsupervised baselines by 6.2% in F1-score and 9.3% in precision, while significantly reducing hallucination-induced errors. In-depth analysis verify the effectiveness of task decomposition and the superiority of fuzzy aggregation.

Title: Numerical Investigation of Sequence Modeling Theory using Controllable Memory Functions

Authors: Haotian Jiang, Zeyu Bao, Shida Wang, Qianxiao Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05678
Pdf URL: https://arxiv.org/pdf/2506.05678
Copy Paste: [[2506.05678]] Numerical Investigation of Sequence Modeling Theory using Controllable Memory Functions(https://arxiv.org/abs/2506.05678)
Keywords: transformer
Abstract: The evolution of sequence modeling architectures, from recurrent neural networks and convolutional models to Transformers and structured state-space models, reflects ongoing efforts to address the diverse temporal dependencies inherent in sequential data. Despite this progress, systematically characterizing the strengths and limitations of these architectures remains a fundamental this http URL this work, we propose a synthetic benchmarking framework to evaluate how effectively different sequence models capture distinct temporal structures. The core of this approach is to generate synthetic targets, each characterized by a memory function and a parameter that determines the strength of temporal dependence. This setup allows us to produce a continuum of tasks that vary in temporal complexity, enabling fine-grained analysis of model behavior concerning specific memory properties. We focus on four representative memory functions, each corresponding to a distinct class of temporal this http URL on several sequence modeling architectures confirm existing theoretical insights and reveal new this http URL results demonstrate the effectiveness of the proposed method in advancing theoretical understandingand highlight the importance of using controllable targets with clearly defined structures for evaluating sequence modeling architectures.

Title: Learning Design-Score Manifold to Guide Diffusion Models for Offline Optimization

Authors: Tailin Zhou, Zhilin Chen, Wenlong Lyu, Zhitang Chen, Danny H.K. Tsang, Jun Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05680
Pdf URL: https://arxiv.org/pdf/2506.05680
Copy Paste: [[2506.05680]] Learning Design-Score Manifold to Guide Diffusion Models for Offline Optimization(https://arxiv.org/abs/2506.05680)
Keywords: diffusion
Abstract: Optimizing complex systems, from discovering therapeutic drugs to designing high-performance materials, remains a fundamental challenge across science and engineering, as the underlying rules are often unknown and costly to evaluate. Offline optimization aims to optimize designs for target scores using pre-collected datasets without system interaction. However, conventional approaches may fail beyond training data, predicting inaccurate scores and generating inferior designs. This paper introduces ManGO, a diffusion-based framework that learns the design-score manifold, capturing the design-score interdependencies holistically. Unlike existing methods that treat design and score spaces in isolation, ManGO unifies forward prediction and backward generation, attaining generalization beyond training data. Key to this is its derivative-free guidance for conditional generation, coupled with adaptive inference-time scaling that dynamically optimizes denoising paths. Extensive evaluations demonstrate that ManGO outperforms 24 single- and 10 multi-objective optimization methods across diverse domains, including synthetic tasks, robot control, material design, DNA sequence, and real-world engineering optimization.

Title: Multi-Modal Multi-Task Federated Foundation Models for Next-Generation Extended Reality Systems: Towards Privacy-Preserving Distributed Intelligence in AR/VR/MR

Authors: Fardis Nadimi, Payam Abdisarabshali, Kasra Borazjani, Jacob Chakareski, Seyyedali Hosseinalipour
Subjects: cs.LG, cs.AI, cs.CR, cs.MM
Abstract URL: https://arxiv.org/abs/2506.05683
Pdf URL: https://arxiv.org/pdf/2506.05683
Copy Paste: [[2506.05683]] Multi-Modal Multi-Task Federated Foundation Models for Next-Generation Extended Reality Systems: Towards Privacy-Preserving Distributed Intelligence in AR/VR/MR(https://arxiv.org/abs/2506.05683)
Keywords: privacy, federate
Abstract: Extended reality (XR) systems, which consist of virtual reality (VR), augmented reality (AR), and mixed reality (XR), offer a transformative interface for immersive, multi-modal, and embodied human-computer interaction. In this paper, we envision that multi-modal multi-task (M3T) federated foundation models (FedFMs) can offer transformative capabilities for XR systems through integrating the representational strength of M3T foundation models (FMs) with the privacy-preserving model training principles of federated learning (FL). We present a modular architecture for FedFMs, which entails different coordination paradigms for model training and aggregations. Central to our vision is the codification of XR challenges that affect the implementation of FedFMs under the SHIFT dimensions: (1) Sensor and modality diversity, (2) Hardware heterogeneity and system-level constraints, (3) Interactivity and embodied personalization, (4) Functional/task variability, and (5) Temporality and environmental variability. We illustrate the manifestation of these dimensions across a set of emerging and anticipated applications of XR systems. Finally, we propose evaluation metrics, dataset requirements, and design tradeoffs necessary for the development of resource-aware FedFMs in XR. This perspective aims to chart the technical and conceptual foundations for context-aware privacy-preserving intelligence in the next generation of XR systems.

Title: Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models

Authors: Hugues Thomas, Chen Chen, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05689
Pdf URL: https://arxiv.org/pdf/2506.05689
Copy Paste: [[2506.05689]] Pts3D-LLM: Studying the Impact of Token Structure for 3D Scene Understanding With Large Language Models(https://arxiv.org/abs/2506.05689)
Keywords: robust, transformer, large language model
Abstract: Effectively representing 3D scenes for Multimodal Large Language Models (MLLMs) is crucial yet challenging. Existing approaches commonly only rely on 2D image features and use varied tokenization approaches. This work presents a rigorous study of 3D token structures, systematically comparing video-based and point-based representations while maintaining consistent model backbones and parameters. We propose a novel approach that enriches visual tokens by incorporating 3D point cloud features from a Sonata pretrained Point Transformer V3 encoder. Our experiments demonstrate that merging explicit 3D features significantly boosts performance. Furthermore, we show that point-based token structures can rival video-based ones when the points are cleverly sampled and ordered. Our best models from both structures achieve state-of-the-art results on multiple 3D understanding benchmarks. We emphasize our analysis of token structures as a key contribution, alongside transparent reporting of results averaged over multiple seeds, a practice we believe is vital for robust progress in the field.

Title: When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation

Authors: Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, Jinsong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05690
Pdf URL: https://arxiv.org/pdf/2506.05690
Copy Paste: [[2506.05690]] When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation(https://arxiv.org/abs/2506.05690)
Keywords: large language model
Abstract: Graph retrieval-augmented generation (GraphRAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) with external knowledge. It leverages graphs to model the hierarchical structure between specific concepts, enabling more coherent and effective knowledge retrieval for accurate this http URL its conceptual promise, recent studies report that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. This raises a critical question: Is GraphRAG really effective, and in which scenarios do graph structures provide measurable benefits for RAG systems? To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models onboth hierarchical knowledge retrieval and deep contextual reasoning. GraphRAG-Bench features a comprehensive dataset with tasks of increasing difficulty, coveringfact retrieval, complex reasoning, contextual summarization, and creative generation, and a systematic evaluation across the entire pipeline, from graph constructionand knowledge retrieval to final generation. Leveraging this novel benchmark, we systematically investigate the conditions when GraphRAG surpasses traditional RAG and the underlying reasons for its success, offering guidelines for its practical application. All related resources and analyses are collected for the community at this https URL.

Title: SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code

Authors: Xinghang Li, Jingzhe Ding, Chao Peng, Bing Zhao, Xiang Gao, Hongwan Gao, Xinchen Gu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05692
Pdf URL: https://arxiv.org/pdf/2506.05692
Copy Paste: [[2506.05692]] SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code(https://arxiv.org/abs/2506.05692)
Keywords: secure, security, large language model
Abstract: The code generation capabilities of large language models(LLMs) have emerged as a critical dimension in evaluating their overall performance. However, prior research has largely overlooked the security risks inherent in the generated code. In this work, we introduce \benchmark, a benchmark specifically designed to assess the security of LLM-generated code. The dataset encompasses a wide range of common software development scenarios and vulnerability types. Building upon this benchmark, we develop an automatic evaluation framework that leverages both static application security testing(SAST) and LLM-based judging to assess the presence of security vulnerabilities in model-generated code. Through the empirical evaluation of state-of-the-art LLMs on \benchmark, we reveal notable deficiencies in their ability to produce vulnerability-free code. Our findings highlight pressing challenges and offer actionable insights for future advancements in the secure code generation performance of LLMs. The data and code will be released soon.

Title: Being Strong Progressively! Enhancing Knowledge Distillation of Large Language Models through a Curriculum Learning Framework

Authors: Lingyuan Liu, Mengxiang Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05695
Pdf URL: https://arxiv.org/pdf/2506.05695
Copy Paste: [[2506.05695]] Being Strong Progressively! Enhancing Knowledge Distillation of Large Language Models through a Curriculum Learning Framework(https://arxiv.org/abs/2506.05695)
Keywords: large language model
Abstract: Knowledge Distillation (KD) compresses large language models (LLMs) by transferring the teacher model's capabilities to a smaller student model, reducing inference cost and memory usage while maintaining performance. However, existing KD methods for LLMs often fail to prevent significant shifts in the student model's distribution during training, leading to issues such as catastrophic forgetting, mode collapse, and training-inference mismatch. To address these challenges, we propose a novel, plug-in curriculum learning framework inspired by the strength training principle of "progressive overload" (POCL), which can be seamlessly integrated into existing white-box KD approaches with minimal computational overhead. The framework comprises two core components: (1) a difficulty measurer that ranks and partitions training samples from easy to hard, and (2) a training scheduler that incrementally introduces these subsets into the distillation process at fixed intervals while applying loss functions with progressively rising temperatures. By starting with the easiest samples and progressively increasing the difficulty, the approach enhances both the stability and efficiency of learning. Extensive experiments in instruction-following settings demonstrate that POCL consistently improves the performance of distilled student models across various white-box KD methods and model families. Our findings highlight the effectiveness of sorted training samples in KD for LLMs. More generally, our work demonstrates how to structure training data within the KD process to enhance the stability and performance of distilled LLMs.

Title: RKEFino1: A Regulation Knowledge-Enhanced Large Language Model

Authors: Yan Wang, Yueru He, Ruoyu Xiang, Jeff Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05700
Pdf URL: https://arxiv.org/pdf/2506.05700
Copy Paste: [[2506.05700]] RKEFino1: A Regulation Knowledge-Enhanced Large Language Model(https://arxiv.org/abs/2506.05700)
Keywords: large language model
Abstract: Recent advances in large language models (LLMs) hold great promise for financial applications but introduce critical accuracy and compliance challenges in Digital Regulatory Reporting (DRR). To address these issues, we propose RKEFino1, a regulation knowledge-enhanced financial reasoning model built upon Fino1, fine-tuned with domain knowledge from XBRL, CDM, and MOF. We formulate two QA tasks-knowledge-based and mathematical reasoning-and introduce a novel Numerical NER task covering financial entities in both sentences and tables. Experimental results demonstrate the effectiveness and generalization capacity of RKEFino1 in compliance-critical financial tasks. We have released our model on Hugging Face.

Title: Hybrid Stabilization Protocol for Cross-Chain Digital Assets Using Adaptor Signatures and AI-Driven Arbitrage

Authors: Shengwei You, Andrey Kuehlkamp, Jarek Nabrzyski
Subjects: cs.CR, cs.CE
Abstract URL: https://arxiv.org/abs/2506.05708
Pdf URL: https://arxiv.org/pdf/2506.05708
Copy Paste: [[2506.05708]] Hybrid Stabilization Protocol for Cross-Chain Digital Assets Using Adaptor Signatures and AI-Driven Arbitrage(https://arxiv.org/abs/2506.05708)
Keywords: privacy, robust
Abstract: Stablecoins face an unresolved trilemma of balancing decentralization, stability, and regulatory compliance. We present a hybrid stabilization protocol that combines crypto-collateralized reserves, algorithmic futures contracts, and cross-chain liquidity pools to achieve robust price adherence while preserving user privacy. At its core, the protocol introduces stabilization futures contracts (SFCs), non-collateralized derivatives that programmatically incentivize third-party arbitrageurs to counteract price deviations via adaptor signature atomic swaps. Autonomous AI agents optimize delta hedging across decentralized exchanges (DEXs), while zkSNARKs prove compliance with anti-money laundering (AML) regulations without exposing identities or transaction details. Our cryptographic design reduces cross-chain liquidity concentration (Herfindahl-Hirschman Index: 2,400 vs. 4,900 in single-chain systems) and ensures atomicity under standard cryptographic assumptions. The protocol's layered architecture encompassing incentive-compatible SFCs, AI-driven market making, and zero-knowledge regulatory proofs. It provides a blueprint for next-generation decentralized financial infrastructure.

Title: Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration

Authors: Fanhu Zeng, Deli Yu, Zhenglun Kong, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05709
Pdf URL: https://arxiv.org/pdf/2506.05709
Copy Paste: [[2506.05709]] Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration(https://arxiv.org/abs/2506.05709)
Keywords: transformer, segmentation
Abstract: Vision transformers have been widely explored in various vision tasks. Due to heavy computational cost, much interest has aroused for compressing vision transformer dynamically in the aspect of tokens. Current methods mainly pay attention to token pruning or merging to reduce token numbers, in which tokens are compressed exclusively, causing great information loss and therefore post-training is inevitably required to recover the performance. In this paper, we rethink token reduction and unify the process as an explicit form of token matrix transformation, in which all existing methods are constructing special forms of matrices within the framework. Furthermore, we propose a many-to-many Token Transforming framework that serves as a generalization of all existing methods and reserves the most information, even enabling training-free acceleration. We conduct extensive experiments to validate our framework. Specifically, we reduce 40% FLOPs and accelerate DeiT-S by $\times$1.5 with marginal 0.1% accuracy drop. Furthermore, we extend the method to dense prediction tasks including segmentation, object detection, depth estimation, and language model generation. Results demonstrate that the proposed method consistently achieves substantial improvements, offering a better computation-performance trade-off, impressive budget reduction and inference acceleration.

Title: Latent Diffusion Model Based Denoising Receiver for 6G Semantic Communication: From Stochastic Differential Theory to Application

Authors: Xiucheng Wang, Honggang Jia, Nan Cheng, Dusit Niyato
Subjects: cs.LG, cs.IT, eess.SY
Abstract URL: https://arxiv.org/abs/2506.05710
Pdf URL: https://arxiv.org/pdf/2506.05710
Copy Paste: [[2506.05710]] Latent Diffusion Model Based Denoising Receiver for 6G Semantic Communication: From Stochastic Differential Theory to Application(https://arxiv.org/abs/2506.05710)
Keywords: robust, diffusion, generative
Abstract: In this paper, a novel semantic communication framework empowered by generative artificial intelligence (GAI) is proposed, specifically leveraging the capabilities of diffusion models (DMs). A rigorous theoretical foundation is established based on stochastic differential equations (SDEs), which elucidates the denoising properties of DMs in mitigating additive white Gaussian noise (AWGN) in latent semantic representations. Crucially, a closed-form analytical relationship between the signal-to-noise ratio (SNR) and the denoising timestep is derived, enabling the optimal selection of diffusion parameters for any given channel condition. To address the distribution mismatch between the received signal and the DM's training data, a mathematically principled scaling mechanism is introduced, ensuring robust performance across a wide range of SNRs without requiring model fine-tuning. Built upon this theoretical insight, we develop a latent diffusion model (LDM)-based semantic transceiver, wherein a variational autoencoder (VAE) is employed for efficient semantic compression, and a pretrained DM serves as a universal denoiser. Notably, the proposed architecture is fully training-free at inference time, offering high modularity and compatibility with large-scale pretrained LDMs. This design inherently supports zero-shot generalization and mitigates the challenges posed by out-of-distribution inputs. Extensive experimental evaluations demonstrate that the proposed framework significantly outperforms conventional neural-network-based semantic communication baselines, particularly under low SNR conditions and distributional shifts, thereby establishing a promising direction for GAI-driven robust semantic transmission in future 6G systems.

Title: A symmetric LWE-based Multi-Recipient Cryptosystem

Authors: Saikat Gope, Srinivasan Krishnaswamy, Chayan Bhawal
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05711
Pdf URL: https://arxiv.org/pdf/2506.05711
Copy Paste: [[2506.05711]] A symmetric LWE-based Multi-Recipient Cryptosystem(https://arxiv.org/abs/2506.05711)
Keywords: security
Abstract: This article describes a post-quantum multirecipient symmetric cryptosystem whose security is based on the hardness of the LWE problem. In this scheme a single sender encrypts multiple messages for multiple recipients generating a single ciphertext which is broadcast to the recipients. Each recipient decrypts the ciphertext with her secret key to recover the message intended for her. In this process, the recipient cannot efficiently extract any information about the other messages. This scheme is intended for messages like images and sound that can tolerate a small amount of noise. This article introduces the scheme and establishes its security based on the LWE problem. Further, an example is given to demonstrate the application of this scheme for encrypting multiple images.

Title: Come Together, But Not Right Now: A Progressive Strategy to Boost Low-Rank Adaptation

Authors: Zhan Zhuang, Xiequn Wang, Wei Li, Yulong Zhang, Qiushi Huang, Shuhao Chen, Xuehao Wang, Yanbin Wei, Yuhe Nie, Kede Ma, Yu Zhang, Ying Wei
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05713
Pdf URL: https://arxiv.org/pdf/2506.05713
Copy Paste: [[2506.05713]] Come Together, But Not Right Now: A Progressive Strategy to Boost Low-Rank Adaptation(https://arxiv.org/abs/2506.05713)
Keywords: robust
Abstract: Low-rank adaptation (LoRA) has emerged as a leading parameter-efficient fine-tuning technique for adapting large foundation models, yet it often locks adapters into suboptimal minima near their initialization. This hampers model generalization and limits downstream operators such as adapter merging and pruning. Here, we propose CoTo, a progressive training strategy that gradually increases adapters' activation probability over the course of fine-tuning. By stochastically deactivating adapters, CoTo encourages more balanced optimization and broader exploration of the loss landscape. We provide a theoretical analysis showing that CoTo promotes layer-wise dropout stability and linear mode connectivity, and we adopt a cooperative-game approach to quantify each adapter's marginal contribution. Extensive experiments demonstrate that CoTo consistently boosts single-task performance, enhances multi-task merging accuracy, improves pruning robustness, and reduces training overhead, all while remaining compatible with diverse LoRA variants. Code is available at this https URL.

Title: Ensemble Elastic DQN: A novel multi-step ensemble approach to address overestimation in deep value-based reinforcement learning

Authors: Adrian Ly, Richard Dazeley, Peter Vamplew, Francisco Cruz, Sunil Aryal
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05716
Pdf URL: https://arxiv.org/pdf/2506.05716
Copy Paste: [[2506.05716]] Ensemble Elastic DQN: A novel multi-step ensemble approach to address overestimation in deep value-based reinforcement learning(https://arxiv.org/abs/2506.05716)
Keywords: robust
Abstract: While many algorithmic extensions to Deep Q-Networks (DQN) have been proposed, there remains limited understanding of how different improvements interact. In particular, multi-step and ensemble style extensions have shown promise in reducing overestimation bias, thereby improving sample efficiency and algorithmic stability. In this paper, we introduce a novel algorithm called Ensemble Elastic Step DQN (EEDQN), which unifies ensembles with elastic step updates to stabilise algorithmic performance. EEDQN is designed to address two major challenges in deep reinforcement learning: overestimation bias and sample efficiency. We evaluated EEDQN against standard and ensemble DQN variants across the MinAtar benchmark, a set of environments that emphasise behavioral learning while reducing representational complexity. Our results show that EEDQN achieves consistently robust performance across all tested environments, outperforming baseline DQN methods and matching or exceeding state-of-the-art ensemble DQNs in final returns on most of the MinAtar environments. These findings highlight the potential of systematically combining algorithmic improvements and provide evidence that ensemble and multi-step methods, when carefully integrated, can yield substantial gains.

Title: You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping

Authors: Jingshun Huang, Haitao Lin, Tianyu Wang, Yanwei Fu, Yu-Gang Jiang, Xiangyang Xue
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2506.05719
Pdf URL: https://arxiv.org/pdf/2506.05719
Copy Paste: [[2506.05719]] You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping(https://arxiv.org/abs/2506.05719)
Keywords: segmentation
Abstract: This paper addresses the problem of category-level pose estimation for articulated objects in robotic manipulation tasks. Recent works have shown promising results in estimating part pose and size at the category level. However, these approaches primarily follow a complex multi-stage pipeline that first segments part instances in the point cloud and then estimates the Normalized Part Coordinate Space (NPCS) representation for 6D poses. These approaches suffer from high computational costs and low performance in real-time robotic tasks. To address these limitations, we propose YOEO, a single-stage method that simultaneously outputs instance segmentation and NPCS representations in an end-to-end manner. We use a unified network to generate point-wise semantic labels and centroid offsets, allowing points from the same part instance to vote for the same centroid. We further utilize a clustering algorithm to distinguish points based on their estimated centroid distances. Finally, we first separate the NPCS region of each instance. Then, we align the separated regions with the real point cloud to recover the final pose and size. Experimental results on the GAPart dataset demonstrate the pose estimation capabilities of our proposed single-shot method. We also deploy our synthetically-trained model in a real-world setting, providing real-time visual feedback at 200Hz, enabling a physical Kinova robot to interact with unseen articulated objects. This showcases the utility and effectiveness of our proposed method.

Title: Any-Class Presence Likelihood for Robust Multi-Label Classification with Abundant Negative Data

Authors: Dumindu Tissera, Omar Awadallah, Muhammad Umair Danish, Ayan Sadhu, Katarina Grolinger
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05721
Pdf URL: https://arxiv.org/pdf/2506.05721
Copy Paste: [[2506.05721]] Any-Class Presence Likelihood for Robust Multi-Label Classification with Abundant Negative Data(https://arxiv.org/abs/2506.05721)
Keywords: robust
Abstract: Multi-label Classification (MLC) assigns an instance to one or more non-exclusive classes. A challenge arises when the dataset contains a large proportion of instances with no assigned class, referred to as negative data, which can overwhelm the learning process and hinder the accurate identification and classification of positive instances. Nevertheless, it is common in MLC applications such as industrial defect detection, agricultural disease identification, and healthcare diagnosis to encounter large amounts of negative data. Assigning a separate negative class to these instances further complicates the learning objective and introduces unnecessary redundancies. To address this challenge, we redesign standard MLC loss functions by deriving a likelihood of any class being present, formulated by a normalized weighted geometric mean of the predicted class probabilities. We introduce a regularization parameter that controls the relative contribution of the absent class probabilities to the any-class presence likelihood in positive instances. The any-class presence likelihood complements the multi-label learning by encouraging the network to become more aware of implicit positive instances and improve the label classification within those positive instances. Experiments on large-scale datasets with negative data: SewerML, modified COCO, and ChestX-ray14, across various networks and base loss functions show that our loss functions consistently improve MLC performance of their standard loss counterparts, achieving gains of up to 6.01 percentage points in F1, 8.06 in F2, and 3.11 in mean average precision, all without additional parameters or computational complexity. Code available at: this https URL

Title: Large Language Models are Good Relational Learners

Authors: Fang Wu, Vijay Prakash Dwivedi, Jure Leskovec
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05725
Pdf URL: https://arxiv.org/pdf/2506.05725
Copy Paste: [[2506.05725]] Large Language Models are Good Relational Learners(https://arxiv.org/abs/2506.05725)
Keywords: large language model
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various domains, yet their application to relational deep learning (RDL) remains underexplored. Existing approaches adapt LLMs by traversing relational links between entities in a database and converting the structured data into flat text documents. Still, this text-based serialization disregards critical relational structures, introduces redundancy, and often exceeds standard LLM context lengths. We introduce Rel-LLM, a novel architecture that utilizes a graph neural network (GNN)- based encoder to generate structured relational prompts for LLMs within a retrieval-augmented generation (RAG) framework. Unlike traditional text-based serialization approaches, our method preserves the inherent relational structure of databases while enabling LLMs to effectively process and reason over complex entity relationships. Specifically, the GNN encoder extracts a local subgraph around an entity to build feature representations that contain relevant entity relationships and temporal dependencies. These representations are transformed into structured prompts using a denormalization process, effectively allowing the LLM to reason over relational structures. Through extensive experiments, we demonstrate that Rel-LLM outperforms existing methods on key RDL tasks, offering a scalable and efficient approach to integrating LLMs with structured data sources. Code is available at this https URL.

Title: There's Waldo: PCB Tamper Forensic Analysis using Explainable AI on Impedance Signatures

Authors: Maryam Saadat Safa, Seyedmohammad Nouraniboosjin, Fatemeh Ganji, Shahin Tajik
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05734
Pdf URL: https://arxiv.org/pdf/2506.05734
Copy Paste: [[2506.05734]] There's Waldo: PCB Tamper Forensic Analysis using Explainable AI on Impedance Signatures(https://arxiv.org/abs/2506.05734)
Keywords: security
Abstract: The security of printed circuit boards (PCBs) has become increasingly vital as supply chain vulnerabilities, including tampering, present significant risks to electronic systems. While detecting tampering on a PCB is the first step for verification, forensics is also needed to identify the modified component. One non-invasive and reliable PCB tamper detection technique with global coverage is the impedance characterization of a PCB's power delivery network (PDN). However, it is an open question whether one can use the two-dimensional impedance signatures for forensics purposes. In this work, we introduce a novel PCB forensics approach using explainable AI (XAI) on impedance signatures. Through extensive experiments, we replicate various PCB tamper events, generating a dataset used to develop an XAI algorithm capable of not only detecting tampering but also explaining why the algorithm makes a decision about whether a tamper event has happened. At the core of our XAI algorithm is a random forest classifier with an accuracy of 96.7%, sufficient to explain the algorithm's decisions. To understand the behavior of the classifier in the decision-making process, we utilized SHAP values as an XAI tool to determine which frequency component influences the classifier's decision for a particular class the most. This approach enhances detection capabilities as well as advancing the verifier's ability to reverse-engineer and analyze two-dimensional impedance signatures for forensics.

Title: Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness

Authors: Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Mohsen Ghassemi, Yifan Li, Vamsi K. Potluru, Eli Chien, Kamalika Chaudhuri, Olgica Milenkovic, Pan Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05735
Pdf URL: https://arxiv.org/pdf/2506.05735
Copy Paste: [[2506.05735]] Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness(https://arxiv.org/abs/2506.05735)
Keywords: large language model
Abstract: Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs). However, existing approaches predominantly focus on the explicit removal of isolated facts, often overlooking latent inferential dependencies and the non-deterministic nature of knowledge within LLMs. Consequently, facts presumed forgotten may persist implicitly through correlated information. To address these challenges, we propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge by representing relevant factual contexts as knowledge graphs with associated confidence scores. We further develop an inference-based evaluation protocol leveraging powerful LLMs as judges; these judges reason over the extracted knowledge subgraph to determine unlearning success. Our LLM judges utilize carefully designed prompts and are calibrated against human evaluations to ensure their trustworthiness and stability. Extensive experiments on our newly constructed benchmark demonstrate that our framework provides a more realistic and rigorous assessment of unlearning performance. Moreover, our findings reveal that current evaluation strategies tend to overestimate unlearning effectiveness. Our code is publicly available at this https URL.

Title: Generalized Incremental Learning under Concept Drift across Evolving Data Streams

Authors: En Yu, Jie Lu, Guangquan Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05736
Pdf URL: https://arxiv.org/pdf/2506.05736
Copy Paste: [[2506.05736]] Generalized Incremental Learning under Concept Drift across Evolving Data Streams(https://arxiv.org/abs/2506.05736)
Keywords: robust
Abstract: Real-world data streams exhibit inherent non-stationarity characterized by concept drift, posing significant challenges for adaptive learning systems. While existing methods address isolated distribution shifts, they overlook the critical co-evolution of label spaces and distributions under limited supervision and persistent uncertainty. To address this, we formalize Generalized Incremental Learning under Concept Drift (GILCD), characterizing the joint evolution of distributions and label spaces in open-environment streaming contexts, and propose a novel framework called Calibrated Source-Free Adaptation (CSFA). First, CSFA introduces a training-free prototype calibration mechanism that dynamically fuses emerging prototypes with base representations, enabling stable new-class identification without optimization overhead. Second, we design a novel source-free adaptation algorithm, i.e., Reliable Surrogate Gap Sharpness-aware (RSGS) minimization. It integrates sharpness-aware perturbation loss optimization with surrogate gap minimization, while employing entropy-based uncertainty filtering to discard unreliable samples. This mechanism ensures robust distribution alignment and mitigates generalization degradation caused by uncertainties. Therefore, CSFA establishes a unified framework for stable adaptation to evolving semantics and distributions in open-world streaming scenarios. Extensive experiments validate the superior performance and effectiveness of CSFA compared to state-of-the-art approaches.

Title: To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt

Authors: Zhilong Wang, Neha Nagaraja, Lan Zhang, Hayretdin Bahsi, Pawan Patil, Peng Liu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05739
Pdf URL: https://arxiv.org/pdf/2506.05739
Copy Paste: [[2506.05739]] To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt(https://arxiv.org/abs/2506.05739)
Keywords: security, protect, defense, attack
Abstract: LLM agents are widely used as agents for customer support, content generation, and code assistance. However, they are vulnerable to prompt injection attacks, where adversarial inputs manipulate the model's behavior. Traditional defenses like input sanitization, guard models, and guardrails are either cumbersome or ineffective. In this paper, we propose a novel, lightweight defense mechanism called Polymorphic Prompt Assembling (PPA), which protects against prompt injection with near-zero overhead. The approach is based on the insight that prompt injection requires guessing and breaking the structure of the system prompt. By dynamically varying the structure of system prompts, PPA prevents attackers from predicting the prompt structure, thereby enhancing security without compromising performance. We conducted experiments to evaluate the effectiveness of PPA against existing attacks and compared it with other defense methods.

Title: FIST: A Structured Threat Modeling Framework for Fraud Incidents

Authors: Yu-Chen Dai, Lu-An Chen, Sy-Jye Her, Yu-Xian Jiang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05740
Pdf URL: https://arxiv.org/pdf/2506.05740
Copy Paste: [[2506.05740]] FIST: A Structured Threat Modeling Framework for Fraud Incidents(https://arxiv.org/abs/2506.05740)
Keywords: security, attack
Abstract: Fraudulent activities are rapidly evolving, employing increasingly diverse and sophisticated methods that pose serious threats to individuals, organizations, and society. This paper proposes the FIST Framework (Fraud Incident Structured Threat Framework), an innovative structured threat modeling methodology specifically designed for fraud scenarios. Inspired by MITRE ATT\&CK and DISARM, FIST systematically incorporates social engineering tactics, stage-based behavioral decomposition, and detailed attack technique mapping into a reusable knowledge base. FIST aims to enhance the efficiency of fraud detection and the standardization of threat intelligence sharing, promoting collaboration and a unified language across organizations and sectors. The framework integrates interdisciplinary insights from cybersecurity, criminology, and behavioral science, addressing both technical vectors and psychological manipulation mechanisms in fraud. This approach enables fine-grained analysis of fraud incidents, supporting automated detection, quantitative risk assessment, and standardized incident reporting. The effectiveness of the framework is further validated through real-world case studies, demonstrating its value in bridging academic research and practical applications, and laying the foundation for an intelligence-driven anti-fraud ecosystem. To the best of our knowledge, FIST is the first systematic, open-source fraud threat modeling framework that unifies both technical and psychological aspects, and is made freely available to foster collaboration between academia and industry.

Title: When Better Features Mean Greater Risks: The Performance-Privacy Trade-Off in Contrastive Learning

Authors: Ruining Sun, Hongsheng Hu, Wei Luo, Zhaoxi Zhang, Yanjun Zhang, Haizhuan Yuan, Leo Yu Zhang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05743
Pdf URL: https://arxiv.org/pdf/2506.05743
Copy Paste: [[2506.05743]] When Better Features Mean Greater Risks: The Performance-Privacy Trade-Off in Contrastive Learning(https://arxiv.org/abs/2506.05743)
Keywords: privacy, protect, attack, robust, extraction, membership infer
Abstract: With the rapid advancement of deep learning technology, pre-trained encoder models have demonstrated exceptional feature extraction capabilities, playing a pivotal role in the research and application of deep learning. However, their widespread use has raised significant concerns about the risk of training data privacy leakage. This paper systematically investigates the privacy threats posed by membership inference attacks (MIAs) targeting encoder models, focusing on contrastive learning frameworks. Through experimental analysis, we reveal the significant impact of model architecture complexity on membership privacy leakage: As more advanced encoder frameworks improve feature-extraction performance, they simultaneously exacerbate privacy-leakage risks. Furthermore, this paper proposes a novel membership inference attack method based on the p-norm of feature vectors, termed the Embedding Lp-Norm Likelihood Attack (LpLA). This method infers membership status, by leveraging the statistical distribution characteristics of the p-norm of feature vectors. Experimental results across multiple datasets and model architectures demonstrate that LpLA outperforms existing methods in attack performance and robustness, particularly under limited attack knowledge and query volumes. This study not only uncovers the potential risks of privacy leakage in contrastive learning frameworks, but also provides a practical basis for privacy protection research in encoder models. We hope that this work will draw greater attention to the privacy risks associated with self-supervised learning models and shed light on the importance of a balance between model utility and training data privacy. Our code is publicly available at: this https URL.

Title: LLM-Symbolic Integration for Robust Temporal Tabular Reasoning

Authors: Atharv Kulkarni, Kushagra Dixit, Vivek Srikumar, Dan Roth, Vivek Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05746
Pdf URL: https://arxiv.org/pdf/2506.05746
Copy Paste: [[2506.05746]] LLM-Symbolic Integration for Robust Temporal Tabular Reasoning(https://arxiv.org/abs/2506.05746)
Keywords: robust, large language model
Abstract: Temporal tabular question answering presents a significant challenge for Large Language Models (LLMs), requiring robust reasoning over structured data, which is a task where traditional prompting methods often fall short. These methods face challenges such as memorization, sensitivity to table size, and reduced performance on complex queries. To overcome these limitations, we introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations, alongside a symbolic intermediate representation that transforms tables into database schemas. This structured approach allows LLMs to generate and execute SQL queries, enhancing generalization and mitigating biases. By incorporating adaptive few-shot prompting with contextually tailored examples, our method achieves superior robustness, scalability, and performance. Experimental results consistently highlight improvements across key challenges, setting a new benchmark for robust temporal reasoning with LLMs.

Title: Efficient Online RFT with Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance

Authors: Rudransh Agnihotri, Ananya Pandey
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05748
Pdf URL: https://arxiv.org/pdf/2506.05748
Copy Paste: [[2506.05748]] Efficient Online RFT with Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance(https://arxiv.org/abs/2506.05748)
Keywords: interpretability
Abstract: Reward-model training is the cost bottleneck in modern Reinforcement Learning Human Feedback (RLHF) pipelines, often requiring tens of billions of parameters and an offline preference-tuning phase. In the proposed method, a frozen, instruction-tuned 7B LLM is augmented with only a one line JSON rubric and a rank-16 LoRA adapter (affecting just 0.8% of the model's parameters), enabling it to serve as a complete substitute for the previously used heavyweight evaluation models. The plug-and-play judge achieves 96.2% accuracy on RewardBench, outperforming specialized reward networks ranging from 27B to 70B parameters. Additionally, it allows a 7B actor to outperform the top 70B DPO baseline, which scores 61.8%, by achieving 92% exact match accuracy on GSM-8K utilizing online PPO. Thorough ablations indicate that (i) six in context demonstrations deliver the majority of the zero-to-few-shot improvements (+2pp), and (ii) the LoRA effectively addresses the remaining disparity, particularly in the safety and adversarial Chat-Hard segments. The proposed model introduces HH-Rationales, a subset of 10,000 pairs from Anthropic HH-RLHF, to examine interpretability, accompanied by human generated justifications. GPT-4 scoring indicates that our LoRA judge attains approximately = 9/10 in similarity to human explanations, while zero-shot judges score around =5/10. These results indicate that the combination of prompt engineering and tiny LoRA produces a cost effective, transparent, and easily adjustable reward function, removing the offline phase while achieving new state-of-the-art outcomes for both static evaluation and online RLHF.

Title: Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning

Authors: Xuanyu Lei, Chenliang Li, Yuning Wu, Kaiming Liu, Weizhou Shen, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05760
Pdf URL: https://arxiv.org/pdf/2506.05760
Copy Paste: [[2506.05760]] Writing-RL: Advancing Long-form Writing via Adaptive Curriculum Reinforcement Learning(https://arxiv.org/abs/2506.05760)
Keywords: large language model
Abstract: Recent advances in Large Language Models (LLMs) have enabled strong performance in long-form writing, yet existing supervised fine-tuning (SFT) approaches suffer from limitations such as data saturation and restricted learning capacity bounded by teacher signals. In this work, we present Writing-RL: an Adaptive Curriculum Reinforcement Learning framework to advance long-form writing capabilities beyond SFT. The framework consists of three key components: Margin-aware Data Selection strategy that prioritizes samples with high learning potential, Pairwise Comparison Reward mechanism that provides discriminative learning signals in the absence of verifiable rewards, and Dynamic Reference Scheduling approach, which plays a particularly critical role by adaptively adjusting task difficulty based on evolving model performance. Experiments on 7B-scale writer models show that our RL framework largely improves long-form writing performance over strong SFT baselines. Furthermore, we observe that models trained with long-output RL generalize surprisingly well to long-input reasoning tasks, potentially offering a promising perspective for rethinking long-context training.

Title: BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

Authors: Yunpeng Qing, Shuo Chen, Yixiao Chi, Shunyu Liu, Sixu Lin, Changqing Zou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05762
Pdf URL: https://arxiv.org/pdf/2506.05762
Copy Paste: [[2506.05762]] BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning(https://arxiv.org/abs/2506.05762)
Keywords: diffusion, generative
Abstract: Recent advances in offline Reinforcement Learning (RL) have proven that effective policy learning can benefit from imposing conservative constraints on pre-collected datasets. However, such static datasets often exhibit distribution bias, resulting in limited generalizability. To address this limitation, a straightforward solution is data augmentation (DA), which leverages generative models to enrich data distribution. Despite the promising results, current DA techniques focus solely on reconstructing future trajectories from given states, while ignoring the exploration of history transitions that reach them. This single-direction paradigm inevitably hinders the discovery of diverse behavior patterns, especially those leading to critical states that may have yielded high-reward outcomes. In this work, we introduce Bidirectional Trajectory Diffusion (BiTrajDiff), a novel DA framework for offline RL that models both future and history trajectories from any intermediate states. Specifically, we decompose the trajectory generation task into two independent yet complementary diffusion processes: one generating forward trajectories to predict future dynamics, and the other generating backward trajectories to trace essential history this http URL can efficiently leverage critical states as anchors to expand into potentially valuable yet underexplored regions of the state space, thereby facilitating dataset diversity. Extensive experiments on the D4RL benchmark suite demonstrate that BiTrajDiff achieves superior performance compared to other advanced DA methods across various offline RL backbones.

Title: Exploring Microstructural Dynamics in Cryptocurrency Limit Order Books: Better Inputs Matter More Than Stacking Another Hidden Layer

Authors: Haochuan (Kevin)Wang
Subjects: cs.LG, q-fin.TR
Abstract URL: https://arxiv.org/abs/2506.05764
Pdf URL: https://arxiv.org/pdf/2506.05764
Copy Paste: [[2506.05764]] Exploring Microstructural Dynamics in Cryptocurrency Limit Order Books: Better Inputs Matter More Than Stacking Another Hidden Layer(https://arxiv.org/abs/2506.05764)
Keywords: robust, extraction, interpretability
Abstract: Cryptocurrency price dynamics are driven largely by microstructural supply demand imbalances in the limit order book (LOB), yet the highly noisy nature of LOB data complicates the signal extraction process. Prior research has demonstrated that deep-learning architectures can yield promising predictive performance on pre-processed equity and futures LOB data, but they often treat model complexity as an unqualified virtue. In this paper, we aim to examine whether adding extra hidden layers or parameters to "blackbox ish" neural networks genuinely enhances short term price forecasting, or if gains are primarily attributable to data preprocessing and feature engineering. We benchmark a spectrum of models from interpretable baselines, logistic regression, XGBoost to deep architectures (DeepLOB, Conv1D+LSTM) on BTC/USDT LOB snapshots sampled at 100 ms to multi second intervals using publicly available Bybit data. We introduce two data filtering pipelines (Kalman, Savitzky Golay) and evaluate both binary (up/down) and ternary (up/flat/down) labeling schemes. Our analysis compares models on out of sample accuracy, latency, and robustness to noise. Results reveal that, with data preprocessing and hyperparameter tuning, simpler models can match and even exceed the performance of more complex networks, offering faster inference and greater interpretability.

Title: BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions

Authors: Saptarshi Sengupta, Shuhua Yang, Paul Kwong Yu, Fali Wang, Suhang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05766
Pdf URL: https://arxiv.org/pdf/2506.05766
Copy Paste: [[2506.05766]] BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions(https://arxiv.org/abs/2506.05766)
Keywords: large language model
Abstract: Retrieval augmented generation (RAG) has shown great power in improving Large Language Models (LLMs). However, most existing RAG-based LLMs are dedicated to retrieving single modality information, mainly text; while for many real-world problems, such as healthcare, information relevant to queries can manifest in various modalities such as knowledge graph, text (clinical notes), and complex molecular structure. Thus, being able to retrieve relevant multi-modality domain-specific information, and reason and synthesize diverse knowledge to generate an accurate response is important. To address the gap, we present BioMol-MQA, a new question-answering (QA) dataset on polypharmacy, which is composed of two parts (i) a multimodal knowledge graph (KG) with text and molecular structure for information retrieval; and (ii) challenging questions that designed to test LLM capabilities in retrieving and reasoning over multimodal KG to answer questions. Our benchmarks indicate that existing LLMs struggle to answer these questions and do well only when given the necessary background data, signaling the necessity for strong RAG frameworks.

Title: dots.llm1 Technical Report

Authors: Bi Huo, Bin Tu, Cheng Qin, Da Zheng, Debing Zhang, Dongjie Zhang, En Li, Fu Guo, Jian Yao, Jie Lou, Junfeng Tian, Li Hu, Ran Zhu, Shengdong Chen, Shuo Liu, Su Guang, Te Wo, Weijun Zhang, Xiaoming Shi, Xinxin Peng, Xing Wu, Yawen Liu, Yuqiu Ji, Ze Wen, Zhenhai Liu, Zichao Li, Zilong Liao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05767
Pdf URL: https://arxiv.org/pdf/2506.05767
Copy Paste: [[2506.05767]] dots.llm1 Technical Report(https://arxiv.org/abs/2506.05767)
Keywords: large language model
Abstract: Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models.

Title: AANet: Virtual Screening under Structural Uncertainty via Alignment and Aggregation

Authors: Wenyu Zhu, Jianhui Wang, Bowen Gao, Yinjun Jia, Haichuan Tan, Ya-Qin Zhang, Wei-Ying Ma, Yanyan Lan
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2506.05768
Pdf URL: https://arxiv.org/pdf/2506.05768
Copy Paste: [[2506.05768]] AANet: Virtual Screening under Structural Uncertainty via Alignment and Aggregation(https://arxiv.org/abs/2506.05768)
Keywords: robust
Abstract: Virtual screening (VS) is a critical component of modern drug discovery, yet most existing methods--whether physics-based or deep learning-based--are developed around holo protein structures with known ligand-bound pockets. Consequently, their performance degrades significantly on apo or predicted structures such as those from AlphaFold2, which are more representative of real-world early-stage drug discovery, where pocket information is often missing. In this paper, we introduce an alignment-and-aggregation framework to enable accurate virtual screening under structural uncertainty. Our method comprises two core components: (1) a tri-modal contrastive learning module that aligns representations of the ligand, the holo pocket, and cavities detected from structures, thereby enhancing robustness to pocket localization error; and (2) a cross-attention based adapter for dynamically aggregating candidate binding sites, enabling the model to learn from activity data even without precise pocket annotations. We evaluated our method on a newly curated benchmark of apo structures, where it significantly outperforms state-of-the-art methods in blind apo setting, improving the early enrichment factor (EF1%) from 11.75 to 37.19. Notably, it also maintains strong performance on holo structures. These results demonstrate the promise of our approach in advancing first-in-class drug discovery, particularly in scenarios lacking experimentally resolved protein-ligand complexes.

Title: Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

Authors: Tuomas Oikarinen, Ge Yan, Tsui-Wei Weng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05774
Pdf URL: https://arxiv.org/pdf/2506.05774
Copy Paste: [[2506.05774]] Evaluating Neuron Explanations: A Unified Framework with Sanity Checks(https://arxiv.org/abs/2506.05774)
Keywords: interpretability
Abstract: Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability. This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are. In this work we unify many existing explanation evaluation methods under one mathematical framework. This allows us to compare existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical methods on the evaluation. In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels. Based on our experimental and theoretical results, we propose guidelines that future evaluations should follow and identify a set of reliable evaluation metrics.

Title: Robust sensor fusion against on-vehicle sensor staleness

Authors: Meng Fan, Yifan Zuo, Patrick Blaes, Harley Montgomery, Subhasis Das
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2506.05780
Pdf URL: https://arxiv.org/pdf/2506.05780
Copy Paste: [[2506.05780]] Robust sensor fusion against on-vehicle sensor staleness(https://arxiv.org/abs/2506.05780)
Keywords: robust
Abstract: Sensor fusion is crucial for a performant and robust Perception system in autonomous vehicles, but sensor staleness, where data from different sensors arrives with varying delays, poses significant challenges. Temporal misalignment between sensor modalities leads to inconsistent object state estimates, severely degrading the quality of trajectory predictions that are critical for safety. We present a novel and model-agnostic approach to address this problem via (1) a per-point timestamp offset feature (for LiDAR and radar both relative to camera) that enables fine-grained temporal awareness in sensor fusion, and (2) a data augmentation strategy that simulates realistic sensor staleness patterns observed in deployed vehicles. Our method is integrated into a perspective-view detection model that consumes sensor data from multiple LiDARs, radars and cameras. We demonstrate that while a conventional model shows significant regressions when one sensor modality is stale, our approach reaches consistently good performance across both synchronized and stale conditions.

Title: EASG-Bench: Video Q&A Benchmark with Egocentric Action Scene Graphs

Authors: Ivan Rodin, Tz-Ying Wu, Kyle Min, Sharath Nittur Sridhar, Antonino Furnari, Subarna Tripathi, Giovanni Maria Farinella
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05787
Pdf URL: https://arxiv.org/pdf/2506.05787
Copy Paste: [[2506.05787]] EASG-Bench: Video Q&A Benchmark with Egocentric Action Scene Graphs(https://arxiv.org/abs/2506.05787)
Keywords: large language model
Abstract: We introduce EASG-Bench, a question-answering benchmark for egocentric videos where the question-answering pairs are created from spatio-temporally grounded dynamic scene graphs capturing intricate relationships among actors, actions, and objects. We propose a systematic evaluation framework and evaluate several language-only and video large language models (video-LLMs) on this benchmark. We observe a performance gap in language-only and video-LLMs, especially on questions focusing on temporal ordering, thus identifying a research gap in the area of long-context video understanding. To promote the reproducibility of our findings and facilitate further research, the benchmark and accompanying code are available at the following GitHub page: this https URL.

Title: Discrete Minds in a Continuous World: Do Language Models Know Time Passes?

Authors: Minghan Wang, Ye Bai, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05790
Pdf URL: https://arxiv.org/pdf/2506.05790
Copy Paste: [[2506.05790]] Discrete Minds in a Continuous World: Do Language Models Know Time Passes?(https://arxiv.org/abs/2506.05790)
Keywords: large language model
Abstract: While Large Language Models (LLMs) excel at temporal reasoning tasks like event ordering and duration estimation, their ability to perceive the actual passage of time remains unexplored. We investigate whether LLMs perceive the passage of time and adapt their decision-making accordingly through three complementary experiments. First, we introduce the Token-Time Hypothesis, positing that LLMs can map discrete token counts to continuous wall-clock time, and validate this through a dialogue duration judgment task. Second, we demonstrate that LLMs could use this awareness to adapt their response length while maintaining accuracy when users express urgency in question answering tasks. Finally, we develop BombRush, an interactive navigation challenge that examines how LLMs modify behavior under progressive time pressure in dynamic environments. Our findings indicate that LLMs possess certain awareness of time passage, enabling them to bridge discrete linguistic tokens and continuous physical time, though this capability varies with model size and reasoning abilities. This work establishes a theoretical foundation for enhancing temporal awareness in LLMs for time-sensitive applications.

Title: EqCollide: Equivariant and Collision-Aware Deformable Objects Neural Simulator

Authors: Qianyi Chen, Tianrun Gao, Chenbo Jiang, Tailin Wu
Subjects: cs.LG, cs.CE, cs.RO
Abstract URL: https://arxiv.org/abs/2506.05797
Pdf URL: https://arxiv.org/pdf/2506.05797
Copy Paste: [[2506.05797]] EqCollide: Equivariant and Collision-Aware Deformable Objects Neural Simulator(https://arxiv.org/abs/2506.05797)
Keywords: robust
Abstract: Simulating collisions of deformable objects is a fundamental yet challenging task due to the complexity of modeling solid mechanics and multi-body interactions. Existing data-driven methods often suffer from lack of equivariance to physical symmetries, inadequate handling of collisions, and limited scalability. Here we introduce EqCollide, the first end-to-end equivariant neural fields simulator for deformable objects and their collisions. We propose an equivariant encoder to map object geometry and velocity into latent control points. A subsequent equivariant Graph Neural Network-based Neural Ordinary Differential Equation models the interactions among control points via collision-aware message passing. To reconstruct velocity fields, we query a neural field conditioned on control point features, enabling continuous and resolution-independent motion predictions. Experimental results show that EqCollide achieves accurate, stable, and scalable simulations across diverse object configurations, and our model achieves 24.34% to 35.82% lower rollout MSE even compared with the best-performing baseline model. Furthermore, our model could generalize to more colliding objects and extended temporal horizons, and stay robust to input transformed with group action.

Title: Option Pricing Using Ensemble Learning

Authors: Zeyuan Li, Qingdao Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05799
Pdf URL: https://arxiv.org/pdf/2506.05799
Copy Paste: [[2506.05799]] Option Pricing Using Ensemble Learning(https://arxiv.org/abs/2506.05799)
Keywords: robust, extraction
Abstract: Ensemble learning is characterized by flexibility, high precision, and refined structure. As a critical component within computational finance, option pricing with machine learning requires both high predictive accuracy and reduced structural complexity-features that align well with the inherent advantages of ensemble learning. This paper investigates the application of ensemble learning to option pricing, and conducts a comparative analysis with classical machine learning models to assess their performance in terms of accuracy, local feature extraction, and robustness to noise. A novel experimental strategy is introduced, leveraging parameter transfer across experiments to improve robustness and realism in financial this http URL upon this strategy, an evaluation mechanism is developed that incorporates a scoring strategy and a weighted evaluation strategy explicitly emphasizing the foundational role of financial theory. This mechanism embodies an orderly integration of theoretical finance and computational methods. In addition, the study examines the interaction between sliding window technique and noise, revealing nuanced patterns that suggest a potential connection relevant to ongoing research in machine learning and data science.

Title: LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models

Authors: Haojie Yu, Zhaonian Wang, Yihan Pan, Meng Cheng, Hao Yang, Chao Wang, Tao Xie, Xiaoming Xu, Xiaoming Wei, Xunliang Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05806
Pdf URL: https://arxiv.org/pdf/2506.05806
Copy Paste: [[2506.05806]] LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models(https://arxiv.org/abs/2506.05806)
Keywords: robust, diffusion
Abstract: Diffusion-based models have gained wide adoption in the virtual human generation due to their outstanding expressiveness. However, their substantial computational requirements have constrained their deployment in real-time interactive avatar applications, where stringent speed, latency, and duration requirements are paramount. We present a novel audio-driven portrait video generation framework based on the diffusion model to address these challenges. Firstly, we propose robust variable-length video generation to reduce the minimum time required to generate the initial video clip or state transitions, which significantly enhances the user experience. Secondly, we propose a consistency model training strategy for Audio-Image-to-Video to ensure real-time performance, enabling a fast few-step generation. Model quantization and pipeline parallelism are further employed to accelerate the inference speed. To mitigate the stability loss incurred by the diffusion process and model quantization, we introduce a new inference strategy tailored for long-duration video generation. These methods ensure real-time performance and low latency while maintaining high-fidelity output. Thirdly, we incorporate class labels as a conditional input to seamlessly switch between speaking, listening, and idle states. Lastly, we design a novel mechanism for fine-grained facial expression control to exploit our model's inherent capacity. Extensive experiments demonstrate that our approach achieves low-latency, fluid, and authentic two-way communication. On an NVIDIA RTX 4090D, our model achieves a maximum of 78 FPS at a resolution of 384x384 and 45 FPS at a resolution of 512x512, with an initial video generation latency of 140 ms and 215 ms, respectively.

Title: DeformCL: Learning Deformable Centerline Representation for Vessel Extraction in 3D Medical Image

Authors: Ziwei Zhao, Zhixing Zhang, Yuhang Liu, Zhao Zhang, Haojun Yu, Dong Wang, Liwei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05820
Pdf URL: https://arxiv.org/pdf/2506.05820
Copy Paste: [[2506.05820]] DeformCL: Learning Deformable Centerline Representation for Vessel Extraction in 3D Medical Image(https://arxiv.org/abs/2506.05820)
Keywords: robust, extraction, segmentation
Abstract: In the field of 3D medical imaging, accurately extracting and representing the blood vessels with curvilinear structures holds paramount importance for clinical diagnosis. Previous methods have commonly relied on discrete representation like mask, often resulting in local fractures or scattered fragments due to the inherent limitations of the per-pixel classification paradigm. In this work, we introduce DeformCL, a new continuous representation based on Deformable Centerlines, where centerline points act as nodes connected by edges that capture spatial relationships. Compared with previous representations, DeformCL offers three key advantages: natural connectivity, noise robustness, and interaction facility. We present a comprehensive training pipeline structured in a cascaded manner to fully exploit these favorable properties of DeformCL. Extensive experiments on four 3D vessel segmentation datasets demonstrate the effectiveness and superiority of our method. Furthermore, the visualization of curved planar reformation images validates the clinical significance of the proposed framework. We release the code in this https URL

Title: FuseUNet: A Multi-Scale Feature Fusion Method for U-like Networks

Authors: Quansong He, Xiangde Min, Kaishen Wang, Tao He
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05821
Pdf URL: https://arxiv.org/pdf/2506.05821
Copy Paste: [[2506.05821]] FuseUNet: A Multi-Scale Feature Fusion Method for U-like Networks(https://arxiv.org/abs/2506.05821)
Keywords: segmentation
Abstract: Medical image segmentation is a critical task in computer vision, with UNet serving as a milestone architecture. The typical component of UNet family is the skip connection, however, their skip connections face two significant limitations: (1) they lack effective interaction between features at different scales, and (2) they rely on simple concatenation or addition operations, which constrain efficient information integration. While recent improvements to UNet have focused on enhancing encoder and decoder capabilities, these limitations remain overlooked. To overcome these challenges, we propose a novel multi-scale feature fusion method that reimagines the UNet decoding process as solving an initial value problem (IVP), treating skip connections as discrete nodes. By leveraging principles from the linear multistep method, we propose an adaptive ordinary differential equation method to enable effective multi-scale feature fusion. Our approach is independent of the encoder and decoder architectures, making it adaptable to various U-Net-like networks. Experiments on ACDC, KiTS2023, MSD brain tumor, and ISIC2017/2018 skin lesion segmentation datasets demonstrate improved feature utilization, reduced network parameters, and maintained high performance. The code is available at this https URL.

Title: Learning Along the Arrow of Time: Hyperbolic Geometry for Backward-Compatible Representation Learning

Authors: Ngoc Bui, Menglin Yang, Runjin Chen, Leonardo Neves, Mingxuan Ju, Rex Ying, Neil Shah, Tong Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05826
Pdf URL: https://arxiv.org/pdf/2506.05826
Copy Paste: [[2506.05826]] Learning Along the Arrow of Time: Hyperbolic Geometry for Backward-Compatible Representation Learning(https://arxiv.org/abs/2506.05826)
Keywords: robust
Abstract: Backward compatible representation learning enables updated models to integrate seamlessly with existing ones, avoiding to reprocess stored data. Despite recent advances, existing compatibility approaches in Euclidean space neglect the uncertainty in the old embedding model and force the new model to reconstruct outdated representations regardless of their quality, thereby hindering the learning process of the new model. In this paper, we propose to switch perspectives to hyperbolic geometry, where we treat time as a natural axis for capturing a model's confidence and evolution. By lifting embeddings into hyperbolic space and constraining updated embeddings to lie within the entailment cone of the old ones, we maintain generational consistency across models while accounting for uncertainties in the representations. To further enhance compatibility, we introduce a robust contrastive alignment loss that dynamically adjusts alignment weights based on the uncertainty of the old embeddings. Experiments validate the superiority of the proposed method in achieving compatibility, paving the way for more resilient and adaptable machine learning systems.

Title: Heartcare Suite: Multi-dimensional Understanding of ECG with Raw Multi-lead Signal Modeling

Authors: Yihan Xie, Sijing Li, Tianwei Lin, Zhuonan Wang, Chenglin Yang, Yu Zhong, Wenqiao Zhang, Haoyuan Li, Hao Jiang, Fengda Zhang, Qishan Chen, Jun Xiao, Yueting Zhuang, Beng Chin Ooi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05831
Pdf URL: https://arxiv.org/pdf/2506.05831
Copy Paste: [[2506.05831]] Heartcare Suite: Multi-dimensional Understanding of ECG with Raw Multi-lead Signal Modeling(https://arxiv.org/abs/2506.05831)
Keywords: diffusion, large language model
Abstract: We present Heartcare Suite, a multimodal comprehensive framework for finegrained electrocardiogram (ECG) understanding. It comprises three key components: (i) Heartcare-220K, a high-quality, structured, and comprehensive multimodal ECG dataset covering essential tasks such as disease diagnosis, waveform morphology analysis, and rhythm interpretation. (ii) Heartcare-Bench, a systematic and multi-dimensional benchmark designed to evaluate diagnostic intelligence and guide the optimization of Medical Multimodal Large Language Models (Med-MLLMs) in ECG scenarios. and (iii) HeartcareGPT with a tailored tokenizer Bidirectional ECG Abstract Tokenization (Beat), which compresses raw multi-lead signals into semantically rich discrete tokens via duallevel vector quantization and query-guided bidirectional diffusion mechanism. Built upon Heartcare-220K, HeartcareGPT achieves strong generalization and SoTA performance across multiple clinically meaningful tasks. Extensive experiments demonstrate that Heartcare Suite is highly effective in advancing ECGspecific multimodal understanding and evaluation. Our project is available at this https URL .

Title: FontAdapter: Instant Font Adaptation in Visual Text Generation

Authors: Myungkyu Koo, Subin Kim, Sangkyung Kwak, Jaehyun Nam, Seojin Kim, Jinwoo Shin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05843
Pdf URL: https://arxiv.org/pdf/2506.05843
Copy Paste: [[2506.05843]] FontAdapter: Instant Font Adaptation in Visual Text Generation(https://arxiv.org/abs/2506.05843)
Keywords: robust, diffusion
Abstract: Text-to-image diffusion models have significantly improved the seamless integration of visual text into diverse image contexts. Recent approaches further improve control over font styles through fine-tuning with predefined font dictionaries. However, adapting unseen fonts outside the preset is computationally expensive, often requiring tens of minutes, making real-time customization impractical. In this paper, we present FontAdapter, a framework that enables visual text generation in unseen fonts within seconds, conditioned on a reference glyph image. To this end, we find that direct training on font datasets fails to capture nuanced font attributes, limiting generalization to new glyphs. To overcome this, we propose a two-stage curriculum learning approach: FontAdapter first learns to extract font attributes from isolated glyphs and then integrates these styles into diverse natural backgrounds. To support this two-stage training scheme, we construct synthetic datasets tailored to each stage, leveraging large-scale online fonts effectively. Experiments demonstrate that FontAdapter enables high-quality, robust font customization across unseen fonts without additional fine-tuning during inference. Furthermore, it supports visual text editing, font style blending, and cross-lingual font transfer, positioning FontAdapter as a versatile framework for font customization tasks.

Title: $\text{C}^{2}\text{BNVAE}$: Dual-Conditional Deep Generation of Network Traffic Data for Network Intrusion Detection System Balancing

Authors: Yifan Zeng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05844
Pdf URL: https://arxiv.org/pdf/2506.05844
Copy Paste: [[2506.05844]] $\text{C}^{2}\text{BNVAE}$: Dual-Conditional Deep Generation of Network Traffic Data for Network Intrusion Detection System Balancing(https://arxiv.org/abs/2506.05844)
Keywords: attack
Abstract: Network Intrusion Detection Systems (NIDS) face challenges due to class imbalance, affecting their ability to detect novel and rare attacks. This paper proposes a Dual-Conditional Batch Normalization Variational Autoencoder ($\text{C}^{2}\text{BNVAE}$) for generating balanced and labeled network traffic data. $\text{C}^{2}\text{BNVAE}$ improves the model's adaptability to different data categories and generates realistic category-specific data by incorporating Conditional Batch Normalization (CBN) into the Conditional Variational Autoencoder (CVAE). Experiments on the NSL-KDD dataset show the potential of $\text{C}^{2}\text{BNVAE}$ in addressing imbalance and improving NIDS performance with lower computational overhead compared to some baselines.

Title: Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models

Authors: Cheonbok Park, Jeonghoon Kim, Joosung Lee, Sanghwan Bae, Jaegul Choo, Kangmin Yoo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05850
Pdf URL: https://arxiv.org/pdf/2506.05850
Copy Paste: [[2506.05850]] Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models(https://arxiv.org/abs/2506.05850)
Keywords: large language model
Abstract: We identify \textbf{Cross-lingual Collapse}, a systematic drift in which the chain-of-thought (CoT) of a multilingual language model reverts to its dominant pre-training language even when the prompt is expressed in a different language. Recent large language models (LLMs) with reinforcement learning with verifiable reward (RLVR) have achieved strong logical reasoning performances by exposing their intermediate reasoning traces, giving rise to large reasoning models (LRMs). However, the mechanism behind multilingual reasoning in LRMs is not yet fully explored. To investigate the issue, we fine-tune multilingual LRMs with Group-Relative Policy Optimization (GRPO) on translated versions of the GSM$8$K and SimpleRL-Zoo datasets in three different languages: Chinese, Korean, and Ukrainian. During training, we monitor both task accuracy and language consistency of the reasoning chains. Our experiments reveal three key findings: (i) GRPO rapidly amplifies pre-training language imbalances, leading to the erosion of low-resource languages within just a few hundred updates; (ii) language consistency reward mitigates this drift but does so at the expense of an almost 5 - 10 pp drop in accuracy. and (iii) the resulting language collapse is severely damaging and largely irreversible, as subsequent fine-tuning struggles to steer the model back toward its original target-language reasoning capabilities. Together, these findings point to a remarkable conclusion: \textit{not all languages are trained equally for reasoning}. Furthermore, our paper sheds light on the roles of reward shaping, data difficulty, and pre-training priors in eliciting multilingual reasoning.

Title: Cross-View Multi-Modal Segmentation @ Ego-Exo4D Challenges 2025

Authors: Yuqian Fu, Runze Wang, Yanwei Fu, Danda Pani Paudel, Luc Van Gool
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05856
Pdf URL: https://arxiv.org/pdf/2506.05856
Copy Paste: [[2506.05856]] Cross-View Multi-Modal Segmentation @ Ego-Exo4D Challenges 2025(https://arxiv.org/abs/2506.05856)
Keywords: robust, segmentation
Abstract: In this report, we present a cross-view multi-modal object segmentation approach for the object correspondence task in the Ego-Exo4D Correspondence Challenges 2025. Given object queries from one perspective (e.g., ego view), the goal is to predict the corresponding object masks in another perspective (e.g., exo view). To tackle this task, we propose a multimodal condition fusion module that enhances object localization by leveraging both visual masks and textual descriptions as segmentation conditions. Furthermore, to address the visual domain gap between ego and exo views, we introduce a cross-view object alignment module that enforces object-level consistency across perspectives, thereby improving the model's robustness to viewpoint changes. Our proposed method ranked second on the leaderboard of the large-scale Ego-Exo4D object correspondence benchmark. Code will be made available at this https URL.

Title: ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On

Authors: Jinjuan Wang, Wenzhang Sun, Ming Li, Yun Zheng, Fanyao Li, Zhulin Tao, Donglin Di, Hao Li, Wei Chen, Xianglin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05858
Pdf URL: https://arxiv.org/pdf/2506.05858
Copy Paste: [[2506.05858]] ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On(https://arxiv.org/abs/2506.05858)
Keywords: robust, diffusion
Abstract: Video virtual try-on aims to seamlessly replace the clothing of a person in a source video with a target garment. Despite significant progress in this field, existing approaches still struggle to maintain continuity and reproduce garment details. In this paper, we introduce ChronoTailor, a diffusion-based framework that generates temporally consistent videos while preserving fine-grained garment details. By employing a precise spatio-temporal attention mechanism to guide the integration of fine-grained garment features, ChronoTailor achieves robust try-on performance. First, ChronoTailor leverages region-aware spatial guidance to steer the evolution of spatial attention and employs an attention-driven temporal feature fusion mechanism to generate more continuous temporal features. This dual approach not only enables fine-grained local editing but also effectively mitigates artifacts arising from video dynamics. Second, ChronoTailor integrates multi-scale garment features to preserve low-level visual details and incorporates a garment-pose feature alignment to ensure temporal continuity during dynamic motion. Additionally, we collect StyleDress, a new dataset featuring intricate garments, varied environments, and diverse poses, offering advantages over existing public datasets, and will be publicly available for research. Extensive experiments show that ChronoTailor maintains spatio-temporal continuity and preserves garment details during motion, significantly outperforming previous methods.

Title: CryoFastAR: Fast Cryo-EM Ab Initio Reconstruction Made Easy

Authors: Jiakai Zhang, Shouchen Zhou, Haizhao Dai, Xinhang Liu, Peihao Wang, Zhiwen Fan, Yuan Pei, Jingyi Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05864
Pdf URL: https://arxiv.org/pdf/2506.05864
Copy Paste: [[2506.05864]] CryoFastAR: Fast Cryo-EM Ab Initio Reconstruction Made Easy(https://arxiv.org/abs/2506.05864)
Keywords: robust
Abstract: Pose estimation from unordered images is fundamental for 3D reconstruction, robotics, and scientific imaging. Recent geometric foundation models, such as DUSt3R, enable end-to-end dense 3D reconstruction but remain underexplored in scientific imaging fields like cryo-electron microscopy (cryo-EM) for near-atomic protein reconstruction. In cryo-EM, pose estimation and 3D reconstruction from unordered particle images still depend on time-consuming iterative optimization, primarily due to challenges such as low signal-to-noise ratios (SNR) and distortions from the contrast transfer function (CTF). We introduce CryoFastAR, the first geometric foundation model that can directly predict poses from Cryo-EM noisy images for Fast ab initio Reconstruction. By integrating multi-view features and training on large-scale simulated cryo-EM data with realistic noise and CTF modulations, CryoFastAR enhances pose estimation accuracy and generalization. To enhance training stability, we propose a progressive training strategy that first allows the model to extract essential features under simpler conditions before gradually increasing difficulty to improve robustness. Experiments show that CryoFastAR achieves comparable quality while significantly accelerating inference over traditional iterative approaches on both synthetic and real datasets.

Title: Stealix: Model Stealing via Prompt Evolution

Authors: Zhixiong Zhuang, Hui-Po Wang, Maria-Irina Nicolae, Mario Fritz
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05867
Pdf URL: https://arxiv.org/pdf/2506.05867
Copy Paste: [[2506.05867]] Stealix: Model Stealing via Prompt Evolution(https://arxiv.org/abs/2506.05867)
Keywords: security, attack, steal, diffusion, generative
Abstract: Model stealing poses a significant security risk in machine learning by enabling attackers to replicate a black-box model without access to its training data, thus jeopardizing intellectual property and exposing sensitive information. Recent methods that use pre-trained diffusion models for data synthesis improve efficiency and performance but rely heavily on manually crafted prompts, limiting automation and scalability, especially for attackers with little expertise. To assess the risks posed by open-source pre-trained models, we propose a more realistic threat model that eliminates the need for prompt design skills or knowledge of class names. In this context, we introduce Stealix, the first approach to perform model stealing without predefined prompts. Stealix uses two open-source pre-trained models to infer the victim model's data distribution, and iteratively refines prompts through a genetic algorithm, progressively improving the precision and diversity of synthetic images. Our experimental results demonstrate that Stealix significantly outperforms other methods, even those with access to class names or fine-grained prompts, while operating under the same query budget. These findings highlight the scalability of our approach and suggest that the risks posed by pre-trained generative models in model stealing may be greater than previously recognized.

Title: BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures

Authors: Xiannan Hu, Tianyou Zeng, Xiaoming Yuan, Liwei Song, Guangyuan Zhang, Bangzheng He
Subjects: cs.LG, cs.DC, cs.PF
Abstract URL: https://arxiv.org/abs/2506.05871
Pdf URL: https://arxiv.org/pdf/2506.05871
Copy Paste: [[2506.05871]] BestServe: Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures(https://arxiv.org/abs/2506.05871)
Keywords: large language model
Abstract: Serving large language models (LLMs) to millions of users requires efficient resource allocation and parallelism strategies. It is a labor intensive trial-and-error process to find such a strategy. We present BestServe, a novel framework for ranking serving strategies by estimating goodput under various operating scenarios. Supporting both collocated and disaggregated architectures, BestServe leverages an inference simulator built on an adapted roofline model and CPU-GPU dispatch dynamics. Our framework determines the optimal strategy in minutes on a single standard CPU, eliminating the need for costly benchmarking, while achieving predictions within a $20\%$ error margin. It appeals to be practical for rapid deployment planning because of its lightweight design and strong extensibility.

Title: Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection

Authors: Yu Li, Xingyu Qiu, Yuqian Fu, Jie Chen, Tianwen Qian, Xu Zheng, Danda Pani Paudel, Yanwei Fu, Xuanjing Huang, Luc Van Gool, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05872
Pdf URL: https://arxiv.org/pdf/2506.05872
Copy Paste: [[2506.05872]] Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection(https://arxiv.org/abs/2506.05872)
Keywords: generative
Abstract: Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance.

Title: Interpretable Clustering Ensemble

Authors: Hang Lv, Lianyu Hu, Mudi Jiang, Xinying Liu, Zengyou He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05877
Pdf URL: https://arxiv.org/pdf/2506.05877
Copy Paste: [[2506.05877]] Interpretable Clustering Ensemble(https://arxiv.org/abs/2506.05877)
Keywords: interpretability
Abstract: Clustering ensemble has emerged as an important research topic in the field of machine learning. Although numerous methods have been proposed to improve clustering quality, most existing approaches overlook the need for interpretability in high-stakes applications. In domains such as medical diagnosis and financial risk assessment, algorithms must not only be accurate but also interpretable to ensure transparent and trustworthy decision-making. Therefore, to fill the gap of lack of interpretable algorithms in the field of clustering ensemble, we propose the first interpretable clustering ensemble algorithm in the literature. By treating base partitions as categorical variables, our method constructs a decision tree in the original feature space and use the statistical association test to guide the tree building process. Experimental results demonstrate that our algorithm achieves comparable performance to state-of-the-art (SOTA) clustering ensemble methods while maintaining an additional feature of interpretability. To the best of our knowledge, this is the first interpretable algorithm specifically designed for clustering ensemble, offering a new perspective for future research in interpretable clustering.

Title: NILMFormer: Non-Intrusive Load Monitoring that Accounts for Non-Stationarity

Authors: Adrien Petralia, Philippe Charpentier, Youssef Kadhi, Themis Palpanas
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2506.05880
Pdf URL: https://arxiv.org/pdf/2506.05880
Copy Paste: [[2506.05880]] NILMFormer: Non-Intrusive Load Monitoring that Accounts for Non-Stationarity(https://arxiv.org/abs/2506.05880)
Keywords: transformer
Abstract: Millions of smart meters have been deployed worldwide, collecting the total power consumed by individual households. Based on these data, electricity suppliers offer their clients energy monitoring solutions to provide feedback on the consumption of their individual appliances. Historically, such estimates have relied on statistical methods that use coarse-grained total monthly consumption and static customer data, such as appliance ownership. Non-Intrusive Load Monitoring (NILM) is the problem of disaggregating a household's collected total power consumption to retrieve the consumed power for individual appliances. Current state-of-the-art (SotA) solutions for NILM are based on deep-learning (DL) and operate on subsequences of an entire household consumption reading. However, the non-stationary nature of real-world smart meter data leads to a drift in the data distribution within each segmented window, which significantly affects model performance. This paper introduces NILMFormer, a Transformer-based architecture that incorporates a new subsequence stationarization/de-stationarization scheme to mitigate the distribution drift and that uses a novel positional encoding that relies only on the subsequence's timestamp information. Experiments with 4 real-world datasets show that NILMFormer significantly outperforms the SotA approaches. Our solution has been deployed as the backbone algorithm for EDF's (Electricité De France) consumption monitoring service, delivering detailed insights to millions of customers about their individual appliances' power consumption. This paper appeared in KDD 2025.

Title: Query Nearby: Offset-Adjusted Mask2Former enhances small-organ segmentation

Authors: Xin Zhang, Dongdong Meng, Sheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05897
Pdf URL: https://arxiv.org/pdf/2506.05897
Copy Paste: [[2506.05897]] Query Nearby: Offset-Adjusted Mask2Former enhances small-organ segmentation(https://arxiv.org/abs/2506.05897)
Keywords: transformer, segmentation
Abstract: Medical segmentation plays an important role in clinical applications like radiation therapy and surgical guidance, but acquiring clinically acceptable results is difficult. In recent years, progress has been witnessed with the success of utilizing transformer-like models, such as combining the attention mechanism with CNN. In particular, transformer-based segmentation models can extract global information more effectively, compensating for the drawbacks of CNN modules that focus on local features. However, utilizing transformer architecture is not easy, because training transformer-based models can be resource-demanding. Moreover, due to the distinct characteristics in the medical field, especially when encountering mid-sized and small organs with compact regions, their results often seem unsatisfactory. For example, using ViT to segment medical images directly only gives a DSC of less than 50\%, which is far lower than the clinically acceptable score of 80\%. In this paper, we used Mask2Former with deformable attention to reduce computation and proposed offset adjustment strategies to encourage sampling points within the same organs during attention weights computation, thereby integrating compact foreground information better. Additionally, we utilized the 4th feature map in Mask2Former to provide a coarse location of organs, and employed an FCN-based auxiliary head to help train Mask2Former more quickly using Dice loss. We show that our model achieves SOTA (State-of-the-Art) performance on the HaNSeg and SegRap2023 datasets, especially on mid-sized and small this http URL code is available at link this https URL\_Background-location\_Decoder\_Mask2former.

Title: Differentially Private Explanations for Clusters

Authors: Amir Gilad, Tova Milo, Kathy Razmadze, Ron Zadicario
Subjects: cs.CR, cs.DB
Abstract URL: https://arxiv.org/abs/2506.05900
Pdf URL: https://arxiv.org/pdf/2506.05900
Copy Paste: [[2506.05900]] Differentially Private Explanations for Clusters(https://arxiv.org/abs/2506.05900)
Keywords: secure, privacy, protect
Abstract: The dire need to protect sensitive data has led to various flavors of privacy definitions. Among these, Differential privacy (DP) is considered one of the most rigorous and secure notions of privacy, enabling data analysis while preserving the privacy of data contributors. One of the fundamental tasks of data analysis is clustering , which is meant to unravel hidden patterns within complex datasets. However, interpreting clustering results poses significant challenges, and often necessitates an extensive analytical process. Interpreting clustering results under DP is even more challenging, as analysts are provided with noisy responses to queries, and longer, manual exploration sessions require additional noise to meet privacy constraints. While increasing attention has been given to clustering explanation frameworks that aim at assisting analysts by automatically uncovering the characteristics of each cluster, such frameworks may also disclose sensitive information within the dataset, leading to a breach in privacy. To address these challenges, we present DPClustX, a framework that provides explanations for black-box clustering results while satisfying DP. DPClustX takes as input the sensitive dataset alongside privately computed clustering labels, and outputs a global explanation, emphasizing prominent characteristics of each cluster while guaranteeing DP. We perform an extensive experimental analysis of DPClustX on real data, showing that it provides insightful and accurate explanations even under tight privacy constraints.

Title: Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router

Authors: Chenyang Shao, Xinyang Liu, Yutang Lin, Fengli Xu, Yong Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05901
Pdf URL: https://arxiv.org/pdf/2506.05901
Copy Paste: [[2506.05901]] Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router(https://arxiv.org/abs/2506.05901)
Keywords: large language model
Abstract: Multi-step reasoning has proven essential for enhancing the problem-solving capabilities of Large Language Models (LLMs) by decomposing complex tasks into intermediate steps, either explicitly or implicitly. Extending the reasoning chain at test time through deeper thought processes or broader exploration, can furthur improve performance, but often incurs substantial costs due to the explosion in token usage. Yet, many reasoning steps are relatively simple and can be handled by more efficient smaller-scale language models (SLMs). This motivates hybrid approaches that allocate subtasks across models of varying capacities. However, realizing such collaboration requires accurate task decomposition and difficulty-aware subtask allocation, which is challenging. To address this, we propose R2-Reasoner, a novel framework that enables collaborative reasoning across heterogeneous LLMs by dynamically routing sub-tasks based on estimated complexity. At the core of our framework is a Reinforced Model Router, composed of a task decomposer and a subtask allocator. The task decomposer segments complex input queries into logically ordered subtasks, while the subtask allocator assigns each subtask to the most appropriate model, ranging from lightweight SLMs to powerful LLMs, balancing accuracy and efficiency. To train this router, we introduce a staged pipeline that combines supervised fine-tuning on task-specific datasets with Group Relative Policy Optimization algorithm, enabling self-supervised refinement through iterative reinforcement learning. Extensive experiments across four challenging benchmarks demonstrate that R2-Reasoner reduces API costs by 86.85% while maintaining or surpassing baseline accuracy. Our framework paves the way for more cost-effective and adaptive LLM reasoning. The code is open-source at this https URL .

Title: A Driving Regime-Embedded Deep Learning Framework for Modeling Intra-Driver Heterogeneity in Multi-Scale Car-Following Dynamics

Authors: Shirui Zhou, Jiying Yan, Junfang Tian, Tao Wang, Yongfu Li, Shiquan Zhong
Subjects: cs.LG, physics.soc-ph
Abstract URL: https://arxiv.org/abs/2506.05902
Pdf URL: https://arxiv.org/pdf/2506.05902
Copy Paste: [[2506.05902]] A Driving Regime-Embedded Deep Learning Framework for Modeling Intra-Driver Heterogeneity in Multi-Scale Car-Following Dynamics(https://arxiv.org/abs/2506.05902)
Keywords: robust, segmentation
Abstract: A fundamental challenge in car-following modeling lies in accurately representing the multi-scale complexity of driving behaviors, particularly the intra-driver heterogeneity where a single driver's actions fluctuate dynamically under varying conditions. While existing models, both conventional and data-driven, address behavioral heterogeneity to some extent, they often emphasize inter-driver heterogeneity or rely on simplified assumptions, limiting their ability to capture the dynamic heterogeneity of a single driver under different driving conditions. To address this gap, we propose a novel data-driven car-following framework that systematically embeds discrete driving regimes (e.g., steady-state following, acceleration, cruising) into vehicular motion predictions. Leveraging high-resolution traffic trajectory datasets, the proposed hybrid deep learning architecture combines Gated Recurrent Units for discrete driving regime classification with Long Short-Term Memory networks for continuous kinematic prediction, unifying discrete decision-making processes and continuous vehicular dynamics to comprehensively represent inter- and intra-driver heterogeneity. Driving regimes are identified using a bottom-up segmentation algorithm and Dynamic Time Warping, ensuring robust characterization of behavioral states across diverse traffic scenarios. Comparative analyses demonstrate that the framework significantly reduces prediction errors for acceleration (maximum MSE improvement reached 58.47\%), speed, and spacing metrics while reproducing critical traffic phenomena, such as stop-and-go wave propagation and oscillatory dynamics.

Title: Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness

Authors: Steven Landgraf, Markus Hillemann, Markus Ulrich
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05917
Pdf URL: https://arxiv.org/pdf/2506.05917
Copy Paste: [[2506.05917]] Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness(https://arxiv.org/abs/2506.05917)
Keywords: robust, segmentation
Abstract: Semantic segmentation is critical for scene understanding but demands costly pixel-wise annotations, attracting increasing attention to semi-supervised approaches to leverage abundant unlabeled data. While semi-supervised segmentation is often promoted as a path toward scalable, real-world deployment, it is astonishing that current evaluation protocols exclusively focus on segmentation accuracy, entirely overlooking reliability and robustness. These qualities, which ensure consistent performance under diverse conditions (robustness) and well-calibrated model confidences as well as meaningful uncertainties (reliability), are essential for safety-critical applications like autonomous driving, where models must handle unpredictable environments and avoid sudden failures at all costs. To address this gap, we introduce the Reliable Segmentation Score (RSS), a novel metric that combines predictive accuracy, calibration, and uncertainty quality measures via a harmonic mean. RSS penalizes deficiencies in any of its components, providing an easy and intuitive way of holistically judging segmentation models. Comprehensive evaluations of UniMatchV2 against its predecessor and a supervised baseline show that semi-supervised methods often trade reliability for accuracy. While out-of-domain evaluations demonstrate UniMatchV2's robustness, they further expose persistent reliability shortcomings. We advocate for a shift in evaluation protocols toward more holistic metrics like RSS to better align semi-supervised learning research with real-world deployment needs.

Title: Generating Grounded Responses to Counter Misinformation via Learning Efficient Fine-Grained Critiques

Authors: Xiaofei Xu, Xiuzhen Zhang, Ke Deng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05924
Pdf URL: https://arxiv.org/pdf/2506.05924
Copy Paste: [[2506.05924]] Generating Grounded Responses to Counter Misinformation via Learning Efficient Fine-Grained Critiques(https://arxiv.org/abs/2506.05924)
Keywords: large language model
Abstract: Fake news and misinformation poses a significant threat to society, making efficient mitigation essential. However, manual fact-checking is costly and lacks scalability. Large Language Models (LLMs) offer promise in automating counter-response generation to mitigate misinformation, but a critical challenge lies in their tendency to hallucinate non-factual information. Existing models mainly rely on LLM self-feedback to reduce hallucination, but this approach is computationally expensive. In this paper, we propose MisMitiFact, Misinformation Mitigation grounded in Facts, an efficient framework for generating fact-grounded counter-responses at scale. MisMitiFact generates simple critique feedback to refine LLM outputs, ensuring responses are grounded in evidence. We develop lightweight, fine-grained critique models trained on data sourced from readily available fact-checking sites to identify and correct errors in key elements such as numerals, entities, and topics in LLM generations. Experiments show that MisMitiFact generates counter-responses of comparable quality to LLMs' self-feedback while using significantly smaller critique models. Importantly, it achieves ~5x increase in feedback generation throughput, making it highly suitable for cost-effective, large-scale misinformation mitigation. Code and LLM prompt templates are at this https URL.

Title: LengClaro2023: A Dataset of Administrative Texts in Spanish with Plain Language adaptations

Authors: Belén Agüera-Marco, Itziar Gonzalez-Dios
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05927
Pdf URL: https://arxiv.org/pdf/2506.05927
Copy Paste: [[2506.05927]] LengClaro2023: A Dataset of Administrative Texts in Spanish with Plain Language adaptations(https://arxiv.org/abs/2506.05927)
Keywords: security
Abstract: In this work, we present LengClaro2023, a dataset of legal-administrative texts in Spanish. Based on the most frequently used procedures from the Spanish Social Security website, we have created for each text two simplified equivalents. The first version follows the recommendations provided by arText claro. The second version incorporates additional recommendations from plain language guidelines to explore further potential improvements in the system. The linguistic resource created in this work can be used for evaluating automatic text simplification (ATS) systems in Spanish.

Title: MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models

Authors: Jie Cao, Tianwei Lin, Hongyang He, Rolan Yan, Wenqiao Zhang, Juncheng Li, Dongping Zhang, Siliang Tang, Yueting Zhuang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05928
Pdf URL: https://arxiv.org/pdf/2506.05928
Copy Paste: [[2506.05928]] MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models(https://arxiv.org/abs/2506.05928)
Keywords: large language model
Abstract: Recent studies integrate Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) to further enhance the performance of parameter-efficient fine-tuning (PEFT) methods in Large Language Model (LLM) applications. Existing methods employ \emph{homogeneous} MoE-LoRA architectures composed of LoRA experts with either similar or identical structures and capacities. However, these approaches often suffer from representation collapse and expert load imbalance, which negatively impact the potential of LLMs. To address these challenges, we propose a \emph{heterogeneous} \textbf{Mixture-of-Adapters (MoA)} approach. This method dynamically integrates PEFT adapter experts with diverse structures, leveraging their complementary representational capabilities to foster expert specialization, thereby enhancing the effective transfer of pre-trained knowledge to downstream tasks. MoA supports two variants: \textbf{(i)} \textit{Soft MoA} achieves fine-grained integration by performing a weighted fusion of all expert outputs; \textbf{(ii)} \textit{Sparse MoA} activates adapter experts sparsely based on their contribution, achieving this with negligible performance degradation. Experimental results demonstrate that heterogeneous MoA outperforms homogeneous MoE-LoRA methods in both performance and parameter efficiency. Our project is available at this https URL.

Title: FADE: Frequency-Aware Diffusion Model Factorization for Video Editing

Authors: Yixuan Zhu, Haolin Wang, Shilin Ma, Wenliang Zhao, Yansong Tang, Lei Chen, Jie Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05934
Pdf URL: https://arxiv.org/pdf/2506.05934
Copy Paste: [[2506.05934]] FADE: Frequency-Aware Diffusion Model Factorization for Video Editing(https://arxiv.org/abs/2506.05934)
Keywords: diffusion
Abstract: Recent advancements in diffusion frameworks have significantly enhanced video editing, achieving high fidelity and strong alignment with textual prompts. However, conventional approaches using image diffusion models fall short in handling video dynamics, particularly for challenging temporal edits like motion adjustments. While current video diffusion models produce high-quality results, adapting them for efficient editing remains difficult due to the heavy computational demands that prevent the direct application of previous image editing techniques. To overcome these limitations, we introduce FADE, a training-free yet highly effective video editing approach that fully leverages the inherent priors from pre-trained video diffusion models via frequency-aware factorization. Rather than simply using these models, we first analyze the attention patterns within the video model to reveal how video priors are distributed across different components. Building on these insights, we propose a factorization strategy to optimize each component's specialized role. Furthermore, we devise spectrum-guided modulation to refine the sampling trajectory with frequency domain cues, preventing information leakage and supporting efficient, versatile edits while preserving the basic spatial and temporal structure. Extensive experiments on real-world videos demonstrate that our method consistently delivers high-quality, realistic and temporally coherent editing results both qualitatively and quantitatively. Code is available at this https URL .

Title: DynamicMind: A Tri-Mode Thinking System for Large Language Models

Authors: Wei Li, Yanbin Wei, Qiushi Huang, Jiangyue Yan, Yang Chen, James T. Kwok, Yu Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05936
Pdf URL: https://arxiv.org/pdf/2506.05936
Copy Paste: [[2506.05936]] DynamicMind: A Tri-Mode Thinking System for Large Language Models(https://arxiv.org/abs/2506.05936)
Keywords: large language model
Abstract: Modern large language models (LLMs) often struggle to dynamically adapt their reasoning depth to varying task complexities, leading to suboptimal performance or inefficient resource utilization. To address this, we introduce DynamicMind, a novel tri-mode thinking system. DynamicMind empowers LLMs to autonomously select between Fast, Normal, and Slow thinking modes for zero-shot question answering (ZSQA) tasks through cognitive-inspired prompt engineering. Our framework's core innovations include: (1) expanding the established dual-process framework of fast and slow thinking into a tri-mode thinking system involving a normal thinking mode to preserve the intrinsic capabilities of LLM; (2) proposing the Thinking Density metric, which aligns computational resource allocation with problem complexity; and (3) developing the Thinking Mode Capacity (TMC) dataset and a lightweight Mind Router to predict the optimal thinking mode. Extensive experiments across diverse mathematical, commonsense, and scientific QA benchmarks demonstrate that DynamicMind achieves superior ZSQA capabilities while establishing an effective trade-off between performance and computational efficiency.

Title: Quantifying Adversarial Uncertainty in Evidential Deep Learning using Conflict Resolution

Authors: Charmaine Barker, Daniel Bethell, Simos Gerasimou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05937
Pdf URL: https://arxiv.org/pdf/2506.05937
Copy Paste: [[2506.05937]] Quantifying Adversarial Uncertainty in Evidential Deep Learning using Conflict Resolution(https://arxiv.org/abs/2506.05937)
Keywords: attack, robust
Abstract: Reliability of deep learning models is critical for deployment in high-stakes applications, where out-of-distribution or adversarial inputs may lead to detrimental outcomes. Evidential Deep Learning, an efficient paradigm for uncertainty quantification, models predictions as Dirichlet distributions of a single forward pass. However, EDL is particularly vulnerable to adversarially perturbed inputs, making overconfident errors. Conflict-aware Evidential Deep Learning (C-EDL) is a lightweight post-hoc uncertainty quantification approach that mitigates these issues, enhancing adversarial and OOD robustness without retraining. C-EDL generates diverse, task-preserving transformations per input and quantifies representational disagreement to calibrate uncertainty estimates when needed. C-EDL's conflict-aware prediction adjustment improves detection of OOD and adversarial inputs, maintaining high in-distribution accuracy and low computational overhead. Our experimental evaluation shows that C-EDL significantly outperforms state-of-the-art EDL variants and competitive baselines, achieving substantial reductions in coverage for OOD data (up to 55%) and adversarial data (up to 90%), across a range of datasets, attack types, and uncertainty metrics.

Title: Exponential Family Variational Flow Matching for Tabular Data Generation

Authors: Andrés Guzmán-Cordero, Floor Eijkelboom, Jan-Willem van de Meent
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05940
Pdf URL: https://arxiv.org/pdf/2506.05940
Copy Paste: [[2506.05940]] Exponential Family Variational Flow Matching for Tabular Data Generation(https://arxiv.org/abs/2506.05940)
Keywords: diffusion, generative
Abstract: While denoising diffusion and flow matching have driven major advances in generative modeling, their application to tabular data remains limited, despite its ubiquity in real-world applications. To this end, we develop TabbyFlow, a variational Flow Matching (VFM) method for tabular data generation. To apply VFM to data with mixed continuous and discrete features, we introduce Exponential Family Variational Flow Matching (EF-VFM), which represents heterogeneous data types using a general exponential family distribution. We hereby obtain an efficient, data-driven objective based on moment matching, enabling principled learning of probability paths over mixed continuous and discrete variables. We also establish a connection between variational flow matching and generalized flow matching objectives based on Bregman divergences. Evaluation on tabular data benchmarks demonstrates state-of-the-art performance compared to baselines.

Title: Comparative Analysis of Modern Machine Learning Models for Retail Sales Forecasting

Authors: Luka Hobor, Mario Brcic, Lidija Polutnik, Ante Kapetanovic
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05941
Pdf URL: https://arxiv.org/pdf/2506.05941
Copy Paste: [[2506.05941]] Comparative Analysis of Modern Machine Learning Models for Retail Sales Forecasting(https://arxiv.org/abs/2506.05941)
Keywords: transformer
Abstract: Accurate forecasting is key for all business planning. When estimated sales are too high, brick-and-mortar retailers may incur higher costs due to unsold inventories, higher labor and storage space costs, etc. On the other hand, when forecasts underestimate the level of sales, firms experience lost sales, shortages, and impact on the reputation of the retailer in their relevant market. Accurate forecasting presents a competitive advantage for companies. It facilitates the achievement of revenue and profit goals and execution of pricing strategy and tactics. In this study, we provide an exhaustive assessment of the forecasting models applied to a high-resolution brick-and-mortar retail dataset. Our forecasting framework addresses the problems found in retail environments, including intermittent demand, missing values, and frequent product turnover. We compare tree-based ensembles (such as XGBoost and LightGBM) and state-of-the-art neural network architectures (including N-BEATS, NHITS, and the Temporal Fusion Transformer) across various experimental settings. Our results show that localized modeling strategies especially those using tree-based models on individual groups with non-imputed data, consistently deliver superior forecasting accuracy and computational efficiency. In contrast, neural models benefit from advanced imputation methods, yet still fall short in handling the irregularities typical of physical retail data. These results further practical understanding for model selection in retail environment and highlight the significance of data preprocessing to improve forecast performance.

Title: Additive decomposition of one-dimensional signals using Transformers

Authors: Samuele Salti, Andrea Pinto, Alessandro Lanza, Serena Morigi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05942
Pdf URL: https://arxiv.org/pdf/2506.05942
Copy Paste: [[2506.05942]] Additive decomposition of one-dimensional signals using Transformers(https://arxiv.org/abs/2506.05942)
Keywords: transformer
Abstract: One-dimensional signal decomposition is a well-established and widely used technique across various scientific fields. It serves as a highly valuable pre-processing step for data analysis. While traditional decomposition techniques often rely on mathematical models, recent research suggests that applying the latest deep learning models to this problem presents an exciting, unexplored area with promising potential. This work presents a novel method for the additive decomposition of one-dimensional signals. We leverage the Transformer architecture to decompose signals into their constituent components: piece-wise constant, smooth (low-frequency oscillatory), textured (high-frequency oscillatory), and a noise component. Our model, trained on synthetic data, achieves excellent accuracy in modeling and decomposing input signals from the same distribution, as demonstrated by the experimental results.

Title: IntentionESC: An Intention-Centered Framework for Enhancing Emotional Support in Dialogue Systems

Authors: Xinjie Zhang, Wenxuan Wang, Qin Jin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05947
Pdf URL: https://arxiv.org/pdf/2506.05947
Copy Paste: [[2506.05947]] IntentionESC: An Intention-Centered Framework for Enhancing Emotional Support in Dialogue Systems(https://arxiv.org/abs/2506.05947)
Keywords: large language model
Abstract: In emotional support conversations, unclear intentions can lead supporters to employ inappropriate strategies, inadvertently imposing their expectations or solutions on the seeker. Clearly defined intentions are essential for guiding both the supporter's motivations and the overall emotional support process. In this paper, we propose the Intention-centered Emotional Support Conversation (IntentionESC) framework, which defines the possible intentions of supporters in emotional support conversations, identifies key emotional state aspects for inferring these intentions, and maps them to appropriate support strategies. While Large Language Models (LLMs) excel in text generating, they fundamentally operate as probabilistic models trained on extensive datasets, lacking a true understanding of human thought processes and intentions. To address this limitation, we introduce the Intention Centric Chain-of-Thought (ICECoT) mechanism. ICECoT enables LLMs to mimic human reasoning by analyzing emotional states, inferring intentions, and selecting suitable support strategies, thereby generating more effective emotional support responses. To train the model with ICECoT and integrate expert knowledge, we design an automated annotation pipeline that produces high-quality training data. Furthermore, we develop a comprehensive evaluation scheme to assess emotional support efficacy and conduct extensive experiments to validate our framework. Our data and code are available at this https URL.

Title: Elementary Math Word Problem Generation using Large Language Models

Authors: Nimesh Ariyarathne, Harshani Bandara, Yasith Heshan, Omega Gamage, Surangika Ranathunga, Dilan Nayanajith, Yutharsan Sivapalan, Gayathri Lihinikaduarachchi, Tharoosha Vihidun, Meenambika Chandirakumar, Sanujen Premakumar, Sanjula Gathsara
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05950
Pdf URL: https://arxiv.org/pdf/2506.05950
Copy Paste: [[2506.05950]] Elementary Math Word Problem Generation using Large Language Models(https://arxiv.org/abs/2506.05950)
Keywords: large language model
Abstract: Mathematics is often perceived as a complex subject by students, leading to high failure rates in exams. To improve Mathematics skills, it is important to provide sample questions for students to practice problem-solving. Manually creating Math Word Problems (MWPs) is time consuming for tutors, because they have to type in natural language while adhering to grammar and spelling rules of the language. Existing Deep Learning techniques for MWP generation either require a tutor to provide the initial portion of the MWP, and/or additional information such as an equation. In this paper, we present an MWP generation system based on Large Language Models (LLMs) that overcome the need for additional input - the only input to our system is the number of MWPs needed, the grade and the type of question (e.g. addition, subtraction). Unlike the existing LLM-based solutions for MWP generation, we carried out an extensive set of experiments involving different LLMs, prompting strategies, techniques to improve the diversity of questions, as well as techniques that employ human feedback to improve LLM performance. Human and automated evaluations confirmed that the generated MWPs are high in quality, with minimal spelling and grammar issues. However, LLMs still struggle to generate questions that adhere to the specified grade and question type requirements.

Title: MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation

Authors: Dongjie Fu, Tengjiao Sun, Pengcheng Fang, Xiaohao Cai, Hansung Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05952
Pdf URL: https://arxiv.org/pdf/2506.05952
Copy Paste: [[2506.05952]] MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation(https://arxiv.org/abs/2506.05952)
Keywords: transformer
Abstract: Recent advances in transformer-based text-to-motion generation have led to impressive progress in synthesizing high-quality human motion. Nevertheless, jointly achieving high fidelity, streaming capability, real-time responsiveness, and scalability remains a fundamental challenge. In this paper, we propose MOGO (Motion Generation with One-pass), a novel autoregressive framework tailored for efficient and real-time 3D motion generation. MOGO comprises two key components: (1) MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences with learnable scaling to produce compact yet expressive representations; and (2) RQHC-Transformer, a residual quantized hierarchical causal transformer that generates multi-layer motion tokens in a single forward pass, significantly reducing inference latency. To enhance semantic fidelity, we further introduce a text condition alignment mechanism that improves motion decoding under textual control. Extensive experiments on benchmark datasets including HumanML3D, KIT-ML, and CMP demonstrate that MOGO achieves competitive or superior generation quality compared to state-of-the-art transformer-based methods, while offering substantial improvements in real-time performance, streaming generation, and generalization under zero-shot settings.

Title: AQUATIC-Diff: Additive Quantization for Truly Tiny Compressed Diffusion Models

Authors: Adil Hasan, Thomas Peyrin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05960
Pdf URL: https://arxiv.org/pdf/2506.05960
Copy Paste: [[2506.05960]] AQUATIC-Diff: Additive Quantization for Truly Tiny Compressed Diffusion Models(https://arxiv.org/abs/2506.05960)
Keywords: diffusion, large language model
Abstract: Significant investments have been made towards the commodification of diffusion models for generation of diverse media. Their mass-market adoption is however still hobbled by the intense hardware resource requirements of diffusion model inference. Model quantization strategies tailored specifically towards diffusion models have been useful in easing this burden, yet have generally explored the Uniform Scalar Quantization (USQ) family of quantization methods. In contrast, Vector Quantization (VQ) methods, which operate on groups of multiple related weights as the basic unit of compression, have seen substantial success in Large Language Model (LLM) quantization. In this work, we apply codebook-based additive vector quantization to the problem of diffusion model compression. Our resulting approach achieves a new Pareto frontier for the extremely low-bit weight quantization on the standard class-conditional benchmark of LDM-4 on ImageNet at 20 inference time steps. Notably, we report sFID 1.92 points lower than the full-precision model at W4A8 and the best-reported results for FID, sFID and ISC at W2A8. We are also able to demonstrate FLOPs savings on arbitrary hardware via an efficient inference kernel, as opposed to savings resulting from small integer operations which may lack broad hardware support.

Title: Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning

Authors: Motoki Omura, Kazuki Ota, Takayuki Osa, Yusuke Mukuta, Tatsuya Harada
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2506.05968
Pdf URL: https://arxiv.org/pdf/2506.05968
Copy Paste: [[2506.05968]] Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning(https://arxiv.org/abs/2506.05968)
Keywords: robust
Abstract: For continuous action spaces, actor-critic methods are widely used in online reinforcement learning (RL). However, unlike RL algorithms for discrete actions, which generally model the optimal value function using the Bellman optimality operator, RL algorithms for continuous actions typically model Q-values for the current policy using the Bellman operator. These algorithms for continuous actions rely exclusively on policy updates for improvement, which often results in low sample efficiency. This study examines the effectiveness of incorporating the Bellman optimality operator into actor-critic frameworks. Experiments in a simple environment show that modeling optimal values accelerates learning but leads to overestimation bias. To address this, we propose an annealing approach that gradually transitions from the Bellman optimality operator to the Bellman operator, thereby accelerating learning while mitigating bias. Our method, combined with TD3 and SAC, significantly outperforms existing approaches across various locomotion and manipulation tasks, demonstrating improved performance and robustness to hyperparameters related to optimality.

Title: Let's Put Ourselves in Sally's Shoes: Shoes-of-Others Prefixing Improves Theory of Mind in Large Language Models

Authors: Kazutoshi Shinoda, Nobukatsu Hojo, Kyosuke Nishida, Yoshihiro Yamazaki, Keita Suzuki, Hiroaki Sugiyama, Kuniko Saito
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05970
Pdf URL: https://arxiv.org/pdf/2506.05970
Copy Paste: [[2506.05970]] Let's Put Ourselves in Sally's Shoes: Shoes-of-Others Prefixing Improves Theory of Mind in Large Language Models(https://arxiv.org/abs/2506.05970)
Keywords: large language model
Abstract: Recent studies have shown that Theory of Mind (ToM) in large language models (LLMs) has not reached human-level performance yet. Since fine-tuning LLMs on ToM datasets often degrades their generalization, several inference-time methods have been proposed to enhance ToM in LLMs. However, existing inference-time methods for ToM are specialized for inferring beliefs from contexts involving changes in the world state. In this study, we present a new inference-time method for ToM, Shoes-of-Others (SoO) prefixing, which makes fewer assumptions about contexts and is applicable to broader scenarios. SoO prefixing simply specifies the beginning of LLM outputs with ``Let's put ourselves in A's shoes.'', where A denotes the target character's name. We evaluate SoO prefixing on two benchmarks that assess ToM in conversational and narrative contexts without changes in the world state and find that it consistently improves ToM across five categories of mental states. Our analysis suggests that SoO prefixing elicits faithful thoughts, thereby improving the ToM performance.

Title: On Measuring Long-Range Interactions in Graph Neural Networks

Authors: Jacob Bamberger, Benjamin Gutteridge, Scott le Roux, Michael M. Bronstein, Xiaowen Dong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05971
Pdf URL: https://arxiv.org/pdf/2506.05971
Copy Paste: [[2506.05971]] On Measuring Long-Range Interactions in Graph Neural Networks(https://arxiv.org/abs/2506.05971)
Keywords: robust
Abstract: Long-range graph tasks -- those dependent on interactions between distant nodes -- are an open problem in graph neural network research. Real-world benchmark tasks, especially the Long Range Graph Benchmark, have become popular for validating the long-range capability of proposed architectures. However, this is an empirical approach that lacks both robustness and theoretical underpinning; a more principled characterization of the long-range problem is required. To bridge this gap, we formalize long-range interactions in graph tasks, introduce a range measure for operators on graphs, and validate it with synthetic experiments. We then leverage our measure to examine commonly used tasks and architectures, and discuss to what extent they are, in fact, long-range. We believe our work advances efforts to define and address the long-range problem on graphs, and that our range measure will aid evaluation of new datasets and architectures.

Title: LTG at SemEval-2025 Task 10: Optimizing Context for Classification of Narrative Roles

Authors: Egil Rønningstad, Gaurav Negi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05976
Pdf URL: https://arxiv.org/pdf/2506.05976
Copy Paste: [[2506.05976]] LTG at SemEval-2025 Task 10: Optimizing Context for Classification of Narrative Roles(https://arxiv.org/abs/2506.05976)
Keywords: generative
Abstract: Our contribution to the SemEval 2025 shared task 10, subtask 1 on entity framing, tackles the challenge of providing the necessary segments from longer documents as context for classification with a masked language model. We show that a simple entity-oriented heuristics for context selection can enable text classification using models with limited context window. Our context selection approach and the XLM-RoBERTa language model is on par with, or outperforms, Supervised Fine-Tuning with larger generative language models.

Title: Mitigating Catastrophic Forgetting with Adaptive Transformer Block Expansion in Federated Fine-Tuning

Authors: Yujia Huo, Jianchun Liu, Hongli Xu, Zhenguo Ma, Shilong Wang, Liusheng Huang
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2506.05977
Pdf URL: https://arxiv.org/pdf/2506.05977
Copy Paste: [[2506.05977]] Mitigating Catastrophic Forgetting with Adaptive Transformer Block Expansion in Federated Fine-Tuning(https://arxiv.org/abs/2506.05977)
Keywords: privacy, federate, transformer, large language model
Abstract: Federated fine-tuning (FedFT) of large language models (LLMs) has emerged as a promising solution for adapting models to distributed data environments while ensuring data privacy. Existing FedFT methods predominantly utilize parameter-efficient fine-tuning (PEFT) techniques to reduce communication and computation overhead. However, they often fail to adequately address the catastrophic forgetting, a critical challenge arising from continual adaptation in distributed environments. The traditional centralized fine-tuning methods, which are not designed for the heterogeneous and privacy-constrained nature of federated environments, struggle to mitigate this issue effectively. Moreover, the challenge is further exacerbated by significant variation in data distributions and device capabilities across clients, which leads to intensified forgetting and degraded model generalization. To tackle these issues, we propose FedBE, a novel FedFT framework that integrates an adaptive transformer block expansion mechanism with a dynamic trainable-block allocation strategy. Specifically, FedBE expands trainable blocks within the model architecture, structurally separating newly learned task-specific knowledge from the original pre-trained representations. Additionally, FedBE dynamically assigns these trainable blocks to clients based on their data distributions and computational capabilities. This enables the framework to better accommodate heterogeneous federated environments and enhances the generalization ability of the this http URL experiments show that compared with existing federated fine-tuning methods, FedBE achieves 12-74% higher accuracy retention on general tasks after fine-tuning and a model convergence acceleration ratio of 1.9-3.1x without degrading the accuracy of downstream tasks.

Title: Tau-Eval: A Unified Evaluation Framework for Useful and Private Text Anonymization

Authors: Gabriel Loiseau, Damien Sileo, Damien Riquet, Maxime Meyer, Marc Tommasi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05979
Pdf URL: https://arxiv.org/pdf/2506.05979
Copy Paste: [[2506.05979]] Tau-Eval: A Unified Evaluation Framework for Useful and Private Text Anonymization(https://arxiv.org/abs/2506.05979)
Keywords: privacy, protect
Abstract: Text anonymization is the process of removing or obfuscating information from textual data to protect the privacy of individuals. This process inherently involves a complex trade-off between privacy protection and information preservation, where stringent anonymization methods can significantly impact the text's utility for downstream applications. Evaluating the effectiveness of text anonymization proves challenging from both privacy and utility perspectives, as there is no universal benchmark that can comprehensively assess anonymization techniques across diverse, and sometimes contradictory contexts. We present Tau-Eval, an open-source framework for benchmarking text anonymization methods through the lens of privacy and utility task sensitivity. A Python library, code, documentation and tutorials are publicly available.

Title: AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification

Authors: Geonwoo Cho, Jaemoon Lee, Jaegyun Im, Subi Lee, Jihwan Lee, Sundong Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05980
Pdf URL: https://arxiv.org/pdf/2506.05980
Copy Paste: [[2506.05980]] AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification(https://arxiv.org/abs/2506.05980)
Keywords: robust
Abstract: Skill-based reinforcement learning (SBRL) enables rapid adaptation in environments with sparse rewards by pretraining a skill-conditioned policy. Effective skill learning requires jointly maximizing both exploration and skill diversity. However, existing methods often face challenges in simultaneously optimizing for these two conflicting objectives. In this work, we propose a new method, Adaptive Multi-objective Projection for balancing Exploration and skill Diversification (AMPED), which explicitly addresses both exploration and skill diversification. We begin by conducting extensive ablation studies to identify and define a set of objectives that effectively capture the aspects of exploration and skill diversity, respectively. During the skill pretraining phase, AMPED introduces a gradient surgery technique to balance the objectives of exploration and skill diversity, mitigating conflicts and reducing reliance on heuristic tuning. In the subsequent fine-tuning phase, AMPED incorporates a skill selector module that dynamically selects suitable skills for downstream tasks, based on task-specific performance signals. Our approach achieves performance that surpasses SBRL baselines across various benchmarks. These results highlight the importance of explicitly harmonizing exploration and diversity and demonstrate the effectiveness of AMPED in enabling robust and generalizable skill learning. Project Page: this https URL

Title: MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks

Authors: Zonglin Wu, Yule Xue, Xin Wei, Yiren Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05982
Pdf URL: https://arxiv.org/pdf/2506.05982
Copy Paste: [[2506.05982]] MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks(https://arxiv.org/abs/2506.05982)
Keywords: security, defense, attack, robust, fair
Abstract: As automated attack techniques rapidly advance, CAPTCHAs remain a critical defense mechanism against malicious bots. However, existing CAPTCHA schemes encompass a diverse range of modalities -- from static distorted text and obfuscated images to interactive clicks, sliding puzzles, and logic-based questions -- yet the community still lacks a unified, large-scale, multimodal benchmark to rigorously evaluate their security robustness. To address this gap, we introduce MCA-Bench, a comprehensive and reproducible benchmarking suite that integrates heterogeneous CAPTCHA types into a single evaluation protocol. Leveraging a shared vision-language model backbone, we fine-tune specialized cracking agents for each CAPTCHA category, enabling consistent, cross-modal assessments. Extensive experiments reveal that MCA-Bench effectively maps the vulnerability spectrum of modern CAPTCHA designs under varied attack settings, and crucially offers the first quantitative analysis of how challenge complexity, interaction depth, and model solvability interrelate. Based on these findings, we propose three actionable design principles and identify key open challenges, laying the groundwork for systematic CAPTCHA hardening, fair benchmarking, and broader community collaboration. Datasets and code are available online.

Title: A Culturally-Rich Romanian NLP Dataset from "Who Wants to Be a Millionaire?" Videos

Authors: Alexandru-Gabriel Ganea, Antonia-Adelina Popovici, Adrian-Marius Dumitran
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05991
Pdf URL: https://arxiv.org/pdf/2506.05991
Copy Paste: [[2506.05991]] A Culturally-Rich Romanian NLP Dataset from "Who Wants to Be a Millionaire?" Videos(https://arxiv.org/abs/2506.05991)
Keywords: robust, extraction, large language model
Abstract: Large Language Models (LLMs) demonstrate varying performance across languages and cultural contexts. This study introduces a novel, culturally-rich, multilingual dataset derived from video recordings of the Romanian game show "Who Wants to Be a Millionaire?" (Vrei să fii Milionar?). We employed an innovative process combining optical character recognition (OCR), automated text extraction, and manual verification to collect question-answer pairs, enriching them with metadata including question domain (e.g., biology, history), cultural relevance (Romanian-specific vs. international), and difficulty. Benchmarking state-of-the-art LLMs, including Romanian-adapted models, on this dataset revealed significant performance disparities: models consistently achieve higher accuracy (80-95%) on international questions compared to Romanian-specific cultural questions (50-75%). We further investigate these differences through experiments involving machine translation of Romanian questions into English and cross-lingual tests using a comparable dataset in French. Our findings underscore the impact of cultural context and data source on LLM performance and offer practical insights for building robust, culturally-aware multilingual NLP systems, especially in educational domains. The dataset is publicly available at Hugging Face.

Title: LaDEEP: A Deep Learning-based Surrogate Model for Large Deformation of Elastic-Plastic Solids

Authors: Shilong Tao, Zhe Feng, Haonan Sun, Zhanxing Zhu, Yunhuai Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06001
Pdf URL: https://arxiv.org/pdf/2506.06001
Copy Paste: [[2506.06001]] LaDEEP: A Deep Learning-based Surrogate Model for Large Deformation of Elastic-Plastic Solids(https://arxiv.org/abs/2506.06001)
Keywords: transformer
Abstract: Scientific computing for large deformation of elastic-plastic solids is critical for numerous real-world applications. Classical numerical solvers rely primarily on local discrete linear approximation and are constrained by an inherent trade-off between accuracy and efficiency. Recently, deep learning models have achieved impressive progress in solving the continuum mechanism. While previous models have explored various architectures and constructed coefficient-solution mappings, they are designed for general instances without considering specific problem properties and hard to accurately handle with complex elastic-plastic solids involving contact, loading and unloading. In this work, we take stretch bending, a popular metal fabrication technique, as our case study and introduce LaDEEP, a deep learning-based surrogate model for \textbf{La}rge \textbf{De}formation of \textbf{E}lastic-\textbf{P}lastic Solids. We encode the partitioned regions of the involved slender solids into a token sequence to maintain their essential order property. To characterize the physical process of the solid deformation, a two-stage Transformer-based module is designed to predict the deformation with the sequence of tokens as input. Empirically, LaDEEP achieves five magnitudes faster speed than finite element methods with a comparable accuracy, and gains 20.47\% relative improvement on average compared to other deep learning baselines. We have also deployed our model into a real-world industrial production system, and it has shown remarkable performance in both accuracy and efficiency.

Title: What Really is a Member? Discrediting Membership Inference via Poisoning

Authors: Neal Mangaokar, Ashish Hooda, Zhuohang Li, Bradley A. Malin, Kassem Fawaz, Somesh Jha, Atul Prakash, Amrita Roy Chowdhury
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2506.06003
Pdf URL: https://arxiv.org/pdf/2506.06003
Copy Paste: [[2506.06003]] What Really is a Member? Discrediting Membership Inference via Poisoning(https://arxiv.org/abs/2506.06003)
Keywords: attack, robust, membership infer
Abstract: Membership inference tests aim to determine whether a particular data point was included in a language model's training set. However, recent works have shown that such tests often fail under the strict definition of membership based on exact matching, and have suggested relaxing this definition to include semantic neighbors as members as well. In this work, we show that membership inference tests are still unreliable under this relaxation - it is possible to poison the training dataset in a way that causes the test to produce incorrect predictions for a target point. We theoretically reveal a trade-off between a test's accuracy and its robustness to poisoning. We also present a concrete instantiation of this poisoning attack and empirically validate its effectiveness. Our results show that it can degrade the performance of existing tests to well below random.

Title: Enhancing Orthopox Image Classification Using Hybrid Machine Learning and Deep Learning Models

Authors: Alejandro Puente-Castro, Enrique Fernandez-Blanco, Daniel Rivero, Andres Molares-Ulloa
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06007
Pdf URL: https://arxiv.org/pdf/2506.06007
Copy Paste: [[2506.06007]] Enhancing Orthopox Image Classification Using Hybrid Machine Learning and Deep Learning Models(https://arxiv.org/abs/2506.06007)
Keywords: robust, extraction
Abstract: Orthopoxvirus infections must be accurately classified from medical pictures for an easy and early diagnosis and epidemic prevention. The necessity for automated and scalable solutions is highlighted by the fact that traditional diagnostic techniques can be time-consuming and require expert interpretation and there are few and biased data sets of the different types of Orthopox. In order to improve classification performance and lower computational costs, a hybrid strategy is put forth in this paper that uses Machine Learning models combined with pretrained Deep Learning models to extract deep feature representations without the need for augmented data. The findings show that this feature extraction method, when paired with other methods in the state-of-the-art, produces excellent classification outcomes while preserving training and inference efficiency. The proposed approach demonstrates strong generalization and robustness across multiple evaluation settings, offering a scalable and interpretable solution for real-world clinical deployment.

Title: Token Signature: Predicting Chain-of-Thought Gains with Token Decoding Feature in Large Language Models

Authors: Peijie Liu, Fengli Xu, Yong Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06008
Pdf URL: https://arxiv.org/pdf/2506.06008
Copy Paste: [[2506.06008]] Token Signature: Predicting Chain-of-Thought Gains with Token Decoding Feature in Large Language Models(https://arxiv.org/abs/2506.06008)
Keywords: large language model
Abstract: Chain-of-Thought (CoT) technique has proven effective in improving the performance of large language models (LLMs) on complex reasoning tasks. However, the performance gains are inconsistent across different tasks, and the underlying mechanism remains a long-standing research question. In this work, we make a preliminary observation that the monotonicity of token probability distributions may be correlated with the gains achieved through CoT reasoning. Leveraging this insight, we propose two indicators based on the token probability distribution to assess CoT effectiveness across different tasks. By combining instance-level indicators with logistic regression model, we introduce Dynamic CoT, a method that dynamically select between CoT and direct answer. Furthermore, we extend Dynamic CoT to closed-source models by transferring decision strategies learned from open-source models. Our indicators for assessing CoT effectiveness achieve an accuracy of 89.2\%, and Dynamic CoT reduces token consumption by more than 35\% while maintaining high accuracy. Overall, our work offers a novel perspective on the underlying mechanisms of CoT reasoning and provides a framework for its more efficient deployment.

Title: Unlocking Recursive Thinking of LLMs: Alignment via Refinement

Authors: Haoke Zhang, Xiaobo Liang, Cunxiang Wang, Juntao Li, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06009
Pdf URL: https://arxiv.org/pdf/2506.06009
Copy Paste: [[2506.06009]] Unlocking Recursive Thinking of LLMs: Alignment via Refinement(https://arxiv.org/abs/2506.06009)
Keywords: large language model
Abstract: The OpenAI o1-series models have demonstrated that leveraging long-form Chain of Thought (CoT) can substantially enhance performance. However, the recursive thinking capabilities of Large Language Models (LLMs) remain limited, particularly in the absence of expert-curated data for distillation. In this paper, we propose \textbf{AvR}: \textbf{Alignment via Refinement}, a novel method aimed at unlocking the potential of LLMs for recursive reasoning through long-form CoT. AvR introduces a refinement process that integrates criticism and improvement actions, guided by differentiable learning techniques to optimize \textbf{refinement-aware rewards}. As a result, the synthesized multi-round data can be organized as a long refinement thought, further enabling test-time scaling. Experimental results show that AvR significantly outperforms conventional preference optimization methods. Notably, with only 3k synthetic samples, our method boosts the performance of the LLaMA-3-8B-Instruct model by over 20\% in win rate on AlpacaEval 2.0. Our code is available at Github (this https URL).

Title: AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search

Authors: Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, Yong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06017
Pdf URL: https://arxiv.org/pdf/2506.06017
Copy Paste: [[2506.06017]] AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search(https://arxiv.org/abs/2506.06017)
Keywords: large language model
Abstract: Large language model (LLM) agents have demonstrated strong capabilities across diverse domains. However, designing high-performing agentic systems remains challenging. Existing agent search methods suffer from three major limitations: (1) an emphasis on optimizing agentic workflows while under-utilizing proven human-designed components such as memory, planning, and tool use; (2) high evaluation costs, as each newly generated agent must be fully evaluated on benchmarks; and (3) inefficient search in large search space. In this work, we introduce a comprehensive framework to address these challenges. First, We propose a hierarchical search space that jointly models agentic workflow and composable functional components, enabling richer agentic system designs. Building on this structured design space, we introduce a predictive value model that estimates agent performance given agentic system and task description, allowing for efficient, low-cost evaluation during the search process. Finally, we present a hierarchical Monte Carlo Tree Search (MCTS) strategy informed by uncertainty to guide the search. Experiments on seven benchmarks, covering embodied, math, web, tool, and game, show that our method achieves an average performance gain of 8.34\% over state-of-the-art baselines and exhibits faster search progress with steeper improvement trajectories. Code repo is available at this https URL.

Title: When to Trust Context: Self-Reflective Debates for Context Reliability

Authors: Zeqi Zhou, Fang Wu, Shayan Talaei, Haokai Zhao, Cheng Meixin, Tinson Xu, Amin Saberi, Yejin Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06020
Pdf URL: https://arxiv.org/pdf/2506.06020
Copy Paste: [[2506.06020]] When to Trust Context: Self-Reflective Debates for Context Reliability(https://arxiv.org/abs/2506.06020)
Keywords: robust, large language model
Abstract: Large language models frequently encounter conflicts between their parametric knowledge and contextual input, often resulting in factual inconsistencies or hallucinations. We propose Self-Reflective Debate for Contextual Reliability (SR-DCR), a lightweight framework that integrates token-level self-confidence with an asymmetric multi-agent debate to adjudicate such conflicts. A critic, deprived of context, challenges a defender who argues from the given passage; a judge model evaluates the debate and determines the context's reliability. The final answer is selected by combining the verdict with model confidence. Experiments on the ClashEval benchmark demonstrate that SR-DCR consistently enhances robustness to misleading context while maintaining accuracy on trustworthy inputs, outperforming both classical debate and confidence-only baselines with minimal computational overhead. The code is available at this https URL.

Title: Unisoma: A Unified Transformer-based Solver for Multi-Solid Systems

Authors: Shilong Tao, Zhe Feng, Haonan Sun, Zhanxing Zhu, Yunhuai Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06021
Pdf URL: https://arxiv.org/pdf/2506.06021
Copy Paste: [[2506.06021]] Unisoma: A Unified Transformer-based Solver for Multi-Solid Systems(https://arxiv.org/abs/2506.06021)
Keywords: transformer
Abstract: Multi-solid systems are foundational to a wide range of real-world applications, yet modeling their complex interactions remains challenging. Existing deep learning methods predominantly rely on implicit modeling, where the factors influencing solid deformation are not explicitly represented but are instead indirectly learned. However, as the number of solids increases, these methods struggle to accurately capture intricate physical interactions. In this paper, we introduce a novel explicit modeling paradigm that incorporates factors influencing solid deformation through structured modules. Specifically, we present Unisoma, a unified and flexible Transformer-based model capable of handling variable numbers of solids. Unisoma directly captures physical interactions using contact modules and adaptive interaction allocation mechanism, and learns the deformation through a triplet relationship. Compared to implicit modeling techniques, explicit modeling is more well-suited for multi-solid systems with diverse coupling patterns, as it enables detailed treatment of each solid while preventing information blending and confusion. Experimentally, Unisoma achieves consistent state-of-the-art performance across seven well-established datasets and two complex multi-solid tasks. Code is avaiable at \href{this link}{this https URL}.

Title: Restereo: Diffusion stereo video generation and restoration

Authors: Xingchang Huang, Ashish Kumar Singh, Florian Dubost, Cristina Nader Vasconcelos, Sakar Khattar, Liang Shi, Christian Theobalt, Cengiz Oztireli, Gurprit Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06023
Pdf URL: https://arxiv.org/pdf/2506.06023
Copy Paste: [[2506.06023]] Restereo: Diffusion stereo video generation and restoration(https://arxiv.org/abs/2506.06023)
Keywords: diffusion
Abstract: Stereo video generation has been gaining increasing attention with recent advancements in video diffusion models. However, most existing methods focus on generating 3D stereoscopic videos from monocular 2D videos. These approaches typically assume that the input monocular video is of high quality, making the task primarily about inpainting occluded regions in the warped video while preserving disoccluded areas. In this paper, we introduce a new pipeline that not only generates stereo videos but also enhances both left-view and right-view videos consistently with a single model. Our approach achieves this by fine-tuning the model on degraded data for restoration, as well as conditioning the model on warped masks for consistent stereo generation. As a result, our method can be fine-tuned on a relatively small synthetic stereo video datasets and applied to low-quality real-world videos, performing both stereo video generation and restoration. Experiments demonstrate that our method outperforms existing approaches both qualitatively and quantitatively in stereo video generation from low-resolution inputs.

Title: O-MaMa @ EgoExo4D Correspondence Challenge: Learning Object Mask Matching between Egocentric and Exocentric Views

Authors: Lorenzo Mur-Labadia, Maria Santos-Villafranca, Alejandro Perez-Yus, Jesus Bermudez-Cameo, Ruben Martinez-Cantin, Jose J. Guerrero
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06026
Pdf URL: https://arxiv.org/pdf/2506.06026
Copy Paste: [[2506.06026]] O-MaMa @ EgoExo4D Correspondence Challenge: Learning Object Mask Matching between Egocentric and Exocentric Views(https://arxiv.org/abs/2506.06026)
Keywords: segmentation
Abstract: The goal of the correspondence task is to segment specific objects across different views. This technical report re-defines cross-image segmentation by treating it as a mask matching task. Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) an Ego$\leftrightarrow$Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space, and (4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects.

Title: Sample-Specific Noise Injection For Diffusion-Based Adversarial Purification

Authors: Yuhao Sun, Jiacheng Zhang, Zesheng Ye, Chaowei Xiao, Feng Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06027
Pdf URL: https://arxiv.org/pdf/2506.06027
Copy Paste: [[2506.06027]] Sample-Specific Noise Injection For Diffusion-Based Adversarial Purification(https://arxiv.org/abs/2506.06027)
Keywords: robust, diffusion, generative
Abstract: Diffusion-based purification (DBP) methods aim to remove adversarial noise from the input sample by first injecting Gaussian noise through a forward diffusion process, and then recovering the clean example through a reverse generative process. In the above process, how much Gaussian noise is injected to the input sample is key to the success of DBP methods, which is controlled by a constant noise level $t^*$ for all samples in existing methods. In this paper, we discover that an optimal $t^*$ for each sample indeed could be different. Intuitively, the cleaner a sample is, the less the noise it should be injected, and vice versa. Motivated by this finding, we propose a new framework, called Sample-specific Score-aware Noise Injection (SSNI). Specifically, SSNI uses a pre-trained score network to estimate how much a data point deviates from the clean data distribution (i.e., score norms). Then, based on the magnitude of score norms, SSNI applies a reweighting function to adaptively adjust $t^*$ for each sample, achieving sample-specific noise injections. Empirically, incorporating our framework with existing DBP methods results in a notable improvement in both accuracy and robustness on CIFAR-10 and ImageNet-1K, highlighting the necessity to allocate distinct noise levels to different samples in DBP methods. Our code is available at: this https URL.

Title: Large Language Models are Demonstration Pre-Selectors for Themselves

Authors: Jiarui Jin, Yuwei Wu, Haoxuan Li, Xiaoting He, Weinan Zhang, Yiming Yang, Yong Yu, Jun Wang, Mengyue Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06033
Pdf URL: https://arxiv.org/pdf/2506.06033
Copy Paste: [[2506.06033]] Large Language Models are Demonstration Pre-Selectors for Themselves(https://arxiv.org/abs/2506.06033)
Keywords: large language model
Abstract: In-context learning (ICL) with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training data. However, existing ICL methods, which rely on similarity or diversity scores to choose demonstrations, incur high computational costs due to repeatedly retrieval from large-scale datasets for each query. To this end, we propose FEEDER (FEw yet Essential Demonstration prE-selectoR), a novel pre-selection framework that identifies a representative subset of demonstrations containing the most representative examples in the training data, tailored to specific LLMs. To construct this subset, we introduce the "sufficiency" and "necessity" metrics in the pre-selection stage and design a tree-based algorithm to identify representative examples efficiently. Once pre-selected, this representative subset can effectively replace the full training data, improving efficiency while maintaining comparable performance in ICL. Additionally, our pre-selected subset also benefits fine-tuning LLMs, where we introduce a bi-level optimization method that enhances training efficiency without sacrificing performance. Experiments with LLMs ranging from 300M to 8B parameters show that FEEDER can reduce training data size by over 20% while maintaining performance and seamlessly integrating with various downstream demonstration selection strategies in ICL.

Title: MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems?

Authors: Zhitao He, Zongwei Lyu, Dazhong Chen, Dadi Guo, Yi R. Fung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06034
Pdf URL: https://arxiv.org/pdf/2506.06034
Copy Paste: [[2506.06034]] MATP-BENCH: Can MLLM Be a Good Automated Theorem Prover for Multimodal Problems?(https://arxiv.org/abs/2506.06034)
Keywords: large language model
Abstract: Numerous theorems, such as those in geometry, are often presented in multimodal forms (e.g., diagrams). Humans benefit from visual reasoning in such settings, using diagrams to gain intuition and guide the proof process. Modern Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in solving a wide range of mathematical problems. However, the potential of MLLMs as Automated Theorem Provers (ATPs), specifically in the multimodal domain, remains underexplored. In this paper, we introduce the Multimodal Automated Theorem Proving benchmark (MATP-BENCH), a new Multimodal, Multi-level, and Multi-language benchmark designed to evaluate MLLMs in this role as multimodal automated theorem provers. MATP-BENCH consists of 1056 multimodal theorems drawn from high school, university, and competition-level mathematics. All these multimodal problems are accompanied by formalizations in Lean 4, Coq and Isabelle, thus making the benchmark compatible with a wide range of theorem-proving frameworks. MATP-BENCH requires models to integrate sophisticated visual understanding with mastery of a broad spectrum of mathematical knowledge and rigorous symbolic reasoning to generate formal proofs. We use MATP-BENCH to evaluate a variety of advanced multimodal language models. Existing methods can only solve a limited number of the MATP-BENCH problems, indicating that this benchmark poses an open challenge for research on automated theorem proving.

Title: HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Authors: Shiyi Zhang, Dong Liang, Hairong Zheng, Yihang Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06035
Pdf URL: https://arxiv.org/pdf/2506.06035
Copy Paste: [[2506.06035]] HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion(https://arxiv.org/abs/2506.06035)
Keywords: diffusion, generative
Abstract: Reconstructing visual information from brain activity bridges the gap between neuroscience and computer vision. Even though progress has been made in decoding images from fMRI using generative models, a challenge remains in accurately recovering highly complex visual stimuli. This difficulty stems from their elemental density and diversity, sophisticated spatial structures, and multifaceted semantic information. To address these challenges, we propose HAVIR that contains two adapters: (1) The AutoKL Adapter transforms fMRI voxels into a latent diffusion prior, capturing topological structures; (2) The CLIP Adapter converts the voxels to CLIP text and image embeddings, containing semantic information. These complementary representations are fused by Versatile Diffusion to generate the final reconstructed image. To extract the most essential semantic information from complex scenarios, the CLIP Adapter is trained with text captions describing the visual stimuli and their corresponding semantic images synthesized from these captions. The experimental results demonstrate that HAVIR effectively reconstructs both structural features and semantic information of visual stimuli even in complex scenarios, outperforming existing models.

Title: Do-PFN: In-Context Learning for Causal Effect Estimation

Authors: Jake Robertson, Arik Reuter, Siyuan Guo, Noah Hollmann, Frank Hutter, Bernhard Schölkopf
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06039
Pdf URL: https://arxiv.org/pdf/2506.06039
Copy Paste: [[2506.06039]] Do-PFN: In-Context Learning for Causal Effect Estimation(https://arxiv.org/abs/2506.06039)
Keywords: robust
Abstract: Estimation of causal effects is critical to a range of scientific disciplines. Existing methods for this task either require interventional data, knowledge about the ground truth causal graph, or rely on assumptions such as unconfoundedness, restricting their applicability in real-world settings. In the domain of tabular machine learning, Prior-data fitted networks (PFNs) have achieved state-of-the-art predictive performance, having been pre-trained on synthetic data to solve tabular prediction problems via in-context learning. To assess whether this can be transferred to the harder problem of causal effect estimation, we pre-train PFNs on synthetic data drawn from a wide variety of causal structures, including interventions, to predict interventional outcomes given observational data. Through extensive experiments on synthetic case studies, we show that our approach allows for the accurate estimation of causal effects without knowledge of the underlying causal graph. We also perform ablation studies that elucidate Do-PFN's scalability and robustness across datasets with a variety of causal characteristics.

Title: Diffusion-Based Hierarchical Graph Neural Networks for Simulating Nonlinear Solid Mechanics

Authors: Tobias Würth, Niklas Freymuth, Gerhard Neumann, Luise Kärger
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2506.06045
Pdf URL: https://arxiv.org/pdf/2506.06045
Copy Paste: [[2506.06045]] Diffusion-Based Hierarchical Graph Neural Networks for Simulating Nonlinear Solid Mechanics(https://arxiv.org/abs/2506.06045)
Keywords: diffusion
Abstract: Graph-based learned simulators have emerged as a promising approach for simulating physical systems on unstructured meshes, offering speed and generalization across diverse geometries. However, they often struggle with capturing global phenomena, such as bending or long-range correlations, and suffer from error accumulation over long rollouts due to their reliance on local message passing and direct next-step prediction. We address these limitations by introducing the Rolling Diffusion-Batched Inference Network (ROBIN), a novel learned simulator that integrates two key innovations: (i) Rolling Diffusion, a parallelized inference scheme that amortizes the cost of diffusion-based refinement across physical time steps by overlapping denoising steps across a temporal window. (ii) A Hierarchical Graph Neural Network built on algebraic multigrid coarsening, enabling multiscale message passing across different mesh resolutions. This architecture, implemented via Algebraic-hierarchical Message Passing Networks, captures both fine-scale local dynamics and global structural effects critical for phenomena like beam bending or multi-body contact. We validate ROBIN on challenging 2D and 3D solid mechanics benchmarks involving geometric, material, and contact nonlinearities. ROBIN achieves state-of-the-art accuracy on all tasks, substantially outperforming existing next-step learned simulators while reducing inference time by up to an order of magnitude compared to standard diffusion simulators.

Title: Hey, That's My Data! Label-Only Dataset Inference in Large Language Models

Authors: Chen Xiong, Zihao Wang, Rui Zhu, Tsung-Yi Ho, Pin-Yu Chen, Jingwei Xiong, Haixu Tang, Lucila Ohno-Machado
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06057
Pdf URL: https://arxiv.org/pdf/2506.06057
Copy Paste: [[2506.06057]] Hey, That's My Data! Label-Only Dataset Inference in Large Language Models(https://arxiv.org/abs/2506.06057)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) have revolutionized Natural Language Processing by excelling at interpreting, reasoning about, and generating human language. However, their reliance on large-scale, often proprietary datasets poses a critical challenge: unauthorized usage of such data can lead to copyright infringement and significant financial harm. Existing dataset-inference methods typically depend on log probabilities to detect suspicious training material, yet many leading LLMs have begun withholding or obfuscating these signals. This reality underscores the pressing need for label-only approaches capable of identifying dataset membership without relying on internal model logits. We address this gap by introducing CatShift, a label-only dataset-inference framework that capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data. If a suspicious dataset was previously seen by the model, fine-tuning on a portion of it triggers a pronounced post-tuning shift in the model's outputs; conversely, truly novel data elicits more modest changes. By comparing the model's output shifts for a suspicious dataset against those for a known non-member validation set, we statistically determine whether the suspicious set is likely to have been part of the model's original training corpus. Extensive experiments on both open-source and API-based LLMs validate CatShift's effectiveness in logit-inaccessible settings, offering a robust and practical solution for safeguarding proprietary data.

Title: Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models

Authors: Yingqi Hu, Zhuo Zhang, Jingyuan Zhang, Lizhen Qu, Zenglin Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06060
Pdf URL: https://arxiv.org/pdf/2506.06060
Copy Paste: [[2506.06060]] Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models(https://arxiv.org/abs/2506.06060)
Keywords: privacy, defense, attack, robust, extraction, federate, large language model
Abstract: Federated fine-tuning of large language models (FedLLMs) presents a promising approach for achieving strong model performance while preserving data privacy in sensitive domains. However, the inherent memorization ability of LLMs makes them vulnerable to training data extraction attacks. To investigate this risk, we introduce simple yet effective extraction attack algorithms specifically designed for FedLLMs. In contrast to prior "verbatim" extraction attacks, which assume access to fragments from all training data, our approach operates under a more realistic threat model, where the attacker only has access to a single client's data and aims to extract previously unseen personally identifiable information (PII) from other clients. This requires leveraging contextual prefixes held by the attacker to generalize across clients. To evaluate the effectiveness of our approaches, we propose two rigorous metrics-coverage rate and efficiency-and extend a real-world legal dataset with PII annotations aligned with CPIS, GDPR, and CCPA standards, achieving 89.9% human-verified precision. Experimental results show that our method can extract up to 56.57% of victim-exclusive PII, with "Address," "Birthday," and "Name" being the most vulnerable categories. Our findings underscore the pressing need for robust defense strategies and contribute a new benchmark and evaluation framework for future research in privacy-preserving federated learning.

Title: Zero-Shot Detection of LLM-Generated Code via Approximated Task Conditioning

Authors: Maor Ashkenazi, Ofir Brenner, Tal Furman Shohet, Eran Treister
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06069
Pdf URL: https://arxiv.org/pdf/2506.06069
Copy Paste: [[2506.06069]] Zero-Shot Detection of LLM-Generated Code via Approximated Task Conditioning(https://arxiv.org/abs/2506.06069)
Keywords: security, large language model
Abstract: Detecting Large Language Model (LLM)-generated code is a growing challenge with implications for security, intellectual property, and academic integrity. We investigate the role of conditional probability distributions in improving zero-shot LLM-generated code detection, when considering both the code and the corresponding task prompt that generated it. Our key insight is that when evaluating the probability distribution of code tokens using an LLM, there is little difference between LLM-generated and human-written code. However, conditioning on the task reveals notable differences. This contrasts with natural language text, where differences exist even in the unconditional distributions. Leveraging this, we propose a novel zero-shot detection approach that approximates the original task used to generate a given code snippet and then evaluates token-level entropy under the approximated task conditioning (ATC). We further provide a mathematical intuition, contextualizing our method relative to previous approaches. ATC requires neither access to the generator LLM nor the original task prompts, making it practical for real-world applications. To the best of our knowledge, it achieves state-of-the-art results across benchmarks and generalizes across programming languages, including Python, CPP, and Java. Our findings highlight the importance of task-level conditioning for LLM-generated code detection. The supplementary materials and code are available at this https URL, including the dataset gathering implementation, to foster further research in this area.

Title: System-Aware Unlearning Algorithms: Use Lesser, Forget Faster

Authors: Linda Lu, Ayush Sekhari, Karthik Sridharan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06073
Pdf URL: https://arxiv.org/pdf/2506.06073
Copy Paste: [[2506.06073]] System-Aware Unlearning Algorithms: Use Lesser, Forget Faster(https://arxiv.org/abs/2506.06073)
Keywords: secure, attack
Abstract: Machine unlearning addresses the problem of updating a machine learning model/system trained on a dataset $S$ so that the influence of a set of deletion requests $U \subseteq S$ on the unlearned model is minimized. The gold standard definition of unlearning demands that the updated model, after deletion, be nearly identical to the model obtained by retraining. This definition is designed for a worst-case attacker (one who can recover not only the unlearned model but also the remaining data samples, i.e., $S \setminus U$). Such a stringent definition has made developing efficient unlearning algorithms challenging. However, such strong attackers are also unrealistic. In this work, we propose a new definition, system-aware unlearning, which aims to provide unlearning guarantees against an attacker that can at best only gain access to the data stored in the system for learning/unlearning requests and not all of $S\setminus U$. With this new definition, we use the simple intuition that if a system can store less to make its learning/unlearning updates, it can be more secure and update more efficiently against a system-aware attacker. Towards that end, we present an exact system-aware unlearning algorithm for linear classification using a selective sampling-based approach, and we generalize the method for classification with general function classes. We theoretically analyze the tradeoffs between deletion capacity, accuracy, memory, and computation time.

Title: Feedback Guidance of Diffusion Models

Authors: Koulischer Felix, Handke Florian, Deleu Johannes, Demeester Thomas, Ambrogioni Luca
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06085
Pdf URL: https://arxiv.org/pdf/2506.06085
Copy Paste: [[2506.06085]] Feedback Guidance of Diffusion Models(https://arxiv.org/abs/2506.06085)
Keywords: diffusion
Abstract: While Classifier-Free Guidance (CFG) has become standard for improving sample fidelity in conditional diffusion models, it can harm diversity and induce memorization by applying constant guidance regardless of whether a particular sample needs correction. We propose FeedBack Guidance (FBG), which uses a state-dependent coefficient to self-regulate guidance amounts based on need. Our approach is derived from first principles by assuming the learned conditional distribution is linearly corrupted by the unconditional distribution, contrasting with CFG's implicit multiplicative assumption. Our scheme relies on feedback of its own predictions about the conditional signal informativeness to adapt guidance dynamically during inference, challenging the view of guidance as a fixed hyperparameter. The approach is benchmarked on ImageNet512x512, where it significantly outperforms Classifier-Free Guidance and is competitive to Limited Interval Guidance (LIG) while benefitting from a strong mathematical framework. On Text-To-Image generation, we demonstrate that, as anticipated, our approach automatically applies higher guidance scales for complex prompts than for simpler ones and that it can be easily combined with existing guidance schemes such as CFG or LIG.

Title: Reinforcing Code Generation: Improving Text-to-SQL with Execution-Based Learning

Authors: Atharv Kulkarni, Vivek Srikumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06093
Pdf URL: https://arxiv.org/pdf/2506.06093
Copy Paste: [[2506.06093]] Reinforcing Code Generation: Improving Text-to-SQL with Execution-Based Learning(https://arxiv.org/abs/2506.06093)
Keywords: large language model
Abstract: In this work, we study the problem of code generation with a large language model (LLM), with a focus on generating SQL queries from natural language questions. We ask: Instead of using supervised fine tuning with text-code pairs, can we tune a model by having it interact with a database engine? We frame this problem as a reinforcement learning problem where the model receives execution-based feedback from the environment in the form of scalar rewards. These rewards penalize execution failures and assign positive values when a query returns a correct answer. We use the rewards within the Group Relative Policy Optimization (GRPO) framework. We use a tabular reasoning benchmark to test and evaluate our findings. We find that with only weak supervision in the form of question-answer pairs, RL-tuning improves the accuracy of model generated SQL code from 31.49 to 49.83 while reducing error percentage from 25.43% to 14.71%. This improvement allowed the model nearly match the performance performance to the larger SQLCoder-70B model. Our work demonstrates the potential of using execution-based feedback to improve symbolic reasoning capabilities of LLMs.

Title: Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU

Authors: Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Weifeng Liu, Qingxiao Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06095
Pdf URL: https://arxiv.org/pdf/2506.06095
Copy Paste: [[2506.06095]] Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU(https://arxiv.org/abs/2506.06095)
Keywords: transformer, large language model
Abstract: Large language models are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. Moreover, rule-based mechanisms ignore the fusion opportunities of mixed-type operators and fail to adapt to various sequence lengths. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer via flexible masking and operator fusion on GPU. We firstly unify the storage format and kernel implementation for the multi-head attention. Then, we map fusion schemes to compilation templates and determine the optimal parameter setting through a two-stage search engine. The experimental results show that compared to the state-of-the-art work, STOF achieves maximum speedups of 1.7x in MHA computation and 1.5x in end-to-end inference.

Title: VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

Authors: Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, Yali Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06097
Pdf URL: https://arxiv.org/pdf/2506.06097
Copy Paste: [[2506.06097]] VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning(https://arxiv.org/abs/2506.06097)
Keywords: large language model
Abstract: The recent advance in video understanding has been driven by multimodal large language models (MLLMs). But these MLLMs are good at analyzing short videos, while suffering from difficulties in understanding videos with a longer context. To address this difficulty, several agent paradigms have recently been proposed, using MLLMs as agents for retrieving extra contextual knowledge in a long video. However, most existing agents ignore the key fact that a long video is composed with multiple shots, i.e., to answer the user question from a long video, it is critical to deeply understand its relevant shots like human. Without such insight, these agents often mistakenly find redundant even noisy temporal context, restricting their capacity for long video understanding. To fill this gap, we propose VideoChat-A1, a novel long video agent paradigm. Different from the previous works, our VideoChat-A1 can deeply think with long videos, via a distinct chain-of-shot reasoning paradigm. More specifically, it can progressively select the relevant shots of user question, and look into these shots in a coarse-to-fine partition. By multi-modal reasoning along the shot chain, VideoChat-A1 can effectively mimic step-by-step human thinking process, allowing to interactively discover preferable temporal context for thoughtful understanding in long videos. Extensive experiments show that, our VideoChat-A1 achieves the state-of-the-art performance on the mainstream long video QA benchmarks, e.g., it achieves 77.0 on VideoMME and 70.1 on EgoSchema, outperforming its strong baselines (e.g., Intern2.5VL-8B and InternVideo2.5-8B), by up to 10.8\% and 6.2\%. Compared to leading close-source GPT-4o and Gemini 1.5 Pro, VideoChat-A1 offers competitive accuracy, but with 7\% input frames and 12\% inference time on average.

Title: Text-to-LoRA: Instant Transformer Adaption

Authors: Rujikorn Charakorn, Edoardo Cetin, Yujin Tang, Robert Tjarko Lange
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06105
Pdf URL: https://arxiv.org/pdf/2506.06105
Copy Paste: [[2506.06105]] Text-to-LoRA: Instant Transformer Adaption(https://arxiv.org/abs/2506.06105)
Keywords: transformer, large language model
Abstract: While Foundation Models provide a general tool for rapid content creation, they regularly require task-specific adaptation. Traditionally, this exercise involves careful curation of datasets and repeated fine-tuning of the underlying model. Fine-tuning techniques enable practitioners to adapt foundation models for many new applications but require expensive and lengthy training while being notably sensitive to hyper-parameter choices. To overcome these limitations, we introduce Text-to-LoRA (T2L), a model capable of adapting Large Language Models on the fly solely based on a natural language description of the target task. T2L is a hypernetwork trained to construct LoRAs in a single inexpensive forward pass. After training T2L on a suite of 9 pre-trained LoRA adapters (GSM8K, Arc, etc.), we show that the ad-hoc reconstructed LoRA instances match the performance of task-specific adapters across the corresponding test sets. Furthermore, T2L can compress hundreds of LoRA instances and zero-shot generalize to entirely unseen tasks. This approach provides a significant step towards democratizing the specialization of foundation models and enables language-based adaptation with minimal compute requirements. Our code is available at this https URL

Title: Synthetic Tabular Data: Methods, Attacks and Defenses

Authors: Graham Cormode, Samuel Maddock, Enayat Ullah, Shripad Gade
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2506.06108
Pdf URL: https://arxiv.org/pdf/2506.06108
Copy Paste: [[2506.06108]] Synthetic Tabular Data: Methods, Attacks and Defenses(https://arxiv.org/abs/2506.06108)
Keywords: privacy, defense, attack
Abstract: Synthetic data is often positioned as a solution to replace sensitive fixed-size datasets with a source of unlimited matching data, freed from privacy concerns. There has been much progress in synthetic data generation over the last decade, leveraging corresponding advances in machine learning and data analytics. In this survey, we cover the key developments and the main concepts in tabular synthetic data generation, including paradigms based on probabilistic graphical models and on deep learning. We provide background and motivation, before giving a technical deep-dive into the methodologies. We also address the limitations of synthetic data, by studying attacks that seek to retrieve information about the original sensitive data. Finally, we present extensions and open problems in this area.

Title: Towards Lifecycle Unlearning Commitment Management: Measuring Sample-level Unlearning Completeness

Authors: Cheng-Long Wang, Qi Li, Zihang Xiang, Yinzhi Cao, Di Wang
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2506.06112
Pdf URL: https://arxiv.org/pdf/2506.06112
Copy Paste: [[2506.06112]] Towards Lifecycle Unlearning Commitment Management: Measuring Sample-level Unlearning Completeness(https://arxiv.org/abs/2506.06112)
Keywords: security, privacy, attack, membership infer
Abstract: Growing concerns over data privacy and security highlight the importance of machine unlearning--removing specific data influences from trained models without full retraining. Techniques like Membership Inference Attacks (MIAs) are widely used to externally assess successful unlearning. However, existing methods face two key limitations: (1) maximizing MIA effectiveness (e.g., via online attacks) requires prohibitive computational resources, often exceeding retraining costs; (2) MIAs, designed for binary inclusion tests, struggle to capture granular changes in approximate unlearning. To address these challenges, we propose the Interpolated Approximate Measurement (IAM), a framework natively designed for unlearning inference. IAM quantifies sample-level unlearning completeness by interpolating the model's generalization-fitting behavior gap on queried samples. IAM achieves strong performance in binary inclusion tests for exact unlearning and high correlation for approximate unlearning--scalable to LLMs using just one pre-trained shadow model. We theoretically analyze how IAM's scoring mechanism maintains performance efficiently. We then apply IAM to recent approximate unlearning algorithms, revealing general risks of both over-unlearning and under-unlearning, underscoring the need for stronger safeguards in approximate unlearning systems. The code is available at this https URL.

Title: Bridging the Gap: In-Context Learning for Modeling Human Disagreement

Authors: Benedetta Muscato, Yue Li, Gizem Gezici, Zhixue Zhao, Fosca Giannotti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06113
Pdf URL: https://arxiv.org/pdf/2506.06113
Copy Paste: [[2506.06113]] Bridging the Gap: In-Context Learning for Modeling Human Disagreement(https://arxiv.org/abs/2506.06113)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown strong performance on NLP classification tasks. However, they typically rely on aggregated labels-often via majority voting-which can obscure the human disagreement inherent in subjective annotations. This study examines whether LLMs can capture multiple perspectives and reflect annotator disagreement in subjective tasks such as hate speech and offensive language detection. We use in-context learning (ICL) in zero-shot and few-shot settings, evaluating four open-source LLMs across three label modeling strategies: aggregated hard labels, and disaggregated hard and soft labels. In few-shot prompting, we assess demonstration selection methods based on textual similarity (BM25, PLM-based), annotation disagreement (entropy), a combined ranking, and example ordering strategies (random vs. curriculum-based). Results show that multi-perspective generation is viable in zero-shot settings, while few-shot setups often fail to capture the full spectrum of human judgments. Prompt design and demonstration selection notably affect performance, though example ordering has limited impact. These findings highlight the challenges of modeling subjectivity with LLMs and the importance of building more perspective-aware, socially intelligent models.

Title: SATversary: Adversarial Attacks on Satellite Fingerprinting

Authors: Joshua Smailes, Sebastian Köhler, Simon Birnbach, Martin Strohmeier, Ivan Martinovic
Subjects: cs.CR, eess.SP
Abstract URL: https://arxiv.org/abs/2506.06119
Pdf URL: https://arxiv.org/pdf/2506.06119
Copy Paste: [[2506.06119]] SATversary: Adversarial Attacks on Satellite Fingerprinting(https://arxiv.org/abs/2506.06119)
Keywords: protect, attack
Abstract: As satellite systems become increasingly vulnerable to physical layer attacks via SDRs, novel countermeasures are being developed to protect critical systems, particularly those lacking cryptographic protection, or those which cannot be upgraded to support modern cryptography. Among these is transmitter fingerprinting, which provides mechanisms by which communication can be authenticated by looking at characteristics of the transmitter, expressed as impairments on the signal. Previous works show that fingerprinting can be used to classify satellite transmitters, or authenticate them against SDR-equipped attackers under simple replay scenarios. In this paper we build upon this by looking at attacks directly targeting the fingerprinting system, with an attacker optimizing for maximum impact in jamming, spoofing, and dataset poisoning attacks, and demonstrate these attacks on the SatIQ system designed to authenticate Iridium transmitters. We show that an optimized jamming signal can cause a 50% error rate with attacker-to-victim ratios as low as -30dB (far less power than traditional jamming) and demonstrate successful identity forgery during spoofing attacks, with an attacker successfully removing their own transmitter's fingerprint from messages. We also present a data poisoning attack, enabling persistent message spoofing by altering the data used to authenticate incoming messages to include the fingerprint of the attacker's transmitter. Finally, we show that our model trained to optimize spoofing attacks can also be used to detect spoofing and replay attacks, even when it has never seen the attacker's transmitter before. Furthermore, this technique works even when the training dataset includes only a single transmitter, enabling fingerprinting to be used to protect small constellations and even individual satellites, providing additional protection where it is needed the most.

Title: PrivTru: A Privacy-by-Design Data Trustee Minimizing Information Leakage

Authors: Lukas Gehring, Florian Tschorsch
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.06124
Pdf URL: https://arxiv.org/pdf/2506.06124
Copy Paste: [[2506.06124]] PrivTru: A Privacy-by-Design Data Trustee Minimizing Information Leakage(https://arxiv.org/abs/2506.06124)
Keywords: secure, privacy
Abstract: Data trustees serve as intermediaries that facilitate secure data sharing between independent parties. This paper offers a technical perspective on Data trustees, guided by privacy-by-design principles. We introduce PrivTru, an instantiation of a data trustee that provably achieves optimal privacy properties. Therefore, PrivTru calculates the minimal amount of information the data trustee needs to request from data sources to respond to a given query. Our analysis shows that PrivTru minimizes information leakage to the data trustee, regardless of the trustee's prior knowledge, while preserving the utility of the data.

Title: CCLSTM: Coupled Convolutional Long-Short Term Memory Network for Occupancy Flow Forecasting

Authors: Peter Lengyel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06128
Pdf URL: https://arxiv.org/pdf/2506.06128
Copy Paste: [[2506.06128]] CCLSTM: Coupled Convolutional Long-Short Term Memory Network for Occupancy Flow Forecasting(https://arxiv.org/abs/2506.06128)
Keywords: transformer
Abstract: Predicting future states of dynamic agents is a fundamental task in autonomous driving. An expressive representation for this purpose is Occupancy Flow Fields, which provide a scalable and unified format for modeling motion, spatial extent, and multi-modal future distributions. While recent methods have achieved strong results using this representation, they often depend on high-quality vectorized inputs, which are unavailable or difficult to generate in practice, and the use of transformer-based architectures, which are computationally intensive and costly to deploy. To address these issues, we propose \textbf{Coupled Convolutional LSTM (CCLSTM)}, a lightweight, end-to-end trainable architecture based solely on convolutional operations. Without relying on vectorized inputs or self-attention mechanisms, CCLSTM effectively captures temporal dynamics and spatial occupancy-flow correlations using a compact recurrent convolutional structure. Despite its simplicity, CCLSTM achieves state-of-the-art performance on occupancy flow metrics and, as of this submission, ranks $1^{\text{st}}$ in all metrics on the 2024 Waymo Occupancy and Flow Prediction Challenge leaderboard.

Title: Let's CONFER: A Dataset for Evaluating Natural Language Inference Models on CONditional InFERence and Presupposition

Authors: Tara Azin, Daniel Dumitrescu, Diana Inkpen, Raj Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06133
Pdf URL: https://arxiv.org/pdf/2506.06133
Copy Paste: [[2506.06133]] Let's CONFER: A Dataset for Evaluating Natural Language Inference Models on CONditional InFERence and Presupposition(https://arxiv.org/abs/2506.06133)
Keywords: large language model
Abstract: Natural Language Inference (NLI) is the task of determining whether a sentence pair represents entailment, contradiction, or a neutral relationship. While NLI models perform well on many inference tasks, their ability to handle fine-grained pragmatic inferences, particularly presupposition in conditionals, remains underexplored. In this study, we introduce CONFER, a novel dataset designed to evaluate how NLI models process inference in conditional sentences. We assess the performance of four NLI models, including two pre-trained models, to examine their generalization to conditional reasoning. Additionally, we evaluate Large Language Models (LLMs), including GPT-4o, LLaMA, Gemma, and DeepSeek-R1, in zero-shot and few-shot prompting settings to analyze their ability to infer presuppositions with and without prior context. Our findings indicate that NLI models struggle with presuppositional reasoning in conditionals, and fine-tuning on existing NLI datasets does not necessarily improve their performance.

Title: Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems

Authors: Haowei Wang, Rupeng Zhang, Junjie Wang, Mingyang Li, Yuekai Huang, Dandan Wang, Qing Wang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06151
Pdf URL: https://arxiv.org/pdf/2506.06151
Copy Paste: [[2506.06151]] Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems(https://arxiv.org/abs/2506.06151)
Keywords: attack, large language model
Abstract: Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by retrieving relevant documents from external corpora before generating responses. This approach significantly expands LLM capabilities by leveraging vast, up-to-date external knowledge. However, this reliance on external knowledge makes RAG systems vulnerable to corpus poisoning attacks that manipulate generated outputs via poisoned document injection. Existing poisoning attack strategies typically treat the retrieval and generation stages as disjointed, limiting their effectiveness. We propose Joint-GCG, the first framework to unify gradient-based attacks across both retriever and generator models through three innovations: (1) Cross-Vocabulary Projection for aligning embedding spaces, (2) Gradient Tokenization Alignment for synchronizing token-level gradient signals, and (3) Adaptive Weighted Fusion for dynamically balancing attacking objectives. Evaluations demonstrate that Joint-GCG achieves at most 25% and an average of 5% higher attack success rate than previous methods across multiple retrievers and generators. While optimized under a white-box assumption, the generated poisons show unprecedented transferability to unseen models. Joint-GCG's innovative unification of gradient-based attacks across retrieval and generation stages fundamentally reshapes our understanding of vulnerabilities within RAG systems. Our code is available at this https URL.

Title: A Novel Large-scale Crop Dataset and Dual-stream Transformer Method for Fine-grained Hierarchical Crop Classification from Integrated Hyperspectral EnMAP Data and Multispectral Sentinel-2 Time Series

Authors: Wenyuan Li, Shunlin Liang, Yuxiang Zhang, Liqin Liu, Keyan Chen, Yongzhe Chen, Han Ma, Jianglei Xu, Yichuan Ma, Shikang Guan, Zhenwei Shi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06155
Pdf URL: https://arxiv.org/pdf/2506.06155
Copy Paste: [[2506.06155]] A Novel Large-scale Crop Dataset and Dual-stream Transformer Method for Fine-grained Hierarchical Crop Classification from Integrated Hyperspectral EnMAP Data and Multispectral Sentinel-2 Time Series(https://arxiv.org/abs/2506.06155)
Keywords: security, transformer
Abstract: Fine-grained crop classification is crucial for precision agriculture and food security monitoring. It requires simultaneous capture of both phenological dynamics (obtained from multi-temporal satellite data like Sentinel-2) and subtle spectral variations (demanding nanometer-scale spectral resolution from hyperspectral imagery). Research combining these two modalities remains scarce currently due to challenges in hyperspectral data acquisition and crop types annotation costs. To address these issues, we construct a hierarchical hyperspectral crop dataset (H2Crop) by integrating 30m-resolution EnMAP hyperspectral data with Sentinel-2 time series. With over one million annotated field parcels organized in a four-tier crop taxonomy, H2Crop establishes a vital benchmark for fine-grained agricultural crop classification and hyperspectral image processing. We propose a dual-stream Transformer architecture that synergistically processes these modalities. It coordinates two specialized pathways: a spectral-spatial Transformer extracts fine-grained signatures from hyperspectral EnMAP data, while a temporal Swin Transformer extracts crop growth patterns from Sentinel-2 time series. The designed hierarchy classification heads with hierarchical fusion then simultaneously delivers multi-level classification across all taxonomic tiers. Experiments demonstrate that adding hyperspectral EnMAP data to Sentinel-2 time series yields a 4.2% average F1-scores improvement (peaking at 6.3%). Extensive comparisons also confirming our method's higher accuracy over existing deep learning approaches for crop type classification and the consistent benefits of hyperspectral data across varying temporal windows and crop change scenarios. Codes and dataset will be available at this https URL and this http URL Keywords: Crop type classification, precision agriculture, remote sensing, deep learning, hyperspectral data, Sentinel-2 time series, fine-grained crops

Title: ENMA: Tokenwise Autoregression for Generative Neural PDE Operators

Authors: Armand Kassaï Koupaï, Lise Le Boudec, Louis Serrano, Patrick Gallinari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06158
Pdf URL: https://arxiv.org/pdf/2506.06158
Copy Paste: [[2506.06158]] ENMA: Tokenwise Autoregression for Generative Neural PDE Operators(https://arxiv.org/abs/2506.06158)
Keywords: robust, transformer, generative
Abstract: Solving time-dependent parametric partial differential equations (PDEs) remains a fundamental challenge for neural solvers, particularly when generalizing across a wide range of physical parameters and dynamics. When data is uncertain or incomplete-as is often the case-a natural approach is to turn to generative models. We introduce ENMA, a generative neural operator designed to model spatio-temporal dynamics arising from physical phenomena. ENMA predicts future dynamics in a compressed latent space using a generative masked autoregressive transformer trained with flow matching loss, enabling tokenwise generation. Irregularly sampled spatial observations are encoded into uniform latent representations via attention mechanisms and further compressed through a spatio-temporal convolutional encoder. This allows ENMA to perform in-context learning at inference time by conditioning on either past states of the target trajectory or auxiliary context trajectories with similar dynamics. The result is a robust and adaptable framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.

Title: Obfuscation-Resilient Binary Code Similarity Analysis using Dominance Enhanced Semantic Graph

Authors: Yufeng Wang, Yuhong Feng, Yixuan Cao, Haoran Li, Haiyue Feng, Yifeng Wang
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2506.06161
Pdf URL: https://arxiv.org/pdf/2506.06161
Copy Paste: [[2506.06161]] Obfuscation-Resilient Binary Code Similarity Analysis using Dominance Enhanced Semantic Graph(https://arxiv.org/abs/2506.06161)
Keywords: robust
Abstract: Binary code similarity analysis (BCSA) serves as a core technique for binary analysis tasks such as vulnerability detection. While current graph-based BCSA approaches capture substantial semantics and show strong performance, their performance suffers under code obfuscation due to the unstable control flow. To address this issue, we develop ORCAS, an Obfuscation-Resilient BCSA model based on Dominance Enhanced Semantic Graph (DESG). The DESG is an original binary code representation, capturing more binaries' implicit semantics without control flow structure, including inter-instruction relations, inter-basic block relations, and instruction-basic block relations. ORCAS robustly scores semantic similarity across binary functions from different obfuscation options, optimization levels, and instruction set architectures. Extensive evaluation on the BinKit dataset shows ORCAS significantly outperforms eight baselines, achieving an average 12.1% PR-AUC gain when using combined three obfuscation options compared to the state-of-the-art approaches. Furthermore, ORCAS improves recall by up to 43% on an original obfuscated real-world vulnerability dataset, which we released to facilitate future research.

Title: The Lock-in Hypothesis: Stagnation by Algorithm

Authors: Tianyi Alex Qiu, Zhonghao He, Tejasveer Chugh, Max Kleiman-Weiner
Subjects: cs.LG, cs.AI, cs.CL, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2506.06166
Pdf URL: https://arxiv.org/pdf/2506.06166
Copy Paste: [[2506.06166]] The Lock-in Hypothesis: Stagnation by Algorithm(https://arxiv.org/abs/2506.06166)
Keywords: large language model
Abstract: The training and deployment of large language models (LLMs) create a feedback loop with human users: models learn human beliefs from data, reinforce these beliefs with generated content, reabsorb the reinforced beliefs, and feed them back to users again and again. This dynamic resembles an echo chamber. We hypothesize that this feedback loop entrenches the existing values and beliefs of users, leading to a loss of diversity and potentially the lock-in of false beliefs. We formalize this hypothesis and test it empirically with agent-based LLM simulations and real-world GPT usage data. Analysis reveals sudden but sustained drops in diversity after the release of new GPT iterations, consistent with the hypothesized human-AI feedback loop. Code and data available at this https URL

Title: Technical Report for Egocentric Mistake Detection for the HoloAssist Challenge

Authors: Constantin Patsch, Marsil Zakour, Yuankai Wu, Eckehard Steinbach
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06174
Pdf URL: https://arxiv.org/pdf/2506.06174
Copy Paste: [[2506.06174]] Technical Report for Egocentric Mistake Detection for the HoloAssist Challenge(https://arxiv.org/abs/2506.06174)
Keywords: large language model
Abstract: In this report, we address the task of online mistake detection, which is vital in domains like industrial automation and education, where real-time video analysis allows human operators to correct errors as they occur. While previous work focuses on procedural errors involving action order, broader error types must be addressed for real-world use. We introduce an online mistake detection framework that handles both procedural and execution errors (e.g., motor slips or tool misuse). Upon detecting an error, we use a large language model (LLM) to generate explanatory feedback. Experiments on the HoloAssist benchmark confirm the effectiveness of our approach, where our approach is placed second on the mistake detection task.

Title: Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach

Authors: James Ford, Anthony Rios
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06175
Pdf URL: https://arxiv.org/pdf/2506.06175
Copy Paste: [[2506.06175]] Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach(https://arxiv.org/abs/2506.06175)
Keywords: large language model
Abstract: Large language models can translate natural-language chart descriptions into runnable code, yet approximately 15\% of the generated scripts still fail to execute, even after supervised fine-tuning and reinforcement learning. We investigate whether this persistent error rate stems from model limitations or from reliance on a single-prompt design. To explore this, we propose a lightweight multi-agent pipeline that separates drafting, execution, repair, and judgment, using only an off-the-shelf GPT-4o-mini model. On the \textsc{Text2Chart31} benchmark, our system reduces execution errors to 4.5\% within three repair iterations, outperforming the strongest fine-tuned baseline by nearly 5 percentage points while requiring significantly less compute. Similar performance is observed on the \textsc{ChartX} benchmark, with an error rate of 4.6\%, demonstrating strong generalization. Under current benchmarks, execution success appears largely solved. However, manual review reveals that 6 out of 100 sampled charts contain hallucinations, and an LLM-based accessibility audit shows that only 33.3\% (\textsc{Text2Chart31}) and 7.2\% (\textsc{ChartX}) of generated charts satisfy basic colorblindness guidelines. These findings suggest that future work should shift focus from execution reliability toward improving chart aesthetics, semantic fidelity, and accessibility.

Title: SatelliteFormula: Multi-Modal Symbolic Regression from Remote Sensing Imagery for Physics Discovery

Authors: Zhenyu Yu, Mohd. Yamani Idna Idris, Pei Wang, Yuelong Xia, Fei Ma, Rizwan Qureshi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06176
Pdf URL: https://arxiv.org/pdf/2506.06176
Copy Paste: [[2506.06176]] SatelliteFormula: Multi-Modal Symbolic Regression from Remote Sensing Imagery for Physics Discovery(https://arxiv.org/abs/2506.06176)
Keywords: extraction, interpretability, transformer
Abstract: We propose SatelliteFormula, a novel symbolic regression framework that derives physically interpretable expressions directly from multi-spectral remote sensing imagery. Unlike traditional empirical indices or black-box learning models, SatelliteFormula combines a Vision Transformer-based encoder for spatial-spectral feature extraction with physics-guided constraints to ensure consistency and interpretability. Existing symbolic regression methods struggle with the high-dimensional complexity of multi-spectral data; our method addresses this by integrating transformer representations into a symbolic optimizer that balances accuracy and physical plausibility. Extensive experiments on benchmark datasets and remote sensing tasks demonstrate superior performance, stability, and generalization compared to state-of-the-art baselines. SatelliteFormula enables interpretable modeling of complex environmental variables, bridging the gap between data-driven learning and physical understanding.

Title: Detecting Voice Phishing with Precision: Fine-Tuning Small Language Models

Authors: Ju Yong Sim, Seong Hwan Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06180
Pdf URL: https://arxiv.org/pdf/2506.06180
Copy Paste: [[2506.06180]] Detecting Voice Phishing with Precision: Fine-Tuning Small Language Models(https://arxiv.org/abs/2506.06180)
Keywords: robust
Abstract: We develop a voice phishing (VP) detector by fine-tuning Llama3, a representative open-source, small language model (LM). In the prompt, we provide carefully-designed VP evaluation criteria and apply the Chain-of-Thought (CoT) technique. To evaluate the robustness of LMs and highlight differences in their performance, we construct an adversarial test dataset that places the models under challenging conditions. Moreover, to address the lack of VP transcripts, we create transcripts by referencing existing or new types of VP techniques. We compare cases where evaluation criteria are included, the CoT technique is applied, or both are used together. In the experiment, our results show that the Llama3-8B model, fine-tuned with a dataset that includes a prompt with VP evaluation criteria, yields the best performance among small LMs and is comparable to that of a GPT-4-based VP detector. These findings indicate that incorporating human expert knowledge into the prompt is more effective than using the CoT technique for small LMs in VP detection.

Title: Antithetic Noise in Diffusion Models

Authors: Jing Jia, Sifan Liu, Bowen Song, Wei Yuan, Liyue Shen, Guanyang Wang
Subjects: cs.LG, math.NA, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2506.06185
Pdf URL: https://arxiv.org/pdf/2506.06185
Copy Paste: [[2506.06185]] Antithetic Noise in Diffusion Models(https://arxiv.org/abs/2506.06185)
Keywords: diffusion
Abstract: We initiate a systematic study of antithetic initial noise in diffusion models. Across unconditional models trained on diverse datasets, text-conditioned latent-diffusion models, and diffusion-posterior samplers, we find that pairing each initial noise with its negation consistently yields strongly negatively correlated samples. To explain this phenomenon, we combine experiments and theoretical analysis, leading to a symmetry conjecture that the learned score function is approximately affine antisymmetric (odd symmetry up to a constant shift), and provide evidence supporting it. Leveraging this negative correlation, we enable two applications: (1) enhancing image diversity in models like Stable Diffusion without quality loss, and (2) sharpening uncertainty quantification (e.g., up to 90% narrower confidence intervals) when estimating downstream statistics. Building on these gains, we extend the two-point pairing to a randomized quasi-Monte Carlo estimator, which further improves estimation accuracy. Our framework is training-free, model-agnostic, and adds no runtime overhead.

Title: Transformative or Conservative? Conservation laws for ResNets and Transformers

Authors: Sibylle Marcotte, Rémi Gribonval, Gabriel Peyré
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06194
Pdf URL: https://arxiv.org/pdf/2506.06194
Copy Paste: [[2506.06194]] Transformative or Conservative? Conservation laws for ResNets and Transformers(https://arxiv.org/abs/2506.06194)
Keywords: transformer
Abstract: While conservation laws in gradient flow training dynamics are well understood for (mostly shallow) ReLU and linear networks, their study remains largely unexplored for more practical architectures. This paper bridges this gap by deriving and analyzing conservation laws for modern architectures, with a focus on convolutional ResNets and Transformer networks. For this, we first show that basic building blocks such as ReLU (or linear) shallow networks, with or without convolution, have easily expressed conservation laws, and no more than the known ones. In the case of a single attention layer, we also completely describe all conservation laws, and we show that residual blocks have the same conservation laws as the same block without a skip connection. We then introduce the notion of conservation laws that depend only on a subset of parameters (corresponding e.g. to a pair of consecutive layers, to a residual block, or to an attention layer). We demonstrate that the characterization of such laws can be reduced to the analysis of the corresponding building block in isolation. Finally, we examine how these newly discovered conservation principles, initially established in the continuous gradient flow regime, persist under discrete optimization dynamics, particularly in the context of Stochastic Gradient Descent (SGD).

Title: How to craft a deep reinforcement learning policy for wind farm flow control

Authors: Elie Kadoche, Pascal Bianchi, Florence Carton, Philippe Ciblat, Damien Ernst
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06204
Pdf URL: https://arxiv.org/pdf/2506.06204
Copy Paste: [[2506.06204]] How to craft a deep reinforcement learning policy for wind farm flow control(https://arxiv.org/abs/2506.06204)
Keywords: robust
Abstract: Within wind farms, wake effects between turbines can significantly reduce overall energy production. Wind farm flow control encompasses methods designed to mitigate these effects through coordinated turbine control. Wake steering, for example, consists in intentionally misaligning certain turbines with the wind to optimize airflow and increase power output. However, designing a robust wake steering controller remains challenging, and existing machine learning approaches are limited to quasi-static wind conditions or small wind farms. This work presents a new deep reinforcement learning methodology to develop a wake steering policy that overcomes these limitations. Our approach introduces a novel architecture that combines graph attention networks and multi-head self-attention blocks, alongside a novel reward function and training strategy. The resulting model computes the yaw angles of each turbine, optimizing energy production in time-varying wind conditions. An empirical study conducted on steady-state, low-fidelity simulation, shows that our model requires approximately 10 times fewer training steps than a fully connected neural network and achieves more robust performance compared to a strong optimization baseline, increasing energy production by up to 14 %. To the best of our knowledge, this is the first deep reinforcement learning-based wake steering controller to generalize effectively across any time-varying wind conditions in a low-fidelity, steady-state numerical simulation setting.

Title: Building Models of Neurological Language

Authors: Henry Watkins
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06208
Pdf URL: https://arxiv.org/pdf/2506.06208
Copy Paste: [[2506.06208]] Building Models of Neurological Language(https://arxiv.org/abs/2506.06208)
Keywords: secure, extraction
Abstract: This report documents the development and evaluation of domain-specific language models for neurology. Initially focused on building a bespoke model, the project adapted to rapid advances in open-source and commercial medical LLMs, shifting toward leveraging retrieval-augmented generation (RAG) and representational models for secure, local deployment. Key contributions include the creation of neurology-specific datasets (case reports, QA sets, textbook-derived data), tools for multi-word expression extraction, and graph-based analyses of medical terminology. The project also produced scripts and Docker containers for local hosting. Performance metrics and graph community results are reported, with future possible work open for multimodal models using open-source architectures like phi-4.

Title: Model-Driven Graph Contrastive Learning

Authors: Ali Azizpour, Nicolas Zilberstein, Santiago Segarra
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06212
Pdf URL: https://arxiv.org/pdf/2506.06212
Copy Paste: [[2506.06212]] Model-Driven Graph Contrastive Learning(https://arxiv.org/abs/2506.06212)
Keywords: generative
Abstract: We propose $\textbf{MGCL}$, a model-driven graph contrastive learning (GCL) framework that leverages graphons (probabilistic generative models for graphs) to guide contrastive learning by accounting for the data's underlying generative process. GCL has emerged as a powerful self-supervised framework for learning expressive node or graph representations without relying on annotated labels, which are often scarce in real-world data. By contrasting augmented views of graph data, GCL has demonstrated strong performance across various downstream tasks, such as node and graph classification. However, existing methods typically rely on manually designed or heuristic augmentation strategies that are not tailored to the underlying data distribution and operate at the individual graph level, ignoring similarities among graphs generated from the same model. Conversely, in our proposed approach, MGCL first estimates the graphon associated with the observed data and then defines a graphon-informed augmentation process, enabling data-adaptive and principled augmentations. Additionally, for graph-level tasks, MGCL clusters the dataset and estimates a graphon per group, enabling contrastive pairs to reflect shared semantics and structure. Extensive experiments on benchmark datasets demonstrate that MGCL achieves state-of-the-art performance, highlighting the advantages of incorporating generative models into GCL.

Title: Can Theoretical Physics Research Benefit from Language Agents?

Authors: Sirui Lu, Zhijing Jin, Terry Jingchen Zhang, Pavel Kos, J. Ignacio Cirac, Bernhard Schölkopf
Subjects: cs.CL, cs.AI, math-ph, quant-ph
Abstract URL: https://arxiv.org/abs/2506.06214
Pdf URL: https://arxiv.org/pdf/2506.06214
Copy Paste: [[2506.06214]] Can Theoretical Physics Research Benefit from Language Agents?(https://arxiv.org/abs/2506.06214)
Keywords: robust, large language model
Abstract: Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature. This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox. We analyze current LLM capabilities for physics -- from mathematical reasoning to code generation -- identifying critical gaps in physical intuition, constraint satisfaction, and reliable reasoning. We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments. Realizing this vision requires addressing fundamental challenges: ensuring physical consistency, and developing robust verification methods. We call for collaborative efforts between physics and AI communities to help advance scientific discovery in physics.

Title: STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving

Authors: Christian Fruhwirth-Reisinger, Dušan Malić, Wei Lin, David Schinagl, Samuel Schulter, Horst Possegger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06218
Pdf URL: https://arxiv.org/pdf/2506.06218
Copy Paste: [[2506.06218]] STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving(https://arxiv.org/abs/2506.06218)
Keywords: robust, large language model
Abstract: We introduce STSBench, a scenario-based framework to benchmark the holistic understanding of vision-language models (VLMs) for autonomous driving. The framework automatically mines pre-defined traffic scenarios from any dataset using ground-truth annotations, provides an intuitive user interface for efficient human verification, and generates multiple-choice questions for model evaluation. Applied to the NuScenes dataset, we present STSnu, the first benchmark that evaluates the spatio-temporal reasoning capabilities of VLMs based on comprehensive 3D perception. Existing benchmarks typically target off-the-shelf or fine-tuned VLMs for images or videos from a single viewpoint and focus on semantic tasks such as object recognition, dense captioning, risk assessment, or scene understanding. In contrast, STSnu evaluates driving expert VLMs for end-to-end driving, operating on videos from multi-view cameras or LiDAR. It specifically assesses their ability to reason about both ego-vehicle actions and complex interactions among traffic participants, a crucial capability for autonomous vehicles. The benchmark features 43 diverse scenarios spanning multiple views and frames, resulting in 971 human-verified multiple-choice questions. A thorough evaluation uncovers critical shortcomings in existing models' ability to reason about fundamental traffic dynamics in complex environments. These findings highlight the urgent need for architectural advances that explicitly model spatio-temporal reasoning. By addressing a core gap in spatio-temporal evaluation, STSBench enables the development of more robust and explainable VLMs for autonomous driving.

Title: GenIR: Generative Visual Feedback for Mental Image Retrieval

Authors: Diji Yang, Minghao Liu, Chung-Hsiang Lo, Yi Zhang, James Davis
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06220
Pdf URL: https://arxiv.org/pdf/2506.06220
Copy Paste: [[2506.06220]] GenIR: Generative Visual Feedback for Mental Image Retrieval(https://arxiv.org/abs/2506.06220)
Keywords: diffusion, generative
Abstract: Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind, that is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction.

Title: PROVSYN: Synthesizing Provenance Graphs for Data Augmentation in Intrusion Detection Systems

Authors: Yi Huang, Wajih UI Hassan, Yao Guo, Xiangqun Chen, Ding Li
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.06226
Pdf URL: https://arxiv.org/pdf/2506.06226
Copy Paste: [[2506.06226]] PROVSYN: Synthesizing Provenance Graphs for Data Augmentation in Intrusion Detection Systems(https://arxiv.org/abs/2506.06226)
Keywords: attack, large language model
Abstract: Provenance graph analysis plays a vital role in intrusion detection, particularly against Advanced Persistent Threats (APTs), by exposing complex attack patterns. While recent systems combine graph neural networks (GNNs) with natural language processing (NLP) to capture structural and semantic features, their effectiveness is limited by class imbalance in real-world data. To address this, we introduce PROVSYN, an automated framework that synthesizes provenance graphs through a three-phase pipeline: (1) heterogeneous graph structure synthesis with structural-semantic modeling, (2) rule-based topological refinement, and (3) context-aware textual attribute synthesis using large language models (LLMs). PROVSYN includes a comprehensive evaluation framework that integrates structural, textual, temporal, and embedding-based metrics, along with a semantic validation mechanism to assess the correctness of generated attack patterns and system behaviors. To demonstrate practical utility, we use the synthetic graphs to augment training datasets for downstream APT detection models. Experimental results show that PROVSYN produces high-fidelity graphs and improves detection performance through effective data augmentation.

Title: Bridging External and Parametric Knowledge: Mitigating Hallucination of LLMs with Shared-Private Semantic Synergy in Dual-Stream Knowledge

Authors: Yi Sui, Chaozhuo Li, Chen Zhang, Dawei song, Qiuchi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06240
Pdf URL: https://arxiv.org/pdf/2506.06240
Copy Paste: [[2506.06240]] Bridging External and Parametric Knowledge: Mitigating Hallucination of LLMs with Shared-Private Semantic Synergy in Dual-Stream Knowledge(https://arxiv.org/abs/2506.06240)
Keywords: large language model
Abstract: Retrieval-augmented generation (RAG) is a cost-effective approach to mitigate the hallucination of Large Language Models (LLMs) by incorporating the retrieved external knowledge into the generation process. However, external knowledge may conflict with the parametric knowledge of LLMs. Furthermore, current LLMs lack inherent mechanisms for resolving such knowledge conflicts, making traditional RAG methods suffer from degraded performance and stability. Thus, we propose a Dual-Stream Knowledge-Augmented Framework for Shared-Private Semantic Synergy (DSSP-RAG). Central to the framework is a novel approach that refines self-attention into a mixed-attention, distinguishing shared and private semantics for a controlled internal-external knowledge integration. To effectively facilitate DSSP in RAG, we further introduce an unsupervised hallucination detection method based on cognitive uncertainty, ensuring the necessity of introducing knowledge, and an Energy Quotient (EQ) based on attention difference matrices to reduce noise in the retrieved external knowledge. Extensive experiments on benchmark datasets show that DSSP-RAG can effectively resolve conflicts and enhance the complementarity of dual-stream knowledge, leading to superior performance over strong baselines.

Title: Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models

Authors: Zahra Babaiee, Peyman M. Kiasari, Daniela Rus, Radu Grosu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06242
Pdf URL: https://arxiv.org/pdf/2506.06242
Copy Paste: [[2506.06242]] Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models(https://arxiv.org/abs/2506.06242)
Keywords: large language model
Abstract: Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization'-the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems' capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: \href{this https URL}{this http URL}

Title: Cartridges: Lightweight and general-purpose long context representations via self-study

Authors: Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, Christopher Re
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06266
Pdf URL: https://arxiv.org/pdf/2506.06266
Copy Paste: [[2506.06266]] Cartridges: Lightweight and general-purpose long context representations via self-study(https://arxiv.org/abs/2506.06266)
Keywords: large language model
Abstract: Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-1M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model's effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.

Title: AdvSumm: Adversarial Training for Bias Mitigation in Text Summarization

Authors: Mukur Gupta, Nikhil Reddy Varimalla, Nicholas Deas, Melanie Subbiah, Kathleen McKeown
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06273
Pdf URL: https://arxiv.org/pdf/2506.06273
Copy Paste: [[2506.06273]] AdvSumm: Adversarial Training for Bias Mitigation in Text Summarization(https://arxiv.org/abs/2506.06273)
Keywords: robust, fair, transformer, large language model
Abstract: Large Language Models (LLMs) have achieved impressive performance in text summarization and are increasingly deployed in real-world applications. However, these systems often inherit associative and framing biases from pre-training data, leading to inappropriate or unfair outputs in downstream tasks. In this work, we present AdvSumm (Adversarial Summarization), a domain-agnostic training framework designed to mitigate bias in text summarization through improved generalization. Inspired by adversarial robustness, AdvSumm introduces a novel Perturber component that applies gradient-guided perturbations at the embedding level of Sequence-to-Sequence models, enhancing the model's robustness to input variations. We empirically demonstrate that AdvSumm effectively reduces different types of bias in summarization-specifically, name-nationality bias and political framing bias-without compromising summarization quality. Compared to standard transformers and data augmentation techniques like back-translation, AdvSumm achieves stronger bias mitigation performance across benchmark datasets.

Title: STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

Authors: Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel Angel Bautista, Josh Susskind, Shuangfei Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06276
Pdf URL: https://arxiv.org/pdf/2506.06276
Copy Paste: [[2506.06276]] STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis(https://arxiv.org/abs/2506.06276)
Keywords: diffusion, transformer, generative
Abstract: We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive Flow (TARFlow), which combines the expressive power of normalizing flows with the structured modeling capabilities of Autoregressive Transformers. We first establish the theoretical universality of TARFlow for modeling continuous distributions. Building on this foundation, we introduce several key architectural and algorithmic innovations to significantly enhance scalability: (1) a deep-shallow design, wherein a deep Transformer block captures most of the model representational capacity, complemented by a few shallow Transformer blocks that are computationally efficient yet substantially beneficial; (2) modeling in the latent space of pretrained autoencoders, which proves more effective than direct pixel-level modeling; and (3) a novel guidance algorithm that significantly boosts sample quality. Crucially, our model remains an end-to-end normalizing flow, enabling exact maximum likelihood training in continuous spaces without discretization. STARFlow achieves competitive performance in both class-conditional and text-conditional image generation tasks, approaching state-of-the-art diffusion models in sample quality. To our knowledge, this work is the first successful demonstration of normalizing flows operating effectively at this scale and resolution.

Title: Distillation Robustifies Unlearning

Authors: Bruce W. Lee, Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, Bryce Woodworth, Alex Cloud, Alexander Matt Turner
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06278
Pdf URL: https://arxiv.org/pdf/2506.06278
Copy Paste: [[2506.06278]] Distillation Robustifies Unlearning(https://arxiv.org/abs/2506.06278)
Keywords: robust
Abstract: Current LLM unlearning methods are not robust: they can be reverted easily with a few steps of finetuning. This is true even for the idealized unlearning method of training to imitate an oracle model that was never exposed to unwanted information, suggesting that output-based finetuning is insufficient to achieve robust unlearning. In a similar vein, we find that training a randomly initialized student to imitate an unlearned model transfers desired behaviors while leaving undesired capabilities behind. In other words, distillation robustifies unlearning. Building on this insight, we propose Unlearn-Noise-Distill-on-Outputs (UNDO), a scalable method that distills an unlearned model into a partially noised copy of itself. UNDO introduces a tunable tradeoff between compute cost and robustness, establishing a new Pareto frontier on synthetic language and arithmetic tasks. At its strongest setting, UNDO matches the robustness of a model retrained from scratch with perfect data filtering while using only 60-80% of the compute and requiring only 0.01% of the pretraining data to be labeled. We also show that UNDO robustifies unlearning on the more realistic Weapons of Mass Destruction Proxy (WMDP) benchmark. Since distillation is widely used in practice, incorporating an unlearning step beforehand offers a convenient path to robust capability removal.

Title: CoMemo: LVLMs Need Image Context with Image Memory

Authors: Shi Liu, Weijie Su, Xizhou Zhu, Wenhai Wang, Jifeng Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06279
Pdf URL: https://arxiv.org/pdf/2506.06279
Copy Paste: [[2506.06279]] CoMemo: LVLMs Need Image Context with Image Memory(https://arxiv.org/abs/2506.06279)
Keywords: large language model
Abstract: Recent advancements in Large Vision-Language Models built upon Large Language Models have established aligning visual features with LLM representations as the dominant paradigm. However, inherited LLM architectural designs introduce suboptimal characteristics for multimodal processing. First, LVLMs exhibit a bimodal distribution in attention allocation, leading to the progressive neglect of middle visual content as context expands. Second, conventional positional encoding schemes fail to preserve vital 2D structural relationships when processing dynamic high-resolution images. To address these limitations, we propose CoMemo - a dual-path architecture that combines a Context image path with an image Memory path for visual processing, effectively alleviating visual information neglect. Additionally, we introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness while mitigating remote decay in extended sequences. Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate CoMemo's superior performance compared to conventional LVLM architectures. Project page is available at this https URL.

Title: Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias

Authors: Yuanzhe Hu, Kinshuk Goel, Vlad Killiakov, Yaoqing Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06280
Pdf URL: https://arxiv.org/pdf/2506.06280
Copy Paste: [[2506.06280]] Eigenspectrum Analysis of Neural Networks without Aspect Ratio Bias(https://arxiv.org/abs/2506.06280)
Keywords: large language model
Abstract: Diagnosing deep neural networks (DNNs) through the eigenspectrum of weight matrices has been an active area of research in recent years. At a high level, eigenspectrum analysis of DNNs involves measuring the heavytailness of the empirical spectral densities (ESD) of weight matrices. It provides insight into how well a model is trained and can guide decisions on assigning better layer-wise training hyperparameters. In this paper, we address a challenge associated with such eigenspectrum methods: the impact of the aspect ratio of weight matrices on estimated heavytailness metrics. We demonstrate that matrices of varying sizes (and aspect ratios) introduce a non-negligible bias in estimating heavytailness metrics, leading to inaccurate model diagnosis and layer-wise hyperparameter assignment. To overcome this challenge, we propose FARMS (Fixed-Aspect-Ratio Matrix Subsampling), a method that normalizes the weight matrices by subsampling submatrices with a fixed aspect ratio. Instead of measuring the heavytailness of the original ESD, we measure the average ESD of these subsampled submatrices. We show that measuring the heavytailness of these submatrices with the fixed aspect ratio can effectively mitigate the aspect ratio bias. We validate our approach across various optimization techniques and application domains that involve eigenspectrum analysis of weights, including image classification in computer vision (CV) models, scientific machine learning (SciML) model training, and large language model (LLM) pruning. Our results show that despite its simplicity, FARMS uniformly improves the accuracy of eigenspectrum analysis while enabling more effective layer-wise hyperparameter assignment in these application domains. In one of the LLM pruning experiments, FARMS reduces the perplexity of the LLaMA-7B model by 17.3% when compared with the state-of-the-art method.

Title: TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

Authors: Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Muhammad Haris Khan, Rao Muhammad Anwer, Jorma Laaksonen, Fahad Shahbaz Khan, Salman Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06281
Pdf URL: https://arxiv.org/pdf/2506.06281
Copy Paste: [[2506.06281]] TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation(https://arxiv.org/abs/2506.06281)
Keywords: segmentation
Abstract: Modern Earth observation (EO) increasingly leverages deep learning to harness the scale and diversity of satellite imagery across sensors and regions. While recent foundation models have demonstrated promising generalization across EO tasks, many remain limited by the scale, geographical coverage, and spectral diversity of their training data, factors critical for learning globally transferable representations. In this work, we introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery, combined with large spatial tiles and land-cover aware sampling to enrich spatial and semantic coverage. By treating sensing modalities as natural augmentations in our self-supervised approach, we unify radar and optical inputs via modality-specific patch embeddings and adaptive cross-attention fusion. Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism that incorporates class-frequency-aware regularization to address long-tailed distributions in land this http URL achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench. Our code and pretrained models are publicly available at: this https URL .