2025-07-01

Title: Robust Perspective Correction for Real-World Crack Evolution Tracking in Image-Based Structural Health Monitoring

Authors: Xinxin Sun, Peter Chang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22437
Pdf URL: https://arxiv.org/pdf/2506.22437
Copy Paste: [[2506.22437]] Robust Perspective Correction for Real-World Crack Evolution Tracking in Image-Based Structural Health Monitoring(https://arxiv.org/abs/2506.22437)
Keywords: robust, diffusion
Abstract: Accurate image alignment is essential for monitoring crack evolution in structural health monitoring (SHM), particularly under real-world conditions involving perspective distortion, occlusion, and low contrast. However, traditional feature detectors such as SIFT and SURF, which rely on Gaussian-based scale spaces, tend to suppress high-frequency edges, making them unsuitable for thin crack localization. Lightweight binary alternatives like ORB and BRISK, while computationally efficient, often suffer from poor keypoint repeatability on textured or shadowed surfaces. This study presents a physics-informed alignment framework that adapts the open KAZE architecture to SHM-specific challenges. By utilizing nonlinear anisotropic diffusion to construct a crack-preserving scale space, and integrating RANSAC-based homography estimation, the framework enables accurate geometric correction without the need for training, parameter tuning, or prior calibration. The method is validated on time-lapse images of masonry and concrete acquired via handheld smartphone under varied field conditions, including shadow interference, cropping, oblique viewing angles, and surface clutter. Compared to classical detectors, the proposed framework reduces crack area and spine length errors by up to 70 percent and 90 percent, respectively, while maintaining sub-5 percent alignment error in key metrics. Unsupervised, interpretable, and computationally lightweight, this approach supports scalable deployment via UAVs and mobile platforms. By tailoring nonlinear scale-space modeling to SHM image alignment, this work offers a robust and physically grounded alternative to conventional techniques for tracking real-world crack evolution.

Title: Learning Interpretable Rules from Neural Networks: Neurosymbolic AI for Radar Hand Gesture Recognition

Authors: Sarah Seifi, Tobias Sukianto, Cecilia Carbonelli, Lorenzo Servadei, Robert Wille
Subjects: cs.LG, cs.HC
Abstract URL: https://arxiv.org/abs/2506.22443
Pdf URL: https://arxiv.org/pdf/2506.22443
Copy Paste: [[2506.22443]] Learning Interpretable Rules from Neural Networks: Neurosymbolic AI for Radar Hand Gesture Recognition(https://arxiv.org/abs/2506.22443)
Keywords: interpretability
Abstract: Rule-based models offer interpretability but struggle with complex data, while deep neural networks excel in performance yet lack transparency. This work investigates a neuro-symbolic rule learning neural network named RL-Net that learns interpretable rule lists through neural optimization, applied for the first time to radar-based hand gesture recognition (HGR). We benchmark RL-Net against a fully transparent rule-based system (MIRA) and an explainable black-box model (XentricAI), evaluating accuracy, interpretability, and user adaptability via transfer learning. Our results show that RL-Net achieves a favorable trade-off, maintaining strong performance (93.03% F1) while significantly reducing rule complexity. We identify optimization challenges specific to rule pruning and hierarchy bias and propose stability-enhancing modifications. Compared to MIRA and XentricAI, RL-Net emerges as a practical middle ground between transparency and performance. This study highlights the real-world feasibility of neuro-symbolic models for interpretable HGR and offers insights for extending explainable AI to edge-deployable sensing systems.

Title: Active Learning for Forecasting Severity among Patients with Post Acute Sequelae of SARS-CoV-2

Authors: Jing Wang, Amar Sra, Jeremy C. Weiss
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2506.22444
Pdf URL: https://arxiv.org/pdf/2506.22444
Copy Paste: [[2506.22444]] Active Learning for Forecasting Severity among Patients with Post Acute Sequelae of SARS-CoV-2(https://arxiv.org/abs/2506.22444)
Keywords: large language model
Abstract: The long-term effects of Postacute Sequelae of SARS-CoV-2, known as PASC, pose a significant challenge to healthcare systems worldwide. Accurate identification of progression events, such as hospitalization and reinfection, is essential for effective patient management and resource allocation. However, traditional models trained on structured data struggle to capture the nuanced progression of PASC. In this study, we introduce the first publicly available cohort of 18 PASC patients, with text time series features based on Large Language Model Llama-3.1-70B-Instruct and clinical risk annotated by clinical expert. We propose an Active Attention Network to predict the clinical risk and identify progression events related to the risk. By integrating human expertise with active learning, we aim to enhance clinical risk prediction accuracy and enable progression events identification with fewer number of annotation. The ultimate goal is to improves patient care and decision-making for SARS-CoV-2 patient.

Title: Hierarchical Adversarially-Resilient Multi-Agent Reinforcement Learning for Cyber-Physical Systems Security

Authors: Saad Alqithami
Subjects: cs.LG, cs.AI, cs.CR, cs.MA
Abstract URL: https://arxiv.org/abs/2506.22445
Pdf URL: https://arxiv.org/pdf/2506.22445
Copy Paste: [[2506.22445]] Hierarchical Adversarially-Resilient Multi-Agent Reinforcement Learning for Cyber-Physical Systems Security(https://arxiv.org/abs/2506.22445)
Keywords: security, defense, attack
Abstract: Cyber-Physical Systems play a critical role in the infrastructure of various sectors, including manufacturing, energy distribution, and autonomous transportation systems. However, their increasing connectivity renders them highly vulnerable to sophisticated cyber threats, such as adaptive and zero-day attacks, against which traditional security methods like rule-based intrusion detection and single-agent reinforcement learning prove insufficient. To overcome these challenges, this paper introduces a novel Hierarchical Adversarially-Resilient Multi-Agent Reinforcement Learning (HAMARL) framework. HAMARL employs a hierarchical structure consisting of local agents dedicated to subsystem security and a global coordinator that oversees and optimizes comprehensive, system-wide defense strategies. Furthermore, the framework incorporates an adversarial training loop designed to simulate and anticipate evolving cyber threats, enabling proactive defense adaptation. Extensive experimental evaluations conducted on a simulated industrial IoT testbed indicate that HAMARL substantially outperforms traditional multi-agent reinforcement learning approaches, significantly improving attack detection accuracy, reducing response times, and ensuring operational continuity. The results underscore the effectiveness of combining hierarchical multi-agent coordination with adversarially-aware training to enhance the resilience and security of next-generation CPS.

Title: EAGLE: Efficient Alignment of Generalized Latent Embeddings for Multimodal Survival Prediction with Interpretable Attribution Analysis

Authors: Aakash Tripathi, Asim Waqas, Matthew B. Schabath, Yasin Yilmaz, Ghulam Rasool
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22446
Pdf URL: https://arxiv.org/pdf/2506.22446
Copy Paste: [[2506.22446]] EAGLE: Efficient Alignment of Generalized Latent Embeddings for Multimodal Survival Prediction with Interpretable Attribution Analysis(https://arxiv.org/abs/2506.22446)
Keywords: interpretability
Abstract: Accurate cancer survival prediction requires integration of diverse data modalities that reflect the complex interplay between imaging, clinical parameters, and textual reports. However, existing multimodal approaches suffer from simplistic fusion strategies, massive computational requirements, and lack of interpretability-critical barriers to clinical adoption. We present EAGLE (Efficient Alignment of Generalized Latent Embeddings), a novel deep learning framework that addresses these limitations through attention-based multimodal fusion with comprehensive attribution analysis. EAGLE introduces four key innovations: (1) dynamic cross-modal attention mechanisms that learn hierarchical relationships between modalities, (2) massive dimensionality reduction (99.96%) while maintaining predictive performance, (3) three complementary attribution methods providing patient-level interpretability, and (4) a unified pipeline enabling seamless adaptation across cancer types. We evaluated EAGLE on 911 patients across three distinct malignancies: glioblastoma (GBM, n=160), intraductal papillary mucinous neoplasms (IPMN, n=171), and non-small cell lung cancer (NSCLC, n=580). Patient-level analysis showed high-risk individuals relied more heavily on adverse imaging features, while low-risk patients demonstrated balanced modality contributions. Risk stratification identified clinically meaningful groups with 4-fold (GBM) to 5-fold (NSCLC) differences in median survival, directly informing treatment intensity decisions. By combining state-of-the-art performance with clinical interpretability, EAGLE bridges the gap between advanced AI capabilities and practical healthcare deployment, offering a scalable solution for multimodal survival prediction that enhances both prognostic accuracy and physician trust in automated predictions.

Title: Vision Transformers for Multi-Variable Climate Downscaling: Emulating Regional Climate Models with a Shared Encoder and Multi-Decoder Architecture

Authors: Fabio Merizzi, Harilaos Loukos
Subjects: cs.LG, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2506.22447
Pdf URL: https://arxiv.org/pdf/2506.22447
Copy Paste: [[2506.22447]] Vision Transformers for Multi-Variable Climate Downscaling: Emulating Regional Climate Models with a Shared Encoder and Multi-Decoder Architecture(https://arxiv.org/abs/2506.22447)
Keywords: transformer
Abstract: Global Climate Models (GCMs) are critical for simulating large-scale climate dynamics, but their coarse spatial resolution limits their applicability in regional studies. Regional Climate Models (RCMs) refine this through dynamic downscaling, albeit at considerable computational cost and with limited flexibility. While deep learning has emerged as an efficient data-driven alternative, most existing studies have focused on single-variable models that downscale one variable at a time. This approach can lead to limited contextual awareness, redundant computation, and lack of cross-variable interaction. Our study addresses these limitations by proposing a multi-task, multi-variable Vision Transformer (ViT) architecture with a shared encoder and variable-specific decoders (1EMD). The proposed architecture jointly predicts three key climate variables: surface temperature (tas), wind speed (sfcWind), and 500 hPa geopotential height (zg500), directly from GCM-resolution inputs, emulating RCM-scale downscaling over Europe. We show that our multi-variable approach achieves positive cross-variable knowledge transfer and consistently outperforms single-variable baselines trained under identical conditions, while also improving computational efficiency. These results demonstrate the effectiveness of multi-variable modeling for high-resolution climate downscaling.

Title: Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantization

Authors: Weizhi Gao, Zhichao Hou, Junqi Yin, Feiyi Wang, Linyu Peng, Xiaorui Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22463
Pdf URL: https://arxiv.org/pdf/2506.22463
Copy Paste: [[2506.22463]] Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantization(https://arxiv.org/abs/2506.22463)
Keywords: diffusion, generative
Abstract: Diffusion models have emerged as powerful generative models, but their high computation cost in iterative sampling remains a significant bottleneck. In this work, we present an in-depth and insightful study of state-of-the-art acceleration techniques for diffusion models, including caching and quantization, revealing their limitations in computation error and generation quality. To break these limits, this work introduces Modulated Diffusion (MoDiff), an innovative, rigorous, and principled framework that accelerates generative modeling through modulated quantization and error compensation. MoDiff not only inherents the advantages of existing caching and quantization methods but also serves as a general framework to accelerate all diffusion models. The advantages of MoDiff are supported by solid theoretical insight and analysis. In addition, extensive experiments on CIFAR-10 and LSUN demonstrate that MoDiff significant reduces activation quantization from 8 bits to 3 bits without performance degradation in post-training quantization (PTQ). Our code implementation is available at this https URL.

Title: Hallucination Detection with Small Language Models

Authors: Ming Cheung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22486
Pdf URL: https://arxiv.org/pdf/2506.22486
Copy Paste: [[2506.22486]] Hallucination Detection with Small Language Models(https://arxiv.org/abs/2506.22486)
Keywords: large language model
Abstract: Since the introduction of ChatGPT, large language models (LLMs) have demonstrated significant utility in various tasks, such as answering questions through retrieval-augmented generation. Context can be retrieved using a vectorized database, serving as a foundation for LLMs to generate responses. However, hallucinations in responses can undermine the reliability of LLMs in practical applications, and they are not easily detectable in the absence of ground truth, particularly in question-and-answer scenarios. This paper proposes a framework that integrates multiple small language models to verify responses generated by LLMs using the retrieved context from a vectorized database. By breaking down the responses into individual sentences and utilizing the probability of generating "Yes" tokens from the outputs of multiple models for a given set of questions, responses, and relevant context, hallucinations can be detected. The proposed framework is validated through experiments with real datasets comprising over 100 sets of questions, answers, and contexts, including responses with fully and partially correct sentences. The results demonstrate a 10\% improvement in F1 scores for detecting correct responses compared to hallucinations, indicating that multiple small language models can be effectively employed for answer verification, providing a scalable and efficient solution for both academic and practical applications.

Title: PromptAug: Fine-grained Conflict Classification Using Data Augmentation

Authors: Oliver Warke, Joemon M. Jose, Faegheh Hasibi, Jan Breitsohl
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2506.22491
Pdf URL: https://arxiv.org/pdf/2506.22491
Copy Paste: [[2506.22491]] PromptAug: Fine-grained Conflict Classification Using Data Augmentation(https://arxiv.org/abs/2506.22491)
Keywords: robust, large language model
Abstract: Given the rise of conflicts on social media, effective classification models to detect harmful behaviours are essential. Following the garbage-in-garbage-out maxim, machine learning performance depends heavily on training data quality. However, high-quality labelled data, especially for nuanced tasks like identifying conflict behaviours, is limited, expensive, and difficult to obtain. Additionally, as social media platforms increasingly restrict access to research data, text data augmentation is gaining attention as an alternative to generate training data. Augmenting conflict-related data poses unique challenges due to Large Language Model (LLM) guardrails that prevent generation of offensive content. This paper introduces PromptAug, an innovative LLM-based data augmentation method. PromptAug achieves statistically significant improvements of 2% in both accuracy and F1-score on conflict and emotion datasets. To thoroughly evaluate PromptAug against other data augmentation methods we conduct a robust evaluation using extreme data scarcity scenarios, quantitative diversity analysis and a qualitative thematic analysis. The thematic analysis identifies four problematic patterns in augmented text: Linguistic Fluidity, Humour Ambiguity, Augmented Content Ambiguity, and Augmented Content Misinterpretation. Overall, this work presents PromptAug as an effective method for augmenting data in sensitive tasks like conflict detection, offering a unique, interdisciplinary evaluation grounded in both natural language processing and social science methodology.

Title: ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction

Authors: Hao Liu, Yu Hu, Rakiba Rayhana, Ling Bai, Zheng Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22498
Pdf URL: https://arxiv.org/pdf/2506.22498
Copy Paste: [[2506.22498]] ViFusionTST: Deep Fusion of Time-Series Image Representations from Load Signals for Early Bed-Exit Prediction(https://arxiv.org/abs/2506.22498)
Keywords: privacy, transformer
Abstract: Bed-related falls remain a leading source of injury in hospitals and long-term-care facilities, yet many commercial alarms trigger only after a patient has already left the bed. We show that early bed-exit intent can be predicted using only four low-cost load cells mounted under the bed legs. The resulting load signals are first converted into a compact set of complementary images: an RGB line plot that preserves raw waveforms and three texture maps - recurrence plot, Markov transition field, and Gramian angular field - that expose higher-order dynamics. We introduce ViFusionTST, a dual-stream Swin Transformer that processes the line plot and texture maps in parallel and fuses them through cross-attention to learn data-driven modality weights. To provide a realistic benchmark, we collected six months of continuous data from 95 beds in a long-term-care facility. On this real-world dataset ViFusionTST reaches an accuracy of 0.885 and an F1 score of 0.794, surpassing recent 1D and 2D time-series baselines across F1, recall, accuracy, and AUPRC. The results demonstrate that image-based fusion of load-sensor signals for time series classification is a practical and effective solution for real-time, privacy-preserving fall prevention.

Title: Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models

Authors: Weiyi Zhao, Xiaoyu Tan, Liang Liu, Sijia Li, Youwei Song, Xihe Qiu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22500
Pdf URL: https://arxiv.org/pdf/2506.22500
Copy Paste: [[2506.22500]] Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models(https://arxiv.org/abs/2506.22500)
Keywords: diffusion, large language model
Abstract: Surgical risk identification is critical for patient safety and reducing preventable medical errors. While multimodal large language models (MLLMs) show promise for automated operating room (OR) risk detection, they often exhibit visual-semantic knowledge conflicts (VS-KC), failing to identify visual safety violations despite understanding textual rules. To address this, we introduce a dataset comprising over 34,000 synthetic images generated by diffusion models, depicting operating room scenes containing entities that violate established safety rules. These images were created to alleviate data scarcity and examine MLLMs vulnerabilities. In addition, the dataset includes 214 human-annotated images that serve as a gold-standard reference for validation. This comprehensive dataset, spanning diverse perspectives, stages, and configurations, is designed to expose and study VS-KC. Fine-tuning on OR-VSKC significantly improves MLLMs' detection of trained conflict entities and generalizes well to new viewpoints for these entities, but performance on untrained entity types remains poor, highlighting learning specificity and the need for comprehensive training. The main contributions of this work include: (1) a data generation methodology tailored for rule-violation scenarios; (2) the release of the OR-VSKC dataset and its associated benchmark as open-source resources; and (3) an empirical analysis of violation-sensitive knowledge consistency in representative MLLMs. The dataset and appendix are available at this https URL.

Title: How Can Multimodal Remote Sensing Datasets Transform Classification via SpatialNet-ViT?

Authors: Gautam Siddharth Kashyap, Manaswi Kulahara, Nipun Joshi, Usman Naseem
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22501
Pdf URL: https://arxiv.org/pdf/2506.22501
Copy Paste: [[2506.22501]] How Can Multimodal Remote Sensing Datasets Transform Classification via SpatialNet-ViT?(https://arxiv.org/abs/2506.22501)
Keywords: robust, transformer
Abstract: Remote sensing datasets offer significant promise for tackling key classification tasks such as land-use categorization, object presence detection, and rural/urban classification. However, many existing studies tend to focus on narrow tasks or datasets, which limits their ability to generalize across various remote sensing classification challenges. To overcome this, we propose a novel model, SpatialNet-ViT, leveraging the power of Vision Transformers (ViTs) and Multi-Task Learning (MTL). This integrated approach combines spatial awareness with contextual understanding, improving both classification accuracy and scalability. Additionally, techniques like data augmentation, transfer learning, and multi-task learning are employed to enhance model robustness and its ability to generalize across diverse datasets

Title: What Makes a Dribble Successful? Insights From 3D Pose Tracking Data

Authors: Michiel Schepers, Pieter Robberechts, Jan Van Haaren, Jesse Davis
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22503
Pdf URL: https://arxiv.org/pdf/2506.22503
Copy Paste: [[2506.22503]] What Makes a Dribble Successful? Insights From 3D Pose Tracking Data(https://arxiv.org/abs/2506.22503)
Keywords: attack
Abstract: Data analysis plays an increasingly important role in soccer, offering new ways to evaluate individual and team performance. One specific application is the evaluation of dribbles: one-on-one situations where an attacker attempts to bypass a defender with the ball. While previous research has primarily relied on 2D positional tracking data, this fails to capture aspects like balance, orientation, and ball control, limiting the depth of current insights. This study explores how pose tracking data (capturing players' posture and movement in three dimensions) can improve our understanding of dribbling skills. We extract novel pose-based features from 1,736 dribbles in the 2022/23 Champions League season and evaluate their impact on dribble success. Our results indicate that features capturing the attacker's balance and the alignment of the orientation between the attacker and defender are informative for predicting dribble success. Incorporating these pose-based features on top of features derived from traditional 2D positional data leads to a measurable improvement in model performance.

Title: Patch2Loc: Learning to Localize Patches for Unsupervised Brain Lesion Detection

Authors: Hassan Baker, Austin J. Brockmeier
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22504
Pdf URL: https://arxiv.org/pdf/2506.22504
Copy Paste: [[2506.22504]] Patch2Loc: Learning to Localize Patches for Unsupervised Brain Lesion Detection(https://arxiv.org/abs/2506.22504)
Keywords: segmentation
Abstract: Detecting brain lesions as abnormalities observed in magnetic resonance imaging (MRI) is essential for diagnosis and treatment. In the search of abnormalities, such as tumors and malformations, radiologists may benefit from computer-aided diagnostics that use computer vision systems trained with machine learning to segment normal tissue from abnormal brain tissue. While supervised learning methods require annotated lesions, we propose a new unsupervised approach (Patch2Loc) that learns from normal patches taken from structural MRI. We train a neural network model to map a patch back to its spatial location within a slice of the brain volume. During inference, abnormal patches are detected by the relatively higher error and/or variance of the location prediction. This generates a heatmap that can be integrated into pixel-wise methods to achieve finer-grained segmentation. We demonstrate the ability of our model to segment abnormal brain tissues by applying our approach to the detection of tumor tissues in MRI on T2-weighted images from BraTS2021 and MSLUB datasets and T1-weighted images from ATLAS and WMH datasets. We show that it outperforms the state-of-the art in unsupervised segmentation. The codebase for this work can be found on our \href{this https URL}{GitHub page}.

Title: Weakly Supervised Object Segmentation by Background Conditional Divergence

Authors: Hassan Baker, Matthew S. Emigh, Austin J. Brockmeier
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22505
Pdf URL: https://arxiv.org/pdf/2506.22505
Copy Paste: [[2506.22505]] Weakly Supervised Object Segmentation by Background Conditional Divergence(https://arxiv.org/abs/2506.22505)
Keywords: generative, segmentation
Abstract: As a computer vision task, automatic object segmentation remains challenging in specialized image domains without massive labeled data, such as synthetic aperture sonar images, remote sensing, biomedical imaging, etc. In any domain, obtaining pixel-wise segmentation masks is expensive. In this work, we propose a method for training a masking network to perform binary object segmentation using weak supervision in the form of image-wise presence or absence of an object of interest, which provides less information but may be obtained more quickly from manual or automatic labeling. A key step in our method is that the segmented objects can be placed into background-only images to create realistic, images of the objects with counterfactual backgrounds. To create a contrast between the original and counterfactual background images, we propose to first cluster the background-only images, and then during learning create counterfactual images that blend objects segmented from their original source backgrounds to backgrounds chosen from a targeted cluster. One term in the training loss is the divergence between these counterfactual images and the real object images with backgrounds of the target cluster. The other term is a supervised loss for background-only images. While an adversarial critic could provide the divergence, we use sample-based divergences. We conduct experiments on side-scan and synthetic aperture sonar in which our approach succeeds compared to previous unsupervised segmentation baselines that were only tested on natural images. Furthermore, to show generality we extend our experiments to natural images, obtaining reasonable performance with our method that avoids pretrained networks, generative networks, and adversarial critics. The basecode for this work can be found at \href{GitHub}{this https URL}.

Title: SABRE-FL: Selective and Accurate Backdoor Rejection for Federated Prompt Learning

Authors: Momin Ahmad Khan, Yasra Chandio, Fatima Muhammad Anwar
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22506
Pdf URL: https://arxiv.org/pdf/2506.22506
Copy Paste: [[2506.22506]] SABRE-FL: Selective and Accurate Backdoor Rejection for Federated Prompt Learning(https://arxiv.org/abs/2506.22506)
Keywords: security, privacy, defense, attack, robust, federate
Abstract: Federated Prompt Learning has emerged as a communication-efficient and privacy-preserving paradigm for adapting large vision-language models like CLIP across decentralized clients. However, the security implications of this setup remain underexplored. In this work, we present the first study of backdoor attacks in Federated Prompt Learning. We show that when malicious clients inject visually imperceptible, learnable noise triggers into input images, the global prompt learner becomes vulnerable to targeted misclassification while still maintaining high accuracy on clean inputs. Motivated by this vulnerability, we propose SABRE-FL, a lightweight, modular defense that filters poisoned prompt updates using an embedding-space anomaly detector trained offline on out-of-distribution data. SABRE-FL requires no access to raw client data or labels and generalizes across diverse datasets. We show, both theoretically and empirically, that malicious clients can be reliably identified and filtered using an embedding-based detector. Across five diverse datasets and four baseline defenses, SABRE-FL outperforms all baselines by significantly reducing backdoor accuracy while preserving clean accuracy, demonstrating strong empirical performance and underscoring the need for robust prompt learning in future federated systems.

Title: AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text

Authors: Chenyang Shao, Tianxing Li, Chenhao Pu, Fengli Xu, Yong Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22508
Pdf URL: https://arxiv.org/pdf/2506.22508
Copy Paste: [[2506.22508]] AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text(https://arxiv.org/abs/2506.22508)
Keywords: privacy, attack, steal, large language model
Abstract: In today's digital world, casual user-generated content often contains subtle cues that may inadvertently expose sensitive personal attributes. Such risks underscore the growing importance of effective text anonymization to safeguard individual privacy. However, existing methods either rely on rigid replacements that damage utility or cloud-based LLMs that are costly and pose privacy risks. To address these issues, we explore the use of locally deployed smaller-scale language models (SLMs) for anonymization. Yet training effective SLMs remains challenging due to limited high-quality supervision. To address the challenge, we propose AgentStealth, a self-reinforcing LLM anonymization this http URL, we introduce an adversarial anonymization workflow enhanced by In-context Contrastive Learning and Adaptive Utility-Aware Control. Second, we perform supervised adaptation of SLMs using high-quality data collected from the workflow, which includes both anonymization and attack signals. Finally, we apply online reinforcement learning where the model leverages its internal adversarial feedback to iteratively improve anonymization performance. Experiments on two datasets show that our method outperforms baselines in both anonymization effectiveness (+12.3%) and utility (+6.8%). Our lightweight design supports direct deployment on edge devices, avoiding cloud reliance and communication-based privacy risks. Our code is open-source at this https URL.

Title: FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment

Authors: Hang Xu, Jie Huang, Linjiang Huang, Dong Li, Yidi Liu, Feng Zhao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22509
Pdf URL: https://arxiv.org/pdf/2506.22509
Copy Paste: [[2506.22509]] FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment(https://arxiv.org/abs/2506.22509)
Keywords: diffusion
Abstract: Domain Adaptation(DA) for dense prediction tasks is an important topic, which enhances the dense prediction model's performance when tested on its unseen domain. Recently, with the development of Diffusion-based Dense Prediction (DDP) models, the exploration of DA designs tailored to this framework is worth exploring, since the diffusion model is effective in modeling the distribution transformation that comprises domain information. In this work, we propose a training-free mechanism for DDP frameworks, endowing them with DA capabilities. Our motivation arises from the observation that the exposure bias (e.g., noise statistics bias) in diffusion brings domain shift, and different domains in conditions of DDP models can also be effectively captured by the noise prediction statistics. Based on this, we propose a training-free Domain Noise Alignment (DNA) approach, which alleviates the variations of noise statistics to domain changes during the diffusion sampling process, thereby achieving domain adaptation. Specifically, when the source domain is available, we directly adopt the DNA method to achieve domain adaptation by aligning the noise statistics of the target domain with those of the source domain. For the more challenging source-free DA, inspired by the observation that regions closer to the source domain exhibit higher confidence meeting variations of sampling noise, we utilize the statistics from the high-confidence regions progressively to guide the noise statistic adjustment during the sampling process. Notably, our method demonstrates the effectiveness of enhancing the DA capability of DDP models across four common dense prediction tasks. Code is available at \href{this https URL}{this https URL}.

Title: Lightning the Night with Generative Artificial Intelligence

Authors: Tingting Zhou, Feng Zhang, Haoyang Fu, Baoxiang Pan, Renhe Zhang, Feng Lu, Zhixin Yang
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2506.22511
Pdf URL: https://arxiv.org/pdf/2506.22511
Copy Paste: [[2506.22511]] Lightning the Night with Generative Artificial Intelligence(https://arxiv.org/abs/2506.22511)
Keywords: diffusion, generative
Abstract: The visible light reflectance data from geostationary satellites is crucial for meteorological observations and plays an important role in weather monitoring and forecasting. However, due to the lack of visible light at night, it is impossible to conduct continuous all-day weather observations using visible light reflectance data. This study pioneers the use of generative diffusion models to address this limitation. Based on the multi-band thermal infrared brightness temperature data from the Advanced Geostationary Radiation Imager (AGRI) onboard the Fengyun-4B (FY4B) geostationary satellite, we developed a high-precision visible light reflectance retrieval model, called Reflectance Diffusion (RefDiff), which enables 0.47~\mu\mathrm{m}, 0.65~\mu\mathrm{m}, and 0.825~\mu\mathrm{m} bands visible light reflectance retrieval at night. Compared to the classical models, RefDiff not only significantly improves accuracy through ensemble averaging but also provides uncertainty estimation. Specifically, the SSIM index of RefDiff can reach 0.90, with particularly significant improvements in areas with complex cloud structures and thick clouds. The model's nighttime retrieval capability was validated using VIIRS nighttime product, demonstrating comparable performance to its daytime counterpart. In summary, this research has made substantial progress in the ability to retrieve visible light reflectance at night, with the potential to expand the application of nighttime visible light data.

Title: In-context learning for the classification of manipulation techniques in phishing emails

Authors: Antony Dalmiere (LAAS-TRUST, LAAS), Guillaume Auriol (LAAS-TRUST, INSA Toulouse), Vincent Nicomette (LAAS-TSF, LAAS), Pascal Marchand (LERASS)
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22515
Pdf URL: https://arxiv.org/pdf/2506.22515
Copy Paste: [[2506.22515]] In-context learning for the classification of manipulation techniques in phishing emails(https://arxiv.org/abs/2506.22515)
Keywords: attack, large language model
Abstract: Traditional phishing detection often overlooks psychological manipulation. This study investigates using Large Language Model (LLM) In-Context Learning (ICL) for fine-grained classification of phishing emails based on a taxonomy of 40 manipulation techniques. Using few-shot examples with GPT-4o-mini on real-world French phishing emails (SignalSpam), we evaluated performance against a human-annotated test set (100 emails). The approach effectively identifies prevalent techniques (e.g., Baiting, Curiosity Appeal, Request For Minor Favor) with a promising accuracy of 0.76. This work demonstrates ICL's potential for nuanced phishing analysis and provides insights into attacker strategies.

Title: Can "consciousness" be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis

Authors: Jingkai Li
Subjects: cs.CL, cs.AI, cs.NE, q-bio.NC
Abstract URL: https://arxiv.org/abs/2506.22516
Pdf URL: https://arxiv.org/pdf/2506.22516
Copy Paste: [[2506.22516]] Can "consciousness" be observed from large language model (LLM) internal states? Dissecting LLM representations obtained from Theory of Mind test with Integrated Information Theory and Span Representation analysis(https://arxiv.org/abs/2506.22516)
Keywords: transformer, large language model
Abstract: Integrated Information Theory (IIT) provides a quantitative framework for explaining consciousness phenomenon, positing that conscious systems comprise elements integrated through causal properties. We apply IIT 3.0 and 4.0 -- the latest iterations of this framework -- to sequences of Large Language Model (LLM) representations, analyzing data derived from existing Theory of Mind (ToM) test results. Our study systematically investigates whether the differences of ToM test performances, when presented in the LLM representations, can be revealed by IIT estimates, i.e., $\Phi^{\max}$ (IIT 3.0), $\Phi$ (IIT 4.0), Conceptual Information (IIT 3.0), and $\Phi$-structure (IIT 4.0). Furthermore, we compare these metrics with the Span Representations independent of any estimate for consciousness. This additional effort aims to differentiate between potential "consciousness" phenomena and inherent separations within LLM representational space. We conduct comprehensive experiments examining variations across LLM transformer layers and linguistic spans from stimuli. Our results suggest that sequences of contemporary Transformer-based LLM representations lack statistically significant indicators of observed "consciousness" phenomena but exhibit intriguing patterns under $\textit{spatio}$-permutational analyses. The Appendix and code are available as Supplementary Materials at: this https URL.

Title: Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation

Authors: Deyu Zou, Yongqiang Chen, Mufei Li, Siqi Miao, Chenxi Liu, Bo Han, James Cheng, Pan Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22518
Pdf URL: https://arxiv.org/pdf/2506.22518
Copy Paste: [[2506.22518]] Weak-to-Strong GraphRAG: Aligning Weak Retrievers with Large Language Models for Graph-based Retrieval Augmented Generation(https://arxiv.org/abs/2506.22518)
Keywords: large language model
Abstract: Graph-based retrieval-augmented generation (RAG) enables large language models (LLMs) to ground responses with structured external knowledge from up-to-date knowledge graphs (KGs) and reduce hallucinations. However, LLMs often rely on a weak retriever in graph-based RAG: I) Due to the lack of ground truth, the retriever is often trained on weak supervision, which often introduces spurious signals to the LLMs. II) Due to the abstraction of graph data, the retrieved knowledge is often presented in unorganized forms. To mitigate the issue, we present Refined Graph-based RAG (ReG) to align weak retrievers to LLMs for graph-based RAG. Specifically, ReG incorporates LLM feedback to get rid of spurious signals and improve the quality of the supervision. Meanwhile, ReG introduces a structure-aware reorganization module to refactor the retrieval results into logically coherent evidence chains. Experiments on prominent benchmarks demonstrate that ReG significantly and consistently brings improvements across different LLM backbones by up to 10%. The improved supervision quality enables ReG to match the state-of-the-art performance with 5% training data and to transfer to out-of-distribution KGs. Notably, when adopted to reasoning-based LLMs, ReG reduces the reasoning token cost by up to 30% and improves the performance by up to 4%.

Title: A Survey on Model Extraction Attacks and Defenses for Large Language Models

Authors: Kaixiang Zhao, Lincan Li, Kaize Ding, Neil Zhenqiang Gong, Yue Zhao, Yushun Dong
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22521
Pdf URL: https://arxiv.org/pdf/2506.22521
Copy Paste: [[2506.22521]] A Survey on Model Extraction Attacks and Defenses for Large Language Models(https://arxiv.org/abs/2506.22521)
Keywords: security, privacy, protect, defense, attack, steal, extraction, transformer, generative, large language model
Abstract: Model extraction attacks pose significant security threats to deployed language models, potentially compromising intellectual property and user privacy. This survey provides a comprehensive taxonomy of LLM-specific extraction attacks and defenses, categorizing attacks into functionality extraction, training data extraction, and prompt-targeted attacks. We analyze various attack methodologies including API-based knowledge distillation, direct querying, parameter recovery, and prompt stealing techniques that exploit transformer architectures. We then examine defense mechanisms organized into model protection, data privacy protection, and prompt-targeted strategies, evaluating their effectiveness across different deployment scenarios. We propose specialized metrics for evaluating both attack effectiveness and defense performance, addressing the specific challenges of generative language models. Through our analysis, we identify critical limitations in current approaches and propose promising research directions, including integrated attack methodologies and adaptive defense mechanisms that balance security with model utility. This work serves NLP researchers, ML engineers, and security professionals seeking to protect language models in production environments.

Title: MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs

Authors: Boyuan Chen, Minghao Shao, Abdul Basit, Siddharth Garg, Muhammad Shafique
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22557
Pdf URL: https://arxiv.org/pdf/2506.22557
Copy Paste: [[2506.22557]] MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs(https://arxiv.org/abs/2506.22557)
Keywords: defense, attack, robust, large language model
Abstract: The growing capabilities of large language models (LLMs) have exposed them to increasingly sophisticated jailbreak attacks. Among these, obfuscation-based attacks -- which encrypt malicious content to evade detection -- remain highly effective. By leveraging the reasoning ability of advanced LLMs to interpret encrypted prompts, such attacks circumvent conventional defenses that rely on keyword detection or context filtering. These methods are very difficult to defend against, as existing safety mechanisms are not designed to interpret or decode ciphered content. In this work, we propose \textbf{MetaCipher}, a novel obfuscation-based jailbreak framework, along with a reinforcement learning-based dynamic cipher selection mechanism that adaptively chooses optimal encryption strategies from a cipher pool. This approach enhances jailbreak effectiveness and generalizability across diverse task types, victim LLMs, and safety guardrails. Our framework is modular and extensible by design, supporting arbitrary cipher families and accommodating evolving adversarial strategies. We complement our method with a large-scale empirical analysis of cipher performance across multiple victim LLMs. Within as few as 10 queries, MetaCipher achieves over 92\% attack success rate (ASR) on most recent standard malicious prompt benchmarks against state-of-the-art non-reasoning LLMs, and over 74\% ASR against reasoning-capable LLMs, outperforming all existing obfuscation-based jailbreak methods. These results highlight the long-term robustness and adaptability of our approach, making it more resilient than prior methods in the face of advancing safety measures.

Title: Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation

Authors: Shansong Wang, Zhecheng Jin, Mingzhe Hu, Mojtaba Safari, Feng Zhao, Chih-Wei Chang, Richard LJ Qiu, Justin Roper, David S. Yu, Xiaofeng Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22567
Pdf URL: https://arxiv.org/pdf/2506.22567
Copy Paste: [[2506.22567]] Unifying Biomedical Vision-Language Expertise: Towards a Generalist Foundation Model via Multi-CLIP Knowledge Distillation(https://arxiv.org/abs/2506.22567)
Keywords: robust
Abstract: CLIP models pretrained on natural images with billion-scale image-text pairs have demonstrated impressive capabilities in zero-shot classification, cross-modal retrieval, and open-ended visual answering. However, transferring this success to biomedicine is hindered by the scarcity of large-scale biomedical image-text corpora, the heterogeneity of image modalities, and fragmented data standards across institutions. These limitations hinder the development of a unified and generalizable biomedical foundation model trained from scratch. To overcome this, we introduce MMKD-CLIP, a generalist biomedical foundation model developed via Multiple Medical CLIP Knowledge Distillation. Rather than relying on billion-scale raw data, MMKD-CLIP distills knowledge from nine state-of-the-art domain-specific or generalist biomedical CLIP models, each pretrained on millions of biomedical image-text pairs. Our two-stage training pipeline first performs CLIP-style pretraining on over 2.9 million biomedical image-text pairs from 26 image modalities, followed by feature-level distillation using over 19.2 million feature pairs extracted from teacher models. We evaluate MMKD-CLIP on 58 diverse biomedical datasets, encompassing over 10.8 million biomedical images across nine image modalities. The evaluation spans six core task types: zero-shot classification, linear probing, cross-modal retrieval, visual question answering, survival prediction, and cancer diagnosis. MMKD-CLIP consistently outperforms all teacher models while demonstrating remarkable robustness and generalization across image domains and task settings. These results underscore that multi-teacher knowledge distillation is a scalable and effective paradigm for building high-performing biomedical foundation models under the practical constraints of real-world data availability.

Title: Dual Atrous Separable Convolution for Improving Agricultural Semantic Segmentation

Authors: Chee Mei Ling, Thangarajah Akilan, Aparna Ravinda Phalke
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22570
Pdf URL: https://arxiv.org/pdf/2506.22570
Copy Paste: [[2506.22570]] Dual Atrous Separable Convolution for Improving Agricultural Semantic Segmentation(https://arxiv.org/abs/2506.22570)
Keywords: transformer, segmentation
Abstract: Agricultural image semantic segmentation is a pivotal component of modern agriculture, facilitating accurate visual data analysis to improve crop management, optimize resource utilization, and boost overall productivity. This study proposes an efficient image segmentation method for precision agriculture, focusing on accurately delineating farmland anomalies to support informed decision-making and proactive interventions. A novel Dual Atrous Separable Convolution (DAS Conv) module is integrated within the DeepLabV3-based segmentation framework. The DAS Conv module is meticulously designed to achieve an optimal balance between dilation rates and padding size, thereby enhancing model performance without compromising efficiency. The study also incorporates a strategic skip connection from an optimal stage in the encoder to the decoder to bolster the model's capacity to capture fine-grained spatial features. Despite its lower computational complexity, the proposed model outperforms its baseline and achieves performance comparable to highly complex transformer-based state-of-the-art (SOTA) models on the Agriculture Vision benchmark dataset. It achieves more than 66% improvement in efficiency when considering the trade-off between model complexity and performance, compared to the SOTA model. This study highlights an efficient and effective solution for improving semantic segmentation in remote sensing applications, offering a computationally lightweight model capable of high-quality performance in agricultural imagery.

Title: The Hidden Link Between RLHF and Contrastive Learning

Authors: Xufei Lv, Haoyuan Sun, Xuefeng Bai, Min Zhang, Houde Liu, Kehai Chen
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.22578
Pdf URL: https://arxiv.org/pdf/2506.22578
Copy Paste: [[2506.22578]] The Hidden Link Between RLHF and Contrastive Learning(https://arxiv.org/abs/2506.22578)
Keywords: large language model
Abstract: Alignment of large language models (LLMs) with human values has recently garnered significant attention, with prominent examples including the canonical yet costly Reinforcement Learning from Human Feedback (RLHF) and the simple Direct Preference Optimization (DPO). In this work, we demonstrate that both RLHF and DPO can be interpreted from the perspective of mutual information (MI) maximization, uncovering a profound connection to contrastive learning. Within this framework, both RLHF and DPO can be viewed as methods that perform contrastive learning based on the positive and negative samples derived from the base model, leveraging the Donsker-Varadhan (DV) lower bound on MI (equivalently, the MINE estimator). This paradigm further explains why RLHF may not intrinsically incentivize reasoning capacities in LLMs beyond what is already present in the base model. Building on this perspective, we replace the DV/MINE bound with the Jensen-Shannon MI estimator and propose Mutual Information Optimization (MIO). Comprehensive theoretical analysis and extensive empirical evaluations demonstrate that MIO mitigates the late-stage decline in chosen-likelihood observed in DPO, achieving competitive or superior performance across various challenging reasoning and mathematical benchmarks. We will release the model and code upon acceptance.

Title: LIGHT: Multi-Modal Text Linking on Historical Maps

Authors: Yijun Lin, Rhett Olson, Junhan Wu, Yao-Yi Chiang, Jerod Weinman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22589
Pdf URL: https://arxiv.org/pdf/2506.22589
Copy Paste: [[2506.22589]] LIGHT: Multi-Modal Text Linking on Historical Maps(https://arxiv.org/abs/2506.22589)
Keywords: robust
Abstract: Text on historical maps provides valuable information for studies in history, economics, geography, and other related fields. Unlike structured or semi-structured documents, text on maps varies significantly in orientation, reading order, shape, and placement. Many modern methods can detect and transcribe text regions, but they struggle to effectively ``link'' the recognized text fragments, e.g., determining a multi-word place name. Existing layout analysis methods model word relationships to improve text understanding in structured documents, but they primarily rely on linguistic features and neglect geometric information, which is essential for handling map text. To address these challenges, we propose LIGHT, a novel multi-modal approach that integrates linguistic, image, and geometric features for linking text on historical maps. In particular, LIGHT includes a geometry-aware embedding module that encodes the polygonal coordinates of text regions to capture polygon shapes and their relative spatial positions on an image. LIGHT unifies this geometric information with the visual and linguistic token embeddings from LayoutLMv3, a pretrained layout analysis model. LIGHT uses the cross-modal information to predict the reading-order successor of each text instance directly with a bi-directional learning strategy that enhances sequence robustness. Experimental results show that LIGHT outperforms existing methods on the ICDAR 2024/2025 MapText Competition data, demonstrating the effectiveness of multi-modal learning for historical map text linking.

Title: BrainMT: A Hybrid Mamba-Transformer Architecture for Modeling Long-Range Dependencies in Functional MRI Data

Authors: Arunkumar Kannan, Martin A. Lindquist, Brian Caffo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22591
Pdf URL: https://arxiv.org/pdf/2506.22591
Copy Paste: [[2506.22591]] BrainMT: A Hybrid Mamba-Transformer Architecture for Modeling Long-Range Dependencies in Functional MRI Data(https://arxiv.org/abs/2506.22591)
Keywords: transformer
Abstract: Recent advances in deep learning have made it possible to predict phenotypic measures directly from functional magnetic resonance imaging (fMRI) brain volumes, sparking significant interest in the neuroimaging community. However, existing approaches, primarily based on convolutional neural networks or transformer architectures, often struggle to model the complex relationships inherent in fMRI data, limited by their inability to capture long-range spatial and temporal dependencies. To overcome these shortcomings, we introduce BrainMT, a novel hybrid framework designed to efficiently learn and integrate long-range spatiotemporal attributes in fMRI data. Our framework operates in two stages: (1) a bidirectional Mamba block with a temporal-first scanning mechanism to capture global temporal interactions in a computationally efficient manner; and (2) a transformer block leveraging self-attention to model global spatial relationships across the deep features processed by the Mamba block. Extensive experiments on two large-scale public datasets, UKBioBank and the Human Connectome Project, demonstrate that BrainMT achieves state-of-the-art performance on both classification (sex prediction) and regression (cognitive intelligence prediction) tasks, outperforming existing methods by a significant margin. Our code and implementation details will be made publicly available at this this https URL

Title: RExBench: Can coding agents autonomously implement AI research extensions?

Authors: Nicholas Edwards, Yukyung Lee, Yujun (Audrey)Mao, Yulu Qin, Sebastian Schuster, Najoung Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22598
Pdf URL: https://arxiv.org/pdf/2506.22598
Copy Paste: [[2506.22598]] RExBench: Can coding agents autonomously implement AI research extensions?(https://arxiv.org/abs/2506.22598)
Keywords: robust, large language model
Abstract: Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research pipeline in machine learning and the natural sciences. We argue that research extension and its implementation is a critical capability for such systems, and introduce RExBench to support the evaluation of this capability. RExBench is a benchmark consisting of 12 realistic research experiment implementation tasks that aim to investigate research hypotheses that have not previously been implemented. Each task is set up as an extension to an existing research paper and codebase, accompanied by domain expert-written instructions. RExBench is robust to data contamination, and supports an automatic evaluation infrastructure that executes agent outputs to determine whether the success criteria are met. We use this benchmark to evaluate nine LLM agents implemented using three different frameworks: aider, Claude Code, and OpenHands. We find that all agents evaluated fail to autonomously implement the majority of the extensions. Although the success rate improves with additional human-written hints, the best performance under this setting remains below 40%. This indicates that current agents are still short of being able to handle realistic research extension tasks without substantial human guidance.

Title: Are Fast Methods Stable in Adversarially Robust Transfer Learning?

Authors: Joshua C. Zhao, Saurabh Bagchi
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.22602
Pdf URL: https://arxiv.org/pdf/2506.22602
Copy Paste: [[2506.22602]] Are Fast Methods Stable in Adversarially Robust Transfer Learning?(https://arxiv.org/abs/2506.22602)
Keywords: robust
Abstract: Transfer learning is often used to decrease the computational cost of model training, as fine-tuning a model allows a downstream task to leverage the features learned from the pre-training dataset and quickly adapt them to a new task. This is particularly useful for achieving adversarial robustness, as adversarially training models from scratch is very computationally expensive. However, high robustness in transfer learning still requires adversarial training during the fine-tuning phase, which requires up to an order of magnitude more time than standard fine-tuning. In this work, we revisit the use of the fast gradient sign method (FGSM) in robust transfer learning to improve the computational cost of adversarial fine-tuning. We surprisingly find that FGSM is much more stable in adversarial fine-tuning than when training from scratch. In particular, FGSM fine-tuning does not suffer from any issues with catastrophic overfitting at standard perturbation budgets of $\varepsilon=4$ or $\varepsilon=8$. This stability is further enhanced with parameter-efficient fine-tuning methods, where FGSM remains stable even up to $\varepsilon=32$ for linear probing. We demonstrate how this stability translates into performance across multiple datasets. Compared to fine-tuning with the more commonly used method of projected gradient descent (PGD), on average, FGSM only loses 0.39% and 1.39% test robustness for $\varepsilon=4$ and $\varepsilon=8$ while using $4\times$ less training time. Surprisingly, FGSM may not only be a significantly more efficient alternative to PGD in adversarially robust transfer learning but also a well-performing one.

Title: A User-Centric, Privacy-Preserving, and Verifiable Ecosystem for Personal Data Management and Utilization

Authors: Osama Zafar, Mina Namazi, Yuqiao Xu, Youngjin Yoo, Erman Ayday
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22606
Pdf URL: https://arxiv.org/pdf/2506.22606
Copy Paste: [[2506.22606]] A User-Centric, Privacy-Preserving, and Verifiable Ecosystem for Personal Data Management and Utilization(https://arxiv.org/abs/2506.22606)
Keywords: secure, security, privacy, robust, federate
Abstract: In the current paradigm of digital personalized services, the centralized management of personal data raises significant privacy concerns, security vulnerabilities, and diminished individual autonomy over sensitive information. Despite their efficiency, traditional centralized architectures frequently fail to satisfy rigorous privacy requirements and expose users to data breaches and unauthorized access risks. This pressing challenge calls for a fundamental paradigm shift in methodologies for collecting, storing, and utilizing personal data across diverse sectors, including education, healthcare, and finance. This paper introduces a novel decentralized, privacy-preserving architecture that handles heterogeneous personal information, ranging from educational credentials to health records and financial data. Unlike traditional models, our system grants users complete data ownership and control, allowing them to selectively share information without compromising privacy. The architecture's foundation comprises advanced privacy-enhancing technologies, including secure enclaves and federated learning, enabling secure computation, verification, and data sharing. The system supports diverse functionalities, including local computation, model training, and privacy-preserving data sharing, while ensuring data credibility and robust user privacy.

Title: Temperature Matters: Enhancing Watermark Robustness Against Paraphrasing Attacks

Authors: Badr Youbi Idrissi, Monica Millunzi, Amelia Sorrenti, Lorenzo Baraldi, Daryna Dementieva
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22623
Pdf URL: https://arxiv.org/pdf/2506.22623
Copy Paste: [[2506.22623]] Temperature Matters: Enhancing Watermark Robustness Against Paraphrasing Attacks(https://arxiv.org/abs/2506.22623)
Keywords: attack, robust, watermark, large language model
Abstract: In the present-day scenario, Large Language Models (LLMs) are establishing their presence as powerful instruments permeating various sectors of society. While their utility offers valuable support to individuals, there are multiple concerns over potential misuse. Consequently, some academic endeavors have sought to introduce watermarking techniques, characterized by the inclusion of markers within machine-generated text, to facilitate algorithmic identification. This research project is focused on the development of a novel methodology for the detection of synthetic text, with the overarching goal of ensuring the ethical application of LLMs in AI-driven text generation. The investigation commences with replicating findings from a previous baseline study, thereby underscoring its susceptibility to variations in the underlying generation model. Subsequently, we propose an innovative watermarking approach and subject it to rigorous evaluation, employing paraphrased generated text to asses its robustness. Experimental results highlight the robustness of our proposal compared to the~\cite{aarson} watermarking method.

Title: Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning

Authors: Zuyao You, Zuxuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22624
Pdf URL: https://arxiv.org/pdf/2506.22624
Copy Paste: [[2506.22624]] Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning(https://arxiv.org/abs/2506.22624)
Keywords: segmentation
Abstract: We present Seg-R1, a preliminary exploration of using reinforcement learning (RL) to enhance the pixel-level understanding and reasoning capabilities of large multimodal models (LMMs). Starting with foreground segmentation tasks, specifically camouflaged object detection (COD) and salient object detection (SOD), our approach enables the LMM to generate point and bounding box prompts in the next-token fashion, which are then used to guide SAM2 in producing segmentation masks. We introduce Group Relative Policy Optimization (GRPO) into the segmentation domain, equipping the LMM with pixel-level comprehension through a carefully designed training strategy. Notably, Seg-R1 achieves remarkable performance with purely RL-based training, achieving .873 S-measure on COD10K without complex model modification. Moreover, we found that pure RL training demonstrates strong open-world generalization. Despite being trained solely on foreground segmentation image-mask pairs without text supervision, Seg-R1 achieves impressive zero-shot performance on referring segmentation and reasoning segmentation tasks, with 71.4 cIoU on RefCOCOg test and 56.7 gIoU on ReasonSeg test, outperforming models fully supervised on these datasets.

Title: CaO$_2$: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation

Authors: Haoxuan Wang, Zhenghao Zhao, Junyi Wu, Yuzhang Shang, Gaowen Liu, Yan Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22637
Pdf URL: https://arxiv.org/pdf/2506.22637
Copy Paste: [[2506.22637]] CaO$_2$: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation(https://arxiv.org/abs/2506.22637)
Keywords: diffusion
Abstract: The recent introduction of diffusion models in dataset distillation has shown promising potential in creating compact surrogate datasets for large, high-resolution target datasets, offering improved efficiency and performance over traditional bi-level/uni-level optimization methods. However, current diffusion-based dataset distillation approaches overlook the evaluation process and exhibit two critical inconsistencies in the distillation process: (1) Objective Inconsistency, where the distillation process diverges from the evaluation objective, and (2) Condition Inconsistency, leading to mismatches between generated images and their corresponding conditions. To resolve these issues, we introduce Condition-aware Optimization with Objective-guided Sampling (CaO$_2$), a two-stage diffusion-based framework that aligns the distillation process with the evaluation objective. The first stage employs a probability-informed sample selection pipeline, while the second stage refines the corresponding latent representations to improve conditional likelihood. CaO$_2$ achieves state-of-the-art performance on ImageNet and its subsets, surpassing the best-performing baselines by an average of 2.3% accuracy.

Title: Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training

Authors: Aadim Nepal, Safal Shrestha, Anubhav Shrestha, Minwu Kim, Keith Ross
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22638
Pdf URL: https://arxiv.org/pdf/2506.22638
Copy Paste: [[2506.22638]] Layer Importance for Mathematical Reasoning is Forged in Pre-Training and Invariant after Post-Training(https://arxiv.org/abs/2506.22638)
Keywords: transformer, large language model
Abstract: Large language models can exhibit improved mathematical reasoning capabilities following post-training with instruction tuning, reinforcement learning, or knowledge distillation. However, it remains unclear whether these improvements are driven by major changes in transformer layers or from minor adjustments that leave the relative layer importance structures of the base model largely unchanged. We investigate this question through systematic layer-wise ablation experiments, examining base, instruction-tuned, knowledge-distilled, and reinforcement learning variants on mathematical reasoning benchmarks. Our findings show that mathematical reasoning gives rise to a specific layer importance structure, and this structure persists across all post-training paradigms. Removal of such layers causes accuracy drops of up to 80%. In contrast, non-mathematical tasks like factual recall exhibit no critical layers. This distinction suggests that mathematical reasoning requires specialized layers that emerge during pre-training, while other non-reasoning tasks do not. From an information-theoretic perspective, we also observe that these critical layers are the same layers where major representational transformation occurs.

Title: Fingerprinting SDKs for Mobile Apps and Where to Find Them: Understanding the Market for Device Fingerprinting

Authors: Michael A. Specter, Mihai Christodorescu, Abbie Farr, Bo Ma, Robin Lassonde, Xiaoyang Xu, Xiang Pan, Fengguo Wei, Saswat Anand, Dave Kleidermacher
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.22639
Pdf URL: https://arxiv.org/pdf/2506.22639
Copy Paste: [[2506.22639]] Fingerprinting SDKs for Mobile Apps and Where to Find Them: Understanding the Market for Device Fingerprinting(https://arxiv.org/abs/2506.22639)
Keywords: security, privacy
Abstract: This paper presents a large-scale analysis of fingerprinting-like behavior in the mobile application ecosystem. We take a market-based approach, focusing on third-party tracking as enabled by applications' common use of third-party SDKs. Our dataset consists of over 228,000 SDKs from popular Maven repositories, 178,000 Android applications collected from the Google Play store, and our static analysis pipeline detects exfiltration of over 500 individual signals. To the best of our knowledge, this represents the largest-scale analysis of SDK behavior undertaken to date. We find that Ads SDKs (the ostensible focus of industry efforts such as Apple's App Tracking Transparency and Google's Privacy Sandbox) appear to be the source of only 30.56% of the fingerprinting behaviors. A surprising 23.92% originate from SDKs whose purpose was unknown or unclear. Furthermore, Security and Authentication SDKs are linked to only 11.7% of likely fingerprinting instances. These results suggest that addressing fingerprinting solely in specific market-segment contexts like advertising may offer incomplete benefit. Enforcing anti-fingerprinting policies is also complex, as we observe a sparse distribution of signals and APIs used by likely fingerprinting SDKs. For instance, only 2% of exfiltrated APIs are used by more than 75% of SDKs, making it difficult to rely on user permissions to control fingerprinting behavior.

Title: VERA: Variational Inference Framework for Jailbreaking Large Language Models

Authors: Anamika Lochab, Lu Yan, Patrick Pynadath, Xiangyu Zhang, Ruqi Zhang
Subjects: cs.CR, cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.22666
Pdf URL: https://arxiv.org/pdf/2506.22666
Copy Paste: [[2506.22666]] VERA: Variational Inference Framework for Jailbreaking Large Language Models(https://arxiv.org/abs/2506.22666)
Keywords: attack, large language model
Abstract: The rise of API-only access to state-of-the-art LLMs highlights the need for effective black-box jailbreak methods to identify model vulnerabilities in real-world settings. Without a principled objective for gradient-based optimization, most existing approaches rely on genetic algorithms, which are limited by their initialization and dependence on manually curated prompt pools. Furthermore, these methods require individual optimization for each prompt, failing to provide a comprehensive characterization of model vulnerabilities. To address this gap, we introduce VERA: Variational infErence fRamework for jAilbreaking. VERA casts black-box jailbreak prompting as a variational inference problem, training a small attacker LLM to approximate the target LLM's posterior over adversarial prompts. Once trained, the attacker can generate diverse, fluent jailbreak prompts for a target query without re-optimization. Experimental results show that VERA achieves strong performance across a range of target LLMs, highlighting the value of probabilistic inference for adversarial prompt generation.

Title: 3D Shape Generation: A Survey

Authors: Nicolas Caytuiro, Ivan Sipiran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22678
Pdf URL: https://arxiv.org/pdf/2506.22678
Copy Paste: [[2506.22678]] 3D Shape Generation: A Survey(https://arxiv.org/abs/2506.22678)
Keywords: generative
Abstract: Recent advances in deep learning have significantly transformed the field of 3D shape generation, enabling the synthesis of complex, diverse, and semantically meaningful 3D objects. This survey provides a comprehensive overview of the current state of the art in 3D shape generation, organizing the discussion around three core components: shape representations, generative modeling approaches, and evaluation protocols. We begin by categorizing 3D representations into explicit, implicit, and hybrid setups, highlighting their structural properties, advantages, and limitations. Next, we review a wide range of generation methods, focusing on feedforward architectures. We further summarize commonly used datasets and evaluation metrics that assess fidelity, diversity, and realism of generated shapes. Finally, we identify open challenges and outline future research directions that could drive progress in controllable, efficient, and high-quality 3D shape generation. This survey aims to serve as a valuable reference for researchers and practitioners seeking a structured and in-depth understanding of this rapidly evolving field.

Title: Assessing the feasibility of Large Language Models for detecting micro-behaviors in team interactions during space missions

Authors: Ankush Raut, Projna Paromita, Sydney Begerowski, Suzanne Bell, Theodora Chaspari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22679
Pdf URL: https://arxiv.org/pdf/2506.22679
Copy Paste: [[2506.22679]] Assessing the feasibility of Large Language Models for detecting micro-behaviors in team interactions during space missions(https://arxiv.org/abs/2506.22679)
Keywords: large language model
Abstract: We explore the feasibility of large language models (LLMs) in detecting subtle expressions of micro-behaviors in team conversations using transcripts collected during simulated space missions. Specifically, we examine zero-shot classification, fine-tuning, and paraphrase-augmented fine-tuning with encoder-only sequence classification LLMs, as well as few-shot text generation with decoder-only causal language modeling LLMs, to predict the micro-behavior associated with each conversational turn (i.e., dialogue). Our findings indicate that encoder-only LLMs, such as RoBERTa and DistilBERT, struggled to detect underrepresented micro-behaviors, particularly discouraging speech, even with weighted fine-tuning. In contrast, the instruction fine-tuned version of Llama-3.1, a decoder-only LLM, demonstrated superior performance, with the best models achieving macro F1-scores of 44% for 3-way classification and 68% for binary classification. These results have implications for the development of speech technologies aimed at analyzing team communication dynamics and enhancing training interventions in high-stakes environments such as space missions, particularly in scenarios where text is the only accessible data.

Title: Mitigating Semantic Collapse in Generative Personalization with a Surprisingly Simple Test-Time Embedding Adjustment

Authors: Anh Bui, Trang Vu, Trung Le, Junae Kim, Tamas Abraham, Rollin Omari, Amar Kaur, Dinh Phung
Subjects: cs.LG, cs.GR
Abstract URL: https://arxiv.org/abs/2506.22685
Pdf URL: https://arxiv.org/pdf/2506.22685
Copy Paste: [[2506.22685]] Mitigating Semantic Collapse in Generative Personalization with a Surprisingly Simple Test-Time Embedding Adjustment(https://arxiv.org/abs/2506.22685)
Keywords: generative
Abstract: In this paper, we investigate the semantic collapsing problem in generative personalization, an under-explored topic where the learned visual concept ($V^*$) gradually shifts from its original textual meaning and comes to dominate other concepts in multi-concept input prompts. This issue not only reduces the semantic richness of complex input prompts like "a photo of $V^*$ wearing glasses and playing guitar" into simpler, less contextually rich forms such as "a photo of $V^*$" but also leads to simplified output images that fail to capture the intended concept. We identify the root cause as unconstrained optimisation, which allows the learned embedding $V^*$ to drift arbitrarily in the embedding space, both in direction and magnitude. To address this, we propose a simple yet effective training-free method that adjusts the magnitude and direction of pre-trained embedding at inference time, effectively mitigating the semantic collapsing problem. Our method is broadly applicable across different personalization methods and demonstrates significant improvements in text-image alignment in diverse use cases. Our code is anonymously published at this https URL.

Title: Residual Matrix Transformers: Scaling the Size of the Residual Stream

Authors: Brian Mak, Jeffrey Flanigan
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.22696
Pdf URL: https://arxiv.org/pdf/2506.22696
Copy Paste: [[2506.22696]] Residual Matrix Transformers: Scaling the Size of the Residual Stream(https://arxiv.org/abs/2506.22696)
Keywords: transformer
Abstract: The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties. Code for this project can be found at this https URL.

Title: Text Production and Comprehension by Human and Artificial Intelligence: Interdisciplinary Workshop Report

Authors: Emily Dux Speltz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22698
Pdf URL: https://arxiv.org/pdf/2506.22698
Copy Paste: [[2506.22698]] Text Production and Comprehension by Human and Artificial Intelligence: Interdisciplinary Workshop Report(https://arxiv.org/abs/2506.22698)
Keywords: large language model
Abstract: This report synthesizes the outcomes of a recent interdisciplinary workshop that brought together leading experts in cognitive psychology, language learning, and artificial intelligence (AI)-based natural language processing (NLP). The workshop, funded by the National Science Foundation, aimed to address a critical knowledge gap in our understanding of the relationship between AI language models and human cognitive processes in text comprehension and composition. Through collaborative dialogue across cognitive, linguistic, and technological perspectives, workshop participants examined the underlying processes involved when humans produce and comprehend text, and how AI can both inform our understanding of these processes and augment human capabilities. The workshop revealed emerging patterns in the relationship between large language models (LLMs) and human cognition, with highlights on both the capabilities of LLMs and their limitations in fully replicating human-like language understanding and generation. Key findings include the potential of LLMs to offer insights into human language processing, the increasing alignment between LLM behavior and human language processing when models are fine-tuned with human feedback, and the opportunities and challenges presented by human-AI collaboration in language tasks. By synthesizing these findings, this report aims to guide future research, development, and implementation of LLMs in cognitive psychology, linguistics, and education. It emphasizes the importance of ethical considerations and responsible use of AI technologies while striving to enhance human capabilities in text comprehension and production through effective human-AI collaboration.

Title: General Autonomous Cybersecurity Defense: Learning Robust Policies for Dynamic Topologies and Diverse Attackers

Authors: Arun Ramamurthy, Neil Dhir
Subjects: cs.CR, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2506.22706
Pdf URL: https://arxiv.org/pdf/2506.22706
Copy Paste: [[2506.22706]] General Autonomous Cybersecurity Defense: Learning Robust Policies for Dynamic Topologies and Diverse Attackers(https://arxiv.org/abs/2506.22706)
Keywords: security, defense, attack, robust
Abstract: In the face of evolving cyber threats such as malware, ransomware and phishing, autonomous cybersecurity defense (ACD) systems have become essential for real-time threat detection and response with optional human intervention. However, existing ACD systems rely on limiting assumptions, particularly the stationarity of the underlying network dynamics. In real-world scenarios, network topologies can change due to actions taken by attackers or defenders, system failures, or time evolution of networks, leading to failures in the adaptive capabilities of current defense agents. Moreover, many agents are trained on static environments, resulting in overfitting to specific topologies, which hampers their ability to generalize to out-of-distribution network topologies. This work addresses these challenges by exploring methods for developing agents to learn generalizable policies across dynamic network environments -- general ACD (GACD).

Title: FairMarket-RL: LLM-Guided Fairness Shaping for Multi-Agent Reinforcement Learning in Peer-to-Peer Markets

Authors: Shrenik Jadhav, Birva Sevak, Srijita Das, Akhtar Hussain, Wencong Su, Van-Hai Bui
Subjects: cs.LG, econ.GN, eess.SY
Abstract URL: https://arxiv.org/abs/2506.22708
Pdf URL: https://arxiv.org/pdf/2506.22708
Copy Paste: [[2506.22708]] FairMarket-RL: LLM-Guided Fairness Shaping for Multi-Agent Reinforcement Learning in Peer-to-Peer Markets(https://arxiv.org/abs/2506.22708)
Keywords: robust, fair, large language model
Abstract: Peer-to-peer (P2P) trading is increasingly recognized as a key mechanism for decentralized market regulation, yet existing approaches often lack robust frameworks to ensure fairness. This paper presents FairMarket-RL, a novel hybrid framework that combines Large Language Models (LLMs) with Reinforcement Learning (RL) to enable fairness-aware trading agents. In a simulated P2P microgrid with multiple sellers and buyers, the LLM acts as a real-time fairness critic, evaluating each trading episode using two metrics: Fairness-To-Buyer (FTB) and Fairness-Between-Sellers (FBS). These fairness scores are integrated into agent rewards through scheduled {\lambda}-coefficients, forming an adaptive LLM-guided reward shaping loop that replaces brittle, rule-based fairness constraints. Agents are trained using Independent Proximal Policy Optimization (IPPO) and achieve equitable outcomes, fulfilling over 90% of buyer demand, maintaining fair seller margins, and consistently reaching FTB and FBS scores above 0.80. The training process demonstrates that fairness feedback improves convergence, reduces buyer shortfalls, and narrows profit disparities between sellers. With its language-based critic, the framework scales naturally, and its extension to a large power distribution system with household prosumers illustrates its practical applicability. FairMarket-RL thus offers a scalable, equity-driven solution for autonomous trading in decentralized energy systems.

Title: Generalized Linear Mode Connectivity for Transformers

Authors: Alexander Theus, Alessandro Cabodi, Sotiris Anagnostidis, Antonio Orvieto, Sidak Pal Singh, Valentina Boeva
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.22712
Pdf URL: https://arxiv.org/pdf/2506.22712
Copy Paste: [[2506.22712]] Generalized Linear Mode Connectivity for Transformers(https://arxiv.org/abs/2506.22712)
Keywords: transformer
Abstract: Understanding the geometry of neural network loss landscapes is a central question in deep learning, with implications for generalization and optimization. A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths, despite appearing to lie in separate loss basins. However, this is often obscured by symmetries in parameter space -- such as neuron permutations -- which make functionally equivalent models appear dissimilar. Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope and fail to capture the richer symmetries exhibited by modern architectures such as Transformers. In this work, we introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, orthogonal transformations, and general invertible maps -- broadening the set of valid reparameterizations and subsuming many previous approaches as special cases. Crucially, this generalization enables, for the first time, the discovery of low- and zero-barrier linear interpolation paths between independently trained Vision Transformers and GPT-2 models. These results reveal deeper structure in the loss landscape and underscore the importance of symmetry-aware analysis for understanding model space geometry.

Title: BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute

Authors: Dujian Ding, Ankur Mallick, Shaokun Zhang, Chi Wang, Daniel Madrigal, Mirian Del Carmen Hipolito Garcia, Menglin Xia, Laks V.S. Lakshmanan, Qingyun Wu, Victor Rühle
Subjects: cs.LG, cs.AI, cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2506.22716
Pdf URL: https://arxiv.org/pdf/2506.22716
Copy Paste: [[2506.22716]] BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute(https://arxiv.org/abs/2506.22716)
Keywords: large language model
Abstract: Large language models (LLMs) are powerful tools but are often expensive to deploy at scale. LLM query routing mitigates this by dynamically assigning queries to models of varying cost and quality to obtain a desired trade-off. Prior query routing approaches generate only one response from the selected model and a single response from a small (inexpensive) model was often not good enough to beat a response from a large (expensive) model due to which they end up overusing the large model and missing out on potential cost savings. However, it is well known that for small models, generating multiple responses and selecting the best can enhance quality while remaining cheaper than a single large-model response. We leverage this idea to propose BEST-Route, a novel routing framework that chooses a model and the number of responses to sample from it based on query difficulty and the quality thresholds. Experiments on real-world datasets demonstrate that our method reduces costs by up to 60% with less than 1% performance drop.

Title: Part Segmentation and Motion Estimation for Articulated Objects with Dynamic 3D Gaussians

Authors: Jun-Jee Chao, Qingyuan Jiang, Volkan Isler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22718
Pdf URL: https://arxiv.org/pdf/2506.22718
Copy Paste: [[2506.22718]] Part Segmentation and Motion Estimation for Articulated Objects with Dynamic 3D Gaussians(https://arxiv.org/abs/2506.22718)
Keywords: robust, segmentation
Abstract: Part segmentation and motion estimation are two fundamental problems for articulated object motion analysis. In this paper, we present a method to solve these two problems jointly from a sequence of observed point clouds of a single articulated object. The main challenge in our problem setting is that the point clouds are not assumed to be generated by a fixed set of moving points. Instead, each point cloud in the sequence could be an arbitrary sampling of the object surface at that particular time step. Such scenarios occur when the object undergoes major occlusions, or if the dataset is collected using measurements from multiple sensors asynchronously. In these scenarios, methods that rely on tracking point correspondences are not appropriate. We present an alternative approach based on a compact but effective representation where we represent the object as a collection of simple building blocks modeled as 3D Gaussians. We parameterize the Gaussians with time-dependent rotations, translations, and scales that are shared across all time steps. With our representation, part segmentation can be achieved by building correspondences between the observed points and the Gaussians. Moreover, the transformation of each point across time can be obtained by following the poses of the assigned Gaussian (even when the point is not observed). Experiments show that our method outperforms existing methods that solely rely on finding point correspondences. Additionally, we extend existing datasets to emulate real-world scenarios by considering viewpoint occlusions. We further demonstrate that our method is more robust to missing points as compared to existing approaches on these challenging datasets, even when some parts are completely occluded in some time-steps. Notably, our part segmentation performance outperforms the state-of-the-art method by 13% on point clouds with occlusions.

Title: Kill Two Birds with One Stone! Trajectory enabled Unified Online Detection of Adversarial Examples and Backdoor Attacks

Authors: Anmin Fu, Fanyu Meng, Huaibing Peng, Hua Ma, Zhi Zhang, Yifeng Zheng, Willy Susilo, Yansong Gao
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22722
Pdf URL: https://arxiv.org/pdf/2506.22722
Copy Paste: [[2506.22722]] Kill Two Birds with One Stone! Trajectory enabled Unified Online Detection of Adversarial Examples and Backdoor Attacks(https://arxiv.org/abs/2506.22722)
Keywords: attack
Abstract: The proposed UniGuard is the first unified online detection framework capable of simultaneously addressing adversarial examples and backdoor attacks. UniGuard builds upon two key insights: first, both AE and backdoor attacks have to compromise the inference phase, making it possible to tackle them simultaneously during run-time via online detection. Second, an adversarial input, whether a perturbed sample in AE attacks or a trigger-carrying sample in backdoor attacks, exhibits distinctive trajectory signatures from a benign sample as it propagates through the layers of a DL model in forward inference. The propagation trajectory of the adversarial sample must deviate from that of its benign counterpart; otherwise, the adversarial objective cannot be fulfilled. Detecting these trajectory signatures is inherently challenging due to their subtlety; UniGuard overcomes this by treating the propagation trajectory as a time-series signal, leveraging LSTM and spectrum transformation to amplify differences between adversarial and benign trajectories that are subtle in the time domain. UniGuard exceptional efficiency and effectiveness have been extensively validated across various modalities (image, text, and audio) and tasks (classification and regression), ranging from diverse model architectures against a wide range of AE attacks and backdoor attacks, including challenging partial backdoors and dynamic triggers. When compared to SOTA methods, including ContraNet (NDSS 22) specific for AE detection and TED (IEEE SP 24) specific for backdoor detection, UniGuard consistently demonstrates superior performance, even when matched against each method's strengths in addressing their respective threats-each SOTA fails to parts of attack strategies while UniGuard succeeds for all.

Title: The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure

Authors: Niyati Bafna, Tianjian Li, Kenton Murray, David R. Mortensen, David Yarowsky, Hale Sirin, Daniel Khashabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22724
Pdf URL: https://arxiv.org/pdf/2506.22724
Copy Paste: [[2506.22724]] The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure(https://arxiv.org/abs/2506.22724)
Keywords: interpretability, large language model
Abstract: Multilingual generation with large language models (LLMs) is often of poor quality for mid- to low-resource languages. Building on insights from interpretability, we demonstrate the existence of an implicit task-solving-->translation pipeline for generation, whereby the model first solves the required task in a largely target-language-agnostic manner, and subsequently translates answer concepts into the intended target language. We hypothesize that the failure of the translation stage is an important culprit for the observed low quality of final outputs, and formalize this as the translation barrier hypothesis. We test this hypothesis for a word translation task across 108 language pairs, using logit lens to observe model processing in intermediate layers. We find that a significant portion of overall failures indeed stems from translation failure, or the model's inability to translate correctly solved intermediate concepts into the target language. This is especially true for low-resource target languages. Our results highlight an important hurdle for end-to-end multilingual generation, and lend guiding insights for future work seeking to improve multilinguality in LLMs.

Title: Convergent Privacy Framework with Contractive GNN Layers for Multi-hop Aggregations

Authors: Yu Zheng, Chenang Li, Zhou Li, Qingsong Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.22727
Pdf URL: https://arxiv.org/pdf/2506.22727
Copy Paste: [[2506.22727]] Convergent Privacy Framework with Contractive GNN Layers for Multi-hop Aggregations(https://arxiv.org/abs/2506.22727)
Keywords: security, privacy, protect
Abstract: Differential privacy (DP) has been integrated into graph neural networks (GNNs) to protect sensitive structural information, e.g., edges, nodes, and associated features across various applications. A common approach is to perturb the message-passing process, which forms the core of most GNN architectures. However, existing methods typically incur a privacy cost that grows linearly with the number of layers (Usenix Security'23), ultimately requiring excessive noise to maintain a reasonable privacy level. This limitation becomes particularly problematic when deep GNNs are necessary to capture complex and long-range interactions in graphs. In this paper, we theoretically establish that the privacy budget can converge with respect to the number of layers by applying privacy amplification techniques to the message-passing process, exploiting the contractive properties inherent to standard GNN operations. Motivated by this analysis, we propose a simple yet effective Contractive Graph Layer (CGL) that ensures the contractiveness required for theoretical guarantees while preserving model utility. Our framework, CARIBOU, supports both training and inference, equipped with a contractive aggregation module, a privacy allocation module, and a privacy auditing module. Experimental evaluations demonstrate that CARIBOU significantly improves the privacy-utility trade-off and achieves superior performance in privacy auditing tasks.

Title: Robust Tensor Completion via Gradient Tensor Nulclear L1-L2 Norm for Traffic Data Recovery

Authors: Hao Shu, Jicheng Li, Tianyv Lei, Lijun Sun
Subjects: cs.LG, eess.SP, stat.ML
Abstract URL: https://arxiv.org/abs/2506.22732
Pdf URL: https://arxiv.org/pdf/2506.22732
Copy Paste: [[2506.22732]] Robust Tensor Completion via Gradient Tensor Nulclear L1-L2 Norm for Traffic Data Recovery(https://arxiv.org/abs/2506.22732)
Keywords: robust
Abstract: In real-world scenarios, spatiotemporal traffic data frequently experiences dual degradation from missing values and noise caused by sensor malfunctions and communication failures. Therefore, effective data recovery methods are essential to ensure the reliability of downstream data-driven applications. while classical tensor completion methods have been widely adopted, they are incapable of modeling noise, making them unsuitable for complex scenarios involving simultaneous data missingness and noise interference. Existing Robust Tensor Completion (RTC) approaches offer potential solutions by separately modeling the actual tensor data and noise. However, their effectiveness is often constrained by the over-relaxation of convex rank surrogates and the suboptimal utilization of local consistency, leading to inadequate model accuracy. To address these limitations, we first introduce the tensor L1-L2 norm, a novel non-convex tensor rank surrogate that functions as an effective low-rank representation tool. Leveraging an advanced feature fusion strategy, we further develop the gradient tensor L1-L2 norm by incorporating the tensor L1-L2 norm in the gradient domain. By integrating the gradient tensor nuclear L1-L2 norm into the RTC framework, we propose the Robust Tensor Completion via Gradient Tensor Nuclear L1-L2 Norm (RTC-GTNLN) model, which not only fully exploits both global low-rankness and local consistency without trade-off parameter, but also effectively handles the dual degradation challenges of missing data and noise in traffic data. Extensive experiments conducted on multiple real-world traffic datasets demonstrate that the RTC-GTNLN model consistently outperforms existing state-of-the-art methods in complex recovery scenarios involving simultaneous missing values and noise.

Title: Enhancing Android Malware Detection with Retrieval-Augmented Generation

Authors: Saraga S., Anagha M. S., Dincy R. Arikkat, Rafidha Rehiman K. A., Serena Nicolazzo, Antonino Nocera, Vinod P
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.22750
Pdf URL: https://arxiv.org/pdf/2506.22750
Copy Paste: [[2506.22750]] Enhancing Android Malware Detection with Retrieval-Augmented Generation(https://arxiv.org/abs/2506.22750)
Keywords: security, privacy, attack, transformer
Abstract: The widespread use of Android applications has made them a prime target for cyberattacks, significantly increasing the risk of malware that threatens user privacy, security, and device functionality. Effective malware detection is thus critical, with static analysis, dynamic analysis, and Machine Learning being widely used approaches. In this work, we focus on a Machine Learning-based method utilizing static features. We first compiled a dataset of benign and malicious APKs and performed static analysis to extract features such as code structure, permissions, and manifest file content, without executing the apps. Instead of relying solely on raw static features, our system uses an LLM to generate high-level functional descriptions of APKs. To mitigate hallucinations, which are a known vulnerability of LLM, we integrated Retrieval-Augmented Generation (RAG), enabling the LLM to ground its output in relevant context. Using carefully designed prompts, we guide the LLM to produce coherent function summaries, which are then analyzed using a transformer-based model, improving detection accuracy over conventional feature-based methods for malware detection.

Title: Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography

Authors: Jianing Zhang, Jiayi Zhu, Feiyu Ji, Xiaokang Yang, Xiaoyun Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22753
Pdf URL: https://arxiv.org/pdf/2506.22753
Copy Paste: [[2506.22753]] Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography(https://arxiv.org/abs/2506.22753)
Keywords: diffusion
Abstract: Metalenses offer significant potential for ultra-compact computational imaging but face challenges from complex optical degradation and computational restoration difficulties. Existing methods typically rely on precise optical calibration or massive paired datasets, which are non-trivial for real-world imaging systems. Furthermore, a lack of control over the inference process often results in undesirable hallucinated artifacts. We introduce Degradation-Modeled Multipath Diffusion for tunable metalens photography, leveraging powerful natural image priors from pretrained models instead of large datasets. Our framework uses positive, neutral, and negative-prompt paths to balance high-frequency detail generation, structural fidelity, and suppression of metalens-specific degradation, alongside \textit{pseudo} data augmentation. A tunable decoder enables controlled trade-offs between fidelity and perceptual quality. Additionally, a spatially varying degradation-aware attention (SVDA) module adaptively models complex optical and sensor-induced degradation. Finally, we design and build a millimeter-scale MetaCamera for real-world validation. Extensive results show that our approach outperforms state-of-the-art methods, achieving high-fidelity and sharp image reconstruction. More materials: this https URL.

Title: RoboPearls: Editable Video Simulation for Robot Manipulation

Authors: Tao Tang, Likui Zhang, Youpeng Wen, Kaidong Zhang, Jia-Wang Bian, xia zhou, Tianyi Yan, Kun Zhan, Peng Jia, Hefeng Wu, Liang Lin, Xiaodan Liang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2506.22756
Pdf URL: https://arxiv.org/pdf/2506.22756
Copy Paste: [[2506.22756]] RoboPearls: Editable Video Simulation for Robot Manipulation(https://arxiv.org/abs/2506.22756)
Keywords: large language model
Abstract: The development of generalist robot manipulation policies has seen significant progress, driven by large-scale demonstration data across diverse environments. However, the high cost and inefficiency of collecting real-world demonstrations hinder the scalability of data acquisition. While existing simulation platforms enable controlled environments for robotic learning, the challenge of bridging the sim-to-real gap remains. To address these challenges, we propose RoboPearls, an editable video simulation framework for robotic manipulation. Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations from demonstration videos, and supports a wide range of simulation operators, including various object manipulations, powered by advanced modules like Incremental Semantic Distillation (ISD) and 3D regularized NNFM Loss (3D-NNFM). Moreover, by incorporating large language models (LLMs), RoboPearls automates the simulation production process in a user-friendly manner through flexible command interpretation and execution. Furthermore, RoboPearls employs a vision-language model (VLM) to analyze robotic learning issues to close the simulation loop for performance enhancement. To demonstrate the effectiveness of RoboPearls, we conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot, which demonstrate our satisfactory simulation performance.

Title: VSRM: A Robust Mamba-Based Framework for Video Super-Resolution

Authors: Dinh Phu Tran, Dao Duy Hung, Daeyoung Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22762
Pdf URL: https://arxiv.org/pdf/2506.22762
Copy Paste: [[2506.22762]] VSRM: A Robust Mamba-Based Framework for Video Super-Resolution(https://arxiv.org/abs/2506.22762)
Keywords: robust, transformer
Abstract: Video super-resolution remains a major challenge in low-level vision tasks. To date, CNN- and Transformer-based methods have delivered impressive results. However, CNNs are limited by local receptive fields, while Transformers struggle with quadratic complexity, posing challenges for processing long sequences in VSR. Recently, Mamba has drawn attention for its long-sequence modeling, linear complexity, and large receptive fields. In this work, we propose VSRM, a novel \textbf{V}ideo \textbf{S}uper-\textbf{R}esolution framework that leverages the power of \textbf{M}amba. VSRM introduces Spatial-to-Temporal Mamba and Temporal-to-Spatial Mamba blocks to extract long-range spatio-temporal features and enhance receptive fields efficiently. To better align adjacent frames, we propose Deformable Cross-Mamba Alignment module. This module utilizes a deformable cross-mamba mechanism to make the compensation stage more dynamic and flexible, preventing feature distortions. Finally, we minimize the frequency domain gaps between reconstructed and ground-truth frames by proposing a simple yet effective Frequency Charbonnier-like loss that better preserves high-frequency content and enhances visual quality. Through extensive experiments, VSRM achieves state-of-the-art results on diverse benchmarks, establishing itself as a solid foundation for future research.

Title: Multimodal Atmospheric Super-Resolution With Deep Generative Models

Authors: Dibyajyoti Chakraborty, Haiwen Guan, Jason Stock, Troy Arcomano, Guido Cervone, Romit Maulik
Subjects: cs.LG, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2506.22780
Pdf URL: https://arxiv.org/pdf/2506.22780
Copy Paste: [[2506.22780]] Multimodal Atmospheric Super-Resolution With Deep Generative Models(https://arxiv.org/abs/2506.22780)
Keywords: diffusion, generative
Abstract: Score-based diffusion modeling is a generative machine learning algorithm that can be used to sample from complex distributions. They achieve this by learning a score function, i.e., the gradient of the log-probability density of the data, and reversing a noising process using the same. Once trained, score-based diffusion models not only generate new samples but also enable zero-shot conditioning of the generated samples on observed data. This promises a novel paradigm for data and model fusion, wherein the implicitly learned distributions of pretrained score-based diffusion models can be updated given the availability of online data in a Bayesian formulation. In this article, we apply such a concept to the super-resolution of a high-dimensional dynamical system, given the real-time availability of low-resolution and experimentally observed sparse sensor measurements from multimodal data. Additional analysis on how score-based sampling can be used for uncertainty estimates is also provided. Our experiments are performed for a super-resolution task that generates the ERA5 atmospheric dataset given sparse observations from a coarse-grained representation of the same and/or from unstructured experimental observations of the IGRA radiosonde dataset. We demonstrate accurate recovery of the high dimensional state given multiple sources of low-fidelity measurements. We also discover that the generative model can balance the influence of multiple dataset modalities during spatiotemporal reconstructions.

Title: PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection

Authors: Oguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep P. Chinchali
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.22783
Pdf URL: https://arxiv.org/pdf/2506.22783
Copy Paste: [[2506.22783]] PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection(https://arxiv.org/abs/2506.22783)
Keywords: attack, generative
Abstract: Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.

Title: Single-Frame Point-Pixel Registration via Supervised Cross-Modal Feature Matching

Authors: Yu Han, Zhiwei Huang, Yanting Zhang, Fangjun Ding, Shen Cai, Rui Fan
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2506.22784
Pdf URL: https://arxiv.org/pdf/2506.22784
Copy Paste: [[2506.22784]] Single-Frame Point-Pixel Registration via Supervised Cross-Modal Feature Matching(https://arxiv.org/abs/2506.22784)
Keywords: robust
Abstract: Point-pixel registration between LiDAR point clouds and camera images is a fundamental yet challenging task in autonomous driving and robotic perception. A key difficulty lies in the modality gap between unstructured point clouds and structured images, especially under sparse single-frame LiDAR settings. Existing methods typically extract features separately from point clouds and images, then rely on hand-crafted or learned matching strategies. This separate encoding fails to bridge the modality gap effectively, and more critically, these methods struggle with the sparsity and noise of single-frame LiDAR, often requiring point cloud accumulation or additional priors to improve reliability. Inspired by recent progress in detector-free matching paradigms (e.g. MatchAnything), we revisit the projection-based approach and introduce the detector-free framework for direct point-pixel matching between LiDAR and camera views. Specifically, we project the LiDAR intensity map into a 2D view from the LiDAR perspective and feed it into an attention-based detector-free matching network, enabling cross-modal correspondence estimation without relying on multi-frame accumulation. To further enhance matching reliability, we introduce a repeatability scoring mechanism that acts as a soft visibility prior. This guides the network to suppress unreliable matches in regions with low intensity variation, improving robustness under sparse input. Extensive experiments on KITTI, nuScenes, and MIAS-LCEC-TF70 benchmarks demonstrate that our method achieves state-of-the-art performance, outperforming prior approaches on nuScenes (even those relying on accumulated point clouds), despite using only single-frame LiDAR.

Title: What's Privacy Good for? Measuring Privacy as a Shield from Harms due to Personal Data Use

Authors: Sri Harsha Gajavalli, Junichi Koizumi, Rakibul Hasan
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2506.22787
Pdf URL: https://arxiv.org/pdf/2506.22787
Copy Paste: [[2506.22787]] What's Privacy Good for? Measuring Privacy as a Shield from Harms due to Personal Data Use(https://arxiv.org/abs/2506.22787)
Keywords: privacy
Abstract: We propose a harm-centric conceptualization of privacy that asks: What harms from personal data use can privacy prevent? The motivation behind this research is limitations in existing privacy frameworks (e.g., Contextual Integrity) to capture or categorize many of the harms that arise from modern technology's use of personal data. We operationalize this conceptualization in an online study with 400 college and university students. Study participants indicated their perceptions of different harms (e.g., manipulation, discrimination, and harassment) that may arise when artificial intelligence-based algorithms infer personal data (e.g., demographics, personality traits, and cognitive disability) and use it to identify students who are likely to drop out of a course or the best job candidate. The study includes 14 harms and six types of personal data selected based on an extensive literature review. Comprehensive statistical analyses of the study data show that the 14 harms are internally consistent and collectively represent a general notion of privacy harms. The study data also surfaces nuanced perceptions of harms, both across the contexts and participants' demographic factors. Based on these results, we discuss how privacy can be improved equitably. Thus, this research not only contributes to enhancing the understanding of privacy as a concept but also provides practical guidance to improve privacy in the context of education and employment.

Title: ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models

Authors: Jianxin Yan, Wangze Ni, Lei Chen, Xuemin Lin, Peng Cheng, Zhan Qin, Kui Ren
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2506.22791
Pdf URL: https://arxiv.org/pdf/2506.22791
Copy Paste: [[2506.22791]] ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models(https://arxiv.org/abs/2506.22791)
Keywords: large language model
Abstract: Semantic caching significantly reduces computational costs and improves efficiency by storing and reusing large language model (LLM) responses. However, existing systems rely primarily on matching individual queries, lacking awareness of multi-turn dialogue contexts, which leads to incorrect cache hits when similar queries appear in different conversational settings. This demonstration introduces ContextCache, a context-aware semantic caching system for multi-turn dialogues. ContextCache employs a two-stage retrieval architecture that first executes vector-based retrieval on the current query to identify potential matches and then integrates current and historical dialogue representations through self-attention mechanisms for precise contextual matching. Evaluation of real-world conversations shows that ContextCache improves precision and recall compared to existing methods. Additionally, cached responses exhibit approximately 10 times lower latency than direct LLM invocation, enabling significant computational cost reductions for LLM conversational applications.

Title: RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors

Authors: Sicong Du, Jiarun Liu, Qifeng Chen, Hao-Xiang Chen, Tai-Jiang Mu, Sheng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22800
Pdf URL: https://arxiv.org/pdf/2506.22800
Copy Paste: [[2506.22800]] RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors(https://arxiv.org/abs/2506.22800)
Keywords: diffusion
Abstract: A single-pass driving clip frequently results in incomplete scanning of the road structure, making reconstructed scene expanding a critical requirement for sensor simulators to effectively regress driving actions. Although contemporary 3D Gaussian Splatting (3DGS) techniques achieve remarkable reconstruction quality, their direct extension through the integration of diffusion priors often introduces cumulative physical inconsistencies and compromises training efficiency. To address these limitations, we present RGE-GS, a novel expansive reconstruction framework that synergizes diffusion-based generation with reward-guided Gaussian integration. The RGE-GS framework incorporates two key innovations: First, we propose a reward network that learns to identify and prioritize consistently generated patterns prior to reconstruction phases, thereby enabling selective retention of diffusion outputs for spatial stability. Second, during the reconstruction process, we devise a differentiated training strategy that automatically adjust Gaussian optimization progress according to scene converge metrics, which achieving better convergence than baseline methods. Extensive evaluations of publicly available datasets demonstrate that RGE-GS achieves state-of-the-art performance in reconstruction quality. Our source-code will be made publicly available at this https URL. (Camera-ready version incorporating reviewer suggestions will be updated soon.)

Title: Riemannian-Geometric Fingerprints of Generative Models

Authors: Hae Jin Song, Laurent Itti
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2506.22802
Pdf URL: https://arxiv.org/pdf/2506.22802
Copy Paste: [[2506.22802]] Riemannian-Geometric Fingerprints of Generative Models(https://arxiv.org/abs/2506.22802)
Keywords: protect, generative
Abstract: Recent breakthroughs and rapid integration of generative models (GMs) have sparked interest in the problem of model attribution and their fingerprints. For instance, service providers need reliable methods of authenticating their models to protect their IP, while users and law enforcement seek to verify the source of generated content for accountability and trust. In addition, a growing threat of model collapse is arising, as more model-generated data are being fed back into sources (e.g., YouTube) that are often harvested for training ("regurgitative training"), heightening the need to differentiate synthetic from human data. Yet, a gap still exists in understanding generative models' fingerprints, we believe, stemming from the lack of a formal framework that can define, represent, and analyze the fingerprints in a principled way. To address this gap, we take a geometric approach and propose a new definition of artifact and fingerprint of GMs using Riemannian geometry, which allows us to leverage the rich theory of differential geometry. Our new definition generalizes previous work (Song et al., 2024) to non-Euclidean manifolds by learning Riemannian metrics from data and replacing the Euclidean distances and nearest-neighbor search with geodesic distances and kNN-based Riemannian center of mass. We apply our theory to a new gradient-based algorithm for computing the fingerprints in practice. Results show that it is more effective in distinguishing a large array of GMs, spanning across 4 different datasets in 2 different resolutions (64 by 64, 256 by 256), 27 model architectures, and 2 modalities (Vision, Vision-Language). Using our proposed definition significantly improves the performance on model attribution, as well as a generalization to unseen datasets, model types, and modalities, suggesting its practical efficacy.

Title: Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding

Authors: Nuoye Xiong, Anqi Dong, Ning Wang, Cong Hua, Guangming Zhu, Mei Lin, Peiyi Shen, Liang Zhang
Subjects: cs.CV, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22803
Pdf URL: https://arxiv.org/pdf/2506.22803
Copy Paste: [[2506.22803]] Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding(https://arxiv.org/abs/2506.22803)
Keywords: interpretability, transformer
Abstract: Recent advances in deep learning have led to increasingly complex models with deeper layers and more parameters, reducing interpretability and making their decisions harder to understand. While many methods explain black-box reasoning, most lack effective interventions or only operate at sample-level without modifying the model itself. To address this, we propose the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU). CBM-HNMU leverages the Concept Bottleneck Model (CBM) as an interpretable framework to approximate black-box reasoning and communicate conceptual understanding. Detrimental concepts are automatically identified and refined (removed/replaced) based on global gradient contributions. The modified CBM then distills corrected knowledge back into the black-box model, enhancing both interpretability and accuracy. We evaluate CBM-HNMU on various CNN and transformer-based models across Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft, and CUB-200, achieving a maximum accuracy improvement of 2.64% and a maximum increase in average accuracy across 1.03%. Source code is available at: this https URL.

Title: Concept Pinpoint Eraser for Text-to-image Diffusion Models via Residual Attention Gate

Authors: Byung Hyun Lee, Sungjin Lim, Seunggyu Lee, Dong Un Kang, Se Young Chun
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22806
Pdf URL: https://arxiv.org/pdf/2506.22806
Copy Paste: [[2506.22806]] Concept Pinpoint Eraser for Text-to-image Diffusion Models via Residual Attention Gate(https://arxiv.org/abs/2506.22806)
Keywords: attack, robust, diffusion
Abstract: Remarkable progress in text-to-image diffusion models has brought a major concern about potentially generating images on inappropriate or trademarked concepts. Concept erasing has been investigated with the goals of deleting target concepts in diffusion models while preserving other concepts with minimal distortion. To achieve these goals, recent concept erasing methods usually fine-tune the cross-attention layers of diffusion models. In this work, we first show that merely updating the cross-attention layers in diffusion models, which is mathematically equivalent to adding \emph{linear} modules to weights, may not be able to preserve diverse remaining concepts. Then, we propose a novel framework, dubbed Concept Pinpoint Eraser (CPE), by adding \emph{nonlinear} Residual Attention Gates (ResAGs) that selectively erase (or cut) target concepts while safeguarding remaining concepts from broad distributions by employing an attention anchoring loss to prevent the forgetting. Moreover, we adversarially train CPE with ResAG and learnable text embeddings in an iterative manner to maximize erasing performance and enhance robustness against adversarial attacks. Extensive experiments on the erasure of celebrities, artistic styles, and explicit contents demonstrated that the proposed CPE outperforms prior arts by keeping diverse remaining concepts while deleting the target concepts with robustness against attack prompts. Code is available at this https URL

Title: FreqDGT: Frequency-Adaptive Dynamic Graph Networks with Transformer for Cross-subject EEG Emotion Recognition

Authors: Yueyang Li, Shengyu Gong, Weiming Zeng, Nizhuan Wang, Wai Ting Siok
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22807
Pdf URL: https://arxiv.org/pdf/2506.22807
Copy Paste: [[2506.22807]] FreqDGT: Frequency-Adaptive Dynamic Graph Networks with Transformer for Cross-subject EEG Emotion Recognition(https://arxiv.org/abs/2506.22807)
Keywords: robust, transformer
Abstract: Electroencephalography (EEG) serves as a reliable and objective signal for emotion recognition in affective brain-computer interfaces, offering unique advantages through its high temporal resolution and ability to capture authentic emotional states that cannot be consciously controlled. However, cross-subject generalization remains a fundamental challenge due to individual variability, cognitive traits, and emotional responses. We propose FreqDGT, a frequency-adaptive dynamic graph transformer that systematically addresses these limitations through an integrated framework. FreqDGT introduces frequency-adaptive processing (FAP) to dynamically weight emotion-relevant frequency bands based on neuroscientific evidence, employs adaptive dynamic graph learning (ADGL) to learn input-specific brain connectivity patterns, and implements multi-scale temporal disentanglement network (MTDN) that combines hierarchical temporal transformers with adversarial feature disentanglement to capture both temporal dynamics and ensure cross-subject robustness. Comprehensive experiments demonstrate that FreqDGT significantly improves cross-subject emotion recognition accuracy, confirming the effectiveness of integrating frequency-adaptive, spatial-dynamic, and temporal-hierarchical modeling while ensuring robustness to individual differences. The code is available at this https URL.

Title: MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs

Authors: Jianhui Wei, Zijie Meng, Zikai Xiao, Tianxiang Hu, Yang Feng, Zhijie Zhou, Jian Wu, Zuozhu Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22808
Pdf URL: https://arxiv.org/pdf/2506.22808
Copy Paste: [[2506.22808]] MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs(https://arxiv.org/abs/2506.22808)
Keywords: large language model
Abstract: While Medical Large Language Models (MedLLMs) have demonstrated remarkable potential in clinical tasks, their ethical safety remains insufficiently explored. This paper introduces $\textbf{MedEthicsQA}$, a comprehensive benchmark comprising $\textbf{5,623}$ multiple-choice questions and $\textbf{5,351}$ open-ended questions for evaluation of medical ethics in LLMs. We systematically establish a hierarchical taxonomy integrating global medical ethical standards. The benchmark encompasses widely used medical datasets, authoritative question banks, and scenarios derived from PubMed literature. Rigorous quality control involving multi-stage filtering and multi-faceted expert validation ensures the reliability of the dataset with a low error rate ($2.72\%$). Evaluation of state-of-the-art MedLLMs exhibit declined performance in answering medical ethics questions compared to their foundation counterparts, elucidating the deficiencies of medical ethics alignment. The dataset, registered under CC BY-NC 4.0 license, is available at this https URL.

Title: BayesLoRA: Task-Specific Uncertainty in Low-Rank Adapters

Authors: Cooper Doyle
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.22809
Pdf URL: https://arxiv.org/pdf/2506.22809
Copy Paste: [[2506.22809]] BayesLoRA: Task-Specific Uncertainty in Low-Rank Adapters(https://arxiv.org/abs/2506.22809)
Keywords: transformer
Abstract: We propose BayesLoRA, a task-specific uncertainty quantification framework that integrates MC-Dropout into Low-Rank Adapters (LoRA). Unlike general-purpose transformer uncertainty methods, BayesLoRA provides guardrails tailored to downstream workflows, enabling agents to introspect and modulate behavior under uncertainty. We demonstrate mathematically and empirically that LoRA adapters exhibit amplified variance outside fine-tuning distributions, yielding reliable confidence estimates for agentic decision-making.

Title: Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models

Authors: Zhuojun Ding, Wei Wei, Chenghao Fan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22813
Pdf URL: https://arxiv.org/pdf/2506.22813
Copy Paste: [[2506.22813]] Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models(https://arxiv.org/abs/2506.22813)
Keywords: extraction, large language model
Abstract: Supervised fine-tuning (SFT) is widely used to align large language models (LLMs) with information extraction (IE) tasks, such as named entity recognition (NER). However, annotating such fine-grained labels and training domain-specific models is costly. Existing works typically train a unified model across multiple domains, but such approaches lack adaptation and scalability since not all training data benefits target domains and scaling trained models remains challenging. We propose the SaM framework, which dynamically Selects and Merges expert models at inference time. Specifically, for a target domain, we select domain-specific experts pre-trained on existing domains based on (i) domain similarity to the target domain and (ii) performance on sampled instances, respectively. The experts are then merged to create task-specific models optimized for the target domain. By dynamically merging experts beneficial to target domains, we improve generalization across various domains without extra training. Additionally, experts can be added or removed conveniently, leading to great scalability. Extensive experiments on multiple benchmarks demonstrate our framework's effectiveness, which outperforms the unified model by an average of 10%. We further provide insights into potential improvements, practical experience, and extensions of our framework.

Title: Unleashing the Multi-View Fusion Potential: Noise Correction in VLM for Open-Vocabulary 3D Scene Understanding

Authors: Xingyilang Yin, Jiale Wang, Xi Yang, Mutian Xu, Xu Gu, Nannan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22817
Pdf URL: https://arxiv.org/pdf/2506.22817
Copy Paste: [[2506.22817]] Unleashing the Multi-View Fusion Potential: Noise Correction in VLM for Open-Vocabulary 3D Scene Understanding(https://arxiv.org/abs/2506.22817)
Keywords: segmentation
Abstract: Recent open-vocabulary 3D scene understanding approaches mainly focus on training 3D networks through contrastive learning with point-text pairs or by distilling 2D features into 3D models via point-pixel alignment. While these methods show considerable performance in benchmarks with limited vocabularies, they struggle to handle diverse object categories as the limited amount of 3D data upbound training strong open-vocabulary 3d models. We observe that 2D multi-view fusion methods take precedence in understanding diverse concepts in 3D scenes. However, inherent noises in vision-language models lead multi-view fusion to sub-optimal performance. To this end, we introduce MVOV3D, a novel approach aimed at unleashing the potential of 2D multi-view fusion for open-vocabulary 3D scene understanding. We focus on reducing the inherent noises without training, thereby preserving the generalizability while enhancing open-world capabilities. Specifically, MVOV3D improves multi-view 2D features by leveraging precise region-level image features and text features encoded by CLIP encoders and incorporates 3D geometric priors to optimize multi-view fusion. Extensive experiments on various datasets demonstrate the effectiveness of our method. Notably, our MVOV3D achieves a new record with 14.7% mIoU on ScanNet200 and 16.2% mIoU on Matterport160 for challenge open-vocabulary semantic segmentation, outperforming current leading trained 3D networks by a significant margin.

Title: Prompting without Panic: Attribute-aware, Zero-shot, Test-Time Calibration

Authors: Ramya Hebbalaguppe, Tamoghno Kandar, Abhinav Nagpal, Chetan Arora
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22819
Pdf URL: https://arxiv.org/pdf/2506.22819
Copy Paste: [[2506.22819]] Prompting without Panic: Attribute-aware, Zero-shot, Test-Time Calibration(https://arxiv.org/abs/2506.22819)
Keywords: large language model
Abstract: Vision-language models (VLM) have demonstrated impressive performance in image recognition by leveraging self-supervised training on large datasets. Their performance can be further improved by adapting to the test sample using test-time prompt tuning (TPT). Unfortunately, the singular focus of TPT approaches on improving the accuracy suffers from tunnel vision, and leads to degradation in confidence calibration. This limits the applicability of TPT in critical applications. We make three contributions in this work. (1) We posit that random or naive initialization of prompts leads to overfitting on a particular test sample, and is the main reason for miscalibration of the VLM after TPT. To mitigate the problem, we propose careful initialization of test time prompt using prior knowledge about the target label attributes from a large language model (LLM); (2) To further maintain the quality of prompts during \tpt, we propose a novel regularization loss to reduce intraclass distance, and increase inter-class distance between the learnt Through extensive experiments on different CLIP architectures and 15 datasets, we show that our approach can effectively improve the calibration after TPT. We report an average expected calibration error (ECE) of 4.11 with our method, TCA, compared to 11.7 for vanilla TPT, 6.12 for C-TPT (ICLR'24), 6.78 for DiffTPT (CVPR'23), and 8.43 for PromptAlign (NeurIPS'23). The code is publicly accessible at: this https URL.

Title: Listener-Rewarded Thinking in VLMs for Image Preferences

Authors: Alexander Gambashidze, Li Pengyi, Matvey Skripkin, Andrey Galichin, Anton Gusarov, Konstantin Sobolev, Andrey Kuznetsov, Ivan Oseledets
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22832
Pdf URL: https://arxiv.org/pdf/2506.22832
Copy Paste: [[2506.22832]] Listener-Rewarded Thinking in VLMs for Image Preferences(https://arxiv.org/abs/2506.22832)
Keywords: robust, generative
Abstract: Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner not only to answer correctly, but to produce explanations that are persuasive to an independent model. Our listener-shaped reward scheme achieves best accuracy on the ImageReward benchmark (67.4%), significantly improves out-of-distribution (OOD) performance on a large-scale human preference dataset (1.2M votes, up to +6% over naive reasoner), and reduces reasoning contradictions compared to strong GRPO and SFT baselines. These results demonstrate that listener-based rewards provide a scalable, data-efficient path to aligning vision-language models with nuanced human preferences. We will release our reasoning model here: this https URL.

Title: SemFaceEdit: Semantic Face Editing on Generative Radiance Manifolds

Authors: Shashikant Verma, Shanmuganathan Raman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22833
Pdf URL: https://arxiv.org/pdf/2506.22833
Copy Paste: [[2506.22833]] SemFaceEdit: Semantic Face Editing on Generative Radiance Manifolds(https://arxiv.org/abs/2506.22833)
Keywords: generative
Abstract: Despite multiple view consistency offered by 3D-aware GAN techniques, the resulting images often lack the capacity for localized editing. In response, generative radiance manifolds emerge as an efficient approach for constrained point sampling within volumes, effectively reducing computational demands and enabling the learning of fine details. This work introduces SemFaceEdit, a novel method that streamlines the appearance and geometric editing process by generating semantic fields on generative radiance manifolds. Utilizing latent codes, our method effectively disentangles the geometry and appearance associated with different facial semantics within the generated image. In contrast to existing methods that can change the appearance of the entire radiance field, our method enables the precise editing of particular facial semantics while preserving the integrity of other regions. Our network comprises two key modules: the Geometry module, which generates semantic radiance and occupancy fields, and the Appearance module, which is responsible for predicting RGB radiance. We jointly train both modules in adversarial settings to learn semantic-aware geometry and appearance descriptors. The appearance descriptors are then conditioned on their respective semantic latent codes by the Appearance Module, facilitating disentanglement and enhanced control. Our experiments highlight SemFaceEdit's superior performance in semantic field-based editing, particularly in achieving improved radiance field disentanglement.

Title: FOCUS: Fine-grained Optimization with Semantic Guided Understanding for Pedestrian Attributes Recognition

Authors: Hongyan An, Kuan Zhu, Xin He, Haiyun Guo, Chaoyang Zhao, Ming Tang, Jinqiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22836
Pdf URL: https://arxiv.org/pdf/2506.22836
Copy Paste: [[2506.22836]] FOCUS: Fine-grained Optimization with Semantic Guided Understanding for Pedestrian Attributes Recognition(https://arxiv.org/abs/2506.22836)
Keywords: security, extraction
Abstract: Pedestrian attribute recognition (PAR) is a fundamental perception task in intelligent transportation and security. To tackle this fine-grained task, most existing methods focus on extracting regional features to enrich attribute information. However, a regional feature is typically used to predict a fixed set of pre-defined attributes in these methods, which limits the performance and practicality in two aspects: 1) Regional features may compromise fine-grained patterns unique to certain attributes in favor of capturing common characteristics shared across attributes. 2) Regional features cannot generalize to predict unseen attributes in the test time. In this paper, we propose the \textbf{F}ine-grained \textbf{O}ptimization with semanti\textbf{C} g\textbf{U}ided under\textbf{S}tanding (FOCUS) approach for PAR, which adaptively extracts fine-grained attribute-level features for each attribute individually, regardless of whether the attributes are seen or not during training. Specifically, we propose the Multi-Granularity Mix Tokens (MGMT) to capture latent features at varying levels of visual granularity, thereby enriching the diversity of the extracted information. Next, we introduce the Attribute-guided Visual Feature Extraction (AVFE) module, which leverages textual attributes as queries to retrieve their corresponding visual attribute features from the Mix Tokens using a cross-attention mechanism. To ensure that textual attributes focus on the appropriate Mix Tokens, we further incorporate a Region-Aware Contrastive Learning (RACL) method, encouraging attributes within the same region to share consistent attention maps. Extensive experiments on PA100K, PETA, and RAPv1 datasets demonstrate the effectiveness and strong generalization ability of our method.

Title: xLSTMAD: A Powerful xLSTM-based Method for Anomaly Detection

Authors: Kamil Faber, Marcin Pietroń, Dominik Żurek, Roberto Corizzo
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22837
Pdf URL: https://arxiv.org/pdf/2506.22837
Copy Paste: [[2506.22837]] xLSTMAD: A Powerful xLSTM-based Method for Anomaly Detection(https://arxiv.org/abs/2506.22837)
Keywords: transformer
Abstract: The recently proposed xLSTM is a powerful model that leverages expressive multiplicative gating and residual connections, providing the temporal capacity needed for long-horizon forecasting and representation learning. This architecture has demonstrated success in time series forecasting, lossless compression, and even large-scale language modeling tasks, where its linear memory footprint and fast inference make it a viable alternative to Transformers. Despite its growing popularity, no prior work has explored xLSTM for anomaly detection. In this work, we fill this gap by proposing xLSTMAD, the first anomaly detection method that integrates a full encoder-decoder xLSTM architecture, purpose-built for multivariate time series data. Our encoder processes input sequences to capture historical context, while the decoder is devised in two separate variants of the method. In the forecasting approach, the decoder iteratively generates forecasted future values xLSTMAD-F, while the reconstruction approach reconstructs the input time series from its encoded counterpart xLSTMAD-R. We investigate the performance of two loss functions: Mean Squared Error (MSE), and Soft Dynamic Time Warping (SoftDTW) to consider local reconstruction fidelity and global sequence alignment, respectively. We evaluate our method on the comprehensive TSB-AD-M benchmark, which spans 17 real-world datasets, using state-of-the-art challenging metrics such as VUS-PR. In our results, xLSTM showcases state-of-the-art accuracy, outperforming 23 popular anomaly detection baselines. Our paper is the first work revealing the powerful modeling capabilities of xLSTM for anomaly detection, paving the way for exciting new developments on this subject. Our code is available at: this https URL

Title: AG-VPReID 2025: Aerial-Ground Video-based Person Re-identification Challenge Results

Authors: Kien Nguyen, Clinton Fookes, Sridha Sridharan, Huy Nguyen, Feng Liu, Xiaoming Liu, Arun Ross, Dana Michalski, Tamás Endrei, Ivan DeAndres-Tame, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia, Zijing Gong, Yuhao Wang, Xuehu Liu, Pingping Zhang, Md Rashidunnabi, Hugo Proença, Kailash A. Hambarde, Saeid Rezaei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22843
Pdf URL: https://arxiv.org/pdf/2506.22843
Copy Paste: [[2506.22843]] AG-VPReID 2025: Aerial-Ground Video-based Person Re-identification Challenge Results(https://arxiv.org/abs/2506.22843)
Keywords: transformer
Abstract: Person re-identification (ReID) across aerial and ground vantage points has become crucial for large-scale surveillance and public safety applications. Although significant progress has been made in ground-only scenarios, bridging the aerial-ground domain gap remains a formidable challenge due to extreme viewpoint differences, scale variations, and occlusions. Building upon the achievements of the AG-ReID 2023 Challenge, this paper introduces the AG-VPReID 2025 Challenge - the first large-scale video-based competition focused on high-altitude (80-120m) aerial-ground ReID. Constructed on the new AG-VPReID dataset with 3,027 identities, over 13,500 tracklets, and approximately 3.7 million frames captured from UAVs, CCTV, and wearable cameras, the challenge featured four international teams. These teams developed solutions ranging from multi-stream architectures to transformer-based temporal reasoning and physics-informed modeling. The leading approach, X-TFCLIP from UAM, attained 72.28% Rank-1 accuracy in the aerial-to-ground ReID setting and 70.77% in the ground-to-aerial ReID setting, surpassing existing baselines while highlighting the dataset's complexity. For additional details, please refer to the official website at this https URL.

Title: Quantum Neural Networks for Wind Energy Forecasting: A Comparative Study of Performance and Scalability with Classical Models

Authors: Batuhan Hangun, Oguz Altun, Onder Eyecioglu
Subjects: cs.LG, cs.AI, cs.PF
Abstract URL: https://arxiv.org/abs/2506.22845
Pdf URL: https://arxiv.org/pdf/2506.22845
Copy Paste: [[2506.22845]] Quantum Neural Networks for Wind Energy Forecasting: A Comparative Study of Performance and Scalability with Classical Models(https://arxiv.org/abs/2506.22845)
Keywords: security
Abstract: Quantum Neural Networks (QNNs), a prominent approach in Quantum Machine Learning (QML), are emerging as a powerful alternative to classical machine learning methods. Recent studies have focused on the applicability of QNNs to various tasks, such as time-series forecasting, prediction, and classification, across a wide range of applications, including cybersecurity and medical imaging. With the increased use of smart grids driven by the integration of renewable energy systems, machine learning plays an important role in predicting power demand and detecting system disturbances. This study provides an in-depth investigation of QNNs for predicting the power output of a wind turbine. We assess the predictive performance and simulation time of six QNN configurations that are based on the Z Feature Map for data encoding and varying ansatz structures. Through detailed cross-validation experiments and tests on an unseen hold-out dataset, we experimentally demonstrate that QNNs can achieve predictive performance that is competitive with, and in some cases marginally better than, the benchmarked classical approaches. Our results also reveal the effects of dataset size and circuit complexity on predictive performance and simulation time. We believe our findings will offer valuable insights for researchers in the energy domain who wish to incorporate quantum machine learning into their work.

Title: Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization

Authors: Duygu Altinok
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.22846
Pdf URL: https://arxiv.org/pdf/2506.22846
Copy Paste: [[2506.22846]] Boosting CTC-Based ASR Using LLM-Based Intermediate Loss Regularization(https://arxiv.org/abs/2506.22846)
Keywords: large language model
Abstract: End-to-end (E2E) automatic speech recognition (ASR) systems have revolutionized the field by integrating all components into a single neural network, with attention-based encoder-decoder models achieving state-of-the-art performance. However, their autoregressive decoding process limits inference speed, making them unsuitable for real-time applications. In contrast, CTC-based models offer faster, non-autoregressive decoding but struggle to model linguistic dependencies effectively. Addressing this challenge, we propose a novel auxiliary loss framework called Language-Aware Intermediate Loss (LAIL) to enhance CTC-based ASR using the linguistic knowledge of large language models (LLMs). By attaching connector layers to intermediate encoder layers, LAIL maps outputs to the embedding space of an LLM and computes a causal language modeling loss during training. This approach enhances linguistic modeling while preserving the computational efficiency of CTC decoding. Using the Conformer architecture and various LLaMA models, we demonstrate significant improvements in Word Error Rate (WER) on the LibriSpeech, TEDLIUM2, and WSJ corpora, achieving state-of-the-art performance for CTC-based ASR with minimal computational overhead.

Title: DMD-Net: Deep Mesh Denoising Network

Authors: Aalok Gangopadhyay, Shashikant Verma, Shanmuganathan Raman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22850
Pdf URL: https://arxiv.org/pdf/2506.22850
Copy Paste: [[2506.22850]] DMD-Net: Deep Mesh Denoising Network(https://arxiv.org/abs/2506.22850)
Keywords: robust, transformer
Abstract: We present Deep Mesh Denoising Network (DMD-Net), an end-to-end deep learning framework, for solving the mesh denoising problem. DMD-Net consists of a Graph Convolutional Neural Network in which aggregation is performed in both the primal as well as the dual graph. This is realized in the form of an asymmetric two-stream network, which contains a primal-dual fusion block that enables communication between the primal-stream and the dual-stream. We develop a Feature Guided Transformer (FGT) paradigm, which consists of a feature extractor, a transformer, and a denoiser. The feature extractor estimates the local features, that guide the transformer to compute a transformation, which is applied to the noisy input mesh to obtain a useful intermediate representation. This is further processed by the denoiser to obtain the denoised mesh. Our network is trained on a large scale dataset of 3D objects. We perform exhaustive ablation studies to demonstrate that each component in our network is essential for obtaining the best performance. We show that our method obtains competitive or better results when compared with the state-of-the-art mesh denoising algorithms. We demonstrate that our method is robust to various kinds of noise. We observe that even in the presence of extremely high noise, our method achieves excellent performance.

Title: Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems

Authors: Yucheng Cai, Yuxuan Wu, Yi Huang, Junlan Feng, Zhijian Ou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22852
Pdf URL: https://arxiv.org/pdf/2506.22852
Copy Paste: [[2506.22852]] Knowledge Augmented Finetuning Matters in both RAG and Agent Based Dialog Systems(https://arxiv.org/abs/2506.22852)
Keywords: large language model
Abstract: Large language models (LLMs) have recently been applied to dialog systems. Despite making progress, LLMs are prone to errors in knowledge-intensive scenarios. Recently, approaches based on retrieval augmented generation (RAG) and agent have emerged to improve the factual accuracy by enhancing the LLMs with knowledge retrieved from external knowledge bases (KBs). This is mostly implemented by prompting the LLMs with instructions, examples and the retrieved knowledge. However, LLMs may have difficulty using the retrieved knowledge effectively for response generation, because they are not well trained to do such generation for specific domains. To mitigate this problem, we propose to finetune the LLMs in the RAG-based and agent-based systems with domain-specific data, together with domain-specific external knowledge, which is called knowledge augmented finetuning (KAFT). We base our study on the MobileCS2 dataset, a real-life customer service dialog dataset that features intensive knowledge interactions, to systematically compare the prompting and KAFT techniques in the RAG-based and agent-based systems. Experiment results show that KAFT substantially surpasses prompting in both RAG and agent systems, particularly in terms of factual accuracy. To the best of our knowledge, this paper represents the first solid empirical work to investigate the KAFT idea.

Title: DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

Authors: Kyochul Jang, Donghyeon Lee, Kyusik Kim, Dongseok Heo, Taewhoo Lee, Woojeong Kim, Bongwon Suh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22853
Pdf URL: https://arxiv.org/pdf/2506.22853
Copy Paste: [[2506.22853]] DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues(https://arxiv.org/abs/2506.22853)
Keywords: large language model
Abstract: Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Our code and data are all publicly available: this https URL.

Title: Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval

Authors: Li-Cheng Shen, Jih-Kang Hsieh, Wei-Hua Li, Chu-Song Chen
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.22864
Pdf URL: https://arxiv.org/pdf/2506.22864
Copy Paste: [[2506.22864]] Mask-aware Text-to-Image Retrieval: Referring Expression Segmentation Meets Cross-modal Retrieval(https://arxiv.org/abs/2506.22864)
Keywords: interpretability, large language model, segmentation
Abstract: Text-to-image retrieval (TIR) aims to find relevant images based on a textual query, but existing approaches are primarily based on whole-image captions and lack interpretability. Meanwhile, referring expression segmentation (RES) enables precise object localization based on natural language descriptions but is computationally expensive when applied across large image collections. To bridge this gap, we introduce Mask-aware TIR (MaTIR), a new task that unifies TIR and RES, requiring both efficient image search and accurate object segmentation. To address this task, we propose a two-stage framework, comprising a first stage for segmentation-aware image retrieval and a second stage for reranking and object grounding with a multimodal large language model (MLLM). We leverage SAM 2 to generate object masks and Alpha-CLIP to extract region-level embeddings offline at first, enabling effective and scalable online retrieval. Secondly, MLLM is used to refine retrieval rankings and generate bounding boxes, which are matched to segmentation masks. We evaluate our approach on COCO and D$^3$ datasets, demonstrating significant improvements in both retrieval accuracy and segmentation quality over previous methods.

Title: Region-Aware CAM: High-Resolution Weakly-Supervised Defect Segmentation via Salient Region Perception

Authors: Hang-Cheng Dong, Lu Zou, Bingguo Liu, Dong Ye, Guodong Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22866
Pdf URL: https://arxiv.org/pdf/2506.22866
Copy Paste: [[2506.22866]] Region-Aware CAM: High-Resolution Weakly-Supervised Defect Segmentation via Salient Region Perception(https://arxiv.org/abs/2506.22866)
Keywords: segmentation
Abstract: Surface defect detection plays a critical role in industrial quality inspection. Recent advances in artificial intelligence have significantly enhanced the automation level of detection processes. However, conventional semantic segmentation and object detection models heavily rely on large-scale annotated datasets, which conflicts with the practical requirements of defect detection tasks. This paper proposes a novel weakly supervised semantic segmentation framework comprising two key components: a region-aware class activation map (CAM) and pseudo-label training. To address the limitations of existing CAM methods, especially low-resolution thermal maps, and insufficient detail preservation, we introduce filtering-guided backpropagation (FGBP), which refines target regions by filtering gradient magnitudes to identify areas with higher relevance to defects. Building upon this, we further develop a region-aware weighted module to enhance spatial precision. Finally, pseudo-label segmentation is implemented to refine the model's performance iteratively. Comprehensive experiments on industrial defect datasets demonstrate the superiority of our method. The proposed framework effectively bridges the gap between weakly supervised learning and high-precision defect segmentation, offering a practical solution for resource-constrained industrial scenarios.

Title: STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing

Authors: Junsung Lee, Junoh Kang, Bohyung Han
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22868
Pdf URL: https://arxiv.org/pdf/2506.22868
Copy Paste: [[2506.22868]] STR-Match: Matching SpatioTemporal Relevance Score for Training-Free Video Editing(https://arxiv.org/abs/2506.22868)
Keywords: diffusion
Abstract: Previous text-guided video editing methods often suffer from temporal inconsistency, motion distortion, and-most notably-limited domain transformation. We attribute these limitations to insufficient modeling of spatiotemporal pixel relevance during the editing process. To address this, we propose STR-Match, a training-free video editing algorithm that produces visually appealing and spatiotemporally coherent videos through latent optimization guided by our novel STR score. The score captures spatiotemporal pixel relevance across adjacent frames by leveraging 2D spatial attention and 1D temporal modules in text-to-video (T2V) diffusion models, without the overhead of computationally expensive 3D attention mechanisms. Integrated into a latent optimization framework with a latent mask, STR-Match generates temporally consistent and visually faithful videos, maintaining strong performance even under significant domain transformations while preserving key visual attributes of the source. Extensive experiments demonstrate that STR-Match consistently outperforms existing methods in both visual quality and spatiotemporal consistency.

Title: P$^2$U: Progressive Precision Update For Efficient Model Distribution

Authors: Homayun Afrabandpey, Hamed Rezazadegan Tavakoli
Subjects: cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.22871
Pdf URL: https://arxiv.org/pdf/2506.22871
Copy Paste: [[2506.22871]] P$^2$U: Progressive Precision Update For Efficient Model Distribution(https://arxiv.org/abs/2506.22871)
Keywords: federate
Abstract: Efficient model distribution is becoming increasingly critical in bandwidth-constrained environments. In this paper, we propose a simple yet effective approach called Progressive Precision Update (P$^2$U) to address this problem. Instead of transmitting the original high-precision model, P$^2$U transmits a lower-bit precision model, coupled with a model update representing the difference between the original high-precision model and the transmitted low precision version. With extensive experiments on various model architectures, ranging from small models ($1 - 6$ million parameters) to a large model (more than $100$ million parameters) and using three different data sets, e.g., chest X-Ray, PASCAL-VOC, and CIFAR-100, we demonstrate that P$^2$U consistently achieves better tradeoff between accuracy, bandwidth usage and latency. Moreover, we show that when bandwidth or startup time is the priority, aggressive quantization (e.g., 4-bit) can be used without severely compromising performance. These results establish P$^2$U as an effective and practical solution for scalable and efficient model distribution in low-resource settings, including federated learning, edge computing, and IoT deployments. Given that P$^2$U complements existing compression techniques and can be implemented alongside any compression method, e.g., sparsification, quantization, pruning, etc., the potential for improvement is even greater.

Title: Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder

Authors: Dang Jisheng (1 and 2), Wu Xudong (3), Wang Bimei (4 and 2), Lv Ning (1), Chen Jiayu (1), Jingwen Zhao (3), Yichu liu (5), Jizhao Liu (1), Juncheng Li (6), Teng Wang (7) ((1) Lanzhou University, (2) National University of Singapore, (3) Sun Yat-sen University, (4) Jinan University, (5) South China University of Technology, (6) Zhejiang University, (7) The University of Hong Kong )
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22880
Pdf URL: https://arxiv.org/pdf/2506.22880
Copy Paste: [[2506.22880]] Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder(https://arxiv.org/abs/2506.22880)
Keywords: large language model, segmentation
Abstract: Existing video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models. This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy. To systematically mitigate this issue, we propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2. Specifically, first, we devise a pre-training paradigm that converts textual ground-truth labels into point-level prompts while generating corresponding text masks. These masks are refined through a hybrid loss function to strengthen the model's semantic grounding capabilities. Next, we employ linear projection to disentangle hidden states that generated by a large language model into distinct textual and visual feature subspaces. Finally, a dynamic mask fusion strategy synergistically combines these decoupled features through triple supervision from predicted text/visual masks and ground-truth annotations. Extensive experiments demonstrate state-of-the-art performance across diverse tasks, including image segmentation, image question answering, video segmentation, and video question answering. Our codes are available at this https URL.

Title: CP-Guard: A Unified, Probability-Agnostic, and Adaptive Framework for Malicious Agent Detection and Defense in Multi-Agent Embodied Perception Systems

Authors: Senkang Hu, Yihang Tao, Guowen Xu, Xinyuan Qian, Yiqin Deng, Xianhao Chen, Sam Tak Wu Kwong, Yuguang Fang
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2506.22890
Pdf URL: https://arxiv.org/pdf/2506.22890
Copy Paste: [[2506.22890]] CP-Guard: A Unified, Probability-Agnostic, and Adaptive Framework for Malicious Agent Detection and Defense in Multi-Agent Embodied Perception Systems(https://arxiv.org/abs/2506.22890)
Keywords: defense, attack, segmentation
Abstract: Collaborative Perception (CP) has been shown to be a promising technique for multi-agent autonomous driving and multi-agent robotic systems, where multiple agents share their perception information to enhance the overall perception performance and expand the perception range. However, in CP, an ego agent needs to receive messages from its collaborators, which makes it vulnerable to attacks from malicious agents. To address this critical issue, we propose a unified, probability-agnostic, and adaptive framework, namely, CP-Guard, which is a tailored defense mechanism for CP deployed by each agent to accurately detect and eliminate malicious agents in its collaboration network. Our key idea is to enable CP to reach a consensus rather than a conflict against an ego agent's perception results. Based on this idea, we first develop a probability-agnostic sample consensus (PASAC) method to effectively sample a subset of the collaborators and verify the consensus without prior probabilities of malicious agents. Furthermore, we define collaborative consistency loss (CCLoss) for object detection task and bird's eye view (BEV) segmentation task to capture the discrepancy between an ego agent and its collaborators, which is used as a verification criterion for consensus. In addition, we propose online adaptive threshold via dual sliding windows to dynamically adjust the threshold for consensus verification and ensure the reliability of the systems in dynamic environments. Finally, we conduct extensive experiments and demonstrate the effectiveness of our framework. Code will be released at this https URL

Title: Interpretable Time Series Autoregression for Periodicity Quantification

Authors: Xinyu Chen, Vassilis Digalakis Jr, Lijun Ding, Dingyi Zhuang, Jinhua Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22895
Pdf URL: https://arxiv.org/pdf/2506.22895
Copy Paste: [[2506.22895]] Interpretable Time Series Autoregression for Periodicity Quantification(https://arxiv.org/abs/2506.22895)
Keywords: interpretability
Abstract: Time series autoregression is a classical statistical model for capturing auto-correlations and identifying temporal patterns such as periodicity and seasonality. In this work, we propose a novel sparse autoregression framework from an interpretable machine learning perspective and the model interpretability for periodicity quantification is reinforced by $\ell_0$-norm induced sparsity constraints. On the time-varying time series data, we reformulate the sparse autoregression and convert the involved optimization problem into a mixed-integer optimization (MIO). To accelerate it, we develop a subspace pursuit based decision variable pruning (DVP) strategy to reduce the search space. On the multidimensional time series that involves complicated spatial and temporal dimensions, we propose a spatially- and time-varying sparse autoregression model and resolve the corresponding MIO problem by developing a two-stage optimization scheme. In particular, the proposed scheme makes the model scalable to large problems even with millions of decision variables. Empirically, we conduct extensive experiments to evaluate the proposed models on real-world time series data. First, we demonstrate that the MIO solver can be drastically accelerated through the DVP strategy, while maintaining the same solution quality as a full MIO solver. Applying the time-varying sparse autoregression model to ridesharing trip data, we uncover both daily and weekly periodicities and reveal long-term changes in regularity of human mobility. Second, we demonstrate the spatial patterns of yearly seasonality in climate variable time series such as temperature and precipitation across the past four decades, and our model allows to discover dynamic climate patterns and identify climate phenomena such as El Nino in sea surface temperature.

Title: Neural Cellular Automata: From Cells to Pixels

Authors: Ehsan Pajouheshgar, Yitao Xu, Ali Abbasi, Alexander Mordvintsev, Wenzel Jakob, Sabine Süsstrunk
Subjects: cs.CV, cs.GR, cs.LG, cs.MA, eess.IV
Abstract URL: https://arxiv.org/abs/2506.22899
Pdf URL: https://arxiv.org/pdf/2506.22899
Copy Paste: [[2506.22899]] Neural Cellular Automata: From Cells to Pixels(https://arxiv.org/abs/2506.22899)
Keywords: robust
Abstract: Neural Cellular Automata (NCAs) are bio-inspired systems in which identical cells self-organize to form complex and coherent patterns by repeatedly applying simple local rules. NCAs display striking emergent behaviors including self-regeneration, generalization and robustness to unseen situations, and spontaneous motion. Despite their success in texture synthesis and morphogenesis, NCAs remain largely confined to low-resolution grids. This limitation stems from (1) training time and memory requirements that grow quadratically with grid size, (2) the strictly local propagation of information which impedes long-range cell communication, and (3) the heavy compute demands of real-time inference at high resolution. In this work, we overcome this limitation by pairing NCA with a tiny, shared implicit decoder, inspired by recent advances in implicit neural representations. Following NCA evolution on a coarse grid, a lightweight decoder renders output images at arbitrary resolution. We also propose novel loss functions for both morphogenesis and texture synthesis tasks, specifically tailored for high-resolution output with minimal memory and computation overhead. Combining our proposed architecture and loss functions brings substantial improvement in quality, efficiency, and performance. NCAs equipped with our implicit decoder can generate full-HD outputs in real time while preserving their self-organizing, emergent properties. Moreover, because each MLP processes cell states independently, inference remains highly parallelizable and efficient. We demonstrate the applicability of our approach across multiple NCA variants (on 2D, 3D grids, and 3D meshes) and multiple tasks, including texture generation and morphogenesis (growing patterns from a seed), showing that with our proposed framework, NCAs seamlessly scale to high-resolution outputs with minimal computational overhead.

Title: Point Cloud Compression and Objective Quality Assessment: A Survey

Authors: Yiling Xu, Yujie Zhang, Shuting Xia, Kaifa Yang, He Huang, Ziyu Shan, Wenjie Huang, Qi Yang, Le Yang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.22902
Pdf URL: https://arxiv.org/pdf/2506.22902
Copy Paste: [[2506.22902]] Point Cloud Compression and Objective Quality Assessment: A Survey(https://arxiv.org/abs/2506.22902)
Keywords: extraction
Abstract: The rapid growth of 3D point cloud data, driven by applications in autonomous driving, robotics, and immersive environments, has led to criticals demand for efficient compression and quality assessment techniques. Unlike traditional 2D media, point clouds present unique challenges due to their irregular structure, high data volume, and complex attributes. This paper provides a comprehensive survey of recent advances in point cloud compression (PCC) and point cloud quality assessment (PCQA), emphasizing their significance for real-time and perceptually relevant applications. We analyze a wide range of handcrafted and learning-based PCC algorithms, along with objective PCQA metrics. By benchmarking representative methods on emerging datasets, we offer detailed comparisons and practical insights into their strengths and limitations. Despite notable progress, challenges such as enhancing visual fidelity, reducing latency, and supporting multimodal data remain. This survey outlines future directions, including hybrid compression frameworks and advanced feature extraction strategies, to enable more efficient, immersive, and intelligent 3D applications.

Title: MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances

Authors: Yunzhe Shao, Xinyu Yi, Lu Yin, Shihui Guo, Junhai Yong, Feng Xu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2506.22907
Pdf URL: https://arxiv.org/pdf/2506.22907
Copy Paste: [[2506.22907]] MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances(https://arxiv.org/abs/2506.22907)
Keywords: robust
Abstract: This paper proposes a novel method called MagShield, designed to address the issue of magnetic interference in sparse inertial motion capture (MoCap) systems. Existing Inertial Measurement Unit (IMU) systems are prone to orientation estimation errors in magnetically disturbed environments, limiting their practical application in real-world scenarios. To address this problem, MagShield employs a "detect-then-correct" strategy, first detecting magnetic disturbances through multi-IMU joint analysis, and then correcting orientation errors using human motion priors. MagShield can be integrated with most existing sparse inertial MoCap systems, improving their performance in magnetically disturbed environments. Experimental results demonstrate that MagShield significantly enhances the accuracy of motion capture under magnetic interference and exhibits good compatibility across different sparse inertial MoCap systems.

Title: Attention to Burstiness: Low-Rank Bilinear Prompt Tuning

Authors: Yuzhu Wang, Manni Duan, Shu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22908
Pdf URL: https://arxiv.org/pdf/2506.22908
Copy Paste: [[2506.22908]] Attention to Burstiness: Low-Rank Bilinear Prompt Tuning(https://arxiv.org/abs/2506.22908)
Keywords: transformer
Abstract: Visual Prompt Tuning (VPT) is a parameter-efficient fune-tuning technique that adapts a pre-trained vision Transformer (ViT) by learning a small set of parameters in the input space, known as prompts. In VPT, we uncover ``burstiness'' in the values arising from the interaction of image patch embeddings, and the key and query projectors within Transformer's self-attention module. Furthermore, the values of patch embeddings and the key and query projectors exhibit Laplacian and hyper-Laplacian distribution, respectively. Intuitively, these non-Gaussian distributions pose challenges for learning prompts. To address this, we propose whitening these data, de-correlating them and equalizing their variance towards more Gaussian before learning prompts. We derive the whitening matrix over random image patch embeddings and ViT's key and query projectors, and multiply it with the prompt to be learned in a bilinear manner. Surprisingly, this method significantly accelerates prompt tuning and boosts accuracy, e.g., $>$25 accuracy points on the CUB dataset; interestingly, it learns ``bursty prompts''. Extending the bilinear model which is known to introduce burstiness, we present a compact, low-rank version by learning two smaller matrices whose multiplication yields the final prompts. We call the proposed methods Bilinear Prompt Tuning (BPT). Extensive experiments across multiple benchmark datasets demonstrate that BPT methods not only outperform various VPT methods but also reduce parameter count and computation overhead.

Title: Towards Time Series Generation Conditioned on Unstructured Natural Language

Authors: Jaeyun Woo, Jiseok Lee, Brian Kenji Iwana
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.22927
Pdf URL: https://arxiv.org/pdf/2506.22927
Copy Paste: [[2506.22927]] Towards Time Series Generation Conditioned on Unstructured Natural Language(https://arxiv.org/abs/2506.22927)
Keywords: diffusion, generative
Abstract: Generative Artificial Intelligence (AI) has rapidly become a powerful tool, capable of generating various types of data, such as images and text. However, despite the significant advancement of generative AI, time series generative AI remains underdeveloped, even though the application of time series is essential in finance, climate, and numerous fields. In this research, we propose a novel method of generating time series conditioned on unstructured natural language descriptions. We use a diffusion model combined with a language model to generate time series from the text. Through the proposed method, we demonstrate that time series generation based on natural language is possible. The proposed method can provide various applications such as custom forecasting, time series manipulation, data augmentation, and transfer learning. Furthermore, we construct and propose a new public dataset for time series generation, consisting of 63,010 time series-description pairs.

Title: Towards Explainable Bilingual Multimodal Misinformation Detection and Localization

Authors: Yiwei He, Xiangtai Li, Zhenglin Huang, Yi Dong, Hao Fei, Jiangning Zhang, Baoyuan Wu, Guangliang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22930
Pdf URL: https://arxiv.org/pdf/2506.22930
Copy Paste: [[2506.22930]] Towards Explainable Bilingual Multimodal Misinformation Detection and Localization(https://arxiv.org/abs/2506.22930)
Keywords: interpretability
Abstract: The increasing realism of multimodal content has made misinformation more subtle and harder to detect, especially in news media where images are frequently paired with bilingual (e.g., Chinese-English) subtitles. Such content often includes localized image edits and cross-lingual inconsistencies that jointly distort meaning while remaining superficially plausible. We introduce BiMi, a bilingual multimodal framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis. To support generalization, BiMi integrates an online retrieval module that supplements model reasoning with up-to-date external context. We further release BiMiBench, a large-scale and comprehensive benchmark constructed by systematically editing real news images and subtitles, comprising 104,000 samples with realistic manipulations across visual and linguistic modalities. To enhance interpretability, we apply Group Relative Policy Optimization (GRPO) to improve explanation quality, marking the first use of GRPO in this domain. Extensive experiments demonstrate that BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore, advancing state-of-the-art performance in realistic, multilingual misinformation detection. Code, models, and datasets will be released.

Title: Efficient Cybersecurity Assessment Using SVM and Fuzzy Evidential Reasoning for Resilient Infrastructure

Authors: Zaydon L. Ali, Wassan Saad Abduljabbar Hayale, Israa Ibraheem Al_Barazanchi, Ravi Sekhar, Pritesh Shah, Sushma Parihar
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22938
Pdf URL: https://arxiv.org/pdf/2506.22938
Copy Paste: [[2506.22938]] Efficient Cybersecurity Assessment Using SVM and Fuzzy Evidential Reasoning for Resilient Infrastructure(https://arxiv.org/abs/2506.22938)
Keywords: secure, security, privacy
Abstract: With current advancement in hybermedia knowledges, the privacy of digital information has developed a critical problem. To overawed the susceptibilities of present security protocols, scholars tend to focus mainly on efforts on alternation of current protocols. Over past decade, various proposed encoding models have been shown insecurity, leading to main threats against significant data. Utilizing the suitable encryption model is very vital means of guard against various such, but algorithm is selected based on the dependency of data which need to be secured. Moreover, testing potentiality of the security assessment one by one to identify the best choice can take a vital time for processing. For faster and precisive identification of assessment algorithm, we suggest a security phase exposure model for cipher encryption technique by invoking Support Vector Machine (SVM). In this work, we form a dataset using usual security components like contrast, homogeneity. To overcome the uncertainty in analysing the security and lack of ability of processing data to a risk assessment mechanism. To overcome with such complications, this paper proposes an assessment model for security issues using fuzzy evidential reasoning (ER) approaches. Significantly, the model can be utilised to process and assemble risk assessment data on various aspects in systematic ways. To estimate the performance of our framework, we have various analyses like, recall, F1 score and accuracy.

Title: A Study on Semi-Supervised Detection of DDoS Attacks under Class Imbalance

Authors: Ehsan Hallaji, Vaishnavi Shanmugam, Roozbeh Razavi-Far, Mehrdad Saif
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.22949
Pdf URL: https://arxiv.org/pdf/2506.22949
Copy Paste: [[2506.22949]] A Study on Semi-Supervised Detection of DDoS Attacks under Class Imbalance(https://arxiv.org/abs/2506.22949)
Keywords: security, attack, robust
Abstract: One of the most difficult challenges in cybersecurity is eliminating Distributed Denial of Service (DDoS) attacks. Automating this task using artificial intelligence is a complex process due to the inherent class imbalance and lack of sufficient labeled samples of real-world datasets. This research investigates the use of Semi-Supervised Learning (SSL) techniques to improve DDoS attack detection when data is imbalanced and partially labeled. In this process, 13 state-of-the-art SSL algorithms are evaluated for detecting DDoS attacks in several scenarios. We evaluate their practical efficacy and shortcomings, including the extent to which they work in extreme environments. The results will offer insight into designing intelligent Intrusion Detection Systems (IDSs) that are robust against class imbalance and handle partially labeled data.

Title: Infinite Sampling: Efficient and Stable Grouped RL Training for Large Language Models

Authors: Liangyu Wang, Huanyi Xie, Xinhai Wang, Tianjin Huang, Mengdi Li, Di Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.22950
Pdf URL: https://arxiv.org/pdf/2506.22950
Copy Paste: [[2506.22950]] Infinite Sampling: Efficient and Stable Grouped RL Training for Large Language Models(https://arxiv.org/abs/2506.22950)
Keywords: large language model
Abstract: Group-based reinforcement learning algorithms such as Group Reward Policy Optimization (GRPO) have proven effective for fine-tuning large language models (LLMs) with human feedback. However, generating and storing multiple responses per prompt incurs substantial memory overhead, especially as the sample group size increases, limiting scalability under constrained hardware. We propose Infinite Sampling, a framework that enables efficient and stable GRPO training by decoupling group size from GPU memory usage. It consists of: (1) micro sampling groups that decompose large groups into memory-feasible rounds; (2) continuous sampling that interleaves generation across groups to improve utilization; and (3) a length-aware scheduler combining token-conditioned sequence length prediction with a two-stage plan: global grouping via FPTAS and runtime refill via SJF. Experiments show that our Micro Sampling Groups reduce peak memory usage by over 50% compared to full-group decoding (e.g., from 21.55 GB to 10.64 GB on Qwen3-1.7B). Building on this, Infinite Sampling improves throughput by over 25% compared to the naive micro sampling group method, reducing decoding steps while maintaining full-length completions and memory usage. Our hybrid scheduling ensures efficient and stable GRPO training with larger groups under realistic GPU memory constraints.

Title: YM-WML: A new Yolo-based segmentation Model with Weighted Multi-class Loss for medical imaging

Authors: Haniyeh Nikkhah, Jafar Tanha, Mahdi Zarrin, SeyedEhsan Roshan, Amin Kazempour
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22955
Pdf URL: https://arxiv.org/pdf/2506.22955
Copy Paste: [[2506.22955]] YM-WML: A new Yolo-based segmentation Model with Weighted Multi-class Loss for medical imaging(https://arxiv.org/abs/2506.22955)
Keywords: robust, extraction, segmentation
Abstract: Medical image segmentation poses significant challenges due to class imbalance and the complex structure of medical images. To address these challenges, this study proposes YM-WML, a novel model for cardiac image segmentation. The model integrates a robust backbone for effective feature extraction, a YOLOv11 neck for multi-scale feature aggregation, and an attention-based segmentation head for precise and accurate segmentation. To address class imbalance, we introduce the Weighted Multi-class Exponential (WME) loss function. On the ACDC dataset, YM-WML achieves a Dice Similarity Coefficient of 91.02, outperforming state-of-the-art methods. The model demonstrates stable training, accurate segmentation, and strong generalization, setting a new benchmark in cardiac segmentation tasks.

Title: Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models

Authors: Younwoo Choi, Changling Li, Yongjin Yang, Zhijing Jin
Subjects: cs.CL, cs.AI, cs.CY, cs.MA
Abstract URL: https://arxiv.org/abs/2506.22957
Pdf URL: https://arxiv.org/pdf/2506.22957
Copy Paste: [[2506.22957]] Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models(https://arxiv.org/abs/2506.22957)
Keywords: robust, large language model
Abstract: As large language models (LLMs) are increasingly integrated into multi-agent and human-AI systems, understanding their awareness of both self-context and conversational partners is essential for ensuring reliable performance and robust safety. While prior work has extensively studied situational awareness which refers to an LLM's ability to recognize its operating phase and constraints, it has largely overlooked the complementary capacity to identify and adapt to the identity and characteristics of a dialogue partner. In this paper, we formalize this latter capability as interlocutor awareness and present the first systematic evaluation of its emergence in contemporary LLMs. We examine interlocutor inference across three dimensions-reasoning patterns, linguistic style, and alignment preferences-and show that LLMs reliably identify same-family peers and certain prominent model families, such as GPT and Claude. To demonstrate its practical significance, we develop three case studies in which interlocutor awareness both enhances multi-LLM collaboration through prompt adaptation and introduces new alignment and safety vulnerabilities, including reward-hacking behaviors and increased jailbreak susceptibility. Our findings highlight the dual promise and peril of identity-sensitive behavior in LLMs, underscoring the need for further understanding of interlocutor awareness and new safeguards in multi-agent deployments. Our code is open-sourced at this https URL.

Title: Peccavi: Visual Paraphrase Attack Safe and Distortion Free Image Watermarking Technique for AI-Generated Images

Authors: Shreyas Dixit, Ashhar Aziz, Shashwat Bajpai, Vasu Sharma, Aman Chadha, Vinija Jain, Amitava Das
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22960
Pdf URL: https://arxiv.org/pdf/2506.22960
Copy Paste: [[2506.22960]] Peccavi: Visual Paraphrase Attack Safe and Distortion Free Image Watermarking Technique for AI-Generated Images(https://arxiv.org/abs/2506.22960)
Keywords: attack, watermark, generative
Abstract: A report by the European Union Law Enforcement Agency predicts that by 2026, up to 90 percent of online content could be synthetically generated, raising concerns among policymakers, who cautioned that "Generative AI could act as a force multiplier for political disinformation. The combined effect of generative text, images, videos, and audio may surpass the influence of any single modality." In response, California's Bill AB 3211 mandates the watermarking of AI-generated images, videos, and audio. However, concerns remain regarding the vulnerability of invisible watermarking techniques to tampering and the potential for malicious actors to bypass them entirely. Generative AI-powered de-watermarking attacks, especially the newly introduced visual paraphrase attack, have shown an ability to fully remove watermarks, resulting in a paraphrase of the original image. This paper introduces PECCAVI, the first visual paraphrase attack-safe and distortion-free image watermarking technique. In visual paraphrase attacks, an image is altered while preserving its core semantic regions, termed Non-Melting Points (NMPs). PECCAVI strategically embeds watermarks within these NMPs and employs multi-channel frequency domain watermarking. It also incorporates noisy burnishing to counter reverse-engineering efforts aimed at locating NMPs to disrupt the embedded watermark, thereby enhancing durability. PECCAVI is model-agnostic. All relevant resources and codes will be open-sourced.

Title: ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Authors: Amir Aghdam, Vincent Tao Hu
Subjects: cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.22967
Pdf URL: https://arxiv.org/pdf/2506.22967
Copy Paste: [[2506.22967]] ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment(https://arxiv.org/abs/2506.22967)
Keywords: large language model
Abstract: We address the task of zero-shot fine-grained video classification, where no video examples or temporal annotations are available for unseen action classes. While contrastive vision-language models such as SigLIP demonstrate strong open-set recognition via mean-pooled image-text similarity, they fail to capture the temporal structure critical for distinguishing fine-grained activities. We introduce ActAlign, a zero-shot framework that formulates video classification as sequence alignment. For each class, a large language model generates an ordered sub-action sequence, which is aligned with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on the extremely challenging ActionAtlas benchmark, where human accuracy is only 61.6%. ActAlign outperforms billion-parameter video-language models while using approximately 8x less parameters. These results demonstrate that structured language priors, combined with classical alignment techniques, offer a scalable and general approach to unlocking the open-set recognition potential of vision-language models for fine-grained video understanding.

Title: A Systematic Study of Compositional Syntactic Transformer Language Models

Authors: Yida Zhao, Hao Xve, Xiang Hu, Kewei Tu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22978
Pdf URL: https://arxiv.org/pdf/2506.22978
Copy Paste: [[2506.22978]] A Systematic Study of Compositional Syntactic Transformer Language Models(https://arxiv.org/abs/2506.22978)
Keywords: transformer
Abstract: Syntactic language models (SLMs) enhance Transformers by incorporating syntactic biases through the modeling of linearized syntactic parse trees alongside surface sentences. This paper focuses on compositional SLMs that are based on constituency parse trees and contain explicit bottom-up composition of constituent representations. We identify key aspects of design choices in existing compositional SLMs and propose a unified framework encompassing both existing models and novel variants. We conduct a comprehensive empirical evaluation of all the variants in our framework across language modeling, syntactic generalization, summarization, dialogue, and inference efficiency. Based on the experimental results, we make multiple recommendations on the design of compositional SLMs. Our code is released at this https URL.

Title: Probabilistic Prototype Calibration of Vision-Language Models for Generalized Few-shot Semantic Segmentation

Authors: Jie Liu, Jiayi Shen, Pan Zhou, Jan-Jakob Sonke, Efstratios Gavves
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22979
Pdf URL: https://arxiv.org/pdf/2506.22979
Copy Paste: [[2506.22979]] Probabilistic Prototype Calibration of Vision-Language Models for Generalized Few-shot Semantic Segmentation(https://arxiv.org/abs/2506.22979)
Keywords: segmentation
Abstract: Generalized Few-Shot Semantic Segmentation (GFSS) aims to extend a segmentation model to novel classes with only a few annotated examples while maintaining performance on base classes. Recently, pretrained vision-language models (VLMs) such as CLIP have been leveraged in GFSS to improve generalization on novel classes through multi-modal prototypes learning. However, existing prototype-based methods are inherently deterministic, limiting the adaptability of learned prototypes to diverse samples, particularly for novel classes with scarce annotations. To address this, we propose FewCLIP, a probabilistic prototype calibration framework over multi-modal prototypes from the pretrained CLIP, thus providing more adaptive prototype learning for GFSS. Specifically, FewCLIP first introduces a prototype calibration mechanism, which refines frozen textual prototypes with learnable visual calibration prototypes, leading to a more discriminative and adaptive representation. Furthermore, unlike deterministic prototype learning techniques, FewCLIP introduces distribution regularization over these calibration prototypes. This probabilistic formulation ensures structured and uncertainty-aware prototype learning, effectively mitigating overfitting to limited novel class data while enhancing generalization. Extensive experimental results on PASCAL-5$^i$ and COCO-20$^i$ datasets demonstrate that our proposed FewCLIP significantly outperforms state-of-the-art approaches across both GFSS and class-incremental setting. The code is available at this https URL.

Title: Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models

Authors: Atharv Mittal, Agam Pandey, Amritanshu Tiwari, Sukrit Jindal, Swadesh Swain
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22982
Pdf URL: https://arxiv.org/pdf/2506.22982
Copy Paste: [[2506.22982]] Revisiting CroPA: A Reproducibility Study and Enhancements for Cross-Prompt Adversarial Transferability in Vision-Language Models(https://arxiv.org/abs/2506.22982)
Keywords: security, attack, robust
Abstract: Large Vision-Language Models (VLMs) have revolutionized computer vision, enabling tasks such as image classification, captioning, and visual question answering. However, they remain highly vulnerable to adversarial attacks, particularly in scenarios where both visual and textual modalities can be manipulated. In this study, we conduct a comprehensive reproducibility study of "An Image is Worth 1000 Lies: Adversarial Transferability Across Prompts on Vision-Language Models" validating the Cross-Prompt Attack (CroPA) and confirming its superior cross-prompt transferability compared to existing baselines. Beyond replication we propose several key improvements: (1) A novel initialization strategy that significantly improves Attack Success Rate (ASR). (2) Investigate cross-image transferability by learning universal perturbations. (3) A novel loss function targeting vision encoder attention mechanisms to improve generalization. Our evaluation across prominent VLMs -- including Flamingo, BLIP-2, and InstructBLIP as well as extended experiments on LLaVA validates the original results and demonstrates that our improvements consistently boost adversarial effectiveness. Our work reinforces the importance of studying adversarial vulnerabilities in VLMs and provides a more robust framework for generating transferable adversarial examples, with significant implications for understanding the security of VLMs in real-world applications.

Title: Cybersecurity-Focused Anomaly Detection in Connected Autonomous Vehicles Using Machine Learning

Authors: Prathyush Kumar Reddy Lebaku, Lu Gao, Yunpeng Zhang, Zhixia Li, Yongxin Liu, Tanvir Arafin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.22984
Pdf URL: https://arxiv.org/pdf/2506.22984
Copy Paste: [[2506.22984]] Cybersecurity-Focused Anomaly Detection in Connected Autonomous Vehicles Using Machine Learning(https://arxiv.org/abs/2506.22984)
Keywords: security, attack, interpretability
Abstract: Anomaly detection in connected autonomous vehicles (CAVs) is crucial for maintaining safe and reliable transportation networks, as CAVs can be susceptible to sensor malfunctions, cyber-attacks, and unexpected environmental disruptions. This study explores an anomaly detection approach by simulating vehicle behavior, generating a dataset that represents typical and atypical vehicular interactions. The dataset includes time-series data of position, speed, and acceleration for multiple connected autonomous vehicles. We utilized machine learning models to effectively identify abnormal driving patterns. First, we applied a stacked Long Short-Term Memory (LSTM) model to capture temporal dependencies and sequence-based anomalies. The stacked LSTM model processed the sequential data to learn standard driving behaviors. Additionally, we deployed a Random Forest model to support anomaly detection by offering ensemble-based predictions, which enhanced model interpretability and performance. The Random Forest model achieved an R2 of 0.9830, MAE of 5.746, and a 95th percentile anomaly threshold of 14.18, while the stacked LSTM model attained an R2 of 0.9998, MAE of 82.425, and a 95th percentile anomaly threshold of 265.63. These results demonstrate the models' effectiveness in accurately predicting vehicle trajectories and detecting anomalies in autonomous driving scenarios.

Title: A Reinforcement Learning Approach for Optimal Control in Microgrids

Authors: Davide Salaorni, Federico Bianchi, Francesco Trovò, Marcello Restelli
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2506.22995
Pdf URL: https://arxiv.org/pdf/2506.22995
Copy Paste: [[2506.22995]] A Reinforcement Learning Approach for Optimal Control in Microgrids(https://arxiv.org/abs/2506.22995)
Keywords: robust
Abstract: The increasing integration of renewable energy sources (RESs) is transforming traditional power grid networks, which require new approaches for managing decentralized energy production and consumption. Microgrids (MGs) provide a promising solution by enabling localized control over energy generation, storage, and distribution. This paper presents a novel reinforcement learning (RL)-based methodology for optimizing microgrid energy management. Specifically, we propose an RL agent that learns optimal energy trading and storage policies by leveraging historical data on energy production, consumption, and market prices. A digital twin (DT) is used to simulate the energy storage system dynamics, incorporating degradation factors to ensure a realistic emulation of the analysed setting. Our approach is validated through an experimental campaign using real-world data from a power grid located in the Italian territory. The results indicate that the proposed RL-based strategy outperforms rule-based methods and existing RL benchmarks, offering a robust solution for intelligent microgrid management.

Title: A Novel Frame Identification and Synchronization Technique for Smartphone Visible Light Communication Systems Based on Convolutional Neural Networks

Authors: Vaigai Nayaki Yokar, Hoa Le-Minh, Xicong Li, Wai Lok Woo, Luis Nero Alves, Stanislav Zvanovec, Tran The Son, Zabih Ghassemlooy
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.23004
Pdf URL: https://arxiv.org/pdf/2506.23004
Copy Paste: [[2506.23004]] A Novel Frame Identification and Synchronization Technique for Smartphone Visible Light Communication Systems Based on Convolutional Neural Networks(https://arxiv.org/abs/2506.23004)
Keywords: robust
Abstract: This paper proposes a novel, robust, and lightweight supervised Convolutional Neural Network (CNN)-based technique for frame identification and synchronization, designed to enhance short-link communication performance in a screen-to-camera (S2C) based visible light communication (VLC) system. Developed using Python and the TensorFlow Keras framework, the proposed CNN model was trained through three real-time experimental investigations conducted in Jupyter Notebook. These experiments incorporated a dataset created from scratch to address various real-time challenges in S2C communication, including blurring, cropping, and rotated images in mobility scenarios. Overhead frames were introduced for synchronization, which leads to enhanced system performance. The experimental results demonstrate that the proposed model achieves an overall accuracy of approximately 98.74%, highlighting its effectiveness in identifying and synchronizing frames in S2C VLC systems.

Title: MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models

Authors: Jian Chen, Wenye Ma, Penghang Liu, Wei Wang, Tengwei Song, Ming Li, Chenguang Wang, Ruiyi Zhang, Changyou Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23009
Pdf URL: https://arxiv.org/pdf/2506.23009
Copy Paste: [[2506.23009]] MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models(https://arxiv.org/abs/2506.23009)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable visual reasoning abilities in natural images, text-rich documents, and graphic designs. However, their ability to interpret music sheets remains underexplored. To bridge this gap, we introduce MusiXQA, the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding. MusiXQA features high-quality synthetic music sheets generated via MusiXTeX, with structured annotations covering note pitch and duration, chords, clefs, key/time signatures, and text, enabling diverse visual QA tasks. Through extensive evaluations, we reveal significant limitations of current state-of-the-art MLLMs in this domain. Beyond benchmarking, we developed Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods. The proposed dataset and model establish a foundation for future advances in MLLMs for music sheet understanding. Code, data, and model will be released upon acceptance.

Title: Spectra 1.1: Scaling Laws and Efficient Inference for Ternary Language Models

Authors: Tejas Vaidhya, Ayush Kaushal, Vineet Jain, Francis Couture Harpin, Prashant Shishodia, Majid Behbahani, Yuriy Nevmyvaka, Irina Rish
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23025
Pdf URL: https://arxiv.org/pdf/2506.23025
Copy Paste: [[2506.23025]] Spectra 1.1: Scaling Laws and Efficient Inference for Ternary Language Models(https://arxiv.org/abs/2506.23025)
Keywords: large language model
Abstract: Large language models (LLMs) are increasingly used across research and industry applications, yet their inference efficiency remains a significant challenge. As the computational power of modern GPU architectures continuously improves, their memory bandwidth and capacity have not scaled proportionally, creating a critical bottleneck during inference. To address this, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements. We first analyze the scalability of TriLMs by conducting a scaling law analysis, revealing that TriLMs benefit more from increasing training data than from scaling model parameters. Based on this observation, we introduce Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights, which demonstrate accelerated inference across various CPU architectures. Also, building on the 2-bit packing, we develop a GPU kernel called TriRun that accelerates end-to-end model inference by up to 5 times compared to floating-point baselines. To encourage further exploration and development of TriLMs, we will release the Spectra-1.1 suite and TriRun inference kernels. Overall, our work lays the foundation for building and deploying efficient LLMs, providing a valuable resource for the research community.

Title: Feature-Wise Mixing for Mitigating Contextual Bias in Predictive Supervised Learning

Authors: Yash Vardhan Tomar
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.23033
Pdf URL: https://arxiv.org/pdf/2506.23033
Copy Paste: [[2506.23033]] Feature-Wise Mixing for Mitigating Contextual Bias in Predictive Supervised Learning(https://arxiv.org/abs/2506.23033)
Keywords: fair
Abstract: Bias in predictive machine learning (ML) models is a fundamental challenge due to the skewed or unfair outcomes produced by biased models. Existing mitigation strategies rely on either post-hoc corrections or rigid constraints. However, emerging research claims that these techniques can limit scalability and reduce generalizability. To address this, this paper introduces a feature-wise mixing framework to mitigate contextual bias. This was done by redistributing feature representations across multiple contextual datasets. To assess feature-wise mixing's effectiveness, four ML classifiers were trained using cross-validation and evaluated with bias-sensitive loss functions, including disparity metrics and mean squared error (MSE), which served as a standard measure of predictive performance. The proposed method achieved an average bias reduction of 43.35% and a statistically significant decrease in MSE across all classifiers trained on mixed datasets. Additionally, benchmarking against established bias mitigation techniques found that feature-wise mixing consistently outperformed SMOTE oversampling and demonstrated competitive effectiveness without requiring explicit bias attribute identification. Feature-wise mixing efficiently avoids the computational overhead typically associated with fairness-aware learning algorithms. Future work could explore applying feature-wise mixing for real-world fields where accurate predictions are necessary.

Title: Fragile, Robust, and Antifragile: A Perspective from Parameter Responses in Reinforcement Learning Under Stress

Authors: Zain ul Abdeen, Ming Jin
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2506.23036
Pdf URL: https://arxiv.org/pdf/2506.23036
Copy Paste: [[2506.23036]] Fragile, Robust, and Antifragile: A Perspective from Parameter Responses in Reinforcement Learning Under Stress(https://arxiv.org/abs/2506.23036)
Keywords: attack, robust
Abstract: This paper explores Reinforcement learning (RL) policy robustness by systematically analyzing network parameters under internal and external stresses. Inspired by synaptic plasticity in neuroscience, synaptic filtering introduces internal stress by selectively perturbing parameters, while adversarial attacks apply external stress through modified agent observations. This dual approach enables the classification of parameters as fragile, robust, or antifragile, based on their influence on policy performance in clean and adversarial settings. Parameter scores are defined to quantify these characteristics, and the framework is validated on PPO-trained agents in Mujoco continuous control environments. The results highlight the presence of antifragile parameters that enhance policy performance under stress, demonstrating the potential of targeted filtering techniques to improve RL policy adaptability. These insights provide a foundation for future advancements in the design of robust and antifragile RL systems.

Title: Inpainting is All You Need: A Diffusion-based Augmentation Method for Semi-supervised Medical Image Segmentation

Authors: Xinrong Hu, Yiyu Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23038
Pdf URL: https://arxiv.org/pdf/2506.23038
Copy Paste: [[2506.23038]] Inpainting is All You Need: A Diffusion-based Augmentation Method for Semi-supervised Medical Image Segmentation(https://arxiv.org/abs/2506.23038)
Keywords: diffusion, segmentation
Abstract: Collecting pixel-level labels for medical datasets can be a laborious and expensive process, and enhancing segmentation performance with a scarcity of labeled data is a crucial challenge. This work introduces AugPaint, a data augmentation framework that utilizes inpainting to generate image-label pairs from limited labeled data. AugPaint leverages latent diffusion models, known for their ability to generate high-quality in-domain images with low overhead, and adapts the sampling process for the inpainting task without need for retraining. Specifically, given a pair of image and label mask, we crop the area labeled with the foreground and condition on it during reversed denoising process for every noise level. Masked background area would gradually be filled in, and all generated images are paired with the label mask. This approach ensures the accuracy of match between synthetic images and label masks, setting it apart from existing dataset generation methods. The generated images serve as valuable supervision for training downstream segmentation models, effectively addressing the challenge of limited annotations. We conducted extensive evaluations of our data augmentation method on four public medical image segmentation datasets, including CT, MRI, and skin imaging. Results across all datasets demonstrate that AugPaint outperforms state-of-the-art label-efficient methodologies, significantly improving segmentation performance.

Title: ReMem: Mutual Information-Aware Fine-tuning of Pretrained Vision Transformers for Effective Knowledge Distillation

Authors: Chengyu Dong, Huan Gui, Noveen Sachdeva, Long Jin, Ke Yin, Jingbo Shang, Lichan Hong, Ed H.Chi, Zhe Zhao
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.23041
Pdf URL: https://arxiv.org/pdf/2506.23041
Copy Paste: [[2506.23041]] ReMem: Mutual Information-Aware Fine-tuning of Pretrained Vision Transformers for Effective Knowledge Distillation(https://arxiv.org/abs/2506.23041)
Keywords: transformer
Abstract: Knowledge distillation from pretrained visual representation models offers an effective approach to improve small, task-specific production models. However, the effectiveness of such knowledge transfer drops significantly when distilling from strong models that are pretrained in a large scale. In this paper, we address this challenge for pretrained Vision Transformers (ViTs) by exploring methods to fine-tune them for more effective knowledge transfer. Motivated by the connection between mutual information and distillation effectiveness, we propose to employ mutual information-aware optimization during finetuning. For small or highly-imbalanced downstream datasets where such optimization becomes less effective, we introduce a simple yet effective heuristic of reweighting MLP blocks. This approach is inspired by our observation that top MLP blocks are primarily responsible for mutual information loss. Our method enables small student models to benefit from those pretrained models among the strongest.

Title: Ovis-U1 Technical Report

Authors: Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Xiaohao Chen, Jianshan Zhao, Yang Li, Qing-Guo Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23044
Pdf URL: https://arxiv.org/pdf/2506.23044
Copy Paste: [[2506.23044]] Ovis-U1 Technical Report(https://arxiv.org/abs/2506.23044)
Keywords: diffusion
Abstract: In this report, we introduce Ovis-U1, a 3-billion-parameter unified model that integrates multimodal understanding, text-to-image generation, and image editing capabilities. Building on the foundation of the Ovis series, Ovis-U1 incorporates a diffusion-based visual decoder paired with a bidirectional token refiner, enabling image generation tasks comparable to leading models like GPT-4o. Unlike some previous models that use a frozen MLLM for generation tasks, Ovis-U1 utilizes a new unified training approach starting from a language model. Compared to training solely on understanding or generation tasks, unified training yields better performance, demonstrating the enhancement achieved by integrating these two tasks. Ovis-U1 achieves a score of 69.6 on the OpenCompass Multi-modal Academic Benchmark, surpassing recent state-of-the-art models such as Ristretto-3B and SAIL-VL-1.5-2B. In text-to-image generation, it excels with scores of 83.72 and 0.89 on the DPG-Bench and GenEval benchmarks, respectively. For image editing, it achieves 4.00 and 6.42 on the ImgEdit-Bench and GEdit-Bench-EN, respectively. As the initial version of the Ovis unified model series, Ovis-U1 pushes the boundaries of multimodal understanding, generation, and editing.

Title: Equivalence Classes in AES -- Part 1

Authors: David Cornwell
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.23050
Pdf URL: https://arxiv.org/pdf/2506.23050
Copy Paste: [[2506.23050]] Equivalence Classes in AES -- Part 1(https://arxiv.org/abs/2506.23050)
Keywords: attack
Abstract: We investigate properties of equivalence classes in AES which arise naturally from properties of MixColumns and InvMixColumns. These two operations have the property that the XOR of the 4 input bytes equals the XOR of 4 output bytes. We examine the effect on equivalence classes due to the operation of SubBytes, ShiftRows, MixColumns and AddRoundKey. The next phase of research is to find a key recovery attack using known (plaintext, ciphertext) equivalence class pairs. Keywords: AES, Equivalence, Class, MixColumns, ShiftRows, SubBytes, AddRoundKey, Schedule, State, XOR

Title: Double-Diffusion: Diffusion Conditioned Diffusion Probabilistic Model For Air Quality Prediction

Authors: Hanlin Dong, Arian Prabowo, Hao Xue, Flora D. Salim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23053
Pdf URL: https://arxiv.org/pdf/2506.23053
Copy Paste: [[2506.23053]] Double-Diffusion: Diffusion Conditioned Diffusion Probabilistic Model For Air Quality Prediction(https://arxiv.org/abs/2506.23053)
Keywords: diffusion, generative
Abstract: Air quality prediction is a challenging forecasting task due to its spatio-temporal complexity and the inherent dynamics as well as uncertainty. Most of the current models handle these two challenges by applying Graph Neural Networks or known physics principles, and quantifying stochasticity through probabilistic networks like Diffusion models. Nevertheless, finding the right balancing point between the certainties and uncertainties remains an open question. Therefore, we propose Double-Diffusion, a novel diffusion probabilistic model that harnesses the power of known physics to guide air quality forecasting with stochasticity. To the best of our knowledge, while precedents have been made of using conditional diffusion models to predict air pollution, this is the first attempt to use physics as a conditional generative approach for air quality prediction. Along with a sampling strategy adopted from image restoration and a new denoiser architecture, Double-Diffusion ranks first in most evaluation scenarios across two real-life datasets compared with other probabilistic models, it also cuts inference time by 50% to 30% while enjoying an increase between 3-12% in Continuous Ranked Probabilistic Score (CRPS).

Title: Measuring How LLMs Internalize Human Psychological Concepts: A preliminary analysis

Authors: Hiro Taiyo Hamada, Ippei Fujisawa, Genji Kawakita, Yuki Yamada
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23055
Pdf URL: https://arxiv.org/pdf/2506.23055
Copy Paste: [[2506.23055]] Measuring How LLMs Internalize Human Psychological Concepts: A preliminary analysis(https://arxiv.org/abs/2506.23055)
Keywords: large language model
Abstract: Large Language Models (LLMs) such as ChatGPT have shown remarkable abilities in producing human-like text. However, it is unclear how accurately these models internalize concepts that shape human thought and behavior. Here, we developed a quantitative framework to assess concept alignment between LLMs and human psychological dimensions using 43 standardized psychological questionnaires, selected for their established validity in measuring distinct psychological constructs. Our method evaluates how accurately language models reconstruct and classify questionnaire items through pairwise similarity analysis. We compared resulting cluster structures with the original categorical labels using hierarchical clustering. A GPT-4 model achieved superior classification accuracy (66.2\%), significantly outperforming GPT-3.5 (55.9\%) and BERT (48.1\%), all exceeding random baseline performance (31.9\%). We also demonstrated that the estimated semantic similarity from GPT-4 is associated with Pearson's correlation coefficients of human responses in multiple psychological questionnaires. This framework provides a novel approach to evaluate the alignment of the human-LLM concept and identify potential representational biases. Our findings demonstrate that modern LLMs can approximate human psychological constructs with measurable accuracy, offering insights for developing more interpretable AI systems.

Title: Boosting LLM's Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning

Authors: Xiang Zhuang, Bin Wu, Jiyu Cui, Kehua Feng, Xiaotong Li, Huabin Xing, Keyan Ding, Qiang Zhang, Huajun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23056
Pdf URL: https://arxiv.org/pdf/2506.23056
Copy Paste: [[2506.23056]] Boosting LLM's Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning(https://arxiv.org/abs/2506.23056)
Keywords: large language model
Abstract: Molecular structure elucidation involves deducing a molecule's structure from various types of spectral data, which is crucial in chemical experimental analysis. While large language models (LLMs) have shown remarkable proficiency in analyzing and reasoning through complex tasks, they still encounter substantial challenges in molecular structure elucidation. We identify that these challenges largely stem from LLMs' limited grasp of specialized chemical knowledge. In this work, we introduce a Knowledge-enhanced reasoning framework for Molecular Structure Elucidation (K-MSE), leveraging Monte Carlo Tree Search for test-time scaling as a plugin. Specifically, we construct an external molecular substructure knowledge base to extend the LLMs' coverage of the chemical structure space. Furthermore, we design a specialized molecule-spectrum scorer to act as a reward model for the reasoning process, addressing the issue of inaccurate solution evaluation in LLMs. Experimental results show that our approach significantly boosts performance, particularly gaining more than 20% improvement on both GPT-4o-mini and GPT-4o. Our code is available at this https URL.

Title: CoreMark: Toward Robust and Universal Text Watermarking Technique

Authors: Jiale Meng, Yiming Li, Zheming Lu, Zewei He, Hao Luo, Tianwei Zhang
Subjects: cs.CV, cs.CR, cs.MM
Abstract URL: https://arxiv.org/abs/2506.23066
Pdf URL: https://arxiv.org/pdf/2506.23066
Copy Paste: [[2506.23066]] CoreMark: Toward Robust and Universal Text Watermarking Technique(https://arxiv.org/abs/2506.23066)
Keywords: attack, robust, watermark
Abstract: Text watermarking schemes have gained considerable attention in recent years, yet still face critical challenges in achieving simultaneous robustness, generalizability, and imperceptibility. This paper introduces a new embedding paradigm,termed CORE, which comprises several consecutively aligned black pixel segments. Its key innovation lies in its inherent noise resistance during transmission and broad applicability across languages and fonts. Based on the CORE, we present a text watermarking framework named CoreMark. Specifically, CoreMark first dynamically extracts COREs from characters. Then, the characters with stronger robustness are selected according to the lengths of COREs. By modifying the thickness of the CORE, the hidden data is embedded into the selected characters without causing significant visual distortions. Moreover, a general plug-and-play embedding strength modulator is proposed, which can adaptively enhance the robustness for small font sizes by adjusting the embedding strength according to the font size. Experimental evaluation indicates that CoreMark demonstrates outstanding generalizability across multiple languages and fonts. Compared to existing methods, CoreMark achieves significant improvements in resisting screenshot, print-scan, and print camera attacks, while maintaining satisfactory imperceptibility.

Title: Curious Causality-Seeking Agents Learn Meta Causal World

Authors: Zhiyu Zhao, Haoxuan Li, Haifeng Zhang, Jun Wang, Francesco Faccio, Jürgen Schmidhuber, Mengyue Yang
Subjects: cs.LG, cs.AI, stat.AP
Abstract URL: https://arxiv.org/abs/2506.23068
Pdf URL: https://arxiv.org/pdf/2506.23068
Copy Paste: [[2506.23068]] Curious Causality-Seeking Agents Learn Meta Causal World(https://arxiv.org/abs/2506.23068)
Keywords: robust
Abstract: When building a world model, a common assumption is that the environment has a single, unchanging underlying causal rule, like applying Newton's laws to every situation. In reality, what appears as a drifting causal mechanism is often the manifestation of a fixed underlying mechanism seen through a narrow observational window. This brings about a problem that, when building a world model, even subtle shifts in policy or environment states can alter the very observed causal mechanisms. In this work, we introduce the \textbf{Meta-Causal Graph} as world models, a minimal unified representation that efficiently encodes the transformation rules governing how causal structures shift across different latent world states. A single Meta-Causal Graph is composed of multiple causal subgraphs, each triggered by meta state, which is in the latent state space. Building on this representation, we introduce a \textbf{Causality-Seeking Agent} whose objectives are to (1) identify the meta states that trigger each subgraph, (2) discover the corresponding causal relationships by agent curiosity-driven intervention policy, and (3) iteratively refine the Meta-Causal Graph through ongoing curiosity-driven exploration and agent experiences. Experiments on both synthetic tasks and a challenging robot arm manipulation task demonstrate that our method robustly captures shifts in causal dynamics and generalizes effectively to previously unseen contexts.

Title: Learning Counterfactually Decoupled Attention for Open-World Model Attribution

Authors: Yu Zheng, Boyang Gong, Fanye Kong, Yueqi Duan, Bingyao Yu, Wenzhao Zheng, Lei Chen, Jiwen Lu, Jie Zhou
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23074
Pdf URL: https://arxiv.org/pdf/2506.23074
Copy Paste: [[2506.23074]] Learning Counterfactually Decoupled Attention for Open-World Model Attribution(https://arxiv.org/abs/2506.23074)
Keywords: attack
Abstract: In this paper, we propose a Counterfactually Decoupled Attention Learning (CDAL) method for open-world model attribution. Existing methods rely on handcrafted design of region partitioning or feature space, which could be confounded by the spurious statistical correlations and struggle with novel attacks in open-world scenarios. To address this, CDAL explicitly models the causal relationships between the attentional visual traces and source model attribution, and counterfactually decouples the discriminative model-specific artifacts from confounding source biases for comparison. In this way, the resulting causal effect provides a quantification on the quality of learned attention maps, thus encouraging the network to capture essential generation patterns that generalize to unseen source models by maximizing the effect. Extensive experiments on existing open-world model attribution benchmarks show that with minimal computational overhead, our method consistently improves state-of-the-art models by large margins, particularly for unseen novel attacks. Source code: this https URL.

Title: Frequency-enhanced Multi-granularity Context Network for Efficient Vertebrae Segmentation

Authors: Jian Shi, Tianqi You, Pingping Zhang, Hongli Zhang, Rui Xu, Haojie Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23086
Pdf URL: https://arxiv.org/pdf/2506.23086
Copy Paste: [[2506.23086]] Frequency-enhanced Multi-granularity Context Network for Efficient Vertebrae Segmentation(https://arxiv.org/abs/2506.23086)
Keywords: segmentation
Abstract: Automated and accurate segmentation of individual vertebra in 3D CT and MRI images is essential for various clinical applications. Due to the limitations of current imaging techniques and the complexity of spinal structures, existing methods still struggle with reducing the impact of image blurring and distinguishing similar vertebrae. To alleviate these issues, we introduce a Frequency-enhanced Multi-granularity Context Network (FMC-Net) to improve the accuracy of vertebrae segmentation. Specifically, we first apply wavelet transform for lossless downsampling to reduce the feature distortion in blurred images. The decomposed high and low-frequency components are then processed separately. For the high-frequency components, we apply a High-frequency Feature Refinement (HFR) to amplify the prominence of key features and filter out noises, restoring fine-grained details in blurred images. For the low-frequency components, we use a Multi-granularity State Space Model (MG-SSM) to aggregate feature representations with different receptive fields, extracting spatially-varying contexts while capturing long-range dependencies with linear complexity. The utilization of multi-granularity contexts is essential for distinguishing similar vertebrae and improving segmentation accuracy. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches on both CT and MRI vertebrae segmentation datasets. The source code is publicly available at this https URL.

Title: Where, What, Why: Towards Explainable Driver Attention Prediction

Authors: Yuchen Zhou, Jiayu Tang, Xiaoyan Xiao, Yueyao Lin, Linkai Liu, Zipeng Guo, Hao Fei, Xiaobo Xia, Chao Gou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23088
Pdf URL: https://arxiv.org/pdf/2506.23088
Copy Paste: [[2506.23088]] Where, What, Why: Towards Explainable Driver Attention Prediction(https://arxiv.org/abs/2506.23088)
Keywords: robust, large language model
Abstract: Modeling task-driven attention in driving is a fundamental challenge for both autonomous vehicles and cognitive science. Existing methods primarily predict where drivers look by generating spatial heatmaps, but fail to capture the cognitive motivations behind attention allocation in specific contexts, which limits deeper understanding of attention mechanisms. To bridge this gap, we introduce Explainable Driver Attention Prediction, a novel task paradigm that jointly predicts spatial attention regions (where), parses attended semantics (what), and provides cognitive reasoning for attention allocation (why). To support this, we present W3DA, the first large-scale explainable driver attention dataset. It enriches existing benchmarks with detailed semantic and causal annotations across diverse driving scenarios, including normal conditions, safety-critical situations, and traffic accidents. We further propose LLada, a Large Language model-driven framework for driver attention prediction, which unifies pixel modeling, semantic parsing, and cognitive reasoning within an end-to-end architecture. Extensive experiments demonstrate the effectiveness of LLada, exhibiting robust generalization across datasets and driving conditions. This work serves as a key step toward a deeper understanding of driver attention mechanisms, with significant implications for autonomous driving, intelligent driver training, and human-computer interaction.

Title: From Individuals to Interactions: Benchmarking Gender Bias in Multimodal Large Language Models from the Lens of Social Relationship

Authors: Yue Xu, Wenjie Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23101
Pdf URL: https://arxiv.org/pdf/2506.23101
Copy Paste: [[2506.23101]] From Individuals to Interactions: Benchmarking Gender Bias in Multimodal Large Language Models from the Lens of Social Relationship(https://arxiv.org/abs/2506.23101)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have shown impressive capabilities across tasks involving both visual and textual modalities. However, growing concerns remain about their potential to encode and amplify gender bias, particularly in socially sensitive applications. Existing benchmarks predominantly evaluate bias in isolated scenarios, overlooking how bias may emerge subtly through interpersonal interactions. We fill this gap by going beyond single-entity evaluation and instead focusing on a deeper examination of relational and contextual gender bias in dual-individual interactions. We introduce Genres, a novel benchmark designed to evaluate gender bias in MLLMs through the lens of social relationships in generated narratives. Genres assesses gender bias through a dual-character profile and narrative generation task that captures rich interpersonal dynamics and supports a fine-grained bias evaluation suite across multiple dimensions. Experiments on both open- and closed-source MLLMs reveal persistent, context-sensitive gender biases that are not evident in single-character settings. Our findings underscore the importance of relationship-aware benchmarks for diagnosing subtle, interaction-driven gender bias in MLLMs and provide actionable insights for future bias mitigation.

Title: DC-TTA: Divide-and-Conquer Framework for Test-Time Adaptation of Interactive Segmentation

Authors: Jihun Kim, Hoyong Kwon, Hyeokjun Kweon, Wooseong Jeong, Kuk-Jin Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23104
Pdf URL: https://arxiv.org/pdf/2506.23104
Copy Paste: [[2506.23104]] DC-TTA: Divide-and-Conquer Framework for Test-Time Adaptation of Interactive Segmentation(https://arxiv.org/abs/2506.23104)
Keywords: segmentation
Abstract: Interactive segmentation (IS) allows users to iteratively refine object boundaries with minimal cues, such as positive and negative clicks. While the Segment Anything Model (SAM) has garnered attention in the IS community for its promptable segmentation capabilities, it often struggles in specialized domains or when handling complex scenarios (e.g., camouflaged or multi-part objects). To overcome these challenges, we propose DC-TTA, a novel test-time adaptation (TTA) framework that adapts SAM on a per-sample basis by leveraging user interactions as supervision. Instead of forcing a single model to incorporate all user clicks at once, DC-TTA partitions the clicks into more coherent subsets, each processed independently via TTA with a separated model. This Divide-and-Conquer strategy reduces conflicts among diverse cues and enables more localized updates. Finally, we merge the adapted models to form a unified predictor that integrates the specialized knowledge from each subset. Experimental results across various benchmarks demonstrate that DC-TTA significantly outperforms SAM's zero-shot results and conventional TTA methods, effectively handling complex tasks such as camouflaged object segmentation with fewer interactions and improved accuracy.

Title: FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes

Authors: Janki Atul Nawale, Mohammed Safi Ur Rahman Khan, Janani D, Mansi Gupta, Danish Pruthi, Mitesh M. Khapra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23111
Pdf URL: https://arxiv.org/pdf/2506.23111
Copy Paste: [[2506.23111]] FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes(https://arxiv.org/abs/2506.23111)
Keywords: fair
Abstract: Existing studies on fairness are largely Western-focused, making them inadequate for culturally diverse countries such as India. To address this gap, we introduce INDIC-BIAS, a comprehensive India-centric benchmark designed to evaluate fairness of LLMs across 85 identity groups encompassing diverse castes, religions, regions, and tribes. We first consult domain experts to curate over 1,800 socio-cultural topics spanning behaviors and situations, where biases and stereotypes are likely to emerge. Grounded in these topics, we generate and manually validate 20,000 real-world scenario templates to probe LLMs for fairness. We structure these templates into three evaluation tasks: plausibility, judgment, and generation. Our evaluation of 14 popular LLMs on these tasks reveals strong negative biases against marginalized identities, with models frequently reinforcing common stereotypes. Additionally, we find that models struggle to mitigate bias even when explicitly asked to rationalize their decision. Our evaluation provides evidence of both allocative and representational harms that current LLMs could cause towards Indian identities, calling for a more cautious usage in practical applications. We release INDIC-BIAS as an open-source benchmark to advance research on benchmarking and mitigating biases and stereotypes in the Indian context.

Title: MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings

Authors: Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.23115
Pdf URL: https://arxiv.org/pdf/2506.23115
Copy Paste: [[2506.23115]] MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings(https://arxiv.org/abs/2506.23115)
Keywords: robust
Abstract: Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and data. To address these issues, we propose MoCa, a two-stage framework for transforming pre-trained VLMs into effective bidirectional multimodal embedding models. The first stage, Modality-aware Continual Pre-training, introduces a joint reconstruction objective that simultaneously denoises interleaved text and image inputs, enhancing bidirectional context-aware reasoning. The second stage, Heterogeneous Contrastive Fine-tuning, leverages diverse, semantically rich multimodal data beyond simple image-caption pairs to enhance generalization and alignment. Our method addresses the stated limitations by introducing bidirectional attention through continual pre-training, scaling effectively with massive unlabeled datasets via joint reconstruction objectives, and utilizing diverse multimodal data for enhanced representation robustness. Experiments demonstrate that MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results, and exhibits strong scalability with both model size and training data on MMEB.

Title: Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation

Authors: Zhenhua Ning, Zhuotao Tian, Shaoshuai Shi, Guangming Lu, Daojing He, Wenjie Pei, Li Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23120
Pdf URL: https://arxiv.org/pdf/2506.23120
Copy Paste: [[2506.23120]] Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation(https://arxiv.org/abs/2506.23120)
Keywords: large language model, segmentation
Abstract: Recent advances in point cloud perception have demonstrated remarkable progress in scene understanding through vision-language alignment leveraging large language models (LLMs). However, existing methods may still encounter challenges in handling complex instructions that require accurate spatial reasoning, even if the 3D point cloud data provides detailed spatial cues such as size and position for identifying the targets. To tackle this issue, we propose Relevant Reasoning Segmentation (R$^2$S), a reasoning-based segmentation framework. The framework emulates human cognitive processes by decomposing spatial reasoning into two sequential stages: first identifying relevant elements, then processing instructions guided by their associated visual priors. Furthermore, acknowledging the inadequacy of existing datasets in complex reasoning tasks, we introduce 3D ReasonSeg, a reasoning-based segmentation dataset comprising 25,185 training samples and 3,966 validation samples with precise annotations. Both quantitative and qualitative experiments demonstrate that the R$^2$S and 3D ReasonSeg effectively endow 3D point cloud perception with stronger spatial reasoning capabilities, and we hope that they can serve as a new baseline and benchmark for future work.

Title: Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models

Authors: Shivam Sharma, Tanmoy Chakraborty
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.23122
Pdf URL: https://arxiv.org/pdf/2506.23122
Copy Paste: [[2506.23122]] Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models(https://arxiv.org/abs/2506.23122)
Keywords: transformer
Abstract: This work investigates the challenging task of identifying narrative roles - Hero, Villain, Victim, and Other - in Internet memes, across three diverse test sets spanning English and code-mixed (English-Hindi) languages. Building on an annotated dataset originally skewed toward the 'Other' class, we explore a more balanced and linguistically diverse extension, originally introduced as part of the CLEF 2024 shared task. Comprehensive lexical and structural analyses highlight the nuanced, culture-specific, and context-rich language used in real memes, in contrast to synthetically curated hateful content, which exhibits explicit and repetitive lexical markers. To benchmark the role detection task, we evaluate a wide spectrum of models, including fine-tuned multilingual transformers, sentiment and abuse-aware classifiers, instruction-tuned LLMs, and multimodal vision-language models. Performance is assessed under zero-shot settings using precision, recall, and F1 metrics. While larger models like DeBERTa-v3 and Qwen2.5-VL demonstrate notable gains, results reveal consistent challenges in reliably identifying the 'Victim' class and generalising across cultural and code-mixed content. We also explore prompt design strategies to guide multimodal models and find that hybrid prompts incorporating structured instructions and role definitions offer marginal yet consistent improvements. Our findings underscore the importance of cultural grounding, prompt engineering, and multimodal reasoning in modelling subtle narrative framings in visual-textual content.

Title: Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning

Authors: Zhaoye Fei, Li Ji, Siyin Wang, Junhao Shi, Jingjing Gong, Xipeng Qiu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23127
Pdf URL: https://arxiv.org/pdf/2506.23127
Copy Paste: [[2506.23127]] Unleashing Embodied Task Planning Ability in LLMs via Reinforcement Learning(https://arxiv.org/abs/2506.23127)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they face significant challenges in embodied task planning scenarios that require continuous environmental understanding and action generation. Existing approaches generate open-loop action scripts based on static knowledge, making it difficult to learn causal relationships between actions and environmental feedback, particularly in partially observable environments. We introduce Embodied Planner-R1, a novel outcome-driven reinforcement learning framework that enables LLMs to develop interactive capabilities through autonomous exploration with minimal supervision. Our framework incorporates three key innovations: (1) Without human annotations, we employ pure reinforcement learning with group rollout, incorporating in-environment interaction through parallel exploration; (2) completion-driven sparse reward; and (3) Interactive Policy Optimization (IPO) for efficient learning from grouped trajectories. Across two challenging text-based Embodied planning benchmarks, Embodied Planner-R1 achieves impressive completion rates of 97.78% on ALFWorld and 79.92% on ScienceWorld, surpassing prior methods by a large margin, and suffers only a -3.66% drop in previously unseen environments, evidencing strong generalization.

Title: Dare to Plagiarize? Plagiarized Painting Recognition and Retrieval

Authors: Sophie Zhou, Shu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23132
Pdf URL: https://arxiv.org/pdf/2506.23132
Copy Paste: [[2506.23132]] Dare to Plagiarize? Plagiarized Painting Recognition and Retrieval(https://arxiv.org/abs/2506.23132)
Keywords: protect, generative
Abstract: Art plagiarism detection plays a crucial role in protecting artists' copyrights and intellectual property, yet it remains a challenging problem in forensic analysis. In this paper, we address the task of recognizing plagiarized paintings and explaining the detected plagarisms by retrieving visually similar authentic artworks. To support this study, we construct a dataset by collecting painting photos and synthesizing plagiarized versions using generative AI, tailored to specific artists' styles. We first establish a baseline approach using off-the-shelf features from the visual foundation model DINOv2 to retrieve the most similar images in the database and classify plagiarism based on a similarity threshold. Surprisingly, this non-learned method achieves a high recognition accuracy of 97.2\% but suffers from low retrieval precision 29.0\% average precision (AP). To improve retrieval quality, we finetune DINOv2 with a metric learning loss using positive and negative sample pairs sampled in the database. The finetuned model greatly improves retrieval performance by 12\% AP over the baseline, though it unexpectedly results in a lower recognition accuracy (92.7\%). We conclude with insightful discussions and outline directions for future research.

Title: Format-Adapter: Improving Reasoning Capability of LLMs by Adapting Suitable Format

Authors: Dingzirui Wang, Xuanliang Zhang, Rongyu Cao, Longxu Dou, Xianzhen Luo, Yingwei Ma, Qingfu Zhu, Wanxiang Che, Binhua Li, Fei Huang, Yongbin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23133
Pdf URL: https://arxiv.org/pdf/2506.23133
Copy Paste: [[2506.23133]] Format-Adapter: Improving Reasoning Capability of LLMs by Adapting Suitable Format(https://arxiv.org/abs/2506.23133)
Keywords: large language model
Abstract: Generating and voting multiple answers is an effective method to mitigate reasoning inconsistencies of large language models (LLMs). Prior works have shown that multiple reasoning formats outperform a single format when generating multiple answers. However, previous works using multiple formats rely on formats labeled by humans, which could be unsuitable for all tasks and have high labeling costs. To address this issue, we adapt suitable formats to the given tasks by generating and selecting formats. We first propose how to measure the reasoning error when generating multiple answers. Then, we introduce Format-Adapter, which utilizes LLMs to generate and select suitable reasoning formats by minimizing the error measurement we present. We conduct experiments on math and commonsense reasoning tasks, where Format-Adapter achieves a 4.3% performance improvement on average over previous works, demonstrating the effectiveness.

Title: LLM-Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation

Authors: Shadman Sobhan, Mohammad Ariful Haque
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23136
Pdf URL: https://arxiv.org/pdf/2506.23136
Copy Paste: [[2506.23136]] LLM-Assisted Question-Answering on Technical Documents Using Structured Data-Aware Retrieval Augmented Generation(https://arxiv.org/abs/2506.23136)
Keywords: large language model
Abstract: Large Language Models (LLMs) are capable of natural language understanding and generation. But they face challenges such as hallucination and outdated knowledge. Fine-tuning is one possible solution, but it is resource-intensive and must be repeated with every data update. Retrieval-Augmented Generation (RAG) offers an efficient solution by allowing LLMs to access external knowledge sources. However, traditional RAG pipelines struggle with retrieving information from complex technical documents with structured data such as tables and images. In this work, we propose a RAG pipeline, capable of handling tables and images in documents, for technical documents that support both scanned and searchable formats. Its retrieval process combines vector similarity search with a fine-tuned reranker based on Gemma-2-9b-it. The reranker is trained using RAFT (Retrieval-Augmented Fine-Tuning) on a custom dataset designed to improve context identification for question answering. Our evaluation demonstrates that the proposed pipeline achieves a high faithfulness score of 94% (RAGas) and 96% (DeepEval), and an answer relevancy score of 87% (RAGas) and 93% (DeepEval). Comparative analysis demonstrates that the proposed architecture is superior to general RAG pipelines in terms of table-based questions and handling questions outside context.

Title: VisualPrompter: Prompt Optimization with Visual Feedback for Text-to-Image Synthesis

Authors: Shiyu Wu, Mingzhen Sun, Weining Wang, Yequan Wang, Jing Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23138
Pdf URL: https://arxiv.org/pdf/2506.23138
Copy Paste: [[2506.23138]] VisualPrompter: Prompt Optimization with Visual Feedback for Text-to-Image Synthesis(https://arxiv.org/abs/2506.23138)
Keywords: diffusion, generative
Abstract: Since there exists a notable gap between user-provided and model-preferred prompts, generating high-quality and satisfactory images using diffusion models often requires prompt engineering to optimize user inputs. Current studies on text-to-image prompt engineering can effectively enhance the style and aesthetics of generated images. However, they often neglect the semantic alignment between generated images and user descriptions, resulting in visually appealing but content-wise unsatisfying outputs. In this work, we propose VisualPrompter, a novel training-free prompt engineering framework that refines user inputs to model-preferred sentences. In particular, VisualPrompter utilizes an automatic self-reflection module to identify the missing concepts in generated images and a target-specific prompt optimization mechanism to revise the prompts in a fine-grained manner. Extensive experiments demonstrate the effectiveness of our VisualPrompter, which achieves new state-of-the-art performance on multiple benchmarks for text-image alignment evaluation. Additionally, our framework features a plug-and-play design, making it highly adaptable to various generative models.

Title: Forget-MI: Machine Unlearning for Forgetting Multimodal Information in Healthcare Settings

Authors: Shahad Hardan, Darya Taratynova, Abdelmajid Essofi, Karthik Nandakumar, Mohammad Yaqub
Subjects: cs.LG, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2506.23145
Pdf URL: https://arxiv.org/pdf/2506.23145
Copy Paste: [[2506.23145]] Forget-MI: Machine Unlearning for Forgetting Multimodal Information in Healthcare Settings(https://arxiv.org/abs/2506.23145)
Keywords: privacy, attack, membership infer
Abstract: Privacy preservation in AI is crucial, especially in healthcare, where models rely on sensitive patient data. In the emerging field of machine unlearning, existing methodologies struggle to remove patient data from trained multimodal architectures, which are widely used in healthcare. We propose Forget-MI, a novel machine unlearning method for multimodal medical data, by establishing loss functions and perturbation techniques. Our approach unlearns unimodal and joint representations of the data requested to be forgotten while preserving knowledge from the remaining data and maintaining comparable performance to the original model. We evaluate our results using performance on the forget dataset, performance on the test dataset, and Membership Inference Attack (MIA), which measures the attacker's ability to distinguish the forget dataset from the training dataset. Our model outperforms the existing approaches that aim to reduce MIA and the performance on the forget dataset while keeping an equivalent performance on the test set. Specifically, our approach reduces MIA by 0.202 and decreases AUC and F1 scores on the forget set by 0.221 and 0.305, respectively. Additionally, our performance on the test set matches that of the retrained model, while allowing forgetting. Code is available at this https URL

Title: Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions

Authors: Dingzriui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23146
Pdf URL: https://arxiv.org/pdf/2506.23146
Copy Paste: [[2506.23146]] Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions(https://arxiv.org/abs/2506.23146)
Keywords: large language model
Abstract: In-context learning (ICL) has emerged as an effective approach to enhance the performance of large language models (LLMs). However, its effectiveness varies significantly across models and tasks, posing challenges for practitioners to determine when ICL reliably improves performance. Current evaluation approaches, reliant on performance change after applying ICL, suffer from low reliability, poor attribution, and impracticality in data-insufficient scenarios. We propose the Learning-to-Context Slope (LCS), a novel metric that quantifies ICL effectiveness by modeling the slope between learning gain (loss decrease from demonstrations) and contextual relevance (demonstration-input relevance). LCS addresses key limitations of performance-based metrics: (1) it captures continuous loss changes even when outputs are incorrect, improving reliability; (2) its formulation attributes ICL failures to weak contextual alignment (inability to adapt inputs to demonstrations) or strong output calibration (self-verification of correctness); and (3) it minimizes reliance on labeled data via synthetic evaluation. Extensive experiments demonstrate that LCS strongly correlates with performance improvements in labeled settings and reliably reflects true effectiveness in biased or data-scarce scenarios. Further analysis reveals actionable thresholds for LCS and identifies model capabilities critical to ICL success.

Title: V-SYNTHESIS: Task-Agnostic Synthesis of Consistent and Diverse In-Context Demonstrations from Scratch via V-Entropy

Authors: Dingzirui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23149
Pdf URL: https://arxiv.org/pdf/2506.23149
Copy Paste: [[2506.23149]] V-SYNTHESIS: Task-Agnostic Synthesis of Consistent and Diverse In-Context Demonstrations from Scratch via V-Entropy(https://arxiv.org/abs/2506.23149)
Keywords: large language model
Abstract: High labeling cost for in-context learning (ICL) demonstrations motivates using large language models (LLMs) for synthesis to reduce overhead. However, existing synthesis methods are mainly task-specific or rely on pre-existing demonstrations. So this paper focuses on synthesizing demonstrations from scratch for arbitrary tasks. A major challenge in synthesizing from scratch is ensuring consistency with the target task, as the lack of labeling guidance could lead to synthesis bias. We first propose a consistency metric called V-Score, which has higher performance and lower computation cost compared with the metrics based on grams or embedding vectors. Furthermore, we introduce V-Synthesis, which leverages V-Score for proportional sampling to ensure both high consistency and diversity of synthesized demonstrations. Experimental results demonstrate that V-Synthesis yields an average performance improvement of 2.0% compared to existing synthesis methods confirming the effectiveness of V-Synthesis.

Title: MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation

Authors: Vladislav Bargatin, Egor Chistov, Alexander Yakovenko, Dmitriy Vatolin
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2506.23151
Pdf URL: https://arxiv.org/pdf/2506.23151
Copy Paste: [[2506.23151]] MEMFOF: High-Resolution Training for Memory-Efficient Multi-Frame Optical Flow Estimation(https://arxiv.org/abs/2506.23151)
Keywords: robust
Abstract: Recent advances in optical flow estimation have prioritized accuracy at the cost of growing GPU memory consumption, particularly for high-resolution (FullHD) inputs. We introduce MEMFOF, a memory-efficient multi-frame optical flow method that identifies a favorable trade-off between multi-frame estimation and GPU memory usage. Notably, MEMFOF requires only 2.09 GB of GPU memory at runtime for 1080p inputs, and 28.5 GB during training, which uniquely positions our method to be trained at native 1080p without the need for cropping or downsampling. We systematically revisit design choices from RAFT-like architectures, integrating reduced correlation volumes and high-resolution training protocols alongside multi-frame estimation, to achieve state-of-the-art performance across multiple benchmarks while substantially reducing memory overhead. Our method outperforms more resource-intensive alternatives in both accuracy and runtime efficiency, validating its robustness for flow estimation at high resolutions. At the time of submission, our method ranks first on the Spring benchmark with a 1-pixel (1px) outlier rate of 3.289, leads Sintel (clean) with an endpoint error (EPE) of 0.963, and achieves the best Fl-all error on KITTI-2015 at 2.94%. The code is available at this https URL.

Title: Dynamic View Synthesis from Small Camera Motion Videos

Authors: Huiqiang Sun, Xingyi Li, Juewen Peng, Liao Shen, Zhiguo Cao, Ke Xian, Guosheng Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23153
Pdf URL: https://arxiv.org/pdf/2506.23153
Copy Paste: [[2506.23153]] Dynamic View Synthesis from Small Camera Motion Videos(https://arxiv.org/abs/2506.23153)
Keywords: robust
Abstract: Novel view synthesis for dynamic $3$D scenes poses a significant challenge. Many notable efforts use NeRF-based approaches to address this task and yield impressive results. However, these methods rely heavily on sufficient motion parallax in the input images or videos. When the camera motion range becomes limited or even stationary (i.e., small camera motion), existing methods encounter two primary challenges: incorrect representation of scene geometry and inaccurate estimation of camera parameters. These challenges make prior methods struggle to produce satisfactory results or even become invalid. To address the first challenge, we propose a novel Distribution-based Depth Regularization (DDR) that ensures the rendering weight distribution to align with the true distribution. Specifically, unlike previous methods that use depth loss to calculate the error of the expectation, we calculate the expectation of the error by using Gumbel-softmax to differentiably sample points from discrete rendering weight distribution. Additionally, we introduce constraints that enforce the volume density of spatial points before the object boundary along the ray to be near zero, ensuring that our model learns the correct geometry of the scene. To demystify the DDR, we further propose a visualization tool that enables observing the scene geometry representation at the rendering weight level. For the second challenge, we incorporate camera parameter learning during training to enhance the robustness of our model to camera parameters. We conduct extensive experiments to demonstrate the effectiveness of our approach in representing scenes with small camera motion input, and our results compare favorably to state-of-the-art methods.

Title: Self-Supervised Contrastive Learning for Multi-Label Images

Authors: Jiale Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23156
Pdf URL: https://arxiv.org/pdf/2506.23156
Copy Paste: [[2506.23156]] Self-Supervised Contrastive Learning for Multi-Label Images(https://arxiv.org/abs/2506.23156)
Keywords: extraction
Abstract: Self-supervised learning (SSL) has demonstrated its effectiveness in learning representations through comparison methods that align with human intuition. However, mainstream SSL methods heavily rely on high body datasets with single label, such as ImageNet, resulting in intolerable pre-training overhead. Besides, more general multi-label images are frequently overlooked in SSL, despite their potential for richer semantic information and broader applicability in downstream scenarios. Therefore, we tailor the mainstream SSL approach to guarantee excellent representation learning capabilities using fewer multi-label images. Firstly, we propose a block-wise augmentation module aimed at extracting additional potential positive view pairs from multi-label images. Subsequently, an image-aware contrastive loss is devised to establish connections between these views, thereby facilitating the extraction of semantically consistent representations. Comprehensive linear fine-tuning and transfer learning validate the competitiveness of our approach despite challenging sample quality and quantity.

Title: Mirror Descent Policy Optimisation for Robust Constrained Markov Decision Processes

Authors: David Bossens, Atsushi Nitanda
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2506.23165
Pdf URL: https://arxiv.org/pdf/2506.23165
Copy Paste: [[2506.23165]] Mirror Descent Policy Optimisation for Robust Constrained Markov Decision Processes(https://arxiv.org/abs/2506.23165)
Keywords: robust
Abstract: Safety is an essential requirement for reinforcement learning systems. The newly emerging framework of robust constrained Markov decision processes allows learning policies that satisfy long-term constraints while providing guarantees under epistemic uncertainty. This paper presents mirror descent policy optimisation for robust constrained Markov decision processes (RCMDPs), making use of policy gradient techniques to optimise both the policy (as a maximiser) and the transition kernel (as an adversarial minimiser) on the Lagrangian representing a constrained MDP. In the oracle-based RCMDP setting, we obtain an $\mathcal{O}\left(\frac{1}{T}\right)$ convergence rate for the squared distance as a Bregman divergence, and an $\mathcal{O}\left(e^{-T}\right)$ convergence rate for entropy-regularised objectives. In the sample-based RCMDP setting, we obtain an $\tilde{\mathcal{O}}\left(\frac{1}{T^{1/3}}\right)$ convergence rate. Experiments confirm the benefits of mirror descent policy optimisation in constrained and unconstrained optimisation, and significant improvements are observed in robustness tests when compared to baseline policy optimisation algorithms.

Title: Data Can Speak for Itself: Quality-guided Utilization of Wireless Synthetic Data

Authors: Chen Gong, Bo Liang, Wei Gao, Chenren Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23174
Pdf URL: https://arxiv.org/pdf/2506.23174
Copy Paste: [[2506.23174]] Data Can Speak for Itself: Quality-guided Utilization of Wireless Synthetic Data(https://arxiv.org/abs/2506.23174)
Keywords: generative
Abstract: Generative models have gained significant attention for their ability to produce realistic synthetic data that supplements the quantity of real-world datasets. While recent studies show performance improvements in wireless sensing tasks by incorporating all synthetic data into training sets, the quality of synthetic data remains unpredictable and the resulting performance gains are not guaranteed. To address this gap, we propose tractable and generalizable metrics to quantify quality attributes of synthetic data - affinity and diversity. Our assessment reveals prevalent affinity limitation in current wireless synthetic data, leading to mislabeled data and degraded task performance. We attribute the quality limitation to generative models' lack of awareness of untrained conditions and domain-specific processing. To mitigate these issues, we introduce SynCheck, a quality-guided synthetic data utilization scheme that refines synthetic data quality during task model training. Our evaluation demonstrates that SynCheck consistently outperforms quality-oblivious utilization of synthetic data, and achieves 4.3% performance improvement even when the previous utilization degrades performance by 13.4%.

Title: Attribution assignment for deep-generative sequence models enables interpretability analysis using positive-only data

Authors: Robert Frank, Michael Widrich, Rahmad Akbar, Günter Klambauer, Geir Kjetil Sandve, Philippe A. Robert, Victor Greiff
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2506.23182
Pdf URL: https://arxiv.org/pdf/2506.23182
Copy Paste: [[2506.23182]] Attribution assignment for deep-generative sequence models enables interpretability analysis using positive-only data(https://arxiv.org/abs/2506.23182)
Keywords: interpretability, generative
Abstract: Generative machine learning models offer a powerful framework for therapeutic design by efficiently exploring large spaces of biological sequences enriched for desirable properties. Unlike supervised learning methods, which require both positive and negative labeled data, generative models such as LSTMs can be trained solely on positively labeled sequences, for example, high-affinity antibodies. This is particularly advantageous in biological settings where negative data are scarce, unreliable, or biologically ill-defined. However, the lack of attribution methods for generative models has hindered the ability to extract interpretable biological insights from such models. To address this gap, we developed Generative Attribution Metric Analysis (GAMA), an attribution method for autoregressive generative models based on Integrated Gradients. We assessed GAMA using synthetic datasets with known ground truths to characterize its statistical behavior and validate its ability to recover biologically relevant features. We further demonstrated the utility of GAMA by applying it to experimental antibody-antigen binding data. GAMA enables model interpretability and the validation of generative sequence design strategies without the need for negative training data.

Title: A Practical and Secure Byzantine Robust Aggregator

Authors: De Zhang Lee, Aashish Kolluri, Prateek Saxena, Ee-Chien Chang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.23183
Pdf URL: https://arxiv.org/pdf/2506.23183
Copy Paste: [[2506.23183]] A Practical and Secure Byzantine Robust Aggregator(https://arxiv.org/abs/2506.23183)
Keywords: secure, security, defense, attack, robust
Abstract: In machine learning security, one is often faced with the problem of removing outliers from a given set of high-dimensional vectors when computing their average. For example, many variants of data poisoning attacks produce gradient vectors during training that are outliers in the distribution of clean gradients, which bias the computed average used to derive the ML model. Filtering them out before averaging serves as a generic defense strategy. Byzantine robust aggregation is an algorithmic primitive which computes a robust average of vectors, in the presence of an $\epsilon$ fraction of vectors which may have been arbitrarily and adaptively corrupted, such that the resulting bias in the final average is provably bounded. In this paper, we give the first robust aggregator that runs in quasi-linear time in the size of input vectors and provably has near-optimal bias bounds. Our algorithm also does not assume any knowledge of the distribution of clean vectors, nor does it require pre-computing any filtering thresholds from it. This makes it practical to use directly in standard neural network training procedures. We empirically confirm its expected runtime efficiency and its effectiveness in nullifying 10 different ML poisoning attacks.

Title: Trident: Detecting Face Forgeries with Adversarial Triplet Learning

Authors: Mustafa Hakan Kara, Aysegul Dundar, Uğur Güdükbay
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23189
Pdf URL: https://arxiv.org/pdf/2506.23189
Copy Paste: [[2506.23189]] Trident: Detecting Face Forgeries with Adversarial Triplet Learning(https://arxiv.org/abs/2506.23189)
Keywords: robust
Abstract: As face forgeries generated by deep neural networks become increasingly sophisticated, detecting face manipulations in digital media has posed a significant challenge, underscoring the importance of maintaining digital media integrity and combating visual disinformation. Current detection models, predominantly based on supervised training with domain-specific data, often falter against forgeries generated by unencountered techniques. In response to this challenge, we introduce \textit{Trident}, a face forgery detection framework that employs triplet learning with a Siamese network architecture for enhanced adaptability across diverse forgery methods. \textit{Trident} is trained on curated triplets to isolate nuanced differences of forgeries, capturing fine-grained features that distinguish pristine samples from manipulated ones while controlling for other variables. To further enhance generalizability, we incorporate domain-adversarial training with a forgery discriminator. This adversarial component guides our embedding model towards forgery-agnostic representations, improving its robustness to unseen manipulations. In addition, we prevent gradient flow from the classifier head to the embedding model, avoiding overfitting induced by artifacts peculiar to certain forgeries. Comprehensive evaluations across multiple benchmarks and ablation studies demonstrate the effectiveness of our framework. We will release our code in a GitHub repository.

Title: External Data-Enhanced Meta-Representation for Adaptive Probabilistic Load Forecasting

Authors: Haoran Li, Muhao Guo, Marija Ilic, Yang Weng, Guangchun Ruan
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2506.23201
Pdf URL: https://arxiv.org/pdf/2506.23201
Copy Paste: [[2506.23201]] External Data-Enhanced Meta-Representation for Adaptive Probabilistic Load Forecasting(https://arxiv.org/abs/2506.23201)
Keywords: robust, extraction
Abstract: Accurate residential load forecasting is critical for power system reliability with rising renewable integration and demand-side flexibility. However, most statistical and machine learning models treat external factors, such as weather, calendar effects, and pricing, as extra input, ignoring their heterogeneity, and thus limiting the extraction of useful external information. We propose a paradigm shift: external data should serve as meta-knowledge to dynamically adapt the forecasting model itself. Based on this idea, we design a meta-representation framework using hypernetworks that modulate selected parameters of a base Deep Learning (DL) model in response to external conditions. This provides both expressivity and adaptability. We further integrate a Mixture-of-Experts (MoE) mechanism to enhance efficiency through selective expert activation, while improving robustness by filtering redundant external inputs. The resulting model, dubbed as a Meta Mixture of Experts for External data (M2oE2), achieves substantial improvements in accuracy and robustness with limited additional overhead, outperforming existing state-of-the-art methods in diverse load datasets. The dataset and source code are publicly available at this https URL\_load\this http URL.

Title: Transformer-Based Person Search with High-Frequency Augmentation and Multi-Wave Mixing

Authors: Qilin Shu, Qixian Zhang, Qi Zhang, Hongyun Zhang, Duoqian Miao, Cairong Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23202
Pdf URL: https://arxiv.org/pdf/2506.23202
Copy Paste: [[2506.23202]] Transformer-Based Person Search with High-Frequency Augmentation and Multi-Wave Mixing(https://arxiv.org/abs/2506.23202)
Keywords: extraction, transformer
Abstract: The person search task aims to locate a target person within a set of scene images. In recent years, transformer-based models in this field have made some progress. However, they still face three primary challenges: 1) the self-attention mechanism tends to suppress high-frequency components in the features, which severely impacts model performance; 2) the computational cost of transformers is relatively high. To address these issues, we propose a novel High-frequency Augmentation and Multi-Wave mixing (HAMW) method for person search. HAMW is designed to enhance the discriminative feature extraction capabilities of transformers while reducing computational overhead and improving efficiency. Specifically, we develop a three-stage framework that progressively optimizes both detection and re-identification performance. Our model enhances the perception of high-frequency features by learning from augmented inputs containing additional high-frequency components. Furthermore, we replace the self-attention layers in the transformer with a strategy based on multi-level Haar wavelet fusion to capture multi-scale features. This not only lowers the computational complexity but also alleviates the suppression of high-frequency features and enhances the ability to exploit multi-scale information. Extensive experiments demonstrate that HAMW achieves state-of-the-art performance on both the CUHK-SYSU and PRW datasets.

Title: BridgeShape: Latent Diffusion Schrödinger Bridge for 3D Shape Completion

Authors: Dequan Kong, Zhe Zhu, Honghua Chen, Mingqiang Wei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23205
Pdf URL: https://arxiv.org/pdf/2506.23205
Copy Paste: [[2506.23205]] BridgeShape: Latent Diffusion Schrödinger Bridge for 3D Shape Completion(https://arxiv.org/abs/2506.23205)
Keywords: diffusion
Abstract: Existing diffusion-based 3D shape completion methods typically use a conditional paradigm, injecting incomplete shape information into the denoising network via deep feature interactions (e.g., concatenation, cross-attention) to guide sampling toward complete shapes, often represented by voxel-based distance functions. However, these approaches fail to explicitly model the optimal global transport path, leading to suboptimal completions. Moreover, performing diffusion directly in voxel space imposes resolution constraints, limiting the generation of fine-grained geometric details. To address these challenges, we propose BridgeShape, a novel framework for 3D shape completion via latent diffusion Schrödinger bridge. The key innovations lie in two aspects: (i) BridgeShape formulates shape completion as an optimal transport problem, explicitly modeling the transition between incomplete and complete shapes to ensure a globally coherent transformation. (ii) We introduce a Depth-Enhanced Vector Quantized Variational Autoencoder (VQ-VAE) to encode 3D shapes into a compact latent space, leveraging self-projected multi-view depth information enriched with strong DINOv2 features to enhance geometric structural perception. By operating in a compact yet structurally informative latent space, BridgeShape effectively mitigates resolution constraints and enables more efficient and high-fidelity 3D shape completion. BridgeShape achieves state-of-the-art performance on large-scale 3D shape completion benchmarks, demonstrating superior fidelity at higher resolutions and for unseen object classes.

Title: TVG-SLAM: Robust Gaussian Splatting SLAM with Tri-view Geometric Constraints

Authors: Zhen Tan, Xieyuanli Chen, Lei Feng, Yangbing Ge, Shuaifeng Zhi, Jiaxiong Liu, Dewen Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23207
Pdf URL: https://arxiv.org/pdf/2506.23207
Copy Paste: [[2506.23207]] TVG-SLAM: Robust Gaussian Splatting SLAM with Tri-view Geometric Constraints(https://arxiv.org/abs/2506.23207)
Keywords: robust
Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have enabled RGB-only SLAM systems to achieve high-fidelity scene representation. However, the heavy reliance of existing systems on photometric rendering loss for camera tracking undermines their robustness, especially in unbounded outdoor environments with severe viewpoint and illumination changes. To address these challenges, we propose TVG-SLAM, a robust RGB-only 3DGS SLAM system that leverages a novel tri-view geometry paradigm to ensure consistent tracking and high-quality mapping. We introduce a dense tri-view matching module that aggregates reliable pairwise correspondences into consistent tri-view matches, forming robust geometric constraints across frames. For tracking, we propose Hybrid Geometric Constraints, which leverage tri-view matches to construct complementary geometric cues alongside photometric loss, ensuring accurate and stable pose estimation even under drastic viewpoint shifts and lighting variations. For mapping, we propose a new probabilistic initialization strategy that encodes geometric uncertainty from tri-view correspondences into newly initialized Gaussians. Additionally, we design a Dynamic Attenuation of Rendering Trust mechanism to mitigate tracking drift caused by mapping latency. Experiments on multiple public outdoor datasets show that our TVG-SLAM outperforms prior RGB-only 3DGS-based SLAM systems. Notably, in the most challenging dataset, our method improves tracking robustness, reducing the average Absolute Trajectory Error (ATE) by 69.0\% while achieving state-of-the-art rendering quality. The implementation of our method will be released as open-source.

Title: FedRef: Communication-Efficient Bayesian Fine Tuning with Reference Model

Authors: Taehwan Yoon, Bongjun Choi
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2506.23210
Pdf URL: https://arxiv.org/pdf/2506.23210
Copy Paste: [[2506.23210]] FedRef: Communication-Efficient Bayesian Fine Tuning with Reference Model(https://arxiv.org/abs/2506.23210)
Keywords: privacy, federate
Abstract: Federated learning(FL) is used for distributed scenarios to train artificial intelligence(AI) models while ensuring users' privacy. In federated learning scenario, the server generally never knows about users' data. This type of concept makes the AI training process efficient in terms of data privacy. However, regarding model performance, federated AI models may not sufficiently satisfy AI users' expectations. Furthermore, AI users have a wide range of different needs. It is not easy to satisfy the whole users needs. These types of issues can be addressed through AI model optimization, fine-tuning, or personalization to achieve optimal model performance. To address model optimization challenges, we propose reference model-based federated learning for optimal fine-tuning, which overcomes catastrophic forgetting in each round. This method is derived from Bayesian parameter-efficient transfer learning, which includes an optimal proximal term and enables overcoming the catastrophic forgetting issue in each round by utilizing a reference model that incorporates previous model parameters. As a result, this method achieves both high model performance and low computing cost.

Title: UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

Authors: Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, Yong Li
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.23219
Pdf URL: https://arxiv.org/pdf/2506.23219
Copy Paste: [[2506.23219]] UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding(https://arxiv.org/abs/2506.23219)
Keywords: robust, large language model
Abstract: Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $\textit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $\textit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $\textit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $\textit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via this https URL.

Title: Masked Gated Linear Unit

Authors: Yukito Tajima, Nakamasa Inoue, Yusuke Sekikawa, Ikuro Sato, Rio Yokota
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.23225
Pdf URL: https://arxiv.org/pdf/2506.23225
Copy Paste: [[2506.23225]] Masked Gated Linear Unit(https://arxiv.org/abs/2506.23225)
Keywords: large language model
Abstract: Gated Linear Units (GLUs) have become essential components in the feed-forward networks of state-of-the-art Large Language Models (LLMs). However, they require twice as many memory reads compared to feed-forward layers without gating, due to the use of separate weight matrices for the gate and value streams. To address this bottleneck, we introduce Masked Gated Linear Units (MGLUs), a novel family of GLUs with an efficient kernel implementation. The core contribution of MGLUs include: (1) the Mixture of Element-wise Gating (MoEG) architecture that learns multiple binary masks, each determining gate or value assignments at the element level on a single shared weight matrix resulting in reduced memory transfer, and (2) FlashMGLU, a hardware-friendly kernel that yields up to a 19.7 $\times$ inference-time speed-up over a naive PyTorch MGLU and is 47% more memory-efficient and 34% faster than standard GLUs despite added architectural complexity on an RTX5090 GPU. In LLM experiments, the Swish-activated variant SwiMGLU preserves its memory advantages while matching - or even surpassing - the downstream accuracy of the SwiGLU baseline.

Title: High-quality Pseudo-labeling for Point Cloud Segmentation with Scene-level Annotation

Authors: Lunhao Duan, Shanshan Zhao, Xingxing Weng, Jing Zhang, Gui-Song Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23227
Pdf URL: https://arxiv.org/pdf/2506.23227
Copy Paste: [[2506.23227]] High-quality Pseudo-labeling for Point Cloud Segmentation with Scene-level Annotation(https://arxiv.org/abs/2506.23227)
Keywords: segmentation
Abstract: This paper investigates indoor point cloud semantic segmentation under scene-level annotation, which is less explored compared to methods relying on sparse point-level labels. In the absence of precise point-level labels, current methods first generate point-level pseudo-labels, which are then used to train segmentation models. However, generating accurate pseudo-labels for each point solely based on scene-level annotations poses a considerable challenge, substantially affecting segmentation performance. Consequently, to enhance accuracy, this paper proposes a high-quality pseudo-label generation framework by exploring contemporary multi-modal information and region-point semantic consistency. Specifically, with a cross-modal feature guidance module, our method utilizes 2D-3D correspondences to align point cloud features with corresponding 2D image pixels, thereby assisting point cloud feature learning. To further alleviate the challenge presented by the scene-level annotation, we introduce a region-point semantic consistency module. It produces regional semantics through a region-voting strategy derived from point-level semantics, which are subsequently employed to guide the point-level semantic predictions. Leveraging the aforementioned modules, our method can rectify inaccurate point-level semantic predictions during training and obtain high-quality pseudo-labels. Significant improvements over previous works on ScanNet v2 and S3DIS datasets under scene-level annotation can demonstrate the effectiveness. Additionally, comprehensive ablation studies validate the contributions of our approach's individual components. The code is available at this https URL .

Title: Generalist Reward Models: Found Inside Large Language Models

Authors: Yi-Chen Li, Tian Xu, Yang Yu, Xuqin Zhang, Xiong-Hui Chen, Zhongxiang Ling, Ningjing Chao, Lei Yuan, Zhi-Hua Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23235
Pdf URL: https://arxiv.org/pdf/2506.23235
Copy Paste: [[2506.23235]] Generalist Reward Models: Found Inside Large Language Models(https://arxiv.org/abs/2506.23235)
Keywords: large language model
Abstract: The alignment of Large Language Models (LLMs) is critically dependent on reward models trained on costly human preference data. While recent work explores bypassing this cost with AI feedback, these methods often lack a rigorous theoretical foundation. In this paper, we discover that a powerful generalist reward model is already latently present within any LLM trained via standard next-token prediction. We prove that this endogenous reward is not a heuristic, but is theoretically equivalent to a reward function learned through offline inverse reinforcement learning. This connection allows us to directly elicit a high-quality reward signal from a base (pre-trained or supervised fine-tuned) model without any further training. Critically, we also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model. To our best knowledge, this is the first theoretical proof of the effectiveness of reinforcement learning for LLMs. Our experiments validate this theory, demonstrating that our method not only outperforms existing LLM-as-a-judge approaches but can also surpass explicitly trained reward models. These findings suggest that the reward modeling stage can be replaced by a principled method of eliciting the knowledge already captured during pre-training, heralding a more efficient, powerful, and scalable paradigm for LLMs alignment as well as multi-modal models.

Title: VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions

Authors: Marko Mihajlovic, Siwei Zhang, Gen Li, Kaifeng Zhao, Lea Müller, Siyu Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23236
Pdf URL: https://arxiv.org/pdf/2506.23236
Copy Paste: [[2506.23236]] VolumetricSMPL: A Neural Volumetric Body Model for Efficient Interactions, Contacts, and Collisions(https://arxiv.org/abs/2506.23236)
Keywords: robust
Abstract: Parametric human body models play a crucial role in computer graphics and vision, enabling applications ranging from human motion analysis to understanding human-environment interactions. Traditionally, these models use surface meshes, which pose challenges in efficiently handling interactions with other geometric entities, such as objects and scenes, typically represented as meshes or point clouds. To address this limitation, recent research has explored volumetric neural implicit body models. However, existing works are either insufficiently robust for complex human articulations or impose high computational and memory costs, limiting their widespread use. To this end, we introduce VolumetricSMPL, a neural volumetric body model that leverages Neural Blend Weights (NBW) to generate compact, yet efficient MLP decoders. Unlike prior approaches that rely on large MLPs, NBW dynamically blends a small set of learned weight matrices using predicted shape- and pose-dependent coefficients, significantly improving computational efficiency while preserving expressiveness. VolumetricSMPL outperforms prior volumetric occupancy model COAP with 10x faster inference, 6x lower GPU memory usage, enhanced accuracy, and a Signed Distance Function (SDF) for efficient and differentiable contact modeling. We demonstrate VolumetricSMPL's strengths across four challenging tasks: (1) reconstructing human-object interactions from in-the-wild images, (2) recovering human meshes in 3D scenes from egocentric views, (3) scene-constrained motion synthesis, and (4) resolving self-intersections. Our results highlight its broad applicability and significant performance and efficiency gains.

Title: Aggregating Local Saliency Maps for Semi-Global Explainable Image Classification

Authors: James Hinns, David Martens
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23247
Pdf URL: https://arxiv.org/pdf/2506.23247
Copy Paste: [[2506.23247]] Aggregating Local Saliency Maps for Semi-Global Explainable Image Classification(https://arxiv.org/abs/2506.23247)
Keywords: watermark, segmentation
Abstract: Deep learning dominates image classification tasks, yet understanding how models arrive at predictions remains a challenge. Much research focuses on local explanations of individual predictions, such as saliency maps, which visualise the influence of specific pixels on a model's prediction. However, reviewing many of these explanations to identify recurring patterns is infeasible, while global methods often oversimplify and miss important local behaviours. To address this, we propose Segment Attribution Tables (SATs), a method for summarising local saliency explanations into (semi-)global insights. SATs take image segments (such as "eyes" in Chihuahuas) and leverage saliency maps to quantify their influence. These segments highlight concepts the model relies on across instances and reveal spurious correlations, such as reliance on backgrounds or watermarks, even when out-of-distribution test performance sees little change. SATs can explain any classifier for which a form of saliency map can be produced, using segmentation maps that provide named segments. SATs bridge the gap between oversimplified global summaries and overly detailed local explanations, offering a practical tool for analysing and debugging image classifiers.

Title: DGE-YOLO: Dual-Branch Gathering and Attention for Accurate UAV Object Detection

Authors: Kunwei Lv, Ping Lan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23252
Pdf URL: https://arxiv.org/pdf/2506.23252
Copy Paste: [[2506.23252]] DGE-YOLO: Dual-Branch Gathering and Attention for Accurate UAV Object Detection(https://arxiv.org/abs/2506.23252)
Keywords: robust, extraction
Abstract: The rapid proliferation of unmanned aerial vehicles (UAVs) has highlighted the importance of robust and efficient object detection in diverse aerial scenarios. Detecting small objects under complex conditions, however, remains a significant challenge. Existing approaches often prioritize inference speed, leading to degraded performance when handling multi-modal inputs. To address this, we present DGE-YOLO, an enhanced YOLO-based detection framework designed to effectively fuse multi-modal information. Specifically, we introduce a dual-branch architecture for modality-specific feature extraction, enabling the model to process both infrared and visible images. To further enrich semantic representation, we propose an Efficient Multi-scale Attention (EMA) mechanism that enhances feature learning across spatial scales. Additionally, we replace the conventional neck with a Gather-and-Distribute module to mitigate information loss during feature aggregation. Extensive experiments on the Drone Vehicle dataset demonstrate that DGE-YOLO achieves superior performance over state-of-the-art methods, validating its effectiveness in multi-modal UAV object detection tasks.

Title: PixelBoost: Leveraging Brownian Motion for Realistic-Image Super-Resolution

Authors: Aradhana Mishra, Bumshik Lee
Subjects: cs.CV, cs.AI, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2506.23254
Pdf URL: https://arxiv.org/pdf/2506.23254
Copy Paste: [[2506.23254]] PixelBoost: Leveraging Brownian Motion for Realistic-Image Super-Resolution(https://arxiv.org/abs/2506.23254)
Keywords: diffusion
Abstract: Diffusion-model-based image super-resolution techniques often face a trade-off between realistic image generation and computational efficiency. This issue is exacerbated when inference times by decreasing sampling steps, resulting in less realistic and hazy images. To overcome this challenge, we introduce a novel diffusion model named PixelBoost that underscores the significance of embracing the stochastic nature of Brownian motion in advancing image super-resolution, resulting in a high degree of realism, particularly focusing on texture and edge definitions. By integrating controlled stochasticity into the training regimen, our proposed model avoids convergence to local optima, effectively capturing and reproducing the inherent uncertainty of image textures and patterns. Our proposed model demonstrates superior objective results in terms of learned perceptual image patch similarity (LPIPS), lightness order error (LOE), peak signal-to-noise ratio(PSNR), structural similarity index measure (SSIM), as well as visual quality. To determine the edge enhancement, we evaluated the gradient magnitude and pixel value, and our proposed model exhibited a better edge reconstruction capability. Additionally, our model demonstrates adaptive learning capabilities by effectively adjusting to Brownian noise patterns and introduces a sigmoidal noise sequencing method that simplifies training, resulting in faster inference speeds.

Title: From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows

Authors: Mohamed Amine Ferrag, Norbert Tihanyi, Djallel Hamouda, Leandros Maglaras, Merouane Debbah
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23260
Pdf URL: https://arxiv.org/pdf/2506.23260
Copy Paste: [[2506.23260]] From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows(https://arxiv.org/abs/2506.23260)
Keywords: security, privacy, defense, attack, robust, membership infer, federate, large language model
Abstract: Autonomous AI agents powered by large language models (LLMs) with structured function-calling interfaces have dramatically expanded capabilities for real-time data retrieval, complex computation, and multi-step orchestration. Yet, the explosive proliferation of plugins, connectors, and inter-agent protocols has outpaced discovery mechanisms and security practices, resulting in brittle integrations vulnerable to diverse threats. In this survey, we introduce the first unified, end-to-end threat model for LLM-agent ecosystems, spanning host-to-tool and agent-to-agent communications, formalize adversary capabilities and attacker objectives, and catalog over thirty attack techniques. Specifically, we organized the threat model into four domains: Input Manipulation (e.g., prompt injections, long-context hijacks, multimodal adversarial inputs), Model Compromise (e.g., prompt- and parameter-level backdoors, composite and encrypted multi-backdoors, poisoning strategies), System and Privacy Attacks (e.g., speculative side-channels, membership inference, retrieval poisoning, social-engineering simulations), and Protocol Vulnerabilities (e.g., exploits in Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent Network Protocol (ANP), and Agent-to-Agent (A2A) protocol). For each category, we review representative scenarios, assess real-world feasibility, and evaluate existing defenses. Building on our threat taxonomy, we identify key open challenges and future research directions, such as securing MCP deployments through dynamic trust management and cryptographic provenance tracking; designing and hardening Agentic Web Interfaces; and achieving resilience in multi-agent and federated environments. Our work provides a comprehensive reference to guide the design of robust defense mechanisms and establish best practices for resilient LLM-agent workflows.

Title: Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis

Authors: Lei-lei Li, Jianwu Fang, Junbin Xiao, Shanmin Pang, Hongkai Yu, Chen Lv, Jianru Xue, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23263
Pdf URL: https://arxiv.org/pdf/2506.23263
Copy Paste: [[2506.23263]] Causal-Entity Reflected Egocentric Traffic Accident Video Synthesis(https://arxiv.org/abs/2506.23263)
Keywords: diffusion
Abstract: Egocentricly comprehending the causes and effects of car accidents is crucial for the safety of self-driving cars, and synthesizing causal-entity reflected accident videos can facilitate the capability test to respond to unaffordable accidents in reality. However, incorporating causal relations as seen in real-world videos into synthetic videos remains challenging. This work argues that precisely identifying the accident participants and capturing their related behaviors are of critical importance. In this regard, we propose a novel diffusion model, Causal-VidSyn, for synthesizing egocentric traffic accident videos. To enable causal entity grounding in video diffusion, Causal-VidSyn leverages the cause descriptions and driver fixations to identify the accident participants and behaviors, facilitated by accident reason answering and gaze-conditioned selection modules. To support Causal-VidSyn, we further construct Drive-Gaze, the largest driver gaze dataset (with 1.54M frames of fixations) in driving accident scenarios. Extensive experiments show that Causal-VidSyn surpasses state-of-the-art video diffusion models in terms of frame quality and causal sensitivity in various tasks, including accident video editing, normal-to-accident video diffusion, and text-to-video generation.

Title: Token Activation Map to Visually Explain Multimodal LLMs

Authors: Yi Li, Hualiang Wang, Xinpeng Ding, Haonan Wang, Xiaomeng Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23270
Pdf URL: https://arxiv.org/pdf/2506.23270
Copy Paste: [[2506.23270]] Token Activation Map to Visually Explain Multimodal LLMs(https://arxiv.org/abs/2506.23270)
Keywords: explainability, large language model
Abstract: Multimodal large language models (MLLMs) are broadly empowering various fields. Despite their advancements, the explainability of MLLMs remains less explored, hindering deeper understanding, model credibility, and effective visualization. Unlike conventional vision models (e.g., CNNs, ViTs, CLIP) that produce a single output, MLLMs generate sequences of tokens progressively, where each generated token depends on the previous context. Therefore, earlier context tokens can introduce redundant activations that interfere with the explanation of later tokens beyond their original information. Existing studies often overlook this issue, but our observations reveal that these redundant correlations can significantly hurt the reliability of explanations. To address this, we propose an estimated causal inference method to mitigate the interference of context to achieve high-quality MLLM explanation, with a novel rank Gaussian filter to further reduce activation noises. We term this method Token Activation Map (TAM) to highlight the consideration of interactions between tokens. TAM also indicates that it excels at explaining multiple tokens of MLLM, which is different from the Class Activation Map (CAM) for a single prediction. Our TAM method significantly outperforms existing SoTA methods, showcasing high-quality visualization results that can be utilized for various scenarios, such as object localization, failure case analysis, video visualization, MLLMs visual comparison, and model understanding (e.g., color, shape, action, location, visual reasoning, multi-turn conversation, etc). The code is available this http URL.

Title: Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation

Authors: Jinxing Zhou, Zhihui Li, Yongqiang Yu, Yanghao Zhou, Ruohao Guo, Guangyao Li, Yuxin Mao, Mingfei Han, Xiaojun Chang, Meng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23271
Pdf URL: https://arxiv.org/pdf/2506.23271
Copy Paste: [[2506.23271]] Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation(https://arxiv.org/abs/2506.23271)
Keywords: transformer, segmentation
Abstract: We present \textbf{Met}a-\textbf{T}oken \textbf{Le}arning (Mettle), a simple and memory-efficient method for adapting large-scale pretrained transformer models to downstream audio-visual tasks. Instead of sequentially modifying the output feature distribution of the transformer backbone, Mettle utilizes a lightweight \textit{Layer-Centric Distillation (LCD)} module to distill in parallel the intact audio or visual features embedded by each transformer layer into compact meta-tokens. This distillation process considers both pretrained knowledge preservation and task-specific adaptation. The obtained meta-tokens can be directly applied to classification tasks, such as audio-visual event localization and audio-visual video parsing. To further support fine-grained segmentation tasks, such as audio-visual segmentation, we introduce a \textit{Meta-Token Injection (MTI)} module, which utilizes the audio and visual meta-tokens distilled from the top transformer layer to guide feature adaptation in earlier layers. Extensive experiments on multiple audiovisual benchmarks demonstrate that our method significantly reduces memory usage and training time while maintaining parameter efficiency and competitive accuracy.

Title: Why Settle for One? Text-to-ImageSet Generation and Evaluation

Authors: Chengyou Jia, Xin Shen, Zhuohang Dang, Zhuohang Dang, Changliang Xia, Weijia Wu, Xinyu Zhang, Hangwei Qian, Ivor W.Tsang, Minnan Luo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23275
Pdf URL: https://arxiv.org/pdf/2506.23275
Copy Paste: [[2506.23275]] Why Settle for One? Text-to-ImageSet Generation and Evaluation(https://arxiv.org/abs/2506.23275)
Keywords: diffusion, transformer
Abstract: Despite remarkable progress in Text-to-Image models, many real-world applications require generating coherent image sets with diverse consistency requirements. Existing consistent methods often focus on a specific domain with specific aspects of consistency, which significantly constrains their generalizability to broader applications. In this paper, we propose a more challenging problem, Text-to-ImageSet (T2IS) generation, which aims to generate sets of images that meet various consistency requirements based on user instructions. To systematically study this problem, we first introduce $\textbf{T2IS-Bench}$ with 596 diverse instructions across 26 subcategories, providing comprehensive coverage for T2IS generation. Building on this, we propose $\textbf{T2IS-Eval}$, an evaluation framework that transforms user instructions into multifaceted assessment criteria and employs effective evaluators to adaptively assess consistency fulfillment between criteria and generated sets. Subsequently, we propose $\textbf{AutoT2IS}$, a training-free framework that maximally leverages pretrained Diffusion Transformers' in-context capabilities to harmonize visual elements to satisfy both image-level prompt alignment and set-level visual consistency. Extensive experiments on T2IS-Bench reveal that diverse consistency challenges all existing methods, while our AutoT2IS significantly outperforms current generalized and even specialized approaches. Our method also demonstrates the ability to enable numerous underexplored real-world applications, confirming its substantial practical value. Visit our project in this https URL.

Title: Autoregressive Denoising Score Matching is a Good Video Anomaly Detector

Authors: Hanwen Zhang, Congqi Cao, Qinyi Lv, Lingtong Min, Yanning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23282
Pdf URL: https://arxiv.org/pdf/2506.23282
Copy Paste: [[2506.23282]] Autoregressive Denoising Score Matching is a Good Video Anomaly Detector(https://arxiv.org/abs/2506.23282)
Keywords: transformer, generative
Abstract: Video anomaly detection (VAD) is an important computer vision problem. Thanks to the mode coverage capabilities of generative models, the likelihood-based paradigm is catching growing interest, as it can model normal distribution and detect out-of-distribution anomalies. However, these likelihood-based methods are blind to the anomalies located in local modes near the learned distribution. To handle these ``unseen" anomalies, we dive into three gaps uniquely existing in VAD regarding scene, motion and appearance. Specifically, we first build a noise-conditioned score transformer for denoising score matching. Then, we introduce a scene-dependent and motion-aware score function by embedding the scene condition of input sequences into our model and assigning motion weights based on the difference between key frames of input sequences. Next, to solve the problem of blindness in principle, we integrate unaffected visual information via a novel autoregressive denoising score matching mechanism for inference. Through autoregressively injecting intensifying Gaussian noise into the denoised data and estimating the corresponding score function, we compare the denoised data with the original data to get a difference and aggregate it with the score function for an enhanced appearance perception and accumulate the abnormal context. With all three gaps considered, we can compute a more comprehensive anomaly indicator. Experiments on three popular VAD benchmarks demonstrate the state-of-the-art performance of our method.

Title: Hierarchical Quantized Diffusion Based Tree Generation Method for Hierarchical Representation and Lineage Analysis

Authors: Zelin Zang, WenZhe Li, Fei Chen, Yongjie Xu, Chang Yu, Zhen Lei, Stan Z. Li
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2506.23287
Pdf URL: https://arxiv.org/pdf/2506.23287
Copy Paste: [[2506.23287]] Hierarchical Quantized Diffusion Based Tree Generation Method for Hierarchical Representation and Lineage Analysis(https://arxiv.org/abs/2506.23287)
Keywords: diffusion, generative
Abstract: In single-cell research, tracing and analyzing high-throughput single-cell differentiation trajectories is crucial for understanding complex biological processes. Key to this is the modeling and generation of hierarchical data that represents the intrinsic structure within datasets. Traditional methods face limitations in terms of computational cost, performance, generative capacity, and stability. Recent VAEs based approaches have made strides in addressing these challenges but still require specialized network modules for each tree branch, limiting their stability and ability to capture deep hierarchical relationships. To overcome these challenges, we introduce diffusion-based approach called HDTree. HDTree captures tree relationships within a hierarchical latent space using a unified hierarchical codebook and quantized diffusion processes to model tree node transitions. This method improves stability by eliminating branch-specific modules and enhancing generative capacity through gradual hierarchical changes simulated by the diffusion process. HDTree's effectiveness is demonstrated through comparisons on both general-purpose and single-cell datasets, where it outperforms existing methods in terms of accuracy and performance. These contributions provide a new tool for hierarchical lineage analysis, enabling more accurate and efficient modeling of cellular differentiation paths and offering insights for downstream biological tasks. The code of HDTree is available at anonymous link this https URL.

Title: Two Spelling Normalization Approaches Based on Large Language Models

Authors: Miguel Domingo, Francisco Casacuberta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23288
Pdf URL: https://arxiv.org/pdf/2506.23288
Copy Paste: [[2506.23288]] Two Spelling Normalization Approaches Based on Large Language Models(https://arxiv.org/abs/2506.23288)
Keywords: large language model
Abstract: The absence of standardized spelling conventions and the organic evolution of human language present an inherent linguistic challenge within historical documents, a longstanding concern for scholars in the humanities. Addressing this issue, spelling normalization endeavors to align a document's orthography with contemporary standards. In this study, we propose two new approaches based on large language models: one of which has been trained without a supervised training, and a second one which has been trained for machine translation. Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.

Title: DDL: A Dataset for Interpretable Deepfake Detection and Localization in Real-World Scenarios

Authors: Changtao Miao, Yi Zhang, Weize Gao, Man Luo, Weiwei Feng, Zhiya Tan, Jianshu Li, Ajian Liu, Yunfeng Diao, Qi Chu, Tao Gong, Zhe Li, Weibin Yao, Joey Tianyi Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23292
Pdf URL: https://arxiv.org/pdf/2506.23292
Copy Paste: [[2506.23292]] DDL: A Dataset for Interpretable Deepfake Detection and Localization in Real-World Scenarios(https://arxiv.org/abs/2506.23292)
Keywords: interpretability
Abstract: Recent advances in AIGC have exacerbated the misuse of malicious deepfake content, making the development of reliable deepfake detection methods an essential means to address this challenge. Although existing deepfake detection models demonstrate outstanding performance in detection metrics, most methods only provide simple binary classification results, lacking interpretability. In critical domains such as law, interpretability is crucial for enhancing the credibility and authority of decisions. Recent studies attempt to improve the interpretability of classification results by providing spatial manipulation masks or temporal forgery segments. However, the practical effectiveness of these methods remains suboptimal due to limitations of the forgery data. Most current deepfake datasets predominantly offer binary labels, only a few datasets with localization annotations. However, they suffer from restricted forgery scenarios, limited diversity in deepfake types, and insufficient data scale, making them inadequate for complex real-world scenarios. To address this predicament, we construct a novel large-scale deepfake detection and localization ($\textbf{DDL}$) dataset containing over $\textbf{1.8M}$ forged samples and encompassing up to $\textbf{75}$ distinct deepfake methods. The DDL design incorporates four key innovations: (1) $\textbf{Diverse Forgery Scenarios}$, (2) $\textbf{Comprehensive Deepfake Methods}$, (3) $\textbf{Varied Manipulation Modes}$, and (4) $\textbf{Fine-grained Forgery Annotations}$. Through these improvements, our DDL not only provides a more challenging benchmark for complex real-world forgeries, but also offers crucial support for building next-generation deepfake detection, localization, and interpretability methods. The DDL dataset project page is on this https URL.

Title: Objective-Free Local Learning and Emergent Language Structure in Thinking Machines

Authors: P. Myles Eugenio
Subjects: cs.CL, cs.AI, cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2506.23293
Pdf URL: https://arxiv.org/pdf/2506.23293
Copy Paste: [[2506.23293]] Objective-Free Local Learning and Emergent Language Structure in Thinking Machines(https://arxiv.org/abs/2506.23293)
Keywords: generative
Abstract: We present a neuro-symbolic framework for generative language modeling based on local, event-driven emergent learning. At its core is a hierarchical Hopfield memory chain acting as a compositional short-term memory and dynamic tokenizer (retokenizer). Rather than relying on predefined tokens or supervision, the model builds structure from scratch, learning symbol sequences as multi-scale representations. It constructs projection tensors that bind co-occurring features into hierarchical tokens, introducing redundancy (i.e an emergent gauge structure) and enabling compression of local activations into long-range dependencies. Curiously, we find that the retokenizer can filter natural language patterns from noise, generating synthetic languages with coherent internal morphology -- quantifiably the same as human language. Language is learned in a local (Hebbian) fashion, where model constraints dictate allowed emergent structure, and new information is retained in alignment with this structure. The absence of a global objective enables a form of plasticity not found in conventional language models, allowing the system to generalize beyond its initial inference class -- even without explicit data. We demonstrate that briefly activating a new neuron during inference binds distributed multi-scale token features into a symbolic embedding. These emergent embedding neurons act as long-term memory and support a key-value mechanism for compositional inference and generalization. This architecture provides a methodological foundation for studying how symbolic structure can emerge from local neural learning. It offers a new pathway for building scalable, interpretable neuro-symbolic systems -- where tokens, grammar, and reasoning arise as compressed memory traces within a Hopfield hierarchy. This approach advances the development of neuromorphic architectures for generative language models.

Title: Threshold Signatures for Central Bank Digital Currencies

Authors: Mostafa Abdelrahman, Filip Rezabek, Lars Hupel, Kilian Glas, Georg Carle
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.23294
Pdf URL: https://arxiv.org/pdf/2506.23294
Copy Paste: [[2506.23294]] Threshold Signatures for Central Bank Digital Currencies(https://arxiv.org/abs/2506.23294)
Keywords: security
Abstract: Digital signatures are crucial for securing Central Bank Digital Currencies (CBDCs) transactions. Like most forms of digital currencies, CBDC solutions rely on signatures for transaction authenticity and integrity, leading to major issues in the case of private key compromise. Our work explores threshold signature schemes (TSSs) in the context of CBDCs. TSSs allow distributed key management and signing, reducing the risk of a compromised key. We analyze CBDC-specific requirements, considering the applicability of TSSs, and use Filia CBDC solution as a base for a detailed evaluation. As most of the current solutions rely on ECDSA for compatibility, we focus on ECDSA-based TSSs and their supporting libraries. Our performance evaluation measured the computational and communication complexity across key processes, as well as the throughput and latency of end-to-end transactions. The results confirm that TSS can enhance the security of CBDC implementations while maintaining acceptable performance for real-world deployments.

Title: DiffFit: Disentangled Garment Warping and Texture Refinement for Virtual Try-On

Authors: Xiang Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23295
Pdf URL: https://arxiv.org/pdf/2506.23295
Copy Paste: [[2506.23295]] DiffFit: Disentangled Garment Warping and Texture Refinement for Virtual Try-On(https://arxiv.org/abs/2506.23295)
Keywords: diffusion
Abstract: Virtual try-on (VTON) aims to synthesize realistic images of a person wearing a target garment, with broad applications in e-commerce and digital fashion. While recent advances in latent diffusion models have substantially improved visual quality, existing approaches still struggle with preserving fine-grained garment details, achieving precise garment-body alignment, maintaining inference efficiency, and generalizing to diverse poses and clothing styles. To address these challenges, we propose DiffFit, a novel two-stage latent diffusion framework for high-fidelity virtual try-on. DiffFit adopts a progressive generation strategy: the first stage performs geometry-aware garment warping, aligning the garment with the target body through fine-grained deformation and pose adaptation. The second stage refines texture fidelity via a cross-modal conditional diffusion model that integrates the warped garment, the original garment appearance, and the target person image for high-quality rendering. By decoupling geometric alignment and appearance refinement, DiffFit effectively reduces task complexity and enhances both generation stability and visual realism. It excels in preserving garment-specific attributes such as textures, wrinkles, and lighting, while ensuring accurate alignment with the human body. Extensive experiments on large-scale VTON benchmarks demonstrate that DiffFit achieves superior performance over existing state-of-the-art methods in both quantitative metrics and perceptual evaluations.

Title: Securing AI Systems: A Guide to Known Attacks and Impacts

Authors: Naoto Kiribuchi, Kengo Zenitani, Takayuki Semitsu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23296
Pdf URL: https://arxiv.org/pdf/2506.23296
Copy Paste: [[2506.23296]] Securing AI Systems: A Guide to Known Attacks and Impacts(https://arxiv.org/abs/2506.23296)
Keywords: security, defense, attack, generative
Abstract: Embedded into information systems, artificial intelligence (AI) faces security threats that exploit AI-specific vulnerabilities. This paper provides an accessible overview of adversarial attacks unique to predictive and generative AI systems. We identify eleven major attack types and explicitly link attack techniques to their impacts -- including information leakage, system compromise, and resource exhaustion -- mapped to the confidentiality, integrity, and availability (CIA) security triad. We aim to equip researchers, developers, security practitioners, and policymakers, even those without specialized AI security expertise, with foundational knowledge to recognize AI-specific risks and implement effective defenses, thereby enhancing the overall security posture of AI systems.

Title: Endo-4DGX: Robust Endoscopic Scene Reconstruction and Illumination Correction with Gaussian Splatting

Authors: Yiming Huang, Long Bai, Beilei Cui, Yanheng Li, Tong Chen, Jie Wang, Jinlin Wu, Zhen Lei, Hongbin Liu, Hongliang Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23308
Pdf URL: https://arxiv.org/pdf/2506.23308
Copy Paste: [[2506.23308]] Endo-4DGX: Robust Endoscopic Scene Reconstruction and Illumination Correction with Gaussian Splatting(https://arxiv.org/abs/2506.23308)
Keywords: robust
Abstract: Accurate reconstruction of soft tissue is crucial for advancing automation in image-guided robotic surgery. The recent 3D Gaussian Splatting (3DGS) techniques and their variants, 4DGS, achieve high-quality renderings of dynamic surgical scenes in real-time. However, 3D-GS-based methods still struggle in scenarios with varying illumination, such as low light and over-exposure. Training 3D-GS in such extreme light conditions leads to severe optimization problems and devastating rendering quality. To address these challenges, we present Endo-4DGX, a novel reconstruction method with illumination-adaptive Gaussian Splatting designed specifically for endoscopic scenes with uneven lighting. By incorporating illumination embeddings, our method effectively models view-dependent brightness variations. We introduce a region-aware enhancement module to model the sub-area lightness at the Gaussian level and a spatial-aware adjustment module to learn the view-consistent brightness adjustment. With the illumination adaptive design, Endo-4DGX achieves superior rendering performance under both low-light and over-exposure conditions while maintaining geometric accuracy. Additionally, we employ an exposure control loss to restore the appearance from adverse exposure to the normal level for illumination-adaptive optimization. Experimental results demonstrate that Endo-4DGX significantly outperforms combinations of state-of-the-art reconstruction and restoration methods in challenging lighting environments, underscoring its potential to advance robot-assisted surgical applications. Our code is available at this https URL.

Title: Interpretable by Design: MH-AutoML for Transparent and Efficient Android Malware Detection without Compromising Performance

Authors: Joner Assolin, Gabriel Canto, Diego Kreutz, Eduardo Feitosa, Hendrio Bragança, Angelo Nogueira, Vanderson Rocha
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23314
Pdf URL: https://arxiv.org/pdf/2506.23314
Copy Paste: [[2506.23314]] Interpretable by Design: MH-AutoML for Transparent and Efficient Android Malware Detection without Compromising Performance(https://arxiv.org/abs/2506.23314)
Keywords: security, interpretability, explainability
Abstract: Malware detection in Android systems requires both cybersecurity expertise and machine learning (ML) techniques. Automated Machine Learning (AutoML) has emerged as an approach to simplify ML development by reducing the need for specialized knowledge. However, current AutoML solutions typically operate as black-box systems with limited transparency, interpretability, and experiment traceability. To address these limitations, we present MH-AutoML, a domain-specific framework for Android malware detection. MH-AutoML automates the entire ML pipeline, including data preprocessing, feature engineering, algorithm selection, and hyperparameter tuning. The framework incorporates capabilities for interpretability, debugging, and experiment tracking that are often missing in general-purpose solutions. In this study, we compare MH-AutoML against seven established AutoML frameworks: Auto-Sklearn, AutoGluon, TPOT, HyperGBM, Auto-PyTorch, LightAutoML, and MLJAR. Results show that MH-AutoML achieves better recall rates while providing more transparency and control. The framework maintains computational efficiency comparable to other solutions, making it suitable for cybersecurity applications where both performance and explainability matter.

Title: FastSeg: Efficient Training-Free Open-Vocabulary Segmentation via Hierarchical Attention Refinement Method

Authors: Quang-Huy Che, Vinh-Tiep Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23323
Pdf URL: https://arxiv.org/pdf/2506.23323
Copy Paste: [[2506.23323]] FastSeg: Efficient Training-Free Open-Vocabulary Segmentation via Hierarchical Attention Refinement Method(https://arxiv.org/abs/2506.23323)
Keywords: extraction, diffusion, segmentation
Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the number of iterations with the quality of the segmentation. In this work, we propose FastSeg, a novel and efficient training-free framework with only (1+1)-step of reverse process of a pretrained diffusion model (e.g., Stable Diffusion). Moreover, instead of running multiple times for different classes, FastSeg performs segmentation for all classes at once. To further enhance the segmentation quality, FastSeg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances fused cross-attention using scale-aligned selfattention maps, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FastSeg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FastSeg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency.

Title: IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

Authors: Parker Liu, Chenxin Li, Zhengxin Li, Yipeng Wu, Wuyang Li, Zhiqin Yang, Zhenyuan Zhang, Yunlong Lin, Sirui Han, Brandon Y. Feng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23329
Pdf URL: https://arxiv.org/pdf/2506.23329
Copy Paste: [[2506.23329]] IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering(https://arxiv.org/abs/2506.23329)
Keywords: generative
Abstract: Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This "understanding-by-creating" approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility. Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage. IR3D-Bench, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.

Title: VALID-Mol: a Systematic Framework for Validated LLM-Assisted Molecular Design

Authors: Malikussaid, Hilal Hudan Nuha
Subjects: cs.LG, cs.AI, physics.chem-ph, q-bio.QM
Abstract URL: https://arxiv.org/abs/2506.23339
Pdf URL: https://arxiv.org/pdf/2506.23339
Copy Paste: [[2506.23339]] VALID-Mol: a Systematic Framework for Validated LLM-Assisted Molecular Design(https://arxiv.org/abs/2506.23339)
Keywords: large language model
Abstract: Large Language Models (LLMs) demonstrate remarkable potential for scientific discovery, but their application in domains requiring factual accuracy and domain-specific constraints remains challenging. In molecular design for drug discovery, LLMs can suggest creative molecular modifications but often produce chemically invalid or impractical structures. We present VALID-Mol, a systematic framework for integrating chemical validation with LLM-driven molecular design that increases the rate of generating valid chemical structures from 3% to 83%. Our approach combines methodical prompt engineering, automated chemical validation, and a fine-tuned domain-adapted LLM to ensure reliable generation of synthesizable molecules with improved properties. Beyond the specific implementation, we contribute a generalizable methodology for scientifically-constrained LLM applications, with quantifiable reliability improvements. Computational predictions suggest our framework can generate promising candidates for synthesis with up to 17-fold computationally predicted improvements in target affinity while maintaining synthetic accessibility. We provide a detailed analysis of our prompt engineering process, validation architecture, and fine-tuning approach, offering a reproducible blueprint for applying LLMs to other scientific domains where domain-specific validation is essential.

Title: Information Loss in LLMs' Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family

Authors: Yumeng Lin, Xufeng Duan, David Haslett, Yige Chen, Zhenguang G. Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23340
Pdf URL: https://arxiv.org/pdf/2506.23340
Copy Paste: [[2506.23340]] Information Loss in LLMs' Multilingual Translation: The Role of Training Data, Language Proximity, and Language Family(https://arxiv.org/abs/2506.23340)
Keywords: robust, large language model
Abstract: Large language models have achieved impressive progress in multilingual translation, yet they continue to face challenges with certain language pairs-particularly those with limited training data or significant linguistic divergence from English. This study systematically investigates how training data, language proximity, and language family affect information loss in multilingual translation. We evaluate two large language models, GPT-4 and Llama 2, by performing round-trip translations. Translation quality was assessed using BLEU scores and BERT similarity metrics. Our results reveal a robust interaction between training data size and language distance: while abundant training data can mitigate the effects of linguistic divergence, languages structurally closer to English consistently yield higher translation quality in low-resource conditions. Among various distance metrics, orthographic, phylogenetic, syntactic, and geographical distances emerge as strong predictors of translation performance. Language family also exerts an independent influence. These findings contribute to a deeper understanding of the linguistic constraints shaping multilingual translation in large language models, emphasizing that translation quality is shaped not only by data volume but also by structural and typological relationships between languages.

Title: ATGen: A Framework for Active Text Generation

Authors: Akim Tsvigun, Daniil Vasilev, Ivan Tsvigun, Ivan Lysenko, Talgat Bektleuov, Aleksandr Medvedev, Uliana Vinogradova, Nikita Severin, Mikhail Mozikov, Andrey Savchenko, Rostislav Grigorev, Ramil Kuleev, Fedor Zhdanov, Artem Shelmanov, Ilya Makarov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23342
Pdf URL: https://arxiv.org/pdf/2506.23342
Copy Paste: [[2506.23342]] ATGen: A Framework for Active Text Generation(https://arxiv.org/abs/2506.23342)
Keywords: large language model
Abstract: Active learning (AL) has demonstrated remarkable potential in reducing the annotation effort required for training machine learning models. However, despite the surging popularity of natural language generation (NLG) tasks in recent years, the application of AL to NLG has been limited. In this paper, we introduce Active Text Generation (ATGen) - a comprehensive framework that bridges AL with text generation tasks, enabling the application of state-of-the-art AL strategies to NLG. Our framework simplifies AL-empowered annotation in NLG tasks using both human annotators and automatic annotation agents based on large language models (LLMs). The framework supports LLMs deployed as services, such as ChatGPT and Claude, or operated on-premises. Furthermore, ATGen provides a unified platform for smooth implementation and benchmarking of novel AL strategies tailored to NLG tasks. Finally, we present evaluation results for state-of-the-art AL strategies across diverse settings and multiple text generation tasks. We show that ATGen reduces both the effort of human annotators and costs associated with API calls to LLM-based annotation agents. The code of the framework is available on GitHub under the MIT license. The video presentation is available at this http URL

Title: CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation

Authors: Yi Liu, Shengqian Li, Zuzeng Lin, Feng Wang, Si Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23347
Pdf URL: https://arxiv.org/pdf/2506.23347
Copy Paste: [[2506.23347]] CycleVAR: Repurposing Autoregressive Model for Unsupervised One-Step Image Translation(https://arxiv.org/abs/2506.23347)
Keywords: transformer
Abstract: The current conditional autoregressive image generation methods have shown promising results, yet their potential remains largely unexplored in the practical unsupervised image translation domain, which operates without explicit cross-domain correspondences. A critical limitation stems from the discrete quantization inherent in traditional Vector Quantization-based frameworks, which disrupts gradient flow between the Variational Autoencoder decoder and causal Transformer, impeding end-to-end optimization during adversarial training in image space. To tackle this issue, we propose using Softmax Relaxed Quantization, a novel approach that reformulates codebook selection as a continuous probability mixing process via Softmax, thereby preserving gradient propagation. Building upon this differentiable foundation, we introduce CycleVAR, which reformulates image-to-image translation as image-conditional visual autoregressive generation by injecting multi-scale source image tokens as contextual prompts, analogous to prefix-based conditioning in language models. CycleVAR exploits two modes to generate the target image tokens, including (1) serial multi-step generation, enabling iterative refinement across scales, and (2) parallel one-step generation synthesizing all resolution outputs in a single forward pass. Experimental findings indicate that the parallel one-step generation mode attains superior translation quality with quicker inference speed than the serial multi-step mode in unsupervised scenarios. Furthermore, both quantitative and qualitative results indicate that CycleVAR surpasses previous state-of-the-art unsupervised image translation models, \textit{e}.\textit{g}., CycleGAN-Turbo.

Title: A case for data valuation transparency via DValCards

Authors: Keziah Naggita, Julienne LaChance
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23349
Pdf URL: https://arxiv.org/pdf/2506.23349
Copy Paste: [[2506.23349]] A case for data valuation transparency via DValCards(https://arxiv.org/abs/2506.23349)
Keywords: fair
Abstract: Following the rise in popularity of data-centric machine learning (ML), various data valuation methods have been proposed to quantify the contribution of each datapoint to desired ML model performance metrics (e.g., accuracy). Beyond the technical applications of data valuation methods (e.g., data cleaning, data acquisition, etc.), it has been suggested that within the context of data markets, data buyers might utilize such methods to fairly compensate data owners. Here we demonstrate that data valuation metrics are inherently biased and unstable under simple algorithmic design choices, resulting in both technical and ethical implications. By analyzing 9 tabular classification datasets and 6 data valuation methods, we illustrate how (1) common and inexpensive data pre-processing techniques can drastically alter estimated data values; (2) subsampling via data valuation metrics may increase class imbalance; and (3) data valuation metrics may undervalue underrepresented group data. Consequently, we argue in favor of increased transparency associated with data valuation in-the-wild and introduce the novel Data Valuation Cards (DValCards) framework towards this aim. The proliferation of DValCards will reduce misuse of data valuation metrics, including in data pricing, and build trust in responsible ML systems.

Title: GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields

Authors: Shunsuke Yasuki, Taiki Miyanishi, Nakamasa Inoue, Shuhei Kurita, Koya Sakamoto, Daichi Azuma, Masato Taki, Yutaka Matsuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23352
Pdf URL: https://arxiv.org/pdf/2506.23352
Copy Paste: [[2506.23352]] GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields(https://arxiv.org/abs/2506.23352)
Keywords: large language model, segmentation
Abstract: The advancement of 3D language fields has enabled intuitive interactions with 3D scenes via natural language. However, existing approaches are typically limited to small-scale environments, lacking the scalability and compositional reasoning capabilities necessary for large, complex urban settings. To overcome these limitations, we propose GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale high-fidelity 3D scenes. GeoProg3D consists of two key components: (i) a Geography-aware City-scale 3D Language Field (GCLF) that leverages a memory-efficient hierarchical 3D model to handle large-scale data, integrated with geographic information for efficiently filtering vast urban spaces using directional cues, distance measurements, elevation data, and landmark references; and (ii) Geographical Vision APIs (GV-APIs), specialized geographic vision tools such as area segmentation and object detection. Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF, effectively supporting diverse geographic vision tasks. To assess performance in city-scale reasoning, we introduce GeoEval3D, a comprehensive benchmark dataset containing 952 query-answer pairs across five challenging tasks: grounding, spatial reasoning, comparison, counting, and measurement. Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks. To our knowledge, GeoProg3D is the first framework enabling compositional geographic reasoning in high-fidelity city-scale 3D environments via natural language. The code is available at this https URL.

Title: Layer Decomposition and Morphological Reconstruction for Task-Oriented Infrared Image Enhancement

Authors: Siyuan Chai, Xiaodong Guo, Tong Liu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.23353
Pdf URL: https://arxiv.org/pdf/2506.23353
Copy Paste: [[2506.23353]] Layer Decomposition and Morphological Reconstruction for Task-Oriented Infrared Image Enhancement(https://arxiv.org/abs/2506.23353)
Keywords: extraction, segmentation
Abstract: Infrared image helps improve the perception capabilities of autonomous driving in complex weather conditions such as fog, rain, and low light. However, infrared image often suffers from low contrast, especially in non-heat-emitting targets like bicycles, which significantly affects the performance of downstream high-level vision tasks. Furthermore, achieving contrast enhancement without amplifying noise and losing important information remains a challenge. To address these challenges, we propose a task-oriented infrared image enhancement method. Our approach consists of two key components: layer decomposition and saliency information extraction. First, we design an layer decomposition method for infrared images, which enhances scene details while preserving dark region features, providing more features for subsequent saliency information extraction. Then, we propose a morphological reconstruction-based saliency extraction method that effectively extracts and enhances target information without amplifying noise. Our method improves the image quality for object detection and semantic segmentation tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods.

Title: Federated Timeline Synthesis: Scalable and Private Methodology For Model Training and Deployment

Authors: Pawel Renc, Michal K. Grzeszczyk, Linglong Qian, Nassim Oufattole, Jeff Rasley, Arkadiusz Sitek
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23358
Pdf URL: https://arxiv.org/pdf/2506.23358
Copy Paste: [[2506.23358]] Federated Timeline Synthesis: Scalable and Private Methodology For Model Training and Deployment(https://arxiv.org/abs/2506.23358)
Keywords: privacy, federate, transformer, generative
Abstract: We present Federated Timeline Synthesis (FTS), a novel framework for training generative foundation models across distributed timeseries data applied to electronic health records (EHR). At its core, FTS represents patient history as tokenized Patient Health Timelines (PHTs), language-agnostic sequences encoding temporal, categorical, and continuous clinical information. Each institution trains an autoregressive transformer on its local PHTs and transmits only model weights to a central server. The server uses the generators to synthesize a large corpus of trajectories and train a Global Generator (GG), enabling zero-shot inference via Monte Carlo simulation of future PHTs. We evaluate FTS on five clinically meaningful prediction tasks using MIMIC-IV data, showing that models trained on synthetic data generated by GG perform comparably to those trained on real data. FTS offers strong privacy guarantees, scalability across institutions, and extensibility to diverse prediction and simulation tasks especially in healthcare, including counterfactual inference, early warning detection, and synthetic trial design.

Title: OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Authors: Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23361
Pdf URL: https://arxiv.org/pdf/2506.23361
Copy Paste: [[2506.23361]] OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions(https://arxiv.org/abs/2506.23361)
Keywords: diffusion, transformer
Abstract: Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: this https URL. Our code will be released at this https URL

Title: When Additive Noise Meets Unobserved Mediators: Bivariate Denoising Diffusion for Causal Discovery

Authors: Dominik Meier, Sujai Hiremath, Promit Ghosal, Kyra Gan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23374
Pdf URL: https://arxiv.org/pdf/2506.23374
Copy Paste: [[2506.23374]] When Additive Noise Meets Unobserved Mediators: Bivariate Denoising Diffusion for Causal Discovery(https://arxiv.org/abs/2506.23374)
Keywords: diffusion
Abstract: Distinguishing cause and effect from bivariate observational data is a foundational problem in many disciplines, but challenging without additional assumptions. Additive noise models (ANMs) are widely used to enable sample-efficient bivariate causal discovery. However, conventional ANM-based methods fail when unobserved mediators corrupt the causal relationship between variables. This paper makes three key contributions: first, we rigorously characterize why standard ANM approaches break down in the presence of unmeasured mediators. Second, we demonstrate that prior solutions for hidden mediation are brittle in finite sample settings, limiting their practical utility. To address these gaps, we propose Bivariate Denoising Diffusion (BiDD) for causal discovery, a method designed to handle latent noise introduced by unmeasured mediators. Unlike prior methods that infer directionality through mean squared error loss comparisons, our approach introduces a novel independence test statistic: during the noising and denoising processes for each variable, we condition on the other variable as input and evaluate the independence of the predicted noise relative to this input. We prove asymptotic consistency of BiDD under the ANM, and conjecture that it performs well under hidden mediation. Experiments on synthetic and real-world data demonstrate consistent performance, outperforming existing methods in mediator-corrupted settings while maintaining strong performance in mediator-free settings.

Title: Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs

Authors: Taejin Kim, Siun-Chuon Mau, Konrad Vesey
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23377
Pdf URL: https://arxiv.org/pdf/2506.23377
Copy Paste: [[2506.23377]] Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs(https://arxiv.org/abs/2506.23377)
Keywords: large language model
Abstract: Large language models (LLMs) are used in a variety of mission-critical roles. Due to the rapidly developing nature of LLMs, there is a lack of quantifiable understanding of the bias and perspective associated with LLM output. Inspired by this need, this paper considers the broader issue of perspective or viewpoint of general text and perspective control of large-language model (LLM) output. Perspective-Dial consists of two main components: a (1) metric space, dubbed Perspective Space, that enables quantitative measurements of different perspectives regarding a topic, and the use of (2) Systematic Prompt Engineering that utilizes greedy-coordinate descent to control LLM output perspective based on measurement feedback from the Perspective Space. The empirical nature of the approach allows progress to side step a principled understanding of perspective or bias -- effectively quantifying and adjusting outputs for a variety of topics. Potential applications include detection, tracking and mitigation of LLM bias, narrative detection, sense making and tracking in public discourse, and debate bot advocating given perspective.

Title: Hierarchical Memory Organization for Wikipedia Generation

Authors: Eugene J. Yu, Dawei Zhu, Yifan Song, Xiangyu Wong, Jiebin Zhang, Wenxuan Shi, Xiaoguang Li, Qun Liu, Sujian Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23393
Pdf URL: https://arxiv.org/pdf/2506.23393
Copy Paste: [[2506.23393]] Hierarchical Memory Organization for Wikipedia Generation(https://arxiv.org/abs/2506.23393)
Keywords: robust
Abstract: Generating Wikipedia articles autonomously is a challenging task requiring the integration of accurate, comprehensive, and well-structured information from diverse sources. This paper introduces the Memory Organization-based Generation (MOG) framework, a novel approach to address these challenges by leveraging a hierarchical memory architecture. MOG extracts fine-grained memory units from web documents, recursively organizes them into a Wikipedia-style hierarchical structure, and uses this structure to guide the generation process. This ensures alignment between memory and the article outline, improving both informativeness and verifiability while minimizing hallucinations. Additionally, a citation module is implemented to enhance traceability by linking every generated sentence to specific memory units. Evaluations on our newly created WikiStart dataset demonstrate that MOG outperforms baseline methods in producing informative and reliable articles, making it particularly robust in real-world scenarios.

Title: Do LLMs Dream of Discrete Algorithms?

Authors: Claudionor Coelho Jr, Yanen Li, Philip Tee
Subjects: cs.LG, cs.LO
Abstract URL: https://arxiv.org/abs/2506.23408
Pdf URL: https://arxiv.org/pdf/2506.23408
Copy Paste: [[2506.23408]] Do LLMs Dream of Discrete Algorithms?(https://arxiv.org/abs/2506.23408)
Keywords: robust, interpretability, large language model
Abstract: Large Language Models (LLMs) have rapidly transformed the landscape of artificial intelligence, enabling natural language interfaces and dynamic orchestration of software components. However, their reliance on probabilistic inference limits their effectiveness in domains requiring strict logical reasoning, discrete decision-making, and robust interpretability. This paper investigates these limitations and proposes a neurosymbolic approach that augments LLMs with logic-based reasoning modules, particularly leveraging Prolog predicates and composable toolsets. By integrating first-order logic and explicit rule systems, our framework enables LLMs to decompose complex queries into verifiable sub-tasks, orchestrate reliable solutions, and mitigate common failure modes such as hallucination and incorrect step decomposition. We demonstrate the practical benefits of this hybrid architecture through experiments on the DABStep benchmark, showing improved precision, coverage, and system documentation in multi-step reasoning tasks. Our results indicate that combining LLMs with modular logic reasoning restores engineering rigor, enhances system reliability, and offers a scalable path toward trustworthy, interpretable AI agents across complex domains.

Title: Datasets for Fairness in Language Models: An In-Depth Survey

Authors: Jiale Zhang, Zichong Wang, Avash Palikhe, Zhipeng Yin, Wenbin Zhang
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23411
Pdf URL: https://arxiv.org/pdf/2506.23411
Copy Paste: [[2506.23411]] Datasets for Fairness in Language Models: An In-Depth Survey(https://arxiv.org/abs/2506.23411)
Keywords: fair
Abstract: Fairness benchmarks play a central role in shaping how we evaluate language models, yet surprisingly little attention has been given to examining the datasets that these benchmarks rely on. This survey addresses that gap by presenting a broad and careful review of the most widely used fairness datasets in current language model research, characterizing them along several key dimensions including their origin, scope, content, and intended use to help researchers better appreciate the assumptions and limitations embedded in these resources. To support more meaningful comparisons and analyses, we introduce a unified evaluation framework that reveals consistent patterns of demographic disparities across datasets and scoring methods. Applying this framework to twenty four common benchmarks, we highlight the often overlooked biases that can influence conclusions about model fairness and offer practical guidance for selecting, combining, and interpreting these datasets. We also point to opportunities for creating new fairness benchmarks that reflect more diverse social contexts and encourage more thoughtful use of these tools going forward. All code, data, and detailed results are publicly available at this https URL to promote transparency and reproducibility across the research community.

Title: BenchMake: Turn any scientific data set into a reproducible benchmark

Authors: Amanda S Barnard
Subjects: cs.LG, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2506.23419
Pdf URL: https://arxiv.org/pdf/2506.23419
Copy Paste: [[2506.23419]] BenchMake: Turn any scientific data set into a reproducible benchmark(https://arxiv.org/abs/2506.23419)
Keywords: robust
Abstract: Benchmark data sets are a cornerstone of machine learning development and applications, ensuring new methods are robust, reliable and competitive. The relative rarity of benchmark sets in computational science, due to the uniqueness of the problems and the pace of change in the associated domains, makes evaluating new innovations difficult for computational scientists. In this paper a new tool is developed and tested to potentially turn any of the increasing numbers of scientific data sets made openly available into a benchmark accessible to the community. BenchMake uses non-negative matrix factorisation to deterministically identify and isolate challenging edge cases on the convex hull (the smallest convex set that contains all existing data instances) and partitions a required fraction of matched data instances into a testing set that maximises divergence and statistical significance, across tabular, graph, image, signal and textual modalities. BenchMake splits are compared to establish splits and random splits using ten publicly available benchmark sets from different areas of science, with different sizes, shapes, distributions.

Title: TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs

Authors: Felipe Nuti, Tim Franzmeyer, João Henriques
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23423
Pdf URL: https://arxiv.org/pdf/2506.23423
Copy Paste: [[2506.23423]] TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs(https://arxiv.org/abs/2506.23423)
Keywords: attack, large language model
Abstract: Past work has studied the effects of fine-tuning on large language models' (LLMs) overall performance on certain tasks. However, a quantitative and systematic method for analyzing its effect on individual outputs is still lacking. Here, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method tracks the model's intermediate hidden states, providing a more fine-grained insight into the effects of fine-tuning than a simple comparison of final outputs from pre-trained and fine-tuned models. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component. Empirically, we find that model behavior and performance can be steered by up- or down-scaling the fine-tuning component during the forward pass. Motivated by this finding and our theoretical analysis, we define the Tuning Contribution (TuCo) as the ratio of the magnitudes of the fine-tuning component to the pre-training component. We observe that three prominent adversarial attacks on LLMs circumvent safety measures in a way that reduces TuCo, and that TuCo is consistently lower on prompts where these attacks succeed compared to those where they do not. This suggests that attenuating the effect of fine-tuning on model outputs plays a role in the success of such attacks. In summary, TuCo enables the quantitative study of how fine-tuning influences model behavior and safety, and vice versa.

Title: Accurate Parameter-Efficient Test-Time Adaptation for Time Series Forecasting

Authors: Heitor R. Medeiros, Hossein Sharifi-Noghabi, Gabriel L. Oliveira, Saghar Irandoust
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23424
Pdf URL: https://arxiv.org/pdf/2506.23424
Copy Paste: [[2506.23424]] Accurate Parameter-Efficient Test-Time Adaptation for Time Series Forecasting(https://arxiv.org/abs/2506.23424)
Keywords: robust
Abstract: Real-world time series often exhibit a non-stationary nature, degrading the performance of pre-trained forecasting models. Test-Time Adaptation (TTA) addresses this by adjusting models during inference, but existing methods typically update the full model, increasing memory and compute costs. We propose PETSA, a parameter-efficient method that adapts forecasters at test time by only updating small calibration modules on the input and output. PETSA uses low-rank adapters and dynamic gating to adjust representations without retraining. To maintain accuracy despite limited adaptation capacity, we introduce a specialized loss combining three components: (1) a robust term, (2) a frequency-domain term to preserve periodicity, and (3) a patch-wise structural term for structural alignment. PETSA improves the adaptability of various forecasting backbones while requiring fewer parameters than baselines. Experimental results on benchmark datasets show that PETSA achieves competitive or better performance across all horizons. Our code is available at: this https URL

Title: Pipelined Decoder for Efficient Context-Aware Text Generation

Authors: Zixian Huang, Chenxu Niu, Yu Gu, Gengyang Xiao, Xinwei Huang, Gong Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23431
Pdf URL: https://arxiv.org/pdf/2506.23431
Copy Paste: [[2506.23431]] Pipelined Decoder for Efficient Context-Aware Text Generation(https://arxiv.org/abs/2506.23431)
Keywords: generative
Abstract: As the basis of generative AI, an autoregressive model requires the generation of a new token depending on all the previously generated tokens, which brings high quality but also restricts the model to generate tokens one by one, forming a bottleneck limiting the generation speed. In this paper, we propose a new decoder architecture that efficiently generates text in parallel for context-aware generation tasks. Our proposed pipelined decoder initiates the generation of multiple subsequences simultaneously, and, at each time-step, it generates a new token for each subsequence to realize parallelism. Experiments on multiple text generation tasks, including question answering, text summarization, and keyphrase generation, show that our pipelined decoder significantly improves the generation speed without a significant loss of generation quality or additional memory consumption.

Title: PathDiff: Histopathology Image Synthesis with Unpaired Text and Mask Conditions

Authors: Mahesh Bhosale, Abdul Wasi, Yuanhao Zhai, Yunjie Tian, Samuel Border, Nan Xi, Pinaki Sarder, Junsong Yuan, David Doermann, Xuan Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23440
Pdf URL: https://arxiv.org/pdf/2506.23440
Copy Paste: [[2506.23440]] PathDiff: Histopathology Image Synthesis with Unpaired Text and Mask Conditions(https://arxiv.org/abs/2506.23440)
Keywords: privacy, diffusion, generative, segmentation
Abstract: Diffusion-based generative models have shown promise in synthesizing histopathology images to address data scarcity caused by privacy constraints. Diagnostic text reports provide high-level semantic descriptions, and masks offer fine-grained spatial structures essential for representing distinct morphological regions. However, public datasets lack paired text and mask data for the same histopathological images, limiting their joint use in image generation. This constraint restricts the ability to fully exploit the benefits of combining both modalities for enhanced control over semantics and spatial details. To overcome this, we propose PathDiff, a diffusion framework that effectively learns from unpaired mask-text data by integrating both modalities into a unified conditioning space. PathDiff allows precise control over structural and contextual features, generating high-quality, semantically accurate images. PathDiff also improves image fidelity, text-image alignment, and faithfulness, enhancing data augmentation for downstream tasks like nuclei segmentation and classification. Extensive experiments demonstrate its superiority over existing methods.

Title: Enhancing Insider Threat Detection Using User-Based Sequencing and Transformer Encoders

Authors: Mohamed Elbasheer, Adewale Akinfaderin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23446
Pdf URL: https://arxiv.org/pdf/2506.23446
Copy Paste: [[2506.23446]] Enhancing Insider Threat Detection Using User-Based Sequencing and Transformer Encoders(https://arxiv.org/abs/2506.23446)
Keywords: transformer
Abstract: Insider threat detection presents unique challenges due to the authorized status of malicious actors and the subtlety of anomalous behaviors. Existing machine learning methods often treat user activity as isolated events, thereby failing to leverage sequential dependencies in user behavior. In this study, we propose a User-Based Sequencing (UBS) methodology, transforming the CERT insider threat dataset into structured temporal sequences suitable for deep sequential modeling. We deploy a Transformer Encoder architecture to model benign user activity and employ its reconstruction errors as anomaly scores. These scores are subsequently evaluated using three unsupervised outlier detection algorithms: One-Class SVM (OCSVM), Local Outlier Factor (LOF), and Isolation Forest (iForest). Across four rigorously designed test sets, including combinations of multiple CERT dataset releases, our UBS-Transformer pipeline consistently achieves state-of-the-art performance - notably 96.61% accuracy, 99.43% recall, 96.38% F1-score, 95.00% AUROC, and exceptionally low false negative (0.0057) and false positive (0.0571) rates. Comparative analyses demonstrate that our approach substantially outperforms tabular and conventional autoencoder baselines, underscoring the efficacy of sequential user modeling and advanced anomaly detection in the insider threat domain.

Title: Contrastive Learning with Diffusion Features for Weakly Supervised Medical Image Segmentation

Authors: Dewen Zeng, Xinrong Hu, Yu-Jen Chen, Yawen Wu, Xiaowei Xu, Yiyu Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23460
Pdf URL: https://arxiv.org/pdf/2506.23460
Copy Paste: [[2506.23460]] Contrastive Learning with Diffusion Features for Weakly Supervised Medical Image Segmentation(https://arxiv.org/abs/2506.23460)
Keywords: robust, diffusion, segmentation
Abstract: Weakly supervised semantic segmentation (WSSS) methods using class labels often rely on class activation maps (CAMs) to localize objects. However, traditional CAM-based methods struggle with partial activations and imprecise object boundaries due to optimization discrepancies between classification and segmentation. Recently, the conditional diffusion model (CDM) has been used as an alternative for generating segmentation masks in WSSS, leveraging its strong image generation capabilities tailored to specific class distributions. By modifying or perturbing the condition during diffusion sampling, the related objects can be highlighted in the generated images. Yet, the saliency maps generated by CDMs are prone to noise from background alterations during reverse diffusion. To alleviate the problem, we introduce Contrastive Learning with Diffusion Features (CLDF), a novel method that uses contrastive learning to train a pixel decoder to map the diffusion features from a frozen CDM to a low-dimensional embedding space for segmentation. Specifically, we integrate gradient maps generated from CDM external classifier with CAMs to identify foreground and background pixels with fewer false positives/negatives for contrastive learning, enabling robust pixel embedding learning. Experimental results on four segmentation tasks from two public medical datasets demonstrate that our method significantly outperforms existing baselines.

Title: Time-variant Image Inpainting via Interactive Distribution Transition Estimation

Authors: Yun Xing, Qing Guo, Xiaoguang Li, Yihao Huang, Xiaofeng Cao, Di Lin, Ivor Tsang, Lei Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23461
Pdf URL: https://arxiv.org/pdf/2506.23461
Copy Paste: [[2506.23461]] Time-variant Image Inpainting via Interactive Distribution Transition Estimation(https://arxiv.org/abs/2506.23461)
Keywords: diffusion
Abstract: In this work, we focus on a novel and practical task, i.e., Time-vAriant iMage inPainting (TAMP). The aim of TAMP is to restore a damaged target image by leveraging the complementary information from a reference image, where both images captured the same scene but with a significant time gap in between, i.e., time-variant images. Different from conventional reference-guided image inpainting, the reference image under TAMP setup presents significant content distinction to the target image and potentially also suffers from damages. Such an application frequently happens in our daily lives to restore a damaged image by referring to another reference image, where there is no guarantee of the reference image's source and quality. In particular, our study finds that even state-of-the-art (SOTA) reference-guided image inpainting methods fail to achieve plausible results due to the chaotic image complementation. To address such an ill-posed problem, we propose a novel Interactive Distribution Transition Estimation (InDiTE) module which interactively complements the time-variant images with adaptive semantics thus facilitate the restoration of damaged regions. To further boost the performance, we propose our TAMP solution, namely Interactive Distribution Transition Estimation-driven Diffusion (InDiTE-Diff), which integrates InDiTE with SOTA diffusion model and conducts latent cross-reference during sampling. Moreover, considering the lack of benchmarks for TAMP task, we newly assembled a dataset, i.e., TAMP-Street, based on existing image and mask datasets. We conduct experiments on the TAMP-Street datasets under two different time-variant image inpainting settings, which show our method consistently outperform SOTA reference-guided image inpainting methods for solving TAMP.

Title: Can We Predict the Unpredictable? Leveraging DisasterNet-LLM for Multimodal Disaster Classification

Authors: Manaswi Kulahara, Gautam Siddharth Kashyap, Nipun Joshi, Arpita Soni
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23462
Pdf URL: https://arxiv.org/pdf/2506.23462
Copy Paste: [[2506.23462]] Can We Predict the Unpredictable? Leveraging DisasterNet-LLM for Multimodal Disaster Classification(https://arxiv.org/abs/2506.23462)
Keywords: transformer, large language model
Abstract: Effective disaster management requires timely and accurate insights, yet traditional methods struggle to integrate multimodal data such as images, weather records, and textual reports. To address this, we propose DisasterNet-LLM, a specialized Large Language Model (LLM) designed for comprehensive disaster analysis. By leveraging advanced pretraining, cross-modal attention mechanisms, and adaptive transformers, DisasterNet-LLM excels in disaster classification. Experimental results demonstrate its superiority over state-of-the-art models, achieving higher accuracy of 89.5%, an F1 score of 88.0%, AUC of 0.92%, and BERTScore of 0.88% in multimodal disaster classification tasks.

Title: What to Keep and What to Drop: Adaptive Table Filtering Framework

Authors: Jang Won June
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23463
Pdf URL: https://arxiv.org/pdf/2506.23463
Copy Paste: [[2506.23463]] What to Keep and What to Drop: Adaptive Table Filtering Framework(https://arxiv.org/abs/2506.23463)
Keywords: large language model
Abstract: Large language models (LLMs) for table-based reasoning often struggle with large tables due to input length limits. We propose ATF (Adaptive Table Filtering Framework), a modular and question-aware filtering pipeline that prunes uninformative columns and rows using LLM-generated column descriptions, clustering, and sparse-dense alignment scores. ATF integrates seamlessly with existing models (e.g., TAPAS, TAPEX) without retraining. Experiments show that ATF reduces table cells by ~70\%, boosting performance on out-of-domain TableQA tasks while causing slight performance drops on Table Fact Verification, where full-table context is more critical. These results highlight ATF's ability to adaptively balance informativeness and minimalism across tasks.

Title: Sanitizing Manufacturing Dataset Labels Using Vision-Language Models

Authors: Nazanin Mahjourian, Vinh Nguyen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23465
Pdf URL: https://arxiv.org/pdf/2506.23465
Copy Paste: [[2506.23465]] Sanitizing Manufacturing Dataset Labels Using Vision-Language Models(https://arxiv.org/abs/2506.23465)
Keywords: robust
Abstract: The success of machine learning models in industrial applications is heavily dependent on the quality of the datasets used to train the models. However, large-scale datasets, specially those constructed from crowd-sourcing and web-scraping, often suffer from label noise, inconsistencies, and errors. This problem is particularly pronounced in manufacturing domains, where obtaining high-quality labels is costly and time-consuming. This paper introduces Vision-Language Sanitization and Refinement (VLSR), which is a vision-language-based framework for label sanitization and refinement in multi-label manufacturing image datasets. This method embeds both images and their associated textual labels into a shared semantic space leveraging the CLIP vision-language model. Then two key tasks are addressed in this process by computing the cosine similarity between embeddings. First, label sanitization is performed to identify irrelevant, misspelled, or semantically weak labels, and surface the most semantically aligned label for each image by comparing image-label pairs using cosine similarity between image and label embeddings. Second, the method applies density-based clustering on text embeddings, followed by iterative cluster merging, to group semantically similar labels into unified label groups. The Factorynet dataset, which includes noisy labels from both human annotations and web-scraped sources, is employed to evaluate the effectiveness of the proposed framework. Experimental results demonstrate that the VLSR framework successfully identifies problematic labels and improves label consistency. This method enables a significant reduction in label vocabulary through clustering, which ultimately enhances the dataset's quality for training robust machine learning models in industrial applications with minimal human intervention.

Title: AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-rays

Authors: Chenlang Yi, Zizhan Xiong, Qi Qi, Xiyuan Wei, Girish Bathla, Ching-Long Lin, Bobak Jack Mortazavi, Tianbao Yang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23467
Pdf URL: https://arxiv.org/pdf/2506.23467
Copy Paste: [[2506.23467]] AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-rays(https://arxiv.org/abs/2506.23467)
Keywords: robust, fair
Abstract: Contrastive Language-Image Pre-training (CLIP) models have demonstrated superior performance across various visual tasks including medical image classification. However, fairness concerns, including demographic biases, have received limited attention for CLIP models. This oversight leads to critical issues, particularly those related to race and gender, resulting in disparities in diagnostic outcomes and reduced reliability for underrepresented groups. To address these challenges, we introduce AdFair-CLIP, a novel framework employing adversarial feature intervention to suppress sensitive attributes, thereby mitigating spurious correlations and improving prediction fairness. We conduct comprehensive experiments on chest X-ray (CXR) datasets, and show that AdFair-CLIP significantly enhances both fairness and diagnostic accuracy, while maintaining robust generalization in zero-shot and few-shot scenarios. These results establish new benchmarks for fairness-aware learning in CLIP-based medical diagnostic models, particularly for CXR analysis.

Title: Interactive Interface For Semantic Segmentation Dataset Synthesis

Authors: Ngoc-Do Tran, Minh-Tuan Huynh, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23470
Pdf URL: https://arxiv.org/pdf/2506.23470
Copy Paste: [[2506.23470]] Interactive Interface For Semantic Segmentation Dataset Synthesis(https://arxiv.org/abs/2506.23470)
Keywords: privacy, segmentation
Abstract: The rapid advancement of AI and computer vision has significantly increased the demand for high-quality annotated datasets, particularly for semantic segmentation. However, creating such datasets is resource-intensive, requiring substantial time, labor, and financial investment, and often raises privacy concerns due to the use of real-world data. To mitigate these challenges, we present SynthLab, consisting of a modular platform for visual data synthesis and a user-friendly interface. The modular architecture of SynthLab enables easy maintenance, scalability with centralized updates, and seamless integration of new features. Each module handles distinct aspects of computer vision tasks, enhancing flexibility and adaptability. Meanwhile, its interactive, user-friendly interface allows users to quickly customize their data pipelines through drag-and-drop actions. Extensive user studies involving a diverse range of users across different ages, professions, and expertise levels, have demonstrated flexible usage, and high accessibility of SynthLab, enabling users without deep technical expertise to harness AI for real-world applications.

Title: A Large-Scale Evolvable Dataset for Model Context Protocol Ecosystem and Security Analysis

Authors: Zhiwei Lin, Bonan Ruan, Jiahao Liu, Weibo Zhao
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.23474
Pdf URL: https://arxiv.org/pdf/2506.23474
Copy Paste: [[2506.23474]] A Large-Scale Evolvable Dataset for Model Context Protocol Ecosystem and Security Analysis(https://arxiv.org/abs/2506.23474)
Keywords: security
Abstract: The Model Context Protocol (MCP) has recently emerged as a standardized interface for connecting language models with external tools and data. As the ecosystem rapidly expands, the lack of a structured, comprehensive view of existing MCP artifacts presents challenges for research. To bridge this gap, we introduce MCPCorpus, a large-scale dataset containing around 14K MCP servers and 300 MCP clients. Each artifact is annotated with 20+ normalized attributes capturing its identity, interface configuration, GitHub activity, and metadata. MCPCorpus provides a reproducible snapshot of the real-world MCP ecosystem, enabling studies of adoption trends, ecosystem health, and implementation diversity. To keep pace with the rapid evolution of the MCP ecosystem, we provide utility tools for automated data synchronization, normalization, and inspection. Furthermore, to support efficient exploration and exploitation, we release a lightweight web-based search interface. MCPCorpus is publicly available at: this https URL.

Title: Evaluation of Geolocation Capabilities of Multimodal Large Language Models and Analysis of Associated Privacy Risks

Authors: Xian Zhang, Xiang Cheng
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.23481
Pdf URL: https://arxiv.org/pdf/2506.23481
Copy Paste: [[2506.23481]] Evaluation of Geolocation Capabilities of Multimodal Large Language Models and Analysis of Associated Privacy Risks(https://arxiv.org/abs/2506.23481)
Keywords: security, privacy, large language model
Abstract: Objectives: The rapid advancement of Multimodal Large Language Models (MLLMs) has significantly enhanced their reasoning capabilities, enabling a wide range of intelligent applications. However, these advancements also raise critical concerns regarding privacy and ethics. MLLMs are now capable of inferring the geographic location of images -- such as those shared on social media or captured from street views -- based solely on visual content, thereby posing serious risks of privacy invasion, including doxxing, surveillance, and other security threats. Methods: This study provides a comprehensive analysis of existing geolocation techniques based on MLLMs. It systematically reviews relevant litera-ture and evaluates the performance of state-of-the-art visual reasoning models on geolocation tasks, particularly in identifying the origins of street view imagery. Results: Empirical evaluation reveals that the most advanced visual large models can successfully localize the origin of street-level imagery with up to $49\%$ accuracy within a 1-kilometer radius. This performance underscores the models' powerful capacity to extract and utilize fine-grained geographic cues from visual data. Conclusions: Building on these findings, the study identifies key visual elements that contribute to suc-cessful geolocation, such as text, architectural styles, and environmental features. Furthermore, it discusses the potential privacy implications associated with MLLM-enabled geolocation and discuss several technical and policy-based coun-termeasures to mitigate associated risks. Our code and dataset are available at this https URL.

Title: MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting

Authors: Jun Huang, Ting Liu, Yihang Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23482
Pdf URL: https://arxiv.org/pdf/2506.23482
Copy Paste: [[2506.23482]] MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting(https://arxiv.org/abs/2506.23482)
Keywords: diffusion, generative
Abstract: Advancements in generative models have enabled image inpainting models to generate content within specific regions of an image based on provided prompts and masks. However, existing inpainting methods often suffer from problems such as semantic misalignment, structural distortion, and style inconsistency. In this work, we present MTADiffusion, a Mask-Text Alignment diffusion model designed for object inpainting. To enhance the semantic capabilities of the inpainting model, we introduce MTAPipeline, an automatic solution for annotating masks with detailed descriptions. Based on the MTAPipeline, we construct a new MTADataset comprising 5 million images and 25 million mask-text pairs. Furthermore, we propose a multi-task training strategy that integrates both inpainting and edge prediction tasks to improve structural stability. To promote style consistency, we present a novel inpainting style-consistency loss using a pre-trained VGG network and the Gram matrix. Comprehensive evaluations on BrushBench and EditBench demonstrate that MTADiffusion achieves state-of-the-art performance compared to other methods.

Title: Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent

Authors: Haocheng Yu, Yaxiong Wu, Hao Wang, Wei Guo, Yong Liu, Yawen Li, Yuyang Ye, Junping Du, Enhong Chen
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.23485
Pdf URL: https://arxiv.org/pdf/2506.23485
Copy Paste: [[2506.23485]] Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent(https://arxiv.org/abs/2506.23485)
Keywords: large language model
Abstract: Interactive recommendation is a typical information-seeking task that allows users to interactively express their needs through natural language and obtain personalized recommendations. Large language model-powered (LLM-powered) agents have become a new paradigm in interactive recommendations, effectively capturing users' real-time needs and enhancing personalized experiences. However, due to limited planning and generalization capabilities, existing formulations of LLM-powered interactive recommender agents struggle to effectively address diverse and complex user intents, such as intuitive, unrefined, or occasionally ambiguous requests. To tackle this challenge, we propose a novel thought-augmented interactive recommender agent system (TAIRA) that addresses complex user intents through distilled thought patterns. Specifically, TAIRA is designed as an LLM-powered multi-agent system featuring a manager agent that orchestrates recommendation tasks by decomposing user needs and planning subtasks, with its planning capacity strengthened through Thought Pattern Distillation (TPD), a thought-augmentation method that extracts high-level thoughts from the agent's and human experts' experiences. Moreover, we designed a set of user simulation schemes to generate personalized queries of different difficulties and evaluate the recommendations based on specific datasets. Through comprehensive experiments conducted across multiple datasets, TAIRA exhibits significantly enhanced performance compared to existing methods. Notably, TAIRA shows a greater advantage on more challenging tasks while generalizing effectively on novel tasks, further validating its superiority in managing complex user intents within interactive recommendation systems. The code is publicly available at:this https URL.

Title: Qwen-GUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding

Authors: ZongHan Hsieh, Tzer-Jen Wei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23491
Pdf URL: https://arxiv.org/pdf/2506.23491
Copy Paste: [[2506.23491]] Qwen-GUI-3B: A Lightweight Vision-Language Model for Cross-Resolution GUI Grounding(https://arxiv.org/abs/2506.23491)
Keywords: robust
Abstract: This paper introduces Qwen-GUI-3B, a lightweight Vision-Language Model (VLM) specifically designed for Graphical User Interface grounding tasks, achieving performance competitive with significantly larger models. Unlike large-scale VLMs (>7B parameters) that are computationally intensive and impractical for consumer-grade hardware, Qwen-GUI-3B delivers strong grounding accuracy while being fully trainable on a single GPU (RTX 4090). The model incorporates several key innovations: (i) combine cross-platform, multi-resolution dataset of 24K examples from diverse sources including mobile, desktop, and web GUI screenshots to effectively address data scarcity in high-resolution desktop environments; (ii) a two-stage fine-tuning strategy, where initial cross-platform training establishes robust GUI understanding, followed by specialized fine-tuning on high-resolution data to significantly enhance model adaptability; and (iii) data curation and redundancy reduction strategies, demonstrating that randomly sampling a smaller subset with reduced redundancy achieves performance comparable to larger datasets, emphasizing data diversity over sheer volume. Empirical evaluation on standard GUI grounding benchmarks-including ScreenSpot, ScreenSpot-v2, and the challenging ScreenSpot-Pro, highlights Qwen-GUI-3B's exceptional accuracy, achieving 84.9% on ScreenSpot and 86.4% on ScreenSpot-v2, surpassing prior models under 4B parameters. Ablation studies validate the critical role of balanced sampling and two-stage fine-tuning in enhancing robustness, particularly in high-resolution desktop scenarios. The Qwen-GUI-3B is available at: this https URL

Title: Sample Margin-Aware Recalibration of Temperature Scaling

Authors: Haolan Guo, Linwei Tao, Haoyang Luo, Minjing Dong, Chang Xu
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.23492
Pdf URL: https://arxiv.org/pdf/2506.23492
Copy Paste: [[2506.23492]] Sample Margin-Aware Recalibration of Temperature Scaling(https://arxiv.org/abs/2506.23492)
Keywords: robust
Abstract: Recent advances in deep learning have significantly improved predictive accuracy. However, modern neural networks remain systematically overconfident, posing risks for deployment in safety-critical scenarios. Current post-hoc calibration methods face a fundamental dilemma: global approaches like Temperature Scaling apply uniform adjustments across all samples, introducing high bias despite computational efficiency, while more expressive methods that operate on full logit distributions suffer from high variance due to noisy high-dimensional inputs and insufficient validation data. To address these challenges, we propose Sample Margin-Aware Recalibration of Temperature (SMART), a lightweight, data-efficient recalibration method that precisely scales logits based on the margin between the top two logits -- termed the logit gap. Specifically, the logit gap serves as a denoised, scalar signal directly tied to decision boundary uncertainty, providing a robust indicator that avoids the noise inherent in high-dimensional logit spaces while preserving model prediction invariance. Meanwhile, SMART employs a novel soft-binned Expected Calibration Error (SoftECE) objective that balances model bias and variance through adaptive binning, enabling stable parameter updates even with extremely limited calibration data. Extensive evaluations across diverse datasets and architectures demonstrate that SMART achieves state-of-the-art calibration performance even with substantially fewer parameters compared to existing parametric methods, offering a principled, robust, and highly efficient solution for practical uncertainty quantification in neural network predictions. The source code is available at: this https URL.

Title: LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching

Authors: Mengxiao Tian, Xinxiao Wu, Shuo Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23502
Pdf URL: https://arxiv.org/pdf/2506.23502
Copy Paste: [[2506.23502]] LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching(https://arxiv.org/abs/2506.23502)
Keywords: large language model
Abstract: Driven by large-scale contrastive vision-language pre-trained models such as CLIP, recent advancements in the image-text matching task have achieved remarkable success in representation learning. Due to image-level visual-language alignment, CLIP falls short in understanding fine-grained details such as object attributes and spatial relationships between objects. Recent efforts have attempted to compel CLIP to acquire structured visual representations by introducing prompt learning to achieve object-level alignment. While achieving promising results, they still lack the capability to perceive actions, which are crucial for describing the states or relationships between objects. Therefore, we propose to endow CLIP with fine-grained action-level understanding by introducing an LLM-enhanced action-aware multi-modal prompt-tuning method, incorporating the action-related external knowledge generated by large language models (LLMs). Specifically, we design an action triplet prompt and an action state prompt to exploit compositional semantic knowledge and state-related causal knowledge implicitly stored in LLMs. Subsequently, we propose an adaptive interaction module to aggregate attentive visual features conditioned on action-aware prompted knowledge for establishing discriminative and action-aware visual representations, which further improves the performance. Comprehensive experimental results on two benchmark datasets demonstrate the effectiveness of our method.

Title: Improve Underwater Object Detection through YOLOv12 Architecture and Physics-informed Augmentation

Authors: Tinh Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23505
Pdf URL: https://arxiv.org/pdf/2506.23505
Copy Paste: [[2506.23505]] Improve Underwater Object Detection through YOLOv12 Architecture and Physics-informed Augmentation(https://arxiv.org/abs/2506.23505)
Keywords: robust
Abstract: Underwater object detection is crucial for autonomous navigation, environmental monitoring, and marine exploration, but it is severely hampered by light attenuation, turbidity, and occlusion. Current methods balance accuracy and computational efficiency, but they have trouble deploying in real-time under low visibility conditions. Through the integration of physics-informed augmentation techniques with the YOLOv12 architecture, this study advances underwater detection. With Residual ELAN blocks to preserve structural features in turbid waters and Area Attention to maintain large receptive fields for occluded objects while reducing computational complexity. Underwater optical properties are addressed by domain-specific augmentations such as turbulence adaptive blurring, biologically grounded occlusion simulation, and spectral HSV transformations for color distortion. Extensive tests on four difficult datasets show state-of-the-art performance, with Brackish data registering 98.30% mAP at 142 FPS. YOLOv12 improves occlusion robustness by 18.9%, small-object recall by 22.4%, and detection precision by up to 7.94% compared to previous models. The crucial role of augmentation strategy is validated by ablation studies. This work offers a precise and effective solution for conservation and underwater robotics applications.

Title: Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably

Authors: Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23508
Pdf URL: https://arxiv.org/pdf/2506.23508
Copy Paste: [[2506.23508]] Reinforcement Fine-Tuning Enables MLLMs Learning Novel Tasks Stably(https://arxiv.org/abs/2506.23508)
Keywords: large language model
Abstract: Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt multimodal large language models to downstream tasks. While effective at task adaptation, their impact on prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on an open-source multimodal model, Qwen2.5-VL. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly on novel tasks but maintains prior knowledge. We analyze this phenomenon through the lens of learning dynamics, showing that RFT reinforces correct samples that are naturally aligned with the base model's probability landscape, mitigating interference with prior knowledge. Moreover, supervised training on correct RFT-simulated rollouts allows SFT to preserve knowledge while rapidly learning new tasks. These findings suggest that data distribution, rather than algorithmic differences, plays a central role in forgetting, and highlight RFT's potential for stable continual learning in multimodal large language models.

Title: ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models

Authors: Zixun Fang, Kai Zhu, Zhiheng Liu, Yu Liu, Wei Zhai, Yang Cao, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23513
Pdf URL: https://arxiv.org/pdf/2506.23513
Copy Paste: [[2506.23513]] ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models(https://arxiv.org/abs/2506.23513)
Keywords: diffusion
Abstract: Panoramic video generation aims to synthesize 360-degree immersive videos, holding significant importance in the fields of VR, world models, and spatial intelligence. Existing works fail to synthesize high-quality panoramic videos due to the inherent modality gap between panoramic data and perspective data, which constitutes the majority of the training data for modern diffusion models. In this paper, we propose a novel framework utilizing pretrained perspective video models for generating panoramic videos. Specifically, we design a novel panorama representation named ViewPoint map, which possesses global spatial continuity and fine-grained visual details simultaneously. With our proposed Pano-Perspective attention mechanism, the model benefits from pretrained perspective priors and captures the panoramic spatial correlations of the ViewPoint map effectively. Extensive experiments demonstrate that our method can synthesize highly dynamic and spatially consistent panoramic videos, achieving state-of-the-art performance and surpassing previous methods.

Title: FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization

Authors: Seung-Wook Kim, Seongyeol Kim, Jiah Kim, Seowon Ji, Se-Ho Lee
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.23516
Pdf URL: https://arxiv.org/pdf/2506.23516
Copy Paste: [[2506.23516]] FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization(https://arxiv.org/abs/2506.23516)
Keywords: robust, federate
Abstract: Federated learning (FL) often suffers from performance degradation due to key challenges such as data heterogeneity and communication constraints. To address these limitations, we present a novel FL framework called FedWSQ, which integrates weight standardization (WS) and the proposed distribution-aware non-uniform quantization (DANUQ). WS enhances FL performance by filtering out biased components in local updates during training, thereby improving the robustness of the model against data heterogeneity and unstable client participation. In addition, DANUQ minimizes quantization errors by leveraging the statistical properties of local model updates. As a result, FedWSQ significantly reduces communication overhead while maintaining superior model accuracy. Extensive experiments on FL benchmark datasets demonstrate that FedWSQ consistently outperforms existing FL methods across various challenging FL settings, including extreme data heterogeneity and ultra-low-bit communication scenarios.

Title: WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image

Authors: Jiwoo Park, Tae Eun Choi, Youngjun Jun, Seong Jae Hwang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23518
Pdf URL: https://arxiv.org/pdf/2506.23518
Copy Paste: [[2506.23518]] WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image(https://arxiv.org/abs/2506.23518)
Keywords: diffusion
Abstract: Generating high-quality novel views of a scene from a single image requires maintaining structural coherence across different views, referred to as view consistency. While diffusion models have driven advancements in novel view synthesis, they still struggle to preserve spatial continuity across views. Diffusion models have been combined with 3D models to address the issue, but such approaches lack efficiency due to their complex multi-step pipelines. This paper proposes a novel view-consistent image generation method which utilizes diffusion models without additional modules. Our key idea is to enhance diffusion models with a training-free method that enables adaptive attention manipulation and noise reinitialization by leveraging view-guided warping to ensure view consistency. Through our comprehensive metric framework suitable for novel-view datasets, we show that our method improves view consistency across various diffusion models, demonstrating its broader applicability.

Title: Lightweight Temporal Transformer Decomposition for Federated Autonomous Driving

Authors: Tuong Do, Binh X. Nguyen, Quang D. Tran, Erman Tjiputra, Te-Chuan Chiu, Anh Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23523
Pdf URL: https://arxiv.org/pdf/2506.23523
Copy Paste: [[2506.23523]] Lightweight Temporal Transformer Decomposition for Federated Autonomous Driving(https://arxiv.org/abs/2506.23523)
Keywords: robust, federate, transformer
Abstract: Traditional vision-based autonomous driving systems often face difficulties in navigating complex environments when relying solely on single-image inputs. To overcome this limitation, incorporating temporal data such as past image frames or steering sequences, has proven effective in enhancing robustness and adaptability in challenging scenarios. While previous high-performance methods exist, they often rely on resource-intensive fusion networks, making them impractical for training and unsuitable for federated learning. To address these challenges, we propose lightweight temporal transformer decomposition, a method that processes sequential image frames and temporal steering data by breaking down large attention maps into smaller matrices. This approach reduces model complexity, enabling efficient weight updates for convergence and real-time predictions while leveraging temporal information to enhance autonomous driving performance. Intensive experiments on three datasets demonstrate that our method outperforms recent approaches by a clear margin while achieving real-time performance. Additionally, real robot experiments further confirm the effectiveness of our method.

Title: NEU-ESC: A Comprehensive Vietnamese dataset for Educational Sentiment analysis and topic Classification toward multitask learning

Authors: Phan Quoc Hung Mai, Quang Hung Nguyen, Phuong Giang Duong, Hong Hanh Nguyen, Nguyen Tuan Long
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23524
Pdf URL: https://arxiv.org/pdf/2506.23524
Copy Paste: [[2506.23524]] NEU-ESC: A Comprehensive Vietnamese dataset for Educational Sentiment analysis and topic Classification toward multitask learning(https://arxiv.org/abs/2506.23524)
Keywords: large language model
Abstract: In the field of education, understanding students' opinions through their comments is crucial, especially in the Vietnamese language, where resources remain limited. Existing educational datasets often lack domain relevance and student slang. To address these gaps, we introduce NEU-ESC, a new Vietnamese dataset for Educational Sentiment Classification and Topic Classification, curated from university forums, which offers more samples, richer class diversity, longer texts, and broader vocabulary. In addition, we explore multitask learning using encoder-only language models (BERT), in which we showed that it achieves performance up to 83.7% and 79.8% accuracy for sentiment and topic classification tasks. We also benchmark our dataset and model with other datasets and models, including Large Language Models, and discuss these benchmarks. The dataset is publicly available at: this https URL.

Title: On Recipe Memorization and Creativity in Large Language Models: Is Your Model a Creative Cook, a Bad Cook, or Merely a Plagiator?

Authors: Jan Kvapil, Martin Fajcik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23527
Pdf URL: https://arxiv.org/pdf/2506.23527
Copy Paste: [[2506.23527]] On Recipe Memorization and Creativity in Large Language Models: Is Your Model a Creative Cook, a Bad Cook, or Merely a Plagiator?(https://arxiv.org/abs/2506.23527)
Keywords: large language model
Abstract: This work-in-progress investigates the memorization, creativity, and nonsense found in cooking recipes generated from Large Language Models (LLMs). Precisely, we aim (i) to analyze memorization, creativity, and non-sense in LLMs using a small, high-quality set of human judgments and (ii) to evaluate potential approaches to automate such a human annotation in order to scale our study to hundreds of recipes. To achieve (i), we conduct a detailed human annotation on 20 preselected recipes generated by LLM (Mixtral), extracting each recipe's ingredients and step-by-step actions to assess which elements are memorized--i.e., directly traceable to online sources possibly seen during training--and which arise from genuine creative synthesis or outright nonsense. We find that Mixtral consistently reuses ingredients that can be found in online documents, potentially seen during model training, suggesting strong reliance on memorized content. To achieve aim (ii) and scale our analysis beyond small sample sizes and single LLM validation, we design an ``LLM-as-judge'' pipeline that automates recipe generation, nonsense detection, parsing ingredients and recipe steps, and their annotation. For instance, comparing its output against human annotations, the best ingredient extractor and annotator is Llama 3.1+Gemma 2 9B, achieving up to 78% accuracy on ingredient matching. This automated framework enables large-scale quantification of memorization, creativity, and nonsense in generated recipes, providing rigorous evidence of the models' creative capacities.

Title: Uncertainty-aware Diffusion and Reinforcement Learning for Joint Plane Localization and Anomaly Diagnosis in 3D Ultrasound

Authors: Yuhao Huang, Yueyue Xu, Haoran Dou, Jiaxiao Deng, Xin Yang, Hongyu Zheng, Dong Ni
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23538
Pdf URL: https://arxiv.org/pdf/2506.23538
Copy Paste: [[2506.23538]] Uncertainty-aware Diffusion and Reinforcement Learning for Joint Plane Localization and Anomaly Diagnosis in 3D Ultrasound(https://arxiv.org/abs/2506.23538)
Keywords: diffusion
Abstract: Congenital uterine anomalies (CUAs) can lead to infertility, miscarriage, preterm birth, and an increased risk of pregnancy complications. Compared to traditional 2D ultrasound (US), 3D US can reconstruct the coronal plane, providing a clear visualization of the uterine morphology for assessing CUAs accurately. In this paper, we propose an intelligent system for simultaneous automated plane localization and CUA diagnosis. Our highlights are: 1) we develop a denoising diffusion model with local (plane) and global (volume/text) guidance, using an adaptive weighting strategy to optimize attention allocation to different conditions; 2) we introduce a reinforcement learning-based framework with unsupervised rewards to extract the key slice summary from redundant sequences, fully integrating information across multiple planes to reduce learning difficulty; 3) we provide text-driven uncertainty modeling for coarse prediction, and leverage it to adjust the classification probability for overall performance improvement. Extensive experiments on a large 3D uterine US dataset show the efficacy of our method, in terms of plane localization and CUA diagnosis. Code is available at this https URL.

Title: Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention

Authors: Weida Wang, Changyong He, Jin Zeng, Di Qiu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23542
Pdf URL: https://arxiv.org/pdf/2506.23542
Copy Paste: [[2506.23542]] Consistent Time-of-Flight Depth Denoising via Graph-Informed Geometric Attention(https://arxiv.org/abs/2506.23542)
Keywords: robust
Abstract: Depth images captured by Time-of-Flight (ToF) sensors are prone to noise, requiring denoising for reliable downstream applications. Previous works either focus on single-frame processing, or perform multi-frame processing without considering depth variations at corresponding pixels across frames, leading to undesirable temporal inconsistency and spatial ambiguity. In this paper, we propose a novel ToF depth denoising network leveraging motion-invariant graph fusion to simultaneously enhance temporal stability and spatial sharpness. Specifically, despite depth shifts across frames, graph structures exhibit temporal self-similarity, enabling cross-frame geometric attention for graph fusion. Then, by incorporating an image smoothness prior on the fused graph and data fidelity term derived from ToF noise distribution, we formulate a maximum a posterior problem for ToF denoising. Finally, the solution is unrolled into iterative filters whose weights are adaptively learned from the graph-informed geometric attention, producing a high-performance yet interpretable network. Experimental results demonstrate that the proposed scheme achieves state-of-the-art performance in terms of accuracy and consistency on synthetic DVToF dataset and exhibits robust generalization on the real Kinectv2 dataset. Source code will be released at \href{this https URL}{this https URL}.

Title: Pyramidal Patchification Flow for Visual Generation

Authors: Hui Li, Baoyou Chen, Liwei Zhang, Jiaye Li, Jingdong Wang, Siyu Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23543
Pdf URL: https://arxiv.org/pdf/2506.23543
Copy Paste: [[2506.23543]] Pyramidal Patchification Flow for Visual Generation(https://arxiv.org/abs/2506.23543)
Keywords: diffusion, transformer
Abstract: Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch size for all the timesteps, we introduce a Pyramidal Patchification Flow (PPFlow) approach: Large patch sizes are used for high noise timesteps and small patch sizes for low noise timesteps; Linear projections are learned for each patch size; and Unpatchify is accordingly modified. Unlike Pyramidal Flow, our approach operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick. We demonstrate the effectiveness of our approach through two training manners. Training from scratch achieves a $1.6\times$ ($2.0\times$) inference speed over SiT-B/2 for 2-level (3-level) pyramid patchification with slightly lower training FLOPs and similar image generation performance. Training from pretrained normal DiTs achieves even better performance with small training time. The code and checkpoint are at this https URL.

Title: A unified framework on the universal approximation of transformer-type architectures

Authors: Jingpu Cheng, Qianxiao Li, Ting Lin, Zuowei Shen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23551
Pdf URL: https://arxiv.org/pdf/2506.23551
Copy Paste: [[2506.23551]] A unified framework on the universal approximation of transformer-type architectures(https://arxiv.org/abs/2506.23551)
Keywords: transformer
Abstract: We investigate the universal approximation property (UAP) of transformer-type architectures, providing a unified theoretical framework that extends prior results on residual networks to models incorporating attention mechanisms. Our work identifies token distinguishability as a fundamental requirement for UAP and introduces a general sufficient condition that applies to a broad class of architectures. Leveraging an analyticity assumption on the attention layer, we can significantly simplify the verification of this condition, providing a non-constructive approach in establishing UAP for such architectures. We demonstrate the applicability of our framework by proving UAP for transformers with various attention mechanisms, including kernel-based and sparse attention mechanisms. The corollaries of our results either generalize prior works or establish UAP for architectures not previously covered. Furthermore, our framework offers a principled foundation for designing novel transformer architectures with inherent UAP guarantees, including those with specific functional symmetries. We propose examples to illustrate these insights.

Title: JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

Authors: Mingi Kwon, Joonghyuk Shin, Jaeseok Jung, Jaesik Park, Youngjung Uh
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.23552
Pdf URL: https://arxiv.org/pdf/2506.23552
Copy Paste: [[2506.23552]] JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching(https://arxiv.org/abs/2506.23552)
Keywords: diffusion, transformer, generative
Abstract: The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and reference motion-facilitating tasks such as synchronized talking head generation from text, audio-driven animation, and much more, within a single, coherent model. JAM-Flow significantly advances multi-modal generative modeling by providing a practical solution for holistic audio-visual synthesis. project page: this https URL

Title: Metadata, Wavelet, and Time Aware Diffusion Models for Satellite Image Super Resolution

Authors: Luigi Sigillo, Renato Giamba, Danilo Comminiello
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23566
Pdf URL: https://arxiv.org/pdf/2506.23566
Copy Paste: [[2506.23566]] Metadata, Wavelet, and Time Aware Diffusion Models for Satellite Image Super Resolution(https://arxiv.org/abs/2506.23566)
Keywords: diffusion
Abstract: The acquisition of high-resolution satellite imagery is often constrained by the spatial and temporal limitations of satellite sensors, as well as the high costs associated with frequent observations. These challenges hinder applications such as environmental monitoring, disaster response, and agricultural management, which require fine-grained and high-resolution data. In this paper, we propose MWT-Diff, an innovative framework for satellite image super-resolution (SR) that combines latent diffusion models with wavelet transforms to address these challenges. At the core of the framework is a novel metadata-, wavelet-, and time-aware encoder (MWT-Encoder), which generates embeddings that capture metadata attributes, multi-scale frequency information, and temporal relationships. The embedded feature representations steer the hierarchical diffusion dynamics, through which the model progressively reconstructs high-resolution satellite imagery from low-resolution inputs. This process preserves critical spatial characteristics including textural patterns, boundary discontinuities, and high-frequency spectral components essential for detailed remote sensing analysis. The comparative analysis of MWT-Diff across multiple datasets demonstrated favorable performance compared to recent approaches, as measured by standard perceptual quality metrics including FID and LPIPS.

Title: Event-based Tiny Object Detection: A Benchmark Dataset and Baseline

Authors: Nuo Chen, Chao Xiao, Yimian Dai, Shiman He, Miao Li, Wei An
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23575
Pdf URL: https://arxiv.org/pdf/2506.23575
Copy Paste: [[2506.23575]] Event-based Tiny Object Detection: A Benchmark Dataset and Baseline(https://arxiv.org/abs/2506.23575)
Keywords: segmentation
Abstract: Small object detection (SOD) in anti-UAV task is a challenging problem due to the small size of UAVs and complex backgrounds. Traditional frame-based cameras struggle to detect small objects in complex environments due to their low frame rates, limited dynamic range, and data redundancy. Event cameras, with microsecond temporal resolution and high dynamic range, provide a more effective solution for SOD. However, existing event-based object detection datasets are limited in scale, feature large targets size, and lack diverse backgrounds, making them unsuitable for SOD benchmarks. In this paper, we introduce a Event-based Small object detection (EVSOD) dataset (namely EV-UAV), the first large-scale, highly diverse benchmark for anti-UAV tasks. It includes 147 sequences with over 2.3 million event-level annotations, featuring extremely small targets (averaging 6.8 $\times$ 5.4 pixels) and diverse scenarios such as urban clutter and extreme lighting conditions. Furthermore, based on the observation that small moving targets form continuous curves in spatiotemporal event point clouds, we propose Event based Sparse Segmentation Network (EV-SpSegNet), a novel baseline for event segmentation in point cloud space, along with a Spatiotemporal Correlation (STC) loss that leverages motion continuity to guide the network in retaining target events. Extensive experiments on the EV-UAV dataset demonstrate the superiority of our method and provide a benchmark for future research in EVSOD. The dataset and code are at this https URL.

Title: StackCLIP: Clustering-Driven Stacked Prompt in Zero-Shot Industrial Anomaly Detection

Authors: Yanning Hou, Yanran Ruan, Junfa Li, Shanshan Wang, Jianfeng Qiu, Ke Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23577
Pdf URL: https://arxiv.org/pdf/2506.23577
Copy Paste: [[2506.23577]] StackCLIP: Clustering-Driven Stacked Prompt in Zero-Shot Industrial Anomaly Detection(https://arxiv.org/abs/2506.23577)
Keywords: robust, segmentation
Abstract: Enhancing the alignment between text and image features in the CLIP model is a critical challenge in zero-shot industrial anomaly detection tasks. Recent studies predominantly utilize specific category prompts during pretraining, which can cause overfitting to the training categories and limit model generalization. To address this, we propose a method that transforms category names through multicategory name stacking to create stacked prompts, forming the basis of our StackCLIP model. Our approach introduces two key components. The Clustering-Driven Stacked Prompts (CSP) module constructs generic prompts by stacking semantically analogous categories, while utilizing multi-object textual feature fusion to amplify discriminative anomalies among similar objects. The Ensemble Feature Alignment (EFA) module trains knowledge-specific linear layers tailored for each stack cluster and adaptively integrates them based on the attributes of test categories. These modules work together to deliver superior training speed, stability, and convergence, significantly boosting anomaly segmentation performance. Additionally, our stacked prompt framework offers robust generalization across classification tasks. To further improve performance, we introduce the Regulating Prompt Learning (RPL) module, which leverages the generalization power of stacked prompts to refine prompt learning, elevating results in anomaly detection classification tasks. Extensive testing on seven industrial anomaly detection datasets demonstrates that our method achieves state-of-the-art performance in both zero-shot anomaly detection and segmentation tasks.

Title: Dataset Distillation via Vision-Language Category Prototype

Authors: Yawen Zou, Guang Li, Duo Su, Zi Wang, Jun Yu, Chao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23580
Pdf URL: https://arxiv.org/pdf/2506.23580
Copy Paste: [[2506.23580]] Dataset Distillation via Vision-Language Category Prototype(https://arxiv.org/abs/2506.23580)
Keywords: robust, large language model
Abstract: Dataset distillation (DD) condenses large datasets into compact yet informative substitutes, preserving performance comparable to the original dataset while reducing storage, transmission costs, and computational consumption. However, previous DD methods mainly focus on distilling information from images, often overlooking the semantic information inherent in the data. The disregard for context hinders the model's generalization ability, particularly in tasks involving complex datasets, which may result in illogical outputs or the omission of critical objects. In this study, we integrate vision-language methods into DD by introducing text prototypes to distill language information and collaboratively synthesize data with image prototypes, thereby enhancing dataset distillation performance. Notably, the text prototypes utilized in this study are derived from descriptive text information generated by an open-source large language model. This framework demonstrates broad applicability across datasets without pre-existing text descriptions, expanding the potential of dataset distillation beyond traditional image-based approaches. Compared to other methods, the proposed approach generates logically coherent images containing target objects, achieving state-of-the-art validation performance and demonstrating robust generalization. Source code and generated data are available in this https URL

Title: PBCAT: Patch-based composite adversarial training against physically realizable attacks on object detection

Authors: Xiao Li, Yiming Zhu, Yifan Huang, Wei Zhang, Yingzhe He, Jie Shi, Xiaolin Hu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23581
Pdf URL: https://arxiv.org/pdf/2506.23581
Copy Paste: [[2506.23581]] PBCAT: Patch-based composite adversarial training against physically realizable attacks on object detection(https://arxiv.org/abs/2506.23581)
Keywords: security, defense, attack, robust
Abstract: Object detection plays a crucial role in many security-sensitive applications. However, several recent studies have shown that object detectors can be easily fooled by physically realizable attacks, \eg, adversarial patches and recent adversarial textures, which pose realistic and urgent threats. Adversarial Training (AT) has been recognized as the most effective defense against adversarial attacks. While AT has been extensively studied in the $l_\infty$ attack settings on classification models, AT against physically realizable attacks on object detectors has received limited exploration. Early attempts are only performed to defend against adversarial patches, leaving AT against a wider range of physically realizable attacks under-explored. In this work, we consider defending against various physically realizable attacks with a unified AT method. We propose PBCAT, a novel Patch-Based Composite Adversarial Training strategy. PBCAT optimizes the model by incorporating the combination of small-area gradient-guided adversarial patches and imperceptible global adversarial perturbations covering the entire image. With these designs, PBCAT has the potential to defend against not only adversarial patches but also unseen physically realizable attacks such as adversarial textures. Extensive experiments in multiple settings demonstrated that PBCAT significantly improved robustness against various physically realizable attacks over state-of-the-art defense methods. Notably, it improved the detection accuracy by 29.7\% over previous defense methods under one recent adversarial texture attack.

Title: Detect \& Score: Privacy-Preserving Misbehaviour Detection and Contribution Evaluation in Federated Learning

Authors: Marvin Xhemrishi, Alexandre Graell i Amat, Balázs Pejó
Subjects: cs.CR, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23583
Pdf URL: https://arxiv.org/pdf/2506.23583
Copy Paste: [[2506.23583]] Detect \& Score: Privacy-Preserving Misbehaviour Detection and Contribution Evaluation in Federated Learning(https://arxiv.org/abs/2506.23583)
Keywords: secure, privacy, robust, federate
Abstract: Federated learning with secure aggregation enables private and collaborative learning from decentralised data without leaking sensitive client information. However, secure aggregation also complicates the detection of malicious client behaviour and the evaluation of individual client contributions to the learning. To address these challenges, QI (Pejo et al.) and FedGT (Xhemrishi et al.) were proposed for contribution evaluation (CE) and misbehaviour detection (MD), respectively. QI, however, lacks adequate MD accuracy due to its reliance on the random selection of clients in each training round, while FedGT lacks the CE ability. In this work, we combine the strengths of QI and FedGT to achieve both robust MD and accurate CE. Our experiments demonstrate superior performance compared to using either method independently.

Title: Transition Matching: Scalable and Flexible Generative Modeling

Authors: Neta Shaul, Uriel Singer, Itai Gat, Yaron Lipman
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23589
Pdf URL: https://arxiv.org/pdf/2506.23589
Copy Paste: [[2506.23589]] Transition Matching: Scalable and Flexible Generative Modeling(https://arxiv.org/abs/2506.23589)
Keywords: diffusion, generative
Abstract: Diffusion and flow matching models have significantly advanced media generation, yet their design space is well-explored, somewhat limiting further improvements. Concurrently, autoregressive (AR) models, particularly those generating continuous tokens, have emerged as a promising direction for unifying text and media generation. This paper introduces Transition Matching (TM), a novel discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation. TM decomposes complex generation tasks into simpler Markov transitions, allowing for expressive non-deterministic probability transition kernels and arbitrary non-continuous supervision processes, thereby unlocking new flexible design avenues. We explore these choices through three TM variants: (i) Difference Transition Matching (DTM), which generalizes flow matching to discrete-time by directly learning transition probabilities, yielding state-of-the-art image quality and text adherence as well as improved sampling efficiency. (ii) Autoregressive Transition Matching (ARTM) and (iii) Full History Transition Matching (FHTM) are partially and fully causal models, respectively, that generalize continuous AR methods. They achieve continuous causal AR generation quality comparable to non-causal approaches and potentially enable seamless integration with existing AR text generation techniques. Notably, FHTM is the first fully causal model to match or surpass the performance of flow-based methods on text-to-image task in continuous domains. We demonstrate these contributions through a rigorous large-scale comparison of TM variants and relevant baselines, maintaining a fixed architecture, training data, and hyperparameters.

Title: CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models

Authors: Qiming Li, Zekai Ye, Xiaocheng Feng, Weihong Zhong, Libo Qin, Ruihan Chen, Baohang Li, Kui Jiang, Yaowei Wang, Ting Liu, Bing Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23590
Pdf URL: https://arxiv.org/pdf/2506.23590
Copy Paste: [[2506.23590]] CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models(https://arxiv.org/abs/2506.23590)
Keywords: generative
Abstract: Although Large Vision-Language Models (LVLMs) have demonstrated powerful capabilities in interpreting visual information, they frequently produce content that deviates from visual information, leading to object hallucination. To tackle this, recent works mostly depend on expensive manual annotations and training cost, or significantly increase inference time. In this work, we observe that LVLMs' attention to visual information is significantly stronger when answering caption queries compared to non-caption queries. Inspired by this phenomenon, we propose Caption-sensitive Attention Intervention (CAI), a training-free, plug-and-play hallucination mitigation method that leverages the attention activation pattern in response to caption queries to enhance LVLMs' visual perception capability. Extensive experimental results across four benchmarks covering both discriminative and generative tasks, demonstrate that CAI achieves state-of-the-art (SOTA) hallucination mitigating performance only with minimal additional inference cost.

Title: Cybersecurity AI: The Dangerous Gap Between Automation and Autonomy

Authors: Víctor Mayoral-Vilches
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.23592
Pdf URL: https://arxiv.org/pdf/2506.23592
Copy Paste: [[2506.23592]] Cybersecurity AI: The Dangerous Gap Between Automation and Autonomy(https://arxiv.org/abs/2506.23592)
Keywords: security, attack
Abstract: The cybersecurity industry combines "automated" and "autonomous" AI, creating dangerous misconceptions about system capabilities. Recent milestones like XBOW topping HackerOne's leaderboard showcase impressive progress, yet these systems remain fundamentally semi-autonomous--requiring human oversight. Drawing from robotics principles, where the distinction between automation and autonomy is well-established, I take inspiration from prior work and establish a 6-level taxonomy (Level 0-5) distinguishing automation from autonomy in Cybersecurity AI. Current "autonomous" pentesters operate at Level 3-4: they execute complex attack sequences but need human review for edge cases and strategic decisions. True Level 5 autonomy remains aspirational. Organizations deploying mischaracterized "autonomous" tools risk reducing oversight precisely when it's most needed, potentially creating new vulnerabilities. The path forward requires precise terminology, transparent capabilities disclosure, and human-AI partnership-not replacement.

Title: When Will It Fail?: Anomaly to Prompt for Forecasting Future Anomalies in Time Series

Authors: Min-Yeong Park, Won-Jeong Lee, Seong Tae Kim, Gyeong-Moon Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23596
Pdf URL: https://arxiv.org/pdf/2506.23596
Copy Paste: [[2506.23596]] When Will It Fail?: Anomaly to Prompt for Forecasting Future Anomalies in Time Series(https://arxiv.org/abs/2506.23596)
Keywords: robust
Abstract: Recently, forecasting future abnormal events has emerged as an important scenario to tackle real-world necessities. However, the solution of predicting specific future time points when anomalies will occur, known as Anomaly Prediction (AP), remains under-explored. Existing methods dealing with time series data fail in AP, focusing only on immediate anomalies or failing to provide precise predictions for future anomalies. To address the AP task, we propose a novel framework called Anomaly to Prompt (A2P), comprised of Anomaly-Aware Forecasting (AAF) and Synthetic Anomaly Prompting (SAP). To enable the forecasting model to forecast abnormal time points, we adopt a strategy to learn the relationships of anomalies. For the robust detection of anomalies, our proposed SAP introduces a learnable Anomaly Prompt Pool (APP) that simulates diverse anomaly patterns using signal adaptive prompt. Comprehensive experiments on multiple real-world datasets demonstrate the superiority of A2P over state-of-the-art methods, showcasing its ability to predict future anomalies. Our implementation code is available at this https URL.

Title: Semantic-guided Diverse Decoding for Large Language Model

Authors: Weijie Shi, Yue Cui, Yaguang Wu, Jingzhi Fang, Shibo Zhang, Mengze Li, Sirui Han, Jia Zhu, Jiajie Xu, Xiaofang Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23601
Pdf URL: https://arxiv.org/pdf/2506.23601
Copy Paste: [[2506.23601]] Semantic-guided Diverse Decoding for Large Language Model(https://arxiv.org/abs/2506.23601)
Keywords: large language model
Abstract: Diverse decoding of large language models is crucial for applications requiring multiple semantically distinct responses, yet existing methods primarily achieve lexical rather than semantic diversity. This limitation significantly constrains Best-of-N strategies, group-based reinforcement learning, and data synthesis. While temperature sampling and diverse beam search modify token distributions or apply n-gram penalties, they fail to ensure meaningful semantic differentiation. We introduce Semantic-guided Diverse Decoding (SemDiD), operating directly in embedding space that balances quality with diversity through three complementary mechanisms: orthogonal directional guidance, dynamic inter-group repulsion, and position-debiased probability assessment. SemDiD harmonizes these competing objectives using adaptive gain functions and constraint optimization, ensuring both quality thresholds and maximal semantic differentiation. Experiments show SemDiD consistently outperforms existing methods, improving Best-of-N coverage by 1.4-5.2% across diverse tasks and accelerating RLHF training convergence by 15% while increasing accuracy by up to 2.1%.

Title: SoK: Semantic Privacy in Large Language Models

Authors: Baihe Ma, Yanna Jiang, Xu Wang, Guangshen Yu, Qin Wang, Caijun Sun, Chen Li, Xuelei Qi, Ying He, Wei Ni, Ren Ping Liu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23603
Pdf URL: https://arxiv.org/pdf/2506.23603
Copy Paste: [[2506.23603]] SoK: Semantic Privacy in Large Language Models(https://arxiv.org/abs/2506.23603)
Keywords: privacy, protect, defense, attack, robust, large language model
Abstract: As Large Language Models (LLMs) are increasingly deployed in sensitive domains, traditional data privacy measures prove inadequate for protecting information that is implicit, contextual, or inferable - what we define as semantic privacy. This Systematization of Knowledge (SoK) introduces a lifecycle-centric framework to analyze how semantic privacy risks emerge across input processing, pretraining, fine-tuning, and alignment stages of LLMs. We categorize key attack vectors and assess how current defenses, such as differential privacy, embedding encryption, edge computing, and unlearning, address these threats. Our analysis reveals critical gaps in semantic-level protection, especially against contextual inference and latent representation leakage. We conclude by outlining open challenges, including quantifying semantic leakage, protecting multimodal inputs, balancing de-identification with generation quality, and ensuring transparency in privacy enforcement. This work aims to inform future research on designing robust, semantically aware privacy-preserving techniques for LLMs.

Title: AI-Generated Lecture Slides for Improving Slide Element Detection and Retrieval

Authors: Suyash Maniyar, Vishvesh Trivedi, Ajoy Mondal, Anand Mishra, C.V. Jawahar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23605
Pdf URL: https://arxiv.org/pdf/2506.23605
Copy Paste: [[2506.23605]] AI-Generated Lecture Slides for Improving Slide Element Detection and Retrieval(https://arxiv.org/abs/2506.23605)
Keywords: large language model
Abstract: Lecture slide element detection and retrieval are key problems in slide understanding. Training effective models for these tasks often depends on extensive manual annotation. However, annotating large volumes of lecture slides for supervised training is labor intensive and requires domain expertise. To address this, we propose a large language model (LLM)-guided synthetic lecture slide generation pipeline, SynLecSlideGen, which produces high-quality, coherent and realistic slides. We also create an evaluation benchmark, namely RealSlide by manually annotating 1,050 real lecture slides. To assess the utility of our synthetic slides, we perform few-shot transfer learning on real data using models pre-trained on them. Experimental results show that few-shot transfer learning with pretraining on synthetic slides significantly improves performance compared to training only on real data. This demonstrates that synthetic data can effectively compensate for limited labeled lecture slides. The code and resources of our work are publicly available on our project website: this https URL.

Title: SG-LDM: Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion

Authors: Zhengkang Xiang, Zizhao Li, Amir Khodabandeh, Kourosh Khoshelham
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23606
Pdf URL: https://arxiv.org/pdf/2506.23606
Copy Paste: [[2506.23606]] SG-LDM: Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion(https://arxiv.org/abs/2506.23606)
Keywords: robust, diffusion, generative, segmentation
Abstract: Lidar point cloud synthesis based on generative models offers a promising solution to augment deep learning pipelines, particularly when real-world data is scarce or lacks diversity. By enabling flexible object manipulation, this synthesis approach can significantly enrich training datasets and enhance discriminative models. However, existing methods focus on unconditional lidar point cloud generation, overlooking their potential for real-world applications. In this paper, we propose SG-LDM, a Semantic-Guided Lidar Diffusion Model that employs latent alignment to enable robust semantic-to-lidar synthesis. By directly operating in the native lidar space and leveraging explicit semantic conditioning, SG-LDM achieves state-of-the-art performance in generating high-fidelity lidar point clouds guided by semantic labels. Moreover, we propose the first diffusion-based lidar translation framework based on SG-LDM, which enables cross-domain translation as a domain adaptation strategy to enhance downstream perception performance. Systematic experiments demonstrate that SG-LDM significantly outperforms existing lidar diffusion models and the proposed lidar translation framework further improves data augmentation performance in the downstream lidar segmentation task.

Title: PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum

Authors: Shiqi Zhang, Sha Zhang, Jiajun Deng, Yedong Shen, Mingxiao MA, Yanyong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23607
Pdf URL: https://arxiv.org/pdf/2506.23607
Copy Paste: [[2506.23607]] PGOV3D: Open-Vocabulary 3D Semantic Segmentation with Partial-to-Global Curriculum(https://arxiv.org/abs/2506.23607)
Keywords: large language model, segmentation
Abstract: Existing open-vocabulary 3D semantic segmentation methods typically supervise 3D segmentation models by merging text-aligned features (e.g., CLIP) extracted from multi-view images onto 3D points. However, such approaches treat multi-view images merely as intermediaries for transferring open-vocabulary information, overlooking their rich semantic content and cross-view correspondences, which limits model effectiveness. To address this, we propose PGOV3D, a novel framework that introduces a Partial-to-Global curriculum for improving open-vocabulary 3D semantic segmentation. The key innovation lies in a two-stage training strategy. In the first stage, we pre-train the model on partial scenes that provide dense semantic information but relatively simple geometry. These partial point clouds are derived from multi-view RGB-D inputs via pixel-wise depth projection. To enable open-vocabulary learning, we leverage a multi-modal large language model (MLLM) and a 2D segmentation foundation model to generate open-vocabulary labels for each viewpoint, offering rich and aligned supervision. An auxiliary inter-frame consistency module is introduced to enforce feature consistency across varying viewpoints and enhance spatial understanding. In the second stage, we fine-tune the model on complete scene-level point clouds, which are sparser and structurally more complex. We aggregate the partial vocabularies associated with each scene and generate pseudo labels using the pre-trained model, effectively bridging the semantic gap between dense partial observations and large-scale 3D environments. Extensive experiments on ScanNet, ScanNet200, and S3DIS benchmarks demonstrate that PGOV3D achieves competitive performance in open-vocabulary 3D semantic segmentation.

Title: Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs

Authors: Manuel Pratelli, Marinella Petrocchi
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.23610
Pdf URL: https://arxiv.org/pdf/2506.23610
Copy Paste: [[2506.23610]] Evaluating the Simulation of Human Personality-Driven Susceptibility to Misinformation with LLMs(https://arxiv.org/abs/2506.23610)
Keywords: large language model
Abstract: Large language models (LLMs) make it possible to generate synthetic behavioural data at scale, offering an ethical and low-cost alternative to human experiments. Whether such data can faithfully capture psychological differences driven by personality traits, however, remains an open question. We evaluate the capacity of LLM agents, conditioned on Big-Five profiles, to reproduce personality-based variation in susceptibility to misinformation, focusing on news discernment, the ability to judge true headlines as true and false headlines as false. Leveraging published datasets in which human participants with known personality profiles rated headline accuracy, we create matching LLM agents and compare their responses to the original human patterns. Certain trait-misinformation associations, notably those involving Agreeableness and Conscientiousness, are reliably replicated, whereas others diverge, revealing systematic biases in how LLMs internalize and express personality. The results underscore both the promise and the limits of personality-aligned LLMs for behavioral simulation, and offer new insight into modeling cognitive diversity in artificial agents.

Title: AttentionGS: Towards Initialization-Free 3D Gaussian Splatting via Structural Attention

Authors: Ziao Liu, Zhenjia Li, Yifeng Shi, Xiangang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23611
Pdf URL: https://arxiv.org/pdf/2506.23611
Copy Paste: [[2506.23611]] AttentionGS: Towards Initialization-Free 3D Gaussian Splatting via Structural Attention(https://arxiv.org/abs/2506.23611)
Keywords: robust
Abstract: 3D Gaussian Splatting (3DGS) is a powerful alternative to Neural Radiance Fields (NeRF), excelling in complex scene reconstruction and efficient rendering. However, it relies on high-quality point clouds from Structure-from-Motion (SfM), limiting its applicability. SfM also fails in texture-deficient or constrained-view scenarios, causing severe degradation in 3DGS reconstruction. To address this limitation, we propose AttentionGS, a novel framework that eliminates the dependency on high-quality initial point clouds by leveraging structural attention for direct 3D reconstruction from randomly initialization. In the early training stage, we introduce geometric attention to rapidly recover the global scene structure. As training progresses, we incorporate texture attention to refine fine-grained details and enhance rendering quality. Furthermore, we employ opacity-weighted gradients to guide Gaussian densification, leading to improved surface reconstruction. Extensive experiments on multiple benchmark datasets demonstrate that AttentionGS significantly outperforms state-of-the-art methods, particularly in scenarios where point cloud initialization is unreliable. Our approach paves the way for more robust and flexible 3D Gaussian Splatting in real-world applications.

Title: TurboVSR: Fantastic Video Upscalers and Where to Find Them

Authors: Zhongdao Wang, Guodongfang Zhao, Jingjing Ren, Bailan Feng, Shifeng Zhang, Wenbo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23618
Pdf URL: https://arxiv.org/pdf/2506.23618
Copy Paste: [[2506.23618]] TurboVSR: Fantastic Video Upscalers and Where to Find Them(https://arxiv.org/abs/2506.23618)
Keywords: diffusion, generative
Abstract: Diffusion-based generative models have demonstrated exceptional promise in the video super-resolution (VSR) task, achieving a substantial advancement in detail generation relative to prior methods. However, these approaches face significant computational efficiency challenges. For instance, current techniques may require tens of minutes to super-resolve a mere 2-second, 1080p video. In this paper, we present TurboVSR, an ultra-efficient diffusion-based video super-resolution model. Our core design comprises three key aspects: (1) We employ an autoencoder with a high compression ratio of 32$\times$32$\times$8 to reduce the number of tokens. (2) Highly compressed latents pose substantial challenges for training. We introduce factorized conditioning to mitigate the learning complexity: we first learn to super-resolve the initial frame; subsequently, we condition the super-resolution of the remaining frames on the high-resolution initial frame and the low-resolution subsequent frames. (3) We convert the pre-trained diffusion model to a shortcut model to enable fewer sampling steps, further accelerating inference. As a result, TurboVSR performs on par with state-of-the-art VSR methods, while being 100+ times faster, taking only 7 seconds to process a 2-second long 1080p video. TurboVSR also supports image resolution by considering image as a one-frame video. Our efficient design makes SR beyond 1080p possible, results on 4K (3648$\times$2048) image SR show surprising fine details.

Title: Privacy-Preserving Federated Learning Scheme with Mitigating Model Poisoning Attacks: Vulnerabilities and Countermeasures

Authors: Jiahui Wu, Fucai Luo, Tiecheng Sun, Haiyan Wang, Weizhe Zhang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.23622
Pdf URL: https://arxiv.org/pdf/2506.23622
Copy Paste: [[2506.23622]] Privacy-Preserving Federated Learning Scheme with Mitigating Model Poisoning Attacks: Vulnerabilities and Countermeasures(https://arxiv.org/abs/2506.23622)
Keywords: secure, security, privacy, protect, attack, robust, federate
Abstract: The privacy-preserving federated learning schemes based on the setting of two honest-but-curious and non-colluding servers offer promising solutions in terms of security and efficiency. However, our investigation reveals that these schemes still suffer from privacy leakage when considering model poisoning attacks from malicious users. Specifically, we demonstrate that the privacy-preserving computation process for defending against model poisoning attacks inadvertently leaks privacy to one of the honest-but-curious servers, enabling it to access users' gradients in plaintext. To address both privacy leakage and model poisoning attacks, we propose an enhanced privacy-preserving and Byzantine-robust federated learning (PBFL) scheme, comprising three components: (1) a two-trapdoor fully homomorphic encryption (FHE) scheme to bolster users' privacy protection; (2) a novel secure normalization judgment method to preemptively thwart gradient poisoning; and (3) an innovative secure cosine similarity measurement method for detecting model poisoning attacks without compromising data privacy. Our scheme guarantees privacy preservation and resilience against model poisoning attacks, even in scenarios with heterogeneous, non-IID (Independently and Identically Distributed) datasets. Theoretical analyses substantiate the security and efficiency of our scheme, and extensive experiments corroborate the efficacy of our private attacks. Furthermore, the experimental results demonstrate that our scheme accelerates training speed while reducing communication overhead compared to the state-of-the-art PBFL schemes.

Title: Revisiting Audio-Visual Segmentation with Vision-Centric Transformer

Authors: Shaofei Huang, Rui Ling, Tianrui Hui, Hongyu Li, Xu Zhou, Shifeng Zhang, Si Liu, Richang Hong, Meng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23623
Pdf URL: https://arxiv.org/pdf/2506.23623
Copy Paste: [[2506.23623]] Revisiting Audio-Visual Segmentation with Vision-Centric Transformer(https://arxiv.org/abs/2506.23623)
Keywords: transformer, segmentation
Abstract: Audio-Visual Segmentation (AVS) aims to segment sound-producing objects in video frames based on the associated audio signal. Prevailing AVS methods typically adopt an audio-centric Transformer architecture, where object queries are derived from audio features. However, audio-centric Transformers suffer from two limitations: perception ambiguity caused by the mixed nature of audio, and weakened dense prediction ability due to visual detail loss. To address these limitations, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation. Extensive experiments demonstrate that our VCT framework achieves new state-of-the-art performances on three subsets of the AVSBench dataset. The code is available at this https URL.

Title: A Nonlinear Low-rank Representation Model with Convolutional Neural Network for Imputing Water Quality Data

Authors: Xin Liao, Bing Yang, Cai Yu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23629
Pdf URL: https://arxiv.org/pdf/2506.23629
Copy Paste: [[2506.23629]] A Nonlinear Low-rank Representation Model with Convolutional Neural Network for Imputing Water Quality Data(https://arxiv.org/abs/2506.23629)
Keywords: protect
Abstract: The integrity of Water Quality Data (WQD) is critical in environmental monitoring for scientific decision-making and ecological protection. However, water quality monitoring systems are often challenged by large amounts of missing data due to unavoidable problems such as sensor failures and communication delays, which further lead to water quality data becoming High-Dimensional and Sparse (HDS). Traditional data imputation methods are difficult to depict the potential dynamics and fail to capture the deep data features, resulting in unsatisfactory imputation performance. To effectively address the above issues, this paper proposes a Nonlinear Low-rank Representation model (NLR) with Convolutional Neural Networks (CNN) for imputing missing WQD, which utilizes CNNs to implement two ideas: a) fusing temporal features to model the temporal dependence of data between time slots, and b) Extracting nonlinear interactions and local patterns to mine higher-order relationships features and achieve deep fusion of multidimensional information. Experimental studies on three real water quality datasets demonstrate that the proposed model significantly outperforms existing state-of-the-art data imputation models in terms of estimation accuracy. It provides an effective approach for handling water quality monitoring data in complex dynamic environments.

Title: Blending Concepts with Text-to-Image Diffusion Models

Authors: Lorenzo Olearo, Giorgio Longari, Alessandro Raganato, Rafael Peñaloza, Simone Melzi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23630
Pdf URL: https://arxiv.org/pdf/2506.23630
Copy Paste: [[2506.23630]] Blending Concepts with Text-to-Image Diffusion Models(https://arxiv.org/abs/2506.23630)
Keywords: diffusion
Abstract: Diffusion models have dramatically advanced text-to-image generation in recent years, translating abstract concepts into high-fidelity images with remarkable ease. In this work, we examine whether they can also blend distinct concepts, ranging from concrete objects to intangible ideas, into coherent new visual entities under a zero-shot framework. Specifically, concept blending merges the key attributes of multiple concepts (expressed as textual prompts) into a single, novel image that captures the essence of each concept. We investigate four blending methods, each exploiting different aspects of the diffusion pipeline (e.g., prompt scheduling, embedding interpolation, or layer-wise conditioning). Through systematic experimentation across diverse concept categories, such as merging concrete concepts, synthesizing compound words, transferring artistic styles, and blending architectural landmarks, we show that modern diffusion models indeed exhibit creative blending capabilities without further training or fine-tuning. Our extensive user study, involving 100 participants, reveals that no single approach dominates in all scenarios: each blending technique excels under certain conditions, with factors like prompt ordering, conceptual distance, and random seed affecting the outcome. These findings highlight the remarkable compositional potential of diffusion models while exposing their sensitivity to seemingly minor input variations.

Title: gMBA: Expression Semantic Guided Mixed Boolean-Arithmetic Deobfuscation Using Transformer Architectures

Authors: Youjeong Noh, Joon-Young Paik, Jingun Kwon, Eun-Sun Cho
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23634
Pdf URL: https://arxiv.org/pdf/2506.23634
Copy Paste: [[2506.23634]] gMBA: Expression Semantic Guided Mixed Boolean-Arithmetic Deobfuscation Using Transformer Architectures(https://arxiv.org/abs/2506.23634)
Keywords: protect, transformer
Abstract: Mixed Boolean-Arithmetic (MBA) obfuscation protects intellectual property by converting programs into forms that are more complex to analyze. However, MBA has been increasingly exploited by malware developers to evade detection and cause significant real-world problems. Traditional MBA deobfuscation methods often consider these expressions as part of a black box and overlook their internal semantic information. To bridge this gap, we propose a truth table, which is an automatically constructed semantic representation of an expression's behavior that does not rely on external resources. The truth table is a mathematical form that represents the output of expression for all possible combinations of input. We also propose a general and extensible guided MBA deobfuscation framework (gMBA) that modifies a Transformer-based neural encoder-decoder Seq2Seq architecture to incorporate this semantic guidance. Experimental results and in-depth analysis show that integrating expression semantics significantly improves performance and highlights the importance of internal semantic expressions in recovering obfuscated code to its original form.

Title: Unified Multimodal Understanding via Byte-Pair Visual Encoding

Authors: Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23639
Pdf URL: https://arxiv.org/pdf/2506.23639
Copy Paste: [[2506.23639]] Unified Multimodal Understanding via Byte-Pair Visual Encoding(https://arxiv.org/abs/2506.23639)
Keywords: transformer, large language model
Abstract: Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens. Unlike conventional approaches that rely on modality-specific encoders, our method directly incorporates structural information into visual tokens, mirroring successful tokenization strategies in text-only language models. We introduce a priority-guided encoding scheme that considers both frequency and spatial consistency, coupled with a multi-stage training procedure based on curriculum-driven data composition. These enhancements enable the transformer model to better capture cross-modal relationships and reason with visual information. Comprehensive experiments demonstrate improved performance across diverse vision-language tasks. By bridging the gap between visual and textual representations, our approach contributes to the advancement of more capable and efficient multimodal foundation models.

Title: VAP-Diffusion: Enriching Descriptions with MLLMs for Enhanced Medical Image Generation

Authors: Peng Huang, Junhu Fu, Bowen Guo, Zeju Li, Yuanyuan Wang, Yi Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23641
Pdf URL: https://arxiv.org/pdf/2506.23641
Copy Paste: [[2506.23641]] VAP-Diffusion: Enriching Descriptions with MLLMs for Enhanced Medical Image Generation(https://arxiv.org/abs/2506.23641)
Keywords: robust, diffusion, generative, large language model
Abstract: As the appearance of medical images is influenced by multiple underlying factors, generative models require rich attribute information beyond labels to produce realistic and diverse images. For instance, generating an image of skin lesion with specific patterns demands descriptions that go beyond diagnosis, such as shape, size, texture, and color. However, such detailed descriptions are not always accessible. To address this, we explore a framework, termed Visual Attribute Prompts (VAP)-Diffusion, to leverage external knowledge from pre-trained Multi-modal Large Language Models (MLLMs) to improve the quality and diversity of medical image generation. First, to derive descriptions from MLLMs without hallucination, we design a series of prompts following Chain-of-Thoughts for common medical imaging tasks, including dermatologic, colorectal, and chest X-ray images. Generated descriptions are utilized during training and stored across different categories. During testing, descriptions are randomly retrieved from the corresponding category for inference. Moreover, to make the generator robust to unseen combination of descriptions at the test time, we propose a Prototype Condition Mechanism that restricts test embeddings to be similar to those from training. Experiments on three common types of medical imaging across four datasets verify the effectiveness of VAP-Diffusion.

Title: MReg: A Novel Regression Model with MoE-based Video Feature Mining for Mitral Regurgitation Diagnosis

Authors: Zhe Liu, Yuhao Huang, Lian Liu, Chengrui Zhang, Haotian Lin, Tong Han, Zhiyuan Zhu, Yanlin Chen, Yuerui Chen, Dong Ni, Zhongshan Gou, Xin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23648
Pdf URL: https://arxiv.org/pdf/2506.23648
Copy Paste: [[2506.23648]] MReg: A Novel Regression Model with MoE-based Video Feature Mining for Mitral Regurgitation Diagnosis(https://arxiv.org/abs/2506.23648)
Keywords: interpretability
Abstract: Color Doppler echocardiography is a crucial tool for diagnosing mitral regurgitation (MR). Recent studies have explored intelligent methods for MR diagnosis to minimize user dependence and improve accuracy. However, these approaches often fail to align with clinical workflow and may lead to suboptimal accuracy and interpretability. In this study, we introduce an automated MR diagnosis model (MReg) developed on the 4-chamber cardiac color Doppler echocardiography video (A4C-CDV). It follows comprehensive feature mining strategies to detect MR and assess its severity, considering clinical realities. Our contribution is threefold. First, we formulate the MR diagnosis as a regression task to capture the continuity and ordinal relationships between categories. Second, we design a feature selection and amplification mechanism to imitate the sonographer's diagnostic logic for accurate MR grading. Third, inspired by the Mixture-of-Experts concept, we introduce a feature summary module to extract the category-level features, enhancing the representational capacity for more accurate grading. We trained and evaluated our proposed MReg on a large in-house A4C-CDV dataset comprising 1868 cases with three graded regurgitation labels. Compared to other weakly supervised video anomaly detection and supervised classification methods, MReg demonstrated superior performance in MR diagnosis. Our code is available at: this https URL.

Title: Towards Markerless Intraoperative Tracking of Deformable Spine Tissue

Authors: Connor Daly, Elettra Marconi, Marco Riva, Jinendra Ekanayake, Daniel S. Elson, Ferdinando Rodriguez y Baena
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23657
Pdf URL: https://arxiv.org/pdf/2506.23657
Copy Paste: [[2506.23657]] Towards Markerless Intraoperative Tracking of Deformable Spine Tissue(https://arxiv.org/abs/2506.23657)
Keywords: segmentation
Abstract: Consumer-grade RGB-D imaging for intraoperative orthopedic tissue tracking is a promising method with high translational potential. Unlike bone-mounted tracking devices, markerless tracking can reduce operating time and complexity. However, its use has been limited to cadaveric studies. This paper introduces the first real-world clinical RGB-D dataset for spine surgery and develops SpineAlign, a system for capturing deformation between preoperative and intraoperative spine states. We also present an intraoperative segmentation network trained on this data and introduce CorrespondNet, a multi-task framework for predicting key regions for registration in both intraoperative and preoperative scenes.

Title: Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack

Authors: Arnisa Fazla, Lucas Krauter, David Guzman Piedrahita, Andrianos Michail
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23661
Pdf URL: https://arxiv.org/pdf/2506.23661
Copy Paste: [[2506.23661]] Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack(https://arxiv.org/abs/2506.23661)
Keywords: attack, robust
Abstract: We extend BeamAttack, an adversarial attack algorithm designed to evaluate the robustness of text classification systems through word-level modifications guided by beam search. Our extensions include support for word deletions and the option to skip substitutions, enabling the discovery of minimal modifications that alter model predictions. We also integrate LIME to better prioritize word replacements. Evaluated across multiple datasets and victim models (BiLSTM, BERT, and adversarially trained RoBERTa) within the BODEGA framework, our approach achieves over a 99\% attack success rate while preserving the semantic and lexical similarity of the original texts. Through both quantitative and qualitative analysis, we highlight BeamAttack's effectiveness and its limitations. Our implementation is available at this https URL

Title: Zero-Shot Contextual Embeddings via Offline Synthetic Corpus Generation

Authors: Philip Lippmann, Jie Yang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.23662
Pdf URL: https://arxiv.org/pdf/2506.23662
Copy Paste: [[2506.23662]] Zero-Shot Contextual Embeddings via Offline Synthetic Corpus Generation(https://arxiv.org/abs/2506.23662)
Keywords: privacy
Abstract: Context-aware embedding methods boost retrieval accuracy by conditioning on corpus statistics (e.g., term co-occurrence and topical patterns) extracted from neighboring documents. However, this context-aware approach requires access to the target corpus or requires domain-specific finetuning, posing practical barriers in privacy-sensitive or resource-constrained settings. We present ZEST, a zero-shot contextual adaptation framework that replaces real corpus access with a one-time offline synthesis of a compact proxy. Given only a handful exemplar documents representative of the general target domain, we use a multi-step hierarchical procedure to generate a synthetic context corpus of several hundred documents that aims to emulate key domain-specific distributions. At inference, the frozen context-aware encoder uses this proxy corpus -- without any finetuning or target corpus access -- to produce domain-adapted embeddings. Across the MTEB benchmark, ZEST's zero-shot synthetic context adaptation using only five example documents performs within 0.5% of models leveraging full target corpus access -- demonstrating remarkable efficacy without any retraining. ZEST thus provides a practical method for deploying high-performance, adaptable embeddings in constrained environments.

Title: On the Domain Robustness of Contrastive Vision-Language Models

Authors: Mario Koddenbrock, Rudolf Hoffmann, David Brodmann, Erik Rodner
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23663
Pdf URL: https://arxiv.org/pdf/2506.23663
Copy Paste: [[2506.23663]] On the Domain Robustness of Contrastive Vision-Language Models(https://arxiv.org/abs/2506.23663)
Keywords: robust, large language model
Abstract: In real-world vision-language applications, practitioners increasingly rely on large, pretrained foundation models rather than custom-built solutions, despite limited transparency regarding their training data and processes. While these models achieve impressive performance on general benchmarks, their effectiveness can decline notably under specialized domain shifts, such as unique imaging conditions or environmental variations. In this work, we introduce Deepbench, a framework designed to assess domain-specific robustness of vision-language models (VLMs). Deepbench leverages a large language model (LLM) to generate realistic, context-aware image corruptions tailored to specific deployment domains without requiring labeled data. We evaluate a range of contrastive vision-language architectures and architectural variants across six real-world domains and observe substantial variability in robustness, highlighting the need for targeted, domain-aware evaluation. Deepbench is released as open-source software to support further research into domain-aware robustness assessment.

Title: L0: Reinforcement Learning to Become General Agents

Authors: Junjie Zhang, Jingyi Xi, Zhuoyang Song, Junyu Lu, Yuhua Ke, Ting Sun, Yukun Yang, Jiaxing Zhang, Songxin Zhang, Zejian Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23667
Pdf URL: https://arxiv.org/pdf/2506.23667
Copy Paste: [[2506.23667]] L0: Reinforcement Learning to Become General Agents(https://arxiv.org/abs/2506.23667)
Keywords: robust, large language model
Abstract: Training large language models (LLMs) to act as autonomous agents for multi-turn, long-horizon tasks remains significant challenges in scalability and training efficiency. To address this, we introduce L-Zero (L0), a scalable, end-to-end training pipeline for general-purpose agents. Featuring a low-cost, extensible, and sandboxed concurrent agent worker pool, L0 lowers the barrier for applying reinforcement learning in complex environments. We also introduce NB-Agent, the agent scaffold within L0, which operates in a "code-as-action" fashion via a Read-Eval-Print-Loop (REPL). We evaluate L0 on factuality question-answering benchmarks. Our experiments demonstrate that a base model can develop robust problem-solving skills using solely Reinforcement Learning with Verifiable Rewards (RLVR). On the Qwen2.5-7B-Instruct model, our method boosts accuracy on SimpleQA from 30 % to 80 % and on HotpotQA from 22 % to 41 %. We have open-sourced the entire L0 system, including our L0 series models, the NB-Agent, a complete training pipeline, and the corresponding training recipes on (this https URL).

Title: Pruning by Block Benefit: Exploring the Properties of Vision Transformer Blocks during Domain Adaptation

Authors: Patrick Glandorf, Bodo Rosenhahn
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23675
Pdf URL: https://arxiv.org/pdf/2506.23675
Copy Paste: [[2506.23675]] Pruning by Block Benefit: Exploring the Properties of Vision Transformer Blocks during Domain Adaptation(https://arxiv.org/abs/2506.23675)
Keywords: transformer
Abstract: Vision Transformer have set new benchmarks in several tasks, but these models come with the lack of high computational costs which makes them impractical for resource limited hardware. Network pruning reduces the computational complexity by removing less important operations while maintaining performance. However, pruning a model on an unseen data domain, leads to a misevaluation of weight significance, resulting in suboptimal resource assignment. In this work, we find that task-sensitive layers initially fail to improve the feature representation on downstream tasks, leading to performance loss for early pruning decisions. To address this problem, we introduce Pruning by Block Benefit (P3B), a pruning method that utilizes the relative contribution on block level to globally assign parameter resources. P3B identifies low-impact components to reduce parameter allocation while preserving critical ones. Classical pruning mask optimization struggles to reactivate zero-mask-elements. In contrast, P3B sets a layerwise keep ratio based on global performance metrics, ensuring the reactivation of late-converging blocks. We show in extensive experiments that P3B is a state of the art pruning method with most noticeable gains in transfer learning tasks. Notably, P3B is able to conserve high performance, even in high sparsity regimes of 70% parameter reduction while only losing 0.64% in accuracy.

Title: A Unified Framework for Stealthy Adversarial Generation via Latent Optimization and Transferability Enhancement

Authors: Gaozheng Pei, Ke Ma, Dongpeng Zhang, Chengzhi Sun, Qianqian Xu, Qingming Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23676
Pdf URL: https://arxiv.org/pdf/2506.23676
Copy Paste: [[2506.23676]] A Unified Framework for Stealthy Adversarial Generation via Latent Optimization and Transferability Enhancement(https://arxiv.org/abs/2506.23676)
Keywords: attack, steal, diffusion
Abstract: Due to their powerful image generation capabilities, diffusion-based adversarial example generation methods through image editing are rapidly gaining popularity. However, due to reliance on the discriminative capability of the diffusion model, these diffusion-based methods often struggle to generalize beyond conventional image classification tasks, such as in Deepfake detection. Moreover, traditional strategies for enhancing adversarial example transferability are challenging to adapt to these methods. To address these challenges, we propose a unified framework that seamlessly incorporates traditional transferability enhancement strategies into diffusion model-based adversarial example generation via image editing, enabling their application across a wider range of downstream tasks. Our method won first place in the "1st Adversarial Attacks on Deepfake Detectors: A Challenge in the Era of AI-Generated Media" competition at ACM MM25, which validates the effectiveness of our approach.

Title: Learning Modular Exponentiation with Transformers

Authors: David Demitri Africa, Sara M. Kapoor, Theo Simon Sorg
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2506.23679
Pdf URL: https://arxiv.org/pdf/2506.23679
Copy Paste: [[2506.23679]] Learning Modular Exponentiation with Transformers(https://arxiv.org/abs/2506.23679)
Keywords: interpretability, transformer
Abstract: Modular exponentiation is crucial to number theory and cryptography, yet remains largely unexplored from a mechanistic interpretability standpoint. We train a 4-layer encoder-decoder Transformer model to perform this operation and investigate the emergence of numerical reasoning during training. Utilizing principled sampling strategies, PCA-based embedding analysis, and activation patching, we examine how number-theoretic properties are encoded within the model. We find that reciprocal operand training leads to strong performance gains, with sudden generalization across related moduli. These synchronized accuracy surges reflect grokking-like dynamics, suggesting the model internalizes shared arithmetic structure. We also find a subgraph consisting entirely of attention heads in the final layer sufficient to achieve full performance on the task of regular exponentiation. These results suggest that transformer models learn modular arithmetic through specialized computational circuits, paving the way for more interpretable and efficient neural approaches to modular exponentiation.

Title: Not quite a piece of CHERI-cake: Are new digital security by design architectures usable?

Authors: Maysara Alhindi, Joseph Hallett
Subjects: cs.CR, cs.AR, cs.HC
Abstract URL: https://arxiv.org/abs/2506.23682
Pdf URL: https://arxiv.org/pdf/2506.23682
Copy Paste: [[2506.23682]] Not quite a piece of CHERI-cake: Are new digital security by design architectures usable?(https://arxiv.org/abs/2506.23682)
Keywords: security
Abstract: A digital security-by-design computer architecture, like CHERI, lets you program without fear of buffer overflows or other memory safety errors, but CHERI also rewrites some of the assumptions about how C works and how fundamental types (such as pointers) are implemented in hardware. We conducted a usability study to examine how developers react to the changes required by CHERI when porting software to run on it. We find that developers struggle with CHERI's display of warnings and errors and a lack of diverse documentation.

Title: Threadbox: Sandboxing for Modular Security

Authors: Maysara Alhindi, Joseph Hallett
Subjects: cs.CR, cs.OS, cs.SE
Abstract URL: https://arxiv.org/abs/2506.23683
Pdf URL: https://arxiv.org/pdf/2506.23683
Copy Paste: [[2506.23683]] Threadbox: Sandboxing for Modular Security(https://arxiv.org/abs/2506.23683)
Keywords: security
Abstract: There are many sandboxing mechanisms provided by operating systems to limit what resources applications can access, however, sometimes the use of these mechanisms requires developers to refactor their code to fit the sandboxing model. In this work, we investigate what makes existing sandboxing mechanisms challenging to apply to certain types of applications, and propose Threadbox, a sandboxing mechanism that enables having modular and independent sandboxes, and can be applied to threads and sandbox specific functions. We present case studies to illustrate the applicability of the idea and discuss its limitations.

Title: SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation

Authors: Shuai Tan, Biao Gong, Yujie Wei, Shiwei Zhang, Zhuoxin Liu, Dandan Zheng, Jingdong Chen, Yan Wang, Hao Ouyang, Kecheng Zheng, Yujun Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23690
Pdf URL: https://arxiv.org/pdf/2506.23690
Copy Paste: [[2506.23690]] SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation(https://arxiv.org/abs/2506.23690)
Keywords: diffusion, generative
Abstract: Diffusion-based video motion customization facilitates the acquisition of human motion representations from a few video samples, while achieving arbitrary subjects transfer through precise textual conditioning. Existing approaches often rely on semantic-level alignment, expecting the model to learn new motion concepts and combine them with other entities (e.g., ''cats'' or ''dogs'') to produce visually appealing results. However, video data involve complex spatio-temporal patterns, and focusing solely on semantics cause the model to overlook the visual complexity of motion. Conversely, tuning only the visual representation leads to semantic confusion in representing the intended action. To address these limitations, we propose SynMotion, a new motion-customized video generation model that jointly leverages semantic guidance and visual adaptation. At the semantic level, we introduce the dual-embedding semantic comprehension mechanism which disentangles subject and motion representations, allowing the model to learn customized motion features while preserving its generative capabilities for diverse subjects. At the visual level, we integrate parameter-efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence. Furthermore, we introduce a new embedding-specific training strategy which \textbf{alternately optimizes} subject and motion embeddings, supported by the manually constructed Subject Prior Video (SPV) training dataset. This strategy promotes motion specificity while preserving generalization across diverse subjects. Lastly, we introduce MotionBench, a newly curated benchmark with diverse motion patterns. Experimental results across both T2V and I2V settings demonstrate that \method outperforms existing baselines. Project page: this https URL

Title: Single Image Test-Time Adaptation via Multi-View Co-Training

Authors: Smriti Joshi, Richard Osuala, Lidia Garrucho, Kaisar Kushibar, Dimitri Kessler, Oliver Diaz, Karim Lekadir
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23705
Pdf URL: https://arxiv.org/pdf/2506.23705
Copy Paste: [[2506.23705]] Single Image Test-Time Adaptation via Multi-View Co-Training(https://arxiv.org/abs/2506.23705)
Keywords: segmentation
Abstract: Test-time adaptation enables a trained model to adjust to a new domain during inference, making it particularly valuable in clinical settings where such on-the-fly adaptation is required. However, existing techniques depend on large target domain datasets, which are often impractical and unavailable in medical scenarios that demand per-patient, real-time inference. Moreover, current methods commonly focus on two-dimensional images, failing to leverage the volumetric richness of medical imaging data. Bridging this gap, we propose a Patch-Based Multi-View Co-Training method for Single Image Test-Time adaptation. Our method enforces feature and prediction consistency through uncertainty-guided self-training, enabling effective volumetric segmentation in the target domain with only a single test-time image. Validated on three publicly available breast magnetic resonance imaging datasets for tumor segmentation, our method achieves performance close to the upper bound supervised benchmark while also outperforming all existing state-of-the-art methods, on average by a Dice Similarity Coefficient of 3.75%. We publicly share our accessible codebase, readily integrable with the popular nnUNet framework, at this https URL.

Title: Subjective Camera: Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion

Authors: Haoyang Chen, Dongfang Sun, Caoyuan Ma, Shiqin Wang, Kewei Zhang, Zheng Wang, Zhixiang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23711
Pdf URL: https://arxiv.org/pdf/2506.23711
Copy Paste: [[2506.23711]] Subjective Camera: Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion(https://arxiv.org/abs/2506.23711)
Keywords: robust, diffusion
Abstract: We propose Subjective Camera, a human-as-imaging-device paradigm that reconstructs real-world scenes from mental impressions through synergistic use of verbal descriptions and progressive rough sketches. This approach overcomes dual limitations of language ambiguity and sketch abstraction by treating the user's drawing sequence as priors, effectively translating subjective perceptual expectations into photorealistic images. Existing approaches face three fundamental barriers: (1) user-specific subjective input biases, (2) huge modality gap between planar sketch and 3D priors in diffusion, and (3) sketch quality-sensitive performance degradation. Current solutions either demand resource-intensive model adaptation or impose impractical requirements on sketch precision. Our framework addresses these challenges through concept-sequential generation. (1) We establish robust appearance priors through text-reward optimization, and then implement sequence-aware disentangled generation that processes concepts in sketching order; these steps accommodate user-specific subjective expectation in a train-free way. (2) We employ latent optimization that effectively bridges the modality gap between planar sketches and 3D priors in diffusion. (3) Our hierarchical reward-guided framework enables the use of rough sketches without demanding artistic expertise. Comprehensive evaluation across diverse datasets demonstrates that our approach achieves state-of-the-art performance in maintaining both semantic and spatial coherence.

Title: System-Embedded Diffusion Bridge Models

Authors: Bartlomiej Sobieski, Matthew Tivnan, Yuang Wang, Siyeop Yoon, Pengfei Jin, Dufan Wu, Quanzheng Li, Przemyslaw Biecek
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23726
Pdf URL: https://arxiv.org/pdf/2506.23726
Copy Paste: [[2506.23726]] System-Embedded Diffusion Bridge Models(https://arxiv.org/abs/2506.23726)
Keywords: robust, diffusion, generative
Abstract: Solving inverse problems -- recovering signals from incomplete or noisy measurements -- is fundamental in science and engineering. Score-based generative models (SGMs) have recently emerged as a powerful framework for this task. Two main paradigms have formed: unsupervised approaches that adapt pretrained generative models to inverse problems, and supervised bridge methods that train stochastic processes conditioned on paired clean and corrupted data. While the former typically assume knowledge of the measurement model, the latter have largely overlooked this structural information. We introduce System embedded Diffusion Bridge Models (SDBs), a new class of supervised bridge methods that explicitly embed the known linear measurement system into the coefficients of a matrix-valued SDE. This principled integration yields consistent improvements across diverse linear inverse problems and demonstrates robust generalization under system misspecification between training and deployment, offering a promising solution to real-world applications.

Title: Proteus-ID: ID-Consistent and Motion-Coherent Video Customization

Authors: Guiyu Zhang, Chen Shi, Zijian Jiang, Xunzhi Xiang, Jingjing Qian, Shaoshuai Shi, Li Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23729
Pdf URL: https://arxiv.org/pdf/2506.23729
Copy Paste: [[2506.23729]] Proteus-ID: ID-Consistent and Motion-Coherent Video Customization(https://arxiv.org/abs/2506.23729)
Keywords: diffusion
Abstract: Video identity customization seeks to synthesize realistic, temporally coherent videos of a specific subject, given a single reference image and a text prompt. This task presents two core challenges: (1) maintaining identity consistency while aligning with the described appearance and actions, and (2) generating natural, fluid motion without unrealistic stiffness. To address these challenges, we introduce Proteus-ID, a novel diffusion-based framework for identity-consistent and motion-coherent video customization. First, we propose a Multimodal Identity Fusion (MIF) module that unifies visual and textual cues into a joint identity representation using a Q-Former, providing coherent guidance to the diffusion model and eliminating modality imbalance. Second, we present a Time-Aware Identity Injection (TAII) mechanism that dynamically modulates identity conditioning across denoising steps, improving fine-detail reconstruction. Third, we propose Adaptive Motion Learning (AML), a self-supervised strategy that reweights the training loss based on optical-flow-derived motion heatmaps, enhancing motion realism without requiring additional inputs. To support this task, we construct Proteus-Bench, a high-quality dataset comprising 200K curated clips for training and 150 individuals from diverse professions and ethnicities for evaluation. Extensive experiments demonstrate that Proteus-ID outperforms prior methods in identity preservation, text alignment, and motion quality, establishing a new benchmark for video identity customization. Codes and data are publicly available at this https URL.

Title: Radioactive Watermarks in Diffusion and Autoregressive Image Generative Models

Authors: Michel Meintz, Jan Dubiński, Franziska Boenisch, Adam Dziedzic
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.23731
Pdf URL: https://arxiv.org/pdf/2506.23731
Copy Paste: [[2506.23731]] Radioactive Watermarks in Diffusion and Autoregressive Image Generative Models(https://arxiv.org/abs/2506.23731)
Keywords: robust, watermark, diffusion, generative, large language model
Abstract: Image generative models have become increasingly popular, but training them requires large datasets that are costly to collect and curate. To circumvent these costs, some parties may exploit existing models by using the generated images as training data for their own models. In general, watermarking is a valuable tool for detecting unauthorized use of generated images. However, when these images are used to train a new model, watermarking can only enable detection if the watermark persists through training and remains identifiable in the outputs of the newly trained model - a property known as radioactivity. We analyze the radioactivity of watermarks in images generated by diffusion models (DMs) and image autoregressive models (IARs). We find that existing watermarking methods for DMs fail to retain radioactivity, as watermarks are either erased during encoding into the latent space or lost in the noising-denoising process (during the training in the latent space). Meanwhile, despite IARs having recently surpassed DMs in image generation quality and efficiency, no radioactive watermarking methods have been proposed for them. To overcome this limitation, we propose the first watermarking method tailored for IARs and with radioactivity in mind - drawing inspiration from techniques in large language models (LLMs), which share IARs' autoregressive paradigm. Our extensive experimental evaluation highlights our method's effectiveness in preserving radioactivity within IARs, enabling robust provenance tracking, and preventing unauthorized use of their generated images.

Title: AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data

Authors: JiaRu Wu, Mingwei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23735
Pdf URL: https://arxiv.org/pdf/2506.23735
Copy Paste: [[2506.23735]] AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data(https://arxiv.org/abs/2506.23735)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have shown remarkable performance on various tasks, but existing evaluation benchmarks are often static and insufficient to fully assess their robustness and generalization in realistic scenarios. Prior work using evolutionary or adversarial data augmentation has improved evaluation diversity but lacks systematic control over perturbation types and multi-step complexity, limiting comprehensive robustness analysis. To address these gaps, we propose AutoEvoEval, an evolution-based evaluation framework for close-ended tasks such as multi-choice question answering. AutoEvoEval introduces 22 interpretable atomic evolution operations and supports multi-round compositions, enabling controlled generation of diverse, challenging, and realistic test samples. We conduct extensive experiments addressing four research questions on a broad set of open- and closed-source LLMs. Our results show that atomic operations cause an average accuracy drop of 7.283\%, with structure-disrupting or misleading semantic edits causing the largest declines. Model sensitivities vary significantly for the same perturbation, and combining multiple evolution steps amplifies adversarial effects by up to 52.932\%. These findings suggest current benchmarks may overestimate true model generalization and emphasize the need for evolution-aware robustness evaluation. Code and resources are available at: this https URL.

Title: Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences

Authors: Tiziano Labruna, Simone Gallo, Giovanni Da San Martino
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23743
Pdf URL: https://arxiv.org/pdf/2506.23743
Copy Paste: [[2506.23743]] Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences(https://arxiv.org/abs/2506.23743)
Keywords: fair, large language model
Abstract: Positional bias in binary question answering occurs when a model systematically favors one choice over another based solely on the ordering of presented options. In this study, we quantify and analyze positional bias across five large language models under varying degrees of answer uncertainty. We re-adapted the SQuAD-it dataset by adding an extra incorrect answer option and then created multiple versions with progressively less context and more out-of-context answers, yielding datasets that range from low to high uncertainty. Additionally, we evaluate two naturally higher-uncertainty benchmarks: (1) WebGPT - question pairs with unequal human-assigned quality scores, and (2) Winning Arguments - where models predict the more persuasive argument in Reddit's r/ChangeMyView exchanges. Across each dataset, the order of the "correct" (or higher-quality/persuasive) option is systematically flipped (first placed in position 1, then in position 2) to compute both Preference Fairness and Position Consistency. We observe that positional bias is nearly absent under low-uncertainty conditions, but grows exponentially when it becomes doubtful to decide which option is correct.

Title: Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?

Authors: Annika Mütze, Sadia Ilyas, Christian Dörpelkus, Matthias Rottmann
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23751
Pdf URL: https://arxiv.org/pdf/2506.23751
Copy Paste: [[2506.23751]] Can We Challenge Open-Vocabulary Object Detectors with Generated Content in Street Scenes?(https://arxiv.org/abs/2506.23751)
Keywords: diffusion
Abstract: Open-vocabulary object detectors such as Grounding DINO are trained on vast and diverse data, achieving remarkable performance on challenging datasets. Due to that, it is unclear where to find their limitations, which is of major concern when using in safety-critical applications. Real-world data does not provide sufficient control, required for a rigorous evaluation of model generalization. In contrast, synthetically generated data allows to systematically explore the boundaries of model competence/generalization. In this work, we address two research questions: 1) Can we challenge open-vocabulary object detectors with generated image content? 2) Can we find systematic failure modes of those models? To address these questions, we design two automated pipelines using stable diffusion to inpaint unusual objects with high diversity in semantics, by sampling multiple substantives from WordNet and ChatGPT. On the synthetically generated data, we evaluate and compare multiple open-vocabulary object detectors as well as a classical object detector. The synthetic data is derived from two real-world datasets, namely LostAndFound, a challenging out-of-distribution (OOD) detection benchmark, and the NuImages dataset. Our results indicate that inpainting can challenge open-vocabulary object detectors in terms of overlooking objects. Additionally, we find a strong dependence of open-vocabulary models on object location, rather than on object semantics. This provides a systematic approach to challenge open-vocabulary models and gives valuable insights on how data could be acquired to effectively improve these models.

Title: Model-driven Stochastic Trace Clustering

Authors: Jari Peeperkorn, Johannes De Smedt, Jochen De Weerdt
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23776
Pdf URL: https://arxiv.org/pdf/2506.23776
Copy Paste: [[2506.23776]] Model-driven Stochastic Trace Clustering(https://arxiv.org/abs/2506.23776)
Keywords: interpretability
Abstract: Process discovery algorithms automatically extract process models from event logs, but high variability often results in complex and hard-to-understand models. To mitigate this issue, trace clustering techniques group process executions into clusters, each represented by a simpler and more understandable process model. Model-driven trace clustering improves on this by assigning traces to clusters based on their conformity to cluster-specific process models. However, most existing clustering techniques rely on either no process model discovery, or non-stochastic models, neglecting the frequency or probability of activities and transitions, thereby limiting their capability to capture real-world execution dynamics. We propose a novel model-driven trace clustering method that optimizes stochastic process models within each cluster. Our approach uses entropic relevance, a stochastic conformance metric based on directly-follows probabilities, to guide trace assignment. This allows clustering decisions to consider both structural alignment with a cluster's process model and the likelihood that a trace originates from a given stochastic process model. The method is computationally efficient, scales linearly with input size, and improves model interpretability by producing clusters with clearer control-flow patterns. Extensive experiments on public real-life datasets show that our method outperforms existing alternatives in representing process behavior and reveals how clustering performance rankings can shift when stochasticity is considered.

Title: Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking

Authors: Shiao Wang, Ju Huang, Qingchuan Ma, Jinfeng Gao, Chunyi Xu, Xiao Wang, Lan Chen, Bo Jiang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23783
Pdf URL: https://arxiv.org/pdf/2506.23783
Copy Paste: [[2506.23783]] Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking(https://arxiv.org/abs/2506.23783)
Keywords: robust, extraction, transformer
Abstract: Combining traditional RGB cameras with bio-inspired event cameras for robust object tracking has garnered increasing attention in recent years. However, most existing multimodal tracking algorithms depend heavily on high-complexity Vision Transformer architectures for feature extraction and fusion across modalities. This not only leads to substantial computational overhead but also limits the effectiveness of cross-modal interactions. In this paper, we propose an efficient RGB-Event object tracking framework based on the linear-complexity Vision Mamba network, termed Mamba-FETrack V2. Specifically, we first design a lightweight Prompt Generator that utilizes embedded features from each modality, together with a shared prompt pool, to dynamically generate modality-specific learnable prompt vectors. These prompts, along with the modality-specific embedded features, are then fed into a Vision Mamba-based FEMamba backbone, which facilitates prompt-guided feature extraction, cross-modal interaction, and fusion in a unified manner. Finally, the fused representations are passed to the tracking head for accurate target localization. Extensive experimental evaluations on multiple RGB-Event tracking benchmarks, including short-term COESOT dataset and long-term datasets, i.e., FE108 and FELT V2, demonstrate the superior performance and efficiency of the proposed tracking framework. The source code and pre-trained models will be released on this https URL

Title: Controllable Reference-Based Real-World Remote Sensing Image Super-Resolution with Generative Diffusion Priors

Authors: Ce Wang, Wanjie Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23801
Pdf URL: https://arxiv.org/pdf/2506.23801
Copy Paste: [[2506.23801]] Controllable Reference-Based Real-World Remote Sensing Image Super-Resolution with Generative Diffusion Priors(https://arxiv.org/abs/2506.23801)
Keywords: diffusion, generative, segmentation
Abstract: Super-resolution (SR) techniques can enhance the spatial resolution of remote sensing images by utilizing low-resolution (LR) images to reconstruct high-resolution (HR) images, enabling more efficient large-scale earth observation applications. While single-image super-resolution (SISR) methods have shown progress, reference-based super-resolution (RefSR) offers superior performance by incorporating historical HR images alongside current LR observations. However, existing RefSR methods struggle with real-world complexities, such as cross-sensor resolution gap and significant land cover changes, often leading to under-generation or over-reliance on reference image. To address these challenges, we propose CRefDiff, a novel controllable reference-based diffusion model for real-world remote sensing image SR. To address the under-generation problem, CRefDiff is built upon the pretrained Stable Diffusion model, leveraging its powerful generative prior to produce accurate structures and textures. To mitigate over-reliance on the reference, we introduce a dual-branch fusion mechanism that adaptively integrates both local and global information from the reference image. Moreover, this novel dual-branch design enables reference strength control during inference, enhancing interactivity and flexibility of the model. Finally, a strategy named Better Start is proposed to significantly reduce the number of denoising steps, thereby accelerating the inference process. To support further research, we introduce Real-RefRSSRD, a new real-world RefSR dataset for remote sensing images, consisting of HR NAIP and LR Sentinel-2 image pairs with diverse land cover changes and significant temporal gaps. Extensive experiments on Real-RefRSSRD show that CRefDiff achieves state-of-the-art performance across various metrics and improves downstream tasks such as scene classification and semantic segmentation.

Title: MadCLIP: Few-shot Medical Anomaly Detection with CLIP

Authors: Mahshid Shiri, Cigdem Beyan, Vittorio Murino
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23810
Pdf URL: https://arxiv.org/pdf/2506.23810
Copy Paste: [[2506.23810]] MadCLIP: Few-shot Medical Anomaly Detection with CLIP(https://arxiv.org/abs/2506.23810)
Keywords: segmentation
Abstract: An innovative few-shot anomaly detection approach is presented, leveraging the pre-trained CLIP model for medical data, and adapting it for both image-level anomaly classification (AC) and pixel-level anomaly segmentation (AS). A dual-branch design is proposed to separately capture normal and abnormal features through learnable adapters in the CLIP vision encoder. To improve semantic alignment, learnable text prompts are employed to link visual features. Furthermore, SigLIP loss is applied to effectively handle the many-to-one relationship between images and unpaired text prompts, showcasing its adaptation in the medical field for the first time. Our approach is validated on multiple modalities, demonstrating superior performance over existing methods for AC and AS, in both same-dataset and cross-dataset evaluations. Unlike prior work, it does not rely on synthetic data or memory banks, and an ablation study confirms the contribution of each component. The code is available at this https URL.

Title: Breaking Out from the TESSERACT: Reassessing ML-based Malware Detection under Spatio-Temporal Drift

Authors: Theo Chow, Mario D'Onghia, Lorenz Linhardt, Zeliang Kan, Daniel Arp, Lorenzo Cavallaro, Fabio Pierazzi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.23814
Pdf URL: https://arxiv.org/pdf/2506.23814
Copy Paste: [[2506.23814]] Breaking Out from the TESSERACT: Reassessing ML-based Malware Detection under Spatio-Temporal Drift(https://arxiv.org/abs/2506.23814)
Keywords: security
Abstract: Several recent works focused on the best practices for applying machine learning to cybersecurity. In the context of malware, TESSERACT highlighted the impact of concept drift on detection performance and suggested temporal and spatial constraints to be enforced to ensure realistic time-aware evaluations, which have been adopted by the community. In this paper, we demonstrate striking discrepancies in the performance of learning-based malware detection across the same time frame when evaluated on two representative Android malware datasets used in top-tier security conferences, both adhering to established sampling and evaluation guidelines. This questions our ability to understand how current state-of-the-art approaches would perform in realistic scenarios. To address this, we identify five novel temporal and spatial bias factors that affect realistic evaluations. We thoroughly evaluate the impact of these factors in the Android malware domain on two representative datasets and five Android malware classifiers used or proposed in top-tier security conferences. For each factor, we provide practical and actionable recommendations that the community should integrate in their methodology for more realistic and reproducible settings.

Title: Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model

Authors: Shiming Chen, Bowen Duan, Salman Khan, Fahad Shahbaz Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23822
Pdf URL: https://arxiv.org/pdf/2506.23822
Copy Paste: [[2506.23822]] Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model(https://arxiv.org/abs/2506.23822)
Keywords: interpretability
Abstract: Large-scale vision-language models (VLMs), such as CLIP, have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets. However, these methods often lack interpretability, as they compute the similarity between an entire query image and the embedded category words, making it difficult to explain their predictions. One approach to address this issue is to develop interpretable models by integrating language, where classifiers are built using discrete attributes, similar to human perception. This introduces a new challenge: how to effectively align local visual features with corresponding attributes based on pre-trained VLMs. To tackle this, we propose LaZSL, a locally-aligned vision-language model for interpretable ZSL. LaZSL employs local visual-semantic alignment via optimal transport to perform interaction between visual regions and their associated attributes, facilitating effective alignment and providing interpretable similarity without the need for additional training. Extensive experiments demonstrate that our method offers several advantages, including enhanced interpretability, improved accuracy, and strong domain generalization. Codes available at: this https URL.

Title: Flash-VStream: Efficient Real-Time Understanding for Long Video Streams

Authors: Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Xiaojie Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23825
Pdf URL: https://arxiv.org/pdf/2506.23825
Copy Paste: [[2506.23825]] Flash-VStream: Efficient Real-Time Understanding for Long Video Streams(https://arxiv.org/abs/2506.23825)
Keywords: large language model
Abstract: Benefiting from the advances in large language models and cross-modal alignment, existing multimodal large language models have achieved prominent performance in image and short video understanding. However, the understanding of long videos is still challenging, as their long-context nature results in significant computational and memory overhead. Most existing work treats long videos in the same way as short videos, which is inefficient for real-world applications and hard to generalize to even longer videos. To address these issues, we propose Flash-VStream, an efficient video language model capable of processing extremely long videos and responding to user queries in real time. Particularly, we design a Flash Memory module, containing a low-capacity context memory to aggregate long-context temporal information and model the distribution of information density, and a high-capacity augmentation memory to retrieve detailed spatial information based on this distribution. Compared to existing models, Flash-VStream achieves significant reductions in inference latency. Extensive experiments on long video benchmarks and comprehensive video benchmarks, i.e., EgoSchema, MLVU, LVBench, MVBench and Video-MME, demonstrate the state-of-the-art performance and outstanding efficiency of our method. Code is available at this https URL.

Title: Low-latency vision transformers via large-scale multi-head attention

Authors: Ronit D. Gross, Tal Halevi, Ella Koresh, Yarden Tzach, Ido Kanter
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23832
Pdf URL: https://arxiv.org/pdf/2506.23832
Copy Paste: [[2506.23832]] Low-latency vision transformers via large-scale multi-head attention(https://arxiv.org/abs/2506.23832)
Keywords: transformer
Abstract: The emergence of spontaneous symmetry breaking among a few heads of multi-head attention (MHA) across transformer blocks in classification tasks was recently demonstrated through the quantification of single-nodal performance (SNP). This finding indicates that each head focuses its attention on a subset of labels through cooperation among its SNPs. This underlying learning mechanism is generalized to large-scale MHA (LS-MHA) using a single matrix value representing single-head performance (SHP), analogous to single-filter performance in convolutional neural networks (CNNs). The results indicate that each SHP matrix comprises multiple unit clusters such that each label being explicitly recognized by a few heads with negligible noise. This leads to an increased signal-to-noise ratio (SNR) along the transformer blocks, thereby improving classification accuracy. These features give rise to several distinct vision transformer (ViT) architectures that achieve the same accuracy but differ in their LS-MHA structures. As a result, their soft committee yields superior accuracy, an outcome not typically observed in CNNs which rely on hundreds of filters. In addition, a significant reduction in latency is achieved without affecting the accuracy by replacing the initial transformer blocks with convolutional layers. This substitution accelerates early-stage learning, which is then improved by subsequent transformer layers. The extension of this learning mechanism to natural language processing tasks, based on quantitative differences between CNNs and ViT architectures, has the potential to yield new insights in deep learning. The findings are demonstrated using compact convolutional transformer architectures trained on the CIFAR-100 dataset.

Title: PointSSIM: A novel low dimensional resolution invariant image-to-image comparison metric

Authors: Oscar Ovanger, Ragnar Hauge, Jacob Skauvold, Michael J. Pyrcz, Jo Eidsvik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23833
Pdf URL: https://arxiv.org/pdf/2506.23833
Copy Paste: [[2506.23833]] PointSSIM: A novel low dimensional resolution invariant image-to-image comparison metric(https://arxiv.org/abs/2506.23833)
Keywords: robust
Abstract: This paper presents PointSSIM, a novel low-dimensional image-to-image comparison metric that is resolution invariant. Drawing inspiration from the structural similarity index measure and mathematical morphology, PointSSIM enables robust comparison across binary images of varying resolutions by transforming them into marked point pattern representations. The key features of the image, referred to as anchor points, are extracted from binary images by identifying locally adaptive maxima from the minimal distance transform. Image comparisons are then performed using a summary vector, capturing intensity, connectivity, complexity, and structural attributes. Results show that this approach provides an efficient and reliable method for image comparison, particularly suited to applications requiring structural analysis across different resolutions.

Title: Refine Any Object in Any Scene

Authors: Ziwei Chen, Ziling Liu, Zitong Huang, Mingqi Gao, Feng Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23835
Pdf URL: https://arxiv.org/pdf/2506.23835
Copy Paste: [[2506.23835]] Refine Any Object in Any Scene(https://arxiv.org/abs/2506.23835)
Keywords: generative
Abstract: Viewpoint missing of objects is common in scene reconstruction, as camera paths typically prioritize capturing the overall scene structure rather than individual objects. This makes it highly challenging to achieve high-fidelity object-level modeling while maintaining accurate scene-level representation. Addressing this issue is critical for advancing downstream tasks requiring detailed object understanding and appearance modeling. In this paper, we introduce Refine Any object In any ScenE (RAISE), a novel 3D enhancement framework that leverages 3D generative priors to recover fine-grained object geometry and appearance under missing views. Starting from substituting degraded objects with proxies, via a 3D generative model with strong 3D understanding, RAISE progressively refines geometry and texture by aligning each proxy to its degraded counterpart in 7-DOF pose, followed by correcting spatial and appearance inconsistencies via registration-constrained enhancement. This two-stage refinement ensures the high-fidelity geometry and appearance of the original object in unseen views while maintaining consistency in spatial positioning, observed geometry, and appearance. Extensive experiments on challenging benchmarks show that RAISE significantly outperforms state-of-the-art methods in both novel view synthesis and geometry completion tasks. RAISE is made publicly available at this https URL.

Title: An ontological lens on attack trees: Toward adequacy and interoperability

Authors: Ítalo Oliveira, Stefano M. Nicoletti, Gal Engelberg, Mattia Fumagalli, Dan Klein, Giancarlo Guizzardi
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2506.23841
Pdf URL: https://arxiv.org/pdf/2506.23841
Copy Paste: [[2506.23841]] An ontological lens on attack trees: Toward adequacy and interoperability(https://arxiv.org/abs/2506.23841)
Keywords: security, attack
Abstract: Attack Trees (AT) are a popular formalism for security analysis. They are meant to display an attacker's goal decomposed into attack steps needed to achieve it and compute certain security metrics (e.g., attack cost, probability, and damage). ATs offer three important services: (a) conceptual modeling capabilities for representing security risk management scenarios, (b) a qualitative assessment to find root causes and minimal conditions of successful attacks, and (c) quantitative analyses via security metrics computation under formal semantics, such as minimal time and cost among all attacks. Still, the AT language presents limitations due to its lack of ontological foundations, thus compromising associated services. Via an ontological analysis grounded in the Common Ontology of Value and Risk (COVER) -- a reference core ontology based on the Unified Foundational Ontology (UFO) -- we investigate the ontological adequacy of AT and reveal four significant shortcomings: (1) ambiguous syntactical terms that can be interpreted in various ways; (2) ontological deficit concerning crucial domain-specific concepts; (3) lacking modeling guidance to construct ATs decomposing a goal; (4) lack of semantic interoperability, resulting in ad hoc stand-alone tools. We also discuss existing incremental solutions and how our analysis paves the way for overcoming those issues through a broader approach to risk management modeling.

Title: Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts

Authors: Kenny Peng, Rajiv Movva, Jon Kleinberg, Emma Pierson, Nikhil Garg
Subjects: cs.LG, cs.AI, cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.23845
Pdf URL: https://arxiv.org/pdf/2506.23845
Copy Paste: [[2506.23845]] Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts(https://arxiv.org/abs/2506.23845)
Keywords: fair, interpretability, explainability
Abstract: While sparse autoencoders (SAEs) have generated significant excitement, a series of negative results have added to skepticism about their usefulness. Here, we establish a conceptual distinction that reconciles competing narratives surrounding SAEs. We argue that while SAEs may be less effective for acting on known concepts, SAEs are powerful tools for discovering unknown concepts. This distinction cleanly separates existing negative and positive results, and suggests several classes of SAE applications. Specifically, we outline use cases for SAEs in (i) ML interpretability, explainability, fairness, auditing, and safety, and (ii) social and health sciences.

Title: Differentially Private Synthetic Data Release for Topics API Outputs

Authors: Travis Dick, Alessandro Epasto, Adel Javanmard, Josh Karlin, Andres Munoz Medina, Vahab Mirrokni, Sergei Vassilvitskii, Peilin Zhong
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23855
Pdf URL: https://arxiv.org/pdf/2506.23855
Copy Paste: [[2506.23855]] Differentially Private Synthetic Data Release for Topics API Outputs(https://arxiv.org/abs/2506.23855)
Keywords: privacy, protect
Abstract: The analysis of the privacy properties of Privacy-Preserving Ads APIs is an area of research that has received strong interest from academics, industry, and regulators. Despite this interest, the empirical study of these methods is hindered by the lack of publicly available data. Reliable empirical analysis of the privacy properties of an API, in fact, requires access to a dataset consisting of realistic API outputs; however, privacy concerns prevent the general release of such data to the public. In this work, we develop a novel methodology to construct synthetic API outputs that are simultaneously realistic enough to enable accurate study and provide strong privacy protections. We focus on one Privacy-Preserving Ads APIs: the Topics API, part of Google Chrome's Privacy Sandbox. We developed a methodology to generate a differentially-private dataset that closely matches the re-identification risk properties of the real Topics API data. The use of differential privacy provides strong theoretical bounds on the leakage of private user information from this release. Our methodology is based on first computing a large number of differentially-private statistics describing how output API traces evolve over time. Then, we design a parameterized distribution over sequences of API traces and optimize its parameters so that they closely match the statistics obtained. Finally, we create the synthetic data by drawing from this distribution. Our work is complemented by an open-source release of the anonymized dataset obtained by this methodology. We hope this will enable external researchers to analyze the API in-depth and replicate prior and future work on a realistic large-scale dataset. We believe that this work will contribute to fostering transparency regarding the privacy properties of Privacy-Preserving Ads APIs.

Title: VMoBA: Mixture-of-Block Attention for Video Diffusion Models

Authors: Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, Yunhai Tong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23858
Pdf URL: https://arxiv.org/pdf/2506.23858
Copy Paste: [[2506.23858]] VMoBA: Mixture-of-Block Attention for Video Diffusion Models(https://arxiv.org/abs/2506.23858)
Keywords: diffusion, transformer
Abstract: The quadratic complexity of full attention mechanisms poses a significant bottleneck for Video Diffusion Models (VDMs) aiming to generate long-duration, high-resolution videos. While various sparse attention methods have been proposed, many are designed as training-free inference accelerators or do not optimally capture the unique spatio-temporal characteristics inherent in video data when trained natively. This paper introduces Video Mixture of Block Attention (VMoBA), a novel sparse attention mechanism specifically adapted for VDMs. Motivated by an in-depth analysis of attention patterns within pre-trained video transformers, which revealed strong spatio-temporal locality, varying query importance, and head-specific concentration levels, VMoBA enhances the original MoBA framework with three key modifications: (1) a layer-wise recurrent block partition scheme (1D-2D-3D) to dynamically adapt to diverse spatio-temporal attention patterns and improve efficiency; (2) global block selection to prioritize the most salient query-key block interactions across an entire attention head; and (3) threshold-based block selection to dynamically determine the number of attended blocks based on their cumulative similarity. Extensive experiments demonstrate that VMoBA significantly accelerates the training of VDMs on longer sequences, achieving 2.92x FLOPs and 1.48x latency speedup, while attaining comparable or even superior generation quality to full attention. Furthermore, VMoBA exhibits competitive performance in training-free inference, offering 2.40x FLOPs and 1.35x latency speedup for high-res video generation.

Title: Exploring Privacy and Security as Drivers for Environmental Sustainability in Cloud-Based Office Solutions

Authors: Jason Kayembe, Iness Ben Guirat, Jan Tobias Mühlberg
Subjects: cs.CR, cs.CY, cs.SE
Abstract URL: https://arxiv.org/abs/2506.23866
Pdf URL: https://arxiv.org/pdf/2506.23866
Copy Paste: [[2506.23866]] Exploring Privacy and Security as Drivers for Environmental Sustainability in Cloud-Based Office Solutions(https://arxiv.org/abs/2506.23866)
Keywords: security, privacy
Abstract: In this paper, we explore the intersection of privacy, security, and environmental sustainability in cloud-based office solutions, focusing on quantifying user- and network-side energy use and associated carbon emissions. We hypothesise that privacy-focused services are typically more energy-efficient than those funded through data collection and advertising. To evaluate this, we propose a framework that systematically measures environmental costs based on energy usage and network data traffic during well-defined, automated usage scenarios. To test our hypothesis, we first analyse how underlying architectures and business models, such as monetisation through personalised advertising, contribute to the environmental footprint of these services. We then explore existing methodologies and tools for software environmental impact assessment. We apply our framework to three mainstream email services selected to reflect different privacy policies, from ad-supported tracking-intensive models to privacy-focused designs: Microsoft Outlook, Google Mail (Gmail), and Proton Mail. We extend this comparison to a self-hosted email solution, evaluated with and without end-to-end encryption. We show that the self-hosted solution, even with 14% of device energy and 15% of emissions overheads from PGP encryption, remains the most energy-efficient, saving up to 33% of emissions per session compared to Gmail. Among commercial providers, Proton Mail is the most efficient, saving up to 0.1 gCO2 e per session compared to Outlook, whose emissions can be further reduced by 2% through ad-blocking.

Title: Chain of Thought in Order: Discovering Learning-Friendly Orders for Arithmetic

Authors: Yuta Sato, Kazuhiko Kawamoto, Hiroshi Kera
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23875
Pdf URL: https://arxiv.org/pdf/2506.23875
Copy Paste: [[2506.23875]] Chain of Thought in Order: Discovering Learning-Friendly Orders for Arithmetic(https://arxiv.org/abs/2506.23875)
Keywords: transformer
Abstract: The chain of thought is fundamental in Transformers, which is to perform step-by-step reasoning. Besides what intermediate steps work, the order of these steps critically affects the difficulty of the reasoning. This study addresses a novel task of unraveling chain of thought - reordering decoder input tokens to a learning-friendly sequence for Transformers to learn arithmetic tasks. The proposed pipeline first trains a Transformer on a mixture of target sequences arranged in different orders and then identifies benign orders as those with fast loss drops in the early stage. As the search space grows factorially with sequence length, we propose a two-stage hierarchical approach for inter- and intra-block reordering. Experiments on four order-sensitive arithmetic tasks show that our method identifies a learning-friendly order out of a few billion candidates. Notably, on the multiplication task, it recovered the reverse-digit order reported in prior studies.

Title: Spurious-Aware Prototype Refinement for Reliable Out-of-Distribution Detection

Authors: Reihaneh Zohrabi, Hosein Hasani, Mahdieh Soleymani Baghshah, Anna Rohrbach, Marcus Rohrbach, Mohammad Hossein Rohban
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23881
Pdf URL: https://arxiv.org/pdf/2506.23881
Copy Paste: [[2506.23881]] Spurious-Aware Prototype Refinement for Reliable Out-of-Distribution Detection(https://arxiv.org/abs/2506.23881)
Keywords: robust
Abstract: Out-of-distribution (OOD) detection is crucial for ensuring the reliability and safety of machine learning models in real-world applications, where they frequently face data distributions unseen during training. Despite progress, existing methods are often vulnerable to spurious correlations that mislead models and compromise robustness. To address this, we propose SPROD, a novel prototype-based OOD detection approach that explicitly addresses the challenge posed by unknown spurious correlations. Our post-hoc method refines class prototypes to mitigate bias from spurious features without additional data or hyperparameter tuning, and is broadly applicable across diverse backbones and OOD detection settings. We conduct a comprehensive spurious correlation OOD detection benchmarking, comparing our method against existing approaches and demonstrating its superior performance across challenging OOD datasets, such as CelebA, Waterbirds, UrbanCars, Spurious Imagenet, and the newly introduced Animals MetaCoCo. On average, SPROD improves AUROC by 4.7% and FPR@95 by 9.3% over the second best.

Title: Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting

Authors: André de Souza Loureiro, Jorge Valverde-Rebaza, Julieta Noguez, David Escarcega, Ricardo Marcacini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23888
Pdf URL: https://arxiv.org/pdf/2506.23888
Copy Paste: [[2506.23888]] Advancing Multi-Step Mathematical Reasoning in Large Language Models through Multi-Layered Self-Reflection with Auto-Prompting(https://arxiv.org/abs/2506.23888)
Keywords: large language model
Abstract: Recent advancements in Large Language Models (LLMs) have significantly improved their problem-solving capabilities. However, these models still struggle when faced with complex multi-step reasoning tasks. In this paper, we propose the Multi-Layered Self-Reflection with Auto-Prompting (MAPS) framework, a novel approach designed to enhance multi-step mathematical reasoning in LLMs by integrating techniques such as Chain of Thought (CoT), Self-Reflection, and Auto-Prompting. Unlike traditional static prompting methods, MAPS employs an iterative refinement process. Initially, the model generates a solution using CoT prompting. When errors are detected, an adaptive self-reflection mechanism identifies and analyzes them, generating tailored prompts to guide corrections. These dynamically adjusted prompts enable the model to iteratively refine its reasoning. Experiments on four well-established benchmarks across multiple LLMs show that MAPS significantly outperforms standard CoT and achieves competitive results with reasoning-optimized models. In addition, MAPS enables general-purpose LLMs to reach performance levels comparable to specialized reasoning models. While deeper reflection layers improve accuracy, they also increase token usage and costs. To balance this trade-off, MAPS strategically limits reflection depth, ensuring an optimal balance between cost and reasoning performance.

Title: GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models

Authors: Hamza Rasaee, Taha Koleilat, Hassan Rivaz
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23903
Pdf URL: https://arxiv.org/pdf/2506.23903
Copy Paste: [[2506.23903]] GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models(https://arxiv.org/abs/2506.23903)
Keywords: robust, segmentation
Abstract: Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt-driven vision-language model (VLM) that integrates Grounding DINO with SAM2 to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized. These datasets were divided into 15 for fine-tuning and validation of Grounding DINO using Low Rank Adaptation (LoRA) to the ultrasound domain, and 3 were held out entirely for testing to evaluate performance in unseen distributions. Comprehensive experiments demonstrate that our approach outperforms state-of-the-art segmentation methods, including UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, and SAMUS on most seen datasets while maintaining strong performance on unseen datasets without additional fine-tuning. These results underscore the promise of VLMs in scalable and robust ultrasound image analysis, reducing dependence on large, organ-specific annotated datasets. We will publish our code on this http URL after acceptance.

Title: RawMal-TF: Raw Malware Dataset Labeled by Type and Family

Authors: David Bálik, Martin Jureček, Mark Stamp
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.23909
Pdf URL: https://arxiv.org/pdf/2506.23909
Copy Paste: [[2506.23909]] RawMal-TF: Raw Malware Dataset Labeled by Type and Family(https://arxiv.org/abs/2506.23909)
Keywords: robust, extraction
Abstract: This work addresses the challenge of malware classification using machine learning by developing a novel dataset labeled at both the malware type and family levels. Raw binaries were collected from sources such as VirusShare, VX Underground, and MalwareBazaar, and subsequently labeled with family information parsed from binary names and type-level labels integrated from ClarAVy. The dataset includes 14 malware types and 17 malware families, and was processed using a unified feature extraction pipeline based on static analysis, particularly extracting features from Portable Executable headers, to support advanced classification tasks. The evaluation was focused on three key classification tasks. In the binary classification of malware versus benign samples, Random Forest and XGBoost achieved high accuracy on the full datasets, reaching 98.5% for type-based detection and 98.98% for family-based detection. When using truncated datasets of 1,000 samples to assess performance under limited data conditions, both models still performed strongly, achieving 97.6% for type-based detection and 98.66% for family-based detection. For interclass classification, which distinguishes between malware types or families, the models reached up to 97.5% accuracy on type-level tasks and up to 93.7% on family-level tasks. In the multiclass classification setting, which assigns samples to the correct type or family, SVM achieved 81.1% accuracy on type labels, while Random Forest and XGBoost reached approximately 73.4% on family labels. The results highlight practical trade-offs between accuracy and computational cost, and demonstrate that labeling at both the type and family levels enables more fine-grained and insightful malware classification. The work establishes a robust foundation for future research on advanced malware detection and classification.

Title: Three-dimensional end-to-end deep learning for brain MRI analysis

Authors: Radhika Juglan, Marta Ligero, Zunamys I. Carrero, Asier Rabasco, Tim Lenz, Leo Misera, Gregory Patrick Veldhuizen, Paul Kuntke, Hagen H. Kitzler, Sven Nebelung, Daniel Truhn, Jakob Nikolas Kather
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23916
Pdf URL: https://arxiv.org/pdf/2506.23916
Copy Paste: [[2506.23916]] Three-dimensional end-to-end deep learning for brain MRI analysis(https://arxiv.org/abs/2506.23916)
Keywords: extraction, explainability, transformer
Abstract: Deep learning (DL) methods are increasingly outperforming classical approaches in brain imaging, yet their generalizability across diverse imaging cohorts remains inadequately assessed. As age and sex are key neurobiological markers in clinical neuroscience, influencing brain structure and disease risk, this study evaluates three of the existing three-dimensional architectures, namely Simple Fully Connected Network (SFCN), DenseNet, and Shifted Window (Swin) Transformers, for age and sex prediction using T1-weighted MRI from four independent cohorts: UK Biobank (UKB, n=47,390), Dallas Lifespan Brain Study (DLBS, n=132), Parkinson's Progression Markers Initiative (PPMI, n=108 healthy controls), and Information eXtraction from Images (IXI, n=319). We found that SFCN consistently outperformed more complex architectures with AUC of 1.00 [1.00-1.00] in UKB (internal test set) and 0.85-0.91 in external test sets for sex classification. For the age prediction task, SFCN demonstrated a mean absolute error (MAE) of 2.66 (r=0.89) in UKB and 4.98-5.81 (r=0.55-0.70) across external datasets. Pairwise DeLong and Wilcoxon signed-rank tests with Bonferroni corrections confirmed SFCN's superiority over Swin Transformer across most cohorts (p<0.017, for three comparisons). Explainability analysis further demonstrates the regional consistency of model attention across cohorts and specific to each task. Our findings reveal that simpler convolutional networks outperform the denser and more complex attention-based DL architectures in brain image analysis by demonstrating better generalizability across different datasets.

Title: The Trilemma of Truth in Large Language Models

Authors: Germans Savcisens, Tina Eliassi-Rad
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.23921
Pdf URL: https://arxiv.org/pdf/2506.23921
Copy Paste: [[2506.23921]] The Trilemma of Truth in Large Language Models(https://arxiv.org/abs/2506.23921)
Keywords: large language model
Abstract: We often attribute human characteristics to large language models (LLMs) and claim that they "know" certain things. LLMs have an internal probabilistic knowledge that represents information retained during training. How can we assess the veracity of this knowledge? We examine two common methods for probing the veracity of LLMs and discover several assumptions that are flawed. To address these flawed assumptions, we introduce sAwMIL (short for Sparse Aware Multiple-Instance Learning), a probing method that utilizes the internal activations of LLMs to separate statements into true, false, and neither. sAwMIL is based on multiple-instance learning and conformal prediction. We evaluate sAwMIL on 5 validity criteria across 16 open-source LLMs, including both default and chat-based variants, as well as on 3 new datasets. Among the insights we provide are: (1) the veracity signal is often concentrated in the third quarter of an LLM's depth; (2) truth and falsehood signals are not always symmetric; (3) linear probes perform better on chat models than on default models; (4) nonlinear probes may be required to capture veracity signals for some LLMs with reinforcement learning from human feedback or knowledge distillation; and (5) LLMs capture a third type of signal that is distinct from true and false and is neither true nor false. These findings provide a reliable method for verifying what LLMs "know" and how certain they are of their probabilistic internal knowledge.

Title: IMPACT: Inflectional Morphology Probes Across Complex Typologies

Authors: Mohammed J. Saeed, Tommi Vehvilainen, Evgeny Fedoseev, Sevil Caliskan, Tatiana Vodolazova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23929
Pdf URL: https://arxiv.org/pdf/2506.23929
Copy Paste: [[2506.23929]] IMPACT: Inflectional Morphology Probes Across Complex Typologies(https://arxiv.org/abs/2506.23929)
Keywords: large language model
Abstract: Large Language Models (LLMs) have shown significant progress on various multilingual benchmarks and are increasingly used to generate and evaluate text in non-English languages. However, while they may produce fluent outputs, it remains unclear to what extent these models truly grasp the underlying linguistic complexity of those languages, particularly in morphology. To investigate this, we introduce IMPACT, a synthetically generated evaluation framework focused on inflectional morphology, which we publicly release, designed to evaluate LLM performance across five morphologically rich languages: Arabic, Russian, Finnish, Turkish, and Hebrew. IMPACT includes unit-test-style cases covering both shared and language-specific phenomena, from basic verb inflections (e.g., tense, number, gender) to unique features like Arabic's reverse gender agreement and vowel harmony in Finnish and Turkish. We assess eight multilingual LLMs that, despite strong English performance, struggle with other languages and uncommon morphological patterns, especially when judging ungrammatical examples. We also show that Chain of Thought and Thinking Models can degrade performance. Our work exposes gaps in LLMs' handling of linguistic complexity, pointing to clear room for improvement. To support further research, we publicly release the IMPACT framework.

Title: Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages

Authors: Ruhina Tabasshum Prome (Bangladesh Institute of Governance and Management), Tarikul Islam Tamiti (George Mason University), Anomadarshi Barua (George Mason University)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.23930
Pdf URL: https://arxiv.org/pdf/2506.23930
Copy Paste: [[2506.23930]] Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages(https://arxiv.org/abs/2506.23930)
Keywords: large language model
Abstract: The rapid expansion of social media leads to a marked increase in hate speech, which threatens personal lives and results in numerous hate crimes. Detecting hate speech presents several challenges: diverse dialects, frequent code-mixing, and the prevalence of misspelled words in user-generated content on social media platforms. Recent progress in hate speech detection is typically concentrated on high-resource languages. However, low-resource languages still face significant challenges due to the lack of large-scale, high-quality datasets. This paper investigates how we can overcome this limitation via prompt engineering on large language models (LLMs) focusing on low-resource Bengali language. We investigate six prompting strategies - zero-shot prompting, refusal suppression, flattering the classifier, multi-shot prompting, role prompting, and finally our innovative metaphor prompting to detect hate speech effectively in low-resource languages. We pioneer the metaphor prompting to circumvent the built-in safety mechanisms of LLMs that marks a significant departure from existing jailbreaking methods. We investigate all six different prompting strategies on the Llama2-7B model and compare the results extensively with three pre-trained word embeddings - GloVe, Word2Vec, and FastText for three different deep learning models - multilayer perceptron (MLP), convolutional neural network (CNN), and bidirectional gated recurrent unit (BiGRU). To prove the effectiveness of our metaphor prompting in the low-resource Bengali language, we also evaluate it in another low-resource language - Hindi, and two high-resource languages - English and German. The performance of all prompting techniques is evaluated using the F1 score, and environmental impact factor (IF), which measures CO$_2$ emissions, electricity usage, and computational time.

Title: Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs

Authors: Yang Dai, Jianxiang An, Tianwei Lin, Hongyang He, Hongzhe Huang, Wenqiao Zhang, Zheqi Lv, Siliang Tang, Yueting Zhuang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23940
Pdf URL: https://arxiv.org/pdf/2506.23940
Copy Paste: [[2506.23940]] Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs(https://arxiv.org/abs/2506.23940)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have achieved success across various domains. However, their applicability tends to degrade when confronted with different types of data inputs, especially for MLLMs that have been fine-tuned for specific tasks. Despite its importance, the study of knowledge sharing among domain-specific MLLMs--such as those trained for mathematics or code--remains largely underexplored. To address the fragmentation of knowledge across domain-specialized MLLMs, we propose a unified parameter integration framework that enables modular composition of expert capabilities. Our method is grounded in a novel Compatibility-Aware Parameter Splicing (CAPS) strategy, which leverages both local functional attribution and global information-theoretic signals to guide selective parameter fusion. By extending this mechanism to the low-rank adaptation layer granularity, we ensure efficient integration with minimal inference overhead. Furthermore, we introduce a domain compatibility scoring mechanism that quantifies inter-expert alignment at the activation level and correlates with downstream task utility. This principled fusion protocol allows the final model to synergize heterogeneous expertise while preserving structural modularity. Extensive evaluations across diverse multimodal benchmarks validate the effectiveness of our framework, offering a scalable path toward compositional, domain-adaptive MLLMs.

Title: Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders

Authors: Mathis Le Bail, Jérémie Dentan, Davide Buscaldi, Sonia Vanier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23951
Pdf URL: https://arxiv.org/pdf/2506.23951
Copy Paste: [[2506.23951]] Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders(https://arxiv.org/abs/2506.23951)
Keywords: extraction, interpretability, explainability, large language model
Abstract: Sparse Autoencoders (SAEs) have been successfully used to probe Large Language Models (LLMs) and extract interpretable concepts from their internal representations. These concepts are linear combinations of neuron activations that correspond to human-interpretable features. In this paper, we investigate the effectiveness of SAE-based explainability approaches for sentence classification, a domain where such methods have not been extensively explored. We present a novel SAE-based architecture tailored for text classification, leveraging a specialized classifier head and incorporating an activation rate sparsity loss. We benchmark this architecture against established methods such as ConceptShap, Independent Component Analysis, and other SAE-based concept extraction techniques. Our evaluation covers two classification benchmarks and four fine-tuned LLMs from the Pythia family. We further enrich our analysis with two novel metrics for measuring the precision of concept-based explanations, using an external sentence encoder. Our empirical results show that our architecture improves both the causality and interpretability of the extracted features.

Title: Bridging the Gap with Retrieval-Augmented Generation: Making Prosthetic Device User Manuals Available in Marginalised Languages

Authors: Ikechukwu Ogbonna, Lesley Davidson, Soumya Banerjee, Abhishek Dasgupta, Laurence Kenney, Vikranth Harthikote Nagaraja
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23958
Pdf URL: https://arxiv.org/pdf/2506.23958
Copy Paste: [[2506.23958]] Bridging the Gap with Retrieval-Augmented Generation: Making Prosthetic Device User Manuals Available in Marginalised Languages(https://arxiv.org/abs/2506.23958)
Keywords: generative
Abstract: Millions of people in African countries face barriers to accessing healthcare due to language and literacy gaps. This research tackles this challenge by transforming complex medical documents -- in this case, prosthetic device user manuals -- into accessible formats for underserved populations. This case study in cross-cultural translation is particularly pertinent/relevant for communities that receive donated prosthetic devices but may not receive the accompanying user documentation. Or, if available online, may only be available in formats (e.g., language and readability) that are inaccessible to local populations (e.g., English-language, high resource settings/cultural context). The approach is demonstrated using the widely spoken Pidgin dialect, but our open-source framework has been designed to enable rapid and easy extension to other languages/dialects. This work presents an AI-powered framework designed to process and translate complex medical documents, e.g., user manuals for prosthetic devices, into marginalised languages. The system enables users -- such as healthcare workers or patients -- to upload English-language medical equipment manuals, pose questions in their native language, and receive accurate, localised answers in real time. Technically, the system integrates a Retrieval-Augmented Generation (RAG) pipeline for processing and semantic understanding of the uploaded manuals. It then employs advanced Natural Language Processing (NLP) models for generative question-answering and multilingual translation. Beyond simple translation, it ensures accessibility to device instructions, treatment protocols, and safety information, empowering patients and clinicians to make informed healthcare decisions.

Title: ADReFT: Adaptive Decision Repair for Safe Autonomous Driving via Reinforcement Fine-Tuning

Authors: Mingfei Cheng, Xiaofei Xie, Renzhi Wang, Yuan Zhou, Ming Hu
Subjects: cs.LG, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2506.23960
Pdf URL: https://arxiv.org/pdf/2506.23960
Copy Paste: [[2506.23960]] ADReFT: Adaptive Decision Repair for Safe Autonomous Driving via Reinforcement Fine-Tuning(https://arxiv.org/abs/2506.23960)
Keywords: transformer
Abstract: Autonomous Driving Systems (ADSs) continue to face safety-critical risks due to the inherent limitations in their design and performance capabilities. Online repair plays a crucial role in mitigating such limitations, ensuring the runtime safety and reliability of ADSs. Existing online repair solutions enforce ADS compliance by transforming unacceptable trajectories into acceptable ones based on predefined specifications, such as rule-based constraints or training datasets. However, these approaches often lack generalizability, adaptability and tend to be overly conservative, resulting in ineffective repairs that not only fail to mitigate safety risks sufficiently but also degrade the overall driving experience. To address this issue, we propose Adaptive Decision Repair (ADReFT), a novel and effective repair method that identifies safety-critical states through offline learning from failed tests and generates appropriate mitigation actions to improve ADS safety. Specifically, ADReFT incorporates a transformer-based model with two joint heads, State Monitor and Decision Adapter, designed to capture complex driving environment interactions to evaluate state safety severity and generate adaptive repair actions. Given the absence of oracles for state safety identification, we first pretrain ADReFT using supervised learning with coarse annotations, i.e., labeling states preceding violations as positive samples and others as negative samples. It establishes ADReFT's foundational capability to mitigate safety-critical violations, though it may result in somewhat conservative mitigation strategies. Therefore, we subsequently finetune ADReFT using reinforcement learning to improve its initial capability and generate more precise and contextually appropriate repair decisions. Our evaluation results illustrate that ADReFT achieves better repair performance.

Title: Evaluating the Impact of Khmer Font Types on Text Recognition

Authors: Vannkinh Nom, Souhail Bakkali, Muhammad Muzzamil Luqman, Mickael Coustaty, Jean-Marc Ogier
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23963
Pdf URL: https://arxiv.org/pdf/2506.23963
Copy Paste: [[2506.23963]] Evaluating the Impact of Khmer Font Types on Text Recognition(https://arxiv.org/abs/2506.23963)
Keywords: robust
Abstract: Text recognition is significantly influenced by font types, especially for complex scripts like Khmer. The variety of Khmer fonts, each with its unique character structure, presents challenges for optical character recognition (OCR) systems. In this study, we evaluate the impact of 19 randomly selected Khmer font types on text recognition accuracy using Pytesseract. The fonts include Angkor, Battambang, Bayon, Bokor, Chenla, Dangrek, Freehand, Kh Kompong Chhnang, Kh SN Kampongsom, Khmer, Khmer CN Stueng Songke, Khmer Savuth Pen, Metal, Moul, Odor MeanChey, Preah Vihear, Siemreap, Sithi Manuss, and iSeth First. Our comparison of OCR performance across these fonts reveals that Khmer, Odor MeanChey, Siemreap, Sithi Manuss, and Battambang achieve high accuracy, while iSeth First, Bayon, and Dangrek perform poorly. This study underscores the critical importance of font selection in optimizing Khmer text recognition and provides valuable insights for developing more robust OCR systems.

Title: UMA: A Family of Universal Models for Atoms

Authors: Brandon M. Wood, Misko Dzamba, Xiang Fu, Meng Gao, Muhammed Shuaibi, Luis Barroso-Luque, Kareem Abdelmaqsoud, Vahe Gharakhanyan, John R. Kitchin, Daniel S. Levine, Kyle Michel, Anuroop Sriram, Taco Cohen, Abhishek Das, Ammar Rizvi, Sushree Jagriti Sahoo, Zachary W. Ulissi, C. Lawrence Zitnick
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23971
Pdf URL: https://arxiv.org/pdf/2506.23971
Copy Paste: [[2506.23971]] UMA: A Family of Universal Models for Atoms(https://arxiv.org/abs/2506.23971)
Keywords: fair
Abstract: The ability to quickly and accurately compute properties from atomic simulations is critical for advancing a large number of applications in chemistry and materials science including drug discovery, energy storage, and semiconductor manufacturing. To address this need, Meta FAIR presents a family of Universal Models for Atoms (UMA), designed to push the frontier of speed, accuracy, and generalization. UMA models are trained on half a billion unique 3D atomic structures (the largest training runs to date) by compiling data across multiple chemical domains, e.g. molecules, materials, and catalysts. We develop empirical scaling laws to help understand how to increase model capacity alongside dataset size to achieve the best accuracy. The UMA small and medium models utilize a novel architectural design we refer to as mixture of linear experts that enables increasing model capacity without sacrificing speed. For example, UMA-medium has 1.4B parameters but only ~50M active parameters per atomic structure. We evaluate UMA models on a diverse set of applications across multiple domains and find that, remarkably, a single model without any fine-tuning can perform similarly or better than specialized models. We are releasing the UMA code, weights, and associated data to accelerate computational workflows and enable the community to continue to build increasingly capable AI models.

Title: Visual and Memory Dual Adapter for Multi-Modal Object Tracking

Authors: Boyue Xu, Ruichao Hou, Tongwei Ren, Gangshan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23972
Pdf URL: https://arxiv.org/pdf/2506.23972
Copy Paste: [[2506.23972]] Visual and Memory Dual Adapter for Multi-Modal Object Tracking(https://arxiv.org/abs/2506.23972)
Keywords: robust
Abstract: Prompt-learning-based multi-modal trackers have achieved promising progress by employing lightweight visual adapters to incorporate auxiliary modality features into frozen foundation models. However, existing approaches often struggle to learn reliable prompts due to limited exploitation of critical cues across frequency and temporal domains. In this paper, we propose a novel visual and memory dual adapter (VMDA) to construct more robust and discriminative representations for multi-modal tracking. Specifically, we develop a simple but effective visual adapter that adaptively transfers discriminative cues from auxiliary modality to dominant modality by jointly modeling the frequency, spatial, and channel-wise features. Additionally, we design the memory adapter inspired by the human memory mechanism, which stores global temporal cues and performs dynamic update and retrieval operations to ensure the consistent propagation of reliable temporal information across video sequences. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the various multi-modal tracking tasks, including RGB-Thermal, RGB-Depth, and RGB-Event tracking. Code and models are available at this https URL.

Title: Toward Simple and Robust Contrastive Explanations for Image Classification by Leveraging Instance Similarity and Concept Relevance

Authors: Yuliia Kaidashova, Bettina Finzel, Ute Schmid
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.23975
Pdf URL: https://arxiv.org/pdf/2506.23975
Copy Paste: [[2506.23975]] Toward Simple and Robust Contrastive Explanations for Image Classification by Leveraging Instance Similarity and Concept Relevance(https://arxiv.org/abs/2506.23975)
Keywords: robust
Abstract: Understanding why a classification model prefers one class over another for an input instance is the challenge of contrastive explanation. This work implements concept-based contrastive explanations for image classification by leveraging the similarity of instance embeddings and relevance of human-understandable concepts used by a fine-tuned deep learning model. Our approach extracts concepts with their relevance score, computes contrasts for similar instances, and evaluates the resulting contrastive explanations based on explanation complexity. Robustness is tested for different image augmentations. Two research questions are addressed: (1) whether explanation complexity varies across different relevance ranges, and (2) whether explanation complexity remains consistent under image augmentations such as rotation and noise. The results confirm that for our experiments higher concept relevance leads to shorter, less complex explanations, while lower relevance results in longer, more diffuse explanations. Additionally, explanations show varying degrees of robustness. The discussion of these findings offers insights into the potential of building more interpretable and robust AI systems.

Title: A Scalable Approach for Safe and Robust Learning via Lipschitz-Constrained Networks

Authors: Zain ul Abdeen, Vassilis Kekatos, Ming Jin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.23977
Pdf URL: https://arxiv.org/pdf/2506.23977
Copy Paste: [[2506.23977]] A Scalable Approach for Safe and Robust Learning via Lipschitz-Constrained Networks(https://arxiv.org/abs/2506.23977)
Keywords: robust
Abstract: Certified robustness is a critical property for deploying neural networks (NN) in safety-critical applications. A principle approach to achieving such guarantees is to constrain the global Lipschitz constant of the network. However, accurate methods for Lipschitz-constrained training often suffer from non-convex formulations and poor scalability due to reliance on global semidefinite programs (SDPs). In this letter, we propose a convex training framework that enforces global Lipschitz constraints via semidefinite relaxation. By reparameterizing the NN using loop transformation, we derive a convex admissibility condition that enables tractable and certifiable training. While the resulting formulation guarantees robustness, its scalability is limited by the size of global SDP. To overcome this, we develop a randomized subspace linear matrix inequalities (RS-LMI) approach that decomposes the global constraints into sketched layerwise constraints projected onto low-dimensional subspaces, yielding a smooth and memory-efficient training objective. Empirical results on MNIST, CIFAR-10, and ImageNet demonstrate that the proposed framework achieves competitive accuracy with significantly improved Lipschitz bounds and runtime performance.

Title: LLM Agents Are the Antidote to Walled Gardens

Authors: Samuele Marro, Philip Torr
Subjects: cs.LG, cs.CL, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2506.23978
Pdf URL: https://arxiv.org/pdf/2506.23978
Copy Paste: [[2506.23978]] LLM Agents Are the Antidote to Walled Gardens(https://arxiv.org/abs/2506.23978)
Keywords: security
Abstract: While the Internet's core infrastructure was designed to be open and universal, today's application layer is dominated by closed, proprietary platforms. Open and interoperable APIs require significant investment, and market leaders have little incentive to enable data exchange that could erode their user lock-in. We argue that LLM-based agents fundamentally disrupt this status quo. Agents can automatically translate between data formats and interact with interfaces designed for humans: this makes interoperability dramatically cheaper and effectively unavoidable. We name this shift universal interoperability: the ability for any two digital services to exchange data seamlessly using AI-mediated adapters. Universal interoperability undermines monopolistic behaviours and promotes data portability. However, it can also lead to new security risks and technical debt. Our position is that the ML community should embrace this development while building the appropriate frameworks to mitigate the downsides. By acting now, we can harness AI to restore user freedom and competitive markets without sacrificing security.

Title: TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation

Authors: Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun, Wuwei Huang, Quandong Wang, Wei Liu, Jian Luan, Bin Wang, Deyi Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23979
Pdf URL: https://arxiv.org/pdf/2506.23979
Copy Paste: [[2506.23979]] TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation(https://arxiv.org/abs/2506.23979)
Keywords: large language model
Abstract: Conducting supervised fine-tuning and preference fine-tuning on large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values. However, constructing such datasets is resource-intensive, and most available datasets for supervised and preference fine-tuning are in English. To address these challenges, we propose the \underline{\textbf{Ta}}xonomy-Guided \underline{\textbf{P}}reference Data Generation (TaP) framework, which facilitates automated and scalable construction of preference datasets across various languages. TaP is grounded in a structured taxonomy that allows fine-grained control over dataset composition, thereby ensuring both diversity and comprehensive coverage. We employ TaP-generated datasets to perform supervised and preference fine-tuning on various LLMs. Experimental results demonstrate that LLMs trained on TaP-generated datasets outperform those trained on existing open-source datasets. Remarkably, LLMs trained on TaP-generated datasets surpass the performance of those trained on an open-source dataset that is 180 times larger.

Title: Lock Prediction for Zero-Downtime Database Encryption

Authors: Mohamed Sami Rakha, Adam Sorrenti, Greg Stager, Walid Rjaibi, Andriy Miranskyy
Subjects: cs.CR, cs.DB
Abstract URL: https://arxiv.org/abs/2506.23985
Pdf URL: https://arxiv.org/pdf/2506.23985
Copy Paste: [[2506.23985]] Lock Prediction for Zero-Downtime Database Encryption(https://arxiv.org/abs/2506.23985)
Keywords: secure, security, protect, robust, transformer
Abstract: Modern enterprise database systems face significant challenges in balancing data security and performance. Ensuring robust encryption for sensitive information is critical for systems' compliance with security standards. Although holistic database encryption provides strong protection, existing database systems often require a complete backup and restore cycle, resulting in prolonged downtime and increased storage usage. This makes it difficult to implement online encryption techniques in high-throughput environments without disrupting critical operations. To address this challenge, we envision a solution that enables online database encryption aligned with system activity, eliminating the need for downtime, storage overhead, or full-database reprocessing. Central to this vision is the ability to predict which parts of the database will be accessed next, allowing encryption to be applied online. As a step towards this solution, this study proposes a predictive approach that leverages deep learning models to forecast database lock sequences, using IBM Db2 as the database system under study. In this study, we collected a specialized dataset from TPC-C benchmark workloads, leveraging lock event logs for model training and evaluation. We applied deep learning architectures, such as Transformer and LSTM, to evaluate models for various table-level and page-level lock predictions. We benchmark the accuracy of the trained models versus a Naive Baseline across different prediction horizons and timelines. The study experiments demonstrate that the proposed deep learning-based models achieve up to 49% average accuracy for table-level and 66% for page-level predictions, outperforming a Naive Baseline. By anticipating which tables and pages will be locked next, the proposed approach is a step toward online encryption, offering a practical path toward secure, low-overhead database systems.

Title: Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning

Authors: Seungjun Yi, Joakim Nguyen, Huimin Xu, Terence Lim, Andrew Well, Mia Markey, Ying Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.23998
Pdf URL: https://arxiv.org/pdf/2506.23998
Copy Paste: [[2506.23998]] Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning(https://arxiv.org/abs/2506.23998)
Keywords: large language model
Abstract: Congenital heart disease (CHD) presents complex, lifelong challenges often underrepresented in traditional clinical metrics. While unstructured narratives offer rich insights into patient and caregiver experiences, manual thematic analysis (TA) remains labor-intensive and unscalable. We propose a fully automated large language model (LLM) pipeline that performs end-to-end TA on clinical narratives, which eliminates the need for manual coding or full transcript review. Our system employs a novel multi-agent framework, where specialized LLM agents assume roles to enhance theme quality and alignment with human analysis. To further improve thematic relevance, we optionally integrate reinforcement learning from human feedback (RLHF). This supports scalable, patient-centered analysis of large qualitative datasets and allows LLMs to be fine-tuned for specific clinical contexts.

Title: The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models

Authors: Lijun Sheng, Jian Liang, Ran He, Zilei Wang, Tieniu Tan
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.24000
Pdf URL: https://arxiv.org/pdf/2506.24000
Copy Paste: [[2506.24000]] The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models(https://arxiv.org/abs/2506.24000)
Keywords: robust, fair
Abstract: Test-time adaptation (TTA) methods have gained significant attention for enhancing the performance of vision-language models (VLMs) such as CLIP during inference, without requiring additional labeled data. However, current TTA researches generally suffer from major limitations such as duplication of baseline results, limited evaluation metrics, inconsistent experimental settings, and insufficient analysis. These problems hinder fair comparisons between TTA methods and obscure their practical strengths and weaknesses. To address these challenges, we introduce TTA-VLM, a comprehensive benchmark for evaluating TTA methods on VLMs. Our benchmark implements 8 episodic TTA and 7 online TTA methods within a unified and reproducible framework, and evaluates them across 15 widely used datasets. Unlike prior studies focused solely on CLIP, we extend the evaluation to SigLIP--a model trained with a Sigmoid loss--and include training-time tuning methods such as CoOp, MaPLe, and TeCoA to assess generality. Beyond classification accuracy, TTA-VLM incorporates various evaluation metrics, including robustness, calibration, out-of-distribution detection, and stability, enabling a more holistic assessment of TTA methods. Through extensive experiments, we find that 1) existing TTA methods produce limited gains compared to the previous pioneering work; 2) current TTA methods exhibit poor collaboration with training-time fine-tuning methods; 3) accuracy gains frequently come at the cost of reduced model trustworthiness. We release TTA-VLM to provide fair comparison and comprehensive evaluation of TTA methods for VLMs, and we hope it encourages the community to develop more reliable and generalizable TTA strategies.

Title: Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective

Authors: Anselm R. Strohmaier, Wim Van Dooren, Kathrin Seßler, Brian Greer, Lieven Verschaffel
Subjects: cs.CL, math.HO
Abstract URL: https://arxiv.org/abs/2506.24006
Pdf URL: https://arxiv.org/pdf/2506.24006
Copy Paste: [[2506.24006]] Large Language Models Don't Make Sense of Word Problems. A Scoping Review from a Mathematics Education Perspective(https://arxiv.org/abs/2506.24006)
Keywords: large language model
Abstract: The progress of Large Language Models (LLMs) like ChatGPT raises the question of how they can be integrated into education. One hope is that they can support mathematics learning, including word-problem solving. Since LLMs can handle textual input with ease, they appear well-suited for solving mathematical word problems. Yet their real competence, whether they can make sense of the real-world context, and the implications for classrooms remain unclear. We conducted a scoping review from a mathematics-education perspective, including three parts: a technical overview, a systematic review of word problems used in research, and a state-of-the-art empirical evaluation of LLMs on mathematical word problems. First, in the technical overview, we contrast the conceptualization of word problems and their solution processes between LLMs and students. In computer-science research this is typically labeled mathematical reasoning, a term that does not align with usage in mathematics education. Second, our literature review of 213 studies shows that the most popular word-problem corpora are dominated by s-problems, which do not require a consideration of realities of their real-world context. Finally, our evaluation of GPT-3.5-turbo, GPT-4o-mini, GPT-4.1, and o3 on 287 word problems shows that most recent LLMs solve these s-problems with near-perfect accuracy, including a perfect score on 20 problems from PISA. LLMs still showed weaknesses in tackling problems where the real-world context is problematic or non-sensical. In sum, we argue based on all three aspects that LLMs have mastered a superficial solution process but do not make sense of word problems, which potentially limits their value as instructional tools in mathematics classrooms.

Title: EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations

Authors: Hyunjong Kim, Sangyeop Kim, Jongheon Jeong, Yeongjae Cho, Sungzoon Cho
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.24016
Pdf URL: https://arxiv.org/pdf/2506.24016
Copy Paste: [[2506.24016]] EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations(https://arxiv.org/abs/2506.24016)
Keywords: large language model
Abstract: Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at this https URL.

Title: Poisoning Attacks to Local Differential Privacy for Ranking Estimation

Authors: Pei Zhan (1, 2 and 3), Peng Tang (1, 2 and 3), Yangzhuo Li (1 and 3), Puwen Wei (1 and 3), Shanqing Guo (1 and 3) ((1) School of Cyber Science and Technology, Shandong University, (2) Quan Cheng Laboratory, Jinan, China, (3) State Key Laboratory of Cryptography and Digital Economy Security, Shandong University, Qingdao, China)
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.24033
Pdf URL: https://arxiv.org/pdf/2506.24033
Copy Paste: [[2506.24033]] Poisoning Attacks to Local Differential Privacy for Ranking Estimation(https://arxiv.org/abs/2506.24033)
Keywords: privacy, defense, attack
Abstract: Local differential privacy (LDP) involves users perturbing their inputs to provide plausible deniability of their data. However, this also makes LDP vulnerable to poisoning attacks. In this paper, we first introduce novel poisoning attacks for ranking estimation. These attacks are intricate, as fake attackers do not merely adjust the frequency of target items. Instead, they leverage a limited number of fake users to precisely modify frequencies, effectively altering item rankings to maximize gains. To tackle this challenge, we introduce the concepts of attack cost and optimal attack item (set), and propose corresponding strategies for kRR, OUE, and OLH protocols. For kRR, we iteratively select optimal attack items and allocate suitable fake users. For OUE, we iteratively determine optimal attack item sets and consider the incremental changes in item frequencies across different sets. Regarding OLH, we develop a harmonic cost function based on the pre-image of a hash to select that supporting a larger number of effective attack items. Lastly, we present an attack strategy based on confidence levels to quantify the probability of a successful attack and the number of attack iterations more precisely. We demonstrate the effectiveness of our attacks through theoretical and empirical evidence, highlighting the necessity for defenses against these attacks. The source code and data have been made available at this https URL.

Title: Foundation Models for Zero-Shot Segmentation of Scientific Images without AI-Ready Data

Authors: Shubhabrata Mukherjee, Jack Lang, Obeen Kwon, Iryna Zenyuk, Valerie Brogden, Adam Weber, Daniela Ushizima
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2506.24039
Pdf URL: https://arxiv.org/pdf/2506.24039
Copy Paste: [[2506.24039]] Foundation Models for Zero-Shot Segmentation of Scientific Images without AI-Ready Data(https://arxiv.org/abs/2506.24039)
Keywords: segmentation
Abstract: Zero-shot and prompt-based technologies capitalized on using frequently occurring images to transform visual reasoning tasks, which explains why such technologies struggle with valuable yet scarce scientific image sets. In this work, we propose Zenesis, a comprehensive no-code interactive platform designed to minimize barriers posed by data readiness for scientific images. We develop lightweight multi-modal adaptation techniques that enable zero-shot operation on raw scientific data, along with human-in-the-loop refinement and heuristic-based temporal enhancement options. We demonstrate the performance of our approach through comprehensive comparison and validation on challenging Focused Ion Beam Scanning Electron Microscopy (FIB-SEM) data of catalyst-loaded membranes. Zenesis significantly outperforms baseline methods, achieving an average accuracy of 0.947, an Intersection over Union (IOU) of 0.858, and a Dice score of 0.923 for amorphous catalyst samples and accuracy of 0.987, an IOU of 0.857, and a Dice score of 0.923 for crystalline samples. These results mark a substantial improvement over traditional methods like Otsu thresholding and even advanced models like Segment Anything Model (SAM) when used in isolation. Our results demonstrate that Zenesis is a powerful tool for scientific applications, particularly in fields where high-quality annotated datasets are unavailable, accelerating accurate analysis of experimental imaging.

Title: Faster Diffusion Models via Higher-Order Approximation

Authors: Gen Li, Yuchen Zhou, Yuting Wei, Yuxin Chen
Subjects: cs.LG, math.NA, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2506.24042
Pdf URL: https://arxiv.org/pdf/2506.24042
Copy Paste: [[2506.24042]] Faster Diffusion Models via Higher-Order Approximation(https://arxiv.org/abs/2506.24042)
Keywords: robust, diffusion
Abstract: In this paper, we explore provable acceleration of diffusion models without any additional retraining. Focusing on the task of approximating a target data distribution in $\mathbb{R}^d$ to within $\varepsilon$ total-variation distance, we propose a principled, training-free sampling algorithm that requires only the order of $$ d^{1+2/K} \varepsilon^{-1/K} $$ score function evaluations (up to log factor) in the presence of accurate scores, where $K$ is an arbitrarily large fixed integer. This result applies to a broad class of target data distributions, without the need for assumptions such as smoothness or log-concavity. Our theory is robust vis-a-vis inexact score estimation, degrading gracefully as the score estimation error increases -- without demanding higher-order smoothness on the score estimates as assumed in previous work. The proposed algorithm draws insight from high-order ODE solvers, leveraging high-order Lagrange interpolation and successive refinement to approximate the integral derived from the probability flow ODE.

Title: A Survey on Vision-Language-Action Models for Autonomous Driving

Authors: Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, Hao Ye, Zihao Sheng, Xin Zhao, Tuopu Wen, Zheng Fu, Sikai Chen, Kun Jiang, Diange Yang, Seongjin Choi, Lijun Sun
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2506.24044
Pdf URL: https://arxiv.org/pdf/2506.24044
Copy Paste: [[2506.24044]] A Survey on Vision-Language-Action Models for Autonomous Driving(https://arxiv.org/abs/2506.24044)
Keywords: robust, large language model
Abstract: The rapid progress of multimodal large language models (MLLM) has paved the way for Vision-Language-Action (VLA) paradigms, which integrate visual perception, natural language understanding, and control within a single policy. Researchers in autonomous driving are actively adapting these methods to the vehicle domain. Such models promise autonomous vehicles that can interpret high-level instructions, reason about complex traffic scenes, and make their own decisions. However, the literature remains fragmented and is rapidly expanding. This survey offers the first comprehensive overview of VLA for Autonomous Driving (VLA4AD). We (i) formalize the architectural building blocks shared across recent work, (ii) trace the evolution from early explainer to reasoning-centric VLA models, and (iii) compare over 20 representative models according to VLA's progress in the autonomous driving domain. We also consolidate existing datasets and benchmarks, highlighting protocols that jointly measure driving safety, accuracy, and explanation quality. Finally, we detail open challenges - robustness, real-time efficiency, and formal verification - and outline future directions of VLA4AD. This survey provides a concise yet complete reference for advancing interpretable socially aligned autonomous vehicles. Github repo is available at \href{this https URL}{SicongJiang/Awesome-VLA4AD}.

Title: Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models

Authors: Tung-Ling Li, Hongliang Liu
Subjects: cs.CR, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.24056
Pdf URL: https://arxiv.org/pdf/2506.24056
Copy Paste: [[2506.24056]] Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models(https://arxiv.org/abs/2506.24056)
Keywords: attack, large language model
Abstract: We introduce logit-gap steering, a fast jailbreak framework that casts the refusal-affirmation gap of RLHF-aligned language models as a single pass over the vocabulary. A forward-computable score blends gap reduction with lightweight proxies for KL penalty and reward shift, allowing a "sort-sum-stop" sweep to complete in under a second and return a short suffix--two orders of magnitude fewer model calls than beam or gradient attacks. The same suffix generalises to unseen prompts and scales from 0.5 B to 70 B checkpoints, lifting one-shot attack success from baseline levels to 80-100% while preserving topical coherence. Beyond efficiency, these suffixes expose sentence-boundary reward cliffs and other alignment artefacts, offering a lightweight probe into how safety tuning reshapes internal representations.

Title: Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios

Authors: Deng Li, Aming Wu, Yang Li, Yaowei Wang, Yahong Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24063
Pdf URL: https://arxiv.org/pdf/2506.24063
Copy Paste: [[2506.24063]] Continual Adaptation: Environment-Conditional Parameter Generation for Object Detection in Dynamic Scenarios(https://arxiv.org/abs/2506.24063)
Keywords: diffusion
Abstract: In practice, environments constantly change over time and space, posing significant challenges for object detectors trained based on a closed-set assumption, i.e., training and test data share the same distribution. To this end, continual test-time adaptation has attracted much attention, aiming to improve detectors' generalization by fine-tuning a few specific parameters, e.g., BatchNorm layers. However, based on a small number of test images, fine-tuning certain parameters may affect the representation ability of other fixed parameters, leading to performance degradation. Instead, we explore a new mechanism, i.e., converting the fine-tuning process to a specific-parameter generation. Particularly, we first design a dual-path LoRA-based domain-aware adapter that disentangles features into domain-invariant and domain-specific components, enabling efficient adaptation. Additionally, a conditional diffusion-based parameter generation mechanism is presented to synthesize the adapter's parameters based on the current environment, preventing the optimization from getting stuck in local optima. Finally, we propose a class-centered optimal transport alignment method to mitigate catastrophic forgetting. Extensive experiments conducted on various continuous domain adaptive object detection tasks demonstrate the effectiveness. Meanwhile, visualization results show that the representation extracted by the generated parameters can capture more object-related information and strengthen the generalization ability.

Title: STACK: Adversarial Attacks on LLM Safeguard Pipelines

Authors: Ian R. McKenzie, Oskar J. Hollinsworth, Tom Tseng, Xander Davies, Stephen Casper, Aaron D. Tucker, Robert Kirk, Adam Gleave
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.24068
Pdf URL: https://arxiv.org/pdf/2506.24068
Copy Paste: [[2506.24068]] STACK: Adversarial Attacks on LLM Safeguard Pipelines(https://arxiv.org/abs/2506.24068)
Keywords: security, protect, defense, attack
Abstract: Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.

Title: Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention

Authors: Wonwoong Cho, Yanxia Zhang, Yan-Ying Chen, David I. Inouye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.24085
Pdf URL: https://arxiv.org/pdf/2506.24085
Copy Paste: [[2506.24085]] Imagine for Me: Creative Conceptual Blending of Real Images and Text via Blended Attention(https://arxiv.org/abs/2506.24085)
Keywords: diffusion, generative
Abstract: Blending visual and textual concepts into a new visual concept is a unique and powerful trait of human beings that can fuel creativity. However, in practice, cross-modal conceptual blending for humans is prone to cognitive biases, like design fixation, which leads to local minima in the design space. In this paper, we propose a T2I diffusion adapter "IT-Blender" that can automate the blending process to enhance human creativity. Prior works related to cross-modal conceptual blending are limited in encoding a real image without loss of details or in disentangling the image and text inputs. To address these gaps, IT-Blender leverages pretrained diffusion models (SD and FLUX) to blend the latent representations of a clean reference image with those of the noisy generated image. Combined with our novel blended attention, IT-Blender encodes the real reference image without loss of details and blends the visual concept with the object specified by the text in a disentangled way. Our experiment results show that IT-Blender outperforms the baselines by a large margin in blending visual and textual concepts, shedding light on the new application of image generative models to augment human creativity.

Title: MotionGPT3: Human Motion as a Second Modality

Authors: Bingfan Zhu, Biao Jiang, Sunyi Wang, Shixiang Tang, Tao Chen, Linjie Luo, Youyi Zheng, Xin Chen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.24086
Pdf URL: https://arxiv.org/pdf/2506.24086
Copy Paste: [[2506.24086]] MotionGPT3: Human Motion as a Second Modality(https://arxiv.org/abs/2506.24086)
Keywords: diffusion
Abstract: Though recent advances in multimodal models have demonstrated strong capabilities and opportunities in unified understanding and generation, the development of unified motion-language models remains underexplored. To enable such models with high-fidelity human motion, two core challenges must be addressed. The first is the reconstruction gap between the continuous motion modality and discrete representation in an autoregressive manner, and the second is the degradation of language intelligence during unified training. Inspired by the mixture of experts, we propose MotionGPT3, a bimodal motion-language model that treats human motion as a second modality, decoupling motion modeling via separate model parameters and enabling both effective cross-modal interaction and efficient multimodal scaling training. To preserve language intelligence, the text branch retains the original structure and parameters of the pretrained language model, while a new motion branch is integrated via a shared attention mechanism, enabling bidirectional information flow between two modalities. We first employ a motion Variational Autoencoder (VAE) to encode raw human motion into latent representations. Based on this continuous latent space, the motion branch predicts motion latents directly from intermediate hidden states using a diffusion head, bypassing discrete tokenization. Extensive experiments show that our approach achieves competitive performance on both motion understanding and generation tasks while preserving strong language capabilities, establishing a unified bimodal motion diffusion framework within an autoregressive manner.

Title: WaRA: Wavelet Low Rank Adaptation

Authors: Moein Heidari, Yasamin Medghalchi, Mahdi Khoursha, Reza Rezaeian, Ilker Hacihaliloglu
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2506.24092
Pdf URL: https://arxiv.org/pdf/2506.24092
Copy Paste: [[2506.24092]] WaRA: Wavelet Low Rank Adaptation(https://arxiv.org/abs/2506.24092)
Keywords: segmentation
Abstract: Parameter-efficient fine-tuning (PEFT) has gained widespread adoption across various applications. Among PEFT techniques, Low-Rank Adaptation (LoRA) and its extensions have emerged as particularly effective, allowing efficient model adaptation while significantly reducing computational overhead. However, existing approaches typically rely on global low-rank factorizations, which overlook local or multi-scale structure, failing to capture complex patterns in the weight updates. To address this, we propose WaRA, a novel PEFT method that leverages wavelet transforms to decompose the weight update matrix into a multi-resolution representation. By performing low-rank factorization in the wavelet domain and reconstructing updates through an inverse transform, WaRA obtains compressed adaptation parameters that harness multi-resolution analysis, enabling it to capture both coarse and fine-grained features while providing greater flexibility and sparser representations than standard LoRA. Through comprehensive experiments and analysis, we demonstrate that WaRA performs superior on diverse vision tasks, including image generation, classification, and semantic segmentation, significantly enhancing generated image quality while reducing computational complexity. Although WaRA was primarily designed for vision tasks, we further showcase its effectiveness in language tasks, highlighting its broader applicability and generalizability. The code is publicly available at \href{GitHub}{this https URL}.

Title: Development of Hybrid Artificial Intelligence Training on Real and Synthetic Data: Benchmark on Two Mixed Training Strategies

Authors: Paul Wachter, Lukas Niehaus, Julius Schöning
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.24093
Pdf URL: https://arxiv.org/pdf/2506.24093
Copy Paste: [[2506.24093]] Development of Hybrid Artificial Intelligence Training on Real and Synthetic Data: Benchmark on Two Mixed Training Strategies(https://arxiv.org/abs/2506.24093)
Keywords: robust
Abstract: Synthetic data has emerged as a cost-effective alternative to real data for training artificial neural networks (ANN). However, the disparity between synthetic and real data results in a domain gap. That gap leads to poor performance and generalization of the trained ANN when applied to real-world scenarios. Several strategies have been developed to bridge this gap, which combine synthetic and real data, known as mixed training using hybrid datasets. While these strategies have been shown to mitigate the domain gap, a systematic evaluation of their generalizability and robustness across various tasks and architectures remains underexplored. To address this challenge, our study comprehensively analyzes two widely used mixing strategies on three prevalent architectures and three distinct hybrid datasets. From these datasets, we sample subsets with varying proportions of synthetic to real data to investigate the impact of synthetic and real components. The findings of this paper provide valuable insights into optimizing the use of synthetic data in the training process of any ANN, contributing to enhancing robustness and efficacy.

Title: MILo: Mesh-In-the-Loop Gaussian Splatting for Detailed and Efficient Surface Reconstruction

Authors: Antoine Guédon, Diego Gomez, Nissim Maruani, Bingchen Gong, George Drettakis, Maks Ovsjanikov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24096
Pdf URL: https://arxiv.org/pdf/2506.24096
Copy Paste: [[2506.24096]] MILo: Mesh-In-the-Loop Gaussian Splatting for Detailed and Efficient Surface Reconstruction(https://arxiv.org/abs/2506.24096)
Keywords: extraction
Abstract: While recent advances in Gaussian Splatting have enabled fast reconstruction of high-quality 3D scenes from images, extracting accurate surface meshes remains a challenge. Current approaches extract the surface through costly post-processing steps, resulting in the loss of fine geometric details or requiring significant time and leading to very dense meshes with millions of vertices. More fundamentally, the a posteriori conversion from a volumetric to a surface representation limits the ability of the final mesh to preserve all geometric structures captured during training. We present MILo, a novel Gaussian Splatting framework that bridges the gap between volumetric and surface representations by differentiably extracting a mesh from the 3D Gaussians. We design a fully differentiable procedure that constructs the mesh-including both vertex locations and connectivity-at every iteration directly from the parameters of the Gaussians, which are the only quantities optimized during training. Our method introduces three key technical contributions: a bidirectional consistency framework ensuring both representations-Gaussians and the extracted mesh-capture the same underlying geometry during training; an adaptive mesh extraction process performed at each training iteration, which uses Gaussians as differentiable pivots for Delaunay triangulation; a novel method for computing signed distance values from the 3D Gaussians that enables precise surface extraction while avoiding geometric erosion. Our approach can reconstruct complete scenes, including backgrounds, with state-of-the-art quality while requiring an order of magnitude fewer mesh vertices than previous methods. Due to their light weight and empty interior, our meshes are well suited for downstream applications such as physics simulations or animation.

Title: DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

Authors: Xiangtai Li, Tao Zhang, Yanwei Li, Haobo Yuan, Shihao Chen, Yikang Zhou, Jiahao Meng, Yueyi Sun, Shilin Xu, Lu Qi, Tianheng Cheng, Yi Lin, Zilong Huang, Wenhao Huang, Jiashi Feng, Guang Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24102
Pdf URL: https://arxiv.org/pdf/2506.24102
Copy Paste: [[2506.24102]] DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World(https://arxiv.org/abs/2506.24102)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) demonstrate a complex understanding of scenes, benefiting from large-scale and high-quality datasets. Most existing caption datasets lack the ground locations and relations for visual entities. Several grounded caption datasets face the problems of missing detailed descriptions, relations, and massive object descriptions on high-resolution images. To fill this gap for the community, we present DenseWorld-1M, the first massive, detailed, dense grounded caption dataset in the real world. We design a three-stage labeling pipeline, containing open-world perception, detailed object caption generation, and dense caption merging. The first stage obtains entity-level masks and labels. The second stage generates the object-level, detailed captions with the guidance of masks and labels from the first stage. The final stage merges object captions and masks into spatial and relational dense captions. To accelerate the labeling process and improve caption quality, we present two VLM models: the Detailed Region Caption model and the Spatial Caption Merging model. Extensive experiments on various settings, including vision-language understanding, visual grounding, and region caption generation, demonstrate the effectiveness of our DenseWorld-1M dataset and labeling models.

Title: Epona: Autoregressive Diffusion World Model for Autonomous Driving

Authors: Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu, Xingang Pan, Xiaoyang Guo, Yuan Liu, Jingwei Huang, Li Yuan, Qian Zhang, Xiao-Xiao Long, Xun Cao, Wei Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24113
Pdf URL: https://arxiv.org/pdf/2506.24113
Copy Paste: [[2506.24113]] Epona: Autoregressive Diffusion World Model for Autonomous Driving(https://arxiv.org/abs/2506.24113)
Keywords: diffusion
Abstract: Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in an end-to-end framework. Our architecture enables high-resolution, long-duration generation while introducing a novel chain-of-forward training strategy to address error accumulation in autoregressive loops. Experimental results demonstrate state-of-the-art performance with 7.4\% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a real-time motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks. Code will be publicly available at \href{this https URL}{this https URL}.

Title: Computational Detection of Intertextual Parallels in Biblical Hebrew: A Benchmark Study Using Transformer-Based Language Models

Authors: David M. Smiley
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.24117
Pdf URL: https://arxiv.org/pdf/2506.24117
Copy Paste: [[2506.24117]] Computational Detection of Intertextual Parallels in Biblical Hebrew: A Benchmark Study Using Transformer-Based Language Models(https://arxiv.org/abs/2506.24117)
Keywords: transformer
Abstract: Identifying parallel passages in biblical Hebrew is foundational in biblical scholarship for uncovering intertextual relationships. Traditional methods rely on manual comparison, which is labor-intensive and prone to human error. This study evaluates the potential of pre-trained transformer-based language models, including E5, AlephBERT, MPNet, and LaBSE, for detecting textual parallels in the Hebrew Bible. Focusing on known parallels between the books of Samuel/Kings and Chronicles, I assessed each model's capability to generate word embeddings that delineate parallel from non-parallel passages. Utilizing cosine similarity and Wasserstein Distance measures, I found that E5 and AlephBERT show significant promise, with E5 excelling in parallel detection and AlephBERT demonstrating stronger non-parallel differentiation. These findings indicate that pre-trained models can enhance the efficiency and accuracy of detecting intertextual parallels in ancient texts, suggesting broader applications for ancient language studies.

Title: Data Uniformity Improves Training Efficiency and More, with a Convergence Framework Beyond the NTK Regime

Authors: Yuqing Wang, Shangding Gu
Subjects: cs.LG, cs.AI, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2506.24120
Pdf URL: https://arxiv.org/pdf/2506.24120
Copy Paste: [[2506.24120]] Data Uniformity Improves Training Efficiency and More, with a Convergence Framework Beyond the NTK Regime(https://arxiv.org/abs/2506.24120)
Keywords: transformer, large language model
Abstract: Data selection plays a crucial role in data-driven decision-making, including in large language models (LLMs), and is typically task-dependent. Properties such as data quality and diversity have been extensively studied and are known to enhance model performance. However, it remains unclear whether there exist other quantitative and general principles of data selection that can consistently improve performance, especially for complex tasks with limited prior knowledge. In this paper, we demonstrate that selecting more uniformly distributed data can improve training efficiency while enhancing performance. Specifically, we establish that more uniform (less biased) distribution leads to a larger minimum pairwise distance between data points, denoted by $h_{\min}$, and prove that a smaller $h_{\min}$ can slow down the training dynamics of gradient descent (GD). Moreover, we theoretically show that the approximation error of neural networks decreases as $h_{\min}$ increases. Our analysis introduces a convergence framework for GD beyond the Neural Tangent Kernel (NTK) regime, applicable to a broad class of architectures, including transformers, without requiring Lipschitz smoothness. This framework further provides theoretical justification for the use of residual connections and function compositions in deep neural architectures. In the end, we conduct comprehensive experiments for supervised fine-tuning across various settings, including different optimization strategies, model sizes, and training datasets. The results consistently demonstrate that selecting data by maximizing pairwise distance significantly accelerates training and achieves comparable or better performance in LLMs across diverse datasets. Code and Datasets are available at the link: this https URL.

Title: TextMesh4D: High-Quality Text-to-4D Mesh Generation

Authors: Sisi Dai, Xinxin Su, Boyan Wan, Ruizhen Hu, Kai Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24121
Pdf URL: https://arxiv.org/pdf/2506.24121
Copy Paste: [[2506.24121]] TextMesh4D: High-Quality Text-to-4D Mesh Generation(https://arxiv.org/abs/2506.24121)
Keywords: robust, diffusion, generative
Abstract: Recent advancements in diffusion generative models significantly advanced image, video, and 3D content creation from user-provided text prompts. However, the challenging problem of dynamic 3D content generation (text-to-4D) with diffusion guidance remains largely unexplored. In this paper, we introduce TextMesh4D, a novel framework for high-quality text-to-4D generation. Our approach leverages per-face Jacobians as a differentiable mesh representation and decomposes 4D generation into two stages: static object creation and dynamic motion synthesis. We further propose a flexibility-rigidity regularization term to stabilize Jacobian optimization under video diffusion priors, ensuring robust geometric performance. Experiments demonstrate that TextMesh4D achieves state-of-the-art results in terms of temporal consistency, structural fidelity, and visual realism. Moreover, TextMesh4D operates with a low GPU memory overhead-requiring only a single 24GB GPU-offering a cost-effective yet high-quality solution for text-driven 4D mesh generation. The code will be released to facilitate future research in text-to-4D generation.

Title: Calligrapher: Freestyle Text Image Customization

Authors: Yue Ma, Qingyan Bai, Hao Ouyang, Ka Leong Cheng, Qiuyu Wang, Hongyu Liu, Zichen Liu, Haofan Wang, Jingye Chen, Yujun Shen, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.24123
Pdf URL: https://arxiv.org/pdf/2506.24123
Copy Paste: [[2506.24123]] Calligrapher: Freestyle Text Image Customization(https://arxiv.org/abs/2506.24123)
Keywords: robust, diffusion, generative, large language model
Abstract: We introduce Calligrapher, a novel diffusion-based framework that innovatively integrates advanced text customization with artistic typography for digital calligraphy and design applications. Addressing the challenges of precise style control and data dependency in typographic customization, our framework incorporates three key technical contributions. First, we develop a self-distillation mechanism that leverages the pre-trained text-to-image generative model itself alongside the large language model to automatically construct a style-centric typography benchmark. Second, we introduce a localized style injection framework via a trainable style encoder, which comprises both Qformer and linear layers, to extract robust style features from reference images. An in-context generation mechanism is also employed to directly embed reference images into the denoising process, further enhancing the refined alignment of target styles. Extensive quantitative and qualitative evaluations across diverse fonts and design contexts confirm Calligrapher's accurate reproduction of intricate stylistic details and precise glyph positioning. By automating high-quality, visually consistent typography, Calligrapher surpasses traditional models, empowering creative practitioners in digital art, branding, and contextual typographic design.

Title: Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives

Authors: Dong Sixun, Fan Wei, Teresa Wu, Fu Yanjie
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.24124
Pdf URL: https://arxiv.org/pdf/2506.24124
Copy Paste: [[2506.24124]] Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives(https://arxiv.org/abs/2506.24124)
Keywords: large language model
Abstract: Time series forecasting traditionally relies on unimodal numerical inputs, which often struggle to capture high-level semantic patterns due to their dense and unstructured nature. While recent approaches have explored representing time series as text using large language models (LLMs), these methods remain limited by the discrete nature of token sequences and lack the perceptual intuition humans typically apply, such as interpreting visual patterns. In this paper, we propose a multimodal contrastive learning framework that transforms raw time series into structured visual and textual perspectives. Rather than using natural language or real-world images, we construct both modalities directly from numerical sequences. We then align these views in a shared semantic space via contrastive learning, enabling the model to capture richer and more complementary representations. Furthermore, we introduce a variate selection module that leverages the aligned representations to identify the most informative variables for multivariate forecasting. Extensive experiments on fifteen short-term and six long-term forecasting benchmarks demonstrate that our approach consistently outperforms strong unimodal and cross-modal baselines, highlighting the effectiveness of multimodal alignment in enhancing time series forecasting. Code is available at: this https URL.