2025-09-22

Title: Pre-Forgettable Models: Prompt Learning as a Native Mechanism for Unlearning

Authors: Rutger Hendrix, Giovanni Patanè, Leonardo G. Russo, Simone Carnemolla, Giovanni Bellitto, Federica Proietto Salanitri, Concetto Spampinato, Matteo Pennisi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15230
Pdf URL: https://arxiv.org/pdf/2509.15230
Copy Paste: [[2509.15230]] Pre-Forgettable Models: Prompt Learning as a Native Mechanism for Unlearning(https://arxiv.org/abs/2509.15230)
Keywords: security, privacy, protect, attack, robust, extraction, membership infer
Abstract: Foundation models have transformed multimedia analysis by enabling robust and transferable representations across diverse modalities and tasks. However, their static deployment conflicts with growing societal and regulatory demands -- particularly the need to unlearn specific data upon request, as mandated by privacy frameworks such as the GDPR. Traditional unlearning approaches, including retraining, activation editing, or distillation, are often computationally expensive, fragile, and ill-suited for real-time or continuously evolving systems. In this paper, we propose a paradigm shift: rethinking unlearning not as a retroactive intervention but as a built-in capability. We introduce a prompt-based learning framework that unifies knowledge acquisition and removal within a single training phase. Rather than encoding information in model weights, our approach binds class-level semantics to dedicated prompt tokens. This design enables instant unlearning simply by removing the corresponding prompt -- without retraining, model modification, or access to original data. Experiments demonstrate that our framework preserves predictive performance on retained classes while effectively erasing forgotten ones. Beyond utility, our method exhibits strong privacy and security guarantees: it is resistant to membership inference attacks, and prompt removal prevents any residual knowledge extraction, even under adversarial conditions. This ensures compliance with data protection principles and safeguards against unauthorized access to forgotten information, making the framework suitable for deployment in sensitive and regulated environments. Overall, by embedding removability into the architecture itself, this work establishes a new foundation for designing modular, scalable and ethically responsive AI models.

Title: Exploring the Capabilities of LLM Encoders for Image-Text Retrieval in Chest X-rays

Authors: Hanbin Ko, Gihun Cho, Inhyeok Baek, Donguk Kim, Joonbeom Koo, Changi Kim, Dongheon Lee, Chang Min Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15234
Pdf URL: https://arxiv.org/pdf/2509.15234
Copy Paste: [[2509.15234]] Exploring the Capabilities of LLM Encoders for Image-Text Retrieval in Chest X-rays(https://arxiv.org/abs/2509.15234)
Keywords: robust, large language model
Abstract: Vision-language pretraining has advanced image-text alignment, yet progress in radiology remains constrained by the heterogeneity of clinical reports, including abbreviations, impression-only notes, and stylistic variability. Unlike general-domain settings where more data often leads to better performance, naively scaling to large collections of noisy reports can plateau or even degrade model learning. We ask whether large language model (LLM) encoders can provide robust clinical representations that transfer across diverse styles and better guide image-text alignment. We introduce LLM2VEC4CXR, a domain-adapted LLM encoder for chest X-ray reports, and LLM2CLIP4CXR, a dual-tower framework that couples this encoder with a vision backbone. LLM2VEC4CXR improves clinical text understanding over BERT-based baselines, handles abbreviations and style variation, and achieves strong clinical alignment on report-level metrics. LLM2CLIP4CXR leverages these embeddings to boost retrieval accuracy and clinically oriented scores, with stronger cross-dataset generalization than prior medical CLIP variants. Trained on 1.6M CXR studies from public and private sources with heterogeneous and noisy reports, our models demonstrate that robustness -- not scale alone -- is the key to effective multimodal learning. We release models to support further research in medical image-text representation learning.

Title: ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Authors: Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.15235
Pdf URL: https://arxiv.org/pdf/2509.15235
Copy Paste: [[2509.15235]] ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding(https://arxiv.org/abs/2509.15235)
Keywords: large language model
Abstract: Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding.

Title: M-PACE: Mother Child Framework for Multimodal Compliance

Authors: Shreyash Verma, Amit Kesari, Vinayak Trivedi, Anupam Purwar, Ratnesh Jamidar
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2509.15241
Pdf URL: https://arxiv.org/pdf/2509.15241
Copy Paste: [[2509.15241]] M-PACE: Mother Child Framework for Multimodal Compliance(https://arxiv.org/abs/2509.15241)
Keywords: extraction, large language model
Abstract: Ensuring that multi-modal content adheres to brand, legal, or platform-specific compliance standards is an increasingly complex challenge across domains. Traditional compliance frameworks typically rely on disjointed, multi-stage pipelines that integrate separate modules for image classification, text extraction, audio transcription, hand-crafted checks, and rule-based merges. This architectural fragmentation increases operational overhead, hampers scalability, and hinders the ability to adapt to dynamic guidelines efficiently. With the emergence of Multimodal Large Language Models (MLLMs), there is growing potential to unify these workflows under a single, general-purpose framework capable of jointly processing visual and textual content. In light of this, we propose Multimodal Parameter Agnostic Compliance Engine (M-PACE), a framework designed for assessing attributes across vision-language inputs in a single pass. As a representative use case, we apply M-PACE to advertisement compliance, demonstrating its ability to evaluate over 15 compliance-related attributes. To support structured evaluation, we introduce a human-annotated benchmark enriched with augmented samples that simulate challenging real-world conditions, including visual obstructions and profanity injection. M-PACE employs a mother-child MLLM setup, demonstrating that a stronger parent MLLM evaluating the outputs of smaller child models can significantly reduce dependence on human reviewers, thereby automating quality control. Our analysis reveals that inference costs reduce by over 31 times, with the most efficient models (Gemini 2.0 Flash as child MLLM selected by mother MLLM) operating at 0.0005 per image, compared to 0.0159 for Gemini 2.5 Pro with comparable accuracy, highlighting the trade-off between cost and output quality achieved in real time by M-PACE in real life deployment over advertising data.

Title: ProFusion: 3D Reconstruction of Protein Complex Structures from Multi-view AFM Images

Authors: Jaydeep Rade, Md Hasibul Hasan Hasib, Meric Ozturk, Baboucarr Faal, Sheng Yang, Dipali G. Sashital, Vincenzo Venditti, Baoyu Chen, Soumik Sarkar, Adarsh Krishnamurthy, Anwesha Sarkar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15242
Pdf URL: https://arxiv.org/pdf/2509.15242
Copy Paste: [[2509.15242]] ProFusion: 3D Reconstruction of Protein Complex Structures from Multi-view AFM Images(https://arxiv.org/abs/2509.15242)
Keywords: diffusion
Abstract: AI-based in silico methods have improved protein structure prediction but often struggle with large protein complexes (PCs) involving multiple interacting proteins due to missing 3D spatial cues. Experimental techniques like Cryo-EM are accurate but costly and time-consuming. We present ProFusion, a hybrid framework that integrates a deep learning model with Atomic Force Microscopy (AFM), which provides high-resolution height maps from random orientations, naturally yielding multi-view data for 3D reconstruction. However, generating a large-scale AFM imaging data set sufficient to train deep learning models is impractical. Therefore, we developed a virtual AFM framework that simulates the imaging process and generated a dataset of ~542,000 proteins with multi-view synthetic AFM images. We train a conditional diffusion model to synthesize novel views from unposed inputs and an instance-specific Neural Radiance Field (NeRF) model to reconstruct 3D structures. Our reconstructed 3D protein structures achieve an average Chamfer Distance within the AFM imaging resolution, reflecting high structural fidelity. Our method is extensively validated on experimental AFM images of various PCs, demonstrating strong potential for accurate, cost-effective protein complex structure prediction and rapid iterative validation using AFM experiments.

Title: Multi-Modal Interpretability for Enhanced Localization in Vision-Language Models

Authors: Muhammad Imran, Yugyung Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15243
Pdf URL: https://arxiv.org/pdf/2509.15243
Copy Paste: [[2509.15243]] Multi-Modal Interpretability for Enhanced Localization in Vision-Language Models(https://arxiv.org/abs/2509.15243)
Keywords: interpretability, transformer
Abstract: Recent advances in vision-language models have significantly expanded the frontiers of automated image analysis. However, applying these models in safety-critical contexts remains challenging due to the complex relationships between objects, subtle visual cues, and the heightened demand for transparency and reliability. This paper presents the Multi-Modal Explainable Learning (MMEL) framework, designed to enhance the interpretability of vision-language models while maintaining high performance. Building upon prior work in gradient-based explanations for transformer architectures (Grad-eclip), MMEL introduces a novel Hierarchical Semantic Relationship Module that enhances model interpretability through multi-scale feature processing, adaptive attention weighting, and cross-modal alignment. Our approach processes features at multiple semantic levels to capture relationships between image regions at different granularities, applying learnable layer-specific weights to balance contributions across the model's depth. This results in more comprehensive visual explanations that highlight both primary objects and their contextual relationships with improved precision. Through extensive experiments on standard datasets, we demonstrate that by incorporating semantic relationship information into gradient-based attribution maps, MMEL produces more focused and contextually aware visualizations that better reflect how vision-language models process complex scenes. The MMEL framework generalizes across various domains, offering valuable insights into model decisions for applications requiring high interpretability and reliability.

Title: Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning

Authors: Wenda Qin, Andrea Burns, Bryan A. Plummer, Margrit Betke
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15250
Pdf URL: https://arxiv.org/pdf/2509.15250
Copy Paste: [[2509.15250]] Walk and Read Less: Improving the Efficiency of Vision-and-Language Navigation via Tuning-Free Multimodal Token Pruning(https://arxiv.org/abs/2509.15250)
Keywords: large language model
Abstract: Large models achieve strong performance on Vision-and-Language Navigation (VLN) tasks, but are costly to run in resource-limited environments. Token pruning offers appealing tradeoffs for efficiency with minimal performance loss by reducing model input size, but prior work overlooks VLN-specific challenges. For example, information loss from pruning can effectively increase computational cost due to longer walks. Thus, the inability to identify uninformative tokens undermines the supposed efficiency gains from pruning. To address this, we propose Navigation-Aware Pruning (NAP), which uses navigation-specific traits to simplify the pruning process by pre-filtering tokens into foreground and background. For example, image views are filtered based on whether the agent can navigate in that direction. We also extract navigation-relevant instructions using a Large Language Model. After filtering, we focus pruning on background tokens, minimizing information loss. To further help avoid increases in navigation length, we discourage backtracking by removing low-importance navigation nodes. Experiments on standard VLN benchmarks show NAP significantly outperforms prior work, preserving higher success rates while saving more than 50% FLOPS.

Title: Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha

Authors: Tandin Wangchuk, Tad Gonsalves
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15255
Pdf URL: https://arxiv.org/pdf/2509.15255
Copy Paste: [[2509.15255]] Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha(https://arxiv.org/abs/2509.15255)
Keywords: large language model
Abstract: Large Language Models (LLMs) are gaining popularity and improving rapidly. Tokenizers are crucial components of natural language processing, especially for LLMs. Tokenizers break down input text into tokens that models can easily process while ensuring the text is accurately represented, capturing its meaning and structure. Effective tokenizers enhance the capabilities of LLMs by improving a model's understanding of context and semantics, ultimately leading to better performance in various downstream tasks, such as translation, classification, sentiment analysis, and text generation. Most pre-trained tokenizers are suitable for high-resource languages like English but perform poorly for low-resource languages. Dzongkha, Bhutan's national language spoken by around seven hundred thousand people, is a low-resource language, and its linguistic complexity poses unique NLP challenges. Despite some progress, significant research in Dzongkha NLP is lacking, particularly in tokenization. This study evaluates the training and performance of three common tokenization algorithms in comparison to other popular methods. Specifically, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece (Unigram) were evaluated for their suitability for Dzongkha. Performance was assessed using metrics like Subword Fertility, Proportion of Continued Words, Normalized Sequence Length, and execution time. The results show that while all three algorithms demonstrate potential, SentencePiece is the most effective for Dzongkha tokenization, paving the way for further NLP advancements. This underscores the need for tailored approaches for low-resource languages and ongoing research. In this study, we presented three tokenization algorithms for Dzongkha, paving the way for building Dzongkha Large Language Models.

Title: RespoDiff: Dual-Module Bottleneck Transformation for Responsible & Faithful T2I Generation

Authors: Silpa Vadakkeeveetil Sreelatha, Sauradip Nag, Muhammad Awais, Serge Belongie, Anjan Dutta
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15257
Pdf URL: https://arxiv.org/pdf/2509.15257
Copy Paste: [[2509.15257]] RespoDiff: Dual-Module Bottleneck Transformation for Responsible & Faithful T2I Generation(https://arxiv.org/abs/2509.15257)
Keywords: fair, diffusion
Abstract: The rapid advancement of diffusion models has enabled high-fidelity and semantically rich text-to-image generation; however, ensuring fairness and safety remains an open challenge. Existing methods typically improve fairness and safety at the expense of semantic fidelity and image quality. In this work, we propose RespoDiff, a novel framework for responsible text-to-image generation that incorporates a dual-module transformation on the intermediate bottleneck representations of diffusion models. Our approach introduces two distinct learnable modules: one focused on capturing and enforcing responsible concepts, such as fairness and safety, and the other dedicated to maintaining semantic alignment with neutral prompts. To facilitate the dual learning process, we introduce a novel score-matching objective that enables effective coordination between the modules. Our method outperforms state-of-the-art methods in responsible generation by ensuring semantic alignment while optimizing both objectives without compromising image fidelity. Our approach improves responsible and semantically coherent generation by 20% across diverse, unseen prompts. Moreover, it integrates seamlessly into large-scale models like SDXL, enhancing fairness and safety. Code will be released upon acceptance.

Title: Generative AI Meets Wireless Sensing: Towards Wireless Foundation Model

Authors: Zheng Yang, Guoxuan Chi, Chenshu Wu, Hanyu Liu, Yuchong Gao, Yunhao Liu, Jie Xu, Tony Xiao Han
Subjects: cs.LG, cs.AI, eess.SP
Abstract URL: https://arxiv.org/abs/2509.15258
Pdf URL: https://arxiv.org/pdf/2509.15258
Copy Paste: [[2509.15258]] Generative AI Meets Wireless Sensing: Towards Wireless Foundation Model(https://arxiv.org/abs/2509.15258)
Keywords: diffusion, generative
Abstract: Generative Artificial Intelligence (GenAI) has made significant advancements in fields such as computer vision (CV) and natural language processing (NLP), demonstrating its capability to synthesize high-fidelity data and improve generalization. Recently, there has been growing interest in integrating GenAI into wireless sensing systems. By leveraging generative techniques such as data augmentation, domain adaptation, and denoising, wireless sensing applications, including device localization, human activity recognition, and environmental monitoring, can be significantly improved. This survey investigates the convergence of GenAI and wireless sensing from two complementary perspectives. First, we explore how GenAI can be integrated into wireless sensing pipelines, focusing on two modes of integration: as a plugin to augment task-specific models and as a solver to directly address sensing tasks. Second, we analyze the characteristics of mainstream generative models, such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and diffusion models, and discuss their applicability and unique advantages across various wireless sensing tasks. We further identify key challenges in applying GenAI to wireless sensing and outline a future direction toward a wireless foundation model: a unified, pre-trained design capable of scalable, adaptable, and efficient signal understanding across diverse sensing tasks.

Title: IEFS-GMB: Gradient Memory Bank-Guided Feature Selection Based on Information Entropy for EEG Classification of Neurological Disorders

Authors: Liang Zhang, Hanyang Dong, Jia-Hong Gao, Yi Sun, Kuntao Xiao, Wanli Yang, Zhao Lv, Shurong Sheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15259
Pdf URL: https://arxiv.org/pdf/2509.15259
Copy Paste: [[2509.15259]] IEFS-GMB: Gradient Memory Bank-Guided Feature Selection Based on Information Entropy for EEG Classification of Neurological Disorders(https://arxiv.org/abs/2509.15259)
Keywords: robust, interpretability
Abstract: Deep learning-based EEG classification is crucial for the automated detection of neurological disorders, improving diagnostic accuracy and enabling early intervention. However, the low signal-to-noise ratio of EEG signals limits model performance, making feature selection (FS) vital for optimizing representations learned by neural network encoders. Existing FS methods are seldom designed specifically for EEG diagnosis; many are architecture-dependent and lack interpretability, limiting their applicability. Moreover, most rely on single-iteration data, resulting in limited robustness to variability. To address these issues, we propose IEFS-GMB, an Information Entropy-based Feature Selection method guided by a Gradient Memory Bank. This approach constructs a dynamic memory bank storing historical gradients, computes feature importance via information entropy, and applies entropy-based weighting to select informative EEG features. Experiments on four public neurological disease datasets show that encoders enhanced with IEFS-GMB achieve accuracy improvements of 0.64% to 6.45% over baseline models. The method also outperforms four competing FS techniques and improves model interpretability, supporting its practical use in clinical settings.

Title: Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Authors: Yujia Hu, Ming Shan Hee, Preslav Nakov, Roy Ka-Wei Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15260
Pdf URL: https://arxiv.org/pdf/2509.15260
Copy Paste: [[2509.15260]] Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages(https://arxiv.org/abs/2509.15260)
Keywords: large language model
Abstract: The advancement of Large Language Models (LLMs) has transformed natural language processing; however, their safety mechanisms remain under-explored in low-resource, multilingual settings. Here, we aim to bridge this gap. In particular, we introduce \textsf{SGToxicGuard}, a novel dataset and evaluation framework for benchmarking LLM safety in Singapore's diverse linguistic context, including Singlish, Chinese, Malay, and Tamil. SGToxicGuard adopts a red-teaming approach to systematically probe LLM vulnerabilities in three real-world scenarios: \textit{conversation}, \textit{question-answering}, and \textit{content composition}. We conduct extensive experiments with state-of-the-art multilingual LLMs, and the results uncover critical gaps in their safety guardrails. By offering actionable insights into cultural sensitivity and toxicity mitigation, we lay the foundation for safer and more inclusive AI systems in linguistically diverse environments.\footnote{Link to the dataset: this https URL.} \textcolor{red}{Disclaimer: This paper contains sensitive content that may be disturbing to some readers.}

Title: A Weak Supervision Approach for Monitoring Recreational Drug Use Effects in Social Media

Authors: Lucía Prieto-Santamaría, Alba Cortés Iglesias, Claudio Vidal Giné, Fermín Fernández Calderón, Óscar M. Lozano, Alejandro Rodríguez-González
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15266
Pdf URL: https://arxiv.org/pdf/2509.15266
Copy Paste: [[2509.15266]] A Weak Supervision Approach for Monitoring Recreational Drug Use Effects in Social Media(https://arxiv.org/abs/2509.15266)
Keywords: extraction
Abstract: Understanding the real-world effects of recreational drug use remains a critical challenge in public health and biomedical research, especially as traditional surveillance systems often underrepresent user experiences. In this study, we leverage social media (specifically Twitter) as a rich and unfiltered source of user-reported effects associated with three emerging psychoactive substances: ecstasy, GHB, and 2C-B. By combining a curated list of slang terms with biomedical concept extraction via MetaMap, we identified and weakly annotated over 92,000 tweets mentioning these substances. Each tweet was labeled with a polarity reflecting whether it reported a positive or negative effect, following an expert-guided heuristic process. We then performed descriptive and comparative analyses of the reported phenotypic outcomes across substances and trained multiple machine learning classifiers to predict polarity from tweet content, accounting for strong class imbalance using techniques such as cost-sensitive learning and synthetic oversampling. The top performance on the test set was obtained from eXtreme Gradient Boosting with cost-sensitive learning (F1 = 0.885, AUPRC = 0.934). Our findings reveal that Twitter enables the detection of substance-specific phenotypic effects, and that polarity classification models can support real-time pharmacovigilance and drug effect characterization with high accuracy.

Title: Autoguided Online Data Curation for Diffusion Model Training

Authors: Valeria Pais, Luis Oala, Daniele Faccio, Marco Aversa
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15267
Pdf URL: https://arxiv.org/pdf/2509.15267
Copy Paste: [[2509.15267]] Autoguided Online Data Curation for Diffusion Model Training(https://arxiv.org/abs/2509.15267)
Keywords: robust, diffusion, generative
Abstract: The costs of generative model compute rekindled promises and hopes for efficient data curation. In this work, we investigate whether recently developed autoguidance and online data selection methods can improve the time and sample efficiency of training generative diffusion models. We integrate joint example selection (JEST) and autoguidance into a unified code base for fast ablation and benchmarking. We evaluate combinations of data curation on a controlled 2-D synthetic data generation task as well as (3x64x64)-D image generation. Our comparisons are made at equal wall-clock time and equal number of samples, explicitly accounting for the overhead of selection. Across experiments, autoguidance consistently improves sample quality and diversity. Early AJEST (applying selection only at the beginning of training) can match or modestly exceed autoguidance alone in data efficiency on both tasks. However, its time overhead and added complexity make autoguidance or uniform random data selection preferable in most situations. These findings suggest that while targeted online selection can yield efficiency gains in early training, robust sample quality improvements are primarily driven by autoguidance. We discuss limitations and scope, and outline when data selection may be beneficial.

Title: Modeling Transformers as complex networks to analyze learning dynamics

Authors: Elisabetta Rocchetti
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15269
Pdf URL: https://arxiv.org/pdf/2509.15269
Copy Paste: [[2509.15269]] Modeling Transformers as complex networks to analyze learning dynamics(https://arxiv.org/abs/2509.15269)
Keywords: interpretability, transformer, large language model
Abstract: The process by which Large Language Models (LLMs) acquire complex capabilities during training remains a key open question in mechanistic interpretability. This project investigates whether these learning dynamics can be characterized through the lens of Complex Network Theory (CNT). I introduce a novel methodology to represent a Transformer-based LLM as a directed, weighted graph where nodes are the model's computational components (attention heads and MLPs) and edges represent causal influence, measured via an intervention-based ablation technique. By tracking the evolution of this component-graph across 143 training checkpoints of the Pythia-14M model on a canonical induction task, I analyze a suite of graph-theoretic metrics. The results reveal that the network's structure evolves through distinct phases of exploration, consolidation, and refinement. Specifically, I identify the emergence of a stable hierarchy of information spreader components and a dynamic set of information gatherer components, whose roles reconfigure at key learning junctures. This work demonstrates that a component-level network perspective offers a powerful macroscopic lens for visualizing and understanding the self-organizing principles that drive the formation of functional circuits in LLMs.

Title: PRISM: Phase-enhanced Radial-based Image Signature Mapping framework for fingerprinting AI-generated images

Authors: Emanuele Ricco, Elia Onofri, Lorenzo Cima, Stefano Cresci, Roberto Di Pietro
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15270
Pdf URL: https://arxiv.org/pdf/2509.15270
Copy Paste: [[2509.15270]] PRISM: Phase-enhanced Radial-based Image Signature Mapping framework for fingerprinting AI-generated images(https://arxiv.org/abs/2509.15270)
Keywords: diffusion, generative
Abstract: A critical need has emerged for generative AI: attribution methods. That is, solutions that can identify the model originating AI-generated content. This feature, generally relevant in multimodal applications, is especially sensitive in commercial settings where users subscribe to paid proprietary services and expect guarantees about the source of the content they receive. To address these issues, we introduce PRISM, a scalable Phase-enhanced Radial-based Image Signature Mapping framework for fingerprinting AI-generated images. PRISM is based on a radial reduction of the discrete Fourier transform that leverages amplitude and phase information to capture model-specific signatures. The output of the above process is subsequently clustered via linear discriminant analysis to achieve reliable model attribution in diverse settings, even if the model's internal details are inaccessible. To support our work, we construct PRISM-36K, a novel dataset of 36,000 images generated by six text-to-image GAN- and diffusion-based models. On this dataset, PRISM achieves an attribution accuracy of 92.04%. We additionally evaluate our method on four benchmarks from the literature, reaching an average accuracy of 81.60%. Finally, we evaluate our methodology also in the binary task of detecting real vs fake images, achieving an average accuracy of 88.41%. We obtain our best result on GenImage with an accuracy of 95.06%, whereas the original benchmark achieved 82.20%. Our results demonstrate the effectiveness of frequency-domain fingerprinting for cross-architecture and cross-dataset model attribution, offering a viable solution for enforcing accountability and trust in generative AI systems.

Title: Large Vision Models Can Solve Mental Rotation Problems

Authors: Sebastian Ray Mason, Anders Gjølbye, Phillip Chavarria Højbjerg, Lenka Tětková, Lars Kai Hansen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15271
Pdf URL: https://arxiv.org/pdf/2509.15271
Copy Paste: [[2509.15271]] Large Vision Models Can Solve Mental Rotation Problems(https://arxiv.org/abs/2509.15271)
Keywords: transformer
Abstract: Mental rotation is a key test of spatial reasoning in humans and has been central to understanding how perception supports cognition. Despite the success of modern vision transformers, it is still unclear how well these models develop similar abilities. In this work, we present a systematic evaluation of ViT, CLIP, DINOv2, and DINOv3 across a range of mental-rotation tasks, from simple block structures similar to those used by Shepard and Metzler to study human cognition, to more complex block figures, three types of text, and photo-realistic objects. By probing model representations layer by layer, we examine where and how these networks succeed. We find that i) self-supervised ViTs capture geometric structure better than supervised ViTs; ii) intermediate layers perform better than final layers; iii) task difficulty increases with rotation complexity and occlusion, mirroring human reaction times and suggesting similar constraints in embedding space representations.

Title: Which Direction to Choose? An Analysis on the Representation Power of Self-Supervised ViTs in Downstream Tasks

Authors: Yannis Kaltampanidis, Alexandros Doumanoglou, Dimitrios Zarpalas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15272
Pdf URL: https://arxiv.org/pdf/2509.15272
Copy Paste: [[2509.15272]] Which Direction to Choose? An Analysis on the Representation Power of Self-Supervised ViTs in Downstream Tasks(https://arxiv.org/abs/2509.15272)
Keywords: transformer, segmentation
Abstract: Self-Supervised Learning (SSL) for Vision Transformers (ViTs) has recently demonstrated considerable potential as a pre-training strategy for a variety of computer vision tasks, including image classification and segmentation, both in standard and few-shot downstream contexts. Two pre-training objectives dominate the landscape of SSL techniques: Contrastive Learning and Masked Image Modeling. Features (or tokens) extracted from the final transformer attention block -- specifically, the keys, queries, and values -- as well as features obtained after the final block's feed-forward layer, have become a common foundation for addressing downstream tasks. However, in many existing approaches, these pre-trained ViT features are further processed through additional transformation layers, often involving lightweight heads or combined with distillation, to achieve superior task performance. Although such methods can improve task outcomes, to the best of our knowledge, a comprehensive analysis of the intrinsic representation capabilities of unaltered ViT features has yet to be conducted. This study aims to bridge this gap by systematically evaluating the use of these unmodified features across image classification and segmentation tasks, in both standard and few-shot contexts. The classification and segmentation rules that we use are either hyperplane based (as in logistic regression) or cosine-similarity based, both of which rely on the presence of interpretable directions in the ViT's latent space. Based on the previous rules and without the use of additional feature transformations, we conduct an analysis across token types, tasks, and pre-trained ViT models. This study provides insights into the optimal choice for token type and decision rule based on the task, context, and the pre-training objective, while reporting detailed findings on two widely-used datasets.

Title: Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning

Authors: Chi Liu, Derek Li, Yan Shu, Robin Chen, Derek Duan, Teng Fang, Bryan Dai
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2509.15279
Pdf URL: https://arxiv.org/pdf/2509.15279
Copy Paste: [[2509.15279]] Fleming-R1: Toward Expert-Level Medical Reasoning via Reinforcement Learning(https://arxiv.org/abs/2509.15279)
Keywords: robust, large language model
Abstract: While large language models show promise in medical applications, achieving expert-level clinical reasoning remains challenging due to the need for both accurate answers and transparent reasoning processes. To address this challenge, we introduce Fleming-R1, a model designed for verifiable medical reasoning through three complementary innovations. First, our Reasoning-Oriented Data Strategy (RODS) combines curated medical QA datasets with knowledge-graph-guided synthesis to improve coverage of underrepresented diseases, drugs, and multi-hop reasoning chains. Second, we employ Chain-of-Thought (CoT) cold start to distill high-quality reasoning trajectories from teacher models, establishing robust inference priors. Third, we implement a two-stage Reinforcement Learning from Verifiable Rewards (RLVR) framework using Group Relative Policy Optimization, which consolidates core reasoning skills while targeting persistent failure modes through adaptive hard-sample mining. Across diverse medical benchmarks, Fleming-R1 delivers substantial parameter-efficient improvements: the 7B variant surpasses much larger baselines, while the 32B model achieves near-parity with GPT-4o and consistently outperforms strong open-source alternatives. These results demonstrate that structured data design, reasoning-oriented initialization, and verifiable reinforcement learning can advance clinical reasoning beyond simple accuracy optimization. We release Fleming-R1 publicly to promote transparent, reproducible, and auditable progress in medical AI, enabling safer deployment in high-stakes clinical environments.

Title: Kuramoto Orientation Diffusion Models

Authors: Yue Song, T. Anderson Keller, Sevan Brodjian, Takeru Miyato, Yisong Yue, Pietro Perona, Max Welling
Subjects: cs.LG, cs.CV, q-bio.NC
Abstract URL: https://arxiv.org/abs/2509.15328
Pdf URL: https://arxiv.org/pdf/2509.15328
Copy Paste: [[2509.15328]] Kuramoto Orientation Diffusion Models(https://arxiv.org/abs/2509.15328)
Keywords: diffusion, generative
Abstract: Orientation-rich images, such as fingerprints and textures, often exhibit coherent angular directional patterns that are challenging to model using standard generative approaches based on isotropic Euclidean diffusion. Motivated by the role of phase synchronization in biological systems, we propose a score-based generative model built on periodic domains by leveraging stochastic Kuramoto dynamics in the diffusion process. In neural and physical systems, Kuramoto models capture synchronization phenomena across coupled oscillators -- a behavior that we re-purpose here as an inductive bias for structured image generation. In our framework, the forward process performs \textit{synchronization} among phase variables through globally or locally coupled oscillator interactions and attraction to a global reference phase, gradually collapsing the data into a low-entropy von Mises distribution. The reverse process then performs \textit{desynchronization}, generating diverse patterns by reversing the dynamics with a learned score function. This approach enables structured destruction during forward diffusion and a hierarchical generation process that progressively refines global coherence into fine-scale details. We implement wrapped Gaussian transition kernels and periodicity-aware networks to account for the circular geometry. Our method achieves competitive results on general image benchmarks and significantly improves generation quality on orientation-dense datasets like fingerprints and textures. Ultimately, this work demonstrates the promise of biologically inspired synchronization dynamics as structured priors in generative modeling.

Title: CoDoL: Conditional Domain Prompt Learning for Out-of-Distribution Generalization

Authors: Min Zhang, Bo Jiang, Jie Zhou, Yimeng Liu, Xin Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15330
Pdf URL: https://arxiv.org/pdf/2509.15330
Copy Paste: [[2509.15330]] CoDoL: Conditional Domain Prompt Learning for Out-of-Distribution Generalization(https://arxiv.org/abs/2509.15330)
Keywords: robust
Abstract: Recent advances in pre-training vision-language models (VLMs), e.g., contrastive language-image pre-training (CLIP) methods, have shown great potential in learning out-of-distribution (OOD) representations. Despite showing competitive performance, the prompt-based CLIP methods still suffer from: i) inaccurate text descriptions, which leads to degraded accuracy and robustness, and poses a challenge for zero-shot CLIP methods. ii) limited vision-language embedding alignment, which significantly affects the generalization performance. To tackle the above issues, this paper proposes a novel Conditional Domain prompt Learning (CoDoL) method, which utilizes readily-available domain information to form prompts and improves the vision-language embedding alignment for improving OOD generalization. To capture both instance-specific and domain-specific information, we further propose a lightweight Domain Meta Network (DMN) to generate input-conditional tokens for images in each domain. Extensive experiments on four OOD benchmarks (PACS, VLCS, OfficeHome and DigitDG) validate the effectiveness of our proposed CoDoL in terms of improving the vision-language embedding alignment as well as the out-of-distribution generalization performance.

Title: Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Perception

Authors: Yulin Wang, Yang Yue, Yang Yue, Huanqian Wang, Haojun Jiang, Yizeng Han, Zanlin Ni, Yifan Pu, Minglei Shi, Rui Lu, Qisen Yang, Andrew Zhao, Zhuofan Xia, Shiji Song, Gao Huang
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2509.15333
Pdf URL: https://arxiv.org/pdf/2509.15333
Copy Paste: [[2509.15333]] Emulating Human-like Adaptive Vision for Efficient and Flexible Machine Visual Perception(https://arxiv.org/abs/2509.15333)
Keywords: interpretability
Abstract: Human vision is highly adaptive, efficiently sampling intricate environments by sequentially fixating on task-relevant regions. In contrast, prevailing machine vision models passively process entire scenes at once, resulting in excessive resource demands scaling with spatial-temporal input resolution and model size, yielding critical limitations impeding both future advancements and real-world application. Here we introduce AdaptiveNN, a general framework aiming to drive a paradigm shift from 'passive' to 'active, adaptive' vision models. AdaptiveNN formulates visual perception as a coarse-to-fine sequential decision-making process, progressively identifying and attending to regions pertinent to the task, incrementally combining information across fixations, and actively concluding observation when sufficient. We establish a theory integrating representation learning with self-rewarding reinforcement learning, enabling end-to-end training of the non-differentiable AdaptiveNN without additional supervision on fixation locations. We assess AdaptiveNN on 17 benchmarks spanning 9 tasks, including large-scale visual recognition, fine-grained discrimination, visual search, processing images from real driving and medical scenarios, language-driven embodied AI, and side-by-side comparisons with humans. AdaptiveNN achieves up to 28x inference cost reduction without sacrificing accuracy, flexibly adapts to varying task demands and resource budgets without retraining, and provides enhanced interpretability via its fixation patterns, demonstrating a promising avenue toward efficient, flexible, and interpretable computer vision. Furthermore, AdaptiveNN exhibits closely human-like perceptual behaviors in many cases, revealing its potential as a valuable tool for investigating visual cognition. Code is available at this https URL.

Title: PolBiX: Detecting LLMs' Political Bias in Fact-Checking through X-phemisms

Authors: Charlott Jakob, David Harbecke, Patrick Parschan, Pia Wenzel Neves, Vera Schmitt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15335
Pdf URL: https://arxiv.org/pdf/2509.15335
Copy Paste: [[2509.15335]] PolBiX: Detecting LLMs' Political Bias in Fact-Checking through X-phemisms(https://arxiv.org/abs/2509.15335)
Keywords: large language model
Abstract: Large Language Models are increasingly used in applications requiring objective assessment, which could be compromised by political bias. Many studies found preferences for left-leaning positions in LLMs, but downstream effects on tasks like fact-checking remain underexplored. In this study, we systematically investigate political bias through exchanging words with euphemisms or dysphemisms in German claims. We construct minimal pairs of factually equivalent claims that differ in political connotation, to assess the consistency of LLMs in classifying them as true or false. We evaluate six LLMs and find that, more than political leaning, the presence of judgmental words significantly influences truthfulness assessment. While a few models show tendencies of political bias, this is not mitigated by explicitly calling for objectivism in prompts.

Title: Quantifying Self-Awareness of Knowledge in Large Language Models

Authors: Yeongbin Seo, Dongha Lee, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15339
Pdf URL: https://arxiv.org/pdf/2509.15339
Copy Paste: [[2509.15339]] Quantifying Self-Awareness of Knowledge in Large Language Models(https://arxiv.org/abs/2509.15339)
Keywords: large language model
Abstract: Hallucination prediction in large language models (LLMs) is often interpreted as a sign of self-awareness. However, we argue that such performance can arise from question-side shortcuts rather than true model-side introspection. To disentangle these factors, we propose the Approximate Question-side Effect (AQE), which quantifies the contribution of question-awareness. Our analysis across multiple datasets reveals that much of the reported success stems from exploiting superficial patterns in questions. We further introduce SCAO (Semantic Compression by Answering in One word), a method that enhances the use of model-side signals. Experiments show that SCAO achieves strong and consistent performance, particularly in settings with reduced question-side cues, highlighting its effectiveness in fostering genuine self-awareness in LLMs.

Title: LowDiff: Efficient Diffusion Sampling with Low-Resolution Condition

Authors: Jiuyi Xu, Qing Jin, Meida Chen, Andrew Feng, Yang Sui, Yangming Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15342
Pdf URL: https://arxiv.org/pdf/2509.15342
Copy Paste: [[2509.15342]] LowDiff: Efficient Diffusion Sampling with Low-Resolution Condition(https://arxiv.org/abs/2509.15342)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable success in image generation but their practical application is often hindered by the slow sampling speed. Prior efforts of improving efficiency primarily focus on compressing models or reducing the total number of denoising steps, largely neglecting the possibility to leverage multiple input resolutions in the generation process. In this work, we propose LowDiff, a novel and efficient diffusion framework based on a cascaded approach by generating increasingly higher resolution outputs. Besides, LowDiff employs a unified model to progressively refine images from low resolution to the desired resolution. With the proposed architecture design and generation techniques, we achieve comparable or even superior performance with much fewer high-resolution sampling steps. LowDiff is applicable to diffusion models in both pixel space and latent space. Extensive experiments on both conditional and unconditional generation tasks across CIFAR-10, FFHQ and ImageNet demonstrate the effectiveness and generality of our method. Results show over 50% throughput improvement across all datasets and settings while maintaining comparable or better quality. On unconditional CIFAR-10, LowDiff achieves an FID of 2.11 and IS of 9.87, while on conditional CIFAR-10, an FID of 1.94 and IS of 10.03. On FFHQ 64x64, LowDiff achieves an FID of 2.43, and on ImageNet 256x256, LowDiff built on LightningDiT-B/1 produces high-quality samples with a FID of 4.00 and an IS of 195.06, together with substantial efficiency gains.

Title: Real, Fake, or Manipulated? Detecting Machine-Influenced Text

Authors: Yitong Wang, Zhongping Zhang, Margherita Piana, Zheng Zhou, Peter Gerstoft, Bryan A. Plummer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15350
Pdf URL: https://arxiv.org/pdf/2509.15350
Copy Paste: [[2509.15350]] Real, Fake, or Manipulated? Detecting Machine-Influenced Text(https://arxiv.org/abs/2509.15350)
Keywords: robust, large language model
Abstract: Large Language Model (LLMs) can be used to write or modify documents, presenting a challenge for understanding the intent behind their use. For example, benign uses may involve using LLM on a human-written document to improve its grammar or to translate it into another language. However, a document entirely produced by a LLM may be more likely to be used to spread misinformation than simple translation (\eg, from use by malicious actors or simply by hallucinating). Prior works in Machine Generated Text (MGT) detection mostly focus on simply identifying whether a document was human or machine written, ignoring these fine-grained uses. In this paper, we introduce a HiErarchical, length-RObust machine-influenced text detector (HERO), which learns to separate text samples of varying lengths from four primary types: human-written, machine-generated, machine-polished, and machine-translated. HERO accomplishes this by combining predictions from length-specialist models that have been trained with Subcategory Guidance. Specifically, for categories that are easily confused (\eg, different source languages), our Subcategory Guidance module encourages separation of the fine-grained categories, boosting performance. Extensive experiments across five LLMs and six domains demonstrate the benefits of our HERO, outperforming the state-of-the-art by 2.5-3 mAP on average.

Title: Predicting Language Models' Success at Zero-Shot Probabilistic Prediction

Authors: Kevin Ren, Santiago Cortes-Gomez, Carlos Miguel Patiño, Ananya Joshi, Ruiqi Lyu, Jingjing Tang, Alistair Turcan, Khurram Yamin, Steven Wu, Bryan Wilder
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15356
Pdf URL: https://arxiv.org/pdf/2509.15356
Copy Paste: [[2509.15356]] Predicting Language Models' Success at Zero-Shot Probabilistic Prediction(https://arxiv.org/abs/2509.15356)
Keywords: large language model
Abstract: Recent work has investigated the capabilities of large language models (LLMs) as zero-shot models for generating individual-level characteristics (e.g., to serve as risk models or augment survey datasets). However, when should a user have confidence that an LLM will provide high-quality predictions for their particular task? To address this question, we conduct a large-scale empirical study of LLMs' zero-shot predictive capabilities across a wide range of tabular prediction tasks. We find that LLMs' performance is highly variable, both on tasks within the same dataset and across different datasets. However, when the LLM performs well on the base prediction task, its predicted probabilities become a stronger signal for individual-level accuracy. Then, we construct metrics to predict LLMs' performance at the task level, aiming to distinguish between tasks where LLMs may perform well and where they are likely unsuitable. We find that some of these metrics, each of which are assessed without labeled data, yield strong signals of LLMs' predictive performance on new tasks.

Title: MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation

Authors: Yu Chang, Jiahao Chen, Anzhe Cheng, Paul Bogdan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15357
Pdf URL: https://arxiv.org/pdf/2509.15357
Copy Paste: [[2509.15357]] MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation(https://arxiv.org/abs/2509.15357)
Keywords: diffusion
Abstract: Text-to-image diffusion models achieve impressive realism but often suffer from compositional failures on prompts with multiple objects, attributes, and spatial relations, resulting in cross-token interference where entities entangle, attributes mix across objects, and spatial cues are violated. To address these failures, we propose MaskAttn-SDXL,a region-level gating mechanism applied to the cross-attention logits of Stable Diffusion XL(SDXL)'s UNet. MaskAttn-SDXL learns a binary mask per layer, injecting it into each cross-attention logit map before softmax to sparsify token-to-latent interactions so that only semantically relevant connections remain active. The method requires no positional encodings, auxiliary tokens, or external region masks, and preserves the original inference path with negligible overhead. In practice, our model improves spatial compliance and attribute binding in multi-object prompts while preserving overall image quality and diversity. These findings demonstrate that logit-level maksed cross-attention is an data-efficient primitve for enforcing compositional control, and our method thus serves as a practical extension for spatial control in text-to-image generation.

Title: Beyond Spurious Signals: Debiasing Multimodal Large Language Models via Counterfactual Inference and Adaptive Expert Routing

Authors: Zichen Wu, Hsiu-Yuan Huang, Yunfang Wu
Subjects: cs.CL, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2509.15361
Pdf URL: https://arxiv.org/pdf/2509.15361
Copy Paste: [[2509.15361]] Beyond Spurious Signals: Debiasing Multimodal Large Language Models via Counterfactual Inference and Adaptive Expert Routing(https://arxiv.org/abs/2509.15361)
Keywords: robust, large language model
Abstract: Multimodal Large Language Models (MLLMs) have shown substantial capabilities in integrating visual and textual information, yet frequently rely on spurious correlations, undermining their robustness and generalization in complex multimodal reasoning tasks. This paper addresses the critical challenge of superficial correlation bias in MLLMs through a novel causal mediation-based debiasing framework. Specially, we distinguishing core semantics from spurious textual and visual contexts via counterfactual examples to activate training-stage debiasing and employ a Mixture-of-Experts (MoE) architecture with dynamic routing to selectively engages modality-specific debiasing experts. Empirical evaluation on multimodal sarcasm detection and sentiment analysis tasks demonstrates that our framework significantly surpasses unimodal debiasing strategies and existing state-of-the-art models.

Title: Stochastic Sample Approximations of (Local) Moduli of Continuity

Authors: Rodion Nazarov, Allen Gehret, Robert Shorten, Jakub Marecek
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15368
Pdf URL: https://arxiv.org/pdf/2509.15368
Copy Paste: [[2509.15368]] Stochastic Sample Approximations of (Local) Moduli of Continuity(https://arxiv.org/abs/2509.15368)
Keywords: robust, fair
Abstract: Modulus of local continuity is used to evaluate the robustness of neural networks and fairness of their repeated uses in closed-loop models. Here, we revisit a connection between generalized derivatives and moduli of local continuity, and present a non-uniform stochastic sample approximation for moduli of local continuity. This is of importance in studying robustness of neural networks and fairness of their repeated uses.

Title: Adversarial generalization of unfolding (model-based) networks

Authors: Vicky Kouni
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15370
Pdf URL: https://arxiv.org/pdf/2509.15370
Copy Paste: [[2509.15370]] Adversarial generalization of unfolding (model-based) networks(https://arxiv.org/abs/2509.15370)
Keywords: attack, robust
Abstract: Unfolding networks are interpretable networks emerging from iterative algorithms, incorporate prior knowledge of data structure, and are designed to solve inverse problems like compressed sensing, which deals with recovering data from noisy, missing observations. Compressed sensing finds applications in critical domains, from medical imaging to cryptography, where adversarial robustness is crucial to prevent catastrophic failures. However, a solid theoretical understanding of the performance of unfolding networks in the presence of adversarial attacks is still in its infancy. In this paper, we study the adversarial generalization of unfolding networks when perturbed with $l_2$-norm constrained attacks, generated by the fast gradient sign method. Particularly, we choose a family of state-of-the-art overaparameterized unfolding networks and deploy a new framework to estimate their adversarial Rademacher complexity. Given this estimate, we provide adversarial generalization error bounds for the networks under study, which are tight with respect to the attack level. To our knowledge, this is the first theoretical analysis on the adversarial generalization of unfolding networks. We further present a series of experiments on real-world data, with results corroborating our derived theory, consistently for all data. Finally, we observe that the family's overparameterization can be exploited to promote adversarial robustness, shedding light on how to efficiently robustify neural networks.

Title: RaceGAN: A Framework for Preserving Individuality while Converting Racial Information for Image-to-Image Translation

Authors: Mst Tasnim Pervin, George Bebis, Fang Jiang, Alireza Tavakkoli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15391
Pdf URL: https://arxiv.org/pdf/2509.15391
Copy Paste: [[2509.15391]] RaceGAN: A Framework for Preserving Individuality while Converting Racial Information for Image-to-Image Translation(https://arxiv.org/abs/2509.15391)
Keywords: generative
Abstract: Generative adversarial networks (GANs) have demonstrated significant progress in unpaired image-to-image translation in recent years for several applications. CycleGAN was the first to lead the way, although it was restricted to a pair of domains. StarGAN overcame this constraint by tackling image-to-image translation across various domains, although it was not able to map in-depth low-level style changes for these domains. Style mapping via reference-guided image synthesis has been made possible by the innovations of StarGANv2 and StyleGAN. However, these models do not maintain individuality and need an extra reference image in addition to the input. Our study aims to translate racial traits by means of multi-domain image-to-image translation. We present RaceGAN, a novel framework capable of mapping style codes over several domains during racial attribute translation while maintaining individuality and high level semantics without relying on a reference image. RaceGAN outperforms other models in translating racial features (i.e., Asian, White, and Black) when tested on Chicago Face Dataset. We also give quantitative findings utilizing InceptionReNetv2-based classification to demonstrate the effectiveness of our racial translation. Moreover, we investigate how well the model partitions the latent space into distinct clusters of faces for each ethnic group.

Title: VMDNet: Time Series Forecasting with Leakage-Free Samplewise Variational Mode Decomposition and Multibranch Decoding

Authors: Weibin Feng, Ran Tao, John Cartlidge, Jin Zheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15394
Pdf URL: https://arxiv.org/pdf/2509.15394
Copy Paste: [[2509.15394]] VMDNet: Time Series Forecasting with Leakage-Free Samplewise Variational Mode Decomposition and Multibranch Decoding(https://arxiv.org/abs/2509.15394)
Keywords: robust
Abstract: In time series forecasting, capturing recurrent temporal patterns is essential; decomposition techniques make such structure explicit and thereby improve predictive performance. Variational Mode Decomposition (VMD) is a powerful signal-processing method for periodicity-aware decomposition and has seen growing adoption in recent years. However, existing studies often suffer from information leakage and rely on inappropriate hyperparameter tuning. To address these issues, we propose VMDNet, a causality-preserving framework that (i) applies sample-wise VMD to avoid leakage; (ii) represents each decomposed mode with frequency-aware embeddings and decodes it using parallel temporal convolutional networks (TCNs), ensuring mode independence and efficient learning; and (iii) introduces a bilevel, Stackelberg-inspired optimisation to adaptively select VMD's two core hyperparameters: the number of modes (K) and the bandwidth penalty (alpha). Experiments on two energy-related datasets demonstrate that VMDNet achieves state-of-the-art results when periodicity is strong, showing clear advantages in capturing structured periodic patterns while remaining robust under weak periodicity.

Title: Quantifying Uncertainty in Natural Language Explanations of Large Language Models for Question Answering

Authors: Yangyi Li, Mengdi Huai
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15403
Pdf URL: https://arxiv.org/pdf/2509.15403
Copy Paste: [[2509.15403]] Quantifying Uncertainty in Natural Language Explanations of Large Language Models for Question Answering(https://arxiv.org/abs/2509.15403)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have shown strong capabilities, enabling concise, context-aware answers in question answering (QA) tasks. The lack of transparency in complex LLMs has inspired extensive research aimed at developing methods to explain large language behaviors. Among existing explanation methods, natural language explanations stand out due to their ability to explain LLMs in a self-explanatory manner and enable the understanding of model behaviors even when the models are closed-source. However, despite these promising advancements, there is no existing work studying how to provide valid uncertainty guarantees for these generated natural language explanations. Such uncertainty quantification is critical in understanding the confidence behind these explanations. Notably, generating valid uncertainty estimates for natural language explanations is particularly challenging due to the auto-regressive generation process of LLMs and the presence of noise in medical inquiries. To bridge this gap, in this work, we first propose a novel uncertainty estimation framework for these generated natural language explanations, which provides valid uncertainty guarantees in a post-hoc and model-agnostic manner. Additionally, we also design a novel robust uncertainty estimation method that maintains valid uncertainty guarantees even under noise. Extensive experiments on QA tasks demonstrate the desired performance of our methods.

Title: Causal Fingerprints of AI Generative Models

Authors: Hui Xu, Chi Liu, Congcong Zhu, Minghao Wang, Youyang Qu, Longxiang Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15406
Pdf URL: https://arxiv.org/pdf/2509.15406
Copy Paste: [[2509.15406]] Causal Fingerprints of AI Generative Models(https://arxiv.org/abs/2509.15406)
Keywords: protect, diffusion, generative
Abstract: AI generative models leave implicit traces in their generated images, which are commonly referred to as model fingerprints and are exploited for source attribution. Prior methods rely on model-specific cues or synthesis artifacts, yielding limited fingerprints that may generalize poorly across different generative models. We argue that a complete model fingerprint should reflect the causality between image provenance and model traces, a direction largely unexplored. To this end, we conceptualize the \emph{causal fingerprint} of generative models, and propose a causality-decoupling framework that disentangles it from image-specific content and style in a semantic-invariant latent space derived from pre-trained diffusion reconstruction residual. We further enhance fingerprint granularity with diverse feature representations. We validate causality by assessing attribution performance across representative GANs and diffusion models and by achieving source anonymization using counterfactual examples generated from causal fingerprints. Experiments show our approach outperforms existing methods in model attribution, indicating strong potential for forgery detection, model copyright tracing, and identity protection.

Title: NeuroRAD-FM: A Foundation Model for Neuro-Oncology with Distributionally Robust Training

Authors: Moinak Bhattacharya, Angelica P. Kurtz, Fabio M. Iwamoto, Prateek Prasanna, Gagandeep Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15416
Pdf URL: https://arxiv.org/pdf/2509.15416
Copy Paste: [[2509.15416]] NeuroRAD-FM: A Foundation Model for Neuro-Oncology with Distributionally Robust Training(https://arxiv.org/abs/2509.15416)
Keywords: robust, interpretability
Abstract: Neuro-oncology poses unique challenges for machine learning due to heterogeneous data and tumor complexity, limiting the ability of foundation models (FMs) to generalize across cohorts. Existing FMs also perform poorly in predicting uncommon molecular markers, which are essential for treatment response and risk stratification. To address these gaps, we developed a neuro-oncology specific FM with a distributionally robust loss function, enabling accurate estimation of tumor phenotypes while maintaining cross-institution generalization. We pretrained self-supervised backbones (BYOL, DINO, MAE, MoCo) on multi-institutional brain tumor MRI and applied distributionally robust optimization (DRO) to mitigate site and class imbalance. Downstream tasks included molecular classification of common markers (MGMT, IDH1, 1p/19q, EGFR), uncommon alterations (ATRX, TP53, CDKN2A/2B, TERT), continuous markers (Ki-67, TP53), and overall survival prediction in IDH1 wild-type glioblastoma at UCSF, UPenn, and CUIMC. Our method improved molecular prediction and reduced site-specific embedding differences. At CUIMC, mean balanced accuracy rose from 0.744 to 0.785 and AUC from 0.656 to 0.676, with the largest gains for underrepresented endpoints (CDKN2A/2B accuracy 0.86 to 0.92, AUC 0.73 to 0.92; ATRX AUC 0.69 to 0.82; Ki-67 accuracy 0.60 to 0.69). For survival, c-index improved at all sites: CUIMC 0.592 to 0.597, UPenn 0.647 to 0.672, UCSF 0.600 to 0.627. Grad-CAM highlighted tumor and peri-tumoral regions, confirming interpretability. Overall, coupling FMs with DRO yields more site-invariant representations, improves prediction of common and uncommon markers, and enhances survival discrimination, underscoring the need for prospective validation and integration of longitudinal and interventional signals to advance precision neuro-oncology.

Title: Deep learning and abstractive summarisation for radiological reports: an empirical study for adapting the PEGASUS models' family with scarce data

Authors: Claudio Benzoni, Martina Langhals, Martin Boeker, Luise Modersohn, Máté E. Maros
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15419
Pdf URL: https://arxiv.org/pdf/2509.15419
Copy Paste: [[2509.15419]] Deep learning and abstractive summarisation for radiological reports: an empirical study for adapting the PEGASUS models' family with scarce data(https://arxiv.org/abs/2509.15419)
Keywords: robust
Abstract: Regardless of the rapid development of artificial intelligence, abstractive summarisation is still challenging for sensitive and data-restrictive domains like medicine. With the increasing number of imaging, the relevance of automated tools for complex medical text summarisation is expected to become highly relevant. In this paper, we investigated the adaptation via fine-tuning process of a non-domain-specific abstractive summarisation encoder-decoder model family, and gave insights to practitioners on how to avoid over- and underfitting. We used PEGASUS and PEGASUS-X, on a medium-sized radiological reports public dataset. For each model, we comprehensively evaluated two different checkpoints with varying sizes of the same training data. We monitored the models' performances with lexical and semantic metrics during the training history on the fixed-size validation set. PEGASUS exhibited different phases, which can be related to epoch-wise double-descent, or peak-drop-recovery behaviour. For PEGASUS-X, we found that using a larger checkpoint led to a performance detriment. This work highlights the challenges and risks of fine-tuning models with high expressivity when dealing with scarce training data, and lays the groundwork for future investigations into more robust fine-tuning strategies for summarisation models in specialised domains.

Title: Random Matrix Theory-guided sparse PCA for single-cell RNA-seq data

Authors: Victor Chardès
Subjects: cs.LG, physics.bio-ph, q-bio.QM
Abstract URL: https://arxiv.org/abs/2509.15429
Pdf URL: https://arxiv.org/pdf/2509.15429
Copy Paste: [[2509.15429]] Random Matrix Theory-guided sparse PCA for single-cell RNA-seq data(https://arxiv.org/abs/2509.15429)
Keywords: robust, interpretability, diffusion
Abstract: Single-cell RNA-seq provides detailed molecular snapshots of individual cells but is notoriously noisy. Variability stems from biological differences, PCR amplification bias, limited sequencing depth, and low capture efficiency, making it challenging to adapt computational pipelines to heterogeneous datasets or evolving technologies. As a result, most studies still rely on principal component analysis (PCA) for dimensionality reduction, valued for its interpretability and robustness. Here, we improve upon PCA with a Random Matrix Theory (RMT)-based approach that guides the inference of sparse principal components using existing sparse PCA algorithms. We first introduce a novel biwhitening method, inspired by the Sinkhorn-Knopp algorithm, that simultaneously stabilizes variance across genes and cells. This enables the use of an RMT-based criterion to automatically select the sparsity level, rendering sparse PCA nearly parameter-free. Our mathematically grounded approach retains the interpretability of PCA while enabling robust, hands-off inference of sparse principal components. Across seven single-cell RNA-seq technologies and four sparse PCA algorithms, we show that this method systematically improves the reconstruction of the principal subspace and consistently outperforms PCA-, autoencoder-, and diffusion-based methods in cell-type classification tasks.

Title: Synergizing Static Analysis with Large Language Models for Vulnerability Discovery and beyond

Authors: Vaibhav Agrawal, Kiarash Ahi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2509.15433
Pdf URL: https://arxiv.org/pdf/2509.15433
Copy Paste: [[2509.15433]] Synergizing Static Analysis with Large Language Models for Vulnerability Discovery and beyond(https://arxiv.org/abs/2509.15433)
Keywords: security, large language model
Abstract: This report examines the synergy between Large Language Models (LLMs) and Static Application Security Testing (SAST) to improve vulnerability discovery. Traditional SAST tools, while effective for proactive security, are limited by high false-positive rates and a lack of contextual understanding. Conversely, LLMs excel at code analysis and pattern recognition but can be prone to inconsistencies and hallucinations. By integrating these two technologies, a more intelligent and efficient system is created. This combination moves beyond mere vulnerability detection optimization, transforming security into a deeply integrated, contextual process that provides tangible benefits like improved triage, dynamic bug descriptions, bug validation via exploit generation and enhanced analysis of complex codebases. The result is a more effective security approach that leverages the strengths of both technologies while mitigating their weaknesses. SAST-Genius reduced false positives by about 91 % (225 to 20) compared to Semgrep alone.

Title: ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models

Authors: Chung-En Johnny Yu, Hsuan-Chih (Neil)Chen, Brian Jalaian, Nathaniel D. Bastian
Subjects: cs.CV, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2509.15435
Pdf URL: https://arxiv.org/pdf/2509.15435
Copy Paste: [[2509.15435]] ORCA: Agentic Reasoning For Hallucination and Adversarial Robustness in Vision-Language Models(https://arxiv.org/abs/2509.15435)
Keywords: defense, attack, robust
Abstract: Large Vision-Language Models (LVLMs) exhibit strong multimodal capabilities but remain vulnerable to hallucinations from intrinsic errors and adversarial attacks from external exploitations, limiting their reliability in real-world applications. We present ORCA, an agentic reasoning framework that improves the factual accuracy and adversarial robustness of pretrained LVLMs through test-time structured inference reasoning with a suite of small vision models (less than 3B parameters). ORCA operates via an Observe--Reason--Critique--Act loop, querying multiple visual tools with evidential questions, validating cross-model inconsistencies, and refining predictions iteratively without access to model internals or retraining. ORCA also stores intermediate reasoning traces, which supports auditable decision-making. Though designed primarily to mitigate object-level hallucinations, ORCA also exhibits emergent adversarial robustness without requiring adversarial training or defense mechanisms. We evaluate ORCA across three settings: (1) clean images on hallucination benchmarks, (2) adversarially perturbed images without defense, and (3) adversarially perturbed images with defense applied. On the POPE hallucination benchmark, ORCA improves standalone LVLM performance by +3.64\% to +40.67\% across different subsets. Under adversarial perturbations on POPE, ORCA achieves an average accuracy gain of +20.11\% across LVLMs. When combined with defense techniques on adversarially perturbed AMBER images, ORCA further improves standalone LVLM performance, with gains ranging from +1.20\% to +48.00\% across evaluation metrics. These results demonstrate that ORCA offers a promising path toward building more reliable and robust multimodal systems.

Title: PILOT: Steering Synthetic Data Generation with Psychological & Linguistic Output Targeting

Authors: Caitlin Cisar, Emily Sheffield, Joshua Drake, Alden Harrell, Subramanian Chidambaram, Nikita Nangia, Vinayak Arannil, Alex Williams
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15447
Pdf URL: https://arxiv.org/pdf/2509.15447
Copy Paste: [[2509.15447]] PILOT: Steering Synthetic Data Generation with Psychological & Linguistic Output Targeting(https://arxiv.org/abs/2509.15447)
Keywords: generative, large language model
Abstract: Generative AI applications commonly leverage user personas as a steering mechanism for synthetic data generation, but reliance on natural language representations forces models to make unintended inferences about which attributes to emphasize, limiting precise control over outputs. We introduce PILOT (Psychological and Linguistic Output Targeting), a two-phase framework for steering large language models with structured psycholinguistic profiles. In Phase 1, PILOT translates natural language persona descriptions into multidimensional profiles with normalized scores across linguistic and psychological dimensions. In Phase 2, these profiles guide generation along measurable axes of variation. We evaluate PILOT across three state-of-the-art LLMs (Mistral Large 2, Deepseek-R1, LLaMA 3.3 70B) using 25 synthetic personas under three conditions: Natural-language Persona Steering (NPS), Schema-Based Steering (SBS), and Hybrid Persona-Schema Steering (HPS). Results demonstrate that schema-based approaches significantly reduce artificial-sounding persona repetition while improving output coherence, with silhouette scores increasing from 0.098 to 0.237 and topic purity from 0.773 to 0.957. Our analysis reveals a fundamental trade-off: SBS produces more concise outputs with higher topical consistency, while NPS offers greater lexical diversity but reduced predictability. HPS achieves a balance between these extremes, maintaining output variety while preserving structural consistency. Expert linguistic evaluation confirms that PILOT maintains high response quality across all conditions, with no statistically significant differences between steering approaches.

Title: Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

Authors: Saeed Amizadeh, Sara Abdali, Yinheng Li, Kazuhito Koishida
Subjects: cs.LG, cs.AI, cs.NE, stat.ML
Abstract URL: https://arxiv.org/abs/2509.15448
Pdf URL: https://arxiv.org/pdf/2509.15448
Copy Paste: [[2509.15448]] Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems(https://arxiv.org/abs/2509.15448)
Keywords: transformer
Abstract: Transformers and their attention mechanism have been revolutionary in the field of Machine Learning. While originally proposed for the language data, they quickly found their way to the image, video, graph, etc. data modalities with various signal geometries. Despite this versatility, generalizing the attention mechanism to scenarios where data is presented at different scales from potentially different modalities is not straightforward. The attempts to incorporate hierarchy and multi-modality within transformers are largely based on ad hoc heuristics, which are not seamlessly generalizable to similar problems with potentially different structures. To address this problem, in this paper, we take a fundamentally different approach: we first propose a mathematical construct to represent multi-modal, multi-scale data. We then mathematically derive the neural attention mechanics for the proposed construct from the first principle of entropy minimization. We show that the derived formulation is optimal in the sense of being the closest to the standard Softmax attention while incorporating the inductive biases originating from the hierarchical/geometric information of the problem. We further propose an efficient algorithm based on dynamic programming to compute our derived attention mechanism. By incorporating it within transformers, we show that the proposed hierarchical attention mechanism not only can be employed to train transformer models in hierarchical/multi-modal settings from scratch, but it can also be used to inject hierarchical information into classical, pre-trained transformer models post training, resulting in more efficient models in zero-shot manner.

Title: IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs

Authors: Junchen Zhao, Ali Derakhshan, Dushyant Bharadwaj, Jayden Kana Hyman, Junhao Dong, Sangeetha Abdu Jyothi, Ian Harris
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15455
Pdf URL: https://arxiv.org/pdf/2509.15455
Copy Paste: [[2509.15455]] IMPQ: Interaction-Aware Layerwise Mixed Precision Quantization for LLMs(https://arxiv.org/abs/2509.15455)
Keywords: large language model
Abstract: Large Language Models (LLMs) promise impressive capabilities, yet their multi-billion-parameter scale makes on-device or low-resource deployment prohibitive. Mixed-precision quantization offers a compelling solution, but existing methods struggle when the average precision drops below four bits, as they rely on isolated, layer-specific metrics that overlook critical inter-layer interactions affecting overall performance. In this paper, we propose two innovations to address these limitations. First, we frame the mixed-precision quantization problem as a cooperative game among layers and introduce Shapley-based Progressive Quantization Estimation (SPQE) to efficiently obtain accurate Shapley estimates of layer sensitivities and inter-layer interactions. Second, building upon SPQE, we propose Interaction-aware Mixed-Precision Quantization (IMPQ) which translates these Shapley estimates into a binary quadratic optimization formulation, assigning either 2 or 4-bit precision to layers under strict memory constraints. Comprehensive experiments conducted on Llama-3, Gemma-2, and Qwen-3 models across three independent PTQ backends (Quanto, HQQ, GPTQ) demonstrate IMPQ's scalability and consistently superior performance compared to methods relying solely on isolated metrics. Across average precisions spanning 4 bit down to 2 bit, IMPQ cuts Perplexity by 20 to 80 percent relative to the best baseline, with the margin growing as the bit-width tightens.

Title: CAGE: Continuity-Aware edGE Network Unlocks Robust Floorplan Reconstruction

Authors: Yiyi Liu, Chunyang Liu, Weiqin Jiao, Bojian Wu, Fashuai Li, Biao Xiong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15459
Pdf URL: https://arxiv.org/pdf/2509.15459
Copy Paste: [[2509.15459]] CAGE: Continuity-Aware edGE Network Unlocks Robust Floorplan Reconstruction(https://arxiv.org/abs/2509.15459)
Keywords: robust, transformer
Abstract: We present \textbf{CAGE} (\textit{Continuity-Aware edGE}) network, a \textcolor{red}{robust} framework for reconstructing vector floorplans directly from point-cloud density maps. Traditional corner-based polygon representations are highly sensitive to noise and incomplete observations, often resulting in fragmented or implausible layouts. Recent line grouping methods leverage structural cues to improve robustness but still struggle to recover fine geometric details. To address these limitations, we propose a \textit{native} edge-centric formulation, modeling each wall segment as a directed, geometrically continuous edge. This representation enables inference of coherent floorplan structures, ensuring watertight, topologically valid room boundaries while improving robustness and reducing artifacts. Towards this design, we develop a dual-query transformer decoder that integrates perturbed and latent queries within a denoising framework, which not only stabilizes optimization but also accelerates convergence. Extensive experiments on Structured3D and SceneCAD show that \textbf{CAGE} achieves state-of-the-art performance, with F1 scores of 99.1\% (rooms), 91.7\% (corners), and 89.3\% (angles). The method also demonstrates strong cross-dataset generalization, underscoring the efficacy of our architectural innovations. Code and pretrained models will be released upon acceptance.

Title: Temporal Reasoning with Large Language Models Augmented by Evolving Knowledge Graphs

Authors: Junhong Lin, Song Wang, Xiaojie Guo, Julian Shun, Yada Zhu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15464
Pdf URL: https://arxiv.org/pdf/2509.15464
Copy Paste: [[2509.15464]] Temporal Reasoning with Large Language Models Augmented by Evolving Knowledge Graphs(https://arxiv.org/abs/2509.15464)
Keywords: robust, large language model
Abstract: Large language models (LLMs) excel at many language understanding tasks but struggle to reason over knowledge that evolves. To address this, recent work has explored augmenting LLMs with knowledge graphs (KGs) to provide structured, up-to-date information. However, many existing approaches assume a static snapshot of the KG and overlook the temporal dynamics and factual inconsistencies inherent in real-world data. To address the challenge of reasoning over temporally shifting knowledge, we propose EvoReasoner, a temporal-aware multi-hop reasoning algorithm that performs global-local entity grounding, multi-route decomposition, and temporally grounded scoring. To ensure that the underlying KG remains accurate and up-to-date, we introduce EvoKG, a noise-tolerant KG evolution module that incrementally updates the KG from unstructured documents through confidence-based contradiction resolution and temporal trend tracking. We evaluate our approach on temporal QA benchmarks and a novel end-to-end setting where the KG is dynamically updated from raw documents. Our method outperforms both prompting-based and KG-enhanced baselines, effectively narrowing the gap between small and large LLMs on dynamic question answering. Notably, an 8B-parameter model using our approach matches the performance of a 671B model prompted seven months later. These results highlight the importance of combining temporal reasoning with KG evolution for robust and up-to-date LLM performance. Our code is publicly available at this http URL.

Title: Efficient Multimodal Dataset Distillation via Generative Models

Authors: Zhenghao Zhao, Haoxuan Wang, Junyi Wu, Yuzhang Shang, Gaowen Liu, Yan Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15472
Pdf URL: https://arxiv.org/pdf/2509.15472
Copy Paste: [[2509.15472]] Efficient Multimodal Dataset Distillation via Generative Models(https://arxiv.org/abs/2509.15472)
Keywords: generative, large language model
Abstract: Dataset distillation aims to synthesize a small dataset from a large dataset, enabling the model trained on it to perform well on the original dataset. With the blooming of large language models and multimodal large language models, the importance of multimodal datasets, particularly image-text datasets, has grown significantly. However, existing multimodal dataset distillation methods are constrained by the Matching Training Trajectories algorithm, which significantly increases the computing resource requirement, and takes days to process the distillation. In this work, we introduce EDGE, a generative distillation method for efficient multimodal dataset distillation. Specifically, we identify two key challenges of distilling multimodal datasets with generative models: 1) The lack of correlation between generated images and captions. 2) The lack of diversity among generated samples. To address the aforementioned issues, we propose a novel generative model training workflow with a bi-directional contrastive loss and a diversity loss. Furthermore, we propose a caption synthesis strategy to further improve text-to-image retrieval performance by introducing more text information. Our method is evaluated on Flickr30K, COCO, and CC3M datasets, demonstrating superior performance and efficiency compared to existing approaches. Notably, our method achieves results 18x faster than the state-of-the-art method.

Title: Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding

Authors: Zhu Li, Xiyuan Gao, Yuqing Zhang, Shekhar Nayak, Matt Coler
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2509.15476
Pdf URL: https://arxiv.org/pdf/2509.15476
Copy Paste: [[2509.15476]] Evaluating Multimodal Large Language Models on Spoken Sarcasm Understanding(https://arxiv.org/abs/2509.15476)
Keywords: large language model
Abstract: Sarcasm detection remains a challenge in natural language understanding, as sarcastic intent often relies on subtle cross-modal cues spanning text, speech, and vision. While prior work has primarily focused on textual or visual-textual sarcasm, comprehensive audio-visual-textual sarcasm understanding remains underexplored. In this paper, we systematically evaluate large language models (LLMs) and multimodal LLMs for sarcasm detection on English (MUStARD++) and Chinese (MCSD 1.0) in zero-shot, few-shot, and LoRA fine-tuning settings. In addition to direct classification, we explore models as feature encoders, integrating their representations through a collaborative gating fusion module. Experimental results show that audio-based models achieve the strongest unimodal performance, while text-audio and audio-vision combinations outperform unimodal and trimodal models. Furthermore, MLLMs such as Qwen-Omni show competitive zero-shot and fine-tuned performance. Our findings highlight the potential of MLLMs for cross-lingual, audio-visual-textual sarcasm understanding.

Title: Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models

Authors: Madison Van Doren, Casey Ford, Emily Dix
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15478
Pdf URL: https://arxiv.org/pdf/2509.15478
Copy Paste: [[2509.15478]] Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models(https://arxiv.org/abs/2509.15478)
Keywords: robust, large language model
Abstract: Multimodal large language models (MLLMs) are increasingly used in real world applications, yet their safety under adversarial conditions remains underexplored. This study evaluates the harmlessness of four leading MLLMs (GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus) when exposed to adversarial prompts across text-only and multimodal formats. A team of 26 red teamers generated 726 prompts targeting three harm categories: illegal activity, disinformation, and unethical behaviour. These prompts were submitted to each model, and 17 annotators rated 2,904 model outputs for harmfulness using a 5-point scale. Results show significant differences in vulnerability across models and modalities. Pixtral 12B exhibited the highest rate of harmful responses (~62%), while Claude Sonnet 3.5 was the most resistant (~10%). Contrary to expectations, text-only prompts were slightly more effective at bypassing safety mechanisms than multimodal ones. Statistical analysis confirmed that both model type and input modality were significant predictors of harmfulness. These findings underscore the urgent need for robust, multimodal safety benchmarks as MLLMs are deployed more widely.

Title: Solar Forecasting with Causality: A Graph-Transformer Approach to Spatiotemporal Dependencies

Authors: Yanan Niu, Demetri Psaltis, Christophe Moser, Luisa Lambertini
Subjects: cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2509.15481
Pdf URL: https://arxiv.org/pdf/2509.15481
Copy Paste: [[2509.15481]] Solar Forecasting with Causality: A Graph-Transformer Approach to Spatiotemporal Dependencies(https://arxiv.org/abs/2509.15481)
Keywords: transformer
Abstract: Accurate solar forecasting underpins effective renewable energy management. We present SolarCAST, a causally informed model predicting future global horizontal irradiance (GHI) at a target site using only historical GHI from site X and nearby stations S - unlike prior work that relies on sky-camera or satellite imagery requiring specialized hardware and heavy preprocessing. To deliver high accuracy with only public sensor data, SolarCAST models three classes of confounding factors behind X-S correlations using scalable neural components: (i) observable synchronous variables (e.g., time of day, station identity), handled via an embedding module; (ii) latent synchronous factors (e.g., regional weather patterns), captured by a spatio-temporal graph neural network; and (iii) time-lagged influences (e.g., cloud movement across stations), modeled with a gated transformer that learns temporal shifts. It outperforms leading time-series and multimodal baselines across diverse geographical conditions, and achieves a 25.9% error reduction over the top commercial forecaster, Solcast. SolarCAST offers a lightweight, practical, and generalizable solution for localized solar forecasting.

Title: Comparing Computational Pathology Foundation Models using Representational Similarity Analysis

Authors: Vaibhav Mishra, William Lotter
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15482
Pdf URL: https://arxiv.org/pdf/2509.15482
Copy Paste: [[2509.15482]] Comparing Computational Pathology Foundation Models using Representational Similarity Analysis(https://arxiv.org/abs/2509.15482)
Keywords: robust
Abstract: Foundation models are increasingly developed in computational pathology (CPath) given their promise in facilitating many downstream tasks. While recent studies have evaluated task performance across models, less is known about the structure and variability of their learned representations. Here, we systematically analyze the representational spaces of six CPath foundation models using techniques popularized in computational neuroscience. The models analyzed span vision-language contrastive learning (CONCH, PLIP, KEEP) and self-distillation (UNI (v2), Virchow (v2), Prov-GigaPath) approaches. Through representational similarity analysis using H&E image patches from TCGA, we find that UNI2 and Virchow2 have the most distinct representational structures, whereas Prov-Gigapath has the highest average similarity across models. Having the same training paradigm (vision-only vs. vision-language) did not guarantee higher representational similarity. The representations of all models showed a high slide-dependence, but relatively low disease-dependence. Stain normalization decreased slide-dependence for all models by a range of 5.5% (CONCH) to 20.5% (PLIP). In terms of intrinsic dimensionality, vision-language models demonstrated relatively compact representations, compared to the more distributed representations of vision-only models. These findings highlight opportunities to improve robustness to slide-specific features, inform model ensembling strategies, and provide insights into how training paradigms shape model representations. Our framework is extendable across medical imaging domains, where probing the internal representations of foundation models can help ensure effective development and deployment.

Title: SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters

Authors: Abdarahmane Traore, Éric Hervet, Andy Couturier
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15490
Pdf URL: https://arxiv.org/pdf/2509.15490
Copy Paste: [[2509.15490]] SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters(https://arxiv.org/abs/2509.15490)
Keywords: robust
Abstract: Recent advances in vision-language models (VLMs) have enabled powerful multimodal reasoning, but state-of-the-art approaches typically rely on extremely large models with prohibitive computational and memory requirements. This makes their deployment challenging in resource-constrained environments such as warehouses, robotics, and industrial applications, where both efficiency and robust spatial understanding are critical. In this work, we present SmolRGPT, a compact vision-language architecture that explicitly incorporates region-level spatial reasoning by integrating both RGB and depth cues. SmolRGPT employs a three-stage curriculum that progressively align visual and language features, enables spatial relationship understanding, and adapts to task-specific datasets. We demonstrate that with only 600M parameters, SmolRGPT achieves competitive results on challenging warehouse spatial reasoning benchmarks, matching or exceeding the performance of much larger alternatives. These findings highlight the potential for efficient, deployable multimodal intelligence in real-world settings without sacrificing core spatial reasoning capabilities. The code of the experimentation will be available at: this https URL

Title: Lynx: Towards High-Fidelity Personalized Video Generation

Authors: Shen Sang, Tiancheng Zhi, Tianpei Gu, Jing Liu, Linjie Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15496
Pdf URL: https://arxiv.org/pdf/2509.15496
Copy Paste: [[2509.15496]] Lynx: Towards High-Fidelity Personalized Video Generation(https://arxiv.org/abs/2509.15496)
Keywords: robust, diffusion, transformer
Abstract: We present Lynx, a high-fidelity model for personalized video synthesis from a single input image. Built on an open-source Diffusion Transformer (DiT) foundation model, Lynx introduces two lightweight adapters to ensure identity fidelity. The ID-adapter employs a Perceiver Resampler to convert ArcFace-derived facial embeddings into compact identity tokens for conditioning, while the Ref-adapter integrates dense VAE features from a frozen reference pathway, injecting fine-grained details across all transformer layers through cross-attention. These modules collectively enable robust identity preservation while maintaining temporal coherence and visual realism. Through evaluation on a curated benchmark of 40 subjects and 20 unbiased prompts, which yielded 800 test cases, Lynx has demonstrated superior face resemblance, competitive prompt following, and strong video quality, thereby advancing the state of personalized video generation.

Title: Backdoor Mitigation via Invertible Pruning Masks

Authors: Kealan Dunnett, Reza Arablouei, Dimity Miller, Volkan Dedeoglu, Raja Jurdak
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15497
Pdf URL: https://arxiv.org/pdf/2509.15497
Copy Paste: [[2509.15497]] Backdoor Mitigation via Invertible Pruning Masks(https://arxiv.org/abs/2509.15497)
Keywords: defense, attack, robust, interpretability
Abstract: Model pruning has gained traction as a promising defense strategy against backdoor attacks in deep learning. However, existing pruning-based approaches often fall short in accurately identifying and removing the specific parameters responsible for inducing backdoor behaviors. Despite the dominance of fine-tuning-based defenses in recent literature, largely due to their superior performance, pruning remains a compelling alternative, offering greater interpretability and improved robustness in low-data regimes. In this paper, we propose a novel pruning approach featuring a learned \emph{selection} mechanism to identify parameters critical to both main and backdoor tasks, along with an \emph{invertible} pruning mask designed to simultaneously achieve two complementary goals: eliminating the backdoor task while preserving it through the inverse mask. We formulate this as a bi-level optimization problem that jointly learns selection variables, a sparse invertible mask, and sample-specific backdoor perturbations derived from clean data. The inner problem synthesizes candidate triggers using the inverse mask, while the outer problem refines the mask to suppress backdoor behavior without impairing clean-task accuracy. Extensive experiments demonstrate that our approach outperforms existing pruning-based backdoor mitigation approaches, maintains strong performance under limited data conditions, and achieves competitive results compared to state-of-the-art fine-tuning approaches. Notably, the proposed approach is particularly effective in restoring correct predictions for compromised samples after successful backdoor mitigation.

Title: Mental Accounts for Actions: EWA-Inspired Attention in Decision Transformers

Authors: Zahra Aref, Narayan B. Mandayam
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15498
Pdf URL: https://arxiv.org/pdf/2509.15498
Copy Paste: [[2509.15498]] Mental Accounts for Actions: EWA-Inspired Attention in Decision Transformers(https://arxiv.org/abs/2509.15498)
Keywords: transformer
Abstract: Transformers have emerged as a compelling architecture for sequential decision-making by modeling trajectories via self-attention. In reinforcement learning (RL), they enable return-conditioned control without relying on value function approximation. Decision Transformers (DTs) exploit this by casting RL as supervised sequence modeling, but they are restricted to offline data and lack exploration. Online Decision Transformers (ODTs) address this limitation through entropy-regularized training on on-policy rollouts, offering a stable alternative to traditional RL methods like Soft Actor-Critic, which depend on bootstrapped targets and reward shaping. Despite these advantages, ODTs use standard attention, which lacks explicit memory of action-specific outcomes. This leads to inefficiencies in learning long-term action effectiveness. Inspired by cognitive models such as Experience-Weighted Attraction (EWA), we propose Experience-Weighted Attraction with Vector Quantization for Online Decision Transformers (EWA-VQ-ODT), a lightweight module that maintains per-action mental accounts summarizing recent successes and failures. Continuous actions are routed via direct grid lookup to a compact vector-quantized codebook, where each code stores a scalar attraction updated online through decay and reward-based reinforcement. These attractions modulate attention by biasing the columns associated with action tokens, requiring no change to the backbone or training objective. On standard continuous-control benchmarks, EWA-VQ-ODT improves sample efficiency and average return over ODT, particularly in early training. The module is computationally efficient, interpretable via per-code traces, and supported by theoretical guarantees that bound the attraction dynamics and its impact on attention drift.

Title: Adversarially Robust Assembly Language Model for Packed Executables Detection

Authors: Shijia Li, Jiang Ming, Lanqing Liu, Longwei Yang, Ni Zhang, Chunfu Jia
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2509.15499
Pdf URL: https://arxiv.org/pdf/2509.15499
Copy Paste: [[2509.15499]] Adversarially Robust Assembly Language Model for Packed Executables Detection(https://arxiv.org/abs/2509.15499)
Keywords: robust
Abstract: Detecting packed executables is a critical component of large-scale malware analysis and antivirus engine workflows, as it identifies samples that warrant computationally intensive dynamic unpacking to reveal concealed malicious behavior. Traditionally, packer detection techniques have relied on empirical features, such as high entropy or specific binary patterns. However, these empirical, feature-based methods are increasingly vulnerable to evasion by adversarial samples or unknown packers (e.g., low-entropy packers). Furthermore, the dependence on expert-crafted features poses challenges in sustaining and evolving these methods over time. In this paper, we examine the limitations of existing packer detection methods and propose Pack-ALM, a novel deep-learning-based approach for detecting packed executables. Inspired by the linguistic concept of distinguishing between real and pseudo words, we reformulate packer detection as a task of differentiating between legitimate and "pseudo" instructions. To achieve this, we preprocess native data and packed data into "pseudo" instructions and design a pre-trained assembly language model that recognizes features indicative of packed data. We evaluate Pack-ALM against leading industrial packer detection tools and state-of-the-art assembly language models. Extensive experiments on over 37,000 samples demonstrate that Pack-ALM effectively identifies packed binaries, including samples created with adversarial or previously unseen packing techniques. Moreover, Pack-ALM outperforms traditional entropy-based methods and advanced assembly language models in both detection accuracy and adversarial robustness.

Title: KoopCast: Trajectory Forecasting via Koopman Operators

Authors: Jungjin Lee, Jaeuk Shin, Gihwan Kim, Joonho Han, Insoon Yang
Subjects: cs.LG, cs.RO, eess.SY
Abstract URL: https://arxiv.org/abs/2509.15513
Pdf URL: https://arxiv.org/pdf/2509.15513
Copy Paste: [[2509.15513]] KoopCast: Trajectory Forecasting via Koopman Operators(https://arxiv.org/abs/2509.15513)
Keywords: interpretability
Abstract: We present KoopCast, a lightweight yet efficient model for trajectory forecasting in general dynamic environments. Our approach leverages Koopman operator theory, which enables a linear representation of nonlinear dynamics by lifting trajectories into a higher-dimensional space. The framework follows a two-stage design: first, a probabilistic neural goal estimator predicts plausible long-term targets, specifying where to go; second, a Koopman operator-based refinement module incorporates intention and history into a nonlinear feature space, enabling linear prediction that dictates how to go. This dual structure not only ensures strong predictive accuracy but also inherits the favorable properties of linear operators while faithfully capturing nonlinear dynamics. As a result, our model offers three key advantages: (i) competitive accuracy, (ii) interpretability grounded in Koopman spectral theory, and (iii) low-latency deployment. We validate these benefits on ETH/UCY, the Waymo Open Motion Dataset, and nuScenes, which feature rich multi-agent interactions and map-constrained nonlinear motion. Across benchmarks, KoopCast consistently delivers high predictive accuracy together with mode-level interpretability and practical efficiency.

Title: How do Language Models Generate Slang: A Systematic Comparison between Human and Machine-Generated Slang Usages

Authors: Siyang Wu, Zhewei Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15518
Pdf URL: https://arxiv.org/pdf/2509.15518
Copy Paste: [[2509.15518]] How do Language Models Generate Slang: A Systematic Comparison between Human and Machine-Generated Slang Usages(https://arxiv.org/abs/2509.15518)
Keywords: large language model
Abstract: Slang is a commonly used type of informal language that poses a daunting challenge to NLP systems. Recent advances in large language models (LLMs), however, have made the problem more approachable. While LLM agents are becoming more widely applied to intermediary tasks such as slang detection and slang interpretation, their generalizability and reliability are heavily dependent on whether these models have captured structural knowledge about slang that align well with human attested slang usages. To answer this question, we contribute a systematic comparison between human and machine-generated slang usages. Our evaluative framework focuses on three core aspects: 1) Characteristics of the usages that reflect systematic biases in how machines perceive slang, 2) Creativity reflected by both lexical coinages and word reuses employed by the slang usages, and 3) Informativeness of the slang usages when used as gold-standard examples for model distillation. By comparing human-attested slang usages from the Online Slang Dictionary (OSD) and slang generated by GPT-4o and Llama-3, we find significant biases in how LLMs perceive slang. Our results suggest that while LLMs have captured significant knowledge about the creative aspects of slang, such knowledge does not align with humans sufficiently to enable LLMs for extrapolative tasks such as linguistic analyses.

Title: SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models

Authors: Sen Wang, Jingyi Tian, Le Wang, Zhimin Liao, Jiayi Li, Huaiyi Dong, Kun Xia, Sanping Zhou, Wei Tang, Hua Gang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2509.15536
Pdf URL: https://arxiv.org/pdf/2509.15536
Copy Paste: [[2509.15536]] SAMPO:Scale-wise Autoregression with Motion PrOmpt for generative world models(https://arxiv.org/abs/2509.15536)
Keywords: generative
Abstract: World models allow agents to simulate the consequences of actions in imagined environments for planning, control, and long-horizon decision-making. However, existing autoregressive world models struggle with visually coherent predictions due to disrupted spatial structure, inefficient decoding, and inadequate motion modeling. In response, we propose \textbf{S}cale-wise \textbf{A}utoregression with \textbf{M}otion \textbf{P}r\textbf{O}mpt (\textbf{SAMPO}), a hybrid framework that combines visual autoregressive modeling for intra-frame generation with causal modeling for next-frame generation. Specifically, SAMPO integrates temporal causal decoding with bidirectional spatial attention, which preserves spatial locality and supports parallel decoding within each scale. This design significantly enhances both temporal consistency and rollout efficiency. To further improve dynamic scene understanding, we devise an asymmetric multi-scale tokenizer that preserves spatial details in observed frames and extracts compact dynamic representations for future frames, optimizing both memory usage and model performance. Additionally, we introduce a trajectory-aware motion prompt module that injects spatiotemporal cues about object and robot trajectories, focusing attention on dynamic regions and improving temporal consistency and physical realism. Extensive experiments show that SAMPO achieves competitive performance in action-conditioned video prediction and model-based control, improving generation quality with 4.4$\times$ faster inference. We also evaluate SAMPO's zero-shot generalization and scaling behavior, demonstrating its ability to generalize to unseen tasks and benefit from larger model sizes.

Title: Enhancing Sa2VA for Referent Video Object Segmentation: 2nd Solution for 7th LSVOS RVOS Track

Authors: Ran Hong, Feng Lu, Leilei Cao, An Yan, Youhai Jiang, Fengjie Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15546
Pdf URL: https://arxiv.org/pdf/2509.15546
Copy Paste: [[2509.15546]] Enhancing Sa2VA for Referent Video Object Segmentation: 2nd Solution for 7th LSVOS RVOS Track(https://arxiv.org/abs/2509.15546)
Keywords: large language model, segmentation
Abstract: Referential Video Object Segmentation (RVOS) aims to segment all objects in a video that match a given natural language description, bridging the gap between vision and language understanding. Recent work, such as Sa2VA, combines Large Language Models (LLMs) with SAM~2, leveraging the strong video reasoning capability of LLMs to guide video segmentation. In this work, we present a training-free framework that substantially improves Sa2VA's performance on the RVOS task. Our method introduces two key components: (1) a Video-Language Checker that explicitly verifies whether the subject and action described in the query actually appear in the video, thereby reducing false positives; and (2) a Key-Frame Sampler that adaptively selects informative frames to better capture both early object appearances and long-range temporal context. Without any additional training, our approach achieves a J&F score of 64.14% on the MeViS test set, ranking 2nd place in the RVOS track of the 7th LSVOS Challenge at ICCV 2025.

Title: A method for improving multilingual quality and diversity of instruction fine-tuning datasets

Authors: Chunguang Zhao, Yilun Liu, Pufan Zeng, Yuanchang Luo, Shimin Tao, Minggui He, Weibin Meng, Song Xu, Ziang Chen, Chen Liu, Hongxia Ma, Li Zhang, Boxing Chen, Daimeng Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15549
Pdf URL: https://arxiv.org/pdf/2509.15549
Copy Paste: [[2509.15549]] A method for improving multilingual quality and diversity of instruction fine-tuning datasets(https://arxiv.org/abs/2509.15549)
Keywords: large language model
Abstract: Multilingual Instruction Fine-Tuning (IFT) is essential for enabling large language models (LLMs) to generalize effectively across diverse linguistic and cultural contexts. However, the scarcity of high-quality multilingual training data and corresponding building method remains a critical bottleneck. While data selection has shown promise in English settings, existing methods often fail to generalize across languages due to reliance on simplistic heuristics or language-specific assumptions. In this work, we introduce Multilingual Data Quality and Diversity (M-DaQ), a novel method for improving LLMs multilinguality, by selecting high-quality and semantically diverse multilingual IFT samples. We further conduct the first systematic investigation of the Superficial Alignment Hypothesis (SAH) in multilingual setting. Empirical results across 18 languages demonstrate that models fine-tuned with M-DaQ method achieve significant performance gains over vanilla baselines over 60% win rate. Human evaluations further validate these gains, highlighting the increment of cultural points in the response. We release the M-DaQ code to support future research.

Title: DNA-DetectLLM: Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm

Authors: Xiaowei Zhu, Yubing Ren, Fang Fang, Qingfeng Tan, Shi Wang, Yanan Cao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15550
Pdf URL: https://arxiv.org/pdf/2509.15550
Copy Paste: [[2509.15550]] DNA-DetectLLM: Unveiling AI-Generated Text via a DNA-Inspired Mutation-Repair Paradigm(https://arxiv.org/abs/2509.15550)
Keywords: attack, robust, generative, large language model
Abstract: The rapid advancement of large language models (LLMs) has blurred the line between AI-generated and human-written text. This progress brings societal risks such as misinformation, authorship ambiguity, and intellectual property concerns, highlighting the urgent need for reliable AI-generated text detection methods. However, recent advances in generative language modeling have resulted in significant overlap between the feature distributions of human-written and AI-generated text, blurring classification boundaries and making accurate detection increasingly challenging. To address the above challenges, we propose a DNA-inspired perspective, leveraging a repair-based process to directly and interpretably capture the intrinsic differences between human-written and AI-generated text. Building on this perspective, we introduce DNA-DetectLLM, a zero-shot detection method for distinguishing AI-generated and human-written text. The method constructs an ideal AI-generated sequence for each input, iteratively repairs non-optimal tokens, and quantifies the cumulative repair effort as an interpretable detection signal. Empirical evaluations demonstrate that our method achieves state-of-the-art detection performance and exhibits strong robustness against various adversarial attacks and input lengths. Specifically, DNA-DetectLLM achieves relative improvements of 5.55% in AUROC and 2.08% in F1 score across multiple public benchmark datasets.

Title: PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors

Authors: Sepehr Dehdashtian, Mashrur M. Morshed, Jacob H. Seidman, Gaurav Bharaj, Vishnu Naresh Boddeti
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15551
Pdf URL: https://arxiv.org/pdf/2509.15551
Copy Paste: [[2509.15551]] PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors(https://arxiv.org/abs/2509.15551)
Keywords: defense, attack
Abstract: Synthetic image detectors (SIDs) are a key defense against the risks posed by the growing realism of images from text-to-image (T2I) models. Red teaming improves SID's effectiveness by identifying and exploiting their failure modes via misclassified synthetic images. However, existing red-teaming solutions (i) require white-box access to SIDs, which is infeasible for proprietary state-of-the-art detectors, and (ii) generate image-specific attacks through expensive online optimization. To address these limitations, we propose PolyJuice, the first black-box, image-agnostic red-teaming method for SIDs, based on an observed distribution shift in the T2I latent space between samples correctly and incorrectly classified by the SID. PolyJuice generates attacks by (i) identifying the direction of this shift through a lightweight offline process that only requires black-box access to the SID, and (ii) exploiting this direction by universally steering all generated images towards the SID's failure modes. PolyJuice-steered T2I models are significantly more effective at deceiving SIDs (up to 84%) compared to their unsteered counterparts. We also show that the steering directions can be estimated efficiently at lower resolutions and transferred to higher resolutions using simple interpolation, reducing computational overhead. Finally, tuning SID models on PolyJuice-augmented datasets notably enhances the performance of the detectors (up to 30%).

Title: Diffusion-Based Cross-Modal Feature Extraction for Multi-Label Classification

Authors: Tian Lan, Yiming Zheng, Jianxin Yin
Subjects: cs.CV, cs.AI, stat.AP
Abstract URL: https://arxiv.org/abs/2509.15553
Pdf URL: https://arxiv.org/pdf/2509.15553
Copy Paste: [[2509.15553]] Diffusion-Based Cross-Modal Feature Extraction for Multi-Label Classification(https://arxiv.org/abs/2509.15553)
Keywords: extraction, diffusion, transformer
Abstract: Multi-label classification has broad applications and depends on powerful representations capable of capturing multi-label interactions. We introduce \textit{Diff-Feat}, a simple but powerful framework that extracts intermediate features from pre-trained diffusion-Transformer models for images and text, and fuses them for downstream tasks. We observe that for vision tasks, the most discriminative intermediate feature along the diffusion process occurs at the middle step and is located in the middle block in Transformer. In contrast, for language tasks, the best feature occurs at the noise-free step and is located in the deepest block. In particular, we observe a striking phenomenon across varying datasets: a mysterious "Layer $12$" consistently yields the best performance on various downstream classification tasks for images (under DiT-XL/2-256$\times$256). We devise a heuristic local-search algorithm that pinpoints the locally optimal "image-text"$\times$"block-timestep" pair among a few candidates, avoiding an exhaustive grid search. A simple fusion-linear projection followed by addition-of the selected representations yields state-of-the-art performance: 98.6\% mAP on MS-COCO-enhanced and 45.7\% mAP on Visual Genome 500, surpassing strong CNN, graph, and Transformer baselines by a wide margin. t-SNE and clustering metrics further reveal that \textit{Diff-Feat} forms tighter semantic clusters than unimodal counterparts. The code is available at this https URL.

Title: Hybrid Deep Learning-Federated Learning Powered Intrusion Detection System for IoT/5G Advanced Edge Computing Network

Authors: Rasil Baidar, Sasa Maric, Robert Abbas
Subjects: cs.CR, cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2509.15555
Pdf URL: https://arxiv.org/pdf/2509.15555
Copy Paste: [[2509.15555]] Hybrid Deep Learning-Federated Learning Powered Intrusion Detection System for IoT/5G Advanced Edge Computing Network(https://arxiv.org/abs/2509.15555)
Keywords: security, privacy, attack, federate, explainability
Abstract: The exponential expansion of IoT and 5G-Advanced applications has enlarged the attack surface for DDoS, malware, and zero-day intrusions. We propose an intrusion detection system that fuses a convolutional neural network (CNN), a bidirectional LSTM (BiLSTM), and an autoencoder (AE) bottleneck within a privacy-preserving federated learning (FL) framework. The CNN-BiLSTM branch captures local and gated cross-feature interactions, while the AE emphasizes reconstruction-based anomaly sensitivity. Training occurs across edge devices without sharing raw data. On UNSW-NB15 (binary), the fused model attains AUC 99.59 percent and F1 97.36 percent; confusion-matrix analysis shows balanced error rates with high precision and recall. Average inference time is approximately 0.0476 ms per sample on our test hardware, which is well within the less than 10 ms URLLC budget, supporting edge deployment. We also discuss explainability, drift tolerance, and FL considerations for compliant, scalable 5G-Advanced IoT security.

Title: Exploring Polyglot Harmony: On Multilingual Data Allocation for Large Language Models Pretraining

Authors: Ping Guo, Yubing Ren, Binbin Liu, Fengze Liu, Haobin Lin, Yifan Zhang, Bingni Zhang, Taifeng Wang, Yin Zheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15556
Pdf URL: https://arxiv.org/pdf/2509.15556
Copy Paste: [[2509.15556]] Exploring Polyglot Harmony: On Multilingual Data Allocation for Large Language Models Pretraining(https://arxiv.org/abs/2509.15556)
Keywords: robust, large language model
Abstract: Large language models (LLMs) have become integral to a wide range of applications worldwide, driving an unprecedented global demand for effective multilingual capabilities. Central to achieving robust multilingual performance is the strategic allocation of language proportions within training corpora. However, determining optimal language ratios is highly challenging due to intricate cross-lingual interactions and sensitivity to dataset scale. This paper introduces Climb (Cross-Lingual Interaction-aware Multilingual Balancing), a novel framework designed to systematically optimize multilingual data allocation. At its core, Climb introduces a cross-lingual interaction-aware language ratio, explicitly quantifying each language's effective allocation by capturing inter-language dependencies. Leveraging this ratio, Climb proposes a principled two-step optimization procedure--first equalizing marginal benefits across languages, then maximizing the magnitude of the resulting language allocation vectors--significantly simplifying the inherently complex multilingual optimization problem. Extensive experiments confirm that Climb can accurately measure cross-lingual interactions across various multilingual settings. LLMs trained with Climb-derived proportions consistently achieve state-of-the-art multilingual performance, even achieving competitive performance with open-sourced LLMs trained with more tokens.

Title: Reward Hacking Mitigation using Verifiable Composite Rewards

Authors: Mirza Farhan Bin Tarek, Rahmatollah Beheshti
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15557
Pdf URL: https://arxiv.org/pdf/2509.15557
Copy Paste: [[2509.15557]] Reward Hacking Mitigation using Verifiable Composite Rewards(https://arxiv.org/abs/2509.15557)
Keywords: large language model
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has recently shown that large language models (LLMs) can develop their own reasoning without direct supervision. However, applications in the medical domain, specifically for question answering, are susceptible to significant reward hacking during the reasoning phase. Our work addresses two primary forms of this behavior: i) providing a final answer without preceding reasoning, and ii) employing non-standard reasoning formats to exploit the reward mechanism. To mitigate these, we introduce a composite reward function with specific penalties for these behaviors. Our experiments show that extending RLVR with our proposed reward model leads to better-formatted reasoning with less reward hacking and good accuracy compared to the baselines. This approach marks a step toward reducing reward hacking and enhancing the reliability of models utilizing RLVR.

Title: From Development to Deployment of AI-assisted Telehealth and Screening for Vision- and Hearing-threatening diseases in resource-constrained settings: Field Observations, Challenges and Way Forward

Authors: Mahesh Shakya, Bijay Adhikari, Nirsara Shrestha, Bipin Koirala, Arun Adhikari, Prasanta Poudyal, Luna Mathema, Sarbagya Buddhacharya, Bijay Khatri, Bishesh Khanal
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2509.15558
Pdf URL: https://arxiv.org/pdf/2509.15558
Copy Paste: [[2509.15558]] From Development to Deployment of AI-assisted Telehealth and Screening for Vision- and Hearing-threatening diseases in resource-constrained settings: Field Observations, Challenges and Way Forward(https://arxiv.org/abs/2509.15558)
Keywords: robust
Abstract: Vision- and hearing-threatening diseases cause preventable disability, especially in resource-constrained settings(RCS) with few specialists and limited screening setup. Large scale AI-assisted screening and telehealth has potential to expand early detection, but practical deployment is challenging in paper-based workflows and limited documented field experience exist to build upon. We provide insights on challenges and ways forward in development to adoption of scalable AI-assisted Telehealth and screening in such settings. Specifically, we find that iterative, interdisciplinary collaboration through early prototyping, shadow deployment and continuous feedback is important to build shared understanding as well as reduce usability hurdles when transitioning from paper-based to AI-ready workflows. We find public datasets and AI models highly useful despite poor performance due to domain shift. In addition, we find the need for automated AI-based image quality check to capture gradable images for robust screening in high-volume camps. Our field learning stress the importance of treating AI development and workflow digitization as an end-to-end, iterative co-design process. By documenting these practical challenges and lessons learned, we aim to address the gap in contextual, actionable field knowledge for building real-world AI-assisted telehealth and mass-screening programs in RCS.

Title: Small LLMs with Expert Blocks Are Good Enough for Hyperparamter Tuning

Authors: Om Naphade, Saksham Bansal, Parikshit Pareek
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2509.15561
Pdf URL: https://arxiv.org/pdf/2509.15561
Copy Paste: [[2509.15561]] Small LLMs with Expert Blocks Are Good Enough for Hyperparamter Tuning(https://arxiv.org/abs/2509.15561)
Keywords: large language model
Abstract: Hyper-parameter Tuning (HPT) is a necessary step in machine learning (ML) pipelines but becomes computationally expensive and opaque with larger models. Recently, Large Language Models (LLMs) have been explored for HPT, yet most rely on models exceeding 100 billion parameters. We propose an Expert Block Framework for HPT using Small LLMs. At its core is the Trajectory Context Summarizer (TCS), a deterministic block that transforms raw training trajectories into structured context, enabling small LLMs to analyze optimization progress with reliability comparable to larger models. Using two locally-run LLMs (phi4:reasoning14B and qwen2.5-coder:32B) and a 10-trial budget, our TCS-enabled HPT pipeline achieves average performance within ~0.9 percentage points of GPT-4 across six diverse tasks.

Title: DC-Mamba: Bi-temporal deformable alignment and scale-sparse enhancement for remote sensing change detection

Authors: Min Sun, Fenghui Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15563
Pdf URL: https://arxiv.org/pdf/2509.15563
Copy Paste: [[2509.15563]] DC-Mamba: Bi-temporal deformable alignment and scale-sparse enhancement for remote sensing change detection(https://arxiv.org/abs/2509.15563)
Keywords: robust
Abstract: Remote sensing change detection (RSCD) is vital for identifying land-cover changes, yet existing methods, including state-of-the-art State Space Models (SSMs), often lack explicit mechanisms to handle geometric misalignments and struggle to distinguish subtle, true changes from this http URL address this, we introduce DC-Mamba, an "align-then-enhance" framework built upon the ChangeMamba backbone. It integrates two lightweight, plug-and-play modules: (1) Bi-Temporal Deformable Alignment (BTDA), which explicitly introduces geometric awareness to correct spatial misalignments at the semantic feature level; and (2) a Scale-Sparse Change Amplifier(SSCA), which uses multi-source cues to selectively amplify high-confidence change signals while suppressing noise before the final classification. This synergistic design first establishes geometric consistency with BTDA to reduce pseudo-changes, then leverages SSCA to sharpen boundaries and enhance the visibility of small or subtle targets. Experiments show our method significantly improves performance over the strong ChangeMamba baseline, increasing the F1-score from 0.5730 to 0.5903 and IoU from 0.4015 to 0.4187. The results confirm the effectiveness of our "align-then-enhance" strategy, offering a robust and easily deployable solution that transparently addresses both geometric and feature-level challenges in RSCD.

Title: BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

Authors: Shaojie Zhang, Ruoceng Zhang, Pei Fu, Shaokang Wang, Jiahui Yang, Xin Du, Shiqi Cui, Bin Qin, Ying Huang, Zhenbo Luo, Jian Luan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15566
Pdf URL: https://arxiv.org/pdf/2509.15566
Copy Paste: [[2509.15566]] BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent(https://arxiv.org/abs/2509.15566)
Keywords: large language model
Abstract: In the field of AI-driven human-GUI interaction automation, while rapid advances in multimodal large language models and reinforcement fine-tuning techniques have yielded remarkable progress, a fundamental challenge persists: their interaction logic significantly deviates from natural human-GUI communication patterns. To fill this gap, we propose "Blink-Think-Link" (BTL), a brain-inspired framework for human-GUI interaction that mimics the human cognitive process between users and graphical interfaces. The system decomposes interactions into three biologically plausible phases: (1) Blink - rapid detection and attention to relevant screen areas, analogous to saccadic eye movements; (2) Think - higher-level reasoning and decision-making, mirroring cognitive planning; and (3) Link - generation of executable commands for precise motor control, emulating human action selection mechanisms. Additionally, we introduce two key technical innovations for the BTL framework: (1) Blink Data Generation - an automated annotation pipeline specifically optimized for blink data, and (2) BTL Reward -- the first rule-based reward mechanism that enables reinforcement learning driven by both process and outcome. Building upon this framework, we develop a GUI agent model named BTL-UI, which demonstrates consistent state-of-the-art performance across both static GUI understanding and dynamic interaction tasks in comprehensive benchmarks. These results provide conclusive empirical validation of the framework's efficacy in developing advanced GUI Agents.

Title: LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs

Authors: Junlong Jia, Xing Wu, Chaochen Gao, Ziyang Chen, Zijia Lin, Zhongzhi Li, Weinong Wang, Haotian Xu, Donghui Jin, Debing Zhang, Binghui Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15568
Pdf URL: https://arxiv.org/pdf/2509.15568
Copy Paste: [[2509.15568]] LiteLong: Resource-Efficient Long-Context Data Synthesis for LLMs(https://arxiv.org/abs/2509.15568)
Keywords: large language model
Abstract: High-quality long-context data is essential for training large language models (LLMs) capable of processing extensive documents, yet existing synthesis approaches using relevance-based aggregation face challenges of computational efficiency. We present LiteLong, a resource-efficient method for synthesizing long-context data through structured topic organization and multi-agent debate. Our approach leverages the BISAC book classification system to provide a comprehensive hierarchical topic organization, and then employs a debate mechanism with multiple LLMs to generate diverse, high-quality topics within this structure. For each topic, we use lightweight BM25 retrieval to obtain relevant documents and concatenate them into 128K-token training samples. Experiments on HELMET and Ruler benchmarks demonstrate that LiteLong achieves competitive long-context performance and can seamlessly integrate with other long-dependency enhancement methods. LiteLong makes high-quality long-context data synthesis more accessible by reducing both computational and data engineering costs, facilitating further research in long-context language training.

Title: Cuckoo Attack: Stealthy and Persistent Attacks Against AI-IDE

Authors: Xinpeng Liu, Junming Liu, Peiyu Liu, Han Zheng, Qinying Wang, Mathias Payer, Shouling Ji, Wenhai Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2509.15572
Pdf URL: https://arxiv.org/pdf/2509.15572
Copy Paste: [[2509.15572]] Cuckoo Attack: Stealthy and Persistent Attacks Against AI-IDE(https://arxiv.org/abs/2509.15572)
Keywords: security, attack, steal
Abstract: Modern AI-powered Integrated Development Environments (AI-IDEs) are increasingly defined by an Agent-centric architecture, where an LLM-powered Agent is deeply integrated to autonomously execute complex tasks. This tight integration, however, also introduces a new and critical attack surface. Attackers can exploit these components by injecting malicious instructions into untrusted external sources, effectively hijacking the Agent to perform harmful operations beyond the user's intention or awareness. This emerging threat has quickly attracted research attention, leading to various proposed attack vectors, such as hijacking Model Context Protocol (MCP) Servers to access private data. However, most existing approaches lack stealth and persistence, limiting their practical impact. We propose the Cuckoo Attack, a novel attack that achieves stealthy and persistent command execution by embedding malicious payloads into configuration files. These files, commonly used in AI-IDEs, execute system commands during routine operations, without displaying execution details to the user. Once configured, such files are rarely revisited unless an obvious runtime error occurs, creating a blind spot for attackers to exploit. We formalize our attack paradigm into two stages, including initial infection and persistence. Based on these stages, we analyze the practicality of the attack execution process and identify the relevant exploitation techniques. Furthermore, we analyze the impact of Cuckoo Attack, which can not only invade the developer's local computer but also achieve supply chain attacks through the spread of configuration files. We contribute seven actionable checkpoints for vendors to evaluate their product security. The critical need for these checks is demonstrated by our end-to-end Proof of Concept, which validated the proposed attack across nine mainstream Agent and AI-IDE pairs.

Title: Relevance to Utility: Process-Supervised Rewrite for RAG

Authors: Jaeyoung Kim, Jongho Kim, Seung-won Hwang, Seoho Song, Young-In Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15577
Pdf URL: https://arxiv.org/pdf/2509.15577
Copy Paste: [[2509.15577]] Relevance to Utility: Process-Supervised Rewrite for RAG(https://arxiv.org/abs/2509.15577)
Keywords: generative
Abstract: Retrieval-Augmented Generation systems often suffer from a gap between optimizing retrieval relevance and generative utility: retrieved documents may be topically relevant but still lack the content needed for effective reasoning during generation. While existing "bridge" modules attempt to rewrite the retrieved text for better generation, we show how they fail to capture true document utility. In this work, we propose R2U, with a key distinction of directly optimizing to maximize the probability of generating a correct answer through process supervision. As such direct observation is expensive, we also propose approximating an efficient distillation pipeline by scaling the supervision from LLMs, which helps the smaller rewriter model generalize better. We evaluate our method across multiple open-domain question-answering benchmarks. The empirical results demonstrate consistent improvements over strong bridging baselines.

Title: Multimodal Learning for Fake News Detection in Short Videos Using Linguistically Verified Data and Heterogeneous Modality Fusion

Authors: Shanghong Li, Chiam Wen Qi Ruth, Hong Xu, Fang Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15578
Pdf URL: https://arxiv.org/pdf/2509.15578
Copy Paste: [[2509.15578]] Multimodal Learning for Fake News Detection in Short Videos Using Linguistically Verified Data and Heterogeneous Modality Fusion(https://arxiv.org/abs/2509.15578)
Keywords: robust
Abstract: The rapid proliferation of short video platforms has necessitated advanced methods for detecting fake news. This need arises from the widespread influence and ease of sharing misinformation, which can lead to significant societal harm. Current methods often struggle with the dynamic and multimodal nature of short video content. This paper presents HFN, Heterogeneous Fusion Net, a novel multimodal framework that integrates video, audio, and text data to evaluate the authenticity of short video content. HFN introduces a Decision Network that dynamically adjusts modality weights during inference and a Weighted Multi-Modal Feature Fusion module to ensure robust performance even with incomplete data. Additionally, we contribute a comprehensive dataset VESV (VEracity on Short Videos) specifically designed for short video fake news detection. Experiments conducted on the FakeTT and newly collected VESV datasets demonstrate improvements of 2.71% and 4.14% in Marco F1 over state-of-the-art methods. This work establishes a robust solution capable of effectively identifying fake news in the complex landscape of short video platforms, paving the way for more reliable and comprehensive approaches in combating misinformation.

Title: DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

Authors: Tsz Ting Chung, Lemao Liu, Mo Yu, Dit-Yan Yeung
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15587
Pdf URL: https://arxiv.org/pdf/2509.15587
Copy Paste: [[2509.15587]] DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models(https://arxiv.org/abs/2509.15587)
Keywords: large language model
Abstract: Logic reasoning in natural language has been recognized as an important measure of human intelligence for Large Language Models (LLMs). Popular benchmarks may entangle multiple reasoning skills and thus provide unfaithful evaluations on the logic reasoning skill. Meanwhile, existing logic reasoning benchmarks are limited in language diversity and their distributions are deviated from the distribution of an ideal logic reasoning benchmark, which may lead to biased evaluation results. This paper thereby proposes a new classical logic benchmark DivLogicEval, consisting of natural sentences composed of diverse statements in a counterintuitive way. To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in LLMs. Through experiments, we demonstrate the extent to which logical reasoning is required to answer the questions in DivLogicEval and compare the performance of different popular LLMs in conducting logical reasoning.

Title: Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification

Authors: Zinan Lin, Enshu Liu, Xuefei Ning, Junyi Zhu, Wenyu Wang, Sergey Yekhanin
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2509.15591
Pdf URL: https://arxiv.org/pdf/2509.15591
Copy Paste: [[2509.15591]] Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification(https://arxiv.org/abs/2509.15591)
Keywords: generative
Abstract: Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at this https URL. The project website is at this https URL.

Title: Personalized Prediction By Learning Halfspace Reference Classes Under Well-Behaved Distribution

Authors: Jizhou Huang, Brendan Juba
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15592
Pdf URL: https://arxiv.org/pdf/2509.15592
Copy Paste: [[2509.15592]] Personalized Prediction By Learning Halfspace Reference Classes Under Well-Behaved Distribution(https://arxiv.org/abs/2509.15592)
Keywords: interpretability
Abstract: In machine learning applications, predictive models are trained to serve future queries across the entire data distribution. Real-world data often demands excessively complex models to achieve competitive performance, however, sacrificing interpretability. Hence, the growing deployment of machine learning models in high-stakes applications, such as healthcare, motivates the search for methods for accurate and explainable predictions. This work proposes a Personalized Prediction scheme, where an easy-to-interpret predictor is learned per query. In particular, we wish to produce a "sparse linear" classifier with competitive performance specifically on some sub-population that includes the query point. The goal of this work is to study the PAC-learnability of this prediction model for sub-populations represented by "halfspaces" in a label-agnostic setting. We first give a distribution-specific PAC-learning algorithm for learning reference classes for personalized prediction. By leveraging both the reference-class learning algorithm and a list learner of sparse linear representations, we prove the first upper bound, $O(\mathrm{opt}^{1/4} )$, for personalized prediction with sparse linear classifiers and homogeneous halfspace subsets. We also evaluate our algorithms on a variety of standard benchmark data sets.

Title: EyePCR: A Comprehensive Benchmark for Fine-Grained Perception, Knowledge Comprehension and Clinical Reasoning in Ophthalmic Surgery

Authors: Gui Wang, Yang Wennuo, Xusen Ma, Zehao Zhong, Zhuoru Wu, Ende Wu, Rong Qu, Wooi Ping Cheah, Jianfeng Ren, Linlin Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15596
Pdf URL: https://arxiv.org/pdf/2509.15596
Copy Paste: [[2509.15596]] EyePCR: A Comprehensive Benchmark for Fine-Grained Perception, Knowledge Comprehension and Clinical Reasoning in Ophthalmic Surgery(https://arxiv.org/abs/2509.15596)
Keywords: large language model
Abstract: MLLMs (Multimodal Large Language Models) have showcased remarkable capabilities, but their performance in high-stakes, domain-specific scenarios like surgical settings, remains largely under-explored. To address this gap, we develop \textbf{EyePCR}, a large-scale benchmark for ophthalmic surgery analysis, grounded in structured clinical knowledge to evaluate cognition across \textit{Perception}, \textit{Comprehension} and \textit{Reasoning}. EyePCR offers a richly annotated corpus with more than 210k VQAs, which cover 1048 fine-grained attributes for multi-view perception, medical knowledge graph of more than 25k triplets for comprehension, and four clinically grounded reasoning tasks. The rich annotations facilitate in-depth cognitive analysis, simulating how surgeons perceive visual cues and combine them with domain knowledge to make decisions, thus greatly improving models' cognitive ability. In particular, \textbf{EyePCR-MLLM}, a domain-adapted variant of Qwen2.5-VL-7B, achieves the highest accuracy on MCQs for \textit{Perception} among compared models and outperforms open-source models in \textit{Comprehension} and \textit{Reasoning}, rivalling commercial models like GPT-4.1. EyePCR reveals the limitations of existing MLLMs in surgical cognition and lays the foundation for benchmarking and enhancing clinical reliability of surgical video understanding models.

Title: TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?

Authors: Zhongyuan Bao, Lejun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15602
Pdf URL: https://arxiv.org/pdf/2509.15602
Copy Paste: [[2509.15602]] TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?(https://arxiv.org/abs/2509.15602)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) excel at general video understanding but struggle with fast, high-frequency sports like tennis, where rally clips are short yet information-dense. To systematically evaluate MLLMs in this challenging domain, we present TennisTV, the first and most comprehensive benchmark for tennis video understanding. TennisTV models each rally as a temporal-ordered sequence of consecutive stroke events, using automated pipelines for filtering and question generation. It covers 8 tasks at rally and stroke levels and includes 2,500 human-verified questions. Evaluating 16 representative MLLMs, we provide the first systematic assessment of tennis video understanding. Results reveal substantial shortcomings and yield two key insights: (i) frame-sampling density should be tailored and balanced across tasks, and (ii) improving temporal grounding is essential for stronger reasoning.

Title: Enhancing WSI-Based Survival Analysis with Report-Auxiliary Self-Distillation

Authors: Zheng Wang, Hong Liu, Zheng Wang, Danyi Li, Min Cen, Baptiste Magnier, Li Liang, Liansheng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15608
Pdf URL: https://arxiv.org/pdf/2509.15608
Copy Paste: [[2509.15608]] Enhancing WSI-Based Survival Analysis with Report-Auxiliary Self-Distillation(https://arxiv.org/abs/2509.15608)
Keywords: large language model
Abstract: Survival analysis based on Whole Slide Images (WSIs) is crucial for evaluating cancer prognosis, as they offer detailed microscopic information essential for predicting patient outcomes. However, traditional WSI-based survival analysis usually faces noisy features and limited data accessibility, hindering their ability to capture critical prognostic features effectively. Although pathology reports provide rich patient-specific information that could assist analysis, their potential to enhance WSI-based survival analysis remains largely unexplored. To this end, this paper proposes a novel Report-auxiliary self-distillation (Rasa) framework for WSI-based survival analysis. First, advanced large language models (LLMs) are utilized to extract fine-grained, WSI-relevant textual descriptions from original noisy pathology reports via a carefully designed task prompt. Next, a self-distillation-based pipeline is designed to filter out irrelevant or redundant WSI features for the student model under the guidance of the teacher model's textual knowledge. Finally, a risk-aware mix-up strategy is incorporated during the training of the student model to enhance both the quantity and diversity of the training data. Extensive experiments carried out on our collected data (CRC) and public data (TCGA-BRCA) demonstrate the superior effectiveness of Rasa against state-of-the-art methods. Our code is available at this https URL.

Title: SciEvent: Benchmarking Multi-domain Scientific Event Extraction

Authors: Bofu Dong, Pritesh Shah, Sumedh Sonawane, Tiyasha Banerjee, Erin Brady, Xinya Du, Ming Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15620
Pdf URL: https://arxiv.org/pdf/2509.15620
Copy Paste: [[2509.15620]] SciEvent: Benchmarking Multi-domain Scientific Event Extraction(https://arxiv.org/abs/2509.15620)
Keywords: extraction, large language model
Abstract: Scientific information extraction (SciIE) has primarily relied on entity-relation extraction in narrow domains, limiting its applicability to interdisciplinary research and struggling to capture the necessary context of scientific information, often resulting in fragmented or conflicting statements. In this paper, we introduce SciEvent, a novel multi-domain benchmark of scientific abstracts annotated via a unified event extraction (EE) schema designed to enable structured and context-aware understanding of scientific content. It includes 500 abstracts across five research domains, with manual annotations of event segments, triggers, and fine-grained arguments. We define SciIE as a multi-stage EE pipeline: (1) segmenting abstracts into core scientific activities--Background, Method, Result, and Conclusion; and (2) extracting the corresponding triggers and arguments. Experiments with fine-tuned EE models, large language models (LLMs), and human annotators reveal a performance gap, with current models struggling in domains such as sociology and humanities. SciEvent serves as a challenging benchmark and a step toward generalizable, multi-domain SciIE.

Title: Concept Unlearning in Large Language Models via Self-Constructed Knowledge Triplets

Authors: Tomoya Yamashita, Yuuki Yamanaka, Masanori Yamada, Takayuki Miura, Toshiki Shibahara, Tomoharu Iwata
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15621
Pdf URL: https://arxiv.org/pdf/2509.15621
Copy Paste: [[2509.15621]] Concept Unlearning in Large Language Models via Self-Constructed Knowledge Triplets(https://arxiv.org/abs/2509.15621)
Keywords: privacy, large language model
Abstract: Machine Unlearning (MU) has recently attracted considerable attention as a solution to privacy and copyright issues in large language models (LLMs). Existing MU methods aim to remove specific target sentences from an LLM while minimizing damage to unrelated knowledge. However, these approaches require explicit target sentences and do not support removing broader concepts, such as persons or events. To address this limitation, we introduce Concept Unlearning (CU) as a new requirement for LLM unlearning. We leverage knowledge graphs to represent the LLM's internal knowledge and define CU as removing the forgetting target nodes and associated edges. This graph-based formulation enables a more intuitive unlearning and facilitates the design of more effective methods. We propose a novel method that prompts the LLM to generate knowledge triplets and explanatory sentences about the forgetting target and applies the unlearning process to these representations. Our approach enables more precise and comprehensive concept removal by aligning the unlearning process with the LLM's internal knowledge representations. Experiments on real-world and synthetic datasets demonstrate that our method effectively achieves concept-level unlearning while preserving unrelated knowledge.

Title: PCSR: Pseudo-label Consistency-Guided Sample Refinement for Noisy Correspondence Learning

Authors: Zhuoyao Liu, Yang Liu, Wentao Feng, Shudong Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15623
Pdf URL: https://arxiv.org/pdf/2509.15623
Copy Paste: [[2509.15623]] PCSR: Pseudo-label Consistency-Guided Sample Refinement for Noisy Correspondence Learning(https://arxiv.org/abs/2509.15623)
Keywords: robust
Abstract: Cross-modal retrieval aims to align different modalities via semantic similarity. However, existing methods often assume that image-text pairs are perfectly aligned, overlooking Noisy Correspondences in real data. These misaligned pairs misguide similarity learning and degrade retrieval performance. Previous methods often rely on coarse-grained categorizations that simply divide data into clean and noisy samples, overlooking the intrinsic diversity within noisy instances. Moreover, they typically apply uniform training strategies regardless of sample characteristics, resulting in suboptimal sample utilization for model optimization. To address the above challenges, we introduce a novel framework, called Pseudo-label Consistency-Guided Sample Refinement (PCSR), which enhances correspondence reliability by explicitly dividing samples based on pseudo-label consistency. Specifically, we first employ a confidence-based estimation to distinguish clean and noisy pairs, then refine the noisy pairs via pseudo-label consistency to uncover structurally distinct subsets. We further proposed a Pseudo-label Consistency Score (PCS) to quantify prediction stability, enabling the separation of ambiguous and refinable samples within noisy pairs. Accordingly, we adopt Adaptive Pair Optimization (APO), where ambiguous samples are optimized with robust loss functions and refinable ones are enhanced via text replacement during training. Extensive experiments on CC152K, MS-COCO and Flickr30K validate the effectiveness of our method in improving retrieval robustness under noisy supervision.

Title: Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models

Authors: Tomoya Yamashita, Akira Ito, Yuuki Yamanaka, Masanori Yamada, Takayuki Miura, Toshiki Shibahara
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15631
Pdf URL: https://arxiv.org/pdf/2509.15631
Copy Paste: [[2509.15631]] Sparse-Autoencoder-Guided Internal Representation Unlearning for Large Language Models(https://arxiv.org/abs/2509.15631)
Keywords: privacy, large language model
Abstract: As large language models (LLMs) are increasingly deployed across various applications, privacy and copyright concerns have heightened the need for more effective LLM unlearning techniques. Many existing unlearning methods aim to suppress undesirable outputs through additional training (e.g., gradient ascent), which reduces the probability of generating such outputs. While such suppression-based approaches can control model outputs, they may not eliminate the underlying knowledge embedded in the model's internal activations; muting a response is not the same as forgetting it. Moreover, such suppression-based methods often suffer from model collapse. To address these issues, we propose a novel unlearning method that directly intervenes in the model's internal activations. In our formulation, forgetting is defined as a state in which the activation of a forgotten target is indistinguishable from that of ``unknown'' entities. Our method introduces an unlearning objective that modifies the activation of the target entity away from those of known entities and toward those of unknown entities in a sparse autoencoder latent space. By aligning the target's internal activation with those of unknown entities, we shift the model's recognition of the target entity from ``known'' to ``unknown'', achieving genuine forgetting while avoiding over-suppression and model collapse. Empirically, we show that our method effectively aligns the internal activations of the forgotten target, a result that the suppression-based approaches do not reliably achieve. Additionally, our method effectively reduces the model's recall of target knowledge in question-answering tasks without significant damage to the non-target knowledge.

Title: pFedSAM: Personalized Federated Learning of Segment Anything Model for Medical Image Segmentation

Authors: Tong Wang, Xingyue Zhao, Linghao Zhuang, Haoyu Zhao, Jiayi Yin, Yuyang He, Gang Yu, Bo Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15638
Pdf URL: https://arxiv.org/pdf/2509.15638
Copy Paste: [[2509.15638]] pFedSAM: Personalized Federated Learning of Segment Anything Model for Medical Image Segmentation(https://arxiv.org/abs/2509.15638)
Keywords: privacy, robust, federate, segmentation
Abstract: Medical image segmentation is crucial for computer-aided diagnosis, yet privacy constraints hinder data sharing across institutions. Federated learning addresses this limitation, but existing approaches often rely on lightweight architectures that struggle with complex, heterogeneous data. Recently, the Segment Anything Model (SAM) has shown outstanding segmentation capabilities; however, its massive encoder poses significant challenges in federated settings. In this work, we present the first personalized federated SAM framework tailored for heterogeneous data scenarios in medical image segmentation. Our framework integrates two key innovations: (1) a personalized strategy that aggregates only the global parameters to capture cross-client commonalities while retaining the designed L-MoE (Localized Mixture-of-Experts) component to preserve domain-specific features; and (2) a decoupled global-local fine-tuning mechanism that leverages a teacher-student paradigm via knowledge distillation to bridge the gap between the global shared model and the personalized local models, thereby mitigating overgeneralization. Extensive experiments on two public datasets validate that our approach significantly improves segmentation performance, achieves robust cross-domain adaptation, and reduces communication overhead.

Title: Information Geometry of Variational Bayes

Authors: Mohammad Emtiyaz Khan
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2509.15641
Pdf URL: https://arxiv.org/pdf/2509.15641
Copy Paste: [[2509.15641]] Information Geometry of Variational Bayes(https://arxiv.org/abs/2509.15641)
Keywords: large language model
Abstract: We highlight a fundamental connection between information geometry and variational Bayes (VB) and discuss its consequences for machine learning. Under certain conditions, a VB solution always requires estimation or computation of natural gradients. We show several consequences of this fact by using the natural-gradient descent algorithm of Khan and Rue (2023) called the Bayesian Learning Rule (BLR). These include (i) a simplification of Bayes' rule as addition of natural gradients, (ii) a generalization of quadratic surrogates used in gradient-based methods, and (iii) a large-scale implementation of VB algorithms for large language models. Neither the connection nor its consequences are new but we further emphasize the common origins of the two fields of information geometry and Bayes with a hope to facilitate more work at the intersection of the two fields.

Title: UNIV: Unified Foundation Model for Infrared and Visible Modalities

Authors: Fangyuan Mao, Shuo Wang, Jilin Mei, Chen Min, Shun Lu, Fuyang Liu, Yu Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15642
Pdf URL: https://arxiv.org/pdf/2509.15642
Copy Paste: [[2509.15642]] UNIV: Unified Foundation Model for Infrared and Visible Modalities(https://arxiv.org/abs/2509.15642)
Keywords: robust, transformer, segmentation
Abstract: The demand for joint RGB-visible and infrared perception is growing rapidly, particularly to achieve robust performance under diverse weather conditions. Although pre-trained models for RGB-visible and infrared data excel in their respective domains, they often underperform in multimodal scenarios, such as autonomous vehicles equipped with both sensors. To address this challenge, we propose a biologically inspired UNified foundation model for Infrared and Visible modalities (UNIV), featuring two key innovations. First, we introduce Patch-wise Cross-modality Contrastive Learning (PCCL), an attention-guided distillation framework that mimics retinal horizontal cells' lateral inhibition, which enables effective cross-modal feature alignment while remaining compatible with any transformer-based architecture. Second, our dual-knowledge preservation mechanism emulates the retina's bipolar cell signal routing - combining LoRA adapters (2% added parameters) with synchronous distillation to prevent catastrophic forgetting, thereby replicating the retina's photopic (cone-driven) and scotopic (rod-driven) functionality. To support cross-modal learning, we introduce the MVIP dataset, the most comprehensive visible-infrared benchmark to date. It contains 98,992 precisely aligned image pairs spanning diverse scenarios. Extensive experiments demonstrate UNIV's superior performance on infrared tasks (+1.7 mIoU in semantic segmentation and +0.7 mAP in object detection) while maintaining 99%+ of the baseline performance on visible RGB tasks. Our code is available at this https URL.

Title: Future-Proofing Cloud Security Against Quantum Attacks: Risk, Transition, and Mitigation Strategies

Authors: Yaser Baseri, Abdelhakim Hafid, Arash Habibi Lashkari
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2509.15653
Pdf URL: https://arxiv.org/pdf/2509.15653
Copy Paste: [[2509.15653]] Future-Proofing Cloud Security Against Quantum Attacks: Risk, Transition, and Mitigation Strategies(https://arxiv.org/abs/2509.15653)
Keywords: secure, security, attack
Abstract: Quantum Computing (QC) introduces a transformative threat to digital security, with the potential to compromise widely deployed classical cryptographic systems. This survey offers a comprehensive and systematic examination of quantumsafe security for Cloud Computing (CC), focusing on the vulnerabilities, transition strategies, and mitigation mechanisms required to secure cloud infrastructures in the quantum era. We evaluated the landscape of quantum threats across the entire CC stack, demonstrating how quantum algorithms can undermine classical encryption and compromise cloud security at multiple architectural layers. Using a structured risk assessment methodology based on the STRIDE model, we evaluate quantum-induced attack vectors and their impact on cloud environments. To address these challenges, we propose a layered security framework that integrates hybrid cryptographic transition strategies, cryptographic agility, and proactive risk mitigation. We analyze the preparation and implementation approaches of the major Cloud Service Providers (CSPs), including AWS, Azure and GCP, synthesizing platform-specific initiatives toward Post-Quantum Cryptography (PQC). Furthermore, we provide a detailed evaluation of standardized PQC algorithms, exploring their resilience to side-channel and active attacks within cloud-native deployments. This survey serves as a strategic reference for cloud architects, policymakers, and researchers, offering actionable insights for navigating the complex transition to quantum-resilient cloud systems. We conclude by identifying six key future research directions: standardization and interoperability, performance and scalability, implementation security, integration with emerging technologies, systemic preparedness, and crypto-agile migration frameworks.

Title: Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations

Authors: Linyang He, Qiaolin Wang, Xilin Jiang, Nima Mesgarani
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2509.15655
Pdf URL: https://arxiv.org/pdf/2509.15655
Copy Paste: [[2509.15655]] Layer-wise Minimal Pair Probing Reveals Contextual Grammatical-Conceptual Hierarchy in Speech Representations(https://arxiv.org/abs/2509.15655)
Keywords: robust, transformer, large language model
Abstract: Transformer-based speech language models (SLMs) have significantly improved neural speech recognition and understanding. While existing research has examined how well SLMs encode shallow acoustic and phonetic features, the extent to which SLMs encode nuanced syntactic and conceptual features remains unclear. By drawing parallels with linguistic competence assessments for large language models, this study is the first to systematically evaluate the presence of contextual syntactic and semantic features across SLMs for self-supervised learning (S3M), automatic speech recognition (ASR), speech compression (codec), and as the encoder for auditory large language models (AudioLLMs). Through minimal pair designs and diagnostic feature analysis across 71 tasks spanning diverse linguistic levels, our layer-wise and time-resolved analysis uncovers that 1) all speech encode grammatical features more robustly than conceptual ones.

Title: VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion

Authors: Dimitrios Damianos, Leon Voukoutis, Georgios Paraskevopoulos, Vassilis Katsouros
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2509.15667
Pdf URL: https://arxiv.org/pdf/2509.15667
Copy Paste: [[2509.15667]] VOX-KRIKRI: Unifying Speech and Language through Continuous Fusion(https://arxiv.org/abs/2509.15667)
Keywords: large language model
Abstract: We present a multimodal fusion framework that bridges pre-trained decoder-based large language models (LLM) and acoustic encoder-decoder architectures such as Whisper, with the aim of building speech-enabled LLMs. Instead of directly using audio embeddings, we explore an intermediate audio-conditioned text space as a more effective mechanism for alignment. Our method operates fully in continuous text representation spaces, fusing Whisper's hidden decoder states with those of an LLM through cross-modal attention, and supports both offline and streaming modes. We introduce \textit{VoxKrikri}, the first Greek speech LLM, and show through analysis that our approach effectively aligns representations across modalities. These results highlight continuous space fusion as a promising path for multilingual and low-resource speech LLMs, while achieving state-of-the-art results for Automatic Speech Recognition in Greek, providing an average $\sim20\%$ relative improvement across benchmarks.

Title: Inference Offloading for Cost-Sensitive Binary Classification at the Edge

Authors: Vishnu Narayanan Moothedath, Umang Agarwal, Umeshraja N, James Richard Gross, Jaya Prakash Champati, Sharayu Moharir
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2509.15674
Pdf URL: https://arxiv.org/pdf/2509.15674
Copy Paste: [[2509.15674]] Inference Offloading for Cost-Sensitive Binary Classification at the Edge(https://arxiv.org/abs/2509.15674)
Keywords: robust
Abstract: We focus on a binary classification problem in an edge intelligence system where false negatives are more costly than false positives. The system has a compact, locally deployed model, which is supplemented by a larger, remote model, which is accessible via the network by incurring an offloading cost. For each sample, our system first uses the locally deployed model for inference. Based on the output of the local model, the sample may be offloaded to the remote model. This work aims to understand the fundamental trade-off between classification accuracy and these offloading costs within such a hierarchical inference (HI) system. To optimize this system, we propose an online learning framework that continuously adapts a pair of thresholds on the local model's confidence scores. These thresholds determine the prediction of the local model and whether a sample is classified locally or offloaded to the remote model. We present a closed-form solution for the setting where the local model is calibrated. For the more general case of uncalibrated models, we introduce H2T2, an online two-threshold hierarchical inference policy, and prove it achieves sublinear regret. H2T2 is model-agnostic, requires no training, and learns in the inference phase using limited feedback. Simulations on real-world datasets show that H2T2 consistently outperforms naive and single-threshold HI policies, sometimes even surpassing offline optima. The policy also demonstrates robustness to distribution shifts and adapts effectively to mismatched classifiers.

Title: KITE: Kernelized and Information Theoretic Exemplars for In-Context Learning

Authors: Vaibhav Singh, Soumya Suvra Ghosal, Kapu Nirmal Joshua, Soumyabrata Pal, Sayak Ray Chowdhury
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2509.15676
Pdf URL: https://arxiv.org/pdf/2509.15676
Copy Paste: [[2509.15676]] KITE: Kernelized and Information Theoretic Exemplars for In-Context Learning(https://arxiv.org/abs/2509.15676)
Keywords: large language model
Abstract: In-context learning (ICL) has emerged as a powerful paradigm for adapting large language models (LLMs) to new and data-scarce tasks using only a few carefully selected task-specific examples presented in the prompt. However, given the limited context size of LLMs, a fundamental question arises: Which examples should be selected to maximize performance on a given user query? While nearest-neighbor-based methods like KATE have been widely adopted for this purpose, they suffer from well-known drawbacks in high-dimensional embedding spaces, including poor generalization and a lack of diversity. In this work, we study this problem of example selection in ICL from a principled, information theory-driven perspective. We first model an LLM as a linear function over input embeddings and frame the example selection task as a query-specific optimization problem: selecting a subset of exemplars from a larger example bank that minimizes the prediction error on a specific query. This formulation departs from traditional generalization-focused learning theoretic approaches by targeting accurate prediction for a specific query instance. We derive a principled surrogate objective that is approximately submodular, enabling the use of a greedy algorithm with an approximation guarantee. We further enhance our method by (i) incorporating the kernel trick to operate in high-dimensional feature spaces without explicit mappings, and (ii) introducing an optimal design-based regularizer to encourage diversity in the selected examples. Empirically, we demonstrate significant improvements over standard retrieval methods across a suite of classification tasks, highlighting the benefits of structure-aware, diverse example selection for ICL in real-world, label-scarce scenarios.

Title: Layout Stroke Imitation: A Layout Guided Handwriting Stroke Generation for Style Imitation with Diffusion Model

Authors: Sidra Hanif, Longin Jan Latecki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15678
Pdf URL: https://arxiv.org/pdf/2509.15678
Copy Paste: [[2509.15678]] Layout Stroke Imitation: A Layout Guided Handwriting Stroke Generation for Style Imitation with Diffusion Model(https://arxiv.org/abs/2509.15678)
Keywords: diffusion
Abstract: Handwriting stroke generation is crucial for improving the performance of tasks such as handwriting recognition and writers order recovery. In handwriting stroke generation, it is significantly important to imitate the sample calligraphic style. The previous studies have suggested utilizing the calligraphic features of the handwriting. However, they had not considered word spacing (word layout) as an explicit handwriting feature, which results in inconsistent word spacing for style imitation. Firstly, this work proposes multi-scale attention features for calligraphic style imitation. These multi-scale feature embeddings highlight the local and global style features. Secondly, we propose to include the words layout, which facilitates word spacing for handwriting stroke generation. Moreover, we propose a conditional diffusion model to predict strokes in contrast to previous work, which directly generated style images. Stroke generation provides additional temporal coordinate information, which is lacking in image generation. Hence, our proposed conditional diffusion model for stroke generation is guided by calligraphic style and word layout for better handwriting imitation and stroke generation in a calligraphic style. Our experimentation shows that the proposed diffusion model outperforms the current state-of-the-art stroke generation and is competitive with recent image generation networks.

Title: SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions

Authors: Cristian Sbrolli, Matteo Matteucci
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2509.15693
Pdf URL: https://arxiv.org/pdf/2509.15693
Copy Paste: [[2509.15693]] SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions(https://arxiv.org/abs/2509.15693)
Keywords: robust, large language model, segmentation
Abstract: The whole is greater than the sum of its parts-even in 3D-text contrastive learning. We introduce SceneForge, a novel framework that enhances contrastive alignment between 3D point clouds and text through structured multi-object scene compositions. SceneForge leverages individual 3D shapes to construct multi-object scenes with explicit spatial relations, pairing them with coherent multi-object descriptions refined by a large language model. By augmenting contrastive training with these structured, compositional samples, SceneForge effectively addresses the scarcity of large-scale 3D-text datasets, significantly enriching data complexity and diversity. We systematically investigate critical design elements, such as the optimal number of objects per scene, the proportion of compositional samples in training batches, and scene construction strategies. Extensive experiments demonstrate that SceneForge delivers substantial performance gains across multiple tasks, including zero-shot classification on ModelNet, ScanObjNN, Objaverse-LVIS, and ScanNet, as well as few-shot part segmentation on ShapeNetPart. SceneForge's compositional augmentations are model-agnostic, consistently improving performance across multiple encoder architectures. Moreover, SceneForge improves 3D visual question answering on ScanQA, generalizes robustly to retrieval scenarios with increasing scene complexity, and showcases spatial reasoning capabilities by adapting spatial configurations to align precisely with textual instructions.

Title: Inference Attacks on Encrypted Online Voting via Traffic Analysis

Authors: Anastasiia Belousova, Francesco Marchiori, Mauro Conti
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2509.15694
Pdf URL: https://arxiv.org/pdf/2509.15694
Copy Paste: [[2509.15694]] Inference Attacks on Encrypted Online Voting via Traffic Analysis(https://arxiv.org/abs/2509.15694)
Keywords: secure, security, privacy, protect, attack
Abstract: Online voting enables individuals to participate in elections remotely, offering greater efficiency and accessibility in both governmental and organizational settings. As this method gains popularity, ensuring the security of online voting systems becomes increasingly vital, as the systems supporting it must satisfy a demanding set of security requirements. Most research in this area emphasizes the design and verification of cryptographic protocols to protect voter integrity and system confidentiality. However, other vectors, such as network traffic analysis, remain relatively understudied, even though they may pose significant threats to voter privacy and the overall trustworthiness of the system. In this paper, we examine how adversaries can exploit metadata from encrypted network traffic to uncover sensitive information during online voting. Our analysis reveals that, even without accessing the encrypted content, it is possible to infer critical voter actions, such as whether a person votes, the exact moment a ballot is submitted, and whether the ballot is valid or spoiled. We test these attacks with both rule-based techniques and machine learning methods. We evaluate our attacks on two widely used online voting platforms, one proprietary and one partially open source, achieving classification accuracy as high as 99.5%. These results expose a significant privacy vulnerability that threatens key properties of secure elections, including voter secrecy and protection against coercion or vote-buying. We explore mitigations to our attacks, demonstrating that countermeasures such as payload padding and timestamp equalization can substantially limit their effectiveness.

Title: Toward Medical Deepfake Detection: A Comprehensive Dataset and Novel Method

Authors: Shuaibo Li, Zhaohu Xing, Hongqiu Wang, Pengfei Hao, Xingyu Li, Zekai Liu, Lei Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15711
Pdf URL: https://arxiv.org/pdf/2509.15711
Copy Paste: [[2509.15711]] Toward Medical Deepfake Detection: A Comprehensive Dataset and Novel Method(https://arxiv.org/abs/2509.15711)
Keywords: generative
Abstract: The rapid advancement of generative AI in medical imaging has introduced both significant opportunities and serious challenges, especially the risk that fake medical images could undermine healthcare systems. These synthetic images pose serious risks, such as diagnostic deception, financial fraud, and misinformation. However, research on medical forensics to counter these threats remains limited, and there is a critical lack of comprehensive datasets specifically tailored for this field. Additionally, existing media forensic methods, which are primarily designed for natural or facial images, are inadequate for capturing the distinct characteristics and subtle artifacts of AI-generated medical images. To tackle these challenges, we introduce \textbf{MedForensics}, a large-scale medical forensics dataset encompassing six medical modalities and twelve state-of-the-art medical generative models. We also propose \textbf{DSKI}, a novel \textbf{D}ual-\textbf{S}tage \textbf{K}nowledge \textbf{I}nfusing detector that constructs a vision-language feature space tailored for the detection of AI-generated medical images. DSKI comprises two core components: 1) a cross-domain fine-trace adapter (CDFA) for extracting subtle forgery clues from both spatial and noise domains during training, and 2) a medical forensic retrieval module (MFRM) that boosts detection accuracy through few-shot retrieval during testing. Experimental results demonstrate that DSKI significantly outperforms both existing methods and human experts, achieving superior accuracy across multiple medical modalities.

Title: Once Upon a Time: Interactive Learning for Storytelling with Small Language Models

Authors: Jonas Mayer Martins, Ali Hamza Bashir, Muhammad Rehan Khalid, Lisa Beinborn
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15714
Pdf URL: https://arxiv.org/pdf/2509.15714
Copy Paste: [[2509.15714]] Once Upon a Time: Interactive Learning for Storytelling with Small Language Models(https://arxiv.org/abs/2509.15714)
Keywords: large language model
Abstract: Children efficiently acquire language not just by listening, but by interacting with others in their social environment. Conversely, large language models are typically trained with next-word prediction on massive amounts of text. Motivated by this contrast, we investigate whether language models can be trained with less data by learning not only from next-word prediction but also from high-level, cognitively inspired feedback. We train a student model to generate stories, which a teacher model rates on readability, narrative coherence, and creativity. By varying the amount of pretraining before the feedback loop, we assess the impact of this interactive learning on formal and functional linguistic competence. We find that the high-level feedback is highly data efficient: With just 1 M words of input in interactive learning, storytelling skills can improve as much as with 410 M words of next-word prediction.

Title: REFER: Mitigating Bias in Opinion Summarisation via Frequency Framed Prompting

Authors: Nannan Huang, Haytham M. Fayek, Xiuzhen Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15723
Pdf URL: https://arxiv.org/pdf/2509.15723
Copy Paste: [[2509.15723]] REFER: Mitigating Bias in Opinion Summarisation via Frequency Framed Prompting(https://arxiv.org/abs/2509.15723)
Keywords: fair, large language model
Abstract: Individuals express diverse opinions, a fair summary should represent these viewpoints comprehensively. Previous research on fairness in opinion summarisation using large language models (LLMs) relied on hyperparameter tuning or providing ground truth distributional information in prompts. However, these methods face practical limitations: end-users rarely modify default model parameters, and accurate distributional information is often unavailable. Building upon cognitive science research demonstrating that frequency-based representations reduce systematic biases in human statistical reasoning by making reference classes explicit and reducing cognitive load, this study investigates whether frequency framed prompting (REFER) can similarly enhance fairness in LLM opinion summarisation. Through systematic experimentation with different prompting frameworks, we adapted techniques known to improve human reasoning to elicit more effective information processing in language models compared to abstract probabilistic this http URL results demonstrate that REFER enhances fairness in language models when summarising opinions. This effect is particularly pronounced in larger language models and using stronger reasoning instructions.

Title: Flying Drones to Locate Cyber-Attackers in LoRaWAN Metropolitan Networks

Authors: Matteo Repetto, Enrico Cambiaso, Fabio Patrone, Sandro Zappatore
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2509.15725
Pdf URL: https://arxiv.org/pdf/2509.15725
Copy Paste: [[2509.15725]] Flying Drones to Locate Cyber-Attackers in LoRaWAN Metropolitan Networks(https://arxiv.org/abs/2509.15725)
Keywords: security, attack
Abstract: Today, many critical services and industrial systems rely on wireless networks for interaction with the IoT, hence becoming vulnerable to a broad number of cyber-threats. While detecting this kind of attacks is not difficult with common cyber-security tools, and even trivial for jamming, finding their origin and identifying culprits is almost impossible today, yet indispensable to stop them, especially when attacks are generated with portable or self-made devices that continuously move around. To address this open challenge, the FOLLOWME project investigates the feasibility of using UAV to locate and even chase attackers during illicit usage of the radio spectrum. The main objective is to develop a cyber-physical security framework that integrates network telemetry with wireless localization. The former triggers alarms in case of anomalies or known attack patterns and provides a coarse-grained indication of the physical area (i.e., the position of affected access gateways), whereas the latter systematically scans such area to identify the exact location of the attacker. The project will specifically address long-range metropolitan area networks and focus on the LoRaWAN protocol, which is the typical scenario for Smart City services.

Title: EigenTrack: Spectral Activation Feature Tracking for Hallucination and Out-of-Distribution Detection in LLMs and VLMs

Authors: Davide Ettori, Nastaran Darabi, Sina Tayebati, Ranganath Krishnan, Mahesh Subedar, Omesh Tickoo, Amit Ranjan Trivedi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15735
Pdf URL: https://arxiv.org/pdf/2509.15735
Copy Paste: [[2509.15735]] EigenTrack: Spectral Activation Feature Tracking for Hallucination and Out-of-Distribution Detection in LLMs and VLMs(https://arxiv.org/abs/2509.15735)
Keywords: large language model
Abstract: Large language models (LLMs) offer broad utility but remain prone to hallucination and out-of-distribution (OOD) errors. We propose EigenTrack, an interpretable real-time detector that uses the spectral geometry of hidden activations, a compact global signature of model dynamics. By streaming covariance-spectrum statistics such as entropy, eigenvalue gaps, and KL divergence from random baselines into a lightweight recurrent classifier, EigenTrack tracks temporal shifts in representation structure that signal hallucination and OOD drift before surface errors appear. Unlike black- and grey-box methods, it needs only a single forward pass without resampling. Unlike existing white-box detectors, it preserves temporal context, aggregates global signals, and offers interpretable accuracy-latency trade-offs.

Title: GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning

Authors: Musen Lin, Minghao Liu, Taoran Lu, Lichen Yuan, Yiwei Liu, Haonan Xu, Yu Miao, Yuhao Chao, Zhaojian Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15738
Pdf URL: https://arxiv.org/pdf/2509.15738
Copy Paste: [[2509.15738]] GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning(https://arxiv.org/abs/2509.15738)
Keywords: robust
Abstract: Graphical User Interface (GUI) Agents, powered by large language and vision-language models, hold promise for enabling end-to-end automation in digital environments. However, their progress is fundamentally constrained by the scarcity of scalable, high-quality trajectory data. Existing data collection strategies either rely on costly and inconsistent manual annotations or on synthetic generation methods that trade off between diversity and meaningful task coverage. To bridge this gap, we present GUI-ReWalk: a reasoning-enhanced, multi-stage framework for synthesizing realistic and diverse GUI trajectories. GUI-ReWalk begins with a stochastic exploration phase that emulates human trial-and-error behaviors, and progressively transitions into a reasoning-guided phase where inferred goals drive coherent and purposeful interactions. Moreover, it supports multi-stride task generation, enabling the construction of long-horizon workflows across multiple applications. By combining randomness for diversity with goal-aware reasoning for structure, GUI-ReWalk produces data that better reflects the intent-aware, adaptive nature of human-computer interaction. We further train Qwen2.5-VL-7B on the GUI-ReWalk dataset and evaluate it across multiple benchmarks, including Screenspot-Pro, OSWorld-G, UI-Vision, AndroidControl, and GUI-Odyssey. Results demonstrate that GUI-ReWalk enables superior coverage of diverse interaction flows, higher trajectory entropy, and more realistic user intent. These findings establish GUI-ReWalk as a scalable and data-efficient framework for advancing GUI agent research and enabling robust real-world automation.

Title: Can LLMs Judge Debates? Evaluating Non-Linear Reasoning via Argumentation Theory Semantics

Authors: Reza Sanayei, Srdjan Vesic, Eduardo Blanco, Mihai Surdeanu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15739
Pdf URL: https://arxiv.org/pdf/2509.15739
Copy Paste: [[2509.15739]] Can LLMs Judge Debates? Evaluating Non-Linear Reasoning via Argumentation Theory Semantics(https://arxiv.org/abs/2509.15739)
Keywords: attack, large language model
Abstract: Large Language Models (LLMs) excel at linear reasoning tasks but remain underexplored on non-linear structures such as those found in natural debates, which are best expressed as argument graphs. We evaluate whether LLMs can approximate structured reasoning from Computational Argumentation Theory (CAT). Specifically, we use Quantitative Argumentation Debate (QuAD) semantics, which assigns acceptability scores to arguments based on their attack and support relations. Given only dialogue-formatted debates from two NoDE datasets, models are prompted to rank arguments without access to the underlying graph. We test several LLMs under advanced instruction strategies, including Chain-of-Thought and In-Context Learning. While models show moderate alignment with QuAD rankings, performance degrades with longer inputs or disrupted discourse flow. Advanced prompting helps mitigate these effects by reducing biases related to argument length and position. Our findings highlight both the promise and limitations of LLMs in modeling formal argumentation semantics and motivate future work on graph-aware reasoning.

Title: TrueMoE: Dual-Routing Mixture of Discriminative Experts for Synthetic Image Detection

Authors: Laixin Zhang, Shuaibo Li, Wei Ma, Hongbin Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15741
Pdf URL: https://arxiv.org/pdf/2509.15741
Copy Paste: [[2509.15741]] TrueMoE: Dual-Routing Mixture of Discriminative Experts for Synthetic Image Detection(https://arxiv.org/abs/2509.15741)
Keywords: robust, generative
Abstract: The rapid progress of generative models has made synthetic image detection an increasingly critical task. Most existing approaches attempt to construct a single, universal discriminative space to separate real from fake content. However, such unified spaces tend to be complex and brittle, often struggling to generalize to unseen generative patterns. In this work, we propose TrueMoE, a novel dual-routing Mixture-of-Discriminative-Experts framework that reformulates the detection task as a collaborative inference across multiple specialized and lightweight discriminative subspaces. At the core of TrueMoE is a Discriminative Expert Array (DEA) organized along complementary axes of manifold structure and perceptual granularity, enabling diverse forgery cues to be captured across subspaces. A dual-routing mechanism, comprising a granularity-aware sparse router and a manifold-aware dense router, adaptively assigns input images to the most relevant experts. Extensive experiments across a wide spectrum of generative models demonstrate that TrueMoE achieves superior generalization and robustness.

Title: FloorSAM: SAM-Guided Floorplan Reconstruction with Semantic-Geometric Fusion

Authors: Han Ye, Haofu Wang, Yunchi Zhang, Jiangjian Xiao, Yuqiang Jin, Jinyuan Liu, Wen-An Zhang, Uladzislau Sychou, Alexander Tuzikov, Vladislav Sobolevskii, Valerii Zakharov, Boris Sokolov, Minglei Fu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15750
Pdf URL: https://arxiv.org/pdf/2509.15750
Copy Paste: [[2509.15750]] FloorSAM: SAM-Guided Floorplan Reconstruction with Semantic-Geometric Fusion(https://arxiv.org/abs/2509.15750)
Keywords: robust, extraction, segmentation
Abstract: Reconstructing building floor plans from point cloud data is key for indoor navigation, BIM, and precise measurements. Traditional methods like geometric algorithms and Mask R-CNN-based deep learning often face issues with noise, limited generalization, and loss of geometric details. We propose FloorSAM, a framework that integrates point cloud density maps with the Segment Anything Model (SAM) for accurate floor plan reconstruction from LiDAR data. Using grid-based filtering, adaptive resolution projection, and image enhancement, we create robust top-down density maps. FloorSAM uses SAM's zero-shot learning for precise room segmentation, improving reconstruction across diverse layouts. Room masks are generated via adaptive prompt points and multistage filtering, followed by joint mask and point cloud analysis for contour extraction and regularization. This produces accurate floor plans and recovers room topological relationships. Tests on Giblayout and ISPRS datasets show better accuracy, recall, and robustness than traditional methods, especially in noisy and complex settings. Code and materials: this http URL.

Title: MCOD: The First Challenging Benchmark for Multispectral Camouflaged Object Detection

Authors: Yang Li, Tingfa Xu, Shuyan Bai, Peifu Liu, Jianan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15753
Pdf URL: https://arxiv.org/pdf/2509.15753
Copy Paste: [[2509.15753]] MCOD: The First Challenging Benchmark for Multispectral Camouflaged Object Detection(https://arxiv.org/abs/2509.15753)
Keywords: robust
Abstract: Camouflaged Object Detection (COD) aims to identify objects that blend seamlessly into natural scenes. Although RGB-based methods have advanced, their performance remains limited under challenging conditions. Multispectral imagery, providing rich spectral information, offers a promising alternative for enhanced foreground-background discrimination. However, existing COD benchmark datasets are exclusively RGB-based, lacking essential support for multispectral approaches, which has impeded progress in this area. To address this gap, we introduce MCOD, the first challenging benchmark dataset specifically designed for multispectral camouflaged object detection. MCOD features three key advantages: (i) Comprehensive challenge attributes: It captures real-world difficulties such as small object sizes and extreme lighting conditions commonly encountered in COD tasks. (ii) Diverse real-world scenarios: The dataset spans a wide range of natural environments to better reflect practical applications. (iii) High-quality pixel-level annotations: Each image is manually annotated with precise object masks and corresponding challenge attribute labels. We benchmark eleven representative COD methods on MCOD, observing a consistent performance drop due to increased task difficulty. Notably, integrating multispectral modalities substantially alleviates this degradation, highlighting the value of spectral information in enhancing detection robustness. We anticipate MCOD will provide a strong foundation for future research in multispectral camouflaged object detection. The dataset is publicly accessible at this https URL.

Title: An Adversarial Robust Behavior Sequence Anomaly Detection Approach Based on Critical Behavior Unit Learning

Authors: Dongyang Zhan, Kai Tan, Lin Ye, Xiangzhan Yu, Hongli Zhang, Zheng He
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2509.15756
Pdf URL: https://arxiv.org/pdf/2509.15756
Copy Paste: [[2509.15756]] An Adversarial Robust Behavior Sequence Anomaly Detection Approach Based on Critical Behavior Unit Learning(https://arxiv.org/abs/2509.15756)
Keywords: attack, robust
Abstract: Sequential deep learning models (e.g., RNN and LSTM) can learn the sequence features of software behaviors, such as API or syscall sequences. However, recent studies have shown that these deep learning-based approaches are vulnerable to adversarial samples. Attackers can use adversarial samples to change the sequential characteristics of behavior sequences and mislead malware classifiers. In this paper, an adversarial robustness anomaly detection method based on the analysis of behavior units is proposed to overcome this problem. We extract related behaviors that usually perform a behavior intention as a behavior unit, which contains the representative semantic information of local behaviors and can be used to improve the robustness of behavior analysis. By learning the overall semantics of each behavior unit and the contextual relationships among behavior units based on a multilevel deep learning model, our approach can mitigate perturbation attacks that target local and large-scale behaviors. In addition, our approach can be applied to both low-level and high-level behavior logs (e.g., API and syscall logs). The experimental results show that our approach outperforms all the compared methods, which indicates that our approach has better performance against obfuscation attacks.

Title: On Optimal Steering to Achieve Exact Fairness

Authors: Mohit Sharma, Amit Jayant Deshpande, Chiranjib Bhattacharyya, Rajiv Ratn Shah
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15759
Pdf URL: https://arxiv.org/pdf/2509.15759
Copy Paste: [[2509.15759]] On Optimal Steering to Achieve Exact Fairness(https://arxiv.org/abs/2509.15759)
Keywords: fair, generative, large language model
Abstract: To fix the 'bias in, bias out' problem in fair machine learning, it is important to steer feature distributions of data or internal representations of Large Language Models (LLMs) to ideal ones that guarantee group-fair outcomes. Previous work on fair generative models and representation steering could greatly benefit from provable fairness guarantees on the model output. We define a distribution as ideal if the minimizer of any cost-sensitive risk on it is guaranteed to have exact group-fair outcomes (e.g., demographic parity, equal opportunity)-in other words, it has no fairness-utility trade-off. We formulate an optimization program for optimal steering by finding the nearest ideal distribution in KL-divergence, and provide efficient algorithms for it when the underlying distributions come from well-known parametric families (e.g., normal, log-normal). Empirically, our optimal steering techniques on both synthetic and real-world datasets improve fairness without diminishing utility (and sometimes even improve utility). We demonstrate affine steering of LLM representations to reduce bias in multi-class classification, e.g., occupation prediction from a short biography in Bios dataset (De-Arteaga et al.). Furthermore, we steer internal representations of LLMs towards desired outputs so that it works equally well across different groups.

Title: UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression

Authors: Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Tianqing Fang, Hongming Zhang, Haitao Mi, Dong Yu, Zhicheng Dou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15763
Pdf URL: https://arxiv.org/pdf/2509.15763
Copy Paste: [[2509.15763]] UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression(https://arxiv.org/abs/2509.15763)
Keywords: large language model
Abstract: Large language models are increasingly capable of handling long-context inputs, but the memory overhead of key-value (KV) cache remains a major bottleneck for general-purpose deployment. While various compression strategies have been explored, sequence-level compression, which drops the full KV caches for certain tokens, is particularly challenging as it can lead to the loss of important contextual information. To address this, we introduce UniGist, a sequence-level long-context compression framework that efficiently preserves context information by replacing raw tokens with special compression tokens (gists) in a fine-grained manner. We adopt a chunk-free training strategy and design an efficient kernel with a gist shift trick, enabling optimized GPU training. Our scheme also supports flexible inference by allowing the actual removal of compressed tokens, resulting in real-time memory savings. Experiments across multiple long-context tasks demonstrate that UniGist significantly improves compression quality, with especially strong performance in detail-recalling tasks and long-range dependency modeling.

Title: Learning to Optimize Capacity Planning in Semiconductor Manufacturing

Authors: Philipp Andelfinger, Jieyi Bi, Qiuyu Zhu, Jianan Zhou, Bo Zhang, Fei Fei Zhang, Chew Wye Chan, Boon Ping Gan, Wentong Cai, Jie Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15767
Pdf URL: https://arxiv.org/pdf/2509.15767
Copy Paste: [[2509.15767]] Learning to Optimize Capacity Planning in Semiconductor Manufacturing(https://arxiv.org/abs/2509.15767)
Keywords: interpretability
Abstract: In manufacturing, capacity planning is the process of allocating production resources in accordance with variable demand. The current industry practice in semiconductor manufacturing typically applies heuristic rules to prioritize actions, such as future change lists that account for incoming machine and recipe dedications. However, while offering interpretability, heuristics cannot easily account for the complex interactions along the process flow that can gradually lead to the formation of bottlenecks. Here, we present a neural network-based model for capacity planning on the level of individual machines, trained using deep reinforcement learning. By representing the policy using a heterogeneous graph neural network, the model directly captures the diverse relationships among machines and processing steps, allowing for proactive decision-making. We describe several measures taken to achieve sufficient scalability to tackle the vast space of possible machine-level actions. Our evaluation results cover Intel's small-scale Minifab model and preliminary experiments using the popular SMT2020 testbed. In the largest tested scenario, our trained policy increases throughput and decreases cycle time by about 1.8% each.

Title: Overview of PlantCLEF 2024: multi-species plant identification in vegetation plot images

Authors: Herve Goeau, Vincent Espitalier, Pierre Bonnet, Alexis Joly
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15768
Pdf URL: https://arxiv.org/pdf/2509.15768
Copy Paste: [[2509.15768]] Overview of PlantCLEF 2024: multi-species plant identification in vegetation plot images(https://arxiv.org/abs/2509.15768)
Keywords: transformer
Abstract: Plot images are essential for ecological studies, enabling standardized sampling, biodiversity assessment, long-term monitoring and remote, large-scale surveys. Plot images are typically fifty centimetres or one square meter in size, and botanists meticulously identify all the species found there. The integration of AI could significantly improve the efficiency of specialists, helping them to extend the scope and coverage of ecological studies. To evaluate advances in this regard, the PlantCLEF 2024 challenge leverages a new test set of thousands of multi-label images annotated by experts and covering over 800 species. In addition, it provides a large training set of 1.7 million individual plant images as well as state-of-the-art vision transformer models pre-trained on this data. The task is evaluated as a (weakly-labeled) multi-label classification task where the aim is to predict all the plant species present on a high-resolution plot image (using the single-label training data). In this paper, we provide an detailed description of the data, the evaluation methodology, the methods and models employed by the participants and the results achieved.

Title: Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation

Authors: Weimin Bai, Yubo Li, Weijian Luo, Wenzheng Chen, He Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15772
Pdf URL: https://arxiv.org/pdf/2509.15772
Copy Paste: [[2509.15772]] Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation(https://arxiv.org/abs/2509.15772)
Keywords: diffusion
Abstract: Score Distillation Sampling (SDS) enables high-quality text-to-3D generation by supervising 3D models through the denoising of multi-view 2D renderings, using a pretrained text-to-image diffusion model to align with the input prompt and ensure 3D consistency. However, existing SDS-based methods face two fundamental limitations: (1) their reliance on CLIP-style text encoders leads to coarse semantic alignment and struggles with fine-grained prompts; and (2) 2D diffusion priors lack explicit 3D spatial constraints, resulting in geometric inconsistencies and inaccurate object relationships in multi-object scenes. To address these challenges, we propose VLM3D, a novel text-to-3D generation framework that integrates large vision-language models (VLMs) into the SDS pipeline as differentiable semantic and spatial priors. Unlike standard text-to-image diffusion priors, VLMs leverage rich language-grounded supervision that enables fine-grained prompt alignment. Moreover, their inherent vision language modeling provides strong spatial understanding, which significantly enhances 3D consistency for single-object generation and improves relational reasoning in multi-object scenes. We instantiate VLM3D based on the open-source Qwen2.5-VL model and evaluate it on the GPTeval3D benchmark. Experiments across diverse objects and complex scenes show that VLM3D significantly outperforms prior SDS-based methods in semantic fidelity, geometric coherence, and spatial correctness.

Title: Enriched Feature Representation and Motion Prediction Module for MOSEv2 Track of 7th LSVOS Challenge: 3rd Place Solution

Authors: Chang Soo Lim, Joonyoung Moon, Donghyeon Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15781
Pdf URL: https://arxiv.org/pdf/2509.15781
Copy Paste: [[2509.15781]] Enriched Feature Representation and Motion Prediction Module for MOSEv2 Track of 7th LSVOS Challenge: 3rd Place Solution(https://arxiv.org/abs/2509.15781)
Keywords: robust, segmentation
Abstract: Video object segmentation (VOS) is a challenging task with wide applications such as video editing and autonomous driving. While Cutie provides strong query-based segmentation and SAM2 offers enriched representations via a pretrained ViT encoder, each has limitations in feature capacity and temporal modeling. In this report, we propose a framework that integrates their complementary strengths by replacing the encoder of Cutie with the ViT encoder of SAM2 and introducing a motion prediction module for temporal stability. We further adopt an ensemble strategy combining Cutie, SAM2, and our variant, achieving 3rd place in the MOSEv2 track of the 7th LSVOS Challenge. We refer to our final model as SCOPE (SAM2-CUTIE Object Prediction Ensemble). This demonstrates the effectiveness of enriched feature representation and motion prediction for robust video object segmentation. The code is available at this https URL.

Title: Ideal Registration? Segmentation is All You Need

Authors: Xiang Chen, Fengting Zhang, Qinghao Liu, Min Liu, Kun Wu, Yaonan Wang, Hang Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15784
Pdf URL: https://arxiv.org/pdf/2509.15784
Copy Paste: [[2509.15784]] Ideal Registration? Segmentation is All You Need(https://arxiv.org/abs/2509.15784)
Keywords: segmentation
Abstract: Deep learning has revolutionized image registration by its ability to handle diverse tasks while achieving significant speed advantages over conventional approaches. Current approaches, however, often employ globally uniform smoothness constraints that fail to accommodate the complex, regionally varying deformations characteristic of anatomical motion. To address this limitation, we propose SegReg, a Segmentation-driven Registration framework that implements anatomically adaptive regularization by exploiting region-specific deformation patterns. Our SegReg first decomposes input moving and fixed images into anatomically coherent subregions through segmentation. These localized domains are then processed by the same registration backbone to compute optimized partial deformation fields, which are subsequently integrated into a global deformation field. SegReg achieves near-perfect structural alignment (98.23% Dice on critical anatomies) using ground-truth segmentation, and outperforms existing methods by 2-12% across three clinical registration scenarios (cardiac, abdominal, and lung images) even with automatic segmentation. Our SegReg demonstrates a near-linear dependence of registration accuracy on segmentation quality, transforming the registration challenge into a segmentation problem. The source code will be released upon manuscript acceptance.

Title: TASAM: Terrain-and-Aware Segment Anything Model for Temporal-Scale Remote Sensing Segmentation

Authors: Tianyang Wang, Xi Xiao, Gaofei Chen, Hanzhang Chi, Qi Zhang, Guo Cheng, Yingrui Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15795
Pdf URL: https://arxiv.org/pdf/2509.15795
Copy Paste: [[2509.15795]] TASAM: Terrain-and-Aware Segment Anything Model for Temporal-Scale Remote Sensing Segmentation(https://arxiv.org/abs/2509.15795)
Keywords: robust, segmentation
Abstract: Segment Anything Model (SAM) has demonstrated impressive zero-shot segmentation capabilities across natural image domains, but it struggles to generalize to the unique challenges of remote sensing data, such as complex terrain, multi-scale objects, and temporal dynamics. In this paper, we introduce TASAM, a terrain and temporally-aware extension of SAM designed specifically for high-resolution remote sensing image segmentation. TASAM integrates three lightweight yet effective modules: a terrain-aware adapter that injects elevation priors, a temporal prompt generator that captures land-cover changes over time, and a multi-scale fusion strategy that enhances fine-grained object delineation. Without retraining the SAM backbone, our approach achieves substantial performance gains across three remote sensing benchmarks-LoveDA, iSAID, and WHU-CD-outperforming both zero-shot SAM and task-specific models with minimal computational overhead. Our results highlight the value of domain-adaptive augmentation for foundation models and offer a scalable path toward more robust geospatial segmentation.

Title: Monte Carlo Tree Diffusion with Multiple Experts for Protein Design

Authors: Xuefeng Liu, Mingxuan Cao, Songhao Jiang, Xiao Luo, Xiaotian Duan, Mengdi Wang, Tobin R. Sosnick, Jinbo Xu, Rick Stevens
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2509.15796
Pdf URL: https://arxiv.org/pdf/2509.15796
Copy Paste: [[2509.15796]] Monte Carlo Tree Diffusion with Multiple Experts for Protein Design(https://arxiv.org/abs/2509.15796)
Keywords: diffusion
Abstract: The goal of protein design is to generate amino acid sequences that fold into functional structures with desired properties. Prior methods combining autoregressive language models with Monte Carlo Tree Search (MCTS) struggle with long-range dependencies and suffer from an impractically large search space. We propose MCTD-ME, Monte Carlo Tree Diffusion with Multiple Experts, which integrates masked diffusion models with tree search to enable multi-token planning and efficient exploration. Unlike autoregressive planners, MCTD-ME uses biophysical-fidelity-enhanced diffusion denoising as the rollout engine, jointly revising multiple positions and scaling to large sequence spaces. It further leverages experts of varying capacities to enrich exploration, guided by a pLDDT-based masking schedule that targets low-confidence regions while preserving reliable residues. We propose a novel multi-expert selection rule (PH-UCT-ME) extends predictive-entropy UCT to expert ensembles. On the inverse folding task (CAMEO and PDB benchmarks), MCTD-ME outperforms single-expert and unguided baselines in both sequence recovery (AAR) and structural similarity (scTM), with gains increasing for longer proteins and benefiting from multi-expert guidance. More generally, the framework is model-agnostic and applicable beyond inverse folding, including de novo protein engineering and multi-objective molecular generation.

Title: CIDER: A Causal Cure for Brand-Obsessed Text-to-Image Models

Authors: Fangjian Shen, Zifeng Liang, Chao Wang, Wushao Wen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15803
Pdf URL: https://arxiv.org/pdf/2509.15803
Copy Paste: [[2509.15803]] CIDER: A Causal Cure for Brand-Obsessed Text-to-Image Models(https://arxiv.org/abs/2509.15803)
Keywords: generative
Abstract: Text-to-image (T2I) models exhibit a significant yet under-explored "brand bias", a tendency to generate contents featuring dominant commercial brands from generic prompts, posing ethical and legal risks. We propose CIDER, a novel, model-agnostic framework to mitigate bias at inference-time through prompt refinement to avoid costly retraining. CIDER uses a lightweight detector to identify branded content and a Vision-Language Model (VLM) to generate stylistically divergent alternatives. We introduce the Brand Neutrality Score (BNS) to quantify this issue and perform extensive experiments on leading T2I models. Results show CIDER significantly reduces both explicit and implicit biases while maintaining image quality and aesthetic appeal. Our work offers a practical solution for more original and equitable content, contributing to the development of trustworthy generative AI.

Title: Best-of-L: Cross-Lingual Reward Modeling for Mathematical Reasoning

Authors: Sara Rajaee, Rochelle Choenni, Ekaterina Shutova, Christof Monz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15811
Pdf URL: https://arxiv.org/pdf/2509.15811
Copy Paste: [[2509.15811]] Best-of-L: Cross-Lingual Reward Modeling for Mathematical Reasoning(https://arxiv.org/abs/2509.15811)
Keywords: large language model
Abstract: While the reasoning abilities of large language models (LLMs) continue to advance, it remains unclear how such ability varies across languages in multilingual LLMs and whether different languages produce reasoning paths that complement each other. To investigate this question, we train a reward model to rank generated responses for a given question across languages. Our results show that our cross-lingual reward model substantially improves mathematical reasoning performance compared to using reward modeling within a single language, benefiting even high-resource languages. While English often exhibits the highest performance in multilingual models, we find that cross-lingual sampling particularly benefits English under low sampling budgets. Our findings reveal new opportunities to improve multilingual reasoning by leveraging the complementary strengths of diverse languages.

Title: SolarCrossFormer: Improving day-ahead Solar Irradiance Forecasting by Integrating Satellite Imagery and Ground Sensors

Authors: Baptiste Schubnel, Jelena Simeunović, Corentin Tissier, Pierre-Jean Alet, Rafael E. Carrillo
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2509.15827
Pdf URL: https://arxiv.org/pdf/2509.15827
Copy Paste: [[2509.15827]] SolarCrossFormer: Improving day-ahead Solar Irradiance Forecasting by Integrating Satellite Imagery and Ground Sensors(https://arxiv.org/abs/2509.15827)
Keywords: robust
Abstract: Accurate day-ahead forecasts of solar irradiance are required for the large-scale integration of solar photovoltaic (PV) systems into the power grid. However, current forecasting solutions lack the temporal and spatial resolution required by system operators. In this paper, we introduce SolarCrossFormer, a novel deep learning model for day-ahead irradiance forecasting, that combines satellite images and time series from a ground-based network of meteorological stations. SolarCrossFormer uses novel graph neural networks to exploit the inter- and intra-modal correlations of the input data and improve the accuracy and resolution of the forecasts. It generates probabilistic forecasts for any location in Switzerland with a 15-minute resolution for horizons up to 24 hours ahead. One of the key advantages of SolarCrossFormer its robustness in real life operations. It can incorporate new time-series data without retraining the model and, additionally, it can produce forecasts for locations without input data by using only their coordinates. Experimental results over a dataset of one year and 127 locations across Switzerland show that SolarCrossFormer yield a normalized mean absolute error of 6.1 % over the forecasting horizon. The results are competitive with those achieved by a commercial numerical weather prediction service.

Title: Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems

Authors: Zhongze Luo, Zhenshuai Yin, Yongxin Guo, Zhichao Wang, Jionghao Zhu, Xiaoying Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.15839
Pdf URL: https://arxiv.org/pdf/2509.15839
Copy Paste: [[2509.15839]] Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems(https://arxiv.org/abs/2509.15839)
Keywords: robust
Abstract: While multimodal LLMs (MLLMs) demonstrate remarkable reasoning progress, their application in specialized scientific domains like physics reveals significant gaps in current evaluation benchmarks. Specifically, existing benchmarks often lack fine-grained subject coverage, neglect the step-by-step reasoning process, and are predominantly English-centric, failing to systematically evaluate the role of visual information. Therefore, we introduce \textbf {Multi-Physics} for Chinese physics reasoning, a comprehensive benchmark that includes 5 difficulty levels, featuring 1,412 image-associated, multiple-choice questions spanning 11 high-school physics subjects. We employ a dual evaluation framework to evaluate 20 different MLLMs, analyzing both final answer accuracy and the step-by-step integrity of their chain-of-thought. Furthermore, we systematically study the impact of difficulty level and visual information by comparing the model performance before and after changing the input mode. Our work provides not only a fine-grained resource for the community but also offers a robust methodology for dissecting the multimodal reasoning process of state-of-the-art MLLMs, and our dataset and code have been open-sourced: this https URL.

Title: FedHK-MVFC: Federated Heat Kernel Multi-View Clustering

Authors: Kristina P. Sinaga
Subjects: cs.LG, cs.CV, cs.DC, math.AG
Abstract URL: https://arxiv.org/abs/2509.15844
Pdf URL: https://arxiv.org/pdf/2509.15844
Copy Paste: [[2509.15844]] FedHK-MVFC: Federated Heat Kernel Multi-View Clustering(https://arxiv.org/abs/2509.15844)
Keywords: secure, privacy, federate
Abstract: In the realm of distributed AI and privacy-focused medical applications, we propose a framework for multi-view clustering that links quantum field theory with federated healthcare analytics. Our method uses heat-kernel coefficients from spectral analysis to convert Euclidean distances into geometry-aware similarity measures, capturing the structure of diverse medical data. We lay this out through the Heat Kernel Distance (HKD) transformation with convergence guarantees. Two algorithms are developed: Heat Kernel-Enhanced Multi-View Fuzzy Clustering (HK-MVFC) for central analysis, and Federated Heat Kernel Multi-View Fuzzy Clustering (FedHK-MVFC) for secure, privacy-preserving learning across hospitals using differential privacy and secure aggregation to facilitate HIPAA-compliant collaboration. Tests on synthetic datasets of cardiovascular patients show an $8-12 \%$ increase in clustering accuracy, $70 \%$ reduced communication, and $98.2 \%$ efficiency retention over centralized methods. Validated on 10,000 patient records across two hospitals, it proves useful for collaborative phenotyping involving ECG, cardiac imaging, and behavioral data. Our theoretical contributions include update rules with proven convergence, adaptive view weighting, and privacy-preserving protocols. This presents a new standard for geometry-aware federated learning in healthcare, turning advanced math into workable solutions for analyzing sensitive medical data while ensuring both rigor and clinical relevance.

Title: ToFU: Transforming How Federated Learning Systems Forget User Data

Authors: Van-Tuan Tran, Hong-Hanh Nguyen-Le, Quoc-Viet Pham
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2509.15861
Pdf URL: https://arxiv.org/pdf/2509.15861
Copy Paste: [[2509.15861]] ToFU: Transforming How Federated Learning Systems Forget User Data(https://arxiv.org/abs/2509.15861)
Keywords: privacy, attack, federate
Abstract: Neural networks unintentionally memorize training data, creating privacy risks in federated learning (FL) systems, such as inference and reconstruction attacks on sensitive data. To mitigate these risks and to comply with privacy regulations, Federated Unlearning (FU) has been introduced to enable participants in FL systems to remove their data's influence from the global model. However, current FU methods primarily act post-hoc, struggling to efficiently erase information deeply memorized by neural networks. We argue that effective unlearning necessitates a paradigm shift: designing FL systems inherently amenable to forgetting. To this end, we propose a learning-to-unlearn Transformation-guided Federated Unlearning (ToFU) framework that incorporates transformations during the learning process to reduce memorization of specific instances. Our theoretical analysis reveals how transformation composition provably bounds instance-specific information, directly simplifying subsequent unlearning. Crucially, ToFU can work as a plug-and-play framework that improves the performance of existing FU methods. Experiments on CIFAR-10, CIFAR-100, and the MUFAC benchmark show that ToFU outperforms existing FU baselines, enhances performance when integrated with current methods, and reduces unlearning time.

Title: SAGE: Semantic-Aware Shared Sampling for Efficient Diffusion

Authors: Haoran Zhao, Tong Bai, Lei Huang, Xiaoyu Liang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15865
Pdf URL: https://arxiv.org/pdf/2509.15865
Copy Paste: [[2509.15865]] SAGE: Semantic-Aware Shared Sampling for Efficient Diffusion(https://arxiv.org/abs/2509.15865)
Keywords: diffusion
Abstract: Diffusion models manifest evident benefits across diverse domains, yet their high sampling cost, requiring dozens of sequential model evaluations, remains a major limitation. Prior efforts mainly accelerate sampling via optimized solvers or distillation, which treat each query independently. In contrast, we reduce total number of steps by sharing early-stage sampling across semantically similar queries. To enable such efficiency gains without sacrificing quality, we propose SAGE, a semantic-aware shared sampling framework that integrates a shared sampling scheme for efficiency and a tailored training strategy for quality preservation. Extensive experiments show that SAGE reduces sampling cost by 25.5%, while improving generation quality with 5.0% lower FID, 5.4% higher CLIP, and 160% higher diversity over baselines.

Title: LC-SLab -- An Object-based Deep Learning Framework for Large-scale Land Cover Classification from Satellite Imagery and Sparse In-situ Labels

Authors: Johannes Leonhardt, Juergen Gall, Ribana Roscher
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15868
Pdf URL: https://arxiv.org/pdf/2509.15868
Copy Paste: [[2509.15868]] LC-SLab -- An Object-based Deep Learning Framework for Large-scale Land Cover Classification from Satellite Imagery and Sparse In-situ Labels(https://arxiv.org/abs/2509.15868)
Keywords: robust, segmentation
Abstract: Large-scale land cover maps generated using deep learning play a critical role across a wide range of Earth science applications. Open in-situ datasets from principled land cover surveys offer a scalable alternative to manual annotation for training such models. However, their sparse spatial coverage often leads to fragmented and noisy predictions when used with existing deep learning-based land cover mapping approaches. A promising direction to address this issue is object-based classification, which assigns labels to semantically coherent image regions rather than individual pixels, thereby imposing a minimum mapping unit. Despite this potential, object-based methods remain underexplored in deep learning-based land cover mapping pipelines, especially in the context of medium-resolution imagery and sparse supervision. To address this gap, we propose LC-SLab, the first deep learning framework for systematically exploring object-based deep learning methods for large-scale land cover classification under sparse supervision. LC-SLab supports both input-level aggregation via graph neural networks, and output-level aggregation by postprocessing results from established semantic segmentation models. Additionally, we incorporate features from a large pre-trained network to improve performance on small datasets. We evaluate the framework on annual Sentinel-2 composites with sparse LUCAS labels, focusing on the tradeoff between accuracy and fragmentation, as well as sensitivity to dataset size. Our results show that object-based methods can match or exceed the accuracy of common pixel-wise models while producing substantially more coherent maps. Input-level aggregation proves more robust on smaller datasets, whereas output-level aggregation performs best with more data. Several configurations of LC-SLab also outperform existing land cover products, highlighting the framework's practical utility.

Title: ENSAM: an efficient foundation model for interactive segmentation of 3D medical images

Authors: Elias Stenhede, Agnar Martin Bjørnstad, Arian Ranjbar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15874
Pdf URL: https://arxiv.org/pdf/2509.15874
Copy Paste: [[2509.15874]] ENSAM: an efficient foundation model for interactive segmentation of 3D medical images(https://arxiv.org/abs/2509.15874)
Keywords: segmentation
Abstract: We present ENSAM (Equivariant, Normalized, Segment Anything Model), a lightweight and promptable model for universal 3D medical image segmentation. ENSAM combines a SegResNet-based encoder with a prompt encoder and mask decoder in a U-Net-style architecture, using latent cross-attention, relative positional encoding, normalized attention, and the Muon optimizer for training. ENSAM is designed to achieve good performance under limited data and computational budgets, and is trained from scratch on under 5,000 volumes from multiple modalities (CT, MRI, PET, ultrasound, microscopy) on a single 32 GB GPU in 6 hours. As part of the CVPR 2025 Foundation Models for Interactive 3D Biomedical Image Segmentation Challenge, ENSAM was evaluated on hidden test set with multimodal 3D medical images, obtaining a DSC AUC of 2.404, NSD AUC of 2.266, final DSC of 0.627, and final NSD of 0.597, outperforming two previously published baseline models (VISTA3D, SAM-Med3D) and matching the third (SegVol), surpassing its performance in final DSC but trailing behind in the other three metrics. In the coreset track of the challenge, ENSAM ranks 5th of 10 overall and best among the approaches not utilizing pretrained weights. Ablation studies confirm that our use of relative positional encodings and the Muon optimizer each substantially speed up convergence and improve segmentation quality.

Title: Self-Supervised Cross-Modal Learning for Image-to-Point Cloud Registration

Authors: Xingmei Wang, Xiaoyu Hu, Chengkai Huang, Ziyan Zeng, Guohao Nie, Quan Z. Sheng, Lina Yao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15882
Pdf URL: https://arxiv.org/pdf/2509.15882
Copy Paste: [[2509.15882]] Self-Supervised Cross-Modal Learning for Image-to-Point Cloud Registration(https://arxiv.org/abs/2509.15882)
Keywords: robust
Abstract: Bridging 2D and 3D sensor modalities is critical for robust perception in autonomous systems. However, image-to-point cloud (I2P) registration remains challenging due to the semantic-geometric gap between texture-rich but depth-ambiguous images and sparse yet metrically precise point clouds, as well as the tendency of existing methods to converge to local optima. To overcome these limitations, we introduce CrossI2P, a self-supervised framework that unifies cross-modal learning and two-stage registration in a single end-to-end pipeline. First, we learn a geometric-semantic fused embedding space via dual-path contrastive learning, enabling annotation-free, bidirectional alignment of 2D textures and 3D structures. Second, we adopt a coarse-to-fine registration paradigm: a global stage establishes superpoint-superpixel correspondences through joint intra-modal context and cross-modal interaction modeling, followed by a geometry-constrained point-level refinement for precise registration. Third, we employ a dynamic training mechanism with gradient normalization to balance losses for feature alignment, correspondence refinement, and pose estimation. Extensive experiments demonstrate that CrossI2P outperforms state-of-the-art methods by 23.7% on the KITTI Odometry benchmark and by 37.9% on nuScenes, significantly improving both accuracy and robustness.

Title: RangeSAM: Leveraging Visual Foundation Models for Range-View repesented LiDAR segmentation

Authors: Paul Julius Kühn, Duc Anh Nguyen, Arjan Kuijper, Holger Graf, Dieter Fellner, Saptarshi Neil Sinha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15886
Pdf URL: https://arxiv.org/pdf/2509.15886
Copy Paste: [[2509.15886]] RangeSAM: Leveraging Visual Foundation Models for Range-View repesented LiDAR segmentation(https://arxiv.org/abs/2509.15886)
Keywords: extraction, segmentation
Abstract: Point cloud segmentation is central to autonomous driving and 3D scene understanding. While voxel- and point-based methods dominate recent research due to their compatibility with deep architectures and ability to capture fine-grained geometry, they often incur high computational cost, irregular memory access, and limited real-time efficiency. In contrast, range-view methods, though relatively underexplored - can leverage mature 2D semantic segmentation techniques for fast and accurate predictions. Motivated by the rapid progress in Visual Foundation Models (VFMs) for captioning, zero-shot recognition, and multimodal tasks, we investigate whether SAM2, the current state-of-the-art VFM for segmentation tasks, can serve as a strong backbone for LiDAR point cloud segmentation in the range view. We present , to our knowledge, the first range-view framework that adapts SAM2 to 3D segmentation, coupling efficient 2D feature extraction with standard projection/back-projection to operate on point clouds. To optimize SAM2 for range-view representations, we implement several architectural modifications to the encoder: (1) a novel module that emphasizes horizontal spatial dependencies inherent in LiDAR range images, (2) a customized configuration of tailored to the geometric properties of spherical projections, and (3) an adapted mechanism in the encoder backbone specifically designed to capture the unique spatial patterns and discontinuities present in range-view pseudo-images. Our approach achieves competitive performance on SemanticKITTI while benefiting from the speed, scalability, and deployment simplicity of 2D-centric pipelines. This work highlights the viability of VFMs as general-purpose backbones for 3D perception and opens a path toward unified, foundation-model-driven LiDAR segmentation. Results lets us conclude that range-view segmentation methods using VFMs leads to promising results.

Title: Distribution-Aligned Decoding for Efficient LLM Task Adaptation

Authors: Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Sam Tak Wu Kwong, Yuguang Fang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15888
Pdf URL: https://arxiv.org/pdf/2509.15888
Copy Paste: [[2509.15888]] Distribution-Aligned Decoding for Efficient LLM Task Adaptation(https://arxiv.org/abs/2509.15888)
Keywords: large language model
Abstract: Adapting billion-parameter language models to a downstream task is still costly, even with parameter-efficient fine-tuning (PEFT). We re-cast task adaptation as output-distribution alignment: the objective is to steer the output distribution toward the task distribution directly during decoding rather than indirectly through weight updates. Building on this view, we introduce Steering Vector Decoding (SVD), a lightweight, PEFT-compatible, and theoretically grounded method. We start with a short warm-start fine-tune and extract a task-aware steering vector from the Kullback-Leibler (KL) divergence gradient between the output distribution of the warm-started and pre-trained models. This steering vector is then used to guide the decoding process to steer the model's output distribution towards the task distribution. We theoretically prove that SVD is first-order equivalent to the gradient step of full fine-tuning and derive a globally optimal solution for the strength of the steering vector. Across three tasks and nine benchmarks, SVD paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 points and open-ended truthfulness by 2 points, with similar gains (1-2 points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. SVD thus offers a lightweight, theoretically grounded path to stronger task adaptation for large language models.

Title: Global Regulation and Excitation via Attention Tuning for Stereo Matching

Authors: Jiahao Li, Xinhong Chen, Zhengmin Jiang, Qian Zhou, Yung-Hui Li, Jianping Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15891
Pdf URL: https://arxiv.org/pdf/2509.15891
Copy Paste: [[2509.15891]] Global Regulation and Excitation via Attention Tuning for Stereo Matching(https://arxiv.org/abs/2509.15891)
Keywords: robust
Abstract: Stereo matching achieves significant progress with iterative algorithms like RAFT-Stereo and IGEV-Stereo. However, these methods struggle in ill-posed regions with occlusions, textureless, or repetitive patterns, due to a lack of global context and geometric information for effective iterative refinement. To enable the existing iterative approaches to incorporate global context, we propose the Global Regulation and Excitation via Attention Tuning (GREAT) framework which encompasses three attention modules. Specifically, Spatial Attention (SA) captures the global context within the spatial dimension, Matching Attention (MA) extracts global context along epipolar lines, and Volume Attention (VA) works in conjunction with SA and MA to construct a more robust cost-volume excited by global context and geometric details. To verify the universality and effectiveness of this framework, we integrate it into several representative iterative stereo-matching methods and validate it through extensive experiments, collectively denoted as GREAT-Stereo. This framework demonstrates superior performance in challenging ill-posed regions. Applied to IGEV-Stereo, among all published methods, our GREAT-IGEV ranks first on the Scene Flow test set, KITTI 2015, and ETH3D leaderboards, and achieves second on the Middlebury benchmark. Code is available at this https URL.

Title: The Psychology of Falsehood: A Human-Centric Survey of Misinformation Detection

Authors: Arghodeep Nandi, Megha Sundriyal, Euna Mehnaz Khan, Jikai Sun, Emily Vraga, Jaideep Srivastava, Tanmoy Chakraborty
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2509.15896
Pdf URL: https://arxiv.org/pdf/2509.15896
Copy Paste: [[2509.15896]] The Psychology of Falsehood: A Human-Centric Survey of Misinformation Detection(https://arxiv.org/abs/2509.15896)
Keywords: robust
Abstract: Misinformation remains one of the most significant issues in the digital age. While automated fact-checking has emerged as a viable solution, most current systems are limited to evaluating factual accuracy. However, the detrimental effect of misinformation transcends simple falsehoods; it takes advantage of how individuals perceive, interpret, and emotionally react to information. This underscores the need to move beyond factuality and adopt more human-centered detection frameworks. In this survey, we explore the evolving interplay between traditional fact-checking approaches and psychological concepts such as cognitive biases, social dynamics, and emotional responses. By analyzing state-of-the-art misinformation detection systems through the lens of human psychology and behavior, we reveal critical limitations of current methods and identify opportunities for improvement. Additionally, we outline future research directions aimed at creating more robust and adaptive frameworks, such as neuro-behavioural models that integrate technological factors with the complexities of human cognition and social influence. These approaches offer promising pathways to more effectively detect and mitigate the societal harms of misinformation.

Title: Re-FRAME the Meeting Summarization SCOPE: Fact-Based Summarization and Personalization via Questions

Authors: Frederic Kirstein, Sonu Kumar, Terry Ruas, Bela Gipp
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15901
Pdf URL: https://arxiv.org/pdf/2509.15901
Copy Paste: [[2509.15901]] Re-FRAME the Meeting Summarization SCOPE: Fact-Based Summarization and Personalization via Questions(https://arxiv.org/abs/2509.15901)
Keywords: large language model
Abstract: Meeting summarization with large language models (LLMs) remains error-prone, often producing outputs with hallucinations, omissions, and irrelevancies. We present FRAME, a modular pipeline that reframes summarization as a semantic enrichment task. FRAME extracts and scores salient facts, organizes them thematically, and uses these to enrich an outline into an abstractive summary. To personalize summaries, we introduce SCOPE, a reason-out-loud protocol that has the model build a reasoning trace by answering nine questions before content selection. For evaluation, we propose P-MESA, a multi-dimensional, reference-free evaluation framework to assess if a summary fits a target reader. P-MESA reliably identifies error instances, achieving >= 89% balanced accuracy against human annotations and strongly aligns with human severity ratings (r >= 0.70). On QMSum and FAME, FRAME reduces hallucination and omission by 2 out of 5 points (measured with MESA), while SCOPE improves knowledge fit and goal alignment over prompt-only baselines. Our findings advocate for rethinking summarization to improve control, faithfulness, and personalization.

Title: Deep Feedback Models

Authors: David Calhas, Arlindo L. Oliveira
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15905
Pdf URL: https://arxiv.org/pdf/2509.15905
Copy Paste: [[2509.15905]] Deep Feedback Models(https://arxiv.org/abs/2509.15905)
Keywords: robust, segmentation
Abstract: Deep Feedback Models (DFMs) are a new class of stateful neural networks that combine bottom up input with high level representations over time. This feedback mechanism introduces dynamics into otherwise static architectures, enabling DFMs to iteratively refine their internal state and mimic aspects of biological decision making. We model this process as a differential equation solved through a recurrent neural network, stabilized via exponential decay to ensure convergence. To evaluate their effectiveness, we measure DFMs under two key conditions: robustness to noise and generalization with limited data. In both object recognition and segmentation tasks, DFMs consistently outperform their feedforward counterparts, particularly in low data or high noise regimes. In addition, DFMs translate to medical imaging settings, while being robust against various types of noise corruption. These findings highlight the importance of feedback in achieving stable, robust, and generalizable learning. Code is available at this https URL.

Title: Foundation Models as World Models: A Foundational Study in Text-Based GridWorlds

Authors: Remo Sasso, Michelangelo Conserva, Dominik Jeurissen, Paulo Rauber
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15915
Pdf URL: https://arxiv.org/pdf/2509.15915
Copy Paste: [[2509.15915]] Foundation Models as World Models: A Foundational Study in Text-Based GridWorlds(https://arxiv.org/abs/2509.15915)
Keywords: large language model
Abstract: While reinforcement learning from scratch has shown impressive results in solving sequential decision-making tasks with efficient simulators, real-world applications with expensive interactions require more sample-efficient agents. Foundation models (FMs) are natural candidates to improve sample efficiency as they possess broad knowledge and reasoning capabilities, but it is yet unclear how to effectively integrate them into the reinforcement learning framework. In this paper, we anticipate and, most importantly, evaluate two promising strategies. First, we consider the use of foundation world models (FWMs) that exploit the prior knowledge of FMs to enable training and evaluating agents with simulated interactions. Second, we consider the use of foundation agents (FAs) that exploit the reasoning capabilities of FMs for decision-making. We evaluate both approaches empirically in a family of grid-world environments that are suitable for the current generation of large language models (LLMs). Our results suggest that improvements in LLMs already translate into better FWMs and FAs; that FAs based on current LLMs can already provide excellent policies for sufficiently simple environments; and that the coupling of FWMs and reinforcement learning agents is highly promising for more complex settings with partial observability and stochastic elements.

Title: Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment

Authors: Ahmed Karim, Qiao Wang (Judy), Zheng Yuan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15926
Pdf URL: https://arxiv.org/pdf/2509.15926
Copy Paste: [[2509.15926]] Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment(https://arxiv.org/abs/2509.15926)
Keywords: large language model
Abstract: Automated Essay Scoring (AES) systems now reach near human agreement on some public benchmarks, yet real-world adoption, especially in high-stakes examinations, remains limited. A principal obstacle is that most models output a single score without any accompanying measure of confidence or explanation. We address this gap with conformal prediction, a distribution-free wrapper that equips any classifier with set-valued outputs and formal coverage guarantees. Two open-source large language models (Llama-3 8B and Qwen-2.5 3B) are fine-tuned on three diverse corpora (ASAP, TOEFL11, Cambridge-FCE) and calibrated at a 90 percent risk level. Reliability is assessed with UAcc, an uncertainty-aware accuracy that rewards models for being both correct and concise. To our knowledge, this is the first work to combine conformal prediction and UAcc for essay scoring. The calibrated models consistently meet the coverage target while keeping prediction sets compact, indicating that open-source, mid-sized LLMs can already support teacher-in-the-loop AES; we discuss scaling and broader user studies as future work.

Title: Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

Authors: Zhiyu Mou, Yiqin Lv, Miao Xu, Cheems Wang, Yixiu Mao, Qichen Ye, Chao Li, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15927
Pdf URL: https://arxiv.org/pdf/2509.15927
Copy Paste: [[2509.15927]] Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search(https://arxiv.org/abs/2509.15927)
Keywords: diffusion, generative, large language model
Abstract: Auto-bidding is an essential tool for advertisers to enhance their advertising performance. Recent progress has shown that AI-Generated Bidding (AIGB), which formulates the auto-bidding as a trajectory generation task and trains a conditional diffusion-based planner on offline data, achieves superior and stable performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still encounter a performance bottleneck due to their neglect of fine-grained generation quality evaluation and inability to explore beyond static datasets. To address this, we propose AIGB-Pearl (\emph{Planning with EvAluator via RL}), a novel method that integrates generative planning and policy optimization. The key to AIGB-Pearl is to construct a non-bootstrapped \emph{trajectory evaluator} to assign rewards and guide policy search, enabling the planner to optimize its generation quality iteratively through interaction. Furthermore, to enhance trajectory evaluator accuracy in offline settings, we incorporate three key techniques: (i) a Large Language Model (LLM)-based architecture for better representational capacity, (ii) hybrid point-wise and pair-wise losses for better score learning, and (iii) adaptive integration of expert feedback for better generalization ability. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.

Title: Improving Monte Carlo Tree Search for Symbolic Regression

Authors: Zhengyao Huang, Daniel Zhengyu Huang, Tiannan Xiao, Dina Ma, Zhenyu Ming, Hao Shi, Yuanhui Wen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15929
Pdf URL: https://arxiv.org/pdf/2509.15929
Copy Paste: [[2509.15929]] Improving Monte Carlo Tree Search for Symbolic Regression(https://arxiv.org/abs/2509.15929)
Keywords: robust
Abstract: Symbolic regression aims to discover concise, interpretable mathematical expressions that satisfy desired objectives, such as fitting data, posing a highly combinatorial optimization problem. While genetic programming has been the dominant approach, recent efforts have explored reinforcement learning methods for improving search efficiency. Monte Carlo Tree Search (MCTS), with its ability to balance exploration and exploitation through guided search, has emerged as a promising technique for symbolic expression discovery. However, its traditional bandit strategies and sequential symbol construction often limit performance. In this work, we propose an improved MCTS framework for symbolic regression that addresses these limitations through two key innovations: (1) an extreme bandit allocation strategy tailored for identifying globally optimal expressions, with finite-time performance guarantees under polynomial reward decay assumptions; and (2) evolution-inspired state-jumping actions such as mutation and crossover, which enable non-local transitions to promising regions of the search space. These state-jumping actions also reshape the reward landscape during the search process, improving both robustness and efficiency. We conduct a thorough numerical study to the impact of these improvements and benchmark our approach against existing symbolic regression methods on a variety of datasets, including both ground-truth and black-box datasets. Our approach achieves competitive performance with state-of-the-art libraries in terms of recovery rate, attains favorable positions on the Pareto frontier of accuracy versus model complexity. Code is available at this https URL.

Title: The Alignment Bottleneck

Authors: Wenjun Cao
Subjects: cs.LG, cs.AI, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2509.15932
Pdf URL: https://arxiv.org/pdf/2509.15932
Copy Paste: [[2509.15932]] The Alignment Bottleneck(https://arxiv.org/abs/2509.15932)
Keywords: large language model
Abstract: Large language models improve with scale, yet feedback-based alignment still exhibits systematic deviations from intended behavior. Motivated by bounded rationality in economics and cognitive science, we view judgment as resource-limited and feedback as a constrained channel. On this basis, we model the loop as a two-stage cascade $U \to H \to Y$ given $S$, with cognitive capacity $C_{\text{cog}|S}$ and average total capacity $\bar{C}_{\text{tot}|S}$. Our main result is a capacity-coupled Alignment Performance Interval. It pairs a data size-independent Fano lower bound proved on a separable codebook mixture with a PAC-Bayes upper bound whose KL term is controlled by the same channel via $m \, \bar{C}_{\text{tot}|S}$. The PAC-Bayes bound becomes an upper bound on the same true risk when the canonical observable loss is used and the dataset is drawn from the same mixture. Under these matched conditions, both limits are governed by a single capacity. Consequences include that, with value complexity and capacity fixed, adding labels alone cannot cross the bound; attaining lower risk on more complex targets requires capacity that grows with $\log M$; and once useful signal saturates capacity, further optimization tends to fit channel regularities, consistent with reports of sycophancy and reward hacking. The analysis views alignment as interface engineering: measure and allocate limited capacity, manage task complexity, and decide where information is spent.

Title: Bayesian Physics Informed Neural Networks for Reliable Transformer Prognostics

Authors: Ibai Ramirez, Jokin Alcibar, Joel Pino, Mikel Sanz, David Pardo, Jose I. Aizpurua
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2509.15933
Pdf URL: https://arxiv.org/pdf/2509.15933
Copy Paste: [[2509.15933]] Bayesian Physics Informed Neural Networks for Reliable Transformer Prognostics(https://arxiv.org/abs/2509.15933)
Keywords: robust, diffusion, transformer
Abstract: Scientific Machine Learning (SciML) integrates physics and data into the learning process, offering improved generalization compared with purely data-driven models. Despite its potential, applications of SciML in prognostics remain limited, partly due to the complexity of incorporating partial differential equations (PDEs) for ageing physics and the scarcity of robust uncertainty quantification methods. This work introduces a Bayesian Physics-Informed Neural Network (B-PINN) framework for probabilistic prognostics estimation. By embedding Bayesian Neural Networks into the PINN architecture, the proposed approach produces principled, uncertainty-aware predictions. The method is applied to a transformer ageing case study, where insulation degradation is primarily driven by thermal stress. The heat diffusion PDE is used as the physical residual, and different prior distributions are investigated to examine their impact on predictive posterior distributions and their ability to encode a priori physical knowledge. The framework is validated against a finite element model developed and tested with real measurements from a solar power plant. Results, benchmarked against a dropout-PINN baseline, show that the proposed B-PINN delivers more reliable prognostic predictions by accurately quantifying predictive uncertainty. This capability is crucial for supporting robust and informed maintenance decision-making in critical power assets.

Title: UniTac2Pose: A Unified Approach Learned in Simulation for Category-level Visuotactile In-hand Pose Estimation

Authors: Mingdong Wu, Long Yang, Jin Liu, Weiyao Huang, Lehong Wu, Zelin Chen, Daolin Ma, Hao Dong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15934
Pdf URL: https://arxiv.org/pdf/2509.15934
Copy Paste: [[2509.15934]] UniTac2Pose: A Unified Approach Learned in Simulation for Category-level Visuotactile In-hand Pose Estimation(https://arxiv.org/abs/2509.15934)
Keywords: robust, diffusion
Abstract: Accurate estimation of the in-hand pose of an object based on its CAD model is crucial in both industrial applications and everyday tasks, ranging from positioning workpieces and assembling components to seamlessly inserting devices like USB connectors. While existing methods often rely on regression, feature matching, or registration techniques, achieving high precision and generalizability to unseen CAD models remains a significant challenge. In this paper, we propose a novel three-stage framework for in-hand pose estimation. The first stage involves sampling and pre-ranking pose candidates, followed by iterative refinement of these candidates in the second stage. In the final stage, post-ranking is applied to identify the most likely pose candidates. These stages are governed by a unified energy-based diffusion model, which is trained solely on simulated data. This energy model simultaneously generates gradients to refine pose estimates and produces an energy scalar that quantifies the quality of the pose estimates. Additionally, borrowing the idea from the computer vision domain, we incorporate a render-compare architecture within the energy-based score network to significantly enhance sim-to-real performance, as demonstrated by our ablation studies. We conduct comprehensive experiments to show that our method outperforms conventional baselines based on regression, matching, and registration techniques, while also exhibiting strong intra-category generalization to previously unseen CAD models. Moreover, our approach integrates tactile object pose estimation, pose tracking, and uncertainty estimation into a unified framework, enabling robust performance across a variety of real-world conditions.

Title: PAN: Pillars-Attention-Based Network for 3D Object Detection

Authors: Ruan Bispo, Dane Mitrev, Letizia Mariotti, Clément Botty, Denver Humphrey, Anthony Scanlan, Ciarán Eising
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15935
Pdf URL: https://arxiv.org/pdf/2509.15935
Copy Paste: [[2509.15935]] PAN: Pillars-Attention-Based Network for 3D Object Detection(https://arxiv.org/abs/2509.15935)
Keywords: robust
Abstract: Camera-radar fusion offers a robust and low-cost alternative to Camera-lidar fusion for the 3D object detection task in real-time under adverse weather and lighting conditions. However, currently, in the literature, it is possible to find few works focusing on this modality and, most importantly, developing new architectures to explore the advantages of the radar point cloud, such as accurate distance estimation and speed information. Therefore, this work presents a novel and efficient 3D object detection algorithm using cameras and radars in the bird's-eye-view (BEV). Our algorithm exploits the advantages of radar before fusing the features into a detection head. A new backbone is introduced, which maps the radar pillar features into an embedded dimension. A self-attention mechanism allows the backbone to model the dependencies between the radar points. We are using a simplified convolutional layer to replace the FPN-based convolutional layers used in the PointPillars-based architectures with the main goal of reducing inference time. Our results show that with this modification, our approach achieves the new state-of-the-art in the 3D object detection problem, reaching 58.2 of the NDS metric for the use of ResNet-50, while also setting a new benchmark for inference time on the nuScenes dataset for the same category.

Title: Targeted Fine-Tuning of DNN-Based Receivers via Influence Functions

Authors: Marko Tuononen, Heikki Penttinen, Ville Hautamäki
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2509.15950
Pdf URL: https://arxiv.org/pdf/2509.15950
Copy Paste: [[2509.15950]] Targeted Fine-Tuning of DNN-Based Receivers via Influence Functions(https://arxiv.org/abs/2509.15950)
Keywords: interpretability
Abstract: We present the first use of influence functions for deep learning-based wireless receivers. Applied to DeepRx, a fully convolutional receiver, influence analysis reveals which training samples drive bit predictions, enabling targeted fine-tuning of poorly performing cases. We show that loss-relative influence with capacity-like binary cross-entropy loss and first-order updates on beneficial samples most consistently improves bit error rate toward genie-aided performance, outperforming random fine-tuning in single-target scenarios. Multi-target adaptation proved less effective, underscoring open challenges. Beyond experiments, we connect influence to self-influence corrections and propose a second-order, influence-aligned update strategy. Our results establish influence functions as both an interpretability tool and a basis for efficient receiver adaptation.

Title: Adversarial Graph Fusion for Incomplete Multi-view Semi-supervised Learning with Tensorial Imputation

Authors: Zhangqi Jiang, Tingjin Luo, Xu Yang, Xinyan Liang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.15955
Pdf URL: https://arxiv.org/pdf/2509.15955
Copy Paste: [[2509.15955]] Adversarial Graph Fusion for Incomplete Multi-view Semi-supervised Learning with Tensorial Imputation(https://arxiv.org/abs/2509.15955)
Keywords: robust
Abstract: View missing remains a significant challenge in graph-based multi-view semi-supervised learning, hindering their real-world applications. To address this issue, traditional methods introduce a missing indicator matrix and focus on mining partial structure among existing samples in each view for label propagation (LP). However, we argue that these disregarded missing samples sometimes induce discontinuous local structures, i.e., sub-clusters, breaking the fundamental smoothness assumption in LP. Consequently, such a Sub-Cluster Problem (SCP) would distort graph fusion and degrade classification performance. To alleviate SCP, we propose a novel incomplete multi-view semi-supervised learning method, termed AGF-TI. Firstly, we design an adversarial graph fusion scheme to learn a robust consensus graph against the distorted local structure through a min-max framework. By stacking all similarity matrices into a tensor, we further recover the incomplete structure from the high-order consistency information based on the low-rank tensor learning. Additionally, the anchor-based strategy is incorporated to reduce the computational complexity. An efficient alternative optimization algorithm combining a reduced gradient descent method is developed to solve the formulated objective, with theoretical convergence. Extensive experimental results on various datasets validate the superiority of our proposed AGF-TI as compared to state-of-the-art methods. Code is available at this https URL.

Title: Localmax dynamics for attention in transformers and its asymptotic behavior

Authors: Henri Cimetière, Maria Teresa Chiri, Bahman Gharesifard
Subjects: cs.CL, cs.LG, math.DS, math.OC
Abstract URL: https://arxiv.org/abs/2509.15958
Pdf URL: https://arxiv.org/pdf/2509.15958
Copy Paste: [[2509.15958]] Localmax dynamics for attention in transformers and its asymptotic behavior(https://arxiv.org/abs/2509.15958)
Keywords: transformer
Abstract: We introduce a new discrete-time attention model, termed the localmax dynamics, which interpolates between the classic softmax dynamics and the hardmax dynamics, where only the tokens that maximize the influence toward a given token have a positive weight. As in hardmax, uniform weights are determined by a parameter controlling neighbor influence, but the key extension lies in relaxing neighborhood interactions through an alignment-sensitivity parameter, which allows controlled deviations from pure hardmax behavior. As we prove, while the convex hull of the token states still converges to a convex polytope, its structure can no longer be fully described by a maximal alignment set, prompting the introduction of quiescent sets to capture the invariant behavior of tokens near vertices. We show that these sets play a key role in understanding the asymptotic behavior of the system, even under time-varying alignment sensitivity parameters. We further show that localmax dynamics does not exhibit finite-time convergence and provide results for vanishing, nonzero, time-varying alignment-sensitivity parameters, recovering the limiting behavior of hardmax as a by-product. Finally, we adapt Lyapunov-based methods from classical opinion dynamics, highlighting their limitations in the asymmetric setting of localmax interactions and outlining directions for future research.

Title: A multi-temporal multi-spectral attention-augmented deep convolution neural network with contrastive learning for crop yield prediction

Authors: Shalini Dangi, Surya Karthikeya Mullapudi, Chandravardhan Singh Raghaw, Shahid Shafi Dar, Mohammad Zia Ur Rehman, Nagendra Kumar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.15966
Pdf URL: https://arxiv.org/pdf/2509.15966
Copy Paste: [[2509.15966]] A multi-temporal multi-spectral attention-augmented deep convolution neural network with contrastive learning for crop yield prediction(https://arxiv.org/abs/2509.15966)
Keywords: security
Abstract: Precise yield prediction is essential for agricultural sustainability and food security. However, climate change complicates accurate yield prediction by affecting major factors such as weather conditions, soil fertility, and farm management systems. Advances in technology have played an essential role in overcoming these challenges by leveraging satellite monitoring and data analysis for precise yield estimation. Current methods rely on spatio-temporal data for predicting crop yield, but they often struggle with multi-spectral data, which is crucial for evaluating crop health and growth patterns. To resolve this challenge, we propose a novel Multi-Temporal Multi-Spectral Yield Prediction Network, MTMS-YieldNet, that integrates spectral data with spatio-temporal information to effectively capture the correlations and dependencies between them. While existing methods that rely on pre-trained models trained on general visual data, MTMS-YieldNet utilizes contrastive learning for feature discrimination during pre-training, focusing on capturing spatial-spectral patterns and spatio-temporal dependencies from remote sensing data. Both quantitative and qualitative assessments highlight the excellence of the proposed MTMS-YieldNet over seven existing state-of-the-art methods. MTMS-YieldNet achieves MAPE scores of 0.336 on Sentinel-1, 0.353 on Landsat-8, and an outstanding 0.331 on Sentinel-2, demonstrating effective yield prediction performance across diverse climatic and seasonal conditions. The outstanding performance of MTMS-YieldNet improves yield predictions and provides valuable insights that can assist farmers in making better decisions, potentially improving crop yields.

Title: BEFT: Bias-Efficient Fine-Tuning of Language Models

Authors: Baichuan Huang, Ananth Balashankar, Amir Aminifar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2509.15974
Pdf URL: https://arxiv.org/pdf/2509.15974
Copy Paste: [[2509.15974]] BEFT: Bias-Efficient Fine-Tuning of Language Models(https://arxiv.org/abs/2509.15974)
Keywords: large language model
Abstract: Fine-tuning all-bias-terms stands out among various parameter-efficient fine-tuning (PEFT) techniques, owing to its out-of-the-box usability and competitive performance, especially in low-data regimes. Bias-only fine-tuning has the potential for unprecedented parameter efficiency. However, the link between fine-tuning different bias terms (i.e., bias terms in the query, key, or value projections) and downstream performance remains unclear. The existing approaches, e.g., based on the magnitude of bias change or empirical Fisher information, provide limited guidance for selecting the particular bias term for effective fine-tuning. In this paper, we propose an approach for selecting the bias term to be fine-tuned, forming the foundation of our bias-efficient fine-tuning (BEFT). We extensively evaluate our bias-efficient approach against other bias-selection approaches, across a wide range of large language models (LLMs) spanning encoder-only and decoder-only architectures from 110M to 6.7B parameters. Our results demonstrate the effectiveness and superiority of our bias-efficient approach on diverse downstream tasks, including classification, multiple-choice, and generation tasks.

Title: Shedding Light on Depth: Explainability Assessment in Monocular Depth Estimation

Authors: Lorenzo Cirillo, Claudio Schiavella, Lorenzo Papa, Paolo Russo, Irene Amerini
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.15980
Pdf URL: https://arxiv.org/pdf/2509.15980
Copy Paste: [[2509.15980]] Shedding Light on Depth: Explainability Assessment in Monocular Depth Estimation(https://arxiv.org/abs/2509.15980)
Keywords: explainability
Abstract: Explainable artificial intelligence is increasingly employed to understand the decision-making process of deep learning models and create trustworthiness in their adoption. However, the explainability of Monocular Depth Estimation (MDE) remains largely unexplored despite its wide deployment in real-world applications. In this work, we study how to analyze MDE networks to map the input image to the predicted depth map. More in detail, we investigate well-established feature attribution methods, Saliency Maps, Integrated Gradients, and Attention Rollout on different computationally complex models for MDE: METER, a lightweight network, and PixelFormer, a deep network. We assess the quality of the generated visual explanations by selectively perturbing the most relevant and irrelevant pixels, as identified by the explainability methods, and analyzing the impact of these perturbations on the model's output. Moreover, since existing evaluation metrics can have some limitations in measuring the validity of visual explanations for MDE, we additionally introduce the Attribution Fidelity. This metric evaluates the reliability of the feature attribution by assessing their consistency with the predicted depth map. Experimental results demonstrate that Saliency Maps and Integrated Gradients have good performance in highlighting the most important input features for MDE lightweight and deep models, respectively. Furthermore, we show that Attribution Fidelity effectively identifies whether an explainability method fails to produce reliable visual maps, even in scenarios where conventional metrics might suggest satisfactory results.

Title: Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations

Authors: Yujie Zhu, Charles A. Hepburn, Matthew Thorpe, Giovanni Montana
Subjects: cs.LG, cs.AI, cs.RO, stat.ML
Abstract URL: https://arxiv.org/abs/2509.15981
Pdf URL: https://arxiv.org/pdf/2509.15981
Copy Paste: [[2509.15981]] Uncertainty-Based Smooth Policy Regularisation for Reinforcement Learning with Few Demonstrations(https://arxiv.org/abs/2509.15981)
Keywords: robust
Abstract: In reinforcement learning with sparse rewards, demonstrations can accelerate learning, but determining when to imitate them remains challenging. We propose Smooth Policy Regularisation from Demonstrations (SPReD), a framework that addresses the fundamental question: when should an agent imitate a demonstration versus follow its own policy? SPReD uses ensemble methods to explicitly model Q-value distributions for both demonstration and policy actions, quantifying uncertainty for comparisons. We develop two complementary uncertainty-aware methods: a probabilistic approach estimating the likelihood of demonstration superiority, and an advantage-based approach scaling imitation by statistical significance. Unlike prevailing methods (e.g. Q-filter) that make binary imitation decisions, SPReD applies continuous, uncertainty-proportional regularisation weights, reducing gradient variance during training. Despite its computational simplicity, SPReD achieves remarkable gains in experiments across eight robotics tasks, outperforming existing approaches by up to a factor of 14 in complex tasks while maintaining robustness to demonstration quality and quantity. Our code is available at this https URL.

Title: Towards Robust Visual Continual Learning with Multi-Prototype Supervision

Authors: Xiwei Liu, Yulong Li, Yichen Li, Xinlin Zhuang, Haolin Yang, Huifa Li, Imran Razzak
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16011
Pdf URL: https://arxiv.org/pdf/2509.16011
Copy Paste: [[2509.16011]] Towards Robust Visual Continual Learning with Multi-Prototype Supervision(https://arxiv.org/abs/2509.16011)
Keywords: robust
Abstract: Language-guided supervision, which utilizes a frozen semantic target from a Pretrained Language Model (PLM), has emerged as a promising paradigm for visual Continual Learning (CL). However, relying on a single target introduces two critical limitations: 1) semantic ambiguity, where a polysemous category name results in conflicting visual representations, and 2) intra-class visual diversity, where a single prototype fails to capture the rich variety of visual appearances within a class. To this end, we propose MuproCL, a novel framework that replaces the single target with multiple, context-aware prototypes. Specifically, we employ a lightweight LLM agent to perform category disambiguation and visual-modal expansion to generate a robust set of semantic prototypes. A LogSumExp aggregation mechanism allows the vision model to adaptively align with the most relevant prototype for a given image. Extensive experiments across various CL baselines demonstrate that MuproCL consistently enhances performance and robustness, establishing a more effective path for language-guided continual learning.

Title: DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching

Authors: Meng Yang, Fan Fan, Zizhuo Li, Songchu Deng, Yong Ma, Jiayi Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16017
Pdf URL: https://arxiv.org/pdf/2509.16017
Copy Paste: [[2509.16017]] DistillMatch: Leveraging Knowledge Distillation from Vision Foundation Model for Multimodal Image Matching(https://arxiv.org/abs/2509.16017)
Keywords: robust
Abstract: Multimodal image matching seeks pixel-level correspondences between images of different modalities, crucial for cross-modal perception, fusion and analysis. However, the significant appearance differences between modalities make this task challenging. Due to the scarcity of high-quality annotated datasets, existing deep learning methods that extract modality-common features for matching perform poorly and lack adaptability to diverse scenarios. Vision Foundation Model (VFM), trained on large-scale data, yields generalizable and robust feature representations adapted to data and tasks of various modalities, including multimodal matching. Thus, we propose DistillMatch, a multimodal image matching method using knowledge distillation from VFM. DistillMatch employs knowledge distillation to build a lightweight student model that extracts high-level semantic features from VFM (including DINOv2 and DINOv3) to assist matching across modalities. To retain modality-specific information, it extracts and injects modality category information into the other modality's features, which enhances the model's understanding of cross-modal correlations. Furthermore, we design V2I-GAN to boost the model's generalization by translating visible to pseudo-infrared images for data augmentation. Experiments show that DistillMatch outperforms existing algorithms on public datasets.

Title: Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning

Authors: Hong-Yun Lin, Jhen-Ke Lin, Chung-Chun Wang, Hao-Chien Lu, Berlin Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16025
Pdf URL: https://arxiv.org/pdf/2509.16025
Copy Paste: [[2509.16025]] Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning(https://arxiv.org/abs/2509.16025)
Keywords: robust
Abstract: Spoken Language Assessment (SLA) estimates a learner's oral proficiency from spontaneous speech. The growing population of L2 English speakers has intensified the demand for reliable SLA, a critical component of Computer Assisted Language Learning (CALL). Existing efforts often rely on cascaded pipelines, which are prone to error propagation, or end-to-end models that often operate on a short audio window, which might miss discourse-level evidence. This paper introduces a novel multimodal foundation model approach that performs session-level evaluation in a single pass. Our approach couples multi-target learning with a frozen, Whisper ASR model-based speech prior for acoustic-aware calibration, allowing for jointly learning holistic and trait-level objectives of SLA without resorting to handcrafted features. By coherently processing the entire response session of an L2 speaker, the model excels at predicting holistic oral proficiency. Experiments conducted on the Speak & Improve benchmark demonstrate that our proposed approach outperforms the previous state-of-the-art cascaded system and exhibits robust cross-part generalization, producing a compact deployable grader that is tailored for CALL applications.

Title: Think, Verbalize, then Speak: Bridging Complex Thoughts and Comprehensible Speech

Authors: Sang Hoon Woo, Sehun Lee, Kang-wook Kim, Gunhee Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16028
Pdf URL: https://arxiv.org/pdf/2509.16028
Copy Paste: [[2509.16028]] Think, Verbalize, then Speak: Bridging Complex Thoughts and Comprehensible Speech(https://arxiv.org/abs/2509.16028)
Keywords: large language model
Abstract: Spoken dialogue systems increasingly employ large language models (LLMs) to leverage their advanced reasoning capabilities. However, direct application of LLMs in spoken communication often yield suboptimal results due to mismatches between optimal textual and verbal delivery. While existing approaches adapt LLMs to produce speech-friendly outputs, their impact on reasoning performance remains underexplored. In this work, we propose Think-Verbalize-Speak, a framework that decouples reasoning from spoken delivery to preserve the full reasoning capacity of LLMs. Central to our method is verbalizing, an intermediate step that translates thoughts into natural, speech-ready text. We also introduce ReVerT, a latency-efficient verbalizer based on incremental and asynchronous summarization. Experiments across multiple benchmarks show that our method enhances speech naturalness and conciseness with minimal impact on reasoning. The project page with the dataset and the source code is available at this https URL

Title: A High-performance Real-time Container File Monitoring Approach Based on Virtual Machine Introspection

Authors: Kai Tan, Dongyang Zhan, Lin Ye, Hongli Zhang, Binxing Fang, Zhihong Tian
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2509.16030
Pdf URL: https://arxiv.org/pdf/2509.16030
Copy Paste: [[2509.16030]] A High-performance Real-time Container File Monitoring Approach Based on Virtual Machine Introspection(https://arxiv.org/abs/2509.16030)
Keywords: secure, security, protect, attack
Abstract: As cloud computing continues to advance and become an integral part of modern IT infrastructure, container security has emerged as a critical factor in ensuring the smooth operation of cloud-native applications. An attacker can attack the service in the container or even perform the container escape attack by tampering with the files. Monitoring container files is important for APT detection and cyberspace security. Existing file monitoring methods are usually based on host operating system or virtual machine introspection to protect file security in real time. The methods based on the host operating system usually monitor file operations in the host operating system. However, when the container escapes to the host, the host operating system will no longer be secure, so these methods face the problem of weak security. Aiming at the problems of low security and high overload introduced in existing container file monitoring, a high-performance container file monitoring method based on virtual machine introspection is proposed. The experimental results show that the proposed approach can effectively monitor the container files and introduce an acceptable monitoring overload.

Title: GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition

Authors: Tianyue Wang, Shuang Yang, Shiguang Shan, Xilin Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16031
Pdf URL: https://arxiv.org/pdf/2509.16031
Copy Paste: [[2509.16031]] GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition(https://arxiv.org/abs/2509.16031)
Keywords: robust, extraction
Abstract: Visual speech recognition (VSR), also known as lip reading, is the task of recognizing speech from silent video. Despite significant advancements in VSR over recent decades, most existing methods pay limited attention to real-world visual challenges such as illumination variations, occlusions, blurring, and pose changes. To address these challenges, we propose GLip, a Global-Local Integrated Progressive framework designed for robust VSR. GLip is built upon two key insights: (i) learning an initial \textit{coarse} alignment between visual features across varying conditions and corresponding speech content facilitates the subsequent learning of \textit{precise} visual-to-speech mappings in challenging environments; (ii) under adverse conditions, certain local regions (e.g., non-occluded areas) often exhibit more discriminative cues for lip reading than global features. To this end, GLip introduces a dual-path feature extraction architecture that integrates both global and local features within a two-stage progressive learning framework. In the first stage, the model learns to align both global and local visual features with corresponding acoustic speech units using easily accessible audio-visual data, establishing a coarse yet semantically robust foundation. In the second stage, we introduce a Contextual Enhancement Module (CEM) to dynamically integrate local features with relevant global context across both spatial and temporal dimensions, refining the coarse representations into precise visual-speech mappings. Our framework uniquely exploits discriminative local regions through a progressive learning strategy, demonstrating enhanced robustness against various visual challenges and consistently outperforming existing methods on the LRS2 and LRS3 benchmarks. We further validate its effectiveness on a newly introduced challenging Mandarin dataset.

Title: ConCap: Practical Network Traffic Generation for Flow-based Intrusion Detection Systems

Authors: Miel Verkerken, Laurens D'hooge, Bruno Volckaert, Filip De Turck, Giovanni Apruzzese
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2509.16038
Pdf URL: https://arxiv.org/pdf/2509.16038
Copy Paste: [[2509.16038]] ConCap: Practical Network Traffic Generation for Flow-based Intrusion Detection Systems(https://arxiv.org/abs/2509.16038)
Keywords: security, attack
Abstract: Network Intrusion Detection Systems (NIDS) have been studied in research for almost four decades. Yet, despite thousands of papers claiming scientific advances, a non-negligible number of recent works suggest that the findings of prior literature may be questionable. At the root of such a disagreement is the well-known challenge of obtaining data representative of a real-world network-and, hence, usable for security assessments. We tackle such a challenge in this paper. We propose ConCap, a practical tool meant to facilitate experimental research on NIDS. Through ConCap, a researcher can set up an isolated and lightweight network environment and configure it to produce network-related data, such as packets or NetFlows, that are automatically labeled, hence ready for fine-grained experiments. ConCap is rooted on open-source software and is designed to foster experimental reproducibility across the scientific community by sharing just one configuration file. Through comprehensive experiments on 10 different network activities, further expanded via in-depth analyses of 21 variants of two specific activities and of 100 repetitions of four other ones, we empirically verify that ConCap produces network data resembling that of a real-world network. We also carry out experiments on well-known benchmark datasets as well as on a real "smart-home" network, showing that, from a cyber-detection viewpoint, ConCap's automatically-labeled NetFlows are functionally equivalent to those collected in other environments. Finally, we show that ConCap enables to safely reproduce sophisticated attack chains (e.g., to test/enhance existing NIDS). Altogether, ConCap is a solution to the "data problem" that is plaguing NIDS research.

Title: Language-Instructed Reasoning for Group Activity Detection via Multimodal Large Language Model

Authors: Jihua Peng, Qianxiong Xu, Yichen Liu, Chenxi Liu, Cheng Long, Rui Zhao, Ziyue Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16054
Pdf URL: https://arxiv.org/pdf/2509.16054
Copy Paste: [[2509.16054]] Language-Instructed Reasoning for Group Activity Detection via Multimodal Large Language Model(https://arxiv.org/abs/2509.16054)
Keywords: explainability, transformer, large language model
Abstract: Group activity detection (GAD) aims to simultaneously identify group members and categorize their collective activities within video sequences. Existing deep learning-based methods develop specialized architectures (e.g., transformer networks) to model the dynamics of individual roles and semantic dependencies between individuals and groups. However, they rely solely on implicit pattern recognition from visual features and struggle with contextual reasoning and explainability. In this work, we propose LIR-GAD, a novel framework of language-instructed reasoning for GAD via Multimodal Large Language Model (MLLM). Our approach expand the original vocabulary of MLLM by introducing an activity-level token and multiple cluster-specific tokens. We process video frames alongside two specially designed tokens and language instructions, which are then integrated into the MLLM. The pretrained commonsense knowledge embedded in the MLLM enables the token and tokens to effectively capture the semantic information of collective activities and learn distinct representational features of different groups, respectively. Also, we introduce a multi-label classification loss to further enhance the token's ability to learn discriminative semantic representations. Then, we design a Multimodal Dual-Alignment Fusion (MDAF) module that integrates MLLM's hidden embeddings corresponding to the designed tokens with visual features, significantly enhancing the performance of GAD. Both quantitative and qualitative experiments demonstrate the superior performance of our proposed method in GAD taks.

Title: SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection

Authors: Maithili Joshi, Palash Nandi, Tanmoy Chakraborty
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2509.16060
Pdf URL: https://arxiv.org/pdf/2509.16060
Copy Paste: [[2509.16060]] SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection(https://arxiv.org/abs/2509.16060)
Keywords: attack, robust, large language model
Abstract: Large Language Models (LLMs) with safe-alignment training are powerful instruments with robust language comprehension capabilities. These models typically undergo meticulous alignment procedures involving human feedback to ensure the acceptance of safe inputs while rejecting harmful or unsafe ones. However, despite their massive scale and alignment efforts, LLMs remain vulnerable to jailbreak attacks, where malicious users manipulate the model to produce harmful outputs that it was explicitly trained to avoid. In this study, we find that the safety mechanisms in LLMs are predominantly embedded in the middle-to-late layers. Building on this insight, we introduce a novel white-box jailbreak method, SABER (Safety Alignment Bypass via Extra Residuals), which connects two intermediate layers $s$ and $e$ such that $s < e$, through a residual connection. Our approach achieves a 51% improvement over the best-performing baseline on the HarmBench test set. Furthermore, SABER induces only a marginal shift in perplexity when evaluated on the HarmBench validation set. The source code is publicly available at this https URL.

Title: Communications to Circulations: 3D Wind Field Retrieval and Real-Time Prediction Using 5G GNSS Signals and Deep Learning

Authors: Yuchen Ye, Hong Liang, Chaoxia Yuan, Mingyu Li, Aoqi Zhou, Chunqing Shang, Hua Cai, Peixi Liu, Kezuan Wang, Yifeng Zheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16068
Pdf URL: https://arxiv.org/pdf/2509.16068
Copy Paste: [[2509.16068]] Communications to Circulations: 3D Wind Field Retrieval and Real-Time Prediction Using 5G GNSS Signals and Deep Learning(https://arxiv.org/abs/2509.16068)
Keywords: robust, transformer
Abstract: Accurate atmospheric wind field information is crucial for various applications, including weather forecasting, aviation safety, and disaster risk reduction. However, obtaining high spatiotemporal resolution wind data remains challenging due to limitations in traditional in-situ observations and remote sensing techniques, as well as the computational expense and biases of numerical weather prediction (NWP) models. This paper introduces G-WindCast, a novel deep learning framework that leverages signal strength variations from 5G Global Navigation Satellite System (GNSS) signals to retrieve and forecast three-dimensional (3D) atmospheric wind fields. The framework utilizes Forward Neural Networks (FNN) and Transformer networks to capture complex, nonlinear, and spatiotemporal relationships between GNSS-derived features and wind dynamics. Our preliminary results demonstrate promising accuracy in both wind retrieval and short-term wind forecasting (up to 30 minutes lead time), with skill scores comparable to high-resolution NWP outputs in certain scenarios. The model exhibits robustness across different forecast horizons and pressure levels, and its predictions for wind speed and direction show superior agreement with observations compared to concurrent ERA5 reanalysis data. Furthermore, we show that the system can maintain excellent performance for localized forecasting even with a significantly reduced number of GNSS stations (e.g., around 100), highlighting its cost-effectiveness and scalability. This interdisciplinary approach underscores the transformative potential of exploiting non-traditional data sources and deep learning for advanced environmental monitoring and real-time atmospheric applications.

Title: Rethinking Molecule Synthesizability with Chain-of-Reaction

Authors: Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Saee Paliwal, Weili Nie, Arash Vahdat
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.16084
Pdf URL: https://arxiv.org/pdf/2509.16084
Copy Paste: [[2509.16084]] Rethinking Molecule Synthesizability with Chain-of-Reaction(https://arxiv.org/abs/2509.16084)
Keywords: generative, large language model
Abstract: A well-known pitfall of molecular generative models is that they are not guaranteed to generate synthesizable molecules. There have been considerable attempts to address this problem, but given the exponentially large combinatorial space of synthesizable molecules, existing methods have shown limited coverage of the space and poor molecular optimization performance. To tackle these problems, we introduce ReaSyn, a generative framework for synthesizable projection where the model explores the neighborhood of given molecules in the synthesizable space by generating pathways that result in synthesizable analogs. To fully utilize the chemical knowledge contained in the synthetic pathways, we propose a novel perspective that views synthetic pathways akin to reasoning paths in large language models (LLMs). Specifically, inspired by chain-of-thought (CoT) reasoning in LLMs, we introduce the chain-of-reaction (CoR) notation that explicitly states reactants, reaction types, and intermediate products for each step in a pathway. With the CoR notation, ReaSyn can get dense supervision in every reaction step to explicitly learn chemical reaction rules during supervised training and perform step-by-step reasoning. In addition, to further enhance the reasoning capability of ReaSyn, we propose reinforcement learning (RL)-based finetuning and goal-directed test-time compute scaling tailored for synthesizable projection. ReaSyn achieves the highest reconstruction rate and pathway diversity in synthesizable molecule reconstruction and the highest optimization performance in synthesizable goal-directed molecular optimization, and significantly outperforms previous synthesizable projection methods in synthesizable hit expansion. These results highlight ReaSyn's superior ability to navigate combinatorially-large synthesizable chemical space.

Title: See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model

Authors: Pengteng Li, Pinhao Song, Wuyang Li, Weiyu Guo, Huizai Yao, Yijie Xu, Dugang Liu, Hui Xiong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16087
Pdf URL: https://arxiv.org/pdf/2509.16087
Copy Paste: [[2509.16087]] See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model(https://arxiv.org/abs/2509.16087)
Keywords: large language model
Abstract: We introduce SEE&TREK, the first training-free prompting framework tailored to enhance the spatial understanding of Multimodal Large Language Models (MLLMS) under vision-only constraints. While prior efforts have incorporated modalities like depth or point clouds to improve spatial reasoning, purely visualspatial understanding remains underexplored. SEE&TREK addresses this gap by focusing on two core principles: increasing visual diversity and motion reconstruction. For visual diversity, we conduct Maximum Semantic Richness Sampling, which employs an off-the-shell perception model to extract semantically rich keyframes that capture scene structure. For motion reconstruction, we simulate visual trajectories and encode relative spatial positions into keyframes to preserve both spatial relations and temporal coherence. Our method is training&GPU-free, requiring only a single forward pass, and can be seamlessly integrated into existing MLLM'S. Extensive experiments on the VSI-B ENCH and STI-B ENCH show that S EE &T REK consistently boosts various MLLM S performance across diverse spatial reasoning tasks with the most +3.5% improvement, offering a promising path toward stronger spatial intelligence.

Title: Randomized Smoothing Meets Vision-Language Models

Authors: Emmanouil Seferis, Changshun Wu, Stefanos Kollias, Saddek Bensalem, Chih-Hong Cheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.16088
Pdf URL: https://arxiv.org/pdf/2509.16088
Copy Paste: [[2509.16088]] Randomized Smoothing Meets Vision-Language Models(https://arxiv.org/abs/2509.16088)
Keywords: attack, robust, generative
Abstract: Randomized smoothing (RS) is one of the prominent techniques to ensure the correctness of machine learning models, where point-wise robustness certificates can be derived analytically. While RS is well understood for classification, its application to generative models is unclear, since their outputs are sequences rather than labels. We resolve this by connecting generative outputs to an oracle classification task and showing that RS can still be enabled: the final response can be classified as a discrete action (e.g., service-robot commands in VLAs), as harmful vs. harmless (content moderation or toxicity detection in VLMs), or even applying oracles to cluster answers into semantically equivalent ones. Provided that the error rate for the oracle classifier comparison is bounded, we develop the theory that associates the number of samples with the corresponding robustness radius. We further derive improved scaling laws analytically relating the certified radius and accuracy to the number of samples, showing that the earlier result of 2 to 3 orders of magnitude fewer samples sufficing with minimal loss remains valid even under weaker assumptions. Together, these advances make robustness certification both well-defined and computationally feasible for state-of-the-art VLMs, as validated against recent jailbreak-style adversarial attacks.

Title: Blind-Spot Guided Diffusion for Self-supervised Real-World Denoising

Authors: Shen Cheng, Haipeng Li, Haibin Huang, Xiaohong Liu, Shuaicheng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16091
Pdf URL: https://arxiv.org/pdf/2509.16091
Copy Paste: [[2509.16091]] Blind-Spot Guided Diffusion for Self-supervised Real-World Denoising(https://arxiv.org/abs/2509.16091)
Keywords: diffusion
Abstract: In this work, we present Blind-Spot Guided Diffusion, a novel self-supervised framework for real-world image denoising. Our approach addresses two major challenges: the limitations of blind-spot networks (BSNs), which often sacrifice local detail and introduce pixel discontinuities due to spatial independence assumptions, and the difficulty of adapting diffusion models to self-supervised denoising. We propose a dual-branch diffusion framework that combines a BSN-based diffusion branch, generating semi-clean images, with a conventional diffusion branch that captures underlying noise distributions. To enable effective training without paired data, we use the BSN-based branch to guide the sampling process, capturing noise structure while preserving local details. Extensive experiments on the SIDD and DND datasets demonstrate state-of-the-art performance, establishing our method as a highly effective self-supervised solution for real-world denoising. Code and pre-trained models are released at: this https URL.

Title: SegDINO3D: 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features

Authors: Jinyuan Qu, Hongyang Li, Xingyu Chen, Shilong Liu, Yukai Shi, Tianhe Ren, Ruitao Jing, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16098
Pdf URL: https://arxiv.org/pdf/2509.16098
Copy Paste: [[2509.16098]] SegDINO3D: 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features(https://arxiv.org/abs/2509.16098)
Keywords: transformer, segmentation
Abstract: In this paper, we present SegDINO3D, a novel Transformer encoder-decoder framework for 3D instance segmentation. As 3D training data is generally not as sufficient as 2D training images, SegDINO3D is designed to fully leverage 2D representation from a pre-trained 2D detection model, including both image-level and object-level features, for improving 3D representation. SegDINO3D takes both a point cloud and its associated 2D images as input. In the encoder stage, it first enriches each 3D point by retrieving 2D image features from its corresponding image views and then leverages a 3D encoder for 3D context fusion. In the decoder stage, it formulates 3D object queries as 3D anchor boxes and performs cross-attention from 3D queries to 2D object queries obtained from 2D images using the 2D detection model. These 2D object queries serve as a compact object-level representation of 2D images, effectively avoiding the challenge of keeping thousands of image feature maps in the memory while faithfully preserving the knowledge of the pre-trained 2D model. The introducing of 3D box queries also enables the model to modulate cross-attention using the predicted boxes for more precise querying. SegDINO3D achieves the state-of-the-art performance on the ScanNetV2 and ScanNet200 3D instance segmentation benchmarks. Notably, on the challenging ScanNet200 dataset, SegDINO3D significantly outperforms prior methods by +8.7 and +6.8 mAP on the validation and hidden test sets, respectively, demonstrating its superiority.

Title: Personalized Federated Learning with Heat-Kernel Enhanced Tensorized Multi-View Clustering

Authors: Kristina P. Sinaga
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2509.16101
Pdf URL: https://arxiv.org/pdf/2509.16101
Copy Paste: [[2509.16101]] Personalized Federated Learning with Heat-Kernel Enhanced Tensorized Multi-View Clustering(https://arxiv.org/abs/2509.16101)
Keywords: privacy, robust, federate
Abstract: We present a robust personalized federated learning framework that leverages heat-kernel enhanced tensorized multi-view fuzzy c-means clustering with advanced tensor decomposition techniques. Our approach integrates heat-kernel coefficients adapted from quantum field theory with Tucker decomposition and canonical polyadic decomposition (CANDECOMP/PARAFAC) to transform conventional distance metrics and efficiently represent high-dimensional multi-view structures. The framework employs matriculation and vectorization techniques to facilitate the discovery of hidden structures and multilinear relationships via N-way generalized tensors. The proposed method introduces a dual-level optimization scheme: local heat-kernel enhanced fuzzy clustering with tensor decomposition operating on order-N input tensors, and federated aggregation of tensor factors with privacy-preserving personalization mechanisms. The local stage employs tensorized kernel Euclidean distance transformations and Tucker decomposition to discover client-specific patterns in multi-view tensor data, while the global aggregation process coordinates tensor factors (core tensors and factor matrices) across clients through differential privacy-preserving protocols. This tensorized approach enables efficient handling of high-dimensional multi-view data with significant communication savings through low-rank tensor approximations.

Title: It Depends: Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge

Authors: Lukas Ellinger, Georg Groh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2509.16107
Pdf URL: https://arxiv.org/pdf/2509.16107
Copy Paste: [[2509.16107]] It Depends: Resolving Referential Ambiguity in Minimal Contexts with Commonsense Knowledge(https://arxiv.org/abs/2509.16107)
Keywords: robust, large language model
Abstract: Ambiguous words or underspecified references require interlocutors to resolve them, often by relying on shared context and commonsense knowledge. Therefore, we systematically investigate whether Large Language Models (LLMs) can leverage commonsense to resolve referential ambiguity in multi-turn conversations and analyze their behavior when ambiguity persists. Further, we study how requests for simplified language affect this capacity. Using a novel multilingual evaluation dataset, we test DeepSeek v3, GPT-4o, Qwen3-32B, GPT-4o-mini, and Llama-3.1-8B via LLM-as-Judge and human annotations. Our findings indicate that current LLMs struggle to resolve ambiguity effectively: they tend to commit to a single interpretation or cover all possible references, rather than hedging or seeking clarification. This limitation becomes more pronounced under simplification prompts, which drastically reduce the use of commonsense reasoning and diverse response strategies. Fine-tuning Llama-3.1-8B with Direct Preference Optimization substantially improves ambiguity resolution across all request types. These results underscore the need for advanced fine-tuning to improve LLMs' handling of ambiguity and to ensure robust performance across diverse communication styles.

Title: CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion

Authors: Sheng Zhang, Yifan Ding, Shuquan Lian, Shun Song, Hui Li
Subjects: cs.CL, cs.IR, cs.SE
Abstract URL: https://arxiv.org/abs/2509.16112
Pdf URL: https://arxiv.org/pdf/2509.16112
Copy Paste: [[2509.16112]] CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion(https://arxiv.org/abs/2509.16112)
Keywords: large language model
Abstract: Repository-level code completion automatically predicts the unfinished code based on the broader information from the repository. Recent strides in Code Large Language Models (code LLMs) have spurred the development of repository-level code completion methods, yielding promising results. Nevertheless, they suffer from issues such as inappropriate query construction, single-path code retrieval, and misalignment between code retriever and code LLM. To address these problems, we introduce CodeRAG, a framework tailored to identify relevant and necessary knowledge for retrieval-augmented repository-level code completion. Its core components include log probability guided query construction, multi-path code retrieval, and preference-aligned BestFit reranking. Extensive experiments on benchmarks ReccEval and CCEval demonstrate that CodeRAG significantly and consistently outperforms state-of-the-art methods. The implementation of CodeRAG is available at this https URL.

Title: DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Authors: Kaiwen Zheng, Huayu Chen, Haotian Ye, Haoxiang Wang, Qinsheng Zhang, Kai Jiang, Hang Su, Stefano Ermon, Jun Zhu, Ming-Yu Liu
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2509.16117
Pdf URL: https://arxiv.org/pdf/2509.16117
Copy Paste: [[2509.16117]] DiffusionNFT: Online Diffusion Reinforcement with Forward Process(https://arxiv.org/abs/2509.16117)
Keywords: diffusion
Abstract: Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks, including solver restrictions, forward-reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to $25\times$ more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.

Title: RadarGaussianDet3D: An Efficient and Effective Gaussian-based 3D Detector with 4D Automotive Radars

Authors: Weiyi Xiong, Bing Zhu, Tao Huang, Zewei Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16119
Pdf URL: https://arxiv.org/pdf/2509.16119
Copy Paste: [[2509.16119]] RadarGaussianDet3D: An Efficient and Effective Gaussian-based 3D Detector with 4D Automotive Radars(https://arxiv.org/abs/2509.16119)
Keywords: robust, extraction
Abstract: 4D automotive radars have gained increasing attention for autonomous driving due to their low cost, robustness, and inherent velocity measurement capability. However, existing 4D radar-based 3D detectors rely heavily on pillar encoders for BEV feature extraction, where each point contributes to only a single BEV grid, resulting in sparse feature maps and degraded representation quality. In addition, they also optimize bounding box attributes independently, leading to sub-optimal detection accuracy. Moreover, their inference speed, while sufficient for high-end GPUs, may fail to meet the real-time requirement on vehicle-mounted embedded devices. To overcome these limitations, an efficient and effective Gaussian-based 3D detector, namely RadarGaussianDet3D is introduced, leveraging Gaussian primitives and distributions as intermediate representations for radar points and bounding boxes. In RadarGaussianDet3D, a novel Point Gaussian Encoder (PGE) is designed to transform each point into a Gaussian primitive after feature aggregation and employs the 3D Gaussian Splatting (3DGS) technique for BEV rasterization, yielding denser feature maps. PGE exhibits exceptionally low latency, owing to the optimized algorithm for point feature aggregation and fast rendering of 3DGS. In addition, a new Box Gaussian Loss (BGL) is proposed, which converts bounding boxes into 3D Gaussian distributions and measures their distance to enable more comprehensive and consistent optimization. Extensive experiments on TJ4DRadSet and View-of-Delft demonstrate that RadarGaussianDet3D achieves state-of-the-art detection accuracy while delivering substantially faster inference, highlighting its potential for real-time deployment in autonomous driving.

Title: Network-Based Detection of Autism Spectrum Disorder Using Sustainable and Non-invasive Salivary Biomarkers

Authors: Janayna M. Fernandes, Robinson Sabino-Silva, Murillo G. Carneiro
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16126
Pdf URL: https://arxiv.org/pdf/2509.16126
Copy Paste: [[2509.16126]] Network-Based Detection of Autism Spectrum Disorder Using Sustainable and Non-invasive Salivary Biomarkers(https://arxiv.org/abs/2509.16126)
Keywords: robust
Abstract: Autism Spectrum Disorder (ASD) lacks reliable biological markers, delaying early diagnosis. Using 159 salivary samples analyzed by ATR-FTIR spectroscopy, we developed GANet, a genetic algorithm-based network optimization framework leveraging PageRank and Degree for importance-based feature characterization. GANet systematically optimizes network structure to extract meaningful patterns from high-dimensional spectral data. It achieved superior performance compared to linear discriminant analysis, support vector machines, and deep learning models, reaching 0.78 accuracy, 0.61 sensitivity, 0.90 specificity, and a 0.74 harmonic mean. These results demonstrate GANet's potential as a robust, bio-inspired, non-invasive tool for precise ASD detection and broader spectral-based health applications.

Title: BaseReward: A Strong Baseline for Multimodal Reward Model

Authors: Yi-Fan Zhang, Haihua Yang, Huanyu Zhang, Yang Shi, Zezhou Chen, Haochen Tian, Chaoyou Fu, Haotian Wang, Kai Wu, Bo Cui, Xu Wang, Jianfei Pan, Haotian Wang, Zhang Zhang, Liang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16127
Pdf URL: https://arxiv.org/pdf/2509.16127
Copy Paste: [[2509.16127]] BaseReward: A Strong Baseline for Multimodal Reward Model(https://arxiv.org/abs/2509.16127)
Keywords: robust, generative, large language model
Abstract: The rapid advancement of Multimodal Large Language Models (MLLMs) has made aligning them with human preferences a critical challenge. Reward Models (RMs) are a core technology for achieving this goal, but a systematic guide for building state-of-the-art Multimodal Reward Models (MRMs) is currently lacking in both academia and industry. Through exhaustive experimental analysis, this paper aims to provide a clear ``recipe'' for constructing high-performance MRMs. We systematically investigate every crucial component in the MRM development pipeline, including \textit{reward modeling paradigms} (e.g., Naive-RM, Critic-based RM, and Generative RM), \textit{reward head architecture}, \textit{training strategies}, \textit{data curation} (covering over ten multimodal and text-only preference datasets), \textit{backbone model} and \textit{model scale}, and \textit{ensemble methods}. Based on these experimental insights, we introduce \textbf{BaseReward}, a powerful and efficient baseline for multimodal reward modeling. BaseReward adopts a simple yet effective architecture, built upon a {Qwen2.5-VL} backbone, featuring an optimized two-layer reward head, and is trained on a carefully curated mixture of high-quality multimodal and text-only preference data. Our results show that BaseReward establishes a new SOTA on major benchmarks such as MM-RLHF-Reward Bench, VL-Reward Bench, and Multimodal Reward Bench, outperforming previous models. Furthermore, to validate its practical utility beyond static benchmarks, we integrate BaseReward into a real-world reinforcement learning pipeline, successfully enhancing an MLLM's performance across various perception, reasoning, and conversational tasks. This work not only delivers a top-tier MRM but, more importantly, provides the community with a clear, empirically-backed guide for developing robust reward models for the next generation of MLLMs.

Title: Dynamic Classifier-Free Diffusion Guidance via Online Feedback

Authors: Pinelopi Papalampidi, Olivia Wiles, Ira Ktena, Aleksandar Shtedritski, Emanuele Bugliarello, Ivana Kajic, Isabela Albuquerque, Aida Nematzadeh
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2509.16131
Pdf URL: https://arxiv.org/pdf/2509.16131
Copy Paste: [[2509.16131]] Dynamic Classifier-Free Diffusion Guidance via Online Feedback(https://arxiv.org/abs/2509.16131)
Keywords: diffusion
Abstract: Classifier-free guidance (CFG) is a cornerstone of text-to-image diffusion models, yet its effectiveness is limited by the use of static guidance scales. This "one-size-fits-all" approach fails to adapt to the diverse requirements of different prompts; moreover, prior solutions like gradient-based correction or fixed heuristic schedules introduce additional complexities and fail to generalize. In this work, we challeng this static paradigm by introducing a framework for dynamic CFG scheduling. Our method leverages online feedback from a suite of general-purpose and specialized small-scale latent-space evaluations, such as CLIP for alignment, a discriminator for fidelity and a human preference reward model, to assess generation quality at each step of the reverse diffusion process. Based on this feedback, we perform a greedy search to select the optimal CFG scale for each timestep, creating a unique guidance schedule tailored to every prompt and sample. We demonstrate the effectiveness of our approach on both small-scale models and the state-of-the-art Imagen 3, showing significant improvements in text alignment, visual quality, text rendering and numerical reasoning. Notably, when compared against the default Imagen 3 baseline, our method achieves up to 53.8% human preference win-rate for overall preference, a figure that increases up to to 55.5% on prompts targeting specific capabilities like text rendering. Our work establishes that the optimal guidance schedule is inherently dynamic and prompt-dependent, and provides an efficient and generalizable framework to achieve it.

Title: Spatio-temporal, multi-field deep learning of shock propagation in meso-structured media

Authors: M. Giselle Fernández-Godino, Meir H. Shachar, Kevin Korner, Jonathan L. Belof, Mukul Kumar, Jonathan Lind, William J. Schill
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.16139
Pdf URL: https://arxiv.org/pdf/2509.16139
Copy Paste: [[2509.16139]] Spatio-temporal, multi-field deep learning of shock propagation in meso-structured media(https://arxiv.org/abs/2509.16139)
Keywords: security, defense
Abstract: The ability to predict how shock waves traverse porous and architected materials is a decisive factor in planetary defense, national security, and the race to achieve inertial fusion energy. Yet capturing pore collapse, anomalous Hugoniot responses, and localized heating -- phenomena that can determine the success of asteroid deflection or fusion ignition -- has remained a major challenge despite recent advances in single-field and reduced representations. We introduce a multi-field spatio-temporal deep learning model (MSTM) that unifies seven coupled fields -- pressure, density, temperature, energy, material distribution, and two velocity components -- into a single autoregressive surrogate. Trained on high-fidelity hydrocode data, MSTM runs about a thousand times faster than direct simulation, achieving errors below 4\% in porous materials and below 10\% in lattice structures. Unlike prior single-field or operator-based surrogates, MSTM resolves sharp shock fronts while preserving integrated quantities such as mass-averaged pressure and temperature to within 5\%. This advance transforms problems once considered intractable into tractable design studies, establishing a practical framework for optimizing meso-structured materials in planetary impact mitigation, inertial fusion energy, and national security.

Title: AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models

Authors: Vatsal Malaviya, Agneet Chatterjee, Maitreya Patel, Yezhou Yang, Chitta Baral
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16141
Pdf URL: https://arxiv.org/pdf/2509.16141
Copy Paste: [[2509.16141]] AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models(https://arxiv.org/abs/2509.16141)
Keywords: large language model
Abstract: Text-to-Image (T2I) models have recently achieved remarkable success in generating images from textual descriptions. However, challenges still persist in accurately rendering complex scenes where actions and interactions form the primary semantic focus. Our key observation in this work is that T2I models frequently struggle to capture nuanced and often implicit attributes inherent in action depiction, leading to generating images that lack key contextual details. To enable systematic evaluation, we introduce AcT2I, a benchmark designed to evaluate the performance of T2I models in generating images from action-centric prompts. We experimentally validate that leading T2I models do not fare well on AcT2I. We further hypothesize that this shortcoming arises from the incomplete representation of the inherent attributes and contextual dependencies in the training corpora of existing T2I models. We build upon this by developing a training-free, knowledge distillation technique utilizing Large Language Models to address this limitation. Specifically, we enhance prompts by incorporating dense information across three dimensions, observing that injecting prompts with temporal details significantly improves image generation accuracy, with our best model achieving an increase of 72%. Our findings highlight the limitations of current T2I methods in generating images that require complex reasoning and demonstrate that integrating linguistic knowledge in a systematic way can notably advance the generation of nuanced and contextually accurate images.

Title: Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models

Authors: Renjie Pi, Kehao Miao, Li Peihang, Runtao Liu, Jiahui Gao, Jipeng Zhang, Xiaofang Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16149
Pdf URL: https://arxiv.org/pdf/2509.16149
Copy Paste: [[2509.16149]] Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models(https://arxiv.org/abs/2509.16149)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have demonstrated extraordinary capabilities in conducting conversations based on image inputs. However, we observe that MLLMs exhibit a pronounced form of visual sycophantic behavior. While similar behavior has also been noted in text-based large language models (LLMs), it becomes significantly more prominent when MLLMs process image inputs. We refer to this phenomenon as the "sycophantic modality gap." To better understand this issue, we further analyze the factors that contribute to the exacerbation of this gap. To mitigate the visual sycophantic behavior, we first experiment with naive supervised fine-tuning to help the MLLM resist misleading instructions from the user. However, we find that this approach also makes the MLLM overly resistant to corrective instructions (i.e., stubborn even if it is wrong). To alleviate this trade-off, we propose Sycophantic Reflective Tuning (SRT), which enables the MLLM to engage in reflective reasoning, allowing it to determine whether a user's instruction is misleading or corrective before drawing a conclusion. After applying SRT, we observe a significant reduction in sycophantic behavior toward misleading instructions, without resulting in excessive stubbornness when receiving corrective instructions.

Title: Automated Cyber Defense with Generalizable Graph-based Reinforcement Learning Agents

Authors: Isaiah J. King, Benjamin Bowman, H. Howie Huang
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2509.16151
Pdf URL: https://arxiv.org/pdf/2509.16151
Copy Paste: [[2509.16151]] Automated Cyber Defense with Generalizable Graph-based Reinforcement Learning Agents(https://arxiv.org/abs/2509.16151)
Keywords: defense
Abstract: Deep reinforcement learning (RL) is emerging as a viable strategy for automated cyber defense (ACD). The traditional RL approach represents networks as a list of computers in various states of safety or threat. Unfortunately, these models are forced to overfit to specific network topologies, rendering them ineffective when faced with even small environmental perturbations. In this work, we frame ACD as a two-player context-based partially observable Markov decision problem with observations represented as attributed graphs. This approach allows our agents to reason through the lens of relational inductive bias. Agents learn how to reason about hosts interacting with other system entities in a more general manner, and their actions are understood as edits to the graph representing the environment. By introducing this bias, we will show that our agents can better reason about the states of networks and zero-shot adapt to new ones. We show that this approach outperforms the state-of-the-art by a wide margin, and makes our agents capable of defending never-before-seen networks against a wide range of adversaries in a variety of complex, and multi-agent environments.

Title: Robust Vision-Language Models via Tensor Decomposition: A Defense Against Adversarial Attacks

Authors: Het Patel, Muzammil Allie, Qian Zhang, Jia Chen, Evangelos E. Papalexakis
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2509.16163
Pdf URL: https://arxiv.org/pdf/2509.16163
Copy Paste: [[2509.16163]] Robust Vision-Language Models via Tensor Decomposition: A Defense Against Adversarial Attacks(https://arxiv.org/abs/2509.16163)
Keywords: defense, attack, robust
Abstract: Vision language models (VLMs) excel in multimodal understanding but are prone to adversarial attacks. Existing defenses often demand costly retraining or significant architecture changes. We introduce a lightweight defense using tensor decomposition suitable for any pre-trained VLM, requiring no retraining. By decomposing and reconstructing vision encoder representations, it filters adversarial noise while preserving meaning. Experiments with CLIP on COCO and Flickr30K show improved robustness. On Flickr30K, it restores 12.3\% performance lost to attacks, raising Recall@1 accuracy from 7.5\% to 19.8\%. On COCO, it recovers 8.1\% performance, improving accuracy from 3.8\% to 11.9\%. Analysis shows Tensor Train decomposition with low rank (8-32) and low residual strength ($\alpha=0.1-0.2$) is optimal. This method is a practical, plug-and-play solution with minimal overhead for existing VLMs.

Title: UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation

Authors: Xiaoqi Zhao, Youwei Pang, Chenyang Yu, Lihe Zhang, Huchuan Lu, Shijian Lu, Georges El Fakhri, Xiaofeng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2509.16170
Pdf URL: https://arxiv.org/pdf/2509.16170
Copy Paste: [[2509.16170]] UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation(https://arxiv.org/abs/2509.16170)
Keywords: segmentation
Abstract: Multi-modal image segmentation faces real-world deployment challenges from incomplete/corrupted modalities degrading performance. While existing methods address training-inference modality gaps via specialized per-combination models, they introduce high deployment costs by requiring exhaustive model subsets and model-modality matching. In this work, we propose a unified modality-relax segmentation network (UniMRSeg) through hierarchical self-supervised compensation (HSSC). Our approach hierarchically bridges representation gaps between complete and incomplete modalities across input, feature and output levels. % First, we adopt modality reconstruction with the hybrid shuffled-masking augmentation, encouraging the model to learn the intrinsic modality characteristics and generate meaningful representations for missing modalities through cross-modal fusion. % Next, modality-invariant contrastive learning implicitly compensates the feature space distance among incomplete-complete modality pairs. Furthermore, the proposed lightweight reverse attention adapter explicitly compensates for the weak perceptual semantics in the frozen encoder. Last, UniMRSeg is fine-tuned under the hybrid consistency constraint to ensure stable prediction under all modality combinations without large performance fluctuations. Without bells and whistles, UniMRSeg significantly outperforms the state-of-the-art methods under diverse missing modality scenarios on MRI-based brain tumor segmentation, RGB-D semantic segmentation, RGB-D/T salient object segmentation. The code will be released at this https URL.

Title: Fast OTSU Thresholding Using Bisection Method

Authors: Sai Varun Kodathala
Subjects: cs.CV, cs.AI, math.NA
Abstract URL: https://arxiv.org/abs/2509.16179
Pdf URL: https://arxiv.org/pdf/2509.16179
Copy Paste: [[2509.16179]] Fast OTSU Thresholding Using Bisection Method(https://arxiv.org/abs/2509.16179)
Keywords: segmentation
Abstract: The Otsu thresholding algorithm represents a fundamental technique in image segmentation, yet its computational efficiency is severely limited by exhaustive search requirements across all possible threshold values. This work presents an optimized implementation that leverages the bisection method to exploit the unimodal characteristics of the between-class variance function. Our approach reduces the computational complexity from O(L) to O(log L) evaluations while preserving segmentation accuracy. Experimental validation on 48 standard test images demonstrates a 91.63% reduction in variance computations and 97.21% reduction in algorithmic iterations compared to conventional exhaustive search. The bisection method achieves exact threshold matches in 66.67% of test cases, with 95.83% exhibiting deviations within 5 gray levels. The algorithm maintains universal convergence within theoretical logarithmic bounds while providing deterministic performance guarantees suitable for real-time applications. This optimization addresses critical computational bottlenecks in large-scale image processing systems without compromising the theoretical foundations or segmentation quality of the original Otsu method.

Title: CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs

Authors: Jinghao Zhang, Sihang Jiang, Shiwei Guo, Shisong Chen, Yanghua Xiao, Hongwei Feng, Jiaqing Liang, Minggui HE, Shimin Tao, Hongxia Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2509.16188
Pdf URL: https://arxiv.org/pdf/2509.16188
Copy Paste: [[2509.16188]] CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs(https://arxiv.org/abs/2509.16188)
Keywords: large language model
Abstract: As large language models (LLMs) are increasingly deployed in diverse cultural environments, evaluating their cultural understanding capability has become essential for ensuring trustworthy and culturally aligned applications. However, most existing benchmarks lack comprehensiveness and are challenging to scale and adapt across different cultural contexts, because their frameworks often lack guidance from well-established cultural theories and tend to rely on expert-driven manual annotations. To address these issues, we propose CultureScope, the most comprehensive evaluation framework to date for assessing cultural understanding in LLMs. Inspired by the cultural iceberg theory, we design a novel dimensional schema for cultural knowledge classification, comprising 3 layers and 140 dimensions, which guides the automated construction of culture-specific knowledge bases and corresponding evaluation datasets for any given languages and cultures. Experimental results demonstrate that our method can effectively evaluate cultural understanding. They also reveal that existing large language models lack comprehensive cultural competence, and merely incorporating multilingual data does not necessarily enhance cultural understanding. All code and data files are available at this https URL

Title: MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Authors: Yanghao Li, Rui Qian, Bowen Pan, Haotian Zhang, Haoshuo Huang, Bowen Zhang, Jialing Tong, Haoxuan You, Xianzhi Du, Zhe Gan, Hyunjik Kim, Chao Jia, Zhenbang Wang, Yinfei Yang, Mingfei Gao, Zi-Yi Dou, Wenze Hu, Chang Gao, Dongxu Li, Philipp Dufter, Zirui Wang, Guoli Yin, Zhengdong Zhang, Chen Chen, Yang Zhao, Ruoming Pang, Zhifeng Chen
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2509.16197
Pdf URL: https://arxiv.org/pdf/2509.16197
Copy Paste: [[2509.16197]] MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer(https://arxiv.org/abs/2509.16197)
Keywords: diffusion, large language model
Abstract: Unified multimodal Large Language Models (LLMs) that can both understand and generate visual content hold immense potential. However, existing open-source models often suffer from a performance trade-off between these capabilities. We present Manzano, a simple and scalable unified framework that substantially reduces this tension by coupling a hybrid image tokenizer with a well-curated training recipe. A single shared vision encoder feeds two lightweight adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation within a common semantic space. A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels. The architecture, together with a unified training recipe over understanding and generation data, enables scalable joint learning of both capabilities. Manzano achieves state-of-the-art results among unified models, and is competitive with specialist models, particularly on text-rich evaluation. Our studies show minimal task conflicts and consistent gains from scaling model size, validating our design choice of a hybrid tokenizer.

Title: RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation

Authors: Jane Luo, Xin Zhang, Steven Liu, Jie Wu, Yiming Huang, Yangyu Huang, Chengyu Yin, Ying Xin, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Qi Chen, Scarlett Li, Mao Yang
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2509.16198
Pdf URL: https://arxiv.org/pdf/2509.16198
Copy Paste: [[2509.16198]] RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation(https://arxiv.org/abs/2509.16198)
Keywords: large language model
Abstract: Large language models excel at function- and file-level code generation, yet generating complete repositories from scratch remains a fundamental challenge. This process demands coherent and reliable planning across proposal- and implementation-level stages, while natural language, due to its ambiguity and verbosity, is ill-suited for faithfully representing complex software structures. To address this, we introduce the Repository Planning Graph (RPG), a persistent representation that unifies proposal- and implementation-level planning by encoding capabilities, file structures, data flows, and functions in one graph. RPG replaces ambiguous natural language with an explicit blueprint, enabling long-horizon planning and scalable repository generation. Building on RPG, we develop ZeroRepo, a graph-driven framework for repository generation from scratch. It operates in three stages: proposal-level planning and implementation-level refinement to construct the graph, followed by graph-guided code generation with test validation. To evaluate this setting, we construct RepoCraft, a benchmark of six real-world projects with 1,052 tasks. On RepoCraft, ZeroRepo produces repositories averaging nearly 36K LOC, roughly 3.9$\times$ the strongest baseline (Claude Code) and about 64$\times$ other baselines. It attains 81.5% functional coverage and a 69.7% pass rate, exceeding Claude Code by 27.3 and 35.8 percentage points, respectively. Further analysis shows that RPG models complex dependencies, enables progressively more sophisticated planning through near-linear scaling, and enhances LLM understanding of repositories, thereby accelerating agent localization.

Title: Inverting Trojans in LLMs

Authors: Zhengxing Li, Guangmingmei Yang, Jayaram Raghuram, David J. Miller, George Kesidis
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2509.16203
Pdf URL: https://arxiv.org/pdf/2509.16203
Copy Paste: [[2509.16203]] Inverting Trojans in LLMs(https://arxiv.org/abs/2509.16203)
Keywords: attack
Abstract: While effective backdoor detection and inversion schemes have been developed for AIs used e.g. for images, there are challenges in "porting" these methods to LLMs. First, the LLM input space is discrete, which precludes gradient-based search over this space, central to many backdoor inversion methods. Second, there are ~30,000^k k-tuples to consider, k the token-length of a putative trigger. Third, for LLMs there is the need to blacklist tokens that have strong marginal associations with the putative target response (class) of an attack, as such tokens give false detection signals. However, good blacklists may not exist for some domains. We propose a LLM trigger inversion approach with three key components: i) discrete search, with putative triggers greedily accreted, starting from a select list of singletons; ii) implicit blacklisting, achieved by evaluating the average cosine similarity, in activation space, between a candidate trigger and a small clean set of samples from the putative target class; iii) detection when a candidate trigger elicits high misclassifications, and with unusually high decision confidence. Unlike many recent works, we demonstrate that our approach reliably detects and successfully inverts ground-truth backdoor trigger phrases.