2025-08-12

Title: Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction

Authors: Juliana Resplande Sant'anna Gomes, Arlindo Rodrigues Galvão Filho
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.06495
Pdf URL: https://arxiv.org/pdf/2508.06495
Copy Paste: [[2508.06495]] Semi-automated Fact-checking in Portuguese: Corpora Enrichment using Retrieval with Claim extraction(https://arxiv.org/abs/2508.06495)
Keywords: robust, extraction, large language model
Abstract: The accelerated dissemination of disinformation often outpaces the capacity for manual fact-checking, highlighting the urgent need for Semi-Automated Fact-Checking (SAFC) systems. Within the Portuguese language context, there is a noted scarcity of publicly available datasets that integrate external evidence, an essential component for developing robust AFC systems, as many existing resources focus solely on classification based on intrinsic text features. This dissertation addresses this gap by developing, applying, and analyzing a methodology to enrich Portuguese news corpora (this http URL, this http URL, MuMiN-PT) with external evidence. The approach simulates a user's verification process, employing Large Language Models (LLMs, specifically Gemini 1.5 Flash) to extract the main claim from texts and search engine APIs (Google Search API, Google FactCheck Claims Search API) to retrieve relevant external documents (evidence). Additionally, a data validation and preprocessing framework, including near-duplicate detection, is introduced to enhance the quality of the base corpora.

Title: Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG

Authors: Rakesh Raj Madavan, Akshat Kaimal, Hashim Faisal, Chandrakala S
Subjects: cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2508.06496
Pdf URL: https://arxiv.org/pdf/2508.06496
Copy Paste: [[2508.06496]] Med-GRIM: Enhanced Zero-Shot Medical VQA using prompt-embedded Multimodal Graph RAG(https://arxiv.org/abs/2508.06496)
Keywords: robust, large language model
Abstract: An ensemble of trained multimodal encoders and vision-language models (VLMs) has become a standard approach for visual question answering (VQA) tasks. However, such models often fail to produce responses with the detailed precision necessary for complex, domain-specific applications such as medical VQA. Our representation model, BIND: BLIVA Integrated with Dense Encoding, extends prior multimodal work by refining the joint embedding space through dense, query-token-based encodings inspired by contrastive pretraining techniques. This refined encoder powers Med-GRIM, a model designed for medical VQA tasks that leverages graph-based retrieval and prompt engineering to integrate domain-specific knowledge. Rather than relying on compute-heavy fine-tuning of vision and language models on specific datasets, Med-GRIM applies a low-compute, modular workflow with small language models (SLMs) for efficiency. Med-GRIM employs prompt-based retrieval to dynamically inject relevant knowledge, ensuring both accuracy and robustness in its responses. By assigning distinct roles to each agent within the VQA system, Med-GRIM achieves large language model performance at a fraction of the computational cost. Additionally, to support scalable research in zero-shot multimodal medical applications, we introduce DermaGraph, a novel Graph-RAG dataset comprising diverse dermatological conditions. This dataset facilitates both multimodal and unimodal querying. The code and dataset are available at: this https URL

Title: Retrieval augmented generation based dynamic prompting for few-shot biomedical named entity recognition using large language models

Authors: Yao Ge, Sudeshna Das, Yuting Guo, Abeed Sarker
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06504
Pdf URL: https://arxiv.org/pdf/2508.06504
Copy Paste: [[2508.06504]] Retrieval augmented generation based dynamic prompting for few-shot biomedical named entity recognition using large language models(https://arxiv.org/abs/2508.06504)
Keywords: large language model
Abstract: Biomedical named entity recognition (NER) is a high-utility natural language processing (NLP) task, and large language models (LLMs) show promise particularly in few-shot settings (i.e., limited training data). In this article, we address the performance challenges of LLMs for few-shot biomedical NER by investigating a dynamic prompting strategy involving retrieval-augmented generation (RAG). In our approach, the annotated in-context learning examples are selected based on their similarities with the input texts, and the prompt is dynamically updated for each instance during inference. We implemented and optimized static and dynamic prompt engineering techniques and evaluated them on five biomedical NER datasets. Static prompting with structured components increased average F1-scores by 12% for GPT-4, and 11% for GPT-3.5 and LLaMA 3-70B, relative to basic static prompting. Dynamic prompting further improved performance, with TF-IDF and SBERT retrieval methods yielding the best results, improving average F1-scores by 7.3% and 5.6% in 5-shot and 10-shot settings, respectively. These findings highlight the utility of contextually adaptive prompts via RAG for biomedical NER.

Title: DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation

Authors: He Feng, Yongjia Ma, Donglin Di, Lei Fan, Tonghua Su, Xiangqian Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06511
Pdf URL: https://arxiv.org/pdf/2508.06511
Copy Paste: [[2508.06511]] DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation(https://arxiv.org/abs/2508.06511)
Keywords: diffusion
Abstract: Portrait animation aims to synthesize talking videos from a static reference face, conditioned on audio and style frame cues (e.g., emotion and head poses), while ensuring precise lip synchronization and faithful reproduction of speaking styles. Existing diffusion-based portrait animation methods primarily focus on lip synchronization or static emotion transformation, often overlooking dynamic styles such as head movements. Moreover, most of these methods rely on a dual U-Net architecture, which preserves identity consistency but incurs additional computational overhead. To this end, we propose DiTalker, a unified DiT-based framework for speaking style-controllable portrait animation. We design a Style-Emotion Encoding Module that employs two separate branches: a style branch extracting identity-specific style information (e.g., head poses and movements), and an emotion branch extracting identity-agnostic emotion features. We further introduce an Audio-Style Fusion Module that decouples audio and speaking styles via two parallel cross-attention layers, using these features to guide the animation process. To enhance the quality of results, we adopt and modify two optimization constraints: one to improve lip synchronization and the other to preserve fine-grained identity and background details. Extensive experiments demonstrate the superiority of DiTalker in terms of lip synchronization and speaking style controllability. Project Page: this https URL

Title: Frequency Prior Guided Matching: A Data Augmentation Approach for Generalizable Semi-Supervised Polyp Segmentation

Authors: Haoran Xi, Chen Liu, Xiaolin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06517
Pdf URL: https://arxiv.org/pdf/2508.06517
Copy Paste: [[2508.06517]] Frequency Prior Guided Matching: A Data Augmentation Approach for Generalizable Semi-Supervised Polyp Segmentation(https://arxiv.org/abs/2508.06517)
Keywords: robust, segmentation
Abstract: Automated polyp segmentation is essential for early diagnosis of colorectal cancer, yet developing robust models remains challenging due to limited annotated data and significant performance degradation under domain shift. Although semi-supervised learning (SSL) reduces annotation requirements, existing methods rely on generic augmentations that ignore polyp-specific structural properties, resulting in poor generalization to new imaging centers and devices. To address this, we introduce Frequency Prior Guided Matching (FPGM), a novel augmentation framework built on a key discovery: polyp edges exhibit a remarkably consistent frequency signature across diverse datasets. FPGM leverages this intrinsic regularity in a two-stage process. It first learns a domain-invariant frequency prior from the edge regions of labeled polyps. Then, it performs principled spectral perturbations on unlabeled images, aligning their amplitude spectra with this learned prior while preserving phase information to maintain structural integrity. This targeted alignment normalizes domain-specific textural variations, thereby compelling the model to learn the underlying, generalizable anatomical structure. Validated on six public datasets, FPGM establishes a new state-of-the-art against ten competing methods. It demonstrates exceptional zero-shot generalization capabilities, achieving over 10% absolute gain in Dice score in data-scarce scenarios. By significantly enhancing cross-domain robustness, FPGM presents a powerful solution for clinically deployable polyp segmentation under limited supervision.

Title: CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models

Authors: Lei Jiang, Fan Chen
Subjects: cs.CL, cs.AI, cs.CY, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2508.06524
Pdf URL: https://arxiv.org/pdf/2508.06524
Copy Paste: [[2508.06524]] CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models(https://arxiv.org/abs/2508.06524)
Keywords: large language model
Abstract: Neural scaling laws have driven the development of increasingly large language models (LLMs) by linking accuracy improvements to growth in parameter count, dataset size, and compute. However, these laws overlook the carbon emissions that scale exponentially with LLM size. This paper presents \textit{CarbonScaling}, an analytical framework that extends neural scaling laws to incorporate both operational and embodied carbon in LLM training. By integrating models for neural scaling, GPU hardware evolution, parallelism optimization, and carbon estimation, \textit{CarbonScaling} quantitatively connects model accuracy to carbon footprint. Results show that while a power-law relationship between accuracy and carbon holds, real-world inefficiencies significantly increase the scaling factor. Hardware technology scaling reduces carbon emissions for small to mid-sized models, but offers diminishing returns for extremely large LLMs due to communication overhead and underutilized GPUs. Training optimizations-especially aggressive critical batch size scaling-help alleviate this inefficiency. \textit{CarbonScaling} offers key insights for training more sustainable and carbon-efficient LLMs.

Title: Large Language Models Facilitate Vision Reflection in Image Classification

Authors: Guoyuan An, JaeYoon Kim, SungEui Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06525
Pdf URL: https://arxiv.org/pdf/2508.06525
Copy Paste: [[2508.06525]] Large Language Models Facilitate Vision Reflection in Image Classification(https://arxiv.org/abs/2508.06525)
Keywords: robust, explainability, large language model
Abstract: This paper presents several novel findings on the explainability of vision reflection in large multimodal models (LMMs). First, we show that prompting an LMM to verify the prediction of a specialized vision model can improve recognition accuracy, even on benchmarks like ImageNet, despite prior evidence that LMMs typically underperform dedicated vision encoders. Second, we analyze the internal behavior of vision reflection and find that the vision-language connector maps visual features into explicit textual concepts, allowing the language model to reason about prediction plausibility using commonsense knowledge. We further observe that replacing a large number of vision tokens with only a few text tokens still enables LLaVA to generate similar answers, suggesting that LMMs may rely primarily on a compact set of distilled textual representations rather than raw vision features. Third, we show that a training-free connector can enhance LMM performance in fine-grained recognition tasks, without extensive feature-alignment training. Together, these findings offer new insights into the explainability of vision-language models and suggest that vision reflection is a promising strategy for achieving robust and interpretable visual recognition.

Title: A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition

Authors: Xiuliang Zhang, Tadiwa Elisha Nyamasvisva, Chuntao Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06528
Pdf URL: https://arxiv.org/pdf/2508.06528
Copy Paste: [[2508.06528]] A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition(https://arxiv.org/abs/2508.06528)
Keywords: transformer
Abstract: Video-based behavior recognition is essential in fields such as public safety, intelligent surveillance, and human-computer interaction. Traditional 3D Convolutional Neural Network (3D CNN) effectively capture local spatiotemporal features but struggle with modeling long-range dependencies. Conversely, Transformers excel at learning global contextual information but face challenges with high computational costs. To address these limitations, we propose a hybrid framework combining 3D CNN and Transformer architectures. The 3D CNN module extracts low-level spatiotemporal features, while the Transformer module captures long-range temporal dependencies, with a fusion mechanism integrating both representations. Evaluated on benchmark datasets, the proposed model outperforms traditional 3D CNN and standalone Transformers, achieving higher recognition accuracy with manageable complexity. Ablation studies further validate the complementary strengths of the two modules. This hybrid framework offers an effective and scalable solution for video-based behavior recognition.

Title: RMT-PPAD: Real-time Multi-task Learning for Panoptic Perception in Autonomous Driving

Authors: Jiayuan Wang, Q. M. Jonathan Wu, Katsuya Suto, Ning Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.06529
Pdf URL: https://arxiv.org/pdf/2508.06529
Copy Paste: [[2508.06529]] RMT-PPAD: Real-time Multi-task Learning for Panoptic Perception in Autonomous Driving(https://arxiv.org/abs/2508.06529)
Keywords: fair, transformer, segmentation
Abstract: Autonomous driving systems rely on panoptic driving perception that requires both precision and real-time performance. In this work, we propose RMT-PPAD, a real-time, transformer-based multi-task model that jointly performs object detection, drivable area segmentation, and lane line segmentation. We introduce a lightweight module, a gate control with an adapter to adaptively fuse shared and task-specific features, effectively alleviating negative transfer between tasks. Additionally, we design an adaptive segmentation decoder to learn the weights over multi-scale features automatically during the training stage. This avoids the manual design of task-specific structures for different segmentation tasks. We also identify and resolve the inconsistency between training and testing labels in lane line segmentation. This allows fairer evaluation. Experiments on the BDD100K dataset demonstrate that RMT-PPAD achieves state-of-the-art results with mAP50 of 84.9% and Recall of 95.4% for object detection, mIoU of 92.6% for drivable area segmentation, and IoU of 56.8% and accuracy of 84.7% for lane line segmentation. The inference speed reaches 32.6 FPS. Moreover, we introduce real-world scenarios to evaluate RMT-PPAD performance in practice. The results show that RMT-PPAD consistently delivers stable performance. The source codes and pre-trained models are released at this https URL.

Title: What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?

Authors: Ming-Kun Xie, Jia-Hao Xiao, Gang Niu, Lei Feng, Zhiqiang Kou, Min-Ling Zhang, Masashi Sugiyama
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.06530
Pdf URL: https://arxiv.org/pdf/2508.06530
Copy Paste: [[2508.06530]] What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?(https://arxiv.org/abs/2508.06530)
Keywords: large language model
Abstract: Large Vision-Language Models (LVLMs), empowered by the success of Large Language Models (LLMs), have achieved impressive performance across domains. Despite the great advances in LVLMs, they still suffer from the unavailable object hallucination issue, which tends to generate objects inconsistent with the image content. The most commonly used Polling-based Object Probing Evaluation (POPE) benchmark evaluates this issue by sampling negative categories according to category-level statistics, \textit{e.g.}, category frequencies and co-occurrence. However, with the continuous advancement of LVLMs, the POPE benchmark has shown diminishing effectiveness in assessing object hallucination, as it employs a simplistic sampling strategy that overlooks image-specific information and restricts distractors to negative object categories only. In this paper, we introduce the Hallucination searching-based Object Probing Evaluation (HOPE) benchmark, aiming to generate the most misleading distractors (\textit{i.e.}, non-existent objects or incorrect image descriptions) that can trigger hallucination in LVLMs, which serves as a means to more rigorously assess their immunity to hallucination. To explore the image-specific information, the content-aware hallucination searching leverages Contrastive Language-Image Pre-Training (CLIP) to approximate the predictive behavior of LVLMs by selecting negative objects with the highest predicted likelihood as distractors. To expand the scope of hallucination assessment, the description-based hallucination searching constructs highly misleading distractors by pairing true objects with false descriptions. Experimental results show that HOPE leads to a precision drop of at least 9\% and up to 23\% across various state-of-the-art LVLMs, significantly outperforming POPE in exposing hallucination vulnerabilities. The code is available at this https URL.

Title: The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

Authors: Aamod Thakur, Ajay Nagpal, Atharva Savarkar, Kundeshwar Pundalik, Siddhesh Dosi, Piyush Sawarkar, Viraj Thakur, Rohit Saluja, Maunendra Sankar Desarkar, Ganesh Ramakrishnan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06533
Pdf URL: https://arxiv.org/pdf/2508.06533
Copy Paste: [[2508.06533]] The Art of Breaking Words: Rethinking Multilingual Tokenizer Design(https://arxiv.org/abs/2508.06533)
Keywords: large language model
Abstract: While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies significantly improve model performance, and our data composition algorithm reduces the average token-to-word ratio by approximately 6% with respect to the conventional data randomization approach. Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models. This improvement yields measurable gains in both model performance and inference speed. This highlights tokenization alongside architecture and training objectives as a critical lever for building efficient, scalable multilingual LLMs

Title: Transfer Learning with EfficientNet for Accurate Leukemia Cell Classification

Authors: Faisal Ahmed
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.06535
Pdf URL: https://arxiv.org/pdf/2508.06535
Copy Paste: [[2508.06535]] Transfer Learning with EfficientNet for Accurate Leukemia Cell Classification(https://arxiv.org/abs/2508.06535)
Keywords: robust
Abstract: Accurate classification of Acute Lymphoblastic Leukemia (ALL) from peripheral blood smear images is essential for early diagnosis and effective treatment planning. This study investigates the use of transfer learning with pretrained convolutional neural networks (CNNs) to improve diagnostic performance. To address the class imbalance in the dataset of 3,631 Hematologic and 7,644 ALL images, we applied extensive data augmentation techniques to create a balanced training set of 10,000 images per class. We evaluated several models, including ResNet50, ResNet101, and EfficientNet variants B0, B1, and B3. EfficientNet-B3 achieved the best results, with an F1-score of 94.30%, accuracy of 92.02%, andAUCof94.79%,outperformingpreviouslyreported methods in the C-NMCChallenge. Thesefindings demonstrate the effectiveness of combining data augmentation with advanced transfer learning models, particularly EfficientNet-B3, in developing accurate and robust diagnostic tools for hematologic malignancy detection.

Title: MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing

Authors: Jinghan Yu, Zhiyuan Ma, Yue Ma, Kaiqi Liu, Yuhan Wang, Jianjun Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06543
Pdf URL: https://arxiv.org/pdf/2508.06543
Copy Paste: [[2508.06543]] MILD: Multi-Layer Diffusion Strategy for Complex and Precise Multi-IP Aware Human Erasing(https://arxiv.org/abs/2508.06543)
Keywords: diffusion
Abstract: Recent years have witnessed the success of diffusion models in image-customized tasks. Prior works have achieved notable progress on human-oriented erasing using explicit mask guidance and semantic-aware inpainting. However, they struggle under complex multi-IP scenarios involving human-human occlusions, human-object entanglements, and background interferences. These challenges are mainly due to: 1) Dataset limitations, as existing datasets rarely cover dense occlusions, camouflaged backgrounds, and diverse interactions; 2) Lack of spatial decoupling, where foreground instances cannot be effectively disentangled, limiting clean background restoration. In this work, we introduce a high-quality multi-IP human erasing dataset with diverse pose variations and complex backgrounds. We then propose Multi-Layer Diffusion (MILD), a novel strategy that decomposes generation into semantically separated pathways for each instance and the background. To enhance human-centric understanding, we introduce Human Morphology Guidance, integrating pose, parsing, and spatial relations. We further present Spatially-Modulated Attention to better guide attention flow. Extensive experiments show that MILD outperforms state-of-the-art methods on challenging human erasing benchmarks.

Title: Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images

Authors: Qi Xun Yeo, Yanyan Li, Gim Hee Lee
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2508.06546
Pdf URL: https://arxiv.org/pdf/2508.06546
Copy Paste: [[2508.06546]] Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images(https://arxiv.org/abs/2508.06546)
Keywords: robust
Abstract: Modern 3D semantic scene graph estimation methods utilize ground truth 3D annotations to accurately predict target objects, predicates, and relationships. In the absence of given 3D ground truth representations, we explore leveraging only multi-view RGB images to tackle this task. To attain robust features for accurate scene graph estimation, we must overcome the noisy reconstructed pseudo point-based geometry from predicted depth maps and reduce the amount of background noise present in multi-view image features. The key is to enrich node and edge features with accurate semantic and spatial information and through neighboring relations. We obtain semantic masks to guide feature aggregation to filter background features and design a novel method to incorporate neighboring node information to aid robustness of our scene graph estimates. Furthermore, we leverage on explicit statistical priors calculated from the training summary statistics to refine node and edge predictions based on their one-hop neighborhood. Our experiments show that our method outperforms current methods purely using multi-view images as the initial input. Our project page is available at this https URL.

Title: Factor Augmented Supervised Learning with Text Embeddings

Authors: Zhanye Luo, Yuefeng Han, Xiufan Yu
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2508.06548
Pdf URL: https://arxiv.org/pdf/2508.06548
Copy Paste: [[2508.06548]] Factor Augmented Supervised Learning with Text Embeddings(https://arxiv.org/abs/2508.06548)
Keywords: large language model
Abstract: Large language models (LLMs) generate text embeddings from text data, producing vector representations that capture the semantic meaning and contextual relationships of words. However, the high dimensionality of these embeddings often impedes efficiency and drives up computational cost in downstream tasks. To address this, we propose AutoEncoder-Augmented Learning with Text (AEALT), a supervised, factor-augmented framework that incorporates dimension reduction directly into pre-trained LLM workflows. First, we extract embeddings from text documents; next, we pass them through a supervised augmented autoencoder to learn low-dimensional, task-relevant latent factors. By modeling the nonlinear structure of complex embeddings, AEALT outperforms conventional deep-learning approaches that rely on raw embeddings. We validate its broad applicability with extensive experiments on classification, anomaly detection, and prediction tasks using multiple real-world public datasets. Numerical results demonstrate that AEALT yields substantial gains over both vanilla embeddings and several standard dimension reduction methods.

Title: Slice or the Whole Pie? Utility Control for AI Models

Authors: Ye Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06551
Pdf URL: https://arxiv.org/pdf/2508.06551
Copy Paste: [[2508.06551]] Slice or the Whole Pie? Utility Control for AI Models(https://arxiv.org/abs/2508.06551)
Keywords: diffusion, segmentation
Abstract: Training deep neural networks (DNNs) has become an increasingly resource-intensive task, requiring large volumes of labeled data, substantial computational power, and considerable fine-tuning efforts to achieve optimal performance across diverse use cases. Although pre-trained models offer a useful starting point, adapting them to meet specific user needs often demands extensive customization, and infrastructure overhead. This challenge grows when a single model must support diverse appli-cations with differing requirements for performance. Traditional solutions often involve training multiple model versions to meet varying requirements, which can be inefficient and difficult to maintain. In order to overcome this challenge, we propose NNObfuscator, a novel utility control mechanism that enables AI models to dynamically modify their performance according to predefined conditions. It is different from traditional methods that need separate models for each user. Instead, NNObfuscator allows a single model to be adapted in real time, giving you controlled access to multiple levels of performance. This mechanism enables model owners set up tiered access, ensuring that free-tier users receive a baseline level of performance while premium users benefit from enhanced capabilities. The approach improves resource allocation, reduces unnecessary computation, and supports sustainable business models in AI deployment. To validate our approach, we conducted experiments on multiple tasks, including image classification, semantic segmentation, and text to image generation, using well-established models such as ResNet, DeepLab, VGG16, FCN and Stable Diffusion. Experimental results show that NNObfuscator successfully makes model more adaptable, so that a single trained model can handle a broad range of tasks without requiring a lot of changes.

Title: Age-Diverse Deepfake Dataset: Bridging the Age Gap in Deepfake Detection

Authors: Unisha Joshi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.06552
Pdf URL: https://arxiv.org/pdf/2508.06552
Copy Paste: [[2508.06552]] Age-Diverse Deepfake Dataset: Bridging the Age Gap in Deepfake Detection(https://arxiv.org/abs/2508.06552)
Keywords: fair
Abstract: The challenges associated with deepfake detection are increasing significantly with the latest advancements in technology and the growing popularity of deepfake videos and images. Despite the presence of numerous detection models, demographic bias in the deepfake dataset remains largely unaddressed. This paper focuses on the mitigation of age-specific bias in the deepfake dataset by introducing an age-diverse deepfake dataset that will improve fairness across age groups. The dataset is constructed through a modular pipeline incorporating the existing deepfake datasets Celeb-DF, FaceForensics++, and UTKFace datasets, and the creation of synthetic data to fill the age distribution gaps. The effectiveness and generalizability of this dataset are evaluated using three deepfake detection models: XceptionNet, EfficientNet, and LipForensics. Evaluation metrics, including AUC, pAUC, and EER, revealed that models trained on the age-diverse dataset demonstrated fairer performance across age groups, improved overall accuracy, and higher generalization across datasets. This study contributes a reproducible, fairness-aware deepfake dataset and model pipeline that can serve as a foundation for future research in fairer deepfake detection. The complete dataset and implementation code are available at this https URL.

Title: On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications

Authors: Simon Baur, Alexandra Benova, Emilio Dolgener Cantú, Jackie Ma
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.06558
Pdf URL: https://arxiv.org/pdf/2508.06558
Copy Paste: [[2508.06558]] On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications(https://arxiv.org/abs/2508.06558)
Keywords: robust, transformer
Abstract: Deploying deep learning models in clinical practice often requires leveraging multiple data modalities, such as images, text, and structured data, to achieve robust and trustworthy decisions. However, not all modalities are always available at inference time. In this work, we propose multimodal privileged knowledge distillation (MMPKD), a training strategy that utilizes additional modalities available solely during training to guide a unimodal vision model. Specifically, we used a text-based teacher model for chest radiographs (MIMIC-CXR) and a tabular metadata-based teacher model for mammography (CBIS-DDSM) to distill knowledge into a vision transformer student model. We show that MMPKD can improve the resulting attention maps' zero-shot capabilities of localizing ROI in input images, while this effect does not generalize across domains, as contrarily suggested by prior research.

Title: Grounding Emotion Recognition with Visual Prototypes: VEGA -- Revisiting CLIP in MERC

Authors: Guanyu Hu, Dimitrios Kollias, Xinyu Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06564
Pdf URL: https://arxiv.org/pdf/2508.06564
Copy Paste: [[2508.06564]] Grounding Emotion Recognition with Visual Prototypes: VEGA -- Revisiting CLIP in MERC(https://arxiv.org/abs/2508.06564)
Keywords: robust
Abstract: Multimodal Emotion Recognition in Conversations remains a challenging task due to the complex interplay of textual, acoustic and visual signals. While recent models have improved performance via advanced fusion strategies, they often lack psychologically meaningful priors to guide multimodal alignment. In this paper, we revisit the use of CLIP and propose a novel Visual Emotion Guided Anchoring (VEGA) mechanism that introduces class-level visual semantics into the fusion and classification process. Distinct from prior work that primarily utilizes CLIP's textual encoder, our approach leverages its image encoder to construct emotion-specific visual anchors based on facial exemplars. These anchors guide unimodal and multimodal features toward a perceptually grounded and psychologically aligned representation space, drawing inspiration from cognitive theories (prototypical emotion categories and multisensory integration). A stochastic anchor sampling strategy further enhances robustness by balancing semantic stability and intra-class diversity. Integrated into a dual-branch architecture with self-distillation, our VEGA-augmented model achieves sota performance on IEMOCAP and MELD. Code is available at: this https URL.

Title: Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features

Authors: Manish Kansana, Elias Hossain, Shahram Rahimi, Noorbakhsh Amiri Golilarz
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06566
Pdf URL: https://arxiv.org/pdf/2508.06566
Copy Paste: [[2508.06566]] Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features(https://arxiv.org/abs/2508.06566)
Keywords: transformer
Abstract: Surface material recognition is a key component in robotic perception and physical interaction, particularly when leveraging both tactile and visual sensory inputs. In this work, we propose Surformer v1, a transformer-based architecture designed for surface classification using structured tactile features and PCA-reduced visual embeddings extracted via ResNet-50. The model integrates modality-specific encoders with cross-modal attention layers, enabling rich interactions between vision and touch. Currently, state-of-the-art deep learning models for vision tasks have achieved remarkable performance. With this in mind, our first set of experiments focused exclusively on tactile-only surface classification. Using feature engineering, we trained and evaluated multiple machine learning models, assessing their accuracy and inference time. We then implemented an encoder-only Transformer model tailored for tactile features. This model not only achieved the highest accuracy but also demonstrated significantly faster inference time compared to other evaluated models, highlighting its potential for real-time applications. To extend this investigation, we introduced a multimodal fusion setup by combining vision and tactile inputs. We trained both Surformer v1 (using structured features) and Multimodal CNN (using raw images) to examine the impact of feature-based versus image-based multimodal learning on classification accuracy and computational efficiency. The results showed that Surformer v1 achieved 99.4% accuracy with an inference time of 0.77 ms, while the Multimodal CNN achieved slightly higher accuracy but required significantly more inference time. These findings suggest Surformer v1 offers a compelling balance between accuracy, efficiency, and computational cost for surface material recognition.

Title: Semi-Supervised Supply Chain Fraud Detection with Unsupervised Pre-Filtering

Authors: Fatemeh Moradi, Mehran Tarif, Mohammadhossein Homaei
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2508.06574
Pdf URL: https://arxiv.org/pdf/2508.06574
Copy Paste: [[2508.06574]] Semi-Supervised Supply Chain Fraud Detection with Unsupervised Pre-Filtering(https://arxiv.org/abs/2508.06574)
Keywords: robust
Abstract: Detecting fraud in modern supply chains is a growing challenge, driven by the complexity of global networks and the scarcity of labeled data. Traditional detection methods often struggle with class imbalance and limited supervision, reducing their effectiveness in real-world applications. This paper proposes a novel two-phase learning framework to address these challenges. In the first phase, the Isolation Forest algorithm performs unsupervised anomaly detection to identify potential fraud cases and reduce the volume of data requiring further analysis. In the second phase, a self-training Support Vector Machine (SVM) refines the predictions using both labeled and high-confidence pseudo-labeled samples, enabling robust semi-supervised learning. The proposed method is evaluated on the DataCo Smart Supply Chain Dataset, a comprehensive real-world supply chain dataset with fraud indicators. It achieves an F1-score of 0.817 while maintaining a false positive rate below 3.0%. These results demonstrate the effectiveness and efficiency of combining unsupervised pre-filtering with semi-supervised refinement for supply chain fraud detection under real-world constraints, though we acknowledge limitations regarding concept drift and the need for comparison with deep learning approaches.

Title: GFlowNets for Learning Better Drug-Drug Interaction Representations

Authors: Azmine Toushik Wasi
Subjects: cs.LG, q-bio.BM, q-bio.MN
Abstract URL: https://arxiv.org/abs/2508.06576
Pdf URL: https://arxiv.org/pdf/2508.06576
Copy Paste: [[2508.06576]] GFlowNets for Learning Better Drug-Drug Interaction Representations(https://arxiv.org/abs/2508.06576)
Keywords: generative
Abstract: Drug-drug interactions pose a significant challenge in clinical pharmacology, with severe class imbalance among interaction types limiting the effectiveness of predictive models. Common interactions dominate datasets, while rare but critical interactions remain underrepresented, leading to poor model performance on infrequent cases. Existing methods often treat DDI prediction as a binary problem, ignoring class-specific nuances and exacerbating bias toward frequent interactions. To address this, we propose a framework combining Generative Flow Networks (GFlowNet) with Variational Graph Autoencoders (VGAE) to generate synthetic samples for rare classes, improving model balance and generate effective and novel DDI pairs. Our approach enhances predictive performance across interaction types, ensuring better clinical reliability.

Title: Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs

Authors: Ying Liu, Can Li, Ting Zhang, Mei Wang, Qiannan Zhu, Jian Li, Hua Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06583
Pdf URL: https://arxiv.org/pdf/2508.06583
Copy Paste: [[2508.06583]] Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs(https://arxiv.org/abs/2508.06583)
Keywords: large language model
Abstract: The conversational capabilities of large language models hold significant promise for enabling scalable and interactive tutoring. While prior research has primarily examined their capacity for Socratic questioning, it often overlooks a critical dimension: adaptively guiding learners based on their cognitive states. This study shifts focus from mere question generation to the broader instructional guidance capability. We ask: Can LLMs emulate expert tutors who dynamically adjust strategies in response to learners' understanding? To investigate this, we propose GuideEval, a benchmark grounded in authentic educational dialogues that evaluates pedagogical guidance through a three-phase behavioral framework: (1) Perception, inferring learner states; (2) Orchestration, adapting instructional strategies; and (3) Elicitation, stimulating proper reflections. Empirical findings reveal that existing LLMs frequently fail to provide effective adaptive scaffolding when learners exhibit confusion or require redirection. Furthermore, we introduce a behavior-guided finetuning strategy that leverages behavior-prompted instructional dialogues, significantly enhancing guidance performance. By shifting the focus from isolated content evaluation to learner-centered interaction, our work advocates a more dialogic paradigm for evaluating Socratic LLMs.

Title: Hypergraph Neural Network with State Space Models for Node Classification

Authors: A. Quadir, M. Tanveer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.06587
Pdf URL: https://arxiv.org/pdf/2508.06587
Copy Paste: [[2508.06587]] Hypergraph Neural Network with State Space Models for Node Classification(https://arxiv.org/abs/2508.06587)
Keywords: transformer
Abstract: In recent years, graph neural networks (GNNs) have gained significant attention for node classification tasks on graph-structured data. However, traditional GNNs primarily focus on adjacency relationships between nodes, often overlooking the rich role-based characteristics that are crucial for learning more expressive node representations. Existing methods for capturing role-based features are largely unsupervised and fail to achieve optimal performance in downstream tasks. To address these limitations, we propose a novel hypergraph neural network with state space model (HGMN) that effectively integrates role-aware representations into GNNs and the state space model. HGMN utilizes hypergraph construction techniques to model higher-order relationships and combines role-based and adjacency-based representations through a learnable mamba transformer mechanism. By leveraging two distinct hypergraph construction methods-based on node degree and neighborhood levels, it strengthens the connections among nodes with similar roles, enhancing the model's representational power. Additionally, the inclusion of hypergraph convolution layers enables the model to capture complex dependencies within hypergraph structures. To mitigate the over-smoothing problem inherent in deep GNNs, we incorporate a residual network, ensuring improved stability and better feature propagation across layers. Extensive experiments conducted on one newly introduced dataset and four benchmark datasets demonstrate the superiority of HGMN. The model achieves significant performance improvements on node classification tasks compared to state-of-the-art GNN methods. These results highlight HGMN's ability to provide enriched node representations by effectively embedding role-based features alongside adjacency information, making it a versatile and powerful tool for a variety of graph-based learning applications.

Title: A Federated Learning Framework for Handling Subtype Confounding and Heterogeneity in Large-Scale Neuroimaging Diagnosis

Authors: Xinglin Zhao, Yanwen Wang, Xiaobo Liu, Yanrong Hao, Rui Cao, Xin Wen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06589
Pdf URL: https://arxiv.org/pdf/2508.06589
Copy Paste: [[2508.06589]] A Federated Learning Framework for Handling Subtype Confounding and Heterogeneity in Large-Scale Neuroimaging Diagnosis(https://arxiv.org/abs/2508.06589)
Keywords: robust, federate
Abstract: Computer-aided diagnosis (CAD) systems play a crucial role in analyzing neuroimaging data for neurological and psychiatric disorders. However, small-sample studies suffer from low reproducibility, while large-scale datasets introduce confounding heterogeneity due to multiple disease subtypes being labeled under a single category. To address these challenges, we propose a novel federated learning framework tailored for neuroimaging CAD systems. Our approach includes a dynamic navigation module that routes samples to the most suitable local models based on latent subtype representations, and a meta-integration module that combines predictions from heterogeneous local models into a unified diagnostic output. We evaluated our framework using a comprehensive dataset comprising fMRI data from over 1300 MDD patients and 1100 healthy controls across multiple study cohorts. Experimental results demonstrate significant improvements in diagnostic accuracy and robustness compared to traditional methods. Specifically, our framework achieved an average accuracy of 74.06\% across all tested sites, showcasing its effectiveness in handling subtype heterogeneity and enhancing model generalizability. Ablation studies further confirmed the importance of both the dynamic navigation and meta-integration modules in improving performance. By addressing data heterogeneity and subtype confounding, our framework advances reliable and reproducible neuroimaging CAD systems, offering significant potential for personalized medicine and clinical decision-making in neurology and psychiatry.

Title: Generative Artificial Intelligence Extracts Structure-Function Relationships from Plants for New Materials

Authors: Rachel K. Luu, Jingyu Deng, Mohammed Shahrudin Ibrahim, Nam-Joon Cho, Ming Dao, Subra Suresh, Markus J. Buehler
Subjects: cs.LG, cond-mat.dis-nn, cond-mat.mtrl-sci, cond-mat.other, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.06591
Pdf URL: https://arxiv.org/pdf/2508.06591
Copy Paste: [[2508.06591]] Generative Artificial Intelligence Extracts Structure-Function Relationships from Plants for New Materials(https://arxiv.org/abs/2508.06591)
Keywords: generative, large language model
Abstract: Large language models (LLMs) have reshaped the research landscape by enabling new approaches to knowledge retrieval and creative ideation. Yet their application in discipline-specific experimental science, particularly in highly multi-disciplinary domains like materials science, remains limited. We present a first-of-its-kind framework that integrates generative AI with literature from hitherto-unconnected fields such as plant science, biomimetics, and materials engineering to extract insights and design experiments for materials. We focus on humidity-responsive systems such as pollen-based materials and Rhapis excelsa (broadleaf lady palm) leaves, which exhibit self-actuation and adaptive performance. Using a suite of AI tools, including a fine-tuned model (BioinspiredLLM), Retrieval-Augmented Generation (RAG), agentic systems, and a Hierarchical Sampling strategy, we extract structure-property relationships and translate them into new classes of bioinspired materials. Structured inference protocols generate and evaluate hundreds of hypotheses from a single query, surfacing novel and experimentally tractable ideas. We validate our approach through real-world implementation: LLM-generated procedures, materials designs, and mechanical predictions were tested in the laboratory, culminating in the fabrication of a novel pollen-based adhesive with tunable morphology and measured shear strength, establishing a foundation for future plant-derived adhesive design. This work demonstrates how AI-assisted ideation can drive real-world materials design and enable effective human-AI collaboration.

Title: LLM Unlearning Without an Expert Curated Dataset

Authors: Xiaoyuan Zhu, Muru Zhang, Ollie Liu, Robin Jia, Willie Neiswanger
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.06595
Pdf URL: https://arxiv.org/pdf/2508.06595
Copy Paste: [[2508.06595]] LLM Unlearning Without an Expert Curated Dataset(https://arxiv.org/abs/2508.06595)
Keywords: security, large language model
Abstract: Modern large language models often encode sensitive, harmful, or copyrighted knowledge, raising the need for post-hoc unlearning-the ability to remove specific domains of knowledge from a model without full retraining. A major bottleneck in current unlearning pipelines is constructing effective forget sets-datasets that approximate the target domain and guide the model to forget it. In this work, we introduce a scalable, automated approach to generate high-quality forget sets using language models themselves. Our method synthesizes textbook-style data through a structured prompting pipeline, requiring only a domain name as input. Through experiments on unlearning biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic datasets consistently outperform the baseline synthetic alternatives and are comparable to the expert-curated ones. Additionally, ablation studies reveal that the multi-step generation pipeline significantly boosts data diversity, which in turn improves unlearning utility. Overall, our findings suggest that synthetic datasets offer a promising path toward practical, scalable unlearning for a wide range of emerging domains without the need for manual intervention. We release our code and dataset at this https URL.

Title: BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent

Authors: Zijian Chen, Xueguang Ma, Shengyao Zhuang, Ping Nie, Kai Zou, Andrew Liu, Joshua Green, Kshama Patel, Ruoxi Meng, Mingyi Su, Sahel Sharifymoghaddam, Yanxi Li, Haoran Hong, Xinyu Shi, Xuye Liu, Nandan Thakur, Crystina Zhang, Luyu Gao, Wenhu Chen, Jimmy Lin
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.06600
Pdf URL: https://arxiv.org/pdf/2508.06600
Copy Paste: [[2508.06600]] BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent(https://arxiv.org/abs/2508.06600)
Keywords: fair, large language model
Abstract: Deep-Research agents, which integrate large language models (LLMs) with search tools, have shown success in improving the effectiveness of handling complex queries that require iterative search planning and reasoning over search results. Evaluations on current benchmarks like BrowseComp relies on black-box live web search APIs, have notable limitations in (1) fairness: dynamic and opaque web APIs hinder fair comparisons and reproducibility of deep research methods; (2) transparency: lack of control over the document corpus makes it difficult to isolate retriever contributions. In other words, the current evaluations may compare a complete deep research system at a given time, but they do not foster well-controlled experiments to provide insights into the capability of underlying deep research LLMs. To address these challenges, we introduce BrowseComp-Plus, a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus. Each query in BrowseComp-Plus includes human-verified supporting documents and mined challenging negatives, enabling controlled experimentation. The benchmark is shown to be effective in distinguishing the performance of deep research systems. For instance, the open-source model Search-R1, when paired with the BM25 retriever, achieves 3.86% accuracy, whereas the GPT-5 achieves 55.9%. Integrating the GPT-5 with the Qwen3-Embedding-8B retriever further enhances its accuracy to 70.1% with fewer search calls. This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods, fostering insights into retrieval effectiveness, citation accuracy, and context engineering in Deep-Research system.

Title: Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Authors: Kyle O'Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, Stella Biderman
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06601
Pdf URL: https://arxiv.org/pdf/2508.06601
Copy Paste: [[2508.06601]] Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs(https://arxiv.org/abs/2508.06601)
Keywords: defense, attack, robust
Abstract: Open-weight AI systems offer unique benefits, including enhanced transparency, open research, and decentralized access. However, they are vulnerable to tampering attacks which can efficiently elicit harmful behaviors by modifying weights or activations. Currently, there is not yet a robust science of open-weight model risk management. Existing safety fine-tuning methods and other post-training techniques have struggled to make LLMs resistant to more than a few dozen steps of adversarial fine-tuning. In this paper, we investigate whether filtering text about dual-use topics from training data can prevent unwanted capabilities and serve as a more tamper-resistant safeguard. We introduce a multi-stage pipeline for scalable data filtering and show that it offers a tractable and effective method for minimizing biothreat proxy knowledge in LLMs. We pretrain multiple 6.9B-parameter models from scratch and find that they exhibit substantial resistance to adversarial fine-tuning attacks on up to 10,000 steps and 300M tokens of biothreat-related text -- outperforming existing post-training baselines by over an order of magnitude -- with no observed degradation to unrelated capabilities. However, while filtered models lack internalized dangerous knowledge, we find that they can still leverage such information when it is provided in context (e.g., via search tool augmentation), demonstrating a need for a defense-in-depth approach. Overall, these findings help to establish pretraining data curation as a promising layer of defense for open-weight AI systems.

Title: Local Diffusion Models and Phases of Data Distributions

Authors: Fangjun Hu, Guangkuo Liu, Yifan Zhang, Xun Gao
Subjects: cs.LG, cond-mat.stat-mech, quant-ph
Abstract URL: https://arxiv.org/abs/2508.06614
Pdf URL: https://arxiv.org/pdf/2508.06614
Copy Paste: [[2508.06614]] Local Diffusion Models and Phases of Data Distributions(https://arxiv.org/abs/2508.06614)
Keywords: diffusion, generative
Abstract: As a class of generative artificial intelligence frameworks inspired by statistical physics, diffusion models have shown extraordinary performance in synthesizing complicated data distributions through a denoising process gradually guided by score functions. Real-life data, like images, is often spatially structured in low-dimensional spaces. However, ordinary diffusion models ignore this local structure and learn spatially global score functions, which are often computationally expensive. In this work, we introduce a new perspective on the phases of data distributions, which provides insight into constructing local denoisers with reduced computational costs. We define two distributions as belonging to the same data distribution phase if they can be mutually connected via spatially local operations such as local denoisers. Then, we show that the reverse denoising process consists of an early trivial phase and a late data phase, sandwiching a rapid phase transition where local denoisers must fail. To diagnose such phase transitions, we prove an information-theoretic bound on the fidelity of local denoisers based on conditional mutual information, and conduct numerical experiments in a real-world dataset. This work suggests simpler and more efficient architectures of diffusion models: far from the phase transition point, we can use small local neural networks to compute the score function; global neural networks are only necessary around the narrow time interval of phase transitions. This result also opens up new directions for studying phases of data distributions, the broader science of generative artificial intelligence, and guiding the design of neural networks inspired by physics concepts.

Title: Generalizing Scaling Laws for Dense and Sparse Large Language Models

Authors: Md Arafat Hossain, Xingfu Wu, Valerie Taylor, Ali Jannesari
Subjects: cs.LG, cs.AI, cs.PF
Abstract URL: https://arxiv.org/abs/2508.06617
Pdf URL: https://arxiv.org/pdf/2508.06617
Copy Paste: [[2508.06617]] Generalizing Scaling Laws for Dense and Sparse Large Language Models(https://arxiv.org/abs/2508.06617)
Keywords: large language model
Abstract: Over the past few years, the size of language models has grown exponentially, as has the computational cost to train these large models. This rapid growth has motivated researchers to develop new techniques aimed at enhancing the efficiency of the training process. Despite these advancements, optimally predicting the model size or allocating optimal resources remains a challenge. Several efforts have addressed the challenge by proposing different scaling laws, but almost all of them are architecture-specific (dense or sparse). In this work we revisit existing scaling laws and propose a generalized scaling law to provide a unified framework that is applicable to both dense and sparse large language models. We evaluate and compare our proposed scaling law with existing scaling laws to demonstrate its effectiveness.

Title: Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models

Authors: Tomohiro Sawada, Kartik Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.06621
Pdf URL: https://arxiv.org/pdf/2508.06621
Copy Paste: [[2508.06621]] Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models(https://arxiv.org/abs/2508.06621)
Keywords: privacy, attack
Abstract: Standard Byte-Pair Encoding (BPE) tokenization compresses text by pairing a learned token vocabulary with a detailed merge list. Recent work has shown that this merge list exposes a potential attack surface for extracting information about language model's training data. In this paper, we explore the downstream impact of BPE inference algorithms that do not rely on this merge list at all, and hence differ from the encoding process during BPE training. To address this question, we investigate two broad classes of BPE inference schemes that differ from BPE application during training: a) targeted deviation from merge-lists including random merge orders, and various corruptions of merge list involving deletion/truncation, and b) non-targeted BPE inference algorithms that do not depend on the merge list but focus on compressing the text either greedily or exactly. Extensive experiments across diverse language modeling tasks like accuracy-based QA benchmarks, machine translation, and open-ended generation reveal that while targeted deviation from the merge lists exhibits significant degradation in language model performance, the non-targeted merge-list-free inference algorithms result in minimal impact on downstream performance that is often much smaller than expected. These findings pave way for simpler and potentially more privacy-preserving tokenization schemes that do not catastrophically compromise model performance.

Title: ContextGuard-LVLM: Enhancing News Veracity through Fine-grained Cross-modal Contextual Consistency Verification

Authors: Sihan Ma, Qiming Wu, Ruotong Jiang, Frank Burns
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06623
Pdf URL: https://arxiv.org/pdf/2508.06623
Copy Paste: [[2508.06623]] ContextGuard-LVLM: Enhancing News Veracity through Fine-grained Cross-modal Contextual Consistency Verification(https://arxiv.org/abs/2508.06623)
Keywords: robust
Abstract: The proliferation of digital news media necessitates robust methods for verifying content veracity, particularly regarding the consistency between visual and textual information. Traditional approaches often fall short in addressing the fine-grained cross-modal contextual consistency (FCCC) problem, which encompasses deeper alignment of visual narrative, emotional tone, and background information with text, beyond mere entity matching. To address this, we propose ContextGuard-LVLM, a novel framework built upon advanced Vision-Language Large Models (LVLMs) and integrating a multi-stage contextual reasoning mechanism. Our model is uniquely enhanced through reinforced or adversarial learning paradigms, enabling it to detect subtle contextual misalignments that evade zero-shot baselines. We extend and augment three established datasets (TamperedNews-Ent, News400-Ent, MMG-Ent) with new fine-grained contextual annotations, including "contextual sentiment," "visual narrative theme," and "scene-event logical coherence," and introduce a comprehensive CTXT (Contextual Coherence) entity type. Extensive experiments demonstrate that ContextGuard-LVLM consistently outperforms state-of-the-art zero-shot LVLM baselines (InstructBLIP and LLaVA 1.5) across nearly all fine-grained consistency tasks, showing significant improvements in complex logical reasoning and nuanced contextual understanding. Furthermore, our model exhibits superior robustness to subtle perturbations and a higher agreement rate with human expert judgments on challenging samples, affirming its efficacy in discerning sophisticated forms of context detachment.

Title: VL-MedGuide: A Visual-Linguistic Large Model for Intelligent and Explainable Skin Disease Auxiliary Diagnosis

Authors: Kexin Yu, Zihan Xu, Jialei Xie, Carter Adams
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06624
Pdf URL: https://arxiv.org/pdf/2508.06624
Copy Paste: [[2508.06624]] VL-MedGuide: A Visual-Linguistic Large Model for Intelligent and Explainable Skin Disease Auxiliary Diagnosis(https://arxiv.org/abs/2508.06624)
Keywords: interpretability
Abstract: Accurate diagnosis of skin diseases remains a significant challenge due to the complex and diverse visual features present in dermatoscopic images, often compounded by a lack of interpretability in existing purely visual diagnostic models. To address these limitations, this study introduces VL-MedGuide (Visual-Linguistic Medical Guide), a novel framework leveraging the powerful multi-modal understanding and reasoning capabilities of Visual-Language Large Models (LVLMs) for intelligent and inherently interpretable auxiliary diagnosis of skin conditions. VL-MedGuide operates in two interconnected stages: a Multi-modal Concept Perception Module, which identifies and linguistically describes dermatologically relevant visual features through sophisticated prompt engineering, and an Explainable Disease Reasoning Module, which integrates these concepts with raw visual information via Chain-of-Thought prompting to provide precise disease diagnoses alongside transparent rationales. Comprehensive experiments on the Derm7pt dataset demonstrate that VL-MedGuide achieves state-of-the-art performance in both disease diagnosis (83.55% BACC, 80.12% F1) and concept detection (76.10% BACC, 67.45% F1), surpassing existing baselines. Furthermore, human evaluations confirm the high clarity, completeness, and trustworthiness of its generated explanations, bridging the gap between AI performance and clinical utility by offering actionable, explainable insights for dermatological practice.

Title: CycleDiff: Cycle Diffusion Models for Unpaired Image-to-image Translation

Authors: Shilong Zou, Yuhang Huang, Renjiao Yi, Chenyang Zhu, Kai Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06625
Pdf URL: https://arxiv.org/pdf/2508.06625
Copy Paste: [[2508.06625]] CycleDiff: Cycle Diffusion Models for Unpaired Image-to-image Translation(https://arxiv.org/abs/2508.06625)
Keywords: diffusion, generative
Abstract: We introduce a diffusion-based cross-domain image translator in the absence of paired training data. Unlike GAN-based methods, our approach integrates diffusion models to learn the image translation process, allowing for more coverable modeling of the data distribution and performance improvement of the cross-domain translation. However, incorporating the translation process within the diffusion process is still challenging since the two processes are not aligned exactly, i.e., the diffusion process is applied to the noisy signal while the translation process is conducted on the clean signal. As a result, recent diffusion-based studies employ separate training or shallow integration to learn the two processes, yet this may cause the local minimal of the translation optimization, constraining the effectiveness of diffusion models. To address the problem, we propose a novel joint learning framework that aligns the diffusion and the translation process, thereby improving the global optimality. Specifically, we propose to extract the image components with diffusion models to represent the clean signal and employ the translation process with the image components, enabling an end-to-end joint learning manner. On the other hand, we introduce a time-dependent translation network to learn the complex translation mapping, resulting in effective translation learning and significant performance improvement. Benefiting from the design of joint learning, our method enables global optimization of both processes, enhancing the optimality and achieving improved fidelity and structural consistency. We have conducted extensive experiments on RGB$\leftrightarrow$RGB and diverse cross-modality translation tasks including RGB$\leftrightarrow$Edge, RGB$\leftrightarrow$Semantics and RGB$\leftrightarrow$Depth, showcasing better generative performances than the state of the arts.

Title: Using Imperfect Synthetic Data in Downstream Inference Tasks

Authors: Yewon Byun, Shantanu Gupta, Zachary C. Lipton, Rachel Leah Childers, Bryan Wilder
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2508.06635
Pdf URL: https://arxiv.org/pdf/2508.06635
Copy Paste: [[2508.06635]] Using Imperfect Synthetic Data in Downstream Inference Tasks(https://arxiv.org/abs/2508.06635)
Keywords: large language model
Abstract: Predictions and generations from large language models are increasingly being explored as an aid to computational social science and human subject research in limited data regimes. While previous technical work has explored the potential to use model-predicted labels for unlabeled data in a principled manner, there is increasing interest in using large language models to generate entirely new synthetic samples (also termed as synthetic simulations), such as in responses to surveys. However, it is not immediately clear by what means practitioners can combine such data with real data and yet produce statistically valid conclusions upon them. In this work, we introduce a new estimator based on generalized method of moments, providing a hyperparameter-free solution with strong theoretical guarantees to address the challenge at hand. Surprisingly, we find that interactions between the moment residuals of synthetic data and those of real data can improve estimates of the target parameter. We empirically validate the finite-sample performance of our estimator across different regression tasks in computational social science applications, demonstrating large empirical gains.

Title: Segmented Confidence Sequences and Multi-Scale Adaptive Confidence Segments for Anomaly Detection in Nonstationary Time Series

Authors: Muyan Anna Li, Aditi Gautam
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06638
Pdf URL: https://arxiv.org/pdf/2508.06638
Copy Paste: [[2508.06638]] Segmented Confidence Sequences and Multi-Scale Adaptive Confidence Segments for Anomaly Detection in Nonstationary Time Series(https://arxiv.org/abs/2508.06638)
Keywords: robust, segmentation
Abstract: As time series data become increasingly prevalent in domains such as manufacturing, IT, and infrastructure monitoring, anomaly detection must adapt to nonstationary environments where statistical properties shift over time. Traditional static thresholds are easily rendered obsolete by regime shifts, concept drift, or multi-scale changes. To address these challenges, we introduce and empirically evaluate two novel adaptive thresholding frameworks: Segmented Confidence Sequences (SCS) and Multi-Scale Adaptive Confidence Segments (MACS). Both leverage statistical online learning and segmentation principles for local, contextually sensitive adaptation, maintaining guarantees on false alarm rates even under evolving distributions. Our experiments across Wafer Manufacturing benchmark datasets show significant F1-score improvement compared to traditional percentile and rolling quantile approaches. This work demonstrates that robust, statistically principled adaptive thresholds enable reliable, interpretable, and timely detection of diverse real-world anomalies.

Title: Rethinking Key-frame-based Micro-expression Recognition: A Robust and Accurate Framework Against Key-frame Errors

Authors: Zheyuan Zhang, Weihao Tang, Hong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06640
Pdf URL: https://arxiv.org/pdf/2508.06640
Copy Paste: [[2508.06640]] Rethinking Key-frame-based Micro-expression Recognition: A Robust and Accurate Framework Against Key-frame Errors(https://arxiv.org/abs/2508.06640)
Keywords: robust
Abstract: Micro-expression recognition (MER) is a highly challenging task in affective computing. With the reduced-sized micro-expression (ME) input that contains key information based on key-frame indexes, key-frame-based methods have significantly improved the performance of MER. However, most of these methods focus on improving the performance with relatively accurate key-frame indexes, while ignoring the difficulty of obtaining accurate key-frame indexes and the objective existence of key-frame index errors, which impedes them from moving towards practical applications. In this paper, we propose CausalNet, a novel framework to achieve robust MER facing key-frame index errors while maintaining accurate recognition. To enhance robustness, CausalNet takes the representation of the entire ME sequence as the input. To address the information redundancy brought by the complete ME range input and maintain accurate recognition, first, the Causal Motion Position Learning Module (CMPLM) is proposed to help the model locate the muscle movement areas related to Action Units (AUs), thereby reducing the attention to other redundant areas. Second, the Causal Attention Block (CAB) is proposed to deeply learn the causal relationships between the muscle contraction and relaxation movements in MEs. Empirical experiments have demonstrated that on popular ME benchmarks, the CausalNet has achieved robust MER under different levels of key-frame index noise. Meanwhile, it has surpassed state-of-the-art (SOTA) methods on several standard MER benchmarks when using the provided annotated key-frames. Code is available at this https URL.

Title: Fractal Language Modelling by Universal Sequence Maps (USM)

Authors: Jonas S Almeida, Daniel E Russ, Susana Vinga, Ines Duarte, Lee Mason, Praphulla Bhawsar, Aaron Ge, Arlindo Oliveira, Jeya Balaji Balasubramanian
Subjects: cs.LG, cs.AI, math.NA, q-bio.QM
Abstract URL: https://arxiv.org/abs/2508.06641
Pdf URL: https://arxiv.org/pdf/2508.06641
Copy Paste: [[2508.06641]] Fractal Language Modelling by Universal Sequence Maps (USM)(https://arxiv.org/abs/2508.06641)
Keywords: transformer
Abstract: Motivation: With the advent of Language Models using Transformers, popularized by ChatGPT, there is a renewed interest in exploring encoding procedures that numerically represent symbolic sequences at multiple scales and embedding dimensions. The challenge that encoding addresses is the need for mechanisms that uniquely retain contextual information about the succession of individual symbols, which can then be modeled by nonlinear formulations such as neural networks. Context: Universal Sequence Maps(USM) are iterated functions that bijectively encode symbolic sequences onto embedded numerical spaces. USM is composed of two Chaos Game Representations (CGR), iterated forwardly and backwardly, that can be projected into the frequency domain (FCGR). The corresponding USM coordinates can be used to compute a Chebyshev distance metric as well as k-mer frequencies, without having to recompute the embedded numeric coordinates, and, paradoxically, allowing for non-integers values of k. Results: This report advances the bijective fractal encoding by Universal Sequence Maps (USM) by resolving seeding biases affecting the iterated process. The resolution had two results, the first expected, the second an intriguing outcome: 1) full reconciliation of numeric positioning with sequence identity; and 2) uncovering the nature of USM as an efficient numeric process converging towards a steady state sequence embedding solution. We illustrate these results for genomic sequences because of the convenience of a planar representation defined by an alphabet with only 4 tokens (the 4 nucleotides). Nevertheless, the application to alphabet of arbitrary cardinality was found to be straightforward.

Title: Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN

Authors: Andrey Sidorenko, Paul Tiwald
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.06647
Pdf URL: https://arxiv.org/pdf/2508.06647
Copy Paste: [[2508.06647]] Privacy-Preserving Tabular Synthetic Data Generation Using TabularARGN(https://arxiv.org/abs/2508.06647)
Keywords: secure, privacy, attack, robust, generative
Abstract: Synthetic data generation has become essential for securely sharing and analyzing sensitive data sets. Traditional anonymization techniques, however, often fail to adequately preserve privacy. We introduce the Tabular Auto-Regressive Generative Network (TabularARGN), a neural network architecture specifically designed for generating high-quality synthetic tabular data. Using a discretization-based auto-regressive approach, TabularARGN achieves high data fidelity while remaining computationally efficient. We evaluate TabularARGN against existing synthetic data generation methods, showing competitive results in statistical similarity, machine learning utility, and detection robustness. We further perform an in-depth privacy evaluation using systematic membership-inference attacks, highlighting the robustness and effective privacy-utility balance of our approach.

Title: Measuring Stereotype and Deviation Biases in Large Language Models

Authors: Daniel Wang, Eli Brignac, Minjia Mao, Xiao Fang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.06649
Pdf URL: https://arxiv.org/pdf/2508.06649
Copy Paste: [[2508.06649]] Measuring Stereotype and Deviation Biases in Large Language Models(https://arxiv.org/abs/2508.06649)
Keywords: large language model
Abstract: Large language models (LLMs) are widely applied across diverse domains, raising concerns about their limitations and potential risks. In this study, we investigate two types of bias that LLMs may display: stereotype bias and deviation bias. Stereotype bias refers to when LLMs consistently associate specific traits with a particular demographic group. Deviation bias reflects the disparity between the demographic distributions extracted from LLM-generated content and real-world demographic distributions. By asking four advanced LLMs to generate profiles of individuals, we examine the associations between each demographic group and attributes such as political affiliation, religion, and sexual orientation. Our experimental results show that all examined LLMs exhibit both significant stereotype bias and deviation bias towards multiple groups. Our findings uncover the biases that occur when LLMs infer user attributes and shed light on the potential harms of LLM-generated outputs.

Title: Towards Robust Red-Green Watermarking for Autoregressive Image Generators

Authors: Denis Lukovnikov, Andreas Müller, Erwin Quiring, Asja Fischer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06656
Pdf URL: https://arxiv.org/pdf/2508.06656
Copy Paste: [[2508.06656]] Towards Robust Red-Green Watermarking for Autoregressive Image Generators(https://arxiv.org/abs/2508.06656)
Keywords: attack, robust, watermark, diffusion, large language model
Abstract: In-generation watermarking for detecting and attributing generated content has recently been explored for latent diffusion models (LDMs), demonstrating high robustness. However, the use of in-generation watermarks in autoregressive (AR) image models has not been explored yet. AR models generate images by autoregressively predicting a sequence of visual tokens that are then decoded into pixels using a vector-quantized decoder. Inspired by red-green watermarks for large language models, we examine token-level watermarking schemes that bias the next-token prediction based on prior tokens. We find that a direct transfer of these schemes works in principle, but the detectability of the watermarks decreases considerably under common image perturbations. As a remedy, we propose two novel watermarking methods that rely on visual token clustering to assign similar tokens to the same set. Firstly, we investigate a training-free approach that relies on a cluster lookup table, and secondly, we finetune VAE encoders to predict token clusters directly from perturbed images. Overall, our experiments show that cluster-level watermarks improve robustness against perturbations and regeneration attacks while preserving image quality. Cluster classification further boosts watermark detectability, outperforming a set of baselines. Moreover, our methods offer fast verification runtime, comparable to lightweight post-hoc watermarking methods.

Title: Testing the Limits of Machine Translation from One Book

Authors: Jonathan Shaw, Dillon Mee, Timothy Khouw, Zackary Leech, Daniel Wilson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.06665
Pdf URL: https://arxiv.org/pdf/2508.06665
Copy Paste: [[2508.06665]] Testing the Limits of Machine Translation from One Book(https://arxiv.org/abs/2508.06665)
Keywords: large language model
Abstract: Current state-of-the-art models demonstrate capacity to leverage in-context learning to translate into previously unseen language contexts. Tanzer et al. [2024] utilize language materials (e.g. a grammar) to improve translation quality for Kalamang using large language models (LLMs). We focus on Kanuri, a language that, despite having substantial speaker population, has minimal digital resources. We design two datasets for evaluation: one focused on health and humanitarian terms, and another containing generalized terminology, investigating how domain-specific tasks impact LLM translation quality. By providing different combinations of language resources (grammar, dictionary, and parallel sentences), we measure LLM translation effectiveness, comparing results to native speaker translations and human linguist performance. We evaluate using both automatic metrics and native speaker assessments of fluency and accuracy. Results demonstrate that parallel sentences remain the most effective data source, outperforming other methods in human evaluations and automatic metrics. While incorporating grammar improves over zero-shot translation, it fails as an effective standalone data source. Human evaluations reveal that LLMs achieve accuracy (meaning) more effectively than fluency (grammaticality). These findings suggest LLM translation evaluation benefits from multidimensional assessment beyond simple accuracy metrics, and that grammar alone, without parallel sentences, does not provide sufficient context for effective domain-specific translation.

Title: Do Biased Models Have Biased Thoughts?

Authors: Swati Rajwal, Shivank Garg, Reem Abdel-Salam, Abdelrahman Zayed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06671
Pdf URL: https://arxiv.org/pdf/2508.06671
Copy Paste: [[2508.06671]] Do Biased Models Have Biased Thoughts?(https://arxiv.org/abs/2508.06671)
Keywords: fair, large language model
Abstract: The impressive performance of language models is undeniable. However, the presence of biases based on gender, race, socio-economic status, physical appearance, and sexual orientation makes the deployment of language models challenging. This paper studies the effect of chain-of-thought prompting, a recent approach that studies the steps followed by the model before it responds, on fairness. More specifically, we ask the following question: \textit{Do biased models have biased thoughts}? To answer our question, we conduct experiments on $5$ popular large language models using fairness metrics to quantify $11$ different biases in the model's thoughts and output. Our results show that the bias in the thinking steps is not highly correlated with the output bias (less than $0.6$ correlation with a $p$-value smaller than $0.001$ in most cases). In other words, unlike human beings, the tested models with biased decisions do not always possess biased thoughts.

Title: Watermarking Kolmogorov-Arnold Networks for Emerging Networked Applications via Activation Perturbation

Authors: Chia-Hsun Lu, Guan-Jhih Wu, Ya-Chi Ho, Chih-Ya Shen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.06676
Pdf URL: https://arxiv.org/pdf/2508.06676
Copy Paste: [[2508.06676]] Watermarking Kolmogorov-Arnold Networks for Emerging Networked Applications via Activation Perturbation(https://arxiv.org/abs/2508.06676)
Keywords: protect, attack, robust, watermark
Abstract: With the increasing importance of protecting intellectual property in machine learning, watermarking techniques have gained significant attention. As advanced models are increasingly deployed in domains such as social network analysis, the need for robust model protection becomes even more critical. While existing watermarking methods have demonstrated effectiveness for conventional deep neural networks, they often fail to adapt to the novel architecture, Kolmogorov-Arnold Networks (KAN), which feature learnable activation functions. KAN holds strong potential for modeling complex relationships in network-structured data. However, their unique design also introduces new challenges for watermarking. Therefore, we propose a novel watermarking method, Discrete Cosine Transform-based Activation Watermarking (DCT-AW), tailored for KAN. Leveraging the learnable activation functions of KAN, our method embeds watermarks by perturbing activation outputs using discrete cosine transform, ensuring compatibility with diverse tasks and achieving task independence. Experimental results demonstrate that DCT-AW has a small impact on model performance and provides superior robustness against various watermark removal attacks, including fine-tuning, pruning, and retraining after pruning.

Title: Stabilizing Federated Learning under Extreme Heterogeneity with HeteRo-Select

Authors: Md. Akmol Masud, Md Abrar Jahin, Mahmud Hasan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.06692
Pdf URL: https://arxiv.org/pdf/2508.06692
Copy Paste: [[2508.06692]] Stabilizing Federated Learning under Extreme Heterogeneity with HeteRo-Select(https://arxiv.org/abs/2508.06692)
Keywords: federate, fair
Abstract: Federated Learning (FL) is a machine learning technique that often suffers from training instability due to the diverse nature of client data. Although utility-based client selection methods like Oort are used to converge by prioritizing high-loss clients, they frequently experience significant drops in accuracy during later stages of training. We propose a theoretical HeteRo-Select framework designed to maintain high performance and ensure long-term training stability. We provide a theoretical analysis showing that when client data is very different (high heterogeneity), choosing a smart subset of client participation can reduce communication more effectively compared to full participation. Our HeteRo-Select method uses a clear, step-by-step scoring system that considers client usefulness, fairness, update speed, and data variety. It also shows convergence guarantees under strong regularization. Our experimental results on the CIFAR-10 dataset under significant label skew ($\alpha=0.1$) support the theoretical findings. The HeteRo-Select method performs better than existing approaches in terms of peak accuracy, final accuracy, and training stability. Specifically, HeteRo-Select achieves a peak accuracy of $74.75\%$, a final accuracy of $72.76\%$, and a minimal stability drop of $1.99\%$. In contrast, Oort records a lower peak accuracy of $73.98\%$, a final accuracy of $71.25\%$, and a larger stability drop of $2.73\%$. The theoretical foundations and empirical performance in our study make HeteRo-Select a reliable solution for real-world heterogeneous FL problems.

Title: Learning More by Seeing Less: Line Drawing Pretraining for Efficient, Transferable, and Human-Aligned Vision

Authors: Tianqin Li, George Liu, Tai Sing Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06696
Pdf URL: https://arxiv.org/pdf/2508.06696
Copy Paste: [[2508.06696]] Learning More by Seeing Less: Line Drawing Pretraining for Efficient, Transferable, and Human-Aligned Vision(https://arxiv.org/abs/2508.06696)
Keywords: robust, segmentation
Abstract: Despite remarkable progress in computer vision, modern recognition systems remain limited by their dependence on rich, redundant visual inputs. In contrast, humans can effortlessly understand sparse, minimal representations like line drawings - suggesting that structure, rather than appearance, underlies efficient visual understanding. In this work, we propose using line drawings as a structure-first pretraining modality to induce more compact and generalizable visual representations. We show that models pretrained on line drawings develop stronger shape bias, more focused attention, and greater data efficiency across classification, detection, and segmentation tasks. Notably, these models also exhibit lower intrinsic dimensionality, requiring significantly fewer principal components to capture representational variance - echoing the similar observation in low dimensional efficient representation in the brain. Beyond performance improvements, line drawing pretraining produces more compressible representations, enabling better distillation into lightweight student models. Students distilled from line-pretrained teachers consistently outperform those trained from color-supervised teachers, highlighting the benefits of structurally compact knowledge. Finally, we demonstrate that the pretraining with line-drawing can also be extended to unsupervised setting via our proposed method "learning to draw". Together, our results support the view that structure-first visual learning fosters efficiency, generalization, and human-aligned inductive biases - offering a simple yet powerful strategy for building more robust and adaptable vision systems.

Title: MMFformer: Multimodal Fusion Transformer Network for Depression Detection

Authors: Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar, Fakhri Karray
Subjects: cs.CV, cs.AI, cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2508.06701
Pdf URL: https://arxiv.org/pdf/2508.06701
Copy Paste: [[2508.06701]] MMFformer: Multimodal Fusion Transformer Network for Depression Detection(https://arxiv.org/abs/2508.06701)
Keywords: extraction, transformer
Abstract: Depression is a serious mental health illness that significantly affects an individual's well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at this https URL.

Title: Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge

Authors: Evangelia Spiliopoulou, Riccardo Fogliato, Hanna Burnsky, Tamer Soliman, Jie Ma, Graham Horwood, Miguel Ballesteros
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06709
Pdf URL: https://arxiv.org/pdf/2508.06709
Copy Paste: [[2508.06709]] Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge(https://arxiv.org/abs/2508.06709)
Keywords: large language model
Abstract: Large language models (LLMs) can serve as judges that offer rapid and reliable assessments of other LLM outputs. However, models may systematically assign overly favorable ratings to their own outputs, a phenomenon known as self-bias, which can distort evaluations of true model performance. Previous studies often conflate genuine differences in model quality with bias or incorrectly assume that evaluations from LLMs and humans follow the same rating distributions. In this work, we present a statistical framework that explicitly formalizes assumptions under which self-bias can be identified and estimated. Our method models the difference in the scoring distribution that LLM-as-a-judge assigns to its own completions compared to other models, while accounting for the underlying quality of the completions provided by an independent, third-party judge (e.g., humans). Our method reliably isolates and quantifies self-bias, even when models vary in ability, ensuring that genuine performance differences are not mistaken for self-bias. We conduct an empirical analysis of self-bias on a large dataset (>5000 prompt-completion pairs) consisting of expert human annotations and judgments from nine different LLM judges. We find that some models, such as GPT-4o and Claude 3.5 Sonnet, systematically assign higher scores to their own outputs. These models also display family-bias; systematically assigning higher ratings to outputs produced by other models of the same family. Our findings highlight potential pitfalls of using LLM judges and offer practical guidance to mitigate biases when interpreting automated evaluations.

Title: Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video

Authors: Jixuan He, Chieh Hubert Lin, Lu Qi, Ming-Hsuan Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06715
Pdf URL: https://arxiv.org/pdf/2508.06715
Copy Paste: [[2508.06715]] Restage4D: Reanimating Deformable 3D Reconstruction from a Single Video(https://arxiv.org/abs/2508.06715)
Keywords: generative
Abstract: Creating deformable 3D content has gained increasing attention with the rise of text-to-image and image-to-video generative models. While these models provide rich semantic priors for appearance, they struggle to capture the physical realism and motion dynamics needed for authentic 4D scene synthesis. In contrast, real-world videos can provide physically grounded geometry and articulation cues that are difficult to hallucinate. One question is raised: \textit{Can we generate physically consistent 4D content by leveraging the motion priors of the real-world video}? In this work, we explore the task of reanimating deformable 3D scenes from a single video, using the original sequence as a supervisory signal to correct artifacts from synthetic motion. We introduce \textbf{Restage4D}, a geometry-preserving pipeline for video-conditioned 4D restaging. Our approach uses a video-rewinding training strategy to temporally bridge a real base video and a synthetic driving video via a shared motion representation. We further incorporate an occlusion-aware rigidity loss and a disocclusion backtracing mechanism to improve structural and geometry consistency under challenging motion. We validate Restage4D on DAVIS and PointOdyssey, demonstrating improved geometry consistency, motion quality, and 3D tracking performance. Our method not only preserves deformable structure under novel motion, but also automatically corrects errors introduced by generative models, revealing the potential of video prior in 4D restaging task. Source code and trained models will be released.

Title: Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis

Authors: Komala Subramanyam Cherukuri, Pranav Abishai Moses, Aisa Sakata, Jiangping Chen, Haihua Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06729
Pdf URL: https://arxiv.org/pdf/2508.06729
Copy Paste: [[2508.06729]] Large Language Models for Oral History Understanding with Text Classification and Sentiment Analysis(https://arxiv.org/abs/2508.06729)
Keywords: large language model
Abstract: Oral histories are vital records of lived experience, particularly within communities affected by systemic injustice and historical erasure. Effective and efficient analysis of their oral history archives can promote access and understanding of the oral histories. However, Large-scale analysis of these archives remains limited due to their unstructured format, emotional complexity, and high annotation costs. This paper presents a scalable framework to automate semantic and sentiment annotation for Japanese American Incarceration Oral History. Using LLMs, we construct a high-quality dataset, evaluate multiple models, and test prompt engineering strategies in historically sensitive contexts. Our multiphase approach combines expert annotation, prompt design, and LLM evaluation with ChatGPT, Llama, and Qwen. We labeled 558 sentences from 15 narrators for sentiment and semantic classification, then evaluated zero-shot, few-shot, and RAG strategies. For semantic classification, ChatGPT achieved the highest F1 score (88.71%), followed by Llama (84.99%) and Qwen (83.72%). For sentiment analysis, Llama slightly outperformed Qwen (82.66%) and ChatGPT (82.29%), with all models showing comparable results. The best prompt configurations were used to annotate 92,191 sentences from 1,002 interviews in the JAIOH collection. Our findings show that LLMs can effectively perform semantic and sentiment annotation across large oral history collections when guided by well-designed prompts. This study provides a reusable annotation pipeline and practical guidance for applying LLMs in culturally sensitive archival analysis. By bridging archival ethics with scalable NLP techniques, this work lays the groundwork for responsible use of artificial intelligence in digital humanities and preservation of collective memory. GitHub: this https URL.

Title: Mitigating Distribution Shift in Graph-Based Android Malware Classification via Function Metadata and LLM Embeddings

Authors: Ngoc N. Tran, Anwar Said, Waseem Abbas, Tyler Derr, Xenofon D. Koutsoukos
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.06734
Pdf URL: https://arxiv.org/pdf/2508.06734
Copy Paste: [[2508.06734]] Mitigating Distribution Shift in Graph-Based Android Malware Classification via Function Metadata and LLM Embeddings(https://arxiv.org/abs/2508.06734)
Keywords: robust, large language model
Abstract: Graph-based malware classifiers can achieve over 94% accuracy on standard Android datasets, yet we find they suffer accuracy drops of up to 45% when evaluated on previously unseen malware variants from the same family - a scenario where strong generalization would typically be expected. This highlights a key limitation in existing approaches: both the model architectures and their structure-only representations often fail to capture deeper semantic patterns. In this work, we propose a robust semantic enrichment framework that enhances function call graphs with contextual features, including function-level metadata and, when available, code embeddings derived from large language models. The framework is designed to operate under real-world constraints where feature availability is inconsistent, and supports flexible integration of semantic signals. To evaluate generalization under realistic domain and temporal shifts, we introduce two new benchmarks: MalNet-Tiny-Common and MalNet-Tiny-Distinct, constructed using malware family partitioning to simulate cross-family generalization and evolving threat behavior. Experiments across multiple graph neural network backbones show that our method improves classification performance by up to 8% under distribution shift and consistently enhances robustness when integrated with adaptation-based methods. These results offer a practical path toward building resilient malware detection systems in evolving threat environments.

Title: Analysis of Schedule-Free Nonconvex Optimization

Authors: Connor Brown
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06743
Pdf URL: https://arxiv.org/pdf/2508.06743
Copy Paste: [[2508.06743]] Analysis of Schedule-Free Nonconvex Optimization(https://arxiv.org/abs/2508.06743)
Keywords: robust
Abstract: First-order methods underpin most large-scale learning algorithms, yet their classical convergence guarantees hinge on carefully scheduled step-sizes that depend on the total horizon $T$, which is rarely known in advance. The Schedule-Free (SF) method promises optimal performance with hyperparameters that are independent of $T$ by interpolating between Polyak--Ruppert averaging and momentum, but nonconvex analysis of SF has been limited or reliant on strong global assumptions. We introduce a robust Lyapunov framework that, under only $L$-smoothness and lower-boundedness, reduces SF analysis to a single-step descent inequality. This yields horizon-agnostic bounds in the nonconvex setting: $O(1/\log T)$ for constant step + PR averaging, $O(\log T/T)$ for a linearly growing step-size, and a continuum of $O(T^{-(1-\alpha)})$ rates for polynomial averaging. We complement these proofs with Performance Estimation Problem (PEP) experiments that numerically validate our rates and suggest that our $O(1/\log T)$ bound on the original nonconvex SF algorithm may tighten to $O(1/T)$. Our work extends SF's horizon-free guarantees to smooth nonconvex optimization and charts future directions for optimal nonconvex rates.

Title: Many-Turn Jailbreaking

Authors: Xianjun Yang, Liqiang Xiao, Shiyang Li, Faisal Ladhak, Hyokun Yun, Linda Ruth Petzold, Yi Xu, William Yang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06755
Pdf URL: https://arxiv.org/pdf/2508.06755
Copy Paste: [[2508.06755]] Many-Turn Jailbreaking(https://arxiv.org/abs/2508.06755)
Keywords: large language model
Abstract: Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking targeting one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than the first-turn conversation or a single target query. This is an even more serious threat because 1) it is common for users to continue asking relevant follow-up questions to clarify certain jailbroken details, and 2) it is also possible that the initial round of jailbreaking causes the LLMs to respond to additional irrelevant questions consistently. As the first step (First draft done at June 2024) in exploring multi-turn jailbreaking, we construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models and provide novel insights into this new safety threat. By revealing this new vulnerability, we aim to call for community efforts to build safer LLMs and pave the way for a more in-depth understanding of jailbreaking LLMs.

Title: FoundBioNet: A Foundation-Based Model for IDH Genotyping of Glioma from Multi-Parametric MRI

Authors: Somayeh Farahani, Marjaneh Hejazi, Antonio Di Ieva, Sidong Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06756
Pdf URL: https://arxiv.org/pdf/2508.06756
Copy Paste: [[2508.06756]] FoundBioNet: A Foundation-Based Model for IDH Genotyping of Glioma from Multi-Parametric MRI(https://arxiv.org/abs/2508.06756)
Keywords: interpretability
Abstract: Accurate, noninvasive detection of isocitrate dehydrogenase (IDH) mutation is essential for effective glioma management. Traditional methods rely on invasive tissue sampling, which may fail to capture a tumor's spatial heterogeneity. While deep learning models have shown promise in molecular profiling, their performance is often limited by scarce annotated data. In contrast, foundation deep learning models offer a more generalizable approach for glioma imaging biomarkers. We propose a Foundation-based Biomarker Network (FoundBioNet) that utilizes a SWIN-UNETR-based architecture to noninvasively predict IDH mutation status from multi-parametric MRI. Two key modules are incorporated: Tumor-Aware Feature Encoding (TAFE) for extracting multi-scale, tumor-focused features, and Cross-Modality Differential (CMD) for highlighting subtle T2-FLAIR mismatch signals associated with IDH mutation. The model was trained and validated on a diverse, multi-center cohort of 1705 glioma patients from six public datasets. Our model achieved AUCs of 90.58%, 88.08%, 65.41%, and 80.31% on independent test sets from EGD, TCGA, Ivy GAP, RHUH, and UPenn, consistently outperforming baseline approaches (p <= 0.05). Ablation studies confirmed that both the TAFE and CMD modules are essential for improving predictive accuracy. By integrating large-scale pretraining and task-specific fine-tuning, FoundBioNet enables generalizable glioma characterization. This approach enhances diagnostic accuracy and interpretability, with the potential to enable more personalized patient care.

Title: VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions

Authors: Yash Garg, Saketh Bachu, Arindam Dutta, Rohit Lal, Sarosij Bose, Calvin-Khang Ta, M. Salman Asif, Amit Roy-Chowdhury
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2508.06757
Pdf URL: https://arxiv.org/pdf/2508.06757
Copy Paste: [[2508.06757]] VOccl3D: A Video Benchmark Dataset for 3D Human Pose and Shape Estimation under real Occlusions(https://arxiv.org/abs/2508.06757)
Keywords: robust
Abstract: Human pose and shape (HPS) estimation methods have been extensively studied, with many demonstrating high zero-shot performance on in-the-wild images and videos. However, these methods often struggle in challenging scenarios involving complex human poses or significant occlusions. Although some studies address 3D human pose estimation under occlusion, they typically evaluate performance on datasets that lack realistic or substantial occlusions, e.g., most existing datasets introduce occlusions with random patches over the human or clipart-style overlays, which may not reflect real-world challenges. To bridge this gap in realistic occlusion datasets, we introduce a novel benchmark dataset, VOccl3D, a Video-based human Occlusion dataset with 3D body pose and shape annotations. Inspired by works such as AGORA and BEDLAM, we constructed this dataset using advanced computer graphics rendering techniques, incorporating diverse real-world occlusion scenarios, clothing textures, and human motions. Additionally, we fine-tuned recent HPS methods, CLIFF and BEDLAM-CLIFF, on our dataset, demonstrating significant qualitative and quantitative improvements across multiple public datasets, as well as on the test split of our dataset, while comparing its performance with other state-of-the-art methods. Furthermore, we leveraged our dataset to enhance human detection performance under occlusion by fine-tuning an existing object detector, YOLO11, thus leading to a robust end-to-end HPS estimation system under occlusions. Overall, this dataset serves as a valuable resource for future research aimed at benchmarking methods designed to handle occlusions, offering a more realistic alternative to existing occlusion datasets. See the Project page for code and dataset:this https URL

Title: SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding

Authors: Zihao Sheng, Zilin Huang, Yen-Jung Chen, Yansong Qu, Yuhao Luo, Yue Leng, Sikai Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06763
Pdf URL: https://arxiv.org/pdf/2508.06763
Copy Paste: [[2508.06763]] SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding(https://arxiv.org/abs/2508.06763)
Keywords: large language model, segmentation
Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks and demonstrate strong potential for traffic accident understanding. However, existing MLLMs in this domain primarily focus on coarse-grained image-level or video-level comprehension and often struggle to handle fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios. To address these limitations, we propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis. SafePLUG supports both arbitrary-shaped visual prompts for region-aware question answering and pixel-level segmentation based on language instructions, while also enabling the recognition of temporally anchored events in traffic accident scenarios. To advance the development of MLLMs for traffic accident understanding, we curate a new dataset containing multimodal question-answer pairs centered on diverse accident scenarios, with detailed pixel-level annotations and temporal event boundaries. Experimental results show that SafePLUG achieves strong performance on multiple tasks, including region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding. These capabilities lay a foundation for fine-grained understanding of complex traffic scenes, with the potential to improve driving safety and enhance situational awareness in smart transportation systems. The code, dataset, and model checkpoints will be made publicly available at: this https URL

Title: Fed MobiLLM: Efficient Federated LLM Fine-Tuning over Heterogeneous Mobile Devices via Server Assisted Side-Tuning

Authors: Xingke Yang, Liang Li, Sicong Li, Liwei Guan, Hao Wang, Xiaoqi Qi, Jiang Liu, Xin Fu, Miao Pan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.06765
Pdf URL: https://arxiv.org/pdf/2508.06765
Copy Paste: [[2508.06765]] Fed MobiLLM: Efficient Federated LLM Fine-Tuning over Heterogeneous Mobile Devices via Server Assisted Side-Tuning(https://arxiv.org/abs/2508.06765)
Keywords: robust, federate, large language model
Abstract: Collaboratively fine-tuning (FT) large language models (LLMs) over heterogeneous mobile devices fosters immense potential applications of personalized intelligence. However, such a vision faces critical system challenges. Conventional federated LLM FT approaches place prohibitive computational and memory burdens on mobile hardware, and their synchronous model aggregation protocols stall for slower devices. In this paper, we propose Fed MobiLLM, a novel design to facilitate efficient federated LLM FT across mobile devices with diverse computing/communication speeds and local model architectures. In particular, Fed MobiLLM implements a pioneering server-assisted federated side-tuning paradigm. Briefly, mobile devices perform lightweight forward propagation computations on local data using their frozen pre-scaled backbone LLMs, and then upload selected intermediate activations. The server trains a shared side-network independently, eliminating client-side backpropagation and enabling asynchronous updates. To bridge model heterogeneity across different devices, we introduce an adaptive layer-wise feature alignment method, which ensures consistent representations for collaboratively tuning a shared side network. Extensive experimental results demonstrate that Fed MobiLLM can maintain robust fine-tuning performance while achieving extremely low on-device memory, with at least 95.2% reduction in computation overhead, 93.2% reduction in communication costs and 5.1x faster convergence compared to existing methods, validating its efficacy for practical LLM adaptation over heterogeneous mobile devices.

Title: PANAMA: A Network-Aware MARL Framework for Multi-Agent Path Finding in Digital Twin Ecosystems

Authors: Arman Dogru, R. Irem Bor-Yaliniz, Nimal Gamini Senarath
Subjects: cs.LG, cs.AI, cs.DC, cs.MA, cs.RO
Abstract URL: https://arxiv.org/abs/2508.06767
Pdf URL: https://arxiv.org/pdf/2508.06767
Copy Paste: [[2508.06767]] PANAMA: A Network-Aware MARL Framework for Multi-Agent Path Finding in Digital Twin Ecosystems(https://arxiv.org/abs/2508.06767)
Keywords: robust
Abstract: Digital Twins (DTs) are transforming industries through advanced data processing and analysis, positioning the world of DTs, Digital World, as a cornerstone of nextgeneration technologies including embodied AI. As robotics and automated systems scale, efficient data-sharing frameworks and robust algorithms become critical. We explore the pivotal role of data handling in next-gen networks, focusing on dynamics between application and network providers (AP/NP) in DT ecosystems. We introduce PANAMA, a novel algorithm with Priority Asymmetry for Network Aware Multi-agent Reinforcement Learning (MARL) based multi-agent path finding (MAPF). By adopting a Centralized Training with Decentralized Execution (CTDE) framework and asynchronous actor-learner architectures, PANAMA accelerates training while enabling autonomous task execution by embodied AI. Our approach demonstrates superior pathfinding performance in accuracy, speed, and scalability compared to existing benchmarks. Through simulations, we highlight optimized data-sharing strategies for scalable, automated systems, ensuring resilience in complex, real-world environments. PANAMA bridges the gap between network-aware decision-making and robust multi-agent coordination, advancing the synergy between DTs, wireless networks, and AI-driven automation.

Title: DiffUS: Differentiable Ultrasound Rendering from Volumetric Imaging

Authors: Noe Bertramo, Gabriel Duguey, Vivek Gopalakrishnan
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2508.06768
Pdf URL: https://arxiv.org/pdf/2508.06768
Copy Paste: [[2508.06768]] DiffUS: Differentiable Ultrasound Rendering from Volumetric Imaging(https://arxiv.org/abs/2508.06768)
Keywords: extraction
Abstract: Intraoperative ultrasound imaging provides real-time guidance during numerous surgical procedures, but its interpretation is complicated by noise, artifacts, and poor alignment with high-resolution preoperative MRI/CT scans. To bridge the gap between reoperative planning and intraoperative guidance, we present DiffUS, a physics-based, differentiable ultrasound renderer that synthesizes realistic B-mode images from volumetric imaging. DiffUS first converts MRI 3D scans into acoustic impedance volumes using a machine learning approach. Next, we simulate ultrasound beam propagation using ray tracing with coupled reflection-transmission equations. DiffUS formulates wave propagation as a sparse linear system that captures multiple internal reflections. Finally, we reconstruct B-mode images via depth-resolved echo extraction across fan-shaped acquisition geometry, incorporating realistic artifacts including speckle noise and depth-dependent degradation. DiffUS is entirely implemented as differentiable tensor operations in PyTorch, enabling gradient-based optimization for downstream applications such as slice-to-volume registration and volumetric reconstruction. Evaluation on the ReMIND dataset demonstrates DiffUS's ability to generate anatomically accurate ultrasound images from brain MRI data.

Title: Zero-Direction Probing: A Linear-Algebraic Framework for Deep Analysis of Large-Language-Model Drift

Authors: Amit Pandey
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2508.06776
Pdf URL: https://arxiv.org/pdf/2508.06776
Copy Paste: [[2508.06776]] Zero-Direction Probing: A Linear-Algebraic Framework for Deep Analysis of Large-Language-Model Drift(https://arxiv.org/abs/2508.06776)
Keywords: transformer
Abstract: We present Zero-Direction Probing (ZDP), a theory-only framework for detecting model drift from null directions of transformer activations without task labels or output evaluations. Under assumptions A1--A6, we prove: (i) the Variance--Leak Theorem, (ii) Fisher Null-Conservation, (iii) a Rank--Leak bound for low-rank updates, and (iv) a logarithmic-regret guarantee for online null-space trackers. We derive a Spectral Null-Leakage (SNL) metric with non-asymptotic tail bounds and a concentration inequality, yielding a-priori thresholds for drift under a Gaussian null model. These results show that monitoring right/left null spaces of layer activations and their Fisher geometry provides concrete, testable guarantees on representational change.

Title: PROPS: Progressively Private Self-alignment of Large Language Models

Authors: Noel Teku, Fengwei Tian, Payel Bhattacharjee, Souradip Chakraborty, Amrit Singh Bedi, Ravi Tandon
Subjects: cs.LG, cs.AI, cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2508.06783
Pdf URL: https://arxiv.org/pdf/2508.06783
Copy Paste: [[2508.06783]] PROPS: Progressively Private Self-alignment of Large Language Models(https://arxiv.org/abs/2508.06783)
Keywords: privacy, large language model
Abstract: Alignment is a key step in developing Large Language Models (LLMs) using human feedback to ensure adherence to human values and societal norms. Dependence on human feedback raises privacy concerns about how much a labeler's preferences may reveal about their personal values, beliefs, and personality traits. Existing approaches, such as Differentially Private SGD (DP-SGD), provide rigorous privacy guarantees by privatizing gradients during fine-tuning and alignment but can provide more privacy than necessary as human preferences are tied only to labels of (prompt, response) pairs and can degrade model utility. This work focuses on LLM alignment with preference-level privacy, which preserves the privacy of preference labels provided by humans. We propose PROPS (PROgressively Private Self-alignment), a multi-stage privacy preserving alignment framework where privately aligned models in previous stages can serve as labelers for supplementing training data in the subsequent stages of alignment. We present theoretical guarantees for PROPS as well as comprehensive validation using multiple models (Pythia and GPT) and datasets (AlpacaEval, Anthropic HH-RLHF, truthy-dpo-v0.1) to demonstrate the utility of PROPS over existing methods while still providing high privacy. For the same privacy budget, alignment via PROPS can achieve up to 3x higher win-rates compared to DP-SGD, and 2.5x higher win-rates compared to Randomized Response (RR) based alignment.

Title: Label Inference Attacks against Federated Unlearning

Authors: Wei Wang, Xiangyun Tang, Yajie Wang, Yijing Lin, Tao Zhang, Meng Shen, Dusit Niyato, Liehuang Zhu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.06789
Pdf URL: https://arxiv.org/pdf/2508.06789
Copy Paste: [[2508.06789]] Label Inference Attacks against Federated Unlearning(https://arxiv.org/abs/2508.06789)
Keywords: privacy, attack, federate
Abstract: Federated Unlearning (FU) has emerged as a promising solution to respond to the right to be forgotten of clients, by allowing clients to erase their data from global models without compromising model performance. Unfortunately, researchers find that the parameter variations of models induced by FU expose clients' data information, enabling attackers to infer the label of unlearning data, while label inference attacks against FU remain unexplored. In this paper, we introduce and analyze a new privacy threat against FU and propose a novel label inference attack, ULIA, which can infer unlearning data labels across three FU levels. To address the unique challenges of inferring labels via the models variations, we design a gradient-label mapping mechanism in ULIA that establishes a relationship between gradient variations and unlearning labels, enabling inferring labels on accumulated model variations. We evaluate ULIA on both IID and non-IID settings. Experimental results show that in the IID setting, ULIA achieves a 100% Attack Success Rate (ASR) under both class-level and client-level unlearning. Even when only 1% of a user's local data is forgotten, ULIA still attains an ASR ranging from 93% to 62.3%.

Title: Towards Practical Data-Dependent Memory-Hard Functions with Optimal Sustained Space Trade-offs in the Parallel Random Oracle Model

Authors: Jeremiah Blocki, Blake Holman
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.06795
Pdf URL: https://arxiv.org/pdf/2508.06795
Copy Paste: [[2508.06795]] Towards Practical Data-Dependent Memory-Hard Functions with Optimal Sustained Space Trade-offs in the Parallel Random Oracle Model(https://arxiv.org/abs/2508.06795)
Keywords: protect, attack
Abstract: Memory-Hard Functions (MHF) are a useful cryptographic primitive to build egalitarian proofs-of-work and to help protect low entropy secrets (e.g., user passwords) against brute-forces attacks. Ideally, we would like for a MHF to have the property that (1) an honest party can evaluate the function in sequential time $\Omega(N)$, and (2) any parallel party that evaluates the function is forced to lockup $\Omega(N)$ memory for $\Omega(N)$ sequential steps. Unfortunately, this goal is not quite achievable, so prior work of Blocki and Holman [BH22] focused on designing MHFs with strong tradeoff guarantees between sustained-space complexity (SSC) and cumulative memory costs (CMC). However, their theoretical construction is not suitable for practical deployment due to the reliance on expensive constructions of combinatorial graphs. Furthermore, there is no formal justification for the heuristic use of the dynamic pebbling game in MHF analysis so we cannot rule out the possibility that there are more efficient attacks in the Parallel Random Oracle Model (PROM). Towards the goal of developing a practical MHF with provably strong SSC/CMC tradeoffs we develop a new MHF called EGSample which does not rely on expensive combinatorial constructions like [BH22]. In the dynamic pebbling model, we prove equivalent SSC/CMC tradeoffs for EGSample i.e., any the dynamic pebbling strategy either (1) locks up $\Omega(N)$ memory for $\Omega(N)$ steps, or (2) incurs cumulative memory cost at least $\Omega(N^{3-\epsilon})$. We also develop new techniques to directly establish SSC/CMC tradeoffs in the parallel random oracle model. In particular, we prove that {\em any} PROM algorithm evaluating our MHF either (1) locks up $\Omega(N)$ blocks of memory for $\Omega(N)$ steps or (2) incurs cumulative memory cost at least $\Omega(N^{2.5-\epsilon})$.

Title: Hardness-Aware Dynamic Curriculum Learning for Robust Multimodal Emotion Recognition with Missing Modalities

Authors: Rui Liu, Haolin Zuo, Zheng Lian, Hongyu Yuan, Qi Fan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06800
Pdf URL: https://arxiv.org/pdf/2508.06800
Copy Paste: [[2508.06800]] Hardness-Aware Dynamic Curriculum Learning for Robust Multimodal Emotion Recognition with Missing Modalities(https://arxiv.org/abs/2508.06800)
Keywords: robust
Abstract: Missing modalities have recently emerged as a critical research direction in multimodal emotion recognition (MER). Conventional approaches typically address this issue through missing modality reconstruction. However, these methods fail to account for variations in reconstruction difficulty across different samples, consequently limiting the model's ability to handle hard samples effectively. To overcome this limitation, we propose a novel Hardness-Aware Dynamic Curriculum Learning framework, termed HARDY-MER. Our framework operates in two key stages: first, it estimates the hardness level of each sample, and second, it strategically emphasizes hard samples during training to enhance model performance on these challenging instances. Specifically, we first introduce a Multi-view Hardness Evaluation mechanism that quantifies reconstruction difficulty by considering both Direct Hardness (modality reconstruction errors) and Indirect Hardness (cross-modal mutual information). Meanwhile, we introduce a Retrieval-based Dynamic Curriculum Learning strategy that dynamically adjusts the training curriculum by retrieving samples with similar semantic information and balancing the learning focus between easy and hard instances. Extensive experiments on benchmark datasets demonstrate that HARDY-MER consistently outperforms existing methods in missing-modality scenarios. Our code will be made publicly available at this https URL.

Title: SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection

Authors: Ziqi Liu, Yangbin Chen, Ziyang Zhou, Yilin Li, Mingxuan Hu, Yushan Pan, Zhijie Xu
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2508.06803
Pdf URL: https://arxiv.org/pdf/2508.06803
Copy Paste: [[2508.06803]] SEVADE: Self-Evolving Multi-Agent Analysis with Decoupled Evaluation for Hallucination-Resistant Irony Detection(https://arxiv.org/abs/2508.06803)
Keywords: large language model
Abstract: Sarcasm detection is a crucial yet challenging Natural Language Processing task. Existing Large Language Model methods are often limited by single-perspective analysis, static reasoning pathways, and a susceptibility to hallucination when processing complex ironic rhetoric, which impacts their accuracy and reliability. To address these challenges, we propose **SEVADE**, a novel **S**elf-**Ev**olving multi-agent **A**nalysis framework with **D**ecoupled **E**valuation for hallucination-resistant sarcasm detection. The core of our framework is a Dynamic Agentive Reasoning Engine (DARE), which utilizes a team of specialized agents grounded in linguistic theory to perform a multifaceted deconstruction of the text and generate a structured reasoning chain. Subsequently, a separate lightweight rationale adjudicator (RA) performs the final classification based solely on this reasoning chain. This decoupled architecture is designed to mitigate the risk of hallucination by separating complex reasoning from the final judgment. Extensive experiments on four benchmark datasets demonstrate that our framework achieves state-of-the-art performance, with average improvements of **6.75%** in Accuracy and **6.29%** in Macro-F1 score.

Title: Edge Detection for Organ Boundaries via Top Down Refinement and SubPixel Upsampling

Authors: Aarav Mehta, Priya Deshmukh, Vikram Singh, Siddharth Malhotra, Krishnan Menon Iyer, Tanvi Iyer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06805
Pdf URL: https://arxiv.org/pdf/2508.06805
Copy Paste: [[2508.06805]] Edge Detection for Organ Boundaries via Top Down Refinement and SubPixel Upsampling(https://arxiv.org/abs/2508.06805)
Keywords: segmentation
Abstract: Accurate localization of organ boundaries is critical in medical imaging for segmentation, registration, surgical planning, and radiotherapy. While deep convolutional networks (ConvNets) have advanced general-purpose edge detection to near-human performance on natural images, their outputs often lack precise localization, a limitation that is particularly harmful in medical applications where millimeter-level accuracy is required. Building on a systematic analysis of ConvNet edge outputs, we propose a medically focused crisp edge detector that adapts a novel top-down backward refinement architecture to medical images (2D and volumetric). Our method progressively upsamples and fuses high-level semantic features with fine-grained low-level cues through a backward refinement pathway, producing high-resolution, well-localized organ boundaries. We further extend the design to handle anisotropic volumes by combining 2D slice-wise refinement with light 3D context aggregation to retain computational efficiency. Evaluations on several CT and MRI organ datasets demonstrate substantially improved boundary localization under strict criteria (boundary F-measure, Hausdorff distance) compared to baseline ConvNet detectors and contemporary medical edge/contour methods. Importantly, integrating our crisp edge maps into downstream pipelines yields consistent gains in organ segmentation (higher Dice scores, lower boundary errors), more accurate image registration, and improved delineation of lesions near organ interfaces. The proposed approach produces clinically valuable, crisp organ edges that materially enhance common medical-imaging tasks.

Title: Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation

Authors: Xiao Huang, Xu Liu, Enze Zhang, Tong Yu, Shuai Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06806
Pdf URL: https://arxiv.org/pdf/2508.06806
Copy Paste: [[2508.06806]] Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation(https://arxiv.org/abs/2508.06806)
Keywords: diffusion
Abstract: Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions. Existing work used offline datasets to generate data that conform to the online data distribution for data augmentation. However, generated data still exhibits a gap with the online data, limiting overall performance. To address this, we propose a new data augmentation approach, Classifier-Free Diffusion Generation (CFDG). Without introducing additional classifier training overhead, CFDG leverages classifier-free guidance diffusion to significantly enhance the generation quality of offline and online data with different distributions. Additionally, it employs a reweighting method to enable more generated data to align with the online data, enhancing performance while maintaining the agent's stability. Experimental results show that CFDG outperforms replaying the two data types or using a standard diffusion model to generate new data. Our method is versatile and can be integrated with existing offline-to-online RL algorithms. By implementing CFDG to popular methods IQL, PEX and APL, we achieve a notable 15% average improvement in empirical performance on the D4RL benchmark such as MuJoCo and AntMaze.

Title: Annotating Errors in English Learners' Written Language Production: Advancing Automated Written Feedback Systems

Authors: Steven Coyne, Diana Galvan-Sosa, Ryan Spring, Camélia Guerraoui, Michael Zock, Keisuke Sakaguchi, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.06810
Pdf URL: https://arxiv.org/pdf/2508.06810
Copy Paste: [[2508.06810]] Annotating Errors in English Learners' Written Language Production: Advancing Automated Written Feedback Systems(https://arxiv.org/abs/2508.06810)
Keywords: large language model
Abstract: Recent advances in natural language processing (NLP) have contributed to the development of automated writing evaluation (AWE) systems that can correct grammatical errors. However, while these systems are effective at improving text, they are not optimally designed for language learning. They favor direct revisions, often with a click-to-fix functionality that can be applied without considering the reason for the correction. Meanwhile, depending on the error type, learners may benefit most from simple explanations and strategically indirect hints, especially on generalizable grammatical rules. To support the generation of such feedback, we introduce an annotation framework that models each error's error type and generalizability. For error type classification, we introduce a typology focused on inferring learners' knowledge gaps by connecting their errors to specific grammatical patterns. Following this framework, we collect a dataset of annotated learner errors and corresponding human-written feedback comments, each labeled as a direct correction or hint. With this data, we evaluate keyword-guided, keyword-free, and template-guided methods of generating feedback using large language models (LLMs). Human teachers examined each system's outputs, assessing them on grounds including relevance, factuality, and comprehensibility. We report on the development of the dataset and the comparative performance of the systems investigated.

Title: Technical Report: Full-Stack Fine-Tuning for the Q Programming Language

Authors: Brendan R. Hogan, Will Brown, Adel Boyarsky, Anderson Schneider, Yuriy Nevmyvaka
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.06813
Pdf URL: https://arxiv.org/pdf/2508.06813
Copy Paste: [[2508.06813]] Technical Report: Full-Stack Fine-Tuning for the Q Programming Language(https://arxiv.org/abs/2508.06813)
Keywords: large language model
Abstract: Even though large language models are becoming increasingly capable, it is still unreasonable to expect them to excel at tasks that are under-represented on the Internet. Leveraging LLMs for specialized applications, particularly in niche programming languages and private domains, remains challenging and largely unsolved. In this work, we address this gap by presenting a comprehensive, open-source approach for adapting LLMs to the Q programming language, a popular tool in quantitative finance that is much less present on the Internet compared to Python, C, Java, and other ``mainstream" languages and is therefore not a strong suit of general-purpose AI models. We introduce a new Leetcode style evaluation dataset for Q, benchmark major frontier models on the dataset, then do pretraining, supervised fine tuning, and reinforcement learning to train a suite of reasoning and non-reasoning models based on the Qwen-2.5 series, spanning five parameter sizes (1.5B, 3B, 7B, 14B, 32B). Our best model achieves a pass@1 accuracy of 59 percent on our Q benchmark, surpassing the best-performing frontier model, Claude Opus-4 by 29.5 percent. Additionally, all models, even our 1.5B model, outperform GPT-4.1 on this task. In addition to releasing models, code, and data, we provide a detailed blueprint for dataset construction, model pretraining, supervised fine-tuning, and reinforcement learning. Our methodology is broadly applicable, and we discuss how these techniques can be extended to other tasks, including those where evaluation may rely on soft or subjective signals.

Title: DualResolution Residual Architecture with Artifact Suppression for Melanocytic Lesion Segmentation

Authors: Vikram Singh, Kabir Malhotra, Rohan Desai, Ananya Shankaracharya, Priyadarshini Chatterjee, Krishnan Menon Iyer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06816
Pdf URL: https://arxiv.org/pdf/2508.06816
Copy Paste: [[2508.06816]] DualResolution Residual Architecture with Artifact Suppression for Melanocytic Lesion Segmentation(https://arxiv.org/abs/2508.06816)
Keywords: robust, segmentation
Abstract: Accurate segmentation of melanocytic tumors in dermoscopic images is a critical step for automated skin cancer screening and clinical decision support. Unlike natural scene segmentation, lesion delineation must reconcile subtle texture and color variations, frequent artifacts (hairs, rulers, bubbles), and a strong need for precise boundary localization to support downstream diagnosis. In this paper we introduce Our method, a novel ResNet inspired dual resolution architecture specifically designed for melanocytic tumor segmentation. Our method maintains a full resolution stream that preserves fine grained boundary information while a complementary pooled stream aggregates multi scale contextual cues for robust lesion recognition. The streams are tightly coupled by boundary aware residual connections that inject high frequency edge information into deep feature maps, and by a channel attention module that adapts color and texture sensitivity to dermoscopic appearance. To further address common imaging artifacts and the limited size of clinical datasets, we propose a lightweight artifact suppression block and a multi task training objective that combines a Dice Tversky segmentation loss with an explicit boundary loss and a contrastive regularizer for feature stability. The combined design yields pixel accurate masks without requiring heavy post processing or complex pre training protocols. Extensive experiments on public dermoscopic benchmarks demonstrate that Our method significantly improves boundary adherence and clinically relevant segmentation metrics compared to standard encoder decoder baselines, making it a practical building block for automated melanoma assessment systems.

Title: VesselRW: Weakly Supervised Subcutaneous Vessel Segmentation via Learned Random Walk Propagation

Authors: Ayaan Nooruddin Siddiqui, Mahnoor Zaidi, Ayesha Nazneen Shahbaz, Priyadarshini Chatterjee, Krishnan Menon Iyer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06819
Pdf URL: https://arxiv.org/pdf/2508.06819
Copy Paste: [[2508.06819]] VesselRW: Weakly Supervised Subcutaneous Vessel Segmentation via Learned Random Walk Propagation(https://arxiv.org/abs/2508.06819)
Keywords: segmentation
Abstract: Accurate segmentation of subcutaneous vessels from clinical images is hampered by scarce, expensive ground truth and by low contrast, noisy appearance of vessels across patients and modalities. We present a novel weakly supervised training framework tailored for subcutaneous vessel segmentation that leverages inexpensive sparse annotations (e.g., centerline traces, dot markers, or short scribbles). Sparse labels are expanded into dense, probabilistic supervision via a differentiable random walk label propagation model whose transition weights incorporate image driven vesselness cues and tubular continuity priors. The propagation yields per-pixel hitting probabilities together with calibrated uncertainty estimates; these are incorporated into an uncertainty weighted loss to avoid over fitting to ambiguous regions. Crucially, the label-propagator is learned jointly with a CNN based segmentation predictor, enabling the system to discover vessel edges and continuity constraints without explicit edge supervision. We further introduce a topology aware regularizer that encourages centerline connectivity and penalizes spurious branches, improving clinical usability. In experiments on clinical subcutaneous imaging datasets, our method consistently outperforms naive training on sparse labels and conventional dense pseudo-labeling, producing more complete vascular maps and better calibrated uncertainty for downstream decision making. The approach substantially reduces annotation burden while preserving clinically relevant vessel topology.

Title: Who's the Evil Twin? Differential Auditing for Undesired Behavior

Authors: Ishwar Balappanawar, Venkata Hasith Vattikuti, Greta Kintzley, Ronan Azimi-Mancel, Satvik Golechha
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2508.06827
Pdf URL: https://arxiv.org/pdf/2508.06827
Copy Paste: [[2508.06827]] Who's the Evil Twin? Differential Auditing for Undesired Behavior(https://arxiv.org/abs/2508.06827)
Keywords: attack
Abstract: Detecting hidden behaviors in neural networks poses a significant challenge due to minimal prior knowledge and potential adversarial obfuscation. We explore this problem by framing detection as an adversarial game between two teams: the red team trains two similar models, one trained solely on benign data and the other trained on data containing hidden harmful behavior, with the performance of both being nearly indistinguishable on the benign dataset. The blue team, with limited to no information about the harmful behaviour, tries to identify the compromised model. We experiment using CNNs and try various blue team strategies, including Gaussian noise analysis, model diffing, integrated gradients, and adversarial attacks under different levels of hints provided by the red team. Results show high accuracy for adversarial-attack-based methods (100\% correct prediction, using hints), which is very promising, whilst the other techniques yield more varied performance. During our LLM-focused rounds, we find that there are not many parallel methods that we could apply from our study with CNNs. Instead, we find that effective LLM auditing methods require some hints about the undesired distribution, which can then used in standard black-box and open-weight methods to probe the models further and reveal their misalignment. We open-source our auditing games (with the model and data) and hope that our findings contribute to designing better audits.

Title: Low-Rank Expert Merging for Multi-Source Domain Adaptation in Person Re-Identification

Authors: Taha Mustapha Nehdi, Nairouz Mrabah, Atif Belal, Marco Pedersoli, Eric Granger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06831
Pdf URL: https://arxiv.org/pdf/2508.06831
Copy Paste: [[2508.06831]] Low-Rank Expert Merging for Multi-Source Domain Adaptation in Person Re-Identification(https://arxiv.org/abs/2508.06831)
Keywords: robust
Abstract: Adapting person re-identification (reID) models to new target environments remains a challenging problem that is typically addressed using unsupervised domain adaptation (UDA) methods. Recent works show that when labeled data originates from several distinct sources (e.g., datasets and cameras), considering each source separately and applying multi-source domain adaptation (MSDA) typically yields higher accuracy and robustness compared to blending the sources and performing conventional UDA. However, state-of-the-art MSDA methods learn domain-specific backbone models or require access to source domain data during adaptation, resulting in significant growth in training parameters and computational cost. In this paper, a Source-free Adaptive Gated Experts (SAGE-reID) method is introduced for person reID. Our SAGE-reID is a cost-effective, source-free MSDA method that first trains individual source-specific low-rank adapters (LoRA) through source-free UDA. Next, a lightweight gating network is introduced and trained to dynamically assign optimal merging weights for fusion of LoRA experts, enabling effective cross-domain knowledge transfer. While the number of backbone parameters remains constant across source domains, LoRA experts scale linearly but remain negligible in size (<= 2% of the backbone), reducing both the memory consumption and risk of overfitting. Extensive experiments conducted on three challenging benchmarks: Market-1501, DukeMTMC-reID, and MSMT17 indicate that SAGE-reID outperforms state-of-the-art methods while being computationally efficient.

Title: Towards Effective Prompt Stealing Attack against Text-to-Image Diffusion Models

Authors: Shiqian Zhao, Chong Wang, Yiming Li, Yihao Huang, Wenjie Qu, Siew-Kei Lam, Yi Xie, Kangjie Chen, Jie Zhang, Tianwei Zhang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.06837
Pdf URL: https://arxiv.org/pdf/2508.06837
Copy Paste: [[2508.06837]] Towards Effective Prompt Stealing Attack against Text-to-Image Diffusion Models(https://arxiv.org/abs/2508.06837)
Keywords: defense, attack, steal, diffusion
Abstract: Text-to-Image (T2I) models, represented by DALL$\cdot$E and Midjourney, have gained huge popularity for creating realistic images. The quality of these images relies on the carefully engineered prompts, which have become valuable intellectual property. While skilled prompters showcase their AI-generated art on markets to attract buyers, this business incidentally exposes them to \textit{prompt stealing attacks}. Existing state-of-the-art attack techniques reconstruct the prompts from a fixed set of modifiers (i.e., style descriptions) with model-specific training, which exhibit restricted adaptability and effectiveness to diverse showcases (i.e., target images) and diffusion models. To alleviate these limitations, we propose Prometheus, a training-free, proxy-in-the-loop, search-based prompt-stealing attack, which reverse-engineers the valuable prompts of the showcases by interacting with a local proxy model. It consists of three innovative designs. First, we introduce dynamic modifiers, as a supplement to static modifiers used in prior works. These dynamic modifiers provide more details specific to the showcases, and we exploit NLP analysis to generate them on the fly. Second, we design a contextual matching algorithm to sort both dynamic and static modifiers. This offline process helps reduce the search space of the subsequent step. Third, we interact with a local proxy model to invert the prompts with a greedy search algorithm. Based on the feedback guidance, we refine the prompt to achieve higher fidelity. The evaluation results show that Prometheus successfully extracts prompts from popular platforms like PromptBase and AIFrog against diverse victim models, including Midjourney, this http URL, and DALL$\cdot$E, with an ASR improvement of 25.0\%. We also validate that Prometheus is resistant to extensive potential defenses, further highlighting its severity in practice.

Title: Hybrid Machine Learning Framework for Predicting Geometric Deviations from 3D Surface Metrology

Authors: Hamidreza Samadi, Md Manjurul Ahsan, Shivakumar Raman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06845
Pdf URL: https://arxiv.org/pdf/2508.06845
Copy Paste: [[2508.06845]] Hybrid Machine Learning Framework for Predicting Geometric Deviations from 3D Surface Metrology(https://arxiv.org/abs/2508.06845)
Keywords: extraction
Abstract: This study addresses the challenge of accurately forecasting geometric deviations in manufactured components using advanced 3D surface analysis. Despite progress in modern manufacturing, maintaining dimensional precision remains difficult, particularly for complex geometries. We present a methodology that employs a high-resolution 3D scanner to acquire multi-angle surface data from 237 components produced across different batches. The data were processed through precise alignment, noise reduction, and merging techniques to generate accurate 3D representations. A hybrid machine learning framework was developed, combining convolutional neural networks for feature extraction with gradient-boosted decision trees for predictive modeling. The proposed system achieved a prediction accuracy of 0.012 mm at a 95% confidence level, representing a 73% improvement over conventional statistical process control methods. In addition to improved accuracy, the model revealed hidden correlations between manufacturing parameters and geometric deviations. This approach offers significant potential for automated quality control, predictive maintenance, and design optimization in precision manufacturing, and the resulting dataset provides a strong foundation for future predictive modeling research.

Title: A Joint Sparse Self-Representation Learning Method for Multiview Clustering

Authors: Mengxue Jia, Zhihua Allen-Zhao, You Zhao, Sanyang Liu
Subjects: cs.CV, cs.DS
Abstract URL: https://arxiv.org/abs/2508.06857
Pdf URL: https://arxiv.org/pdf/2508.06857
Copy Paste: [[2508.06857]] A Joint Sparse Self-Representation Learning Method for Multiview Clustering(https://arxiv.org/abs/2508.06857)
Keywords: extraction
Abstract: Multiview clustering (MC) aims to group samples using consistent and complementary information across various views. The subspace clustering, as a fundamental technique of MC, has attracted significant attention. In this paper, we propose a novel joint sparse self-representation learning model for MC, where a featured difference is the extraction of view-specific local information by introducing cardinality (i.e., $\ell_0$-norm) constraints instead of Graph-Laplacian regularization. Specifically, under each view, cardinality constraints directly restrict the samples used in the self-representation stage to extract reliable local and global structure information, while the low-rank constraint aids in revealing a global coherent structure in the consensus affinity matrix during merging. The attendant challenge is that Augmented Lagrange Method (ALM)-based alternating minimization algorithms cannot guarantee convergence when applied directly to our nonconvex, nonsmooth model, thus resulting in poor generalization ability. To address it, we develop an alternating quadratic penalty (AQP) method with global convergence, where two subproblems are iteratively solved by closed-form solutions. Empirical results on six standard datasets demonstrate the superiority of our model and AQP method, compared to eight state-of-the-art algorithms.

Title: VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding

Authors: Jianxiang He, Shaoguang Wang, Weiyu Guo, Meisheng Hong, Jungang Li, Yijie Xu, Ziyang Chen, Hui Xiong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06869
Pdf URL: https://arxiv.org/pdf/2508.06869
Copy Paste: [[2508.06869]] VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding(https://arxiv.org/abs/2508.06869)
Keywords: robust, large language model
Abstract: Long video understanding presents a significant challenge to multimodal large language models (MLLMs) primarily due to the immense data scale. A critical and widely adopted strategy for making this task computationally tractable is keyframe retrieval, which seeks to identify a sparse set of video frames that are most salient to a given textual query. However, the efficacy of this approach is hindered by weak multimodal alignment between textual queries and visual content and fails to capture the complex temporal semantic information required for precise reasoning. To address this, we propose Visual-Subtitle Integeration(VSI), a multimodal keyframe search method that integrates subtitles, timestamps, and scene boundaries into a unified multimodal search process. The proposed method captures the visual information of video frames as well as the complementary textual information through a dual-stream search mechanism by Video Search Stream as well as Subtitle Match Stream, respectively, and improves the keyframe search accuracy through the interaction of the two search streams. Experimental results show that VSI achieve 40.00% key frame localization accuracy on the text-relevant subset of LongVideoBench and 68.48% accuracy on downstream long Video-QA tasks, surpassing competitive baselines by 20.35% and 15.79%, respectively. Furthermore, on the LongVideoBench, VSI achieved state-of-the-art(SOTA) in medium-to-long video-QA tasks, demonstrating the robustness and generalizability of the proposed multimodal search strategy.

Title: Sparsity-Driven Plasticity in Multi-Task Reinforcement Learning

Authors: Aleksandar Todorov, Juan Cardenas-Cartagena, Rafael F. Cunha, Marco Zullich, Matthia Sabatelli
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06871
Pdf URL: https://arxiv.org/pdf/2508.06871
Copy Paste: [[2508.06871]] Sparsity-Driven Plasticity in Multi-Task Reinforcement Learning(https://arxiv.org/abs/2508.06871)
Keywords: robust
Abstract: Plasticity loss, a diminishing capacity to adapt as training progresses, is a critical challenge in deep reinforcement learning. We examine this issue in multi-task reinforcement learning (MTRL), where higher representational flexibility is crucial for managing diverse and potentially conflicting task demands. We systematically explore how sparsification methods, particularly Gradual Magnitude Pruning (GMP) and Sparse Evolutionary Training (SET), enhance plasticity and consequently improve performance in MTRL agents. We evaluate these approaches across distinct MTRL architectures (shared backbone, Mixture of Experts, Mixture of Orthogonal Experts) on standardized MTRL benchmarks, comparing against dense baselines, and a comprehensive range of alternative plasticity-inducing or regularization methods. Our results demonstrate that both GMP and SET effectively mitigate key indicators of plasticity degradation, such as neuron dormancy and representational collapse. These plasticity improvements often correlate with enhanced multi-task performance, with sparse agents frequently outperforming dense counterparts and achieving competitive results against explicit plasticity interventions. Our findings offer insights into the interplay between plasticity, network sparsity, and MTRL designs, highlighting dynamic sparsification as a robust but context-sensitive tool for developing more adaptable MTRL systems.

Title: ESNERA: Empirical and semantic named entity alignment for named entity dataset merging

Authors: Xiaobo Zhang (1 and 2), Congqing He (2), Ying He (1 and 2), Jian Peng (1), Dajie Fu (1), Tien-Ping Tan (2) ((1) School of Information Engineering, Jiangxi Vocational College of Finance & Economics, Jiujiang, China, (2) School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06877
Pdf URL: https://arxiv.org/pdf/2508.06877
Copy Paste: [[2508.06877]] ESNERA: Empirical and semantic named entity alignment for named entity dataset merging(https://arxiv.org/abs/2508.06877)
Keywords: interpretability
Abstract: Named Entity Recognition (NER) is a fundamental task in natural language processing. It remains a research hotspot due to its wide applicability across domains. Although recent advances in deep learning have significantly improved NER performance, they rely heavily on large, high-quality annotated datasets. However, building these datasets is expensive and time-consuming, posing a major bottleneck for further research. Current dataset merging approaches mainly focus on strategies like manual label mapping or constructing label graphs, which lack interpretability and scalability. To address this, we propose an automatic label alignment method based on label similarity. The method combines empirical and semantic similarities, using a greedy pairwise merging strategy to unify label spaces across different datasets. Experiments are conducted in two stages: first, merging three existing NER datasets into a unified corpus with minimal impact on NER performance; second, integrating this corpus with a small-scale, self-built dataset in the financial domain. The results show that our method enables effective dataset merging and enhances NER performance in the low-resource financial domain. This study presents an efficient, interpretable, and scalable solution for integrating multi-source NER corpora.

Title: NS-FPN: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective

Authors: Maoxun Yuan, Duanni Meng, Ziteng Xi, Tianyi Zhao, Shiji Zhao, Yimian Dai, Xingxing Wei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06878
Pdf URL: https://arxiv.org/pdf/2508.06878
Copy Paste: [[2508.06878]] NS-FPN: Improving Infrared Small Target Detection and Segmentation from Noise Suppression Perspective(https://arxiv.org/abs/2508.06878)
Keywords: defense, segmentation
Abstract: Infrared small target detection and segmentation (IRSTDS) is a critical yet challenging task in defense and civilian applications, owing to the dim, shapeless appearance of targets and severe background clutter. Recent CNN-based methods have achieved promising target perception results, but they only focus on enhancing feature representation to offset the impact of noise, which results in the increased false alarms problem. In this paper, through analyzing the problem from the frequency domain, we pioneer in improving performance from noise suppression perspective and propose a novel noise-suppression feature pyramid network (NS-FPN), which integrates a low-frequency guided feature purification (LFP) module and a spiral-aware feature sampling (SFS) module into the original FPN structure. The LFP module suppresses the noise features by purifying high-frequency components to achieve feature enhancement devoid of noise interference, while the SFS module further adopts spiral sampling to fuse target-relevant features in feature fusion process. Our NS-FPN is designed to be lightweight yet effective and can be easily plugged into existing IRSTDS frameworks. Extensive experiments on the public IRSTDS datasets demonstrate that our method significantly reduces false alarms and achieves superior performance on IRSTDS tasks.

Title: Score Before You Speak: Improving Persona Consistency in Dialogue Generation using Response Quality Scores

Authors: Arpita Saggar, Jonathan C. Darling, Vania Dimitrova, Duygu Sarikaya, David C. Hogg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.06886
Pdf URL: https://arxiv.org/pdf/2508.06886
Copy Paste: [[2508.06886]] Score Before You Speak: Improving Persona Consistency in Dialogue Generation using Response Quality Scores(https://arxiv.org/abs/2508.06886)
Keywords: large language model
Abstract: Persona-based dialogue generation is an important milestone towards building conversational artificial intelligence. Despite the ever-improving capabilities of large language models (LLMs), effectively integrating persona fidelity in conversations remains challenging due to the limited diversity in existing dialogue data. We propose a novel framework SBS (Score-Before-Speaking), which outperforms previous methods and yields improvements for both million and billion-parameter models. Unlike previous methods, SBS unifies the learning of responses and their relative quality into a single step. The key innovation is to train a dialogue model to correlate augmented responses with a quality score during training and then leverage this knowledge at inference. We use noun-based substitution for augmentation and semantic similarity-based scores as a proxy for response quality. Through extensive experiments with benchmark datasets (PERSONA-CHAT and ConvAI2), we show that score-conditioned training allows existing models to better capture a spectrum of persona-consistent dialogues. Our ablation studies also demonstrate that including scores in the input prompt during training is superior to conventional training setups. Code and further details are available at this https URL

Title: Fusion-Based Brain Tumor Classification Using Deep Learning and Explainable AI, and Rule-Based Reasoning

Authors: Melika Filvantorkaman, Mohsen Piri, Maral Filvan Torkaman, Ashkan Zabihi, Hamidreza Moradi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06891
Pdf URL: https://arxiv.org/pdf/2508.06891
Copy Paste: [[2508.06891]] Fusion-Based Brain Tumor Classification Using Deep Learning and Explainable AI, and Rule-Based Reasoning(https://arxiv.org/abs/2508.06891)
Keywords: robust, interpretability
Abstract: Accurate and interpretable classification of brain tumors from magnetic resonance imaging (MRI) is critical for effective diagnosis and treatment planning. This study presents an ensemble-based deep learning framework that combines MobileNetV2 and DenseNet121 convolutional neural networks (CNNs) using a soft voting strategy to classify three common brain tumor types: glioma, meningioma, and pituitary adenoma. The models were trained and evaluated on the Figshare dataset using a stratified 5-fold cross-validation protocol. To enhance transparency and clinical trust, the framework integrates an Explainable AI (XAI) module employing Grad-CAM++ for class-specific saliency visualization, alongside a symbolic Clinical Decision Rule Overlay (CDRO) that maps predictions to established radiological heuristics. The ensemble classifier achieved superior performance compared to individual CNNs, with an accuracy of 91.7%, precision of 91.9%, recall of 91.7%, and F1-score of 91.6%. Grad-CAM++ visualizations revealed strong spatial alignment between model attention and expert-annotated tumor regions, supported by Dice coefficients up to 0.88 and IoU scores up to 0.78. Clinical rule activation further validated model predictions in cases with distinct morphological features. A human-centered interpretability assessment involving five board-certified radiologists yielded high Likert-scale scores for both explanation usefulness (mean = 4.4) and heatmap-region correspondence (mean = 4.0), reinforcing the framework's clinical relevance. Overall, the proposed approach offers a robust, interpretable, and generalizable solution for automated brain tumor classification, advancing the integration of deep learning into clinical neurodiagnostics.

Title: BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models

Authors: Jianting Tang, Yubo Wang, Haoyu Cao, Linli Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06895
Pdf URL: https://arxiv.org/pdf/2508.06895
Copy Paste: [[2508.06895]] BASIC: Boosting Visual Alignment with Intrinsic Refined Embeddings in Multimodal Large Language Models(https://arxiv.org/abs/2508.06895)
Keywords: large language model
Abstract: Mainstream Multimodal Large Language Models (MLLMs) achieve visual understanding by using a vision projector to bridge well-pretrained vision encoders and large language models (LLMs). The inherent gap between visual and textual modalities makes the embeddings from the vision projector critical for visual comprehension. However, current alignment approaches treat visual embeddings as contextual cues and merely apply auto-regressive supervision to textual outputs, neglecting the necessity of introducing equivalent direct visual supervision, which hinders the potential finer alignment of visual embeddings. In this paper, based on our analysis of the refinement process of visual embeddings in the LLM's shallow layers, we propose BASIC, a method that utilizes refined visual embeddings within the LLM as supervision to directly guide the projector in generating initial visual embeddings. Specifically, the guidance is conducted from two perspectives: (i) optimizing embedding directions by reducing angles between initial and supervisory embeddings in semantic space; (ii) improving semantic matching by minimizing disparities between the logit distributions of both visual embeddings. Without additional supervisory models or artificial annotations, BASIC significantly improves the performance of MLLMs across a wide range of benchmarks, demonstrating the effectiveness of our introduced direct visual supervision.

Title: eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos

Authors: Xuecheng Wu, Dingkang Yang, Danlei Huang, Xinyi Yin, Yifan Wang, Jia Zhang, Jiayu Nie, Liangyu Fu, Yang Liu, Junxiao Xue, Hadi Amirpour, Wei Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06902
Pdf URL: https://arxiv.org/pdf/2508.06902
Copy Paste: [[2508.06902]] eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos(https://arxiv.org/abs/2508.06902)
Keywords: transformer
Abstract: Short-form videos (SVs) have become a vital part of our online routine for acquiring and sharing information. Their multimodal complexity poses new challenges for video analysis, highlighting the need for video emotion analysis (VEA) within the community. Given the limited availability of SVs emotion data, we introduce eMotions, a large-scale dataset consisting of 27,996 videos with full-scale annotations. To ensure quality and reduce subjective bias, we emphasize better personnel allocation and propose a multi-stage annotation procedure. Additionally, we provide the category-balanced and test-oriented variants through targeted sampling to meet diverse needs. While there have been significant studies on videos with clear emotional cues (e.g., facial expressions), analyzing emotions in SVs remains a challenging task. The challenge arises from the broader content diversity, which introduces more distinct semantic gaps and complicates the representations learning of emotion-related features. Furthermore, the prevalence of audio-visual co-expressions in SVs leads to the local biases and collective information gaps caused by the inconsistencies in emotional expressions. To tackle this, we propose AV-CANet, an end-to-end audio-visual fusion network that leverages video transformer to capture semantically relevant representations. We further introduce the Local-Global Fusion Module designed to progressively capture the correlations of audio-visual features. Besides, EP-CE Loss is constructed to globally steer optimizations with tripolar penalties. Extensive experiments across three eMotions-related datasets and four public VEA datasets demonstrate the effectiveness of our proposed AV-CANet, while providing broad insights for future research. Moreover, we conduct ablation studies to examine the critical components of our method. Dataset and code will be made available at Github.

Title: A Simple yet Powerful Instance-Aware Prompting Framework for Training-free Camouflaged Object Segmentation

Authors: Chao Yin, Jide Li, Xiaoqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06904
Pdf URL: https://arxiv.org/pdf/2508.06904
Copy Paste: [[2508.06904]] A Simple yet Powerful Instance-Aware Prompting Framework for Training-free Camouflaged Object Segmentation(https://arxiv.org/abs/2508.06904)
Keywords: large language model, segmentation
Abstract: Camouflaged Object Segmentation (COS) remains highly challenging due to the intrinsic visual similarity between target objects and their surroundings. While training-based COS methods achieve good performance, their performance degrades rapidly with increased annotation sparsity. To circumvent this limitation, recent studies have explored training-free COS methods, leveraging the Segment Anything Model (SAM) by automatically generating visual prompts from a single task-generic prompt (\textit{e.g.}, "\textit{camouflaged animal}") uniformly applied across all test images. However, these methods typically produce only semantic-level visual prompts, causing SAM to output coarse semantic masks and thus failing to handle scenarios with multiple discrete camouflaged instances effectively. To address this critical limitation, we propose a simple yet powerful \textbf{I}nstance-\textbf{A}ware \textbf{P}rompting \textbf{F}ramework (IAPF), the first training-free COS pipeline that explicitly converts a task-generic prompt into fine-grained instance masks. Specifically, the IAPF comprises three steps: (1) Text Prompt Generator, utilizing task-generic queries to prompt a Multimodal Large Language Model (MLLM) for generating image-specific foreground and background tags; (2) \textbf{Instance Mask Generator}, leveraging Grounding DINO to produce precise instance-level bounding box prompts, alongside the proposed Single-Foreground Multi-Background Prompting strategy to sample region-constrained point prompts within each box, enabling SAM to yield a candidate instance mask; (3) Self-consistency Instance Mask Voting, which selects the final COS prediction by identifying the candidate mask most consistent across multiple candidate instance masks. Extensive evaluations on standard COS benchmarks demonstrate that the proposed IAPF significantly surpasses existing state-of-the-art training-free COS methods.

Title: MultiRef: Controllable Image Generation with Multiple Visual References

Authors: Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Petr Sushko, Gaoyang Jiang, Yao Wan, Ranjay Krishna
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06905
Pdf URL: https://arxiv.org/pdf/2508.06905
Copy Paste: [[2508.06905]] MultiRef: Controllable Image Generation with Multiple Visual References(https://arxiv.org/abs/2508.06905)
Keywords: generative
Abstract: Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs -- either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: this https URL.

Title: MMReID-Bench: Unleashing the Power of MLLMs for Effective and Versatile Person Re-identification

Authors: Jinhao Li, Zijian Chen, Lirong Deng, Changbo Wang, Guangtao Zhai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06908
Pdf URL: https://arxiv.org/pdf/2508.06908
Copy Paste: [[2508.06908]] MMReID-Bench: Unleashing the Power of MLLMs for Effective and Versatile Person Re-identification(https://arxiv.org/abs/2508.06908)
Keywords: security, robust, large language model
Abstract: Person re-identification (ReID) aims to retrieve the images of an interested person in the gallery images, with wide applications in medical rehabilitation, abnormal behavior detection, and public security. However, traditional person ReID models suffer from uni-modal capability, leading to poor generalization ability in multi-modal data, such as RGB, thermal, infrared, sketch images, textual descriptions, etc. Recently, the emergence of multi-modal large language models (MLLMs) shows a promising avenue for addressing this problem. Despite this potential, existing methods merely regard MLLMs as feature extractors or caption generators, which do not fully unleash their reasoning, instruction-following, and cross-modal understanding capabilities. To bridge this gap, we introduce MMReID-Bench, the first multi-task multi-modal benchmark specifically designed for person ReID. The MMReID-Bench includes 20,710 multi-modal queries and gallery images covering 10 different person ReID tasks. Comprehensive experiments demonstrate the remarkable capabilities of MLLMs in delivering effective and versatile person ReID. Nevertheless, they also have limitations in handling a few modalities, particularly thermal and infrared data. We hope MMReID-Bench can facilitate the community to develop more robust and generalizable multimodal foundation models for person ReID.

Title: Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection

Authors: Siyuan Li, Xi Lin, Guangyan Li, Zehao Liu, Aodu Wulianghai, Li Ding, Jun Wu, Jianhua Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.06913
Pdf URL: https://arxiv.org/pdf/2508.06913
Copy Paste: [[2508.06913]] Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection(https://arxiv.org/abs/2508.06913)
Keywords: attack, robust, large language model
Abstract: The rapid advancement of large language models (LLMs) has resulted in increasingly sophisticated AI-generated content, posing significant challenges in distinguishing LLM-generated text from human-written language. Existing detection methods, primarily based on lexical heuristics or fine-tuned classifiers, often suffer from limited generalizability and are vulnerable to paraphrasing, adversarial perturbations, and cross-domain shifts. In this work, we propose SentiDetect, a model-agnostic framework for detecting LLM-generated text by analyzing the divergence in sentiment distribution stability. Our method is motivated by the empirical observation that LLM outputs tend to exhibit emotionally consistent patterns, whereas human-written texts display greater emotional variability. To capture this phenomenon, we define two complementary metrics: sentiment distribution consistency and sentiment distribution preservation, which quantify stability under sentiment-altering and semantic-preserving transformations. We evaluate SentiDetect on five diverse datasets and a range of advanced LLMs,including Gemini-1.5-Pro, Claude-3, GPT-4-0613, and LLaMa-3.3. Experimental results demonstrate its superiority over state-of-the-art baselines, with over 16% and 11% F1 score improvements on Gemini-1.5-Pro and GPT-4-0613, respectively. Moreover, SentiDetect also shows greater robustness to paraphrasing, adversarial attacks, and text length variations, outperforming existing detectors in challenging scenarios.

Title: AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning

Authors: Shihao Yuan, Yahui Liu, Yang Yue, Jingyuan Zhang, Wangmeng Zuo, Qi Wang, Fuzheng Zhang, Guorui Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06924
Pdf URL: https://arxiv.org/pdf/2508.06924
Copy Paste: [[2508.06924]] AR-GRPO: Training Autoregressive Image Generation Models via Reinforcement Learning(https://arxiv.org/abs/2508.06924)
Keywords: large language model
Abstract: Inspired by the success of reinforcement learning (RL) in refining large language models (LLMs), we propose AR-GRPO, an approach to integrate online RL training into autoregressive (AR) image generation models. We adapt the Group Relative Policy Optimization (GRPO) algorithm to refine the vanilla autoregressive models' outputs by carefully designed reward functions that evaluate generated images across multiple quality dimensions, including perceptual quality, realism, and semantic fidelity. We conduct comprehensive experiments on both class-conditional (i.e., class-to-image) and text-conditional (i.e., text-to-image) image generation tasks, demonstrating that our RL-enhanced framework significantly improves both the image quality and human preference of generated images compared to the standard AR baselines. Our results show consistent improvements across various evaluation metrics, establishing the viability of RL-based optimization for AR image generation and opening new avenues for controllable and high-quality image synthesis. The source codes and models are available at: this https URL.

Title: CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing

Authors: Weiyan Xie, Han Gao, Didan Deng, Kaican Li, April Hua Liu, Yongxiang Huang, Nevin L. Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06937
Pdf URL: https://arxiv.org/pdf/2508.06937
Copy Paste: [[2508.06937]] CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing(https://arxiv.org/abs/2508.06937)
Keywords: generative
Abstract: Recent advances in text-to-image (T2I) models have enabled training-free regional image editing by leveraging the generative priors of foundation models. However, existing methods struggle to balance text adherence in edited regions, context fidelity in unedited areas, and seamless integration of edits. We introduce CannyEdit, a novel training-free framework that addresses these challenges through two key innovations: (1) Selective Canny Control, which masks the structural guidance of Canny ControlNet in user-specified editable regions while strictly preserving details of the source images in unedited areas via inversion-phase ControlNet information retention. This enables precise, text-driven edits without compromising contextual integrity. (2) Dual-Prompt Guidance, which combines local prompts for object-specific edits with a global target prompt to maintain coherent scene interactions. On real-world image editing tasks (addition, replacement, removal), CannyEdit outperforms prior methods like KV-Edit, achieving a 2.93 to 10.49 percent improvement in the balance of text adherence and context fidelity. In terms of editing seamlessness, user studies reveal only 49.2 percent of general users and 42.0 percent of AIGC experts identified CannyEdit's results as AI-edited when paired with real images without edits, versus 76.08 to 89.09 percent for competitor methods.

Title: Class Unbiasing for Generalization in Medical Diagnosis

Authors: Lishi Zuo, Man-Wai Mak, Lu Yi, Youzhi Tu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06943
Pdf URL: https://arxiv.org/pdf/2508.06943
Copy Paste: [[2508.06943]] Class Unbiasing for Generalization in Medical Diagnosis(https://arxiv.org/abs/2508.06943)
Keywords: robust
Abstract: Medical diagnosis might fail due to bias. In this work, we identified class-feature bias, which refers to models' potential reliance on features that are strongly correlated with only a subset of classes, leading to biased performance and poor generalization on other classes. We aim to train a class-unbiased model (Cls-unbias) that mitigates both class imbalance and class-feature bias simultaneously. Specifically, we propose a class-wise inequality loss which promotes equal contributions of classification loss from positive-class and negative-class samples. We propose to optimize a class-wise group distributionally robust optimization objective-a class-weighted training objective that upweights underperforming classes-to enhance the effectiveness of the inequality loss under class imbalance. Through synthetic and real-world datasets, we empirically demonstrate that class-feature bias can negatively impact model performance. Our proposed method effectively mitigates both class-feature bias and class imbalance, thereby improving the model's generalization ability.

Title: AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance

Authors: Lixuan He, Jie Feng, Yong Li
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2508.06944
Pdf URL: https://arxiv.org/pdf/2508.06944
Copy Paste: [[2508.06944]] AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance(https://arxiv.org/abs/2508.06944)
Keywords: large language model
Abstract: Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline of Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL), a process fraught with catastrophic forgetting and suboptimal trade-offs between imitation and exploration. Recent single-stage methods attempt to unify SFT and RL using heuristics, but lack a principled mechanism for dynamically balancing the two paradigms. In this paper, we reframe this challenge through the theoretical lens of \textbf{implicit rewards}, viewing SFT and RL not as distinct methods but as complementary reward signals. We introduce \textbf{Adaptive Meta Fine-Tuning (AMFT)}, a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward. The core of AMFT is a \textbf{meta-gradient adaptive weight controller} that treats the SFT-RL balance as a learnable parameter, dynamically optimizing it to maximize long-term task performance. This forward-looking approach, regularized by policy entropy for stability, autonomously discovers an effective training curriculum. We conduct a comprehensive evaluation on challenging benchmarks spanning mathematical reasoning, abstract visual reasoning (General Points), and vision-language navigation (V-IRL). AMFT consistently establishes a new state-of-the-art and demonstrats superior generalization on out-of-distribution (OOD) tasks. Ablation studies and training dynamic analysis confirm that the meta-learning controller is crucial for AMFT's stability, sample efficiency, and performance, offering a more principled and effective paradigm for LLM this http URL codes are open-sourced via this https URL.

Title: SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work

Authors: Harry Walsh, Ed Fish, Ozge Mercanoglu Sincan, Mohamed Ilyes Lakhal, Richard Bowden, Neil Fox, Bencie Woll, Kepeng Wu, Zecheng Li, Weichao Zhao, Haodong Wang, Wengang Zhou, Houqiang Li, Shengeng Tang, Jiayi He, Xu Wang, Ruobei Zhang, Yaxiong Wang, Lechao Cheng, Meryem Tasyurek, Tugce Kiziltepe, Hacer Yalim Keles
Subjects: cs.CV, eess.IV, eess.SP
Abstract URL: https://arxiv.org/abs/2508.06951
Pdf URL: https://arxiv.org/pdf/2508.06951
Copy Paste: [[2508.06951]] SLRTP2025 Sign Language Production Challenge: Methodology, Results, and Future Work(https://arxiv.org/abs/2508.06951)
Keywords: extraction
Abstract: Sign Language Production (SLP) is the task of generating sign language video from spoken language inputs. The field has seen a range of innovations over the last few years, with the introduction of deep learning-based approaches providing significant improvements in the realism and naturalness of generated outputs. However, the lack of standardized evaluation metrics for SLP approaches hampers meaningful comparisons across different systems. To address this, we introduce the first Sign Language Production Challenge, held as part of the third SLRTP Workshop at CVPR 2025. The competition's aims are to evaluate architectures that translate from spoken language sentences to a sequence of skeleton poses, known as Text-to-Pose (T2P) translation, over a range of metrics. For our evaluation data, we use the RWTH-PHOENIX-Weather-2014T dataset, a German Sign Language - Deutsche Gebardensprache (DGS) weather broadcast dataset. In addition, we curate a custom hidden test set from a similar domain of discourse. This paper presents the challenge design and the winning methodologies. The challenge attracted 33 participants who submitted 231 solutions, with the top-performing team achieving BLEU-1 scores of 31.40 and DTW-MJE of 0.0574. The winning approach utilized a retrieval-based framework and a pre-trained language model. As part of the workshop, we release a standardized evaluation network, including high-quality skeleton extraction-based keypoints establishing a consistent baseline for the SLP field, which will enable future researchers to compare their work against a broader range of methods.

Title: BoRA: Towards More Expressive Low-Rank Adaptation with Block Diversity

Authors: Shiwei Li, Xiandi Luo, Haozhao Wang, Xing Tang, Ziqiang Cui, Dugang Liu, Yuhua Li, Xiuqiang He, Ruixuan Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.06953
Pdf URL: https://arxiv.org/pdf/2508.06953
Copy Paste: [[2508.06953]] BoRA: Towards More Expressive Low-Rank Adaptation with Block Diversity(https://arxiv.org/abs/2508.06953)
Keywords: large language model
Abstract: Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method widely used in large language models (LLMs). It approximates the update of a pretrained weight matrix $W\in\mathbb{R}^{m\times n}$ by the product of two low-rank matrices, $BA$, where $A \in\mathbb{R}^{r\times n}$ and $B\in\mathbb{R}^{m\times r} (r\ll\min\{m,n\})$. Increasing the dimension $r$ can raise the rank of LoRA weights (i.e., $BA$), which typically improves fine-tuning performance but also significantly increases the number of trainable parameters. In this paper, we propose Block Diversified Low-Rank Adaptation (BoRA), which improves the rank of LoRA weights with a small number of additional parameters. Specifically, BoRA treats the product $BA$ as a block matrix multiplication, where $A$ and $B$ are partitioned into $b$ blocks along the columns and rows, respectively (i.e., $A=[A_1,\dots,A_b]$ and $B=[B_1,\dots,B_b]^\top$). Consequently, the product $BA$ becomes the concatenation of the block products $B_iA_j$ for $i,j\in[b]$. To enhance the diversity of different block products, BoRA introduces a unique diagonal matrix $\Sigma_{i,j} \in \mathbb{R}^{r\times r}$ for each block multiplication, resulting in $B_i \Sigma_{i,j} A_j$. By leveraging these block-wise diagonal matrices, BoRA increases the rank of LoRA weights by a factor of $b$ while only requiring $b^2r$ additional parameters. Extensive experiments across multiple datasets and models demonstrate the superiority of BoRA, and ablation studies further validate its scalability.

Title: Beyond Frequency: Seeing Subtle Cues Through the Lens of Spatial Decomposition for Fine-Grained Visual Classification

Authors: Qin Xu, Lili Zhu, Xiaoxia Cheng, Bo Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06959
Pdf URL: https://arxiv.org/pdf/2508.06959
Copy Paste: [[2508.06959]] Beyond Frequency: Seeing Subtle Cues Through the Lens of Spatial Decomposition for Fine-Grained Visual Classification(https://arxiv.org/abs/2508.06959)
Keywords: extraction
Abstract: The crux of resolving fine-grained visual classification (FGVC) lies in capturing discriminative and class-specific cues that correspond to subtle visual characteristics. Recently, frequency decomposition/transform based approaches have attracted considerable interests since its appearing discriminative cue mining ability. However, the frequency-domain methods are based on fixed basis functions, lacking adaptability to image content and unable to dynamically adjust feature extraction according to the discriminative requirements of different images. To address this, we propose a novel method for FGVC, named Subtle-Cue Oriented Perception Engine (SCOPE), which adaptively enhances the representational capability of low-level details and high-level semantics in the spatial domain, breaking through the limitations of fixed scales in the frequency domain and improving the flexibility of multi-scale fusion. The core of SCOPE lies in two modules: the Subtle Detail Extractor (SDE), which dynamically enhances subtle details such as edges and textures from shallow features, and the Salient Semantic Refiner (SSR), which learns semantically coherent and structure-aware refinement features from the high-level features guided by the enhanced shallow features. The SDE and SSR are cascaded stage-by-stage to progressively combine local details with global semantics. Extensive experiments demonstrate that our method achieves new state-of-the-art on four popular fine-grained image classification benchmarks.

Title: Adversarial Video Promotion Against Text-to-Video Retrieval

Authors: Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Qian Li, Shuai Liu, Chao Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06964
Pdf URL: https://arxiv.org/pdf/2508.06964
Copy Paste: [[2508.06964]] Adversarial Video Promotion Against Text-to-Video Retrieval(https://arxiv.org/abs/2508.06964)
Keywords: attack, robust
Abstract: Thanks to the development of cross-modal models, text-to-video retrieval (T2VR) is advancing rapidly, but its robustness remains largely unexamined. Existing attacks against T2VR are designed to push videos away from queries, i.e., suppressing the ranks of videos, while the attacks that pull videos towards selected queries, i.e., promoting the ranks of videos, remain largely unexplored. These attacks can be more impactful as attackers may gain more views/clicks for financial benefits and widespread (mis)information. To this end, we pioneer the first attack against T2VR to promote videos adversarially, dubbed the Video Promotion attack (ViPro). We further propose Modal Refinement (MoRe) to capture the finer-grained, intricate interaction between visual and textual modalities to enhance black-box transferability. Comprehensive experiments cover 2 existing baselines, 3 leading T2VR models, 3 prevailing datasets with over 10k videos, evaluated under 3 scenarios. All experiments are conducted in a multi-target setting to reflect realistic scenarios where attackers seek to promote the video regarding multiple queries simultaneously. We also evaluated our attacks for defences and imperceptibility. Overall, ViPro surpasses other baselines by over $30/10/4\%$ for white/grey/black-box settings on average. Our work highlights an overlooked vulnerability, provides a qualitative analysis on the upper/lower bound of our attacks, and offers insights into potential counterplays. Code will be publicly available at this https URL.

Title: Can Multitask Learning Enhance Model Explainability?

Authors: Hiba Najjar, Bushra Alshbib, Andreas Dengel
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06966
Pdf URL: https://arxiv.org/pdf/2508.06966
Copy Paste: [[2508.06966]] Can Multitask Learning Enhance Model Explainability?(https://arxiv.org/abs/2508.06966)
Keywords: interpretability, explainability, segmentation
Abstract: Remote sensing provides satellite data in diverse types and formats. The usage of multimodal learning networks exploits this diversity to improve model performance, except that the complexity of such networks comes at the expense of their interpretability. In this study, we explore how modalities can be leveraged through multitask learning to intrinsically explain model behavior. In particular, instead of additional inputs, we use certain modalities as additional targets to be predicted along with the main task. The success of this approach relies on the rich information content of satellite data, which remains as input modalities. We show how this modeling context provides numerous benefits: (1) in case of data scarcity, the additional modalities do not need to be collected for model inference at deployment, (2) the model performance remains comparable to the multimodal baseline performance, and in some cases achieves better scores, (3) prediction errors in the main task can be explained via the model behavior in the auxiliary task(s). We demonstrate the efficiency of our approach on three datasets, including segmentation, classification, and regression tasks. Code available at this http URL.

Title: Two-Stage Quranic QA via Ensemble Retrieval and Instruction-Tuned Answer Extraction

Authors: Mohamed Basem, Islam Oshallah, Ali Hamdi, Khaled Shaban, Hozaifa Kassab
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.06971
Pdf URL: https://arxiv.org/pdf/2508.06971
Copy Paste: [[2508.06971]] Two-Stage Quranic QA via Ensemble Retrieval and Instruction-Tuned Answer Extraction(https://arxiv.org/abs/2508.06971)
Keywords: extraction, large language model
Abstract: Quranic Question Answering presents unique challenges due to the linguistic complexity of Classical Arabic and the semantic richness of religious texts. In this paper, we propose a novel two-stage framework that addresses both passage retrieval and answer extraction. For passage retrieval, we ensemble fine-tuned Arabic language models to achieve superior ranking performance. For answer extraction, we employ instruction-tuned large language models with few-shot prompting to overcome the limitations of fine-tuning on small datasets. Our approach achieves state-of-the-art results on the Quran QA 2023 Shared Task, with a MAP@10 of 0.3128 and MRR@10 of 0.5763 for retrieval, and a pAP@10 of 0.669 for extraction, substantially outperforming previous methods. These results demonstrate that combining model ensembling and instruction-tuned language models effectively addresses the challenges of low-resource question answering in specialized domains.

Title: Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models

Authors: Zhijun Tu, Hanting Chen, Siqi Liu, Chuanjian Liu, Jian Li, Jie Hu, Yunhe Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.06974
Pdf URL: https://arxiv.org/pdf/2508.06974
Copy Paste: [[2508.06974]] Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models(https://arxiv.org/abs/2508.06974)
Keywords: large language model
Abstract: 1-bit LLM quantization offers significant advantages in reducing storage and computational costs. However, existing methods typically train 1-bit LLMs from scratch, failing to fully leverage pre-trained models. This results in high training costs and notable accuracy degradation. We identify that the large gap between full precision and 1-bit representations makes direct adaptation difficult. In this paper, we introduce a consistent progressive training for both forward and backward, smoothly converting the floating-point weights into the binarized ones. Additionally, we incorporate binary-aware initialization and dual-scaling compensation to reduce the difficulty of progressive training and improve the performance. Experimental results on LLMs of various sizes demonstrate that our method outperforms existing approaches. Our results show that high-performance 1-bit LLMs can be achieved using pre-trained models, eliminating the need for expensive training from scratch.

Title: Structure-Preserving Digital Twins via Conditional Neural Whitney Forms

Authors: Brooks Kinch, Benjamin Shaffer, Elizabeth Armstrong, Michael Meehan, John Hewson, Nathaniel Trask
Subjects: cs.LG, math.NA, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2508.06981
Pdf URL: https://arxiv.org/pdf/2508.06981
Copy Paste: [[2508.06981]] Structure-Preserving Digital Twins via Conditional Neural Whitney Forms(https://arxiv.org/abs/2508.06981)
Keywords: diffusion
Abstract: We present a framework for constructing real-time digital twins based on structure-preserving reduced finite element models conditioned on a latent variable Z. The approach uses conditional attention mechanisms to learn both a reduced finite element basis and a nonlinear conservation law within the framework of finite element exterior calculus (FEEC). This guarantees numerical well-posedness and exact preservation of conserved quantities, regardless of data sparsity or optimization error. The conditioning mechanism supports real-time calibration to parametric variables, allowing the construction of digital twins which support closed loop inference and calibration to sensor data. The framework interfaces with conventional finite element machinery in a non-invasive manner, allowing treatment of complex geometries and integration of learned models with conventional finite element techniques. Benchmarks include advection diffusion, shock hydrodynamics, electrostatics, and a complex battery thermal runaway problem. The method achieves accurate predictions on complex geometries with sparse data (25 LES simulations), including capturing the transition to turbulence and achieving real-time inference ~0.1s with a speedup of 3.1x10^8 relative to LES. An open-source implementation is available on GitHub.

Title: WeatherDiffusion: Weather-Guided Diffusion Model for Forward and Inverse Rendering

Authors: Yixin Zhu, Zuoliang Zhu, Miloš Hašan, Jian Yang, Jin Xie, Beibei Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.06982
Pdf URL: https://arxiv.org/pdf/2508.06982
Copy Paste: [[2508.06982]] WeatherDiffusion: Weather-Guided Diffusion Model for Forward and Inverse Rendering(https://arxiv.org/abs/2508.06982)
Keywords: robust, diffusion, segmentation
Abstract: Forward and inverse rendering have emerged as key techniques for enabling understanding and reconstruction in the context of autonomous driving (AD). However, complex weather and illumination pose great challenges to this task. The emergence of large diffusion models has shown promise in achieving reasonable results through learning from 2D priors, but these models are difficult to control and lack robustness. In this paper, we introduce WeatherDiffusion, a diffusion-based framework for forward and inverse rendering on AD scenes with various weather and lighting conditions. Our method enables authentic estimation of material properties, scene geometry, and lighting, and further supports controllable weather and illumination editing through the use of predicted intrinsic maps guided by text descriptions. We observe that different intrinsic maps should correspond to different regions of the original image. Based on this observation, we propose Intrinsic map-aware attention (MAA) to enable high-quality inverse rendering. Additionally, we introduce a synthetic dataset (\ie WeatherSynthetic) and a real-world dataset (\ie WeatherReal) for forward and inverse rendering on AD scenes with diverse weather and lighting. Extensive experiments show that our WeatherDiffusion outperforms state-of-the-art methods on several benchmarks. Moreover, our method demonstrates significant value in downstream tasks for AD, enhancing the robustness of object detection and image segmentation in challenging weather scenarios.

Title: UniMove: A Unified Model for Multi-city Human Mobility Prediction

Authors: Chonghua Han, Yuan Yuan, Yukun Liu, Jingtao Ding, Jie Feng, Yong Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.06986
Pdf URL: https://arxiv.org/pdf/2508.06986
Copy Paste: [[2508.06986]] UniMove: A Unified Model for Multi-city Human Mobility Prediction(https://arxiv.org/abs/2508.06986)
Keywords: transformer
Abstract: Human mobility prediction is vital for urban planning, transportation optimization, and personalized services. However, the inherent randomness, non-uniform time intervals, and complex patterns of human mobility, compounded by the heterogeneity introduced by varying city structures, infrastructure, and population densities, present significant challenges in modeling. Existing solutions often require training separate models for each city due to distinct spatial representations and geographic coverage. In this paper, we propose UniMove, a unified model for multi-city human mobility prediction, addressing two challenges: (1) constructing universal spatial representations for effective token sharing across cities, and (2) modeling heterogeneous mobility patterns from varying city characteristics. We propose a trajectory-location dual-tower architecture, with a location tower for universal spatial encoding and a trajectory tower for sequential mobility modeling. We also design MoE Transformer blocks to adaptively select experts to handle diverse movement patterns. Extensive experiments across multiple datasets from diverse cities demonstrate that UniMove truly embodies the essence of a unified model. By enabling joint training on multi-city data with mutual data enhancement, it significantly improves mobility prediction accuracy by over 10.2\%. UniMove represents a key advancement toward realizing a true foundational model with a unified architecture for human mobility. We release the implementation at this https URL.

Title: TADoc: Robust Time-Aware Document Image Dewarping

Authors: Fangmin Zhao, Weichao Zeng, Zhenhang Li, Dongbao Yang, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06988
Pdf URL: https://arxiv.org/pdf/2508.06988
Copy Paste: [[2508.06988]] TADoc: Robust Time-Aware Document Image Dewarping(https://arxiv.org/abs/2508.06988)
Keywords: robust
Abstract: Flattening curved, wrinkled, and rotated document images captured by portable photographing devices, termed document image dewarping, has become an increasingly important task with the rise of digital economy and online working. Although many methods have been proposed recently, they often struggle to achieve satisfactory results when confronted with intricate document structures and higher degrees of deformation in real-world scenarios. Our main insight is that, unlike other document restoration tasks (e.g., deblurring), dewarping in real physical scenes is a progressive motion rather than a one-step transformation. Based on this, we have undertaken two key initiatives. Firstly, we reformulate this task, modeling it for the first time as a dynamic process that encompasses a series of intermediate states. Secondly, we design a lightweight framework called TADoc (Time-Aware Document Dewarping Network) to address the geometric distortion of document images. In addition, due to the inadequacy of OCR metrics for document images containing sparse text, the comprehensiveness of evaluation is insufficient. To address this shortcoming, we propose a new metric -- DLS (Document Layout Similarity) -- to evaluate the effectiveness of document dewarping in downstream tasks. Extensive experiments and in-depth evaluations have been conducted and the results indicate that our model possesses strong robustness, achieving superiority on several benchmarks with different document types and degrees of distortion.

Title: A Comparative Study of Feature Selection in Tsetlin Machines

Authors: Vojtech Halenka, Ole-Christoffer Granmo, Lei Jiao, Per-Arne Andersen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.06991
Pdf URL: https://arxiv.org/pdf/2508.06991
Copy Paste: [[2508.06991]] A Comparative Study of Feature Selection in Tsetlin Machines(https://arxiv.org/abs/2508.06991)
Keywords: interpretability
Abstract: Feature Selection (FS) is crucial for improving model interpretability, reducing complexity, and sometimes for enhancing accuracy. The recently introduced Tsetlin machine (TM) offers interpretable clause-based learning, but lacks established tools for estimating feature importance. In this paper, we adapt and evaluate a range of FS techniques for TMs, including classical filter and embedded methods as well as post-hoc explanation methods originally developed for neural networks (e.g., SHAP and LIME) and a novel family of embedded scorers derived from TM clause weights and Tsetlin automaton (TA) states. We benchmark all methods across 12 datasets, using evaluation protocols, like Remove and Retrain (ROAR) strategy and Remove and Debias (ROAD), to assess causal impact. Our results show that TM-internal scorers not only perform competitively but also exploit the interpretability of clauses to reveal interacting feature patterns. Simpler TM-specific scorers achieve similar accuracy retention at a fraction of the computational cost. This study establishes the first comprehensive baseline for FS in TM and paves the way for developing specialized TM-specific interpretability techniques.

Title: OctreeNCA: Single-Pass 184 MP Segmentation on Consumer Hardware

Authors: Nick Lemke, John Kalkhof, Niklas Babendererde, Anirban Mukhopadhyay
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06993
Pdf URL: https://arxiv.org/pdf/2508.06993
Copy Paste: [[2508.06993]] OctreeNCA: Single-Pass 184 MP Segmentation on Consumer Hardware(https://arxiv.org/abs/2508.06993)
Keywords: transformer, segmentation
Abstract: Medical applications demand segmentation of large inputs, like prostate MRIs, pathology slices, or videos of surgery. These inputs should ideally be inferred at once to provide the model with proper spatial or temporal context. When segmenting large inputs, the VRAM consumption of the GPU becomes the bottleneck. Architectures like UNets or Vision Transformers scale very poorly in VRAM consumption, resulting in patch- or frame-wise approaches that compromise global consistency and inference speed. The lightweight Neural Cellular Automaton (NCA) is a bio-inspired model that is by construction size-invariant. However, due to its local-only communication rules, it lacks global knowledge. We propose OctreeNCA by generalizing the neighborhood definition using an octree data structure. Our generalized neighborhood definition enables the efficient traversal of global knowledge. Since deep learning frameworks are mainly developed for large multi-layer networks, their implementation does not fully leverage the advantages of NCAs. We implement an NCA inference function in CUDA that further reduces VRAM demands and increases inference speed. Our OctreeNCA segments high-resolution images and videos quickly while occupying 90% less VRAM than a UNet during evaluation. This allows us to segment 184 Megapixel pathology slices or 1-minute surgical videos at once.

Title: S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision

Authors: Huihui Xu, Jin Ye, Hongqiu Wang, Changkai Ji, Jiashi Lin, Ming Hu, Ziyan Huang, Ying Chen, Chenglong Ma, Tianbin Li, Lihao Liu, Junjun He, Lei Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.06995
Pdf URL: https://arxiv.org/pdf/2508.06995
Copy Paste: [[2508.06995]] S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision(https://arxiv.org/abs/2508.06995)
Keywords: segmentation
Abstract: Recent self-supervised image segmentation models have achieved promising performance on semantic segmentation and class-agnostic instance segmentation. However, their pretraining schedule is multi-stage, requiring a time-consuming pseudo-masks generation process between each training epoch. This time-consuming offline process not only makes it difficult to scale with training dataset size, but also leads to sub-optimal solutions due to its discontinuous optimization routine. To solve these, we first present a novel pseudo-mask algorithm, Fast Universal Agglomerative Pooling (UniAP). Each layer of UniAP can identify groups of similar nodes in parallel, allowing to generate both semantic-level and instance-level and multi-granular pseudo-masks within ens of milliseconds for one image. Based on the fast UniAP, we propose the Scalable Self-Supervised Universal Segmentation (S2-UniSeg), which employs a student and a momentum teacher for continuous pretraining. A novel segmentation-oriented pretext task, Query-wise Self-Distillation (QuerySD), is proposed to pretrain S2-UniSeg to learn the local-to-global correspondences. Under the same setting, S2-UniSeg outperforms the SOTA UnSAM model, achieving notable improvements of AP+6.9 on COCO, AR+11.1 on UVO, PixelAcc+4.5 on COCOStuff-27, RQ+8.0 on Cityscapes. After scaling up to a larger 2M-image subset of SA-1B, S2-UniSeg further achieves performance gains on all four benchmarks. Our code and pretrained models are available at this https URL

Title: Spatio-Temporal Conditional Diffusion Models for Forecasting Future Multiple Sclerosis Lesion Masks Conditioned on Treatments

Authors: Gian Mario Favero, Ge Ya Luo, Nima Fathi, Justin Szeto, Douglas L. Arnold, Brennan Nichyporuk, Chris Pal, Tal Arbel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07006
Pdf URL: https://arxiv.org/pdf/2508.07006
Copy Paste: [[2508.07006]] Spatio-Temporal Conditional Diffusion Models for Forecasting Future Multiple Sclerosis Lesion Masks Conditioned on Treatments(https://arxiv.org/abs/2508.07006)
Keywords: diffusion, generative
Abstract: Image-based personalized medicine has the potential to transform healthcare, particularly for diseases that exhibit heterogeneous progression such as Multiple Sclerosis (MS). In this work, we introduce the first treatment-aware spatio-temporal diffusion model that is able to generate future masks demonstrating lesion evolution in MS. Our voxel-space approach incorporates multi-modal patient data, including MRI and treatment information, to forecast new and enlarging T2 (NET2) lesion masks at a future time point. Extensive experiments on a multi-centre dataset of 2131 patient 3D MRIs from randomized clinical trials for relapsing-remitting MS demonstrate that our generative model is able to accurately predict NET2 lesion masks for patients across six different treatments. Moreover, we demonstrate our model has the potential for real-world clinical applications through downstream tasks such as future lesion count and location estimation, binary lesion activity classification, and generating counterfactual future NET2 masks for several treatments with different efficacies. This work highlights the potential of causal, image-based generative models as powerful tools for advancing data-driven prognostics in MS.

Title: HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

Authors: Zixiong Wang, Jian Yang, Yiwei Hu, Milos Hasan, Beibei Wang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2508.07011
Pdf URL: https://arxiv.org/pdf/2508.07011
Copy Paste: [[2508.07011]] HiMat: DiT-based Ultra-High Resolution SVBRDF Generation(https://arxiv.org/abs/2508.07011)
Keywords: diffusion, transformer, generative
Abstract: Creating highly detailed SVBRDFs is essential for 3D content creation. The rise of high-resolution text-to-image generative models, based on diffusion transformers (DiT), suggests an opportunity to finetune them for this task. However, retargeting the models to produce multiple aligned SVBRDF maps instead of just RGB images, while achieving high efficiency and ensuring consistency across different maps, remains a challenge. In this paper, we introduce HiMat: a memory- and computation-efficient diffusion-based framework capable of generating native 4K-resolution SVBRDFs. A key challenge we address is maintaining consistency across different maps in a lightweight manner, without relying on training new VAEs or significantly altering the DiT backbone (which would damage its prior capabilities). To tackle this, we introduce the CrossStitch module, a lightweight convolutional module that captures inter-map dependencies through localized operations. Its weights are initialized such that the DiT backbone operation is unchanged before finetuning starts. HiMat enables generation with strong structural coherence and high-frequency details. Results with a large set of text prompts demonstrate the effectiveness of our approach for 4K SVBRDF generation. Further experiments suggest generalization to tasks such as intrinsic decomposition.

Title: Vec2Summ: Text Summarization via Probabilistic Sentence Embeddings

Authors: Mao Li, Fred Conrad, Johann Gagnon-Bartsch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07017
Pdf URL: https://arxiv.org/pdf/2508.07017
Copy Paste: [[2508.07017]] Vec2Summ: Text Summarization via Probabilistic Sentence Embeddings(https://arxiv.org/abs/2508.07017)
Keywords: robust, generative
Abstract: We propose Vec2Summ, a novel method for abstractive summarization that frames the task as semantic compression. Vec2Summ represents a document collection using a single mean vector in the semantic embedding space, capturing the central meaning of the corpus. To reconstruct fluent summaries, we perform embedding inversion -- decoding this mean vector into natural language using a generative language model. To improve reconstruction quality and capture some degree of topical variability, we introduce stochasticity by sampling from a Gaussian distribution centered on the mean. This approach is loosely analogous to bagging in ensemble learning, where controlled randomness encourages more robust and varied outputs. Vec2Summ addresses key limitations of LLM-based summarization methods. It avoids context-length constraints, enables interpretable and controllable generation via semantic parameters, and scales efficiently with corpus size -- requiring only $O(d + d^2)$ parameters. Empirical results show that Vec2Summ produces coherent summaries for topically focused, order-invariant corpora, with performance comparable to direct LLM summarization in terms of thematic coverage and efficiency, albeit with less fine-grained detail. These results underscore Vec2Summ's potential in settings where scalability, semantic control, and corpus-level abstraction are prioritized.

Title: DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents

Authors: Kun Qian, Wenjie Li, Tianyu Sun, Wenhong Wang, Wenhan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07021
Pdf URL: https://arxiv.org/pdf/2508.07021
Copy Paste: [[2508.07021]] DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents(https://arxiv.org/abs/2508.07021)
Keywords: large language model
Abstract: The exponential growth of scientific literature in PDF format necessitates advanced tools for efficient and accurate document understanding, summarization, and content optimization. Traditional methods fall short in handling complex layouts and multimodal content, while direct application of Large Language Models (LLMs) and Vision-Language Large Models (LVLMs) lacks precision and control for intricate editing tasks. This paper introduces DocRefine, an innovative framework designed for intelligent understanding, content refinement, and automated summarization of scientific PDF documents, driven by natural language instructions. DocRefine leverages the power of advanced LVLMs (e.g., GPT-4o) by orchestrating a sophisticated multi-agent system comprising six specialized and collaborative agents: Layout & Structure Analysis, Multimodal Content Understanding, Instruction Decomposition, Content Refinement, Summarization & Generation, and Fidelity & Consistency Verification. This closed-loop feedback architecture ensures high semantic accuracy and visual fidelity. Evaluated on the comprehensive DocEditBench dataset, DocRefine consistently outperforms state-of-the-art baselines across various tasks, achieving overall scores of 86.7% for Semantic Consistency Score (SCS), 93.9% for Layout Fidelity Index (LFI), and 85.0% for Instruction Adherence Rate (IAR). These results demonstrate DocRefine's superior capability in handling complex multimodal document editing, preserving semantic integrity, and maintaining visual consistency, marking a significant advancement in automated scientific document processing.

Title: MV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question Answering

Authors: Jingwei Peng, Jiehao Chen, Mateo Alejandro Rojas, Meilin Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07023
Pdf URL: https://arxiv.org/pdf/2508.07023
Copy Paste: [[2508.07023]] MV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question Answering(https://arxiv.org/abs/2508.07023)
Keywords: robust, transformer
Abstract: Complex Visual Question Answering (Complex VQA) tasks, which demand sophisticated multi-modal reasoning and external knowledge integration, present significant challenges for existing large vision-language models (LVLMs) often limited by their reliance on high-level global features. To address this, we propose MV-CoRe (Multimodal Visual-Conceptual Reasoning), a novel model designed to enhance Complex VQA performance through the deep fusion of diverse visual and linguistic information. MV-CoRe meticulously integrates global embeddings from pre-trained Vision Large Models (VLMs) and Language Large Models (LLMs) with fine-grained semantic-aware visual features, including object detection characteristics and scene graph representations. An innovative Multimodal Fusion Transformer then processes and deeply integrates these diverse feature sets, enabling rich cross-modal attention and facilitating complex reasoning. We evaluate MV-CoRe on challenging Complex VQA benchmarks, including GQA, A-OKVQA, and OKVQA, after training on VQAv2. Our experimental results demonstrate that MV-CoRe consistently outperforms established LVLM baselines, achieving an overall accuracy of 77.5% on GQA. Ablation studies confirm the critical contribution of both object and scene graph features, and human evaluations further validate MV-CoRe's superior factual correctness and reasoning depth, underscoring its robust capabilities for deep visual and conceptual understanding.

Title: Large Language Model Evaluated Stand-alone Attention-Assisted Graph Neural Network with Spatial and Structural Information Interaction for Precise Endoscopic Image Segmentation

Authors: Juntong Fan, Shuyi Fan, Debesh Jha, Changsheng Fang, Tieyong Zeng, Hengyong Yu, Dayang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07028
Pdf URL: https://arxiv.org/pdf/2508.07028
Copy Paste: [[2508.07028]] Large Language Model Evaluated Stand-alone Attention-Assisted Graph Neural Network with Spatial and Structural Information Interaction for Precise Endoscopic Image Segmentation(https://arxiv.org/abs/2508.07028)
Keywords: large language model, segmentation
Abstract: Accurate endoscopic image segmentation on the polyps is critical for early colorectal cancer detection. However, this task remains challenging due to low contrast with surrounding mucosa, specular highlights, and indistinct boundaries. To address these challenges, we propose FOCUS-Med, which stands for Fusion of spatial and structural graph with attentional context-aware polyp segmentation in endoscopic medical imaging. FOCUS-Med integrates a Dual Graph Convolutional Network (Dual-GCN) module to capture contextual spatial and topological structural dependencies. This graph-based representation enables the model to better distinguish polyps from background tissues by leveraging topological cues and spatial connectivity, which are often obscured in raw image intensities. It enhances the model's ability to preserve boundaries and delineate complex shapes typical of polyps. In addition, a location-fused stand-alone self-attention is employed to strengthen global context integration. To bridge the semantic gap between encoder-decoder layers, we incorporate a trainable weighted fast normalized fusion strategy for efficient multi-scale aggregation. Notably, we are the first to introduce the use of a Large Language Model (LLM) to provide detailed qualitative evaluations of segmentation quality. Extensive experiments on public benchmarks demonstrate that FOCUS-Med achieves state-of-the-art performance across five key metrics, underscoring its effectiveness and clinical potential for AI-assisted colonoscopy.

Title: From Imitation to Optimization: A Comparative Study of Offline Learning for Autonomous Driving

Authors: Antonio Guillen-Perez
Subjects: cs.LG, cs.AI, cs.RO, eess.SY
Abstract URL: https://arxiv.org/abs/2508.07029
Pdf URL: https://arxiv.org/pdf/2508.07029
Copy Paste: [[2508.07029]] From Imitation to Optimization: A Comparative Study of Offline Learning for Autonomous Driving(https://arxiv.org/abs/2508.07029)
Keywords: robust, transformer
Abstract: Learning robust driving policies from large-scale, real-world datasets is a central challenge in autonomous driving, as online data collection is often unsafe and impractical. While Behavioral Cloning (BC) offers a straightforward approach to imitation learning, policies trained with BC are notoriously brittle and suffer from compounding errors in closed-loop execution. This work presents a comprehensive pipeline and a comparative study to address this limitation. We first develop a series of increasingly sophisticated BC baselines, culminating in a Transformer-based model that operates on a structured, entity-centric state representation. While this model achieves low imitation loss, we show that it still fails in long-horizon simulations. We then demonstrate that by applying a state-of-the-art Offline Reinforcement Learning algorithm, Conservative Q-Learning (CQL), to the same data and architecture, we can learn a significantly more robust policy. Using a carefully engineered reward function, the CQL agent learns a conservative value function that enables it to recover from minor errors and avoid out-of-distribution states. In a large-scale evaluation on 1,000 unseen scenarios from the Waymo Open Motion Dataset, our final CQL agent achieves a 3.2x higher success rate and a 7.4x lower collision rate than the strongest BC baseline, proving that an offline RL approach is critical for learning robust, long-horizon driving policies from static expert data.

Title: Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities

Authors: Anindya Bijoy Das, Shahnewaz Karim Sakib, Shibbir Ahmed
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07031
Pdf URL: https://arxiv.org/pdf/2508.07031
Copy Paste: [[2508.07031]] Trustworthy Medical Imaging with Large Language Models: A Study of Hallucinations Across Modalities(https://arxiv.org/abs/2508.07031)
Keywords: generative, large language model
Abstract: Large Language Models (LLMs) are increasingly applied to medical imaging tasks, including image interpretation and synthetic image generation. However, these models often produce hallucinations, which are confident but incorrect outputs that can mislead clinical decisions. This study examines hallucinations in two directions: image to text, where LLMs generate reports from X-ray, CT, or MRI scans, and text to image, where models create medical images from clinical prompts. We analyze errors such as factual inconsistencies and anatomical inaccuracies, evaluating outputs using expert informed criteria across imaging modalities. Our findings reveal common patterns of hallucination in both interpretive and generative tasks, with implications for clinical reliability. We also discuss factors contributing to these failures, including model architecture and training data. By systematically studying both image understanding and generation, this work provides insights into improving the safety and trustworthiness of LLM driven medical imaging systems.

Title: A Stage-Aware Mixture of Experts Framework for Neurodegenerative Disease Progression Modelling

Authors: Tiantian He, Keyue Jiang, An Zhao, Anna Schroder, Elinor Thompson, Sonja Soskic, Frederik Barkhof, Daniel C. Alexander
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2508.07032
Pdf URL: https://arxiv.org/pdf/2508.07032
Copy Paste: [[2508.07032]] A Stage-Aware Mixture of Experts Framework for Neurodegenerative Disease Progression Modelling(https://arxiv.org/abs/2508.07032)
Keywords: diffusion, generative
Abstract: The long-term progression of neurodegenerative diseases is commonly conceptualized as a spatiotemporal diffusion process that consists of a graph diffusion process across the structural brain connectome and a localized reaction process within brain regions. However, modeling this progression remains challenging due to 1) the scarcity of longitudinal data obtained through irregular and infrequent subject visits and 2) the complex interplay of pathological mechanisms across brain regions and disease stages, where traditional models assume fixed mechanisms throughout disease progression. To address these limitations, we propose a novel stage-aware Mixture of Experts (MoE) framework that explicitly models how different contributing mechanisms dominate at different disease stages through time-dependent expert this http URL-wise, we utilize an iterative dual optimization method to properly estimate the temporal position of individual observations, constructing a co hort-level progression trajectory from irregular snapshots. Model-wise, we enhance the spatial component with an inhomogeneous graph neural diffusion model (IGND) that allows diffusivity to vary based on node states and time, providing more flexible representations of brain networks. We also introduce a localized neural reaction module to capture complex dynamics beyond standard this http URL resulting IGND-MoE model dynamically integrates these components across temporal states, offering a principled way to understand how stage-specific pathological mechanisms contribute to progression. The stage-wise weights yield novel clinical insights that align with literature, suggesting that graph-related processes are more influential at early stages, while other unknown physical processes become dominant later on.

Title: 3DGS-VBench: A Comprehensive Video Quality Evaluation Benchmark for 3DGS Compression

Authors: Yuke Xing, William Gordon, Qi Yang, Kaifa Yang, Jiarui Wang, Yiling Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07038
Pdf URL: https://arxiv.org/pdf/2508.07038
Copy Paste: [[2508.07038]] 3DGS-VBench: A Comprehensive Video Quality Evaluation Benchmark for 3DGS Compression(https://arxiv.org/abs/2508.07038)
Keywords: generative
Abstract: 3D Gaussian Splatting (3DGS) enables real-time novel view synthesis with high visual fidelity, but its substantial storage requirements hinder practical deployment, prompting state-of-the-art (SOTA) 3DGS methods to incorporate compression modules. However, these 3DGS generative compression techniques introduce unique distortions lacking systematic quality assessment research. To this end, we establish 3DGS-VBench, a large-scale Video Quality Assessment (VQA) Dataset and Benchmark with 660 compressed 3DGS models and video sequences generated from 11 scenes across 6 SOTA 3DGS compression algorithms with systematically designed parameter levels. With annotations from 50 participants, we obtained MOS scores with outlier removal and validated dataset reliability. We benchmark 6 3DGS compression algorithms on storage efficiency and visual quality, and evaluate 15 quality assessment metrics across multiple paradigms. Our work enables specialized VQA model training for 3DGS, serving as a catalyst for compression and quality assessment research. The dataset is available at this https URL.

Title: SPARE: Securing Progressive Web Applications Against Unauthorized Replications

Authors: Sajib Talukder, Nur Imtiazul Haque, Khandakar Ashrafi Akbar
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.07053
Pdf URL: https://arxiv.org/pdf/2508.07053
Copy Paste: [[2508.07053]] SPARE: Securing Progressive Web Applications Against Unauthorized Replications(https://arxiv.org/abs/2508.07053)
Keywords: secure, security, defense, attack, robust
Abstract: WebView applications are widely used in mobile applications to display web content directly within the app, enhancing user engagement by eliminating the need to open an external browser and providing a seamless experience. Progressive Web Applications (PWAs) further improve usability by combining the accessibility of web apps with the speed, offline capabilities, and responsiveness of native applications. However, malicious developers can exploit this technology by duplicating PWA web links to create counterfeit native apps, monetizing through user diversion. This unethical practice poses significant risks to users and the original application developers, underscoring the need for robust security measures to prevent unauthorized replication. Considering the one-way communication of Trusted Web Activity (a method for integrating web content into Android applications) and PWAs, we propose a query parameter-based practical security solution to defend against or mitigate such attacks. We analyze the vulnerabilities of our proposed security solution to assess its effectiveness and introduce advanced measures to address any identified weaknesses, presenting a comprehensive defense framework. As part of our work, we developed a prototype web application that secures PWAs from replication by embedding a combination of Unix timestamps and device identifiers into the query parameters. We evaluate the effectiveness of this defense strategy by simulating an advanced attack scenario. Additionally, we created a realistic dataset reflecting mobile app user behavior, modeled using a Zipfian distribution, to validate our framework.

Title: Membership and Memorization in LLM Knowledge Distillation

Authors: Ziqi Zhang, Ali Shahin Shamsabadi, Hanxiao Lu, Yifeng Cai, Hamed Haddadi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07054
Pdf URL: https://arxiv.org/pdf/2508.07054
Copy Paste: [[2508.07054]] Membership and Memorization in LLM Knowledge Distillation(https://arxiv.org/abs/2508.07054)
Keywords: privacy, large language model
Abstract: Recent advances in Knowledge Distillation (KD) aim to mitigate the high computational demands of Large Language Models (LLMs) by transferring knowledge from a large ''teacher'' to a smaller ''student'' model. However, students may inherit the teacher's privacy when the teacher is trained on private data. In this work, we systematically characterize and investigate membership and memorization privacy risks inherent in six LLM KD techniques. Using instruction-tuning settings that span seven NLP tasks, together with three teacher model families (GPT-2, LLAMA-2, and OPT), and various size student models, we demonstrate that all existing LLM KD approaches carry membership and memorization privacy risks from the teacher to its students. However, the extent of privacy risks varies across different KD techniques. We systematically analyse how key LLM KD components (KD objective functions, student training data and NLP tasks) impact such privacy risks. We also demonstrate a significant disagreement between memorization and membership privacy risks of LLM KD techniques. Finally, we characterize per-block privacy risk and demonstrate that the privacy risk varies across different blocks by a large margin.

Title: SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages

Authors: Muhammad Dehan Al Kautsar, Aswin Candra, Muhammad Alif Al Hakim, Maxalmina Satria Kahfi, Fajri Koto, Alham Fikri Aji, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Genta Indra Winata
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07069
Pdf URL: https://arxiv.org/pdf/2508.07069
Copy Paste: [[2508.07069]] SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages(https://arxiv.org/abs/2508.07069)
Keywords: large language model
Abstract: Although numerous datasets have been developed to support dialogue systems, most existing chit-chat datasets overlook the cultural nuances inherent in natural human conversations. To address this gap, we introduce SEADialogues, a culturally grounded dialogue dataset centered on Southeast Asia, a region with over 700 million people and immense cultural diversity. Our dataset features dialogues in eight languages from six Southeast Asian countries, many of which are low-resource despite having sizable speaker populations. To enhance cultural relevance and personalization, each dialogue includes persona attributes and two culturally grounded topics that reflect everyday life in the respective communities. Furthermore, we release a multi-turn dialogue dataset to advance research on culturally aware and human-centric large language models, including conversational dialogue agents.

Title: Surgical Knowledge Rewrite in Compact LLMs: An 'Unlearn-then-Learn' Strategy with ($IA^3$) for Localized Factual Modulation and Catastrophic Forgetting Mitigation

Authors: Stanley Ngugi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07075
Pdf URL: https://arxiv.org/pdf/2508.07075
Copy Paste: [[2508.07075]] Surgical Knowledge Rewrite in Compact LLMs: An 'Unlearn-then-Learn' Strategy with ($IA^3$) for Localized Factual Modulation and Catastrophic Forgetting Mitigation(https://arxiv.org/abs/2508.07075)
Keywords: interpretability, large language model
Abstract: Large Language Models (LLMs) struggle with dynamic knowledge updates, especially when new information conflicts with deeply embedded facts. Such conflicting factual edits often lead to two critical issues: resistance to adopting the new fact and severe catastrophic forgetting of unrelated knowledge. This paper introduces and evaluates a novel "unlearn-then-learn" strategy for precise knowledge editing in LLMs, leveraging the parameter-efficient fine-tuning (PEFT) technique, Infused Adapter by Inhibiting and Amplifying Inner Activations ($IA^3$). Crucially, this two-stage approach is powered by an initial circuit localization phase that identifies and targets the specific internal components responsible for encoding the conflicting fact. Through a rigorous experimental methodology on microsoft/Phi-3-mini-4k-instruct, we demonstrate that this mechanistically informed two-stage approach achieves near-perfect accuracy (98.50%) for the new, modulated fact while simultaneously effectively suppressing the original conflicting fact (96.00% forget rate). Critically, our strategy exhibits unprecedented localization (72.00% F_control accuracy), dramatically mitigating catastrophic forgetting observed in direct fine-tuning approaches (which showed as low as ~20% F_control accuracy), a direct benefit of our targeted interpretability-guided intervention. Furthermore, qualitative analysis reveals a nuanced mechanism of "soft forgetting," where original knowledge is suppressed from default retrieval but remains latent and conditionally accessible, enhancing model safety and control. These findings represent a significant advancement towards precise, localized, and safe knowledge management in compact LLMs.

Title: Improving Real-Time Concept Drift Detection using a Hybrid Transformer-Autoencoder Framework

Authors: N Harshit, K Mounvik
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07085
Pdf URL: https://arxiv.org/pdf/2508.07085
Copy Paste: [[2508.07085]] Improving Real-Time Concept Drift Detection using a Hybrid Transformer-Autoencoder Framework(https://arxiv.org/abs/2508.07085)
Keywords: robust, interpretability, transformer
Abstract: In applied machine learning, concept drift, which is either gradual or abrupt changes in data distribution, can significantly reduce model performance. Typical detection methods,such as statistical tests or reconstruction-based models,are generally reactive and not very sensitive to early detection. Our study proposes a hybrid framework consisting of Transformers and Autoencoders to model complex temporal dynamics and provide online drift detection. We create a distinct Trust Score methodology, which includes signals on (1) statistical and reconstruction-based drift metrics, more specifically, PSI, JSD, Transformer-AE error, (2) prediction uncertainty, (3) rules violations, and (4) trend of classifier error aligned with the combined metrics defined by the Trust Score. Using a time sequenced airline passenger data set with synthetic drift, our proposed model allows for a better detection of drift using as a whole and at different detection thresholds for both sensitivity and interpretability compared to baseline methods and provides a strong pipeline for drift detection in real time for applied machine learning. We evaluated performance using a time-sequenced airline passenger dataset having the gradually injected stimulus of drift in expectations,e.g. permuted ticket prices in later batches, broken into 10 time segments [1].In the data, our results support that the Transformation-Autoencoder detected drift earlier and with more sensitivity than the autoencoders commonly used in the literature, and provided improved modeling over more error rates and logical violations. Therefore, a robust framework was developed to reliably monitor concept drift.

Title: ForeSight: Multi-View Streaming Joint Object Detection and Trajectory Forecasting

Authors: Sandro Papais, Letian Wang, Brian Cheong, Steven L. Waslander
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2508.07089
Pdf URL: https://arxiv.org/pdf/2508.07089
Copy Paste: [[2508.07089]] ForeSight: Multi-View Streaming Joint Object Detection and Trajectory Forecasting(https://arxiv.org/abs/2508.07089)
Keywords: transformer
Abstract: We introduce ForeSight, a novel joint detection and forecasting framework for vision-based 3D perception in autonomous vehicles. Traditional approaches treat detection and forecasting as separate sequential tasks, limiting their ability to leverage temporal cues. ForeSight addresses this limitation with a multi-task streaming and bidirectional learning approach, allowing detection and forecasting to share query memory and propagate information seamlessly. The forecast-aware detection transformer enhances spatial reasoning by integrating trajectory predictions from a multiple hypothesis forecast memory queue, while the streaming forecast transformer improves temporal consistency using past forecasts and refined detections. Unlike tracking-based methods, ForeSight eliminates the need for explicit object association, reducing error propagation with a tracking-free model that efficiently scales across multi-frame sequences. Experiments on the nuScenes dataset show that ForeSight achieves state-of-the-art performance, achieving an EPA of 54.9%, surpassing previous methods by 9.3%, while also attaining the best mAP and minADE among multi-view detection and forecasting models.

Title: BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context

Authors: Aditya Tomar, Nihar Ranjan Sahoo, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07090
Pdf URL: https://arxiv.org/pdf/2508.07090
Copy Paste: [[2508.07090]] BharatBBQ: A Multilingual Bias Benchmark for Question Answering in the Indian Context(https://arxiv.org/abs/2508.07090)
Keywords: fair
Abstract: Evaluating social biases in language models (LMs) is crucial for ensuring fairness and minimizing the reinforcement of harmful stereotypes in AI systems. Existing benchmarks, such as the Bias Benchmark for Question Answering (BBQ), primarily focus on Western contexts, limiting their applicability to the Indian context. To address this gap, we introduce BharatBBQ, a culturally adapted benchmark designed to assess biases in Hindi, English, Marathi, Bengali, Tamil, Telugu, Odia, and Assamese. BharatBBQ covers 13 social categories, including 3 intersectional groups, reflecting prevalent biases in the Indian sociocultural landscape. Our dataset contains 49,108 examples in one language that are expanded using translation and verification to 392,864 examples in eight different languages. We evaluate five multilingual LM families across zero and few-shot settings, analyzing their bias and stereotypical bias scores. Our findings highlight persistent biases across languages and social categories and often amplified biases in Indian languages compared to English, demonstrating the necessity of linguistically and culturally grounded benchmarks for bias evaluation.

Title: ScamDetect: Towards a Robust, Agnostic Framework to Uncover Threats in Smart Contracts

Authors: Pasquale De Rosa, Pascal Felber, Valerio Schiavoni
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.07094
Pdf URL: https://arxiv.org/pdf/2508.07094
Copy Paste: [[2508.07094]] ScamDetect: Towards a Robust, Agnostic Framework to Uncover Threats in Smart Contracts(https://arxiv.org/abs/2508.07094)
Keywords: security, privacy, robust
Abstract: Smart contracts have transformed decentralized finance by enabling programmable, trustless transactions. However, their widespread adoption and growing financial significance have attracted persistent and sophisticated threats, such as phishing campaigns and contract-level exploits. Traditional transaction-based threat detection methods often expose sensitive user data and interactions, raising privacy and security concerns. In response, static bytecode analysis has emerged as a proactive mitigation strategy, identifying malicious contracts before they execute harmful this http URL on this approach, we introduced PhishingHook, the first machine-learning-based framework for detecting phishing activities in smart contracts via static bytecode and opcode analysis, achieving approximately 90% detection accuracy. Nevertheless, two pressing challenges remain: (1) the increasing use of sophisticated bytecode obfuscation techniques designed to evade static analysis, and (2) the heterogeneity of blockchain environments requiring platform-agnostic this http URL paper presents a vision for ScamDetect (Smart Contract Agnostic Malware Detector), a robust, modular, and platform-agnostic framework for smart contract malware detection. Over the next 2.5 years, ScamDetect will evolve in two stages: first, by tackling obfuscated Ethereum Virtual Machine (EVM) bytecode through graph neural network (GNN) analysis of control flow graphs (CFGs), leveraging GNNs' ability to capture complex structural patterns beyond opcode sequences; and second, by generalizing detection capabilities to emerging runtimes such as WASM. ScamDetect aims to enable proactive, scalable security for the future of decentralized ecosystems.

Title: Towards High-Order Mean Flow Generative Models: Feasibility, Expressivity, and Provably Efficient Criteria

Authors: Yang Cao, Yubin Chen, Zhao Song, Jiahao Zhang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2508.07102
Pdf URL: https://arxiv.org/pdf/2508.07102
Copy Paste: [[2508.07102]] Towards High-Order Mean Flow Generative Models: Feasibility, Expressivity, and Provably Efficient Criteria(https://arxiv.org/abs/2508.07102)
Keywords: generative
Abstract: Generative modelling has seen significant advances through simulation-free paradigms such as Flow Matching, and in particular, the MeanFlow framework, which replaces instantaneous velocity fields with average velocities to enable efficient single-step sampling. In this work, we introduce a theoretical study on Second-Order MeanFlow, a novel extension that incorporates average acceleration fields into the MeanFlow objective. We first establish the feasibility of our approach by proving that the average acceleration satisfies a generalized consistency condition analogous to first-order MeanFlow, thereby supporting stable, one-step sampling and tractable loss functions. We then characterize its expressivity via circuit complexity analysis, showing that under mild assumptions, the Second-Order MeanFlow sampling process can be implemented by uniform threshold circuits within the $\mathsf{TC}^0$ class. Finally, we derive provably efficient criteria for scalable implementation by leveraging fast approximate attention computations: we prove that attention operations within the Second-Order MeanFlow architecture can be approximated to within $1/\mathrm{poly}(n)$ error in time $n^{2+o(1)}$. Together, these results lay the theoretical foundation for high-order flow matching models that combine rich dynamics with practical sampling efficiency.

Title: Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution

Authors: Falaah Arif Khan, Nivedha Sivakumar, Yinong Oliver Wang, Katherine Metcalf, Cezanne Camacho, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07111
Pdf URL: https://arxiv.org/pdf/2508.07111
Copy Paste: [[2508.07111]] Investigating Intersectional Bias in Large Language Models using Confidence Disparities in Coreference Resolution(https://arxiv.org/abs/2508.07111)
Keywords: fair, large language model
Abstract: Large language models (LLMs) have achieved impressive performance, leading to their widespread adoption as decision-support tools in resource-constrained contexts like hiring and admissions. There is, however, scientific consensus that AI systems can reflect and exacerbate societal biases, raising concerns about identity-based harm when used in critical social contexts. Prior work has laid a solid foundation for assessing bias in LLMs by evaluating demographic disparities in different language reasoning tasks. In this work, we extend single-axis fairness evaluations to examine intersectional bias, recognizing that when multiple axes of discrimination intersect, they create distinct patterns of disadvantage. We create a new benchmark called WinoIdentity by augmenting the WinoBias dataset with 25 demographic markers across 10 attributes, including age, nationality, and race, intersected with binary gender, yielding 245,700 prompts to evaluate 50 distinct bias patterns. Focusing on harms of omission due to underrepresentation, we investigate bias through the lens of uncertainty and propose a group (un)fairness metric called Coreference Confidence Disparity which measures whether models are more or less confident for some intersectional identities than others. We evaluate five recently published LLMs and find confidence disparities as high as 40% along various demographic attributes including body type, sexual orientation and socio-economic status, with models being most uncertain about doubly-disadvantaged identities in anti-stereotypical settings. Surprisingly, coreference confidence decreases even for hegemonic or privileged markers, indicating that the recent impressive performance of LLMs is more likely due to memorization than logical reasoning. Notably, these are two independent failures in value alignment and validity that can compound to cause social harm.

Title: AugLift: Boosting Generalization in Lifting-based 3D Human Pose Estimation

Authors: Nikolai Warner, Wenjin Zhang, Irfan Essa, Apaar Sadhwani
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07112
Pdf URL: https://arxiv.org/pdf/2508.07112
Copy Paste: [[2508.07112]] AugLift: Boosting Generalization in Lifting-based 3D Human Pose Estimation(https://arxiv.org/abs/2508.07112)
Keywords: robust
Abstract: Lifting-based methods for 3D Human Pose Estimation (HPE), which predict 3D poses from detected 2D keypoints, often generalize poorly to new datasets and real-world settings. To address this, we propose \emph{AugLift}, a simple yet effective reformulation of the standard lifting pipeline that significantly improves generalization performance without requiring additional data collection or sensors. AugLift sparsely enriches the standard input -- the 2D keypoint coordinates $(x, y)$ -- by augmenting it with a keypoint detection confidence score $c$ and a corresponding depth estimate $d$. These additional signals are computed from the image using off-the-shelf, pre-trained models (e.g., for monocular depth estimation), thereby inheriting their strong generalization capabilities. Importantly, AugLift serves as a modular add-on and can be readily integrated into existing lifting architectures. Our extensive experiments across four datasets demonstrate that AugLift boosts cross-dataset performance on unseen datasets by an average of $10.1\%$, while also improving in-distribution performance by $4.0\%$. These gains are consistent across various lifting architectures, highlighting the robustness of our method. Our analysis suggests that these sparse, keypoint-aligned cues provide robust frame-level context, offering a practical way to significantly improve the generalization of any lifting-based pose estimation model. Code will be made publicly available.

Title: Approaching Maximal Information Extraction in Low-Signal Regimes via Multiple Instance Learning

Authors: Atakan Azakli, Bernd Stelzer
Subjects: cs.LG, hep-ex
Abstract URL: https://arxiv.org/abs/2508.07114
Pdf URL: https://arxiv.org/pdf/2508.07114
Copy Paste: [[2508.07114]] Approaching Maximal Information Extraction in Low-Signal Regimes via Multiple Instance Learning(https://arxiv.org/abs/2508.07114)
Keywords: extraction
Abstract: In this work, we propose a new machine learning (ML) methodology to obtain more precise predictions for some parameters of interest in a given hypotheses testing problem. Our proposed method also allows ML models to have more discriminative power in cases where it is extremely challenging for state-of-the-art classifiers to have any level of accurate predictions. This method can also allow us to systematically decrease the error from ML models in their predictions. In this paper, we provide a mathematical motivation why Multiple Instance Learning (MIL) would have more predictive power over their single-instance counterparts. We support our theoretical claims by analyzing the behavior of the MIL models through their scaling behaviors with respect to the number of instances on which the model makes predictions. As a concrete application, we constrain Wilson coefficients of the Standard Model Effective Field Theory (SMEFT) using kinematic information from subatomic particle collision events at the Large Hadron Collider (LHC). We show that under certain circumstances, it might be possible to extract the theoretical maximum Fisher Information latent in a dataset.

Title: From Nodes to Narratives: Explaining Graph Neural Networks with LLMs and Graph Context

Authors: Peyman Baghershahi, Gregoire Fournier, Pranav Nyati, Sourav Medya
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07117
Pdf URL: https://arxiv.org/pdf/2508.07117
Copy Paste: [[2508.07117]] From Nodes to Narratives: Explaining Graph Neural Networks with LLMs and Graph Context(https://arxiv.org/abs/2508.07117)
Keywords: explainability, large language model
Abstract: Graph Neural Networks (GNNs) have emerged as powerful tools for learning over structured data, including text-attributed graphs, which are common in domains such as citation networks, social platforms, and knowledge graphs. GNNs are not inherently interpretable and thus, many explanation methods have been proposed. However, existing explanation methods often struggle to generate interpretable, fine-grained rationales, especially when node attributes include rich natural language. In this work, we introduce LOGIC, a lightweight, post-hoc framework that uses large language models (LLMs) to generate faithful and interpretable explanations for GNN predictions. LOGIC projects GNN node embeddings into the LLM embedding space and constructs hybrid prompts that interleave soft prompts with textual inputs from the graph structure. This enables the LLM to reason about GNN internal representations and produce natural language explanations along with concise explanation subgraphs. Our experiments across four real-world TAG datasets demonstrate that LOGIC achieves a favorable trade-off between fidelity and sparsity, while significantly improving human-centric metrics such as insightfulness. LOGIC sets a new direction for LLM-based explainability in graph learning by aligning GNN internals with human reasoning.

Title: Multi-Level Service Performance Forecasting via Spatiotemporal Graph Neural Networks

Authors: Zhihao Xue, Yun Zi, Nia Qi, Ming Gong, Yujun Zou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07122
Pdf URL: https://arxiv.org/pdf/2508.07122
Copy Paste: [[2508.07122]] Multi-Level Service Performance Forecasting via Spatiotemporal Graph Neural Networks(https://arxiv.org/abs/2508.07122)
Keywords: robust
Abstract: This paper proposes a spatiotemporal graph neural network-based performance prediction algorithm to address the challenge of forecasting performance fluctuations in distributed backend systems with multi-level service call structures. The method abstracts system states at different time slices into a sequence of graph structures. It integrates the runtime features of service nodes with the invocation relationships among services to construct a unified spatiotemporal modeling framework. The model first applies a graph convolutional network to extract high-order dependency information from the service topology. Then it uses a gated recurrent network to capture the dynamic evolution of performance metrics over time. A time encoding mechanism is also introduced to enhance the model's ability to represent non-stationary temporal sequences. The architecture is trained in an end-to-end manner, optimizing the multi-layer nested structure to achieve high-precision regression of future service performance metrics. To validate the effectiveness of the proposed method, a large-scale public cluster dataset is used. A series of multi-dimensional experiments are designed, including variations in time windows and concurrent load levels. These experiments comprehensively evaluate the model's predictive performance and stability. The experimental results show that the proposed model outperforms existing representative methods across key metrics such as MAE, RMSE, and R2. It maintains strong robustness under varying load intensities and structural complexities. These results demonstrate the model's practical potential for backend service performance management tasks.

Title: Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning

Authors: Zhengran Ji, Boyuan Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07126
Pdf URL: https://arxiv.org/pdf/2508.07126
Copy Paste: [[2508.07126]] Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning(https://arxiv.org/abs/2508.07126)
Keywords: robust
Abstract: Training reinforcement learning agents with human feedback is crucial when task objectives are difficult to specify through dense reward functions. While prior methods rely on offline trajectory comparisons to elicit human preferences, such data is unavailable in online learning scenarios where agents must adapt on the fly. Recent approaches address this by collecting real-time scalar feedback to guide agent behavior and train reward models for continued learning after human feedback becomes unavailable. However, scalar feedback is often noisy and inconsistent, limiting the accuracy and generalization of learned rewards. We propose Pref-GUIDE, a framework that transforms real-time scalar feedback into preference-based data to improve reward model learning for continual policy training. Pref-GUIDE Individual mitigates temporal inconsistency by comparing agent behaviors within short windows and filtering ambiguous feedback. Pref-GUIDE Voting further enhances robustness by aggregating reward models across a population of users to form consensus preferences. Across three challenging environments, Pref-GUIDE significantly outperforms scalar-feedback baselines, with the voting variant exceeding even expert-designed dense rewards. By reframing scalar feedback as structured preferences with population feedback, Pref-GUIDE offers a scalable and principled approach for harnessing human input in online reinforcement learning.

Title: How Effectively Can Large Language Models Connect SNP Variants and ECG Phenotypes for Cardiovascular Risk Prediction?

Authors: Niranjana Arun Menon, Iqra Farooq, Yulong Li, Sara Ahmed, Yutong Xie, Muhammad Awais, Imran Razzak
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2508.07127
Pdf URL: https://arxiv.org/pdf/2508.07127
Copy Paste: [[2508.07127]] How Effectively Can Large Language Models Connect SNP Variants and ECG Phenotypes for Cardiovascular Risk Prediction?(https://arxiv.org/abs/2508.07127)
Keywords: large language model
Abstract: Cardiovascular disease (CVD) prediction remains a tremendous challenge due to its multifactorial etiology and global burden of morbidity and mortality. Despite the growing availability of genomic and electrophysiological data, extracting biologically meaningful insights from such high-dimensional, noisy, and sparsely annotated datasets remains a non-trivial task. Recently, LLMs has been applied effectively to predict structural variations in biological sequences. In this work, we explore the potential of fine-tuned LLMs to predict cardiac diseases and SNPs potentially leading to CVD risk using genetic markers derived from high-throughput genomic profiling. We investigate the effect of genetic patterns associated with cardiac conditions and evaluate how LLMs can learn latent biological relationships from structured and semi-structured genomic data obtained by mapping genetic aspects that are inherited from the family tree. By framing the problem as a Chain of Thought (CoT) reasoning task, the models are prompted to generate disease labels and articulate informed clinical deductions across diverse patient profiles and phenotypes. The findings highlight the promise of LLMs in contributing to early detection, risk assessment, and ultimately, the advancement of personalized medicine in cardiac care.

Title: Perceptual Evaluation of GANs and Diffusion Models for Generating X-rays

Authors: Gregory Schuit, Denis Parra, Cecilia Besa
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07128
Pdf URL: https://arxiv.org/pdf/2508.07128
Copy Paste: [[2508.07128]] Perceptual Evaluation of GANs and Diffusion Models for Generating X-rays(https://arxiv.org/abs/2508.07128)
Keywords: diffusion, generative, segmentation
Abstract: Generative image models have achieved remarkable progress in both natural and medical imaging. In the medical context, these techniques offer a potential solution to data scarcity-especially for low-prevalence anomalies that impair the performance of AI-driven diagnostic and segmentation tools. However, questions remain regarding the fidelity and clinical utility of synthetic images, since poor generation quality can undermine model generalizability and trust. In this study, we evaluate the effectiveness of state-of-the-art generative models-Generative Adversarial Networks (GANs) and Diffusion Models (DMs)-for synthesizing chest X-rays conditioned on four abnormalities: Atelectasis (AT), Lung Opacity (LO), Pleural Effusion (PE), and Enlarged Cardiac Silhouette (ECS). Using a benchmark composed of real images from the MIMIC-CXR dataset and synthetic images from both GANs and DMs, we conducted a reader study with three radiologists of varied experience. Participants were asked to distinguish real from synthetic images and assess the consistency between visual features and the target abnormality. Our results show that while DMs generate more visually realistic images overall, GANs can report better accuracy for specific conditions, such as absence of ECS. We further identify visual cues radiologists use to detect synthetic images, offering insights into the perceptual gaps in current models. These findings underscore the complementary strengths of GANs and DMs and point to the need for further refinement to ensure generative models can reliably augment training datasets for AI diagnostic systems.

Title: A Stable and Principled Loss Function for Direct Language Model Alignment

Authors: Yuandong Tan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07137
Pdf URL: https://arxiv.org/pdf/2508.07137
Copy Paste: [[2508.07137]] A Stable and Principled Loss Function for Direct Language Model Alignment(https://arxiv.org/abs/2508.07137)
Keywords: large language model
Abstract: The alignment of large language models (LLMs) with human preferences is commonly achieved through Reinforcement Learning from Human Feedback (RLHF). Direct Preference Optimization (DPO) simplified this paradigm by establishing a direct mapping between the optimal policy and a reward function, eliminating the need for an explicit reward model. However, we argue that the DPO loss function is theoretically misaligned with its own derivation, as it promotes the indefinite maximization of a logits difference, which can lead to training instability and reward hacking. In this paper, we propose a novel loss function derived directly from the RLHF optimality condition. Our proposed loss targets a specific, finite value for the logits difference, which is dictated by the underlying reward, rather than its maximization. We provide a theoretical analysis, including a gradient-based comparison, to demonstrate that our method avoids the large gradients that plague DPO when the probability of dispreferred responses approaches zero. This inherent stability prevents reward hacking and leads to more effective alignment. We validate our approach by fine-tuning a Qwen2.5-7B model, showing significant win-rate improvements over a standard DPO baseline and achieving competitive performance against larger models like Llama-3.1-8B.

Title: Strategic Incentivization for Locally Differentially Private Federated Learning

Authors: Yashwant Krishna Pagoti, Arunesh Sinha, Shamik Sural
Subjects: cs.LG, cs.GT
Abstract URL: https://arxiv.org/abs/2508.07138
Pdf URL: https://arxiv.org/pdf/2508.07138
Copy Paste: [[2508.07138]] Strategic Incentivization for Locally Differentially Private Federated Learning(https://arxiv.org/abs/2508.07138)
Keywords: privacy, protect, federate
Abstract: In Federated Learning (FL), multiple clients jointly train a machine learning model by sharing gradient information, instead of raw data, with a server over multiple rounds. To address the possibility of information leakage in spite of sharing only the gradients, Local Differential Privacy (LDP) is often used. In LDP, clients add a selective amount of noise to the gradients before sending the same to the server. Although such noise addition protects the privacy of clients, it leads to a degradation in global model accuracy. In this paper, we model this privacy-accuracy trade-off as a game, where the sever incentivizes the clients to add a lower degree of noise for achieving higher accuracy, while the clients attempt to preserve their privacy at the cost of a potential loss in accuracy. A token based incentivization mechanism is introduced in which the quantum of tokens credited to a client in an FL round is a function of the degree of perturbation of its gradients. The client can later access a newly updated global model only after acquiring enough tokens, which are to be deducted from its balance. We identify the players, their actions and payoff, and perform a strategic analysis of the game. Extensive experiments were carried out to study the impact of different parameters.

Title: A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection

Authors: Ivan Zhang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07139
Pdf URL: https://arxiv.org/pdf/2508.07139
Copy Paste: [[2508.07139]] A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection(https://arxiv.org/abs/2508.07139)
Keywords: security, defense, attack
Abstract: Ensuring LLM alignment is critical to information security as AI models become increasingly widespread and integrated in society. Unfortunately, many defenses against adversarial attacks and jailbreaking on LLMs cannot adapt quickly to new attacks, degrade model responses to benign prompts, or introduce significant barriers to scalable implementation. To mitigate these challenges, we introduce a real-time, self-tuning (RTST) moderator framework to defend against adversarial attacks while maintaining a lightweight training footprint. We empirically evaluate its effectiveness using Google's Gemini models against modern, effective jailbreaks. Our results demonstrate the advantages of an adaptive, minimally intrusive framework for jailbreak defense over traditional fine-tuning or classifier models.

Title: CMAMRNet: A Contextual Mask-Aware Network Enhancing Mural Restoration Through Comprehensive Mask Guidance

Authors: Yingtie Lei, Fanghai Yi, Yihang Dong, Weihuang Liu, Xiaofeng Zhang, Zimeng Li, Chi-Man Pun, Xuhang Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07140
Pdf URL: https://arxiv.org/pdf/2508.07140
Copy Paste: [[2508.07140]] CMAMRNet: A Contextual Mask-Aware Network Enhancing Mural Restoration Through Comprehensive Mask Guidance(https://arxiv.org/abs/2508.07140)
Keywords: extraction
Abstract: Murals, as invaluable cultural artifacts, face continuous deterioration from environmental factors and human activities. Digital restoration of murals faces unique challenges due to their complex degradation patterns and the critical need to preserve artistic authenticity. Existing learning-based methods struggle with maintaining consistent mask guidance throughout their networks, leading to insufficient focus on damaged regions and compromised restoration quality. We propose CMAMRNet, a Contextual Mask-Aware Mural Restoration Network that addresses these limitations through comprehensive mask guidance and multi-scale feature extraction. Our framework introduces two key components: (1) the Mask-Aware Up/Down-Sampler (MAUDS), which ensures consistent mask sensitivity across resolution scales through dedicated channel-wise feature selection and mask-guided feature fusion; and (2) the Co-Feature Aggregator (CFA), operating at both the highest and lowest resolutions to extract complementary features for capturing fine textures and global structures in degraded regions. Experimental results on benchmark datasets demonstrate that CMAMRNet outperforms state-of-the-art methods, effectively preserving both structural integrity and artistic details in restored murals. The code is available at~\href{this https URL}{this https URL}.

Title: Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens

Authors: Anna Seo Gyeong Choi, Hoon Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07143
Pdf URL: https://arxiv.org/pdf/2508.07143
Copy Paste: [[2508.07143]] Fairness of Automatic Speech Recognition: Looking Through a Philosophical Lens(https://arxiv.org/abs/2508.07143)
Keywords: fair
Abstract: Automatic Speech Recognition (ASR) systems now mediate countless human-technology interactions, yet research on their fairness implications remains surprisingly limited. This paper examines ASR bias through a philosophical lens, arguing that systematic misrecognition of certain speech varieties constitutes more than a technical limitation -- it represents a form of disrespect that compounds historical injustices against marginalized linguistic communities. We distinguish between morally neutral classification (discriminate1) and harmful discrimination (discriminate2), demonstrating how ASR systems can inadvertently transform the former into the latter when they consistently misrecognize non-standard dialects. We identify three unique ethical dimensions of speech technologies that differentiate ASR bias from other algorithmic fairness concerns: the temporal burden placed on speakers of non-standard varieties ("temporal taxation"), the disruption of conversational flow when systems misrecognize speech, and the fundamental connection between speech patterns and personal/cultural identity. These factors create asymmetric power relationships that existing technical fairness metrics fail to capture. The paper analyzes the tension between linguistic standardization and pluralism in ASR development, arguing that current approaches often embed and reinforce problematic language ideologies. We conclude that addressing ASR bias requires more than technical interventions; it demands recognition of diverse speech varieties as legitimate forms of expression worthy of technological accommodation. This philosophical reframing offers new pathways for developing ASR systems that respect linguistic diversity and speaker autonomy.

Title: Intention-Aware Diffusion Model for Pedestrian Trajectory Prediction

Authors: Yu Liu, Zhijie Liu, Xiao Ren, You-Fu Li, He Kong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07146
Pdf URL: https://arxiv.org/pdf/2508.07146
Copy Paste: [[2508.07146]] Intention-Aware Diffusion Model for Pedestrian Trajectory Prediction(https://arxiv.org/abs/2508.07146)
Keywords: diffusion
Abstract: Predicting pedestrian motion trajectories is critical for the path planning and motion control of autonomous vehicles. Recent diffusion-based models have shown promising results in capturing the inherent stochasticity of pedestrian behavior for trajectory prediction. However, the absence of explicit semantic modelling of pedestrian intent in many diffusion-based methods may result in misinterpreted behaviors and reduced prediction accuracy. To address the above challenges, we propose a diffusion-based pedestrian trajectory prediction framework that incorporates both short-term and long-term motion intentions. Short-term intent is modelled using a residual polar representation, which decouples direction and magnitude to capture fine-grained local motion patterns. Long-term intent is estimated through a learnable, token-based endpoint predictor that generates multiple candidate goals with associated probabilities, enabling multimodal and context-aware intention modelling. Furthermore, we enhance the diffusion process by incorporating adaptive guidance and a residual noise predictor that dynamically refines denoising accuracy. The proposed framework is evaluated on the widely used ETH, UCY, and SDD benchmarks, demonstrating competitive results against state-of-the-art methods.

Title: SketchAnimator: Animate Sketch via Motion Customization of Text-to-Video Diffusion Models

Authors: Ruolin Yang, Da Li, Honggang Zhang, Yi-Zhe Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07149
Pdf URL: https://arxiv.org/pdf/2508.07149
Copy Paste: [[2508.07149]] SketchAnimator: Animate Sketch via Motion Customization of Text-to-Video Diffusion Models(https://arxiv.org/abs/2508.07149)
Keywords: diffusion
Abstract: Sketching is a uniquely human tool for expressing ideas and creativity. The animation of sketches infuses life into these static drawings, opening a new dimension for designers. Animating sketches is a time-consuming process that demands professional skills and extensive experience, often proving daunting for amateurs. In this paper, we propose a novel sketch animation model SketchAnimator, which enables adding creative motion to a given sketch, like "a jumping car''. Namely, given an input sketch and a reference video, we divide the sketch animation into three stages: Appearance Learning, Motion Learning and Video Prior Distillation. In stages 1 and 2, we utilize LoRA to integrate sketch appearance information and motion dynamics from the reference video into the pre-trained T2V model. In the third stage, we utilize Score Distillation Sampling (SDS) to update the parameters of the Bezier curves in each sketch frame according to the acquired motion information. Consequently, our model produces a sketch video that not only retains the original appearance of the sketch but also mirrors the dynamic movements of the reference video. We compare our method with alternative approaches and demonstrate that it generates the desired sketch video under the challenge of one-shot motion customization.

Title: CoopDiff: Anticipating 3D Human-object Interactions via Contact-consistent Decoupled Diffusion

Authors: Xiaotong Lin, Tianming Liang, Jian-Fang Hu, Kun-Yu Lin, Yulei Kang, Chunwei Tian, Jianhuang Lai, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07162
Pdf URL: https://arxiv.org/pdf/2508.07162
Copy Paste: [[2508.07162]] CoopDiff: Anticipating 3D Human-object Interactions via Contact-consistent Decoupled Diffusion(https://arxiv.org/abs/2508.07162)
Keywords: diffusion
Abstract: 3D human-object interaction (HOI) anticipation aims to predict the future motion of humans and their manipulated objects, conditioned on the historical context. Generally, the articulated humans and rigid objects exhibit different motion patterns, due to their distinct intrinsic physical properties. However, this distinction is ignored by most of the existing works, which intend to capture the dynamics of both humans and objects within a single prediction model. In this work, we propose a novel contact-consistent decoupled diffusion framework CoopDiff, which employs two distinct branches to decouple human and object motion modeling, with the human-object contact points as shared anchors to bridge the motion generation across branches. The human dynamics branch is aimed to predict highly structured human motion, while the object dynamics branch focuses on the object motion with rigid translations and rotations. These two branches are bridged by a series of shared contact points with consistency constraint for coherent human-object motion prediction. To further enhance human-object consistency and prediction reliability, we propose a human-driven interaction module to guide object motion modeling. Extensive experiments on the BEHAVE and Human-object Interaction datasets demonstrate that our CoopDiff outperforms state-of-the-art methods.

Title: Large-scale Multi-sequence Pretraining for Generalizable MRI Analysis in Versatile Clinical Applications

Authors: Zelin Qiu, Xi Wang, Zhuoyao Xie, Juan Zhou, Yu Wang, Lingjie Yang, Xinrui Jiang, Juyoung Bae, Moo Hyun Son, Qiang Ye, Dexuan Chen, Rui Zhang, Tao Li, Neeraj Ramesh Mahboobani, Varut Vardhanabhuti, Xiaohui Duan, Yinghua Zhao, Hao Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07165
Pdf URL: https://arxiv.org/pdf/2508.07165
Copy Paste: [[2508.07165]] Large-scale Multi-sequence Pretraining for Generalizable MRI Analysis in Versatile Clinical Applications(https://arxiv.org/abs/2508.07165)
Keywords: robust, segmentation
Abstract: Multi-sequence Magnetic Resonance Imaging (MRI) offers remarkable versatility, enabling the distinct visualization of different tissue types. Nevertheless, the inherent heterogeneity among MRI sequences poses significant challenges to the generalization capability of deep learning models. These challenges undermine model performance when faced with varying acquisition parameters, thereby severely restricting their clinical utility. In this study, we present PRISM, a foundation model PRe-trained with large-scale multI-Sequence MRI. We collected a total of 64 datasets from both public and private sources, encompassing a wide range of whole-body anatomical structures, with scans spanning diverse MRI sequences. Among them, 336,476 volumetric MRI scans from 34 datasets (8 public and 26 private) were curated to construct the largest multi-organ multi-sequence MRI pretraining corpus to date. We propose a novel pretraining paradigm that disentangles anatomically invariant features from sequence-specific variations in MRI, while preserving high-level semantic representations. We established a benchmark comprising 44 downstream tasks, including disease diagnosis, image segmentation, registration, progression prediction, and report generation. These tasks were evaluated on 32 public datasets and 5 private cohorts. PRISM consistently outperformed both non-pretrained models and existing foundation models, achieving first-rank results in 39 out of 44 downstream benchmarks with statistical significance improvements. These results underscore its ability to learn robust and generalizable representations across unseen data acquired under diverse MRI protocols. PRISM provides a scalable framework for multi-sequence MRI analysis, thereby enhancing the translational potential of AI in radiology. It delivers consistent performance across diverse imaging protocols, reinforcing its clinical applicability.

Title: Lightweight Multi-Scale Feature Extraction with Fully Connected LMF Layer for Salient Object Detection

Authors: Yunpeng Shi, Lei Chen, Xiaolu Shen, Yanju Guo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07170
Pdf URL: https://arxiv.org/pdf/2508.07170
Copy Paste: [[2508.07170]] Lightweight Multi-Scale Feature Extraction with Fully Connected LMF Layer for Salient Object Detection(https://arxiv.org/abs/2508.07170)
Keywords: extraction
Abstract: In the domain of computer vision, multi-scale feature extraction is vital for tasks such as salient object detection. However, achieving this capability in lightweight networks remains challenging due to the trade-off between efficiency and performance. This paper proposes a novel lightweight multi-scale feature extraction layer, termed the LMF layer, which employs depthwise separable dilated convolutions in a fully connected structure. By integrating multiple LMF layers, we develop LMFNet, a lightweight network tailored for salient object detection. Our approach significantly reduces the number of parameters while maintaining competitive performance. Here, we show that LMFNet achieves state-of-the-art or comparable results on five benchmark datasets with only 0.81M parameters, outperforming several traditional and lightweight models in terms of both efficiency and accuracy. Our work not only addresses the challenge of multi-scale learning in lightweight networks but also demonstrates the potential for broader applications in image processing tasks. The related code files are available at this https URL

Title: EventRR: Event Referential Reasoning for Referring Video Object Segmentation

Authors: Huihui Xu, Jiashi Lin, Haoyu Chen, Junjun He, Lei Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07171
Pdf URL: https://arxiv.org/pdf/2508.07171
Copy Paste: [[2508.07171]] EventRR: Event Referential Reasoning for Referring Video Object Segmentation(https://arxiv.org/abs/2508.07171)
Keywords: segmentation
Abstract: Referring Video Object Segmentation (RVOS) aims to segment out the object in a video referred by an expression. Current RVOS methods view referring expressions as unstructured sequences, neglecting their crucial semantic structure essential for referent reasoning. Besides, in contrast to image-referring expressions whose semantics focus only on object attributes and object-object relations, video-referring expressions also encompass event attributes and event-event temporal relations. This complexity challenges traditional structured reasoning image approaches. In this paper, we propose the Event Referential Reasoning (EventRR) framework. EventRR decouples RVOS into object summarization part and referent reasoning part. The summarization phase begins by summarizing each frame into a set of bottleneck tokens, which are then efficiently aggregated in the video-level summarization step to exchange the global cross-modal temporal context. For reasoning part, EventRR extracts semantic eventful structure of a video-referring expression into highly expressive Referential Event Graph (REG), which is a single-rooted directed acyclic graph. Guided by topological traversal of REG, we propose Temporal Concept-Role Reasoning (TCRR) to accumulate the referring score of each temporal query from REG leaf nodes to root node. Each reasoning step can be interpreted as a question-answer pair derived from the concept-role relations in REG. Extensive experiments across four widely recognized benchmark datasets, show that EventRR quantitatively and qualitatively outperforms state-of-the-art RVOS methods. Code is available at this https URL

Title: Gradient Surgery for Safe LLM Fine-Tuning

Authors: Biao Yi, Jiahao Li, Baolei Zhang, Lihai Nie, Tong Li, Tiansheng Huang, Zheli Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07172
Pdf URL: https://arxiv.org/pdf/2508.07172
Copy Paste: [[2508.07172]] Gradient Surgery for Safe LLM Fine-Tuning(https://arxiv.org/abs/2508.07172)
Keywords: defense, robust, large language model
Abstract: Fine-tuning-as-a-Service introduces a critical vulnerability where a few malicious examples mixed into the user's fine-tuning dataset can compromise the safety alignment of Large Language Models (LLMs). While a recognized paradigm frames safe fine-tuning as a multi-objective optimization problem balancing user task performance with safety alignment, we find existing solutions are critically sensitive to the harmful ratio, with defenses degrading sharply as harmful ratio increases. We diagnose that this failure stems from conflicting gradients, where the user-task update directly undermines the safety objective. To resolve this, we propose SafeGrad, a novel method that employs gradient surgery. When a conflict is detected, SafeGrad nullifies the harmful component of the user-task gradient by projecting it onto the orthogonal plane of the alignment gradient, allowing the model to learn the user's task without sacrificing safety. To further enhance robustness and data efficiency, we employ a KL-divergence alignment loss that learns the rich, distributional safety profile of the well-aligned foundation model. Extensive experiments show that SafeGrad provides state-of-the-art defense across various LLMs and datasets, maintaining robust safety even at high harmful ratios without compromising task fidelity.

Title: Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models

Authors: Leyi Pan, Zheyu Fu, Yunpeng Zhai, Shuchang Tao, Sheng Guan, Shiyu Huang, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Felix Henry, Lijie Wen, Aiwei Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07173
Pdf URL: https://arxiv.org/pdf/2508.07173
Copy Paste: [[2508.07173]] Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models(https://arxiv.org/abs/2508.07173)
Keywords: defense, attack, robust, large language model
Abstract: The rise of Omni-modal Large Language Models (OLLMs), which integrate visual and auditory processing with text, necessitates robust safety evaluations to mitigate harmful outputs. However, no dedicated benchmarks currently exist for OLLMs, and prior benchmarks designed for other LLMs lack the ability to assess safety performance under audio-visual joint inputs or cross-modal safety consistency. To fill this gap, we introduce Omni-SafetyBench, the first comprehensive parallel benchmark for OLLM safety evaluation, featuring 24 modality combinations and variations with 972 samples each, including dedicated audio-visual harm cases. Considering OLLMs' comprehension challenges with complex omni-modal inputs and the need for cross-modal consistency evaluation, we propose tailored metrics: a Safety-score based on conditional Attack Success Rate (C-ASR) and Refusal Rate (C-RR) to account for comprehension failures, and a Cross-Modal Safety Consistency Score (CMSC-score) to measure consistency across modalities. Evaluating 6 open-source and 4 closed-source OLLMs reveals critical vulnerabilities: (1) no model excels in both overall safety and consistency, with only 3 models achieving over 0.6 in both metrics and top performer scoring around 0.8; (2) safety defenses weaken with complex inputs, especially audio-visual joints; (3) severe weaknesses persist, with some models scoring as low as 0.14 on specific modalities. Our benchmark and metrics highlight urgent needs for enhanced OLLM safety, providing a foundation for future improvements.

Title: Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks

Authors: Jiaqi Yin, Yi-Wei Chen, Meng-Lung Lee, Xiya Liu
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2508.07179
Pdf URL: https://arxiv.org/pdf/2508.07179
Copy Paste: [[2508.07179]] Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks(https://arxiv.org/abs/2508.07179)
Keywords: extraction, large language model
Abstract: Enterprise data pipelines, characterized by complex transformations across multiple programming languages, often cause a semantic disconnect between original metadata and downstream data. This "semantic drift" compromises data reproducibility and governance, and impairs the utility of services like retrieval-augmented generation (RAG) and text-to-SQL systems. To address this, a novel framework is proposed for the automated extraction of fine-grained schema lineage from multilingual enterprise pipeline scripts. This method identifies four key components: source schemas, source tables, transformation logic, and aggregation operations, creating a standardized representation of data transformations. For the rigorous evaluation of lineage quality, this paper introduces the Schema Lineage Composite Evaluation (SLiCE), a metric that assesses both structural correctness and semantic fidelity. A new benchmark is also presented, comprising 1,700 manually annotated lineages from real-world industrial scripts. Experiments were conducted with 12 language models, from 1.3B to 32B small language models (SLMs) to large language models (LLMs) like GPT-4o and GPT-4.1. The results demonstrate that the performance of schema lineage extraction scales with model size and the sophistication of prompting techniques. Specially, a 32B open-source model, using a single reasoning trace, can achieve performance comparable to the GPT series under standard prompting. This finding suggests a scalable and economical approach for deploying schema-aware agents in practical applications.

Title: DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention

Authors: Kabir Khan, Priya Sharma, Arjun Mehta, Neha Gupta, Ravi Narayanan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07185
Pdf URL: https://arxiv.org/pdf/2508.07185
Copy Paste: [[2508.07185]] DySK-Attn: A Framework for Efficient, Real-Time Knowledge Updating in Large Language Models via Dynamic Sparse Knowledge Attention(https://arxiv.org/abs/2508.07185)
Keywords: large language model
Abstract: Large Language Models (LLMs) suffer from a critical limitation: their knowledge is static and quickly becomes outdated. Retraining these massive models is computationally prohibitive, while existing knowledge editing techniques can be slow and may introduce unforeseen side effects. To address this, we propose DySK-Attn, a novel framework that enables LLMs to efficiently integrate real-time knowledge from a dynamic external source. Our approach synergizes an LLM with a dynamic Knowledge Graph (KG) that can be updated instantaneously. The core of our framework is a sparse knowledge attention mechanism, which allows the LLM to perform a coarse-to-fine grained search, efficiently identifying and focusing on a small, highly relevant subset of facts from the vast KG. This mechanism avoids the high computational cost of dense attention over the entire knowledge base and mitigates noise from irrelevant information. We demonstrate through extensive experiments on time-sensitive question-answering tasks that DySK-Attn significantly outperforms strong baselines, including standard Retrieval-Augmented Generation (RAG) and model editing techniques, in both factual accuracy for updated knowledge and computational efficiency. Our framework offers a scalable and effective solution for building LLMs that can stay current with the ever-changing world.

Title: Understanding NFTs from EIP Standards

Authors: Minfeng Qi, Qin Wang, Guangsheng Yu, Ruiqiang Li, Victor Zhou, Shiping Chen
Subjects: cs.CR, cs.ET
Abstract URL: https://arxiv.org/abs/2508.07190
Pdf URL: https://arxiv.org/pdf/2508.07190
Copy Paste: [[2508.07190]] Understanding NFTs from EIP Standards(https://arxiv.org/abs/2508.07190)
Keywords: security
Abstract: We argue that the technical foundations of non-fungible tokens (NFTs) remain inadequately understood. Prior research has focused on market dynamics, user behavior, and isolated security incidents, yet systematic analysis of the standards underpinning NFT functionality is largely absent. We present the first study of NFTs through the lens of Ethereum Improvement Proposals (EIPs). We conduct a large-scale empirical analysis of 191 NFT-related EIPs and 10K+ Ethereum Magicians discussions (as of July, 2025). We integrate multi-dimensional analyses including the automated parsing of Solidity interfaces, graph-based modeling of inheritance structures, contributor profiling, and mining of community discussion data. We distinguish foundational from emerging standards, expose poor cross-version interoperability, and show that growing functional complexity heightens security risks.

Title: Adapting LLMs to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment

Authors: Yanru Sun, Emadeldeen Eldele, Zongxia Xie, Yucheng Wang, Wenzhe Niu, Qinghua Hu, Chee Keong Kwoh, Min Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07195
Pdf URL: https://arxiv.org/pdf/2508.07195
Copy Paste: [[2508.07195]] Adapting LLMs to Time Series Forecasting via Temporal Heterogeneity Modeling and Semantic Alignment(https://arxiv.org/abs/2508.07195)
Keywords: large language model
Abstract: Large Language Models (LLMs) have recently demonstrated impressive capabilities in natural language processing due to their strong generalization and sequence modeling capabilities. However, their direct application to time series forecasting remains challenging due to two fundamental issues: the inherent heterogeneity of temporal patterns and the modality gap between continuous numerical signals and discrete language representations. In this work, we propose TALON, a unified framework that enhances LLM-based forecasting by modeling temporal heterogeneity and enforcing semantic alignment. Specifically, we design a Heterogeneous Temporal Encoder that partitions multivariate time series into structurally coherent segments, enabling localized expert modeling across diverse temporal patterns. To bridge the modality gap, we introduce a Semantic Alignment Module that aligns temporal features with LLM-compatible representations, enabling effective integration of time series into language-based models while eliminating the need for handcrafted prompts during inference. Extensive experiments on seven real-world benchmarks demonstrate that TALON achieves superior performance across all datasets, with average MSE improvements of up to 11\% over recent state-of-the-art methods. These results underscore the effectiveness of incorporating both pattern-aware and semantic-aware designs when adapting LLMs for time series forecasting. The code is available at: this https URL.

Title: What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains

Authors: Chanakya Ekbote, Marco Bondaschi, Nived Rajaraman, Jason D. Lee, Michael Gastpar, Ashok Vardhan Makkuva, Paul Pu Liang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07208
Pdf URL: https://arxiv.org/pdf/2508.07208
Copy Paste: [[2508.07208]] What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains(https://arxiv.org/abs/2508.07208)
Keywords: transformer
Abstract: In-context learning (ICL) is a hallmark capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context. Prior work has shown that ICL emerges in transformers due to the presence of special circuits called induction heads. Given the equivalence between induction heads and conditional k-grams, a recent line of work modeling sequential inputs as Markov processes has revealed the fundamental impact of model depth on its ICL capabilities: while a two-layer transformer can efficiently represent a conditional 1-gram model, its single-layer counterpart cannot solve the task unless it is exponentially large. However, for higher order Markov sources, the best known constructions require at least three layers (each with a single attention head) - leaving open the question: can a two-layer single-head transformer represent any kth-order Markov process? In this paper, we precisely address this and theoretically show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram. Thus, our result provides the tightest known characterization of the interplay between transformer depth and Markov order for ICL. Building on this, we further analyze the learning dynamics of our two-layer construction, focusing on a simplified variant for first-order Markov chains, illustrating how effective in-context representations emerge during training. Together, these results deepen our current understanding of transformer-based ICL and illustrate how even shallow architectures can surprisingly exhibit strong ICL capabilities on structured sequence modeling tasks.

Title: Similarity Matters: A Novel Depth-guided Network for Image Restoration and A New Dataset

Authors: Junyi He, Liuling Chen, Hongyang Zhou, Zhang xiaoxing, Xiaobin Zhu, Shengxiang Yu, Jingyan Qin, Xu-Cheng Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07211
Pdf URL: https://arxiv.org/pdf/2508.07211
Copy Paste: [[2508.07211]] Similarity Matters: A Novel Depth-guided Network for Image Restoration and A New Dataset(https://arxiv.org/abs/2508.07211)
Keywords: robust
Abstract: Image restoration has seen substantial progress in recent years. However, existing methods often neglect depth information, which hurts similarity matching, results in attention distractions in shallow depth-of-field (DoF) scenarios, and excessive enhancement of background content in deep DoF settings. To overcome these limitations, we propose a novel Depth-Guided Network (DGN) for image restoration, together with a novel large-scale high-resolution dataset. Specifically, the network consists of two interactive branches: a depth estimation branch that provides structural guidance, and an image restoration branch that performs the core restoration task. In addition, the image restoration branch exploits intra-object similarity through progressive window-based self-attention and captures inter-object similarity via sparse non-local attention. Through joint training, depth features contribute to improved restoration quality, while the enhanced visual features from the restoration branch in turn help refine depth estimation. Notably, we also introduce a new dataset for training and evaluation, consisting of 9,205 high-resolution images from 403 plant species, with diverse depth and texture variations. Extensive experiments show that our method achieves state-of-the-art performance on several standard benchmarks and generalizes well to unseen plant images, demonstrating its effectiveness and robustness.

Title: Bridging Semantic Logic Gaps: A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization

Authors: Songlin Li, Zhiqing Guo, Yuanman Li, Zeyu Li, Yunfeng Diao, Gaobo Yang, Liejun Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07216
Pdf URL: https://arxiv.org/pdf/2508.07216
Copy Paste: [[2508.07216]] Bridging Semantic Logic Gaps: A Cognition-Inspired Multimodal Boundary-Preserving Network for Image Manipulation Localization(https://arxiv.org/abs/2508.07216)
Keywords: large language model
Abstract: The existing image manipulation localization (IML) models mainly relies on visual cues, but ignores the semantic logical relationships between content features. In fact, the content semantics conveyed by real images often conform to human cognitive laws. However, image manipulation technology usually destroys the internal relationship between content features, thus leaving semantic clues for IML. In this paper, we propose a cognition-inspired multimodal boundary-preserving network (CMB-Net). Specifically, CMB-Net utilizes large language models (LLMs) to analyze manipulated regions within images and generate prompt-based textual information to compensate for the lack of semantic relationships in the visual information. Considering that the erroneous texts induced by hallucination from LLMs will damage the accuracy of IML, we propose an image-text central ambiguity module (ITCAM). It assigns weights to the text features by quantifying the ambiguity between text and image features, thereby ensuring the beneficial impact of textual information. We also propose an image-text interaction module (ITIM) that aligns visual and text features using a correlation matrix for fine-grained interaction. Finally, inspired by invertible neural networks, we propose a restoration edge decoder (RED) that mutually generates input and output features to preserve boundary information in manipulated regions without loss. Extensive experiments show that CMB-Net outperforms most existing IML models.

Title: Neural Bridge Processes

Authors: Jian Xu, Yican Liu, Qibin Zhao, John Paisley, Delu Zeng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07220
Pdf URL: https://arxiv.org/pdf/2508.07220
Copy Paste: [[2508.07220]] Neural Bridge Processes(https://arxiv.org/abs/2508.07220)
Keywords: diffusion
Abstract: Learning stochastic functions from partially observed context-target pairs is a fundamental problem in probabilistic modeling. Traditional models like Gaussian Processes (GPs) face scalability issues with large datasets and assume Gaussianity, limiting their applicability. While Neural Processes (NPs) offer more flexibility, they struggle with capturing complex, multi-modal target distributions. Neural Diffusion Processes (NDPs) enhance expressivity through a learned diffusion process but rely solely on conditional signals in the denoising network, resulting in weak input coupling from an unconditional forward process and semantic mismatch at the diffusion endpoint. In this work, we propose Neural Bridge Processes (NBPs), a novel method for modeling stochastic functions where inputs x act as dynamic anchors for the entire diffusion trajectory. By reformulating the forward kernel to explicitly depend on x, NBP enforces a constrained path that strictly terminates at the supervised target. This approach not only provides stronger gradient signals but also guarantees endpoint coherence. We validate NBPs on synthetic data, EEG signal regression and image regression tasks, achieving substantial improvements over baselines. These results underscore the effectiveness of DDPM-style bridge sampling in enhancing both performance and theoretical consistency for structured prediction tasks.

Title: LLM-based Agents for Automated Confounder Discovery and Subgroup Analysis in Causal Inference

Authors: Po-Han Lee, Yu-Cheng Lin, Chan-Tung Ku, Chan Hsu, Pei-Cing Huang, Ping-Hsun Wu, Yihuang Kang
Subjects: cs.LG, cs.AI, cs.MA, stat.AP, stat.ME
Abstract URL: https://arxiv.org/abs/2508.07221
Pdf URL: https://arxiv.org/pdf/2508.07221
Copy Paste: [[2508.07221]] LLM-based Agents for Automated Confounder Discovery and Subgroup Analysis in Causal Inference(https://arxiv.org/abs/2508.07221)
Keywords: robust, interpretability, large language model
Abstract: Estimating individualized treatment effects from observational data presents a persistent challenge due to unmeasured confounding and structural bias. Causal Machine Learning (causal ML) methods, such as causal trees and doubly robust estimators, provide tools for estimating conditional average treatment effects. These methods have limited effectiveness in complex real-world environments due to the presence of latent confounders or those described in unstructured formats. Moreover, reliance on domain experts for confounder identification and rule interpretation introduces high annotation cost and scalability concerns. In this work, we proposed Large Language Model-based agents for automated confounder discovery and subgroup analysis that integrate agents into the causal ML pipeline to simulate domain expertise. Our framework systematically performs subgroup identification and confounding structure discovery by leveraging the reasoning capabilities of LLM-based agents, which reduces human dependency while preserving interpretability. Experiments on real-world medical datasets show that our proposed approach enhances treatment effect estimation robustness by narrowing confidence intervals and uncovering unrecognized confounding biases. Our findings suggest that LLM-based agents offer a promising path toward scalable, trustworthy, and semantically aware causal inference.

Title: HaDM-ST: Histology-Assisted Differential Modeling for Spatial Transcriptomics Generation

Authors: Xuepeng Liu, Zheng Jiang, Pinan Zhu, Hanyu Liu, Chao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07225
Pdf URL: https://arxiv.org/pdf/2508.07225
Copy Paste: [[2508.07225]] HaDM-ST: Histology-Assisted Differential Modeling for Spatial Transcriptomics Generation(https://arxiv.org/abs/2508.07225)
Keywords: diffusion
Abstract: Spatial transcriptomics (ST) reveals spatial heterogeneity of gene expression, yet its resolution is limited by current platforms. Recent methods enhance resolution via H&E-stained histology, but three major challenges persist: (1) isolating expression-relevant features from visually complex H&E images; (2) achieving spatially precise multimodal alignment in diffusion-based frameworks; and (3) modeling gene-specific variation across expression channels. We propose HaDM-ST (Histology-assisted Differential Modeling for ST Generation), a high-resolution ST generation framework conditioned on H&E images and low-resolution ST. HaDM-ST includes: (i) a semantic distillation network to extract predictive cues from H&E; (ii) a spatial alignment module enforcing pixel-wise correspondence with low-resolution ST; and (iii) a channel-aware adversarial learner for fine-grained gene-level modeling. Experiments on 200 genes across diverse tissues and species show HaDM-ST consistently outperforms prior methods, enhancing spatial fidelity and gene-level coherence in high-resolution ST predictions.

Title: How Does a Deep Neural Network Look at Lexical Stress?

Authors: Itai Allouche, Itay Asael, Rotem Rousso, Vered Dassa, Ann Bradlow, Seung-Eun Kim, Matthew Goldrick, Joseph Keshet
Subjects: cs.CL, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2508.07229
Pdf URL: https://arxiv.org/pdf/2508.07229
Copy Paste: [[2508.07229]] How Does a Deep Neural Network Look at Lexical Stress?(https://arxiv.org/abs/2508.07229)
Keywords: interpretability
Abstract: Despite their success in speech processing, neural networks often operate as black boxes, prompting the question: what informs their decisions, and how can we interpret them? This work examines this issue in the context of lexical stress. A dataset of English disyllabic words was automatically constructed from read and spontaneous speech. Several Convolutional Neural Network (CNN) architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs (e.g., initial stress WAllet, final stress exTEND), achieving up to 92% accuracy on held-out test data. Layerwise Relevance Propagation (LRP), a technique for CNN interpretability analysis, revealed that predictions for held-out minimal pairs (PROtest vs. proTEST ) were most strongly influenced by information in stressed versus unstressed syllables, particularly the spectral properties of stressed vowels. However, the classifiers also attended to information throughout the word. A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel's first and second formants, with some evidence that its pitch and third formant also contribute. These results reveal deep learning's ability to acquire distributed cues to stress from naturally occurring data, extending traditional phonetic work based around highly controlled stimuli.

Title: ASM-UNet: Adaptive Scan Mamba Integrating Group Commonalities and Individual Variations for Fine-Grained Segmentation

Authors: Bo Wang, Mengyuan Xu, Yue Yan, Yuqun Yang, Kechen Shu, Wei Ping, Xu Tang, Wei Jiang, Zheng You
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07237
Pdf URL: https://arxiv.org/pdf/2508.07237
Copy Paste: [[2508.07237]] ASM-UNet: Adaptive Scan Mamba Integrating Group Commonalities and Individual Variations for Fine-Grained Segmentation(https://arxiv.org/abs/2508.07237)
Keywords: segmentation
Abstract: Precise lesion resection depends on accurately identifying fine-grained anatomical structures. While many coarse-grained segmentation (CGS) methods have been successful in large-scale segmentation (e.g., organs), they fall short in clinical scenarios requiring fine-grained segmentation (FGS), which remains challenging due to frequent individual variations in small-scale anatomical structures. Although recent Mamba-based models have advanced medical image segmentation, they often rely on fixed manually-defined scanning orders, which limit their adaptability to individual variations in FGS. To address this, we propose ASM-UNet, a novel Mamba-based architecture for FGS. It introduces adaptive scan scores to dynamically guide the scanning order, generated by combining group-level commonalities and individual-level variations. Experiments on two public datasets (ACDC and Synapse) and a newly proposed challenging biliary tract FGS dataset, namely BTMS, demonstrate that ASM-UNet achieves superior performance in both CGS and FGS tasks. Our code and dataset are available at this https URL.

Title: Causal Negative Sampling via Diffusion Model for Out-of-Distribution Recommendation

Authors: Chu Zhao, Eneng Yang, Yizhou Dang, Jianzhe Zhao, Guibing Guo, Xingwei Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07243
Pdf URL: https://arxiv.org/pdf/2508.07243
Copy Paste: [[2508.07243]] Causal Negative Sampling via Diffusion Model for Out-of-Distribution Recommendation(https://arxiv.org/abs/2508.07243)
Keywords: robust, diffusion
Abstract: Heuristic negative sampling enhances recommendation performance by selecting negative samples of varying hardness levels from predefined candidate pools to guide the model toward learning more accurate decision boundaries. However, our empirical and theoretical analyses reveal that unobserved environmental confounders (e.g., exposure or popularity biases) in candidate pools may cause heuristic sampling methods to introduce false hard negatives (FHNS). These misleading samples can encourage the model to learn spurious correlations induced by such confounders, ultimately compromising its generalization ability under distribution shifts. To address this issue, we propose a novel method named Causal Negative Sampling via Diffusion (CNSDiff). By synthesizing negative samples in the latent space via a conditional diffusion process, CNSDiff avoids the bias introduced by predefined candidate pools and thus reduces the likelihood of generating FHNS. Moreover, it incorporates a causal regularization term to explicitly mitigate the influence of environmental confounders during the negative sampling process, leading to robust negatives that promote out-of-distribution (OOD) generalization. Comprehensive experiments under four representative distribution shift scenarios demonstrate that CNSDiff achieves an average improvement of 13.96% across all evaluation metrics compared to state-of-the-art baselines, verifying its effectiveness and robustness in OOD recommendation tasks.

Title: Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers

Authors: Xin Ma, Yaohui Wang, Genyun Jia, Xinyuan Chen, Tien-Tsin Wong, Cunjian Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07246
Pdf URL: https://arxiv.org/pdf/2508.07246
Copy Paste: [[2508.07246]] Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers(https://arxiv.org/abs/2508.07246)
Keywords: diffusion, transformer, generative
Abstract: Image animation has seen significant progress, driven by the powerful generative capabilities of diffusion models. However, maintaining appearance consistency with static input images and mitigating abrupt motion transitions in generated animations remain persistent challenges. While text-to-video (T2V) generation has demonstrated impressive performance with diffusion transformer models, the image animation field still largely relies on U-Net-based diffusion models, which lag behind the latest T2V approaches. Moreover, the quadratic complexity of vanilla self-attention mechanisms in Transformers imposes heavy computational demands, making image animation particularly resource-intensive. To address these issues, we propose MiraMo, a framework designed to enhance efficiency, appearance consistency, and motion smoothness in image animation. Specifically, MiraMo introduces three key elements: (1) A foundational text-to-video architecture replacing vanilla self-attention with efficient linear attention to reduce computational overhead while preserving generation quality; (2) A novel motion residual learning paradigm that focuses on modeling motion dynamics rather than directly predicting frames, improving temporal consistency; and (3) A DCT-based noise refinement strategy during inference to suppress sudden motion artifacts, complemented by a dynamics control module to balance motion smoothness and expressiveness. Extensive experiments against state-of-the-art methods validate the superiority of MiraMo in generating consistent, smooth, and controllable animations with accelerated inference speed. Additionally, we demonstrate the versatility of MiraMo through applications in motion transfer and video editing tasks.

Title: SUIT: Spatial-Spectral Union-Intersection Interaction Network for Hyperspectral Object Tracking

Authors: Fengchao Xiong, Zhenxing Wu, Sen Jia, Yuntao Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07250
Pdf URL: https://arxiv.org/pdf/2508.07250
Copy Paste: [[2508.07250]] SUIT: Spatial-Spectral Union-Intersection Interaction Network for Hyperspectral Object Tracking(https://arxiv.org/abs/2508.07250)
Keywords: robust, transformer
Abstract: Hyperspectral videos (HSVs), with their inherent spatial-spectral-temporal structure, offer distinct advantages in challenging tracking scenarios such as cluttered backgrounds and small objects. However, existing methods primarily focus on spatial interactions between the template and search regions, often overlooking spectral interactions, leading to suboptimal performance. To address this issue, this paper investigates spectral interactions from both the architectural and training perspectives. At the architectural level, we first establish band-wise long-range spatial relationships between the template and search regions using Transformers. We then model spectral interactions using the inclusion-exclusion principle from set theory, treating them as the union of spatial interactions across all bands. This enables the effective integration of both shared and band-specific spatial cues. At the training level, we introduce a spectral loss to enforce material distribution alignment between the template and predicted regions, enhancing robustness to shape deformation and appearance variations. Extensive experiments demonstrate that our tracker achieves state-of-the-art tracking performance. The source code, trained models and results will be publicly available via this https URL to support reproducibility.

Title: PySeizure: A single machine learning classifier framework to detect seizures in diverse datasets

Authors: Bartlomiej Chybowski, Shima Abdullateef, Hollan Haule, Alfredo Gonzalez-Sulser, Javier Escudero
Subjects: cs.LG, eess.SP, q-bio.NC
Abstract URL: https://arxiv.org/abs/2508.07253
Pdf URL: https://arxiv.org/pdf/2508.07253
Copy Paste: [[2508.07253]] PySeizure: A single machine learning classifier framework to detect seizures in diverse datasets(https://arxiv.org/abs/2508.07253)
Keywords: robust
Abstract: Reliable seizure detection is critical for diagnosing and managing epilepsy, yet clinical workflows remain dependent on time-consuming manual EEG interpretation. While machine learning has shown promise, existing approaches often rely on dataset-specific optimisations, limiting their real-world applicability and reproducibility. Here, we introduce an innovative, open-source machine-learning framework that enables robust and generalisable seizure detection across varied clinical datasets. We evaluate our approach on two publicly available EEG datasets that differ in patient populations and electrode configurations. To enhance robustness, the framework incorporates an automated pre-processing pipeline to standardise data and a majority voting mechanism, in which multiple models independently assess each second of EEG before reaching a final decision. We train, tune, and evaluate models within each dataset, assessing their cross-dataset transferability. Our models achieve high within-dataset performance (AUC 0.904+/-0.059 for CHB-MIT and 0.864+/-0.060 for TUSZ) and demonstrate strong generalisation across datasets despite differences in EEG setups and populations (AUC 0.615+/-0.039 for models trained on CHB-MIT and tested on TUSZ and 0.762+/-0.175 in the reverse case) without any post-processing. Furthermore, a mild post-processing improved the within-dataset results to 0.913+/-0.064 and 0.867+/-0.058 and cross-dataset results to 0.619+/-0.036 and 0.768+/-0.172. These results underscore the potential of, and essential considerations for, deploying our framework in diverse clinical settings. By making our methodology fully reproducible, we provide a foundation for advancing clinically viable, dataset-agnostic seizure detection systems. This approach has the potential for widespread adoption, complementing rather than replacing expert interpretation, and accelerating clinical integration.

Title: Fading the Digital Ink: A Universal Black-Box Attack Framework for 3DGS Watermarking Systems

Authors: Qingyuan Zeng, Shu Jiang, Jiajing Lin, Zhenzhong Wang, Kay Chen Tan, Min Jiang
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2508.07263
Pdf URL: https://arxiv.org/pdf/2508.07263
Copy Paste: [[2508.07263]] Fading the Digital Ink: A Universal Black-Box Attack Framework for 3DGS Watermarking Systems(https://arxiv.org/abs/2508.07263)
Keywords: protect, attack, robust, watermark
Abstract: With the rise of 3D Gaussian Splatting (3DGS), a variety of digital watermarking techniques, embedding either 1D bitstreams or 2D images, are used for copyright protection. However, the robustness of these watermarking techniques against potential attacks remains underexplored. This paper introduces the first universal black-box attack framework, the Group-based Multi-objective Evolutionary Attack (GMEA), designed to challenge these watermarking systems. We formulate the attack as a large-scale multi-objective optimization problem, balancing watermark removal with visual quality. In a black-box setting, we introduce an indirect objective function that blinds the watermark detector by minimizing the standard deviation of features extracted by a convolutional network, thus rendering the feature maps uninformative. To manage the vast search space of 3DGS models, we employ a group-based optimization strategy to partition the model into multiple, independent sub-optimization problems. Experiments demonstrate that our framework effectively removes both 1D and 2D watermarks from mainstream 3DGS watermarking methods while maintaining high visual fidelity. This work reveals critical vulnerabilities in existing 3DGS copyright protection schemes and calls for the development of more robust watermarking systems.

Title: MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory

Authors: Vasudha Varadarajan, Hui Xu, Rebecca Astrid Boehme, Mariam Marlan Mirstrom, Sverker Sikstrom, H. Andrew Schwartz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07279
Pdf URL: https://arxiv.org/pdf/2508.07279
Copy Paste: [[2508.07279]] MAQuA: Adaptive Question-Asking for Multidimensional Mental Health Screening using Item Response Theory(https://arxiv.org/abs/2508.07279)
Keywords: robust, large language model
Abstract: Recent advances in large language models (LLMs) offer new opportunities for scalable, interactive mental health assessment, but excessive querying by LLMs burdens users and is inefficient for real-world screening across transdiagnostic symptom profiles. We introduce MAQuA, an adaptive question-asking framework for simultaneous, multidimensional mental health screening. Combining multi-outcome modeling on language responses with item response theory (IRT) and factor analysis, MAQuA selects the questions with most informative responses across multiple dimensions at each turn to optimize diagnostic information, improving accuracy and potentially reducing response burden. Empirical results on a novel dataset reveal that MAQuA reduces the number of assessment questions required for score stabilization by 50-87% compared to random ordering (e.g., achieving stable depression scores with 71% fewer questions and eating disorder scores with 85% fewer questions). MAQuA demonstrates robust performance across both internalizing (depression, anxiety) and externalizing (substance use, eating disorder) domains, with early stopping strategies further reducing patient time and burden. These findings position MAQuA as a powerful and efficient tool for scalable, nuanced, and interactive mental health screening, advancing the integration of LLM-based agents into real-world clinical workflows.

Title: Representation Understanding via Activation Maximization

Authors: Hongbo Zhu, Angelo Cangelosi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07281
Pdf URL: https://arxiv.org/pdf/2508.07281
Copy Paste: [[2508.07281]] Representation Understanding via Activation Maximization(https://arxiv.org/abs/2508.07281)
Keywords: interpretability, transformer
Abstract: Understanding internal feature representations of deep neural networks (DNNs) is a fundamental step toward model interpretability. Inspired by neuroscience methods that probe biological neurons using visual stimuli, recent deep learning studies have employed Activation Maximization (AM) to synthesize inputs that elicit strong responses from artificial neurons. In this work, we propose a unified feature visualization framework applicable to both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Unlike prior efforts that predominantly focus on the last output-layer neurons in CNNs, we extend feature visualization to intermediate layers as well, offering deeper insights into the hierarchical structure of learned feature representations. Furthermore, we investigate how activation maximization can be leveraged to generate adversarial examples, revealing potential vulnerabilities and decision boundaries of DNNs. Our experiments demonstrate the effectiveness of our approach in both traditional CNNs and modern ViT, highlighting its generalizability and interpretive value.

Title: "Pull or Not to Pull?'': Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas

Authors: Junchen Ding, Penghao Jiang, Zihao Xu, Ziqi Ding, Yichen Zhu, Jiaojiao Jiang, Yuekang Li
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2508.07284
Pdf URL: https://arxiv.org/pdf/2508.07284
Copy Paste: [[2508.07284]] "Pull or Not to Pull?'': Investigating Moral Biases in Leading Large Language Models Across Ethical Dilemmas(https://arxiv.org/abs/2508.07284)
Keywords: fair, large language model
Abstract: As large language models (LLMs) increasingly mediate ethically sensitive decisions, understanding their moral reasoning processes becomes imperative. This study presents a comprehensive empirical evaluation of 14 leading LLMs, both reasoning enabled and general purpose, across 27 diverse trolley problem scenarios, framed by ten moral philosophies, including utilitarianism, deontology, and altruism. Using a factorial prompting protocol, we elicited 3,780 binary decisions and natural language justifications, enabling analysis along axes of decisional assertiveness, explanation answer consistency, public moral alignment, and sensitivity to ethically irrelevant cues. Our findings reveal significant variability across ethical frames and model types: reasoning enhanced models demonstrate greater decisiveness and structured justifications, yet do not always align better with human consensus. Notably, "sweet zones" emerge in altruistic, fairness, and virtue ethics framings, where models achieve a balance of high intervention rates, low explanation conflict, and minimal divergence from aggregated human judgments. However, models diverge under frames emphasizing kinship, legality, or self interest, often producing ethically controversial outcomes. These patterns suggest that moral prompting is not only a behavioral modifier but also a diagnostic tool for uncovering latent alignment philosophies across providers. We advocate for moral reasoning to become a primary axis in LLM alignment, calling for standardized benchmarks that evaluate not just what LLMs decide, but how and why.

Title: Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking

Authors: Jian Chen, Jinbao Tian, Yankui Li, Zhou Li
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.07286
Pdf URL: https://arxiv.org/pdf/2508.07286
Copy Paste: [[2508.07286]] Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking(https://arxiv.org/abs/2508.07286)
Keywords: extraction, large language model
Abstract: Accurate information extraction from specialized texts is a critical challenge, particularly for named entity recognition (NER) in the architecture, engineering, and construction (AEC) domain to support automated rule checking (ARC). The performance of standard pre-trained models is often constrained by the domain gap, as they struggle to interpret the specialized terminology and complex relational contexts inherent in AEC texts. Although this issue can be mitigated by further pre-training on large, human-curated domain corpora, as exemplified by methods like ARCBERT, this approach is both labor-intensive and cost-prohibitive. Consequently, leveraging large language models (LLMs) for automated knowledge generation has emerged as a promising alternative. However, the optimal strategy for generating knowledge that can genuinely enhance smaller, efficient models remains an open question. To address this, we propose ARCE (augmented RoBERTa with contextualized elucidations), a novel approach that systematically explores and optimizes this generation process. ARCE employs an LLM to first generate a corpus of simple, direct explanations, which we term Cote, and then uses this corpus to incrementally pre-train a RoBERTa model prior to its fine-tuning on the downstream task. Our extensive experiments show that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20%. This result also reveals a key finding: simple, explanation-based knowledge proves surprisingly more effective than complex, role-based rationales for this task. The code is publicly available at:this https URL.

Title: CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation

Authors: Yexing Du, Kaiyuan Liu, Youcheng Pan, Zheng Chu, Bo Yang, Xiaocheng Feng, Yang Xiang, Ming Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07295
Pdf URL: https://arxiv.org/pdf/2508.07295
Copy Paste: [[2508.07295]] CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation(https://arxiv.org/abs/2508.07295)
Keywords: robust, large language model
Abstract: As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel \textbf{C}ross-lingual and \textbf{C}ross-modal \textbf{F}actuality benchmark (\textbf{CCFQA}). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs' cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at this https URL.

Title: Revisiting Data Attribution for Influence Functions

Authors: Hongbo Zhu, Angelo Cangelosi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07297
Pdf URL: https://arxiv.org/pdf/2508.07297
Copy Paste: [[2508.07297]] Revisiting Data Attribution for Influence Functions(https://arxiv.org/abs/2508.07297)
Keywords: robust, interpretability
Abstract: The goal of data attribution is to trace the model's predictions through the learning algorithm and back to its training data. thereby identifying the most influential training samples and understanding how the model's behavior leads to particular predictions. Understanding how individual training examples influence a model's predictions is fundamental for machine learning interpretability, data debugging, and model accountability. Influence functions, originating from robust statistics, offer an efficient, first-order approximation to estimate the impact of marginally upweighting or removing a data point on a model's learned parameters and its subsequent predictions, without the need for expensive retraining. This paper comprehensively reviews the data attribution capability of influence functions in deep learning. We discuss their theoretical foundations, recent algorithmic advances for efficient inverse-Hessian-vector product estimation, and evaluate their effectiveness for data attribution and mislabel detection. Finally, highlighting current challenges and promising directions for unleashing the huge potential of influence functions in large-scale, real-world deep learning scenarios.

Title: SynMatch: Rethinking Consistency in Medical Image Segmentation with Sparse Annotations

Authors: Zhiqiang Shen, Peng Cao, Xiaoli Liu, Jinzhu Yang, Osmar R. Zaiane
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07298
Pdf URL: https://arxiv.org/pdf/2508.07298
Copy Paste: [[2508.07298]] SynMatch: Rethinking Consistency in Medical Image Segmentation with Sparse Annotations(https://arxiv.org/abs/2508.07298)
Keywords: segmentation
Abstract: Label scarcity remains a major challenge in deep learning-based medical image segmentation. Recent studies use strong-weak pseudo supervision to leverage unlabeled data. However, performance is often hindered by inconsistencies between pseudo labels and their corresponding unlabeled images. In this work, we propose \textbf{SynMatch}, a novel framework that sidesteps the need for improving pseudo labels by synthesizing images to match them instead. Specifically, SynMatch synthesizes images using texture and shape features extracted from the same segmentation model that generates the corresponding pseudo labels for unlabeled images. This design enables the generation of highly consistent synthesized-image-pseudo-label pairs without requiring any training parameters for image synthesis. We extensively evaluate SynMatch across diverse medical image segmentation tasks under semi-supervised learning (SSL), weakly-supervised learning (WSL), and barely-supervised learning (BSL) settings with increasingly limited annotations. The results demonstrate that SynMatch achieves superior performance, especially in the most challenging BSL setting. For example, it outperforms the recent strong-weak pseudo supervision-based method by 29.71\% and 10.05\% on the polyp segmentation task with 5\% and 10\% scribble annotations, respectively. The code will be released at this https URL.

Title: BEVANet: Bilateral Efficient Visual Attention Network for Real-Time Semantic Segmentation

Authors: Ping-Mao Huang, I-Tien Chao, Ping-Chia Huang, Jia-Wei Liao, Yung-Yu Chuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07300
Pdf URL: https://arxiv.org/pdf/2508.07300
Copy Paste: [[2508.07300]] BEVANet: Bilateral Efficient Visual Attention Network for Real-Time Semantic Segmentation(https://arxiv.org/abs/2508.07300)
Keywords: transformer, segmentation
Abstract: Real-time semantic segmentation presents the dual challenge of designing efficient architectures that capture large receptive fields for semantic understanding while also refining detailed contours. Vision transformers model long-range dependencies effectively but incur high computational cost. To address these challenges, we introduce the Large Kernel Attention (LKA) mechanism. Our proposed Bilateral Efficient Visual Attention Network (BEVANet) expands the receptive field to capture contextual information and extracts visual and structural features using Sparse Decomposed Large Separable Kernel Attentions (SDLSKA). The Comprehensive Kernel Selection (CKS) mechanism dynamically adapts the receptive field to further enhance performance. Furthermore, the Deep Large Kernel Pyramid Pooling Module (DLKPPM) enriches contextual features by synergistically combining dilated convolutions and large kernel attention. The bilateral architecture facilitates frequent branch communication, and the Boundary Guided Adaptive Fusion (BGAF) module enhances boundary delineation by integrating spatial and semantic features under boundary guidance. BEVANet achieves real-time segmentation at 33 FPS, yielding 79.3% mIoU without pretraining and 81.0% mIoU on Cityscapes after ImageNet pretraining, demonstrating state-of-the-art performance. The code and model is available at this https URL.

Title: DragonFruitQualityNet: A Lightweight Convolutional Neural Network for Real-Time Dragon Fruit Quality Inspection on Mobile Devices

Authors: Md Zahurul Haquea, Yeahyea Sarker, Muhammed Farhan Sadique Mahi, Syed Jubayer Jaman, Md Robiul Islam
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07306
Pdf URL: https://arxiv.org/pdf/2508.07306
Copy Paste: [[2508.07306]] DragonFruitQualityNet: A Lightweight Convolutional Neural Network for Real-Time Dragon Fruit Quality Inspection on Mobile Devices(https://arxiv.org/abs/2508.07306)
Keywords: robust
Abstract: Dragon fruit, renowned for its nutritional benefits and economic value, has experienced rising global demand due to its affordability and local availability. As dragon fruit cultivation expands, efficient pre- and post-harvest quality inspection has become essential for improving agricultural productivity and minimizing post-harvest losses. This study presents DragonFruitQualityNet, a lightweight Convolutional Neural Network (CNN) optimized for real-time quality assessment of dragon fruits on mobile devices. We curated a diverse dataset of 13,789 images, integrating self-collected samples with public datasets (dataset from Mendeley Data), and classified them into four categories: fresh, immature, mature, and defective fruits to ensure robust model training. The proposed model achieves an impressive 93.98% accuracy, outperforming existing methods in fruit quality classification. To facilitate practical adoption, we embedded the model into an intuitive mobile application, enabling farmers and agricultural stakeholders to conduct on-device, real-time quality inspections. This research provides an accurate, efficient, and scalable AI-driven solution for dragon fruit quality control, supporting digital agriculture and empowering smallholder farmers with accessible technology. By bridging the gap between research and real-world application, our work advances post-harvest management and promotes sustainable farming practices.

Title: MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark

Authors: Haiyang Guo, Fei Zhu, Hongbo Zhao, Fanhu Zeng, Wenzhuo Liu, Shijie Ma, Da-Han Wang, Xu-Yao Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07307
Pdf URL: https://arxiv.org/pdf/2508.07307
Copy Paste: [[2508.07307]] MCITlib: Multimodal Continual Instruction Tuning Library and Benchmark(https://arxiv.org/abs/2508.07307)
Keywords: large language model
Abstract: Continual learning aims to equip AI systems with the ability to continuously acquire and adapt to new knowledge without forgetting previously learned information, similar to human learning. While traditional continual learning methods focusing on unimodal tasks have achieved notable success, the emergence of Multimodal Large Language Models has brought increasing attention to Multimodal Continual Learning tasks involving multiple modalities, such as vision and language. In this setting, models are expected to not only mitigate catastrophic forgetting but also handle the challenges posed by cross-modal interactions and coordination. To facilitate research in this direction, we introduce MCITlib, a comprehensive and constantly evolving code library for continual instruction tuning of Multimodal Large Language Models. In MCITlib, we have currently implemented 8 representative algorithms for Multimodal Continual Instruction Tuning and systematically evaluated them on 2 carefully selected benchmarks. MCITlib will be continuously updated to reflect advances in the Multimodal Continual Learning field. The codebase is released at this https URL.

Title: HealthBranches: Synthesizing Clinically-Grounded Question Answering Datasets via Decision Pathways

Authors: Cristian Cosentino, Annamaria Defilippo, Marco Dossena, Christopher Irwin, Sara Joubbi, Pietro Liò
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07308
Pdf URL: https://arxiv.org/pdf/2508.07308
Copy Paste: [[2508.07308]] HealthBranches: Synthesizing Clinically-Grounded Question Answering Datasets via Decision Pathways(https://arxiv.org/abs/2508.07308)
Keywords: robust, large language model
Abstract: HealthBranches is a novel benchmark dataset for medical Question-Answering (Q&A), specifically designed to evaluate complex reasoning in Large Language Models (LLMs). This dataset is generated through a semi-automated pipeline that transforms explicit decision pathways from medical source into realistic patient cases with associated questions and answers. Covering 4,063 case studies across 17 healthcare topics, each data point is based on clinically validated reasoning chains. HealthBranches supports both open-ended and multiple-choice question formats and uniquely includes the full reasoning path for each Q&A. Its structured design enables robust evaluation of LLMs' multi-step inference capabilities, including their performance in structured Retrieval-Augmented Generation (RAG) contexts. HealthBranches establishes a foundation for the development of more trustworthy, interpretable, and clinically reliable LLMs in high-stakes domains while also serving as a valuable resource for educational purposes.

Title: DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding

Authors: Junyu Xiong, Yonghui Wang, Weichao Zhao, Chenyu Liu, Bing Yin, Wengang Zhou, Houqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07313
Pdf URL: https://arxiv.org/pdf/2508.07313
Copy Paste: [[2508.07313]] DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding(https://arxiv.org/abs/2508.07313)
Keywords: large language model
Abstract: Understanding multi-page documents poses a significant challenge for multimodal large language models (MLLMs), as it requires fine-grained visual comprehension and multi-hop reasoning across pages. While prior work has explored reinforcement learning (RL) for enhancing advanced reasoning in MLLMs, its application to multi-page document understanding remains underexplored. In this paper, we introduce DocR1, an MLLM trained with a novel RL framework, Evidence Page-Guided GRPO (EviGRPO). EviGRPO incorporates an evidence-aware reward mechanism that promotes a coarse-to-fine reasoning strategy, guiding the model to first retrieve relevant pages before generating answers. This training paradigm enables us to build high-quality models with limited supervision. To support this, we design a two-stage annotation pipeline and a curriculum learning strategy, based on which we construct two datasets: EviBench, a high-quality training set with 4.8k examples, and ArxivFullQA, an evaluation benchmark with 8.6k QA pairs based on scientific papers. Extensive experiments across a wide range of benchmarks demonstrate that DocR1 achieves state-of-the-art performance on multi-page tasks, while consistently maintaining strong results on single-page benchmarks.

Title: RORPCap: Retrieval-based Objects and Relations Prompt for Image Captioning

Authors: Jinjing Gu, Tianbao Qin, Yuanyuan Pu, Zhengpeng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07318
Pdf URL: https://arxiv.org/pdf/2508.07318
Copy Paste: [[2508.07318]] RORPCap: Retrieval-based Objects and Relations Prompt for Image Captioning(https://arxiv.org/abs/2508.07318)
Keywords: extraction
Abstract: Image captioning aims to generate natural language descriptions for input images in an open-form manner. To accurately generate descriptions related to the image, a critical step in image captioning is to identify objects and understand their relations within the image. Modern approaches typically capitalize on object detectors or combine detectors with Graph Convolutional Network (GCN). However, these models suffer from redundant detection information, difficulty in GCN construction, and high training costs. To address these issues, a Retrieval-based Objects and Relations Prompt for Image Captioning (RORPCap) is proposed, inspired by the fact that image-text retrieval can provide rich semantic information for input images. RORPCap employs an Objects and relations Extraction Model to extract object and relation words from the image. These words are then incorporate into predefined prompt templates and encoded as prompt embeddings. Next, a Mamba-based mapping network is designed to quickly map image embeddings extracted by CLIP to visual-text embeddings. Finally, the resulting prompt embeddings and visual-text embeddings are concatenated to form textual-enriched feature embeddings, which are fed into a GPT-2 model for caption generation. Extensive experiments conducted on the widely used MS-COCO dataset show that the RORPCap requires only 2.6 hours under cross-entropy loss training, achieving 120.5% CIDEr score and 22.0% SPICE score on the "Karpathy" test split. RORPCap achieves comparable performance metrics to detector-based and GCN-based models with the shortest training time and demonstrates its potential as an alternative for image captioning.

Title: ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering

Authors: Shubhra Ghosh, Abhilekh Borah, Aditya Kumar Guru, Kripabandhu Ghosh
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07321
Pdf URL: https://arxiv.org/pdf/2508.07321
Copy Paste: [[2508.07321]] ObfusQAte: A Proposed Framework to Evaluate LLM Robustness on Obfuscated Factual Question Answering(https://arxiv.org/abs/2508.07321)
Keywords: robust, large language model
Abstract: The rapid proliferation of Large Language Models (LLMs) has significantly contributed to the development of equitable AI systems capable of factual question-answering (QA). However, no known study tests the LLMs' robustness when presented with obfuscated versions of questions. To systematically evaluate these limitations, we propose a novel technique, ObfusQAte and, leveraging the same, introduce ObfusQA, a comprehensive, first of its kind, framework with multi-tiered obfuscation levels designed to examine LLM capabilities across three distinct dimensions: (i) Named-Entity Indirection, (ii) Distractor Indirection, and (iii) Contextual Overload. By capturing these fine-grained distinctions in language, ObfusQA provides a comprehensive benchmark for evaluating LLM robustness and adaptability. Our study observes that LLMs exhibit a tendency to fail or generate hallucinated responses when confronted with these increasingly nuanced variations. To foster research in this direction, we make ObfusQAte publicly available.

Title: Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative

Authors: Tuo Zhang, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07329
Pdf URL: https://arxiv.org/pdf/2508.07329
Copy Paste: [[2508.07329]] Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative(https://arxiv.org/abs/2508.07329)
Keywords: large language model
Abstract: With the breakthrough progress of large language models (LLMs) in natural language processing and multimodal tasks, efficiently deploying them on resource-constrained edge devices has become a critical challenge. The Mixture of Experts (MoE) architecture enhances model capacity through sparse activation, but faces two major difficulties in practical deployment: (1) The presence of numerous outliers in activation distributions leads to severe degradation in quantization accuracy for both activations and weights, significantly impairing inference performance; (2) Under limited memory, efficient offloading and collaborative inference of expert modules struggle to balance latency and throughput. To address these issues, this paper proposes an efficient MoE edge deployment scheme based on Hessian-Aware Quantization (HAQ) and CPU-GPU collaborative inference. First, by introducing smoothed Hessian matrix quantization, we achieve joint 8-bit quantization of activations and weights, which significantly alleviates the accuracy loss caused by outliers while ensuring efficient implementation on mainstream hardware. Second, we design an expert-level collaborative offloading and inference mechanism, which, combined with expert activation path statistics, enables efficient deployment and scheduling of expert modules between CPU and GPU, greatly reducing memory footprint and inference latency. Extensive experiments validate the effectiveness of our method on mainstream large models such as the OPT series and Mixtral 8*7B: on datasets like Wikitext2 and C4, the inference accuracy of the low-bit quantized model approaches that of the full-precision model, while GPU memory usage is reduced by about 60%, and inference latency is significantly improved.

Title: Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos

Authors: Tuyen Tran, Thao Minh Le, Quang-Hung Le, Truyen Tran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07330
Pdf URL: https://arxiv.org/pdf/2508.07330
Copy Paste: [[2508.07330]] Planner-Refiner: Dynamic Space-Time Refinement for Vision-Language Alignment in Videos(https://arxiv.org/abs/2508.07330)
Keywords: segmentation
Abstract: Vision-language alignment in video must address the complexity of language, evolving interacting entities, their action chains, and semantic gaps between language and vision. This work introduces Planner-Refiner, a framework to overcome these challenges. Planner-Refiner bridges the semantic gap by iteratively refining visual elements' space-time representation, guided by language until semantic gaps are minimal. A Planner module schedules language guidance by decomposing complex linguistic prompts into short sentence chains. The Refiner processes each short sentence, a noun-phrase and verb-phrase pair, to direct visual tokens' self-attention across space then time, achieving efficient single-step refinement. A recurrent system chains these steps, maintaining refined visual token representations. The final representation feeds into task-specific heads for alignment generation. We demonstrate Planner-Refiner's effectiveness on two video-language alignment tasks: Referring Video Object Segmentation and Temporal Grounding with varying language complexity. We further introduce a new MeViS-X benchmark to assess models' capability with long queries. Superior performance versus state-of-the-art methods on these benchmarks shows the approach's potential, especially for complex prompts.

Title: Finite-Time Convergence Analysis of ODE-based Generative Models for Stochastic Interpolants

Authors: Yuhao Liu, Rui Hu, Yu Chen, Longbo Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07333
Pdf URL: https://arxiv.org/pdf/2508.07333
Copy Paste: [[2508.07333]] Finite-Time Convergence Analysis of ODE-based Generative Models for Stochastic Interpolants(https://arxiv.org/abs/2508.07333)
Keywords: robust, generative
Abstract: Stochastic interpolants offer a robust framework for continuously transforming samples between arbitrary data distributions, holding significant promise for generative modeling. Despite their potential, rigorous finite-time convergence guarantees for practical numerical schemes remain largely unexplored. In this work, we address the finite-time convergence analysis of numerical implementations for ordinary differential equations (ODEs) derived from stochastic interpolants. Specifically, we establish novel finite-time error bounds in total variation distance for two widely used numerical integrators: the first-order forward Euler method and the second-order Heun's method. Furthermore, our analysis on the iteration complexity of specific stochastic interpolant constructions provides optimized schedules to enhance computational efficiency. Our theoretical findings are corroborated by numerical experiments, which validate the derived error bounds and complexity analyses.

Title: ProteoKnight: Convolution-based phage virion protein classification and uncertainty analysis

Authors: Samiha Afaf Neha, Abir Ahammed Bhuiyan, Md. Ishrak Khan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07345
Pdf URL: https://arxiv.org/pdf/2508.07345
Copy Paste: [[2508.07345]] ProteoKnight: Convolution-based phage virion protein classification and uncertainty analysis(https://arxiv.org/abs/2508.07345)
Keywords: robust
Abstract: \textbf{Introduction:} Accurate prediction of Phage Virion Proteins (PVP) is essential for genomic studies due to their crucial role as structural elements in bacteriophages. Computational tools, particularly machine learning, have emerged for annotating phage protein sequences from high-throughput sequencing. However, effective annotation requires specialized sequence encodings. Our paper introduces ProteoKnight, a new image-based encoding method that addresses spatial constraints in existing techniques, yielding competitive performance in PVP classification using pre-trained convolutional neural networks. Additionally, our study evaluates prediction uncertainty in binary PVP classification through Monte Carlo Dropout (MCD). \textbf{Methods:} ProteoKnight adapts the classical DNA-Walk algorithm for protein sequences, incorporating pixel colors and adjusting walk distances to capture intricate protein features. Encoded sequences were classified using multiple pre-trained CNNs. Variance and entropy measures assessed prediction uncertainty across proteins of various classes and lengths. \textbf{Results:} Our experiments achieved 90.8% accuracy in binary classification, comparable to state-of-the-art methods. Multi-class classification accuracy remains suboptimal. Our uncertainty analysis unveils variability in prediction confidence influenced by protein class and sequence length. \textbf{Conclusions:} Our study surpasses frequency chaos game representation (FCGR) by introducing novel image encoding that mitigates spatial information loss limitations. Our classification technique yields accurate and robust PVP predictions while identifying low-confidence predictions.

Title: SODiff: Semantic-Oriented Diffusion Model for JPEG Compression Artifacts Removal

Authors: Tingyu Yang, Jue Gong, Jinpei Guo, Wenbo Li, Yong Guo, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07346
Pdf URL: https://arxiv.org/pdf/2508.07346
Copy Paste: [[2508.07346]] SODiff: Semantic-Oriented Diffusion Model for JPEG Compression Artifacts Removal(https://arxiv.org/abs/2508.07346)
Keywords: diffusion, generative
Abstract: JPEG, as a widely used image compression standard, often introduces severe visual artifacts when achieving high compression ratios. Although existing deep learning-based restoration methods have made considerable progress, they often struggle to recover complex texture details, resulting in over-smoothed outputs. To overcome these limitations, we propose SODiff, a novel and efficient semantic-oriented one-step diffusion model for JPEG artifacts removal. Our core idea is that effective restoration hinges on providing semantic-oriented guidance to the pre-trained diffusion model, thereby fully leveraging its powerful generative prior. To this end, SODiff incorporates a semantic-aligned image prompt extractor (SAIPE). SAIPE extracts rich features from low-quality (LQ) images and projects them into an embedding space semantically aligned with that of the text encoder. Simultaneously, it preserves crucial information for faithful reconstruction. Furthermore, we propose a quality factor-aware time predictor that implicitly learns the compression quality factor (QF) of the LQ image and adaptively selects the optimal denoising start timestep for the diffusion process. Extensive experimental results show that our SODiff outperforms recent leading methods in both visual quality and quantitative metrics. Code is available at: this https URL

Title: GS4Buildings: Prior-Guided Gaussian Splatting for 3D Building Reconstruction

Authors: Qilin Zhang, Olaf Wysocki, Boris Jutzi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07355
Pdf URL: https://arxiv.org/pdf/2508.07355
Copy Paste: [[2508.07355]] GS4Buildings: Prior-Guided Gaussian Splatting for 3D Building Reconstruction(https://arxiv.org/abs/2508.07355)
Keywords: robust
Abstract: Recent advances in Gaussian Splatting (GS) have demonstrated its effectiveness in photo-realistic rendering and 3D reconstruction. Among these, 2D Gaussian Splatting (2DGS) is particularly suitable for surface reconstruction due to its flattened Gaussian representation and integrated normal regularization. However, its performance often degrades in large-scale and complex urban scenes with frequent occlusions, leading to incomplete building reconstructions. We propose GS4Buildings, a novel prior-guided Gaussian Splatting method leveraging the ubiquity of semantic 3D building models for robust and scalable building surface reconstruction. Instead of relying on traditional Structure-from-Motion (SfM) pipelines, GS4Buildings initializes Gaussians directly from low-level Level of Detail (LoD)2 semantic 3D building models. Moreover, we generate prior depth and normal maps from the planar building geometry and incorporate them into the optimization process, providing strong geometric guidance for surface consistency and structural accuracy. We also introduce an optional building-focused mode that limits reconstruction to building regions, achieving a 71.8% reduction in Gaussian primitives and enabling a more efficient and compact representation. Experiments on urban datasets demonstrate that GS4Buildings improves reconstruction completeness by 20.5% and geometric accuracy by 32.8%. These results highlight the potential of semantic building model integration to advance GS-based reconstruction toward real-world urban applications such as smart cities and digital twins. Our project is available: this https URL.

Title: DIP-GS: Deep Image Prior For Gaussian Splatting Sparse View Recovery

Authors: Rajaei Khatib, Raja Giryes
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07372
Pdf URL: https://arxiv.org/pdf/2508.07372
Copy Paste: [[2508.07372]] DIP-GS: Deep Image Prior For Gaussian Splatting Sparse View Recovery(https://arxiv.org/abs/2508.07372)
Keywords: generative
Abstract: 3D Gaussian Splatting (3DGS) is a leading 3D scene reconstruction method, obtaining high-quality reconstruction with real-time rendering runtime performance. The main idea behind 3DGS is to represent the scene as a collection of 3D gaussians, while learning their parameters to fit the given views of the scene. While achieving superior performance in the presence of many views, 3DGS struggles with sparse view reconstruction, where the input views are sparse and do not fully cover the scene and have low overlaps. In this paper, we propose DIP-GS, a Deep Image Prior (DIP) 3DGS representation. By using the DIP prior, which utilizes internal structure and patterns, with coarse-to-fine manner, DIP-based 3DGS can operate in scenarios where vanilla 3DGS fails, such as sparse view recovery. Note that our approach does not use any pre-trained models such as generative models and depth estimation, but rather relies only on the input frames. Among such methods, DIP-GS obtains state-of-the-art (SOTA) competitive results on various sparse-view reconstruction tasks, demonstrating its capabilities.

Title: Tight Bounds for Schrödinger Potential Estimation in Unpaired Image-to-Image Translation Problems

Authors: Nikita Puchkin, Denis Suchkov, Alexey Naumov, Denis Belomestny
Subjects: cs.LG, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2508.07392
Pdf URL: https://arxiv.org/pdf/2508.07392
Copy Paste: [[2508.07392]] Tight Bounds for Schrödinger Potential Estimation in Unpaired Image-to-Image Translation Problems(https://arxiv.org/abs/2508.07392)
Keywords: generative
Abstract: Modern methods of generative modelling and unpaired image-to-image translation based on Schrödinger bridges and stochastic optimal control theory aim to transform an initial density to a target one in an optimal way. In the present paper, we assume that we only have access to i.i.d. samples from initial and final distributions. This makes our setup suitable for both generative modelling and unpaired image-to-image translation. Relying on the stochastic optimal control approach, we choose an Ornstein-Uhlenbeck process as the reference one and estimate the corresponding Schrödinger potential. Introducing a risk function as the Kullback-Leibler divergence between couplings, we derive tight bounds on generalization ability of an empirical risk minimizer in a class of Schrödinger potentials including Gaussian mixtures. Thanks to the mixing properties of the Ornstein-Uhlenbeck process, we almost achieve fast rates of convergence up to some logarithmic factors in favourable scenarios. We also illustrate performance of the suggested approach with numerical experiments.

Title: LET-US: Long Event-Text Understanding of Scenes

Authors: Rui Chen, Xingyu Chen, Shaoan Wang, Shihan Kong, Junzhi Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07401
Pdf URL: https://arxiv.org/pdf/2508.07401
Copy Paste: [[2508.07401]] LET-US: Long Event-Text Understanding of Scenes(https://arxiv.org/abs/2508.07401)
Keywords: large language model
Abstract: Event cameras output event streams as sparse, asynchronous data with microsecond-level temporal resolution, enabling visual perception with low latency and a high dynamic range. While existing Multimodal Large Language Models (MLLMs) have achieved significant success in understanding and analyzing RGB video content, they either fail to interpret event streams effectively or remain constrained to very short sequences. In this paper, we introduce LET-US, a framework for long event-stream--text comprehension that employs an adaptive compression mechanism to reduce the volume of input events while preserving critical visual details. LET-US thus establishes a new frontier in cross-modal inferential understanding over extended event sequences. To bridge the substantial modality gap between event streams and textual representations, we adopt a two-stage optimization paradigm that progressively equips our model with the capacity to interpret event-based scenes. To handle the voluminous temporal information inherent in long event streams, we leverage text-guided cross-modal queries for feature reduction, augmented by hierarchical clustering and similarity computation to distill the most representative event features. Moreover, we curate and construct a large-scale event-text aligned dataset to train our model, achieving tighter alignment of event features within the LLM embedding space. We also develop a comprehensive benchmark covering a diverse set of tasks -- reasoning, captioning, classification, temporal localization and moment retrieval. Experimental results demonstrate that LET-US outperforms prior state-of-the-art MLLMs in both descriptive accuracy and semantic comprehension on long-duration event streams. All datasets, codes, and models will be publicly available.

Title: ForensicsSAM: Toward Robust and Unified Image Forgery Detection and Localization Resisting to Adversarial Attack

Authors: Rongxuan Peng, Shunquan Tan, Chenqi Kong, Anwei Luo, Alex C. Kot, Jiwu Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07402
Pdf URL: https://arxiv.org/pdf/2508.07402
Copy Paste: [[2508.07402]] ForensicsSAM: Toward Robust and Unified Image Forgery Detection and Localization Resisting to Adversarial Attack(https://arxiv.org/abs/2508.07402)
Keywords: attack, robust, transformer
Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a popular strategy for adapting large vision foundation models, such as the Segment Anything Model (SAM) and LLaVA, to downstream tasks like image forgery detection and localization (IFDL). However, existing PEFT-based approaches overlook their vulnerability to adversarial attacks. In this paper, we show that highly transferable adversarial images can be crafted solely via the upstream model, without accessing the downstream model or training data, significantly degrading the IFDL performance. To address this, we propose ForensicsSAM, a unified IFDL framework with built-in adversarial robustness. Our design is guided by three key ideas: (1) To compensate for the lack of forgery-relevant knowledge in the frozen image encoder, we inject forgery experts into each transformer block to enhance its ability to capture forgery artifacts. These forgery experts are always activated and shared across any input images. (2) To detect adversarial images, we design an light-weight adversary detector that learns to capture structured, task-specific artifact in RGB domain, enabling reliable discrimination across various attack methods. (3) To resist adversarial attacks, we inject adversary experts into the global attention layers and MLP modules to progressively correct feature shifts induced by adversarial noise. These adversary experts are adaptively activated by the adversary detector, thereby avoiding unnecessary interference with clean images. Extensive experiments across multiple benchmarks demonstrate that ForensicsSAM achieves superior resistance to various adversarial attack methods, while also delivering state-of-the-art performance in image-level forgery detection and pixel-level forgery localization. The resource is available at this https URL.

Title: CLUE: Leveraging Low-Rank Adaptation to Capture Latent Uncovered Evidence for Image Forgery Localization

Authors: Youqi Wang, Shunquan Tan, Rongxuan Peng, Bin Li, Jiwu Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07413
Pdf URL: https://arxiv.org/pdf/2508.07413
Copy Paste: [[2508.07413]] CLUE: Leveraging Low-Rank Adaptation to Capture Latent Uncovered Evidence for Image Forgery Localization(https://arxiv.org/abs/2508.07413)
Keywords: attack, robust, diffusion, generative
Abstract: The increasing accessibility of image editing tools and generative AI has led to a proliferation of visually convincing forgeries, compromising the authenticity of digital media. In this paper, in addition to leveraging distortions from conventional forgeries, we repurpose the mechanism of a state-of-the-art (SOTA) text-to-image synthesis model by exploiting its internal generative process, turning it into a high-fidelity forgery localization tool. To this end, we propose CLUE (Capture Latent Uncovered Evidence), a framework that employs Low- Rank Adaptation (LoRA) to parameter-efficiently reconfigure Stable Diffusion 3 (SD3) as a forensic feature extractor. Our approach begins with the strategic use of SD3's Rectified Flow (RF) mechanism to inject noise at varying intensities into the latent representation, thereby steering the LoRAtuned denoising process to amplify subtle statistical inconsistencies indicative of a forgery. To complement the latent analysis with high-level semantic context and precise spatial details, our method incorporates contextual features from the image encoder of the Segment Anything Model (SAM), which is parameter-efficiently adapted to better trace the boundaries of forged regions. Extensive evaluations demonstrate CLUE's SOTA generalization performance, significantly outperforming prior methods. Furthermore, CLUE shows superior robustness against common post-processing attacks and Online Social Networks (OSNs). Code is publicly available at this https URL.

Title: Grounding Multilingual Multimodal LLMs With Cultural Knowledge

Authors: Jean de Dieu Nyandwi, Yueqi Song, Simran Khanuja, Graham Neubig
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07414
Pdf URL: https://arxiv.org/pdf/2508.07414
Copy Paste: [[2508.07414]] Grounding Multilingual Multimodal LLMs With Cultural Knowledge(https://arxiv.org/abs/2508.07414)
Keywords: large language model
Abstract: Multimodal Large Language Models excel in high-resource settings, but often misinterpret long-tail cultural entities and underperform in low-resource languages. To address this gap, we propose a data-centric approach that directly grounds MLLMs in cultural knowledge. Leveraging a large scale knowledge graph from Wikidata, we collect images that represent culturally significant entities, and generate synthetic multilingual visual question answering data. The resulting dataset, CulturalGround, comprises 22 million high-quality, culturally-rich VQA pairs spanning 42 countries and 39 languages. We train an open-source MLLM CulturalPangea on CulturalGround, interleaving standard multilingual instruction-tuning data to preserve general abilities. CulturalPangea achieves state-of-the-art performance among open models on various culture-focused multilingual multimodal benchmarks, outperforming prior models by an average of 5.0 without degrading results on mainstream vision-language tasks. Our findings show that our targeted, culturally grounded approach could substantially narrow the cultural gap in MLLMs and offer a practical path towards globally inclusive multimodal systems.

Title: Lightning Prediction under Uncertainty: DeepLight with Hazy Loss

Authors: Md Sultanul Arifin, Abu Nowshed Sakib, Yeasir Rayhan, Tanzima Hashem
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07428
Pdf URL: https://arxiv.org/pdf/2508.07428
Copy Paste: [[2508.07428]] Lightning Prediction under Uncertainty: DeepLight with Hazy Loss(https://arxiv.org/abs/2508.07428)
Keywords: protect, robust
Abstract: Lightning, a common feature of severe meteorological conditions, poses significant risks, from direct human injuries to substantial economic losses. These risks are further exacerbated by climate change. Early and accurate prediction of lightning would enable preventive measures to safeguard people, protect property, and minimize economic losses. In this paper, we present DeepLight, a novel deep learning architecture for predicting lightning occurrences. Existing prediction models face several critical limitations: they often struggle to capture the dynamic spatial context and inherent uncertainty of lightning events, underutilize key observational data, such as radar reflectivity and cloud properties, and rely heavily on Numerical Weather Prediction (NWP) systems, which are both computationally expensive and highly sensitive to parameter settings. To overcome these challenges, DeepLight leverages multi-source meteorological data, including radar reflectivity, cloud properties, and historical lightning occurrences through a dual-encoder architecture. By employing multi-branch convolution techniques, it dynamically captures spatial correlations across varying extents. Furthermore, its novel Hazy Loss function explicitly addresses the spatio-temporal uncertainty of lightning by penalizing deviations based on proximity to true events, enabling the model to better learn patterns amidst randomness. Extensive experiments show that DeepLight improves the Equitable Threat Score (ETS) by 18%-30% over state-of-the-art methods, establishing it as a robust solution for lightning prediction.

Title: Let's Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs

Authors: Zhiyi Lyu, Jianguo Huang, Yanchen Deng, Steven Hoi, Bo An
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07434
Pdf URL: https://arxiv.org/pdf/2508.07434
Copy Paste: [[2508.07434]] Let's Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs(https://arxiv.org/abs/2508.07434)
Keywords: large language model
Abstract: Large Language Models (LLMs) with inference-time scaling techniques show promise for code generation, yet face notable efficiency and scalability challenges. Construction-based tree-search methods suffer from rapid growth in tree size, high token consumption, and lack of anytime property. In contrast, improvement-based methods offer better performance but often struggle with uninformative reward signals and inefficient search strategies. In this work, we propose \textbf{ReLoc}, a unified local search framework which effectively performs step-by-step code revision. Specifically, ReLoc explores a series of local revisions through four key algorithmic components: initial code drafting, neighborhood code generation, candidate evaluation, and incumbent code updating, each of which can be instantiated with specific decision rules to realize different local search algorithms such as Hill Climbing (HC) or Genetic Algorithm (GA). Furthermore, we develop a specialized revision reward model that evaluates code quality based on revision distance to produce fine-grained preferences that guide the local search toward more promising candidates. Finally, our extensive experimental results demonstrate that our approach achieves superior performance across diverse code generation tasks, significantly outperforming both construction-based tree search as well as the state-of-the-art improvement-based code generation methods.

Title: Towards Unveiling Predictive Uncertainty Vulnerabilities in the Context of the Right to Be Forgotten

Authors: Wei Qian, Chenxu Zhao, Yangyi Li, Wenqian Ye, Mengdi Huai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07458
Pdf URL: https://arxiv.org/pdf/2508.07458
Copy Paste: [[2508.07458]] Towards Unveiling Predictive Uncertainty Vulnerabilities in the Context of the Right to Be Forgotten(https://arxiv.org/abs/2508.07458)
Keywords: defense, attack
Abstract: Currently, various uncertainty quantification methods have been proposed to provide certainty and probability estimates for deep learning models' label predictions. Meanwhile, with the growing demand for the right to be forgotten, machine unlearning has been extensively studied as a means to remove the impact of requested sensitive data from a pre-trained model without retraining the model from scratch. However, the vulnerabilities of such generated predictive uncertainties with regard to dedicated malicious unlearning attacks remain unexplored. To bridge this gap, for the first time, we propose a new class of malicious unlearning attacks against predictive uncertainties, where the adversary aims to cause the desired manipulations of specific predictive uncertainty results. We also design novel optimization frameworks for our attacks and conduct extensive experiments, including black-box scenarios. Notably, our extensive experiments show that our attacks are more effective in manipulating predictive uncertainties than traditional attacks that focus on label misclassifications, and existing defenses against conventional attacks are ineffective against our attacks.

Title: MOTGNN: Interpretable Graph Neural Networks for Multi-Omics Disease Classification

Authors: Tiantian Yang, Zhiqian Chen
Subjects: cs.LG, q-bio.GN, stat.ML
Abstract URL: https://arxiv.org/abs/2508.07465
Pdf URL: https://arxiv.org/pdf/2508.07465
Copy Paste: [[2508.07465]] MOTGNN: Interpretable Graph Neural Networks for Multi-Omics Disease Classification(https://arxiv.org/abs/2508.07465)
Keywords: robust, interpretability
Abstract: Integrating multi-omics data, such as DNA methylation, mRNA expression, and microRNA (miRNA) expression, offers a comprehensive view of the biological mechanisms underlying disease. However, the high dimensionality and complex interactions among omics layers present major challenges for predictive modeling. We propose Multi-Omics integration with Tree-generated Graph Neural Network (MOTGNN), a novel and interpretable framework for binary disease classification. MOTGNN employs eXtreme Gradient Boosting (XGBoost) to perform omics-specific supervised graph construction, followed by modality-specific Graph Neural Networks (GNNs) for hierarchical representation learning, and a deep feedforward network for cross-omics integration. On three real-world disease datasets, MOTGNN outperforms state-of-the-art baselines by 5-10% in accuracy, ROC-AUC, and F1-score, and remains robust to severe class imbalance (e.g., 87.2% vs. 33.4% F1 on imbalanced data). The model maintains computational efficiency through sparse graphs (2.1-2.8 edges per node) and provides built-in interpretability, revealing both top-ranked biomarkers and the relative contributions of each omics modality. These results highlight MOTGNN's potential to improve both predictive accuracy and interpretability in multi-omics disease modeling.

Title: AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning

Authors: Siminfar Samakoush Galougah, Rishie Raj, Sanjoy Chowdhury, Sayan Nag, Ramani Duraiswami
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07470
Pdf URL: https://arxiv.org/pdf/2508.07470
Copy Paste: [[2508.07470]] AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning(https://arxiv.org/abs/2508.07470)
Keywords: robust, large language model
Abstract: Current audio-visual (AV) benchmarks focus on final answer accuracy, overlooking the underlying reasoning process. This makes it difficult to distinguish genuine comprehension from correct answers derived through flawed reasoning or hallucinations. To address this, we introduce AURA (Audio-visual Understanding and Reasoning Assessment), a benchmark for evaluating the cross-modal reasoning capabilities of Audio-Visual Large Language Models (AV-LLMs) and Omni-modal Language Models (OLMs). AURA includes questions across six challenging cognitive domains, such as causality, timbre and pitch, tempo and AV synchronization, unanswerability, implicit distractions, and skill profiling, explicitly designed to be unanswerable from a single modality. This forces models to construct a valid logical path grounded in both audio and video, setting AURA apart from AV datasets that allow uni-modal shortcuts. To assess reasoning traces, we propose a novel metric, AuraScore, which addresses the lack of robust tools for evaluating reasoning fidelity. It decomposes reasoning into two aspects: (i) Factual Consistency - whether reasoning is grounded in perceptual evidence, and (ii) Core Inference - the logical validity of each reasoning step. Evaluations of SOTA models on AURA reveal a critical reasoning gap: although models achieve high accuracy (up to 92% on some tasks), their Factual Consistency and Core Inference scores fall below 45%. This discrepancy highlights that models often arrive at correct answers through flawed logic, underscoring the need for our benchmark and paving the way for more robust multimodal evaluation.

Title: Positional Biases Shift as Inputs Approach Context Window Limits

Authors: Blerta Veseli, Julian Chibane, Mariya Toneva, Alexander Koller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07479
Pdf URL: https://arxiv.org/pdf/2508.07479
Copy Paste: [[2508.07479]] Positional Biases Shift as Inputs Approach Context Window Limits(https://arxiv.org/abs/2508.07479)
Keywords: large language model
Abstract: Large Language Models (LLMs) often struggle to use information across long inputs effectively. Prior work has identified positional biases, such as the Lost in the Middle (LiM) effect, where models perform better when information appears at the beginning (primacy bias) or end (recency bias) of the input, rather than in the middle. However, long-context studies have not consistently replicated these effects, raising questions about their intensity and the conditions under which they manifest. To address this, we conducted a comprehensive analysis using relative rather than absolute input lengths, defined with respect to each model's context window. Our findings reveal that the LiM effect is strongest when inputs occupy up to 50% of a model's context window. Beyond that, the primacy bias weakens, while recency bias remains relatively stable. This effectively eliminates the LiM effect; instead, we observe a distance-based bias, where model performance is better when relevant information is closer to the end of the input. Furthermore, our results suggest that successful retrieval is a prerequisite for reasoning in LLMs, and that the observed positional biases in reasoning are largely inherited from retrieval. These insights have implications for long-context tasks, the design of future LLM benchmarks, and evaluation methodologies for LLMs handling extended inputs.

Title: ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models

Authors: Archchana Sindhujan, Shenbin Qian, Chan Chi Chun Matthew, Constantin Orasan, Diptesh Kanojia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07484
Pdf URL: https://arxiv.org/pdf/2508.07484
Copy Paste: [[2508.07484]] ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models(https://arxiv.org/abs/2508.07484)
Keywords: transformer, large language model
Abstract: Large Language Models (LLMs) have shown remarkable performance across a wide range of natural language processing tasks. Quality Estimation (QE) for Machine Translation (MT), which assesses the quality of a source-target pair without relying on reference translations, remains a challenging cross-lingual task for LLMs. The challenges stem from the inherent limitations of existing LLM-based QE systems, which are pre-trained for causal language modelling rather than regression-specific tasks, further elevated by the presence of low-resource languages given pre-training data distribution. This paper introduces ALOPE, an adaptive layer-optimization framework designed to enhance LLM-based QE by restructuring Transformer representations through layer-wise adaptation for improved regression-based prediction. Our framework integrates low-rank adapters (LoRA) with regression task heads, leveraging selected pre-trained Transformer layers for improved cross-lingual alignment. In addition to the layer-specific adaptation, ALOPE introduces two strategies-dynamic weighting, which adaptively combines representations from multiple layers, and multi-head regression, which aggregates regression losses from multiple heads for QE. Our framework shows improvements over various existing LLM-based QE approaches. Empirical evidence suggests that intermediate Transformer layers in LLMs provide contextual representations that are more aligned with the cross-lingual nature of the QE task. We make resultant models and framework code publicly available for further research, also allowing existing LLM-based MT frameworks to be scaled with QE capabilities.

Title: N-BEATS-MOE: N-BEATS with a Mixture-of-Experts Layer for Heterogeneous Time Series Forecasting

Authors: Ricardo Matos, Luis Roque, Vitor Cerqueira
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2508.07490
Pdf URL: https://arxiv.org/pdf/2508.07490
Copy Paste: [[2508.07490]] N-BEATS-MOE: N-BEATS with a Mixture-of-Experts Layer for Heterogeneous Time Series Forecasting(https://arxiv.org/abs/2508.07490)
Keywords: interpretability
Abstract: Deep learning approaches are increasingly relevant for time series forecasting tasks. Methods such as N-BEATS, which is built on stacks of multilayer perceptrons (MLPs) blocks, have achieved state-of-the-art results on benchmark datasets and competitions. N-BEATS is also more interpretable relative to other deep learning approaches, as it decomposes forecasts into different time series components, such as trend and seasonality. In this work, we present N-BEATS-MOE, an extension of N-BEATS based on a Mixture-of-Experts (MoE) layer. N-BEATS-MOE employs a dynamic block weighting strategy based on a gating network which allows the model to better adapt to the characteristics of each time series. We also hypothesize that the gating mechanism provides additional interpretability by identifying which expert is most relevant for each series. We evaluate our method across 12 benchmark datasets against several approaches, achieving consistent improvements on several datasets, especially those composed of heterogeneous time series.

Title: Enhancing Privacy in Decentralized Min-Max Optimization: A Differentially Private Approach

Authors: Yueyang Quan, Chang Wang, Shengjie Zhai, Minghong Fang, Zhuqing Liu
Subjects: cs.LG, cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2508.07505
Pdf URL: https://arxiv.org/pdf/2508.07505
Copy Paste: [[2508.07505]] Enhancing Privacy in Decentralized Min-Max Optimization: A Differentially Private Approach(https://arxiv.org/abs/2508.07505)
Keywords: privacy, attack
Abstract: Decentralized min-max optimization allows multi-agent systems to collaboratively solve global min-max optimization problems by facilitating the exchange of model updates among neighboring agents, eliminating the need for a central server. However, sharing model updates in such systems carry a risk of exposing sensitive data to inference attacks, raising significant privacy concerns. To mitigate these privacy risks, differential privacy (DP) has become a widely adopted technique for safeguarding individual data. Despite its advantages, implementing DP in decentralized min-max optimization poses challenges, as the added noise can hinder convergence, particularly in non-convex scenarios with complex agent interactions in min-max optimization problems. In this work, we propose an algorithm called DPMixSGD (Differential Private Minmax Hybrid Stochastic Gradient Descent), a novel privacy-preserving algorithm specifically designed for non-convex decentralized min-max optimization. Our method builds on the state-of-the-art STORM-based algorithm, one of the fastest decentralized min-max solutions. We rigorously prove that the noise added to local gradients does not significantly compromise convergence performance, and we provide theoretical bounds to ensure privacy guarantees. To validate our theoretical findings, we conduct extensive experiments across various tasks and models, demonstrating the effectiveness of our approach.

Title: SRAM-based Physically Unclonable Function using Lightweight Hamming-Code Fuzzy Extractor for Energy Harvesting Beat Sensors

Authors: Hoang-Long Pham, Duy-Hieu Bui, Xuan-Tu Tran, Orazio Aiello
Subjects: cs.CR, eess.SY
Abstract URL: https://arxiv.org/abs/2508.07510
Pdf URL: https://arxiv.org/pdf/2508.07510
Copy Paste: [[2508.07510]] SRAM-based Physically Unclonable Function using Lightweight Hamming-Code Fuzzy Extractor for Energy Harvesting Beat Sensors(https://arxiv.org/abs/2508.07510)
Keywords: secure, security, protect
Abstract: Batteryless energy harvesting IoT sensor nodes such as beat sensors can be deployed in millions without the need to replace batteries. They are ultra-low-power and cost-effective wireless sensor nodes without the maintenance cost and can work for 24 hours/365 days. However, they were not equipped with security mechanisms to protect user data. Data encryption and authentication can be used to secure beat sensor applications, but generating a secure cryptographic key is challenging. In this paper, we proposed an SRAM-based Physically Unclonable Function (PUF) combining a high-reliability bit selection algorithm with a lightweight error-correcting code to generate reliable secure keys for data encryption. The system employs a feature of beat sensors, in which the microcontroller is powered on to transmit the ID signals and then powered off. This fits the SRAM-based PUF requirement, which needs the SRAM to be powered off to read out its random values. The proposed system has been evaluated on STM32 Cortex M0+ microcontrollers and has been implemented to protect important data on beat sensors.

Title: From Field to Drone: Domain Drift Tolerant Automated Multi-Species and Damage Plant Semantic Segmentation for Herbicide Trials

Authors: Artzai Picon, Itziar Eguskiza, Daniel Mugica, Javier Romero, Carlos Javier Jimenez, Eric White, Gabriel Do-Lago-Junqueira, Christian Klukas, Ramon Navarra-Mestre
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07514
Pdf URL: https://arxiv.org/pdf/2508.07514
Copy Paste: [[2508.07514]] From Field to Drone: Domain Drift Tolerant Automated Multi-Species and Damage Plant Semantic Segmentation for Herbicide Trials(https://arxiv.org/abs/2508.07514)
Keywords: robust, segmentation
Abstract: Field trials are vital in herbicide research and development to assess effects on crops and weeds under varied conditions. Traditionally, evaluations rely on manual visual assessments, which are time-consuming, labor-intensive, and subjective. Automating species and damage identification is challenging due to subtle visual differences, but it can greatly enhance efficiency and consistency. We present an improved segmentation model combining a general-purpose self-supervised visual model with hierarchical inference based on botanical taxonomy. Trained on a multi-year dataset (2018-2020) from Germany and Spain using digital and mobile cameras, the model was tested on digital camera data (year 2023) and drone imagery from the United States, Germany, and Spain (year 2024) to evaluate robustness under domain shift. This cross-device evaluation marks a key step in assessing generalization across platforms of the model. Our model significantly improved species identification (F1-score: 0.52 to 0.85, R-squared: 0.75 to 0.98) and damage classification (F1-score: 0.28 to 0.44, R-squared: 0.71 to 0.87) over prior methods. Under domain shift (drone images), it maintained strong performance with moderate degradation (species: F1-score 0.60, R-squared 0.80; damage: F1-score 0.41, R-squared 0.62), where earlier models failed. These results confirm the model's robustness and real-world applicability. It is now deployed in BASF's phenotyping pipeline, enabling large-scale, automated crop and weed monitoring across diverse geographies.

Title: Augmenting Bias Detection in LLMs Using Topological Data Analysis

Authors: Keshav Varadarajan, Tananun Songdechakraiwut
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07516
Pdf URL: https://arxiv.org/pdf/2508.07516
Copy Paste: [[2508.07516]] Augmenting Bias Detection in LLMs Using Topological Data Analysis(https://arxiv.org/abs/2508.07516)
Keywords: large language model
Abstract: Recently, many bias detection methods have been proposed to determine the level of bias a large language model captures. However, tests to identify which parts of a large language model are responsible for bias towards specific groups remain underdeveloped. In this study, we present a method using topological data analysis to identify which heads in GPT-2 contribute to the misrepresentation of identity groups present in the StereoSet dataset. We find that biases for particular categories, such as gender or profession, are concentrated in attention heads that act as hot spots. The metric we propose can also be used to determine which heads capture bias for a specific group within a bias category, and future work could extend this method to help de-bias large language models.

Title: Word Clouds as Common Voices: LLM-Assisted Visualization of Participant-Weighted Themes in Qualitative Interviews

Authors: Joseph T. Colonel, Baihan Lin
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2508.07517
Pdf URL: https://arxiv.org/pdf/2508.07517
Copy Paste: [[2508.07517]] Word Clouds as Common Voices: LLM-Assisted Visualization of Participant-Weighted Themes in Qualitative Interviews(https://arxiv.org/abs/2508.07517)
Keywords: interpretability, large language model
Abstract: Word clouds are a common way to summarize qualitative interviews, yet traditional frequency-based methods often fail in conversational contexts: they surface filler words, ignore paraphrase, and fragment semantically related ideas. This limits their usefulness in early-stage analysis, when researchers need fast, interpretable overviews of what participant actually said. We introduce ThemeClouds, an open-source visualization tool that uses large language models (LLMs) to generate thematic, participant-weighted word clouds from dialogue transcripts. The system prompts an LLM to identify concept-level themes across a corpus and then counts how many unique participants mention each topic, yielding a visualization grounded in breadth of mention rather than raw term frequency. Researchers can customize prompts and visualization parameters, providing transparency and control. Using interviews from a user study comparing five recording-device configurations (31 participants; 155 transcripts, Whisper ASR), our approach surfaces more actionable device concerns than frequency clouds and topic-modeling baselines (e.g., LDA, BERTopic). We discuss design trade-offs for integrating LLM assistance into qualitative workflows, implications for interpretability and researcher agency, and opportunities for interactive analyses such as per-condition contrasts (``diff clouds'').

Title: FairDRL-ST: Disentangled Representation Learning for Fair Spatio-Temporal Mobility Prediction

Authors: Sichen Zhao, Wei Shao, Jeffrey Chan, Ziqi Xu, Flora Salim
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2508.07518
Pdf URL: https://arxiv.org/pdf/2508.07518
Copy Paste: [[2508.07518]] FairDRL-ST: Disentangled Representation Learning for Fair Spatio-Temporal Mobility Prediction(https://arxiv.org/abs/2508.07518)
Keywords: fair
Abstract: As deep spatio-temporal neural networks are increasingly utilised in urban computing contexts, the deployment of such methods can have a direct impact on users of critical urban infrastructure, such as public transport, emergency services, and traffic management systems. While many spatio-temporal methods focus on improving accuracy, fairness has recently gained attention due to growing evidence that biased predictions in spatio-temporal applications can disproportionately disadvantage certain demographic or geographic groups, thereby reinforcing existing socioeconomic inequalities and undermining the ethical deployment of AI in public services. In this paper, we propose a novel framework, FairDRL-ST, based on disentangled representation learning, to address fairness concerns in spatio-temporal prediction, with a particular focus on mobility demand forecasting. By leveraging adversarial learning and disentangled representation learning, our framework learns to separate attributes that contain sensitive information. Unlike existing methods that enforce fairness through supervised learning, which may lead to overcompensation and degraded performance, our framework achieves fairness in an unsupervised manner with minimal performance loss. We apply our framework to real-world urban mobility datasets and demonstrate its ability to close fairness gaps while delivering competitive predictive performance compared to state-of-the-art fairness-aware methods.

Title: Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing

Authors: Joonghyuk Shin, Alchan Hwang, Yujin Kim, Daneul Kim, Jaesik Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07519
Pdf URL: https://arxiv.org/pdf/2508.07519
Copy Paste: [[2508.07519]] Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing(https://arxiv.org/abs/2508.07519)
Keywords: robust, diffusion, transformer
Abstract: Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1. Previous approaches have relied on unidirectional cross-attention mechanisms, with information flowing from text embeddings to image latents. In contrast, MMDiT introduces a unified attention mechanism that concatenates input projections from both modalities and performs a single full attention operation, allowing bidirectional information flow between text and image branches. This architectural shift presents significant challenges for existing editing techniques. In this paper, we systematically analyze MM-DiT's attention mechanism by decomposing attention matrices into four distinct blocks, revealing their inherent characteristics. Through these analyses, we propose a robust, prompt-based image editing method for MM-DiT that supports global to local edits across various MM-DiT variants, including few-step models. We believe our findings bridge the gap between existing U-Net-based methods and emerging architectures, offering deeper insights into MMDiT's behavioral patterns.

Title: From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR

Authors: Jia Deng, Jie Chen, Zhipeng Chen, Daixuan Cheng, Fei Bai, Beichen Zhang, Yinqian Min, Yanzipeng Gao, Wayne Xin Zhao, Ji-Rong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07534
Pdf URL: https://arxiv.org/pdf/2508.07534
Copy Paste: [[2508.07534]] From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR(https://arxiv.org/abs/2508.07534)
Keywords: large language model
Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). Unlike traditional RL approaches, RLVR leverages rule-based feedback to guide LLMs in generating and refining complex reasoning chains -- a process critically dependent on effective exploration strategies. While prior work has demonstrated RLVR's empirical success, the fundamental mechanisms governing LLMs' exploration behaviors remain underexplored. This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects: (1) exploration space shaping, where we develop quantitative metrics to characterize LLMs' capability boundaries; (2) entropy-performance exchange, analyzed across training stages, individual instances, and token-level patterns; and (3) RL performance optimization, examining methods to effectively translate exploration gains into measurable improvements. By unifying previously identified insights with new empirical evidence, this work aims to provide a foundational framework for advancing RLVR systems.

Title: Physics-Informed Multimodal Bearing Fault Classification under Variable Operating Conditions using Transfer Learning

Authors: Tasfiq E. Alam, Md Manjurul Ahsan, Shivakumar Raman
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07536
Pdf URL: https://arxiv.org/pdf/2508.07536
Copy Paste: [[2508.07536]] Physics-Informed Multimodal Bearing Fault Classification under Variable Operating Conditions using Transfer Learning(https://arxiv.org/abs/2508.07536)
Keywords: robust, extraction
Abstract: Accurate and interpretable bearing fault classification is critical for ensuring the reliability of rotating machinery, particularly under variable operating conditions where domain shifts can significantly degrade model performance. This study proposes a physics-informed multimodal convolutional neural network (CNN) with a late fusion architecture, integrating vibration and motor current signals alongside a dedicated physics-based feature extraction branch. The model incorporates a novel physics-informed loss function that penalizes physically implausible predictions based on characteristic bearing fault frequencies - Ball Pass Frequency Outer (BPFO) and Ball Pass Frequency Inner (BPFI) - derived from bearing geometry and shaft speed. Comprehensive experiments on the Paderborn University dataset demonstrate that the proposed physics-informed approach consistently outperforms a non-physics-informed baseline, achieving higher accuracy, reduced false classifications, and improved robustness across multiple data splits. To address performance degradation under unseen operating conditions, three transfer learning (TL) strategies - Target-Specific Fine-Tuning (TSFT), Layer-Wise Adaptation Strategy (LAS), and Hybrid Feature Reuse (HFR) - are evaluated. Results show that LAS yields the best generalization, with additional performance gains when combined with physics-informed modeling. Validation on the KAIST bearing dataset confirms the framework's cross-dataset applicability, achieving up to 98 percent accuracy. Statistical hypothesis testing further verifies significant improvements (p < 0.01) in classification performance. The proposed framework demonstrates the potential of integrating domain knowledge with data-driven learning to achieve robust, interpretable, and generalizable fault diagnosis for real-world industrial applications.

Title: Enhanced Generative Structure Prior for Chinese Text Image Super-resolution

Authors: Xiaoming Li, Wangmeng Zuo, Chen Change Loy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07537
Pdf URL: https://arxiv.org/pdf/2508.07537
Copy Paste: [[2508.07537]] Enhanced Generative Structure Prior for Chinese Text Image Super-resolution(https://arxiv.org/abs/2508.07537)
Keywords: robust, generative
Abstract: Faithful text image super-resolution (SR) is challenging because each character has a unique structure and usually exhibits diverse font styles and layouts. While existing methods primarily focus on English text, less attention has been paid to more complex scripts like Chinese. In this paper, we introduce a high-quality text image SR framework designed to restore the precise strokes of low-resolution (LR) Chinese characters. Unlike methods that rely on character recognition priors to regularize the SR task, we propose a novel structure prior that offers structure-level guidance to enhance visual quality. Our framework incorporates this structure prior within a StyleGAN model, leveraging its generative capabilities for restoration. To maintain the integrity of character structures while accommodating various font styles and layouts, we implement a codebook-based mechanism that restricts the generative space of StyleGAN. Each code in the codebook represents the structure of a specific character, while the vector $w$ in StyleGAN controls the character's style, including typeface, orientation, and location. Through the collaborative interaction between the codebook and style, we generate a high-resolution structure prior that aligns with LR characters both spatially and structurally. Experiments demonstrate that this structure prior provides robust, character-specific guidance, enabling the accurate restoration of clear strokes in degraded characters, even for real-world LR Chinese text with irregular layouts. Our code and pre-trained models will be available at this https URL

Title: A DICOM Image De-identification Algorithm in the MIDI-B Challenge

Authors: Hongzhu Jiang, Sihan Xie, Zhiyu Wan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07538
Pdf URL: https://arxiv.org/pdf/2508.07538
Copy Paste: [[2508.07538]] A DICOM Image De-identification Algorithm in the MIDI-B Challenge(https://arxiv.org/abs/2508.07538)
Keywords: privacy, protect
Abstract: Image de-identification is essential for the public sharing of medical images, particularly in the widely used Digital Imaging and Communications in Medicine (DICOM) format as required by various regulations and standards, including Health Insurance Portability and Accountability Act (HIPAA) privacy rules, the DICOM PS3.15 standard, and best practices recommended by the Cancer Imaging Archive (TCIA). The Medical Image De-Identification Benchmark (MIDI-B) Challenge at the 27th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2024) was organized to evaluate rule-based DICOM image de-identification algorithms with a large dataset of clinical DICOM images. In this report, we explore the critical challenges of de-identifying DICOM images, emphasize the importance of removing personally identifiable information (PII) to protect patient privacy while ensuring the continued utility of medical data for research, diagnostics, and treatment, and provide a comprehensive overview of the standards and regulations that govern this process. Additionally, we detail the de-identification methods we applied - such as pixel masking, date shifting, date hashing, text recognition, text replacement, and text removal - to process datasets during the test phase in strict compliance with these standards. According to the final leaderboard of the MIDI-B challenge, the latest version of our solution algorithm correctly executed 99.92% of the required actions and ranked 2nd out of 10 teams that completed the challenge (from a total of 22 registered teams). Finally, we conducted a thorough analysis of the resulting statistics and discussed the limitations of current approaches and potential avenues for future improvement.

Title: Domain Generalization of Pathological Image Segmentation by Patch-Level and WSI-Level Contrastive Learning

Authors: Yuki Shigeyasu, Shota Harada, Akihiko Yoshizawa, Kazuhiro Terada, Naoki Nakazima, Mariyo Kurata, Hiroyuki Abe, Tetsuo Ushiku, Ryoma Bise
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07539
Pdf URL: https://arxiv.org/pdf/2508.07539
Copy Paste: [[2508.07539]] Domain Generalization of Pathological Image Segmentation by Patch-Level and WSI-Level Contrastive Learning(https://arxiv.org/abs/2508.07539)
Keywords: segmentation
Abstract: In this paper, we address domain shifts in pathological images by focusing on shifts within whole slide images~(WSIs), such as patient characteristics and tissue thickness, rather than shifts between hospitals. Traditional approaches rely on multi-hospital data, but data collection challenges often make this impractical. Therefore, the proposed domain generalization method captures and leverages intra-hospital domain shifts by clustering WSI-level features from non-tumor regions and treating these clusters as domains. To mitigate domain shift, we apply contrastive learning to reduce feature gaps between WSI pairs from different clusters. The proposed method introduces a two-stage contrastive learning approach WSI-level and patch-level contrastive learning to minimize these gaps effectively.

Title: CoT-Pose: Chain-of-Thought Reasoning for 3D Pose Generation from Abstract Prompts

Authors: Junuk Cha, Jihyeon Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07540
Pdf URL: https://arxiv.org/pdf/2508.07540
Copy Paste: [[2508.07540]] CoT-Pose: Chain-of-Thought Reasoning for 3D Pose Generation from Abstract Prompts(https://arxiv.org/abs/2508.07540)
Keywords: large language model
Abstract: Recent advances in multi-modal large language models (MLLMs) and chain-of-thought (CoT) reasoning have led to significant progress in image and text generation tasks. However, the field of 3D human pose generation still faces critical limitations. Most existing text-to-pose models rely heavily on detailed (low-level) prompts that explicitly describe joint configurations. In contrast, humans tend to communicate actions and intentions using abstract (high-level) language. This mismatch results in a practical challenge for deploying pose generation systems in real-world scenarios. To bridge this gap, we introduce a novel framework that incorporates CoT reasoning into the pose generation process, enabling the interpretation of abstract prompts into accurate 3D human poses. We further propose a data synthesis pipeline that automatically generates triplets of abstract prompts, detailed prompts, and corresponding 3D poses for training process. Experimental results demonstrate that our reasoning-enhanced model, CoT-Pose, can effectively generate plausible and semantically aligned poses from abstract textual inputs. This work highlights the importance of high-level understanding in pose generation and opens new directions for reasoning-enhanced approach for human pose generation.

Title: Adaptive Pseudo Label Selection for Individual Unlabeled Data by Positive and Unlabeled Learning

Authors: Takehiro Yamane, Itaru Tsuge, Susumu Saito, Ryoma Bise
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07548
Pdf URL: https://arxiv.org/pdf/2508.07548
Copy Paste: [[2508.07548]] Adaptive Pseudo Label Selection for Individual Unlabeled Data by Positive and Unlabeled Learning(https://arxiv.org/abs/2508.07548)
Keywords: segmentation
Abstract: This paper proposes a novel pseudo-labeling method for medical image segmentation that can perform learning on ``individual images'' to select effective pseudo-labels. We introduce Positive and Unlabeled Learning (PU learning), which uses only positive and unlabeled data for binary classification problems, to obtain the appropriate metric for discriminating foreground and background regions on each unlabeled image. Our PU learning makes us easy to select pseudo-labels for various background regions. The experimental results show the effectiveness of our method.

Title: Decoupled Functional Evaluation of Autonomous Driving Models via Feature Map Quality Scoring

Authors: Ludan Zhang, Sihan Wang, Yuqi Dai, Shuofei Qiao, Lei He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07552
Pdf URL: https://arxiv.org/pdf/2508.07552
Copy Paste: [[2508.07552]] Decoupled Functional Evaluation of Autonomous Driving Models via Feature Map Quality Scoring(https://arxiv.org/abs/2508.07552)
Keywords: interpretability
Abstract: End-to-end models are emerging as the mainstream in autonomous driving perception and planning. However, the lack of explicit supervision signals for intermediate functional modules leads to opaque operational mechanisms and limited interpretability, making it challenging for traditional methods to independently evaluate and train these modules. Pioneering in the issue, this study builds upon the feature map-truth representation similarity-based evaluation framework and proposes an independent evaluation method based on Feature Map Convergence Score (FMCS). A Dual-Granularity Dynamic Weighted Scoring System (DG-DWSS) is constructed, formulating a unified quantitative metric - Feature Map Quality Score - to enable comprehensive evaluation of the quality of feature maps generated by functional modules. A CLIP-based Feature Map Quality Evaluation Network (CLIP-FMQE-Net) is further developed, combining feature-truth encoders and quality score prediction heads to enable real-time quality analysis of feature maps generated by functional modules. Experimental results on the NuScenes dataset demonstrate that integrating our evaluation module into the training improves 3D object detection performance, achieving a 3.89 percent gain in NDS. These results verify the effectiveness of our method in enhancing feature representation quality and overall model performance.

Title: Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning

Authors: Stephan Rabanser
Subjects: cs.LG, cs.AI, cs.CY, stat.ML
Abstract URL: https://arxiv.org/abs/2508.07556
Pdf URL: https://arxiv.org/pdf/2508.07556
Copy Paste: [[2508.07556]] Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning(https://arxiv.org/abs/2508.07556)
Keywords: privacy, defense, robust
Abstract: Machine learning (ML) systems are increasingly deployed in high-stakes domains where reliability is paramount. This thesis investigates how uncertainty estimation can enhance the safety and trustworthiness of ML, focusing on selective prediction -- where models abstain when confidence is low. We first show that a model's training trajectory contains rich uncertainty signals that can be exploited without altering its architecture or loss. By ensembling predictions from intermediate checkpoints, we propose a lightweight, post-hoc abstention method that works across tasks, avoids the cost of deep ensembles, and achieves state-of-the-art selective prediction performance. Crucially, this approach is fully compatible with differential privacy (DP), allowing us to study how privacy noise affects uncertainty quality. We find that while many methods degrade under DP, our trajectory-based approach remains robust, and we introduce a framework for isolating the privacy-uncertainty trade-off. Next, we then develop a finite-sample decomposition of the selective classification gap -- the deviation from the oracle accuracy-coverage curve -- identifying five interpretable error sources and clarifying which interventions can close the gap. This explains why calibration alone cannot fix ranking errors, motivating methods that improve uncertainty ordering. Finally, we show that uncertainty signals can be adversarially manipulated to hide errors or deny service while maintaining high accuracy, and we design defenses combining calibration audits with verifiable inference. Together, these contributions advance reliable ML by improving, evaluating, and safeguarding uncertainty estimation, enabling models that not only make accurate predictions -- but also know when to say "I do not know".

Title: Splat4D: Diffusion-Enhanced 4D Gaussian Splatting for Temporally and Spatially Consistent Content Creation

Authors: Minghao Yin, Yukang Cao, Songyou Peng, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07557
Pdf URL: https://arxiv.org/pdf/2508.07557
Copy Paste: [[2508.07557]] Splat4D: Diffusion-Enhanced 4D Gaussian Splatting for Temporally and Spatially Consistent Content Creation(https://arxiv.org/abs/2508.07557)
Keywords: diffusion
Abstract: Generating high-quality 4D content from monocular videos for applications such as digital humans and AR/VR poses challenges in ensuring temporal and spatial consistency, preserving intricate details, and incorporating user guidance effectively. To overcome these challenges, we introduce Splat4D, a novel framework enabling high-fidelity 4D content generation from a monocular video. Splat4D achieves superior performance while maintaining faithful spatial-temporal coherence by leveraging multi-view rendering, inconsistency identification, a video diffusion model, and an asymmetric U-Net for refinement. Through extensive evaluations on public benchmarks, Splat4D consistently demonstrates state-of-the-art performance across various metrics, underscoring the efficacy of our approach. Additionally, the versatility of Splat4D is validated in various applications such as text/image conditioned 4D generation, 4D human generation, and text-guided content editing, producing coherent outcomes following user instructions.

Title: Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models

Authors: Khanh-Binh Nguyen, Phuoc-Nguyen Bui, Hyunseung Choo, Duc Thanh Nguyen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07570
Pdf URL: https://arxiv.org/pdf/2508.07570
Copy Paste: [[2508.07570]] Adaptive Cache Enhancement for Test-Time Adaptation of Vision-Language Models(https://arxiv.org/abs/2508.07570)
Keywords: robust
Abstract: Vision-language models (VLMs) exhibit remarkable zero-shot generalization but suffer performance degradation under distribution shifts in downstream tasks, particularly in the absence of labeled data. Test-Time Adaptation (TTA) addresses this challenge by enabling online optimization of VLMs during inference, eliminating the need for annotated data. Cache-based TTA methods exploit historical knowledge by maintaining a dynamic memory cache of low-entropy or high-confidence samples, promoting efficient adaptation to out-of-distribution data. Nevertheless, these methods face two critical challenges: (1) unreliable confidence metrics under significant distribution shifts, resulting in error accumulation within the cache and degraded adaptation performance; and (2) rigid decision boundaries that fail to accommodate substantial distributional variations, leading to suboptimal predictions. To overcome these limitations, we introduce the Adaptive Cache Enhancement (ACE) framework, which constructs a robust cache by selectively storing high-confidence or low-entropy image embeddings per class, guided by dynamic, class-specific thresholds initialized from zero-shot statistics and iteratively refined using an exponential moving average and exploration-augmented updates. This approach enables adaptive, class-wise decision boundaries, ensuring robust and accurate predictions across diverse visual distributions. Extensive experiments on 15 diverse benchmark datasets demonstrate that ACE achieves state-of-the-art performance, delivering superior robustness and generalization compared to existing TTA methods in challenging out-of-distribution scenarios.

Title: Towards Theoretical Understanding of Transformer Test-Time Computing: Investigation on In-Context Linear Regression

Authors: Xingwu Chen, Miao Lu, Beining Wu, Difan Zou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07571
Pdf URL: https://arxiv.org/pdf/2508.07571
Copy Paste: [[2508.07571]] Towards Theoretical Understanding of Transformer Test-Time Computing: Investigation on In-Context Linear Regression(https://arxiv.org/abs/2508.07571)
Keywords: transformer
Abstract: Using more test-time computation during language model inference, such as generating more intermediate thoughts or sampling multiple candidate answers, has proven effective in significantly improving model performance. This paper takes an initial step toward bridging the gap between practical language model inference and theoretical transformer analysis by incorporating randomness and sampling. We focus on in-context linear regression with continuous/binary coefficients, where our framework simulates language model decoding through noise injection and binary coefficient sampling. Through this framework, we provide detailed analyses of widely adopted inference techniques. Supported by empirical results, our theoretical framework and analysis demonstrate the potential for offering new insights into understanding inference behaviors in real-world language models.

Title: Exploiting Layer Normalization Fine-tuning in Visual Transformer Foundation Models for Classification

Authors: Zhaorui Tan, Tan Pan, Kaizhu Huang, Weimiao Yu, Kai Yao, Chen Jiang, Qiufeng Wang, Anh Nguyen, Xin Guo, Yuan Cheng, Xi Yang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07577
Pdf URL: https://arxiv.org/pdf/2508.07577
Copy Paste: [[2508.07577]] Exploiting Layer Normalization Fine-tuning in Visual Transformer Foundation Models for Classification(https://arxiv.org/abs/2508.07577)
Keywords: transformer
Abstract: LayerNorm is pivotal in Vision Transformers (ViTs), yet its fine-tuning dynamics under data scarcity and domain shifts remain underexplored. This paper shows that shifts in LayerNorm parameters after fine-tuning (LayerNorm shifts) are indicative of the transitions between source and target domains; its efficacy is contingent upon the degree to which the target training samples accurately represent the target domain, as quantified by our proposed Fine-tuning Shift Ratio ($FSR$). Building on this, we propose a simple yet effective rescaling mechanism using a scalar $\lambda$ that is negatively correlated to $FSR$ to align learned LayerNorm shifts with those ideal shifts achieved under fully representative data, combined with a cyclic framework that further enhances the LayerNorm fine-tuning. Extensive experiments across natural and pathological images, in both in-distribution (ID) and out-of-distribution (OOD) settings, and various target training sample regimes validate our framework. Notably, OOD tasks tend to yield lower $FSR$ and higher $\lambda$ in comparison to ID cases, especially with scarce data, indicating under-represented target training samples. Moreover, ViTFs fine-tuned on pathological data behave more like ID settings, favoring conservative LayerNorm updates. Our findings illuminate the underexplored dynamics of LayerNorm in transfer learning and provide practical strategies for LayerNorm fine-tuning.

Title: When and how can inexact generative models still sample from the data manifold?

Authors: Nisha Chandramoorthy, Adriaan de Clercq
Subjects: cs.LG, math.DS, math.PR
Abstract URL: https://arxiv.org/abs/2508.07581
Pdf URL: https://arxiv.org/pdf/2508.07581
Copy Paste: [[2508.07581]] When and how can inexact generative models still sample from the data manifold?(https://arxiv.org/abs/2508.07581)
Keywords: robust, generative
Abstract: A curious phenomenon observed in some dynamical generative models is the following: despite learning errors in the score function or the drift vector field, the generated samples appear to shift \emph{along} the support of the data distribution but not \emph{away} from it. In this work, we investigate this phenomenon of \emph{robustness of the support} by taking a dynamical systems approach on the generating stochastic/deterministic process. Our perturbation analysis of the probability flow reveals that infinitesimal learning errors cause the predicted density to be different from the target density only on the data manifold for a wide class of generative models. Further, what is the dynamical mechanism that leads to the robustness of the support? We show that the alignment of the top Lyapunov vectors (most sensitive infinitesimal perturbation directions) with the tangent spaces along the boundary of the data manifold leads to robustness and prove a sufficient condition on the dynamics of the generating process to achieve this alignment. Moreover, the alignment condition is efficient to compute and, in practice, for robust generative models, automatically leads to accurate estimates of the tangent bundle of the data manifold. Using a finite-time linear perturbation analysis on samples paths as well as probability flows, our work complements and extends existing works on obtaining theoretical guarantees for generative models from a stochastic analysis, statistical learning and uncertainty quantification points of view. Our results apply across different dynamical generative models, such as conditional flow-matching and score-based generative models, and for different target distributions that may or may not satisfy the manifold hypothesis.

Title: IBPS: Indian Bail Prediction System

Authors: Puspesh Kumar Srivastava, Uddeshya Raj, Praveen Patel, /Shubham Kumar Nigam, Noel Shallum, Arnab Bhattacharya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07592
Pdf URL: https://arxiv.org/pdf/2508.07592
Copy Paste: [[2508.07592]] IBPS: Indian Bail Prediction System(https://arxiv.org/abs/2508.07592)
Keywords: fair, large language model
Abstract: Bail decisions are among the most frequently adjudicated matters in Indian courts, yet they remain plagued by subjectivity, delays, and inconsistencies. With over 75% of India's prison population comprising undertrial prisoners, many from socioeconomically disadvantaged backgrounds, the lack of timely and fair bail adjudication exacerbates human rights concerns and contributes to systemic judicial backlog. In this paper, we present the Indian Bail Prediction System (IBPS), an AI-powered framework designed to assist in bail decision-making by predicting outcomes and generating legally sound rationales based solely on factual case attributes and statutory provisions. We curate and release a large-scale dataset of 150,430 High Court bail judgments, enriched with structured annotations such as age, health, criminal history, crime category, custody duration, statutes, and judicial reasoning. We fine-tune a large language model using parameter-efficient techniques and evaluate its performance across multiple configurations, with and without statutory context, and with RAG. Our results demonstrate that models fine-tuned with statutory knowledge significantly outperform baselines, achieving strong accuracy and explanation quality, and generalize well to a test set independently annotated by legal experts. IBPS offers a transparent, scalable, and reproducible solution to support data-driven legal assistance, reduce bail delays, and promote procedural fairness in the Indian judicial system.

Title: From Prediction to Explanation: Multimodal, Explainable, and Interactive Deepfake Detection Framework for Non-Expert Users

Authors: Shahroz Tariq, Simon S. Woo, Priyanka Singh, Irena Irmalasari, Saakshi Gupta, Dev Gupta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07596
Pdf URL: https://arxiv.org/pdf/2508.07596
Copy Paste: [[2508.07596]] From Prediction to Explanation: Multimodal, Explainable, and Interactive Deepfake Detection Framework for Non-Expert Users(https://arxiv.org/abs/2508.07596)
Keywords: interpretability, large language model
Abstract: The proliferation of deepfake technologies poses urgent challenges and serious risks to digital integrity, particularly within critical sectors such as forensics, journalism, and the legal system. While existing detection systems have made significant progress in classification accuracy, they typically function as black-box models, offering limited transparency and minimal support for human reasoning. This lack of interpretability hinders their usability in real-world decision-making contexts, especially for non-expert users. In this paper, we present DF-P2E (Deepfake: Prediction to Explanation), a novel multimodal framework that integrates visual, semantic, and narrative layers of explanation to make deepfake detection interpretable and accessible. The framework consists of three modular components: (1) a deepfake classifier with Grad-CAM-based saliency visualisation, (2) a visual captioning module that generates natural language summaries of manipulated regions, and (3) a narrative refinement module that uses a fine-tuned Large Language Model (LLM) to produce context-aware, user-sensitive explanations. We instantiate and evaluate the framework on the DF40 benchmark, the most diverse deepfake dataset to date. Experiments demonstrate that our system achieves competitive detection performance while providing high-quality explanations aligned with Grad-CAM activations. By unifying prediction and explanation in a coherent, human-aligned pipeline, this work offers a scalable approach to interpretable deepfake detection, advancing the broader vision of trustworthy and transparent AI systems in adversarial media environments.

Title: LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation

Authors: Wenhui Song, Hanhui Li, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07603
Pdf URL: https://arxiv.org/pdf/2508.07603
Copy Paste: [[2508.07603]] LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation(https://arxiv.org/abs/2508.07603)
Keywords: diffusion, transformer
Abstract: In this paper, we present LaVieID, a novel \underline{l}ocal \underline{a}utoregressive \underline{vi}d\underline{e}o diffusion framework designed to tackle the challenging \underline{id}entity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video decoding. This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at this https URL.

Title: X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning

Authors: Jian Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, Haonan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07607
Pdf URL: https://arxiv.org/pdf/2508.07607
Copy Paste: [[2508.07607]] X2Edit: Revisiting Arbitrary-Instruction Image Editing through Self-Constructed Data and Task-Aware Representation Learning(https://arxiv.org/abs/2508.07607)
Keywords: diffusion, generative
Abstract: Existing open-source datasets for arbitrary-instruction image editing remain suboptimal, while a plug-and-play editing module compatible with community-prevalent generative models is notably absent. In this paper, we first introduce the X2Edit Dataset, a comprehensive dataset covering 14 diverse editing tasks, including subject-driven generation. We utilize the industry-leading unified image generation models and expert models to construct the data. Meanwhile, we design reasonable editing instructions with the VLM and implement various scoring mechanisms to filter the data. As a result, we construct 3.7 million high-quality data with balanced categories. Second, to better integrate seamlessly with community image generation models, we design task-aware MoE-LoRA training based on FLUX.1, with only 8\% of the parameters of the full model. To further improve the final performance, we utilize the internal representations of the diffusion model and define positive/negative samples based on image editing types to introduce contrastive learning. Extensive experiments demonstrate that the model's editing performance is competitive among many excellent models. Additionally, the constructed dataset exhibits substantial advantages over existing open-source datasets. The open-source code, checkpoints, and datasets for X2Edit can be found at the following link: this https URL.

Title: A Trustworthy Method for Multimodal Emotion Recognition

Authors: Junxiao Xue, Xiaozhen Liu, Jie Wang, Xuecheng Wu, Bin Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07625
Pdf URL: https://arxiv.org/pdf/2508.07625
Copy Paste: [[2508.07625]] A Trustworthy Method for Multimodal Emotion Recognition(https://arxiv.org/abs/2508.07625)
Keywords: robust
Abstract: Existing emotion recognition methods mainly focus on enhancing performance by employing complex deep models, typically resulting in significantly higher model complexity. Although effective, it is also crucial to ensure the reliability of the final decision, especially for noisy, corrupted and out-of-distribution data. To this end, we propose a novel emotion recognition method called trusted emotion recognition (TER), which utilizes uncertainty estimation to calculate the confidence value of predictions. TER combines the results from multiple modalities based on their confidence values to output the trusted predictions. We also provide a new evaluation criterion to assess the reliability of predictions. Specifically, we incorporate trusted precision and trusted recall to determine the trusted threshold and formulate the trusted Acc. and trusted F1 score to evaluate the model's trusted performance. The proposed framework combines the confidence module that accordingly endows the model with reliability and robustness against possible noise or corruption. The extensive experimental results validate the effectiveness of our proposed model. The TER achieves state-of-the-art performance on the Music-video, achieving 82.40% Acc. In terms of trusted performance, TER outperforms other methods on the IEMOCAP and Music-video, achieving trusted F1 scores of 0.7511 and 0.9035, respectively.

Title: Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo

Authors: Advait Parulekar, Litu Rout, Karthikeyan Shanmugam, Sanjay Shakkottai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07631
Pdf URL: https://arxiv.org/pdf/2508.07631
Copy Paste: [[2508.07631]] Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo(https://arxiv.org/abs/2508.07631)
Keywords: generative
Abstract: We study the problem of posterior sampling in the context of score based generative models. We have a trained score network for a prior $p(x)$, a measurement model $p(y|x)$, and are tasked with sampling from the posterior $p(x|y)$. Prior work has shown this to be intractable in KL (in the worst case) under well-accepted computational hardness assumptions. Despite this, popular algorithms for tasks such as image super-resolution, stylization, and reconstruction enjoy empirical success. Rather than establishing distributional assumptions or restricted settings under which exact posterior sampling is tractable, we view this as a more general "tilting" problem of biasing a distribution towards a measurement. Under minimal assumptions, we show that one can tractably sample from a distribution that is simultaneously close to the posterior of a noised prior in KL divergence and the true posterior in Fisher divergence. Intuitively, this combination ensures that the resulting sample is consistent with both the measurement and the prior. To the best of our knowledge these are the first formal results for (approximate) posterior sampling in polynomial time.

Title: Extracting Complex Topology from Multivariate Functional Approximation: Contours, Jacobi Sets, and Ridge-Valley Graphs

Authors: Guanqun Ma, David Lenz, Hanqi Guo, Tom Peterka, Bei Wang
Subjects: cs.LG, cs.CG
Abstract URL: https://arxiv.org/abs/2508.07637
Pdf URL: https://arxiv.org/pdf/2508.07637
Copy Paste: [[2508.07637]] Extracting Complex Topology from Multivariate Functional Approximation: Contours, Jacobi Sets, and Ridge-Valley Graphs(https://arxiv.org/abs/2508.07637)
Keywords: extraction
Abstract: Implicit continuous models, such as functional models and implicit neural networks, are an increasingly popular method for replacing discrete data representations with continuous, high-order, and differentiable surrogates. These models offer new perspectives on the storage, transfer, and analysis of scientific data. In this paper, we introduce the first framework to directly extract complex topological features -- contours, Jacobi sets, and ridge-valley graphs -- from a type of continuous implicit model known as multivariate functional approximation (MFA). MFA replaces discrete data with continuous piecewise smooth functions. Given an MFA model as the input, our approach enables direct extraction of complex topological features from the model, without reverting to a discrete representation of the model. Our work is easily generalizable to any continuous implicit model that supports the queries of function values and high-order derivatives. Our work establishes the building blocks for performing topological data analysis and visualization on implicit continuous models.

Title: Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals

Authors: Jia Zhang, Yao Liu, Chen-Xi Zhang, Yi Liu, Yi-Xuan Jin, Lan-Zhe Guo, Yu-Feng Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07638
Pdf URL: https://arxiv.org/pdf/2508.07638
Copy Paste: [[2508.07638]] Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals(https://arxiv.org/abs/2508.07638)
Keywords: robust, large language model
Abstract: Aligning Large Language Models (LLMs) with diverse human values requires moving beyond a single holistic "better-than" preference criterion. While collecting fine-grained, aspect-specific preference data is more reliable and scalable, existing methods like Direct Preference Optimization (DPO) struggle with the severe noise and conflicts inherent in such aggregated datasets. In this paper, we tackle this challenge from a data-centric perspective. We first derive the Direct Multi-Preference Optimization (DMPO) objective, and uncover a key Preference Divergence (PD) term that quantifies inter-aspect preference conflicts. Instead of using this term for direct optimization, we leverage it to formulate a novel, theoretically-grounded data selection principle. Our principle advocates for selecting a subset of high-consensus data-identified by the most negative PD values-for efficient DPO training. We prove the optimality of this strategy by analyzing the loss bounds of the DMPO objective in the selection problem. To operationalize our approach, we introduce practical methods of PD term estimation and length bias mitigation, thereby proposing our PD selection method. Evaluation on the UltraFeedback dataset with three varying conflict levels shows that our simple yet effective strategy achieves over 10% relative improvement against both the standard holistic preference and a stronger oracle using aggregated preference signals, all while boosting training efficiency and obviating the need for intractable holistic preference annotating, unlocking the potential of robust LLM alignment via fine-grained preference signals.

Title: Multi-Turn Jailbreaks Are Simpler Than They Seem

Authors: Xiaoxue Yang, Jaeha Lee, Anna-Katharina Dick, Jasper Timm, Fei Xie, Diogo Cruz
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07646
Pdf URL: https://arxiv.org/pdf/2508.07646
Copy Paste: [[2508.07646]] Multi-Turn Jailbreaks Are Simpler Than They Seem(https://arxiv.org/abs/2508.07646)
Keywords: protect, defense, attack, large language model
Abstract: While defenses against single-turn jailbreak attacks on Large Language Models (LLMs) have improved significantly, multi-turn jailbreaks remain a persistent vulnerability, often achieving success rates exceeding 70% against models optimized for single-turn protection. This work presents an empirical analysis of automated multi-turn jailbreak attacks across state-of-the-art models including GPT-4, Claude, and Gemini variants, using the StrongREJECT benchmark. Our findings challenge the perceived sophistication of multi-turn attacks: when accounting for the attacker's ability to learn from how models refuse harmful requests, multi-turn jailbreaking approaches are approximately equivalent to simply resampling single-turn attacks multiple times. Moreover, attack success is correlated among similar models, making it easier to jailbreak newly released ones. Additionally, for reasoning models, we find surprisingly that higher reasoning effort often leads to higher attack success rates. Our results have important implications for AI safety evaluation and the design of jailbreak-resistant systems. We release the source code at this https URL

Title: LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering

Authors: Xiaohang Zhan, Dingming Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07647
Pdf URL: https://arxiv.org/pdf/2508.07647
Copy Paste: [[2508.07647]] LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering(https://arxiv.org/abs/2508.07647)
Keywords: diffusion
Abstract: We propose a novel training-free image generation algorithm that precisely controls the occlusion relationships between objects in an image. Existing image generation methods typically rely on prompts to influence occlusion, which often lack precision. While layout-to-image methods provide control over object locations, they fail to address occlusion relationships explicitly. Given a pre-trained image diffusion model, our method leverages volume rendering principles to "render" the scene in latent space, guided by occlusion relationships and the estimated transmittance of objects. This approach does not require retraining or fine-tuning the image diffusion model, yet it enables accurate occlusion control due to its physics-grounded foundation. In extensive experiments, our method significantly outperforms existing approaches in terms of occlusion accuracy. Furthermore, we demonstrate that by adjusting the opacities of objects or concepts during rendering, our method can achieve a variety of effects, such as altering the transparency of objects, the density of mass (e.g., forests), the concentration of particles (e.g., rain, fog), the intensity of light, and the strength of lens effects, etc.

Title: Collaborative Learning of Scattering and Deep Features for SAR Target Recognition with Noisy Labels

Authors: Yimin Fu, Zhunga Liu, Dongxiu Guo, Longfei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07656
Pdf URL: https://arxiv.org/pdf/2508.07656
Copy Paste: [[2508.07656]] Collaborative Learning of Scattering and Deep Features for SAR Target Recognition with Noisy Labels(https://arxiv.org/abs/2508.07656)
Keywords: robust
Abstract: The acquisition of high-quality labeled synthetic aperture radar (SAR) data is challenging due to the demanding requirement for expert knowledge. Consequently, the presence of unreliable noisy labels is unavoidable, which results in performance degradation of SAR automatic target recognition (ATR). Existing research on learning with noisy labels mainly focuses on image data. However, the non-intuitive visual characteristics of SAR data are insufficient to achieve noise-robust learning. To address this problem, we propose collaborative learning of scattering and deep features (CLSDF) for SAR ATR with noisy labels. Specifically, a multi-model feature fusion framework is designed to integrate scattering and deep features. The attributed scattering centers (ASCs) are treated as dynamic graph structure data, and the extracted physical characteristics effectively enrich the representation of deep image features. Then, the samples with clean and noisy labels are divided by modeling the loss distribution with multiple class-wise Gaussian Mixture Models (GMMs). Afterward, the semi-supervised learning of two divergent branches is conducted based on the data divided by each other. Moreover, a joint distribution alignment strategy is introduced to enhance the reliability of co-guessed labels. Extensive experiments have been done on the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset, and the results show that the proposed method can achieve state-of-the-art performance under different operating conditions with various label noises.

Title: GLiClass: Generalist Lightweight Model for Sequence Classification Tasks

Authors: Ihor Stepanov, Mykhailo Shtopko, Dmytro Vodianytskyi, Oleksandr Lukashov, Alexander Yavorskyi, Mykyta Yaroshenko
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.07662
Pdf URL: https://arxiv.org/pdf/2508.07662
Copy Paste: [[2508.07662]] GLiClass: Generalist Lightweight Model for Sequence Classification Tasks(https://arxiv.org/abs/2508.07662)
Keywords: generative
Abstract: Classification is one of the most widespread tasks in AI applications, serving often as the first step in filtering, sorting, and categorizing data. Since modern AI systems must handle large volumes of input data and early pipeline stages can propagate errors downstream, achieving high efficiency and accuracy is critical. Moreover, classification requirements can change dynamically based on user needs, necessitating models with strong zero-shot capabilities. While generative LLMs have become mainstream for zero-shot classification due to their versatility, they suffer from inconsistent instruction following and computational inefficiency. Cross-encoders, commonly used as rerankers in RAG pipelines, face a different bottleneck: they must process text-label pairs sequentially, significantly reducing efficiency with large label sets. Embedding-based approaches offer good efficiency but struggle with complex scenarios involving logical and semantic constraints. We propose GLiClass, a novel method that adapts the GLiNER architecture for sequence classification tasks. Our approach achieves strong accuracy and efficiency comparable to embedding-based methods, while maintaining the flexibility needed for zero-shot and few-shot learning scenarios. Additionally, we adapted proximal policy optimization (PPO) for multi-label text classification, enabling training classifiers in data-sparse conditions or from human feedback.

Title: AIS-LLM: A Unified Framework for Maritime Trajectory Prediction, Anomaly Detection, and Collision Risk Assessment with Explainable Forecasting

Authors: Hyobin Park, Jinwook Jung, Minseok Seo, Hyunsoo Choi, Deukjae Cho, Sekil Park, Dong-Geol Choi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07668
Pdf URL: https://arxiv.org/pdf/2508.07668
Copy Paste: [[2508.07668]] AIS-LLM: A Unified Framework for Maritime Trajectory Prediction, Anomaly Detection, and Collision Risk Assessment with Explainable Forecasting(https://arxiv.org/abs/2508.07668)
Keywords: large language model
Abstract: With the increase in maritime traffic and the mandatory implementation of the Automatic Identification System (AIS), the importance and diversity of maritime traffic analysis tasks based on AIS data, such as vessel trajectory prediction, anomaly detection, and collision risk assessment, is rapidly growing. However, existing approaches tend to address these tasks individually, making it difficult to holistically consider complex maritime situations. To address this limitation, we propose a novel framework, AIS-LLM, which integrates time-series AIS data with a large language model (LLM). AIS-LLM consists of a Time-Series Encoder for processing AIS sequences, an LLM-based Prompt Encoder, a Cross-Modality Alignment Module for semantic alignment between time-series data and textual prompts, and an LLM-based Multi-Task Decoder. This architecture enables the simultaneous execution of three key tasks: trajectory prediction, anomaly detection, and risk assessment of vessel collisions within a single end-to-end system. Experimental results demonstrate that AIS-LLM outperforms existing methods across individual tasks, validating its effectiveness. Furthermore, by integratively analyzing task outputs to generate situation summaries and briefings, AIS-LLM presents the potential for more intelligent and efficient maritime traffic management.

Title: Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

Authors: Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C.S. Lui, Wei Chen, Carlee Joe-Wong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07675
Pdf URL: https://arxiv.org/pdf/2508.07675
Copy Paste: [[2508.07675]] Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation(https://arxiv.org/abs/2508.07675)
Keywords: large language model
Abstract: Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved without another forward pass through the LLM, has emerged as one possible solution. Traditional exact-match caching, however, overlooks the semantic similarity between queries, leading to unnecessary recomputation. Semantic caching addresses this by retrieving responses based on semantic similarity, but introduces a fundamentally different cache eviction problem: one must account for mismatch costs between incoming queries and cached responses. Moreover, key system parameters, such as query arrival probabilities and serving costs, are often unknown and must be learned over time. Existing semantic caching methods are largely ad-hoc, lacking theoretical foundations and unable to adapt to real-world uncertainty. In this paper, we present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions. We formulate both offline optimization and online learning variants of the problem, and develop provably efficient algorithms with state-of-the-art guarantees. We also evaluate our framework on a synthetic dataset, showing that our proposed algorithms perform matching or superior performance compared with baselines.

Title: Multi-Hop Privacy Propagation for Differentially Private Federated Learning in Social Networks

Authors: Chenchen Lin, Xuehe Wang
Subjects: cs.LG, cs.DC, cs.GT
Abstract URL: https://arxiv.org/abs/2508.07676
Pdf URL: https://arxiv.org/pdf/2508.07676
Copy Paste: [[2508.07676]] Multi-Hop Privacy Propagation for Differentially Private Federated Learning in Social Networks(https://arxiv.org/abs/2508.07676)
Keywords: privacy, protect, federate
Abstract: Federated learning (FL) enables collaborative model training across decentralized clients without sharing local data, thereby enhancing privacy and facilitating collaboration among clients connected via social networks. However, these social connections introduce privacy externalities: a client's privacy loss depends not only on its privacy protection strategy but also on the privacy decisions of others, propagated through the network via multi-hop interactions. In this work, we propose a socially-aware privacy-preserving FL mechanism that systematically quantifies indirect privacy leakage through a multi-hop propagation model. We formulate the server-client interaction as a two-stage Stackelberg game, where the server, as the leader, optimizes incentive policies, and clients, as followers, strategically select their privacy budgets, which determine their privacy-preserving levels by controlling the magnitude of added noise. To mitigate information asymmetry in networked privacy estimation, we introduce a mean-field estimator to approximate the average external privacy risk. We theoretically prove the existence and convergence of the fixed point of the mean-field estimator and derive closed-form expressions for the Stackelberg Nash Equilibrium. Despite being designed from a client-centric incentive perspective, our mechanism achieves approximately-optimal social welfare, as revealed by Price of Anarchy (PoA) analysis. Experiments on diverse datasets demonstrate that our approach significantly improves client utilities and reduces server costs while maintaining model performance, outperforming both Social-Agnostic (SA) baselines and methods that account for social externalities.

Title: MORE-CLEAR: Multimodal Offline Reinforcement learning for Clinical notes Leveraged Enhanced State Representation

Authors: Yooseok Lim, ByoungJun Jeon, Seong-A Park, Jisoo Lee, Sae Won Choi, Chang Wook Jeong, Ho-Geol Ryu, Hongyeol Lee, Hyun-Lim Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07681
Pdf URL: https://arxiv.org/pdf/2508.07681
Copy Paste: [[2508.07681]] MORE-CLEAR: Multimodal Offline Reinforcement learning for Clinical notes Leveraged Enhanced State Representation(https://arxiv.org/abs/2508.07681)
Keywords: extraction
Abstract: Sepsis, a life-threatening inflammatory response to infection, causes organ dysfunction, making early detection and optimal management critical. Previous reinforcement learning (RL) approaches to sepsis management rely primarily on structured data, such as lab results or vital signs, and on a dearth of a comprehensive understanding of the patient's condition. In this work, we propose a Multimodal Offline REinforcement learning for Clinical notes Leveraged Enhanced stAte Representation (MORE-CLEAR) framework for sepsis control in intensive care units. MORE-CLEAR employs pre-trained large-scale language models (LLMs) to facilitate the extraction of rich semantic representations from clinical notes, preserving clinical context and improving patient state representation. Gated fusion and cross-modal attention allow dynamic weight adjustment in the context of time and the effective integration of multimodal data. Extensive cross-validation using two public (MIMIC-III and MIMIC-IV) and one private dataset demonstrates that MORE-CLEAR significantly improves estimated survival rate and policy performance compared to single-modal RL approaches. To our knowledge, this is the first to leverage LLM capabilities within a multimodal offline RL for better state representation in medical applications. This approach can potentially expedite the treatment and management of sepsis by enabling reinforcement learning models to propose enhanced actions based on a more comprehensive understanding of patient conditions.

Title: DiffVC-OSD: One-Step Diffusion-based Perceptual Neural Video Compression Framework

Authors: Wenzhuo Ma, Zhenzhong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07682
Pdf URL: https://arxiv.org/pdf/2508.07682
Copy Paste: [[2508.07682]] DiffVC-OSD: One-Step Diffusion-based Perceptual Neural Video Compression Framework(https://arxiv.org/abs/2508.07682)
Keywords: diffusion
Abstract: In this work, we first propose DiffVC-OSD, a One-Step Diffusion-based Perceptual Neural Video Compression framework. Unlike conventional multi-step diffusion-based methods, DiffVC-OSD feeds the reconstructed latent representation directly into a One-Step Diffusion Model, enhancing perceptual quality through a single diffusion step guided by both temporal context and the latent itself. To better leverage temporal dependencies, we design a Temporal Context Adapter that encodes conditional inputs into multi-level features, offering more fine-grained guidance for the Denoising Unet. Additionally, we employ an End-to-End Finetuning strategy to improve overall compression performance. Extensive experiments demonstrate that DiffVC-OSD achieves state-of-the-art perceptual compression performance, offers about 20$\times$ faster decoding and a 86.92\% bitrate reduction compared to the corresponding multi-step diffusion-based variant.

Title: TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding

Authors: Chaohong Guo, Xun Mo, Yongwei Nie, Xuemiao Xu, Chao Xu, Fei Yu, Chengjiang Long
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07683
Pdf URL: https://arxiv.org/pdf/2508.07683
Copy Paste: [[2508.07683]] TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding(https://arxiv.org/abs/2508.07683)
Keywords: robust
Abstract: Temporal Video Grounding (TVG) aims to precisely localize video segments corresponding to natural language queries, which is a critical capability for long-form video understanding. Although existing reinforcement learning approaches encourage models to generate reasoning chains before predictions, they fail to explicitly constrain the reasoning process to ensure the quality of the final temporal predictions. To address this limitation, we propose Timestamp Anchor-constrained Reasoning for Temporal Video Grounding (TAR-TVG), a novel framework that introduces timestamp anchors within the reasoning process to enforce explicit supervision to the thought content. These anchors serve as intermediate verification points. More importantly, we require each reasoning step to produce increasingly accurate temporal estimations, thereby ensuring that the reasoning process contributes meaningfully to the final prediction. To address the challenge of low-probability anchor generation in models (e.g., Qwen2.5-VL-3B), we develop an efficient self-distillation training strategy: (1) initial GRPO training to collect 30K high-quality reasoning traces containing multiple timestamp anchors, (2) supervised fine-tuning (SFT) on distilled data, and (3) final GRPO optimization on the SFT-enhanced model. This three-stage training strategy enables robust anchor generation while maintaining reasoning quality. Experiments show that our model achieves state-of-the-art performance while producing interpretable, verifiable reasoning chains with progressively refined temporal estimations.

Title: LoSemB: Logic-Guided Semantic Bridging for Inductive Tool Retrieval

Authors: Luyao Zhuang, Qinggang Zhang, Huachi Zhou, Juhua Liu, Qing Li, Xiao Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07690
Pdf URL: https://arxiv.org/pdf/2508.07690
Copy Paste: [[2508.07690]] LoSemB: Logic-Guided Semantic Bridging for Inductive Tool Retrieval(https://arxiv.org/abs/2508.07690)
Keywords: large language model
Abstract: Tool learning has emerged as a promising paradigm for large language models (LLMs) to solve many real-world tasks. Nonetheless, with the tool repository rapidly expanding, it is impractical to contain all tools within the limited input length of LLMs. To alleviate these issues, researchers have explored incorporating a tool retrieval module to select the most relevant tools or represent tools as unique tokens within LLM parameters. However, most state-of-the-art methods are under transductive settings, assuming all tools have been observed during training. Such a setting deviates from reality as the real-world tool repository is evolving and incorporates new tools frequently. When dealing with these unseen tools, which refer to tools not encountered during the training phase, these methods are limited by two key issues, including the large distribution shift and the vulnerability of similarity-based retrieval. To this end, inspired by human cognitive processes of mastering unseen tools through discovering and applying the logical information from prior experience, we introduce a novel Logic-Guided Semantic Bridging framework for inductive tool retrieval, namely, LoSemB, which aims to mine and transfer latent logical information for inductive tool retrieval without costly retraining. Specifically, LoSemB contains a logic-based embedding alignment module to mitigate distribution shifts and implements a relational augmented retrieval mechanism to reduce the vulnerability of similarity-based retrieval. Extensive experiments demonstrate that LoSemB achieves advanced performance in inductive settings while maintaining desirable effectiveness in the transductive setting.

Title: Semantic-Enhanced Time-Series Forecasting via Large Language Models

Authors: Hao Liu, Chun Yang, Zhang xiaoxing, Xiaobin Zhu
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2508.07697
Pdf URL: https://arxiv.org/pdf/2508.07697
Copy Paste: [[2508.07697]] Semantic-Enhanced Time-Series Forecasting via Large Language Models(https://arxiv.org/abs/2508.07697)
Keywords: interpretability, transformer, large language model
Abstract: Time series forecasting plays a significant role in finance, energy, meteorology, and IoT applications. Recent studies have leveraged the generalization capabilities of large language models (LLMs) to adapt to time series forecasting, achieving promising performance. However, existing studies focus on token-level modal alignment, instead of bridging the intrinsic modality gap between linguistic knowledge structures and time series data patterns, greatly limiting the semantic representation. To address this issue, we propose a novel Semantic-Enhanced LLM (SE-LLM) that explores the inherent periodicity and anomalous characteristics of time series to embed into the semantic space to enhance the token embedding. This process enhances the interpretability of tokens for LLMs, thereby activating the potential of LLMs for temporal sequence analysis. Moreover, existing Transformer-based LLMs excel at capturing long-range dependencies but are weak at modeling short-term anomalies in time-series data. Hence, we propose a plugin module embedded within self-attention that models long-term and short-term dependencies to effectively adapt LLMs to time-series analysis. Our approach freezes the LLM and reduces the sequence dimensionality of tokens, greatly reducing computational consumption. Experiments demonstrate the superiority performance of our SE-LLM against the state-of-the-art (SOTA) methods.

Title: Make Your MoVe: Make Your 3D Contents by Adapting Multi-View Diffusion Models to External Editing

Authors: Weitao Wang, Haoran Xu, Jun Meng, Haoqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07700
Pdf URL: https://arxiv.org/pdf/2508.07700
Copy Paste: [[2508.07700]] Make Your MoVe: Make Your 3D Contents by Adapting Multi-View Diffusion Models to External Editing(https://arxiv.org/abs/2508.07700)
Keywords: diffusion
Abstract: As 3D generation techniques continue to flourish, the demand for generating personalized content is rapidly rising. Users increasingly seek to apply various editing methods to polish generated 3D content, aiming to enhance its color, style, and lighting without compromising the underlying geometry. However, most existing editing tools focus on the 2D domain, and directly feeding their results into 3D generation methods (like multi-view diffusion models) will introduce information loss, degrading the quality of the final 3D assets. In this paper, we propose a tuning-free, plug-and-play scheme that aligns edited assets with their original geometry in a single inference run. Central to our approach is a geometry preservation module that guides the edited multi-view generation with original input normal latents. Besides, an injection switcher is proposed to deliberately control the supervision extent of the original normals, ensuring the alignment between the edited color and normal views. Extensive experiments show that our method consistently improves both the multi-view consistency and mesh quality of edited 3D assets, across multiple combinations of multi-view diffusion models and editing methods.

Title: What am I missing here?: Evaluating Large Language Models for Masked Sentence Prediction

Authors: Charlie Wyatt, Aditya Joshi, Flora Salim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07702
Pdf URL: https://arxiv.org/pdf/2508.07702
Copy Paste: [[2508.07702]] What am I missing here?: Evaluating Large Language Models for Masked Sentence Prediction(https://arxiv.org/abs/2508.07702)
Keywords: transformer, large language model
Abstract: Transformer-based models primarily rely on Next Token Prediction (NTP), which predicts the next token in a sequence based on the preceding context. However, NTP's focus on single-token prediction often limits a model's ability to plan ahead or maintain long-range coherence, raising questions about how well LLMs can predict longer contexts, such as full sentences within structured documents. While NTP encourages local fluency, it provides no explicit incentive to ensure global coherence across sentence boundaries-an essential skill for reconstructive or discursive tasks. To investigate this, we evaluate three commercial LLMs (GPT-4o, Claude 3.5 Sonnet, and Gemini 2.0 Flash) on Masked Sentence Prediction (MSP) - the task of infilling a randomly removed sentence - from three domains: ROCStories (narrative), Recipe1M (procedural), and Wikipedia (expository). We assess both fidelity (similarity to the original sentence) and cohesiveness (fit within the surrounding context). Our key finding reveals that commercial LLMs, despite their superlative performance in other tasks, are poor at predicting masked sentences in low-structured domains, highlighting a gap in current model capabilities.

Title: Training-Free ANN-to-SNN Conversion for High-Performance Spiking Transformer

Authors: Jingya Wang, Xin Deng, Wenjie Wei, Dehao Zhang, Shuai Wang, Qian Sun, Jieyuan Zhang, Hanwen Liu, Ning Xie, Malu Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07710
Pdf URL: https://arxiv.org/pdf/2508.07710
Copy Paste: [[2508.07710]] Training-Free ANN-to-SNN Conversion for High-Performance Spiking Transformer(https://arxiv.org/abs/2508.07710)
Keywords: transformer
Abstract: Leveraging the event-driven paradigm, Spiking Neural Networks (SNNs) offer a promising approach for constructing energy-efficient Transformer architectures. Compared to directly trained Spiking Transformers, ANN-to-SNN conversion methods bypass the high training costs. However, existing methods still suffer from notable limitations, failing to effectively handle nonlinear operations in Transformer architectures and requiring additional fine-tuning processes for pre-trained ANNs. To address these issues, we propose a high-performance and training-free ANN-to-SNN conversion framework tailored for Transformer architectures. Specifically, we introduce a Multi-basis Exponential Decay (MBE) neuron, which employs an exponential decay strategy and multi-basis encoding method to efficiently approximate various nonlinear operations. It removes the requirement for weight modifications in pre-trained ANNs. Extensive experiments across diverse tasks (CV, NLU, NLG) and mainstream Transformer architectures (ViT, RoBERTa, GPT-2) demonstrate that our method achieves near-lossless conversion accuracy with significantly lower latency. This provides a promising pathway for the efficient and scalable deployment of Spiking Transformers in real-world applications.

Title: Detecting Mislabeled and Corrupted Data via Pointwise Mutual Information

Authors: Jinghan Yang, Jiayu Weng
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2508.07713
Pdf URL: https://arxiv.org/pdf/2508.07713
Copy Paste: [[2508.07713]] Detecting Mislabeled and Corrupted Data via Pointwise Mutual Information(https://arxiv.org/abs/2508.07713)
Keywords: robust
Abstract: Deep neural networks can memorize corrupted labels, making data quality critical for model performance, yet real-world datasets are frequently compromised by both label noise and input noise. This paper proposes a mutual information-based framework for data selection under hybrid noise scenarios that quantifies statistical dependencies between inputs and labels. We compute each sample's pointwise contribution to the overall mutual information and find that lower contributions indicate noisy or mislabeled instances. Empirical validation on MNIST with different synthetic noise settings demonstrates that the method effectively filters low-quality samples. Under label corruption, training on high-MI samples improves classification accuracy by up to 15\% compared to random sampling. Furthermore, the method exhibits robustness to benign input modifications, preserving semantically valid data while filtering truly corrupted samples.

Title: DoorDet: Semi-Automated Multi-Class Door Detection Dataset via Object Detection and Large Language Models

Authors: Licheng Zhang, Bach Le, Naveed Akhtar, Tuan Ngo
Subjects: cs.CV, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2508.07714
Pdf URL: https://arxiv.org/pdf/2508.07714
Copy Paste: [[2508.07714]] DoorDet: Semi-Automated Multi-Class Door Detection Dataset via Object Detection and Large Language Models(https://arxiv.org/abs/2508.07714)
Keywords: large language model
Abstract: Accurate detection and classification of diverse door types in floor plans drawings is critical for multiple applications, such as building compliance checking, and indoor scene understanding. Despite their importance, publicly available datasets specifically designed for fine-grained multi-class door detection remain scarce. In this work, we present a semi-automated pipeline that leverages a state-of-the-art object detector and a large language model (LLM) to construct a multi-class door detection dataset with minimal manual effort. Doors are first detected as a unified category using a deep object detection model. Next, an LLM classifies each detected instance based on its visual and contextual features. Finally, a human-in-the-loop stage ensures high-quality labels and bounding boxes. Our method significantly reduces annotation cost while producing a dataset suitable for benchmarking neural models in floor plan analysis. This work demonstrates the potential of combining deep learning and multimodal reasoning for efficient dataset construction in complex real-world domains.

Title: A Registration-Based Star-Shape Segmentation Model and Fast Algorithms

Authors: Daoping Zhang, Xue-Cheng Tai, Lok Ming Lui
Subjects: cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2508.07721
Pdf URL: https://arxiv.org/pdf/2508.07721
Copy Paste: [[2508.07721]] A Registration-Based Star-Shape Segmentation Model and Fast Algorithms(https://arxiv.org/abs/2508.07721)
Keywords: segmentation
Abstract: Image segmentation plays a crucial role in extracting objects of interest and identifying their boundaries within an image. However, accurate segmentation becomes challenging when dealing with occlusions, obscurities, or noise in corrupted images. To tackle this challenge, prior information is often utilized, with recent attention on star-shape priors. In this paper, we propose a star-shape segmentation model based on the registration framework. By combining the level set representation with the registration framework and imposing constraints on the deformed level set function, our model enables both full and partial star-shape segmentation, accommodating single or multiple centers. Additionally, our approach allows for the enforcement of identified boundaries to pass through specified landmark locations. We tackle the proposed models using the alternating direction method of multipliers. Through numerical experiments conducted on synthetic and real images, we demonstrate the efficacy of our approach in achieving accurate star-shape segmentation.

Title: Robust Reinforcement Learning over Wireless Networks with Homomorphic State Representations

Authors: Pietro Talli, Federico Mason, Federico Chiariotti, Andrea Zanella
Subjects: cs.LG, cs.IT, cs.MA
Abstract URL: https://arxiv.org/abs/2508.07722
Pdf URL: https://arxiv.org/pdf/2508.07722
Copy Paste: [[2508.07722]] Robust Reinforcement Learning over Wireless Networks with Homomorphic State Representations(https://arxiv.org/abs/2508.07722)
Keywords: robust
Abstract: In this work, we address the problem of training Reinforcement Learning (RL) agents over communication networks. The RL paradigm requires the agent to instantaneously perceive the state evolution to infer the effects of its actions on the environment. This is impossible if the agent receives state updates over lossy or delayed wireless systems and thus operates with partial and intermittent information. In recent years, numerous frameworks have been proposed to manage RL with imperfect feedback; however, they often offer specific solutions with a substantial computational burden. To address these limits, we propose a novel architecture, named Homomorphic Robust Remote Reinforcement Learning (HR3L), that enables the training of remote RL agents exchanging observations across a non-ideal wireless channel. HR3L considers two units: the transmitter, which encodes meaningful representations of the environment, and the receiver, which decodes these messages and performs actions to maximize a reward signal. Importantly, HR3L does not require the exchange of gradient information across the wireless channel, allowing for quicker training and a lower communication overhead than state-of-the-art solutions. Experimental results demonstrate that HR3L significantly outperforms baseline methods in terms of sample efficiency and adapts to different communication scenarios, including packet losses, delayed transmissions, and capacity limitations.

Title: Enhancing Small-Scale Dataset Expansion with Triplet-Connection-based Sample Re-Weighting

Authors: Ting Xiang, Changjian Chen, Zhuo Tang, Qifeng Zhang, Fei Lyu, Li Yang, Jiapeng Zhang, Kenli Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07723
Pdf URL: https://arxiv.org/pdf/2508.07723
Copy Paste: [[2508.07723]] Enhancing Small-Scale Dataset Expansion with Triplet-Connection-based Sample Re-Weighting(https://arxiv.org/abs/2508.07723)
Keywords: generative
Abstract: The performance of computer vision models in certain real-world applications, such as medical diagnosis, is often limited by the scarcity of available images. Expanding datasets using pre-trained generative models is an effective solution. However, due to the uncontrollable generation process and the ambiguity of natural language, noisy images may be generated. Re-weighting is an effective way to address this issue by assigning low weights to such noisy images. We first theoretically analyze three types of supervision for the generated images. Based on the theoretical analysis, we develop TriReWeight, a triplet-connection-based sample re-weighting method to enhance generative data augmentation. Theoretically, TriReWeight can be integrated with any generative data augmentation methods and never downgrade their performance. Moreover, its generalization approaches the optimal in the order $O(\sqrt{d\ln (n)/n})$. Our experiments validate the correctness of the theoretical analysis and demonstrate that our method outperforms the existing SOTA methods by $7.9\%$ on average over six natural image datasets and by $3.4\%$ on average over three medical datasets. We also experimentally validate that our method can enhance the performance of different generative data augmentation methods.

Title: Separation and Collaboration: Two-Level Routing Grouped Mixture-of-Experts for Multi-Domain Continual Learning

Authors: Jialu Zhou, Dianxi Shi, Shaowu Yang, Xinyu Wei, Mingyue Yang, Leqian Li, Mengzhu Wang, Chunping Qiu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07738
Pdf URL: https://arxiv.org/pdf/2508.07738
Copy Paste: [[2508.07738]] Separation and Collaboration: Two-Level Routing Grouped Mixture-of-Experts for Multi-Domain Continual Learning(https://arxiv.org/abs/2508.07738)
Keywords: large language model
Abstract: Multi-Domain Continual Learning (MDCL) acquires knowledge from sequential tasks with shifting class sets and distribution. Despite the Parameter-Efficient Fine-Tuning (PEFT) methods can adapt for this dual heterogeneity, they still suffer from catastrophic forgetting and forward forgetting. To address these challenges, we propose a Two-Level Routing Grouped Mixture-of-Experts (TRGE) method. Firstly, TRGE dynamically expands the pre-trained CLIP model, assigning specific expert group for each task to mitigate catastrophic forgetting. With the number of experts continually grows in this process, TRGE maintains the static experts count within the group and introduces the intra-group router to alleviate routing overfitting caused by the increasing routing complexity. Meanwhile, we design an inter-group routing policy based on task identifiers and task prototype distance, which dynamically selects relevant expert groups and combines their outputs to enhance inter-task collaboration. Secondly, to get the correct task identifiers, we leverage Multimodal Large Language Models (MLLMs) which own powerful multimodal comprehension capabilities to generate semantic task descriptions and recognize the correct task identifier. Finally, to mitigate forward forgetting, we dynamically fuse outputs for unseen samples from the frozen CLIP model and TRGE adapter based on training progress, leveraging both pre-trained and learned knowledge. Through extensive experiments across various settings, our method outperforms other advanced methods with fewer trainable parameters.

Title: Chimera: Harnessing Multi-Agent LLMs for Automatic Insider Threat Simulation

Authors: Jiongchi Yu, Xiaofei Xie, Qiang Hu, Yuhan Ma, Ziming Zhao
Subjects: cs.CR, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2508.07745
Pdf URL: https://arxiv.org/pdf/2508.07745
Copy Paste: [[2508.07745]] Chimera: Harnessing Multi-Agent LLMs for Automatic Insider Threat Simulation(https://arxiv.org/abs/2508.07745)
Keywords: security, attack, large language model
Abstract: Insider threats, which can lead to severe losses, remain a major security concern. While machine learning-based insider threat detection (ITD) methods have shown promising results, their progress is hindered by the scarcity of high-quality data. Enterprise data is sensitive and rarely accessible, while publicly available datasets, when limited in scale due to cost, lack sufficient real-world coverage; and when purely synthetic, they fail to capture rich semantics and realistic user behavior. To address this, we propose Chimera, the first large language model (LLM)-based multi-agent framework that automatically simulates both benign and malicious insider activities and collects diverse logs across diverse enterprise environments. Chimera models each employee with agents that have role-specific behavior and integrates modules for group meetings, pairwise interactions, and autonomous scheduling, capturing realistic organizational dynamics. It incorporates 15 types of insider attacks (e.g., IP theft, system sabotage) and has been deployed to simulate activities in three sensitive domains: technology company, finance corporation, and medical institution, producing a new dataset, ChimeraLog. We assess ChimeraLog via human studies and quantitative analysis, confirming its diversity, realism, and presence of explainable threat patterns. Evaluations of existing ITD methods show an average F1-score of 0.83, which is significantly lower than 0.99 on the CERT dataset, demonstrating ChimeraLog's higher difficulty and utility for advancing ITD research.

Title: Grouped Speculative Decoding for Autoregressive Image Generation

Authors: Junhyuk So, Juncheol Shin, Hyunho Kook, Eunhyeok Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07747
Pdf URL: https://arxiv.org/pdf/2508.07747
Copy Paste: [[2508.07747]] Grouped Speculative Decoding for Autoregressive Image Generation(https://arxiv.org/abs/2508.07747)
Keywords: diffusion, generative
Abstract: Recently, autoregressive (AR) image models have demonstrated remarkable generative capabilities, positioning themselves as a compelling alternative to diffusion models. However, their sequential nature leads to long inference times, limiting their practical scalability. In this work, we introduce Grouped Speculative Decoding (GSD), a novel, training-free acceleration method for AR image models. While recent studies have explored Speculative Decoding (SD) as a means to speed up AR image generation, existing approaches either provide only modest acceleration or require additional training. Our in-depth analysis reveals a fundamental difference between language and image tokens: image tokens exhibit inherent redundancy and diversity, meaning multiple tokens can convey valid semantics. However, traditional SD methods are designed to accept only a single most-likely token, which fails to leverage this difference, leading to excessive false-negative rejections. To address this, we propose a new SD strategy that evaluates clusters of visually valid tokens rather than relying on a single target token. Additionally, we observe that static clustering based on embedding distance is ineffective, which motivates our dynamic GSD approach. Extensive experiments show that GSD accelerates AR image models by an average of 3.7x while preserving image quality-all without requiring any additional training. The source code is available at this https URL

Title: Exploring Causal Effect of Social Bias on Faithfulness Hallucinations in Large Language Models

Authors: Zhenliang Zhang, Junzhe Zhang, Xinyu Hu, HuiXuan Zhang, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07753
Pdf URL: https://arxiv.org/pdf/2508.07753
Copy Paste: [[2508.07753]] Exploring Causal Effect of Social Bias on Faithfulness Hallucinations in Large Language Models(https://arxiv.org/abs/2508.07753)
Keywords: fair, large language model
Abstract: Large language models (LLMs) have achieved remarkable success in various tasks, yet they remain vulnerable to faithfulness hallucinations, where the output does not align with the input. In this study, we investigate whether social bias contributes to these hallucinations, a causal relationship that has not been explored. A key challenge is controlling confounders within the context, which complicates the isolation of causality between bias states and hallucinations. To address this, we utilize the Structural Causal Model (SCM) to establish and validate the causality and design bias interventions to control confounders. In addition, we develop the Bias Intervention Dataset (BID), which includes various social biases, enabling precise measurement of causal effects. Experiments on mainstream LLMs reveal that biases are significant causes of faithfulness hallucinations, and the effect of each bias state differs in direction. We further analyze the scope of these causal effects across various models, specifically focusing on unfairness hallucinations, which are primarily targeted by social bias, revealing the subtle yet significant causal effect of bias on hallucination generation.

Title: Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild

Authors: Haoran Wang, Zekun Li, Jian Zhang, Lei Qi, Yinghuan Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07759
Pdf URL: https://arxiv.org/pdf/2508.07759
Copy Paste: [[2508.07759]] Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild(https://arxiv.org/abs/2508.07759)
Keywords: diffusion, segmentation
Abstract: Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild. Consequently, reference segmentation, which leverages reference images and their corresponding masks to impart novel knowledge to the model, emerges as a promising new direction for adapting vision models. However, existing reference segmentation approaches predominantly rely on meta-learning, which still necessitates an extensive meta-training process and brings massive data and computational cost. In this study, we propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video. This perspective allows the latest version of SAM, known as SAM2, which is equipped with interactive video object segmentation (iVOS) capabilities, to be adapted to downstream tasks in a lightweight manner. We term this approach Correspondence As Video for SAM (CAV-SAM). CAV-SAM comprises two key modules: the Diffusion-Based Semantic Transition (DBST) module employs a diffusion model to construct a semantic transformation sequence, while the Test-Time Geometric Alignment (TTGA) module aligns the geometric changes within this sequence through test-time fine-tuning. We evaluated CAVSAM on widely-used datasets, achieving segmentation performance improvements exceeding 5% over SOTA methods. Implementation is provided in the supplementary materials.

Title: Sparse Probabilistic Graph Circuits

Authors: Martin Rektoris, Milan Papež, Václav Šmídl, Tomáš Pevný
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07763
Pdf URL: https://arxiv.org/pdf/2508.07763
Copy Paste: [[2508.07763]] Sparse Probabilistic Graph Circuits(https://arxiv.org/abs/2508.07763)
Keywords: generative
Abstract: Deep generative models (DGMs) for graphs achieve impressively high expressive power thanks to very efficient and scalable neural networks. However, these networks contain non-linearities that prevent analytical computation of many standard probabilistic inference queries, i.e., these DGMs are considered \emph{intractable}. While recently proposed Probabilistic Graph Circuits (PGCs) address this issue by enabling \emph{tractable} probabilistic inference, they operate on dense graph representations with $\mathcal{O}(n^2)$ complexity for graphs with $n$ nodes and \emph{$m$ edges}. To address this scalability issue, we introduce Sparse PGCs, a new class of tractable generative models that operate directly on sparse graph representation, reducing the complexity to $\mathcal{O}(n + m)$, which is particularly beneficial for $m \ll n^2$. In the context of de novo drug design, we empirically demonstrate that SPGCs retain exact inference capabilities, improve memory efficiency and inference speed, and match the performance of intractable DGMs in key metrics.

Title: UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models

Authors: Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin, Liangjing Yang, Zhicai Wang, Yanbin Hao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07766
Pdf URL: https://arxiv.org/pdf/2508.07766
Copy Paste: [[2508.07766]] UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models(https://arxiv.org/abs/2508.07766)
Keywords: large language model
Abstract: Unlike bitmap images, scalable vector graphics (SVG) maintain quality when scaled, frequently employed in computer vision and artistic design in the representation of SVG code. In this era of proliferating AI-powered systems, enabling AI to understand and generate SVG has become increasingly urgent. However, AI-driven SVG understanding and generation (U&G) remain significant challenges. SVG code, equivalent to a set of curves and lines controlled by floating-point parameters, demands high precision in SVG U&G. Besides, SVG generation operates under diverse conditional constraints, including textual prompts and visual references, which requires powerful multi-modal processing for condition-to-SVG transformation. Recently, the rapid growth of Multi-modal Large Language Models (MLLMs) have demonstrated capabilities to process multi-modal inputs and generate complex vector controlling parameters, suggesting the potential to address SVG U&G tasks within a unified model. To unlock MLLM's capabilities in the SVG area, we propose an SVG-centric dataset called UniSVG, comprising 525k data items, tailored for MLLM training and evaluation. To our best knowledge, it is the first comprehensive dataset designed for unified SVG generation (from textual prompts and images) and SVG understanding (color, category, usage, etc.). As expected, learning on the proposed dataset boosts open-source MLLMs' performance on various SVG U&G tasks, surpassing SOTA close-source MLLMs like GPT-4V. We release dataset, benchmark, weights, codes and experiment details on this https URL.

Title: Pareto Multi-Objective Alignment for Language Models

Authors: Qiang He, Setareh Maghsudi
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.07768
Pdf URL: https://arxiv.org/pdf/2508.07768
Copy Paste: [[2508.07768]] Pareto Multi-Objective Alignment for Language Models(https://arxiv.org/abs/2508.07768)
Keywords: robust, large language model
Abstract: Large language models (LLMs) are increasingly deployed in real-world applications that require careful balancing of multiple, often conflicting, objectives, such as informativeness versus conciseness, or helpfulness versus creativity. However, current alignment methods, primarily based on RLHF, optimize LLMs toward a single reward function, resulting in rigid behavior that fails to capture the complexity and diversity of human preferences. This limitation hinders the adaptability of LLMs to practical scenarios, making multi-objective alignment (MOA) a critical yet underexplored area. To bridge this gap, we propose Pareto Multi-Objective Alignment (PAMA), a principled and computationally efficient algorithm designed explicitly for MOA in LLMs. In contrast to computationally prohibitive multi-objective optimization (MOO) methods, PAMA transforms multi-objective RLHF into a convex optimization with a closed-form solution, significantly enhancing scalability. Traditional MOO approaches suffer from prohibitive O(n^2*d) complexity, where d represents the number of model parameters, typically in the billions for LLMs, rendering direct optimization infeasible. PAMA reduces this complexity to O(n) where n is the number of objectives, enabling optimization to be completed within milliseconds. We provide theoretical guarantees that PAMA converges to a Pareto stationary point, where no objective can be improved without degrading at least one other. Extensive experiments across language models ranging from 125M to 7B parameters demonstrate PAMA's robust and effective MOA capabilities, aligning with its theoretical advantages. PAMA provides a highly efficient solution to the MOA problem that was previously considered intractable, offering a practical and theoretically grounded approach to aligning LLMs with diverse human values, paving the way for versatile and adaptable real-world AI deployments.

Title: Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation

Authors: Xiaoyan Liu, Kangrui Li, Jiaxin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07769
Pdf URL: https://arxiv.org/pdf/2508.07769
Copy Paste: [[2508.07769]] Dream4D: Lifting Camera-Controlled I2V towards Spatiotemporally Consistent 4D Generation(https://arxiv.org/abs/2508.07769)
Keywords: diffusion
Abstract: The synthesis of spatiotemporally coherent 4D content presents fundamental challenges in computer vision, requiring simultaneous modeling of high-fidelity spatial representations and physically plausible temporal dynamics. Current approaches often struggle to maintain view consistency while handling complex scene dynamics, particularly in large-scale environments with multiple interacting elements. This work introduces Dream4D, a novel framework that bridges this gap through a synergy of controllable video generation and neural 4D reconstruction. Our approach seamlessly combines a two-stage architecture: it first predicts optimal camera trajectories from a single image using few-shot learning, then generates geometrically consistent multi-view sequences via a specialized pose-conditioned diffusion process, which are finally converted into a persistent 4D representation. This framework is the first to leverage both rich temporal priors from video diffusion models and geometric awareness of the reconstruction models, which significantly facilitates 4D generation and shows higher quality (e.g., mPSNR, mSSIM) over existing methods.

Title: Forecasting Continuous Non-Conservative Dynamical Systems in SO(3)

Authors: Lennart Bastian, Mohammad Rashed, Nassir Navab, Tolga Birdal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07775
Pdf URL: https://arxiv.org/pdf/2508.07775
Copy Paste: [[2508.07775]] Forecasting Continuous Non-Conservative Dynamical Systems in SO(3)(https://arxiv.org/abs/2508.07775)
Keywords: robust
Abstract: Modeling the rotation of moving objects is a fundamental task in computer vision, yet $SO(3)$ extrapolation still presents numerous challenges: (1) unknown quantities such as the moment of inertia complicate dynamics, (2) the presence of external forces and torques can lead to non-conservative kinematics, and (3) estimating evolving state trajectories under sparse, noisy observations requires robustness. We propose modeling trajectories of noisy pose estimates on the manifold of 3D rotations in a physically and geometrically meaningful way by leveraging Neural Controlled Differential Equations guided with $SO(3)$ Savitzky-Golay paths. Existing extrapolation methods often rely on energy conservation or constant velocity assumptions, limiting their applicability in real-world scenarios involving non-conservative forces. In contrast, our approach is agnostic to energy and momentum conservation while being robust to input noise, making it applicable to complex, non-inertial systems. Our approach is easily integrated as a module in existing pipelines and generalizes well to trajectories with unknown physical parameters. By learning to approximate object dynamics from noisy states during training, our model attains robust extrapolation capabilities in simulation and various real-world settings. Code is available at this https URL

Title: Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts

Authors: Haoyuan Wu, Haoxing Chen, Xiaodong Chen, Zhanchao Zhou, Tieyuan Chen, Yihong Zhuang, Guoshan Lu, Zenan Huang, Junbo Zhao, Lin Liu, Zhenzhong Lan, Bei Yu, Jianguo Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07785
Pdf URL: https://arxiv.org/pdf/2508.07785
Copy Paste: [[2508.07785]] Grove MoE: Towards Efficient and Superior MoE LLMs with Adjugate Experts(https://arxiv.org/abs/2508.07785)
Keywords: large language model
Abstract: The Mixture of Experts (MoE) architecture is a cornerstone of modern state-of-the-art (SOTA) large language models (LLMs). MoE models facilitate scalability by enabling sparse parameter activation. However, traditional MoE architecture uses homogeneous experts of a uniform size, activating a fixed number of parameters irrespective of input complexity and thus limiting computational efficiency. To overcome this limitation, we introduce Grove MoE, a novel architecture incorporating experts of varying sizes, inspired by the heterogeneous this http URL CPU architecture. This architecture features novel adjugate experts with a dynamic activation mechanism, enabling model capacity expansion while maintaining manageable computational overhead. Building on this architecture, we present GroveMoE-Base and GroveMoE-Inst, 33B-parameter LLMs developed by applying an upcycling strategy to the Qwen3-30B-A3B-Base model during mid-training and post-training. GroveMoE models dynamically activate 3.14-3.28B parameters based on token complexity and achieve performance comparable to SOTA open-source models of similar or even larger size.

Title: Anatomy-Aware Low-Dose CT Denoising via Pretrained Vision Models and Semantic-Guided Contrastive Learning

Authors: Runze Wang, Zeli Chen, Zhiyun Song, Wei Fang, Jiajin Zhang, Danyang Tu, Yuxing Tang, Minfeng Xu, Xianghua Ye, Le Lu, Dakai Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07788
Pdf URL: https://arxiv.org/pdf/2508.07788
Copy Paste: [[2508.07788]] Anatomy-Aware Low-Dose CT Denoising via Pretrained Vision Models and Semantic-Guided Contrastive Learning(https://arxiv.org/abs/2508.07788)
Keywords: segmentation
Abstract: To reduce radiation exposure and improve the diagnostic efficacy of low-dose computed tomography (LDCT), numerous deep learning-based denoising methods have been developed to mitigate noise and artifacts. However, most of these approaches ignore the anatomical semantics of human tissues, which may potentially result in suboptimal denoising outcomes. To address this problem, we propose ALDEN, an anatomy-aware LDCT denoising method that integrates semantic features of pretrained vision models (PVMs) with adversarial and contrastive learning. Specifically, we introduce an anatomy-aware discriminator that dynamically fuses hierarchical semantic features from reference normal-dose CT (NDCT) via cross-attention mechanisms, enabling tissue-specific realism evaluation in the discriminator. In addition, we propose a semantic-guided contrastive learning module that enforces anatomical consistency by contrasting PVM-derived features from LDCT, denoised CT and NDCT, preserving tissue-specific patterns through positive pairs and suppressing artifacts via dual negative pairs. Extensive experiments conducted on two LDCT denoising datasets reveal that ALDEN achieves the state-of-the-art performance, offering superior anatomy preservation and substantially reducing over-smoothing issue of previous work. Further validation on a downstream multi-organ segmentation task (encompassing 117 anatomical structures) affirms the model's ability to maintain anatomical awareness.

Title: Boosting Active Defense Persistence: A Two-Stage Defense Framework Combining Interruption and Poisoning Against Deepfake

Authors: Hongrui Zheng, Yuezun Li, Liejun Wang, Yunfeng Diao, Zhiqing Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07795
Pdf URL: https://arxiv.org/pdf/2508.07795
Copy Paste: [[2508.07795]] Boosting Active Defense Persistence: A Two-Stage Defense Framework Combining Interruption and Poisoning Against Deepfake(https://arxiv.org/abs/2508.07795)
Keywords: protect, defense, attack
Abstract: Active defense strategies have been developed to counter the threat of deepfake technology. However, a primary challenge is their lack of persistence, as their effectiveness is often short-lived. Attackers can bypass these defenses by simply collecting protected samples and retraining their models. This means that static defenses inevitably fail when attackers retrain their models, which severely limits practical use. We argue that an effective defense not only distorts forged content but also blocks the model's ability to adapt, which occurs when attackers retrain their models on protected images. To achieve this, we propose an innovative Two-Stage Defense Framework (TSDF). Benefiting from the intensity separation mechanism designed in this paper, the framework uses dual-function adversarial perturbations to perform two roles. First, it can directly distort the forged results. Second, it acts as a poisoning vehicle that disrupts the data preparation process essential for an attacker's retraining pipeline. By poisoning the data source, TSDF aims to prevent the attacker's model from adapting to the defensive perturbations, thus ensuring the defense remains effective long-term. Comprehensive experiments show that the performance of traditional interruption methods degrades sharply when it is subjected to adversarial retraining. However, our framework shows a strong dual defense capability, which can improve the persistence of active defense. Our code will be available at this https URL.

Title: Power Battery Detection

Authors: Xiaoqi Zhao, Peiqian Cao, Lihe Zhang, Zonglei Feng, Hanqi Liu, Jiaming Zuo, Youwei Pang, Weisi Lin, Georges El Fakhri, Huchuan Lu, Xiaofeng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07797
Pdf URL: https://arxiv.org/pdf/2508.07797
Copy Paste: [[2508.07797]] Power Battery Detection(https://arxiv.org/abs/2508.07797)
Keywords: robust, segmentation
Abstract: Power batteries are essential components in electric vehicles, where internal structural defects can pose serious safety risks. We conduct a comprehensive study on a new task, power battery detection (PBD), which aims to localize the dense endpoints of cathode and anode plates from industrial X-ray images for quality inspection. Manual inspection is inefficient and error-prone, while traditional vision algorithms struggle with densely packed plates, low contrast, scale variation, and imaging artifacts. To address this issue and drive more attention into this meaningful task, we present PBD5K, the first large-scale benchmark for this task, consisting of 5,000 X-ray images from nine battery types with fine-grained annotations and eight types of real-world visual interference. To support scalable and consistent labeling, we develop an intelligent annotation pipeline that combines image filtering, model-assisted pre-labeling, cross-verification, and layered quality evaluation. We formulate PBD as a point-level segmentation problem and propose MDCNeXt, a model designed to extract and integrate multi-dimensional structure clues including point, line, and count information from the plate itself. To improve discrimination between plates and suppress visual interference, MDCNeXt incorporates two state space modules. The first is a prompt-filtered module that learns contrastive relationships guided by task-specific prompts. The second is a density-aware reordering module that refines segmentation in regions with high plate density. In addition, we propose a distance-adaptive mask generation strategy to provide robust supervision under varying spatial distributions of anode and cathode positions. The source code and datasets will be publicly available at \href{this https URL}{PBD5K}.

Title: MambaTrans: Multimodal Fusion Image Translation via Large Language Model Priors for Downstream Visual Tasks

Authors: Yushen Xu, Xiaosong Li, Zhenyu Kuang, Xiaoqi Cheng, Haishu Tan, Huafeng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07803
Pdf URL: https://arxiv.org/pdf/2508.07803
Copy Paste: [[2508.07803]] MambaTrans: Multimodal Fusion Image Translation via Large Language Model Priors for Downstream Visual Tasks(https://arxiv.org/abs/2508.07803)
Keywords: large language model, segmentation
Abstract: The goal of multimodal image fusion is to integrate complementary information from infrared and visible images, generating multimodal fused images for downstream tasks. Existing downstream pre-training models are typically trained on visible images. However, the significant pixel distribution differences between visible and multimodal fusion images can degrade downstream task performance, sometimes even below that of using only visible images. This paper explores adapting multimodal fused images with significant modality differences to object detection and semantic segmentation models trained on visible images. To address this, we propose MambaTrans, a novel multimodal fusion image modality translator. MambaTrans uses descriptions from a multimodal large language model and masks from semantic segmentation models as input. Its core component, the Multi-Model State Space Block, combines mask-image-text cross-attention and a 3D-Selective Scan Module, enhancing pure visual capabilities. By leveraging object detection prior knowledge, MambaTrans minimizes detection loss during training and captures long-term dependencies among text, masks, and images. This enables favorable results in pre-trained models without adjusting their parameters. Experiments on public datasets show that MambaTrans effectively improves multimodal image performance in downstream tasks.

Title: Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning

Authors: Bao Li, Xiaomei Zhang, Miao Xu, Zhaoxin Fan, Xiangyu Zhu, Zhen Lei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07804
Pdf URL: https://arxiv.org/pdf/2508.07804
Copy Paste: [[2508.07804]] Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning(https://arxiv.org/abs/2508.07804)
Keywords: large language model
Abstract: Generating 3D human poses from multimodal inputs such as images or text requires models to capture both rich spatial and semantic correspondences. While pose-specific multimodal large language models (MLLMs) have shown promise in this task, they are typically trained with supervised objectives such as SMPL parameter regression or token-level prediction, which struggle to model the inherent ambiguity and achieve task-specific alignment required for accurate 3D pose generation. To address these limitations, we propose Pose-RFT, a reinforcement fine-tuning framework tailored for 3D human pose generation in MLLMs. We formulate the task as a hybrid action reinforcement learning problem that jointly optimizes discrete language prediction and continuous pose generation. To this end, we introduce HyGRPO, a hybrid reinforcement learning algorithm that performs group-wise reward normalization over sampled responses to guide joint optimization of discrete and continuous actions. Pose-RFT further incorporates task-specific reward functions to guide optimization towards spatial alignment in image-to-pose generation and semantic consistency in text-to-pose generation. Extensive experiments on multiple pose generation benchmarks demonstrate that Pose-RFT significantly improves performance over existing pose-specific MLLMs, validating the effectiveness of hybrid action reinforcement fine-tuning for 3D pose generation.

Title: Can You Trick the Grader? Adversarial Persuasion of LLM Judges

Authors: Yerin Hwang, Dongryeol Lee, Taegwan Kang, Yongil Kim, Kyomin Jung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07805
Pdf URL: https://arxiv.org/pdf/2508.07805
Copy Paste: [[2508.07805]] Can You Trick the Grader? Adversarial Persuasion of LLM Judges(https://arxiv.org/abs/2508.07805)
Keywords: defense, attack, robust, fair, large language model
Abstract: As large language models take on growing roles as automated evaluators in practical settings, a critical question arises: Can individuals persuade an LLM judge to assign unfairly high scores? This study is the first to reveal that strategically embedded persuasive language can bias LLM judges when scoring mathematical reasoning tasks, where correctness should be independent of stylistic variation. Grounded in Aristotle's rhetorical principles, we formalize seven persuasion techniques (Majority, Consistency, Flattery, Reciprocity, Pity, Authority, Identity) and embed them into otherwise identical responses. Across six math benchmarks, we find that persuasive language leads LLM judges to assign inflated scores to incorrect solutions, by up to 8% on average, with Consistency causing the most severe distortion. Notably, increasing model size does not substantially mitigate this vulnerability. Further analysis demonstrates that combining multiple persuasion techniques amplifies the bias, and pairwise evaluation is likewise susceptible. Moreover, the persuasive effect persists under counter prompting strategies, highlighting a critical vulnerability in LLM-as-a-Judge pipelines and underscoring the need for robust defenses against persuasion-based attacks.

Title: Topological Feature Compression for Molecular Graph Neural Networks

Authors: Rahul Khorana
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07807
Pdf URL: https://arxiv.org/pdf/2508.07807
Copy Paste: [[2508.07807]] Topological Feature Compression for Molecular Graph Neural Networks(https://arxiv.org/abs/2508.07807)
Keywords: robust, interpretability
Abstract: Recent advances in molecular representation learning have produced highly effective encodings of molecules for numerous cheminformatics and bioinformatics tasks. However, extracting general chemical insight while balancing predictive accuracy, interpretability, and computational efficiency remains a major challenge. In this work, we introduce a novel Graph Neural Network (GNN) architecture that combines compressed higher-order topological signals with standard molecular features. Our approach captures global geometric information while preserving computational tractability and human-interpretable structure. We evaluate our model across a range of benchmarks, from small-molecule datasets to complex material datasets, and demonstrate superior performance using a parameter-efficient architecture. We achieve the best performing results in both accuracy and robustness across almost all benchmarks. We open source all code \footnote{All code and results can be found on Github this https URL}.

Title: EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning

Authors: Huanyu Liu, Jia Li, Chang Yu, Taozhi Chen, Yihong Dong, Lecheng Wang, Hu XiaoLong, Ge Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07809
Pdf URL: https://arxiv.org/pdf/2508.07809
Copy Paste: [[2508.07809]] EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning(https://arxiv.org/abs/2508.07809)
Keywords: large language model
Abstract: Reinforcement learning with verifiable reward (RLVR) has become a promising paradigm for post-training large language models (LLMs) to improve their reasoning capability. However, when the rollout accuracy is low on hard problems, the reward becomes sparse, limiting learning efficiency and causing exploration bottlenecks. Existing approaches either rely on stronger LLMs for distillation or filter out difficult problems, which limits scalability or restricts reasoning improvement through exploration. We propose EvoCoT, a self-evolving curriculum learning framework based on two-stage chain-of-thought (CoT) reasoning optimization. EvoCoT constrains the exploration space by self-generating and verifying CoT trajectories, then gradually shortens them to expand the space in a controlled way. This enables LLMs to stably learn from initially unsolved hard problems under sparse rewards. We apply EvoCoT to multiple LLM families, including Qwen, DeepSeek, and Llama. Experiments show that EvoCoT enables LLMs to solve previously unsolved problems, improves reasoning capability without external CoT supervision, and is compatible with various RL fine-tuning methods. We release the source code to support future research.

Title: Evaluating Compositional Approaches for Focus and Sentiment Analysis

Authors: Olga Kellert, Muhammad Imran, Nicholas Hill Matlis, Mahmud Uz Zaman, Carlos Gómez-Rodríguez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07810
Pdf URL: https://arxiv.org/pdf/2508.07810
Copy Paste: [[2508.07810]] Evaluating Compositional Approaches for Focus and Sentiment Analysis(https://arxiv.org/abs/2508.07810)
Keywords: interpretability, explainability
Abstract: This paper summarizes the results of evaluating a compositional approach for Focus Analysis (FA) in Linguistics and Sentiment Analysis (SA) in Natural Language Processing (NLP). While quantitative evaluations of compositional and non-compositional approaches in SA exist in NLP, similar quantitative evaluations are very rare in FA in Linguistics that deal with linguistic expressions representing focus or emphasis such as "it was John who left". We fill this gap in research by arguing that compositional rules in SA also apply to FA because FA and SA are closely related meaning that SA is part of FA. Our compositional approach in SA exploits basic syntactic rules such as rules of modification, coordination, and negation represented in the formalism of Universal Dependencies (UDs) in English and applied to words representing sentiments from sentiment dictionaries. Some of the advantages of our compositional analysis method for SA in contrast to non-compositional analysis methods are interpretability and explainability. We test the accuracy of our compositional approach and compare it with a non-compositional approach VADER that uses simple heuristic rules to deal with negation, coordination and modification. In contrast to previous related work that evaluates compositionality in SA on long reviews, this study uses more appropriate datasets to evaluate compositionality. In addition, we generalize the results of compositional approaches in SA to compositional approaches in FA.

Title: DiTVR: Zero-Shot Diffusion Transformer for Video Restoration

Authors: Sicheng Gao, Nancy Mehta, Zongwei Wu, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07811
Pdf URL: https://arxiv.org/pdf/2508.07811
Copy Paste: [[2508.07811]] DiTVR: Zero-Shot Diffusion Transformer for Video Restoration(https://arxiv.org/abs/2508.07811)
Keywords: robust, diffusion, transformer, generative
Abstract: Video restoration aims to reconstruct high quality video sequences from low quality inputs, addressing tasks such as super resolution, denoising, and deblurring. Traditional regression based methods often produce unrealistic details and require extensive paired datasets, while recent generative diffusion models face challenges in ensuring temporal consistency. We introduce DiTVR, a zero shot video restoration framework that couples a diffusion transformer with trajectory aware attention and a wavelet guided, flow consistent sampler. Unlike prior 3D convolutional or frame wise diffusion approaches, our attention mechanism aligns tokens along optical flow trajectories, with particular emphasis on vital layers that exhibit the highest sensitivity to temporal dynamics. A spatiotemporal neighbour cache dynamically selects relevant tokens based on motion correspondences across frames. The flow guided sampler injects data consistency only into low-frequency bands, preserving high frequency priors while accelerating convergence. DiTVR establishes a new zero shot state of the art on video restoration benchmarks, demonstrating superior temporal consistency and detail preservation while remaining robust to flow noise and occlusions.

Title: Segmenting and Understanding: Region-aware Semantic Attention for Fine-grained Image Quality Assessment with Large Language Models

Authors: Chenyue Song, Chen Hui, Haiqi Zhu, Feng Jiang, Yachun Mi, Wei Zhang, Shaohui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07818
Pdf URL: https://arxiv.org/pdf/2508.07818
Copy Paste: [[2508.07818]] Segmenting and Understanding: Region-aware Semantic Attention for Fine-grained Image Quality Assessment with Large Language Models(https://arxiv.org/abs/2508.07818)
Keywords: robust, large language model
Abstract: No-reference image quality assessment (NR-IQA) aims to simulate the process of perceiving image quality aligned with subjective human perception. However, existing NR-IQA methods either focus on global representations that leads to limited insights into the semantically salient regions or employ a uniform weighting for region features that weakens the sensitivity to local quality variations. In this paper, we propose a fine-grained image quality assessment model, named RSFIQA, which integrates region-level distortion information to perceive multi-dimensional quality discrepancies. To enhance regional quality awareness, we first utilize the Segment Anything Model (SAM) to dynamically partition the input image into non-overlapping semantic regions. For each region, we teach a powerful Multi-modal Large Language Model (MLLM) to extract descriptive content and perceive multi-dimensional distortions, enabling a comprehensive understanding of both local semantics and quality degradations. To effectively leverage this information, we introduce Region-Aware Semantic Attention (RSA) mechanism, which generates a global attention map by aggregating fine-grained representations from local regions. In addition, RSFIQA is backbone-agnostic and can be seamlessly integrated into various deep neural network architectures. Extensive experiments demonstrate the robustness and effectiveness of the proposed method, which achieves competitive quality prediction performance across multiple benchmark datasets.

Title: Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP

Authors: Ke Ma, Jun Long, Hongxiao Fei, Liujie Hua, Yueyi Luo
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07819
Pdf URL: https://arxiv.org/pdf/2508.07819
Copy Paste: [[2508.07819]] Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP(https://arxiv.org/abs/2508.07819)
Keywords: robust
Abstract: Pre-trained Vision-Language Models (VLMs) face a significant adaptation gap when applied to Zero-Shot Anomaly Detection (ZSAD), stemming from their lack of local inductive biases for dense prediction and their reliance on inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method integrates a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks.

Title: Evaluating Large Language Models as Expert Annotators

Authors: Yu-Min Tseng, Wei-Lin Chen, Chung-Chi Chen, Hsin-Hsi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07827
Pdf URL: https://arxiv.org/pdf/2508.07827
Copy Paste: [[2508.07827]] Evaluating Large Language Models as Expert Annotators(https://arxiv.org/abs/2508.07827)
Keywords: large language model
Abstract: Textual data annotation, the process of labeling or tagging text with relevant information, is typically costly, time-consuming, and labor-intensive. While large language models (LLMs) have demonstrated their potential as direct alternatives to human annotators for general domains natural language processing (NLP) tasks, their effectiveness on annotation tasks in domains requiring expert knowledge remains underexplored. In this paper, we investigate: whether top-performing LLMs, which might be perceived as having expert-level proficiency in academic and professional benchmarks, can serve as direct alternatives to human expert annotators? To this end, we evaluate both individual LLMs and multi-agent approaches across three highly specialized domains: finance, biomedicine, and law. Specifically, we propose a multi-agent discussion framework to simulate a group of human annotators, where LLMs are tasked to engage in discussions by considering others' annotations and justifications before finalizing their labels. Additionally, we incorporate reasoning models (e.g., o3-mini) to enable a more comprehensive comparison. Our empirical results reveal that: (1) Individual LLMs equipped with inference-time techniques (e.g., chain-of-thought (CoT), self-consistency) show only marginal or even negative performance gains, contrary to prior literature suggesting their broad effectiveness. (2) Overall, reasoning models do not demonstrate statistically significant improvements over non-reasoning models in most settings. This suggests that extended long CoT provides relatively limited benefits for data annotation in specialized domains. (3) Certain model behaviors emerge in the multi-agent discussion environment. For instance, Claude 3.7 Sonnet with thinking rarely changes its initial annotations, even when other agents provide correct annotations or valid reasoning.

Title: A Comparative Analysis of Lightweight Hash Functions Using AVR ATXMega128 and ChipWhisperer

Authors: Mohsin Khan, Dag Johansen, Håvard Dagenborg
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.07840
Pdf URL: https://arxiv.org/pdf/2508.07840
Copy Paste: [[2508.07840]] A Comparative Analysis of Lightweight Hash Functions Using AVR ATXMega128 and ChipWhisperer(https://arxiv.org/abs/2508.07840)
Keywords: security
Abstract: Lightweight hash functions have become important building blocks for security in embedded and IoT systems. A plethora of algorithms have been proposed and standardized, providing a wide range of performance trade-off options for developers to choose from. This paper presents a comparative analysis of 22 key software-based lightweight hash functions, including the finalist from the SHA-3 competition. We use a novel benchmark methodology that combines an AVR ATXMega128 microcontroller with the ChipWhisperer cryptanalysis platform and evaluate and compare the various hash functions along several dimensions, including execution speed, % measured in Cycles per Byte (CpB), memory footprint, and energy consumption. Using the composite E-RANK metric, we provide new insight into the various trade-offs each hash function offers to system developers.

Title: Learning Satellite Attitude Dynamics with Physics-Informed Normalising Flow

Authors: Carlo Cena, Mauro Martini, Marcello Chiaberge
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2508.07841
Pdf URL: https://arxiv.org/pdf/2508.07841
Copy Paste: [[2508.07841]] Learning Satellite Attitude Dynamics with Physics-Informed Normalising Flow(https://arxiv.org/abs/2508.07841)
Keywords: robust
Abstract: Attitude control is a fundamental aspect of spacecraft operations. Model Predictive Control (MPC) has emerged as a powerful strategy for these tasks, relying on accurate models of the system dynamics to optimize control actions over a prediction horizon. In scenarios where physics models are incomplete, difficult to derive, or computationally expensive, machine learning offers a flexible alternative by learning the system behavior directly from data. However, purely data-driven models often struggle with generalization and stability, especially when applied to inputs outside their training domain. To address these limitations, we investigate the benefits of incorporating Physics-Informed Neural Networks (PINNs) into the learning of spacecraft attitude dynamics, comparing their performance with that of purely data-driven approaches. Using a Real-valued Non-Volume Preserving (Real NVP) neural network architecture with a self-attention mechanism, we trained several models on simulated data generated with the Basilisk simulator. Two training strategies were considered: a purely data-driven baseline and a physics-informed variant to improve robustness and stability. Our results demonstrate that the inclusion of physics-based information significantly enhances the performance in terms of the mean relative error of the best architectures found by 27.08%. These advantages are particularly evident when the learned models are integrated into an MPC framework, where PINN-based models consistently outperform their purely data-driven counterparts in terms of control accuracy and robustness, yielding improvements of up to 42.86% in performance stability error and increased robustness-to-noise.

Title: Large Language Models for Czech Aspect-Based Sentiment Analysis

Authors: Jakub Šmíd, Pavel Přibáň, Pavel Král
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07860
Pdf URL: https://arxiv.org/pdf/2508.07860
Copy Paste: [[2508.07860]] Large Language Models for Czech Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2508.07860)
Keywords: large language model
Abstract: Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task that aims to identify sentiment toward specific aspects of an entity. While large language models (LLMs) have shown strong performance in various natural language processing (NLP) tasks, their capabilities for Czech ABSA remain largely unexplored. In this work, we conduct a comprehensive evaluation of 19 LLMs of varying sizes and architectures on Czech ABSA, comparing their performance in zero-shot, few-shot, and fine-tuning scenarios. Our results show that small domain-specific models fine-tuned for ABSA outperform general-purpose LLMs in zero-shot and few-shot settings, while fine-tuned LLMs achieve state-of-the-art results. We analyze how factors such as multilingualism, model size, and recency influence performance and present an error analysis highlighting key challenges, particularly in aspect term prediction. Our findings provide insights into the suitability of LLMs for Czech ABSA and offer guidance for future research in this area.

Title: EFU: Enforcing Federated Unlearning via Functional Encryption

Authors: Samaneh Mohammadi, Vasileios Tsouvalas, Iraklis Symeonidis, Ali Balador, Tanir Ozcelebi, Francesco Flammini, Nirvana Meratnia
Subjects: cs.CR, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07873
Pdf URL: https://arxiv.org/pdf/2508.07873
Copy Paste: [[2508.07873]] EFU: Enforcing Federated Unlearning via Functional Encryption(https://arxiv.org/abs/2508.07873)
Keywords: secure, privacy, federate
Abstract: Federated unlearning (FU) algorithms allow clients in federated settings to exercise their ''right to be forgotten'' by removing the influence of their data from a collaboratively trained model. Existing FU methods maintain data privacy by performing unlearning locally on the client-side and sending targeted updates to the server without exposing forgotten data; yet they often rely on server-side cooperation, revealing the client's intent and identity without enforcement guarantees - compromising autonomy and unlearning privacy. In this work, we propose EFU (Enforced Federated Unlearning), a cryptographically enforced FU framework that enables clients to initiate unlearning while concealing its occurrence from the server. Specifically, EFU leverages functional encryption to bind encrypted updates to specific aggregation functions, ensuring the server can neither perform unauthorized computations nor detect or skip unlearning requests. To further mask behavioral and parameter shifts in the aggregated model, we incorporate auxiliary unlearning losses based on adversarial examples and parameter importance regularization. Extensive experiments show that EFU achieves near-random accuracy on forgotten data while maintaining performance comparable to full retraining across datasets and neural architectures - all while concealing unlearning intent from the server. Furthermore, we demonstrate that EFU is agnostic to the underlying unlearning algorithm, enabling secure, function-hiding, and verifiable unlearning for any client-side FU mechanism that issues targeted updates.

Title: Not Yet AlphaFold for the Mind: Evaluating Centaur as a Synthetic Participant

Authors: Sabrina Namazova, Alessandra Brondetta, Younes Strittmatter, Matthew Nassar, Sebastian Musslick
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07887
Pdf URL: https://arxiv.org/pdf/2508.07887
Copy Paste: [[2508.07887]] Not Yet AlphaFold for the Mind: Evaluating Centaur as a Synthetic Participant(https://arxiv.org/abs/2508.07887)
Keywords: generative, large language model
Abstract: Simulators have revolutionized scientific practice across the natural sciences. By generating data that reliably approximate real-world phenomena, they enable scientists to accelerate hypothesis testing and optimize experimental designs. This is perhaps best illustrated by AlphaFold, a Nobel-prize winning simulator in chemistry that predicts protein structures from amino acid sequences, enabling rapid prototyping of molecular interactions, drug targets, and protein functions. In the behavioral sciences, a reliable participant simulator - a system capable of producing human-like behavior across cognitive tasks - would represent a similarly transformative advance. Recently, Binz et al. introduced Centaur, a large language model (LLM) fine-tuned on human data from 160 experiments, proposing its use not only as a model of cognition but also as a participant simulator for "in silico prototyping of experimental studies", e.g., to advance automated cognitive science. Here, we review the core criteria for a participant simulator and assess how well Centaur meets them. Although Centaur demonstrates strong predictive accuracy, its generative behavior - a critical criterion for a participant simulator - systematically diverges from human data. This suggests that, while Centaur is a significant step toward predicting human behavior, it does not yet meet the standards of a reliable participant simulator or an accurate model of cognition.

Title: Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation

Authors: Bowen Xue, Qixin Yan, Wenjing Wang, Hao Liu, Chen Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07901
Pdf URL: https://arxiv.org/pdf/2508.07901
Copy Paste: [[2508.07901]] Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation(https://arxiv.org/abs/2508.07901)
Keywords: generative
Abstract: Generating high-fidelity human videos that match user-specified identities is important yet challenging in the field of generative AI. Existing methods often rely on an excessive number of training parameters and lack compatibility with other AIGC tools. In this paper, we propose Stand-In, a lightweight and plug-and-play framework for identity preservation in video generation. Specifically, we introduce a conditional image branch into the pre-trained video generation model. Identity control is achieved through restricted self-attentions with conditional position mapping, and can be learned quickly with only 2000 pairs. Despite incorporating and training just $\sim$1\% additional parameters, our framework achieves excellent results in video quality and identity preservation, outperforming other full-parameter training methods. Moreover, our framework can be seamlessly integrated for other tasks, such as subject-driven video generation, pose-referenced video generation, stylization, and face swapping.

Title: Tailored Emotional LLM-Supporter: Enhancing Cultural Sensitivity

Authors: Chen Cecilia Liu, Hiba Arnaout, Nils Kovačić, Dana Atzil-Slonim, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07902
Pdf URL: https://arxiv.org/pdf/2508.07902
Copy Paste: [[2508.07902]] Tailored Emotional LLM-Supporter: Enhancing Cultural Sensitivity(https://arxiv.org/abs/2508.07902)
Keywords: large language model
Abstract: Large language models (LLMs) show promise in offering emotional support and generating empathetic responses for individuals in distress, but their ability to deliver culturally sensitive support remains underexplored due to lack of resources. In this work, we introduce CultureCare, the first dataset designed for this task, spanning four cultures and including 1729 distress messages, 1523 cultural signals, and 1041 support strategies with fine-grained emotional and cultural annotations. Leveraging CultureCare, we (i) develop and test four adaptation strategies for guiding three state-of-the-art LLMs toward culturally sensitive responses; (ii) conduct comprehensive evaluations using LLM judges, in-culture human annotators, and clinical psychologists; (iii) show that adapted LLMs outperform anonymous online peer responses, and that simple cultural role-play is insufficient for cultural sensitivity; and (iv) explore the application of LLMs in clinical training, where experts highlight their potential in fostering cultural competence in future therapists.

Title: Diffusing the Blind Spot: Uterine MRI Synthesis with Diffusion Models

Authors: Johanna P. Müller, Anika Knupfer, Pedro Blöss, Edoardo Berardi Vittur, Bernhard Kainz, Jana Hutter
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07903
Pdf URL: https://arxiv.org/pdf/2508.07903
Copy Paste: [[2508.07903]] Diffusing the Blind Spot: Uterine MRI Synthesis with Diffusion Models(https://arxiv.org/abs/2508.07903)
Keywords: privacy, robust, diffusion, generative
Abstract: Despite significant progress in generative modelling, existing diffusion models often struggle to produce anatomically precise female pelvic images, limiting their application in gynaecological imaging, where data scarcity and patient privacy concerns are critical. To overcome these barriers, we introduce a novel diffusion-based framework for uterine MRI synthesis, integrating both unconditional and conditioned Denoising Diffusion Probabilistic Models (DDPMs) and Latent Diffusion Models (LDMs) in 2D and 3D. Our approach generates anatomically coherent, high fidelity synthetic images that closely mimic real scans and provide valuable resources for training robust diagnostic models. We evaluate generative quality using advanced perceptual and distributional metrics, benchmarking against standard reconstruction methods, and demonstrate substantial gains in diagnostic accuracy on a key classification task. A blinded expert evaluation further validates the clinical realism of our synthetic images. We release our models with privacy safeguards and a comprehensive synthetic uterine MRI dataset to support reproducible research and advance equitable AI in gynaecology.

Title: Generative Video Matting

Authors: Yongtao Ge, Kangyang Xie, Guangkai Xu, Mingyu Liu, Li Ke, Longtao Huang, Hui Xue, Hao Chen, Chunhua Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07905
Pdf URL: https://arxiv.org/pdf/2508.07905
Copy Paste: [[2508.07905]] Generative Video Matting(https://arxiv.org/abs/2508.07905)
Keywords: diffusion, generative, segmentation
Abstract: Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach's superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. The code is available at this https URL.

Title: RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering

Authors: Xing Zi, Jinghao Xiao, Yunxiao Shi, Xian Tao, Jun Li, Ali Braytee, Mukesh Prasad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07918
Pdf URL: https://arxiv.org/pdf/2508.07918
Copy Paste: [[2508.07918]] RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering(https://arxiv.org/abs/2508.07918)
Keywords: large language model, segmentation
Abstract: Visual Question Answering (VQA) in remote sensing (RS) is pivotal for interpreting Earth observation data. However, existing RS VQA datasets are constrained by limitations in annotation richness, question diversity, and the assessment of specific reasoning capabilities. This paper introduces RSVLM-QA dataset, a new large-scale, content-rich VQA dataset for the RS domain. RSVLM-QA is constructed by integrating data from several prominent RS segmentation and detection datasets: WHU, LoveDA, INRIA, and iSAID. We employ an innovative dual-track annotation generation pipeline. Firstly, we leverage Large Language Models (LLMs), specifically GPT-4.1, with meticulously designed prompts to automatically generate a suite of detailed annotations including image captions, spatial relations, and semantic tags, alongside complex caption-based VQA pairs. Secondly, to address the challenging task of object counting in RS imagery, we have developed a specialized automated process that extracts object counts directly from the original segmentation data; GPT-4.1 then formulates natural language answers from these counts, which are paired with preset question templates to create counting QA pairs. RSVLM-QA comprises 13,820 images and 162,373 VQA pairs, featuring extensive annotations and diverse question types. We provide a detailed statistical analysis of the dataset and a comparison with existing RS VQA benchmarks, highlighting the superior depth and breadth of RSVLM-QA's annotations. Furthermore, we conduct benchmark experiments on Six mainstream Vision Language Models (VLMs), demonstrating that RSVLM-QA effectively evaluates and challenges the understanding and reasoning abilities of current VLMs in the RS domain. We believe RSVLM-QA will serve as a pivotal resource for the RS VQA and VLM research communities, poised to catalyze advancements in the field.

Title: Safeguarding Generative AI Applications in Preclinical Imaging through Hybrid Anomaly Detection

Authors: Jakub Binda, Valentina Paneta, Vasileios Eleftheriadis, Hongkyou Chung, Panagiotis Papadimitroulas, Neo Christopher Chung
Subjects: cs.CV, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2508.07923
Pdf URL: https://arxiv.org/pdf/2508.07923
Copy Paste: [[2508.07923]] Safeguarding Generative AI Applications in Preclinical Imaging through Hybrid Anomaly Detection(https://arxiv.org/abs/2508.07923)
Keywords: robust, generative
Abstract: Generative AI holds great potentials to automate and enhance data synthesis in nuclear medicine. However, the high-stakes nature of biomedical imaging necessitates robust mechanisms to detect and manage unexpected or erroneous model behavior. We introduce development and implementation of a hybrid anomaly detection framework to safeguard GenAI models in BIOEMTECH's eyes(TM) systems. Two applications are demonstrated: Pose2Xray, which generates synthetic X-rays from photographic mouse images, and DosimetrEYE, which estimates 3D radiation dose maps from 2D SPECT/CT scans. In both cases, our outlier detection (OD) enhances reliability, reduces manual oversight, and supports real-time quality control. This approach strengthens the industrial viability of GenAI in preclinical settings by increasing robustness, scalability, and regulatory compliance.

Title: Score Augmentation for Diffusion Models

Authors: Liang Hou, Yuan Gao, Boyuan Jiang, Xin Tao, Qi Yan, Renjie Liao, Pengfei Wan, Di Zhang, Kun Gai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07926
Pdf URL: https://arxiv.org/pdf/2508.07926
Copy Paste: [[2508.07926]] Score Augmentation for Diffusion Models(https://arxiv.org/abs/2508.07926)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved remarkable success in generative modeling. However, this study confirms the existence of overfitting in diffusion model training, particularly in data-limited regimes. To address this challenge, we propose Score Augmentation (ScoreAug), a novel data augmentation framework specifically designed for diffusion models. Unlike conventional augmentation approaches that operate on clean data, ScoreAug applies transformations to noisy data, aligning with the inherent denoising mechanism of diffusion. Crucially, ScoreAug further requires the denoiser to predict the augmentation of the original target. This design establishes an equivariant learning objective, enabling the denoiser to learn scores across varied denoising spaces, thereby realizing what we term score augmentation. We also theoretically analyze the relationship between scores in different spaces under general transformations. In experiments, we extensively validate ScoreAug on multiple benchmarks including CIFAR-10, FFHQ, AFHQv2, and ImageNet, with results demonstrating significant performance improvements over baselines. Notably, ScoreAug effectively mitigates overfitting across diverse scenarios, such as varying data scales and model capacities, while exhibiting stable convergence properties. Another advantage of ScoreAug over standard data augmentation lies in its ability to circumvent data leakage issues under certain conditions. Furthermore, we show that ScoreAug can be synergistically combined with traditional data augmentation techniques to achieve additional performance gains.

Title: Shapley-Inspired Feature Weighting in $k$-means with No Additional Hyperparameters

Authors: Richard J. Fawley, Renato Cordeiro de Amorim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.07952
Pdf URL: https://arxiv.org/pdf/2508.07952
Copy Paste: [[2508.07952]] Shapley-Inspired Feature Weighting in $k$-means with No Additional Hyperparameters(https://arxiv.org/abs/2508.07952)
Keywords: robust
Abstract: Clustering algorithms often assume all features contribute equally to the data structure, an assumption that usually fails in high-dimensional or noisy settings. Feature weighting methods can address this, but most require additional parameter tuning. We propose SHARK (Shapley Reweighted $k$-means), a feature-weighted clustering algorithm motivated by the use of Shapley values from cooperative game theory to quantify feature relevance, which requires no additional parameters beyond those in $k$-means. We prove that the $k$-means objective can be decomposed into a sum of per-feature Shapley values, providing an axiomatic foundation for unsupervised feature relevance and reducing Shapley computation from exponential to polynomial time. SHARK iteratively re-weights features by the inverse of their Shapley contribution, emphasising informative dimensions and down-weighting irrelevant ones. Experiments on synthetic and real-world data sets show that SHARK consistently matches or outperforms existing methods, achieving superior robustness and accuracy, particularly in scenarios where noise may be present. Software: this https URL.

Title: Expert Preference-based Evaluation of Automated Related Work Generation

Authors: Furkan Şahinuç, Subhabrata Dutta, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07955
Pdf URL: https://arxiv.org/pdf/2508.07955
Copy Paste: [[2508.07955]] Expert Preference-based Evaluation of Automated Related Work Generation(https://arxiv.org/abs/2508.07955)
Keywords: robust
Abstract: Expert domain writing, such as scientific writing, typically demands extensive domain knowledge. Recent advances in LLMs show promising potential in reducing the expert workload. However, evaluating the quality of automatically generated scientific writing is a crucial open issue, as it requires knowledge of domain-specific evaluation criteria and the ability to discern expert preferences. Conventional automatic metrics and LLM-as-a-judge systems are insufficient to grasp expert preferences and domain-specific quality standards. To address this gap and support human-AI collaborative writing, we focus on related work generation, one of the most challenging scientific tasks, as an exemplar. We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences. Instead of assigning a single score, our framework decomposes the evaluation into fine-grained dimensions. This localized evaluation approach is further augmented with contrastive few-shot examples to provide detailed contextual guidance for the evaluation dimensions. The design principles allow our framework to deliver cardinal assessment of quality, which can facilitate better post-training compared to ordinal preference data. For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs. Empirical investigation reveals that our framework is able to assess the quality of related work sections in a much more robust manner compared to standard LLM judges, reflects natural scenarios of scientific writing, and bears a strong correlation with the human expert assessment. We also observe that generations from state-of-the-art LLMs struggle to satisfy validation constraints of a suitable related work section. They (mostly) fail to improve based on feedback as well.

Title: Large Language Models for Subjective Language Understanding: A Survey

Authors: Changhao Song, Yazhou Zhang, Hui Gao, Ben Yao, Peng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07959
Pdf URL: https://arxiv.org/pdf/2508.07959
Copy Paste: [[2508.07959]] Large Language Models for Subjective Language Understanding: A Survey(https://arxiv.org/abs/2508.07959)
Keywords: large language model
Abstract: Subjective language understanding refers to a broad set of natural language processing tasks where the goal is to interpret or generate content that conveys personal feelings, opinions, or figurative meanings rather than objective facts. With the advent of large language models (LLMs) such as ChatGPT, LLaMA, and others, there has been a paradigm shift in how we approach these inherently nuanced tasks. In this survey, we provide a comprehensive review of recent advances in applying LLMs to subjective language tasks, including sentiment analysis, emotion recognition, sarcasm detection, humor understanding, stance detection, metaphor interpretation, intent detection, and aesthetics assessment. We begin by clarifying the definition of subjective language from linguistic and cognitive perspectives, and we outline the unique challenges posed by subjective language (e.g. ambiguity, figurativeness, context dependence). We then survey the evolution of LLM architectures and techniques that particularly benefit subjectivity tasks, highlighting why LLMs are well-suited to model subtle human-like judgments. For each of the eight tasks, we summarize task definitions, key datasets, state-of-the-art LLM-based methods, and remaining challenges. We provide comparative insights, discussing commonalities and differences among tasks and how multi-task LLM approaches might yield unified models of subjectivity. Finally, we identify open issues such as data limitations, model bias, and ethical considerations, and suggest future research directions. We hope this survey will serve as a valuable resource for researchers and practitioners interested in the intersection of affective computing, figurative language processing, and large-scale language models.

Title: VOIDFace: A Privacy-Preserving Multi-Network Face Recognition With Enhanced Security

Authors: Ajnas Muhammed, Iurri Medvedev, Nuno Gonçalves
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07960
Pdf URL: https://arxiv.org/pdf/2508.07960
Copy Paste: [[2508.07960]] VOIDFace: A Privacy-Preserving Multi-Network Face Recognition With Enhanced Security(https://arxiv.org/abs/2508.07960)
Keywords: secure, security, privacy, robust
Abstract: Advancement of machine learning techniques, combined with the availability of large-scale datasets, has significantly improved the accuracy and efficiency of facial recognition. Modern facial recognition systems are trained using large face datasets collected from diverse individuals or public repositories. However, for training, these datasets are often replicated and stored in multiple workstations, resulting in data replication, which complicates database management and oversight. Currently, once a user submits their face for dataset preparation, they lose control over how their data is used, raising significant privacy and ethical concerns. This paper introduces VOIDFace, a novel framework for facial recognition systems that addresses two major issues. First, it eliminates the need of data replication and improves data control to securely store training face data by using visual secret sharing. Second, it proposes a patch-based multi-training network that uses this novel training data storage mechanism to develop a robust, privacy-preserving facial recognition system. By integrating these advancements, VOIDFace aims to improve the privacy, security, and efficiency of facial recognition training, while ensuring greater control over sensitive personal face data. VOIDFace also enables users to exercise their Right-To-Be-Forgotten property to control their personal data. Experimental evaluations on the VGGFace2 dataset show that VOIDFace provides Right-To-Be-Forgotten, improved data control, security, and privacy while maintaining competitive facial recognition performance. Code is available at: this https URL

Title: TrackOR: Towards Personalized Intelligent Operating Rooms Through Robust Tracking

Authors: Tony Danjun Wang, Christian Heiliger, Nassir Navab, Lennart Bastian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07968
Pdf URL: https://arxiv.org/pdf/2508.07968
Copy Paste: [[2508.07968]] TrackOR: Towards Personalized Intelligent Operating Rooms Through Robust Tracking(https://arxiv.org/abs/2508.07968)
Keywords: robust
Abstract: Providing intelligent support to surgical teams is a key frontier in automated surgical scene understanding, with the long-term goal of improving patient outcomes. Developing personalized intelligence for all staff members requires maintaining a consistent state of who is located where for long surgical procedures, which still poses numerous computational challenges. We propose TrackOR, a framework for tackling long-term multi-person tracking and re-identification in the operating room. TrackOR uses 3D geometric signatures to achieve state-of-the-art online tracking performance (+11% Association Accuracy over the strongest baseline), while also enabling an effective offline recovery process to create analysis-ready trajectories. Our work shows that by leveraging 3D geometric information, persistent identity tracking becomes attainable, enabling a critical shift towards the more granular, staff-centric analyses required for personalized intelligent systems in the operating room. This new capability opens up various applications, including our proposed temporal pathway imprints that translate raw tracking data into actionable insights for improving team efficiency and safety and ultimately providing personalized support.

Title: Understanding Syntactic Generalization in Structure-inducing Language Models

Authors: David Arps, Hassan Sajjad, Laura Kallmeyer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07969
Pdf URL: https://arxiv.org/pdf/2508.07969
Copy Paste: [[2508.07969]] Understanding Syntactic Generalization in Structure-inducing Language Models(https://arxiv.org/abs/2508.07969)
Keywords: transformer, generative
Abstract: Structure-inducing Language Models (SiLM) are trained on a self-supervised language modeling task, and induce a hierarchical sentence representation as a byproduct when processing an input. A wide variety of SiLMs have been proposed. However, these have typically been evaluated on a relatively small scale, and evaluation of these models has systematic gaps and lacks comparability. In this work, we study three different SiLM architectures using both natural language (English) corpora and synthetic bracketing expressions: Structformer (Shen et al., 2021), UDGN (Shen et al., 2022) and GPST (Hu et al., 2024). We compare them with respect to (i) properties of the induced syntactic representations (ii) performance on grammaticality judgment tasks, and (iii) training dynamics. We find that none of the three architectures dominates across all evaluation metrics. However, there are significant differences, in particular with respect to the induced syntactic representations. The Generative Pretrained Structured Transformer (GPST; Hu et al. 2024) performs most consistently across evaluation settings, and outperforms the other models on long-distance dependencies in bracketing expressions. Furthermore, our study shows that small models trained on large amounts of synthetic data provide a useful testbed for evaluating basic model properties.

Title: WeChat-YATT: A Simple, Scalable and Balanced RLHF Trainer

Authors: Junyu Wu, Weiming Chang, Xiaotao Liu, Guanyou He, Tingfeng Xian, Haoqiang Hong, Boqi Chen, Haotao Tian, Tao Yang, Yunsheng Shi, Feng Lin, Ting Yao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.07970
Pdf URL: https://arxiv.org/pdf/2508.07970
Copy Paste: [[2508.07970]] WeChat-YATT: A Simple, Scalable and Balanced RLHF Trainer(https://arxiv.org/abs/2508.07970)
Keywords: robust, transformer, large language model
Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent paradigm for training large language models and multimodal systems. Despite notable advances enabled by existing RLHF training frameworks, significant challenges remain in scaling to complex multimodal workflows and adapting to dynamic workloads. In particular, current systems often encounter limitations related to controller scalability when managing large models, as well as inefficiencies in orchestrating intricate RLHF pipelines, especially in scenarios that require dynamic sampling and resource allocation. In this paper, we introduce WeChat-YATT (Yet Another Transformer Trainer in WeChat), a simple, scalable, and balanced RLHF training framework specifically designed to address these challenges. WeChat-YATT features a parallel controller programming model that enables flexible and efficient orchestration of complex RLHF workflows, effectively mitigating the bottlenecks associated with centralized controller architectures and facilitating scalability in large-scale data scenarios. In addition, we propose a dynamic placement schema that adaptively partitions computational resources and schedules workloads, thereby significantly reducing hardware idle time and improving GPU utilization under variable training conditions. We evaluate WeChat-YATT across a range of experimental scenarios, demonstrating that it achieves substantial improvements in throughput compared to state-of-the-art RLHF training frameworks. Furthermore, WeChat-YATT has been successfully deployed to train models supporting WeChat product features for a large-scale user base, underscoring its effectiveness and robustness in real-world applications.

Title: The Escalator Problem: Identifying Implicit Motion Blindness in AI for Accessibility

Authors: Xiantao Zhang
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2508.07989
Pdf URL: https://arxiv.org/pdf/2508.07989
Copy Paste: [[2508.07989]] The Escalator Problem: Identifying Implicit Motion Blindness in AI for Accessibility(https://arxiv.org/abs/2508.07989)
Keywords: robust, large language model
Abstract: Multimodal Large Language Models (MLLMs) hold immense promise as assistive technologies for the blind and visually impaired (BVI) community. However, we identify a critical failure mode that undermines their trustworthiness in real-world applications. We introduce the Escalator Problem -- the inability of state-of-the-art models to perceive an escalator's direction of travel -- as a canonical example of a deeper limitation we term Implicit Motion Blindness. This blindness stems from the dominant frame-sampling paradigm in video understanding, which, by treating videos as discrete sequences of static images, fundamentally struggles to perceive continuous, low-signal motion. As a position paper, our contribution is not a new model but rather to: (I) formally articulate this blind spot, (II) analyze its implications for user trust, and (III) issue a call to action. We advocate for a paradigm shift from purely semantic recognition towards robust physical perception and urge the development of new, human-centered benchmarks that prioritize safety, reliability, and the genuine needs of users in dynamic environments.

Title: Prompt-Guided Relational Reasoning for Social Behavior Understanding with Vision Foundation Models

Authors: Thinesh Thiyakesan Ponbagavathi, Chengzheng Yang, Alina Roitberg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.07996
Pdf URL: https://arxiv.org/pdf/2508.07996
Copy Paste: [[2508.07996]] Prompt-Guided Relational Reasoning for Social Behavior Understanding with Vision Foundation Models(https://arxiv.org/abs/2508.07996)
Keywords: transformer
Abstract: Group Activity Detection (GAD) involves recognizing social groups and their collective behaviors in videos. Vision Foundation Models (VFMs), like DinoV2, offer excellent features, but are pretrained primarily on object-centric data and remain underexplored for modeling group dynamics. While they are a promising alternative to highly task-specific GAD architectures that require full fine-tuning, our initial investigation reveals that simply swapping CNN backbones used in these methods with VFMs brings little gain, underscoring the need for structured, group-aware reasoning on top. We introduce Prompt-driven Group Activity Detection (ProGraD) -- a method that bridges this gap through 1) learnable group prompts to guide the VFM attention toward social configurations, and 2) a lightweight two-layer GroupContext Transformer that infers actor-group associations and collective behavior. We evaluate our approach on two recent GAD benchmarks: Cafe, which features multiple concurrent social groups, and Social-CAD, which focuses on single-group interactions. While we surpass state-of-the-art in both settings, our method is especially effective in complex multi-group scenarios, where we yield a gain of 6.5\% (Group mAP\@1.0) and 8.2\% (Group mAP\@0.5) using only 10M trainable parameters. Furthermore, our experiments reveal that ProGraD produces interpretable attention maps, offering insights into actor-group reasoning. Code and models will be released.

Title: WideSearch: Benchmarking Agentic Broad Info-Seeking

Authors: Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, Ke Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.07999
Pdf URL: https://arxiv.org/pdf/2508.07999
Copy Paste: [[2508.07999]] WideSearch: Benchmarking Agentic Broad Info-Seeking(https://arxiv.org/abs/2508.07999)
Keywords: large language model
Abstract: From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search agents powered by LLMs offer a promising solution to liberate humans from this tedious work. However, the capability of these agents to perform such "wide-context" collection reliably and completely remains largely unevaluated due to a lack of suitable benchmarks. To bridge this gap, we introduce WideSearch, a new benchmark engineered to evaluate agent reliability on these large-scale collection tasks. The benchmark features 200 manually curated questions (100 in English, 100 in Chinese) from over 15 diverse domains, grounded in real user queries. Each task requires agents to collect large-scale atomic information, which could be verified one by one objectively, and arrange it into a well-organized output. A rigorous five-stage quality control pipeline ensures the difficulty, completeness, and verifiability of the dataset. We benchmark over 10 state-of-the-art agentic search systems, including single-agent, multi-agent frameworks, and end-to-end commercial systems. Most systems achieve overall success rates near 0\%, with the best performer reaching just 5\%. However, given sufficient time, cross-validation by multiple human testers can achieve a near 100\% success rate. These results demonstrate that present search agents have critical deficiencies in large-scale information seeking, underscoring urgent areas for future research and development in agentic search. Our dataset, evaluation pipeline, and benchmark results have been publicly released at this https URL

Title: Progressive Depth Up-scaling via Optimal Transport

Authors: Mingzi Cao, Xi Wang, Nikolaos Aletras
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08011
Pdf URL: https://arxiv.org/pdf/2508.08011
Copy Paste: [[2508.08011]] Progressive Depth Up-scaling via Optimal Transport(https://arxiv.org/abs/2508.08011)
Keywords: transformer, large language model
Abstract: Scaling Large Language Models (LLMs) yields performance gains but incurs substantial training costs. Depth up-scaling offers training efficiency by adding new layers to pre-trained models. However, most existing methods copy or average weights from base layers, neglecting neuron permutation differences. This limitation can potentially cause misalignment that harms performance. Inspired by applying Optimal Transport (OT) for neuron alignment, we propose Optimal Transport Depth Up-Scaling (OpT-DeUS). OpT-DeUS aligns and fuses Transformer blocks in adjacent base layers via OT for new layer creation, to mitigate neuron permutation mismatch between layers. OpT-DeUS achieves better overall performance and offers improved training efficiency than existing methods for continual pre-training and supervised fine-tuning across different model sizes. To further evaluate the impact of interpolation positions, our extensive analysis shows that inserting new layers closer to the top results in higher training efficiency due to shorter back-propagation time while obtaining additional performance gains.

Title: Communication-Efficient Zero-Order and First-Order Federated Learning Methods over Wireless Networks

Authors: Mohamad Assaad, Zeinab Nehme, Merouane Debbah
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.08013
Pdf URL: https://arxiv.org/pdf/2508.08013
Copy Paste: [[2508.08013]] Communication-Efficient Zero-Order and First-Order Federated Learning Methods over Wireless Networks(https://arxiv.org/abs/2508.08013)
Keywords: federate
Abstract: Federated Learning (FL) is an emerging learning framework that enables edge devices to collaboratively train ML models without sharing their local data. FL faces, however, a significant challenge due to the high amount of information that must be exchanged between the devices and the aggregator in the training phase, which can exceed the limited capacity of wireless systems. In this paper, two communication-efficient FL methods are considered where communication overhead is reduced by communicating scalar values instead of long vectors and by allowing high number of users to send information simultaneously. The first approach employs a zero-order optimization technique with two-point gradient estimator, while the second involves a first-order gradient computation strategy. The novelty lies in leveraging channel information in the learning algorithms, eliminating hence the need for additional resources to acquire channel state information (CSI) and to remove its impact, as well as in considering asynchronous devices. We provide a rigorous analytical framework for the two methods, deriving convergence guarantees and establishing appropriate performance bounds.

Title: Mitigating Biases in Surgical Operating Rooms with Geometry

Authors: Tony Danjun Wang, Tobias Czempiel, Nassir Navab, Lennart Bastian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08028
Pdf URL: https://arxiv.org/pdf/2508.08028
Copy Paste: [[2508.08028]] Mitigating Biases in Surgical Operating Rooms with Geometry(https://arxiv.org/abs/2508.08028)
Keywords: robust, biometric
Abstract: Deep neural networks are prone to learning spurious correlations, exploiting dataset-specific artifacts rather than meaningful features for prediction. In surgical operating rooms (OR), these manifest through the standardization of smocks and gowns that obscure robust identifying landmarks, introducing model bias for tasks related to modeling OR personnel. Through gradient-based saliency analysis on two public OR datasets, we reveal that CNN models succumb to such shortcuts, fixating on incidental visual cues such as footwear beneath surgical gowns, distinctive eyewear, or other role-specific identifiers. Avoiding such biases is essential for the next generation of intelligent assistance systems in the OR, which should accurately recognize personalized workflow traits, such as surgical skill level or coordination with other staff members. We address this problem by encoding personnel as 3D point cloud sequences, disentangling identity-relevant shape and motion patterns from appearance-based confounders. Our experiments demonstrate that while RGB and geometric methods achieve comparable performance on datasets with apparent simulation artifacts, RGB models suffer a 12% accuracy drop in realistic clinical settings with decreased visual diversity due to standardizations. This performance gap confirms that geometric representations capture more meaningful biometric features, providing an avenue to developing robust methods of modeling humans in the OR.

Title: Robust Anomaly Detection in O-RAN: Leveraging LLMs against Data Manipulation Attacks

Authors: Thusitha Dayaratne, Ngoc Duy Pham, Viet Vo, Shangqi Lai, Sharif Abuadbba, Hajime Suzuki, Xingliang Yuan, Carsten Rudolph
Subjects: cs.CR, cs.ET, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08029
Pdf URL: https://arxiv.org/pdf/2508.08029
Copy Paste: [[2508.08029]] Robust Anomaly Detection in O-RAN: Leveraging LLMs against Data Manipulation Attacks(https://arxiv.org/abs/2508.08029)
Keywords: security, attack, robust, large language model
Abstract: The introduction of 5G and the Open Radio Access Network (O-RAN) architecture has enabled more flexible and intelligent network deployments. However, the increased complexity and openness of these architectures also introduce novel security challenges, such as data manipulation attacks on the semi-standardised Shared Data Layer (SDL) within the O-RAN platform through malicious xApps. In particular, malicious xApps can exploit this vulnerability by introducing subtle Unicode-wise alterations (hypoglyphs) into the data that are being used by traditional machine learning (ML)-based anomaly detection methods. These Unicode-wise manipulations can potentially bypass detection and cause failures in anomaly detection systems based on traditional ML, such as AutoEncoders, which are unable to process hypoglyphed data without crashing. We investigate the use of Large Language Models (LLMs) for anomaly detection within the O-RAN architecture to address this challenge. We demonstrate that LLM-based xApps maintain robust operational performance and are capable of processing manipulated messages without crashing. While initial detection accuracy requires further improvements, our results highlight the robustness of LLMs to adversarial attacks such as hypoglyphs in input data. There is potential to use their adaptability through prompt engineering to further improve the accuracy, although this requires further research. Additionally, we show that LLMs achieve low detection latency (under 0.07 seconds), making them suitable for Near-Real-Time (Near-RT) RIC deployments.

Title: IPBA: Imperceptible Perturbation Backdoor Attack in Federated Self-Supervised Learning

Authors: Jiayao Wang, Yang Song, Zhendong Zhao, Jiale Zhang, Qilin Wu, Junwu Zhu, Dongfang Zhao
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2508.08031
Pdf URL: https://arxiv.org/pdf/2508.08031
Copy Paste: [[2508.08031]] IPBA: Imperceptible Perturbation Backdoor Attack in Federated Self-Supervised Learning(https://arxiv.org/abs/2508.08031)
Keywords: privacy, defense, attack, robust, steal, federate
Abstract: Federated self-supervised learning (FSSL) combines the advantages of decentralized modeling and unlabeled representation learning, serving as a cutting-edge paradigm with strong potential for scalability and privacy preservation. Although FSSL has garnered increasing attention, research indicates that it remains vulnerable to backdoor attacks. Existing methods generally rely on visually obvious triggers, which makes it difficult to meet the requirements for stealth and practicality in real-world deployment. In this paper, we propose an imperceptible and effective backdoor attack method against FSSL, called IPBA. Our empirical study reveals that existing imperceptible triggers face a series of challenges in FSSL, particularly limited transferability, feature entanglement with augmented samples, and out-of-distribution properties. These issues collectively undermine the effectiveness and stealthiness of traditional backdoor attacks in FSSL. To overcome these challenges, IPBA decouples the feature distributions of backdoor and augmented samples, and introduces Sliced-Wasserstein distance to mitigate the out-of-distribution properties of backdoor samples, thereby optimizing the trigger generation process. Our experimental results on several FSSL scenarios and datasets show that IPBA significantly outperforms existing backdoor attack methods in performance and exhibits strong robustness under various defense mechanisms.

Title: Deep Learning-Based Analysis of Power Consumption in Gasoline, Electric, and Hybrid Vehicles

Authors: Roksana Yahyaabadi, Ghazal Farhani, Taufiq Rahman, Soodeh Nikan, Abdullah Jirjees, Fadi Araji
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2508.08034
Pdf URL: https://arxiv.org/pdf/2508.08034
Copy Paste: [[2508.08034]] Deep Learning-Based Analysis of Power Consumption in Gasoline, Electric, and Hybrid Vehicles(https://arxiv.org/abs/2508.08034)
Keywords: robust, transformer
Abstract: Accurate power consumption prediction is crucial for improving efficiency and reducing environmental impact, yet traditional methods relying on specialized instruments or rigid physical models are impractical for large-scale, real-world deployment. This study introduces a scalable data-driven method using powertrain dynamic feature sets and both traditional machine learning and deep neural networks to estimate instantaneous and cumulative power consumption in internal combustion engine (ICE), electric vehicle (EV), and hybrid electric vehicle (HEV) platforms. ICE models achieved high instantaneous accuracy with mean absolute error and root mean squared error on the order of $10^{-3}$, and cumulative errors under 3%. Transformer and long short-term memory models performed best for EVs and HEVs, with cumulative errors below 4.1% and 2.1%, respectively. Results confirm the approach's effectiveness across vehicles and models. Uncertainty analysis revealed greater variability in EV and HEV datasets than ICE, due to complex power management, emphasizing the need for robust models for advanced powertrains.

Title: TRIDE: A Text-assisted Radar-Image weather-aware fusion network for Depth Estimation

Authors: Huawei Sun, Zixu Wang, Hao Feng, Julius Ott, Lorenzo Servadei, Robert Wille
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08038
Pdf URL: https://arxiv.org/pdf/2508.08038
Copy Paste: [[2508.08038]] TRIDE: A Text-assisted Radar-Image weather-aware fusion network for Depth Estimation(https://arxiv.org/abs/2508.08038)
Keywords: robust, extraction
Abstract: Depth estimation, essential for autonomous driving, seeks to interpret the 3D environment surrounding vehicles. The development of radar sensors, known for their cost-efficiency and robustness, has spurred interest in radar-camera fusion-based solutions. However, existing algorithms fuse features from these modalities without accounting for weather conditions, despite radars being known to be more robust than cameras under adverse weather. Additionally, while Vision-Language models have seen rapid advancement, utilizing language descriptions alongside other modalities for depth estimation remains an open challenge. This paper first introduces a text-generation strategy along with feature extraction and fusion techniques that can assist monocular depth estimation pipelines, leading to improved accuracy across different algorithms on the KITTI dataset. Building on this, we propose TRIDE, a radar-camera fusion algorithm that enhances text feature extraction by incorporating radar point information. To address the impact of weather on sensor performance, we introduce a weather-aware fusion block that adaptively adjusts radar weighting based on current weather conditions. Our method, benchmarked on the nuScenes dataset, demonstrates performance gains over the state-of-the-art, achieving a 12.87% improvement in MAE and a 9.08% improvement in RMSE. Code: this https URL

Title: BadPromptFL: A Novel Backdoor Threat to Prompt-based Federated Learning in Multimodal Models

Authors: Maozhen Zhang, Mengnan Zhao, Bo Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08040
Pdf URL: https://arxiv.org/pdf/2508.08040
Copy Paste: [[2508.08040]] BadPromptFL: A Novel Backdoor Threat to Prompt-based Federated Learning in Multimodal Models(https://arxiv.org/abs/2508.08040)
Keywords: security, privacy, attack, robust, steal, federate
Abstract: Prompt-based tuning has emerged as a lightweight alternative to full fine-tuning in large vision-language models, enabling efficient adaptation via learned contextual prompts. This paradigm has recently been extended to federated learning settings (e.g., PromptFL), where clients collaboratively train prompts under data privacy constraints. However, the security implications of prompt-based aggregation in federated multimodal learning remain largely unexplored, leaving a critical attack surface unaddressed. In this paper, we introduce \textbf{BadPromptFL}, the first backdoor attack targeting prompt-based federated learning in multimodal contrastive models. In BadPromptFL, compromised clients jointly optimize local backdoor triggers and prompt embeddings, injecting poisoned prompts into the global aggregation process. These prompts are then propagated to benign clients, enabling universal backdoor activation at inference without modifying model parameters. Leveraging the contextual learning behavior of CLIP-style architectures, BadPromptFL achieves high attack success rates (e.g., $>90\%$) with minimal visibility and limited client participation. Extensive experiments across multiple datasets and aggregation protocols validate the effectiveness, stealth, and generalizability of our attack, raising critical concerns about the robustness of prompt-based federated learning in real-world deployments.

Title: False Reality: Uncovering Sensor-induced Human-VR Interaction Vulnerability

Authors: Yancheng Jiang, Yan Jiang, Ruochen Zhou, Yi-Chao Chen, Xiaoyu Ji, Wenyuan Xu
Subjects: cs.CR, cs.HC
Abstract URL: https://arxiv.org/abs/2508.08043
Pdf URL: https://arxiv.org/pdf/2508.08043
Copy Paste: [[2508.08043]] False Reality: Uncovering Sensor-induced Human-VR Interaction Vulnerability(https://arxiv.org/abs/2508.08043)
Keywords: security, defense, attack
Abstract: Virtual Reality (VR) techniques, serving as the bridge between the real and virtual worlds, have boomed and are widely used in manufacturing, remote healthcare, gaming, etc. Specifically, VR systems offer users immersive experiences that include both perceptions and actions. Various studies have demonstrated that attackers can manipulate VR software to influence users' interactions, including perception and actions. However, such attacks typically require strong access and specialized expertise. In this paper, we are the first to present a systematic analysis of physical attacks against VR systems and introduce False Reality, a new attack threat to VR devices without requiring access to or modification of their software. False Reality disturbs VR system services by tampering with sensor measurements, and further spoofing users' perception even inducing harmful actions, e.g., inducing dizziness or causing users to crash into obstacles, by exploiting perceptual and psychological effects. We formalize these threats through an attack pathway framework and validate three representative pathways via physical experiments and user studies on five commercial VR devices. Finally, we further propose a defense prototype to mitigate such threats. Our findings shall provide valuable insights for enhancing the security and resilience of future VR systems.

Title: S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix

Authors: Peng Dai, Feitong Tan, Qiangeng Xu, Yihua Huang, David Futschik, Ruofei Du, Sean Fanello, Yinda Zhang, Xiaojuan Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08048
Pdf URL: https://arxiv.org/pdf/2508.08048
Copy Paste: [[2508.08048]] S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix(https://arxiv.org/abs/2508.08048)
Keywords: generative
Abstract: While video generation models excel at producing high-quality monocular videos, generating 3D stereoscopic and spatial videos for immersive applications remains an underexplored challenge. We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos. Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel \textit{frame matrix} inpainting framework. This framework utilizes the original video generation model to synthesize missing content across different viewpoints and timestamps, ensuring spatial and temporal consistency without requiring additional model fine-tuning. Moreover, we develop a \dualupdate~scheme that further improves the quality of video inpainting by alleviating the negative effects propagated from disoccluded areas in the latent space. The resulting multi-view videos are then adapted into stereoscopic pairs or optimized into 4D Gaussians for spatial video synthesis. We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, such as Sora, Lumiere, WALT, and Zeroscope. The experiments demonstrate that our method has a significant improvement over previous methods. Project page at: this https URL

Title: On Understanding of the Dynamics of Model Capacity in Continual Learning

Authors: Supriyo Chakraborty, Krishnan Raghavan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08052
Pdf URL: https://arxiv.org/pdf/2508.08052
Copy Paste: [[2508.08052]] On Understanding of the Dynamics of Model Capacity in Continual Learning(https://arxiv.org/abs/2508.08052)
Keywords: transformer, large language model
Abstract: The stability-plasticity dilemma, closely related to a neural network's (NN) capacity-its ability to represent tasks-is a fundamental challenge in continual learning (CL). Within this context, we introduce CL's effective model capacity (CLEMC) that characterizes the dynamic behavior of the stability-plasticity balance point. We develop a difference equation to model the evolution of the interplay between the NN, task data, and optimization procedure. We then leverage CLEMC to demonstrate that the effective capacity-and, by extension, the stability-plasticity balance point is inherently non-stationary. We show that regardless of the NN architecture or optimization method, a NN's ability to represent new tasks diminishes when incoming task distributions differ from previous ones. We conduct extensive experiments to support our theoretical findings, spanning a range of architectures-from small feedforward network and convolutional networks to medium-sized graph neural networks and transformer-based large language models with millions of parameters.

Title: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

Authors: Weitai Kang, Weiming Zhuang, Zhizhong Li, Yan Yan, Lingjuan Lyu
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08066
Pdf URL: https://arxiv.org/pdf/2508.08066
Copy Paste: [[2508.08066]] Investigating the Design Space of Visual Grounding in Multimodal Large Language Model(https://arxiv.org/abs/2508.08066)
Keywords: large language model
Abstract: Fine-grained multimodal capability in Multimodal Large Language Models (MLLMs) has emerged as a critical research direction, particularly for tackling the visual grounding (VG) problem. Despite the strong performance achieved by existing approaches, they often employ disparate design choices when fine-tuning MLLMs for VG, lacking systematic verification to support these designs. To bridge this gap, this paper presents a comprehensive study of various design choices that impact the VG performance of MLLMs. We conduct our analysis using LLaVA-1.5, which has been widely adopted in prior empirical studies of MLLMs. While more recent models exist, we follow this convention to ensure our findings remain broadly applicable and extendable to other architectures. We cover two key aspects: (1) exploring different visual grounding paradigms in MLLMs, identifying the most effective design, and providing our insights; and (2) conducting ablation studies on the design of grounding data to optimize MLLMs' fine-tuning for the VG task. Finally, our findings contribute to a stronger MLLM for VG, achieving improvements of +5.6% / +6.9% / +7.0% on RefCOCO/+/g over the LLaVA-1.5.

Title: Fully-Fluctuating Participation in Sleepy Consensus

Authors: Yuval Efron, Joachim Neu, Toniann Pitassi
Subjects: cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2508.08068
Pdf URL: https://arxiv.org/pdf/2508.08068
Copy Paste: [[2508.08068]] Fully-Fluctuating Participation in Sleepy Consensus(https://arxiv.org/abs/2508.08068)
Keywords: secure, security, robust
Abstract: Proof-of-work allows Bitcoin to boast security amidst arbitrary fluctuations in participation of miners throughout time, so long as, at any point in time, a majority of hash power is honest. In recent years, however, the pendulum has shifted in favor of proof-of-stake-based consensus protocols. There, the sleepy model is the most prominent model for handling fluctuating participation of nodes. However, to date, no protocol in the sleepy model rivals Bitcoin in its robustness to drastic fluctuations in participation levels, with state-of-the-art protocols making various restrictive assumptions. In this work, we present a new adversary model, called external adversary. Intuitively, in our model, corrupt nodes do not divulge information about their secret keys. In this model, we show that protocols in the sleepy model can meaningfully claim to remain secure against fully fluctuating participation, without compromising efficiency or corruption resilience. Our adversary model is quite natural, and arguably naturally captures the process via which malicious behavior arises in protocols, as opposed to traditional worst-case modeling. On top of which, the model is also theoretically appealing, circumventing a barrier established in a recent work of Malkhi, Momose, and Ren.

Title: Information Bottleneck-based Causal Attention for Multi-label Medical Image Recognition

Authors: Xiaoxiao Cui, Yiran Li, Kai He, Shanzhi Jiang, Mengli Xue, Wentao Li, Junhong Leng, Zhi Liu, Lizhen Cui, Shuo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08069
Pdf URL: https://arxiv.org/pdf/2508.08069
Copy Paste: [[2508.08069]] Information Bottleneck-based Causal Attention for Multi-label Medical Image Recognition(https://arxiv.org/abs/2508.08069)
Keywords: interpretability
Abstract: Multi-label classification (MLC) of medical images aims to identify multiple diseases and holds significant clinical potential. A critical step is to learn class-specific features for accurate diagnosis and improved interpretability effectively. However, current works focus primarily on causal attention to learn class-specific features, yet they struggle to interpret the true cause due to the inadvertent attention to class-irrelevant features. To address this challenge, we propose a new structural causal model (SCM) that treats class-specific attention as a mixture of causal, spurious, and noisy factors, and a novel Information Bottleneck-based Causal Attention (IBCA) that is capable of learning the discriminative class-specific attention for MLC of medical images. Specifically, we propose learning Gaussian mixture multi-label spatial attention to filter out class-irrelevant information and capture each class-specific attention pattern. Then a contrastive enhancement-based causal intervention is proposed to gradually mitigate the spurious attention and reduce noise information by aligning multi-head attention with the Gaussian mixture multi-label spatial. Quantitative and ablation results on Endo and MuReD show that IBCA outperforms all methods. Compared to the second-best results for each metric, IBCA achieves improvements of 6.35\% in CR, 7.72\% in OR, and 5.02\% in mAP for MuReD, 1.47\% in CR, and 1.65\% in CF1, and 1.42\% in mAP for Endo.

Title: Matrix-3D: Omnidirectional Explorable 3D World Generation

Authors: Zhongqi Yang, Wenhang Ge, Yuqi Li, Jiaqi Chen, Haoyuan Li, Mengyin An, Fei Kang, Hua Xue, Baixin Xu, Yuyang Yin, Eric Li, Yang Liu, Yikai Wang, Hao-Xiang Guo, Yahui Zhou
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2508.08086
Pdf URL: https://arxiv.org/pdf/2508.08086
Copy Paste: [[2508.08086]] Matrix-3D: Omnidirectional Explorable 3D World Generation(https://arxiv.org/abs/2508.08086)
Keywords: diffusion
Abstract: Explorable 3D world generation from a single image or text prompt forms a cornerstone of spatial intelligence. Recent works utilize video model to achieve wide-scope and generalizable 3D world generation. However, existing approaches often suffer from a limited scope in the generated scenes. In this work, we propose Matrix-3D, a framework that utilize panoramic representation for wide-coverage omnidirectional explorable 3D world generation that combines conditional video generation and panoramic 3D reconstruction. We first train a trajectory-guided panoramic video diffusion model that employs scene mesh renders as condition, to enable high-quality and geometrically consistent scene video generation. To lift the panorama scene video to 3D world, we propose two separate methods: (1) a feed-forward large panorama reconstruction model for rapid 3D scene reconstruction and (2) an optimization-based pipeline for accurate and detailed 3D scene reconstruction. To facilitate effective training, we also introduce the Matrix-Pano dataset, the first large-scale synthetic collection comprising 116K high-quality static panoramic video sequences with depth and trajectory annotations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance in panoramic video generation and 3D world generation. See more in this https URL.

Title: MDD-Net: Multimodal Depression Detection through Mutual Transformer

Authors: Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar, Fakhri Karray
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08093
Pdf URL: https://arxiv.org/pdf/2508.08093
Copy Paste: [[2508.08093]] MDD-Net: Multimodal Depression Detection through Mutual Transformer(https://arxiv.org/abs/2508.08093)
Keywords: extraction, transformer
Abstract: Depression is a major mental health condition that severely impacts the emotional and physical well-being of individuals. The simple nature of data collection from social media platforms has attracted significant interest in properly utilizing this information for mental health research. A Multimodal Depression Detection Network (MDD-Net), utilizing acoustic and visual data obtained from social media networks, is proposed in this work where mutual transformers are exploited to efficiently extract and fuse multimodal features for efficient depression detection. The MDD-Net consists of four core modules: an acoustic feature extraction module for retrieving relevant acoustic attributes, a visual feature extraction module for extracting significant high-level patterns, a mutual transformer for computing the correlations among the generated features and fusing these features from multiple modalities, and a detection layer for detecting depression using the fused feature representations. The extensive experiments are performed using the multimodal D-Vlog dataset, and the findings reveal that the developed multimodal depression detection network surpasses the state-of-the-art by up to 17.37% for F1-Score, demonstrating the greater performance of the proposed system. The source code is accessible at this https URL.

Title: 3D Plant Root Skeleton Detection and Extraction

Authors: Jiakai Lin, Jinchang Zhang, Ge Jin, Wenzhan Song, Tianming Liu, Guoyu Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08094
Pdf URL: https://arxiv.org/pdf/2508.08094
Copy Paste: [[2508.08094]] 3D Plant Root Skeleton Detection and Extraction(https://arxiv.org/abs/2508.08094)
Keywords: extraction
Abstract: Plant roots typically exhibit a highly complex and dense architecture, incorporating numerous slender lateral roots and branches, which significantly hinders the precise capture and modeling of the entire root system. Additionally, roots often lack sufficient texture and color information, making it difficult to identify and track root traits using visual methods. Previous research on roots has been largely confined to 2D studies; however, exploring the 3D architecture of roots is crucial in botany. Since roots grow in real 3D space, 3D phenotypic information is more critical for studying genetic traits and their impact on root development. We have introduced a 3D root skeleton extraction method that efficiently derives the 3D architecture of plant roots from a few images. This method includes the detection and matching of lateral roots, triangulation to extract the skeletal structure of lateral roots, and the integration of lateral and primary roots. We developed a highly complex root dataset and tested our method on it. The extracted 3D root skeletons showed considerable similarity to the ground truth, validating the effectiveness of the model. This method can play a significant role in automated breeding robots. Through precise 3D root structure analysis, breeding robots can better identify plant phenotypic traits, especially root structure and growth patterns, helping practitioners select seeds with superior root systems. This automated approach not only improves breeding efficiency but also reduces manual intervention, making the breeding process more intelligent and efficient, thus advancing modern agriculture.

Title: Dual Information Speech Language Models for Emotional Conversations

Authors: Chun Wang, Chenyang Liu, Wenze Xu, Weihong Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08095
Pdf URL: https://arxiv.org/pdf/2508.08095
Copy Paste: [[2508.08095]] Dual Information Speech Language Models for Emotional Conversations(https://arxiv.org/abs/2508.08095)
Keywords: large language model
Abstract: Conversational systems relying on text-based large language models (LLMs) often overlook paralinguistic cues, essential for understanding emotions and intentions. Speech-language models (SLMs), which use speech as input, are emerging as a promising solution. However, SLMs built by extending frozen LLMs struggle to capture paralinguistic information and exhibit reduced context understanding. We identify entangled information and improper training strategies as key issues. To address these issues, we propose two heterogeneous adapters and suggest a weakly supervised training strategy. Our approach disentangles paralinguistic and linguistic information, enabling SLMs to interpret speech through structured representations. It also preserves contextual understanding by avoiding the generation of task-specific vectors through controlled randomness. This approach trains only the adapters on common datasets, ensuring parameter and data efficiency. Experiments demonstrate competitive performance in emotional conversation tasks, showcasing the model's ability to effectively integrate both paralinguistic and linguistic information within contextual settings.

Title: Assessing LLM Text Detection in Educational Contexts: Does Human Contribution Affect Detection?

Authors: Lukas Gehring, Benjamin Paaßen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08096
Pdf URL: https://arxiv.org/pdf/2508.08096
Copy Paste: [[2508.08096]] Assessing LLM Text Detection in Educational Contexts: Does Human Contribution Affect Detection?(https://arxiv.org/abs/2508.08096)
Keywords: attack, generative, large language model
Abstract: Recent advancements in Large Language Models (LLMs) and their increased accessibility have made it easier than ever for students to automatically generate texts, posing new challenges for educational institutions. To enforce norms of academic integrity and ensure students' learning, learning analytics methods to automatically detect LLM-generated text appear increasingly appealing. This paper benchmarks the performance of different state-of-the-art detectors in educational contexts, introducing a novel dataset, called Generative Essay Detection in Education (GEDE), containing over 900 student-written essays and over 12,500 LLM-generated essays from various domains. To capture the diversity of LLM usage practices in generating text, we propose the concept of contribution levels, representing students' contribution to a given assignment. These levels range from purely human-written texts, to slightly LLM-improved versions, to fully LLM-generated texts, and finally to active attacks on the detector by "humanizing" generated texts. We show that most detectors struggle to accurately classify texts of intermediate student contribution levels, like LLM-improved human-written texts. Detectors are particularly likely to produce false positives, which is problematic in educational settings where false suspicions can severely impact students' lives. Our dataset, code, and additional supplementary materials are publicly available at this https URL.

Title: TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning

Authors: Junzhe Xu, Yuyang Yin, Xi Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08098
Pdf URL: https://arxiv.org/pdf/2508.08098
Copy Paste: [[2508.08098]] TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning(https://arxiv.org/abs/2508.08098)
Keywords: diffusion, generative, large language model
Abstract: This paper introduces TBAC-UniImage, a novel unified model for multimodal understanding and generation. We achieve this by deeply integrating a pre-trained Diffusion Model, acting as a generative ladder, with a Multimodal Large Language Model (MLLM). Previous diffusion-based unified models face two primary limitations. One approach uses only the MLLM's final hidden state as the generative condition. This creates a shallow connection, as the generator is isolated from the rich, hierarchical representations within the MLLM's intermediate layers. The other approach, pretraining a unified generative architecture from scratch, is computationally expensive and prohibitive for many researchers. To overcome these issues, our work explores a new paradigm. Instead of relying on a single output, we use representations from multiple, diverse layers of the MLLM as generative conditions for the diffusion model. This method treats the pre-trained generator as a ladder, receiving guidance from various depths of the MLLM's understanding process. Consequently, TBAC-UniImage achieves a much deeper and more fine-grained unification of understanding and generation.

Title: Grid2Guide: A* Enabled Small Language Model for Indoor Navigation

Authors: Md. Wasiul Haque, Sagar Dasgupta, Mizanur Rahman
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08100
Pdf URL: https://arxiv.org/pdf/2508.08100
Copy Paste: [[2508.08100]] Grid2Guide: A* Enabled Small Language Model for Indoor Navigation(https://arxiv.org/abs/2508.08100)
Keywords: interpretability
Abstract: Reliable indoor navigation remains a significant challenge in complex environments, particularly where external positioning signals and dedicated infrastructures are unavailable. This research presents Grid2Guide, a hybrid navigation framework that combines the A* search algorithm with a Small Language Model (SLM) to generate clear, human-readable route instructions. The framework first conducts a binary occupancy matrix from a given indoor map. Using this matrix, the A* algorithm computes the optimal path between origin and destination, producing concise textual navigation steps. These steps are then transformed into natural language instructions by the SLM, enhancing interpretability for end users. Experimental evaluations across various indoor scenarios demonstrate the method's effectiveness in producing accurate and timely navigation guidance. The results validate the proposed approach as a lightweight, infrastructure-free solution for real-time indoor navigation support.

Title: Hyperspectral Imaging

Authors: Danfeng Hong, Chenyu Li, Naoto Yokoya, Bing Zhang, Xiuping Jia, Antonio Plaza, Paolo Gamba, Jon Atli Benediktsson, Jocelyn Chanussot
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08107
Pdf URL: https://arxiv.org/pdf/2508.08107
Copy Paste: [[2508.08107]] Hyperspectral Imaging(https://arxiv.org/abs/2508.08107)
Keywords: security
Abstract: Hyperspectral imaging (HSI) is an advanced sensing modality that simultaneously captures spatial and spectral information, enabling non-invasive, label-free analysis of material, chemical, and biological properties. This Primer presents a comprehensive overview of HSI, from the underlying physical principles and sensor architectures to key steps in data acquisition, calibration, and correction. We summarize common data structures and highlight classical and modern analysis methods, including dimensionality reduction, classification, spectral unmixing, and AI-driven techniques such as deep learning. Representative applications across Earth observation, precision agriculture, biomedicine, industrial inspection, cultural heritage, and security are also discussed, emphasizing HSI's ability to uncover sub-visual features for advanced monitoring, diagnostics, and decision-making. Persistent challenges, such as hardware trade-offs, acquisition variability, and the complexity of high-dimensional data, are examined alongside emerging solutions, including computational imaging, physics-informed modeling, cross-modal fusion, and self-supervised learning. Best practices for dataset sharing, reproducibility, and metadata documentation are further highlighted to support transparency and reuse. Looking ahead, we explore future directions toward scalable, real-time, and embedded HSI systems, driven by sensor miniaturization, self-supervised learning, and foundation models. As HSI evolves into a general-purpose, cross-disciplinary platform, it holds promise for transformative applications in science, technology, and society.

Title: GRASPTrack: Geometry-Reasoned Association via Segmentation and Projection for Multi-Object Tracking

Authors: Xudong Han, Pengcheng Fang, Yueying Tian, Jianhui Yu, Xiaohao Cai, Daniel Roggen, Philip Birch
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08117
Pdf URL: https://arxiv.org/pdf/2508.08117
Copy Paste: [[2508.08117]] GRASPTrack: Geometry-Reasoned Association via Segmentation and Projection for Multi-Object Tracking(https://arxiv.org/abs/2508.08117)
Keywords: robust, segmentation
Abstract: Multi-object tracking (MOT) in monocular videos is fundamentally challenged by occlusions and depth ambiguity, issues that conventional tracking-by-detection (TBD) methods struggle to resolve owing to a lack of geometric awareness. To address these limitations, we introduce GRASPTrack, a novel depth-aware MOT framework that integrates monocular depth estimation and instance segmentation into a standard TBD pipeline to generate high-fidelity 3D point clouds from 2D detections, thereby enabling explicit 3D geometric reasoning. These 3D point clouds are then voxelized to enable a precise and robust Voxel-Based 3D Intersection-over-Union (IoU) for spatial association. To further enhance tracking robustness, our approach incorporates Depth-aware Adaptive Noise Compensation, which dynamically adjusts the Kalman filter process noise based on occlusion severity for more reliable state estimation. Additionally, we propose a Depth-enhanced Observation-Centric Momentum, which extends the motion direction consistency from the image plane into 3D space to improve motion-based association cues, particularly for objects with complex trajectories. Extensive experiments on the MOT17, MOT20, and DanceTrack benchmarks demonstrate that our method achieves competitive performance, significantly improving tracking robustness in complex scenes with frequent occlusions and intricate motion patterns.

Title: Vision-Based Localization and LLM-based Navigation for Indoor Environments

Authors: Keyan Rahimi, Md. Wasiul Haque, Sagar Dasgupta, Mizanur Rahman
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2508.08120
Pdf URL: https://arxiv.org/pdf/2508.08120
Copy Paste: [[2508.08120]] Vision-Based Localization and LLM-based Navigation for Indoor Environments(https://arxiv.org/abs/2508.08120)
Keywords: robust, large language model
Abstract: Indoor navigation remains a complex challenge due to the absence of reliable GPS signals and the architectural intricacies of large enclosed environments. This study presents an indoor localization and navigation approach that integrates vision-based localization with large language model (LLM)-based navigation. The localization system utilizes a ResNet-50 convolutional neural network fine-tuned through a two-stage process to identify the user's position using smartphone camera input. To complement localization, the navigation module employs an LLM, guided by a carefully crafted system prompt, to interpret preprocessed floor plan images and generate step-by-step directions. Experimental evaluation was conducted in a realistic office corridor with repetitive features and limited visibility to test localization robustness. The model achieved high confidence and an accuracy of 96% across all tested waypoints, even under constrained viewing conditions and short-duration queries. Navigation tests using ChatGPT on real building floor maps yielded an average instruction accuracy of 75%, with observed limitations in zero-shot reasoning and inference time. This research demonstrates the potential for scalable, infrastructure-free indoor navigation using off-the-shelf cameras and publicly available floor plans, particularly in resource-constrained settings like hospitals, airports, and educational institutions.

Title: MemoryKT: An Integrative Memory-and-Forgetting Method for Knowledge Tracing

Authors: Mingrong Lin, Ke Deng, Zhengyang Wu, Zetao Zheng, Jie Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08122
Pdf URL: https://arxiv.org/pdf/2508.08122
Copy Paste: [[2508.08122]] MemoryKT: An Integrative Memory-and-Forgetting Method for Knowledge Tracing(https://arxiv.org/abs/2508.08122)
Keywords: interpretability
Abstract: Knowledge Tracing (KT) is committed to capturing students' knowledge mastery from their historical interactions. Simulating students' memory states is a promising approach to enhance both the performance and interpretability of knowledge tracing models. Memory consists of three fundamental processes: encoding, storage, and retrieval. Although forgetting primarily manifests during the storage stage, most existing studies rely on a single, undifferentiated forgetting mechanism, overlooking other memory processes as well as personalized forgetting patterns. To address this, this paper proposes memoryKT, a knowledge tracing model based on a novel temporal variational autoencoder. The model simulates memory dynamics through a three-stage process: (i) Learning the distribution of students' knowledge memory features, (ii) Reconstructing their exercise feedback, while (iii) Embedding a personalized forgetting module within the temporal workflow to dynamically modulate memory storage strength. This jointly models the complete encoding-storage-retrieval cycle, significantly enhancing the model's perception capability for individual differences. Extensive experiments on four public datasets demonstrate that our proposed approach significantly outperforms state-of-the-art baselines.

Title: A Physics-Driven Neural Network with Parameter Embedding for Generating Quantitative MR Maps from Weighted Images

Authors: Lingjing Chen (1 and 2), Chengxiu Zhang (1 and 2), Yinqiao Yi (1 and 2), Yida Wang (1 and 2), Yang Song (3), Xu Yan (3), Shengfang Xu (4), Dalin Zhu (4), Mengqiu Cao (3), Yan Zhou (5), Chenglong Wang (1 and 2), Guang Yang (1 and 2) ((1) Shanghai Key Laboratory of Magnetic Resonance, School of Physics and Electronic Science, East China Normal University, Shanghai, China, (2) Institute of Magnetic Resonance and Molecular Imaging in Medicine, East China Normal University, Shanghai, China, (3) MR Research Collaboration Team, Siemens Healthineers, Shanghai, China, (4) Department of Radiology, Gansu Provincial Maternity and Child-care Hospital, Lanzhou, China, (5) Department of Radiology, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08123
Pdf URL: https://arxiv.org/pdf/2508.08123
Copy Paste: [[2508.08123]] A Physics-Driven Neural Network with Parameter Embedding for Generating Quantitative MR Maps from Weighted Images(https://arxiv.org/abs/2508.08123)
Keywords: robust
Abstract: We propose a deep learning-based approach that integrates MRI sequence parameters to improve the accuracy and generalizability of quantitative image synthesis from clinical weighted MRI. Our physics-driven neural network embeds MRI sequence parameters -- repetition time (TR), echo time (TE), and inversion time (TI) -- directly into the model via parameter embedding, enabling the network to learn the underlying physical principles of MRI signal formation. The model takes conventional T1-weighted, T2-weighted, and T2-FLAIR images as input and synthesizes T1, T2, and proton density (PD) quantitative maps. Trained on healthy brain MR images, it was evaluated on both internal and external test datasets. The proposed method achieved high performance with PSNR values exceeding 34 dB and SSIM values above 0.92 for all synthesized parameter maps. It outperformed conventional deep learning models in accuracy and robustness, including data with previously unseen brain structures and lesions. Notably, our model accurately synthesized quantitative maps for these unseen pathological regions, highlighting its superior generalization capability. Incorporating MRI sequence parameters via parameter embedding allows the neural network to better learn the physical characteristics of MR signals, significantly enhancing the performance and reliability of quantitative MRI synthesis. This method shows great potential for accelerating qMRI and improving its clinical utility.

Title: Czech Dataset for Complex Aspect-Based Sentiment Analysis Tasks

Authors: Jakub Šmíd, Pavel Přibáň, Ondřej Pražák, Pavel Král
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08125
Pdf URL: https://arxiv.org/pdf/2508.08125
Copy Paste: [[2508.08125]] Czech Dataset for Complex Aspect-Based Sentiment Analysis Tasks(https://arxiv.org/abs/2508.08125)
Keywords: robust, extraction, transformer
Abstract: In this paper, we introduce a novel Czech dataset for aspect-based sentiment analysis (ABSA), which consists of 3.1K manually annotated reviews from the restaurant domain. The dataset is built upon the older Czech dataset, which contained only separate labels for the basic ABSA tasks such as aspect term extraction or aspect polarity detection. Unlike its predecessor, our new dataset is specifically designed for more complex tasks, e.g. target-aspect-category detection. These advanced tasks require a unified annotation format, seamlessly linking sentiment elements (labels) together. Our dataset follows the format of the well-known SemEval-2016 datasets. This design choice allows effortless application and evaluation in cross-lingual scenarios, ultimately fostering cross-language comparisons with equivalent counterpart datasets in other languages. The annotation process engaged two trained annotators, yielding an impressive inter-annotator agreement rate of approximately 90%. Additionally, we provide 24M reviews without annotations suitable for unsupervised learning. We present robust monolingual baseline results achieved with various Transformer-based models and insightful error analysis to supplement our contributions. Our code and dataset are freely available for non-commercial research purposes.

Title: Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models

Authors: Wenze Xu, Chun Wang, Jiazhen Yu, Sheng Chen, Liang Gao, Weihong Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08131
Pdf URL: https://arxiv.org/pdf/2508.08131
Copy Paste: [[2508.08131]] Optimal Transport Regularization for Speech Text Alignment in Spoken Language Models(https://arxiv.org/abs/2508.08131)
Keywords: large language model
Abstract: Spoken Language Models (SLMs), which extend Large Language Models (LLMs) to perceive speech inputs, have gained increasing attention for their potential to advance speech understanding tasks. However, despite recent progress, studies show that SLMs often struggle to generalize across datasets, even for trained languages and tasks, raising concerns about whether they process speech in a text-like manner as intended. A key challenge underlying this limitation is the modality gap between speech and text representations. The high variability in speech embeddings may allow SLMs to achieve strong in-domain performance by exploiting unintended speech variations, ultimately hindering generalization. To mitigate this modality gap, we introduce Optimal Transport Regularization (OTReg), a method that formulates speech-text alignment as an optimal transport problem and derives a regularization loss to improve SLM training. In each training iteration, OTReg first establishes a structured correspondence between speech and transcript embeddings by determining the optimal transport plan, then incorporates the regularization loss based on this transport plan to optimize SLMs in generating speech embeddings that align more effectively with transcript embeddings. OTReg is lightweight, requiring no additional labels or learnable parameters, and integrates seamlessly into existing SLM training procedures. Extensive multilingual ASR experiments demonstrate that OTReg enhances speech-text alignment, mitigates the modality gap, and consequently improves SLM generalization across diverse datasets.

Title: FantasyStyle: Controllable Stylized Distillation for 3D Gaussian Splatting

Authors: Yitong Yang, Yinglin Wang, Changshuo Wang, Huajie Wang, Shuting He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08136
Pdf URL: https://arxiv.org/pdf/2508.08136
Copy Paste: [[2508.08136]] FantasyStyle: Controllable Stylized Distillation for 3D Gaussian Splatting(https://arxiv.org/abs/2508.08136)
Keywords: diffusion, generative
Abstract: The success of 3DGS in generative and editing applications has sparked growing interest in 3DGS-based style transfer. However, current methods still face two major challenges: (1) multi-view inconsistency often leads to style conflicts, resulting in appearance smoothing and distortion; and (2) heavy reliance on VGG features, which struggle to disentangle style and content from style images, often causing content leakage and excessive stylization. To tackle these issues, we introduce \textbf{FantasyStyle}, a 3DGS-based style transfer framework, and the first to rely entirely on diffusion model distillation. It comprises two key components: (1) \textbf{Multi-View Frequency Consistency}. We enhance cross-view consistency by applying a 3D filter to multi-view noisy latent, selectively reducing low-frequency components to mitigate stylized prior conflicts. (2) \textbf{Controllable Stylized Distillation}. To suppress content leakage from style images, we introduce negative guidance to exclude undesired content. In addition, we identify the limitations of Score Distillation Sampling and Delta Denoising Score in 3D style transfer and remove the reconstruction term accordingly. Building on these insights, we propose a controllable stylized distillation that leverages negative guidance to more effectively optimize the 3D Gaussians. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, achieving higher stylization quality and visual realism across various scenes and styles.

Title: MuaLLM: A Multimodal Large Language Model Agent for Circuit Design Assistance with Hybrid Contextual Retrieval-Augmented Generation

Authors: Pravallika Abbineni, Saoud Aldowaish, Colin Liechty, Soroosh Noorzad, Ali Ghazizadeh, Morteza Fayazi
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2508.08137
Pdf URL: https://arxiv.org/pdf/2508.08137
Copy Paste: [[2508.08137]] MuaLLM: A Multimodal Large Language Model Agent for Circuit Design Assistance with Hybrid Contextual Retrieval-Augmented Generation(https://arxiv.org/abs/2508.08137)
Keywords: large language model
Abstract: Conducting a comprehensive literature review is crucial for advancing circuit design methodologies. However, the rapid influx of state-of-the-art research, inconsistent data representation, and the complexity of optimizing circuit design objectives make this task significantly challenging. In this paper, we propose MuaLLM, an open-source multimodal Large Language Model (LLM) agent for circuit design assistance that integrates a hybrid Retrieval-Augmented Generation (RAG) framework with an adaptive vector database of circuit design research papers. Unlike conventional LLMs, the MuaLLM agent employs a Reason + Act (ReAct) workflow for iterative reasoning, goal-setting, and multi-step information retrieval. It functions as a question-answering design assistant, capable of interpreting complex queries and providing reasoned responses grounded in circuit literature. Its multimodal capabilities enable processing of both textual and visual data, facilitating more efficient and comprehensive analysis. The system dynamically adapts using intelligent search tools, automated document retrieval from the internet, and real-time database updates. Unlike conventional approaches constrained by model context limits, MuaLLM decouples retrieval from inference, enabling scalable reasoning over arbitrarily large corpora. At the maximum context length supported by standard LLMs, MuaLLM remains up to 10x less costly and 1.6x faster while maintaining the same accuracy. This allows rapid, no-human-in-the-loop database generation, overcoming the bottleneck of simulation-based dataset creation for circuits. To evaluate MuaLLM, we introduce two custom datasets: RAG-250, targeting retrieval and citation performance, and Reasoning-100 (Reas-100), focused on multistep reasoning in circuit design. MuaLLM achieves 90.1% recall on RAG-250, and 86.8% accuracy on Reas-100.

Title: Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models

Authors: Tianyi Zhou, Johanne Medina, Sanjay Chawla
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08139
Pdf URL: https://arxiv.org/pdf/2508.08139
Copy Paste: [[2508.08139]] Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models(https://arxiv.org/abs/2508.08139)
Keywords: large language model
Abstract: Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation, which poses increasing risks in multi-turn or agentic applications where outputs may be reused as context. In this work, we investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses. We propose a reliability estimation that leverages token-level uncertainty to guide the aggregation of internal model representations. Specifically, we compute aleatoric and epistemic uncertainty from output logits to identify salient tokens and aggregate their hidden states into compact representations for response-level reliability prediction. Through controlled experiments on open QA benchmarks, we find that correct in-context information improves both answer accuracy and model confidence, while misleading context often induces confidently incorrect responses, revealing a misalignment between uncertainty and correctness. Our probing-based method captures these shifts in model behavior and improves the detection of unreliable outputs across multiple open-source LLMs. These results underscore the limitations of direct uncertainty signals and highlight the potential of uncertainty-guided probing for reliability-aware generation.

Title: Data-Efficient Biomedical In-Context Learning: A Diversity-Enhanced Submodular Perspective

Authors: Jun Wang, Zaifu Zhan, Qixin Zhang, Mingquan Lin, Meijia Song, Rui Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08140
Pdf URL: https://arxiv.org/pdf/2508.08140
Copy Paste: [[2508.08140]] Data-Efficient Biomedical In-Context Learning: A Diversity-Enhanced Submodular Perspective(https://arxiv.org/abs/2508.08140)
Keywords: robust, extraction, large language model
Abstract: Recent progress in large language models (LLMs) has leveraged their in-context learning (ICL) abilities to enable quick adaptation to unseen biomedical NLP tasks. By incorporating only a few input-output examples into prompts, LLMs can rapidly perform these new tasks. While the impact of these demonstrations on LLM performance has been extensively studied, most existing approaches prioritize representativeness over diversity when selecting examples from large corpora. To address this gap, we propose Dual-Div, a diversity-enhanced data-efficient framework for demonstration selection in biomedical ICL. Dual-Div employs a two-stage retrieval and ranking process: First, it identifies a limited set of candidate examples from a corpus by optimizing both representativeness and diversity (with optional annotation for unlabeled data). Second, it ranks these candidates against test queries to select the most relevant and non-redundant demonstrations. Evaluated on three biomedical NLP tasks (named entity recognition (NER), relation extraction (RE), and text classification (TC)) using LLaMA 3.1 and Qwen 2.5 for inference, along with three retrievers (BGE-Large, BMRetriever, MedCPT), Dual-Div consistently outperforms baselines-achieving up to 5% higher macro-F1 scores-while demonstrating robustness to prompt permutations and class imbalance. Our findings establish that diversity in initial retrieval is more critical than ranking-stage optimization, and limiting demonstrations to 3-5 examples maximizes performance efficiency.

Title: Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization

Authors: Nicholas Klein, Hemlata Tak, James Fullwood, Krishna Regmi, Leonidas Spinoulas, Ganesh Sivaraman, Tianxiang Chen, Elie Khoury
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2508.08141
Pdf URL: https://arxiv.org/pdf/2508.08141
Copy Paste: [[2508.08141]] Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization(https://arxiv.org/abs/2508.08141)
Keywords: robust
Abstract: The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.

Title: REX-RAG: Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation

Authors: Wentao Jiang, Xiang Feng, Zengmao Wang, Yong Luo, Pingbo Xu, Zhe Chen, Bo Du, Jing Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08149
Pdf URL: https://arxiv.org/pdf/2508.08149
Copy Paste: [[2508.08149]] REX-RAG: Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation(https://arxiv.org/abs/2508.08149)
Keywords: robust, large language model
Abstract: Reinforcement learning (RL) is emerging as a powerful paradigm for enabling large language models (LLMs) to perform complex reasoning tasks. Recent advances indicate that integrating RL with retrieval-augmented generation (RAG) allows LLMs to dynamically incorporate external knowledge, leading to more informed and robust decision making. However, we identify a critical challenge during policy-driven trajectory sampling: LLMs are frequently trapped in unproductive reasoning paths, which we refer to as "dead ends", committing to overconfident yet incorrect conclusions. This severely hampers exploration and undermines effective policy optimization. To address this challenge, we propose REX-RAG (Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation), a novel framework that explores alternative reasoning paths while maintaining rigorous policy learning through principled distributional corrections. Our approach introduces two key innovations: (1) Mixed Sampling Strategy, which combines a novel probe sampling method with exploratory prompts to escape dead ends; and (2) Policy Correction Mechanism, which employs importance sampling to correct distribution shifts induced by mixed sampling, thereby mitigating gradient estimation bias. We evaluate it on seven question-answering benchmarks, and the experimental results show that REX-RAG achieves average performance gains of 5.1% on Qwen2.5-3B and 3.6% on Qwen2.5-7B over strong baselines, demonstrating competitive results across multiple datasets. The code is publicly available at this https URL.

Title: FairFLRep: Fairness aware fault localization and repair of Deep Neural Networks

Authors: Moses Openja, Paolo Arcaini, Foutse Khomh, Fuyuki Ishikawa
Subjects: cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2508.08151
Pdf URL: https://arxiv.org/pdf/2508.08151
Copy Paste: [[2508.08151]] FairFLRep: Fairness aware fault localization and repair of Deep Neural Networks(https://arxiv.org/abs/2508.08151)
Keywords: fair
Abstract: Deep neural networks (DNNs) are being utilized in various aspects of our daily lives, including high-stakes decision-making applications that impact individuals. However, these systems reflect and amplify bias from the data used during training and testing, potentially resulting in biased behavior and inaccurate decisions. For instance, having different misclassification rates between white and black sub-populations. However, effectively and efficiently identifying and correcting biased behavior in DNNs is a challenge. This paper introduces FairFLRep, an automated fairness-aware fault localization and repair technique that identifies and corrects potentially bias-inducing neurons in DNN classifiers. FairFLRep focuses on adjusting neuron weights associated with sensitive attributes, such as race or gender, that contribute to unfair decisions. By analyzing the input-output relationships within the network, FairFLRep corrects neurons responsible for disparities in predictive quality parity. We evaluate FairFLRep on four image classification datasets using two DNN classifiers, and four tabular datasets with a DNN model. The results show that FairFLRep consistently outperforms existing methods in improving fairness while preserving accuracy. An ablation study confirms the importance of considering fairness during both fault localization and repair stages. Our findings also show that FairFLRep is more efficient than the baseline approaches in repairing the network.

Title: Federated Learning for Epileptic Seizure Prediction Across Heterogeneous EEG Datasets

Authors: Cem Ata Baykara, Saurav Raj Pandey, Ali Burak Ünal, Harlin Lee, Mete Akgün
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.08159
Pdf URL: https://arxiv.org/pdf/2508.08159
Copy Paste: [[2508.08159]] Federated Learning for Epileptic Seizure Prediction Across Heterogeneous EEG Datasets(https://arxiv.org/abs/2508.08159)
Keywords: privacy, robust, federate, fair
Abstract: Developing accurate and generalizable epileptic seizure prediction models from electroencephalography (EEG) data across multiple clinical sites is hindered by patient privacy regulations and significant data heterogeneity (non-IID characteristics). Federated Learning (FL) offers a privacy-preserving framework for collaborative training, but standard aggregation methods like Federated Averaging (FedAvg) can be biased by dominant datasets in heterogeneous settings. This paper investigates FL for seizure prediction using a single EEG channel across four diverse public datasets (Siena, CHB-MIT, Helsinki, NCH), representing distinct patient populations (adult, pediatric, neonate) and recording conditions. We implement privacy-preserving global normalization and propose a Random Subset Aggregation strategy, where each client trains on a fixed-size random subset of its data per round, ensuring equal contribution during aggregation. Our results show that locally trained models fail to generalize across sites, and standard weighted FedAvg yields highly skewed performance (e.g., 89.0% accuracy on CHB-MIT but only 50.8% on Helsinki and 50.6% on NCH). In contrast, Random Subset Aggregation significantly improves performance on under-represented clients (accuracy increases to 81.7% on Helsinki and 68.7% on NCH) and achieves a superior macro-average accuracy of 77.1% and pooled accuracy of 80.0% across all sites, demonstrating a more robust and fair global model. This work highlights the potential of balanced FL approaches for building effective and generalizable seizure prediction systems in realistic, heterogeneous multi-hospital environments while respecting data privacy.

Title: ReconDreamer-RL: Enhancing Reinforcement Learning via Diffusion-based Scene Reconstruction

Authors: Chaojun Ni, Guosheng Zhao, Xiaofeng Wang, Zheng Zhu, Wenkang Qin, Xinze Chen, Guanghong Jia, Guan Huang, Wenjun Mei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08170
Pdf URL: https://arxiv.org/pdf/2508.08170
Copy Paste: [[2508.08170]] ReconDreamer-RL: Enhancing Reinforcement Learning via Diffusion-based Scene Reconstruction(https://arxiv.org/abs/2508.08170)
Keywords: diffusion
Abstract: Reinforcement learning for training end-to-end autonomous driving models in closed-loop simulations is gaining growing attention. However, most simulation environments differ significantly from real-world conditions, creating a substantial simulation-to-reality (sim2real) gap. To bridge this gap, some approaches utilize scene reconstruction techniques to create photorealistic environments as a simulator. While this improves realistic sensor simulation, these methods are inherently constrained by the distribution of the training data, making it difficult to render high-quality sensor data for novel trajectories or corner case scenarios. Therefore, we propose ReconDreamer-RL, a framework designed to integrate video diffusion priors into scene reconstruction to aid reinforcement learning, thereby enhancing end-to-end autonomous driving training. Specifically, in ReconDreamer-RL, we introduce ReconSimulator, which combines the video diffusion prior for appearance modeling and incorporates a kinematic model for physical modeling, thereby reconstructing driving scenarios from real-world data. This narrows the sim2real gap for closed-loop evaluation and reinforcement learning. To cover more corner-case scenarios, we introduce the Dynamic Adversary Agent (DAA), which adjusts the trajectories of surrounding vehicles relative to the ego vehicle, autonomously generating corner-case traffic scenarios (e.g., cut-in). Finally, the Cousin Trajectory Generator (CTG) is proposed to address the issue of training data distribution, which is often biased toward simple straight-line movements. Experiments show that ReconDreamer-RL improves end-to-end autonomous driving training, outperforming imitation learning methods with a 5x reduction in the Collision Ratio.

Title: Neural Logic Networks for Interpretable Classification

Authors: Vincent Perreault, Katsumi Inoue, Richard Labib, Alain Hertz
Subjects: cs.LG, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2508.08172
Pdf URL: https://arxiv.org/pdf/2508.08172
Copy Paste: [[2508.08172]] Neural Logic Networks for Interpretable Classification(https://arxiv.org/abs/2508.08172)
Keywords: interpretability
Abstract: Traditional neural networks have an impressive classification performance, but what they learn cannot be inspected, verified or extracted. Neural Logic Networks on the other hand have an interpretable structure that enables them to learn a logical mechanism relating the inputs and outputs with AND and OR operations. We generalize these networks with NOT operations and biases that take into account unobserved data and develop a rigorous logical and probabilistic modeling in terms of concept combinations to motivate their use. We also propose a novel factorized IF-THEN rule structure for the model as well as a modified learning algorithm. Our method improves the state-of-the-art in Boolean networks discovery and is able to learn relevant, interpretable rules in tabular classification, notably on an example from the medical field where interpretability has tangible value.

Title: CD-TVD: Contrastive Diffusion for 3D Super-Resolution with Scarce High-Resolution Time-Varying Data

Authors: Chongke Bi, Xin Gao, Jiangkang Deng, Guan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08173
Pdf URL: https://arxiv.org/pdf/2508.08173
Copy Paste: [[2508.08173]] CD-TVD: Contrastive Diffusion for 3D Super-Resolution with Scarce High-Resolution Time-Varying Data(https://arxiv.org/abs/2508.08173)
Keywords: diffusion
Abstract: Large-scale scientific simulations require significant resources to generate high-resolution time-varying data (TVD). While super-resolution is an efficient post-processing strategy to reduce costs, existing methods rely on a large amount of HR training data, limiting their applicability to diverse simulation scenarios. To address this constraint, we proposed CD-TVD, a novel framework that combines contrastive learning and an improved diffusion-based super-resolution model to achieve accurate 3D super-resolution from limited time-step high-resolution data. During pre-training on historical simulation data, the contrastive encoder and diffusion superresolution modules learn degradation patterns and detailed features of high-resolution and low-resolution samples. In the training phase, the improved diffusion model with a local attention mechanism is fine-tuned using only one newly generated high-resolution timestep, leveraging the degradation knowledge learned by the encoder. This design minimizes the reliance on large-scale high-resolution datasets while maintaining the capability to recover fine-grained details. Experimental results on fluid and atmospheric simulation datasets confirm that CD-TVD delivers accurate and resource-efficient 3D super-resolution, marking a significant advancement in data augmentation for large-scale scientific simulations. The code is available at this https URL.

Title: MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision

Authors: Zhonghao Yan, Muxi Diao, Yuxuan Yang, Jiayuan Xu, Kaizhou Zhang, Ruoyan Jing, Lele Yang, Yanxi Liu, Kongming Liang, Zhanyu Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08177
Pdf URL: https://arxiv.org/pdf/2508.08177
Copy Paste: [[2508.08177]] MedReasoner: Reinforcement Learning Drives Reasoning Grounding from Clinical Thought to Pixel-Level Precision(https://arxiv.org/abs/2508.08177)
Keywords: large language model, segmentation
Abstract: Accurately grounding regions of interest (ROIs) is critical for diagnosis and treatment planning in medical imaging. While multimodal large language models (MLLMs) combine visual perception with natural language, current medical-grounding pipelines still rely on supervised fine-tuning with explicit spatial hints, making them ill-equipped to handle the implicit queries common in clinical practice. This work makes three core contributions. We first define Unified Medical Reasoning Grounding (UMRG), a novel vision-language task that demands clinical reasoning and pixel-level grounding. Second, we release U-MRG-14K, a dataset of 14K samples featuring pixel-level masks alongside implicit clinical queries and reasoning traces, spanning 10 modalities, 15 super-categories, and 108 specific categories. Finally, we introduce MedReasoner, a modular framework that distinctly separates reasoning from segmentation: an MLLM reasoner is optimized with reinforcement learning, while a frozen segmentation expert converts spatial prompts into masks, with alignment achieved through format and accuracy rewards. MedReasoner achieves state-of-the-art performance on U-MRG-14K and demonstrates strong generalization to unseen clinical queries, underscoring the significant promise of reinforcement learning for interpretable medical grounding.

Title: PP-Motion: Physical-Perceptual Fidelity Evaluation for Human Motion Generation

Authors: Sihan Zhao, Zixuan Wang, Tianyu Luan, Jia Jia, Wentao Zhu, Jiebo Luo, Junsong Yuan, Nan Xi
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2508.08179
Pdf URL: https://arxiv.org/pdf/2508.08179
Copy Paste: [[2508.08179]] PP-Motion: Physical-Perceptual Fidelity Evaluation for Human Motion Generation(https://arxiv.org/abs/2508.08179)
Keywords: robust
Abstract: Human motion generation has found widespread applications in AR/VR, film, sports, and medical rehabilitation, offering a cost-effective alternative to traditional motion capture systems. However, evaluating the fidelity of such generated motions is a crucial, multifaceted task. Although previous approaches have attempted at motion fidelity evaluation using human perception or physical constraints, there remains an inherent gap between human-perceived fidelity and physical feasibility. Moreover, the subjective and coarse binary labeling of human perception further undermines the development of a robust data-driven metric. We address these issues by introducing a physical labeling method. This method evaluates motion fidelity by calculating the minimum modifications needed for a motion to align with physical laws. With this approach, we are able to produce fine-grained, continuous physical alignment annotations that serve as objective ground truth. With these annotations, we propose PP-Motion, a novel data-driven metric to evaluate both physical and perceptual fidelity of human motion. To effectively capture underlying physical priors, we employ Pearson's correlation loss for the training of our metric. Additionally, by incorporating a human-based perceptual fidelity loss, our metric can capture fidelity that simultaneously considers both human perception and physical alignment. Experimental results demonstrate that our metric, PP-Motion, not only aligns with physical laws but also aligns better with human perception of motion fidelity than previous work.

Title: THAT: Token-wise High-frequency Augmentation Transformer for Hyperspectral Pansharpening

Authors: Hongkun Jin, Hongcheng Jiang, Zejun Zhang, Yuan Zhang, Jia Fu, Tingfeng Li, Kai Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08183
Pdf URL: https://arxiv.org/pdf/2508.08183
Copy Paste: [[2508.08183]] THAT: Token-wise High-frequency Augmentation Transformer for Hyperspectral Pansharpening(https://arxiv.org/abs/2508.08183)
Keywords: transformer
Abstract: Transformer-based methods have demonstrated strong potential in hyperspectral pansharpening by modeling long-range dependencies. However, their effectiveness is often limited by redundant token representations and a lack of multi-scale feature modeling. Hyperspectral images exhibit intrinsic spectral priors (e.g., abundance sparsity) and spatial priors (e.g., non-local similarity), which are critical for accurate reconstruction. From a spectral-spatial perspective, Vision Transformers (ViTs) face two major limitations: they struggle to preserve high-frequency components--such as material edges and texture transitions--and suffer from attention dispersion across redundant tokens. These issues stem from the global self-attention mechanism, which tends to dilute high-frequency signals and overlook localized details. To address these challenges, we propose the Token-wise High-frequency Augmentation Transformer (THAT), a novel framework designed to enhance hyperspectral pansharpening through improved high-frequency feature representation and token selection. Specifically, THAT introduces: (1) Pivotal Token Selective Attention (PTSA) to prioritize informative tokens and suppress redundancy; (2) a Multi-level Variance-aware Feed-forward Network (MVFN) to enhance high-frequency detail learning. Experiments on standard benchmarks show that THAT achieves state-of-the-art performance with improved reconstruction quality and efficiency. The source code is available at this https URL.

Title: KARMA: Efficient Structural Defect Segmentation via Kolmogorov-Arnold Representation Learning

Authors: Md Meftahul Ferdaus, Mahdi Abdelguerfi, Elias Ioup, Steven Sloan, Kendall N. Niles, Ken Pathak
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08186
Pdf URL: https://arxiv.org/pdf/2508.08186
Copy Paste: [[2508.08186]] KARMA: Efficient Structural Defect Segmentation via Kolmogorov-Arnold Representation Learning(https://arxiv.org/abs/2508.08186)
Keywords: segmentation
Abstract: Semantic segmentation of structural defects in civil infrastructure remains challenging due to variable defect appearances, harsh imaging conditions, and significant class imbalance. Current deep learning methods, despite their effectiveness, typically require millions of parameters, rendering them impractical for real-time inspection systems. We introduce KARMA (Kolmogorov-Arnold Representation Mapping Architecture), a highly efficient semantic segmentation framework that models complex defect patterns through compositions of one-dimensional functions rather than conventional convolutions. KARMA features three technical innovations: (1) a parameter-efficient Tiny Kolmogorov-Arnold Network (TiKAN) module leveraging low-rank factorization for KAN-based feature transformation; (2) an optimized feature pyramid structure with separable convolutions for multi-scale defect analysis; and (3) a static-dynamic prototype mechanism that enhances feature representation for imbalanced classes. Extensive experiments on benchmark infrastructure inspection datasets demonstrate that KARMA achieves competitive or superior mean IoU performance compared to state-of-the-art approaches, while using significantly fewer parameters (0.959M vs. 31.04M, a 97% reduction). Operating at 0.264 GFLOPS, KARMA maintains inference speeds suitable for real-time deployment, enabling practical automated infrastructure inspection systems without compromising accuracy. The source code can be accessed at the following URL: this https URL.

Title: Reinforcement Learning in Vision: A Survey

Authors: Weijia Wu, Chen Gao, Joya Chen, Kevin Qinghong Lin, Qingwei Meng, Yiming Zhang, Yuke Qiu, Hong Zhou, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08189
Pdf URL: https://arxiv.org/pdf/2508.08189
Copy Paste: [[2508.08189]] Reinforcement Learning in Vision: A Survey(https://arxiv.org/abs/2508.08189)
Keywords: diffusion, large language model
Abstract: Recent advances at the intersection of reinforcement learning (RL) and visual intelligence have enabled agents that not only perceive complex visual scenes but also reason, generate, and act within them. This survey offers a critical and up-to-date synthesis of the field. We first formalize visual RL problems and trace the evolution of policy-optimization strategies from RLHF to verifiable reward paradigms, and from Proximal Policy Optimization to Group Relative Policy Optimization. We then organize more than 200 representative works into four thematic pillars: multi-modal large language models, visual generation, unified model frameworks, and vision-language-action models. For each pillar we examine algorithmic design, reward engineering, benchmark progress, and we distill trends such as curriculum-driven training, preference-aligned diffusion, and unified reward modeling. Finally, we review evaluation protocols spanning set-level fidelity, sample-level preference, and state-level stability, and we identify open challenges that include sample efficiency, generalization, and safe deployment. Our goal is to provide researchers and practitioners with a coherent map of the rapidly expanding landscape of visual RL and to highlight promising directions for future inquiry. Resources are available at: this https URL.

Title: Differential Privacy for Regulatory Compliance in Cyberattack Detection on Critical Infrastructure Systems

Authors: Paritosh Ramanan, H.M. Mohaimanul Islam, Abhiram Reddy Alugula
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.08190
Pdf URL: https://arxiv.org/pdf/2508.08190
Copy Paste: [[2508.08190]] Differential Privacy for Regulatory Compliance in Cyberattack Detection on Critical Infrastructure Systems(https://arxiv.org/abs/2508.08190)
Keywords: privacy, protect, attack, robust
Abstract: Industrial control systems are a fundamental component of critical infrastructure networks (CIN) such as gas, water and power. With the growing risk of cyberattacks, regulatory compliance requirements are also increasing for large scale critical infrastructure systems comprising multiple utility stakeholders. The primary goal of regulators is to ensure overall system stability with recourse to trustworthy stakeholder attack detection. However, adhering to compliance requirements requires stakeholders to also disclose sensor and control data to regulators raising privacy concerns. In this paper, we present a cyberattack detection framework that utilizes differentially private (DP) hypothesis tests geared towards enhancing regulatory confidence while alleviating privacy concerns of CIN stakeholders. The hallmark of our approach is a two phase privacy scheme that protects the privacy of covariance, as well as the associated sensor driven test statistics computed as a means to generate alarms. Theoretically, we show that our method induces a misclassification error rate comparable to the non-DP cases while delivering robust privacy guarantees. With the help of real-world datasets, we show the reliability of our DP-detection outcomes for a wide variety of attack scenarios for interdependent stakeholders.

Title: Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions

Authors: Bangsheng Tang, Carl Chengyan Fu, Fei Kou, Grigory Sizov, Haoci Zhang, Jason Park, Jiawen Liu, Jie You, Qirui Yang, Sachin Mehta, Shengyong Cai, Xiaodong Wang, Xingyu Liu, Yunlu Li, Yanjun Zhou, Wei Wei, Zhiwei Zhao, Zixi Qi, Adolfo Victoria, Aya Ibrahim, Bram Wasti, Changkyu Kim, Daniel Haziza, Fei Sun, Giancarlo Delfin, Emily Guo, Jialin Ouyang, Jaewon Lee, Jianyu Huang, Jeremy Reizenstein, Lu Fang, Quinn Zhu, Ria Verma, Vlad Mihailescu, Xingwen Guo, Yan Cui, Ye Hu, Yejin Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.08192
Pdf URL: https://arxiv.org/pdf/2508.08192
Copy Paste: [[2508.08192]] Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions(https://arxiv.org/abs/2508.08192)
Keywords: large language model
Abstract: Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.

Title: Spatial-ORMLLM: Improve Spatial Relation Understanding in the Operating Room with Multimodal Large Language Model

Authors: Peiqi He, Zhenhao Zhang, Yixiang Zhang, Xiongjun Zhao, Shaoliang Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08199
Pdf URL: https://arxiv.org/pdf/2508.08199
Copy Paste: [[2508.08199]] Spatial-ORMLLM: Improve Spatial Relation Understanding in the Operating Room with Multimodal Large Language Model(https://arxiv.org/abs/2508.08199)
Keywords: robust, large language model
Abstract: Precise spatial modeling in the operating room (OR) is foundational to many clinical tasks, supporting intraoperative awareness, hazard avoidance, and surgical decision-making. While existing approaches leverage large-scale multimodal datasets for latent-space alignment to implicitly learn spatial relationships, they overlook the 3D capabilities of MLLMs. However, this approach raises two issues: (1) Operating rooms typically lack multiple video and audio sensors, making multimodal 3D data difficult to obtain; (2) Training solely on readily available 2D data fails to capture fine-grained details in complex scenes. To address this gap, we introduce Spatial-ORMLLM, the first large vision-language model for 3D spatial reasoning in operating rooms using only RGB modality to infer volumetric and semantic cues, enabling downstream medical tasks with detailed and holistic spatial context. Spatial-ORMLLM incorporates a Spatial-Enhanced Feature Fusion Block, which integrates 2D modality inputs with rich 3D spatial knowledge extracted by the estimation algorithm and then feeds the combined features into the visual tower. By employing a unified end-to-end MLLM framework, it combines powerful spatial features with textual features to deliver robust 3D scene reasoning without any additional expert annotations or sensor inputs. Experiments on multiple benchmark clinical datasets demonstrate that Spatial-ORMLLM achieves state-of-the-art performance and generalizes robustly to previously unseen surgical scenarios and downstream tasks.

Title: Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models

Authors: Kyle Moore, Jesse Roberts, Daryl Watson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08204
Pdf URL: https://arxiv.org/pdf/2508.08204
Copy Paste: [[2508.08204]] Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models(https://arxiv.org/abs/2508.08204)
Keywords: large language model
Abstract: There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulate user trust. Inference time uncertainty, which may provide a real-time signal to the model or external control modules, is particularly important for applying these concepts to improve LLM-user experience in practice. While many of the existing papers consider model calibration, comparatively little work has sought to evaluate how closely model uncertainty aligns to human uncertainty. In this work, we evaluate a collection of inference-time uncertainty measures, using both established metrics and novel variations, to determine how closely they align with both human group-level uncertainty and traditional notions of model calibration. We find that numerous measures show evidence of strong alignment to human uncertainty, even despite the lack of alignment to human answer preference. For those successful metrics, we find moderate to strong evidence of model calibration in terms of both correctness correlation and distributional analysis.

Title: SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling

Authors: Zhuohao Yu, Xingru Jiang, Weizheng Gu, Yidong Wang, Shikun Zhang, Wei Ye
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.08211
Pdf URL: https://arxiv.org/pdf/2508.08211
Copy Paste: [[2508.08211]] SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling(https://arxiv.org/abs/2508.08211)
Keywords: watermark
Abstract: Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework's effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark's consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.

Title: Cross-Subject and Cross-Montage EEG Transfer Learning via Individual Tangent Space Alignment and Spatial-Riemannian Feature Fusion

Authors: Nicole Lai-Tan, Xiao Gu, Marios G. Philiastides, Fani Deligianni
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2508.08216
Pdf URL: https://arxiv.org/pdf/2508.08216
Copy Paste: [[2508.08216]] Cross-Subject and Cross-Montage EEG Transfer Learning via Individual Tangent Space Alignment and Spatial-Riemannian Feature Fusion(https://arxiv.org/abs/2508.08216)
Keywords: robust
Abstract: Personalised music-based interventions offer a powerful means of supporting motor rehabilitation by dynamically tailoring auditory stimuli to provide external timekeeping cues, modulate affective states, and stabilise gait patterns. Generalisable Brain-Computer Interfaces (BCIs) thus hold promise for adapting these interventions across individuals. However, inter-subject variability in EEG signals, further compounded by movement-induced artefacts and motor planning differences, hinders the generalisability of BCIs and results in lengthy calibration processes. We propose Individual Tangent Space Alignment (ITSA), a novel pre-alignment strategy incorporating subject-specific recentering, distribution matching, and supervised rotational alignment to enhance cross-subject generalisation. Our hybrid architecture fuses Regularised Common Spatial Patterns (RCSP) with Riemannian geometry in parallel and sequential configurations, improving class separability while maintaining the geometric structure of covariance matrices for robust statistical computation. Using leave-one-subject-out cross-validation, `ITSA' demonstrates significant performance improvements across subjects and conditions. The parallel fusion approach shows the greatest enhancement over its sequential counterpart, with robust performance maintained across varying data conditions and electrode configurations. The code will be made publicly available at the time of publication.

Title: SAGOnline: Segment Any Gaussians Online

Authors: Wentao Sun, Quanyun Wu, Hanqing Xu, Kyle Gao, Zhengsen Xu, Yiping Chen, Dedong Zhang, Lingfei Ma, John S. Zelek, Jonathan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08219
Pdf URL: https://arxiv.org/pdf/2508.08219
Copy Paste: [[2508.08219]] SAGOnline: Segment Any Gaussians Online(https://arxiv.org/abs/2508.08219)
Keywords: robust, segmentation
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful paradigm for explicit 3D scene representation, yet achieving efficient and consistent 3D segmentation remains challenging. Current methods suffer from prohibitive computational costs, limited 3D spatial reasoning, and an inability to track multiple objects simultaneously. We present Segment Any Gaussians Online (SAGOnline), a lightweight and zero-shot framework for real-time 3D segmentation in Gaussian scenes that addresses these limitations through two key innovations: (1) a decoupled strategy that integrates video foundation models (e.g., SAM2) for view-consistent 2D mask propagation across synthesized views; and (2) a GPU-accelerated 3D mask generation and Gaussian-level instance labeling algorithm that assigns unique identifiers to 3D primitives, enabling lossless multi-object tracking and segmentation across views. SAGOnline achieves state-of-the-art performance on NVOS (92.7% mIoU) and Spin-NeRF (95.2% mIoU) benchmarks, outperforming Feature3DGS, OmniSeg3D-gs, and SA3D by 15--1500 times in inference speed (27 ms/frame). Qualitative results demonstrate robust multi-object segmentation and tracking in complex scenes. Our contributions include: (i) a lightweight and zero-shot framework for 3D segmentation in Gaussian scenes, (ii) explicit labeling of Gaussian primitives enabling simultaneous segmentation and tracking, and (iii) the effective adaptation of 2D video foundation models to the 3D domain. This work allows real-time rendering and 3D scene understanding, paving the way for practical AR/VR and robotic applications.

Title: Learning User Preferences for Image Generation Model

Authors: Wenyi Mo, Ying Ba, Tianyu Zhang, Yalong Bai, Biye Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08220
Pdf URL: https://arxiv.org/pdf/2508.08220
Copy Paste: [[2508.08220]] Learning User Preferences for Image Generation Model(https://arxiv.org/abs/2508.08220)
Keywords: large language model
Abstract: User preference prediction requires a comprehensive and accurate understanding of individual tastes. This includes both surface-level attributes, such as color and style, and deeper content-related aspects, such as themes and composition. However, existing methods typically rely on general human preferences or assume static user profiles, often neglecting individual variability and the dynamic, multifaceted nature of personal taste. To address these limitations, we propose an approach built upon Multimodal Large Language Models, introducing contrastive preference loss and preference tokens to learn personalized user preferences from historical interactions. The contrastive preference loss is designed to effectively distinguish between user ''likes'' and ''dislikes'', while the learnable preference tokens capture shared interest representations among existing users, enabling the model to activate group-specific preferences and enhance consistency across similar users. Extensive experiments demonstrate our model outperforms other methods in preference prediction accuracy, effectively identifying users with similar aesthetic inclinations and providing more precise guidance for generating images that align with individual tastes. The project page is \texttt{this https URL}.

Title: Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent

Authors: Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi
Subjects: cs.LG, cs.AI, cs.IT, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2508.08222
Pdf URL: https://arxiv.org/pdf/2508.08222
Copy Paste: [[2508.08222]] Multi-head Transformers Provably Learn Symbolic Multi-step Reasoning via Gradient Descent(https://arxiv.org/abs/2508.08222)
Keywords: transformer
Abstract: Transformers have demonstrated remarkable capabilities in multi-step reasoning tasks. However, understandings of the underlying mechanisms by which they acquire these abilities through training remain limited, particularly from a theoretical standpoint. This work investigates how transformers learn to solve symbolic multi-step reasoning problems through chain-of-thought processes, focusing on path-finding in trees. We analyze two intertwined tasks: a backward reasoning task, where the model outputs a path from a goal node to the root, and a more complex forward reasoning task, where the model implements two-stage reasoning by first identifying the goal-to-root path and then reversing it to produce the root-to-goal path. Our theoretical analysis, grounded in the dynamics of gradient descent, shows that trained one-layer transformers can provably solve both tasks with generalization guarantees to unseen trees. In particular, our multi-phase training dynamics for forward reasoning elucidate how different attention heads learn to specialize and coordinate autonomously to solve the two subtasks in a single autoregressive path. These results provide a mechanistic explanation of how trained transformers can implement sequential algorithmic procedures. Moreover, they offer insights into the emergence of reasoning abilities, suggesting that when tasks are structured to take intermediate chain-of-thought steps, even shallow multi-head transformers can effectively solve problems that would otherwise require deeper architectures.

Title: Capabilities of GPT-5 on Multimodal Medical Reasoning

Authors: Shansong Wang, Mingzhe Hu, Qiang Li, Mojtaba Safari, Xiaofeng Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08224
Pdf URL: https://arxiv.org/pdf/2508.08224
Copy Paste: [[2508.08224]] Capabilities of GPT-5 on Multimodal Medical Reasoning(https://arxiv.org/abs/2508.08224)
Keywords: large language model
Abstract: Recent advances in large language models (LLMs) have enabled general-purpose systems to perform increasingly complex domain-specific reasoning without extensive fine-tuning. In the medical domain, decision-making often requires integrating heterogeneous information sources, including patient narratives, structured data, and medical images. This study positions GPT-5 as a generalist multimodal reasoner for medical decision support and systematically evaluates its zero-shot chain-of-thought reasoning performance on both text-based question answering and visual question answering tasks under a unified protocol. We benchmark GPT-5, GPT-5-mini, GPT-5-nano, and GPT-4o-2024-11-20 against standardized splits of MedQA, MedXpertQA (text and multimodal), MMLU medical subsets, USMLE self-assessment exams, and VQA-RAD. Results show that GPT-5 consistently outperforms all baselines, achieving state-of-the-art accuracy across all QA benchmarks and delivering substantial gains in multimodal reasoning. On MedXpertQA MM, GPT-5 improves reasoning and understanding scores by +29.62% and +36.18% over GPT-4o, respectively, and surpasses pre-licensed human experts by +24.23% in reasoning and +29.40% in understanding. In contrast, GPT-4o remains below human expert performance in most dimensions. A representative case study demonstrates GPT-5's ability to integrate visual and textual cues into a coherent diagnostic reasoning chain, recommending appropriate high-stakes interventions. Our results show that, on these controlled multimodal reasoning benchmarks, GPT-5 moves from human-comparable to above human-expert performance. This improvement may substantially inform the design of future clinical decision-support systems.

Title: OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution

Authors: Zhiqiang Wu, Zhaomang Sun, Tong Zhou, Bingtao Fu, Ji Cong, Yitong Dong, Huaqi Zhang, Xuan Tang, Mingsong Chen, Xian Wei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08227
Pdf URL: https://arxiv.org/pdf/2508.08227
Copy Paste: [[2508.08227]] OMGSR: You Only Need One Mid-timestep Guidance for Real-World Image Super-Resolution(https://arxiv.org/abs/2508.08227)
Keywords: diffusion, generative
Abstract: Denoising Diffusion Probabilistic Models (DDPM) and Flow Matching (FM) generative models show promising potential for one-step Real-World Image Super-Resolution (Real-ISR). Recent one-step Real-ISR models typically inject a Low-Quality (LQ) image latent distribution at the initial timestep. However, a fundamental gap exists between the LQ image latent distribution and the Gaussian noisy latent distribution, limiting the effective utilization of generative priors. We observe that the noisy latent distribution at DDPM/FM mid-timesteps aligns more closely with the LQ image latent distribution. Based on this insight, we present One Mid-timestep Guidance Real-ISR (OMGSR), a universal framework applicable to DDPM/FM-based generative models. OMGSR injects the LQ image latent distribution at a pre-computed mid-timestep, incorporating the proposed Latent Distribution Refinement loss to alleviate the latent distribution gap. We also design the Overlap-Chunked LPIPS/GAN loss to eliminate checkerboard artifacts in image generation. Within this framework, we instantiate OMGSR for DDPM/FM-based generative models with two variants: OMGSR-S (SD-Turbo) and OMGSR-F (FLUX.1-dev). Experimental results demonstrate that OMGSR-S/F achieves balanced/excellent performance across quantitative and qualitative metrics at 512-resolution. Notably, OMGSR-F establishes overwhelming dominance in all reference metrics. We further train a 1k-resolution OMGSR-F to match the default resolution of FLUX.1-dev, which yields excellent results, especially in the details of the image generation. We also generate 2k-resolution images by the 1k-resolution OMGSR-F using our two-stage Tiled VAE & Diffusion.

Title: Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge

Authors: Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, Mingming Fan
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.08236
Pdf URL: https://arxiv.org/pdf/2508.08236
Copy Paste: [[2508.08236]] Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge(https://arxiv.org/abs/2508.08236)
Keywords: explainability
Abstract: Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.

Title: Cut2Next: Generating Next Shot via In-Context Tuning

Authors: Jingwen He, Hongbo Liu, Jiajun Li, Ziqi Huang, Yu Qiao, Wanli Ouyang, Ziwei Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.08244
Pdf URL: https://arxiv.org/pdf/2508.08244
Copy Paste: [[2508.08244]] Cut2Next: Generating Next Shot via In-Context Tuning(https://arxiv.org/abs/2508.08244)
Keywords: diffusion, transformer
Abstract: Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continuity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context-Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.

Title: StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

Authors: Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08248
Pdf URL: https://arxiv.org/pdf/2508.08248
Copy Paste: [[2508.08248]] StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation(https://arxiv.org/abs/2508.08248)
Keywords: diffusion, transformer
Abstract: Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion's own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.

Title: ReferSplat: Referring Segmentation in 3D Gaussian Splatting

Authors: Shuting He, Guangquan Jie, Changshuo Wang, Yun Zhou, Shuming Hu, Guanbin Li, Henghui Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.08252
Pdf URL: https://arxiv.org/pdf/2508.08252
Copy Paste: [[2508.08252]] ReferSplat: Referring Segmentation in 3D Gaussian Splatting(https://arxiv.org/abs/2508.08252)
Keywords: segmentation
Abstract: We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view, posing a significant challenge for 3D multi-modal understanding. Developing this capability is crucial for advancing embodied AI. To support research in this area, we construct the first R3DGS dataset, Ref-LERF. Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS. To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks. Dataset and code are available at this https URL.