2025-12-23

Title: Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models

Authors: Hongji Li, Junchi yao, Manjiang Yu, Priyanka Singh, Xue Li, Di Wang, Lijie Hu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.17911
Pdf URL: https://arxiv.org/pdf/2512.17911
Copy Paste: [[2512.17911]] Towards Reasoning-Preserving Unlearning in Multimodal Large Language Models(https://arxiv.org/abs/2512.17911)
Keywords: large language model
Abstract: Machine unlearning aims to erase requested data from trained models without full retraining. For Reasoning Multimodal Large Language Models (RMLLMs), this is uniquely challenging: intermediate chain-of-thought steps can still leak sensitive information even when final answers are forgotten, and overly aggressive interventions easily damage general reasoning ability. Yet no benchmark jointly evaluates how well unlearning methods suppress reasoning-level leakage while preserving reasoning competence. We address this gap with RMLLMU-Bench, the first benchmark for RMLLM unlearning that extends standard forgetting metrics with dedicated measures of reasoning leakage and reasoning retention. A systematic evaluation on RMLLMU-Bench reveals that existing unlearning methods for MLLMs and Large (Language) Reasoning Models (LRMs) either leave substantial leakage in the reasoning process or severely degrade reasoning performance. To address these gaps, we propose R-MUSE (Reasoning-preserving MLLM Unlearning via Subspace guidance and Adaptive Steering), a training-free and inference-time intervention framework that steers internal representations to forget both answers and reasoning traces while explicitly preserving general reasoning. Experiments on RMLLMU-Bench demonstrate that R-MUSE achieves a substantially better balance between effective forgetting and reasoning retention.

Title: Graph-O1 : Monte Carlo Tree Search with Reinforcement Learning for Text-Attributed Graph Reasoning

Authors: Lihui Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.17912
Pdf URL: https://arxiv.org/pdf/2512.17912
Copy Paste: [[2512.17912]] Graph-O1 : Monte Carlo Tree Search with Reinforcement Learning for Text-Attributed Graph Reasoning(https://arxiv.org/abs/2512.17912)
Keywords: large language model
Abstract: ChatGPT said: Text-attributed graphs, where nodes and edges contain rich textual information, are widely used across diverse domains. A central challenge in this setting is question answering, which requires jointly leveraging unstructured text and the structured relational signals within the graph. Although Large Language Models (LLMs) have made significant advances in natural language understanding, their direct use for reasoning over text-attributed graphs remains limited. Retrieval-augmented generation methods that operate purely on text often treat passages as isolated units, ignoring the interconnected structure of the graph. Conversely, graph-based RAG methods that serialize large subgraphs into long textual sequences quickly become infeasible due to LLM context-length constraints, resulting in fragmented reasoning and degraded accuracy. To overcome these limitations, we introduce Graph-O1, an agentic GraphRAG framework that enables LLMs to conduct stepwise, interactive reasoning over graphs. Our approach integrates Monte Carlo Tree Search (MCTS) with end-to-end reinforcement learning, allowing the model to selectively explore and retrieve only the most informative subgraph components. The reasoning procedure is framed as a multi-turn interaction between the agent and the graph environment, and the agent is trained through a unified reward mechanism. Extensive experiments across multiple LLM backbones demonstrate that Graph-O1 consistently surpasses state-of-the-art baselines, producing answers that are more accurate, reliable, and interpretable.

Title: Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression

Authors: Boris Kriuk, Logic Ng
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2512.17914
Pdf URL: https://arxiv.org/pdf/2512.17914
Copy Paste: [[2512.17914]] Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression(https://arxiv.org/abs/2512.17914)
Keywords: robust, extraction, large language model
Abstract: Multi-agent Large Language Model (LLM) systems face a critical bottleneck: redundant transmission of contextual information between agents consumes excessive bandwidth and computational resources. Traditional approaches discard internal semantic representations and transmit raw text, forcing receiving agents to recompute similar representations from scratch. We introduce Q-KVComm, a new protocol that enables direct transmission of compressed key-value (KV) cache representations between LLM agents. Q-KVComm combines three key innovations: (1) adaptive layer-wise quantization that allocates variable bit-widths based on sensitivity profiling, (2) hybrid information extraction that preserves critical facts across content domains, and (3) heterogeneous model calibration establishing cross-architecture communication. Extensive experiments across three diverse question-answering datasets demonstrate that Q-KVComm achieves 5-6x compression ratios while maintaining semantic fidelity, with coherence quality scores above 0.77 across all scenarios. The protocol exhibits robust performance across model sizes (1.1B-1.5B parameters) and adapts to real-world applications including conversational QA and multi-hop reasoning. Our work establishes a new paradigm for LLM agent communication, shifting from text-based to representation-based information exchange.

Title: Learning to Prioritize IT Tickets: A Comparative Evaluation of Embedding-based Approaches and Fine-Tuned Transformer Models

Authors: Minh Tri LÊ, Ali Ait-Bachir
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.17916
Pdf URL: https://arxiv.org/pdf/2512.17916
Copy Paste: [[2512.17916]] Learning to Prioritize IT Tickets: A Comparative Evaluation of Embedding-based Approaches and Fine-Tuned Transformer Models(https://arxiv.org/abs/2512.17916)
Keywords: transformer
Abstract: Prioritizing service tickets in IT Service Management (ITSM) is critical for operational efficiency but remains challenging due to noisy textual inputs, subjective writing styles, and pronounced class imbalance. We evaluate two families of approaches for ticket prioritization: embedding-based pipelines that combine dimensionality reduction, clustering, and classical classifiers, and a fine-tuned multilingual transformer that processes both textual and numerical features. Embedding-based methods exhibit limited generalization across a wide range of thirty configurations, with clustering failing to uncover meaningful structures and supervised models highly sensitive to embedding quality. In contrast, the proposed transformer model achieves substantially higher performance, with an average F1-score of 78.5% and weighted Cohen's kappa values of nearly 0.80, indicating strong alignment with true labels. These results highlight the limitations of generic embeddings for ITSM data and demonstrate the effectiveness of domain-adapted transformer architectures for operational ticket prioritization.

Title: KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction

Authors: Aomufei Yuan, Zhiming Wang, Ruijie Miao, Dayu Wang, Yuxuan Tian, Zihan Wang, Yebo Peng, Yuhan Wu, Bairen Yi, Xin Liu, Tong Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.17917
Pdf URL: https://arxiv.org/pdf/2512.17917
Copy Paste: [[2512.17917]] KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction(https://arxiv.org/abs/2512.17917)
Keywords: large language model
Abstract: As the context length of current large language models (LLMs) rapidly increases, the memory demand for the Key-Value (KV) cache is becoming a bottleneck for LLM deployment and batch processing. Traditional KV cache compression methods typically involve permanently evicting or irreversibly merging "less important" tokens with low attention scores. This approach results in the unrecoverable loss of token information, which we call Contextual Amnesia, significantly degrading the model's information retrieval capability. To address this issue, we propose KVReviver, a reversible KV cache compression method based on the sketch algorithm. This method allows reconstructing compressed tokens from an additional data structure, thus enabling full-scale computation within limited memory. Experiments showed that in 2k-length contexts, it requires only 10% of KV Cache budget while maintaining identical end-to-end inference accuracy. For 32k-length contexts, it achieves equivalent or comparable accuracy ~2% accuracy loss) using merely 25% of KV Cache budget.

Title: Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression

Authors: Rahul Baxi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.17920
Pdf URL: https://arxiv.org/pdf/2512.17920
Copy Paste: [[2512.17920]] Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression(https://arxiv.org/abs/2512.17920)
Keywords: large language model
Abstract: Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss' \k{appa}=0.90). We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic effects. Experimental validation via RLHF ablation confirms our constraint salience hypothesis: removing "helpfulness" signals improves CC by 598% on average (71/72 trials, p<0.001), with 79% achieving perfect compliance. This demonstrates that RLHF-trained helpfulness behaviors are the dominant cause of constraint violations at medium compression. Reasoning models outperform efficient models by 27.5% (Cohen's d=0.96). Our findings reveal a fundamental tension between RLHF alignment and instruction-following, providing actionable guidelines for improving deployed systems.

Title: A 96pJ/Frame/Pixel and 61pJ/Event Anti-UAV System with Hybrid Object Tracking Modes

Authors: Yuncheng Lu, Yucen Shi, Aobo Li, Zehao Li, Junying Li, Bo Wang, Tony Tae-Hyoung Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.17939
Pdf URL: https://arxiv.org/pdf/2512.17939
Copy Paste: [[2512.17939]] A 96pJ/Frame/Pixel and 61pJ/Event Anti-UAV System with Hybrid Object Tracking Modes(https://arxiv.org/abs/2512.17939)
Keywords: robust
Abstract: We present an energy-efficient anti-UAV system that integrates frame-based and event-driven object tracking to enable reliable detection of small and fast-moving drones. The system reconstructs binary event frames using run-length encoding, generates region proposals, and adaptively switches between frame mode and event mode based on object size and velocity. A Fast Object Tracking Unit improves robustness for high-speed targets through adaptive thresholding and trajectory-based classification. The neural processing unit supports both grayscale-patch and trajectory inference with a custom instruction set and a zero-skipping MAC architecture, reducing redundant neural computations by more than 97 percent. Implemented in 40 nm CMOS technology, the 2 mm^2 chip achieves 96 pJ per frame per pixel and 61 pJ per event at 0.8 V, and reaches 98.2 percent recognition accuracy on public UAV datasets across 50 to 400 m ranges and 5 to 80 pixels per second speeds. The results demonstrate state-of-the-art end-to-end energy efficiency for anti-UAV systems.

Title: NystagmusNet: Explainable Deep Learning for Photosensitivity Risk Prediction

Authors: Karthik Prabhakar
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.17943
Pdf URL: https://arxiv.org/pdf/2512.17943
Copy Paste: [[2512.17943]] NystagmusNet: Explainable Deep Learning for Photosensitivity Risk Prediction(https://arxiv.org/abs/2512.17943)
Keywords: interpretability, explainability
Abstract: Nystagmus patients with photosensitivity face significant daily challenges due to involuntary eye movements exacerbated by environmental brightness conditions. Current assistive solutions are limited to symptomatic treatments without predictive personalization. This paper proposes NystagmusNet, an AI-driven system that predicts high-risk visual environments and recommends real-time visual adaptations. Using a dual-branch convolutional neural network trained on synthetic and augmented datasets, the system estimates a photosensitivity risk score based on environmental brightness and eye movement variance. The model achieves 75% validation accuracy on synthetic data. Explainability techniques including SHAP and GradCAM are integrated to highlight environmental risk zones, improving clinical trust and model interpretability. The system includes a rule-based recommendation engine for adaptive filter suggestions. Future directions include deployment via smart glasses and reinforcement learning for personalized recommendations.

Title: What's the Price of Monotonicity? A Multi-Dataset Benchmark of Monotone-Constrained Gradient Boosting for Credit PD

Authors: Petr Koklev
Subjects: cs.LG, q-fin.RM, q-fin.ST
Abstract URL: https://arxiv.org/abs/2512.17945
Pdf URL: https://arxiv.org/pdf/2512.17945
Copy Paste: [[2512.17945]] What's the Price of Monotonicity? A Multi-Dataset Benchmark of Monotone-Constrained Gradient Boosting for Credit PD(https://arxiv.org/abs/2512.17945)
Keywords: interpretability
Abstract: Financial institutions face a trade-off between predictive accuracy and interpretability when deploying machine learning models for credit risk. Monotonicity constraints align model behavior with domain knowledge, but their performance cost - the price of monotonicity - is not well quantified. This paper benchmarks monotone-constrained versus unconstrained gradient boosting models for credit probability of default across five public datasets and three libraries. We define the Price of Monotonicity (PoM) as the relative change in standard performance metrics when moving from unconstrained to constrained models, estimated via paired comparisons with bootstrap uncertainty. In our experiments, PoM in AUC ranges from essentially zero to about 2.9 percent: constraints are almost costless on large datasets (typically less than 0.2 percent, often indistinguishable from zero) and most costly on smaller datasets with extensive constraint coverage (around 2-3 percent). Thus, appropriately specified monotonicity constraints can often deliver interpretability with small accuracy losses, particularly in large-scale credit portfolios.

Title: SuperFlow: Training Flow Matching Models with RL on the Fly

Authors: Kaijie Chen, Zhiyang Xu, Ying Shen, Zihao Lin, Yuguang Yao, Lifu Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.17951
Pdf URL: https://arxiv.org/pdf/2512.17951
Copy Paste: [[2512.17951]] SuperFlow: Training Flow Matching Models with RL on the Fly(https://arxiv.org/abs/2512.17951)
Keywords: generative
Abstract: Recent progress in flow-based generative models and reinforcement learning (RL) has improved text-image alignment and visual quality. However, current RL training for flow models still has two main problems: (i) GRPO-style fixed per-prompt group sizes ignore variation in sampling importance across prompts, which leads to inefficient sampling and slower training; and (ii) trajectory-level advantages are reused as per-step estimates, which biases credit assignment along the flow. We propose SuperFlow, an RL training framework for flow-based models that adjusts group sizes with variance-aware sampling and computes step-level advantages in a way that is consistent with continuous-time flow dynamics. Empirically, SuperFlow reaches promising performance while using only 5.4% to 56.3% of the original training steps and reduces training time by 5.2% to 16.7% without any architectural changes. On standard text-to-image (T2I) tasks, including text rendering, compositional image generation, and human preference alignment, SuperFlow improves over SD3.5-M by 4.6% to 47.2%, and over Flow-GRPO by 1.7% to 16.0%.

Title: Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition

Authors: Ellie Zhou, Jihoon Chung, Olga Russakovsky
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.17953
Pdf URL: https://arxiv.org/pdf/2512.17953
Copy Paste: [[2512.17953]] Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition(https://arxiv.org/abs/2512.17953)
Keywords: large language model
Abstract: Human action recognition models often rely on background cues rather than human movement and pose to make predictions, a behavior known as background bias. We present a systematic analysis of background bias across classification models, contrastive text-image pretrained models, and Video Large Language Models (VLLM) and find that all exhibit a strong tendency to default to background reasoning. Next, we propose mitigation strategies for classification models and show that incorporating segmented human input effectively decreases background bias by 3.78%. Finally, we explore manual and automated prompt tuning for VLLMs, demonstrating that prompt design can steer predictions towards human-focused reasoning by 9.85%.

Title: SCS-SupCon: Sigmoid-based Common and Style Supervised Contrastive Learning with Adaptive Decision Boundaries

Authors: Bin Wang, Fadi Dornaika
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.17954
Pdf URL: https://arxiv.org/pdf/2512.17954
Copy Paste: [[2512.17954]] SCS-SupCon: Sigmoid-based Common and Style Supervised Contrastive Learning with Adaptive Decision Boundaries(https://arxiv.org/abs/2512.17954)
Keywords: robust, transformer
Abstract: Image classification is hindered by subtle inter-class differences and substantial intra-class variations, which limit the effectiveness of existing contrastive learning methods. Supervised contrastive approaches based on the InfoNCE loss suffer from negative-sample dilution and lack adaptive decision boundaries, thereby reducing discriminative power in fine-grained recognition tasks. To address these limitations, we propose Sigmoid-based Common and Style Supervised Contrastive Learning (SCS-SupCon). Our framework introduces a sigmoid-based pairwise contrastive loss with learnable temperature and bias parameters to enable adaptive decision boundaries. This formulation emphasizes hard negatives, mitigates negative-sample dilution, and more effectively exploits supervision. In addition, an explicit style-distance constraint further disentangles style and content representations, leading to more robust feature learning. Comprehensive experiments on six benchmark datasets, including CUB200-2011 and Stanford Dogs, demonstrate that SCS-SupCon achieves state-of-the-art performance across both CNN and Transformer backbones. On CIFAR-100 with ResNet-50, SCS-SupCon improves top-1 accuracy over SupCon by approximately 3.9 percentage points and over CS-SupCon by approximately 1.7 points under five-fold cross-validation. On fine-grained datasets, it outperforms CS-SupCon by 0.4--3.0 points. Extensive ablation studies and statistical analyses further confirm the robustness and generalization of the proposed framework, with Friedman tests and Nemenyi post-hoc evaluations validating the stability of the observed improvements.

Title: A Modular Framework for Single-View 3D Reconstruction of Indoor Environments

Authors: Yuxiao Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.17955
Pdf URL: https://arxiv.org/pdf/2512.17955
Copy Paste: [[2512.17955]] A Modular Framework for Single-View 3D Reconstruction of Indoor Environments(https://arxiv.org/abs/2512.17955)
Keywords: diffusion
Abstract: We propose a modular framework for single-view indoor scene 3D reconstruction, where several core modules are powered by diffusion techniques. Traditional approaches for this task often struggle with the complex instance shapes and occlusions inherent in indoor environments. They frequently overshoot by attempting to predict 3D shapes directly from incomplete 2D images, which results in limited reconstruction quality. We aim to overcome this limitation by splitting the process into two steps: first, we employ diffusion-based techniques to predict the complete views of the room background and occluded indoor instances, then transform them into 3D. Our modular framework makes contributions to this field through the following components: an amodal completion module for restoring the full view of occluded instances, an inpainting model specifically trained to predict room layouts, a hybrid depth estimation technique that balances overall geometric accuracy with fine detail expressiveness, and a view-space alignment method that exploits both 2D and 3D cues to ensure precise placement of instances within the scene. This approach effectively reconstructs both foreground instances and the room background from a single image. Extensive experiments on the 3D-Front dataset demonstrate that our method outperforms current state-of-the-art (SOTA) approaches in terms of both visual quality and reconstruction accuracy. The framework holds promising potential for applications in interior design, real estate, and augmented reality.

Title: Convolutional-neural-operator-based transfer learning for solving PDEs

Authors: Peng Fan, Guofei Pang
Subjects: cs.LG, cs.AI, math-ph
Abstract URL: https://arxiv.org/abs/2512.17969
Pdf URL: https://arxiv.org/pdf/2512.17969
Copy Paste: [[2512.17969]] Convolutional-neural-operator-based transfer learning for solving PDEs(https://arxiv.org/abs/2512.17969)
Keywords: diffusion, transformer
Abstract: Convolutional neural operator is a CNN-based architecture recently proposed to enforce structure-preserving continuous-discrete equivalence and enable the genuine, alias-free learning of solution operators of PDEs. This neural operator was demonstrated to outperform for certain cases some baseline models such as DeepONet, Fourier neural operator, and Galerkin transformer in terms of surrogate accuracy. The convolutional neural operator, however, seems not to be validated for few-shot learning. We extend the model to few-shot learning scenarios by first pre-training a convolutional neural operator using a source dataset and then adjusting the parameters of the trained neural operator using only a small target dataset. We investigate three strategies for adjusting the parameters of a trained neural operator, including fine-tuning, low-rank adaption, and neuron linear transformation, and find that the neuron linear transformation strategy enjoys the highest surrogate accuracy in solving PDEs such as Kuramoto-Sivashinsky equation, Brusselator diffusion-reaction system, and Navier-Stokes equations.

Title: Parameter-Efficient Fine-Tuning for HAR: Integrating LoRA and QLoRA into Transformer Models

Authors: Irina Seregina, Philippe Lalanda, German Vega
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.17983
Pdf URL: https://arxiv.org/pdf/2512.17983
Copy Paste: [[2512.17983]] Parameter-Efficient Fine-Tuning for HAR: Integrating LoRA and QLoRA into Transformer Models(https://arxiv.org/abs/2512.17983)
Keywords: robust, transformer
Abstract: Human Activity Recognition is a foundational task in pervasive computing. While recent advances in self-supervised learning and transformer-based architectures have significantly improved HAR performance, adapting large pretrained models to new domains remains a practical challenge due to limited computational resources on target devices. This papers investigates parameter-efficient fine-tuning techniques, specifically Low-Rank Adaptation (LoRA) and Quantized LoRA, as scalable alternatives to full model fine-tuning for HAR. We propose an adaptation framework built upon a Masked Autoencoder backbone and evaluate its performance under a Leave-One-Dataset-Out validation protocol across five open HAR datasets. Our experiments demonstrate that both LoRA and QLoRA can match the recognition performance of full fine-tuning while significantly reducing the number of trainable parameters, memory usage, and training time. Further analyses reveal that LoRA maintains robust performance even under limited supervision and that the adapter rank provides a controllable trade-off between accuracy and efficiency. QLoRA extends these benefits by reducing the memory footprint of frozen weights through quantization, with minimal impact on classification quality.

Title: A Hybrid Inductive-Transductive Network for Traffic Flow Imputation on Unsampled Locations

Authors: Mohammadmahdi Rahimiasl, Ynte Vanderhoydonc, Siegfried Mercelis
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.17984
Pdf URL: https://arxiv.org/pdf/2512.17984
Copy Paste: [[2512.17984]] A Hybrid Inductive-Transductive Network for Traffic Flow Imputation on Unsampled Locations(https://arxiv.org/abs/2512.17984)
Keywords: diffusion, transformer
Abstract: Accurately imputing traffic flow at unsensed locations is difficult: loop detectors provide precise but sparse measurements, speed from probe vehicles is widely available yet only weakly correlated with flow, and nearby links often exhibit strong heterophily in the scale of traffic flow (e.g., ramps vs. mainline), which breaks standard GNN assumptions. We propose HINT, a Hybrid INductive-Transductive Network, and an INDU-TRANSDUCTIVE training strategy that treats speed as a transductive, network-wide signal while learning flow inductively to generalize to unseen locations. HINT couples (i) an inductive spatial transformer that learns similarity-driven, long-range interactions from node features with (ii) a diffusion GCN conditioned by FiLM on rich static context (OSM-derived attributes and traffic simulation), and (iii) a node-wise calibration layer that corrects scale biases per segment. Training uses masked reconstruction with epoch-wise node sampling, hard-node mining to emphasize difficult sensors, and noise injection on visible flows to prevent identity mapping, while graph structure is built from driving distances. Across three real-world datasets, MOW (Antwerp, Belgium), UTD19-Torino, and UTD19-Essen, HINT consistently surpasses state-of-the-art inductive baselines. Relative to KITS, HINT reduces MAE on MOW by $\approx42$% with basic simulation and $\approx50$% with calibrated simulation; on Torino by $\approx22$%, and on Essen by $\approx12$%. Even without simulation, HINT remains superior on MOW and Torino, while simulation is crucial on Essen. These results show that combining inductive flow imputation with transductive speed, traffic simulations and external geospatial improves accuracy for the task described above.

Title: MoE-TransMov: A Transformer-based Model for Next POI Prediction in Familiar & Unfamiliar Movements

Authors: Ruichen Tan, Jiawei Xue, Kota Tsubouchi, Takahiro Yabe, Satish V. Ukkusuri
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.17985
Pdf URL: https://arxiv.org/pdf/2512.17985
Copy Paste: [[2512.17985]] MoE-TransMov: A Transformer-based Model for Next POI Prediction in Familiar & Unfamiliar Movements(https://arxiv.org/abs/2512.17985)
Keywords: transformer
Abstract: Accurate prediction of the next point of interest (POI) within human mobility trajectories is essential for location-based services, as it enables more timely and personalized recommendations. In particular, with the rise of these approaches, studies have shown that users exhibit different POI choices in their familiar and unfamiliar areas, highlighting the importance of incorporating user familiarity into predictive models. However, existing methods often fail to distinguish between the movements of users in familiar and unfamiliar regions. To address this, we propose MoE-TransMov, a Transformer-based model with a Transformer model with a Mixture-of-Experts (MoE) architecture designed to use one framework to capture distinct mobility patterns across different moving contexts without requiring separate training for certain data. Using user-check-in data, we classify movements into familiar and unfamiliar categories and develop a specialized expert network to improve prediction accuracy. Our approach integrates self-attention mechanisms and adaptive gating networks to dynamically select the most relevant expert models for different mobility contexts. Experiments on two real-world datasets, including the widely used but small open-source Foursquare NYC dataset and the large-scale Kyoto dataset collected with LY Corporation (Yahoo Japan Corporation), show that MoE-TransMov outperforms state-of-the-art baselines with notable improvements in Top-1, Top-5, Top-10 accuracy, and mean reciprocal rank (MRR). Given the results, we find that by using this approach, we can efficiently improve mobility predictions under different moving contexts, thereby enhancing the personalization of recommendation systems and advancing various urban applications.

Title: FedOAED: Federated On-Device Autoencoder Denoiser for Heterogeneous Data under Limited Client Availability

Authors: S M Ruhul Kabir Howlader, Xiao Chen, Yifei Xie, Lu Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.17986
Pdf URL: https://arxiv.org/pdf/2512.17986
Copy Paste: [[2512.17986]] FedOAED: Federated On-Device Autoencoder Denoiser for Heterogeneous Data under Limited Client Availability(https://arxiv.org/abs/2512.17986)
Keywords: protect, federate
Abstract: Over the last few decades, machine learning (ML) and deep learning (DL) solutions have demonstrated their potential across many applications by leveraging large amounts of high-quality data. However, strict data-sharing regulations such as the General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA) have prevented many data-driven applications from being realised. Federated Learning (FL), in which raw data never leaves local devices, has shown promise in overcoming these limitations. Although FL has grown rapidly in recent years, it still struggles with heterogeneity, which produces gradient noise, client-drift, and increased variance from partial client participation. In this paper, we propose FedOAED, a novel federated learning algorithm designed to mitigate client-drift arising from multiple local training updates and the variance induced by partial client participation. FedOAED incorporates an on-device autoencoder denoiser on the client side to mitigate client-drift and variance resulting from heterogeneous data under limited client availability. Experiments on multiple vision datasets under Non-IID settings demonstrate that FedOAED consistently outperforms state-of-the-art baselines.

Title: Enhancing Tea Leaf Disease Recognition with Attention Mechanisms and Grad-CAM Visualization

Authors: Omar Faruq Shikdar, Fahad Ahammed, B. M. Shahria Alam, Golam Kibria, Tawhidur Rahman, Nishat Tasnim Niloy
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.17987
Pdf URL: https://arxiv.org/pdf/2512.17987
Copy Paste: [[2512.17987]] Enhancing Tea Leaf Disease Recognition with Attention Mechanisms and Grad-CAM Visualization(https://arxiv.org/abs/2512.17987)
Keywords: interpretability
Abstract: Tea is among the most widely consumed drinks globally. Tea production is a key industry for many countries. One of the main challenges in tea harvesting is tea leaf diseases. If the spread of tea leaf diseases is not stopped in time, it can lead to massive economic losses for farmers. Therefore, it is crucial to identify tea leaf diseases as soon as possible. Manually identifying tea leaf disease is an ineffective and time-consuming method, without any guarantee of success. Automating this process will improve both the efficiency and the success rate of identifying tea leaf diseases. The purpose of this study is to create an automated system that can classify different kinds of tea leaf diseases, allowing farmers to take action to minimize the damage. A novel dataset was developed specifically for this study. The dataset contains 5278 images across seven classes. The dataset was pre-processed prior to training the model. We deployed three pretrained models: DenseNet, Inception, and EfficientNet. EfficientNet was used only in the ensemble model. We utilized two different attention modules to improve model performance. The ensemble model achieved the highest accuracy of 85.68%. Explainable AI was introduced for better model interpretability.

Title: Name That Part: 3D Part Segmentation and Naming

Authors: Soumava Paul, Prakhar Kaushik, Ankit Vaidya, Anand Bhattad, Alan Yuille
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18003
Pdf URL: https://arxiv.org/pdf/2512.18003
Copy Paste: [[2512.18003]] Name That Part: 3D Part Segmentation and Naming(https://arxiv.org/abs/2512.18003)
Keywords: robust, segmentation
Abstract: We address semantic 3D part segmentation: decomposing objects into parts with meaningful names. While datasets exist with part annotations, their definitions are inconsistent across datasets, limiting robust training. Previous methods produce unlabeled decompositions or retrieve single parts without complete shape annotations. We propose ALIGN-Parts, which formulates part naming as a direct set alignment task. Our method decomposes shapes into partlets - implicit 3D part representations - matched to part descriptions via bipartite assignment. We combine geometric cues from 3D part fields, appearance from multi-view vision features, and semantic knowledge from language-model-generated affordance descriptions. Text-alignment loss ensures partlets share embedding space with text, enabling a theoretically open-vocabulary matching setup, given sufficient data. Our efficient and novel, one-shot, 3D part segmentation and naming method finds applications in several downstream tasks, including serving as a scalable annotation engine. As our model supports zero-shot matching to arbitrary descriptions and confidence-calibrated predictions for known categories, with human verification, we create a unified ontology that aligns PartNet, 3DCoMPaT++, and Find3D, consisting of 1,794 unique 3D parts. We also show examples from our newly created Tex-Parts dataset. We also introduce 2 novel metrics appropriate for the named 3D part segmentation task.

Title: Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models

Authors: Shubham Kumar Nigam, Parjanya Aditya Shukla, Noel Shallum, Arnab Bhattacharya
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.18004
Pdf URL: https://arxiv.org/pdf/2512.18004
Copy Paste: [[2512.18004]] Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models(https://arxiv.org/abs/2512.18004)
Keywords: robust, large language model
Abstract: Handwritten text recognition (HTR) and machine translation continue to pose significant challenges, particularly for low-resource languages like Marathi, which lack large digitized corpora and exhibit high variability in handwriting styles. The conventional approach to address this involves a two-stage pipeline: an OCR system extracts text from handwritten images, which is then translated into the target language using a machine translation model. In this work, we explore and compare the performance of traditional OCR-MT pipelines with Vision Large Language Models that aim to unify these stages and directly translate handwritten text images in a single, end-to-end step. Our motivation is grounded in the urgent need for scalable, accurate translation systems to digitize legal records such as FIRs, charge sheets, and witness statements in India's district and high courts. We evaluate both approaches on a curated dataset of handwritten Marathi legal documents, with the goal of enabling efficient legal document processing, even in low-resource environments. Our findings offer actionable insights toward building robust, edge-deployable solutions that enhance access to legal information for non-native speakers and legal professionals alike.

Title: Towards Benchmarking Privacy Vulnerabilities in Selective Forgetting with Large Language Models

Authors: Wei Qian, Chenxu Zhao, Yangyi Li, Mengdi Huai
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2512.18035
Pdf URL: https://arxiv.org/pdf/2512.18035
Copy Paste: [[2512.18035]] Towards Benchmarking Privacy Vulnerabilities in Selective Forgetting with Large Language Models(https://arxiv.org/abs/2512.18035)
Keywords: privacy, protect, attack, fair, large language model
Abstract: The rapid advancements in artificial intelligence (AI) have primarily focused on the process of learning from data to acquire knowledgeable learning systems. As these systems are increasingly deployed in critical areas, ensuring their privacy and alignment with human values is paramount. Recently, selective forgetting (also known as machine unlearning) has shown promise for privacy and data removal tasks, and has emerged as a transformative paradigm shift in the field of AI. It refers to the ability of a model to selectively erase the influence of previously seen data, which is especially important for compliance with modern data protection regulations and for aligning models with human values. Despite its promise, selective forgetting raises significant privacy concerns, especially when the data involved come from sensitive domains. While new unlearning-induced privacy attacks are continuously proposed, each is shown to outperform its predecessors using different experimental settings, which can lead to overly optimistic and potentially unfair assessments that may disproportionately favor one particular attack over the others. In this work, we present the first comprehensive benchmark for evaluating privacy vulnerabilities in selective forgetting. We extensively investigate privacy vulnerabilities of machine unlearning techniques and benchmark privacy leakage across a wide range of victim data, state-of-the-art unlearning privacy attacks, unlearning methods, and model architectures. We systematically evaluate and identify critical factors related to unlearning-induced privacy leakage. With our novel insights, we aim to provide a standardized tool for practitioners seeking to deploy customized unlearning applications with faithful privacy assessments.

Title: Securing Agentic AI Systems -- A Multilayer Security Framework

Authors: Sunil Arora, John Hastings
Subjects: cs.CR, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2512.18043
Pdf URL: https://arxiv.org/pdf/2512.18043
Copy Paste: [[2512.18043]] Securing Agentic AI Systems -- A Multilayer Security Framework(https://arxiv.org/abs/2512.18043)
Keywords: secure, security, defense
Abstract: Securing Agentic Artificial Intelligence (AI) systems requires addressing the complex cyber risks introduced by autonomous, decision-making, and adaptive behaviors. Agentic AI systems are increasingly deployed across industries, organizations, and critical sectors such as cybersecurity, finance, and healthcare. However, their autonomy introduces unique security challenges, including unauthorized actions, adversarial manipulation, and dynamic environmental interactions. Existing AI security frameworks do not adequately address these challenges or the unique nuances of agentic AI. This research develops a lifecycle-aware security framework specifically designed for agentic AI systems using the Design Science Research (DSR) methodology. The paper introduces MAAIS, an agentic security framework, and the agentic AI CIAA (Confidentiality, Integrity, Availability, and Accountability) concept. MAAIS integrates multiple defense layers to maintain CIAA across the AI lifecycle. Framework validation is conducted by mapping with the established MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) AI tactics. The study contributes a structured, standardized, and framework-based approach for the secure deployment and governance of agentic AI in enterprise environments. This framework is intended for enterprise CISOs, security, AI platform, and engineering teams and offers a detailed step-by-step approach to securing agentic AI workloads.

Title: YolovN-CBi: A Lightweight and Efficient Architecture for Real-Time Detection of Small UAVs

Authors: Ami Pandat, Punna Rajasekhar, Gopika Vinod, Rohit Shukla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18046
Pdf URL: https://arxiv.org/pdf/2512.18046
Copy Paste: [[2512.18046]] YolovN-CBi: A Lightweight and Efficient Architecture for Real-Time Detection of Small UAVs(https://arxiv.org/abs/2512.18046)
Keywords: defense
Abstract: Unmanned Aerial Vehicles, commonly known as, drones pose increasing risks in civilian and defense settings, demanding accurate and real-time drone detection systems. However, detecting drones is challenging because of their small size, rapid movement, and low visual contrast. A modified architecture of YolovN called the YolovN-CBi is proposed that incorporates the Convolutional Block Attention Module (CBAM) and the Bidirectional Feature Pyramid Network (BiFPN) to improve sensitivity to small object detections. A curated training dataset consisting of 28K images is created with various flying objects and a local test dataset is collected with 2500 images consisting of very small drone objects. The proposed architecture is evaluated on four benchmark datasets, along with the local test dataset. The baseline Yolov5 and the proposed Yolov5-CBi architecture outperform newer Yolo versions, including Yolov8 and Yolov12, in the speed-accuracy trade-off for small object detection. Four other variants of the proposed CBi architecture are also proposed and evaluated, which vary in the placement and usage of CBAM and BiFPN. These variants are further distilled using knowledge distillation techniques for edge deployment, using a Yolov5m-CBi teacher and a Yolov5n-CBi student. The distilled model achieved a mA@P0.5:0.9 of 0.6573, representing a 6.51% improvement over the teacher's score of 0.6171, highlighting the effectiveness of the distillation process. The distilled model is 82.9% faster than the baseline model, making it more suitable for real-time drone detection. These findings highlight the effectiveness of the proposed CBi architecture, together with the distilled lightweight models in advancing efficient and accurate real-time detection of small UAVs.

Title: FOODER: Real-time Facial Authentication and Expression Recognition

Authors: Sabri Mustafa Kahya, Muhammet Sami Yavuz, Boran Hamdi Sivrikaya, Eckehard Steinbach
Subjects: cs.CV, cs.AI, cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2512.18057
Pdf URL: https://arxiv.org/pdf/2512.18057
Copy Paste: [[2512.18057]] FOODER: Real-time Facial Authentication and Expression Recognition(https://arxiv.org/abs/2512.18057)
Keywords: privacy, robust, transformer
Abstract: Out-of-distribution (OOD) detection is essential for the safe deployment of neural networks, as it enables the identification of samples outside the training domain. We present FOODER, a real-time, privacy-preserving radar-based framework that integrates OOD-based facial authentication with facial expression recognition. FOODER operates using low-cost frequency-modulated continuous-wave (FMCW) radar and exploits both range-Doppler and micro range-Doppler representations. The authentication module employs a multi-encoder multi-decoder architecture with Body Part (BP) and Intermediate Linear Encoder-Decoder (ILED) components to classify a single enrolled individual as in-distribution while detecting all other faces as OOD. Upon successful authentication, an expression recognition module is activated. Concatenated radar representations are processed by a ResNet block to distinguish between dynamic and static facial expressions. Based on this categorization, two specialized MobileViT networks are used to classify dynamic expressions (smile, shock) and static expressions (neutral, anger). This hierarchical design enables robust facial authentication and fine-grained expression recognition while preserving user privacy by relying exclusively on radar data. Experiments conducted on a dataset collected with a 60 GHz short-range FMCW radar demonstrate that FOODER achieves an AUROC of 94.13% and an FPR95 of 18.12% for authentication, along with an average expression recognition accuracy of 94.70%. FOODER outperforms state-of-the-art OOD detection methods and several transformer-based architectures while operating efficiently in real time.

Title: FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis

Authors: Ekta Balkrishna Gavas, Sudipta Banerjee, Chinmay Hegde, Nasir Memon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18073
Pdf URL: https://arxiv.org/pdf/2512.18073
Copy Paste: [[2512.18073]] FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis(https://arxiv.org/abs/2512.18073)
Keywords: biometric, explainability, large language model
Abstract: Multimodal LLMs (MLLMs) have gained significant traction in complex data analysis, visual question answering, generation, and reasoning. Recently, they have been used for analyzing the biometric utility of iris and face images. However, their capabilities in fingerprint understanding are yet unexplored. In this work, we design a comprehensive benchmark, \textsc{FPBench} that evaluates the performance of 20 MLLMs (open-source and proprietary) across 7 real and synthetic datasets on 8 biometric and forensic tasks using zero-shot and chain-of-thought prompting strategies. We discuss our findings in terms of performance, explainability and share our insights into the challenges and limitations. We establish \textsc{FPBench} as the first comprehensive benchmark for fingerprint domain understanding with MLLMs paving the path for foundation models for fingerprints.

Title: Uncertainty-Gated Region-Level Retrieval for Robust Semantic Segmentation

Authors: Shreshth Rajan, Raymond Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18082
Pdf URL: https://arxiv.org/pdf/2512.18082
Copy Paste: [[2512.18082]] Uncertainty-Gated Region-Level Retrieval for Robust Semantic Segmentation(https://arxiv.org/abs/2512.18082)
Keywords: robust, segmentation
Abstract: Semantic segmentation of outdoor street scenes plays a key role in applications such as autonomous driving, mobile robotics, and assistive technology for visually-impaired pedestrians. For these applications, accurately distinguishing between key surfaces and objects such as roads, sidewalks, vehicles, and pedestrians is essential for maintaining safety and minimizing risks. Semantic segmentation must be robust to different environments, lighting and weather conditions, and sensor noise, while being performed in real-time. We propose a region-level, uncertainty-gated retrieval mechanism that improves segmentation accuracy and calibration under domain shift. Our best method achieves an 11.3% increase in mean intersection-over-union while reducing retrieval cost by 87.5%, retrieving for only 12.5% of regions compared to 100% for always-on baseline.

Title: Microstructure-based Variational Neural Networks for Robust Uncertainty Quantification in Materials Digital Twins

Authors: Andreas E. Robertson, Samuel B. Inman, Ashley T. Lenau, Ricardo A. Lebensohn, Dongil Shin, Brad L. Boyce, Remi M. Dingreville
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2512.18104
Pdf URL: https://arxiv.org/pdf/2512.18104
Copy Paste: [[2512.18104]] Microstructure-based Variational Neural Networks for Robust Uncertainty Quantification in Materials Digital Twins(https://arxiv.org/abs/2512.18104)
Keywords: robust
Abstract: Aleatoric uncertainties - irremovable variability in microstructure morphology, constituent behavior, and processing conditions - pose a major challenge to developing uncertainty-robust digital twins. We introduce the Variational Deep Material Network (VDMN), a physics-informed surrogate model that enables efficient and probabilistic forward and inverse predictions of material behavior. The VDMN captures microstructure-induced variability by embedding variational distributions within its hierarchical, mechanistic architecture. Using an analytic propagation scheme based on Taylor-series expansion and automatic differentiation, the VDMN efficiently propagates uncertainty through the network during training and prediction. We demonstrate its capabilities in two digital-twin-driven applications: (1) as an uncertainty-aware materials digital twin, it predicts and experimentally validates the nonlinear mechanical variability in additively manufactured polymer composites; and (2) as an inverse calibration engine, it disentangles and quantitatively identifies overlapping sources of uncertainty in constituent properties. Together, these results establish the VDMN as a foundation for uncertainty-robust materials digital twins.

Title: Learning Generalizable Neural Operators for Inverse Problems

Authors: Adam J. Thorpe, Stepan Tretiakov, Dibakar Roy Sarkar, Krishna Kumar, Ufuk Topcu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18120
Pdf URL: https://arxiv.org/pdf/2512.18120
Copy Paste: [[2512.18120]] Learning Generalizable Neural Operators for Inverse Problems(https://arxiv.org/abs/2512.18120)
Keywords: robust
Abstract: Inverse problems challenge existing neural operator architectures because ill-posed inverse maps violate continuity, uniqueness, and stability assumptions. We introduce B2B${}^{-1}$, an inverse basis-to-basis neural operator framework that addresses this limitation. Our key innovation is to decouple function representation from the inverse map. We learn neural basis functions for the input and output spaces, then train inverse models that operate on the resulting coefficient space. This structure allows us to learn deterministic, invertible, and probabilistic models within a single framework, and to choose models based on the degree of ill-posedness. We evaluate our approach on six inverse PDE benchmarks, including two novel datasets, and compare against existing invertible neural operator baselines. We learn probabilistic models that capture uncertainty and input variability, and remain robust to measurement noise due to implicit denoising in the coefficient calculation. Our results show consistent re-simulation performance across varying levels of ill-posedness. By separating representation from inversion, our framework enables scalable surrogate models for inverse problems that generalize across instances, domains, and degrees of ill-posedness.

Title: TraCeR: Transformer-Based Competing Risk Analysis with Longitudinal Covariates

Authors: Maxmillan Ries, Sohan Seth
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2512.18129
Pdf URL: https://arxiv.org/pdf/2512.18129
Copy Paste: [[2512.18129]] TraCeR: Transformer-Based Competing Risk Analysis with Longitudinal Covariates(https://arxiv.org/abs/2512.18129)
Keywords: transformer
Abstract: Survival analysis is a critical tool for modeling time-to-event data. Recent deep learning-based models have reduced various modeling assumptions including proportional hazard and linearity. However, a persistent challenge remains in incorporating longitudinal covariates, with prior work largely focusing on cross-sectional features, and in assessing calibration of these models, with research primarily focusing on discrimination during evaluation. We introduce TraCeR, a transformer-based survival analysis framework for incorporating longitudinal covariates. Based on a factorized self-attention architecture, TraCeR estimates the hazard function from a sequence of measurements, naturally capturing temporal covariate interactions without assumptions about the underlying data-generating process. The framework is inherently designed to handle censored data and competing events. Experiments on multiple real-world datasets demonstrate that TraCeR achieves substantial and statistically significant performance improvements over state-of-the-art methods. Furthermore, our evaluation extends beyond discrimination metrics and assesses model calibration, addressing a key oversight in literature.

Title: PermuteV: A Performant Side-channel-Resistant RISC-V Core Securing Edge AI Inference

Authors: Nuntipat Narkthong, Xiaolin Xu
Subjects: cs.CR, cs.AR
Abstract URL: https://arxiv.org/abs/2512.18132
Pdf URL: https://arxiv.org/pdf/2512.18132
Copy Paste: [[2512.18132]] PermuteV: A Performant Side-channel-Resistant RISC-V Core Securing Edge AI Inference(https://arxiv.org/abs/2512.18132)
Keywords: secure, security, privacy, defense, attack
Abstract: Edge AI inference is becoming prevalent thanks to the emergence of small yet high-performance microprocessors. This shift from cloud to edge processing brings several benefits in terms of energy savings, improved latency, and increased privacy. On the downside, bringing computation to the edge makes them more vulnerable to physical side-channel attacks (SCA), which aim to extract the confidentiality of neural network models, e.g., architecture and weight. To address this growing threat, we propose PermuteV, a performant side-channel resistant RISC-V core designed to secure neural network inference. PermuteV employs a hardware-accelerated defense mechanism that randomly permutes the execution order of loop iterations, thereby obfuscating the electromagnetic (EM) signature associated with sensitive operations. We implement PermuteV on FPGA and perform evaluations in terms of side-channel security, hardware area, and runtime overhead. The experimental results demonstrate that PermuteV can effectively defend against EM SCA with minimal area and runtime overhead.

Title: Grad: Guided Relation Diffusion Generation for Graph Augmentation in Graph Fraud Detection

Authors: Jie Yang, Rui Zhang, Ziyang Cheng, Dawei Cheng, Guang Yang, Bo Wang
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2512.18133
Pdf URL: https://arxiv.org/pdf/2512.18133
Copy Paste: [[2512.18133]] Grad: Guided Relation Diffusion Generation for Graph Augmentation in Graph Fraud Detection(https://arxiv.org/abs/2512.18133)
Keywords: security, protect, diffusion
Abstract: Nowadays, Graph Fraud Detection (GFD) in financial scenarios has become an urgent research topic to protect online payment security. However, as organized crime groups are becoming more professional in real-world scenarios, fraudsters are employing more sophisticated camouflage strategies. Specifically, fraudsters disguise themselves by mimicking the behavioral data collected by platforms, ensuring that their key characteristics are consistent with those of benign users to a high degree, which we call Adaptive Camouflage. Consequently, this narrows the differences in behavioral traits between them and benign users within the platform's database, thereby making current GFD models lose efficiency. To address this problem, we propose a relation diffusion-based graph augmentation model Grad. In detail, Grad leverages a supervised graph contrastive learning module to enhance the fraud-benign difference and employs a guided relation diffusion generator to generate auxiliary homophilic relations from scratch. Based on these, weak fraudulent signals would be enhanced during the aggregation process, thus being obvious enough to be captured. Extensive experiments have been conducted on two real-world datasets provided by WeChat Pay, one of the largest online payment platforms with billions of users, and three public datasets. The results show that our proposed model Grad outperforms SOTA methods in both various scenarios, achieving at most 11.10% and 43.95% increases in AUC and AP, respectively. Our code is released at this https URL and this https URL.

Title: Local Patches Meet Global Context: Scalable 3D Diffusion Priors for Computed Tomography Reconstruction

Authors: Taewon Yang, Jason Hu, Jeffrey A. Fessler, Liyue Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18161
Pdf URL: https://arxiv.org/pdf/2512.18161
Copy Paste: [[2512.18161]] Local Patches Meet Global Context: Scalable 3D Diffusion Priors for Computed Tomography Reconstruction(https://arxiv.org/abs/2512.18161)
Keywords: diffusion, generative
Abstract: Diffusion models learn strong image priors that can be leveraged to solve inverse problems like medical image reconstruction. However, for real-world applications such as 3D Computed Tomography (CT) imaging, directly training diffusion models on 3D data presents significant challenges due to the high computational demands of extensive GPU resources and large-scale datasets. Existing works mostly reuse 2D diffusion priors to address 3D inverse problems, but fail to fully realize and leverage the generative capacity of diffusion models for high-dimensional data. In this study, we propose a novel 3D patch-based diffusion model that can learn a fully 3D diffusion prior from limited data, enabling scalable generation of high-resolution 3D images. Our core idea is to learn the prior of 3D patches to achieve scalable efficiency, while coupling local and global information to guarantee high-quality 3D image generation, by modeling the joint distribution of position-aware 3D local patches and downsampled 3D volume as global context. Our approach not only enables high-quality 3D generation, but also offers an unprecedentedly efficient and accurate solution to high-resolution 3D inverse problems. Experiments on 3D CT reconstruction across multiple datasets show that our method outperforms state-of-the-art methods in both performance and efficiency, notably achieving high-resolution 3D reconstruction of $512 \times 512 \times 256$ ($\sim$20 mins).

Title: Conscious Data Contribution via Community-Driven Chain-of-Thought Distillation

Authors: Lena Libon, Meghana Bhange, Rushabh Solanki, Elliot Creager, Ulrich Aïvodji
Subjects: cs.LG, cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2512.18174
Pdf URL: https://arxiv.org/pdf/2512.18174
Copy Paste: [[2512.18174]] Conscious Data Contribution via Community-Driven Chain-of-Thought Distillation(https://arxiv.org/abs/2512.18174)
Keywords: privacy
Abstract: The current era of AI development places a heavy emphasis on training large models on increasingly scaled-up datasets. This paradigm has catalyzed entirely new product categories, such as LLM chatbots, while also raising concerns about data privacy and consumer choice. In this paper, we consider questions of data portability and user autonomy in the context of LLMs that "reason" using chain-of-thought (CoT) traces, computing intermediate text artifacts from user input before producing a final output. We first interpret recent data privacy and portability law to argue that these intermediate computations qualify as users' personal data. Then, building on the existing framework of Conscious Data Contribution, we show how communities who receive low utility from an available model can aggregate and distill their shared knowledge into an alternate model better aligned with their goals. We verify this approach empirically and investigate the effects of community diversity, reasoning granularity, and community size on distillation performance.

Title: Atlas is Your Perfect Context: One-Shot Customization for Generalizable Foundational Medical Image Segmentation

Authors: Ziyu Zhang, Yi Yu, Simeng Zhu, Ahmed Aly, Yunhe Gao, Ning Gu, Yuan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18176
Pdf URL: https://arxiv.org/pdf/2512.18176
Copy Paste: [[2512.18176]] Atlas is Your Perfect Context: One-Shot Customization for Generalizable Foundational Medical Image Segmentation(https://arxiv.org/abs/2512.18176)
Keywords: segmentation
Abstract: Accurate medical image segmentation is essential for clinical diagnosis and treatment planning. While recent interactive foundation models (e.g., nnInteractive) enhance generalization through large-scale multimodal pretraining, they still depend on precise prompts and often perform below expectations in contexts that are underrepresented in their training data. We present AtlasSegFM, an atlas-guided framework that customizes available foundation models to clinical contexts with a single annotated example. The core innovations are: 1) a pipeline that provides context-aware prompts for foundation models via registration between a context atlas and query images, and 2) a test-time adapter to fuse predictions from both atlas registration and the foundation model. Extensive experiments across public and in-house datasets spanning multiple modalities and organs demonstrate that AtlasSegFM consistently improves segmentation, particularly for small, delicate structures. AtlasSegFM provides a lightweight, deployable solution one-shot customization of foundation models in real-world clinical workflows. The code will be made publicly available.

Title: FairExpand: Individual Fairness on Graphs with Partial Similarity Information

Authors: Rebecca Salganik, Yibin Wang, Guillaume Salha-Galvan, Jian Kang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18180
Pdf URL: https://arxiv.org/pdf/2512.18180
Copy Paste: [[2512.18180]] FairExpand: Individual Fairness on Graphs with Partial Similarity Information(https://arxiv.org/abs/2512.18180)
Keywords: fair
Abstract: Individual fairness, which requires that similar individuals should be treated similarly by algorithmic systems, has become a central principle in fair machine learning. Individual fairness has garnered traction in graph representation learning due to its practical importance in high-stakes Web areas such as user modeling, recommender systems, and search. However, existing methods assume the existence of predefined similarity information over all node pairs, an often unrealistic requirement that prevents their operationalization in practice. In this paper, we assume the similarity information is only available for a limited subset of node pairs and introduce FairExpand, a flexible framework that promotes individual fairness in this more realistic partial information scenario. FairExpand follows a two-step pipeline that alternates between refining node representations using a backbone model (e.g., a graph neural network) and gradually propagating similarity information, which allows fairness enforcement to effectively expand to the entire graph. Extensive experiments show that FairExpand consistently enhances individual fairness while preserving performance, making it a practical solution for enabling graph-based individual fairness in real-world applications with partial similarity information.

Title: MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Authors: Kaixing Yang, Jiashu Zhu, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jiahong Wu, Xiangxiang Chu, Hongyan Liu, Jun He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18181
Pdf URL: https://arxiv.org/pdf/2512.18181
Copy Paste: [[2512.18181]] MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation(https://arxiv.org/abs/2512.18181)
Keywords: diffusion, transformer
Abstract: With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Project page: this https URL

Title: Is There a Better Source Distribution than Gaussian? Exploring Source Distributions for Image Flow Matching

Authors: Junho Lee, Kwanseok Kim, Joonseok Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18184
Pdf URL: https://arxiv.org/pdf/2512.18184
Copy Paste: [[2512.18184]] Is There a Better Source Distribution than Gaussian? Exploring Source Distributions for Image Flow Matching(https://arxiv.org/abs/2512.18184)
Keywords: robust, generative
Abstract: Flow matching has emerged as a powerful generative modeling approach with flexible choices of source distribution. While Gaussian distributions are commonly used, the potential for better alternatives in high-dimensional data generation remains largely unexplored. In this paper, we propose a novel 2D simulation that captures high-dimensional geometric properties in an interpretable 2D setting, enabling us to analyze the learning dynamics of flow matching during training. Based on this analysis, we derive several key insights about flow matching behavior: (1) density approximation can paradoxically degrade performance due to mode discrepancy, (2) directional alignment suffers from path entanglement when overly concentrated, (3) Gaussian's omnidirectional coverage ensures robust learning, and (4) norm misalignment incurs substantial learning costs. Building on these insights, we propose a practical framework that combines norm-aligned training with directionally-pruned sampling. This approach maintains the robust omnidirectional supervision essential for stable flow learning, while eliminating initializations in data-sparse regions during inference. Importantly, our pruning strategy can be applied to any flow matching model trained with a Gaussian source, providing immediate performance gains without the need for retraining. Empirical evaluations demonstrate consistent improvements in both generation quality and sampling efficiency. Our findings provide practical insights and guidelines for source distribution design and introduce a readily applicable technique for improving existing flow matching models. Our code is available at this https URL.

Title: ALIGN: Advanced Query Initialization with LiDAR-Image Guidance for Occlusion-Robust 3D Object Detection

Authors: Janghyun Baek, Mincheol Chang, Seokha Moon, Seung Joon Lee, Jinkyu Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18187
Pdf URL: https://arxiv.org/pdf/2512.18187
Copy Paste: [[2512.18187]] ALIGN: Advanced Query Initialization with LiDAR-Image Guidance for Occlusion-Robust 3D Object Detection(https://arxiv.org/abs/2512.18187)
Keywords: robust
Abstract: Recent query-based 3D object detection methods using camera and LiDAR inputs have shown strong performance, but existing query initialization strategies,such as random sampling or BEV heatmap-based sampling, often result in inefficient query usage and reduced accuracy, particularly for occluded or crowded objects. To address this limitation, we propose ALIGN (Advanced query initialization with LiDAR and Image GuidaNce), a novel approach for occlusion-robust, object-aware query initialization. Our model consists of three key components: (i) Occlusion-aware Center Estimation (OCE), which integrates LiDAR geometry and image semantics to estimate object centers accurately (ii) Adaptive Neighbor Sampling (ANS), which generates object candidates from LiDAR clustering and supplements each object by sampling spatially and semantically aligned points around it and (iii) Dynamic Query Balancing (DQB), which adaptively balances queries between foreground and background regions. Our extensive experiments on the nuScenes benchmark demonstrate that ALIGN consistently improves performance across multiple state-of-the-art detectors, achieving gains of up to +0.9 mAP and +1.2 NDS, particularly in challenging scenes with occlusions or dense crowds. Our code will be publicly available upon publication.

Title: Multi-Part Object Representations via Graph Structures and Co-Part Discovery

Authors: Alex Foo, Wynne Hsu, Mong Li Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18192
Pdf URL: https://arxiv.org/pdf/2512.18192
Copy Paste: [[2512.18192]] Multi-Part Object Representations via Graph Structures and Co-Part Discovery(https://arxiv.org/abs/2512.18192)
Keywords: robust
Abstract: Discovering object-centric representations from images can significantly enhance the robustness, sample efficiency and generalizability of vision models. Works on images with multi-part objects typically follow an implicit object representation approach, which fail to recognize these learned objects in occluded or out-of-distribution contexts. This is due to the assumption that object part-whole relations are implicitly encoded into the representations through indirect training objectives. We address this limitation by proposing a novel method that leverages on explicit graph representations for parts and present a co-part object discovery algorithm. We then introduce three benchmarks to evaluate the robustness of object-centric methods in recognizing multi-part objects within occluded and out-of-distribution settings. Experimental results on simulated, realistic, and real-world images show marked improvements in the quality of discovered objects compared to state-of-the-art methods, as well as the accurate recognition of multi-part objects in occluded and out-of-distribution contexts. We also show that the discovered object-centric representations can more accurately predict key object properties in a downstream task, highlighting the potential of our method to advance the field of object-centric representations.

Title: PROVEX: Enhancing SOC Analyst Trust with Explainable Provenance-Based IDS

Authors: Devang Dhanuka, Nidhi Rastogi
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18199
Pdf URL: https://arxiv.org/pdf/2512.18199
Copy Paste: [[2512.18199]] PROVEX: Enhancing SOC Analyst Trust with Explainable Provenance-Based IDS(https://arxiv.org/abs/2512.18199)
Keywords: security, attack
Abstract: Modern intrusion detection systems (IDS) leverage graph neural networks (GNNs) to detect malicious activity in system provenance data, but their decisions often remain a black box to analysts. This paper presents a comprehensive XAI framework designed to bridge the trust gap in Security Operations Centers (SOCs) by making graph-based detection transparent. We implement this framework on top of KAIROS, a state-of-the-art temporal graph-based IDS, though our design is applicable to any temporal graph-based detector with minimal adaptation. The complete codebase is available at this https URL. We augment the detection pipeline with post-hoc explanations that highlight why an alert was triggered, identifying key causal subgraphs and events. We adapt three GNN explanation methods - GraphMask, GNNExplainer, and a variational temporal GNN explainer (VA-TGExplainer) - to the temporal provenance context. These tools output human-interpretable representations of anomalous behavior, including important edges and uncertainty estimates. Our contributions focus on the practical integration of these explainers, addressing challenges in memory management and reproducibility. We demonstrate our framework on the DARPA CADETS Engagement 3 dataset and show that it produces concise window-level explanations for detected attacks. Our evaluation reveals that the explainers preserve the TGNN's decisions with high fidelity, surfacing critical edges such as malicious file interactions and anomalous netflows. The average explanation overhead is 3-5 seconds per event. By providing insight into the model's reasoning, our framework aims to improve analyst trust and triage speed.

Title: FedWiLoc: Federated Learning for Privacy-Preserving WiFi Indoor Localization

Authors: Kanishka Roy, Tahsin Fuad Hasan, Chenfeng Wu, Eshwar Vangala, Roshan Ayyalasomayajula
Subjects: cs.CR, eess.SP, eess.SY
Abstract URL: https://arxiv.org/abs/2512.18207
Pdf URL: https://arxiv.org/pdf/2512.18207
Copy Paste: [[2512.18207]] FedWiLoc: Federated Learning for Privacy-Preserving WiFi Indoor Localization(https://arxiv.org/abs/2512.18207)
Keywords: privacy, protect, attack, robust, federate
Abstract: Current data-driven Wi-Fi-based indoor localization systems face three critical challenges: protecting user privacy, achieving accurate predictions in dynamic multipath environments, and generalizing across different deployments. Traditional Wi-Fi localization systems often compromise user privacy, particularly when facing compromised access points (APs) or man-in-the-middle attacks. As IoT devices proliferate in indoor environments, developing solutions that deliver accurate localization while robustly protecting privacy has become imperative. We introduce FedWiLoc, a privacy-preserving indoor localization system that addresses these challenges through three key innovations. First, FedWiLoc employs a split architecture where APs process Channel State Information (CSI) locally and transmit only privacy-preserving embedding vectors to user devices, preventing raw CSI exposure. Second, during training, FedWiLoc uses federated learning to collaboratively train the model across APs without centralizing sensitive user data. Third, we introduce a geometric loss function that jointly optimizes angle-of-arrival predictions and location estimates, enforcing geometric consistency to improve accuracy in challenging multipath conditions. Extensive evaluation across six diverse indoor environments spanning over 2,000 sq. ft. demonstrates that FedWiLoc outperforms state-of-the-art methods by up to 61.9% in median localization error while maintaining strong privacy guarantees throughout both training and inference.

Title: Stable and Efficient Single-Rollout RL for Multimodal Reasoning

Authors: Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou, Zhenwen Liang, Haitao Mi, Pratap Tokekar, Dong Yu
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2512.18215
Pdf URL: https://arxiv.org/pdf/2512.18215
Copy Paste: [[2512.18215]] Stable and Efficient Single-Rollout RL for Multimodal Reasoning(https://arxiv.org/abs/2512.18215)
Keywords: large language model
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce $\textbf{MSSR}$ (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.

Title: GeoSense-AI: Fast Location Inference from Crisis Microblogs

Authors: Deepit Sapru
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2512.18225
Pdf URL: https://arxiv.org/pdf/2512.18225
Copy Paste: [[2512.18225]] GeoSense-AI: Fast Location Inference from Crisis Microblogs(https://arxiv.org/abs/2512.18225)
Keywords: robust, extraction, segmentation
Abstract: This paper presents an applied AI pipeline for realtime geolocation from noisy microblog streams, unifying statistical hashtag segmentation, part-of-speech-driven proper-noun detection, dependency parsing around disaster lexicons, lightweight named-entity recognition, and gazetteer-grounded disambiguation to infer locations directly from text rather than sparse geotags. The approach operationalizes information extraction under streaming constraints, emphasizing low-latency NLP components and efficient validation against geographic knowledge bases to support situational awareness during emergencies. In head to head comparisons with widely used NER toolkits, the system attains strong F1 while being engineered for orders-of-magnitude faster throughput, enabling deployment in live crisis informatics settings. A production map interface demonstrates end-to-end AI functionality ingest, inference, and visualization--surfacing locational signals at scale for floods, outbreaks, and other fastmoving events. By prioritizing robustness to informal text and streaming efficiency, GeoSense-AI illustrates how domain-tuned NLP and knowledge grounding can elevate emergency response beyond conventional geo-tag reliance.

Title: Multifaceted Exploration of Spatial Openness in Rental Housing: A Big Data Analysis in Tokyo's 23 Wards

Authors: Takuya OKi, Yuan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18226
Pdf URL: https://arxiv.org/pdf/2512.18226
Copy Paste: [[2512.18226]] Multifaceted Exploration of Spatial Openness in Rental Housing: A Big Data Analysis in Tokyo's 23 Wards(https://arxiv.org/abs/2512.18226)
Keywords: segmentation
Abstract: Understanding spatial openness is vital for improving residential quality and design; however, studies often treat its influencing factors separately. This study developed a quantitative framework to evaluate the spatial openness in housing from two- (2D) and three- (3D) dimensional perspectives. Using data from 4,004 rental units in Tokyo's 23 wards, we examined the temporal and spatial variations in openness and its relationship with rent and housing attributes. 2D openness was computed via planar visibility using visibility graph analysis (VGA) from floor plans, whereas 3D openness was derived from interior images analysed using Mask2Former, a semantic segmentation model that identifies walls, ceilings, floors, and windows. The results showed an increase in living room visibility and a 1990s peak in overall openness. Spatial analyses revealed partial correlations among openness, rent, and building characteristics, reflecting urban redevelopment trends. Although the 2D and 3D openness indicators were not directly correlated, higher openness tended to correspond to higher rent. The impression scores predicted by the existing models were only weakly related to openness, suggesting that the interior design and furniture more strongly shape perceived space. This study offers a new multidimensional data-driven framework for quantifying residential spatial openness and linking it with urban and market dynamics.

Title: Joint Learning of Depth, Pose, and Local Radiance Field for Large Scale Monocular 3D Reconstruction

Authors: Shahram Najam Syed, Yitian Hu, Yuchao Yao
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.18237
Pdf URL: https://arxiv.org/pdf/2512.18237
Copy Paste: [[2512.18237]] Joint Learning of Depth, Pose, and Local Radiance Field for Large Scale Monocular 3D Reconstruction(https://arxiv.org/abs/2512.18237)
Keywords: transformer
Abstract: Photorealistic 3-D reconstruction from monocular video collapses in large-scale scenes when depth, pose, and radiance are solved in isolation: scale-ambiguous depth yields ghost geometry, long-horizon pose drift corrupts alignment, and a single global NeRF cannot model hundreds of metres of content. We introduce a joint learning framework that couples all three factors and demonstrably overcomes each failure case. Our system begins with a Vision-Transformer (ViT) depth network trained with metric-scale supervision, giving globally consistent depths despite wide field-of-view variations. A multi-scale feature bundle-adjustment (BA) layer refines camera poses directly in feature space--leveraging learned pyramidal descriptors instead of brittle keypoints--to suppress drift on unconstrained trajectories. For scene representation, we deploy an incremental local-radiance-field hierarchy: new hash-grid NeRFs are allocated and frozen on-the-fly when view overlap falls below a threshold, enabling city-block-scale coverage on a single GPU. Evaluated on the Tanks and Temples benchmark, our method reduces Absolute Trajectory Error to 0.001-0.021 m across eight indoor-outdoor sequences--up to 18x lower than BARF and 2x lower than NoPe-NeRF--while maintaining sub-pixel Relative Pose Error. These results demonstrate that metric-scale, drift-free 3-D reconstruction and high-fidelity novel-view synthesis are achievable from a single uncalibrated RGB camera.

Title: SG-RIFE: Semantic-Guided Real-Time Intermediate Flow Estimation with Diffusion-Competitive Perceptual Quality

Authors: Pan Ben Wong, Chengli Wu, Hanyue Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18241
Pdf URL: https://arxiv.org/pdf/2512.18241
Copy Paste: [[2512.18241]] SG-RIFE: Semantic-Guided Real-Time Intermediate Flow Estimation with Diffusion-Competitive Perceptual Quality(https://arxiv.org/abs/2512.18241)
Keywords: diffusion, transformer
Abstract: Real-time Video Frame Interpolation (VFI) has long been dominated by flow-based methods like RIFE, which offer high throughput but often fail in complicated scenarios involving large motion and occlusion. Conversely, recent diffusion-based approaches (e.g., Consec. BB) achieve state-of-the-art perceptual quality but suffer from prohibitive latency, rendering them impractical for real-time applications. To bridge this gap, we propose Semantic-Guided RIFE (SG-RIFE). Instead of training from scratch, we introduce a parameter-efficient fine-tuning strategy that augments a pre-trained RIFE backbone with semantic priors from a frozen DINOv3 Vision Transformer. We propose a Split-Fidelity Aware Projection Module (Split-FAPM) to compress and refine high-dimensional features, and a Deformable Semantic Fusion (DSF) module to align these semantic priors with pixel-level motion fields. Experiments on SNU-FILM demonstrate that semantic injection provides a decisive boost in perceptual fidelity. SG-RIFE outperforms diffusion-based LDMVFI in FID/LPIPS and achieves quality comparable to Consec. BB on complex benchmarks while running significantly faster, proving that semantic consistency enables flow-based methods to achieve diffusion-competitive perceptual quality in near real-time.

Title: Breaking Minds, Breaking Systems: Jailbreaking Large Language Models via Human-like Psychological Manipulation

Authors: Zehao Liu, Xi Lin
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18244
Pdf URL: https://arxiv.org/pdf/2512.18244
Copy Paste: [[2512.18244]] Breaking Minds, Breaking Systems: Jailbreaking Large Language Models via Human-like Psychological Manipulation(https://arxiv.org/abs/2512.18244)
Keywords: security, protect, defense, attack, robust, large language model
Abstract: Large Language Models (LLMs) have gained considerable popularity and protected by increasingly sophisticated safety mechanisms. However, jailbreak attacks continue to pose a critical security threat by inducing models to generate policy-violating behaviors. Current paradigms focus on input-level anomalies, overlooking that the model's internal psychometric state can be systematically manipulated. To address this, we introduce Psychological Jailbreak, a new jailbreak attack paradigm that exposes a stateful psychological attack surface in LLMs, where attackers exploit the manipulation of a model's psychological state across interactions. Building on this insight, we propose Human-like Psychological Manipulation (HPM), a black-box jailbreak method that dynamically profiles a target model's latent psychological vulnerabilities and synthesizes tailored multi-turn attack strategies. By leveraging the model's optimization for anthropomorphic consistency, HPM creates a psychological pressure where social compliance overrides safety constraints. To systematically measure psychological safety, we construct an evaluation framework incorporating psychometric datasets and the Policy Corruption Score (PCS). Benchmarking against various models (e.g., GPT-4o, DeepSeek-V3, Gemini-2-Flash), HPM achieves a mean Attack Success Rate (ASR) of 88.1%, outperforming state-of-the-art attack baselines. Our experiments demonstrate robust penetration against advanced defenses, including adversarial prompt optimization (e.g., RPO) and cognitive interventions (e.g., Self-Reminder). Ultimately, PCS analysis confirms HPM induces safety breakdown to satisfy manipulated contexts. Our work advocates for a fundamental paradigm shift from static content filtering to psychological safety, prioritizing the development of psychological defense mechanisms against deep cognitive manipulation.

Title: Spectral Discrepancy and Cross-modal Semantic Consistency Learning for Object Detection in Hyperspectral Image

Authors: Xiao He, Chang Tang, Xinwang Liu, Wei Zhang, Zhimin Gao, Chuankun Li, Shaohua Qiu, Jiangfeng Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18245
Pdf URL: https://arxiv.org/pdf/2512.18245
Copy Paste: [[2512.18245]] Spectral Discrepancy and Cross-modal Semantic Consistency Learning for Object Detection in Hyperspectral Image(https://arxiv.org/abs/2512.18245)
Keywords: extraction
Abstract: Hyperspectral images with high spectral resolution provide new insights into recognizing subtle differences in similar substances. However, object detection in hyperspectral images faces significant challenges in intra- and inter-class similarity due to the spatial differences in hyperspectral inter-bands and unavoidable interferences, e.g., sensor noises and illumination. To alleviate the hyperspectral inter-bands inconsistencies and redundancy, we propose a novel network termed \textbf{S}pectral \textbf{D}iscrepancy and \textbf{C}ross-\textbf{M}odal semantic consistency learning (SDCM), which facilitates the extraction of consistent information across a wide range of hyperspectral bands while utilizing the spectral dimension to pinpoint regions of interest. Specifically, we leverage a semantic consistency learning (SCL) module that utilizes inter-band contextual cues to diminish the heterogeneity of information among bands, yielding highly coherent spectral dimension representations. On the other hand, we incorporate a spectral gated generator (SGG) into the framework that filters out the redundant data inherent in hyperspectral information based on the importance of the bands. Then, we design the spectral discrepancy aware (SDA) module to enrich the semantic representation of high-level information by extracting pixel-level spectral features. Extensive experiments on two hyperspectral datasets demonstrate that our proposed method achieves state-of-the-art performance when compared with other ones.

Title: Loom: Diffusion-Transformer for Interleaved Generation

Authors: Mingcheng Ye, Jiaming Liu, Yiren Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18254
Pdf URL: https://arxiv.org/pdf/2512.18254
Copy Paste: [[2512.18254]] Loom: Diffusion-Transformer for Interleaved Generation(https://arxiv.org/abs/2512.18254)
Keywords: diffusion, transformer
Abstract: Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence, enabling tasks such as style transfer, compositional synthesis, and procedural tutorials. We present Loom, a unified diffusion-transformer framework for interleaved text-image generation. Loom extends the Bagel unified model via full-parameter fine-tuning and an interleaved architecture that alternates textual and visual embeddings for multi-condition reasoning and sequential planning. A language planning strategy first decomposes a user instruction into stepwise prompts and frame embeddings, which guide temporally consistent synthesis. For each frame, Loom conditions on a small set of sampled prior frames together with the global textual context, rather than concatenating all history, yielding controllable and efficient long-horizon generation. Across style transfer, compositional generation, and tutorial-like procedures, Loom delivers superior compositionality, temporal coherence, and text-image alignment. Experiments demonstrate that Loom substantially outperforms the open-source baseline Anole, achieving an average gain of 2.6 points (on a 5-point scale) across temporal and semantic metrics in text-to-interleaved tasks. We also curate a 50K interleaved tutorial dataset and demonstrate strong improvements over unified and diffusion editing baselines.

Title: Who Can See Through You? Adversarial Shielding Against VLM-Based Attribute Inference Attacks

Authors: Yucheng Fan, Jiawei Chen, Yu Tian, Zhaoxia Yin
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2512.18264
Pdf URL: https://arxiv.org/pdf/2512.18264
Copy Paste: [[2512.18264]] Who Can See Through You? Adversarial Shielding Against VLM-Based Attribute Inference Attacks(https://arxiv.org/abs/2512.18264)
Keywords: privacy, protect, attack, fair
Abstract: As vision-language models (VLMs) become widely adopted, VLM-based attribute inference attacks have emerged as a serious privacy concern, enabling adversaries to infer private attributes from images shared on social media. This escalating threat calls for dedicated protection methods to safeguard user privacy. However, existing methods often degrade the visual quality of images or interfere with vision-based functions on social media, thereby failing to achieve a desirable balance between privacy protection and user experience. To address this challenge, we propose a novel protection method that jointly optimizes privacy suppression and utility preservation under a visual consistency constraint. While our method is conceptually effective, fair comparisons between methods remain challenging due to the lack of publicly available evaluation datasets. To fill this gap, we introduce VPI-COCO, a publicly available benchmark comprising 522 images with hierarchically structured privacy questions and corresponding non-private counterparts, enabling fine-grained and joint evaluation of protection methods in terms of privacy preservation and user experience. Building upon this benchmark, experiments on multiple VLMs demonstrate that our method effectively reduces PAR below 25%, keeps NPAR above 88%, maintains high visual consistency, and generalizes well to unseen and paraphrased privacy questions, demonstrating its strong practical applicability for real-world VLM deployments.

Title: LeJOT: An Intelligent Job Cost Orchestration Solution for Databricks Platform

Authors: Lizhi Ma, Yi-Xiang Hu, Yuke Wang, Yifang Zhao, Yihui Ren, Jian-Xiang Liao, Feng Wu, Xiang-Yang Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18266
Pdf URL: https://arxiv.org/pdf/2512.18266
Copy Paste: [[2512.18266]] LeJOT: An Intelligent Job Cost Orchestration Solution for Databricks Platform(https://arxiv.org/abs/2512.18266)
Keywords: robust
Abstract: With the rapid advancements in big data technologies, the Databricks platform has become a cornerstone for enterprises and research institutions, offering high computational efficiency and a robust ecosystem. However, managing the escalating operational costs associated with job execution remains a critical challenge. Existing solutions rely on static configurations or reactive adjustments, which fail to adapt to the dynamic nature of workloads. To address this, we introduce LeJOT, an intelligent job cost orchestration framework that leverages machine learning for execution time prediction and a solver-based optimization model for real-time resource allocation. Unlike conventional scheduling techniques, LeJOT proactively predicts workload demands, dynamically allocates computing resources, and minimizes costs while ensuring performance requirements are met. Experimental results on real-world Databricks workloads demonstrate that LeJOT achieves an average 20% reduction in cloud computing costs within a minute-level scheduling timeframe, outperforming traditional static allocation strategies. Our approach provides a scalable and adaptive solution for cost-efficient job scheduling in Data Lakehouse environments.

Title: FedSUM Family: Efficient Federated Learning Methods under Arbitrary Client Participation

Authors: Runze You, Shi Pu
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2512.18275
Pdf URL: https://arxiv.org/pdf/2512.18275
Copy Paste: [[2512.18275]] FedSUM Family: Efficient Federated Learning Methods under Arbitrary Client Participation(https://arxiv.org/abs/2512.18275)
Keywords: federate
Abstract: Federated Learning (FL) methods are often designed for specific client participation patterns, limiting their applicability in practical deployments. We introduce the FedSUM family of algorithms, which supports arbitrary client participation without additional assumptions on data heterogeneity. Our framework models participation variability with two delay metrics, the maximum delay $\tau_{\max}$ and the average delay $\tau_{\text{avg}}$. The FedSUM family comprises three variants: FedSUM-B (basic version), FedSUM (standard version), and FedSUM-CR (communication-reduced version). We provide unified convergence guarantees demonstrating the effectiveness of our approach across diverse participation patterns, thereby broadening the applicability of FL in real-world scenarios.

Title: UniMPR: A Unified Framework for Multimodal Place Recognition with Arbitrary Sensor Configurations

Authors: Zhangshuo Qi, Jingyi Xu, Luqi Cheng, Shichen Wen, Yiming Ma, Guangming Xiong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18279
Pdf URL: https://arxiv.org/pdf/2512.18279
Copy Paste: [[2512.18279]] UniMPR: A Unified Framework for Multimodal Place Recognition with Arbitrary Sensor Configurations(https://arxiv.org/abs/2512.18279)
Keywords: robust
Abstract: Place recognition is a critical component of autonomous vehicles and robotics, enabling global localization in GPS-denied environments. Recent advances have spurred significant interest in multimodal place recognition (MPR), which leverages complementary strengths of multiple modalities. Despite its potential, most existing MPR methods still face three key challenges: (1) dynamically adapting to arbitrary modality inputs within a unified framework, (2) maintaining robustness with missing or degraded modalities, and (3) generalizing across diverse sensor configurations and setups. In this paper, we propose UniMPR, a unified framework for multimodal place recognition. Using only one trained model, it can seamlessly adapt to any combination of common perceptual modalities (e.g., camera, LiDAR, radar). To tackle the data heterogeneity, we unify all inputs within a polar BEV feature space. Subsequently, the polar BEVs are fed into a multi-branch network to exploit discriminative intra-model and inter-modal features from any modality combinations. To fully exploit the network's generalization capability and robustness, we construct a large-scale training set from multiple datasets and introduce an adaptive label assignment strategy for extensive pre-training. Experiments on seven datasets demonstrate that UniMPR achieves state-of-the-art performance under varying sensor configurations, modality combinations, and environmental conditions. Our code will be released at this https URL.

Title: AL-GNN: Privacy-Preserving and Replay-Free Continual Graph Learning via Analytic Learning

Authors: Xuling Zhang, Jindong Li, Yifei Zhang, Menglin Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18295
Pdf URL: https://arxiv.org/pdf/2512.18295
Copy Paste: [[2512.18295]] AL-GNN: Privacy-Preserving and Replay-Free Continual Graph Learning via Analytic Learning(https://arxiv.org/abs/2512.18295)
Keywords: privacy
Abstract: Continual graph learning (CGL) aims to enable graph neural networks to incrementally learn from a stream of graph structured data without forgetting previously acquired knowledge. Existing methods particularly those based on experience replay typically store and revisit past graph data to mitigate catastrophic forgetting. However, these approaches pose significant limitations, including privacy concerns, inefficiency. In this work, we propose AL GNN, a novel framework for continual graph learning that eliminates the need for backpropagation and replay buffers. Instead, AL GNN leverages principles from analytic learning theory to formulate learning as a recursive least squares optimization process. It maintains and updates model knowledge analytically through closed form classifier updates and a regularized feature autocorrelation matrix. This design enables efficient one pass training for each task, and inherently preserves data privacy by avoiding historical sample storage. Extensive experiments on multiple dynamic graph classification benchmarks demonstrate that AL GNN achieves competitive or superior performance compared to existing methods. For instance, it improves average performance by 10% on CoraFull and reduces forgetting by over 30% on Reddit, while also reducing training time by nearly 50% due to its backpropagation free design.

Title: InstructNet: A Novel Approach for Multi-Label Instruction Classification through Advanced Deep Learning

Authors: Tanjim Taharat Aurpa, Md Shoaib Ahmed, Md Mahbubur Rahman, Md. Golam Moazzam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.18301
Pdf URL: https://arxiv.org/pdf/2512.18301
Copy Paste: [[2512.18301]] InstructNet: A Novel Approach for Multi-Label Instruction Classification through Advanced Deep Learning(https://arxiv.org/abs/2512.18301)
Keywords: transformer
Abstract: People use search engines for various topics and items, from daily essentials to more aspirational and specialized objects. Therefore, search engines have taken over as peoples preferred resource. The How To prefix has become familiar and widely used in various search styles to find solutions to particular problems. This search allows people to find sequential instructions by providing detailed guidelines to accomplish specific tasks. Categorizing instructional text is also essential for task-oriented learning and creating knowledge bases. This study uses the How To articles to determine the multi-label instruction category. We have brought this work with a dataset comprising 11,121 observations from wikiHow, where each record has multiple categories. To find out the multi-label category meticulously, we employ some transformer-based deep neural architectures, such as Generalized Autoregressive Pretraining for Language Understanding (XLNet), Bidirectional Encoder Representation from Transformers (BERT), etc. In our multi-label instruction classification process, we have reckoned our proposed architectures using accuracy and macro f1-score as the performance metrics. This thorough evaluation showed us much about our strategys strengths and drawbacks. Specifically, our implementation of the XLNet architecture has demonstrated unprecedented performance, achieving an accuracy of 97.30% and micro and macro average scores of 89.02% and 93%, a noteworthy accomplishment in multi-label classification. This high level of accuracy and macro average score is a testament to the effectiveness of the XLNet architecture in our proposed InstructNet approach. By employing a multi-level strategy in our evaluation process, we have gained a more comprehensive knowledge of the effectiveness of our proposed architectures and identified areas for forthcoming improvement and refinement.

Title: MORPHEUS: A Multidimensional Framework for Modeling, Measuring, and Mitigating Human Factors in Cybersecurity

Authors: Giuseppe Desolda, Francesco Greco, Rosa Lanzilotti, Cesare Tucci
Subjects: cs.CR, cs.HC
Abstract URL: https://arxiv.org/abs/2512.18303
Pdf URL: https://arxiv.org/pdf/2512.18303
Copy Paste: [[2512.18303]] MORPHEUS: A Multidimensional Framework for Modeling, Measuring, and Mitigating Human Factors in Cybersecurity(https://arxiv.org/abs/2512.18303)
Keywords: security
Abstract: Current cybersecurity research increasingly acknowledges the human factor, yet remains fragmented, often treating user vulnerabilities as isolated and static traits. This paper introduces MORPHEUS, a holistic framework that operationalizes human-centric security as a dynamic and interconnected system. Grounded in the Cognition-Affect-Behavior (CAB) model and Attribution Theory, MORPHEUS consolidates 50 human factors influencing susceptibility to major cyberthreats, including phishing, malware, password management, and misconfigurations. Beyond factor identification, the framework systematically maps 295 documented interactions, revealing how cognitive, emotional, behavioral, and socio-organizational processes jointly shape security outcomes, and distills them into twelve recurring interaction mechanisms. MORPHEUS further links theory to practice through an inventory of 99 validated psychometric instruments, enabling empirical assessment and targeted intervention. We illustrate the framework's applicability through concrete operational scenarios, spanning risk diagnosis, training, and interface design. Overall, MORPHEUS provides a rigorous yet actionable foundation for advancing human-centered cybersecurity research and practice.

Title: Embedded Safety-Aligned Intelligence via Differentiable Internal Alignment Embeddings

Authors: Harsh Rathva, Ojas Srivastava, Pruthwik Mishra
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18309
Pdf URL: https://arxiv.org/pdf/2512.18309
Copy Paste: [[2512.18309]] Embedded Safety-Aligned Intelligence via Differentiable Internal Alignment Embeddings(https://arxiv.org/abs/2512.18309)
Keywords: fair, diffusion
Abstract: We introduce Embedded Safety-Aligned Intelligence (ESAI), a theoretical framework for multi-agent reinforcement learning that embeds alignment constraints directly into agents internal representations using differentiable internal alignment embeddings. Unlike external reward shaping or post-hoc safety constraints, internal alignment embeddings are learned latent variables that predict externalized harm through counterfactual reasoning and modulate policy updates toward harm reduction through attention and graph-based propagation. The ESAI framework integrates four mechanisms: differentiable counterfactual alignment penalties computed from soft reference distributions, alignment-weighted perceptual attention, Hebbian associative memory supporting temporal credit assignment, and similarity-weighted graph diffusion with bias mitigation controls. We analyze stability conditions for bounded internal embeddings under Lipschitz continuity and spectral constraints, discuss computational complexity, and examine theoretical properties including contraction behavior and fairness-performance tradeoffs. This work positions ESAI as a conceptual contribution to differentiable alignment mechanisms in multi-agent systems. We identify open theoretical questions regarding convergence guarantees, embedding dimensionality, and extension to high-dimensional environments. Empirical evaluation is left to future work.

Title: MatE: Material Extraction from Single-Image via Geometric Prior

Authors: Zeyu Zhang, Wei Zhai, Jian Yang, Yang Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18312
Pdf URL: https://arxiv.org/pdf/2512.18312
Copy Paste: [[2512.18312]] MatE: Material Extraction from Single-Image via Geometric Prior(https://arxiv.org/abs/2512.18312)
Keywords: robust, extraction, diffusion
Abstract: The creation of high-fidelity, physically-based rendering (PBR) materials remains a bottleneck in many graphics pipelines, typically requiring specialized equipment and expert-driven post-processing. To democratize this process, we present MatE, a novel method for generating tileable PBR materials from a single image taken under unconstrained, real-world conditions. Given an image and a user-provided mask, MatE first performs coarse rectification using an estimated depth map as a geometric prior, and then employs a dual-branch diffusion model. Leveraging a learned consistency from rotation-aligned and scale-aligned training data, this model further rectify residual distortions from the coarse result and translate it into a complete set of material maps, including albedo, normal, roughness and height. Our framework achieves invariance to the unknown illumination and perspective of the input image, allowing for the recovery of intrinsic material properties from casual captures. Through comprehensive experiments on both synthetic and real-world data, we demonstrate the efficacy and robustness of our approach, enabling users to create realistic materials from real-world image.

Title: MatSpray: Fusing 2D Material World Knowledge on 3D Geometry

Authors: Philipp Langsteiner, Jan-Niklas Dihlmann, Hendrik P.A. Lensch
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.18314
Pdf URL: https://arxiv.org/pdf/2512.18314
Copy Paste: [[2512.18314]] MatSpray: Fusing 2D Material World Knowledge on 3D Geometry(https://arxiv.org/abs/2512.18314)
Keywords: diffusion
Abstract: Manual modeling of material parameters and 3D geometry is a time consuming yet essential task in the gaming and film industries. While recent advances in 3D reconstruction have enabled accurate approximations of scene geometry and appearance, these methods often fall short in relighting scenarios due to the lack of precise, spatially varying material parameters. At the same time, diffusion models operating on 2D images have shown strong performance in predicting physically based rendering (PBR) properties such as albedo, roughness, and metallicity. However, transferring these 2D material maps onto reconstructed 3D geometry remains a significant challenge. We propose a framework for fusing 2D material data into 3D geometry using a combination of novel learning-based and projection-based approaches. We begin by reconstructing scene geometry via Gaussian Splatting. From the input images, a diffusion model generates 2D maps for albedo, roughness, and metallic parameters. Any existing diffusion model that can convert images or videos to PBR materials can be applied. The predictions are further integrated into the 3D representation either by optimizing an image-based loss or by directly projecting the material parameters onto the Gaussians using Gaussian ray tracing. To enhance fine-scale accuracy and multi-view consistency, we further introduce a light-weight neural refinement step (Neural Merger), which takes ray-traced material features as input and produces detailed adjustments. Our results demonstrate that the proposed methods outperform existing techniques in both quantitative metrics and perceived visual realism. This enables more accurate, relightable, and photorealistic renderings from reconstructed scenes, significantly improving the realism and efficiency of asset creation workflows in content production pipelines.

Title: Trustworthy and Explainable Deep Reinforcement Learning for Safe and Energy-Efficient Process Control: A Use Case in Industrial Compressed Air Systems

Authors: Vincent Bezold, Patrick Wagner, Jakob Hofmann, Marco Huber, Alexander Sauer
Subjects: cs.LG, cs.AI, eess.SY
Abstract URL: https://arxiv.org/abs/2512.18317
Pdf URL: https://arxiv.org/pdf/2512.18317
Copy Paste: [[2512.18317]] Trustworthy and Explainable Deep Reinforcement Learning for Safe and Energy-Efficient Process Control: A Use Case in Industrial Compressed Air Systems(https://arxiv.org/abs/2512.18317)
Keywords: explainability
Abstract: This paper presents a trustworthy reinforcement learning approach for the control of industrial compressed air systems. We develop a framework that enables safe and energy-efficient operation under realistic boundary conditions and introduce a multi-level explainability pipeline combining input perturbation tests, gradient-based sensitivity analysis, and SHAP (SHapley Additive exPlanations) feature attribution. An empirical evaluation across multiple compressor configurations shows that the learned policy is physically plausible, anticipates future demand, and consistently respects system boundaries. Compared to the installed industrial controller, the proposed approach reduces unnecessary overpressure and achieves energy savings of approximately 4\,\% without relying on explicit physics models. The results further indicate that system pressure and forecast information dominate policy decisions, while compressor-level inputs play a secondary role. Overall, the combination of efficiency gains, predictive behavior, and transparent validation supports the trustworthy deployment of reinforcement learning in industrial energy systems.

Title: LIR$^3$AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation

Authors: Guo Chen, Junjie Huang, Huaijin Xie, Fei Sun, Tao Jia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.18329
Pdf URL: https://arxiv.org/pdf/2512.18329
Copy Paste: [[2512.18329]] LIR$^3$AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation(https://arxiv.org/abs/2512.18329)
Keywords: large language model
Abstract: Retrieval-Augmented Generation (RAG) effectively enhances Large Language Models (LLMs) by incorporating retrieved external knowledge into the generation process. Reasoning models improve LLM performance in multi-hop QA tasks, which require integrating and reasoning over multiple pieces of evidence across different documents to answer a complex question. However, they often introduce substantial computational costs, including increased token consumption and inference latency. To better understand and mitigate this trade-off, we conduct a comprehensive study of reasoning strategies for reasoning models in RAG multi-hop QA tasks. Our findings reveal that reasoning models adopt structured strategies to integrate retrieved and internal knowledge, primarily following two modes: Context-Grounded Reasoning, which relies directly on retrieved content, and Knowledge-Reconciled Reasoning, which resolves conflicts or gaps using internal knowledge. To this end, we propose a novel Lightweight Rerank Reasoning Strategy Framework for RAG (LiR$^3$AG) to enable non-reasoning models to transfer reasoning strategies by restructuring retrieved evidence into coherent reasoning chains. LiR$^3$AG significantly reduce the average 98% output tokens overhead and 58.6% inferencing time while improving 8B non-reasoning model's F1 performance ranging from 6.2% to 22.5% to surpass the performance of 32B reasoning model in RAG, offering a practical and efficient path forward for RAG systems.

Title: A two-stream network with global-local feature fusion for bone age assessment

Authors: Qiong Lou, Han Yang, Fang Lu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.18331
Pdf URL: https://arxiv.org/pdf/2512.18331
Copy Paste: [[2512.18331]] A two-stream network with global-local feature fusion for bone age assessment(https://arxiv.org/abs/2512.18331)
Keywords: extraction, transformer
Abstract: Bone Age Assessment (BAA) is a widely used clinical technique that can accurately reflect an individual's growth and development level, as well as maturity. In recent years, although deep learning has advanced the field of bone age assessment, existing methods face challenges in efficiently balancing global features and local skeletal details. This study aims to develop an automated bone age assessment system based on a two-stream deep learning architecture to achieve higher accuracy in bone age assessment. We propose the BoNet+ model incorporating global and local feature extraction channels. A Transformer module is introduced into the global feature extraction channel to enhance the ability in extracting global features through multi-head self-attention mechanism. A RFAConv module is incorporated into the local feature extraction channel to generate adaptive attention maps within multiscale receptive fields, enhancing local feature extraction capabilities. Global and local features are concatenated along the channel dimension and optimized by an Inception-V3 network. The proposed method has been validated on the Radiological Society of North America (RSNA) and Radiological Hand Pose Estimation (RHPE) test datasets, achieving mean absolute errors (MAEs) of 3.81 and 5.65 months, respectively. These results are comparable to the state-of-the-art. The BoNet+ model reduces the clinical workload and achieves automatic, high-precision, and more objective bone age assessment.

Title: Towards Efficient Agents: A Co-Design of Inference Architecture and System

Authors: Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu, Hanting Chen, Wangze Zhang, Chuansai Zhou, Yiming Li, Chen Chen, Xing Li, Zhiyuan Yang, Xiaosong Li, Xianzhi Yu, Zhenhua Dong, Mingxuan Yuan, Yunhe Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.18337
Pdf URL: https://arxiv.org/pdf/2512.18337
Copy Paste: [[2512.18337]] Towards Efficient Agents: A Co-Design of Inference Architecture and System(https://arxiv.org/abs/2512.18337)
Keywords: large language model
Abstract: The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making. However, their real-world deployment is hindered by severe inefficiencies that arise not from isolated model inference, but from the systemic latency accumulated across reasoning loops, context growth, and heterogeneous tool interactions. This paper presents AgentInfer, a unified framework for end-to-end agent acceleration that bridges inference optimization and architectural design. We decompose the problem into four synergistic components: AgentCollab, a hierarchical dual-model reasoning framework that balances large- and small-model usage through dynamic role assignment; AgentSched, a cache-aware hybrid scheduler that minimizes latency under heterogeneous request patterns; AgentSAM, a suffix-automaton-based speculative decoding method that reuses multi-session semantic memory to achieve low-overhead inference acceleration; and AgentCompress, a semantic compression mechanism that asynchronously distills and reorganizes agent memory without disrupting ongoing reasoning. Together, these modules form a Self-Evolution Engine capable of sustaining efficiency and cognitive stability throughout long-horizon reasoning tasks. Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%, achieving an overall 1.8-2.5 times speedup with preserved accuracy. These results underscore that optimizing for agentic task completion-rather than merely per-token throughput-is the key to building scalable, efficient, and self-improving intelligent systems.

Title: Theodosian: A Deep Dive into Memory-Hierarchy-Centric FHE Acceleration

Authors: Wonseok Choi, Hyunah Yu, Jongmin Kim, Hyesung Ji, Jaiyoung Park, Jung Ho Ahn
Subjects: cs.CR, cs.AR
Abstract URL: https://arxiv.org/abs/2512.18345
Pdf URL: https://arxiv.org/pdf/2512.18345
Copy Paste: [[2512.18345]] Theodosian: A Deep Dive into Memory-Hierarchy-Centric FHE Acceleration(https://arxiv.org/abs/2512.18345)
Keywords: secure, privacy
Abstract: Fully homomorphic encryption (FHE) enables secure computation on encrypted data, mitigating privacy concerns in cloud and edge environments. However, due to its high compute and memory demands, extensive acceleration research has been pursued across diverse hardware platforms, especially GPUs. In this paper, we perform a microarchitectural analysis of CKKS, a popular FHE scheme, on modern GPUs. We focus on on-chip cache behavior, and show that the dominant kernels remain bound by memory bandwidth despite a high-bandwidth L2 cache, exposing a persistent memory wall. We further discover that the overall CKKS pipeline throughput is constrained by low per-kernel hardware utilization, caused by insufficient intra-kernel parallelism. Motivated by these findings, we introduce Theodosian, a set of complementary, memory-aware optimizations that improve cache efficiency and reduce runtime overheads. Our approach delivers consistent speedups across various CKKS workloads. On an RTX 5090, we reduce the bootstrapping latency for 32,768 complex numbers to 15.2ms with Theodosian, and further to 12.8ms with additional algorithmic optimizations, establishing new state-of-the-art GPU performance to the best of our knowledge.

Title: LLM-based Few-Shot Early Rumor Detection with Imitation Agent

Authors: Fengzhu Zeng, Qian Shao, Ling Cheng, Wei Gao, Shih-Fen Cheng, Jing Ma, Cheng Niu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18352
Pdf URL: https://arxiv.org/pdf/2512.18352
Copy Paste: [[2512.18352]] LLM-based Few-Shot Early Rumor Detection with Imitation Agent(https://arxiv.org/abs/2512.18352)
Keywords: large language model
Abstract: Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data-scarce settings. While Large Language Models (LLMs) perform well in few-shot NLP tasks, they are not well-suited for time-series data and are computationally expensive for both training and inference. In this work, we propose a novel EARD framework that combines an autonomous agent and an LLM-based detection model, where the agent acts as a reliable decision-maker for \textit{early time point determination}, while the LLM serves as a powerful \textit{rumor detector}. This approach offers the first solution for few-shot EARD, necessitating only the training of a lightweight agent and allowing the LLM to remain training-free. Extensive experiments on four real-world datasets show our approach boosts performance across LLMs and surpasses existing EARD methods in accuracy and earliness.

Title: DACE For Railway Acronym Disambiguation

Authors: El Mokhtar Hribach, Oussama Mechhour, Mohammed Elmonstaser, Yassine El Boudouri, Othmane Kabal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.18357
Pdf URL: https://arxiv.org/pdf/2512.18357
Copy Paste: [[2512.18357]] DACE For Railway Acronym Disambiguation(https://arxiv.org/abs/2512.18357)
Keywords: secure, large language model
Abstract: Acronym Disambiguation (AD) is a fundamental challenge in technical text processing, particularly in specialized sectors where high ambiguity complicates automated analysis. This paper addresses AD within the context of the TextMine'26 competition on French railway documentation. We present DACE (Dynamic Prompting, Retrieval Augmented Generation, Contextual Selection, and Ensemble Aggregation), a framework that enhances Large Language Models through adaptive in-context learning and external domain knowledge injection. By dynamically tailoring prompts to acronym ambiguity and aggregating ensemble predictions, DACE mitigates hallucination and effectively handles low-resource scenarios. Our approach secured the top rank in the competition with an F1 score of 0.9069.

Title: SRS-Stories: Vocabulary-constrained multilingual story generation for language learning

Authors: Wiktor Kamzela, Mateusz Lango, Ondrej Dusek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.18362
Pdf URL: https://arxiv.org/pdf/2512.18362
Copy Paste: [[2512.18362]] SRS-Stories: Vocabulary-constrained multilingual story generation for language learning(https://arxiv.org/abs/2512.18362)
Keywords: large language model
Abstract: In this paper, we use large language models to generate personalized stories for language learners, using only the vocabulary they know. The generated texts are specifically written to teach the user new vocabulary by simply reading stories where it appears in context, while at the same time seamlessly reviewing recently learned vocabulary. The generated stories are enjoyable to read and the vocabulary reviewing/learning is optimized by a Spaced Repetition System. The experiments are conducted in three languages: English, Chinese and Polish, evaluating three story generation methods and three strategies for enforcing lexical constraints. The results show that the generated stories are more grammatical, coherent, and provide better examples of word usage than texts generated by the standard constrained beam search approach

Title: Efficient Zero-Shot Inpainting with Decoupled Diffusion Guidance

Authors: Badr Moufad, Navid Bagheri Shouraki, Alain Oliviero Durmus, Thomas Hirtz, Eric Moulines, Jimmy Olsson, Yazid Janati
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.18365
Pdf URL: https://arxiv.org/pdf/2512.18365
Copy Paste: [[2512.18365]] Efficient Zero-Shot Inpainting with Decoupled Diffusion Guidance(https://arxiv.org/abs/2512.18365)
Keywords: diffusion
Abstract: Diffusion models have emerged as powerful priors for image editing tasks such as inpainting and local modification, where the objective is to generate realistic content that remains consistent with observed regions. In particular, zero-shot approaches that leverage a pretrained diffusion model, without any retraining, have been shown to achieve highly effective reconstructions. However, state-of-the-art zero-shot methods typically rely on a sequence of surrogate likelihood functions, whose scores are used as proxies for the ideal score. This procedure however requires vector-Jacobian products through the denoiser at every reverse step, introducing significant memory and runtime overhead. To address this issue, we propose a new likelihood surrogate that yields simple and efficient to sample Gaussian posterior transitions, sidestepping the backpropagation through the denoiser network. Our extensive experiments show that our method achieves strong observation consistency compared with fine-tuned baselines and produces coherent, high-quality reconstructions, all while significantly reducing inference cost. Code is available at this https URL.

Title: Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale

Authors: Ansh Nagwekar
Subjects: cs.LG, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2512.18373
Pdf URL: https://arxiv.org/pdf/2512.18373
Copy Paste: [[2512.18373]] Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale(https://arxiv.org/abs/2512.18373)
Keywords: interpretability
Abstract: Neural network optimization remains one of the most consequential yet poorly understood challenges in modern AI research, where improvements in training algorithms can lead to enhanced feature learning in foundation models, order-of-magnitude reductions in training time, and improved interpretability into how networks learn. While stochastic gradient descent (SGD) and its variants have become the de facto standard for training deep networks, their success in these over-parameterized regimes often appears more empirical than principled. This thesis investigates this apparent paradox by tracing the evolution of optimization algorithms from classical first-order methods to modern higher-order techniques, revealing how principled algorithmic design can demystify the training process. Starting from first principles with SGD and adaptive gradient methods, the analysis progressively uncovers the limitations of these conventional approaches when confronted with anisotropy that is representative of real-world data. These breakdowns motivate the exploration of sophisticated alternatives rooted in curvature information: second-order approximation techniques, layer-wise preconditioning, adaptive learning rates, and more. Next, the interplay between these optimization algorithms and the broader neural network training toolkit, which includes prior and recent developments such as maximal update parametrization, learning rate schedules, and exponential moving averages, emerges as equally essential to empirical success. To bridge the gap between theoretical understanding and practical deployment, this paper offers practical prescriptions and implementation strategies for integrating these methods into modern deep learning workflows.

Title: RecurGS: Interactive Scene Modeling via Discrete-State Recurrent Gaussian Fusion

Authors: Wenhao Hu, Haonan Zhou, Zesheng Li, Liu Liu, Jiacheng Dong, Zhizhong Su, Gaoang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18386
Pdf URL: https://arxiv.org/pdf/2512.18386
Copy Paste: [[2512.18386]] RecurGS: Interactive Scene Modeling via Discrete-State Recurrent Gaussian Fusion(https://arxiv.org/abs/2512.18386)
Keywords: diffusion
Abstract: Recent advances in 3D scene representations have enabled high-fidelity novel view synthesis, yet adapting to discrete scene changes and constructing interactive 3D environments remain open challenges in vision and robotics. Existing approaches focus solely on updating a single scene without supporting novel-state synthesis. Others rely on diffusion-based object-background decoupling that works on one state at a time and cannot fuse information across multiple observations. To address these limitations, we introduce RecurGS, a recurrent fusion framework that incrementally integrates discrete Gaussian scene states into a single evolving representation capable of interaction. RecurGS detects object-level changes across consecutive states, aligns their geometric motion using semantic correspondence and Lie-algebra based SE(3) refinement, and performs recurrent updates that preserve historical structures through replay supervision. A voxelized, visibility-aware fusion module selectively incorporates newly observed regions while keeping stable areas fixed, mitigating catastrophic forgetting and enabling efficient long-horizon updates. RecurGS supports object-level manipulation, synthesizes novel scene states without requiring additional scans, and maintains photorealistic fidelity across evolving environments. Extensive experiments across synthetic and real-world datasets demonstrate that our framework delivers high-quality reconstructions with substantially improved update efficiency, providing a scalable step toward continuously interactive Gaussian worlds.

Title: AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

Authors: Mark Kashirskiy, Artiom Lipinski, Ilya Makarov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18399
Pdf URL: https://arxiv.org/pdf/2512.18399
Copy Paste: [[2512.18399]] AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3(https://arxiv.org/abs/2512.18399)
Keywords: transformer, large language model
Abstract: Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages exhibit suboptimal performance on morphologically rich languages such as Arabic, resulting in inflated token sequences and reduced compression efficiency. In this work, we present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm with a comprehensive normalization pipeline addressing Arabic-specific orthographic variations including Alif variants, diacritics, and Arabic-Indic numerals. We systematically compare BPE, WordPiece, and SentencePiece algorithms across multiple configurations, demonstrating that SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized baselines. Furthermore, we introduce the Language Extension Pipeline (LEP), a method for integrating the optimized tokenizer into Qwen3-0.6B through vocabulary extension with mean subtoken initialization and selective transformer layer unfreezing. Our experiments show that LEP reduces evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. We release our tokenizer, training scripts, and model checkpoints to facilitate Arabic NLP research.

Title: Automated Mosaic Tesserae Segmentation via Deep Learning Techniques

Authors: Charilaos Kapelonis, Marios Antonakakis, Konstantinos Politof, Aristomenis Antoniadis, Michalis Zervakis
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.18406
Pdf URL: https://arxiv.org/pdf/2512.18406
Copy Paste: [[2512.18406]] Automated Mosaic Tesserae Segmentation via Deep Learning Techniques(https://arxiv.org/abs/2512.18406)
Keywords: segmentation
Abstract: Art is widely recognized as a reflection of civilization and mosaics represent an important part of cultural heritage. Mosaics are an ancient art form created by arranging small pieces, called tesserae, on a surface using adhesive. Due to their age and fragility, they are prone to damage, highlighting the need for digital preservation. This paper addresses the problem of digitizing mosaics by segmenting the tesserae to separate them from the background within the broader field of Image Segmentation in Computer Vision. We propose a method leveraging Segment Anything Model 2 (SAM 2) by Meta AI, a foundation model that outperforms most conventional segmentation models, to automatically segment mosaics. Due to the limited open datasets in the field, we also create an annotated dataset of mosaic images to fine-tune and evaluate the model. Quantitative evaluation on our testing dataset shows notable improvements compared to the baseline SAM 2 model, with Intersection over Union increasing from 89.00% to 91.02% and Recall from 92.12% to 95.89%. Additionally, on a benchmark proposed by a prior approach, our model achieves an F-measure 3% higher than previous methods and reduces the error in the absolute difference between predicted and actual tesserae from 0.20 to just 0.02. The notable performance of the fine-tuned SAM 2 model together with the newly annotated dataset can pave the way for real-time segmentation of mosaic images.

Title: MoE Pathfinder: Trajectory-driven Expert Pruning

Authors: Xican Yang, Yuanhe Tian, Yan Song
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18425
Pdf URL: https://arxiv.org/pdf/2512.18425
Copy Paste: [[2512.18425]] MoE Pathfinder: Trajectory-driven Expert Pruning(https://arxiv.org/abs/2512.18425)
Keywords: large language model
Abstract: Mixture-of-experts (MoE) architectures used in large language models (LLMs) achieve state-of-the-art performance across diverse tasks yet face practical challenges such as deployment complexity and low activation efficiency. Expert pruning has thus emerged as a promising solution to reduce computational overhead and simplify the deployment of MoE models. However, existing expert pruning approaches conventionally rely on local importance metrics and often apply uniform layer-wise pruning, leveraging only partial evaluation signals and overlooking the heterogeneous contributions of experts across layers. To address these limitations, we propose an expert pruning approach based on the trajectory of activated experts across layers, which treats MoE as a weighted computation graph and casts expert selection as a global optimal path planning problem. Within this framework, we integrate complementary importance signals from reconstruction error, routing probabilities, and activation strength at the trajectory level, which naturally yields non-uniform expert retention across layers. Experiments show that our approach achieves superior pruning performance on nearly all tasks compared with most existing approaches.

Title: Federated Learning Based Decentralized Adaptive Intelligent Transmission Protocol for Privacy Preserving 6G Networks

Authors: Ansar Ahmed
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18432
Pdf URL: https://arxiv.org/pdf/2512.18432
Copy Paste: [[2512.18432]] Federated Learning Based Decentralized Adaptive Intelligent Transmission Protocol for Privacy Preserving 6G Networks(https://arxiv.org/abs/2512.18432)
Keywords: secure, privacy, robust, federate
Abstract: The move to 6th Generation (6G) wireless networks creates new issues with privacy, scalability, and adaptability. The data-intensive nature of 6G is not handled well by older, centralized network models. A shift toward more secure and decentralized systems is therefore required. A new framework called the Federated Learning-based Decentralized Adaptive Intelligent Transmission Protocol (AITP) is proposed to meet these challenges. The AITP uses the distributed learning of Federated Learning (FL) within a decentralized system. Transmission parameters can be adjusted intelligently in real time. User privacy is maintained by keeping raw data on local edge devices. The protocol's performance was evaluated with mathematical modeling and detailed simulations. It was shown to be superior to traditional non-adaptive and centralized AI methods across several key metrics. These included latency, network throughput, energy efficiency, and robustness. The AITP is presented as a foundational technology for future 6G networks that supports a user-centric, privacy-first design. This study is a step forward for privacy-preserving research in 6G.

Title: MeniMV: A Multi-view Benchmark for Meniscus Injury Severity Grading

Authors: Shurui Xu, Siqi Yang, Jiapin Ren, Zhong Cao, Hongwei Yang, Mengzhen Fan, Yuyu Sun, Shuyan Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18437
Pdf URL: https://arxiv.org/pdf/2512.18437
Copy Paste: [[2512.18437]] MeniMV: A Multi-view Benchmark for Meniscus Injury Severity Grading(https://arxiv.org/abs/2512.18437)
Keywords: transformer
Abstract: Precise grading of meniscal horn tears is critical in knee injury diagnosis but remains underexplored in automated MRI analysis. Existing methods often rely on coarse study-level labels or binary classification, lacking localization and severity information. In this paper, we introduce MeniMV, a multi-view benchmark dataset specifically designed for horn-specific meniscus injury grading. MeniMV comprises 3,000 annotated knee MRI exams from 750 patients across three medical centers, providing 6,000 co-registered sagittal and coronal images. Each exam is meticulously annotated with four-tier (grade 0-3) severity labels for both anterior and posterior meniscal horns, verified by chief orthopedic physicians. Notably, MeniMV offers more than double the pathology-labeled data volume of prior datasets while uniquely capturing the dual-view diagnostic context essential in clinical practice. To demonstrate the utility of MeniMV, we benchmark multiple state-of-the-art CNN and Transformer-based models. Our extensive experiments establish strong baselines and highlight challenges in severity grading, providing a valuable foundation for future research in automated musculoskeletal imaging.

Title: An Agentic AI Framework for Training General Practitioner Student Skills

Authors: Victor De Marez, Jens Van Nooten, Luna De Bruyne, Walter Daelemans
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18440
Pdf URL: https://arxiv.org/pdf/2512.18440
Copy Paste: [[2512.18440]] An Agentic AI Framework for Training General Practitioner Student Skills(https://arxiv.org/abs/2512.18440)
Keywords: large language model
Abstract: Advancements in large language models offer strong potential for enhancing virtual simulated patients (VSPs) in medical education by providing scalable alternatives to resource-intensive traditional methods. However, current VSPs often struggle with medical accuracy, consistent roleplaying, scenario generation for VSP use, and educationally structured feedback. We introduce an agentic framework for training general practitioner student skills that unifies (i) configurable, evidence-based vignette generation, (ii) controlled persona-driven patient dialogue with optional retrieval grounding, and (iii) standards-based assessment and feedback for both communication and clinical reasoning. We instantiate the framework in an interactive spoken consultation setting and evaluate it with medical students ($\mathbf{N{=}14}$). Participants reported realistic and vignette-faithful dialogue, appropriate difficulty calibration, a stable personality signal, and highly useful example-rich feedback, alongside excellent overall usability. These results support agentic separation of scenario control, interaction control, and standards-based assessment as a practical pattern for building dependable and pedagogically valuable VSP training tools.

Title: On the Universality of Transformer Architectures; How Much Attention Is Enough?

Authors: Amirreza Abbasi, Mohsen Hooshmand
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18445
Pdf URL: https://arxiv.org/pdf/2512.18445
Copy Paste: [[2512.18445]] On the Universality of Transformer Architectures; How Much Attention Is Enough?(https://arxiv.org/abs/2512.18445)
Keywords: robust, transformer, large language model
Abstract: Transformers are crucial across many AI fields, such as large language models, computer vision, and reinforcement learning. This prominence stems from the architecture's perceived universality and scalability compared to alternatives. This work examines the problem of universality in Transformers, reviews recent progress, including architectural refinements such as structural minimality and approximation rates, and surveys state-of-the-art advances that inform both theoretical and practical understanding. Our aim is to clarify what is currently known about Transformers expressiveness, separate robust guarantees from fragile ones, and identify key directions for future theoretical research.

Title: Object-Centric Framework for Video Moment Retrieval

Authors: Zongyao Li, Yongkang Wong, Satoshi Yamazaki, Jianquan Liu, Mohan Kankanhalli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18448
Pdf URL: https://arxiv.org/pdf/2512.18448
Copy Paste: [[2512.18448]] Object-Centric Framework for Video Moment Retrieval(https://arxiv.org/abs/2512.18448)
Keywords: transformer
Abstract: Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a relational tracklet transformer, which models spatio-temporal correlations among objects over time. By explicitly capturing object-level state changes, our framework enables more accurate localization of moments aligned with object-oriented queries. We evaluated our method on three benchmarks: Charades-STA, QVHighlights, and TACoS. Experimental results demonstrate that our method outperforms existing state-of-the-art methods across all benchmarks.

Title: Secret mixtures of experts inside your LLM

Authors: Enric Boix-Adsera
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2512.18452
Pdf URL: https://arxiv.org/pdf/2512.18452
Copy Paste: [[2512.18452]] Secret mixtures of experts inside your LLM(https://arxiv.org/abs/2512.18452)
Keywords: transformer
Abstract: Despite being one of the earliest neural network layers, the Multilayer Perceptron (MLP) is arguably one of the least understood parts of the transformer architecture due to its dense computation and lack of easy visualization. This paper seeks to understand the MLP layers in dense LLM models by hypothesizing that these layers secretly approximately perform a sparse computation -- namely, that they can be well approximated by sparsely-activating Mixture of Experts (MoE) layers. Our hypothesis is based on a novel theoretical connection between MoE models and Sparse Autoencoder (SAE) structure in activation space. We empirically validate the hypothesis on pretrained LLMs, and demonstrate that the activation distribution matters -- these results do not hold for Gaussian data, but rather rely crucially on structure in the distribution of neural network activations. Our results shine light on a general principle at play in MLP layers inside LLMs, and give an explanation for the effectiveness of modern MoE-based transformers. Additionally, our experimental explorations suggest new directions for more efficient MoE architecture design based on low-rank routers.

Title: Out-of-Distribution Detection in Molecular Complexes via Diffusion Models for Irregular Graphs

Authors: David Graber, Victor Armegioiu, Rebecca Buller, Siddhartha Mishra
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2512.18454
Pdf URL: https://arxiv.org/pdf/2512.18454
Copy Paste: [[2512.18454]] Out-of-Distribution Detection in Molecular Complexes via Diffusion Models for Irregular Graphs(https://arxiv.org/abs/2512.18454)
Keywords: robust, diffusion
Abstract: Predictive machine learning models generally excel on in-distribution data, but their performance degrades on out-of-distribution (OOD) inputs. Reliable deployment therefore requires robust OOD detection, yet this is particularly challenging for irregular 3D graphs that combine continuous geometry with categorical identities and are unordered by construction. Here, we present a probabilistic OOD detection framework for complex 3D graph data built on a diffusion model that learns a density of the training distribution in a fully unsupervised manner. A key ingredient we introduce is a unified continuous diffusion over both 3D coordinates and discrete features: categorical identities are embedded in a continuous space and trained with cross-entropy, while the corresponding diffusion score is obtained analytically via posterior-mean interpolation from predicted class probabilities. This yields a single self-consistent probability-flow ODE (PF-ODE) that produces per-sample log-likelihoods, providing a principled typicality score for distribution shift. We validate the approach on protein-ligand complexes and construct strict OOD datasets by withholding entire protein families from training. PF-ODE likelihoods identify held-out families as OOD and correlate strongly with prediction errors of an independent binding-affinity model (GEMS), enabling a priori reliability estimates on new complexes. Beyond scalar likelihoods, we show that multi-scale PF-ODE trajectory statistics - including path tortuosity, flow stiffness, and vector-field instability - provide complementary OOD information. Modeling the joint distribution of these trajectory features yields a practical, high-sensitivity detector that improves separation over likelihood-only baselines, offering a label-free OOD quantification workflow for geometric deep learning.

Title: Plasticine: A Traceable Diffusion Model for Medical Image Translation

Authors: Tianyang Zhanng, Xinxing Cheng, Jun Cheng, Shaoming Zheng, He Zhao, Huazhu Fu, Alejandro F Frangi, Jiang Liu, Jinming Duan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18455
Pdf URL: https://arxiv.org/pdf/2512.18455
Copy Paste: [[2512.18455]] Plasticine: A Traceable Diffusion Model for Medical Image Translation(https://arxiv.org/abs/2512.18455)
Keywords: interpretability, diffusion
Abstract: Domain gaps arising from variations in imaging devices and population distributions pose significant challenges for machine learning in medical image analysis. Existing image-to-image translation methods primarily aim to learn mappings between domains, often generating diverse synthetic data with variations in anatomical scale and shape, but they usually overlook spatial correspondence during the translation process. For clinical applications, traceability, defined as the ability to provide pixel-level correspondences between original and translated images, is equally important. This property enhances clinical interpretability but has been largely overlooked in previous approaches. To address this gap, we propose Plasticine, which is, to the best of our knowledge, the first end-to-end image-to-image translation framework explicitly designed with traceability as a core objective. Our method combines intensity translation and spatial transformation within a denoising diffusion framework. This design enables the generation of synthetic images with interpretable intensity transitions and spatially coherent deformations, supporting pixel-wise traceability throughout the translation process.

Title: SoK: Understanding (New) Security Issues Across AI4Code Use Cases

Authors: Qilong Wu, Taoran Li, Tianyang Zhou, Varun Chandrasekaran
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18456
Pdf URL: https://arxiv.org/pdf/2512.18456
Copy Paste: [[2512.18456]] SoK: Understanding (New) Security Issues Across AI4Code Use Cases(https://arxiv.org/abs/2512.18456)
Keywords: secure, security, attack, robust
Abstract: AI-for-Code (AI4Code) systems are reshaping software engineering, with tools like GitHub Copilot accelerating code generation, translation, and vulnerability detection. Alongside these advances, however, security risks remain pervasive: insecure outputs, biased benchmarks, and susceptibility to adversarial manipulation undermine their reliability. This SoK surveys the landscape of AI4Code security across three core applications, identifying recurring gaps: benchmark dominance by Python and toy problems, lack of standardized security datasets, data leakage in evaluation, and fragile adversarial robustness. A comparative study of six state-of-the-art models illustrates these challenges: insecure patterns persist in code generation, vulnerability detection is brittle to semantic-preserving attacks, fine-tuning often misaligns security objectives, and code translation yields uneven security benefits. From this analysis, we distill three forward paths: embedding secure-by-default practices in code generation, building robust and comprehensive detection benchmarks, and leveraging translation as a route to security-enhanced languages. We call for a shift toward security-first AI4Code, where vulnerability mitigation and robustness are embedded throughout the development life cycle.

Title: Self-organizing maps for water quality assessment in reservoirs and lakes: A systematic literature review

Authors: Oraib Almegdadi, João Marcelino, Sarah Fakhreddine, João Manso, Nuno C. Marques
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18466
Pdf URL: https://arxiv.org/pdf/2512.18466
Copy Paste: [[2512.18466]] Self-organizing maps for water quality assessment in reservoirs and lakes: A systematic literature review(https://arxiv.org/abs/2512.18466)
Keywords: security
Abstract: Sustainable water quality underpins ecological balance and water security. Assessing and managing lakes and reservoirs is difficult due to data sparsity, heterogeneity, and nonlinear relationships among parameters. This review examines how Self-Organizing Map (SOM), an unsupervised AI technique, is applied to water quality assessment. It synthesizes research on parameter selection, spatial and temporal sampling strategies, and clustering approaches. Emphasis is placed on how SOM handles multidimensional data and uncovers hidden patterns to support effective water management. The growing availability of environmental data from in-situ sensors, remote sensing imagery, IoT technologies, and historical records has significantly expanded analytical opportunities in environmental monitoring. SOM has proven effective in analysing complex datasets, particularly when labelled data are limited or unavailable. It enables high-dimensional data visualization, facilitates the detection of hidden ecological patterns, and identifies critical correlations among diverse water quality indicators. This review highlights SOMs versatility in ecological assessments, trophic state classification, algal bloom monitoring, and catchment area impact evaluations. The findings offer comprehensive insights into existing methodologies, supporting future research and practical applications aimed at improving the monitoring and sustainable management of lake and reservoir ecosystems.

Title: APC-GNN++: An Adaptive Patient-Centric GNN with Context-Aware Attention and Mini-Graph Explainability for Diabetes Classification

Authors: Khaled Berkani
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18473
Pdf URL: https://arxiv.org/pdf/2512.18473
Copy Paste: [[2512.18473]] APC-GNN++: An Adaptive Patient-Centric GNN with Context-Aware Attention and Mini-Graph Explainability for Diabetes Classification(https://arxiv.org/abs/2512.18473)
Keywords: explainability
Abstract: We propose APC-GNN++, an adaptive patient-centric Graph Neural Network for diabetes classification. Our model integrates context-aware edge attention, confidence-guided blending of node features and graph representations, and neighborhood consistency regularization to better capture clinically meaningful relationships between patients. To handle unseen patients, we introduce a mini-graph approach that leverages the nearest neighbors of the new patient, enabling real-time explainable predictions without retraining the global model. We evaluate APC-GNN++ on a real-world diabetes dataset collected from a regional hospital in Algeria and show that it outperforms traditional machine learning models (MLP, Random Forest, XGBoost) and a vanilla GCN, achieving higher test accuracy and macro F1- score. The analysis of node-level confidence scores further reveals how the model balances self-information and graph-based evidence across different patient groups, providing interpretable patient-centric insights. The system is also embedded in a Tkinter-based graphical user interface (GUI) for interactive use by healthcare professionals .

Title: Research on a hybrid LSTM-CNN-Attention model for text-based web content classification

Authors: Mykola Kuz, Ihor Lazarovych, Mykola Kozlenko, Mykola Pikuliak, Andrii Kvasniuk
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.18475
Pdf URL: https://arxiv.org/pdf/2512.18475
Copy Paste: [[2512.18475]] Research on a hybrid LSTM-CNN-Attention model for text-based web content classification(https://arxiv.org/abs/2512.18475)
Keywords: robust, extraction, transformer
Abstract: This study presents a hybrid deep learning architecture that integrates LSTM, CNN, and an Attention mechanism to enhance the classification of web content based on text. Pretrained GloVe embeddings are used to represent words as dense vectors that preserve semantic similarity. The CNN layer extracts local n-gram patterns and lexical features, while the LSTM layer models long-range dependencies and sequential structure. The integrated Attention mechanism enables the model to focus selectively on the most informative parts of the input sequence. A 5-fold cross-validation setup was used to assess the robustness and generalizability of the proposed solution. Experimental results show that the hybrid LSTM-CNN-Attention model achieved outstanding performance, with an accuracy of 0.98, precision of 0.94, recall of 0.92, and F1-score of 0.93. These results surpass the performance of baseline models based solely on CNNs, LSTMs, or transformer-based classifiers such as BERT. The combination of neural network components enabled the model to effectively capture both fine-grained text structures and broader semantic context. Furthermore, the use of GloVe embeddings provided an efficient and effective representation of textual data, making the model suitable for integration into systems with real-time or near-real-time requirements. The proposed hybrid architecture demonstrates high effectiveness in text-based web content classification, particularly in tasks requiring both syntactic feature extraction and semantic interpretation. By combining presented mechanisms, the model addresses the limitations of individual architectures and achieves improved generalization. These findings support the broader use of hybrid deep learning approaches in NLP applications, especially where complex, unstructured textual data must be processed and classified with high reliability.

Title: QLink: Quantum-Safe Bridge Architecture for Blockchain Interoperability

Authors: Joao Vitor Barros Da Silva, Arsh Gupta, Madhusudan Singh Irish Singh
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.18488
Pdf URL: https://arxiv.org/pdf/2512.18488
Copy Paste: [[2512.18488]] QLink: Quantum-Safe Bridge Architecture for Blockchain Interoperability(https://arxiv.org/abs/2512.18488)
Keywords: secure, security
Abstract: Secure interoperability across heterogeneous blockchains remains one of the most pressing challenges in Web3 with existing bridge protocols vulnerable to both classical exploits and emerging quantum threats. This paper introduces QLink a quantum-safe Layer 3 interoperability protocol that integrates postquantum cryptography (PQC) quantum key distribution (QKD) and hardware security modules (HSMs) into a unified validator architecture. To our knowledge, QLink is the first interoperability framework to combine these mechanisms to secure validator communication proof aggregation and key management. Validators exchange encryption keys through QKD channels, achieving information-theoretic security against interception, while cross-chain proofs are generated and aggregated with NIST-standardized PQC algorithms. Private keys remain sealed inside HSM enclaves mitigating the risk of theft or leakage. Deployed as a dedicated Layer 3 protocol QLink operates independently of Layer 1 and Layer 2 chains providing a scalable decentralized foundation for secure cross-chain messaging and asset transfer. Experimental evaluation using network simulations demonstrates that validator communication overhead remains sub-second while security guarantees extend beyond current bridge architectures to resist both classical and quantum adversaries. By addressing today vulnerabilities and anticipating future quantum threats QLink establishes a practical and future-proof pathway for blockchain interoperability.

Title: Enhancing Decision-Making in Windows PE Malware Classification During Dataset Shifts with Uncertainty Estimation

Authors: Rahul Yumlembam, Biju Issac, Seibu Mary Jacob
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18495
Pdf URL: https://arxiv.org/pdf/2512.18495
Copy Paste: [[2512.18495]] Enhancing Decision-Making in Windows PE Malware Classification During Dataset Shifts with Uncertainty Estimation(https://arxiv.org/abs/2512.18495)
Keywords: security, robust
Abstract: Artificial intelligence techniques have achieved strong performance in classifying Windows Portable Executable (PE) malware, but their reliability often degrades under dataset shifts, leading to misclassifications with severe security consequences. To address this, we enhance an existing LightGBM (LGBM) malware detector by integrating Neural Networks (NN), PriorNet, and Neural Network Ensembles, evaluated across three benchmark datasets: EMBER, BODMAS, and UCSB. The UCSB dataset, composed mainly of packed malware, introduces a substantial distributional shift relative to EMBER and BODMAS, making it a challenging testbed for robustness. We study uncertainty-aware decision strategies, including probability thresholding, PriorNet, ensemble-derived estimates, and Inductive Conformal Evaluation (ICE). Our main contribution is the use of ensemble-based uncertainty estimates as Non-Conformity Measures within ICE, combined with a novel threshold optimisation method. On the UCSB dataset, where the shift is most severe, the state-of-the-art probability-based ICE (SOTA) yields an incorrect acceptance rate (IA%) of 22.8%. In contrast, our method reduces this to 16% a relative reduction of about 30% while maintaining competitive correct acceptance rates (CA%). These results demonstrate that integrating ensemble-based uncertainty with conformal prediction provides a more reliable safeguard against misclassifications under extreme dataset shifts, particularly in the presence of packed malware, thereby offering practical benefits for real-world security operations.

Title: Adaptive-VoCo: Complexity-Aware Visual Token Compression for Vision-Language Models

Authors: Xiaoyang Guo, Keze Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18496
Pdf URL: https://arxiv.org/pdf/2512.18496
Copy Paste: [[2512.18496]] Adaptive-VoCo: Complexity-Aware Visual Token Compression for Vision-Language Models(https://arxiv.org/abs/2512.18496)
Keywords: robust
Abstract: In recent years, large-scale vision-language models (VLMs) have demonstrated remarkable performance on multimodal understanding and reasoning tasks. However, handling high-dimensional visual features often incurs substantial computational and memory costs. VoCo-LLaMA alleviates this issue by compressing visual patch tokens into a few VoCo tokens, reducing computational overhead while preserving strong cross-modal alignment. Nevertheless, such approaches typically adopt a fixed compression rate, limiting their ability to adapt to varying levels of visual complexity. To address this limitation, we propose Adaptive-VoCo, a framework that augments VoCo-LLaMA with a lightweight predictor for adaptive compression. This predictor dynamically selects an optimal compression rate by quantifying an image's visual complexity using statistical cues from the vision encoder, such as patch token entropy and attention map variance. Furthermore, we introduce a joint loss function that integrates rate regularization with complexity alignment. This enables the model to balance inference efficiency with representational capacity, particularly in challenging scenarios. Experimental results show that our method consistently outperforms fixed-rate baselines across multiple multimodal tasks, highlighting the potential of adaptive visual compression for creating more efficient and robust VLMs.

Title: PlantDiseaseNet-RT50: A Fine-tuned ResNet50 Architecture for High-Accuracy Plant Disease Detection Beyond Standard CNNs

Authors: Santwana Sagnika, Manav Malhotra, Ishtaj Kaur Deol, Soumyajit Roy, Swarnav Kumar
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.18500
Pdf URL: https://arxiv.org/pdf/2512.18500
Copy Paste: [[2512.18500]] PlantDiseaseNet-RT50: A Fine-tuned ResNet50 Architecture for High-Accuracy Plant Disease Detection Beyond Standard CNNs(https://arxiv.org/abs/2512.18500)
Keywords: security
Abstract: Plant diseases pose a significant threat to agricultural productivity and global food security, accounting for 70-80% of crop losses worldwide. Traditional detection methods rely heavily on expert visual inspection, which is time-consuming, labour-intensive, and often impractical for large-scale farming operations. In this paper, we present PlantDiseaseNet-RT50, a novel fine-tuned deep learning architecture based on ResNet50 for automated plant disease detection. Our model features strategically unfrozen layers, a custom classification head with regularization mechanisms, and dynamic learning rate scheduling through cosine decay. Using a comprehensive dataset of distinct plant disease categories across multiple crop species, PlantDiseaseNet-RT50 achieves exceptional performance with approximately 98% accuracy, precision, and recall. Our architectural modifications and optimization protocol demonstrate how targeted fine-tuning can transform a standard pretrained model into a specialized agricultural diagnostic tool. We provide a detailed account of our methodology, including the systematic unfreezing of terminal layers, implementation of batch normalization and dropout regularization and application of advanced training techniques. PlantDiseaseNet-RT50 represents a significant advancement in AI-driven agricultural tools, offering a computationally efficient solution for rapid and accurate plant disease diagnosis that can be readily implemented in practical farming contexts to support timely interventions and reduce crop losses.

Title: NASTaR: NovaSAR Automated Ship Target Recognition Dataset

Authors: Benyamin Hosseiny, Kamirul Kamirul, Odysseas Pappas, Alin Achim
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.18503
Pdf URL: https://arxiv.org/pdf/2512.18503
Copy Paste: [[2512.18503]] NASTaR: NovaSAR Automated Ship Target Recognition Dataset(https://arxiv.org/abs/2512.18503)
Keywords: robust
Abstract: Synthetic Aperture Radar (SAR) offers a unique capability for all-weather, space-based maritime activity monitoring by capturing and imaging strong reflections from ships at sea. A well-defined challenge in this domain is ship type classification. Due to the high diversity and complexity of ship types, accurate recognition is difficult and typically requires specialized deep learning models. These models, however, depend on large, high-quality ground-truth datasets to achieve robust performance and generalization. Furthermore, the growing variety of SAR satellites operating at different frequencies and spatial resolutions has amplified the need for more annotated datasets to enhance model accuracy. To address this, we present the NovaSAR Automated Ship Target Recognition (NASTaR) dataset. This dataset comprises of 3415 ship patches extracted from NovaSAR S-band imagery, with labels matched to AIS data. It includes distinctive features such as 23 unique classes, inshore/offshore separation, and an auxiliary wake dataset for patches where ship wakes are visible. We validated the dataset applicability across prominent ship-type classification scenarios using benchmark deep learning models. Results demonstrate over 60% accuracy for classifying four major ship types, over 70% for a three-class scenario, more than 75% for distinguishing cargo from tanker ships, and over 87% for identifying fishing vessels. The NASTaR dataset is available at https://https://doi.org/10.5523/bris, while relevant codes for benchmarking and analysis are available at this https URL.

Title: Teaching and Critiquing Conceptualization and Operationalization in NLP

Authors: Vagrant Gautam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.18505
Pdf URL: https://arxiv.org/pdf/2512.18505
Copy Paste: [[2512.18505]] Teaching and Critiquing Conceptualization and Operationalization in NLP(https://arxiv.org/abs/2512.18505)
Keywords: interpretability
Abstract: NLP researchers regularly invoke abstract concepts like "interpretability," "bias," "reasoning," and "stereotypes," without defining them. Each subfield has a shared understanding or conceptualization of what these terms mean and how we should treat them, and this shared understanding is the basis on which operational decisions are made: Datasets are built to evaluate these concepts, metrics are proposed to quantify them, and claims are made about systems. But what do they mean, what should they mean, and how should we measure them? I outline a seminar I created for students to explore these questions of conceptualization and operationalization, with an interdisciplinary reading list and an emphasis on discussion and critique.

Title: Feature-Enhanced Graph Neural Networks for Classification of Synthetic Graph Generative Models: A Benchmarking Study

Authors: Janek Dyer, Jagdeep Ahluwalia, Javad Zarrin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18524
Pdf URL: https://arxiv.org/pdf/2512.18524
Copy Paste: [[2512.18524]] Feature-Enhanced Graph Neural Networks for Classification of Synthetic Graph Generative Models: A Benchmarking Study(https://arxiv.org/abs/2512.18524)
Keywords: generative
Abstract: The ability to discriminate between generative graph models is critical to understanding complex structural patterns in both synthetic graphs and the real-world structures that they emulate. While Graph Neural Networks (GNNs) have seen increasing use to great effect in graph classification tasks, few studies explore their integration with interpretable graph theoretic features. This paper investigates the classification of synthetic graph families using a hybrid approach that combines GNNs with engineered graph-theoretic features. We generate a large and structurally diverse synthetic dataset comprising graphs from five representative generative families, Erdos-Renyi, Watts-Strogatz, Barab'asi-Albert, Holme-Kim, and Stochastic Block Model. These graphs range in size up to 1x10^4 nodes, containing up to 1.1x10^5 edges. A comprehensive range of node and graph level features is extracted for each graph and pruned using a Random Forest based feature selection pipeline. The features are integrated into six GNN architectures: GCN, GAT, GATv2, GIN, GraphSAGE and GTN. Each architecture is optimised for hyperparameter selection using Optuna. Finally, models were compared against a baseline Support Vector Machine (SVM) trained solely on the handcrafted features. Our evaluation demonstrates that GraphSAGE and GTN achieve the highest classification performance, with 98.5% accuracy, and strong class separation evidenced by t-SNE and UMAP visualisations. GCN and GIN also performed well, while GAT-based models lagged due to limitations in their ability to capture global structures. The SVM baseline confirmed the importance of the message passing functionality for performance gains and meaningful class separation.

Title: Detection of AI Generated Images Using Combined Uncertainty Measures and Particle Swarm Optimised Rejection Mechanism

Authors: Rahul Yumlembam, Biju Issac, Nauman Aslam, Eaby Kollonoor Babu, Josh Collyer, Fraser Kennedy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18527
Pdf URL: https://arxiv.org/pdf/2512.18527
Copy Paste: [[2512.18527]] Detection of AI Generated Images Using Combined Uncertainty Measures and Particle Swarm Optimised Rejection Mechanism(https://arxiv.org/abs/2512.18527)
Keywords: attack, robust, diffusion
Abstract: As AI-generated images become increasingly photorealistic, distinguishing them from natural images poses a growing challenge. This paper presents a robust detection framework that leverages multiple uncertainty measures to decide whether to trust or reject a model's predictions. We focus on three complementary techniques: Fisher Information, which captures the sensitivity of model parameters to input variations; entropy-based uncertainty from Monte Carlo Dropout, which reflects predictive variability; and predictive variance from a Deep Kernel Learning framework using a Gaussian Process classifier. To integrate these diverse uncertainty signals, Particle Swarm Optimisation is used to learn optimal weightings and determine an adaptive rejection threshold. The model is trained on Stable Diffusion-generated images and evaluated on GLIDE, VQDM, Midjourney, BigGAN, and StyleGAN3, each introducing significant distribution shifts. While standard metrics such as prediction probability and Fisher-based measures perform well in distribution, their effectiveness degrades under shift. In contrast, the Combined Uncertainty measure consistently achieves an incorrect rejection rate of approximately 70 percent on unseen generators, successfully filtering most misclassified AI samples. Although the system occasionally rejects correct predictions from newer generators, this conservative behaviour is acceptable, as rejected samples can support retraining. The framework maintains high acceptance of accurate predictions for natural images and in-domain AI data. Under adversarial attacks using FGSM and PGD, the Combined Uncertainty method rejects around 61 percent of successful attacks, while GP-based uncertainty alone achieves up to 80 percent. Overall, the results demonstrate that multi-source uncertainty fusion provides a resilient and adaptive solution for AI-generated image detection.

Title: WoundNet-Ensemble: A Novel IoMT System Integrating Self-Supervised Deep Learning and Multi-Model Fusion for Automated, High-Accuracy Wound Classification and Healing Progression Monitoring

Authors: Moses Kiprono
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18528
Pdf URL: https://arxiv.org/pdf/2512.18528
Copy Paste: [[2512.18528]] WoundNet-Ensemble: A Novel IoMT System Integrating Self-Supervised Deep Learning and Multi-Model Fusion for Automated, High-Accuracy Wound Classification and Healing Progression Monitoring(https://arxiv.org/abs/2512.18528)
Keywords: robust, transformer
Abstract: Chronic wounds, including diabetic foot ulcers which affect up to one-third of people with diabetes, impose a substantial clinical and economic burden, with U.S. healthcare costs exceeding 25 billion dollars annually. Current wound assessment remains predominantly subjective, leading to inconsistent classification and delayed interventions. We present WoundNet-Ensemble, an Internet of Medical Things system leveraging a novel ensemble of three complementary deep learning architectures: ResNet-50, the self-supervised Vision Transformer DINOv2, and Swin Transformer, for automated classification of six clinically distinct wound types. Our system achieves 99.90 percent ensemble accuracy on a comprehensive dataset of 5,175 wound images spanning diabetic foot ulcers, pressure ulcers, venous ulcers, thermal burns, pilonidal sinus wounds, and fungating malignant tumors. The weighted fusion strategy demonstrates a 3.7 percent improvement over previous state-of-the-art methods. Furthermore, we implement a longitudinal wound healing tracker that computes healing rates, severity scores, and generates clinical alerts. This work demonstrates a robust, accurate, and clinically deployable tool for modernizing wound care through artificial intelligence, addressing critical needs in telemedicine and remote patient monitoring. The implementation and trained models will be made publicly available to support reproducibility.

Title: Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset

Authors: S Mahmudul Hasan, Shaily Roy, Akib Jawad Nafis
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.18533
Pdf URL: https://arxiv.org/pdf/2512.18533
Copy Paste: [[2512.18533]] Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset(https://arxiv.org/abs/2512.18533)
Keywords: transformer
Abstract: The proliferation of linguistically subtle political disinformation poses a significant challenge to automated fact-checking systems. Despite increasing emphasis on complex neural architectures, the empirical limits of text-only linguistic modeling remain underexplored. We present a systematic diagnostic evaluation of nine machine learning algorithms on the LIAR benchmark. By isolating lexical features (Bag-of-Words, TF-IDF) and semantic embeddings (GloVe), we uncover a hard "Performance Ceiling", with fine-grained classification not exceeding a Weighted F1-score of 0.32 across models. Crucially, a simple linear SVM (Accuracy: 0.624) matches the performance of pre-trained Transformers such as RoBERTa (Accuracy: 0.620), suggesting that model capacity is not the primary bottleneck. We further diagnose a massive "Generalization Gap" in tree-based ensembles, which achieve more than 99% training accuracy but collapse to approximately 25% on test data, indicating reliance on lexical memorization rather than semantic inference. Synthetic data augmentation via SMOTE yields no meaningful gains, confirming that the limitation is semantic (feature ambiguity) rather than distributional. These findings indicate that for political fact-checking, increasing model complexity without incorporating external knowledge yields diminishing returns.

Title: SecureCode v2.0: A Production-Grade Dataset for Training Security-Aware Code Generation Models

Authors: Scott Thornton
Subjects: cs.CR, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.18542
Pdf URL: https://arxiv.org/pdf/2512.18542
Copy Paste: [[2512.18542]] SecureCode v2.0: A Production-Grade Dataset for Training Security-Aware Code Generation Models(https://arxiv.org/abs/2512.18542)
Keywords: secure, security, defense, attack
Abstract: AI assistants produce vulnerable code in 45% of security-relevant scenarios, introducing flaws into production systems at scale. Yet existing secure coding datasets fall short. They lack incident grounding, don't provide the scale modern training requires, and miss the operational security context developers need for production deployments. We present SecureCode v2.0, a production-grade dataset of 1,215 security-focused coding examples that passed structural validation and expert security review. Every example ties to actual documented security incidents with CVE references, provides vulnerable and secure implementations, demonstrates concrete attacks, and includes defense-in-depth operational guidance. The dataset covers 11 vulnerability categories (complete OWASP Top 10:2025 plus AI/ML Security Threats) across 11 languages (Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, and YAML for infrastructure-as-code). Our quality assurance framework ensures complete incident grounding. Each example includes SIEM integration strategies, infrastructure hardening recommendations (Docker, AppArmor, WAF configurations), and testing approaches using language-appropriate frameworks. The dataset uses a 4-turn conversational structure mirroring actual developer-AI interactions, escalating from basic implementations to advanced security considerations and defense-in-depth guidance. Our contributions: (1) 1,215 rigorously validated examples split into 989 training, 122 validation, and 104 test sets, (2) an automated validation framework ensuring dataset consistency, (3) a 4-turn conversational structure capturing realistic security workflows, (4) comprehensive operational security guidance with SIEM integration strategies, (5) complete language-specific implementation fidelity, and (6) open-source release of data, validation tools, and benchmarking protocols.

Title: LLMs on Drugs: Language Models Are Few-Shot Consumers

Authors: Alexander Doudkin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.18546
Pdf URL: https://arxiv.org/pdf/2512.18546
Copy Paste: [[2512.18546]] LLMs on Drugs: Language Models Are Few-Shot Consumers(https://arxiv.org/abs/2512.18546)
Keywords: large language model
Abstract: Large language models (LLMs) are sensitive to the personas imposed on them at inference time, yet prompt-level "drug" interventions have never been benchmarked rigorously. We present the first controlled study of psychoactive framings on GPT-5-mini using ARC-Challenge. Four single-sentence prompts -- LSD, cocaine, alcohol, and cannabis -- are compared against a sober control across 100 validation items per condition, with deterministic decoding, full logging, Wilson confidence intervals, and Fisher exact tests. Control accuracy is 0.45; alcohol collapses to 0.10 (p = 3.2e-8), cocaine to 0.21 (p = 4.9e-4), LSD to 0.19 (p = 1.3e-4), and cannabis to 0.30 (p = 0.041), largely because persona prompts disrupt the mandated "Answer: " template. Persona text therefore behaves like a "few-shot consumable" that can destroy reliability without touching model weights. All experimental code, raw results, and analysis scripts are available at this https URL.

Title: Enhancing Medical Large Vision-Language Models via Alignment Distillation

Authors: Aofei Chang, Ting Wang, Fenglong Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18554
Pdf URL: https://arxiv.org/pdf/2512.18554
Copy Paste: [[2512.18554]] Enhancing Medical Large Vision-Language Models via Alignment Distillation(https://arxiv.org/abs/2512.18554)
Keywords: interpretability
Abstract: Medical Large Vision-Language Models (Med-LVLMs) have shown promising results in clinical applications, but often suffer from hallucinated outputs due to misaligned visual understanding. In this work, we identify two fundamental limitations contributing to this issue: insufficient visual representation learning and poor visual attention alignment. To address these problems, we propose MEDALIGN, a simple, lightweight alignment distillation framework that transfers visual alignment knowledge from a domain-specific Contrastive Language-Image Pre-training (CLIP) model to Med-LVLMs. MEDALIGN introduces two distillation losses: a spatial-aware visual alignment loss based on visual token-level similarity structures, and an attention-aware distillation loss that guides attention toward diagnostically relevant regions. Extensive experiments on medical report generation and medical visual question answering (VQA) benchmarks show that MEDALIGN consistently improves both performance and interpretability, yielding more visually grounded outputs.

Title: Proof of Authenticity of General IoT Information with Tamper-Evident Sensors and Blockchain

Authors: Kenji Saito
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2512.18560
Pdf URL: https://arxiv.org/pdf/2512.18560
Copy Paste: [[2512.18560]] Proof of Authenticity of General IoT Information with Tamper-Evident Sensors and Blockchain(https://arxiv.org/abs/2512.18560)
Keywords: secure
Abstract: Sensor data in IoT (Internet of Things) systems is vulnerable to tampering or falsification when transmitted through untrusted services. This is critical because such data increasingly underpins real-world decisions in domains such as logistics, healthcare, and other critical infrastructure. We propose a general method for secure sensor-data logging in which tamper-evident devices periodically sign readouts, link data using redundant hash chains, and submit cryptographic evidence to a blockchain-based service via Merkle trees to ensure verifiability even under data loss. Our approach enables reliable and cost-effective validation of sensor data across diverse IoT systems, including disaster response and other humanitarian applications, without relying on the integrity of intermediate systems.

Title: OpenView: Empowering MLLMs with Out-of-view VQA

Authors: Qixiang Chen, Cheng Zhang, Chi-Wing Fu, Jingwen Ye, Jianfei Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18563
Pdf URL: https://arxiv.org/pdf/2512.18563
Copy Paste: [[2512.18563]] OpenView: Empowering MLLMs with Out-of-view VQA(https://arxiv.org/abs/2512.18563)
Keywords: large language model
Abstract: Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable evaluation. Experimental results show that despite having a large gap from human performance in OOV VQA answer selection, upon empowered by OpenView, multiple MLLMs can consistently boost their performance, uplifted from 48.6% to 64.1% on average. Code, benchmark, and data will be available at this https URL.

Title: Comparing Dynamical Models Through Diffeomorphic Vector Field Alignment

Authors: Ruiqi Chen (1), Giacomo Vedovati (2), Todd Braver (3), ShiNung Ching (2) ((1) Division of Biology and Biomedical Sciences, Washington University in St. Louis, (2) Department of Electrical and Systems Engineering, Washington University in St. Louis, (3) Department of Psychological and Brain Sciences, Washington University in St. Louis)
Subjects: cs.LG, eess.SY, q-bio.NC
Abstract URL: https://arxiv.org/abs/2512.18566
Pdf URL: https://arxiv.org/pdf/2512.18566
Copy Paste: [[2512.18566]] Comparing Dynamical Models Through Diffeomorphic Vector Field Alignment(https://arxiv.org/abs/2512.18566)
Keywords: generative
Abstract: Dynamical systems models such as recurrent neural networks (RNNs) are increasingly popular in theoretical neuroscience for hypothesis-generation and data analysis. Evaluating the dynamics in such models is key to understanding their learned generative mechanisms. However, such evaluation is impeded by two major challenges: First, comparison of learned dynamics across models is difficult because there is no enforced equivalence of their coordinate systems. Second, identification of mechanistically important low-dimensional motifs (e.g., limit sets) is intractable in high-dimensional nonlinear models such as RNNs. Here, we propose a comprehensive framework to address these two issues, termed Diffeomorphic vector field alignment FOR learned Models (DFORM). DFORM learns a nonlinear coordinate transformation between the state spaces of two dynamical systems, which aligns their trajectories in a maximally one-to-one manner. In so doing, DFORM enables an assessment of whether two models exhibit topological equivalence, i.e., similar mechanisms despite differences in coordinate systems. A byproduct of this method is a means to locate dynamical motifs on low-dimensional manifolds embedded within higher-dimensional systems. We verified DFORM's ability to identify linear and nonlinear coordinate transformations using canonical topologically equivalent systems, RNNs, and systems related by nonlinear flows. DFORM was also shown to provide a quantification of similarity between topologically distinct systems. We then demonstrated that DFORM can locate important dynamical motifs including invariant manifolds and saddle limit sets within high-dimensional models. Finally, using a set of RNN models trained on human functional MRI (fMRI) recordings, we illustrated that DFORM can identify limit cycles from high-dimensional data-driven models, which agreed well with prior numerical analysis.

Title: Placenta Accreta Spectrum Detection Using an MRI-based Hybrid CNN-Transformer Model

Authors: Sumaiya Ali, Areej Alhothali, Ohoud Alzamzami, Sameera Albasri, Ahmed Abduljabbar, Muhammad Alwazzan
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.18573
Pdf URL: https://arxiv.org/pdf/2512.18573
Copy Paste: [[2512.18573]] Placenta Accreta Spectrum Detection Using an MRI-based Hybrid CNN-Transformer Model(https://arxiv.org/abs/2512.18573)
Keywords: robust, transformer
Abstract: Placenta Accreta Spectrum (PAS) is a serious obstetric condition that can be challenging to diagnose with Magnetic Resonance Imaging (MRI) due to variability in radiologists' interpretations. To overcome this challenge, a hybrid 3D deep learning model for automated PAS detection from volumetric MRI scans is proposed in this study. The model integrates a 3D DenseNet121 to capture local features and a 3D Vision Transformer (ViT) to model global spatial context. It was developed and evaluated on a retrospective dataset of 1,133 MRI volumes. Multiple 3D deep learning architectures were also evaluated for comparison. On an independent test set, the DenseNet121-ViT model achieved the highest performance with a five-run average accuracy of 84.3%. These results highlight the strength of hybrid CNN-Transformer models as a computer-aided diagnosis tool. The model's performance demonstrates a clear potential to assist radiologists by providing a robust decision support to improve diagnostic consistency across interpretations, and ultimately enhance the accuracy and timeliness of PAS diagnosis.

Title: SD2AIL: Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models

Authors: Pengcheng Li, Qiang Fang, Tong Zhao, Yixing Lan, Xin Xu
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2512.18583
Pdf URL: https://arxiv.org/pdf/2512.18583
Copy Paste: [[2512.18583]] SD2AIL: Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models(https://arxiv.org/abs/2512.18583)
Keywords: robust, diffusion
Abstract: Adversarial Imitation Learning (AIL) is a dominant framework in imitation learning that infers rewards from expert demonstrations to guide policy optimization. Although providing more expert demonstrations typically leads to improved performance and greater stability, collecting such demonstrations can be challenging in certain scenarios. Inspired by the success of diffusion models in data generation, we propose SD2AIL, which utilizes synthetic demonstrations via diffusion models. We first employ a diffusion model in the discriminator to generate synthetic demonstrations as pseudo-expert data that augment the expert demonstrations. To selectively replay the most valuable demonstrations from the large pool of (pseudo-) expert demonstrations, we further introduce a prioritized expert demonstration replay strategy (PEDR). The experimental results on simulation tasks demonstrate the effectiveness and robustness of our method. In particular, in the Hopper task, our method achieves an average return of 3441, surpassing the state-of-the-art method by 89. Our code will be available at this https URL.

Title: DNA-HHE: Dual-mode Near-network Accelerator for Hybrid Homomorphic Encryption on the Edge

Authors: Yifan Zhao, Xinglong Yu, Yi Sun, Honglin Kuang, Jun Han
Subjects: cs.CR, cs.AR
Abstract URL: https://arxiv.org/abs/2512.18589
Pdf URL: https://arxiv.org/pdf/2512.18589
Copy Paste: [[2512.18589]] DNA-HHE: Dual-mode Near-network Accelerator for Hybrid Homomorphic Encryption on the Edge(https://arxiv.org/abs/2512.18589)
Keywords: privacy
Abstract: Fully homomorphic encryption (FHE) schemes like RNS-CKKS enable privacy-preserving outsourced computation (PPOC) but suffer from high computational latency and ciphertext expansion, especially on the resource-constrained edge side. Hybrid Homomorphic Encryption (HHE) mitigates these issues on the edge side by replacing HE with lightweight symmetric encryption for plaintext encryption, such as the Rubato cipher for the HHE variant of RNS-CKKS, yet it introduces transciphering overhead on the cloud. The respective strengths and limitations of FHE and HHE call for a dual-mode HHE solution with flexible algorithm switching ability. This paper presents DNA-HHE, the first dual-mode HHE accelerator with near-network coupling for edge devices. DNA-HHE supports both edge-side RNS-CKKS and Rubato within a unified architecture driven by flexible custom instructions. To realize a compact implementation for the edge side, we propose a DSP-efficient modular reduction design, a compact multi-field-adaptive butterfly unit, and parallel scheduling schemes of Rubato with a high degree of resource sharing. DNA-HHE is designed with network protocol packaging and transmission capacities and directly coupled to the network interface controller, achieving reduced overall latency of edge-side PPOC by 1.09$\times$ to 1.56$\times$. Our evaluations on the ASIC and FPGA platforms demonstrate that DNA-HHE outperforms the state-of-the-art single-mode designs in both edge-side RNS-CKKS and symmetric cipher with better computation latency and area efficiency, while offering dual-mode functionality.

Title: From Scratch to Fine-Tuned: A Comparative Study of Transformer Training Strategies for Legal Machine Translation

Authors: Amit Barman, Atanu Mandal, Sudip Kumar Naskar
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2512.18593
Pdf URL: https://arxiv.org/pdf/2512.18593
Copy Paste: [[2512.18593]] From Scratch to Fine-Tuned: A Comparative Study of Transformer Training Strategies for Legal Machine Translation(https://arxiv.org/abs/2512.18593)
Keywords: transformer
Abstract: In multilingual nations like India, access to legal information is often hindered by language barriers, as much of the legal and judicial documentation remains in English. Legal Machine Translation (L-MT) offers a scalable solution to this challenge by enabling accurate and accessible translations of legal documents. This paper presents our work for the JUST-NLP 2025 Legal MT shared task, focusing on English-Hindi translation using Transformer-based approaches. We experiment with 2 complementary strategies, fine-tuning a pre-trained OPUS-MT model for domain-specific adaptation and training a Transformer model from scratch using the provided legal corpus. Performance is evaluated using standard MT metrics, including SacreBLEU, chrF++, TER, ROUGE, BERTScore, METEOR, and COMET. Our fine-tuned OPUS-MT model achieves a SacreBLEU score of 46.03, significantly outperforming both baseline and from-scratch models. The results highlight the effectiveness of domain adaptation in enhancing translation quality and demonstrate the potential of L-MT systems to improve access to justice and legal transparency in multilingual contexts.

Title: Benchmarking neural surrogates on realistic spatiotemporal multiphysics flows

Authors: Runze Mao, Rui Zhang, Xuan Bai, Tianhao Wu, Teng Zhang, Zhenyi Chen, Minqi Lin, Bocheng Zeng, Yangchen Xu, Yingxuan Xiang, Haoze Zhang, Shubham Goswami, Pierre A. Dawe, Yifan Xu, Zhenhua An, Mengtao Yan, Xiaoyi Lu, Yi Wang, Rongbo Bai, Haobu Gao, Xiaohang Fang, Han Li, Hao Sun, Zhi X. Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18595
Pdf URL: https://arxiv.org/pdf/2512.18595
Copy Paste: [[2512.18595]] Benchmarking neural surrogates on realistic spatiotemporal multiphysics flows(https://arxiv.org/abs/2512.18595)
Keywords: robust, transformer
Abstract: Predicting multiphysics dynamics is computationally expensive and challenging due to the severe coupling of multi-scale, heterogeneous physical processes. While neural surrogates promise a paradigm shift, the field currently suffers from an "illusion of mastery", as repeatedly emphasized in top-tier commentaries: existing evaluations overly rely on simplified, low-dimensional proxies, which fail to expose the models' inherent fragility in realistic regimes. To bridge this critical gap, we present REALM (REalistic AI Learning for Multiphysics), a rigorous benchmarking framework designed to test neural surrogates on challenging, application-driven reactive flows. REALM features 11 high-fidelity datasets spanning from canonical multiphysics problems to complex propulsion and fire safety scenarios, alongside a standardized end-to-end training and evaluation protocol that incorporates multiphysics-aware preprocessing and a robust rollout strategy. Using this framework, we systematically benchmark over a dozen representative surrogate model families, including spectral operators, convolutional models, Transformers, pointwise operators, and graph/mesh networks, and identify three robust trends: (i) a scaling barrier governed jointly by dimensionality, stiffness, and mesh irregularity, leading to rapidly growing rollout errors; (ii) performance primarily controlled by architectural inductive biases rather than parameter count; and (iii) a persistent gap between nominal accuracy metrics and physically trustworthy behavior, where models with high correlations still miss key transient structures and integral quantities. Taken together, REALM exposes the limits of current neural surrogates on realistic multiphysics flows and offers a rigorous testbed to drive the development of next-generation physics-aware architectures.

Title: Commercial Vehicle Braking Optimization: A Robust SIFT-Trajectory Approach

Authors: Zhe Li, Kun Cheng, Hanyue Mo, Jintao Lu, Ziwen Kuang, Jianwen Ye, Lixu Xu, Xinya Meng, Jiahui Zhao, Shengda Ji, Shuyuan Liu, Mengyu Wang
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.18597
Pdf URL: https://arxiv.org/pdf/2512.18597
Copy Paste: [[2512.18597]] Commercial Vehicle Braking Optimization: A Robust SIFT-Trajectory Approach(https://arxiv.org/abs/2512.18597)
Keywords: robust, extraction
Abstract: A vision-based trajectory analysis solution is proposed to address the "zero-speed braking" issue caused by inaccurate Controller Area Network (CAN) signals in commercial vehicle Automatic Emergency Braking (AEB) systems during low-speed operation. The algorithm utilizes the NVIDIA Jetson AGX Xavier platform to process sequential video frames from a blind spot camera, employing self-adaptive Contrast Limited Adaptive Histogram Equalization (CLAHE)-enhanced Scale-Invariant Feature Transform (SIFT) feature extraction and K-Nearest Neighbors (KNN)-Random Sample Consensus (RANSAC) matching. This allows for precise classification of the vehicle's motion state (static, vibration, moving). Key innovations include 1) multiframe trajectory displacement statistics (5-frame sliding window), 2) a dual-threshold state decision matrix, and 3) OBD-II driven dynamic Region of Interest (ROI) configuration. The system effectively suppresses environmental interference and false detection of dynamic objects, directly addressing the challenge of low-speed false activation in commercial vehicle safety systems. Evaluation in a real-world dataset (32,454 video segments from 1,852 vehicles) demonstrates an F1-score of 99.96% for static detection, 97.78% for moving state recognition, and a processing delay of 14.2 milliseconds (resolution 704x576). The deployment on-site shows an 89% reduction in false braking events, a 100% success rate in emergency braking, and a fault rate below 5%.

Title: SimpleCall: A Lightweight Image Restoration Agent in Label-Free Environments with MLLM Perceptual Feedback

Authors: Jianglin Lu, Yuanwei Wu, Ziyi Zhao, Hongcheng Wang, Felix Jimenez, Abrar Majeedi, Yun Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18599
Pdf URL: https://arxiv.org/pdf/2512.18599
Copy Paste: [[2512.18599]] SimpleCall: A Lightweight Image Restoration Agent in Label-Free Environments with MLLM Perceptual Feedback(https://arxiv.org/abs/2512.18599)
Keywords: large language model
Abstract: Complex image restoration aims to recover high-quality images from inputs affected by multiple degradations such as blur, noise, rain, and compression artifacts. Recent restoration agents, powered by vision-language models and large language models, offer promising restoration capabilities but suffer from significant efficiency bottlenecks due to reflection, rollback, and iterative tool searching. Moreover, their performance heavily depends on degradation recognition models that require extensive annotations for training, limiting their applicability in label-free environments. To address these limitations, we propose a policy optimization-based restoration framework that learns an lightweight agent to determine tool-calling sequences. The agent operates in a sequential decision process, selecting the most appropriate restoration operation at each step to maximize final image quality. To enable training within label-free environments, we introduce a novel reward mechanism driven by multimodal large language models, which act as human-aligned evaluator and provide perceptual feedback for policy improvement. Once trained, our agent executes a deterministic restoration plans without redundant tool invocations, significantly accelerating inference while maintaining high restoration quality. Extensive experiments show that despite using no supervision, our method matches SOTA performance on full-reference metrics and surpasses existing approaches on no-reference metrics across diverse degradation scenarios.

Title: The Interaction Bottleneck of Deep Neural Networks: Discovery, Proof, and Modulation

Authors: Huiqi Deng, Qihan Ren, Zhuofan Chen, Zhenyuan Cui, Wen Shen, Peng Zhang, Hongbin Pei, Quanshi Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18607
Pdf URL: https://arxiv.org/pdf/2512.18607
Copy Paste: [[2512.18607]] The Interaction Bottleneck of Deep Neural Networks: Discovery, Proof, and Modulation(https://arxiv.org/abs/2512.18607)
Keywords: robust
Abstract: Understanding what kinds of cooperative structures deep neural networks (DNNs) can represent remains a fundamental yet insufficiently understood problem. In this work, we treat interactions as the fundamental units of such structure and investigate a largely unexplored question: how DNNs encode interactions under different levels of contextual complexity, and how these microscopic interaction patterns shape macroscopic representation capacity. To quantify this complexity, we use multi-order interactions [57], where each order reflects the amount of contextual information required to evaluate the joint interaction utility of a variable pair. This formulation enables a stratified analysis of cooperative patterns learned by DNNs. Building on this formulation, we develop a comprehensive study of interaction structure in DNNs. (i) We empirically discover a universal interaction bottleneck: across architectures and tasks, DNNs easily learn low-order and high-order interactions but consistently under-represent mid-order ones. (ii) We theoretically explain this bottleneck by proving that mid-order interactions incur the highest contextual variability, yielding large gradient variance and making them intrinsically difficult to learn. (iii) We further modulate the bottleneck by introducing losses that steer models toward emphasizing interactions of selected orders. Finally, we connect microscopic interaction structures with macroscopic representational behavior: low-order-emphasized models exhibit stronger generalization and robustness, whereas high-order-emphasized models demonstrate greater structural modeling and fitting capability. Together, these results uncover an inherent representational bias in modern DNNs and establish interaction order as a powerful lens for interpreting and guiding deep representations.

Title: A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts

Authors: Prabigya Acharya, Liza Shrestha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.18608
Pdf URL: https://arxiv.org/pdf/2512.18608
Copy Paste: [[2512.18608]] A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts(https://arxiv.org/abs/2512.18608)
Keywords: privacy, robust, large language model
Abstract: Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.

Title: Text2Graph VPR: A Text-to-Graph Expert System for Explainable Place Recognition in Changing Environments

Authors: Saeideh Yousefzadeh, Hamidreza Pourreza
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18613
Pdf URL: https://arxiv.org/pdf/2512.18613
Copy Paste: [[2512.18613]] Text2Graph VPR: A Text-to-Graph Expert System for Explainable Place Recognition in Changing Environments(https://arxiv.org/abs/2512.18613)
Keywords: robust
Abstract: Visual Place Recognition (VPR) in long-term deployment requires reasoning beyond pixel similarity: systems must make transparent, interpretable decisions that remain robust under lighting, weather and seasonal change. We present Text2Graph VPR, an explainable semantic localization system that converts image sequences into textual scene descriptions, parses those descriptions into structured scene graphs, and reasons over the resulting graphs to identify places. Scene graphs capture objects, attributes and pairwise relations; we aggregate per-frame graphs into a compact place representation and perform retrieval with a dual-similarity mechanism that fuses learned Graph Attention Network (GAT) embeddings and a Shortest-Path (SP) kernel for structural matching. This hybrid design enables both learned semantic matching and topology-aware comparison, and -- critically -- produces human-readable intermediate representations that support diagnostic analysis and improve transparency in the decision process. We validate the system on Oxford RobotCar and MSLS (Amman/San Francisco) benchmarks and demonstrate robust retrieval under severe appearance shifts, along with zero-shot operation using human textual queries. The results illustrate that semantic, graph-based reasoning is a viable and interpretable alternative for place recognition, particularly suited to safety-sensitive and resource-constrained settings.

Title: LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction

Authors: Jensen Zhang, Ningyuan Liu, Yijia Fan, Zihao Huang, Qinglin Zeng, Kaitong Cai, Jian Wang, Keze Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18623
Pdf URL: https://arxiv.org/pdf/2512.18623
Copy Paste: [[2512.18623]] LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction(https://arxiv.org/abs/2512.18623)
Keywords: large language model
Abstract: Large language models (LLMs) often generate hallucinated content that lacks factual or contextual grounding, limiting their reliability in critical applications. Existing approaches such as supervised fine-tuning and reinforcement learning from human feedback are data intensive and computationally expensive, while static parameter editing methods struggle with context dependent errors and catastrophic forgetting. We propose LLM-CAS, a framework that formulates real-time hallucination correction as a hierarchical reinforcement learning problem. LLM-CAS trains an agent to learn a policy that dynamically selects temporary neuron perturbations during inference based on the current context. Unlike prior dynamic approaches that rely on heuristic or predefined adjustments, this policy driven mechanism enables adaptive and fine grained correction without permanent parameter modification. Experiments across multiple language models demonstrate that LLM-CAS consistently improves factual accuracy, achieving gains of 10.98 percentage points on StoryCloze, 2.71 points on TriviaQA, and 2.06 points on the MC1 score of TruthfulQA. These results outperform both static editing methods such as ITI and CAA and the dynamic SADI framework. Overall, LLM-CAS provides an efficient and context aware solution for improving the reliability of LLMs, with promising potential for future multimodal extensions.

Title: Multi-user Pufferfish Privacy

Authors: Ni Ding, Songpei Lu, Wenjing Yang, Zijian Zhang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.18632
Pdf URL: https://arxiv.org/pdf/2512.18632
Copy Paste: [[2512.18632]] Multi-user Pufferfish Privacy(https://arxiv.org/abs/2512.18632)
Keywords: privacy
Abstract: This paper studies how to achieve individual indistinguishability by pufferfish privacy in aggregated query to a multi-user system. It is assumed that each user reports realization of a random variable. We study how to calibrate Laplace noise, added to the query answer, to attain pufferfish privacy when user changes his/her reported data value, leaves the system and is replaced by another use with different randomness. Sufficient conditions are derived for all scenarios for attaining statistical indistinguishability on four sets of secret pairs. They are derived using the existing Kantorovich method (Wasserstain metric of order $1$). These results can be applied to attain indistinguishability when a certain class of users is added or removed from a tabular data. It is revealed that attaining indifference in individual's data is conditioned on the statistics of this user only. For binary (Bernoulli distributed) random variables, the derived sufficient conditions can be further relaxed to reduce the noise and improve data utility.

Title: From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers

Authors: Ryotaro Kawata, Yujin Song, Alberto Bietti, Naoki Nishikawa, Taiji Suzuki, Samuel Vaiter, Denny Wu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2512.18634
Pdf URL: https://arxiv.org/pdf/2512.18634
Copy Paste: [[2512.18634]] From Shortcut to Induction Head: How Data Diversity Shapes Algorithm Selection in Transformers(https://arxiv.org/abs/2512.18634)
Keywords: robust, transformer
Abstract: Transformers can implement both generalizable algorithms (e.g., induction heads) and simple positional shortcuts (e.g., memorizing fixed output positions). In this work, we study how the choice of pretraining data distribution steers a shallow transformer toward one behavior or the other. Focusing on a minimal trigger-output prediction task -- copying the token immediately following a special trigger upon its second occurrence -- we present a rigorous analysis of gradient-based training of a single-layer transformer. In both the infinite and finite sample regimes, we prove a transition in the learned mechanism: if input sequences exhibit sufficient diversity, measured by a low ``max-sum'' ratio of trigger-to-trigger distances, the trained model implements an induction head and generalizes to unseen contexts; by contrast, when this ratio is large, the model resorts to a positional shortcut and fails to generalize out-of-distribution (OOD). We also reveal a trade-off between the pretraining context length and OOD generalization, and derive the optimal pretraining distribution that minimizes computational cost per sample. Finally, we validate our theoretical predictions with controlled synthetic experiments, demonstrating that broadening context distributions robustly induces induction heads and enables OOD generalization. Our results shed light on the algorithmic biases of pretrained transformers and offer conceptual guidelines for data-driven control of their learned behaviors.

Title: Uni-Neur2Img: Unified Neural Signal-Guided Image Generation, Editing, and Stylization via Diffusion Transformers

Authors: Xiyue Bai, Ronghao Yu, Jia Xiu, Pengfei Zhou, Jie Xia, Peng Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18635
Pdf URL: https://arxiv.org/pdf/2512.18635
Copy Paste: [[2512.18635]] Uni-Neur2Img: Unified Neural Signal-Guided Image Generation, Editing, and Stylization via Diffusion Transformers(https://arxiv.org/abs/2512.18635)
Keywords: diffusion, transformer
Abstract: Generating or editing images directly from Neural signals has immense potential at the intersection of neuroscience, vision, and Brain-computer interaction. In this paper, We present Uni-Neur2Img, a unified framework for neural signal-driven image generation and editing. The framework introduces a parameter-efficient LoRA-based neural signal injection module that independently processes each conditioning signal as a pluggable component, facilitating flexible multi-modal conditioning without altering base model parameters. Additionally, we employ a causal attention mechanism accommodate the long-sequence modeling demands of conditional generation tasks. Existing neural-driven generation research predominantly focuses on textual modalities as conditions or intermediate representations, resulting in limited exploration of visual modalities as direct conditioning signals. To bridge this research gap, we introduce the EEG-Style dataset. We conduct comprehensive evaluations across public benchmarks and self-collected neural signal datasets: (1) EEG-driven image generation on the public CVPR40 dataset; (2) neural signal-guided image editing on the public Loongx dataset for semantic-aware local modifications; and (3) EEG-driven style transfer on our self-collected EEG-Style dataset. Extensive experimental results demonstrate significant improvements in generation fidelity, editing consistency, and style transfer quality while maintaining low computational overhead and strong scalability to additional modalities. Thus, Uni-Neur2Img offers a unified, efficient, and extensible solution for bridging neural signals and visual content generation.

Title: Volley Revolver: A Novel Matrix-Encoding Method for Privacy-Preserving Deep Learning (Inference++)

Authors: John Chiang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.18646
Pdf URL: https://arxiv.org/pdf/2512.18646
Copy Paste: [[2512.18646]] Volley Revolver: A Novel Matrix-Encoding Method for Privacy-Preserving Deep Learning (Inference++)(https://arxiv.org/abs/2512.18646)
Keywords: secure, privacy
Abstract: Privacy-preserving inference of convolutional neural networks (CNNs) using homomorphic encryption has emerged as a promising approach for enabling secure machine learning in untrusted environments. In our previous work, we introduced a matrix-encoding strategy that allows convolution and matrix multiplication to be efficiently evaluated over encrypted data, enabling practical CNN inference without revealing either the input data or the model parameters. The core idea behind this strategy is to construct a three-dimensional representation within ciphertexts that preserves the intrinsic spatial structure of both input image data and model weights, rather than flattening them into conventional two-dimensional encodings. However, this approach can operate efficiently $only$ when the number of available plaintext slots within a ciphertext is sufficient to accommodate an entire input image, which becomes a critical bottleneck when processing high-resolution images. In this paper, we address this fundamental limitation by proposing an improved encoding and computation framework that removes the requirement that a single encrypted ciphertext must fully contain one input image. Our method reformulates the data layout and homomorphic operations to partition high-resolution inputs across multiple ciphertexts while preserving the algebraic structure required for efficient convolution and matrix multiplication. As a result, our approach enables privacy-preserving CNN inference to scale naturally beyond the slot-capacity constraints of prior methods, making homomorphic evaluation of CNNs practical for higher-resolution and more complex datasets.

Title: Adversarial Robustness in Zero-Shot Learning:An Empirical Study on Class and Concept-Level Vulnerabilities

Authors: Zhiyuan Peng, Zihan Ye, Shreyank N Gowda, Yuping Yan, Haotian Xu, Ling Shao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18651
Pdf URL: https://arxiv.org/pdf/2512.18651
Copy Paste: [[2512.18651]] Adversarial Robustness in Zero-Shot Learning:An Empirical Study on Class and Concept-Level Vulnerabilities(https://arxiv.org/abs/2512.18651)
Keywords: attack, robust, interpretability
Abstract: Zero-shot Learning (ZSL) aims to enable image classifiers to recognize images from unseen classes that were not included during training. Unlike traditional supervised classification, ZSL typically relies on learning a mapping from visual features to predefined, human-understandable class concepts. While ZSL models promise to improve generalization and interpretability, their robustness under systematic input perturbations remain unclear. In this study, we present an empirical analysis about the robustness of existing ZSL methods at both classlevel and concept-level. Specifically, we successfully disrupted their class prediction by the well-known non-target class attack (clsA). However, in the Generalized Zero-shot Learning (GZSL) setting, we observe that the success of clsA is only at the original best-calibrated point. After the attack, the optimal bestcalibration point shifts, and ZSL models maintain relatively strong performance at other calibration points, indicating that clsA results in a spurious attack success in the GZSL. To address this, we propose the Class-Bias Enhanced Attack (CBEA), which completely eliminates GZSL accuracy across all calibrated points by enhancing the gap between seen and unseen class this http URL, at concept-level attack, we introduce two novel attack modes: Class-Preserving Concept Attack (CPconA) and NonClass-Preserving Concept Attack (NCPconA). Our extensive experiments evaluate three typical ZSL models across various architectures from the past three years and reveal that ZSL models are vulnerable not only to the traditional class attack but also to concept-based attacks. These attacks allow malicious actors to easily manipulate class predictions by erasing or introducing concepts. Our findings highlight a significant performance gap between existing approaches, emphasizing the need for improved adversarial robustness in current ZSL models.

Title: Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital

Authors: Pierre Colombo, Malik Boudiaf, Allyn Sweet, Michael Desa, Hongxi Wang, Kevin Candra, Syméon del Marmol
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.18658
Pdf URL: https://arxiv.org/pdf/2512.18658
Copy Paste: [[2512.18658]] Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital(https://arxiv.org/abs/2512.18658)
Keywords: security
Abstract: Before closing venture capital financing rounds, lawyers conduct diligence that includes tying out the capitalization table: verifying that every security (for example, shares, options, warrants) and issuance term (for example, vesting schedules, acceleration triggers, transfer restrictions) is supported by large sets of underlying legal documentation. While LLMs continue to improve on legal benchmarks, specialized legal workflows, such as capitalization tie-out, remain out of reach even for strong agentic systems. The task requires multi-document reasoning, strict evidence traceability, and deterministic outputs that current approaches fail to reliably deliver. We characterize capitalization tie-out as an instance of a real-world benchmark for legal AI, analyze and compare the performance of existing agentic systems, and propose a world model architecture toward tie-out automation-and more broadly as a foundation for applied legal intelligence.

Title: PMPGuard: Catching Pseudo-Matched Pairs in Remote Sensing Image-Text Retrieval

Authors: Pengxiang Ouyang, Qing Ma, Zheng Wang, Cong Bai
Subjects: cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2512.18660
Pdf URL: https://arxiv.org/pdf/2512.18660
Copy Paste: [[2512.18660]] PMPGuard: Catching Pseudo-Matched Pairs in Remote Sensing Image-Text Retrieval(https://arxiv.org/abs/2512.18660)
Keywords: robust
Abstract: Remote sensing (RS) image-text retrieval faces significant challenges in real-world datasets due to the presence of Pseudo-Matched Pairs (PMPs), semantically mismatched or weakly aligned image-text pairs, which hinder the learning of reliable cross-modal alignments. To address this issue, we propose a novel retrieval framework that leverages Cross-Modal Gated Attention and a Positive-Negative Awareness Attention mechanism to mitigate the impact of such noisy associations. The gated module dynamically regulates cross-modal information flow, while the awareness mechanism explicitly distinguishes informative (positive) cues from misleading (negative) ones during alignment learning. Extensive experiments on three benchmark RS datasets, i.e., RSICD, RSITMD, and RS5M, demonstrate that our method consistently achieves state-of-the-art performance, highlighting its robustness and effectiveness in handling real-world mismatches and PMPs in RS image-text retrieval tasks.

Title: SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse

Authors: Yiming Sun, Mi Zhang, Feifei Li, Geng Hong, Min Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18671
Pdf URL: https://arxiv.org/pdf/2512.18671
Copy Paste: [[2512.18671]] SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse(https://arxiv.org/abs/2512.18671)
Keywords: large language model
Abstract: Despite Video Large Language Models having rapidly advanced in recent years, perceptual hallucinations pose a substantial safety risk, which severely restricts their real-world applicability. While several methods for hallucination mitigation have been proposed, they often compromise the model's capacity for video understanding and reasoning. In this work, we propose SmartSight, a pioneering step to address this issue in a training-free manner by leveraging the model's own introspective capabilities. Specifically, SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding. It assesses the hallucination of each response using the Temporal Attention Collapse score, which measures whether the model over-focuses on trivial temporal regions of the input video when generating the response. To improve efficiency, SmartSight identifies the Visual Attention Vanishing point, enabling more accurate hallucination estimation and early termination of hallucinated responses, leading to a substantial reduction in decoding cost. Experiments show that SmartSight substantially lowers hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%. These results highlight SmartSight's effectiveness in improving the reliability of open-source Video-LLMs.

Title: AsyncDiff: Asynchronous Timestep Conditioning for Enhanced Text-to-Image Diffusion Inference

Authors: Longhuan Xu, Feng Yin, Cunjian Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18675
Pdf URL: https://arxiv.org/pdf/2512.18675
Copy Paste: [[2512.18675]] AsyncDiff: Asynchronous Timestep Conditioning for Enhanced Text-to-Image Diffusion Inference(https://arxiv.org/abs/2512.18675)
Keywords: diffusion
Abstract: Text-to-image diffusion inference typically follows synchronized schedules, where the numerical integrator advances the latent state to the same timestep at which the denoiser is conditioned. We propose an asynchronous inference mechanism that decouples these two, allowing the denoiser to be conditioned at a different, learned timestep while keeping image update schedule unchanged. A lightweight timestep prediction module (TPM), trained with Group Relative Policy Optimization (GRPO), selects a more feasible conditioning timestep based on the current state, effectively choosing a desired noise level to control image detail and textural richness. At deployment, a scaling hyper-parameter can be used to interpolate between the original and de-synchronized timesteps, enabling conservative or aggressive adjustments. To keep the study computationally affordable, we cap the inference at 15 steps for SD3.5 and 10 steps for Flux. Evaluated on Stable Diffusion 3.5 Medium and Flux.1-dev across MS-COCO 2014 and T2I-CompBench datasets, our method optimizes a composite reward that averages Image Reward, HPSv2, CLIP Score and Pick Score, and shows consistent improvement.

Title: brat: Aligned Multi-View Embeddings for Brain MRI Analysis

Authors: Maxime Kayser, Maksim Gridnev, Wanting Wang, Max Bain, Aneesh Rangnekar, Avijit Chatterjee, Aleksandr Petrov, Harini Veeraraghavan, Nathaniel C. Swinburne
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2512.18679
Pdf URL: https://arxiv.org/pdf/2512.18679
Copy Paste: [[2512.18679]] brat: Aligned Multi-View Embeddings for Brain MRI Analysis(https://arxiv.org/abs/2512.18679)
Keywords: transformer
Abstract: We present brat (brain report alignment transformer), a multi-view representation learning framework for brain magnetic resonance imaging (MRI) trained on MRIs paired with clinical reports. Brain MRIs present unique challenges due to the presence of numerous, highly varied, and often subtle abnormalities that are localized to a few slices within a 3D volume. To address these challenges, we introduce a brain MRI dataset $10\times$ larger than existing ones, containing approximately 80,000 3D scans with corresponding radiology reports, and propose a multi-view pre-training approach inspired by advances in document retrieval. We develop an implicit query-feature matching mechanism and adopt concepts from quality-diversity to obtain multi-view embeddings of MRIs that are aligned with the clinical features given by report sentences. We evaluate our approach across multiple vision-language and vision tasks, demonstrating substantial performance improvements. The brat foundation models are publicly released.

Title: Solver-Independent Automated Problem Formulation via LLMs for High-Cost Simulation-Driven Design

Authors: Yuchen Li, Handing Wang, Bing Xue, Mengjie Zhang, Yaochu Jin
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2512.18682
Pdf URL: https://arxiv.org/pdf/2512.18682
Copy Paste: [[2512.18682]] Solver-Independent Automated Problem Formulation via LLMs for High-Cost Simulation-Driven Design(https://arxiv.org/abs/2512.18682)
Keywords: large language model
Abstract: In the high-cost simulation-driven design domain, translating ambiguous design requirements into a mathematical optimization formulation is a bottleneck for optimizing product performance. This process is time-consuming and heavily reliant on expert knowledge. While large language models (LLMs) offer potential for automating this task, existing approaches either suffer from poor formalization that fails to accurately align with the design intent or rely on solver feedback for data filtering, which is unavailable due to the high simulation costs. To address this challenge, we propose APF, a framework for solver-independent, automated problem formulation via LLMs designed to automatically convert engineers' natural language requirements into executable optimization models. The core of this framework is an innovative pipeline for automatically generating high-quality data, which overcomes the difficulty of constructing suitable fine-tuning datasets in the absence of high-cost solver feedback with the help of data generation and test instance annotation. The generated high-quality dataset is used to perform supervised fine-tuning on LLMs, significantly enhancing their ability to generate accurate and executable optimization problem formulations. Experimental results on antenna design demonstrate that APF significantly outperforms the existing methods in both the accuracy of requirement formalization and the quality of resulting radiation efficiency curves in meeting the design goals.

Title: A Study of Finetuning Video Transformers for Multi-view Geometry Tasks

Authors: Huimin Wu, Kwang-Ting Cheng, Stephen Lin, Zhirong Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18684
Pdf URL: https://arxiv.org/pdf/2512.18684
Copy Paste: [[2512.18684]] A Study of Finetuning Video Transformers for Multi-view Geometry Tasks(https://arxiv.org/abs/2512.18684)
Keywords: transformer
Abstract: This paper presents an investigation of vision transformer learning for multi-view geometry tasks, such as optical flow estimation, by fine-tuning video foundation models. Unlike previous methods that involve custom architectural designs and task-specific pretraining, our research finds that general-purpose models pretrained on videos can be readily transferred to multi-view problems with minimal adaptation. The core insight is that general-purpose attention between patches learns temporal and spatial information for geometric reasoning. We demonstrate that appending a linear decoder to the Transformer backbone produces satisfactory results, and iterative refinement can further elevate performance to stateof-the-art levels. This conceptually simple approach achieves top cross-dataset generalization results for optical flow estimation with end-point error (EPE) of 0.69, 1.78, and 3.15 on the Sintel clean, Sintel final, and KITTI datasets, respectively. Our method additionally establishes a new record on the online test benchmark with EPE values of 0.79, 1.88, and F1 value of 3.79. Applications to 3D depth estimation and stereo matching also show strong performance, illustrating the versatility of video-pretrained models in addressing geometric vision tasks.

Title: Fusion of Multiscale Features Via Centralized Sparse-attention Network for EEG Decoding

Authors: Xiangrui Cai, Shaocheng Ma, Lei Cao, Jie Li, Tianyu Liu, Yilin Dong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18689
Pdf URL: https://arxiv.org/pdf/2512.18689
Copy Paste: [[2512.18689]] Fusion of Multiscale Features Via Centralized Sparse-attention Network for EEG Decoding(https://arxiv.org/abs/2512.18689)
Keywords: robust, extraction, interpretability
Abstract: Electroencephalography (EEG) signal decoding is a key technology that translates brain activity into executable commands, laying the foundation for direct brain-machine interfacing and intelligent interaction. To address the inherent spatiotemporal heterogeneity of EEG signals, this paper proposes a multi-branch parallel architecture, where each temporal scale is equipped with an independent spatial feature extraction module. To further enhance multi-branch feature fusion, we propose a Fusion of Multiscale Features via Centralized Sparse-attention Network (EEG-CSANet), a centralized sparse-attention network. It employs a main-auxiliary branch architecture, where the main branch models core spatiotemporal patterns via multiscale self-attention, and the auxiliary branch facilitates efficient local interactions through sparse cross-attention. Experimental results show that EEG-CSANet achieves state-of-the-art (SOTA) performance across five public datasets (BCIC-IV-2A, BCIC-IV-2B, HGD, SEED, and SEED-VIG), with accuracies of 88.54%, 91.09%, 99.43%, 96.03%, and 90.56%, respectively. Such performance demonstrates its strong adaptability and robustness across various EEG decoding tasks. Moreover, extensive ablation studies are conducted to enhance the interpretability of EEG-CSANet. In the future, we hope that EEG-CSANet could serve as a promising baseline model in the field of EEG signal decoding. The source code is publicly available at: this https URL

Title: EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images

Authors: Jongmin Park, Minh-Quan Viet Bui, Juan Luis Gonzalez Bello, Jaeho Moon, Jihyong Oh, Munchurl Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18692
Pdf URL: https://arxiv.org/pdf/2512.18692
Copy Paste: [[2512.18692]] EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images(https://arxiv.org/abs/2512.18692)
Keywords: robust
Abstract: Feed-forward 3D Gaussian Splatting (3DGS) enables efficient one-pass scene reconstruction, providing 3D representations for novel view synthesis without per-scene optimization. However, existing methods typically predict pixel-aligned primitives per-view, producing an excessive number of primitives in dense-view settings and offering no explicit control over the number of predicted Gaussians. To address this, we propose EcoSplat, the first efficiency-controllable feed-forward 3DGS framework that adaptively predicts the 3D representation for any given target primitive count at inference time. EcoSplat adopts a two-stage optimization process. The first stage is Pixel-aligned Gaussian Training (PGT) where our model learns initial primitive prediction. The second stage is Importance-aware Gaussian Finetuning (IGF) stage where our model learns rank primitives and adaptively adjust their parameters based on the target primitive count. Extensive experiments across multiple dense-view settings show that EcoSplat is robust and outperforms state-of-the-art methods under strict primitive-count constraints, making it well-suited for flexible downstream rendering tasks.

Title: Generating Risky Samples with Conformity Constraints via Diffusion Models

Authors: Han Yu, Hao Zou, Xingxuan Zhang, Zhengyi Wang, Yue He, Kehan Li, Peng Cui
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18722
Pdf URL: https://arxiv.org/pdf/2512.18722
Copy Paste: [[2512.18722]] Generating Risky Samples with Conformity Constraints via Diffusion Models(https://arxiv.org/abs/2512.18722)
Keywords: diffusion
Abstract: Although neural networks achieve promising performance in many tasks, they may still fail when encountering some examples and bring about risks to applications. To discover risky samples, previous literature attempts to search for patterns of risky samples within existing datasets or inject perturbation into them. Yet in this way the diversity of risky samples is limited by the coverage of existing datasets. To overcome this limitation, recent works adopt diffusion models to produce new risky samples beyond the coverage of existing datasets. However, these methods struggle in the conformity between generated samples and expected categories, which could introduce label noise and severely limit their effectiveness in applications. To address this issue, we propose RiskyDiff that incorporates the embeddings of both texts and images as implicit constraints of category conformity. We also design a conformity score to further explicitly strengthen the category conformity, as well as introduce the mechanisms of embedding screening and risky gradient guidance to boost the risk of generated samples. Extensive experiments reveal that RiskyDiff greatly outperforms existing methods in terms of the degree of risk, generation quality, and conformity with conditioned categories. We also empirically show the generalization ability of the models can be enhanced by augmenting training data with generated samples of high conformity.

Title: A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models

Authors: Zhiquan Tan, Yinrong Hong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18730
Pdf URL: https://arxiv.org/pdf/2512.18730
Copy Paste: [[2512.18730]] A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models(https://arxiv.org/abs/2512.18730)
Keywords: large language model
Abstract: Large language models (LLMs) trained via KL-regularized reinforcement learning demonstrate strong instruction following, self-correction, and reasoning abilities. Yet their theoretical underpinnings remain limited. We exploit the closed-form energy-based model (EBM) structure of the optimal KL-regularized policy to provide a unified variational analysis of LLMs. For instruction-tuned models, under natural assumptions on reward potentials and pretraining symmetry, we prove that the transition kernel satisfies detailed balance with respect to a scalar potential encoding response quality. This yields monotonic KL convergence to a high-quality stationary distribution, bounded hitting times to superior states, and exponential mixing governed by the spectral gap. For reasoning models trained with verifiable rewards (RLVR), we show the objective is equivalent to expected KL minimization toward an optimal reasoning distribution, with the suboptimality gap reducing to the Bernoulli KL between target and current accuracies along the natural gradient flow. This helps explain empirical entropy-accuracy trade-offs.

Title: Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection

Authors: Junjun Pan, Yixin Liu, Rui Miao, Kaize Ding, Yu Zheng, Quoc Viet Hung Nguyen, Alan Wee-Chung Liew, Shirui Pan
Subjects: cs.CR, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2512.18733
Pdf URL: https://arxiv.org/pdf/2512.18733
Copy Paste: [[2512.18733]] Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection(https://arxiv.org/abs/2512.18733)
Keywords: security, defense, attack, robust, interpretability, large language model
Abstract: Large language model (LLM)-based multi-agent systems (MAS) have shown strong capabilities in solving complex tasks. As MAS become increasingly autonomous in various safety-critical tasks, detecting malicious agents has become a critical security concern. Although existing graph anomaly detection (GAD)-based defenses can identify anomalous agents, they mainly rely on coarse sentence-level information and overlook fine-grained lexical cues, leading to suboptimal performance. Moreover, the lack of interpretability in these methods limits their reliability and real-world applicability. To address these limitations, we propose XG-Guard, an explainable and fine-grained safeguarding framework for detecting malicious agents in MAS. To incorporate both coarse and fine-grained textual information for anomalous agent identification, we utilize a bi-level agent encoder to jointly model the sentence- and token-level representations of each agent. A theme-based anomaly detector further captures the evolving discussion focus in MAS dialogues, while a bi-level score fusion mechanism quantifies token-level contributions for explanation. Extensive experiments across diverse MAS topologies and attack scenarios demonstrate robust detection performance and strong interpretability of XG-Guard.

Title: Is Your Conditional Diffusion Model Actually Denoising?

Authors: Daniel Pfrommer, Zehao Dou, Christopher Scarvelis, Max Simchowitz, Ali Jadbabaie
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18736
Pdf URL: https://arxiv.org/pdf/2512.18736
Copy Paste: [[2512.18736]] Is Your Conditional Diffusion Model Actually Denoising?(https://arxiv.org/abs/2512.18736)
Keywords: diffusion, generative
Abstract: We study the inductive biases of diffusion models with a conditioning-variable, which have seen widespread application as both text-conditioned generative image models and observation-conditioned continuous control policies. We observe that when these models are queried conditionally, their generations consistently deviate from the idealized "denoising" process upon which diffusion models are formulated, inducing disagreement between popular sampling algorithms (e.g. DDPM, DDIM). We introduce Schedule Deviation, a rigorous measure which captures the rate of deviation from a standard denoising process, and provide a methodology to compute it. Crucially, we demonstrate that the deviation from an idealized denoising process occurs irrespective of the model capacity or amount of training data. We posit that this phenomenon occurs due to the difficulty of bridging distinct denoising flows across different parts of the conditioning space and show theoretically how such a phenomenon can arise through an inductive bias towards smoothness.

Title: Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation

Authors: Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, Yansong Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18741
Pdf URL: https://arxiv.org/pdf/2512.18741
Copy Paste: [[2512.18741]] Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation(https://arxiv.org/abs/2512.18741)
Keywords: diffusion
Abstract: Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose \textbf{Memorize-and-Generate (MAG)}, a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce \textbf{MAG-Bench} to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.

Title: MemEvolve: Meta-Evolution of Agent Memory Systems

Authors: Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, Shuicheng Yan
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2512.18746
Pdf URL: https://arxiv.org/pdf/2512.18746
Copy Paste: [[2512.18746]] MemEvolve: Meta-Evolution of Agent Memory Systems(https://arxiv.org/abs/2512.18746)
Keywords: fair, large language model
Abstract: Self-evolving memory systems are unprecedentedly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve, a meta-evolutionary framework that jointly evolves agents' experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior research and foster openness in future self-evolving systems, we introduce EvolveLab, a unified self-evolving memory codebase that distills twelve representative memory systems into a modular design space (encode, store, retrieve, manage), providing both a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks demonstrate that MemEvolve achieves (I) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to $17.06\%$; and (II) strong cross-task and cross-LLM generalization, designing memory architectures that transfer effectively across diverse benchmarks and backbone models.

Title: IPCV: Information-Preserving Compression for MLLM Visual Encoders

Authors: Yuan Chen, Zichen Wen, Yuzhou Wu, Xuyang Liu, Shuang Chen, Junpeng Ma, Weijia Li, Conghui He, Linfeng Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18747
Pdf URL: https://arxiv.org/pdf/2512.18747
Copy Paste: [[2512.18747]] IPCV: Information-Preserving Compression for MLLM Visual Encoders(https://arxiv.org/abs/2512.18747)
Keywords: transformer, large language model
Abstract: Multimodal Large Language Models (MLLMs) deliver strong vision-language performance but at high computational cost, driven by numerous visual tokens processed by the Vision Transformer (ViT) encoder. Existing token pruning strategies are inadequate: LLM-stage token pruning overlooks the ViT's overhead, while conventional ViT token pruning, without language guidance, risks discarding textually critical visual cues and introduces feature distortions amplified by the ViT's bidirectional attention. To meet these challenges, we propose IPCV, a training-free, information-preserving compression framework for MLLM visual encoders. IPCV enables aggressive token pruning inside the ViT via Neighbor-Guided Reconstruction (NGR) that temporarily reconstructs pruned tokens to participate in attention with minimal overhead, then fully restores them before passing to the LLM. Besides, we introduce Attention Stabilization (AS) to further alleviate the negative influence from token pruning by approximating the K/V of pruned tokens. It can be directly applied to previous LLM-side token pruning methods to enhance their performance. Extensive experiments show that IPCV substantially reduces end-to-end computation and outperforms state-of-the-art training-free token compression methods across diverse image and video benchmarks. Our code is available at this https URL.

Title: Context-Aware Network Based on Multi-scale Spatio-temporal Attention for Action Recognition in Videos

Authors: Xiaoyang Li, Wenzhu Yang, Kanglin Wang, Tiebiao Wang, Qingsong Fei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18750
Pdf URL: https://arxiv.org/pdf/2512.18750
Copy Paste: [[2512.18750]] Context-Aware Network Based on Multi-scale Spatio-temporal Attention for Action Recognition in Videos(https://arxiv.org/abs/2512.18750)
Keywords: robust, extraction
Abstract: Action recognition is a critical task in video understanding, requiring the comprehensive capture of spatio-temporal cues across various scales. However, existing methods often overlook the multi-granularity nature of actions. To address this limitation, we introduce the Context-Aware Network (CAN). CAN consists of two core modules: the Multi-scale Temporal Cue Module (MTCM) and the Group Spatial Cue Module (GSCM). MTCM effectively extracts temporal cues at multiple scales, capturing both fast-changing motion details and overall action flow. GSCM, on the other hand, extracts spatial cues at different scales by grouping feature maps and applying specialized extraction methods to each group. Experiments conducted on five benchmark datasets (Something-Something V1 and V2, Diving48, Kinetics-400, and UCF101) demonstrate the effectiveness of CAN. Our approach achieves competitive performance, outperforming most mainstream methods, with accuracies of 50.4% on Something-Something V1, 63.9% on Something-Something V2, 88.4% on Diving48, 74.9% on Kinetics-400, and 86.9% on UCF101. These results highlight the importance of capturing multi-scale spatio-temporal cues for robust action recognition.

Title: ISADM: An Integrated STRIDE, ATT&CK, and D3FEND Model for Threat Modeling Against Real-world Adversaries

Authors: Khondokar Fida Hasan, Hasibul Hossain Shajeeb, Chathura Abeydeera, Benjamin Turnbull, Matthew Warren
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.18751
Pdf URL: https://arxiv.org/pdf/2512.18751
Copy Paste: [[2512.18751]] ISADM: An Integrated STRIDE, ATT&CK, and D3FEND Model for Threat Modeling Against Real-world Adversaries(https://arxiv.org/abs/2512.18751)
Keywords: security, defense, attack
Abstract: FinTechs increasing connectivity, rapid innovation, and reliance on global digital infrastructures present significant cybersecurity challenges. Traditional cybersecurity frameworks often struggle to identify and prioritize sector-specific vulnerabilities or adapt to evolving adversary tactics, particularly in highly targeted sectors such as FinTech. To address these gaps, we propose ISADM (Integrated STRIDE-ATTACK-D3FEND Threat Model), a novel hybrid methodology applied to FinTech security that integrates STRIDE's asset-centric threat classification with MITRE ATTACK's catalog of real-world adversary behaviors and D3FEND's structured knowledge of countermeasures. ISADM employs a frequency-based scoring mechanism to quantify the prevalence of adversarial Tactics, Techniques, and Procedures (TTPs), enabling a proactive, score-driven risk assessment and prioritization framework. This proactive approach contributes to shifting organizations from reactive defense strategies toward the strategic fortification of critical assets. We validate ISADM through industry-relevant case study analyses, demonstrating how the approach replicates actual attack patterns and strengthens proactive threat modeling, guiding risk prioritization and resource allocation to the most critical vulnerabilities. Overall, ISADM offers a comprehensive hybrid threat modeling methodology that bridges asset-centric and adversary-centric analysis, providing FinTech systems with stronger defenses. The emphasis on real-world validation highlights its practical significance in enhancing the sector's cybersecurity posture through a frequency-informed, impact-aware prioritization scheme that combines empirical attacker data with contextual risk analysis.

Title: MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation

Authors: Guohui Zhang, Hu Yu, Xiaoxiao Ma, Yaning Pan, Hang Xu, Feng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18766
Pdf URL: https://arxiv.org/pdf/2512.18766
Copy Paste: [[2512.18766]] MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation(https://arxiv.org/abs/2512.18766)
Keywords: generative
Abstract: Reinforcement learning (RL) has demonstrated significant potential for post-training language models and autoregressive visual generative models, but adapting RL to masked generative models remains challenging. The core factor is that policy optimization requires accounting for the probability likelihood of each step due to its multi-step and iterative refinement process. This reliance on entire sampling trajectories introduces high computational cost, whereas natively optimizing random steps often yields suboptimal results. In this paper, we present MaskFocus, a novel RL framework that achieves effective policy optimization for masked generative models by focusing on critical steps. Specifically, we determine the step-level information gain by measuring the similarity between the intermediate images at each sampling step and the final generated image. Crucially, we leverage this to identify the most critical and valuable steps and execute focused policy optimization on them. Furthermore, we design a dynamic routing sampling mechanism based on entropy to encourage the model to explore more valuable masking strategies for samples with low entropy. Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of our method.

Title: In-Context Audio Control of Video Diffusion Transformers

Authors: Wenze Liu, Weicai Ye, Minghong Cai, Quande Liu, Xintao Wang, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18772
Pdf URL: https://arxiv.org/pdf/2512.18772
Copy Paste: [[2512.18772]] In-Context Audio Control of Video Diffusion Transformers(https://arxiv.org/abs/2512.18772)
Keywords: diffusion, transformer
Abstract: Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusion transformers (ICAC), a framework that investigates the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT. We systematically explore three distinct mechanisms for injecting audio conditions: standard cross-attention, 2D self-attention, and unified 3D self-attention. Our findings reveal that while 3D attention offers the highest potential for capturing spatio-temporal audio-visual correlations, it presents significant training challenges. To overcome this, we propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance. Our experiments demonstrate that this approach achieves strong lip synchronization and video quality, conditioned on an audio stream and reference images.

Title: Eff-GRot: Efficient and Generalizable Rotation Estimation with Transformers

Authors: Fanis Mathioulakis, Gorjan Radevski, Tinne Tuytelaars
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.18784
Pdf URL: https://arxiv.org/pdf/2512.18784
Copy Paste: [[2512.18784]] Eff-GRot: Efficient and Generalizable Rotation Estimation with Transformers(https://arxiv.org/abs/2512.18784)
Keywords: transformer
Abstract: We introduce Eff-GRot, an approach for efficient and generalizable rotation estimation from RGB images. Given a query image and a set of reference images with known orientations, our method directly predicts the object's rotation in a single forward pass, without requiring object- or category-specific training. At the core of our framework is a transformer that performs a comparison in the latent space, jointly processing rotation-aware representations from multiple references alongside a query. This design enables a favorable balance between accuracy and computational efficiency while remaining simple, scalable, and fully end-to-end. Experimental results show that Eff-GRot offers a promising direction toward more efficient rotation estimation, particularly in latency-sensitive applications.

Title: Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation

Authors: Guangtao Lyu, Chenghao Xu, Qi Liu, Jiexi Yan, Muli Yang, Fen Fang, Cheng Deng
Subjects: cs.CV, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2512.18804
Pdf URL: https://arxiv.org/pdf/2512.18804
Copy Paste: [[2512.18804]] Tempo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation(https://arxiv.org/abs/2512.18804)
Keywords: diffusion
Abstract: Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music. While existing methods often rely on additional genre labels to further improve dance generation, such labels are typically noisy, coarse, unavailable, or insufficient to capture the diversity of real-world music, which can result in rhythm misalignment or stylistic drift. In contrast, we observe that tempo, a core property reflecting musical rhythm and pace, remains relatively consistent across datasets and genres, typically ranging from 60 to 200 BPM. Based on this finding, we propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module that enhances the diffusion model and its rhythm perception. TempoMoE organizes motion experts into tempo-structured groups for different tempo ranges, with multi-scale beat experts capturing fine- and long-range rhythmic dynamics. A Hierarchical Rhythm-Adaptive Routing dynamically selects and fuses experts from music features, enabling flexible, rhythm-aligned generation without manual genre labels. Extensive experiments demonstrate that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.

Title: FedVideoMAE: Efficient Privacy-Preserving Federated Video Moderation

Authors: Ziyuan Tao, Chuanzhi Xu, Sandaru Jayawardana, Wei Bao, Kanchana Thilakarathna, Teng Joon Lim
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2512.18809
Pdf URL: https://arxiv.org/pdf/2512.18809
Copy Paste: [[2512.18809]] FedVideoMAE: Efficient Privacy-Preserving Federated Video Moderation(https://arxiv.org/abs/2512.18809)
Keywords: secure, privacy, protect, defense, federate
Abstract: The rapid growth of short-form video platforms increases the need for privacy-preserving moderation, as cloud-based pipelines expose raw videos to privacy risks, high bandwidth costs, and inference latency. To address these challenges, we propose an on-device federated learning framework for video violence detection that integrates self-supervised VideoMAE representations, LoRA-based parameter-efficient adaptation, and defense-in-depth privacy protection. Our approach reduces the trainable parameter count to 5.5M (~3.5% of a 156M backbone) and incorporates DP-SGD with configurable privacy budgets and secure aggregation. Experiments on RWF-2000 with 40 clients achieve 77.25% accuracy without privacy protection and 65-66% under strong differential privacy, while reducing communication cost by $28.3\times$ compared to full-model federated learning. The code is available at: {this https URL}

Title: EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer

Authors: Yuxiao Yang, Hualian Sheng, Sijia Cai, Jing Lin, Jiahao Wang, Bing Deng, Junzhe Lu, Haoqian Wang, Jieping Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18814
Pdf URL: https://arxiv.org/pdf/2512.18814
Copy Paste: [[2512.18814]] EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer(https://arxiv.org/abs/2512.18814)
Keywords: diffusion, transformer
Abstract: Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities. Furthermore, we propose MVS-RoPE (Motion-Video Syncronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a synchronized coordinate system for the dual-modal latent sequence, MVS-RoPE establishes an inductive bias that fosters temporal alignment between the two modalities. We also propose a Motion-Video Two-Stage Training Strategy. This strategy enables the model to perform both the joint generation of complex human action videos and their corresponding motion sequences, as well as versatile cross-modal conditional generation tasks. To facilitate the training of a model with these capabilities, we construct HuMoVe, a large-scale dataset of approximately 80,000 high-quality, human-centric video-motion pairs. Our findings reveal that explicitly representing human motion is complementary to appearance, significantly boosting the coherence and plausibility of human-centric video generation.

Title: Controllable Probabilistic Forecasting with Stochastic Decomposition Layers

Authors: John S. Schreck, William E. Chapman, Charlie Becker, David John Gagne II, Dhamma Kimpara, Nihanth Cherukuru, Judith Berner, Kirsten J. Mayer, Negin Sobhani
Subjects: cs.LG, cs.AI, physics.ao-ph, physics.geo-ph
Abstract URL: https://arxiv.org/abs/2512.18815
Pdf URL: https://arxiv.org/pdf/2512.18815
Copy Paste: [[2512.18815]] Controllable Probabilistic Forecasting with Stochastic Decomposition Layers(https://arxiv.org/abs/2512.18815)
Keywords: interpretability, diffusion
Abstract: AI weather prediction ensembles with latent noise injection and optimized with the continuous ranked probability score (CRPS) have produced both accurate and well-calibrated predictions with far less computational cost compared with diffusion-based methods. However, current CRPS ensemble approaches vary in their training strategies and noise injection mechanisms, with most injecting noise globally throughout the network via conditional normalization. This structure increases training expense and limits the physical interpretability of the stochastic perturbations. We introduce Stochastic Decomposition Layers (SDL) for converting deterministic machine learning weather models into probabilistic ensemble systems. Adapted from StyleGAN's hierarchical noise injection, SDL applies learned perturbations at three decoder scales through latent-driven modulation, per-pixel noise, and channel scaling. When applied to WXFormer via transfer learning, SDL requires less than 2\% of the computational cost needed to train the baseline model. Each ensemble member is generated from a compact latent tensor (5 MB), enabling perfect reproducibility and post-inference spread adjustment through latent rescaling. Evaluation on 2022 ERA5 reanalysis shows ensembles with spread-skill ratios approaching unity and rank histograms that progressively flatten toward uniformity through medium-range forecasts, achieving calibration competitive with operational IFS-ENS. Multi-scale experiments reveal hierarchical uncertainty: coarse layers modulate synoptic patterns while fine layers control mesoscale variability. The explicit latent parameterization provides interpretable uncertainty quantification for operational forecasting and climate applications.

Title: From Word to World: Can Large Language Models be Implicit Text-based World Models?

Authors: Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang, Cheng Qian, Zeping Li, Pony Ma, Guanhua Chen, Heng Ji, Mengdi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.18832
Pdf URL: https://arxiv.org/pdf/2512.18832
Copy Paste: [[2512.18832]] From Word to World: Can Large Language Models be Implicit Text-based World Models?(https://arxiv.org/abs/2512.18832)
Keywords: robust, large language model
Abstract: Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text-based environments, which provide a controlled setting to reinterpret language modeling as next-state prediction under interaction. We introduce a three-level framework for evaluating LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm-starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.

Title: Generative Modeling through Spectral Analysis of Koopman Operator

Authors: Yuanchao Xu, Fengyi Li, Masahiro Fujisawa, Youssef Marzouk, Isao Ishikawa
Subjects: cs.LG, math.DS
Abstract URL: https://arxiv.org/abs/2512.18837
Pdf URL: https://arxiv.org/pdf/2512.18837
Copy Paste: [[2512.18837]] Generative Modeling through Spectral Analysis of Koopman Operator(https://arxiv.org/abs/2512.18837)
Keywords: generative
Abstract: We propose Koopman Spectral Wasserstein Gradient Descent (KSWGD), a generative modeling framework that combines operator-theoretic spectral analysis with optimal transport. The novel insight is that the spectral structure required for accelerated Wasserstein gradient descent can be directly estimated from trajectory data via Koopman operator approximation which can eliminate the need for explicit knowledge of the target potential or neural network training. We provide rigorous convergence analysis and establish connection to Feynman-Kac theory that clarifies the method's probabilistic foundation. Experiments across diverse settings, including compact manifold sampling, metastable multi-well systems, image generation, and high dimensional stochastic partial differential equation, demonstrate that KSWGD consistently achieves faster convergence than other existing methods while maintaining high sample quality.

Title: MDToC: Metacognitive Dynamic Tree of Concepts for Boosting Mathematical Problem-Solving of Large Language Models

Authors: Tung Duong Ta, Tim Oates
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.18841
Pdf URL: https://arxiv.org/pdf/2512.18841
Copy Paste: [[2512.18841]] MDToC: Metacognitive Dynamic Tree of Concepts for Boosting Mathematical Problem-Solving of Large Language Models(https://arxiv.org/abs/2512.18841)
Keywords: large language model
Abstract: Despite advances in mathematical reasoning capabilities, Large Language Models (LLMs) still struggle with calculation verification when using established prompting techniques. We present MDToC (Metacognitive Dynamic Tree of Concepts), a three-phase approach that constructs a concept tree, develops accuracy-verified calculations for each concept, and employs majority voting to evaluate competing solutions. Evaluations across CHAMP, MATH, and Game-of-24 benchmarks demonstrate our MDToC's effectiveness, with GPT-4-Turbo achieving 58.1\% on CHAMP, 86.6\% on MATH, and 85\% on Game-of-24 - outperforming GoT by 5\%, 5.4\%, and 4\% on all these tasks, respectively, without hand-engineered hints. MDToC consistently surpasses existing prompting methods across all backbone models, yielding improvements of up to 7.6\% over ToT and 6.2\% over GoT, establishing metacognitive calculation verification as a promising direction for enhanced mathematical reasoning.

Title: Brain-Gen: Towards Interpreting Neural Signals for Stimulus Reconstruction Using Transformers and Latent Diffusion Models

Authors: Hasib Aslam, Muhammad Talal Faiz, Muhammad Imran Malik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18843
Pdf URL: https://arxiv.org/pdf/2512.18843
Copy Paste: [[2512.18843]] Brain-Gen: Towards Interpreting Neural Signals for Stimulus Reconstruction Using Transformers and Latent Diffusion Models(https://arxiv.org/abs/2512.18843)
Keywords: interpretability, diffusion, transformer
Abstract: Advances in neuroscience and artificial intelligence have enabled preliminary decoding of brain activity. However, despite the progress, the interpretability of neural representations remains limited. A significant challenge arises from the intrinsic properties of electroencephalography (EEG) signals, including high noise levels, spatial diffusion, and pronounced temporal variability. To interpret the neural mechanism underlying thoughts, we propose a transformers-based framework to extract spatial-temporal representations associated with observed visual stimuli from EEG recordings. These features are subsequently incorporated into the attention mechanisms of Latent Diffusion Models (LDMs) to facilitate the reconstruction of visual stimuli from brain activity. The quantitative evaluations on publicly available benchmark datasets demonstrate that the proposed method excels at modeling the semantic structures from EEG signals; achieving up to 6.5% increase in latent space clustering accuracy and 11.8% increase in zero shot generalization across unseen classes while having comparable Inception Score and Fréchet Inception Distance with existing baselines. Our work marks a significant step towards generalizable semantic interpretation of the EEG signals.

Title: VizDefender: Unmasking Visualization Tampering through Proactive Localization and Intent Inference

Authors: Sicheng Song, Yanjie Zhang, Zixin Chen, Huamin Qu, Changbo Wang, Chenhui Li
Subjects: cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2512.18853
Pdf URL: https://arxiv.org/pdf/2512.18853
Copy Paste: [[2512.18853]] VizDefender: Unmasking Visualization Tampering through Proactive Localization and Intent Inference(https://arxiv.org/abs/2512.18853)
Keywords: protect, attack, watermark, large language model
Abstract: The integrity of data visualizations is increasingly threatened by image editing techniques that enable subtle yet deceptive tampering. Through a formative study, we define this challenge and categorize tampering techniques into two primary types: data manipulation and visual encoding manipulation. To address this, we present VizDefender, a framework for tampering detection and analysis. The framework integrates two core components: 1) a semi-fragile watermark module that protects the visualization by embedding a location map to images, which allows for the precise localization of tampered regions while preserving visual quality, and 2) an intent analysis module that leverages Multimodal Large Language Models (MLLMs) to interpret manipulation, inferring the attacker's intent and misleading effects. Extensive evaluations and user studies demonstrate the effectiveness of our methods.

Title: Toward Human-Centered AI-Assisted Terminology Work

Authors: Antonio San Martin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.18859
Pdf URL: https://arxiv.org/pdf/2512.18859
Copy Paste: [[2512.18859]] Toward Human-Centered AI-Assisted Terminology Work(https://arxiv.org/abs/2512.18859)
Keywords: diffusion, generative
Abstract: The rapid diffusion of generative artificial intelligence is transforming terminology work. While this technology promises gains in efficiency, its unstructured adoption risks weakening professional autonomy, amplifying bias, and eroding linguistic and conceptual diversity. This paper argues that a human-centered approach to artificial intelligence has become a necessity for terminology work. Building on research in artificial intelligence and translation studies, it proposes a human-centered framework that conceptualizes artificial intelligence as a means of amplifying the terminologist's capabilities, rather than replacing them. The framework is organized around three interrelated dimensions: the augmented terminologist, ethical AI, and human-centered design. Together, these dimensions emphasize the compatibility of high automation with strong human control, the central role of terminologists in bias mitigation, and the importance of designing AI tools and workflows around the needs, values, and well-being of the terminologist. The paper concludes by stressing that current choices in AI adoption will shape not only terminological practice, but also the preservation of accuracy, adequacy, and diversity in terminology and specialized knowledge.

Title: Cross-modal Counterfactual Explanations: Uncovering Decision Factors and Dataset Biases in Subjective Classification

Authors: Alina Elena Baia, Andrea Cavallaro
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2512.18864
Pdf URL: https://arxiv.org/pdf/2512.18864
Copy Paste: [[2512.18864]] Cross-modal Counterfactual Explanations: Uncovering Decision Factors and Dataset Biases in Subjective Classification(https://arxiv.org/abs/2512.18864)
Keywords: privacy, fair, interpretability
Abstract: Concept-driven counterfactuals explain decisions of classifiers by altering the model predictions through semantic changes. In this paper, we present a novel approach that leverages cross-modal decompositionality and image-specific concepts to create counterfactual scenarios expressed in natural language. We apply the proposed interpretability framework, termed Decompose and Explain (DeX), to the challenging domain of image privacy decisions, which are contextual and subjective. This application enables the quantification of the differential contributions of key scene elements to the model prediction. We identify relevant decision factors via a multi-criterion selection mechanism that considers both image similarity for minimal perturbations and decision confidence to prioritize impactful changes. This approach evaluates and compares diverse explanations, and assesses the interdependency and mutual influence among explanatory properties. By leveraging image-specific concepts, DeX generates image-grounded, sparse explanations, yielding significant improvements over the state of the art. Importantly, DeX operates as a training-free framework, offering high flexibility. Results show that DeX not only uncovers the principal contributing factors influencing subjective decisions, but also identifies underlying dataset biases allowing for targeted mitigation strategies to improve fairness.

Title: CrashChat: A Multimodal Large Language Model for Multitask Traffic Crash Video Analysis

Authors: Kaidi Liang, Ke Li, Xianbiao Hu, Ruwen Qin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18878
Pdf URL: https://arxiv.org/pdf/2512.18878
Copy Paste: [[2512.18878]] CrashChat: A Multimodal Large Language Model for Multitask Traffic Crash Video Analysis(https://arxiv.org/abs/2512.18878)
Keywords: large language model
Abstract: Automating crash video analysis is essential to leverage the growing availability of driving video data for traffic safety research and accountability attribution in autonomous driving. Crash video analysis is a challenging multitask problem due to the complex spatiotemporal dynamics of crash events in video data and the diverse analytical requirements involved. It requires capabilities spanning crash recognition, temporal grounding, and high-level video understanding. Existing models, however, cannot perform all these tasks within a unified framework, and effective training strategies for such models remain underexplored. To fill these gaps, this paper proposes CrashChat, a multimodal large language model (MLLM) for multitask traffic crash analysis, built upon VideoLLaMA3. CrashChat acquires domain-specific knowledge through instruction fine-tuning and employs a novel multitask learning strategy based on task decoupling and grouping, which maximizes the benefit of joint learning within and across task groups while mitigating negative transfer. Numerical experiments on consolidated public datasets demonstrate that CrashChat consistently outperforms existing MLLMs across model scales and traditional vision-based methods, achieving state-of-the-art performance. It reaches near-perfect accuracy in crash recognition, a 176\% improvement in crash localization, and a 40\% improvement in the more challenging pre-crash localization. Compared to general MLLMs, it substantially enhances textual accuracy and content coverage in crash description and reasoning tasks, with 0.18-0.41 increases in BLEU scores and 0.18-0.42 increases in ROUGE scores. Beyond its strong performance, CrashChat is a convenient, end-to-end analytical tool ready for practical implementation. The dataset and implementation code for CrashChat are available at this https URL.

Title: Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

Authors: Ming Li, Han Chen, Yunze Xiao, Jian Chen, Hong Jiao, Tianyi Zhou
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2512.18880
Pdf URL: https://arxiv.org/pdf/2512.18880
Copy Paste: [[2512.18880]] Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction(https://arxiv.org/abs/2512.18880)
Keywords: large language model
Abstract: Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.

Title: Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

Authors: Shaomu Tan, Ryosuke Mitani, Ritvik Choudhary, Qiyu Wu, Toshiyuki Sekiya, Christof Monz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.18906
Pdf URL: https://arxiv.org/pdf/2512.18906
Copy Paste: [[2512.18906]] Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations(https://arxiv.org/abs/2512.18906)
Keywords: robust, generative
Abstract: Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.

Title: Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models

Authors: Mohamad Zamini, Diksha Shukla
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18910
Pdf URL: https://arxiv.org/pdf/2512.18910
Copy Paste: [[2512.18910]] Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models(https://arxiv.org/abs/2512.18910)
Keywords: transformer, large language model
Abstract: Multimodal Large Language Models (MLLMs) combine visual and textual representations to enable rich reasoning capabilities. However, the high computational cost of processing dense visual tokens remains a major bottleneck. A critical component in this pipeline is the visual projector, which bridges the vision encoder and the language model. Standard designs often employ a simple multi-layer perceptron for direct token mapping, but this approach scales poorly with high-resolution inputs, introducing significant redundancy. We present Delta-LLaVA, a token-efficient projector that employs a low-rank DeltaProjection to align multi-level vision features into a compact subspace before further interaction. On top of this base alignment, lightweight Transformer blocks act as specialization layers, capturing both global and local structure under constrained token budgets. Extensive experiments and ablations demonstrate that this base-then-specialize design yields consistent gains across multiple benchmarks with only 144 tokens, highlighting the importance of token formation prior to scaling interaction capacity. With Delta-LLaVA, inference throughput improves by up to 55%, while end-to-end training accelerates by nearly 4-5x in pretraining and over 1.5x in finetuning, highlighting the dual benefits of our design in both efficiency and scalability.

Title: Merging of Kolmogorov-Arnold networks trained on disjoint datasets

Authors: Andrew Polar, Michael Poluektov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18921
Pdf URL: https://arxiv.org/pdf/2512.18921
Copy Paste: [[2512.18921]] Merging of Kolmogorov-Arnold networks trained on disjoint datasets(https://arxiv.org/abs/2512.18921)
Keywords: federate
Abstract: Training on disjoint datasets can serve two primary goals: accelerating data processing and enabling federated learning. It has already been established that Kolmogorov-Arnold networks (KANs) are particularly well suited for federated learning and can be merged through simple parameter averaging. While the federated learning literature has mostly focused on achieving training convergence across distributed nodes, the present paper specifically targets acceleration of the training, which depends critically on the choice of an optimisation method and the type of the basis functions. To the best knowledge of the authors, the fastest currently-available combination is the Newton-Kaczmarz method and the piecewise-linear basis functions. Here, it is shown that training on disjoint datasets (or disjoint subsets of the training dataset) can further improve the performance. Experimental comparisons are provided, and all corresponding codes are publicly available.

Title: The Ensemble Schr{ö}dinger Bridge filter for Nonlinear Data Assimilation

Authors: Feng Bao, Hui Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18928
Pdf URL: https://arxiv.org/pdf/2512.18928
Copy Paste: [[2512.18928]] The Ensemble Schr{ö}dinger Bridge filter for Nonlinear Data Assimilation(https://arxiv.org/abs/2512.18928)
Keywords: diffusion, generative
Abstract: This work puts forward a novel nonlinear optimal filter namely the Ensemble Schr{ö}dinger Bridge nonlinear filter. The proposed filter finds marriage of the standard prediction procedure and the diffusion generative modeling for the analysis procedure to realize one filtering step. The designed approach finds no structural model error, and it is derivative free, training free and highly parallizable. Experimental results show that the designed algorithm performs well given highly nonlinear dynamics in (mildly) high dimension up to 40 or above under a chaotic environment. It also shows better performance than classical methods such as the ensemble Kalman filter and the Particle filter in numerous tests given different level of nonlinearity. Future work will focus on extending the proposed approach to practical meteorological applications and establishing a rigorous convergence analysis.

Title: LouvreSAE: Sparse Autoencoders for Interpretable and Controllable Style Transfer

Authors: Raina Panda, Daniel Fein, Arpita Singhal, Mark Fiore, Maneesh Agrawala, Matyas Bohacek
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2512.18930
Pdf URL: https://arxiv.org/pdf/2512.18930
Copy Paste: [[2512.18930]] LouvreSAE: Sparse Autoencoders for Interpretable and Controllable Style Transfer(https://arxiv.org/abs/2512.18930)
Keywords: generative
Abstract: Artistic style transfer in generative models remains a significant challenge, as existing methods often introduce style only via model fine-tuning, additional adapters, or prompt engineering, all of which can be computationally expensive and may still entangle style with subject matter. In this paper, we introduce a training- and inference-light, interpretable method for representing and transferring artistic style. Our approach leverages an art-specific Sparse Autoencoder (SAE) on top of latent embeddings of generative image models. Trained on artistic data, our SAE learns an emergent, largely disentangled set of stylistic and compositional concepts, corresponding to style-related elements pertaining brushwork, texture, and color palette, as well as semantic and structural concepts. We call it LouvreSAE and use it to construct style profiles: compact, decomposable steering vectors that enable style transfer without any model updates or optimization. Unlike prior concept-based style transfer methods, our method requires no fine-tuning, no LoRA training, and no additional inference passes, enabling direct steering of artistic styles from only a few reference images. We validate our method on ArtBench10, achieving or surpassing existing methods on style evaluations (VGG Style Loss and CLIP Score Style) while being 1.7-20x faster and, critically, interpretable.

Title: DPSR: Differentially Private Sparse Reconstruction via Multi-Stage Denoising for Recommender Systems

Authors: Sarwan Ali
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2512.18932
Pdf URL: https://arxiv.org/pdf/2512.18932
Copy Paste: [[2512.18932]] DPSR: Differentially Private Sparse Reconstruction via Multi-Stage Denoising for Recommender Systems(https://arxiv.org/abs/2512.18932)
Keywords: privacy, protect
Abstract: Differential privacy (DP) has emerged as the gold standard for protecting user data in recommender systems, but existing privacy-preserving mechanisms face a fundamental challenge: the privacy-utility tradeoff inevitably degrades recommendation quality as privacy budgets tighten. We introduce DPSR (Differentially Private Sparse Reconstruction), a novel three-stage denoising framework that fundamentally addresses this limitation by exploiting the inherent structure of rating matrices -- sparsity, low-rank properties, and collaborative patterns. DPSR consists of three synergistic stages: (1) \textit{information-theoretic noise calibration} that adaptively reduces noise for high-information ratings, (2) \textit{collaborative filtering-based denoising} that leverages item-item similarities to remove privacy noise, and (3) \textit{low-rank matrix completion} that exploits latent structure for signal recovery. Critically, all denoising operations occur \textit{after} noise injection, preserving differential privacy through the post-processing immunity theorem while removing both privacy-induced and inherent data noise. Through extensive experiments on synthetic datasets with controlled ground truth, we demonstrate that DPSR achieves 5.57\% to 9.23\% RMSE improvement over state-of-the-art Laplace and Gaussian mechanisms across privacy budgets ranging from $\varepsilon=0.1$ to $\varepsilon=10.0$ (all improvements statistically significant with $p < 0.05$, most $p < 0.001$). Remarkably, at $\varepsilon=1.0$, DPSR achieves RMSE of 0.9823, \textit{outperforming even the non-private baseline} (1.0983), demonstrating that our denoising pipeline acts as an effective regularizer that removes data noise in addition to privacy noise.

Title: Point What You Mean: Visually Grounded Instruction Policy

Authors: Hang Yu, Juntu Zhao, Yufeng Liu, Kaiyu Li, Cheng Ma, Di Zhang, Yingdong Hu, Guang Chen, Junyuan Xie, Junliang Guo, Junqiao Zhao, Yang Gao
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.18933
Pdf URL: https://arxiv.org/pdf/2512.18933
Copy Paste: [[2512.18933]] Point What You Mean: Visually Grounded Instruction Policy(https://arxiv.org/abs/2512.18933)
Keywords: robust
Abstract: Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.

Title: When Less is More: 8-bit Quantization Improves Continual Learning in Large Language Models

Authors: Michael S. Zhang, Rishi A. Ruia, Arnav Kewalram, Saathvik Dharmapuram, Utkarsh Sharma, Kevin Zhu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18934
Pdf URL: https://arxiv.org/pdf/2512.18934
Copy Paste: [[2512.18934]] When Less is More: 8-bit Quantization Improves Continual Learning in Large Language Models(https://arxiv.org/abs/2512.18934)
Keywords: large language model
Abstract: Catastrophic forgetting poses a fundamental challenge in continual learning, particularly when models are quantized for deployment efficiency. We systematically investigate the interplay between quantization precision (FP16, INT8, INT4) and replay buffer strategies in large language models, revealing unexpected dynamics. While FP16 achieves superior initial task performance (74.44% on NLU), we observe a striking inversion on subsequent tasks: quantized models outperform FP16 by 8-15% on final task forward accuracy, with INT4 achieving nearly double FP16's performance on Code generation (40% vs 20%). Critically, even minimal replay buffers (0.1%) dramatically improve retention - increasing NLU retention after Math training from 45% to 65% across all precision levels - with INT8 consistently achieving the optimal balance between learning plasticity and knowledge retention. We hypothesize that quantization-induced noise acts as implicit regularization, preventing the overfitting to new task gradients that plagues high-precision models. These findings challenge the conventional wisdom that higher precision is always preferable, suggesting instead that INT8 quantization offers both computational efficiency and superior continual learning dynamics. Our results provide practical guidelines for deploying compressed models in continual learning scenarios: small replay buffers (1-2%) suffice for NLU tasks, while Math and Code benefit from moderate buffers (5-10%), with quantized models requiring less replay than FP16 to achieve comparable retention. Code is available at this https URL.

Title: FASTRIC: Prompt Specification Language for Verifiable LLM Interactions

Authors: Wen-Long Jin
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2512.18940
Pdf URL: https://arxiv.org/pdf/2512.18940
Copy Paste: [[2512.18940]] FASTRIC: Prompt Specification Language for Verifiable LLM Interactions(https://arxiv.org/abs/2512.18940)
Keywords: large language model
Abstract: Large Language Models (LLMs) execute complex multi-turn interaction protocols but lack formal specifications to verify execution against designer intent. We introduce FASTRIC, a Prompt Specification Language that makes implicit Finite State Machines (FSMs) explicit in natural language prompts, enabling conformance verification through execution trace analysis. The LLM serves as intelligent execution agent: interpreting designer-encoded FSMs to execute specified behavioral roles. Unlike symbolic specification languages requiring parsers and compilers, FASTRIC leverages LLMs as unified infrastructure-simultaneously parser, interpreter, runtime environment, and development assistant. FASTRIC guides designers to articulate seven FSM elements (Final States, Agents, States, Triggers, Roles, Initial State, Constraints) structuring multi-turn interactions. Specification formality-ranging from implicit descriptions that frontier models infer to explicit step-by-step instructions for weaker models-serves as a design parameter. We introduce procedural conformance as verification metric measuring execution adherence to FSM specifications. Testing a 3-state kindergarten tutoring FSM across four formality levels and three model scales (14.7B, 685B, 1T+ parameters) reveals optimal specification formality is a function of model capacity. DeepSeek-V3.2 (685B) achieves perfect conformance (1.00) at L2-L4; ChatGPT-5 (~1T) peaks at L3 (0.90) before collapsing at L4 (0.39); Phi4 (14.7B) shows no stable optimum with high variance (SD=0.16-0.36). These findings reveal model-specific formality ranges-"Goldilocks zones"-where specifications provide sufficient structure without over-constraint, establishing Prompt Specification Engineering for creating verifiable interaction protocols, transforming multi-turn interaction design from heuristic art to systematic engineering with measurable procedural guarantees.

Title: Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement

Authors: Saman Forouzandeh, Wei Peng, Parham Moradi, Xinghuo Yu, Mahdi Jalili
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18950
Pdf URL: https://arxiv.org/pdf/2512.18950
Copy Paste: [[2512.18950]] Learning Hierarchical Procedural Memory for LLM Agents through Bayesian Selection and Contrastive Refinement(https://arxiv.org/abs/2512.18950)
Keywords: large language model
Abstract: We present MACLA, a framework that decouples reasoning from learning by maintaining a frozen large language model while performing all adaptation in an external hierarchical procedural memory. MACLA extracts reusable procedures from trajectories, tracks reliability via Bayesian posteriors, selects actions through expected-utility scoring, and refines procedures by contrasting successes and failures. Across four benchmarks (ALFWorld, WebShop, TravelPlanner, InterCodeSQL), MACLA achieves 78.1 percent average performance, outperforming all baselines. On ALFWorld unseen tasks, MACLA reaches 90.3 percent with 3.1 percent positive generalization. The system constructs memory in 56 seconds, 2800 times faster than the state-of-the-art LLM parameter-training baseline, compressing 2851 trajectories into 187 procedures. Experimental results demonstrate that structured external memory with Bayesian selection and contrastive refinement enables sample-efficient, interpretable, and continually improving agents without LLM parameter updates.

Title: Symmetrization of 3D Generative Models

Authors: Nicolas Caytuiro, Ivan Sipiran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18953
Pdf URL: https://arxiv.org/pdf/2512.18953
Copy Paste: [[2512.18953]] Symmetrization of 3D Generative Models(https://arxiv.org/abs/2512.18953)
Keywords: generative
Abstract: We propose a novel data-centric approach to promote symmetry in 3D generative models by modifying the training data rather than the model architecture. Our method begins with an analysis of reflectional symmetry in both real-world 3D shapes and samples generated by state-of-the-art models. We hypothesize that training a generative model exclusively on half-objects, obtained by reflecting one half of the shapes along the x=0 plane, enables the model to learn a rich distribution of partial geometries which, when reflected during generation, yield complete shapes that are both visually plausible and geometrically symmetric. To test this, we construct a new dataset of half-objects from three ShapeNet classes (Airplane, Car, and Chair) and train two generative models. Experiments demonstrate that the generated shapes are symmetrical and consistent, compared with the generated objects from the original model and the original dataset objects.

Title: VOIC: Visible-Occluded Decoupling for Monocular 3D Semantic Scene Completion

Authors: Zaidao Han, Risa Higashita, Jiang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18954
Pdf URL: https://arxiv.org/pdf/2512.18954
Copy Paste: [[2512.18954]] VOIC: Visible-Occluded Decoupling for Monocular 3D Semantic Scene Completion(https://arxiv.org/abs/2512.18954)
Keywords: extraction, segmentation
Abstract: Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance.

Title: Scaling Online Distributionally Robust Reinforcement Learning: Sample-Efficient Guarantees with General Function Approximation

Authors: Debamita Ghosh, George K. Atia, Yue Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18957
Pdf URL: https://arxiv.org/pdf/2512.18957
Copy Paste: [[2512.18957]] Scaling Online Distributionally Robust Reinforcement Learning: Sample-Efficient Guarantees with General Function Approximation(https://arxiv.org/abs/2512.18957)
Keywords: robust, generative
Abstract: The deployment of reinforcement learning (RL) agents in real-world applications is often hindered by performance degradation caused by mismatches between training and deployment environments. Distributionally robust RL (DR-RL) addresses this issue by optimizing worst-case performance over an uncertainty set of transition dynamics. However, existing work typically relies on substantial prior knowledge-such as access to a generative model or a large offline dataset-and largely focuses on tabular methods that do not scale to complex domains. We overcome these limitations by proposing an online DR-RL algorithm with general function approximation that learns an optimal robust policy purely through interaction with the environment, without requiring prior models or offline data, enabling deployment in high-dimensional tasks. We further provide a theoretical analysis establishing a near-optimal sublinear regret bound under a total variation uncertainty set, demonstrating the sample efficiency and effectiveness of our method.

Title: DVI: Disentangling Semantic and Visual Identity for Training-Free Personalized Generation

Authors: Guandong Li, Yijun Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18964
Pdf URL: https://arxiv.org/pdf/2512.18964
Copy Paste: [[2512.18964]] DVI: Disentangling Semantic and Visual Identity for Training-Free Personalized Generation(https://arxiv.org/abs/2512.18964)
Keywords: robust, diffusion
Abstract: Recent tuning-free identity customization methods achieve high facial fidelity but often overlook visual context, such as lighting, skin texture, and environmental tone. This limitation leads to ``Semantic-Visual Dissonance,'' where accurate facial geometry clashes with the input's unique atmosphere, causing an unnatural ``sticker-like'' effect. We propose **DVI (Disentangled Visual-Identity)**, a zero-shot framework that orthogonally disentangles identity into fine-grained semantic and coarse-grained visual streams. Unlike methods relying solely on semantic vectors, DVI exploits the inherent statistical properties of the VAE latent space, utilizing mean and variance as lightweight descriptors for global visual atmosphere. We introduce a **Parameter-Free Feature Modulation** mechanism that adaptively modulates semantic embeddings with these visual statistics, effectively injecting the reference's ``visual soul'' without training. Furthermore, a **Dynamic Temporal Granularity Scheduler** aligns with the diffusion process, prioritizing visual atmosphere in early denoising stages while refining semantic details later. Extensive experiments demonstrate that DVI significantly enhances visual consistency and atmospheric fidelity without parameter fine-tuning, maintaining robust identity preservation and outperforming state-of-the-art methods in IBench evaluations.

Title: Lag Operator SSMs: A Geometric Framework for Structured State Space Modeling

Authors: Sutashu Tomonaga, Kenji Doya, Noboru Murata
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.18965
Pdf URL: https://arxiv.org/pdf/2512.18965
Copy Paste: [[2512.18965]] Lag Operator SSMs: A Geometric Framework for Structured State Space Modeling(https://arxiv.org/abs/2512.18965)
Keywords: robust
Abstract: Structured State Space Models (SSMs), which are at the heart of the recently popular Mamba architecture, are powerful tools for sequence modeling. However, their theoretical foundation relies on a complex, multi-stage process of continuous-time modeling and subsequent discretization, which can obscure intuition. We introduce a direct, first-principles framework for constructing discrete-time SSMs that is both flexible and modular. Our approach is based on a novel lag operator, which geometrically derives the discrete-time recurrence by measuring how the system's basis functions "slide" and change from one timestep to the next. The resulting state matrices are computed via a single inner product involving this operator, offering a modular design space for creating novel SSMs by flexibly combining different basis functions and time-warping schemes. To validate our approach, we demonstrate that a specific instance exactly recovers the recurrence of the influential HiPPO model. Numerical simulations confirm our derivation, providing new theoretical tools for designing flexible and robust sequence models.

Title: Total Curvature Regularization and its_Minimization for Surface and Image Smoothing

Authors: Tianle Lu, Ke Chen, Yuping Duan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.18968
Pdf URL: https://arxiv.org/pdf/2512.18968
Copy Paste: [[2512.18968]] Total Curvature Regularization and its_Minimization for Surface and Image Smoothing(https://arxiv.org/abs/2512.18968)
Keywords: robust
Abstract: We introduce a novel formulation for curvature regularization by penalizing normal curvatures from multiple directions. This total normal curvature regularization is capable of producing solutions with sharp edges and precise isotropic properties. To tackle the resulting high-order nonlinear optimization problem, we reformulate it as the task of finding the steady-state solution of a time-dependent partial differential equation (PDE) system. Time discretization is achieved through operator splitting, where each subproblem at the fractional steps either has a closed-form solution or can be efficiently solved using advanced algorithms. Our method circumvents the need for complex parameter tuning and demonstrates robustness to parameter choices. The efficiency and effectiveness of our approach have been rigorously validated in the context of surface and image smoothing problems.

Title: R-GenIMA: Integrating Neuroimaging and Genetics with Interpretable Multimodal AI for Alzheimer's Disease Progression

Authors: Kun Zhao, Siyuan Dai, Yingying Zhang, Guodong Liu, Pengfei Gu, Chenghua Lin, Paul M. Thompson, Alex Leow, Heng Huang, Lifang He, Liang Zhan, Haoteng Tang (for the Alzheimer's Disease Neuroimaging Initiative (ADNI) Project)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18986
Pdf URL: https://arxiv.org/pdf/2512.18986
Copy Paste: [[2512.18986]] R-GenIMA: Integrating Neuroimaging and Genetics with Interpretable Multimodal AI for Alzheimer's Disease Progression(https://arxiv.org/abs/2512.18986)
Keywords: transformer, large language model
Abstract: Early detection of Alzheimer's disease (AD) requires models capable of integrating macro-scale neuroanatomical alterations with micro-scale genetic susceptibility, yet existing multimodal approaches struggle to align these heterogeneous signals. We introduce R-GenIMA, an interpretable multimodal large language model that couples a novel ROI-wise vision transformer with genetic prompting to jointly model structural MRI and single nucleotide polymorphisms (SNPs) variations. By representing each anatomically parcellated brain region as a visual token and encoding SNP profiles as structured text, the framework enables cross-modal attention that links regional atrophy patterns to underlying genetic factors. Applied to the ADNI cohort, R-GenIMA achieves state-of-the-art performance in four-way classification across normal cognition (NC), subjective memory concerns (SMC), mild cognitive impairment (MCI), and AD. Beyond predictive accuracy, the model yields biologically meaningful explanations by identifying stage-specific brain regions and gene signatures, as well as coherent ROI-Gene association patterns across the disease continuum. Attention-based attribution revealed genes consistently enriched for established GWAS-supported AD risk loci, including APOE, BIN1, CLU, and RBFOX1. Stage-resolved neuroanatomical signatures identified shared vulnerability hubs across disease stages alongside stage-specific patterns: striatal involvement in subjective decline, frontotemporal engagement during prodromal impairment, and consolidated multimodal network disruption in AD. These results demonstrate that interpretable multimodal AI can synthesize imaging and genetics to reveal mechanistic insights, providing a foundation for clinically deployable tools that enable earlier risk stratification and inform precision therapeutic strategies in Alzheimer's disease.

Title: ICP-4D: Bridging Iterative Closest Point and LiDAR Panoptic Segmentation

Authors: Gyeongrok Oh, Youngdong Jang, Jonghyun Choi, Suk-Ju Kang, Guang Lin, Sangpil Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18991
Pdf URL: https://arxiv.org/pdf/2512.18991
Copy Paste: [[2512.18991]] ICP-4D: Bridging Iterative Closest Point and LiDAR Panoptic Segmentation(https://arxiv.org/abs/2512.18991)
Keywords: robust, segmentation
Abstract: Dominant paradigms for 4D LiDAR panoptic segmentation are usually required to train deep neural networks with large superimposed point clouds or design dedicated modules for instance association. However, these approaches perform redundant point processing and consequently become computationally expensive, yet still overlook the rich geometric priors inherently provided by raw point clouds. To this end, we introduce ICP-4D, a simple yet effective training-free framework that unifies spatial and temporal reasoning through geometric relations among instance-level point sets. Specifically, we apply the Iterative Closest Point (ICP) algorithm to directly associate temporally consistent instances by aligning the source and target point sets through the estimated transformation. To stabilize association under noisy instance predictions, we introduce a Sinkhorn-based soft matching. This exploits the underlying instance distribution to obtain accurate point-wise correspondences, resulting in robust geometric alignment. Furthermore, our carefully designed pipeline, which considers three instance types-static, dynamic, and missing-offers computational efficiency and occlusion-aware matching. Our extensive experiments across both SemanticKITTI and panoptic nuScenes demonstrate that our method consistently outperforms state-of-the-art approaches, even without additional training or extra point cloud inputs.

Title: Evaluating the Challenges of LLMs in Real-world Medical Follow-up: A Comparative Study and An Optimized Framework

Authors: Jinyan Liu, Zikang Chen, Qinchuan Wang, Tan Xie, Heming Zheng, Xudong Lv
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.18999
Pdf URL: https://arxiv.org/pdf/2512.18999
Copy Paste: [[2512.18999]] Evaluating the Challenges of LLMs in Real-world Medical Follow-up: A Comparative Study and An Optimized Framework(https://arxiv.org/abs/2512.18999)
Keywords: extraction, large language model
Abstract: When applied directly in an end-to-end manner to medical follow-up tasks, Large Language Models (LLMs) often suffer from uncontrolled dialog flow and inaccurate information extraction due to the complexity of follow-up forms. To address this limitation, we designed and compared two follow-up chatbot systems: an end-to-end LLM-based system (control group) and a modular pipeline with structured process control (experimental group). Experimental results show that while the end-to-end approach frequently fails on lengthy and complex forms, our modular method-built on task decomposition, semantic clustering, and flow management-substantially improves dialog stability and extraction accuracy. Moreover, it reduces the number of dialogue turns by 46.73% and lowers token consumption by 80% to 87.5%. These findings highlight the necessity of integrating external control mechanisms when deploying LLMs in high-stakes medical follow-up scenarios.

Title: Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models

Authors: Tongyuan Miao, Gary Huang, Kai Jun Han, Annie Jiang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.19004
Pdf URL: https://arxiv.org/pdf/2512.19004
Copy Paste: [[2512.19004]] Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models(https://arxiv.org/abs/2512.19004)
Keywords: diffusion, generative, large language model
Abstract: Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization. We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35\% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding.

Title: Quantum-Resistant Cryptographic Models for Next-Gen Cybersecurity

Authors: Navin Chhibber, Amber Rastogi, Ankur Mahida, Vatsal Gupta, Piyush Ranjan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.19005
Pdf URL: https://arxiv.org/pdf/2512.19005
Copy Paste: [[2512.19005]] Quantum-Resistant Cryptographic Models for Next-Gen Cybersecurity(https://arxiv.org/abs/2512.19005)
Keywords: secure, security, protect, attack, robust
Abstract: Another threat is the development of large quantum computers, which have a high likelihood of breaking the high popular security protocols because it can use both Shor and Grover algorithms. In order to fix this looming threat, quantum-resistant cryptographic systems, otherwise known as post-quantum cryptography (PQC), are being formulated to protect cybersecurity systems of the future. The current paper presents the state of the art in designing, realizing, and testing the security of robust quantum-resistant algorithms, paying attention to lattice-based, code-based, multivariate polynomial and hash-based cryptography. We discuss their resistance to classical and quantum attackers, distributed system scalability properties, and their deployment in practice (secure communications, blockchain, cloud computing infrastructures). Also, we study a hybrid cryptographic model that integrates the classical efficient cryptography scheme and a quantum-resilient cryptographic scheme to achieve a backward-compatible solution and simultaneously improving the forward security properties. With the experimental findings, it is evident that performance with reasonable computational footprint of the proposed framework succeeds to install amplified security fortitude which successfully harbours prolific cybersecurity systems of the future.

Title: The 6th International Verification of Neural Networks Competition (VNN-COMP 2025): Summary and Results

Authors: Konstantin Kaulen, Tobias Ladner, Stanley Bak, Christopher Brix, Hai Duong, Thomas Flinkow, Taylor T. Johnson, Lukas Koller, Edoardo Manino, ThanhVu H Nguyen, Haoze Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19007
Pdf URL: https://arxiv.org/pdf/2512.19007
Copy Paste: [[2512.19007]] The 6th International Verification of Neural Networks Competition (VNN-COMP 2025): Summary and Results(https://arxiv.org/abs/2512.19007)
Keywords: fair
Abstract: This report summarizes the 6th International Verification of Neural Networks Competition (VNN-COMP 2025), held as a part of the 8th International Symposium on AI Verification (SAIV), that was collocated with the 37th International Conference on Computer-Aided Verification (CAV). VNN-COMP is held annually to facilitate the fair and objective comparison of state-of-the-art neural network verification tools, encourage the standardization of tool interfaces, and bring together the neural network verification community. To this end, standardized formats for networks (ONNX) and specification (VNN-LIB) were defined, tools were evaluated on equal-cost hardware (using an automatic evaluation pipeline based on AWS instances), and tool parameters were chosen by the participants before the final test sets were made public. In the 2025 iteration, 8 teams participated on a diverse set of 16 regular and 9 extended benchmarks. This report summarizes the rules, benchmarks, participating tools, results, and lessons learned from this iteration of this competition.

Title: Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline

Authors: Akshaj Prashanth Rao, Advait Singh, Saumya Kumaar Saksena, Dhruv Kumar
Subjects: cs.CR, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.19011
Pdf URL: https://arxiv.org/pdf/2512.19011
Copy Paste: [[2512.19011]] Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline(https://arxiv.org/abs/2512.19011)
Keywords: secure, security, defense, attack, robust, large language model
Abstract: Prompt injection and jailbreaking attacks pose persistent security challenges to large language model (LLM)-based systems. We present an efficient and systematically evaluated defense architecture that mitigates these threats through a lightweight, multi-stage pipeline. Its core component is a semantic filter based on text normalization, TF-IDF representations, and a Linear SVM classifier. Despite its simplicity, this module achieves 93.4% accuracy and 96.5% specificity on held-out data, substantially reducing attack throughput while incurring negligible computational overhead. Building on this efficient foundation, the full pipeline integrates complementary detection and mitigation mechanisms that operate at successive stages, providing strong robustness with minimal latency. In comparative experiments, our SVM-based configuration improves overall accuracy from 35.1% to 93.4% while reducing average time to completion from approximately 450s to 47s, yielding over 10 times lower latency than ShieldGemma. These results demonstrate that the proposed design simultaneously advances defensive precision and efficiency, addressing a core limitation of current model-based moderators. Evaluation across a curated corpus of over 30,000 labeled prompts, including benign, jailbreak, and application-layer injections, confirms that staged, resource-efficient defenses can robustly secure modern LLM-driven applications.

Title: DREAM: Dynamic Red-teaming across Environments for AI Models

Authors: Liming Lu, Xiang Gu, Junyu Huang, Jiawei Du, Yunhuai Liu, Yongbin Zhou, Shuchao Pang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.19016
Pdf URL: https://arxiv.org/pdf/2512.19016
Copy Paste: [[2512.19016]] DREAM: Dynamic Red-teaming across Environments for AI Models(https://arxiv.org/abs/2512.19016)
Keywords: defense, attack, robust, large language model
Abstract: Large Language Models (LLMs) are increasingly used in agentic systems, where their interactions with diverse tools and environments create complex, multi-stage safety challenges. However, existing benchmarks mostly rely on static, single-turn assessments that miss vulnerabilities from adaptive, long-chain attacks. To fill this gap, we introduce DREAM, a framework for systematic evaluation of LLM agents against dynamic, multi-stage attacks. At its core, DREAM uses a Cross-Environment Adversarial Knowledge Graph (CE-AKG) to maintain stateful, cross-domain understanding of vulnerabilities. This graph guides a Contextualized Guided Policy Search (C-GPS) algorithm that dynamically constructs attack chains from a knowledge base of 1,986 atomic actions across 349 distinct digital environments. Our evaluation of 12 leading LLM agents reveals a critical vulnerability: these attack chains succeed in over 70% of cases for most models, showing the power of stateful, cross-environment exploits. Through analysis of these failures, we identify two key weaknesses in current agents: contextual fragility, where safety behaviors fail to transfer across environments, and an inability to track long-term malicious intent. Our findings also show that traditional safety measures, such as initial defense prompts, are largely ineffective against attacks that build context over multiple interactions. To advance agent safety research, we release DREAM as a tool for evaluating vulnerabilities and developing more robust defenses.

Title: Optimizer Dynamics at the Edge of Stability with Differential Privacy

Authors: Ayana Hussain, Ricky Fang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.19019
Pdf URL: https://arxiv.org/pdf/2512.19019
Copy Paste: [[2512.19019]] Optimizer Dynamics at the Edge of Stability with Differential Privacy(https://arxiv.org/abs/2512.19019)
Keywords: privacy
Abstract: Deep learning models can reveal sensitive information about individual training examples, and while differential privacy (DP) provides guarantees restricting such leakage, it also alters optimization dynamics in poorly understood ways. We study the training dynamics of neural networks under DP by comparing Gradient Descent (GD), and Adam to their privacy-preserving variants. Prior work shows that these optimizers exhibit distinct stability dynamics: full-batch methods train at the Edge of Stability (EoS), while mini-batch and adaptive methods exhibit analogous edge-of-stability behavior. At these regimes, the training loss and the sharpness--the maximum eigenvalue of the training loss Hessian--exhibit certain characteristic behavior. In DP training, per-example gradient clipping and Gaussian noise modify the update rule, and it is unclear whether these stability patterns persist. We analyze how clipping and noise change sharpness and loss evolution and show that while DP generally reduces the sharpness and can prevent optimizers from fully reaching the classical stability thresholds, patterns from EoS and analogous adaptive methods stability regimes persist, with the largest learning rates and largest privacy budgets approaching, and sometimes exceeding, these thresholds. These findings highlight the unpredictability introduced by DP in neural network optimization.

Title: CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization

Authors: Zelin Zhao, Xinyu Gong, Bangya Liu, Ziyang Song, Jun Zhang, Suhui Wu, Yongxin Chen, Hao Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.19020
Pdf URL: https://arxiv.org/pdf/2512.19020
Copy Paste: [[2512.19020]] CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization(https://arxiv.org/abs/2512.19020)
Keywords: robust, diffusion
Abstract: Achieving precise camera control in video generation remains challenging, as existing methods often rely on camera pose annotations that are difficult to scale to large and dynamic datasets and are frequently inconsistent with depth estimation, leading to train-test discrepancies. We introduce CETCAM, a camera-controllable video generation framework that eliminates the need for camera annotations through a consistent and extensible tokenization scheme. CETCAM leverages recent advances in geometry foundation models, such as VGGT, to estimate depth and camera parameters and converts them into unified, geometry-aware tokens. These tokens are seamlessly integrated into a pretrained video diffusion backbone via lightweight context blocks. Trained in two progressive stages, CETCAM first learns robust camera controllability from diverse raw video data and then refines fine-grained visual quality using curated high-fidelity datasets. Extensive experiments across multiple benchmarks demonstrate state-of-the-art geometric consistency, temporal stability, and visual realism. Moreover, CETCAM exhibits strong adaptability to additional control modalities, including inpainting and layout control, highlighting its flexibility beyond camera control. The project page is available at this https URL.

Title: VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation

Authors: Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang, Rong Wei, Rui Tang, Juncheng Li, Hanqing Wang, Jiangmiao Pang, Anton van den Hengel, Jiajun Liu, Qi Wu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.19021
Pdf URL: https://arxiv.org/pdf/2512.19021
Copy Paste: [[2512.19021]] VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation(https://arxiv.org/abs/2512.19021)
Keywords: robust
Abstract: Despite remarkable progress in Vision-Language Navigation (VLN), existing benchmarks remain confined to fixed, small-scale datasets with naive physical simulation. These shortcomings limit the insight that the benchmarks provide into sim-to-real generalization, and create a significant research gap. Furthermore, task fragmentation prevents unified/shared progress in the area, while limited data scales fail to meet the demands of modern LLM-based pretraining. To overcome these limitations, we introduce VLNVerse: a new large-scale, extensible benchmark designed for Versatile, Embodied, Realistic Simulation, and Evaluation. VLNVerse redefines VLN as a scalable, full-stack embodied AI problem. Its Versatile nature unifies previously fragmented tasks into a single framework and provides an extensible toolkit for researchers. Its Embodied design moves beyond intangible and teleporting "ghost" agents that support full-kinematics in a Realistic Simulation powered by a robust physics engine. We leverage the scale and diversity of VLNVerse to conduct a comprehensive Evaluation of existing methods, from classic models to MLLM-based agents. We also propose a novel unified multi-task model capable of addressing all tasks within the benchmark. VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing the community with a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.

Title: Steering Vision-Language Pre-trained Models for Incremental Face Presentation Attack Detection

Authors: Haoze Li, Jie Zhang, Guoying Zhao, Stephen Lin, Shiguang Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19022
Pdf URL: https://arxiv.org/pdf/2512.19022
Copy Paste: [[2512.19022]] Steering Vision-Language Pre-trained Models for Incremental Face Presentation Attack Detection(https://arxiv.org/abs/2512.19022)
Keywords: privacy, attack, robust
Abstract: Face Presentation Attack Detection (PAD) demands incremental learning (IL) to combat evolving spoofing tactics and domains. Privacy regulations, however, forbid retaining past data, necessitating rehearsal-free IL (RF-IL). Vision-Language Pre-trained (VLP) models, with their prompt-tunable cross-modal representations, enable efficient adaptation to new spoofing styles and domains. Capitalizing on this strength, we propose \textbf{SVLP-IL}, a VLP-based RF-IL framework that balances stability and plasticity via \textit{Multi-Aspect Prompting} (MAP) and \textit{Selective Elastic Weight Consolidation} (SEWC). MAP isolates domain dependencies, enhances distribution-shift sensitivity, and mitigates forgetting by jointly exploiting universal and domain-specific cues. SEWC selectively preserves critical weights from previous tasks, retaining essential knowledge while allowing flexibility for new adaptations. Comprehensive experiments across multiple PAD benchmarks show that SVLP-IL significantly reduces catastrophic forgetting and enhances performance on unseen domains. SVLP-IL offers a privacy-compliant, practical solution for robust lifelong PAD deployment in RF-IL settings.

Title: The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation

Authors: Hengrui Jia, Taoran Li, Jonas Guan, Varun Chandrasekaran
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.19025
Pdf URL: https://arxiv.org/pdf/2512.19025
Copy Paste: [[2512.19025]] The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation(https://arxiv.org/abs/2512.19025)
Keywords: large language model
Abstract: Machine unlearning aims to remove specific data influences from trained models, a capability essential for adhering to copyright laws and ensuring AI safety. Current unlearning metrics typically measure success by monitoring the model's performance degradation on the specific unlearning dataset ($D_u$). We argue that for Large Language Models (LLMs), this evaluation paradigm is insufficient and potentially misleading. Many real-world uses of unlearning--motivated by copyright or safety--implicitly target not only verbatim content in $D_u$, but also behaviors influenced by the broader generalizations the model derived from it. We demonstrate that LLMs can pass standard unlearning evaluation and appear to have ``forgotten'' the target knowledge, while simultaneously retaining strong capabilities on content that is semantically adjacent to $D_u$. This phenomenon indicates that erasing exact sentences does not necessarily equate to removing the underlying knowledge. To address this gap, we propose \name, an automated stress-testing framework that generates a surrogate dataset, $\tilde{D}_u$. This surrogate set is constructed to be semantically derived from $D_u$ yet sufficiently distinct in embedding space. By comparing unlearning metric scores between $D_u$ and $\tilde{D}_u$, we can stress-test the reliability of the metric itself. Our extensive evaluation across three LLM families (Llama-3-8B, Qwen2.5-7B, and Zephyr-7B-$\beta$), three distinct datasets, and seven standard metrics reveals widespread inconsistencies. We find that current metrics frequently overestimate unlearning success, failing to detect retained knowledge exposed by our stress-test datasets.

Title: Finer-Personalization Rank: Fine-Grained Retrieval Examines Identity Preservation for Personalized Generation

Authors: Connor Kilrain, David Carlyn, Julia Chae, Sara Beery, Wei-Lun Chao, Jianyang Gu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19026
Pdf URL: https://arxiv.org/pdf/2512.19026
Copy Paste: [[2512.19026]] Finer-Personalization Rank: Fine-Grained Retrieval Examines Identity Preservation for Personalized Generation(https://arxiv.org/abs/2512.19026)
Keywords: generative
Abstract: The rise of personalized generative models raises a central question: how should we evaluate identity preservation? Given a reference image (e.g., one's pet), we expect the generated image to retain precise details attached to the subject's identity. However, current generative evaluation metrics emphasize the overall semantic similarity between the reference and the output, and overlook these fine-grained discriminative details. We introduce Finer-Personalization Rank, an evaluation protocol tailored to identity preservation. Instead of pairwise similarity, Finer-Personalization Rank adopts a ranking view: it treats each generated image as a query against an identity-labeled gallery consisting of visually similar real images. Retrieval metrics (e.g., mean average precision) measure performance, where higher scores indicate that identity-specific details (e.g., a distinctive head spot) are preserved. We assess identity at multiple granularities -- from fine-grained categories (e.g., bird species, car models) to individual instances (e.g., re-identification). Across CUB, Stanford Cars, and animal Re-ID benchmarks, Finer-Personalization Rank more faithfully reflects identity retention than semantic-only metrics and reveals substantial identity drift in several popular personalization methods. These results position the gallery-based protocol as a principled and practical evaluation for personalized generation.

Title: Automatic Neuronal Activity Segmentation in Fast Four Dimensional Spatio-Temporal Fluorescence Imaging using Bayesian Approach

Authors: Ran Li, Pan Xiao, Kaushik Dutta, Youdong Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19032
Pdf URL: https://arxiv.org/pdf/2512.19032
Copy Paste: [[2512.19032]] Automatic Neuronal Activity Segmentation in Fast Four Dimensional Spatio-Temporal Fluorescence Imaging using Bayesian Approach(https://arxiv.org/abs/2512.19032)
Keywords: segmentation
Abstract: Fluorescence Microcopy Calcium Imaging is a fundamental tool to in-vivo record and analyze large scale neuronal activities simultaneously at a single cell resolution. Automatic and precise detection of behaviorally relevant neuron activity from the recordings is critical to study the mapping of brain activity in organisms. However a perpetual bottleneck to this problem is the manual segmentation which is time and labor intensive and lacks generalizability. To this end, we present a Bayesian Deep Learning Framework to detect neuronal activities in 4D spatio-temporal data obtained by light sheet microscopy. Our approach accounts for the use of temporal information by calculating pixel wise correlation maps and combines it with spatial information given by the mean summary image. The Bayesian framework not only produces probability segmentation maps but also models the uncertainty pertaining to active neuron detection. To evaluate the accuracy of our framework we implemented the test of reproducibility to assert the generalization of the network to detect neuron activity. The network achieved a mean Dice Score of 0.81 relative to the synthetic Ground Truth obtained by Otsu's method and a mean Dice Score of 0.79 between the first and second run for test of reproducibility. Our method successfully deployed can be used for rapid detection of active neuronal activities for behavioural studies.

Title: Elevating Intrusion Detection and Security Fortification in Intelligent Networks through Cutting-Edge Machine Learning Paradigms

Authors: Md Minhazul Islam Munna, Md Mahbubur Rahman, Jaroslav Frnda, Muhammad Shahid Anwar, Alpamis Kutlimuratov
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.19037
Pdf URL: https://arxiv.org/pdf/2512.19037
Copy Paste: [[2512.19037]] Elevating Intrusion Detection and Security Fortification in Intelligent Networks through Cutting-Edge Machine Learning Paradigms(https://arxiv.org/abs/2512.19037)
Keywords: security, attack, robust, extraction
Abstract: The proliferation of IoT devices and their reliance on Wi-Fi networks have introduced significant security vulnerabilities, particularly the KRACK and Kr00k attacks, which exploit weaknesses in WPA2 encryption to intercept and manipulate sensitive data. Traditional IDS using classifiers face challenges such as model overfitting, incomplete feature extraction, and high false positive rates, limiting their effectiveness in real-world deployments. To address these challenges, this study proposes a robust multiclass machine learning based intrusion detection framework. The methodology integrates advanced feature selection techniques to identify critical attributes, mitigating redundancy and enhancing detection accuracy. Two distinct ML architectures are implemented: a baseline classifier pipeline and a stacked ensemble model combining noise injection, Principal Component Analysis (PCA), and meta learning to improve generalization and reduce false positives. Evaluated on the AWID3 data set, the proposed ensemble architecture achieves superior performance, with an accuracy of 98%, precision of 98%, recall of 98%, and a false positive rate of just 2%, outperforming existing state-of-the-art methods. This work demonstrates the efficacy of combining preprocessing strategies with ensemble learning to fortify network security against sophisticated Wi-Fi attacks, offering a scalable and reliable solution for IoT environments. Future directions include real-time deployment and adversarial resilience testing to further enhance the model's adaptability.

Title: WaTeRFlow: Watermark Temporal Robustness via Flow Consistency

Authors: Utae Jeong, Sumin In, Hyunju Ryu, Jaewan Choi, Feng Yang, Jongheon Jeong, Seungryong Kim, Sangpil Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19048
Pdf URL: https://arxiv.org/pdf/2512.19048
Copy Paste: [[2512.19048]] WaTeRFlow: Watermark Temporal Robustness via Flow Consistency(https://arxiv.org/abs/2512.19048)
Keywords: robust, watermark, diffusion, generative
Abstract: Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.

Title: Decoupled Generative Modeling for Human-Object Interaction Synthesis

Authors: Hwanhee Jung, Seunggwan Lee, Jeongyoon Yoon, SeungHyeon Kim, Giljoo Nam, Qixing Huang, Sangpil Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19049
Pdf URL: https://arxiv.org/pdf/2512.19049
Copy Paste: [[2512.19049]] Decoupled Generative Modeling for Human-Object Interaction Synthesis(https://arxiv.org/abs/2512.19049)
Keywords: generative
Abstract: Synthesizing realistic human-object interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive, long-sequence planning in dynamic scenes, while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and perceptual studies likewise prefer our results.

Title: Efficient Personalization of Generative Models via Optimal Experimental Design

Authors: Guy Schacht, Ziyad Sheebaelhamd, Riccardo De Santi, Mojmír Mutný, Andreas Krause
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2512.19057
Pdf URL: https://arxiv.org/pdf/2512.19057
Copy Paste: [[2512.19057]] Efficient Personalization of Generative Models via Optimal Experimental Design(https://arxiv.org/abs/2512.19057)
Keywords: generative
Abstract: Preference learning from human feedback has the ability to align generative models with the needs of end-users. Human feedback is costly and time-consuming to obtain, which creates demand for data-efficient query selection methods. This work presents a novel approach that leverages optimal experimental design to ask humans the most informative preference queries, from which we can elucidate the latent reward function modeling user preferences efficiently. We formulate the problem of preference query selection as the one that maximizes the information about the underlying latent preference model. We show that this problem has a convex optimization formulation, and introduce a statistically and computationally efficient algorithm ED-PBRL that is supported by theoretical guarantees and can efficiently construct structured queries such as images or text. We empirically present the proposed framework by personalizing a text-to-image generative model to user-specific styles, showing that it requires less preference queries compared to random query selection.

Title: 6DAttack: Backdoor Attacks in the 6DoF Pose Estimation

Authors: Jihui Guo, Zongmin Zhang, Zhen Sun, Yuhao Yang, Jinlin Wu, Fu Zhang, Xinlei He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19058
Pdf URL: https://arxiv.org/pdf/2512.19058
Copy Paste: [[2512.19058]] 6DAttack: Backdoor Attacks in the 6DoF Pose Estimation(https://arxiv.org/abs/2512.19058)
Keywords: security, defense, attack, diffusion
Abstract: Deep learning advances have enabled accurate six-degree-of-freedom (6DoF) object pose estimation, widely used in robotics, AR/VR, and autonomous systems. However, backdoor attacks pose significant security risks. While most research focuses on 2D vision, 6DoF pose estimation remains largely unexplored. Unlike traditional backdoors that only change classes, 6DoF attacks must control continuous parameters like translation and rotation, rendering 2D methods inapplicable. We propose 6DAttack, a framework using 3D object triggers to induce controlled erroneous poses while maintaining normal behavior. Evaluations on PVNet, DenseFusion, and PoseDiffusion across LINEMOD, YCB-Video, and CO3D show high attack success rates (ASRs) without compromising clean performance. Backdoored models achieve up to 100% clean ADD accuracy and 100% ASR, with triggered samples reaching 97.70% ADD-P. Furthermore, a representative defense remains ineffective. Our findings reveal a serious, underexplored threat to 6DoF pose estimation.

Title: Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation

Authors: Khanh Nguyen, Dasith de Silva Edirimuni, Ghulam Mubashar Hassan, Ajmal Mian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19088
Pdf URL: https://arxiv.org/pdf/2512.19088
Copy Paste: [[2512.19088]] Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation(https://arxiv.org/abs/2512.19088)
Keywords: segmentation
Abstract: Locating and retrieving objects from scene-level point clouds is a challenging problem with broad applications in robotics and augmented reality. This task is commonly formulated as open-vocabulary 3D instance segmentation. Although recent methods demonstrate strong performance, they depend heavily on SAM and CLIP to generate and classify 3D instance masks from images accompanying the point cloud, leading to substantial computational overhead and slow processing that limit their deployment in real-world settings. Open-YOLO 3D alleviates this issue by using a real-time 2D detector to classify class-agnostic masks produced directly from the point cloud by a pretrained 3D segmenter, eliminating the need for SAM and CLIP and significantly reducing inference time. However, Open-YOLO 3D often fails to generalize to object categories that appear infrequently in the 3D training data. In this paper, we propose a method that generates 3D instance masks for novel objects from RGB images guided by a 2D open-vocabulary detector. Our approach inherits the 2D detector's ability to recognize novel objects while maintaining efficient classification, enabling fast and accurate retrieval of rare instances from open-ended text queries. Our code will be made available at this https URL.

Title: Auditing Significance, Metric Choice, and Demographic Fairness in Medical AI Challenges

Authors: Ariel Lubonja, Pedro R. A. S. Bassi, Wenxuan Li, Hualin Qiao, Randal Burns, Alan L. Yuille, Zongwei Zhou
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.19091
Pdf URL: https://arxiv.org/pdf/2512.19091
Copy Paste: [[2512.19091]] Auditing Significance, Metric Choice, and Demographic Fairness in Medical AI Challenges(https://arxiv.org/abs/2512.19091)
Keywords: fair
Abstract: Open challenges have become the de facto standard for comparative ranking of medical AI methods. Despite their importance, medical AI leaderboards exhibit three persistent limitations: (1) score gaps are rarely tested for statistical significance, so rank stability is unknown; (2) single averaged metrics are applied to every organ, hiding clinically important boundary errors; (3) performance across intersecting demographics is seldom reported, masking fairness and equity gaps. We introduce RankInsight, an open-source toolkit that seeks to address these limitations. RankInsight (1) computes pair-wise significance maps that show the nnU-Net family outperforms Vision-Language and MONAI submissions with high statistical certainty; (2) recomputes leaderboards with organ-appropriate metrics, reversing the order of the top four models when Dice is replaced by NSD for tubular structures; and (3) audits intersectional fairness, revealing that more than half of the MONAI-based entries have the largest gender-race discrepancy on our proprietary Johns Hopkins Hospital dataset. The RankInsight toolkit is publicly released and can be directly applied to past, ongoing, and future challenges. It enables organizers and participants to publish rankings that are statistically sound, clinically meaningful, and demographically fair.

Title: A Large Language Model Based Method for Complex Logical Reasoning over Knowledge Graphs

Authors: Ziyan Zhang, Chao Wang, Zhuo Chen, Lei Chen, Chiyi Li, Kai Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19092
Pdf URL: https://arxiv.org/pdf/2512.19092
Copy Paste: [[2512.19092]] A Large Language Model Based Method for Complex Logical Reasoning over Knowledge Graphs(https://arxiv.org/abs/2512.19092)
Keywords: robust, large language model
Abstract: Reasoning over knowledge graphs (KGs) with first-order logic (FOL) queries is challenging due to the inherent incompleteness of real-world KGs and the compositional complexity of logical query structures. Most existing methods rely on embedding entities and relations into continuous geometric spaces and answer queries via differentiable set operations. While effective for simple query patterns, these approaches often struggle to generalize to complex queries involving multiple operators, deeper reasoning chains, or heterogeneous KG schemas. We propose ROG (Reasoning Over knowledge Graphs with large language models), an ensemble-style framework that combines query-aware KG neighborhood retrieval with large language model (LLM)-based chain-of-thought reasoning. ROG decomposes complex FOL queries into sequences of simpler sub-queries, retrieves compact, query-relevant subgraphs as contextual evidence, and performs step-by-step logical inference using an LLM, avoiding the need for task-specific embedding optimization. Experiments on standard KG reasoning benchmarks demonstrate that ROG consistently outperforms strong embedding-based baselines in terms of mean reciprocal rank (MRR), with particularly notable gains on high-complexity query types. These results suggest that integrating structured KG retrieval with LLM-driven logical reasoning offers a robust and effective alternative for complex KG reasoning tasks.

Title: Dual Model Deep Learning for Alzheimer Prognostication

Authors: Alireza Moayedikia, Sara Fin, Uffe Kock Wiil
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.19099
Pdf URL: https://arxiv.org/pdf/2512.19099
Copy Paste: [[2512.19099]] Dual Model Deep Learning for Alzheimer Prognostication(https://arxiv.org/abs/2512.19099)
Keywords: robust
Abstract: Disease modifying therapies for Alzheimer's disease demand precise timing decisions, yet current predictive models require longitudinal observations and provide no uncertainty quantification, rendering them impractical at the critical first visit when treatment decisions must be made. We developed PROGRESS (PRognostic Generalization from REsting Static Signatures), a dual-model deep learning framework that transforms a single baseline cerebrospinal fluid biomarker assessment into actionable prognostic estimates without requiring prior clinical history. The framework addresses two complementary clinical questions: a probabilistic trajectory network predicts individualized cognitive decline with calibrated uncertainty bounds achieving near-nominal coverage, enabling honest prognostic communication; and a deep survival model estimates time to conversion from mild cognitive impairment to dementia. Using data from over 3,000 participants across 43 Alzheimer's Disease Research Centers in the National Alzheimer's Coordinating Center database, PROGRESS substantially outperforms Cox proportional hazards, Random Survival Forests, and gradient boosting methods for survival prediction. Risk stratification identifies patient groups with seven-fold differences in conversion rates, enabling clinically meaningful treatment prioritization. Leave-one-center-out validation demonstrates robust generalizability, with survival discrimination remaining strong across held-out sites despite heterogeneous measurement conditions spanning four decades of assay technologies. By combining superior survival prediction with trustworthy trajectory uncertainty quantification, PROGRESS bridges the gap between biomarker measurement and personalized clinical decision-making.

Title: Timely Parameter Updating in Over-the-Air Federated Learning

Authors: Jiaqi Zhu, Zhongyuan Zhao, Xiao Li, Ruihao Du, Shi Jin, Howard H.Yang
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2512.19103
Pdf URL: https://arxiv.org/pdf/2512.19103
Copy Paste: [[2512.19103]] Timely Parameter Updating in Over-the-Air Federated Learning(https://arxiv.org/abs/2512.19103)
Keywords: federate, fair
Abstract: Incorporating over-the-air computations (OAC) into the model training process of federated learning (FL) is an effective approach to alleviating the communication bottleneck in FL systems. Under OAC-FL, every client modulates its intermediate parameters, such as gradient, onto the same set of orthogonal waveforms and simultaneously transmits the radio signal to the edge server. By exploiting the superposition property of multiple-access channels, the edge server can obtain an automatically aggregated global gradient from the received signal. However, the limited number of orthogonal waveforms available in practical systems is fundamentally mismatched with the high dimensionality of modern deep learning models. To address this issue, we propose Freshness Freshness-mAgnItude awaRe top-k (FAIR-k), an algorithm that selects, in each communication round, the most impactful subset of gradients to be updated over the air. In essence, FAIR-k combines the complementary strengths of the Round-Robin and Top-k algorithms, striking a delicate balance between timeliness (freshness of parameter updates) and importance (gradient magnitude). Leveraging tools from Markov analysis, we characterize the distribution of parameter staleness under FAIR-k. Building on this, we establish the convergence rate of OAC-FL with FAIR-k, which discloses the joint effect of data heterogeneity, channel noise, and parameter staleness on the training efficiency. Notably, as opposed to conventional analyses that assume a universal Lipschitz constant across all the clients, our framework adopts a finer-grained model of the data heterogeneity. The analysis demonstrates that since FAIR-k promotes fresh (and fair) parameter updates, it not only accelerates convergence but also enhances communication efficiency by enabling an extended period of local training without significantly affecting overall training efficiency.

Title: HyperLoad: A Cross-Modality Enhanced Large Language Model-Based Framework for Green Data Center Cooling Load Prediction

Authors: Haoyu Jiang, Boan Qu, Junjie Zhu, Fanjie Zeng, Xiaojie Lin, Wei Zhong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19114
Pdf URL: https://arxiv.org/pdf/2512.19114
Copy Paste: [[2512.19114]] HyperLoad: A Cross-Modality Enhanced Large Language Model-Based Framework for Green Data Center Cooling Load Prediction(https://arxiv.org/abs/2512.19114)
Keywords: large language model
Abstract: The rapid growth of artificial intelligence is exponentially escalating computational demand, inflating data center energy use and carbon emissions, and spurring rapid deployment of green data centers to relieve resource and environmental stress. Achieving sub-minute orchestration of renewables, storage, and loads, while minimizing PUE and lifecycle carbon intensity, hinges on accurate load forecasting. However, existing methods struggle to address small-sample scenarios caused by cold start, load distortion, multi-source data fragmentation, and distribution shifts in green data centers. We introduce HyperLoad, a cross-modality framework that exploits pre-trained large language models (LLMs) to overcome data scarcity. In the Cross-Modality Knowledge Alignment phase, textual priors and time-series data are mapped to a common latent space, maximizing the utility of prior knowledge. In the Multi-Scale Feature Modeling phase, domain-aligned priors are injected through adaptive prefix-tuning, enabling rapid scenario adaptation, while an Enhanced Global Interaction Attention mechanism captures cross-device temporal dependencies. The public DCData dataset is released for benchmarking. Under both data sufficient and data scarce settings, HyperLoad consistently surpasses state-of-the-art (SOTA) baselines, demonstrating its practicality for sustainable green data center management.

Title: Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?

Authors: Hengyi Feng, Zeang Sheng, Meiyi Qiang, Wentao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19115
Pdf URL: https://arxiv.org/pdf/2512.19115
Copy Paste: [[2512.19115]] Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?(https://arxiv.org/abs/2512.19115)
Keywords: interpretability, generative, large language model
Abstract: Despite the remarkable success of multimodal large language models (MLLMs) in generative tasks, we observe that they exhibit a counterintuitive deficiency in the zero-shot multimodal retrieval task. In this work, we investigate the underlying mechanisms that hinder MLLMs from serving as effective retrievers. With the help of sparse autoencoders (SAEs), we decompose MLLM output representations into interpretable semantic concepts to probe their intrinsic behavior. Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics; the visual information essential for multimodal retrieval only constitutes a small portion. This imbalance is compounded by the heavy focus of MLLMs on bridging image-text modalities, which facilitates generation but homogenizes embeddings and finally diminishes the discriminative power required for multimodal retrieval. We further discover that the specific feature components that contribute most to the similarity computations for MLLMs are in fact distractors that actively degrade retrieval performance. Overall, our work provides the first in-depth interpretability analysis of MLLM representations in the context of multimodal retrieval and offers possible directions for enhancing the multimodal retrieval capabilities of MLLMs.

Title: Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?

Authors: Amar Lakel (MICA)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19117
Pdf URL: https://arxiv.org/pdf/2512.19117
Copy Paste: [[2512.19117]] Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?(https://arxiv.org/abs/2512.19117)
Keywords: generative, large language model
Abstract: This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ''Large Language Models'' (LLM) with that of ''Large Discourse Models'' (LDM), and then with that of Artificial Discursive Agent (ADA). The theoretical framework is based on an ontological triad distinguishing three regulatory instances: the apprehension of the phenomenal regularities of the referential world, the structuring of embodied cognition, and the structural-linguistic sedimentation of the utterance within a socio-historical context. LDMs, operating on the product of these three instances (the document), model the discursive projection of a portion of human experience reified by the learning corpus. The proposed program aims to replace the ''fascination/fear'' dichotomy with public trials and procedures that make the place, uses, and limits of artificial discursive agents in contemporary social space decipherable, situating this approach within a perspective of governance and co-regulation involving the State, industry, civil society, and academia.

Title: ShadowBlock: Efficient Dynamic Anonymous Blocklisting and Its Cross-chain Application

Authors: Haotian Deng, Mengxuan Liu, Chuan Zhang, Wei Huang, Licheng Wang, Liehuang Zhu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.19124
Pdf URL: https://arxiv.org/pdf/2512.19124
Copy Paste: [[2512.19124]] ShadowBlock: Efficient Dynamic Anonymous Blocklisting and Its Cross-chain Application(https://arxiv.org/abs/2512.19124)
Keywords: privacy, protect
Abstract: Online harassment, incitement to violence, racist behavior, and other harmful content on social media can damage social harmony and even break the law. Traditional blocklisting technologies can block malicious users, but this comes at the expense of identity privacy. The anonymous blocklisting has emerged as an effective mechanism to restrict the abuse of freedom of speech while protecting user identity privacy. However, the state-of-the-art anonymous blocklisting schemes suffer from either poor dynamism or low efficiency. In this paper, we propose $\mathsf{ShadowBlock}$, an efficient dynamic anonymous blocklisting scheme. Specifically, we utilize the pseudorandom function and cryptographic accumulator to construct the public blocklisting, enabling users to prove they are not on the blocklisting in an anonymous manner. To improve verification efficiency, we design an aggregation zero-knowledge proof mechanism that converts multiple verification operations into a single one. In addition, we leverage the accumulator's property to achieve efficient updates of the blocklisting, i.e., the original proof can be reused with minimal updates rather than regenerating the entire proof. Experiments show that $\mathsf{ShadowBlock}$ has better dynamics and efficiency than the existing schemes. Finally, the discussion on applications indicates that $\mathsf{ShadowBlock}$ also holds significant value and has broad prospects in emerging fields such as cross-chain identity management.

Title: SAP: Syntactic Attention Pruning for Transformer-based Language Models

Authors: Tzu-Yun Lee, Ding-Yong Hong, Jan-Jan Wu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.19125
Pdf URL: https://arxiv.org/pdf/2512.19125
Copy Paste: [[2512.19125]] SAP: Syntactic Attention Pruning for Transformer-based Language Models(https://arxiv.org/abs/2512.19125)
Keywords: robust, interpretability, transformer
Abstract: This paper introduces Syntactic Attention Pruning (SAP), a novel method for effectively pruning attention heads in Transformer models. Unlike conventional approaches that rely solely on mathematical analysis of model weights and activations, SAP incorporates both the syntactic structure and attention patterns of sentences to guide the pruning process. By leveraging these linguistic features, SAP not only achieves performance comparable to state-of-the-art methods but also enhances the interpretability of model behavior. To further improve robustness, we propose Candidate Filtering (CF), a mechanism that prioritizes heads based on their contribution to model performance, mitigating degradation during pruning. Experimental results indicate that SAP effectively preserves critical heads of a high density of strong attention values, outperforming existing head pruning strategies in retrain-free settings. These findings position SAP as a promising foundation for a new direction in model compression research, offering high flexibility for pruning across all transformer-based language models.

Title: AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards

Authors: Zihan Lin, Xiaohan Wang, Hexiong Yang, Jiajun Chai, Jie Cao, Guojun Yin, Wei Lin, Ran He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19126
Pdf URL: https://arxiv.org/pdf/2512.19126
Copy Paste: [[2512.19126]] AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards(https://arxiv.org/abs/2512.19126)
Keywords: large language model
Abstract: While reinforcement learning (RL) shows promise in training tool-use large language models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of explicit reasoning rewards to bolster reasoning and tool utilization. Furthermore, natively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose advantage-weighted policy optimization (AWPO) -- a principled RL framework that effectively integrates explicit reasoning rewards to enhance tool-use capability. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0 percent in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.

Title: QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

Authors: Dehai Min, Kailin Zhang, Tongtong Wu, Lu Cheng
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2512.19134
Pdf URL: https://arxiv.org/pdf/2512.19134
Copy Paste: [[2512.19134]] QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation(https://arxiv.org/abs/2512.19134)
Keywords: robust, large language model
Abstract: Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama, Qwen, GPT), improving EM by up to 14 points. Domain generalization on biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at this https URL.

Title: RP-CATE: Recurrent Perceptron-based Channel Attention Transformer Encoder for Industrial Hybrid Modeling

Authors: Haoran Yang, Yinan Zhang, Wenjie Zhang, Dongxia Wang, Peiyu Liu, Yuqi Ye, Kexin Chen, Wenhai Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.19147
Pdf URL: https://arxiv.org/pdf/2512.19147
Copy Paste: [[2512.19147]] RP-CATE: Recurrent Perceptron-based Channel Attention Transformer Encoder for Industrial Hybrid Modeling(https://arxiv.org/abs/2512.19147)
Keywords: interpretability, transformer
Abstract: Nowadays, industrial hybrid modeling which integrates both mechanistic modeling and machine learning-based modeling techniques has attracted increasing interest from scholars due to its high accuracy, low computational cost, and satisfactory interpretability. Nevertheless, the existing industrial hybrid modeling methods still face two main limitations. First, current research has mainly focused on applying a single machine learning method to one specific task, failing to develop a comprehensive machine learning architecture suitable for modeling tasks, which limits their ability to effectively represent complex industrial scenarios. Second, industrial datasets often contain underlying associations (e.g., monotonicity or periodicity) that are not adequately exploited by current research, which can degrade model's predictive performance. To address these limitations, this paper proposes the Recurrent Perceptron-based Channel Attention Transformer Encoder (RP-CATE), with three distinctive characteristics: 1: We developed a novel architecture by replacing the self-attention mechanism with channel attention and incorporating our proposed Recurrent Perceptron (RP) Module into Transformer, achieving enhanced effectiveness for industrial modeling tasks compared to the original Transformer. 2: We proposed a new data type called Pseudo-Image Data (PID) tailored for channel attention requirements and developed a cyclic sliding window method for generating PID. 3: We introduced the concept of Pseudo-Sequential Data (PSD) and a method for converting industrial datasets into PSD, which enables the RP Module to capture the underlying associations within industrial dataset more effectively. An experiment aimed at hybrid modeling in chemical engineering was conducted by using RP-CATE and the experimental results demonstrate that RP-CATE achieves the best performance compared to other baseline models.

Title: Beyond Sliding Windows: Learning to Manage Memory in Non-Markovian Environments

Authors: Geraud Nangue Tasse, Matthew Riemer, Benjamin Rosman, Tim Klinger
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19154
Pdf URL: https://arxiv.org/pdf/2512.19154
Copy Paste: [[2512.19154]] Beyond Sliding Windows: Learning to Manage Memory in Non-Markovian Environments(https://arxiv.org/abs/2512.19154)
Keywords: transformer
Abstract: Recent success in developing increasingly general purpose agents based on sequence models has led to increased focus on the problem of deploying computationally limited agents within the vastly more complex real-world. A key challenge experienced in these more realistic domains is highly non-Markovian dependencies with respect to the agent's observations, which are less common in small controlled domains. The predominant approach for dealing with this in the literature is to stack together a window of the most recent observations (Frame Stacking), but this window size must grow with the degree of non-Markovian dependencies, which results in prohibitive computational and memory requirements for both action inference and learning. In this paper, we are motivated by the insight that in many environments that are highly non-Markovian with respect to time, the environment only causally depends on a relatively small number of observations over that time-scale. A natural direction would then be to consider meta-algorithms that maintain relatively small adaptive stacks of memories such that it is possible to express highly non-Markovian dependencies with respect to time while considering fewer observations at each step and thus experience substantial savings in both compute and memory requirements. Hence, we propose a meta-algorithm (Adaptive Stacking) for achieving exactly that with convergence guarantees and quantify the reduced computation and memory constraints for MLP, LSTM, and Transformer-based agents. Our experiments utilize popular memory tasks, which give us control over the degree of non-Markovian dependencies. This allows us to demonstrate that an appropriate meta-algorithm can learn the removal of memories not predictive of future rewards without excessive removal of important experiences. Code: this https URL

Title: OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions

Authors: Wendong Bu, Kaihang Pan, Yuze Lin, Jiacheng Li, Kai Shen, Wenqiao Zhang, Juncheng Li, Jun Xiao, Siliang Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19159
Pdf URL: https://arxiv.org/pdf/2512.19159
Copy Paste: [[2512.19159]] OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions(https://arxiv.org/abs/2512.19159)
Keywords: transformer, large language model
Abstract: Large language models (LLMs) have unified diverse linguistic tasks within a single framework, yet such unification remains unexplored in human motion generation. Existing methods are confined to isolated tasks, limiting flexibility for free-form and omni-objective generation. To address this, we propose OmniMoGen, a unified framework that enables versatile motion generation through interleaved text-motion instructions. Built upon a concise RVQ-VAE and transformer architecture, OmniMoGen supports end-to-end instruction-driven motion generation. We construct X2Mo, a large-scale dataset of over 137K interleaved text-motion instructions, and introduce AnyContext, a benchmark for evaluating interleaved motion generation. Experiments show that OmniMoGen achieves state-of-the-art performance on text-to-motion, motion editing, and AnyContext, exhibiting emerging capabilities such as compositional editing, self-reflective generation, and knowledge-informed generation. These results mark a step toward the next intelligent motion generation. Project Page: this https URL.

Title: JEPA-Reasoner: Decoupling Latent Reasoning from Token Generation

Authors: Bingyang Kelvin Liu, Ziyu Patrick Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19171
Pdf URL: https://arxiv.org/pdf/2512.19171
Copy Paste: [[2512.19171]] JEPA-Reasoner: Decoupling Latent Reasoning from Token Generation(https://arxiv.org/abs/2512.19171)
Keywords: robust, transformer, generative
Abstract: While Joint-Embedding Predictive Architecture (JEPA) has emerged as a powerful architecture for learning rich latent representations, it fundamentally lacks generative abilities. Meanwhile, latent space reasoning attempts for Transformer models like COCONUT do improve performance, but they ultimately rely on token-by-token generation, which still accumulates compounding error and relies on context information to gain reasoning insights. To address these limitations, we propose JEPA-Reasoner, a novel JEPA model enhanced with generative ability that reasons in latent space. We augment it with a separate action-taker model, Talker, to produce human-readable sentences. Our approach demonstrates that decoupling latent space reasoning and token generation enables JEPA-Reasoner to produce mixed latent vectors that might lay the foundation for multi-threaded reasoning, while performing autoregressive generation with superior robustness to compounding error.

Title: Practical Quantum-Classical Feature Fusion for complex data Classification

Authors: Azadeh Alavi, Fatemeh Kouchmeshki, Abdolrahman Alavi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19180
Pdf URL: https://arxiv.org/pdf/2512.19180
Copy Paste: [[2512.19180]] Practical Quantum-Classical Feature Fusion for complex data Classification(https://arxiv.org/abs/2512.19180)
Keywords: robust
Abstract: Hybrid quantum and classical learning aims to couple quantum feature maps with the robustness of classical neural networks, yet most architectures treat the quantum circuit as an isolated feature extractor and merge its measurements with classical representations by direct concatenation. This neglects that the quantum and classical branches constitute distinct computational modalities and limits reliable performance on complex, high dimensional tabular and semi structured data, including remote sensing, environmental monitoring, and medical diagnostics. We present a multimodal formulation of hybrid learning and propose a cross attention mid fusion architecture in which a classical representation queries quantum derived feature tokens through an attention block with residual connectivity. The quantum branch is kept within practical NISQ budgets and uses up to nine qubits. We evaluate on Wine, Breast Cancer, Forest CoverType, FashionMNIST, and SteelPlatesFaults, comparing a quantum only model, a classical baseline, residual hybrid models, and the proposed mid fusion model under a consistent protocol. Pure quantum and standard hybrid designs underperform due to measurement induced information loss, while cross attention mid fusion is consistently competitive and improves performance on the more complex datasets in most cases. These findings suggest that quantum derived information becomes most valuable when integrated through principled multimodal fusion rather than used in isolation or loosely appended to classical features.

Title: Operator-Based Generalization Bound for Deep Learning: Insights on Multi-Task Learning

Authors: Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19184
Pdf URL: https://arxiv.org/pdf/2512.19184
Copy Paste: [[2512.19184]] Operator-Based Generalization Bound for Deep Learning: Insights on Multi-Task Learning(https://arxiv.org/abs/2512.19184)
Keywords: robust
Abstract: This paper presents novel generalization bounds for vector-valued neural networks and deep kernel methods, focusing on multi-task learning through an operator-theoretic framework. Our key development lies in strategically combining a Koopman based approach with existing techniques, achieving tighter generalization guarantees compared to traditional norm-based bounds. To mitigate computational challenges associated with Koopman-based methods, we introduce sketching techniques applicable to vector valued neural networks. These techniques yield excess risk bounds under generic Lipschitz losses, providing performance guarantees for applications including robust and multiple quantile regression. Furthermore, we propose a novel deep learning framework, deep vector-valued reproducing kernel Hilbert spaces (vvRKHS), leveraging Perron Frobenius (PF) operators to enhance deep kernel methods. We derive a new Rademacher generalization bound for this framework, explicitly addressing underfitting and overfitting through kernel refinement strategies. This work offers novel insights into the generalization properties of multitask learning with deep learning architectures, an area that has been relatively unexplored until recent developments.

Title: Evaluating MCC for Low-Frequency Cyberattack Detection in Imbalanced Intrusion Detection Data

Authors: Prameshwar Thiyagarajan, Chad A. Williams
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.19203
Pdf URL: https://arxiv.org/pdf/2512.19203
Copy Paste: [[2512.19203]] Evaluating MCC for Low-Frequency Cyberattack Detection in Imbalanced Intrusion Detection Data(https://arxiv.org/abs/2512.19203)
Keywords: attack
Abstract: In many real-world network environments, several types of cyberattacks occur at very low rates compared to benign traffic, making them difficult for intrusion detection systems (IDS) to detect reliably. This imbalance causes traditional evaluation metrics, such as accuracy, to often overstate model performance in these conditions, masking failures on minority attack classes that are most important in practice. In this paper, we evaluate a set of base and meta classifiers on low-traffic attacks in the CSE-CIC-IDS2018 dataset and compare their reliability in terms of accuracy and Matthews Correlation Coefficient (MCC). The results show that accuracy consistently inflates performance, while MCC provides a more accurate assessment of a classifier's performance across both majority and minority classes. Meta-classification methods, such as LogitBoost and AdaBoost, demonstrate more effective minority class detection when measured by MCC, revealing trends that accuracy fails to capture. These findings establish the need for imbalance-aware evaluation and make MCC a more trustworthy metric for IDS research involving low-traffic cyberattacks.

Title: MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning

Authors: Tao Zhang, Ziqian Zeng, Hao Peng, Huiping Zhuang, Cen Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19206
Pdf URL: https://arxiv.org/pdf/2512.19206
Copy Paste: [[2512.19206]] MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning(https://arxiv.org/abs/2512.19206)
Keywords: large language model
Abstract: Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs), but this progress is accompanied by substantial memory and latency overhead from the extensive Key-Value (KV) cache. Although KV cache quantization is a promising compression technique, existing low-bit quantization methods often exhibit severe performance degradation on complex reasoning tasks. Fixed-precision quantization struggles to handle outlier channels in the key cache, while current mixed-precision strategies fail to accurately identify components requiring high-precision representation. We find that an effective low-bit KV cache quantization strategy must consider two factors: a key channel's intrinsic quantization difficulty and its relevance to the query. Based on this insight, we propose MixKVQ, a novel plug-and-play method that introduces a lightweight, query-aware algorithm to identify and preserve critical key channels that need higher precision, while applying per-token quantization for value cache. Experiments on complex reasoning datasets demonstrate that our approach significantly outperforms existing low-bit methods, achieving performance comparable to a full-precision baseline at a substantially reduced memory footprint.

Title: InvCoSS: Inversion-driven Continual Self-supervised Learning in Medical Multi-modal Image Pre-training

Authors: Zihao Luo, Shaohao Rui, Zhenyu Tang, Guotai Wang, Xiaosong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19213
Pdf URL: https://arxiv.org/pdf/2512.19213
Copy Paste: [[2512.19213]] InvCoSS: Inversion-driven Continual Self-supervised Learning in Medical Multi-modal Image Pre-training(https://arxiv.org/abs/2512.19213)
Keywords: privacy
Abstract: Continual self-supervised learning (CSSL) in medical imaging trains a foundation model sequentially, alleviating the need for collecting multi-modal images for joint training and offering promising improvements in downstream performance while preserving data privacy. However, most existing methods still rely on replaying data from previous stages to prevent catastrophic forgetting, which compromises privacy and limits their applicability in real-world scenarios where data transfer across sites is often restricted. In this work, we propose InvCoSS, an inversion-driven continual self-supervised learning framework for medical multi-modal image pre-training. Specifically, after training on a previous task, InvCoSS inverts the pre-trained self-supervised model to generate synthetic images that approximate the original training distribution. These synthetic images are then combined with data from the new task for joint optimization, which effectively mitigates catastrophic forgetting while strictly adhering to the constraint of no access to previous real data. Furthermore, to improve the fidelity of synthetic images, we introduce a novel InvUNet with a multi-scale fusion architecture to restore both high- and low-frequency components of the inverted images. To enhance diversity and prevent mode collapse, we design a repulsive representation-learning mechanism that encourages a diverse feature space for synthetic images without class guidance. Extensive experiments across nine downstream tasks validate the effectiveness of InvCoSS, achieving performance comparable to or even superior to prior data-replay methods while significantly reducing storage requirements and eliminating data privacy constraints.

Title: HippMetric: A skeletal-representation-based framework for cross-sectional and longitudinal hippocampal substructural morphometry

Authors: Na Gao, Chenfei Ye, Yanwu Yang, Anqi Li, Zhengbo He, Li Liang, Zhiyuan Liu, Xingyu Hao, Ting Ma, Tengfei Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19214
Pdf URL: https://arxiv.org/pdf/2512.19214
Copy Paste: [[2512.19214]] HippMetric: A skeletal-representation-based framework for cross-sectional and longitudinal hippocampal substructural morphometry(https://arxiv.org/abs/2512.19214)
Keywords: generative
Abstract: Accurate characterization of hippocampal substructure is crucial for detecting subtle structural changes and identifying early neurodegenerative biomarkers. However, high inter-subject variability and complex folding pattern of human hippocampus hinder consistent cross-subject and longitudinal analysis. Most existing approaches rely on subject-specific modelling and lack a stable intrinsic coordinate system to accommodate anatomical variability, which limits their ability to establish reliable inter- and intra-individual correspondence. To address this, we propose HippMetric, a skeletal representation (s-rep)-based framework for hippocampal substructural morphometry and point-wise correspondence across individuals and scans. HippMetric builds on the Axis-Referenced Morphometric Model (ARMM) and employs a deformable skeletal coordinate system aligned with hippocampal anatomy and function, providing a biologically grounded reference for correspondence. Our framework comprises two core modules: a skeletal-based coordinate system that respects the hippocampus' conserved longitudinal lamellar architecture, in which functional units (lamellae) are stacked perpendicular to the long-axis, enabling anatomically consistent localization across subjects and time; and individualized s-reps generated through surface reconstruction, deformation, and geometrically constrained spoke refinement, enforcing boundary adherence, orthogonality and non-intersection to produce mathematically valid skeletal geometry. Extensive experiments on two international cohorts demonstrate that HippMetric achieves higher accuracy, reliability, and correspondence stability compared to existing shape models.

Title: Towards Minimal Fine-Tuning of VLMs

Authors: Tiange Luo, Lajanugen Logeswaran, Jaekyeom Kim, Justin Johnson, Honglak Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19219
Pdf URL: https://arxiv.org/pdf/2512.19219
Copy Paste: [[2512.19219]] Towards Minimal Fine-Tuning of VLMs(https://arxiv.org/abs/2512.19219)
Keywords: transformer
Abstract: We introduce Image-LoRA, a lightweight parameter efficient fine-tuning (PEFT) recipe for transformer-based vision-language models (VLMs). Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span, reducing adapter-only training FLOPs roughly in proportion to the visual-token fraction. We further adapt only a subset of attention heads, selected using head influence scores estimated with a rank-1 Image-LoRA, and stabilize per-layer updates via selection-size normalization. Across screen-centric grounding and referring benchmarks spanning text-heavy to image-heavy regimes, Image-LoRA matches or closely approaches standard LoRA accuracy while using fewer trainable parameters and lower adapter-only training FLOPs. The method also preserves the pure-text reasoning performance of VLMs before and after fine-tuning, as further shown on GSM8K.

Title: Regression generation adversarial network based on dual data evaluation strategy for industrial application

Authors: Zesen Wang, Yonggang Li, Lijuan Lan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.19232
Pdf URL: https://arxiv.org/pdf/2512.19232
Copy Paste: [[2512.19232]] Regression generation adversarial network based on dual data evaluation strategy for industrial application(https://arxiv.org/abs/2512.19232)
Keywords: generative
Abstract: Soft sensing infers hard-to-measure data through a large number of easily obtainable variables. However, in complex industrial scenarios, the issue of insufficient data volume persists, which diminishes the reliability of soft sensing. Generative Adversarial Networks (GAN) are one of the effective solutions for addressing insufficient samples. Nevertheless, traditional GAN fail to account for the mapping relationship between labels and features, which limits further performance improvement. Although some studies have proposed solutions, none have considered both performance and efficiency simultaneously. To address these problems, this paper proposes the multi-task learning-based regression GAN framework that integrates regression information into both the discriminator and generator, and implements a shallow sharing mechanism between the discriminator and regressor. This approach significantly enhances the quality of generated samples while improving the algorithm's operational efficiency. Moreover, considering the importance of training samples and generated samples, a dual data evaluation strategy is designed to make GAN generate more diverse samples, thereby increasing the generalization of subsequent modeling. The superiority of method is validated through four classic industrial soft sensing cases: wastewater treatment plants, surface water, $CO_2$ absorption towers, and industrial gas turbines.

Title: Identifying Features Associated with Bias Against 93 Stigmatized Groups in Language Models and Guardrail Model Safety Mitigation

Authors: Anna-Maria Gueorguieva, Aylin Caliskan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.19238
Pdf URL: https://arxiv.org/pdf/2512.19238
Copy Paste: [[2512.19238]] Identifying Features Associated with Bias Against 93 Stigmatized Groups in Language Models and Guardrail Model Safety Mitigation(https://arxiv.org/abs/2512.19238)
Keywords: protect, large language model
Abstract: Large language models (LLMs) have been shown to exhibit social bias, however, bias towards non-protected stigmatized identities remain understudied. Furthermore, what social features of stigmas are associated with bias in LLM outputs is unknown. From psychology literature, it has been shown that stigmas contain six shared social features: aesthetics, concealability, course, disruptiveness, origin, and peril. In this study, we investigate if human and LLM ratings of the features of stigmas, along with prompt style and type of stigma, have effect on bias towards stigmatized groups in LLM outputs. We measure bias against 93 stigmatized groups across three widely used LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) using SocialStigmaQA, a benchmark that includes 37 social scenarios about stigmatized identities; for example deciding wether to recommend them for an internship. We find that stigmas rated by humans to be highly perilous (e.g., being a gang member or having HIV) have the most biased outputs from SocialStigmaQA prompts (60% of outputs from all models) while sociodemographic stigmas (e.g. Asian-American or old age) have the least amount of biased outputs (11%). We test if the amount of biased outputs could be decreased by using guardrail models, models meant to identify harmful input, using each LLM's respective guardrail model (Granite Guardian 3.0, Llama Guard 3.0, Mistral Moderation API). We find that bias decreases significantly by 10.4%, 1.4%, and 7.8%, respectively. However, we show that features with significant effect on bias remain unchanged post-mitigation and that guardrail models often fail to recognize the intent of bias in prompts. This work has implications for using LLMs in scenarios involving stigmatized groups and we suggest future work towards improving guardrail models for bias mitigation.

Title: ChemATP: A Training-Free Chemical Reasoning Framework for Large Language Models

Authors: Mingxu Zhang, Dazhong Shen, Qi Zhang, Ying Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19240
Pdf URL: https://arxiv.org/pdf/2512.19240
Copy Paste: [[2512.19240]] ChemATP: A Training-Free Chemical Reasoning Framework for Large Language Models(https://arxiv.org/abs/2512.19240)
Keywords: interpretability, large language model
Abstract: Large Language Models (LLMs) exhibit strong general reasoning but struggle in molecular science due to the lack of explicit chemical priors in standard string representations. Current solutions face a fundamental dilemma. Training-based methods inject priors into parameters, but this static coupling hinders rapid knowledge updates and often compromises the model's general reasoning capabilities. Conversely, existing training-free methods avoid these issues but rely on surface-level prompting, failing to provide the fine-grained atom-level priors essential for precise chemical reasoning. To address this issue, we introduce ChemATP, a framework that decouples chemical knowledge from the reasoning engine. By constructing the first atom-level textual knowledge base, ChemATP enables frozen LLMs to explicitly retrieve and reason over this information dynamically. This architecture ensures interpretability and adaptability while preserving the LLM's intrinsic general intelligence. Experiments show that ChemATP significantly outperforms training-free baselines and rivals state-of-the-art training-based models, demonstrating that explicit prior injection is a competitive alternative to implicit parameter updates.

Title: VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis

Authors: Meng Chu, Senqiao Yang, Haoxuan Che, Suiyun Zhang, Xichen Zhang, Shaozuo Yu, Haokun Gui, Zhefan Rao, Dandan Tu, Rui Liu, Jiaya Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19243
Pdf URL: https://arxiv.org/pdf/2512.19243
Copy Paste: [[2512.19243]] VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis(https://arxiv.org/abs/2512.19243)
Keywords: generative
Abstract: Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models' performance in real-world settings, we introduce Long Goal Bench (LGBench), a 2,000-task suite (1,000 T2I and 1,000 I2I) whose average instruction contains 18 to 22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find that even state-of-the-art models satisfy fewer than 72 percent of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present VisionDirector, a training-free vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling with semantic verification and rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 versus 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (plus 7 percent overall) and ImgEdit (plus 0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing.

Title: Auto-Prompting with Retrieval Guidance for Frame Detection in Logistics

Authors: Do Minh Duc, Quan Xuan Truong, Nguyen Tat Dat, Nguyen Van Vinh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19247
Pdf URL: https://arxiv.org/pdf/2512.19247
Copy Paste: [[2512.19247]] Auto-Prompting with Retrieval Guidance for Frame Detection in Logistics(https://arxiv.org/abs/2512.19247)
Keywords: large language model
Abstract: Prompt engineering plays a critical role in adapting large language models (LLMs) to complex reasoning and labeling tasks without the need for extensive fine-tuning. In this paper, we propose a novel prompt optimization pipeline for frame detection in logistics texts, combining retrieval-augmented generation (RAG), few-shot prompting, chain-of-thought (CoT) reasoning, and automatic CoT synthesis (Auto-CoT) to generate highly effective task-specific prompts. Central to our approach is an LLM-based prompt optimizer agent that iteratively refines the prompts using retrieved examples, performance feedback, and internal self-evaluation. Our framework is evaluated on a real-world logistics text annotation task, where reasoning accuracy and labeling efficiency are critical. Experimental results show that the optimized prompts - particularly those enhanced via Auto-CoT and RAG - improve real-world inference accuracy by up to 15% compared to baseline zero-shot or static prompts. The system demonstrates consistent improvements across multiple LLMs, including GPT-4o, Qwen 2.5 (72B), and LLaMA 3.1 (70B), validating its generalizability and practical value. These findings suggest that structured prompt optimization is a viable alternative to full fine-tuning, offering scalable solutions for deploying LLMs in domain-specific NLP applications such as logistics.

Title: Small Language Models as Compiler Experts: Auto-Parallelization for Heterogeneous Systems

Authors: Prathamesh Devadiga
Subjects: cs.LG, cs.PL
Abstract URL: https://arxiv.org/abs/2512.19250
Pdf URL: https://arxiv.org/pdf/2512.19250
Copy Paste: [[2512.19250]] Small Language Models as Compiler Experts: Auto-Parallelization for Heterogeneous Systems(https://arxiv.org/abs/2512.19250)
Keywords: robust
Abstract: Traditional auto-parallelizing compilers, reliant on rigid heuristics, struggle with the complexity of modern heterogeneous systems. This paper presents a comprehensive evaluation of small (approximately 1B parameter) language-model-driven compiler auto-parallelization. We evaluate three models: gemma3, llama3.2, and qwen2.5, using six reasoning strategies across 11 real-world kernels drawn from scientific computing, graph algorithms, and machine learning. Our system is benchmarked against strong compiler baselines, including LLVM Polly, TVM, and Triton. Across 376 total evaluations, the proposed approach achieves an average speedup of 6.81x and a peak performance of 43.25x on convolution operations. We analyze scalability, verify correctness using multiple sanitizers, and confirm robustness across diverse compilers and hardware platforms. Our results demonstrate that small, efficient language models can serve as powerful reasoning engines for complex compiler optimization tasks.

Title: Is Visual Realism Enough? Evaluating Gait Biometric Fidelity in Generative AI Human Animation

Authors: Ivan DeAndres-Tame, Chengwei Ye, Ruben Tolosana, Ruben Vera-Rodriguez, Shiqi Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19275
Pdf URL: https://arxiv.org/pdf/2512.19275
Copy Paste: [[2512.19275]] Is Visual Realism Enough? Evaluating Gait Biometric Fidelity in Generative AI Human Animation(https://arxiv.org/abs/2512.19275)
Keywords: biometric, generative
Abstract: Generative AI (GenAI) models have revolutionized animation, enabling the synthesis of humans and motion patterns with remarkable visual fidelity. However, generating truly realistic human animation remains a formidable challenge, where even minor inconsistencies can make a subject appear unnatural. This limitation is particularly critical when AI-generated videos are evaluated for behavioral biometrics, where subtle motion cues that define identity are easily lost or distorted. The present study investigates whether state-of-the-art GenAI human animation models can preserve the subtle spatio-temporal details needed for person identification through gait biometrics. Specifically, we evaluate four different GenAI models across two primary evaluation tasks to assess their ability to i) restore gait patterns from reference videos under varying conditions of complexity, and ii) transfer these gait patterns to different visual identities. Our results show that while visual quality is mostly high, biometric fidelity remains low in tasks focusing on identification, suggesting that current GenAI models struggle to disentangle identity from motion. Furthermore, through an identity transfer task, we expose a fundamental flaw in appearance-based gait recognition: when texture is disentangled from motion, identification collapses, proving current GenAI models rely on visual attributes rather than temporal dynamics.

Title: Hand-Aware Egocentric Motion Reconstruction with Sequence-Level Context

Authors: Kyungwon Cho, Hanbyul Joo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19283
Pdf URL: https://arxiv.org/pdf/2512.19283
Copy Paste: [[2512.19283]] Hand-Aware Egocentric Motion Reconstruction with Sequence-Level Context(https://arxiv.org/abs/2512.19283)
Keywords: diffusion
Abstract: Egocentric vision systems are becoming widely available, creating new opportunities for human-computer interaction. A core challenge is estimating the wearer's full-body motion from first-person videos, which is crucial for understanding human behavior. However, this task is difficult since most body parts are invisible from the egocentric view. Prior approaches mainly rely on head trajectories, leading to ambiguity, or assume continuously tracked hands, which is unrealistic for lightweight egocentric devices. In this work, we present HaMoS, the first hand-aware, sequence-level diffusion framework that directly conditions on both head trajectory and intermittently visible hand cues caused by field-of-view limitations and occlusions, as in real-world egocentric devices. To overcome the lack of datasets pairing diverse camera views with human motion, we introduce a novel augmentation method that models such real-world conditions. We also demonstrate that sequence-level contexts such as body shape and field-of-view are crucial for accurate motion reconstruction, and thus employ local attention to infer long sequences efficiently. Experiments on public benchmarks show that our method achieves state-of-the-art accuracy and temporal smoothness, demonstrating a practical step toward reliable in-the-wild egocentric 3D motion understanding.

Title: GShield: Mitigating Poisoning Attacks in Federated Learning

Authors: Sameera K. M., Serena Nicolazzo, Antonino Nocera, Vinod P., Rafidha Rehiman K. A
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.19286
Pdf URL: https://arxiv.org/pdf/2512.19286
Copy Paste: [[2512.19286]] GShield: Mitigating Poisoning Attacks in Federated Learning(https://arxiv.org/abs/2512.19286)
Keywords: privacy, defense, attack, robust, federate
Abstract: Federated Learning (FL) has recently emerged as a revolutionary approach to collaborative training Machine Learning models. In particular, it enables decentralized model training while preserving data privacy, but its distributed nature makes it highly vulnerable to a severe attack known as Data Poisoning. In such scenarios, malicious clients inject manipulated data into the training process, thereby degrading global model performance or causing targeted misclassification. In this paper, we present a novel defense mechanism called GShield, designed to detect and mitigate malicious and low-quality updates, especially under non-independent and identically distributed (non-IID) data scenarios. GShield operates by learning the distribution of benign gradients through clustering and Gaussian modeling during an initial round, enabling it to establish a reliable baseline of trusted client behavior. With this benign profile, GShield selectively aggregates only those updates that align with the expected gradient patterns, effectively isolating adversarial clients and preserving the integrity of the global model. An extensive experimental campaign demonstrates that our proposed defense significantly improves model robustness compared to the state-of-the-art methods while maintaining a high accuracy of performance across both tabular and image datasets. Furthermore, GShield improves the accuracy of the targeted class by 43\% to 65\% after detecting malicious and low-quality clients.

Title: Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models

Authors: Linzhi Chen, Yang Sun, Hongru Wei, Yuqi Chen
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19297
Pdf URL: https://arxiv.org/pdf/2512.19297
Copy Paste: [[2512.19297]] Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models(https://arxiv.org/abs/2512.19297)
Keywords: security, defense, attack, robust, steal, large language model
Abstract: Low-Rank Adaptation (LoRA) has emerged as an efficient method for fine-tuning large language models (LLMs) and is widely adopted within the open-source community. However, the decentralized dissemination of LoRA adapters through platforms such as Hugging Face introduces novel security vulnerabilities: malicious adapters can be easily distributed and evade conventional oversight mechanisms. Despite these risks, backdoor attacks targeting LoRA-based fine-tuning remain relatively underexplored. Existing backdoor attack strategies are ill-suited to this setting, as they often rely on inaccessible training data, fail to account for the structural properties unique to LoRA, or suffer from high false trigger rates (FTR), thereby compromising their stealth. To address these challenges, we propose Causal-Guided Detoxify Backdoor Attack (CBA), a novel backdoor attack framework specifically designed for open-weight LoRA models. CBA operates without access to original training data and achieves high stealth through two key innovations: (1) a coverage-guided data generation pipeline that synthesizes task-aligned inputs via behavioral exploration, and (2) a causal-guided detoxification strategy that merges poisoned and clean adapters by preserving task-critical neurons. Unlike prior approaches, CBA enables post-training control over attack intensity through causal influence-based weight allocation, eliminating the need for repeated retraining. Evaluated across six LoRA models, CBA achieves high attack success rates while reducing FTR by 50-70\% compared to baseline methods. Furthermore, it demonstrates enhanced resistance to state-of-the-art backdoor defenses, highlighting its stealth and robustness.

Title: RMLer: Synthesizing Novel Objects across Diverse Categories via Reinforcement Mixing Learning

Authors: Jun Li, Zikun Chen, Haibo Chen, Shuo Chen, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19300
Pdf URL: https://arxiv.org/pdf/2512.19300
Copy Paste: [[2512.19300]] RMLer: Synthesizing Novel Objects across Diverse Categories via Reinforcement Mixing Learning(https://arxiv.org/abs/2512.19300)
Keywords: robust
Abstract: Novel object synthesis by integrating distinct textual concepts from diverse categories remains a significant challenge in Text-to-Image (T2I) generation. Existing methods often suffer from insufficient concept mixing, lack of rigorous evaluation, and suboptimal outputs-manifesting as conceptual imbalance, superficial combinations, or mere juxtapositions. To address these limitations, we propose Reinforcement Mixing Learning (RMLer), a framework that formulates cross-category concept fusion as a reinforcement learning problem: mixed features serve as states, mixing strategies as actions, and visual outcomes as rewards. Specifically, we design an MLP-policy network to predict dynamic coefficients for blending cross-category text embeddings. We further introduce visual rewards based on (1) semantic similarity and (2) compositional balance between the fused object and its constituent concepts, optimizing the policy via proximal policy optimization. At inference, a selection strategy leverages these rewards to curate the highest-quality fused objects. Extensive experiments demonstrate RMLer's superiority in synthesizing coherent, high-fidelity objects from diverse categories, outperforming existing methods. Our work provides a robust framework for generating novel visual concepts, with promising applications in film, gaming, and design.

Title: Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Remote Sensing

Authors: Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19302
Pdf URL: https://arxiv.org/pdf/2512.19302
Copy Paste: [[2512.19302]] Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Remote Sensing(https://arxiv.org/abs/2512.19302)
Keywords: segmentation
Abstract: Large Vision-Language Models (LVLMs) hold great promise for advancing remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only reinforcement learning objective, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Remarkably, the learned prompting policy generalizes zero-shot to multiple referring segmentation benchmarks, exposing a distinct divide between semantic-level and instance-level grounding. We further found that compact segmenters outperform larger ones under semantic-level supervision, and that negative prompts are ineffective in heterogeneous aerial backgrounds. Together, these findings establish semantic-level reasoning segmentation as a new paradigm for geospatial understanding, opening the way toward unified, interpretable LVLM-driven Earth observation. Our code and model are available at this https URL.

Title: CienaLLM: Generative Climate-Impact Extraction from News Articles with Autoregressive LLMs

Authors: Javier Vela-Tambo, Jorge Gracia, Fernando Dominguez-Castro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19305
Pdf URL: https://arxiv.org/pdf/2512.19305
Copy Paste: [[2512.19305]] CienaLLM: Generative Climate-Impact Extraction from News Articles with Autoregressive LLMs(https://arxiv.org/abs/2512.19305)
Keywords: extraction, generative, large language model
Abstract: Understanding and monitoring the socio-economic impacts of climate hazards requires extracting structured information from heterogeneous news articles on a large scale. To that end, we have developed CienaLLM, a modular framework based on schema-guided Generative Information Extraction. CienaLLM uses open-weight Large Language Models for zero-shot information extraction from news articles, and supports configurable prompts and output schemas, multi-step pipelines, and cloud or on-premise inference. To systematically assess how the choice of LLM family, size, precision regime, and prompting strategy affect performance, we run a large factorial study in models, precisions, and prompt engineering techniques. An additional response parsing step nearly eliminates format errors while preserving accuracy; larger models deliver the strongest and most stable performance, while quantization offers substantial efficiency gains with modest accuracy trade-offs; and prompt strategies show heterogeneous, model-specific effects. CienaLLM matches or outperforms the supervised baseline in accuracy for extracting drought impacts from Spanish news, although at a higher inference cost. While evaluated in droughts, the schema-driven and model-agnostic design is suitable for adapting to related information extraction tasks (e.g., other hazards, sectors, or languages) by editing prompts and schemas rather than retraining. We release code, configurations, and schemas to support reproducible use.

Title: Time-Vertex Machine Learning for Optimal Sensor Placement in Temporal Graph Signals: Applications in Structural Health Monitoring

Authors: Keivan Faghih Niresi, Jun Qing, Mengjie Zhao, Olga Fink
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2512.19309
Pdf URL: https://arxiv.org/pdf/2512.19309
Copy Paste: [[2512.19309]] Time-Vertex Machine Learning for Optimal Sensor Placement in Temporal Graph Signals: Applications in Structural Health Monitoring(https://arxiv.org/abs/2512.19309)
Keywords: robust
Abstract: Structural Health Monitoring (SHM) plays a crucial role in maintaining the safety and resilience of infrastructure. As sensor networks grow in scale and complexity, identifying the most informative sensors becomes essential to reduce deployment costs without compromising monitoring quality. While Graph Signal Processing (GSP) has shown promise by leveraging spatial correlations among sensor nodes, conventional approaches often overlook the temporal dynamics of structural behavior. To overcome this limitation, we propose Time-Vertex Machine Learning (TVML), a novel framework that integrates GSP, time-domain analysis, and machine learning to enable interpretable and efficient sensor placement by identifying representative nodes that minimize redundancy while preserving critical information. We evaluate the proposed approach on two bridge datasets for damage detection and time-varying graph signal reconstruction tasks. The results demonstrate the effectiveness of our approach in enhancing SHM systems by providing a robust, adaptive, and efficient solution for sensor placement.

Title: MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture

Authors: Hui Li, Jiayue Lyu, Fu-Yun Wang, Kaihui Cheng, Siyu Zhu, Jingdong Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19311
Pdf URL: https://arxiv.org/pdf/2512.19311
Copy Paste: [[2512.19311]] MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture(https://arxiv.org/abs/2512.19311)
Keywords: diffusion
Abstract: This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 x 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 x 512.

Title: Protecting Quantum Circuits Through Compiler-Resistant Obfuscation

Authors: Pradyun Parayil, Amal Raj, Vivek Balachandran
Subjects: cs.CR, quant-ph
Abstract URL: https://arxiv.org/abs/2512.19314
Pdf URL: https://arxiv.org/pdf/2512.19314
Copy Paste: [[2512.19314]] Protecting Quantum Circuits Through Compiler-Resistant Obfuscation(https://arxiv.org/abs/2512.19314)
Keywords: protect, defense
Abstract: Quantum circuit obfuscation is becoming increasingly important to prevent theft and reverse engineering of quantum algorithms. As quantum computing advances, the need to protect the intellectual property contained in quantum circuits continues to grow. Existing methods often provide limited defense against structural and statistical analysis or introduce considerable overhead. In this paper, we propose a novel quantum obfuscation method that uses randomized U3 transformations to conceal circuit structure while preserving functionality. We implement and assess our approach on QASM circuits using Qiskit AER, achieving over 93\% semantic accuracy with minimal runtime overhead. The method demonstrates strong resistance to reverse engineering and structural inference, making it a practical and effective approach for quantum software protection.

Title: Neural Implicit Heart Coordinates: 3D cardiac shape reconstruction from sparse segmentations

Authors: Marica Muffoletto, Uxio Hermida, Charlène Mauger, Avan Suinesiaputra, Yiyang Xu, Richard Burns, Lisa Pankewitz, Andrew D McCulloch, Steffen E Petersen, Daniel Rueckert, Alistair A Young
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2512.19316
Pdf URL: https://arxiv.org/pdf/2512.19316
Copy Paste: [[2512.19316]] Neural Implicit Heart Coordinates: 3D cardiac shape reconstruction from sparse segmentations(https://arxiv.org/abs/2512.19316)
Keywords: robust, segmentation
Abstract: Accurate reconstruction of cardiac anatomy from sparse clinical images remains a major challenge in patient-specific modeling. While neural implicit functions have previously been applied to this task, their application to mapping anatomical consistency across subjects has been limited. In this work, we introduce Neural Implicit Heart Coordinates (NIHCs), a standardized implicit coordinate system, based on universal ventricular coordinates, that provides a common anatomical reference frame for the human heart. Our method predicts NIHCs directly from a limited number of 2D segmentations (sparse acquisition) and subsequently decodes them into dense 3D segmentations and high-resolution meshes at arbitrary output resolution. Trained on a large dataset of 5,000 cardiac meshes, the model achieves high reconstruction accuracy on clinical contours, with mean Euclidean surface errors of 2.51$\pm$0.33 mm in a diseased cohort (n=4549) and 2.3$\pm$0.36 mm in a healthy cohort (n=5576). The NIHC representation enables anatomically coherent reconstruction even under severe slice sparsity and segmentation noise, faithfully recovering complex structures such as the valve planes. Compared with traditional pipelines, inference time is reduced from over 60 s to 5-15 s. These results demonstrate that NIHCs constitute a robust and efficient anatomical representation for patient-specific 3D cardiac reconstruction from minimal input data.

Title: Alternative positional encoding functions for neural transformers

Authors: Ezequiel Lopez-Rubio, Macoris Decena-Gimenez, Rafael Marcos Luque-Baena
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19323
Pdf URL: https://arxiv.org/pdf/2512.19323
Copy Paste: [[2512.19323]] Alternative positional encoding functions for neural transformers(https://arxiv.org/abs/2512.19323)
Keywords: transformer
Abstract: A key module in neural transformer-based deep architectures is positional encoding. This module enables a suitable way to encode positional information as input for transformer neural layers. This success has been rooted in the use of sinusoidal functions of various frequencies, in order to capture recurrent patterns of differing typical periods. In this work, an alternative set of periodic functions is proposed for positional encoding. These functions preserve some key properties of sinusoidal ones, while they depart from them in fundamental ways. Some tentative experiments are reported, where the original sinusoidal version is substantially outperformed. This strongly suggests that the alternative functions may have a wider use in other transformer architectures.

Title: DeltaMIL: Gated Memory Integration for Efficient and Discriminative Whole Slide Image Analysis

Authors: Yueting Zhu, Yuehao Song, Shuai Zhang, Wenyu Liu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19331
Pdf URL: https://arxiv.org/pdf/2512.19331
Copy Paste: [[2512.19331]] DeltaMIL: Gated Memory Integration for Efficient and Discriminative Whole Slide Image Analysis(https://arxiv.org/abs/2512.19331)
Keywords: robust, extraction
Abstract: Whole Slide Images (WSIs) are typically analyzed using multiple instance learning (MIL) methods. However, the scale and heterogeneity of WSIs generate highly redundant and dispersed information, making it difficult to identify and integrate discriminative signals. Existing MIL methods either fail to discard uninformative cues effectively or have limited ability to consolidate relevant features from multiple patches, which restricts their performance on large and heterogeneous WSIs. To address this issue, we propose DeltaMIL, a novel MIL framework that explicitly selects semantically relevant regions and integrates the discriminative information from WSIs. Our method leverages the gated delta rule to efficiently filter and integrate information through a block combining forgetting and memory mechanisms. The delta mechanism dynamically updates the memory by removing old values and inserting new ones according to their correlation with the current patch. The gating mechanism further enables rapid forgetting of irrelevant signals. Additionally, DeltaMIL integrates a complementary local pattern mixing mechanism to retain fine-grained pathological locality. Our design enhances the extraction of meaningful cues and suppresses redundant or noisy information, which improves the model's robustness and discriminative power. Experiments demonstrate that DeltaMIL achieves state-of-the-art performance. Specifically, for survival prediction, DeltaMIL improves performance by 3.69\% using ResNet-50 features and 2.36\% using UNI features. For slide-level classification, it increases accuracy by 3.09\% with ResNet-50 features and 3.75\% with UNI features. These results demonstrate the strong and consistent performance of DeltaMIL across diverse WSI tasks.

Title: GANeXt: A Fully ConvNeXt-Enhanced Generative Adversarial Network for MRI- and CBCT-to-CT Synthesis

Authors: Siyuan Mei, Yan Xia, Fuxin Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19336
Pdf URL: https://arxiv.org/pdf/2512.19336
Copy Paste: [[2512.19336]] GANeXt: A Fully ConvNeXt-Enhanced Generative Adversarial Network for MRI- and CBCT-to-CT Synthesis(https://arxiv.org/abs/2512.19336)
Keywords: generative, segmentation
Abstract: The synthesis of computed tomography (CT) from magnetic resonance imaging (MRI) and cone-beam CT (CBCT) plays a critical role in clinical treatment planning by enabling accurate anatomical representation in adaptive radiotherapy. In this work, we propose GANeXt, a 3D patch-based, fully ConvNeXt-powered generative adversarial network for unified CT synthesis across different modalities and anatomical regions. Specifically, GANeXt employs an efficient U-shaped generator constructed from stacked 3D ConvNeXt blocks with compact convolution kernels, while the discriminator adopts a conditional PatchGAN. To improve synthesis quality, we incorporate a combination of loss functions, including mean absolute error (MAE), perceptual loss, segmentation-based masked MAE, and adversarial loss and a combination of Dice loss and cross-entropy for multi-head segmentation discriminator. For both tasks, training is performed with a batch size of 8 using two separate AdamW optimizers for the generator and discriminator, each equipped with a warmup and cosine decay scheduler, with learning rates of $5\times10^{-4}$ and $1\times10^{-3}$, respectively. Data preprocessing includes deformable registration, foreground cropping, percentile normalization for the input modality, and linear normalization of the CT to the range $[-1024, 1000]$. Data augmentation involves random zooming within $(0.8, 1.3)$ (for MRI-to-CT only), fixed-size cropping to $32\times160\times192$ for MRI-to-CT and $32\times128\times128$ for CBCT-to-CT, and random flipping. During inference, we apply a sliding-window approach with $0.8$ overlap and average folding to reconstruct the full-size sCT, followed by inversion of the CT normalization. After joint training on all regions without any fine-tuning, the final models are selected at the end of 3000 epochs for MRI-to-CT and 1000 epochs for CBCT-to-CT using the full training dataset.

Title: ReasonCD: A Multimodal Reasoning Large Model for Implicit Change-of-Interest Semantic Mining

Authors: Zhenyang Huang, Xiao Yu, Yi Zhang, Decheng Wang, Hang Ruan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19354
Pdf URL: https://arxiv.org/pdf/2512.19354
Copy Paste: [[2512.19354]] ReasonCD: A Multimodal Reasoning Large Model for Implicit Change-of-Interest Semantic Mining(https://arxiv.org/abs/2512.19354)
Keywords: large language model
Abstract: Remote sensing image change detection is one of the fundamental tasks in remote sensing intelligent interpretation. Its core objective is to identify changes within change regions of interest (CRoI). Current multimodal large models encode rich human semantic knowledge, which is utilized for guidance in tasks such as remote sensing change detection. However, existing methods that use semantic guidance for detecting users' CRoI overly rely on explicit textual descriptions of CRoI, leading to the problem of near-complete performance failure when presented with implicit CRoI textual descriptions. This paper proposes a multimodal reasoning change detection model named ReasonCD, capable of mining users' implicit task intent. The model leverages the powerful reasoning capabilities of pre-trained large language models to mine users' implicit task intents and subsequently obtains different change detection results based on these intents. Experiments on public datasets demonstrate that the model achieves excellent change detection performance, with an F1 score of 92.1\% on the BCDD dataset. Furthermore, to validate its superior reasoning functionality, this paper annotates a subset of reasoning data based on the SECOND dataset. Experimental results show that the model not only excels at basic reasoning-based change detection tasks but can also explain the reasoning process to aid human decision-making.

Title: Interpretable Hybrid Deep Q-Learning Framework for IoT-Based Food Spoilage Prediction with Synthetic Data Generation and Hardware Validation

Authors: Isshaan Singh, Divyansh Chawla, Anshu Garg, Shivin Mangal, Pallavi Gupta, Khushi Agarwal, Nimrat Singh Khalsa, Nandan Patel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.19361
Pdf URL: https://arxiv.org/pdf/2512.19361
Copy Paste: [[2512.19361]] Interpretable Hybrid Deep Q-Learning Framework for IoT-Based Food Spoilage Prediction with Synthetic Data Generation and Hardware Validation(https://arxiv.org/abs/2512.19361)
Keywords: robust, interpretability
Abstract: The need for an intelligent, real-time spoilage prediction system has become critical in modern IoT-driven food supply chains, where perishable goods are highly susceptible to environmental conditions. Existing methods often lack adaptability to dynamic conditions and fail to optimize decision making in real time. To address these challenges, we propose a hybrid reinforcement learning framework integrating Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNN) for enhanced spoilage prediction. This hybrid architecture captures temporal dependencies within sensor data, enabling robust and adaptive decision making. In alignment with interpretable artificial intelligence principles, a rule-based classifier environment is employed to provide transparent ground truth labeling of spoilage levels based on domain-specific thresholds. This structured design allows the agent to operate within clearly defined semantic boundaries, supporting traceable and interpretable decisions. Model behavior is monitored using interpretability-driven metrics, including spoilage accuracy, reward-to-step ratio, loss reduction rate, and exploration decay. These metrics provide both quantitative performance evaluation and insights into learning dynamics. A class-wise spoilage distribution visualization is used to analyze the agents decision profile and policy behavior. Extensive evaluations on simulated and real-time hardware data demonstrate that the LSTM and RNN based agent outperforms alternative reinforcement learning approaches in prediction accuracy and decision efficiency while maintaining interpretability. The results highlight the potential of hybrid deep reinforcement learning with integrated interpretability for scalable IoT-based food monitoring systems.

Title: From Points to Coalitions: Hierarchical Contrastive Shapley Values for Prioritizing Data Samples

Authors: Canran Xiao, Jiabao Dou, Zhiming Lin, Zong Ke, Liwei Hou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.19363
Pdf URL: https://arxiv.org/pdf/2512.19363
Copy Paste: [[2512.19363]] From Points to Coalitions: Hierarchical Contrastive Shapley Values for Prioritizing Data Samples(https://arxiv.org/abs/2512.19363)
Keywords: fair
Abstract: How should we quantify the value of each training example when datasets are large, heterogeneous, and geometrically structured? Classical Data-Shapley answers in principle, but its O(n!) complexity and point-wise perspective are ill-suited to modern scales. We propose Hierarchical Contrastive Data Valuation (HCDV), a three-stage framework that (i) learns a contrastive, geometry-preserving representation, (ii) organizes the data into a balanced coarse-to-fine hierarchy of clusters, and (iii) assigns Shapley-style payoffs to coalitions via local Monte-Carlo games whose budgets are propagated downward. HCDV collapses the factorial burden to O(T sum_{l} K_{l}) = O(T K_max log n), rewards examples that sharpen decision boundaries, and regularizes outliers through curvature-based smoothness. We prove that HCDV approximately satisfies the four Shapley axioms with surplus loss O(eta log n), enjoys sub-Gaussian coalition deviation tilde O(1/sqrt{T}), and incurs at most k epsilon_infty regret for top-k selection. Experiments on four benchmarks--tabular, vision, streaming, and a 45M-sample CTR task--plus the OpenDataVal suite show that HCDV lifts accuracy by up to +5 pp, slashes valuation time by up to 100x, and directly supports tasks such as augmentation filtering, low-latency streaming updates, and fair marketplace payouts.

Title: Efficient Spike-driven Transformer for High-performance Drone-View Geo-Localization

Authors: Zhongwei Chen, Hai-Jun Rong, Zhao-Xu Yang, Guoqi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19365
Pdf URL: https://arxiv.org/pdf/2512.19365
Copy Paste: [[2512.19365]] Efficient Spike-driven Transformer for High-performance Drone-View Geo-Localization(https://arxiv.org/abs/2512.19365)
Keywords: transformer
Abstract: Traditional drone-view geo-localization (DVGL) methods based on artificial neural networks (ANNs) have achieved remarkable performance. However, ANNs rely on dense computation, which results in high power consumption. In contrast, spiking neural networks (SNNs), which benefit from spike-driven computation, inherently provide low power consumption. Regrettably, the potential of SNNs for DVGL has yet to be thoroughly investigated. Meanwhile, the inherent sparsity of spike-driven computation for representation learning scenarios also results in loss of critical information and difficulties in learning long-range dependencies when aligning heterogeneous visual data sources. To address these, we propose SpikeViMFormer, the first SNN framework designed for DVGL. In this framework, a lightweight spike-driven transformer backbone is adopted to extract coarse-grained features. To mitigate the loss of critical information, the spike-driven selective attention (SSA) block is designed, which uses a spike-driven gating mechanism to achieve selective feature enhancement and highlight discriminative regions. Furthermore, a spike-driven hybrid state space (SHS) block is introduced to learn long-range dependencies using a hybrid state space. Moreover, only the backbone is utilized during the inference stage to reduce computational cost. To ensure backbone effectiveness, a novel hierarchical re-ranking alignment learning (HRAL) strategy is proposed. It refines features via neighborhood re-ranking and maintains cross-batch consistency to directly optimize the backbone. Experimental results demonstrate that SpikeViMFormer outperforms state-of-the-art SNNs. Compared with advanced ANNs, it also achieves competitive this http URL code is available at this https URL

Title: HATS: High-Accuracy Triple-Set Watermarking for Large Language Models

Authors: Zhiqing Hu, Chenxu Zhao, Jiazhong Lu, Xiaolei Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19378
Pdf URL: https://arxiv.org/pdf/2512.19378
Copy Paste: [[2512.19378]] HATS: High-Accuracy Triple-Set Watermarking for Large Language Models(https://arxiv.org/abs/2512.19378)
Keywords: watermark, large language model
Abstract: Misuse of LLM-generated text can be curbed by watermarking techniques that embed implicit signals into the output. We propose a watermark that partitions the vocabulary at each decoding step into three sets (Green/Yellow/Red) with fixed ratios and restricts sampling to the Green and Yellow sets. At detection time, we replay the same partitions, compute Green-enrichment and Red-depletion statistics, convert them to one-sided z-scores, and aggregate their p-values via Fisher's method to decide whether a passage is watermarked. We implement generation, detection, and testing on Llama 2 7B, and evaluate true-positive rate, false-positive rate, and text quality. Results show that the triple-partition scheme achieves high detection accuracy at fixed FPR while preserving readability.

Title: OmniMER: Indonesian Multimodal Emotion Recognition via Auxiliary-Enhanced LLM Adaptation

Authors: Xueming Yan, Boyan Xu, Yaochu Jin, Lixian Xiao, Wenlong Ye, Runyang Cai, Zeqi Zheng, Jingfa Liu, Aimin Yang
Subjects: cs.LG, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2512.19379
Pdf URL: https://arxiv.org/pdf/2512.19379
Copy Paste: [[2512.19379]] OmniMER: Indonesian Multimodal Emotion Recognition via Auxiliary-Enhanced LLM Adaptation(https://arxiv.org/abs/2512.19379)
Keywords: extraction
Abstract: Indonesian, spoken by over 200 million people, remains underserved in multimodal emotion recognition research despite its dominant presence on Southeast Asian social media platforms. We introduce IndoMER, the first multimodal emotion recognition benchmark for Indonesian, comprising 1,944 video segments from 203 speakers with temporally aligned text, audio, and visual annotations across seven emotion categories. The dataset exhibits realistic challenges including cross-modal inconsistency and long-tailed class distributions shaped by Indonesian cultural communication norms. To address these challenges, we propose OmniMER, a multimodal adaptation framework built upon Qwen2.5-Omni that enhances emotion recognition through three auxiliary modality-specific perception tasks: emotion keyword extraction for text, facial expression analysis for video, and prosody analysis for audio. These auxiliary tasks help the model identify emotion-relevant cues in each modality before fusion, reducing reliance on spurious correlations in low-resource settings. Experiments on IndoMER show that OmniMER achieves 0.582 Macro-F1 on sentiment classification and 0.454 on emotion recognition, outperforming the base model by 7.6 and 22.1 absolute points respectively. Cross-lingual evaluation on the Chinese CH-SIMS dataset further demonstrates the generalizability of the proposed framework. The dataset and code are publicly available. this https URL

Title: Brain-Grounded Axes for Reading and Steering LLM States

Authors: Sandro Andric
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.19399
Pdf URL: https://arxiv.org/pdf/2512.19399
Copy Paste: [[2512.19399]] Brain-Grounded Axes for Reading and Steering LLM States(https://arxiv.org/abs/2512.19399)
Keywords: robust, interpretability, large language model
Abstract: Interpretability methods for large language models (LLMs) typically derive directions from textual supervision, which can lack external grounding. We propose using human brain activity not as a training signal but as a coordinate system for reading and steering LLM states. Using the SMN4Lang MEG dataset, we construct a word-level brain atlas of phase-locking value (PLV) patterns and extract latent axes via ICA. We validate axes with independent lexica and NER-based labels (POS/log-frequency used as sanity checks), then train lightweight adapters that map LLM hidden states to these brain axes without fine-tuning the LLM. Steering along the resulting brain-derived directions yields a robust lexical (frequency-linked) axis in a mid TinyLlama layer, surviving perplexity-matched controls, and a brain-vs-text probe comparison shows larger log-frequency shifts (relative to the text probe) with lower perplexity for the brain axis. A function/content axis (axis 13) shows consistent steering in TinyLlama, Qwen2-0.5B, and GPT-2, with PPL-matched text-level corroboration. Layer-4 effects in TinyLlama are large but inconsistent, so we treat them as secondary (Appendix). Axis structure is stable when the atlas is rebuilt without GPT embedding-change features or with word2vec embeddings (|r|=0.64-0.95 across matched axes), reducing circularity concerns. Exploratory fMRI anchoring suggests potential alignment for embedding change and log frequency, but effects are sensitive to hemodynamic modeling assumptions and are treated as population-level evidence only. These results support a new interface: neurophysiology-grounded axes provide interpretable and controllable handles for LLM behavior.

Title: Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara

Authors: Yacouba Diarra, Panga Azazia Kamate, Nouhoum Souleymane Coulibaly, Michael Leventhal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19400
Pdf URL: https://arxiv.org/pdf/2512.19400
Copy Paste: [[2512.19400]] Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara(https://arxiv.org/abs/2512.19400)
Keywords: robust
Abstract: We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47\% to 37.12\% on one and from 36.07\% to 32.33\% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.

Title: From Retrieval to Reasoning: A Framework for Cyber Threat Intelligence NER with Explicit and Adaptive Instructions

Authors: Jiaren Peng, Hongda Sun, Xuan Tian, Cheng Huang, Zeqing Li, Rui Yan
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2512.19414
Pdf URL: https://arxiv.org/pdf/2512.19414
Copy Paste: [[2512.19414]] From Retrieval to Reasoning: A Framework for Cyber Threat Intelligence NER with Explicit and Adaptive Instructions(https://arxiv.org/abs/2512.19414)
Keywords: large language model
Abstract: The automation of Cyber Threat Intelligence (CTI) relies heavily on Named Entity Recognition (NER) to extract critical entities from unstructured text. Currently, Large Language Models (LLMs) primarily address this task through retrieval-based In-Context Learning (ICL). This paper analyzes this mainstream paradigm, revealing a fundamental flaw: its success stems not from global semantic similarity but largely from the incidental overlap of entity types within retrieved examples. This exposes the limitations of relying on unreliable implicit induction. To address this, we propose TTPrompt, a framework shifting from implicit induction to explicit instruction. TTPrompt maps the core concepts of CTI's Tactics, Techniques, and Procedures (TTPs) into an instruction hierarchy: formulating task definitions as Tactics, guiding strategies as Techniques, and annotation guidelines as Procedures. Furthermore, to handle the adaptability challenge of static guidelines, we introduce Feedback-driven Instruction Refinement (FIR). FIR enables LLMs to self-refine guidelines by learning from errors on minimal labeled data, adapting to distinct annotation dialects. Experiments on five CTI NER benchmarks demonstrate that TTPrompt consistently surpasses retrieval-based baselines. Notably, with refinement on just 1% of training data, it rivals models fine-tuned on the full dataset. For instance, on LADDER, its Micro F1 of 71.96% approaches the fine-tuned baseline, and on the complex CTINexus, its Macro F1 exceeds the fine-tuned ACLM model by 10.91%.

Title: CodeSimpleQA: Scaling Factuality in Code Large Language Models

Authors: Jian Yang, Wei Zhang, Yizhi Li, Shawn Guo, Haowen Wang, Aishan Liu, Ge Zhang, Zili Wang, Zhoujun Li, Xianglong Liu, Weifeng Lv
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19424
Pdf URL: https://arxiv.org/pdf/2512.19424
Copy Paste: [[2512.19424]] CodeSimpleQA: Scaling Factuality in Code Large Language Models(https://arxiv.org/abs/2512.19424)
Keywords: large language model
Abstract: Large language models (LLMs) have made significant strides in code generation, achieving impressive capabilities in synthesizing code snippets from natural language instructions. However, a critical challenge remains in ensuring LLMs generate factually accurate responses about programming concepts, technical implementations, etc. Most previous code-related benchmarks focus on code execution correctness, overlooking the factual accuracy of programming knowledge. To address this gap, we present CodeSimpleQA, a comprehensive bilingual benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions, which contains carefully curated question-answer pairs in both English and Chinese, covering diverse programming languages and major computer science domains. Further, we create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning. Our comprehensive evaluation of diverse LLMs reveals that even frontier LLMs struggle with code factuality. Our proposed framework demonstrates substantial improvements over the base model, underscoring the critical importance of factuality-aware alignment in developing reliable code LLMs.

Title: Attention Is Not What You Need

Authors: Zhang Chong
Subjects: cs.LG, cs.AI, math.AG
Abstract URL: https://arxiv.org/abs/2512.19428
Pdf URL: https://arxiv.org/pdf/2512.19428
Copy Paste: [[2512.19428]] Attention Is Not What You Need(https://arxiv.org/abs/2512.19428)
Keywords: transformer
Abstract: We revisit a basic question in sequence modeling: is explicit self-attention actually necessary for strong performance and reasoning? We argue that standard multi-head attention is best seen as a form of tensor lifting: hidden vectors are mapped into a high-dimensional space of pairwise interactions, and learning proceeds by constraining this lifted tensor through gradient descent. This mechanism is extremely expressive but mathematically opaque, because after many layers it becomes very hard to describe the model with a small family of explicit invariants. To explore an alternative, we propose an attention-free architecture based on Grassmann flows. Instead of forming an L by L attention matrix, our Causal Grassmann layer (i) linearly reduces token states, (ii) encodes local token pairs as two-dimensional subspaces on a Grassmann manifold via Plucker coordinates, and (iii) fuses these geometric features back into the hidden states through gated mixing. Information therefore propagates by controlled deformations of low-rank subspaces over multi-scale local windows, so the core computation lives on a finite-dimensional manifold rather than in an unstructured tensor space. On the Wikitext-2 language modeling benchmark, purely Grassmann-based models with 13 to 18 million parameters achieve validation perplexities within about 10 to 15 percent of size-matched Transformers. On the SNLI natural language inference task, a Grassmann-Plucker head on top of DistilBERT slightly outperforms a Transformer head, with best validation and test accuracies of 0.8550 and 0.8538 compared to 0.8545 and 0.8511. We analyze the complexity of Grassmann mixing, show linear scaling in sequence length for fixed rank, and argue that such manifold-based designs offer a more structured route toward geometric and invariant-based interpretations of neural reasoning.

Title: MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments

Authors: Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, Yue Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19432
Pdf URL: https://arxiv.org/pdf/2512.19432
Copy Paste: [[2512.19432]] MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments(https://arxiv.org/abs/2512.19432)
Keywords: robust
Abstract: Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. To bridge this gap, we introduce MobileWorld, a substantially more challenging benchmark designed to better reflect real-world mobile usage, comprising 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld. The difficulty of MobileWorld is twofold. First, it emphasizes long-horizon tasks with cross-application interactions: MobileWorld requires nearly twice as many task-completion steps on average (27.8 vs. 14.3) and includes far more multi-application tasks (62.2% vs. 9.5%) compared to AndroidWorld. Second, MobileWorld extends beyond standard GUI manipulation by introducing novel task categories, including agent-user interaction and MCP-augmented tasks. To ensure robust evaluation, we provide snapshot-based container environment and precise functional verifications, including backend database inspection and task callback APIs. We further develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively. Our analysis shows that current models struggle significantly with user interaction and MCP calls, offering a strategic roadmap toward more robust, next-generation mobile intelligence.

Title: dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models

Authors: Yi Xin, Siqi Luo, Qi Qin, Haoxing Chen, Kaiwen Zhu, Zhiwei Zhang, Yangfan He, Rongchao Zhang, Jinbin Bai, Shuo Cao, Bin Fu, Junjun He, Yihao Liu, Yuewen Cao, Xiaohong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19433
Pdf URL: https://arxiv.org/pdf/2512.19433
Copy Paste: [[2512.19433]] dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models(https://arxiv.org/abs/2512.19433)
Keywords: diffusion, generative, large language model
Abstract: Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs' intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search. Project page: this https URL.

Title: MT-Mark: Rethinking Image Watermarking via Mutual-Teacher Collaboration with Adaptive Feature Modulation

Authors: Fei Ge, Ying Huang, Jie Liu, Guixuan Zhang, Zhi Zeng, Shuwu Zhang, Hu Guan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19438
Pdf URL: https://arxiv.org/pdf/2512.19438
Copy Paste: [[2512.19438]] MT-Mark: Rethinking Image Watermarking via Mutual-Teacher Collaboration with Adaptive Feature Modulation(https://arxiv.org/abs/2512.19438)
Keywords: robust, extraction, watermark
Abstract: Existing deep image watermarking methods follow a fixed embedding-distortion-extraction pipeline, where the embedder and extractor are weakly coupled through a final loss and optimized in isolation. This design lacks explicit collaboration, leaving no structured mechanism for the embedder to incorporate decoding-aware cues or for the extractor to guide embedding during training. To address this architectural limitation, we rethink deep image watermarking by reformulating embedding and extraction as explicitly collaborative components. To realize this reformulation, we introduce a Collaborative Interaction Mechanism (CIM) that establishes direct, bidirectional communication between the embedder and extractor, enabling a mutual-teacher training paradigm and coordinated optimization. Built upon this explicitly collaborative architecture, we further propose an Adaptive Feature Modulation Module (AFMM) to support effective interaction. AFMM enables content-aware feature regulation by decoupling modulation structure and strength, guiding watermark embedding toward stable image features while suppressing host interference during extraction. Under CIM, the AFMMs on both sides form a closed-loop collaboration that aligns embedding behavior with extraction objectives. This architecture-level redesign changes how robustness is learned in watermarking systems. Rather than relying on exhaustive distortion simulation, robustness emerges from coordinated representation learning between embedding and extraction. Experiments on real-world and AI-generated datasets demonstrate that the proposed method consistently outperforms state-of-the-art approaches in watermark extraction accuracy while maintaining high perceptual quality, showing strong robustness and generalization.

Title: An Inverse Scattering Inspired Fourier Neural Operator for Time-Dependent PDE Learning

Authors: Rixin Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.19439
Pdf URL: https://arxiv.org/pdf/2512.19439
Copy Paste: [[2512.19439]] An Inverse Scattering Inspired Fourier Neural Operator for Time-Dependent PDE Learning(https://arxiv.org/abs/2512.19439)
Keywords: robust
Abstract: Learning accurate and stable time-advancement operators for nonlinear partial differential equations (PDEs) remains challenging, particularly for chaotic, stiff, and long-horizon dynamical systems. While neural operator methods such as the Fourier Neural Operator (FNO) and Koopman-inspired extensions achieve good short-term accuracy, their long-term stability is often limited by unconstrained latent representations and cumulative rollout errors. In this work, we introduce an inverse scattering inspired Fourier Neural Operator(IS-FNO), motivated by the reversibility and spectral evolution structure underlying the classical inverse scattering transform. The proposed architecture enforces a near-reversible pairing between lifting and projection maps through an explicitly invertible neural transformation, and models latent temporal evolution using exponential Fourier layers that naturally encode linear and nonlinear spectral dynamics. We systematically evaluate IS-FNO against baseline FNO and Koopman-based models on a range of benchmark PDEs, including the Michelson-Sivashinsky and Kuramoto-Sivashinsky equations (in one and two dimensions), as well as the integrable Korteweg-de Vries and Kadomtsev-Petviashvili equations. The results demonstrate that IS-FNO achieves lower short-term errors and substantially improved long-horizon stability in non-stiff regimes. For integrable systems, reduced IS-FNO variants that embed analytical scattering structure retain competitive long-term accuracy despite limited model capacity. Overall, this work shows that incorporating physical structure -- particularly reversibility and spectral evolution -- into neural operator design significantly enhances robustness and long-term predictive fidelity for nonlinear PDE dynamics.

Title: D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning

Authors: Evelyn Zhang, Fufu Yu, Aoqi Wu, Zichen Wen, Ke Yan, Shouhong Ding, Biqing Qi, Linfeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19443
Pdf URL: https://arxiv.org/pdf/2512.19443
Copy Paste: [[2512.19443]] D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning(https://arxiv.org/abs/2512.19443)
Keywords: secure, large language model
Abstract: Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user's prompt and spatial redundancy. To address this, we introduce D2Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D2Pruner has exceptional efficiency and fidelity. Applied to LLaVA-1.5-7B for general understanding tasks, it reduces FLOPs by 74.2\% while retaining 99.2\% of its original performance. Furthermore, in challenging localization benchmarks with InternVL-2.5-8B, it maintains 85.7\% performance at a 90\% token reduction rate, marking a significant advancement with up to 63. 53\% improvement over existing methods.

Title: Sign Language Recognition using Parallel Bidirectional Reservoir Computing

Authors: Nitin Kumar Singh, Arie Rachmad Syulistyo, Yuichiro Tanaka, Hakaru Tamukoh
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.19451
Pdf URL: https://arxiv.org/pdf/2512.19451
Copy Paste: [[2512.19451]] Sign Language Recognition using Parallel Bidirectional Reservoir Computing(https://arxiv.org/abs/2512.19451)
Keywords: extraction
Abstract: Sign language recognition (SLR) facilitates communication between deaf and hearing communities. Deep learning based SLR models are commonly used but require extensive computational resources, making them unsuitable for deployment on edge devices. To address these limitations, we propose a lightweight SLR system that combines parallel bidirectional reservoir computing (PBRC) with MediaPipe. MediaPipe enables real-time hand tracking and precise extraction of hand joint coordinates, which serve as input features for the PBRC architecture. The proposed PBRC architecture consists of two echo state network (ESN) based bidirectional reservoir computing (BRC) modules arranged in parallel to capture temporal dependencies, thereby creating a rich feature representation for classification. We trained our PBRC-based SLR system on the Word-Level American Sign Language (WLASL) video dataset, achieving top-1, top-5, and top-10 accuracies of 60.85%, 85.86%, and 91.74%, respectively. Training time was significantly reduced to 18.67 seconds due to the intrinsic properties of reservoir computing, compared to over 55 minutes for deep learning based methods such as Bi-GRU. This approach offers a lightweight, cost-effective solution for real-time SLR on edge devices.

Title: SiamGPT: Quality-First Fine-Tuning for Stable Thai Text Generation

Authors: Thittipat Pairatsuppawat, Abhibhu Tachaapornchai, Paweekorn Kusolsomboon, Chutikan Chaiwong, Thodsaporn Chay-intr, Kobkrit Viriyayudhakorn, Nongnuch Ketui, Aslan B. Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19455
Pdf URL: https://arxiv.org/pdf/2512.19455
Copy Paste: [[2512.19455]] SiamGPT: Quality-First Fine-Tuning for Stable Thai Text Generation(https://arxiv.org/abs/2512.19455)
Keywords: robust, large language model
Abstract: Open-weights large language models remain difficult to deploy for Thai due to unstable generation under complex instructions, despite strong English performance. To mitigate these limitations, We present SiamGPT-32B, an open-weights model based on Qwen3-32B, fine-tuned with a Quality-First strategy emphasizing curated supervision over data scale. The fine-tuning pipeline combines translated high-complexity English instruction data with a Thai-adapted AutoIF framework for instruction and linguistic constraints. Using supervised fine-tuning only, without continual pretraining or corpus expansion, SiamGPT-32B improves instruction adherence, multi-turn robustness, and linguistic stability. Evaluations on the SEA-HELM benchmark show that SiamGPT-32B achieves the strongest overall performance among similar-scale open-weights Thai models, with consistent gains in instruction following, multi-turn dialogue, and natural language understanding.

Title: Activations as Features: Probing LLMs for Generalizable Essay Scoring Representations

Authors: Jinwei Chi, Ke Wang, Yu Chen, Xuanye Lin, Qiang Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19456
Pdf URL: https://arxiv.org/pdf/2512.19456
Copy Paste: [[2512.19456]] Activations as Features: Probing LLMs for Generalizable Essay Scoring Representations(https://arxiv.org/abs/2512.19456)
Keywords: large language model
Abstract: Automated essay scoring (AES) is a challenging task in cross-prompt settings due to the diversity of scoring criteria. While previous studies have focused on the output of large language models (LLMs) to improve scoring accuracy, we believe activations from intermediate layers may also provide valuable information. To explore this possibility, we evaluated the discriminative power of LLMs' activations in cross-prompt essay scoring task. Specifically, we used activations to fit probes and further analyzed the effects of different models and input content of LLMs on this discriminative power. By computing the directions of essays across various trait dimensions under different prompts, we analyzed the variation in evaluation perspectives of large language models concerning essay types and traits. Results show that the activations possess strong discriminative power in evaluating essay quality and that LLMs can adapt their evaluation perspectives to different traits and essay types, effectively handling the diversity of scoring criteria in cross-prompt settings.

Title: Multi-Layer Confidence Scoring for Detection of Out-of-Distribution Samples, Adversarial Attacks, and In-Distribution Misclassifications

Authors: Lorenzo Capelli, Leandro de Souza Rosa, Gianluca Setti, Mauro Mangia, Riccardo Rovatti
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19472
Pdf URL: https://arxiv.org/pdf/2512.19472
Copy Paste: [[2512.19472]] Multi-Layer Confidence Scoring for Detection of Out-of-Distribution Samples, Adversarial Attacks, and In-Distribution Misclassifications(https://arxiv.org/abs/2512.19472)
Keywords: attack
Abstract: The recent explosive growth in Deep Neural Networks applications raises concerns about the black-box usage of such models, with limited trasparency and trustworthiness in high-stakes domains, which have been crystallized as regulatory requirements such as the European Union Artificial Intelligence Act. While models with embedded confidence metrics have been proposed, such approaches cannot be applied to already existing models without retraining, limiting their broad application. On the other hand, post-hoc methods, which evaluate pre-trained models, focus on solving problems related to improving the confidence in the model's predictions, and detecting Out-Of-Distribution or Adversarial Attacks samples as independent applications. To tackle the limited applicability of already existing methods, we introduce Multi-Layer Analysis for Confidence Scoring (MACS), a unified post-hoc framework that analyzes intermediate activations to produce classification-maps. From the classification-maps, we derive a score applicable for confidence estimation, detecting distributional shifts and adversarial attacks, unifying the three problems in a common framework, and achieving performances that surpass the state-of-the-art approaches in our experiments with the VGG16 and ViTb16 models with a fraction of their computational overhead.

Title: A Large-Language-Model Framework for Automated Humanitarian Situation Reporting

Authors: Ivan Decostanzi, Yelena Mejova, Kyriaki Kalimeri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19475
Pdf URL: https://arxiv.org/pdf/2512.19475
Copy Paste: [[2512.19475]] A Large-Language-Model Framework for Automated Humanitarian Situation Reporting(https://arxiv.org/abs/2512.19475)
Keywords: extraction, generative, large language model
Abstract: Timely and accurate situational reports are essential for humanitarian decision-making, yet current workflows remain largely manual, resource intensive, and inconsistent. We present a fully automated framework that uses large language models (LLMs) to transform heterogeneous humanitarian documents into structured and evidence-grounded reports. The system integrates semantic text clustering, automatic question generation, retrieval augmented answer extraction with citations, multi-level summarization, and executive summary generation, supported by internal evaluation metrics that emulate expert reasoning. We evaluated the framework across 13 humanitarian events, including natural disasters and conflicts, using more than 1,100 documents from verified sources such as ReliefWeb. The generated questions achieved 84.7 percent relevance, 84.0 percent importance, and 76.4 percent urgency. The extracted answers reached 86.3 percent relevance, with citation precision and recall both exceeding 76 percent. Agreement between human and LLM based evaluations surpassed an F1 score of 0.80. Comparative analysis shows that the proposed framework produces reports that are more structured, interpretable, and actionable than existing baselines. By combining LLM reasoning with transparent citation linking and multi-level evaluation, this study demonstrates that generative AI can autonomously produce accurate, verifiable, and operationally useful humanitarian situation reports.

Title: Emotion-Director: Bridging Affective Shortcut in Emotion-Oriented Image Generation

Authors: Guoli Jia, Junyao Hu, Xinwei Long, Kai Tian, Kaiyan Zhang, KaiKai Zhao, Ning Ding, Bowen Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19479
Pdf URL: https://arxiv.org/pdf/2512.19479
Copy Paste: [[2512.19479]] Emotion-Director: Bridging Affective Shortcut in Emotion-Oriented Image Generation(https://arxiv.org/abs/2512.19479)
Keywords: diffusion
Abstract: Image generation based on diffusion models has demonstrated impressive capability, motivating exploration into diverse and specialized applications. Owing to the importance of emotion in advertising, emotion-oriented image generation has attracted increasing attention. However, current emotion-oriented methods suffer from an affective shortcut, where emotions are approximated to semantics. As evidenced by two decades of research, emotion is not equivalent to semantics. To this end, we propose Emotion-Director, a cross-modal collaboration framework consisting of two modules. First, we propose a cross-Modal Collaborative diffusion model, abbreviated as MC-Diffusion. MC-Diffusion integrates visual prompts with textual prompts for guidance, enabling the generation of emotion-oriented images beyond semantics. Further, we improve the DPO optimization by a negative visual prompt, enhancing the model's sensitivity to different emotions under the same semantics. Second, we propose MC-Agent, a cross-Modal Collaborative Agent system that rewrites textual prompts to express the intended emotions. To avoid template-like rewrites, MC-Agent employs multi-agents to simulate human subjectivity toward emotions, and adopts a chain-of-concept workflow that improves the visual expressiveness of the rewritten prompts. Extensive qualitative and quantitative experiments demonstrate the superiority of Emotion-Director in emotion-oriented image generation.

Title: Lightweight Intrusion Detection in IoT via SHAP-Guided Feature Pruning and Knowledge-Distilled Kronecker Networks

Authors: Hafsa Benaddi, Mohammed Jouhari, Nouha Laamech, Anas Motii, Khalil Ibrahimi
Subjects: cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2512.19488
Pdf URL: https://arxiv.org/pdf/2512.19488
Copy Paste: [[2512.19488]] Lightweight Intrusion Detection in IoT via SHAP-Guided Feature Pruning and Knowledge-Distilled Kronecker Networks(https://arxiv.org/abs/2512.19488)
Keywords: explainability
Abstract: The widespread deployment of Internet of Things (IoT) devices requires intrusion detection systems (IDS) with high accuracy while operating under strict resource constraints. Conventional deep learning IDS are often too large and computationally intensive for edge deployment. We propose a lightweight IDS that combines SHAP-guided feature pruning with knowledge-distilled Kronecker networks. A high-capacity teacher model identifies the most relevant features through SHAP explanations, and a compressed student leverages Kronecker-structured layers to minimize parameters while preserving discriminative inputs. Knowledge distillation transfers softened decision boundaries from teacher to student, improving generalization under compression. Experiments on the TON\_IoT dataset show that the student is nearly three orders of magnitude smaller than the teacher yet sustains macro-F1 above 0.986 with millisecond-level inference latency. The results demonstrate that explainability-driven pruning and structured compression can jointly enable scalable, low-latency, and energy-efficient IDS for heterogeneous IoT environments.

Title: FusionNet: Physics-Aware Representation Learning for Multi-Spectral and Thermal Data via Trainable Signal-Processing Priors

Authors: Georgios Voulgaris
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19504
Pdf URL: https://arxiv.org/pdf/2512.19504
Copy Paste: [[2512.19504]] FusionNet: Physics-Aware Representation Learning for Multi-Spectral and Thermal Data via Trainable Signal-Processing Priors(https://arxiv.org/abs/2512.19504)
Keywords: robust
Abstract: Modern deep learning models operating on multi-modal visual signals often rely on inductive biases that are poorly aligned with the physical processes governing signal formation, leading to brittle performance under cross-spectral and real-world conditions. In particular, approaches that prioritise direct thermal cues struggle to capture indirect yet persistent environmental alterations induced by sustained heat emissions. This work introduces a physics-aware representation learning framework that leverages multi-spectral information to model stable signatures of long-term physical processes. Specifically, a geological Short Wave Infrared (SWIR) ratio sensitive to soil property changes is integrated with Thermal Infrared (TIR) data through an intermediate fusion architecture, instantiated as FusionNet. The proposed backbone embeds trainable differential signal-processing priors within convolutional layers, combines mixed pooling strategies, and employs wider receptive fields to enhance robustness across spectral modalities. Systematic ablations show that each architectural component contributes to performance gains, with DGCNN achieving 88.7% accuracy on the SWIR ratio and FusionNet reaching 90.6%, outperforming state-of-the-art baselines across five spectral configurations. Transfer learning experiments further show that ImageNet pretraining degrades TIR performance, highlighting the importance of modality-aware training for cross-spectral learning. Evaluated on real-world data, the results demonstrate that combining physics-aware feature selection with principled deep learning architectures yields robust and generalisable representations, illustrating how first-principles signal modelling can improve multi-spectral learning under challenging conditions.

Title: Anatomy-R1: Enhancing Anatomy Reasoning in Multimodal Large Language Models via Anatomical Similarity Curriculum and Group Diversity Augmentation

Authors: Ziyang Song, Zelin Zang, Zuyao Chen, Xusheng Liang, Dong Yi, Jinlin Wu, Hongbin Liu, Jiebo Luo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19512
Pdf URL: https://arxiv.org/pdf/2512.19512
Copy Paste: [[2512.19512]] Anatomy-R1: Enhancing Anatomy Reasoning in Multimodal Large Language Models via Anatomical Similarity Curriculum and Group Diversity Augmentation(https://arxiv.org/abs/2512.19512)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored, especially in clinical anatomical surgical images. Anatomy understanding tasks demand precise understanding and clinically coherent answers, which are difficult to achieve due to the complexity of medical data and the scarcity of high-quality expert annotations. These challenges limit the effectiveness of conventional Supervised Fine-Tuning (SFT) strategies. While recent work has demonstrated that Group Relative Policy Optimization (GRPO) can enhance reasoning in MLLMs without relying on large amounts of data, we find two weaknesses that hinder GRPO's reasoning performance in anatomy recognition: 1) knowledge cannot be effectively shared between different anatomical structures, resulting in uneven information gain and preventing the model from converging, and 2) the model quickly converges to a single reasoning path, suppressing the exploration of diverse strategies. To overcome these challenges, we propose two novel methods. First, we implement a progressive learning strategy called Anatomical Similarity Curriculum Learning by controlling question difficulty via the similarity of answer choices, enabling the model to master complex problems incrementally. Second, we utilize question augmentation referred to as Group Diversity Question Augmentation to expand the model's search space for difficult queries, mitigating the tendency to produce uniform responses. Comprehensive experiments on the SGG-VQA and OmniMedVQA benchmarks show our method achieves a significant improvement across the two benchmarks, demonstrating its effectiveness in enhancing the medical reasoning capabilities of MLLMs. The code can be found in this https URL

Title: LacaDM: A Latent Causal Diffusion Model for Multiobjective Reinforcement Learning

Authors: Xueming Yan, Bo Yin, Yaochu Jin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19516
Pdf URL: https://arxiv.org/pdf/2512.19516
Copy Paste: [[2512.19516]] LacaDM: A Latent Causal Diffusion Model for Multiobjective Reinforcement Learning(https://arxiv.org/abs/2512.19516)
Keywords: diffusion
Abstract: Multiobjective reinforcement learning (MORL) poses significant challenges due to the inherent conflicts between objectives and the difficulty of adapting to dynamic environments. Traditional methods often struggle to generalize effectively, particularly in large and complex state-action spaces. To address these limitations, we introduce the Latent Causal Diffusion Model (LacaDM), a novel approach designed to enhance the adaptability of MORL in discrete and continuous environments. Unlike existing methods that primarily address conflicts between objectives, LacaDM learns latent temporal causal relationships between environmental states and policies, enabling efficient knowledge transfer across diverse MORL scenarios. By embedding these causal structures within a diffusion model-based framework, LacaDM achieves a balance between conflicting objectives while maintaining strong generalization capabilities in previously unseen environments. Empirical evaluations on various tasks from the MOGymnasium framework demonstrate that LacaDM consistently outperforms the state-of-art baselines in terms of hypervolume, sparsity, and expected utility maximization, showcasing its effectiveness in complex multiobjective tasks.

Title: A Convolutional Neural Deferred Shader for Physics Based Rendering

Authors: Zhuo He, Yingdong Ru, Qianying Liu, Paul Henderson, Nicolas Pugeault
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19522
Pdf URL: https://arxiv.org/pdf/2512.19522
Copy Paste: [[2512.19522]] A Convolutional Neural Deferred Shader for Physics Based Rendering(https://arxiv.org/abs/2512.19522)
Keywords: diffusion
Abstract: Recent advances in neural rendering have achieved impressive results on photorealistic shading and relighting, by using a multilayer perceptron (MLP) as a regression model to learn the rendering equation from a real-world dataset. Such methods show promise for photorealistically relighting real-world objects, which is difficult to classical rendering, as there is no easy-obtained material ground truth. However, significant challenges still remain the dense connections in MLPs result in a large number of parameters, which requires high computation resources, complicating the training, and reducing performance during rendering. Data driven approaches require large amounts of training data for generalization; unbalanced data might bias the model to ignore the unusual illumination conditions, e.g. dark scenes. This paper introduces pbnds+: a novel physics-based neural deferred shading pipeline utilizing convolution neural networks to decrease the parameters and improve the performance in shading and relighting tasks; Energy regularization is also proposed to restrict the model reflection during dark illumination. Extensive experiments demonstrate that our approach outperforms classical baselines, a state-of-the-art neural shading model, and a diffusion-based method.

Title: Initialization of a Polyharmonic Cascade, Launch and Testing

Authors: Yuriy N. Bakhvalov
Subjects: cs.LG, math.NA
Abstract URL: https://arxiv.org/abs/2512.19524
Pdf URL: https://arxiv.org/pdf/2512.19524
Copy Paste: [[2512.19524]] Initialization of a Polyharmonic Cascade, Launch and Testing(https://arxiv.org/abs/2512.19524)
Keywords: robust
Abstract: This paper concludes a series of studies on the polyharmonic cascade, a deep machine learning architecture theoretically derived from indifference principles and the theory of random functions. A universal initialization procedure is proposed, based on symmetric constellations in the form of hyperoctahedra with a central point. This initialization not only ensures stable training of cascades with tens and hundreds of layers (up to 500 layers without skip connections), but also radically simplifies the computations. Scalability and robustness are demonstrated on MNIST (98.3% without convolutions or augmentations), HIGGS (AUC approximately 0.885 on 11M examples), and Epsilon (AUC approximately 0.963 with 2000 features). All linear algebra is reduced to 2D operations and is efficiently executed on GPUs. A public repository and an archived snapshot are provided for full reproducibility.

Title: Multi-Modal Soccer Scene Analysis with Masked Pre-Training

Authors: Marc Peral, Guillem Capellera, Luis Ferraz, Antonio Rubio, Antonio Agudo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19528
Pdf URL: https://arxiv.org/pdf/2512.19528
Copy Paste: [[2512.19528]] Multi-Modal Soccer Scene Analysis with Masked Pre-Training(https://arxiv.org/abs/2512.19528)
Keywords: robust, transformer
Abstract: In this work we propose a multi-modal architecture for analyzing soccer scenes from tactical camera footage, with a focus on three core tasks: ball trajectory inference, ball state classification, and ball possessor identification. To this end, our solution integrates three distinct input modalities (player trajectories, player types and image crops of individual players) into a unified framework that processes spatial and temporal dynamics using a cascade of sociotemporal transformer blocks. Unlike prior methods, which rely heavily on accurate ball tracking or handcrafted heuristics, our approach infers the ball trajectory without direct access to its past or future positions, and robustly identifies the ball state and ball possessor under noisy or occluded conditions from real top league matches. We also introduce CropDrop, a modality-specific masking pre-training strategy that prevents over-reliance on image features and encourages the model to rely on cross-modal patterns during pre-training. We show the effectiveness of our approach on a large-scale dataset providing substantial improvements over state-of-the-art baselines in all tasks. Our results highlight the benefits of combining structured and visual cues in a transformer-based architecture, and the importance of realistic masking strategies in multi-modal learning.

Title: Learning Continuous Solvent Effects from Transient Flow Data: A Graph Neural Network Benchmark on Catechol Rearrangement

Authors: Hongsheng Xing, Qiuxin Si
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19530
Pdf URL: https://arxiv.org/pdf/2512.19530
Copy Paste: [[2512.19530]] Learning Continuous Solvent Effects from Transient Flow Data: A Graph Neural Network Benchmark on Catechol Rearrangement(https://arxiv.org/abs/2512.19530)
Keywords: robust, large language model
Abstract: Predicting reaction outcomes across continuous solvent composition ranges remains a critical challenge in organic synthesis and process chemistry. Traditional machine learning approaches often treat solvent identity as a discrete categorical variable, which prevents systematic interpolation and extrapolation across the solvent space. This work introduces the \textbf{Catechol Benchmark}, a high-throughput transient flow chemistry dataset comprising 1,227 experimental yield measurements for the rearrangement of allyl-substituted catechol in 24 pure solvents and their binary mixtures, parameterized by continuous volume fractions ($\% B$). We evaluate various architectures under rigorous leave-one-solvent-out and leave-one-mixture-out protocols to test generalization to unseen chemical environments. Our results demonstrate that classical tabular methods (e.g., Gradient-Boosted Decision Trees) and large language model embeddings (e.g., Qwen-7B) struggle with quantitative precision, yielding Mean Squared Errors (MSE) of 0.099 and 0.129, respectively. In contrast, we propose a hybrid GNN-based architecture that integrates Graph Attention Networks (GATs) with Differential Reaction Fingerprints (DRFP) and learned mixture-aware solvent encodings. This approach achieves an \textbf{MSE of 0.0039} ($\pm$ 0.0003), representing a 60\% error reduction over competitive baselines and a $>25\times$ improvement over tabular ensembles. Ablation studies confirm that explicit molecular graph message-passing and continuous mixture encoding are essential for robust generalization. The complete dataset, evaluation protocols, and reference implementations are released to facilitate data-efficient reaction prediction and continuous solvent representation learning.

Title: Event Extraction in Large Language Model

Authors: Bobo Li, Xudong Han, Jiang Liu, Yuzhe Ding, Liqiang Jing, Zhaoqi Zhang, Jinheng Li, Xinya Du, Fei Li, Meishan Zhang, Min Zhang, Aixin Sun, Philip S. Yu, Hao Fei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19537
Pdf URL: https://arxiv.org/pdf/2512.19537
Copy Paste: [[2512.19537]] Event Extraction in Large Language Model(https://arxiv.org/abs/2512.19537)
Keywords: extraction, generative, large language model
Abstract: Large language models (LLMs) and multimodal LLMs are changing event extraction (EE): prompting and generation can often produce structured outputs in zero shot or few shot settings. Yet LLM based pipelines face deployment gaps, including hallucinations under weak constraints, fragile temporal and causal linking over long contexts and across documents, and limited long horizon knowledge management within a bounded context window. We argue that EE should be viewed as a system component that provides a cognitive scaffold for LLM centered solutions. Event schemas and slot constraints create interfaces for grounding and verification; event centric structures act as controlled intermediate representations for stepwise reasoning; event links support relation aware retrieval with graph based RAG; and event stores offer updatable episodic and agent memory beyond the context window. This survey covers EE in text and multimodal settings, organizing tasks and taxonomy, tracing method evolution from rule based and neural models to instruction driven and generative frameworks, and summarizing formulations, decoding strategies, architectures, representations, datasets, and evaluation. We also review cross lingual, low resource, and domain specific settings, and highlight open challenges and future directions for reliable event centric systems. Finally, we outline open challenges and future directions that are central to the LLM era, aiming to evolve EE from static extraction into a structurally reliable, agent ready perception and memory layer for open world systems.

Title: StoryMem: Multi-shot Long Video Storytelling with Memory

Authors: Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, Xingang Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19539
Pdf URL: https://arxiv.org/pdf/2512.19539
Copy Paste: [[2512.19539]] StoryMem: Multi-shot Long Video Storytelling with Memory(https://arxiv.org/abs/2512.19539)
Keywords: diffusion
Abstract: Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.

Title: ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars

Authors: Ziqiao Peng, Yi Chen, Yifeng Ma, Guozhen Zhang, Zhiyao Sun, Zixiang Zhou, Youliang Zhang, Zhengguang Zhou, Zhaoxin Fan, Hongyan Liu, Yuan Zhou, Qinglin Lu, Jun He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19546
Pdf URL: https://arxiv.org/pdf/2512.19546
Copy Paste: [[2512.19546]] ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars(https://arxiv.org/abs/2512.19546)
Keywords: robust
Abstract: Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process-early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model's text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality.

Title: BabyFlow: 3D modeling of realistic and expressive infant faces

Authors: Antonia Alomar, Mireia Masias, Marius George Linguraru, Federico M. Sukno, Gemma Piella
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19560
Pdf URL: https://arxiv.org/pdf/2512.19560
Copy Paste: [[2512.19560]] BabyFlow: 3D modeling of realistic and expressive infant faces(https://arxiv.org/abs/2512.19560)
Keywords: diffusion, generative
Abstract: Early detection of developmental disorders can be aided by analyzing infant craniofacial morphology, but modeling infant faces is challenging due to limited data and frequent spontaneous expressions. We introduce BabyFlow, a generative AI model that disentangles facial identity and expression, enabling independent control over both. Using normalizing flows, BabyFlow learns flexible, probabilistic representations that capture the complex, non-linear variability of expressive infant faces without restrictive linear assumptions. To address scarce and uncontrolled expressive data, we perform cross-age expression transfer, adapting expressions from adult 3D scans to enrich infant datasets with realistic and systematic expressive variants. As a result, BabyFlow improves 3D reconstruction accuracy, particularly in highly expressive regions such as the mouth, eyes, and nose, and supports synthesis and modification of infant expressions while preserving identity. Additionally, by integrating with diffusion models, BabyFlow generates high-fidelity 2D infant images with consistent 3D geometry, providing powerful tools for data augmentation and early facial analysis.

Title: Increasing the Thinking Budget is Not All You Need

Authors: Ignacio Iacobacci, Zhaozhi Qian, Faroq AL-Tam, Muhammad AL-Qurishi, Riad Souissi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19585
Pdf URL: https://arxiv.org/pdf/2512.19585
Copy Paste: [[2512.19585]] Increasing the Thinking Budget is Not All You Need(https://arxiv.org/abs/2512.19585)
Keywords: large language model
Abstract: Recently, a new wave of thinking-capable Large Language Models has emerged, demonstrating exceptional capabilities across a wide range of reasoning benchmarks. Early studies have begun to explore how the amount of compute in terms of the length of the reasoning process, the so-called thinking budget, impacts model performance. In this work, we propose a systematic investigation of the thinking budget as a key parameter, examining its interaction with various configurations such as self-consistency, reflection, and others. Our goal is to provide an informative, balanced comparison framework that considers both performance outcomes and computational cost. Among our findings, we discovered that simply increasing the thinking budget is not the most effective use of compute. More accurate responses can instead be achieved through alternative configurations, such as self-consistency and self-reflection.

Title: No Data? No Problem: Robust Vision-Tabular Learning with Missing Values

Authors: Marta Hasny, Laura Daza, Keno Bressem, Maxime Di Folco, Julia Schnabel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19602
Pdf URL: https://arxiv.org/pdf/2512.19602
Copy Paste: [[2512.19602]] No Data? No Problem: Robust Vision-Tabular Learning with Missing Values(https://arxiv.org/abs/2512.19602)
Keywords: robust
Abstract: Large-scale medical biobanks provide imaging data complemented by extensive tabular information, such as demographics or clinical measurements. However, this abundance of tabular attributes does not reflect real-world datasets, where only a subset of attributes may be available. This discrepancy calls for methods that can leverage all the tabular data during training while remaining robust to missing values at inference. To address this challenge, we propose RoVTL (Robust Vision-Tabular Learning), a framework designed to handle any level of tabular data availability, from 0% to 100%. RoVTL comprises two key stages: contrastive pretraining, where we introduce tabular attribute missingness as data augmentation to promote robustness, and downstream task tuning using a gated cross-attention module for multimodal fusion. During fine-tuning, we employ a novel Tabular More vs. Fewer loss that ranks performance based on the amount of available tabular data. Combined with disentangled gradient learning, this enables consistent performance across all tabular data completeness scenarios. We evaluate RoVTL on cardiac MRI scans from the UK Biobank, demonstrating superior robustness to missing tabular data compared to prior methods. Furthermore, RoVTL successfully generalizes to an external cardiac MRI dataset for multimodal disease classification, and extends to the natural images domain, achieving robust performance on a car advertisements dataset. The code is available at this https URL.

Title: MapTrace: Scalable Data Generation for Route Tracing on Maps

Authors: Artemis Panagopoulou, Aveek Purohit, Achin Kulshrestha, Soroosh Yazdani, Mohit Goyal
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19609
Pdf URL: https://arxiv.org/pdf/2512.19609
Copy Paste: [[2512.19609]] MapTrace: Scalable Data Generation for Route Tracing on Maps(https://arxiv.org/abs/2512.19609)
Keywords: robust, large language model
Abstract: While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatial understanding, such as route tracing on maps remains limited. Unlike humans, who can quickly learn to parse and navigate maps, current models often fail to respect fundamental path constraints, in part due to the prohibitive cost and difficulty of collecting large-scale, pixel-accurate path annotations. To address this, we introduce a scalable synthetic data generation pipeline that leverages synthetic map images and pixel-level parsing to automatically produce precise annotations for this challenging task. Using this pipeline, we construct a fine-tuning dataset of 23k path samples across 4k maps, enabling models to acquire more human-like spatial capabilities. Using this dataset, we fine-tune both open-source and proprietary MLLMs. Results on MapBench show that finetuning substantially improves robustness, raising success rates by up to 6.4 points, while also reducing path-tracing error (NDTW). These gains highlight that fine-grained spatial reasoning, absent in pretrained models, can be explicitly taught with synthetic supervision.

Title: MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery

Authors: Angelo Ortiz Tandazo, Manel Khentout, Youssef Benchekroun, Thomas Hueber, Emmanuel Dupoux
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2512.19612
Pdf URL: https://arxiv.org/pdf/2512.19612
Copy Paste: [[2512.19612]] MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery(https://arxiv.org/abs/2512.19612)
Keywords: robust
Abstract: This paper introduces MauBERT, a multilingual extension of HuBERT that leverages articulatory features for robust cross-lingual phonetic representation learning. We continue HuBERT pre-training with supervision based on a phonetic-to-articulatory feature mapping in 55 languages. Our models learn from multilingual data to predict articulatory features or phones, resulting in language-independent representations that capture multilingual phonetic properties. Through comprehensive ABX discriminability testing, we show MauBERT models produce more context-invariant representations than state-of-the-art multilingual self-supervised learning models. Additionally, the models effectively adapt to unseen languages and casual speech with minimal self-supervised fine-tuning (10 hours of speech). This establishes an effective approach for instilling linguistic inductive biases in self-supervised speech models.

Title: Exploring the features used for summary evaluation by Human and GPT

Authors: Zahra Sadeghi, Evangelos Milios, Frank Rudzicz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19620
Pdf URL: https://arxiv.org/pdf/2512.19620
Copy Paste: [[2512.19620]] Exploring the features used for summary evaluation by Human and GPT(https://arxiv.org/abs/2512.19620)
Keywords: transformer, generative, large language model
Abstract: Summary assessment involves evaluating how well a generated summary reflects the key ideas and meaning of the source text, requiring a deep understanding of the content. Large Language Models (LLMs) have been used to automate this process, acting as judges to evaluate summaries with respect to the original text. While previous research investigated the alignment between LLMs and Human responses, it is not yet well understood what properties or features are exploited by them when asked to evaluate based on a particular quality dimension, and there has not been much attention towards mapping between evaluation scores and metrics. In this paper, we address this issue and discover features aligned with Human and Generative Pre-trained Transformers (GPTs) responses by studying statistical and machine learning metrics. Furthermore, we show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.

Title: Generative diffusion models for agricultural AI: plant image generation, indoor-to-outdoor translation, and expert preference alignment

Authors: Da Tan, Michael Beck, Christopher P. Bidinosti, Robert H. Gulden, Christopher J. Henry
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19632
Pdf URL: https://arxiv.org/pdf/2512.19632
Copy Paste: [[2512.19632]] Generative diffusion models for agricultural AI: plant image generation, indoor-to-outdoor translation, and expert preference alignment(https://arxiv.org/abs/2512.19632)
Keywords: diffusion, generative
Abstract: The success of agricultural artificial intelligence depends heavily on large, diverse, and high-quality plant image datasets, yet collecting such data in real field conditions is costly, labor intensive, and seasonally constrained. This paper investigates diffusion-based generative modeling to address these challenges through plant image synthesis, indoor-to-outdoor translation, and expert preference aligned fine tuning. First, a Stable Diffusion model is fine tuned on captioned indoor and outdoor plant imagery to generate realistic, text conditioned images of canola and soybean. Evaluation using Inception Score, Frechet Inception Distance, and downstream phenotype classification shows that synthetic images effectively augment training data and improve accuracy. Second, we bridge the gap between high resolution indoor datasets and limited outdoor imagery using DreamBooth-based text inversion and image guided diffusion, generating translated images that enhance weed detection and classification with YOLOv8. Finally, a preference guided fine tuning framework trains a reward model on expert scores and applies reward weighted updates to produce more stable and expert aligned outputs. Together, these components demonstrate a practical pathway toward data efficient generative pipelines for agricultural AI.

Title: The Best of Both Worlds: Hybridizing Neural Operators and Solvers for Stable Long-Horizon Inference

Authors: Rajyasri Roy, Dibyajyoti Nayak, Somdatta Goswami
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2512.19643
Pdf URL: https://arxiv.org/pdf/2512.19643
Copy Paste: [[2512.19643]] The Best of Both Worlds: Hybridizing Neural Operators and Solvers for Stable Long-Horizon Inference(https://arxiv.org/abs/2512.19643)
Keywords: robust, data-free
Abstract: Numerical simulation of time-dependent partial differential equations (PDEs) is central to scientific and engineering applications, but high-fidelity solvers are often prohibitively expensive for long-horizon or time-critical settings. Neural operator (NO) surrogates offer fast inference across parametric and functional inputs; however, most autoregressive NO frameworks remain vulnerable to compounding errors, and ensemble-averaged metrics provide limited guarantees for individual inference trajectories. In practice, error accumulation can become unacceptable beyond the training horizon, and existing methods lack mechanisms for online monitoring or correction. To address this gap, we propose ANCHOR (Adaptive Numerical Correction for High-fidelity Operator Rollouts), an online, instance-aware hybrid inference framework for stable long-horizon prediction of nonlinear, time-dependent PDEs. ANCHOR treats a pretrained NO as the primary inference engine and adaptively couples it with a classical numerical solver using a physics-informed, residual-based error estimator. Inspired by adaptive time-stepping in numerical analysis, ANCHOR monitors an exponential moving average (EMA) of the normalized PDE residual to detect accumulating error and trigger corrective solver interventions without requiring access to ground-truth solutions. We show that the EMA-based estimator correlates strongly with the true relative L2 error, enabling data-free, instance-aware error control during inference. Evaluations on four canonical PDEs: 1D and 2D Burgers', 2D Allen-Cahn, and 3D heat conduction, demonstrate that ANCHOR reliably bounds long-horizon error growth, stabilizes extrapolative rollouts, and significantly improves robustness over standalone neural operators, while remaining substantially more efficient than high-fidelity numerical solvers.

Title: Exploring Zero-Shot ACSA with Unified Meaning Representation in Chain-of-Thought Prompting

Authors: Filippos Ventirozos, Peter Appleby, Matthew Shardlow
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19651
Pdf URL: https://arxiv.org/pdf/2512.19651
Copy Paste: [[2512.19651]] Exploring Zero-Shot ACSA with Unified Meaning Representation in Chain-of-Thought Prompting(https://arxiv.org/abs/2512.19651)
Keywords: large language model
Abstract: Aspect-Category Sentiment Analysis (ACSA) provides granular insights by identifying specific themes within reviews and their associated sentiment. While supervised learning approaches dominate this field, the scarcity and high cost of annotated data for new domains present significant barriers. We argue that leveraging large language models (LLMs) in a zero-shot setting is a practical alternative where resources for data annotation are limited. In this work, we propose a novel Chain-of-Thought (CoT) prompting technique that utilises an intermediate Unified Meaning Representation (UMR) to structure the reasoning process for the ACSA task. We evaluate this UMR-based approach against a standard CoT baseline across three models (Qwen3-4B, Qwen3-8B, and Gemini-2.5-Pro) and four diverse datasets. Our findings suggest that UMR effectiveness may be model-dependent. Whilst preliminary results indicate comparable performance for mid-sized models such as Qwen3-8B, these observations warrant further investigation, particularly regarding the potential applicability to smaller model architectures. Further research is required to establish the generalisability of these findings across different model scales.

Title: Over++: Generative Video Compositing for Layer Interaction Effects

Authors: Luchao Qi, Jiaye Wu, Jun Myeong Choi, Cary Phillips, Roni Sengupta, Dan B Goldman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19661
Pdf URL: https://arxiv.org/pdf/2512.19661
Copy Paste: [[2512.19661]] Over++: Generative Video Compositing for Layer Interaction Effects(https://arxiv.org/abs/2512.19661)
Keywords: generative
Abstract: In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.

Title: Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis

Authors: Argha Kamal Samanta, Harshika Goyal, Vasudha Joshi, Tushar Mungle, Pabitra Mitra
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19663
Pdf URL: https://arxiv.org/pdf/2512.19663
Copy Paste: [[2512.19663]] Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis(https://arxiv.org/abs/2512.19663)
Keywords: robust, transformer
Abstract: Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.

Title: Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Authors: Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2512.19673
Pdf URL: https://arxiv.org/pdf/2512.19673
Copy Paste: [[2512.19673]] Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies(https://arxiv.org/abs/2512.19673)
Keywords: transformer, large language model
Abstract: Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at this https URL.

Title: Efficient Vision Mamba for MRI Super-Resolution via Hybrid Selective Scanning

Authors: Mojtaba Safari, Shansong Wang, Vanessa L Wildman, Mingzhe Hu, Zach Eidex, Chih-Wei Chang, Erik H Middlebrooks, Richard L.J Qiu, Pretesh Patel, Ashesh B. Jania, Hui Mao, Zhen Tian, Xiaofeng Yang
Subjects: cs.CV, physics.med-ph
Abstract URL: https://arxiv.org/abs/2512.19676
Pdf URL: https://arxiv.org/pdf/2512.19676
Copy Paste: [[2512.19676]] Efficient Vision Mamba for MRI Super-Resolution via Hybrid Selective Scanning(https://arxiv.org/abs/2512.19676)
Keywords: extraction, diffusion, transformer
Abstract: Background: High-resolution MRI is critical for diagnosis, but long acquisition times limit clinical use. Super-resolution (SR) can enhance resolution post-scan, yet existing deep learning methods face fidelity-efficiency trade-offs. Purpose: To develop a computationally efficient and accurate deep learning framework for MRI SR that preserves anatomical detail for clinical integration. Materials and Methods: We propose a novel SR framework combining multi-head selective state-space models (MHSSM) with a lightweight channel MLP. The model uses 2D patch extraction with hybrid scanning to capture long-range dependencies. Each MambaFormer block integrates MHSSM, depthwise convolutions, and gated channel mixing. Evaluation used 7T brain T1 MP2RAGE maps (n=142) and 1.5T prostate T2w MRI (n=334). Comparisons included Bicubic interpolation, GANs (CycleGAN, Pix2pix, SPSR), transformers (SwinIR), Mamba (MambaIR), and diffusion models (I2SB, Res-SRDiff). Results: Our model achieved superior performance with exceptional efficiency. For 7T brain data: SSIM=0.951+-0.021, PSNR=26.90+-1.41 dB, LPIPS=0.076+-0.022, GMSD=0.083+-0.017, significantly outperforming all baselines (p<0.001). For prostate data: SSIM=0.770+-0.049, PSNR=27.15+-2.19 dB, LPIPS=0.190+-0.095, GMSD=0.087+-0.013. The framework used only 0.9M parameters and 57 GFLOPs, reducing parameters by 99.8% and computation by 97.5% versus Res-SRDiff, while outperforming SwinIR and MambaIR in accuracy and efficiency. Conclusion: The proposed framework provides an efficient, accurate MRI SR solution, delivering enhanced anatomical detail across datasets. Its low computational demand and state-of-the-art performance show strong potential for clinical translation.

Title: WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion

Authors: Hanyang Kong, Xingyi Yang, Xiaoxu Zheng, Xinchao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.19678
Pdf URL: https://arxiv.org/pdf/2512.19678
Copy Paste: [[2512.19678]] WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion(https://arxiv.org/abs/2512.19678)
Keywords: diffusion, generative
Abstract: Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a "fill-and-revise" objective. Our key innovation is a spatio-temporal varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture. Project page: \href{this https URL}{this https URL}.

Title: GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators

Authors: Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, Mengdi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.19682
Pdf URL: https://arxiv.org/pdf/2512.19682
Copy Paste: [[2512.19682]] GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators(https://arxiv.org/abs/2512.19682)
Keywords: generative, large language model
Abstract: Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty-aligned co-evolutionary game between an agent and a scalable, generative environment simulator. Unlike traditional methods that evolve models on static datasets, GenEnv instantiates a dataevolving: the simulator acts as a dynamic curriculum policy, continuously generating tasks specifically tailored to the agent's ``zone of proximal development''. This process is guided by a simple but effective $\alpha$-Curriculum Reward, which aligns task difficulty with the agent's current capabilities. We evaluate GenEnv on five benchmarks, including API-Bank, ALFWorld, BFCL, Bamboogle, and TravelPlanner. Across these tasks, GenEnv improves agent performance by up to \textbf{+40.3\%} over 7B baselines and matches or exceeds the average performance of larger models. Compared to Gemini 2.5 Pro-based offline data augmentation, GenEnv achieves better performance while using 3.3$\times$ less data. By shifting from static supervision to adaptive simulation, GenEnv provides a data-efficient pathway for scaling agent capabilities.

Title: From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

Authors: Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys, Tong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19683
Pdf URL: https://arxiv.org/pdf/2512.19683
Copy Paste: [[2512.19683]] From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs(https://arxiv.org/abs/2512.19683)
Keywords: robust, large language model
Abstract: While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.

Title: Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models

Authors: Pablo Ruiz-Ponce, Sergio Escalera, José García-Rodríguez, Jiankang Deng, Rolandos Alexandros Potamias
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.19692
Pdf URL: https://arxiv.org/pdf/2512.19692
Copy Paste: [[2512.19692]] Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models(https://arxiv.org/abs/2512.19692)
Keywords: robust, diffusion
Abstract: Generating realistic human-human interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coordination among all interactants. Due to limitations in available data and increased learning complexity, previous methods tend to ignore hand motions, limiting the realism and expressivity of the interactions. Additionally, current diffusion-based approaches generate entire motion sequences simultaneously, limiting their ability to capture the reactive and adaptive nature of human interactions. To address these limitations, we introduce Interact2Ar, the first end-to-end text-conditioned autoregressive diffusion model for generating full-body, human-human interactions. Interact2Ar incorporates detailed hand kinematics through dedicated parallel branches, enabling high-fidelity full-body generation. Furthermore, we introduce an autoregressive pipeline coupled with a novel memory technique that facilitates adaptation to the inherent variability of human interactions using efficient large context windows. The adaptability of our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios. To validate the generated motions, we introduce a set of robust evaluators and extended metrics designed specifically for assessing full-body interactions. Through quantitative and qualitative experiments, we demonstrate the state-of-the-art performance of Interact2Ar.