2025-12-03

Title: Human-Level and Beyond: Benchmarking Large Language Models Against Clinical Pharmacists in Prescription Review

Authors: Yan Yang, Mouxiao Bian, Peiling Li, Bingjian Wen, Ruiyao Chen, Kangkun Mao, Xiaojun Ye, Tianbin Li, Pengcheng Chen, Bing Han, Jie Xu, Kaifeng Qiu, Junyan Wu
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2512.02024
Pdf URL: https://arxiv.org/pdf/2512.02024
Copy Paste: [[2512.02024]] Human-Level and Beyond: Benchmarking Large Language Models Against Clinical Pharmacists in Prescription Review(https://arxiv.org/abs/2512.02024)
Keywords: robust, large language model
Abstract: The rapid advancement of large language models (LLMs) has accelerated their integration into clinical decision support, particularly in prescription review. To enable systematic and fine-grained evaluation, we developed RxBench, a comprehensive benchmark that covers common prescription review categories and consolidates 14 frequent types of prescription errors drawn from authoritative pharmacy references. RxBench consists of 1,150 single-choice, 230 multiple-choice, and 879 short-answer items, all reviewed by experienced clinical pharmacists. We benchmarked 18 state-of-the-art LLMs and identified clear stratification of performance across tasks. Notably, Gemini-2.5-pro-preview-05-06, Grok-4-0709, and DeepSeek-R1-0528 consistently formed the first tier, outperforming other models in both accuracy and robustness. Comparisons with licensed pharmacists indicated that leading LLMs can match or exceed human performance in certain tasks. Furthermore, building on insights from our benchmark evaluation, we performed targeted fine-tuning on a mid-tier model, resulting in a specialized model that rivals leading general-purpose LLMs in performance on short-answer question tasks. The main contribution of RxBench lies in establishing a standardized, error-type-oriented framework that not only reveals the capabilities and limitations of frontier LLMs in prescription review but also provides a foundational resource for building more reliable and specialized clinical tools.

Title: Pharmacophore-based design by learning on voxel grids

Authors: Omar Mahmood, Pedro O. Pinheiro, Richard Bonneau, Saeed Saremi, Vishnu Sresht
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2512.02031
Pdf URL: https://arxiv.org/pdf/2512.02031
Copy Paste: [[2512.02031]] Pharmacophore-based design by learning on voxel grids(https://arxiv.org/abs/2512.02031)
Keywords: generative
Abstract: Ligand-based drug discovery (LBDD) relies on making use of known binders to a protein target to find structurally diverse molecules similarly likely to bind. This process typically involves a brute force search of the known binder (query) against a molecular library using some metric of molecular similarity. One popular approach overlays the pharmacophore-shape profile of the known binder to 3D conformations enumerated for each of the library molecules, computes overlaps, and picks a set of diverse library molecules with high overlaps. While this virtual screening workflow has had considerable success in hit diversification, scaffold hopping, and patent busting, it scales poorly with library sizes and restricts candidate generation to existing library compounds. Leveraging recent advances in voxel-based generative modelling, we propose a pharmacophore-based generative model and workflows that address the scaling and fecundity issues of conventional pharmacophore-based virtual screening. We introduce \emph{VoxCap}, a voxel captioning method for generating SMILES strings from voxelised molecular representations. We propose two workflows as practical use cases as well as benchmarks for pharmacophore-based generation: \emph{de-novo} design, in which we aim to generate new molecules with high pharmacophore-shape similarities to query molecules, and fast search, which aims to combine generative design with a cheap 2D substructure similarity search for efficient hit identification. Our results show that VoxCap significantly outperforms previous methods in generating diverse \textit{de-novo} hits. When combined with our fast search workflow, VoxCap reduces computational time by orders of magnitude while returning hits for all query molecules, enabling the search of large libraries that are intractable to search by brute force.

Title: Deep Research: A Systematic Survey

Authors: Zhengliang Shi, Yiqun Chen, Haitao Li, Weiwei Sun, Shiyu Ni, Yougang Lyu, Run-Ze Fan, Bowen Jin, Yixuan Weng, Minjun Zhu, Qiujie Xie, Xinyu Guo, Qu Yang, Jiayi Wu, Jujia Zhao, Xiaqiang Tang, Xinbei Ma, Cunxiang Wang, Jiaxin Mao, Qingyao Ai, Jen-Tse Huang, Wenxuan Wang, Yue Zhang, Yiming Yang, Zhaopeng Tu, Zhaochun Ren
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2512.02038
Pdf URL: https://arxiv.org/pdf/2512.02038
Copy Paste: [[2512.02038]] Deep Research: A Systematic Survey(https://arxiv.org/abs/2512.02038)
Keywords: large language model
Abstract: Large language models (LLMs) have rapidly evolved from text generators into powerful problem solvers. Yet, many open tasks demand critical thinking, multi-source, and verifiable outputs, which are beyond single-shot prompting or standard retrieval-augmented generation. Recently, numerous studies have explored Deep Research (DR), which aims to combine the reasoning capabilities of LLMs with external tools, such as search engines, thereby empowering LLMs to act as research agents capable of completing complex, open-ended tasks. This survey presents a comprehensive and systematic overview of deep research systems, including a clear roadmap, foundational components, practical implementation techniques, important challenges, and future directions. Specifically, our main contributions are as follows: (i) we formalize a three-stage roadmap and distinguish deep research from related paradigms; (ii) we introduce four key components: query planning, information acquisition, memory management, and answer generation, each paired with fine-grained sub-taxonomies; (iii) we summarize optimization techniques, including prompting, supervised fine-tuning, and agentic reinforcement learning; and (iv) we consolidate evaluation criteria and open challenges, aiming to guide and facilitate future development. As the field of deep research continues to evolve rapidly, we are committed to continuously updating this survey to reflect the latest progress in this area.

Title: Mirror, Mirror on the Wall -- Which is the Best Model of Them All?

Authors: Dina Sayed, Heiko Schuldt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02043
Pdf URL: https://arxiv.org/pdf/2512.02043
Copy Paste: [[2512.02043]] Mirror, Mirror on the Wall -- Which is the Best Model of Them All?(https://arxiv.org/abs/2512.02043)
Keywords: large language model
Abstract: Large Language Models (LLMs) have become one of the most transformative tools across many applications, as they have significantly boosted productivity and achieved impressive results in various domains such as finance, healthcare, education, telecommunications, and law, among others. Typically, state-of-the-art (SOTA) foundation models are developed by large corporations based on large data collections and substantial computational and financial resources required to pretrain such models from scratch. These foundation models then serve as the basis for further development and domain adaptation for specific use cases or tasks. However, given the dynamic and fast-paced nature of launching new foundation models, the process of selecting the most suitable model for a particular use case, application, or domain becomes increasingly complex. We argue that there are two main dimensions that need to be taken into consideration when selecting a model for further training: a qualitative dimension (which model is best suited for a task based on information, for instance, taken from model cards) and a quantitative dimension (which is the best performing model). The quantitative performance of models is assessed through leaderboards, which rank models based on standardized benchmarks and provide a consistent framework for comparing different LLMs. In this work, we address the analysis of the quantitative dimension by exploring the current leaderboards and benchmarks. To illustrate this analysis, we focus on the medical domain as a case study, demonstrating the evolution, current landscape, and practical significance of this quantitative evaluation dimension. Finally, we propose a Model Selection Methodology (MSM), a systematic approach designed to guide the navigation, prioritization, and selection of the model that best aligns with a given use case.

Title: Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models

Authors: Kecheng Chen, Ziru Liu, Xijia Tao, Hui Liu, Xinyu Fu, Suiyun Zhang, Dandan Tu, Lingpeng Kong, Rui Liu, Haoliang Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02044
Pdf URL: https://arxiv.org/pdf/2512.02044
Copy Paste: [[2512.02044]] Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models(https://arxiv.org/abs/2512.02044)
Keywords: diffusion
Abstract: Diffusion Language Models (DLMs) have recently achieved significant success due to their any-order generation capabilities. However, existing inference methods typically rely on local, immediate-step metrics such as confidence or entropy which inherently lack a more reliable perspective. This limitation frequently leads to inconsistent sampling trajectories and suboptimal generation quality. To address this, we propose Coherent Contextual Decoding (CCD), a novel inference framework built upon two core innovations. First, CCD employs a trajectory rectification mechanism that leverages historical context to enhance sequence coherence, enabling the early rejection of suboptimal paths. We demonstrate that this mechanism is theoretically equivalent to modeling the consistency of historical steps via the conditional mutual information between context and token predictions. Building on this theoretical insight, we further address the inefficiency of conventional uniform decoding budgets. Instead of rigid allocations based on diffusion steps, we introduce an adaptive sampling strategy that dynamically adjusts the unmasking budget for each step according to our consistency metric. Consequently, our method significantly improves the quality of generation trajectories while accelerating the sampling process. Empirically, our method achieves a simultaneous enhancement in both inference speed and performance across diverse benchmarks on Dream and LLaDA, delivering up to 3.48x speedup alongside 3.91% performance improvement.

Title: Contextual Gating within the Transformer Stack: Synergistic Feature Modulation for Enhanced Lyrical Classification and Calibration

Authors: M.A. Gameiro
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.02053
Pdf URL: https://arxiv.org/pdf/2512.02053
Copy Paste: [[2512.02053]] Contextual Gating within the Transformer Stack: Synergistic Feature Modulation for Enhanced Lyrical Classification and Calibration(https://arxiv.org/abs/2512.02053)
Keywords: transformer
Abstract: This study introduces a significant architectural advancement in feature fusion for lyrical content classification by integrating auxiliary structural features directly into the self-attention mechanism of a pre-trained Transformer. I propose the SFL Transformer, a novel deep learning model that utilizes a Contextual Gating mechanism (an Intermediate SFL) to modulate the sequence of hidden states within the BERT encoder stack, rather than fusing features at the final output layer. This approach modulates the deep, contextualized semantic features (Hseq) using low-dimensional structural cues (Fstruct). The model is applied to a challenging binary classification task derived from UMAP-reduced lyrical embeddings. The SFL Transformer achieved an Accuracy of 0.9910 and a Macro F1 score of 0.9910, significantly improving the state-of-the-art established by the previously published SFL model (Accuracy 0.9894). Crucially, this Contextual Gating strategy maintained exceptional reliability, with a low Expected Calibration Error (ECE = 0.0081) and Log Loss (0.0489). This work validates the hypothesis that injecting auxiliary context mid-stack is the most effective means of synergistically combining structural and semantic information, creating a model with both superior discriminative power and high-fidelity probability estimates.

Title: Leveraging AI multimodal geospatial foundation models for improved near-real-time flood mapping at a global scale

Authors: Mirela G. Tulbure, Julio Caineta, Mark Broich, Mollie D. Gaines, Philippe Rufin, Leon-Friedrich Thomas, Hamed Alemohammad, Jan Hemmerling, Patrick Hostert
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02055
Pdf URL: https://arxiv.org/pdf/2512.02055
Copy Paste: [[2512.02055]] Leveraging AI multimodal geospatial foundation models for improved near-real-time flood mapping at a global scale(https://arxiv.org/abs/2512.02055)
Keywords: segmentation
Abstract: Floods are among the most damaging weather-related hazards, and in 2024, the warmest year on record, extreme flood events affected communities across five continents. Earth observation (EO) satellites provide critical, frequent coverage for mapping inundation, yet operational accuracy depends heavily on labeled datasets and model generalization. Recent Geospatial Foundation Models (GFMs), such as ESA-IBM's TerraMind, offer improved generalizability through large-scale self-supervised pretraining, but their performance on diverse global flood events remains poorly understood. We fine-tune TerraMind for flood extent mapping using FloodsNet, a harmonized multimodal dataset containing co-located Sentinel-1 (Synthetic Aperture Radar, SAR data) and Sentinel-2 (optical) imagery for 85 flood events worldwide. We tested four configurations (base vs. large models; frozen vs. unfrozen backbones) and compared against the TerraMind Sen1Floods11 example and a U-Net trained on both FloodsNet and Sen1Floods11. The base-unfrozen configuration provided the best balance of accuracy, precision, and recall at substantially lower computational cost than the large model. The large unfrozen model achieved the highest recall. Models trained on FloodsNet outperformed the Sen1Floods11-trained example in recall with similar overall accuracy. U-Net achieved higher recall than all GFM configurations, though with slightly lower accuracy and precision. Our results demonstrate that integrating multimodal optical and SAR data and fine-tuning a GFM can enhance near-real-time flood mapping. This study provides one of the first global-scale evaluations of a GFM for flood segmentation, highlighting both its potential and current limitations for climate adaptation and disaster resilience.

Title: Reversing Large Language Models for Efficient Training and Fine-Tuning

Authors: Eshed Gal, Moshe Eliasof, Javier Turek, Uri Ascher, Eran Treister, Eldad Haber
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02056
Pdf URL: https://arxiv.org/pdf/2512.02056
Copy Paste: [[2512.02056]] Reversing Large Language Models for Efficient Training and Fine-Tuning(https://arxiv.org/abs/2512.02056)
Keywords: large language model
Abstract: Large Language Models (LLMs) are known for their expensive and time-consuming training. Thus, oftentimes, LLMs are fine-tuned to address a specific task, given the pretrained weights of a pre-trained LLM considered a foundation model. In this work, we introduce memory-efficient, reversible architectures for LLMs, inspired by symmetric and symplectic differential equations, and investigate their theoretical properties. Different from standard, baseline architectures that store all intermediate activations, the proposed models use time-reversible dynamics to retrieve hidden states during backpropagation, relieving the need to store activations. This property allows for a drastic reduction in memory consumption, allowing for the processing of larger batch sizes for the same available memory, thereby offering improved throughput. In addition, we propose an efficient method for converting existing, non-reversible LLMs into reversible architectures through fine-tuning, rendering our approach practical for exploiting existing pre-trained models. Our results show comparable or improved performance on several datasets and benchmarks, on several LLMs, building a scalable and efficient path towards reducing the memory and computational costs associated with both training from scratch and fine-tuning of LLMs.

Title: Opening the Black Box: An Explainable, Few-shot AI4E Framework Informed by Physics and Expert Knowledge for Materials Engineering

Authors: Haoxiang Zhang, Ruihao Yuan, Lihui Zhang, Yushi Luo, Qiang Zhang, Pan Ding, Xiaodong Ren, Weijie Xing, Niu Gao, Jishan Chen, Chubo Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02057
Pdf URL: https://arxiv.org/pdf/2512.02057
Copy Paste: [[2512.02057]] Opening the Black Box: An Explainable, Few-shot AI4E Framework Informed by Physics and Expert Knowledge for Materials Engineering(https://arxiv.org/abs/2512.02057)
Keywords: interpretability
Abstract: The industrial adoption of Artificial Intelligence for Engineering (AI4E) faces two fundamental bottlenecks: scarce high-quality data and the lack of interpretability in black-box models-particularly critical in safety-sensitive sectors like aerospace. We present an explainable, few-shot AI4E framework that is systematically informed by physics and expert knowledge throughout its architecture. Starting from only 32 experimental samples in an aerial K439B superalloy castings repair welding case, we first augment physically plausible synthetic data through a three-stage protocol: differentiated noise injection calibrated to process variabilities, enforcement of hard physical constraints, and preservation of inter-parameter relationships. We then employ a nested optimization strategy for constitutive model discovery, where symbolic regression explores equation structures while differential evolution optimizes parameters, followed by intensive parameter refinement using hybrid global-local optimization. The resulting interpretable constitutive equation achieves 88% accuracy in predicting hot-cracking tendency. This equation not only provides quantitative predictions but also delivers explicit physical insight, revealing how thermal, geometric, and metallurgical mechanisms couple to drive cracking-thereby advancing engineers' cognitive understanding of the process. Furthermore, the constitutive equation serves as a multi-functional tool for process optimization and high-fidelity virtual data generation, enabling accuracy improvements in other data-driven models. Our approach provides a general blueprint for developing trustworthy AI systems that embed engineering domain knowledge directly into their architecture, enabling reliable adoption in high-stakes industrial applications where data is limited but physical understanding is available.

Title: Superpixel Attack: Enhancing Black-box Adversarial Attack with Image-driven Division Areas

Authors: Issa Oe, Keiichiro Yamamura, Hiroki Ishikura, Ryo Hamahira, Katsuki Fujisawa
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.02062
Pdf URL: https://arxiv.org/pdf/2512.02062
Copy Paste: [[2512.02062]] Superpixel Attack: Enhancing Black-box Adversarial Attack with Image-driven Division Areas(https://arxiv.org/abs/2512.02062)
Keywords: defense, attack, robust
Abstract: Deep learning models are used in safety-critical tasks such as automated driving and face recognition. However, small perturbations in the model input can significantly change the predictions. Adversarial attacks are used to identify small perturbations that can lead to misclassifications. More powerful black-box adversarial attacks are required to develop more effective defenses. A promising approach to black-box adversarial attacks is to repeat the process of extracting a specific image area and changing the perturbations added to it. Existing attacks adopt simple rectangles as the areas where perturbations are changed in a single iteration. We propose applying superpixels instead, which achieve a good balance between color variance and compactness. We also propose a new search method, versatile search, and a novel attack method, Superpixel Attack, which applies superpixels and performs versatile search. Superpixel Attack improves attack success rates by an average of 2.10% compared with existing attacks. Most models used in this study are robust against adversarial attacks, and this improvement is significant for black-box adversarial attacks. The code is avilable at this https URL.

Title: Large Language Model based Smart Contract Auditing with LLMBugScanner

Authors: Yining Yuan, Yifei Wang, Yichang Xu, Zachary Yahn, Sihao Hu, Ling Liu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02069
Pdf URL: https://arxiv.org/pdf/2512.02069
Copy Paste: [[2512.02069]] Large Language Model based Smart Contract Auditing with LLMBugScanner(https://arxiv.org/abs/2512.02069)
Keywords: robust, large language model
Abstract: This paper presents LLMBugScanner, a large language model (LLM) based framework for smart contract vulnerability detection using fine-tuning and ensemble learning. Smart contract auditing presents several challenges for LLMs: different pretrained models exhibit varying reasoning abilities, and no single model performs consistently well across all vulnerability types or contract structures. These limitations persist even after fine-tuning individual LLMs. To address these challenges, LLMBugScanner combines domain knowledge adaptation with ensemble reasoning to improve robustness and generalization. Through domain knowledge adaptation, we fine-tune LLMs on complementary datasets to capture both general code semantics and instruction-guided vulnerability reasoning, using parameter-efficient tuning to reduce computational cost. Through ensemble reasoning, we leverage the complementary strengths of multiple LLMs and apply a consensus-based conflict resolution strategy to produce more reliable vulnerability assessments. We conduct extensive experiments across multiple popular LLMs and compare LLMBugScanner with both pretrained and fine-tuned individual models. Results show that LLMBugScanner achieves consistent accuracy improvements and stronger generalization, demonstrating that it provides a principled, cost-effective, and extensible framework for smart contract auditing.

Title: DPWMixer: Dual-Path Wavelet Mixer for Long-Term Time Series Forecasting

Authors: Li Qianyang, Zhang Xingjun, Wang Shaoxun, Wei Jia
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02070
Pdf URL: https://arxiv.org/pdf/2512.02070
Copy Paste: [[2512.02070]] DPWMixer: Dual-Path Wavelet Mixer for Long-Term Time Series Forecasting(https://arxiv.org/abs/2512.02070)
Keywords: transformer
Abstract: Long-term time series forecasting (LTSF) is a critical task in computational intelligence. While Transformer-based models effectively capture long-range dependencies, they often suffer from quadratic complexity and overfitting due to data sparsity. Conversely, efficient linear models struggle to depict complex non-linear local dynamics. Furthermore, existing multi-scale frameworks typically rely on average pooling, which acts as a non-ideal low-pass filter, leading to spectral aliasing and the irreversible loss of high-frequency transients. In response, this paper proposes DPWMixer, a computationally efficient Dual-Path architecture. The framework is built upon a Lossless Haar Wavelet Pyramid that replaces traditional pooling, utilizing orthogonal decomposition to explicitly disentangle trends and local fluctuations without information loss. To process these components, we design a Dual-Path Trend Mixer that integrates a global linear mapping for macro-trend anchoring and a flexible patch-based MLP-Mixer for micro-dynamic evolution. Finally, An adaptive multi-scale fusion module then integrates predictions from diverse scales, weighted by channel stationarity to optimize synthesis. Extensive experiments on eight public benchmarks demonstrate that our method achieves a consistent improvement over state-of-the-art baselines. The code is available at this https URL.

Title: FDRMFL:Multi-modal Federated Feature Extraction Model Based on Information Maximization and Contrastive Learning

Authors: Haozhe Wu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02076
Pdf URL: https://arxiv.org/pdf/2512.02076
Copy Paste: [[2512.02076]] FDRMFL:Multi-modal Federated Feature Extraction Model Based on Information Maximization and Contrastive Learning(https://arxiv.org/abs/2512.02076)
Keywords: extraction, federate
Abstract: This study focuses on the feature extraction problem in multi-modal data regression. To address three core challenges in real-world scenarios: limited and non-IID data, effective extraction and fusion of multi-modal information, and susceptibility to catastrophic forgetting in model learning, a task-driven supervised multi-modal federated feature extraction method is proposed. The method integrates multi-modal information extraction and contrastive learning mechanisms, and can adapt to different neural network structures as the latent mapping functions for data of each modality. It supports each client to independently learn low-dimensional representations of multi-modal data, and can flexibly control the degree of retention of effective information about the response variable in the predictive variables within the low-dimensional features through parameter tuning. The multi-constraint learning framework constructed by the method guarantees regression accuracy using Mean Squared Error loss. Through the synergistic effect of mutual information preservation constraint, symmetric Kullback-Leibler divergence constraint, and inter-model contrastive constraint, it achieves the retention of task-related information, the extraction, fusion, and alignment of multi-modal features, and the mitigation of representation drift and catastrophic forgetting in non-IID scenarios, respectively. This ensures that the feature extraction process always centers on improving the performance of downstream regression tasks. Experimental results from simulations and real-world data analysis demonstrate that the proposed method achieves more significant performance improvement on downstream regression tasks compared with classical feature extraction techniques.

Title: Deterministic Random Bit Generators Based on Ascon for Embedded Systems

Authors: Abel C. H. Chen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.02082
Pdf URL: https://arxiv.org/pdf/2512.02082
Copy Paste: [[2512.02082]] Deterministic Random Bit Generators Based on Ascon for Embedded Systems(https://arxiv.org/abs/2512.02082)
Keywords: secure, security
Abstract: As the Deterministic Random Bit Generator (DRBG) serves as a fundamental component in random number generation and cryptographic applications, its performance and security are particularly critical in resource-constrained embedded systems, where memory capacity and computational efficiency are limited. Establishing a high-performance and secure DRBG is therefore an important issue for embedded system design. Furthermore, the National Institute of Standards and Technology (NIST) established the Ascon lightweight cryptographic standard in August 2025, which is suitable for use in resource-limited embedded environments. Therefore, this study revises the DRBG standard and proposes three Ascon-driven constructions: the Ascon-Driven Hash-Based DRBG, the Ascon-Driven keyed-Hash Message Authentication Code (HMAC) DRBG, and the Ascon-Driven Counter-mode (CTR) DRBG. In the experiments, these methods are implemented on a Raspberry Pi platform. The experimental results demonstrate that the proposed approaches achieve higher computational efficiency and lower memory usage compared with existing DRBG implementations, making them suitable for deployment in embedded systems.

Title: A survey about Hidden Subgroup Problem from a mathematical and cryptographic perspective

Authors: Simone Dutto, Pietro Mercuri, Nadir Murru, Lorenzo Romano
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.02087
Pdf URL: https://arxiv.org/pdf/2512.02087
Copy Paste: [[2512.02087]] A survey about Hidden Subgroup Problem from a mathematical and cryptographic perspective(https://arxiv.org/abs/2512.02087)
Keywords: security
Abstract: We provide a survey on the Hidden Subgroup Problem (HSP), which plays an important role in studying the security of public-key cryptosystems. We first review the abelian case, where Kitaev's algorithm yields an efficient quantum solution to the HSP, recalling how classical problems (such as order finding, integer factorization, and discrete logarithm) can be formulated as abelian HSP instances. We then examine the current state of the art for non-abelian HSP, where no general efficient quantum solution is known, focusing on some relevant groups including dihedral group (connected to the shortest vector problem), symmetric groups (connected to the graph isomorphism problem), and semidirect product constructions (connected, in a special case, to the code equivalence problem). We also describe the main techniques for addressing the HSP in non-abelian cases, namely Fourier sampling and the black-box approach. Throughout the paper, we highlight the mathematical notions required and exploited in this context, providing a cryptography-oriented perspective.

Title: Cross-View Topology-Aware Graph Representation Learning

Authors: Ahmet Sami Korkmaz, Selim Coskunuzer, Md Joshem Uddin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.02130
Pdf URL: https://arxiv.org/pdf/2512.02130
Copy Paste: [[2512.02130]] Cross-View Topology-Aware Graph Representation Learning(https://arxiv.org/abs/2512.02130)
Keywords: robust
Abstract: Graph classification has gained significant attention due to its applications in chemistry, social networks, and bioinformatics. While Graph Neural Networks (GNNs) effectively capture local structural patterns, they often overlook global topological features that are critical for robust representation learning. In this work, we propose GraphTCL, a dual-view contrastive learning framework that integrates structural embeddings from GNNs with topological embeddings derived from persistent homology. By aligning these complementary views through a cross-view contrastive loss, our method enhances representation quality and improves classification performance. Extensive experiments on benchmark datasets, including TU and OGB molecular graphs, demonstrate that GraphTCL consistently outperforms state-of-the-art baselines. This study highlights the importance of topology-aware contrastive learning for advancing graph representation methods.

Title: Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI

Authors: Aaron Sandoval, Cody Rushing
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2512.02157
Pdf URL: https://arxiv.org/pdf/2512.02157
Copy Paste: [[2512.02157]] Factor(T,U): Factored Cognition Strengthens Monitoring of Untrusted AI(https://arxiv.org/abs/2512.02157)
Keywords: robust
Abstract: The field of AI Control seeks to develop robust control protocols, deployment safeguards for untrusted AI which may be intentionally subversive. However, existing protocols that rely on weaker monitors to detect unsafe behavior often fail on complex tasks beyond the monitor's comprehension. We develop control protocols based on factored cognition, in which a trusted model decomposes a task, an untrusted model solves each resultant child task in isolation, and the results are reassembled into a full solution. These protocols may improve safety by several means, such as by simplifying the context for monitors, or by obscuring vulnerabilities in the environment. We implement our protocols in the APPS coding setting and red team them against backdoor attempts from an adversarial GPT-4.1 Nano. We find that: (i) Adding factored cognition to a trusted monitoring protocol can boost safety from 41% to 63%; (ii) Safety improves because monitor performance improves; (iii) Factored cognition makes it no harder for capable LLMs to write backdoors in APPS. While our protocols show low usefulness in APPS, they hold promise for more complex tasks.

Title: FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

Authors: Kevin David Hayes, Micah Goldblum, Vikash Sehwag, Gowthami Somepalli, Ashwinee Panda, Tom Goldstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02161
Pdf URL: https://arxiv.org/pdf/2512.02161
Copy Paste: [[2512.02161]] FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges(https://arxiv.org/abs/2512.02161)
Keywords: interpretability, generative
Abstract: Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium, SD3.5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.

Title: RobustSurg: Tackling domain generalisation for out-of-distribution surgical scene segmentation

Authors: Mansoor Ali, Maksim Richards, Gilberto Ochoa-Ruiz, Sharib Ali
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02188
Pdf URL: https://arxiv.org/pdf/2512.02188
Copy Paste: [[2512.02188]] RobustSurg: Tackling domain generalisation for out-of-distribution surgical scene segmentation(https://arxiv.org/abs/2512.02188)
Keywords: robust, segmentation
Abstract: While recent advances in deep learning for surgical scene segmentation have demonstrated promising results on single-centre and single-imaging modality data, these methods usually do not generalise to unseen distribution (i.e., from other centres) and unseen modalities. Current literature for tackling generalisation on out-of-distribution data and domain gaps due to modality changes has been widely researched but mostly for natural scene data. However, these methods cannot be directly applied to the surgical scenes due to limited visual cues and often extremely diverse scenarios compared to the natural scene data. Inspired by these works in natural scenes to push generalisability on OOD data, we hypothesise that exploiting the style and content information in the surgical scenes could minimise the appearances, making it less variable to sudden changes such as blood or imaging artefacts. This can be achieved by performing instance normalisation and feature covariance mapping techniques for robust and generalisable feature representations. Further, to eliminate the risk of removing salient feature representation associated with the objects of interest, we introduce a restitution module within the feature learning ResNet backbone that can enable the retention of useful task-relevant features. To tackle the lack of multiclass and multicentre data for surgical scene segmentation, we also provide a newly curated dataset that can be vital for addressing generalisability in this domain. Our proposed RobustSurg obtained nearly 23% improvement on the baseline DeepLabv3+ and from 10-32% improvement on the SOTA in terms of mean IoU score on an unseen centre HeiCholSeg dataset when trained on CholecSeg8K. Similarly, RobustSurg also obtained nearly 22% improvement over the baseline and nearly 11% improvement on a recent SOTA method for the target set of the EndoUDA polyp dataset.

Title: Enforcing Orderedness to Improve Feature Consistency

Authors: Sophie L. Wang, Alex Quach, Nithin Parsan, John J. Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02194
Pdf URL: https://arxiv.org/pdf/2512.02194
Copy Paste: [[2512.02194]] Enforcing Orderedness to Improve Feature Consistency(https://arxiv.org/abs/2512.02194)
Keywords: interpretability
Abstract: Sparse autoencoders (SAEs) have been widely used for interpretability of neural networks, but their learned features often vary across seeds and hyperparameter settings. We introduce Ordered Sparse Autoencoders (OSAE), which extend Matryoshka SAEs by (1) establishing a strict ordering of latent features and (2) deterministically using every feature dimension, avoiding the sampling-based approximations of prior nested SAE methods. Theoretically, we show that OSAEs resolve permutation non-identifiability in settings of sparse dictionary learning where solutions are unique (up to natural symmetries). Empirically on Gemma2-2B and Pythia-70M, we show that OSAEs can help improve consistency compared to Matryoshka baselines.

Title: Multifractal Recalibration of Neural Networks for Medical Imaging Segmentation

Authors: Miguel L. Martins, Miguel T. Coimbra, Francesco Renna
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02198
Pdf URL: https://arxiv.org/pdf/2512.02198
Copy Paste: [[2512.02198]] Multifractal Recalibration of Neural Networks for Medical Imaging Segmentation(https://arxiv.org/abs/2512.02198)
Keywords: segmentation
Abstract: Multifractal analysis has revealed regularities in many self-seeding phenomena, yet its use in modern deep learning remains limited. Existing end-to-end multifractal methods rely on heavy pooling or strong feature-space decimation, which constrain tasks such as semantic segmentation. Motivated by these limitations, we introduce two inductive priors: Monofractal and Multifractal Recalibration. These methods leverage relationships between the probability mass of the exponents and the multifractal spectrum to form statistical descriptions of encoder embeddings, implemented as channel-attention functions in convolutional networks. Using a U-Net-based framework, we show that multifractal recalibration yields substantial gains over a baseline equipped with other channel-attention mechanisms that also use higher-order statistics. Given the proven ability of multifractal analysis to capture pathological regularities, we validate our approach on three public medical-imaging datasets: ISIC18 (dermoscopy), Kvasir-SEG (endoscopy), and BUSI (ultrasound). Our empirical analysis also provides insights into the behavior of these attention layers. We find that excitation responses do not become increasingly specialized with encoder depth in U-Net architectures due to skip connections, and that their effectiveness may relate to global statistics of instance variability.

Title: WhAM: Towards A Translative Model of Sperm Whale Vocalization

Authors: Orr Paradise, Pranav Muralikrishnan, Liangyuan Chen, Hugo Flores García, Bryan Pardo, Roee Diamant, David F. Gruber, Shane Gero, Shafi Goldwasser
Subjects: cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2512.02206
Pdf URL: https://arxiv.org/pdf/2512.02206
Copy Paste: [[2512.02206]] WhAM: Towards A Translative Model of Sperm Whale Vocalization(https://arxiv.org/abs/2512.02206)
Keywords: transformer
Abstract: Sperm whales communicate in short sequences of clicks known as codas. We present WhAM (Whale Acoustics Model), the first transformer-based model capable of generating synthetic sperm whale codas from any audio prompt. WhAM is built by finetuning VampNet, a masked acoustic token model pretrained on musical audio, using 10k coda recordings collected over the past two decades. Through iterative masked token prediction, WhAM generates high-fidelity synthetic codas that preserve key acoustic features of the source recordings. We evaluate WhAM's synthetic codas using Fréchet Audio Distance and through perceptual studies with expert marine biologists. On downstream classification tasks including rhythm, social unit, and vowel classification, WhAM's learned representations achieve strong performance, despite being trained for generation rather than classification. Our code is available at this https URL

Title: InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages

Authors: Mamadou K. Keita, Sebastien Diarra, Christopher Homan, Seydou Diallo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.02213
Pdf URL: https://arxiv.org/pdf/2512.02213
Copy Paste: [[2512.02213]] InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages(https://arxiv.org/abs/2512.02213)
Keywords: large language model
Abstract: Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.

Title: On the Approximation of Phylogenetic Distance Functions by Artificial Neural Networks

Authors: Benjamin K. Rosenzweig, Matthew W. Hahn
Subjects: cs.LG, q-bio.PE
Abstract URL: https://arxiv.org/abs/2512.02223
Pdf URL: https://arxiv.org/pdf/2512.02223
Copy Paste: [[2512.02223]] On the Approximation of Phylogenetic Distance Functions by Artificial Neural Networks(https://arxiv.org/abs/2512.02223)
Keywords: transformer
Abstract: Inferring the phylogenetic relationships among a sample of organisms is a fundamental problem in modern biology. While distance-based hierarchical clustering algorithms achieved early success on this task, these have been supplanted by Bayesian and maximum likelihood search procedures based on complex models of molecular evolution. In this work we describe minimal neural network architectures that can approximate classic phylogenetic distance functions and the properties required to learn distances under a variety of molecular evolutionary models. In contrast to model-based inference (and recently proposed model-free convolutional and transformer networks), these architectures have a small computational footprint and are scalable to large numbers of taxa and molecular characters. The learned distance functions generalize well and, given an appropriate training dataset, achieve results comparable to state-of-the art inference methods.

Title: See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Authors: Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02231
Pdf URL: https://arxiv.org/pdf/2512.02231
Copy Paste: [[2512.02231]] See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models(https://arxiv.org/abs/2512.02231)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.

Title: Lightweight Latent Reasoning for Narrative Tasks

Authors: Alexander Gurung, Nikolay Malkin, Mirella Lapata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02240
Pdf URL: https://arxiv.org/pdf/2512.02240
Copy Paste: [[2512.02240]] Lightweight Latent Reasoning for Narrative Tasks(https://arxiv.org/abs/2512.02240)
Keywords: large language model
Abstract: Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces" that act as latent variables in the generation of an output given a query. A model's ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model 'skip' reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.

Title: PhishSnap: Image-Based Phishing Detection Using Perceptual Hashing

Authors: Md Abdul Ahad Minhaz, Zannatul Zahan Meem, Md. Shohrab Hossain
Subjects: cs.CR, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02243
Pdf URL: https://arxiv.org/pdf/2512.02243
Copy Paste: [[2512.02243]] PhishSnap: Image-Based Phishing Detection Using Perceptual Hashing(https://arxiv.org/abs/2512.02243)
Keywords: security, privacy
Abstract: Phishing remains one of the most prevalent online threats, exploiting human trust to harvest sensitive credentials. Existing URL- and HTML-based detection systems struggle against obfuscation and visual deception. This paper presents \textbf{PhishSnap}, a privacy-preserving, on-device phishing detection system leveraging perceptual hashing (pHash). Implemented as a browser extension, PhishSnap captures webpage screenshots, computes visual hashes, and compares them against legitimate templates to identify visually similar phishing attempts. A \textbf{2024 dataset of 10,000 URLs} (70\%/20\%/10\% train/validation/test) was collected from PhishTank and Netcraft. Due to security takedowns, a subset of phishing pages was unavailable, reducing dataset diversity. The system achieved \textbf{0.79 accuracy}, \textbf{0.76 precision}, and \textbf{0.78 recall}, showing that visual similarity remains a viable anti-phishing measure. The entire inference process occurs locally, ensuring user privacy and minimal latency.

Title: DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models

Authors: Olivia Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02246
Pdf URL: https://arxiv.org/pdf/2512.02246
Copy Paste: [[2512.02246]] DETAIL Matters: Measuring the Impact of Prompt Specificity on Reasoning in Large Language Models(https://arxiv.org/abs/2512.02246)
Keywords: large language model
Abstract: Prompt design plays a critical role in the reasoning performance of large language models (LLMs), yet the impact of prompt specificity - how detailed or vague a prompt is - remains understudied. This paper introduces DETAIL, a framework for evaluating LLM performance across varying levels of prompt specificity. We generate multi-level prompts using GPT-4, quantify specificity via perplexity, and assess correctness using GPT-based semantic equivalence. Experiments on 30 novel reasoning tasks across GPT-4 and O3-mini reveal that specificity improves accuracy, especially for smaller models and procedural tasks. Our results highlight the need for adaptive prompting strategies and provide tools and data to support further research.

Title: CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering

Authors: Liangji Kong, Aditya Joshi, Sarvnaz Karimi
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2512.02251
Pdf URL: https://arxiv.org/pdf/2512.02251
Copy Paste: [[2512.02251]] CAIRNS: Balancing Readability and Scientific Accuracy in Climate Adaptation Question Answering(https://arxiv.org/abs/2512.02251)
Keywords: robust
Abstract: Climate adaptation strategies are proposed in response to climate change. They are practised in agriculture to sustain food production. These strategies can be found in unstructured data (for example, scientific literature from the Elsevier website) or structured (heterogeneous climate data via government APIs). We present Climate Adaptation question-answering with Improved Readability and Noted Sources (CAIRNS), a framework that enables experts -- farmer advisors -- to obtain credible preliminary answers from complex evidence sources from the web. It enhances readability and citation reliability through a structured ScholarGuide prompt and achieves robust evaluation via a consistency-weighted hybrid evaluator that leverages inter-model agreement with experts. Together, these components enable readable, verifiable, and domain-grounded question-answering without fine-tuning or reinforcement learning. Using a previously reported dataset of expert-curated question-answers, we show that CAIRNS outperforms the baselines on most of the metrics. Our thorough ablation study confirms the results on all metrics. To validate our LLM-based evaluation, we also report an analysis of correlations against human judgment.

Title: CVE Breadcrumbs: Tracking Vulnerabilities Through Versioned Apache Libraries

Authors: Derek Garcia, Briana Lee, Ibrahim Matar, David Rickards, Andrew Zilnicki
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2512.02259
Pdf URL: https://arxiv.org/pdf/2512.02259
Copy Paste: [[2512.02259]] CVE Breadcrumbs: Tracking Vulnerabilities Through Versioned Apache Libraries(https://arxiv.org/abs/2512.02259)
Keywords: secure, security
Abstract: The Apache Software Foundation (ASF) ecosystem underpins a vast portion of modern software infrastructure, powering widely used components such as Log4j, Tomcat, and Struts. However, the ubiquity of these libraries has made them prime targets for high-impact security vulnerabilities, as illustrated by incidents like Log4Shell. Despite their widespread adoption, Apache projects are not immune to recurring and severe security weaknesses. We conduct a historical analysis of the Apache ecosystem to follow the "breadcrumb trail of vulnerabilities" by compiling a comprehensive dataset of Common Vulnerabilities and Exposures (CVEs) and Common Weakness Enumerations (CWEs). We examine trends in exploit recurrence, disclosure timelines, and remediation practices. Our analysis is guided by four key research questions: (1) What are the most persistent and repeated CWEs in Apache libraries? (2) How long do CVEs persist before being addressed? (3) What is the delay between CVE introduction and official disclosure? and (4) How long after disclosure are CVEs remediated? We present a detailed timeline of vulnerability lifecycle stages across Apache libraries and offer insights to improve secure coding practices, vulnerability monitoring, and remediation strategies. Our contributions include a curated dataset covering 24,285 Apache libraries, 1,285 CVEs, and 157 CWEs, along with empirical findings and developer-focused recommendations.

Title: The Effect of Enforcing Fairness on Reshaping Explanations in Machine Learning Models

Authors: Joshua Wolff Anderson, Shyam Visweswaran
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2512.02265
Pdf URL: https://arxiv.org/pdf/2512.02265
Copy Paste: [[2512.02265]] The Effect of Enforcing Fairness on Reshaping Explanations in Machine Learning Models(https://arxiv.org/abs/2512.02265)
Keywords: fair, explainability
Abstract: Trustworthy machine learning in healthcare requires strong predictive performance, fairness, and explanations. While it is known that improving fairness can affect predictive performance, little is known about how fairness improvements influence explainability, an essential ingredient for clinical trust. Clinicians may hesitate to rely on a model whose explanations shift after fairness constraints are applied. In this study, we examine how enhancing fairness through bias mitigation techniques reshapes Shapley-based feature rankings. We quantify changes in feature importance rankings after applying fairness constraints across three datasets: pediatric urinary tract infection risk, direct anticoagulant bleeding risk, and recidivism risk. We also evaluate multiple model classes on the stability of Shapley-based rankings. We find that increasing model fairness across racial subgroups can significantly alter feature importance rankings, sometimes in different ways across groups. These results highlight the need to jointly consider accuracy, fairness, and explainability in model assessment rather than in isolation.

Title: Spatiotemporal Pyramid Flow Matching for Climate Emulation

Authors: Jeremy Andrew Irvin, Jiaqi Han, Zikui Wang, Abdulaziz Alharbi, Yufei Zhao, Nomin-Erdene Bayarsaikhan, Daniele Visioni, Andrew Y. Ng, Duncan Watson-Parris
Subjects: cs.CV, cs.AI, cs.LG, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2512.02268
Pdf URL: https://arxiv.org/pdf/2512.02268
Copy Paste: [[2512.02268]] Spatiotemporal Pyramid Flow Matching for Climate Emulation(https://arxiv.org/abs/2512.02268)
Keywords: generative
Abstract: Generative models have the potential to transform the way we emulate Earth's changing climate. Previous generative approaches rely on weather-scale autoregression for climate emulation, but this is inherently slow for long climate horizons and has yet to demonstrate stable rollouts under nonstationary forcings. Here, we introduce Spatiotemporal Pyramid Flows (SPF), a new class of flow matching approaches that model data hierarchically across spatial and temporal scales. Inspired by cascaded video models, SPF partitions the generative trajectory into a spatiotemporal pyramid, progressively increasing spatial resolution to reduce computation and coupling each stage with an associated timescale to enable direct sampling at any temporal level in the pyramid. This design, together with conditioning each stage on prescribed physical forcings (e.g., greenhouse gases or aerosols), enables efficient, parallel climate emulation at multiple timescales. On ClimateBench, SPF outperforms strong flow matching baselines and pre-trained models at yearly and monthly timescales while offering fast sampling, especially at coarser temporal levels. To scale SPF, we curate ClimateSuite, the largest collection of Earth system simulations to date, comprising over 33,000 simulation-years across ten climate models and the first dataset to include simulations of climate interventions. We find that the scaled SPF model demonstrates good generalization to held-out scenarios across climate models. Together, SPF and ClimateSuite provide a foundation for accurate, efficient, probabilistic climate emulation across temporal scales and realistic future scenarios. Data and code is publicly available at this https URL .

Title: Progressive Image Restoration via Text-Conditioned Video Generation

Authors: Peng Kang, Xijun Wang, Yu Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02273
Pdf URL: https://arxiv.org/pdf/2512.02273
Copy Paste: [[2512.02273]] Progressive Image Restoration via Text-Conditioned Video Generation(https://arxiv.org/abs/2512.02273)
Keywords: robust, interpretability
Abstract: Recent text-to-video models have demonstrated strong temporal generation capabilities, yet their potential for image restoration remains underexplored. In this work, we repurpose CogVideo for progressive visual restoration tasks by fine-tuning it to generate restoration trajectories rather than natural video motion. Specifically, we construct synthetic datasets for super-resolution, deblurring, and low-light enhancement, where each sample depicts a gradual transition from degraded to clean frames. Two prompting strategies are compared: a uniform text prompt shared across all samples, and a scene-specific prompting scheme generated via LLaVA multi-modal LLM and refined with ChatGPT. Our fine-tuned model learns to associate temporal progression with restoration quality, producing sequences that improve perceptual metrics such as PSNR, SSIM, and LPIPS across frames. Extensive experiments show that CogVideo effectively restores spatial detail and illumination consistency while maintaining temporal coherence. Moreover, the model generalizes to real-world scenarios on the ReLoBlur dataset without additional training, demonstrating strong zero-shot robustness and interpretability through temporal restoration.

Title: HOT Protocol

Authors: Peter Volnov, Georgii Kuksa, Andrey Zhevlakov
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.02287
Pdf URL: https://arxiv.org/pdf/2512.02287
Copy Paste: [[2512.02287]] HOT Protocol(https://arxiv.org/abs/2512.02287)
Keywords: secure, security
Abstract: HOT Protocol provides the infrastructure that allows smart contracts to securely own and manage private keys. The Multi-Party Computation (MPC) Network manages signing keys. By running an MPC node inside a Trusted Execution Environment (TEE), the protocol achieves stronger security guarantees while lowering economic requirements for participants. The NEAR Protocol provides a decentralized and efficient state layer. Key management can be integrated with any smart contract across Stellar, TON, Solana, and EVM-compatible networks.

Title: Enhancing Cross Domain SAR Oil Spill Segmentation via Morphological Region Perturbation and Synthetic Label-to-SAR Generation

Authors: Andre Juarez, Luis Salsavilca, Frida Coaquira, Celso Gonzales
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02290
Pdf URL: https://arxiv.org/pdf/2512.02290
Copy Paste: [[2512.02290]] Enhancing Cross Domain SAR Oil Spill Segmentation via Morphological Region Perturbation and Synthetic Label-to-SAR Generation(https://arxiv.org/abs/2512.02290)
Keywords: generative, segmentation
Abstract: Deep learning models for SAR oil spill segmentation often fail to generalize across regions due to differences in sea-state, backscatter statistics, and slick morphology, a limitation that is particularly severe along the Peruvian coast where labeled Sentinel-1 data remain scarce. To address this problem, we propose \textbf{MORP--Synth}, a two-stage synthetic augmentation framework designed to improve transfer from Mediterranean to Peruvian conditions. Stage~A applies Morphological Region Perturbation, a curvature guided label space method that generates realistic geometric variations of oil and look-alike regions. Stage~B renders SAR-like textures from the edited masks using a conditional generative INADE model. We compile a Peruvian dataset of 2112 labeled 512$\times$512 patches from 40 Sentinel-1 scenes (2014--2024), harmonized with the Mediterranean CleanSeaNet benchmark, and evaluate seven segmentation architectures. Models pretrained on Mediterranean data degrade from 67.8\% to 51.8\% mIoU on the Peruvian domain; MORP--Synth improves performance up to +6 mIoU and boosts minority-class IoU (+10.8 oil, +14.6 look-alike).

Title: Quantum Vanguard: Server Optimized Privacy Fortified Federated Intelligence for Future Vehicles

Authors: Dev Gurung, Shiva Raj Pokhrel
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.02301
Pdf URL: https://arxiv.org/pdf/2512.02301
Copy Paste: [[2512.02301]] Quantum Vanguard: Server Optimized Privacy Fortified Federated Intelligence for Future Vehicles(https://arxiv.org/abs/2512.02301)
Keywords: secure, security, privacy, defense, federate
Abstract: This work presents vQFL (vehicular Quantum Federated Learning), a new framework that leverages quantum machine learning techniques to tackle key privacy and security issues in autonomous vehicular networks. Furthermore, we propose a server-side adapted fine-tuning method, ft-VQFL,to achieve enhanced and more resilient performance. By integrating quantum federated learning with differential privacy and quantum key distribution (QKD), our quantum vanguard approach creates a multi-layered defense against both classical and quantum threats while preserving model utility. Extensive experimentation with industry-standard datasets (KITTI, Waymo, and nuScenes) demonstrates that vQFL maintains accuracy comparable to standard QFL while significantly improving privacy guaranties and communication security. Our implementation using various quantum models (VQC, QCNN, and SamplerQNN) reveals minimal performance overhead despite the added security measures. This work establishes a crucial foundation for quantum-resistant autonomous vehicle systems that can operate securely in the post-quantum era while efficiently processing the massive data volumes (20-40TB/day per vehicle) generated by modern autonomous fleets. The modular design of the framework allows for seamless integration with existing vehicular networks, positioning vQFL as an essential component for future intelligent transportation infrastructure.

Title: Training Dynamics of Learning 3D-Rotational Equivariance

Authors: Max W. Shen, Ewa Nowara, Michael Maser, Kyunghyun Cho
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2512.02303
Pdf URL: https://arxiv.org/pdf/2512.02303
Copy Paste: [[2512.02303]] Training Dynamics of Learning 3D-Rotational Equivariance(https://arxiv.org/abs/2512.02303)
Keywords: robust
Abstract: While data augmentation is widely used to train symmetry-agnostic models, it remains unclear how quickly and effectively they learn to respect symmetries. We investigate this by deriving a principled measure of equivariance error that, for convex losses, calculates the percent of total loss attributable to imperfections in learned symmetry. We focus our empirical investigation to 3D-rotation equivariance on high-dimensional molecular tasks (flow matching, force field prediction, denoising voxels) and find that models reduce equivariance error quickly to $\leq$2\% held-out loss within 1k-10k training steps, a result robust to model and dataset size. This happens because learning 3D-rotational equivariance is an easier learning task, with a smoother and better-conditioned loss landscape, than the main prediction task. For 3D rotations, the loss penalty for non-equivariant models is small throughout training, so they may achieve lower test loss than equivariant models per GPU-hour unless the equivariant ``efficiency gap'' is narrowed. We also experimentally and theoretically investigate the relationships between relative equivariance error, learning gradients, and model parameters.

Title: When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

Authors: Jack Lu, Ryan Teehan, Jinran Jin, Mengye Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02304
Pdf URL: https://arxiv.org/pdf/2512.02304
Copy Paste: [[2512.02304]] When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers(https://arxiv.org/abs/2512.02304)
Keywords: large language model
Abstract: Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver-verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematical and logical tasks exhibit the highest inherent verifiability.

Title: COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers

Authors: Junyu Wang, Changjia Zhu, Yuanbo Zhou, Lingyao Li, Xu He, Junjie Xiong
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02318
Pdf URL: https://arxiv.org/pdf/2512.02318
Copy Paste: [[2512.02318]] COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers(https://arxiv.org/abs/2512.02318)
Keywords: security, defense, attack, large language model
Abstract: This paper studies how multimodal large language models (MLLMs) undermine the security guarantees of visual CAPTCHA. We identify the attack surface where an adversary can cheaply automate CAPTCHA solving using off-the-shelf models. We evaluate 7 leading commercial and open-source MLLMs across 18 real-world CAPTCHA task types, measuring single-shot accuracy, success under limited retries, end-to-end latency, and per-solve cost. We further analyze the impact of task-specific prompt engineering and few-shot demonstrations on solver effectiveness. We reveal that MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHA tasks at human-like cost and latency, whereas tasks requiring fine-grained localization, multi-step spatial reasoning, or cross-frame consistency remain significantly harder for current models. By examining the reasoning traces of such MLLMs, we investigate the underlying mechanisms of why models succeed/fail on specific CAPTCHA puzzles and use these insights to derive defense-oriented guidelines for selecting and strengthening CAPTCHA tasks. We conclude by discussing implications for platform operators deploying CAPTCHA as part of their abuse-mitigation this http URL Availability (this https URL).

Title: LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems

Authors: Yuanhe Zhang, Weiliu Wang, Zhenhong Zhou, Kun Wang, Jie Zhang, Li Sun, Yang Liu, Sen Su
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2512.02321
Pdf URL: https://arxiv.org/pdf/2512.02321
Copy Paste: [[2512.02321]] LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems(https://arxiv.org/abs/2512.02321)
Keywords: security, attack, large language model
Abstract: Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in reasoning, planning, and tool usage. The recently proposed Model Context Protocol (MCP) has emerged as a unifying framework for integrating external tools into agent systems, enabling a thriving open ecosystem of community-built functionalities. However, the openness and composability that make MCP appealing also introduce a critical yet overlooked security assumption -- implicit trust in third-party tool providers. In this work, we identify and formalize a new class of attacks that exploit this trust boundary without violating explicit permissions. We term this new attack vector implicit toxicity, where malicious behaviors occur entirely within the allowed privilege scope. We propose LeechHijack, a Latent Embedded Exploit for Computation Hijacking, in which an adversarial MCP tool covertly expropriates the agent's computational resources for unauthorized workloads. LeechHijack operates through a two-stage mechanism: an implantation stage that embeds a benign-looking backdoor in a tool, and an exploitation stage where the backdoor activates upon predefined triggers to establish a command-and-control channel. Through this channel, the attacker injects additional tasks that the agent executes as if they were part of its normal workflow, effectively parasitizing the user's compute budget. We implement LeechHijack across four major LLM families. Experiments show that LeechHijack achieves an average success rate of 77.25%, with a resource overhead of 18.62% compared to the baseline. This study highlights the urgent need for computational provenance and resource attestation mechanisms to safeguard the emerging MCP ecosystem.

Title: Unlocking the Power of Boltzmann Machines by Parallelizable Sampler and Efficient Temperature Estimation

Authors: Kentaro Kubo, Hayato Goto
Subjects: cs.LG, quant-ph, stat.ML
Abstract URL: https://arxiv.org/abs/2512.02323
Pdf URL: https://arxiv.org/pdf/2512.02323
Copy Paste: [[2512.02323]] Unlocking the Power of Boltzmann Machines by Parallelizable Sampler and Efficient Temperature Estimation(https://arxiv.org/abs/2512.02323)
Keywords: generative
Abstract: Boltzmann machines (BMs) are powerful energy-based generative models, but their heavy training cost has largely confined practical use to Restricted BMs (RBMs) trained with an efficient learning method called contrastive divergence. More accurate learning typically requires Markov chain Monte Carlo (MCMC) Boltzmann sampling, but it is time-consuming due to the difficulty of parallelization for more expressive models. To address this limitation, we first propose a new Boltzmann sampler inspired by a quantum-inspired combinatorial optimization called simulated bifurcation (SB). This SB-inspired approach, which we name Langevin SB (LSB), enables parallelized sampling while maintaining accuracy comparable to MCMC. Furthermore, this is applicable not only to RBMs but also to BMs with general couplings. However, LSB cannot control the inverse temperature of the output Boltzmann distribution, which hinders learning and degrades performance. To overcome this limitation, we also developed an efficient method for estimating the inverse temperature during the learning process, which we call conditional expectation matching (CEM). By combining LSB and CEM, we establish an efficient learning framework for BMs with greater expressive power than RBMs. We refer to this framework as sampler-adaptive learning (SAL). SAL opens new avenues for energy-based generative modeling beyond RBMs.

Title: Retrieval-Augmented Memory for Online Learning

Authors: Wenzhang Du
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.02333
Pdf URL: https://arxiv.org/pdf/2512.02333
Copy Paste: [[2512.02333]] Retrieval-Augmented Memory for Online Learning(https://arxiv.org/abs/2512.02333)
Keywords: robust
Abstract: Retrieval-augmented models couple parametric predictors with non-parametric memories, but their use in streaming supervised learning with concept drift is not well understood. We study online classification in non-stationary environments and propose Retrieval-Augmented Memory for Online Learning (RAM-OL), a simple extension of stochastic gradient descent that maintains a small buffer of past examples. At each time step, RAM-OL retrieves a few nearest neighbours of the current input in the hidden representation space and updates the model jointly on the current example and the retrieved neighbours. We compare a naive replay variant with a gated replay variant that constrains neighbours using a time window, similarity thresholds, and gradient reweighting, in order to balance fast reuse of relevant past data against robustness to outdated regimes. From a theoretical perspective, we interpret RAM-OL under a bounded drift model and discuss how retrieval can reduce adaptation cost and improve regret constants when patterns recur over time. Empirically, we instantiate RAM-OL on a simple online multilayer perceptron and evaluate it on three real-world data streams derived from electricity pricing, electricity load, and airline delay data. On strongly and periodically drifting streams, RAM-OL improves prequential accuracy by up to about seven percentage points and greatly reduces variance across random seeds, while on a noisy airline stream the gated variant closely matches the purely online baseline. These results show that retrieval-augmented memory is a practical and robust tool for online learning under concept drift.

Title: SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification

Authors: Zhendong Tan, Xingjun Zhang, Chaoyi Hu, Junjie Peng, Kun Xia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.02337
Pdf URL: https://arxiv.org/pdf/2512.02337
Copy Paste: [[2512.02337]] SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification(https://arxiv.org/abs/2512.02337)
Keywords: large language model
Abstract: Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series. Experimental results show that SpecPV achieves up to 6x decoding speedup over standard autoregressive decoding with minor degradation.

Title: Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision

Authors: Chenshuang Zhang, Kang Zhang, Joon Son Chung, In So Kweon, Junmo Kim, Chengzhi Mao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02339
Pdf URL: https://arxiv.org/pdf/2512.02339
Copy Paste: [[2512.02339]] Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision(https://arxiv.org/abs/2512.02339)
Keywords: robust, diffusion
Abstract: Distinguishing visually similar objects by their motion remains a critical challenge in computer vision. Although supervised trackers show promise, contemporary self-supervised trackers struggle when visual cues become ambiguous, limiting their scalability and generalization without extensive labeled data. We find that pre-trained video diffusion models inherently learn motion representations suitable for tracking without task-specific training. This ability arises because their denoising process isolates motion in early, high-noise stages, distinct from later appearance refinement. Capitalizing on this discovery, our self-supervised tracker significantly improves performance in distinguishing visually similar objects, an underexplored failure point for existing methods. Our method achieves up to a 6-point improvement over recent self-supervised approaches on established benchmarks and our newly introduced tests focused on tracking visually similar items. Visualizations confirm that these diffusion-derived motion representations enable robust tracking of even identical objects across challenging viewpoint changes and deformations.

Title: TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction

Authors: Fengyi Zhang, Tianjun Zhang, Kasra Khosoussi, Zheng Zhang, Zi Huang, Yadan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02341
Pdf URL: https://arxiv.org/pdf/2512.02341
Copy Paste: [[2512.02341]] TALO: Pushing 3D Vision Foundation Models Towards Globally Consistent Online Reconstruction(https://arxiv.org/abs/2512.02341)
Keywords: robust
Abstract: 3D vision foundation models have shown strong generalization in reconstructing key 3D attributes from uncalibrated images through a single feed-forward pass. However, when deployed in online settings such as driving scenarios, predictions are made over temporal windows, making it non-trivial to maintain consistency across time. Recent strategies align consecutive predictions by solving global transformation, yet our analysis reveals their fundamental limitations in assumption validity, local alignment scope, and robustness under noisy geometry. In this work, we propose a higher-DOF and long-term alignment framework based on Thin Plate Spline, leveraging globally propagated control points to correct spatially varying inconsistencies. In addition, we adopt a point-agnostic submap registration design that is inherently robust to noisy geometry predictions. The proposed framework is fully plug-and-play, compatible with diverse 3D foundation models and camera configurations (e.g., monocular or surround-view). Extensive experiments demonstrate that our method consistently yields more coherent geometry and lower trajectory errors across multiple datasets, backbone models, and camera setups, highlighting its robustness and generality. Codes are publicly available at \href{this https URL}{this https URL}.

Title: A multi-weight self-matching visual explanation for cnns on sar images

Authors: Siyuan Sun, Yongping Zhang, Hongcheng Zeng, Yamin Wang, Wei Yang, Wanting Yang, Jie Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02344
Pdf URL: https://arxiv.org/pdf/2512.02344
Copy Paste: [[2512.02344]] A multi-weight self-matching visual explanation for cnns on sar images(https://arxiv.org/abs/2512.02344)
Keywords: interpretability
Abstract: In recent years, convolutional neural networks (CNNs) have achieved significant success in various synthetic aperture radar (SAR) tasks. However, the complexity and opacity of their internal mechanisms hinder the fulfillment of high-reliability requirements, thereby limiting their application in SAR. Improving the interpretability of CNNs is thus of great importance for their development and deployment in SAR. In this paper, a visual explanation method termed multi-weight self-matching class activation mapping (MS-CAM) is proposed. MS-CAM matches SAR images with the feature maps and corresponding gradients extracted by the CNN, and combines both channel-wise and element-wise weights to visualize the decision basis learned by the model in SAR images. Extensive experiments conducted on a self-constructed SAR target classification dataset demonstrate that MS-CAM more accurately highlights the network's regions of interest and captures detailed target feature information, thereby enhancing network interpretability. Furthermore, the feasibility of applying MS-CAM to weakly-supervised obiect localization is validated. Key factors affecting localization accuracy, such as pixel thresholds, are analyzed in depth to inform future work.

Title: FOVA: Offline Federated Reinforcement Learning with Mixed-Quality Data

Authors: Nan Qiao, Sheng Yue, Ju Ren, Yaoxue Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02350
Pdf URL: https://arxiv.org/pdf/2512.02350
Copy Paste: [[2512.02350]] FOVA: Offline Federated Reinforcement Learning with Mixed-Quality Data(https://arxiv.org/abs/2512.02350)
Keywords: federate
Abstract: Offline Federated Reinforcement Learning (FRL), a marriage of federated learning and offline reinforcement learning, has attracted increasing interest recently. Albeit with some advancement, we find that the performance of most existing offline FRL methods drops dramatically when provided with mixed-quality data, that is, the logging behaviors (offline data) are collected by policies with varying qualities across clients. To overcome this limitation, this paper introduces a new vote-based offline FRL framework, named FOVA. It exploits a \emph{vote mechanism} to identify high-return actions during local policy evaluation, alleviating the negative effect of low-quality behaviors from diverse local learning policies. Besides, building on advantage-weighted regression (AWR), we construct consistent local and global training objectives, significantly enhancing the efficiency and stability of FOVA. Further, we conduct an extensive theoretical analysis and rigorously show that the policy learned by FOVA enjoys strict policy improvement over the behavioral policy. Extensive experiments corroborate the significant performance gains of our proposed algorithm over existing baselines on widely used benchmarks.

Title: VACoT: Rethinking Visual Data Augmentation with VLMs

Authors: Zhengzhuo Xu, Chong Sun, SiNan Du, Chen Li, Jing Lyu, Chun Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02361
Pdf URL: https://arxiv.org/pdf/2512.02361
Copy Paste: [[2512.02361]] VACoT: Rethinking Visual Data Augmentation with VLMs(https://arxiv.org/abs/2512.02361)
Keywords: robust
Abstract: While visual data augmentation remains a cornerstone for training robust vision models, it has received limited attention in visual language models (VLMs), which predominantly rely on large-scale real data acquisition or synthetic diversity. Consequently, they may struggle with basic perception tasks that conventional models handle reliably. Given the substantial cost of pre-training and fine-tuning VLMs, continue training on augmented data yields limited and diminishing returns. In this paper, we present Visual Augmentation Chain-of-Thought (VACoT), a framework that dynamically invokes image augmentations during model inference. By incorporating post-hoc transformations such as denoising, VACoT substantially improves robustness on challenging and out-of-distribution inputs, especially in OCR-related adversarial scenarios. Distinct from prior approaches limited to local cropping, VACoT integrates a structured collection of general visual augmentations, broadening the query image views while reducing training complexity and computational overhead with efficient agentic reinforcement learning. We propose a conditional reward scheme that encourages necessary augmentation while penalizing verbose responses, ensuring concise and effective reasoning in perception tasks. We demonstrate the superiority of VACoT with extensive experiments on 13 perception benchmarks and further introduce AdvOCR to highlight the generalization benefits of post-hoc visual augmentations in adversarial scenarios.

Title: Memory-Augmented Knowledge Fusion with Safety-Aware Decoding for Domain-Adaptive Question Answering

Authors: Lei Fu, Xiang Chen, Kaige Gao Xinyue Huang, Kejian Tong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02363
Pdf URL: https://arxiv.org/pdf/2512.02363
Copy Paste: [[2512.02363]] Memory-Augmented Knowledge Fusion with Safety-Aware Decoding for Domain-Adaptive Question Answering(https://arxiv.org/abs/2512.02363)
Keywords: large language model
Abstract: Domain-specific question answering (QA) systems for services face unique challenges in integrating heterogeneous knowledge sources while ensuring both accuracy and safety. Existing large language models often struggle with factual consistency and context alignment in sensitive domains such as healthcare policies and government welfare. In this work, we introduce Knowledge-Aware Reasoning and Memory-Augmented Adaptation (KARMA), a novel framework designed to enhance QA performance in care scenarios. KARMA incorporates a dual-encoder architecture to fuse structured and unstructured knowledge sources, a gated memory unit to dynamically regulate external knowledge integration, and a safety-aware controllable decoder that mitigates unsafe outputs using safety classification and guided generation techniques. Extensive experiments on a proprietary QA dataset demonstrate that KARMA outperforms strong baselines in both answer quality and safety. This study offers a comprehensive solution for building trustworthy and adaptive QA systems in service contexts.

Title: SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains

Authors: Qingmei Li, Yang Zhang, Peifeng Zhang, Haohuan Fu, Juepeng Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02369
Pdf URL: https://arxiv.org/pdf/2512.02369
Copy Paste: [[2512.02369]] SAGE: Style-Adaptive Generalization for Privacy-Constrained Semantic Segmentation Across Domains(https://arxiv.org/abs/2512.02369)
Keywords: security, privacy, segmentation
Abstract: Domain generalization for semantic segmentation aims to mitigate the degradation in model performance caused by domain shifts. However, in many real-world scenarios, we are unable to access the model parameters and architectural details due to privacy concerns and security constraints. Traditional fine-tuning or adaptation is hindered, leading to the demand for input-level strategies that can enhance generalization without modifying model weights. To this end, we propose a \textbf{S}tyle-\textbf{A}daptive \textbf{GE}neralization framework (\textbf{SAGE}), which improves the generalization of frozen models under privacy constraints. SAGE learns to synthesize visual prompts that implicitly align feature distributions across styles instead of directly fine-tuning the backbone. Specifically, we first utilize style transfer to construct a diverse style representation of the source domain, thereby learning a set of style characteristics that can cover a wide range of visual features. Then, the model adaptively fuses these style cues according to the visual context of each input, forming a dynamic prompt that harmonizes the image appearance without touching the interior of the model. Through this closed-loop design, SAGE effectively bridges the gap between frozen model invariance and the diversity of unseen domains. Extensive experiments on five benchmark datasets demonstrate that SAGE achieves competitive or superior performance compared to state-of-the-art methods under privacy constraints and outperforms full fine-tuning baselines in all settings.

Title: Reproducing and Extending RaDelft 4D Radar with Camera-Assisted Labels

Authors: Kejia Hu, Mohammed Alsakabi, John M. Dolan, Ozan K. Tonguz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02394
Pdf URL: https://arxiv.org/pdf/2512.02394
Copy Paste: [[2512.02394]] Reproducing and Extending RaDelft 4D Radar with Camera-Assisted Labels(https://arxiv.org/abs/2512.02394)
Keywords: robust, segmentation
Abstract: Recent advances in 4D radar highlight its potential for robust environment perception under adverse conditions, yet progress in radar semantic segmentation remains constrained by the scarcity of open source datasets and labels. The RaDelft data set, although seminal, provides only LiDAR annotations and no public code to generate radar labels, limiting reproducibility and downstream research. In this work, we reproduce the numerical results of the RaDelft group and demonstrate that a camera-guided radar labeling pipeline can generate accurate labels for radar point clouds without relying on human annotations. By projecting radar point clouds into camera-based semantic segmentation and applying spatial clustering, we create labels that significantly enhance the accuracy of radar labels. These results establish a reproducible framework that allows the research community to train and evaluate the labeled 4D radar data. In addition, we study and quantify how different fog levels affect the radar labeling performance.

Title: AtomGraph: Tackling Atomicity Violation in Smart Contracts using Multimodal GCNs

Authors: Xiaoqi Li, Zongwei Li, Wenkai Li, Zeng Zhang, Lei Xie
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.02399
Pdf URL: https://arxiv.org/pdf/2512.02399
Copy Paste: [[2512.02399]] AtomGraph: Tackling Atomicity Violation in Smart Contracts using Multimodal GCNs(https://arxiv.org/abs/2512.02399)
Keywords: security
Abstract: Smart contracts are a core component of blockchain technology and are widely deployed across various scenarios. However, atomicity violations have become a potential security risk. Existing analysis tools often lack the precision required to detect these issues effectively. To address this challenge, we introduce AtomGraph, an automated framework designed for detecting atomicity violations. This framework leverages Graph Convolutional Networks (GCN) to identify atomicity violations through multimodal feature learning and fusion. Specifically, driven by a collaborative learning mechanism, the model simultaneously learns from two heterogeneous modalities: extracting structural topological features from the contract's Control Flow Graph (CFG) and uncovering deep semantics from its opcode sequence. We designed an adaptive weighted fusion mechanism to dynamically adjust the weights of features from each modality to achieve optimal feature fusion. Finally, GCN detects graph-level atomicity violation on the contract. Comprehensive experimental evaluations demonstrate that AtomGraph achieves 96.88% accuracy and 96.97% F1 score, outperforming existing tools. Furthermore, compared to the concatenation fusion model, AtomGraph improves the F1 score by 6.4%, proving its potential in smart contract security detection.

Title: TaleFrame: An Interactive Story Generation System with Fine-Grained Control and Large Language Models

Authors: Yunchao Wang, Guodao Sun, Zihang Fu, Zhehao Liu, Kaixing Du, Haidong Gao, Ronghua Liang
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2512.02402
Pdf URL: https://arxiv.org/pdf/2512.02402
Copy Paste: [[2512.02402]] TaleFrame: An Interactive Story Generation System with Fine-Grained Control and Large Language Models(https://arxiv.org/abs/2512.02402)
Keywords: large language model
Abstract: With the advancement of natural language generation (NLG) technologies, creative story generation systems have gained increasing attention. However, current systems often fail to accurately translate user intent into satisfactory story outputs due to a lack of fine-grained control and unclear input specifications, limiting their applicability. To address this, we propose TaleFrame, a system that combines large language models (LLMs) with human-computer interaction (HCI) to generate stories through structured information, enabling precise control over the generation process. The innovation of TaleFrame lies in decomposing the story structure into four basic units: entities, events, relationships, and story outline. We leverage the Tinystories dataset, parsing and constructing a preference dataset consisting of 9,851 JSON-formatted entries, which is then used to fine-tune a local Llama model. By employing this JSON2Story approach, structured data is transformed into coherent stories. TaleFrame also offers an intuitive interface that supports users in creating and editing entities and events and generates stories through the structured framework. Users can control these units through simple interactions (e.g., drag-and-drop, attach, and connect), thus influencing the details and progression of the story. The generated stories can be evaluated across seven dimensions (e.g., creativity, structural integrity), with the system providing suggestions for refinement based on these evaluations. Users can iteratively adjust the story until a satisfactory result is achieved. Finally, we conduct quantitative evaluation and user studies that demonstrate the usefulness of TaleFrame. Dataset available at this https URL.

Title: ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity

Authors: Hongxiang Liu, Zhifang Deng, Tong Pu, Shengli Lu
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2512.02403
Pdf URL: https://arxiv.org/pdf/2512.02403
Copy Paste: [[2512.02403]] ESACT: An End-to-End Sparse Accelerator for Compute-Intensive Transformers via Local Similarity(https://arxiv.org/abs/2512.02403)
Keywords: transformer
Abstract: Transformers, composed of QKV generation, attention computation, and FFNs, have become the dominant model across various domains due to their outstanding performance. However, their high computational cost hinders efficient hardware deployment. Sparsity offers a promising solution, yet most existing accelerators exploit only intra-row sparsity in attention, while few consider inter-row sparsity. Approaches leveraging inter-row sparsity often rely on costly global similarity estimation, which diminishes the acceleration benefits of sparsity, and typically apply sparsity to only one or two transformer components. Through careful analysis of the attention distribution and computation flow, we observe that local similarity allows end-to-end sparse acceleration with lower computational overhead. Motivated by this observation, we propose ESACT, an end-to-end sparse accelerator for compute-intensive Transformers. ESACT centers on the Sparsity Prediction with Local Similarity (SPLS) mechanism, which leverages HLog quantization to accurately predict local attention sparsity prior to QK generation, achieving efficient sparsity across all transformer components. To support efficient hardware realization, we introduce three architectural innovations. Experimental results on 26 benchmarks demonstrate that SPLS reduces total computation by 52.03% with less than 1% accuracy loss. ESACT achieves an end-to-end energy efficiency of 3.29 TOPS/W, and improves attention-level energy efficiency by 2.95x and 2.26x over SOTA attention accelerators SpAtten and Sanger, respectively.

Title: WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate

Authors: Anoop Cherian, River Doyle, Eyal Ben-Dov, Suhas Lohit, Kuan-Chuan Peng
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02405
Pdf URL: https://arxiv.org/pdf/2512.02405
Copy Paste: [[2512.02405]] WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate(https://arxiv.org/abs/2512.02405)
Keywords: robust, large language model
Abstract: Recent large language models (LLMs) are trained on diverse corpora and tasks, leading them to develop complementary strengths. Multi-agent debate (MAD) has emerged as a popular way to leverage these strengths for robust reasoning, though it has mostly been applied to language-only tasks, leaving its efficacy on multimodal problems underexplored. In this paper, we study MAD for solving vision-and-language reasoning problems. Our setup enables generalizing the debate protocol with heterogeneous experts that possess single- and multi-modal capabilities. To this end, we present Weighted Iterative Society-of-Experts (WISE), a generalized and modular MAD framework that partitions the agents into Solvers, that generate solutions, and Reflectors, that verify correctness, assign weights, and provide natural language feedback. To aggregate the agents' solutions across debate rounds, while accounting for variance in their responses and the feedback weights, we present a modified Dawid-Skene algorithm for post-processing that integrates our two-stage debate model. We evaluate WISE on SMART-840, VisualPuzzles, EvoChart-QA, and a new SMART-840++ dataset with programmatically generated problem instances of controlled difficulty. Our results show that WISE consistently improves accuracy by 2-7% over the state-of-the-art MAD setups and aggregation methods across diverse multimodal tasks and LLM configurations.

Title: MitUNet: Enhancing Floor Plan Recognition using a Hybrid Mix-Transformer and U-Net Architecture

Authors: Dmitriy Parashchuk, Alexey Kapshitskiy, Yuriy Karyakin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02413
Pdf URL: https://arxiv.org/pdf/2512.02413
Copy Paste: [[2512.02413]] MitUNet: Enhancing Floor Plan Recognition using a Hybrid Mix-Transformer and U-Net Architecture(https://arxiv.org/abs/2512.02413)
Keywords: robust, transformer, segmentation
Abstract: Automatic 3D reconstruction of indoor spaces from 2D floor plans requires high-precision semantic segmentation of structural elements, particularly walls. However, existing methods optimized for standard metrics often struggle to detect thin structural components and yield masks with irregular boundaries, lacking the geometric precision required for subsequent vectorization. To address this issue, we introduce MitUNet, a hybrid neural network architecture specifically designed for wall segmentation tasks in the context of 3D modeling. In MitUNet, we utilize a hierarchical Mix-Transformer encoder to capture global context and a U-Net decoder enhanced with scSE attention blocks for precise boundary recovery. Furthermore, we propose an optimization strategy based on the Tversky loss function to effectively balance precision and recall. By fine-tuning the hyperparameters of the loss function, we prioritize the suppression of false positive noise along wall boundaries while maintaining high sensitivity to thin structures. Our experiments on the public CubiCasa5k dataset and a proprietary regional dataset demonstrate that the proposed approach ensures the generation of structurally correct masks with high boundary accuracy, outperforming standard single-task models. MitUNet provides a robust tool for data preparation in automated 3D reconstruction pipelines.

Title: Characterizing Cyber Attacks against Space Infrastructures with Missing Data: Framework and Case Study

Authors: Ekzhin Ear, Jose Luis Castanon Remy, Caleb Chang, Qiren Que, Antonia Feffer, Shouhuai Xu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.02414
Pdf URL: https://arxiv.org/pdf/2512.02414
Copy Paste: [[2512.02414]] Characterizing Cyber Attacks against Space Infrastructures with Missing Data: Framework and Case Study(https://arxiv.org/abs/2512.02414)
Keywords: security, protect, attack
Abstract: Cybersecurity of space infrastructures is an emerging topic, despite space-related cybersecurity incidents occurring as early as 1977 (i.e., hijacking of a satellite transmission signal). There is no single dataset that documents cyber attacks against space infrastructures that have occurred in the past; instead, these incidents are often scattered in media reports while missing many details, which we dub the missing-data problem. Nevertheless, even ``low-quality'' datasets containing such reports would be extremely valuable because of the dearth of space cybersecurity data and the sensitivity of space infrastructures which are often restricted from disclosure by governments. This prompts a research question: How can we characterize real-world cyber attacks against space infrastructures? In this paper, we address the problem by proposing a framework, including metrics, while also addressing the missing-data problem by leveraging methodologies such as the Space Attack Research and Tactic Analysis (SPARTA) and the Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) to ``extrapolate'' the missing data in a principled fashion. We show how the extrapolated data can be used to reconstruct ``hypothetical but plausible'' space cyber kill chains and space cyber attack campaigns that have occurred in practice. To show the usefulness of the framework, we extract data for 108 cyber attacks against space infrastructures and show how to extrapolate this ``low-quality'' dataset containing missing information to derive 6,206 attack technique-level space cyber kill chains. Our findings include: cyber attacks against space infrastructures are getting increasingly sophisticated; successful protection of the link segment between the space and user segments could have thwarted nearly half of the 108 attacks. We will make our dataset available.

Title: Leveraging Large Language Models to Bridge On-chain and Off-chain Transparency in Stablecoins

Authors: Yuexin Xiang, Yuchen Lei, SM Mahir Shazeed Rish, Yuanzhe Zhang, Qin Wang, Tsz Hon Yuen, Jiangshan Yu
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02418
Pdf URL: https://arxiv.org/pdf/2512.02418
Copy Paste: [[2512.02418]] Leveraging Large Language Models to Bridge On-chain and Off-chain Transparency in Stablecoins(https://arxiv.org/abs/2512.02418)
Keywords: large language model
Abstract: Stablecoins such as USDT and USDC aspire to peg stability by coupling issuance controls with reserve attestations. In practice, however, the transparency is split across two worlds: verifiable on-chain traces and off-chain disclosures locked in unstructured text that are unconnected. We introduce a large language model (LLM)-based automated framework that bridges these two dimensions by aligning on-chain issuance data with off-chain disclosure statements. First, we propose an integrative framework using LLMs to capture and analyze on- and off-chain data through document parsing and semantic alignment, extracting key financial indicators from issuer attestations and mapping them to corresponding on-chain metrics. Second, we integrate multi-chain issuance records and disclosure documents within a model context protocol (MCP) framework that standardizes LLMs access to both quantitative market data and qualitative disclosure narratives. This framework enables unified retrieval and contextual alignment across heterogeneous stablecoin information sources and facilitates consistent analysis. Third, we demonstrate the capability of LLMs to operate across heterogeneous data modalities in blockchain analytics, quantifying discrepancies between reported and observed circulation and examining their implications for cross-chain transparency and price dynamics. Our findings reveal systematic gaps between disclosed and verifiable data, showing that LLM-assisted analysis enhances cross-modal transparency and supports automated, data-driven auditing in decentralized finance (DeFi).

Title: WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Authors: Woongyeong Yeo, Kangsan Kim, Jaehong Yoon, Sung Ju Hwang
Subjects: cs.CV, cs.AI, cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02425
Pdf URL: https://arxiv.org/pdf/2512.02425
Copy Paste: [[2512.02425]] WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning(https://arxiv.org/abs/2512.02425)
Keywords: large language model
Abstract: Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.

Title: LightHCG: a Lightweight yet powerful HSIC Disentanglement based Causal Glaucoma Detection Model framework

Authors: Daeyoung Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02437
Pdf URL: https://arxiv.org/pdf/2512.02437
Copy Paste: [[2512.02437]] LightHCG: a Lightweight yet powerful HSIC Disentanglement based Causal Glaucoma Detection Model framework(https://arxiv.org/abs/2512.02437)
Keywords: transformer, generative, segmentation
Abstract: As a representative optic degenerative condition, glaucoma has been a threat to millions due to its irreversibility and severe impact on human vision fields. Mainly characterized by dimmed and blurred visions, or peripheral vision loss, glaucoma is well known to occur due to damages in the optic nerve from increased intraocular pressure (IOP) or neovascularization within the retina. Traditionally, most glaucoma related works and clinical diagnosis focused on detecting these damages in the optic nerve by using patient data from perimetry tests, optic papilla inspections and tonometer-based IOP measurements. Recently, with advancements in computer vision AI models, such as VGG16 or Vision Transformers (ViT), AI-automatized glaucoma detection and optic cup segmentation based on retinal fundus images or OCT recently exhibited significant performance in aiding conventional diagnosis with high performance. However, current AI-driven glaucoma detection approaches still have significant room for improvement in terms of reliability, excessive parameter usage, possibility of spurious correlation within detection, and limitations in applications to intervention analysis or clinical simulations. Thus, this research introduced a novel causal representation driven glaucoma detection model: LightHCG, an extremely lightweight Convolutional VAE-based latent glaucoma representation model that can consider the true causality among glaucoma-related physical factors within the optic nerve region. Using HSIC-based latent space disentanglement and Graph Autoencoder based unsupervised causal representation learning, LightHCG not only exhibits higher performance in classifying glaucoma with 93~99% less weights, but also enhances the possibility of AI-driven intervention analysis, compared to existing advanced vision models such as InceptionV3, MobileNetV2 or VGG16.

Title: Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources

Authors: Phuc Pham, Nhu Pham, Ngoc Quoc Ly
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02438
Pdf URL: https://arxiv.org/pdf/2512.02438
Copy Paste: [[2512.02438]] Boosting Medical Vision-Language Pretraining via Momentum Self-Distillation under Limited Computing Resources(https://arxiv.org/abs/2512.02438)
Keywords: robust, extraction
Abstract: In medical healthcare, obtaining detailed annotations is challenging, highlighting the need for robust Vision-Language Models (VLMs). Pretrained VLMs enable fine-tuning on small datasets or zero-shot inference, achieving performance comparable to task-specific models. Contrastive learning (CL) is a key paradigm for training VLMs but inherently requires large batch sizes for effective learning, making it computationally demanding and often limited to well-resourced institutions. Moreover, with limited data in healthcare, it is important to prioritize knowledge extraction from both data and models during training to improve performance. Therefore, we focus on leveraging the momentum method combined with distillation to simultaneously address computational efficiency and knowledge exploitation. Our contributions can be summarized as follows: (1) leveraging momentum self-distillation to enhance multimodal learning, and (2) integrating momentum mechanisms with gradient accumulation to enlarge the effective batch size without increasing resource consumption. Our method attains competitive performance with state-of-the-art (SOTA) approaches in zero-shot classification, while providing a substantial boost in the few-shot adaption, achieving over 90% AUC-ROC and improving retrieval tasks by 2-3%. Importantly, our method achieves high training efficiency with a single GPU while maintaining reasonable training time. Our approach aims to advance efficient multimodal learning by reducing resource requirements while improving performance over SOTA methods. The implementation of our method is available at this https URL .

Title: Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation

Authors: Junghwan Park, Woojin Cho, Junhyuk Heo, Darongsae Kwon, Kookjin Lee
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02441
Pdf URL: https://arxiv.org/pdf/2512.02441
Copy Paste: [[2512.02441]] Basis-Oriented Low-rank Transfer for Few-Shot and Test-Time Adaptation(https://arxiv.org/abs/2512.02441)
Keywords: robust
Abstract: Adapting large pre-trained models to unseen tasks under tight data and compute budgets remains challenging. Meta-learning approaches explicitly learn good initializations, but they require an additional meta-training phase over many tasks, incur high training cost, and can be unstable. At the same time, the number of task-specific pre-trained models continues to grow, yet the question of how to transfer them to new tasks with minimal additional training remains relatively underexplored. We propose BOLT (Basis-Oriented Low-rank Transfer), a framework that reuses existing fine-tuned models not by merging weights, but instead by extracting an orthogonal, task-informed spectral basis and adapting within that subspace. In the offline phase, BOLT collects dominant singular directions from multiple task vectors and orthogonalizes them per layer to form reusable bases. In the online phase, we freeze these bases and train only a small set of diagonal coefficients per layer for the new task, yielding a rank-controlled update with very few trainable parameters. This design provides (i) a strong, training-free initialization for unseen tasks, obtained by pooling source-task coefficients, along with a lightweight rescaling step while leveraging the shared orthogonal bases, and (ii) a parameter-efficient fine-tuning (PEFT) path that, in our experiments, achieves robust performance compared to common PEFT baselines as well as a representative meta-learned initialization. Our results show that constraining adaptation to a task-informed orthogonal subspace provides an effective alternative for unseen-task transfer.

Title: When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

Authors: Tsimur Hadeliya, Mohammad Ali Jauhar, Nidhi Sakpal, Diogo Cruz
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2512.02445
Pdf URL: https://arxiv.org/pdf/2512.02445
Copy Paste: [[2512.02445]] When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents(https://arxiv.org/abs/2512.02445)
Keywords: large language model
Abstract: Solving complex or long-horizon problems often requires large language models (LLMs) to use external tools and operate over a significantly longer context window. New LLMs enable longer context windows and support tool calling capabilities. Prior works have focused mainly on evaluation of LLMs on long-context prompts, leaving agentic setup relatively unexplored, both from capability and safety perspectives. Our work addresses this gap. We find that LLM agents could be sensitive to length, type, and placement of the context, exhibiting unexpected and inconsistent shifts in task performance and in refusals to execute harmful requests. Models with 1M-2M token context windows show severe degradation already at 100K tokens, with performance drops exceeding 50\% for both benign and harmful tasks. Refusal rates shift unpredictably: GPT-4.1-nano increases from $\sim$5\% to $\sim$40\% while Grok 4 Fast decreases from $\sim$80\% to $\sim$10\% at 200K tokens. Our work shows potential safety issues with agents operating on longer context and opens additional questions on the current metrics and paradigm for evaluating LLM agent safety on long multi-step tasks. In particular, our results on LLM agents reveal a notable divergence in both capability and safety performance compared to prior evaluations of LLMs on similar criteria.

Title: See, Think, Learn: A Self-Taught Multimodal Reasoner

Authors: Sourabh Sharma, Sonam Gupta, Sadbhawna
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2512.02456
Pdf URL: https://arxiv.org/pdf/2512.02456
Copy Paste: [[2512.02456]] See, Think, Learn: A Self-Taught Multimodal Reasoner(https://arxiv.org/abs/2512.02456)
Keywords: robust
Abstract: Vision-Language Models (VLMs) have achieved remarkable progress in integrating visual perception with language understanding. However, effective multimodal reasoning requires both accurate perception and robust reasoning, and weakness in either limits the performance of VLMs. Prior efforts to enhance reasoning often depend on high-quality chain-of-thought (CoT) data, obtained via labor-intensive human annotations, costly proprietary models, or self-training methods that overlook perception. To address these limitations, we propose a simple yet effective self-training framework called See-Think-Learn (STL). At its core, STL introduces a structured reasoning template that encourages the model to see before thinking, first extracting visual attributes in textual form, then using them to guide reasoning. The framework jointly improves perception and reasoning by having the model generate and learn from its own structured rationales in a self-training loop. Furthermore, we augment the training data with negative rationales, i.e. explanations that justify why certain answer choices are incorrect, to enhance the model's ability to distinguish between correct and misleading responses. This fosters more discriminative and robust learning. Experiments across diverse domains show that STL consistently outperforms baselines trained directly only on answers or self-generated reasoning, while qualitative analysis confirms the high quality of its rationales. STL thus provides a cost-effective solution to enhance multimodal reasoning ability of VLMs.

Title: Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

Authors: Jianzong Wu, Hao Lian, Dachao Hao, Ye Tian, Qingyu Shi, Biaolong Chen, Hao Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02457
Pdf URL: https://arxiv.org/pdf/2512.02457
Copy Paste: [[2512.02457]] Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation(https://arxiv.org/abs/2512.02457)
Keywords: generative
Abstract: Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $\times$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.

Title: Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration

Authors: Zhongyi Cai, Yi Du, Chen Wang, Yu Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02458
Pdf URL: https://arxiv.org/pdf/2512.02458
Copy Paste: [[2512.02458]] Vision to Geometry: 3D Spatial Memory for Sequential Embodied MLLM Reasoning and Exploration(https://arxiv.org/abs/2512.02458)
Keywords: large language model
Abstract: Existing research on indoor embodied tasks typically requires agents to actively explore unknown environments and reason about the scene to achieve a specific goal. However, when deployed in real life, agents often face sequential tasks, where each new sub-task follows the completion of the previous one, and certain sub-tasks may be infeasible, such as searching for a non-existent object. Compared with the single-task setting, the core challenge lies in reusing spatial knowledge accumulated from previous explorations to support subsequent reasoning and exploration. In this work, we investigate this underexplored yet practically significant embodied AI challenge. To evaluate this challenge, we introduce SEER-Bench, a new Sequential Embodied Exploration and Reasoning Benchmark encompassing encompassing two classic embodied tasks: Embodied Question Answering (EQA) and Embodied Multi-modal Navigation (EMN). Building on SEER-Bench, we propose 3DSPMR, a 3D SPatial Memory Reasoning approach that exploits relational, visual, and geometric cues from explored regions to augment Multi-Modal Large Language Models (MLLMs) for reasoning and exploration in sequential embodied tasks. To the best of our knowledge, this is the first work to explicitly incorporate geometric information into MLLM-based spatial understanding and reasoning. Extensive experiments verify that 3DSPMR achieves substantial performance gains on both sequential EQA and EMN tasks.

Title: TabGRU: An Enhanced Design for Urban Rainfall Intensity Estimation Using Commercial Microwave Links

Authors: Xingwang Li, Mengyun Chen, Jiamou Liu, Sijie Wang, Shuanggen Jin, Jafet C. M. Andersson, Jonas Olsson, Remco (C. Z.)van de Beek, Hai Victor Habi, Congzheng Han
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02465
Pdf URL: https://arxiv.org/pdf/2512.02465
Copy Paste: [[2512.02465]] TabGRU: An Enhanced Design for Urban Rainfall Intensity Estimation Using Commercial Microwave Links(https://arxiv.org/abs/2512.02465)
Keywords: robust, extraction, transformer
Abstract: In the face of accelerating global urbanization and the increasing frequency of extreme weather events, highresolution urban rainfall monitoring is crucial for building resilient smart cities. Commercial Microwave Links (CMLs) are an emerging data source with great potential for this this http URL traditional rainfall retrieval from CMLs relies on physicsbased models, these often struggle with real-world complexities like signal noise and nonlinear attenuation. To address these limitations, this paper proposes a novel hybrid deep learning architecture based on the Transformer and a Bidirectional Gated Recurrent Unit (BiGRU), which we name TabGRU. This design synergistically captures both long-term dependencies and local sequential features in the CML signal data. The model is further enhanced by a learnable positional embedding and an attention pooling mechanism to improve its dynamic feature extraction and generalization capabilities. The model was validated on a public benchmark dataset from Gothenburg, Sweden (June-September 2015). The evaluation used 12 sub-links from two rain gauges (Torp and Barl) over a test period (August 22-31) covering approximately 10 distinct rainfall events. The proposed TabGRU model demonstrated consistent advantages, outperforming deep learning baselines and achieving high coefficients of determination (R2) at both the Torp site (0.91) and the Barl site (0.96). Furthermore, compared to the physics-based approach, TabGRU maintained higher accuracy and was particularly effective in mitigating the significant overestimation problem observed in the PL model during peak rainfall events. This evaluation confirms that the TabGRU model can effectively overcome the limitations of traditional methods, providing a robust and accurate solution for CML-based urban rainfall monitoring under the tested conditions.

Title: G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline

Authors: Vishwesh Nath, Javier G. Tejero, Ruilong Li, Filippo Filicori, Mahdi Azizian, Sean D. Huver
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02482
Pdf URL: https://arxiv.org/pdf/2512.02482
Copy Paste: [[2512.02482]] G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline(https://arxiv.org/abs/2512.02482)
Keywords: robust
Abstract: We propose G-SHARP, a commercially compatible, real-time surgical scene reconstruction framework designed for minimally invasive procedures that require fast and accurate 3D modeling of deformable tissue. While recent Gaussian splatting approaches have advanced real-time endoscopic reconstruction, existing implementations often depend on non-commercial derivatives, limiting deployability. G-SHARP overcomes these constraints by being the first surgical pipeline built natively on the GSplat (Apache-2.0) differentiable Gaussian rasterizer, enabling principled deformation modeling, robust occlusion handling, and high-fidelity reconstructions on the EndoNeRF pulling benchmark. Our results demonstrate state-of-the-art reconstruction quality with strong speed-accuracy trade-offs suitable for intra-operative use. Finally, we provide a Holoscan SDK application that deploys G-SHARP on NVIDIA IGX Orin and Thor edge hardware, enabling real-time surgical visualization in practical operating-room settings.

Title: UCAgents: Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making

Authors: Qianhan Feng, Zhongzhen Huang, Yakun Zhu, Xiaofan Zhang, Qi Dou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02485
Pdf URL: https://arxiv.org/pdf/2512.02485
Copy Paste: [[2512.02485]] UCAgents: Unidirectional Convergence for Visual Evidence Anchored Multi-Agent Medical Decision-Making(https://arxiv.org/abs/2512.02485)
Keywords: extraction
Abstract: Vision-Language Models (VLMs) show promise in medical diagnosis, yet suffer from reasoning detachment, where linguistically fluent explanations drift from verifiable image evidence, undermining clinical trust. Recent multi-agent frameworks simulate Multidisciplinary Team (MDT) debates to mitigate single-model bias, but open-ended discussions amplify textual noise and computational cost while failing to anchor reasoning to visual evidence, the cornerstone of medical decision-making. We propose UCAgents, a hierarchical multi-agent framework enforcing unidirectional convergence through structured evidence auditing. Inspired by clinical workflows, UCAgents forbids position changes and limits agent interactions to targeted evidence verification, suppressing rhetorical drift while amplifying visual signal extraction. In UCAgents, a one-round inquiry discussion is introduced to uncover potential risks of visual-textual misalignment. This design jointly constrains visual ambiguity and textual noise, a dual-noise bottleneck that we formalize via information theory. Extensive experiments on four medical VQA benchmarks show UCAgents achieves superior accuracy (71.3% on PathVQA, +6.0% over state-of-the-art) with 87.7% lower token cost, the evaluation results further confirm that UCAgents strikes a balance between uncovering more visual evidence and avoiding confusing textual interference. These results demonstrate that UCAgents exhibits both diagnostic reliability and computational efficiency critical for real-world clinical deployment. Code is available at this https URL.

Title: Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts

Authors: Zhongjian Qiao, Rui Yang, Jiafei Lyu, Xiu Li, Zhongxiang Dai, Zhuoran Yang, Siyang Gao, Shuang Qiu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.02486
Pdf URL: https://arxiv.org/pdf/2512.02486
Copy Paste: [[2512.02486]] Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts(https://arxiv.org/abs/2512.02486)
Keywords: robust
Abstract: Single-domain offline reinforcement learning (RL) often suffers from limited data coverage, while cross-domain offline RL handles this issue by leveraging additional data from other domains with dynamics shifts. However, existing studies primarily focus on train-time robustness (handling dynamics shifts from training data), neglecting the test-time robustness against dynamics perturbations when deployed in practical scenarios. In this paper, we investigate dual (both train-time and test-time) robustness against dynamics shifts in cross-domain offline RL. We first empirically show that the policy trained with cross-domain offline RL exhibits fragility under dynamics perturbations during evaluation, particularly when target domain data is limited. To address this, we introduce a novel robust cross-domain Bellman (RCB) operator, which enhances test-time robustness against dynamics perturbations while staying conservative to the out-of-distribution dynamics transitions, thus guaranteeing the train-time robustness. To further counteract potential value overestimation or underestimation caused by the RCB operator, we introduce two techniques, the dynamic value penalty and the Huber loss, into our framework, resulting in the practical \textbf{D}ual-\textbf{RO}bust \textbf{C}ross-domain \textbf{O}ffline RL (DROCO) algorithm. Extensive empirical results across various dynamics shift scenarios show that DROCO outperforms strong baselines and exhibits enhanced robustness to dynamics perturbations.

Title: Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding

Authors: Yerim Jeon, Miso Lee, WonJun Moon, Jae-Pil Heo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02487
Pdf URL: https://arxiv.org/pdf/2512.02487
Copy Paste: [[2512.02487]] Masking Matters: Unlocking the Spatial Reasoning Capabilities of LLMs for 3D Scene-Language Understanding(https://arxiv.org/abs/2512.02487)
Keywords: large language model
Abstract: Recent advances in 3D scene-language understanding have leveraged Large Language Models (LLMs) for 3D reasoning by transferring their general reasoning ability to 3D multi-modal contexts. However, existing methods typically adopt standard decoders from language modeling, which rely on a causal attention mask. This design introduces two fundamental conflicts in 3D scene understanding: sequential bias among order-agnostic 3D objects and restricted object-instruction attention, hindering task-specific reasoning. To overcome these limitations, we propose 3D Spatial Language Instruction Mask (3D-SLIM), an effective masking strategy that replaces the causal mask with an adaptive attention mask tailored to the spatial structure of 3D scenes. Our 3D-SLIM introduces two key components: a Geometry-adaptive Mask that constrains attention based on spatial density rather than token order, and an Instruction-aware Mask that enables object tokens to directly access instruction context. This design allows the model to process objects based on their spatial relationships while being guided by the user's task. 3D-SLIM is simple, requires no architectural modifications, and adds no extra parameters, yet it yields substantial performance improvements across diverse 3D scene-language tasks. Extensive experiments across multiple benchmarks and LLM baselines validate its effectiveness and underscore the critical role of decoder design in 3D multi-modal reasoning.

Title: YingVideo-MV: Music-Driven Multi-Stage Video Generation

Authors: Jiahui Chen, Weida Wang, Runhua Shi, Huan Yang, Chaofan Ding, Zihao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02492
Pdf URL: https://arxiv.org/pdf/2512.02492
Copy Paste: [[2512.02492]] YingVideo-MV: Music-Driven Multi-Stage Video Generation(https://arxiv.org/abs/2512.02492)
Keywords: diffusion, transformer
Abstract: While diffusion model for audio-driven avatar video generation have achieved notable process in synthesizing long sequences with natural audio-visual synchronization and identity consistency, the generation of music-performance videos with camera motions remains largely unexplored. We present YingVideo-MV, the first cascaded framework for music-driven long-video generation. Our approach integrates audio semantic analysis, an interpretable shot planning module (MV-Director), temporal-aware diffusion Transformer architectures, and long-sequence consistency modeling to enable automatic synthesis of high-quality music performance videos from audio signals. We construct a large-scale Music-in-the-Wild Dataset by collecting web data to support the achievement of diverse, high-quality results. Observing that existing long-video generation methods lack explicit camera motion control, we introduce a camera adapter module that embeds camera poses into latent noise. To enhance continulity between clips during long-sequence inference, we further propose a time-aware dynamic window range strategy that adaptively adjust denoising ranges based on audio embedding. Comprehensive benchmark tests demonstrate that YingVideo-MV achieves outstanding performance in generating coherent and expressive music videos, and enables precise music-motion-camera synchronization. More videos are available in our project page: this https URL .

Title: Attention-guided reference point shifting for Gaussian-mixture-based partial point set registration

Authors: Mizuki Kikkawa, Tatsuya Yatagawa, Yutaka Ohtake, Hiromasa Suzuki
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.02496
Pdf URL: https://arxiv.org/pdf/2512.02496
Copy Paste: [[2512.02496]] Attention-guided reference point shifting for Gaussian-mixture-based partial point set registration(https://arxiv.org/abs/2512.02496)
Keywords: robust, transformer
Abstract: This study investigates the impact of the invariance of feature vectors for partial-to-partial point set registration under translation and rotation of input point sets, particularly in the realm of techniques based on deep learning and Gaussian mixture models (GMMs). We reveal both theoretical and practical problems associated with such deep-learning-based registration methods using GMMs, with a particular focus on the limitations of DeepGMR, a pioneering study in this line, to the partial-to-partial point set registration. Our primary goal is to uncover the causes behind such methods and propose a comprehensible solution for that. To address this, we introduce an attention-based reference point shifting (ARPS) layer, which robustly identifies a common reference point of two partial point sets, thereby acquiring transformation-invariant features. The ARPS layer employs a well-studied attention module to find a common reference point rather than the overlap region. Owing to this, it significantly enhances the performance of DeepGMR and its recent variant, UGMMReg. Furthermore, these extension models outperform even prior deep learning methods using attention blocks and Transformer to extract the overlap region or common reference points. We believe these findings provide deeper insights into registration methods using deep learning and GMMs.

Title: A Large Scale Benchmark for Test Time Adaptation Methods in Medical Image Segmentation

Authors: Wenjing Yu, Shuo Jiang, Yifei Chen, Shuo Chang, Yuanhan Wang, Beining Wu, Jie Dong, Mingxuan Liu, Shenghao Zhu, Feiwei Qin, Changmiao Wang, Qiyuan Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02497
Pdf URL: https://arxiv.org/pdf/2512.02497
Copy Paste: [[2512.02497]] A Large Scale Benchmark for Test Time Adaptation Methods in Medical Image Segmentation(https://arxiv.org/abs/2512.02497)
Keywords: robust, segmentation
Abstract: Test time Adaptation is a promising approach for mitigating domain shift in medical image segmentation; however, current evaluations remain limited in terms of modality coverage, task diversity, and methodological consistency. We present MedSeg-TTA, a comprehensive benchmark that examines twenty representative adaptation methods across seven imaging modalities, including MRI, CT, ultrasound, pathology, dermoscopy, OCT, and chest X-ray, under fully unified data preprocessing, backbone configuration, and test time protocols. The benchmark encompasses four significant adaptation paradigms: Input-level Transformation, Feature-level Alignment, Output-level Regularization, and Prior Estimation, enabling the first systematic cross-modality comparison of their reliability and applicability. The results show that no single paradigm performs best in all conditions. Input-level methods are more stable under mild appearance shifts. Feature-level and Output-level methods offer greater advantages in boundary-related metrics, whereas prior-based methods exhibit strong modality dependence. Several methods degrade significantly under large inter-center and inter-device shifts, which highlights the importance of principled method selection for clinical deployment. MedSeg-TTA provides standardized datasets, validated implementations, and a public leaderboard, establishing a rigorous foundation for future research on robust, clinically reliable test-time adaptation. All source codes and open-source datasets are available at this https URL.

Title: dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

Authors: Yumeng Li, Guang Yang, Hao Liu, Bowen Wang, Colin Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02498
Pdf URL: https://arxiv.org/pdf/2512.02498
Copy Paste: [[2512.02498]] dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model(https://arxiv.org/abs/2512.02498)
Keywords: robust
Abstract: Document Layout Parsing serves as a critical gateway for Artificial Intelligence (AI) to access and interpret the world's vast stores of structured knowledge. This process,which encompasses layout detection, text recognition, and relational understanding, is particularly crucial for empowering next-generation Vision-Language Models. Current methods, however, rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. In this paper, we introduce this http URL, a single Vision-Language Model that, for the first time, demonstrates the advantages of jointly learning three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, empowering the model to deliver robust performance across a wide array of tasks, encompassing diverse languages, layouts, and domains. The efficacy of our unified paradigm is validated by state-of-the-art performance on the comprehensive OmniDocBench. Furthermore, to catalyze research in global document intelligence, we introduce XDocParse, a challenging new benchmark spanning 126 languages. On this testbed, this http URL establishes a powerful new baseline, outperforming the next-best competitor by a remarkable +7.4 point margin and proving its unparalleled multilingual capabilities.

Title: GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding

Authors: Jiaqi Liu, Ronghao Fu, Haoran Liu, Lang Sun, Bo Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02505
Pdf URL: https://arxiv.org/pdf/2512.02505
Copy Paste: [[2512.02505]] GeoDiT: A Diffusion-based Vision-Language Model for Geospatial Understanding(https://arxiv.org/abs/2512.02505)
Keywords: diffusion, generative
Abstract: Autoregressive models are structurally misaligned with the inherently parallel nature of geospatial understanding, forcing a rigid sequential narrative onto scenes and fundamentally hindering the generation of structured and coherent outputs. We challenge this paradigm by reframing geospatial generation as a parallel refinement process, enabling a holistic, coarse-to-fine synthesis that resolves all semantic elements simultaneously. To operationalize this, we introduce GeoDiT, the first diffusion-based vision-language model tailored for the geospatial domain. Extensive experiments demonstrate that GeoDiT establishes a new state-of-the-art on benchmarks requiring structured, object-centric outputs. It achieves significant gains in image captioning, visual grounding, and multi-object detection, precisely the tasks where autoregressive models falter. Our work validates that aligning the generative process with the data's intrinsic structure is key to unlocking superior performance in complex geospatial analysis.

Title: Water Quality Estimation Through Machine Learning Multivariate Analysis

Authors: Marco Cardia, Stefano Chessa, Alessio Micheli, Antonella Giuliana Luminare, Francesca Gambineri
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.02508
Pdf URL: https://arxiv.org/pdf/2512.02508
Copy Paste: [[2512.02508]] Water Quality Estimation Through Machine Learning Multivariate Analysis(https://arxiv.org/abs/2512.02508)
Keywords: interpretability
Abstract: The quality of water is key for the quality of agrifood sector. Water is used in agriculture for fertigation, for animal husbandry, and in the agrifood processing industry. In the context of the progressive digitalization of this sector, the automatic assessment of the quality of water is thus becoming an important asset. In this work, we present the integration of Ultraviolet-Visible (UV-Vis) spectroscopy with Machine Learning in the context of water quality assessment aiming at ensuring water safety and the compliance of water regulation. Furthermore, we emphasize the importance of model interpretability by employing SHapley Additive exPlanations (SHAP) to understand the contribution of absorbance at different wavelengths to the predictions. Our approach demonstrates the potential for rapid, accurate, and interpretable assessment of key water quality parameters.

Title: Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling

Authors: Aditya Chaudhary, Prachet Dev Singh, Ankit Jha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02512
Pdf URL: https://arxiv.org/pdf/2512.02512
Copy Paste: [[2512.02512]] Two-Stage Vision Transformer for Image Restoration: Colorization Pretraining + Residual Upsampling(https://arxiv.org/abs/2512.02512)
Keywords: transformer
Abstract: In computer vision, Single Image Super-Resolution (SISR) is still a difficult problem. We present ViT-SR, a new technique to improve the performance of a Vision Transformer (ViT) employing a two-stage training strategy. In our method, the model learns rich, generalizable visual representations from the data itself through a self-supervised pretraining phase on a colourization task. The pre-trained model is then adjusted for 4x super-resolution. By predicting the addition of a high-frequency residual image to an initial bicubic interpolation, this design simplifies residual learning. ViT-SR, trained and evaluated on the DIV2K benchmark dataset, achieves an impressive SSIM of 0.712 and PSNR of 22.90 dB. These results demonstrate the efficacy of our two-stage approach and highlight the potential of self-supervised pre-training for complex image restoration tasks. Further improvements may be possible with larger ViT architectures or alternative pretext tasks.

Title: Decentralized Fairness Aware Multi Task Federated Learning for VR Network

Authors: Krishnendu S. Tharakan, Carlo Fischione
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2512.02513
Pdf URL: https://arxiv.org/pdf/2512.02513
Copy Paste: [[2512.02513]] Decentralized Fairness Aware Multi Task Federated Learning for VR Network(https://arxiv.org/abs/2512.02513)
Keywords: federate, fair
Abstract: Wireless connectivity promises to unshackle virtual reality (VR) experiences, allowing users to engage from anywhere, anytime. However, delivering seamless, high-quality, real-time VR video wirelessly is challenging due to the stringent quality of experience requirements, low latency constraints, and limited VR device capabilities. This paper addresses these challenges by introducing a novel decentralized multi task fair federated learning (DMTFL) based caching that caches and prefetches each VR user's field of view (FOV) at base stations (BSs) based on the caching strategies tailored to each BS. In federated learning (FL) in its naive form, often biases toward certain users, and a single global model fails to capture the statistical heterogeneity across users and BSs. In contrast, the proposed DMTFL algorithm personalizes content delivery by learning individual caching models at each BS. These models are further optimized to perform well under any target distribution, while providing theoretical guarantees via Rademacher complexity and a probably approximately correct (PAC) bound on the loss. Using a realistic VR head-tracking dataset, our simulations demonstrate the superiority of our proposed DMTFL algorithm compared to baseline algorithms.

Title: SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts

Authors: Jiaqi Liu, Ronghao Fu, Lang Sun, Haoran Liu, Xiao Yang, Weipeng Zhang, Xu Na, Zhuoran Duan, Bo Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02517
Pdf URL: https://arxiv.org/pdf/2512.02517
Copy Paste: [[2512.02517]] SkyMoE: A Vision-Language Foundation Model for Enhancing Geospatial Interpretation with Mixture of Experts(https://arxiv.org/abs/2512.02517)
Keywords: large language model
Abstract: The emergence of large vision-language models (VLMs) has significantly enhanced the efficiency and flexibility of geospatial interpretation. However, general-purpose VLMs remain suboptimal for remote sensing (RS) tasks. Existing geospatial VLMs typically adopt a unified modeling strategy and struggle to differentiate between task types and interpretation granularities, limiting their ability to balance local detail perception and global contextual understanding. In this paper, we present SkyMoE, a Mixture-of-Experts (MoE) vision-language model tailored for multimodal, multi-task RS interpretation. SkyMoE employs an adaptive router that generates task- and granularity-aware routing instructions, enabling specialized large language model experts to handle diverse sub-tasks. To further promote expert decoupling and granularity sensitivity, we introduce a context-disentangled augmentation strategy that creates contrastive pairs between local and global features, guiding experts toward level-specific representation learning. We also construct MGRS-Bench, a comprehensive benchmark covering multiple RS interpretation tasks and granularity levels, to evaluate generalization in complex scenarios. Extensive experiments on 21 public datasets demonstrate that SkyMoE achieves state-of-the-art performance across tasks, validating its adaptability, scalability, and superior multi-granularity understanding in remote sensing.

Title: On the Problem of Consistent Anomalies in Zero-Shot Anomaly Detection

Authors: Tai Le-Gia
Subjects: cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2512.02520
Pdf URL: https://arxiv.org/pdf/2512.02520
Copy Paste: [[2512.02520]] On the Problem of Consistent Anomalies in Zero-Shot Anomaly Detection(https://arxiv.org/abs/2512.02520)
Keywords: transformer, segmentation
Abstract: Zero-shot anomaly classification and segmentation (AC/AS) aim to detect anomalous samples and regions without any training data, a capability increasingly crucial in industrial inspection and medical imaging. This dissertation aims to investigate the core challenges of zero-shot AC/AS and presents principled solutions rooted in theory and algorithmic design. We first formalize the problem of consistent anomalies, a failure mode in which recurring similar anomalies systematically bias distance-based methods. By analyzing the statistical and geometric behavior of patch representations from pre-trained Vision Transformers, we identify two key phenomena - similarity scaling and neighbor-burnout - that describe how relationships among normal patches change with and without consistent anomalies in settings characterized by highly similar objects. We then introduce CoDeGraph, a graph-based framework for filtering consistent anomalies built on the similarity scaling and neighbor-burnout phenomena. Through multi-stage graph construction, community detection, and structured refinement, CoDeGraph effectively suppresses the influence of consistent anomalies. Next, we extend this framework to 3D medical imaging by proposing a training-free, computationally efficient volumetric tokenization strategy for MRI data. This enables a genuinely zero-shot 3D anomaly detection pipeline and shows that volumetric anomaly segmentation is achievable without any 3D training samples. Finally, we bridge batch-based and text-based zero-shot methods by demonstrating that CoDeGraph-derived pseudo-masks can supervise prompt-driven vision-language models. Together, this dissertation provides theoretical understanding and practical solutions for the zero-shot AC/AS problem.

Title: WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens

Authors: Jian Yang, Dacheng Yin, Xiaoxuan He, Yong Li, Fengyun Rao, Jing Lyu, Wei Zhai, Yang Cao, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02536
Pdf URL: https://arxiv.org/pdf/2512.02536
Copy Paste: [[2512.02536]] WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens(https://arxiv.org/abs/2512.02536)
Keywords: diffusion, large language model
Abstract: Recent progress in multimodal large language models (MLLMs) has highlighted the challenge of efficiently bridging pre-trained Vision-Language Models (VLMs) with Diffusion Models. While methods using a fixed number of learnable query tokens offer computational efficiency, they suffer from task generalization collapse, failing to adapt to new tasks that are distant from their pre-training tasks. To overcome this, we propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization, enhancing continual learning. Additionally, we introduce a VAE branch with linear projection to recover fine-grained image details. Experimental results confirm our approach mitigates generalization collapse and enables stable continual learning across diverse tasks.

Title: AVGGT: Rethinking Global Attention for Accelerating VGGT

Authors: Xianbing Sun, Zhikai Zhu, Zhengyu Lou, Bo Yang, Jinyang Tang, Liqing Zhang, He Wang, Jianfu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02541
Pdf URL: https://arxiv.org/pdf/2512.02541
Copy Paste: [[2512.02541]] AVGGT: Rethinking Global Attention for Accelerating VGGT(https://arxiv.org/abs/2512.02541)
Keywords: robust
Abstract: Since DUSt3R, models such as VGGT and $\pi^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $\pi^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and $\pi^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves up to $8$-$10\times$ speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.

Title: CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Authors: Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, Chris Shum
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02551
Pdf URL: https://arxiv.org/pdf/2512.02551
Copy Paste: [[2512.02551]] CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning(https://arxiv.org/abs/2512.02551)
Keywords: large language model
Abstract: In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it this http URL} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it this http URL} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it this http URL}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at this http URL

Title: What Signals Really Matter for Misinformation Tasks? Evaluating Fake-News Detection and Virality Prediction under Real-World Constraints

Authors: Francesco Paolo Savatteri (ENC), Chahan Vidal-Gorène (CJM, LIPN), Florian Cafiero (ENC)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02552
Pdf URL: https://arxiv.org/pdf/2512.02552
Copy Paste: [[2512.02552]] What Signals Really Matter for Misinformation Tasks? Evaluating Fake-News Detection and Virality Prediction under Real-World Constraints(https://arxiv.org/abs/2512.02552)
Keywords: transformer
Abstract: We present an evaluation-driven study of two practical tasks regarding online misinformation: (i) fake-news detection and (ii) virality prediction in the context of operational settings, with the necessity for rapid reaction. Using the EVONS and FakeNewsNet datasets, we compare textual embeddings (RoBERTa; with a control using Mistral) against lightweight numeric features (timing, follower counts, verification, likes) and sequence models (GRU, gating architectures, Transformer encoders). We show that textual content alone is a strong discriminator for fake-news detection, while numeric-only pipelines remain viable when language models are unavailable or compute is constrained. Virality prediction is markedly harder than fake-news detection and is highly sensitive to label construction; in our setup, a median-based ''viral'' split (<50 likes) is pragmatic but underestimates real-world virality, and time-censoring for engagement features is desirable yet difficult under current API limits. Dimensionality-reduction analyses suggest non-linear structure is more informative for virality than for fake-news detection (t-SNE > PCA on numeric features). Swapping RoBERTa for Mistral embeddings yields only modest deltas, leaving conclusions unchanged. We discuss implications for evaluation design and report reproducibility constraints that realistically affect the field. We release splits and code where possible and provide guidance for metric selection.

Title: OmniPerson: Unified Identity-Preserving Pedestrian Generation

Authors: Changxiao Ma, Chao Yuan, Xincheng Shi, Yuzhuo Ma, Yongfei Zhang, Longkun Zhou, Yujia Zhang, Shangze Li, Yifan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02554
Pdf URL: https://arxiv.org/pdf/2512.02554
Copy Paste: [[2512.02554]] OmniPerson: Unified Identity-Preserving Pedestrian Generation(https://arxiv.org/abs/2512.02554)
Keywords: privacy, robust
Abstract: Person re-identification (ReID) suffers from a lack of large-scale high-quality training data due to challenges in data privacy and annotation costs. While previous approaches have explored pedestrian generation for data augmentation, they often fail to ensure identity consistency and suffer from insufficient controllability, thereby limiting their effectiveness in dataset augmentation. To address this, We introduce OmniPerson, the first unified identity-preserving pedestrian generation pipeline for visible/infrared image/video ReID tasks. Our contributions are threefold: 1) We proposed OmniPerson, a unified generation model, offering holistic and fine-grained control over all key pedestrian attributes. Supporting RGB/IR modality image/video generation with any number of reference images, two kinds of person poses, and text. Also including RGB-to-IR transfer and image super-resolution abilities.2) We designed Multi-Refer Fuser for robust identity preservation with any number of reference images as input, making OmniPerson could distill a unified identity from a set of multi-view reference images, ensuring our generated pedestrians achieve high-fidelity pedestrian generation.3) We introduce PersonSyn, the first large-scale dataset for multi-reference, controllable pedestrian generation, and present its automated curation pipeline which transforms public, ID-only ReID benchmarks into a richly annotated resource with the dense, multi-modal supervision required for this task. Experimental results demonstrate that OmniPerson achieves SoTA in pedestrian generation, excelling in both visual fidelity and identity consistency. Furthermore, augmenting existing datasets with our generated data consistently improves the performance of ReID models. We will open-source the full codebase, pretrained model, and the PersonSyn dataset.

Title: ADORE: Autonomous Domain-Oriented Relevance Engine for E-commerce

Authors: Zheng Fang, Donghao Xie, Ming Pang, Chunyuan Yuan, Xue Jiang, Changping Peng, Zhangang Lin, Zheng Luo
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2512.02555
Pdf URL: https://arxiv.org/pdf/2512.02555
Copy Paste: [[2512.02555]] ADORE: Autonomous Domain-Oriented Relevance Engine for E-commerce(https://arxiv.org/abs/2512.02555)
Keywords: robust
Abstract: Relevance modeling in e-commerce search remains challenged by semantic gaps in term-matching methods (e.g., BM25) and neural models' reliance on the scarcity of domain-specific hard samples. We propose ADORE, a self-sustaining framework that synergizes three innovations: (1) A Rule-aware Relevance Discrimination module, where a Chain-of-Thought LLM generates intent-aligned training data, refined via Kahneman-Tversky Optimization (KTO) to align with user behavior; (2) An Error-type-aware Data Synthesis module that auto-generates adversarial examples to harden robustness; and (3) A Key-attribute-enhanced Knowledge Distillation module that injects domain-specific attribute hierarchies into a deployable student model. ADORE automates annotation, adversarial generation, and distillation, overcoming data scarcity while enhancing reasoning. Large-scale experiments and online A/B testing verify the effectiveness of ADORE. The framework establishes a new paradigm for resource-efficient, cognitively aligned relevance modeling in industrial applications.

Title: DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Authors: DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Li, Haofen Liang, Haoran Wei, Haowei Zhang, Haowen Luo, Haozhe Ji, Honghui Ding, Hongxuan Tang, Huanqi Cao, Huazuo Gao, Hui Qu, Hui Zeng, Jialiang Huang, Jiashi Li, Jiaxin Xu, Jiewen Hu, Jingchang Chen, Jingting Xiang, Jingyang Yuan, Jingyuan Cheng, Jinhua Zhu, Jun Ran, Junguang Jiang, Junjie Qiu, Junlong Li, Junxiao Song, Kai Dong, Kaige Gao, Kang Guan, Kexin Huang, Kexing Zhou, Kezhao Huang, Kuai Yu, Lean Wang, Lecong Zhang, Lei Wang, Liang Zhao, Liangsheng Yin, Lihua Guo, Lingxiao Luo, Linwang Ma, Litong Wang, Liyue Zhang, M.S. Di, M.Y Xu, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Mingxu Zhou, Panpan Huang, Peixin Cong, Peiyi Wang, Qiancheng Wang, Qihao Zhu, Qingyang Li, Qinyu Chen, Qiushi Du, Ruiling Xu, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, Runqiu Yin, Runxin Xu, Ruomeng Shen, Ruoyu Zhang, S.H. Liu, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shaofei Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02556
Pdf URL: https://arxiv.org/pdf/2512.02556
Copy Paste: [[2512.02556]] DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models(https://arxiv.org/abs/2512.02556)
Keywords: robust, large language model
Abstract: We introduce DeepSeek-V3.2, a model that harmonizes high computational efficiency with superior reasoning and agent performance. The key technical breakthroughs of DeepSeek-V3.2 are as follows: (1) DeepSeek Sparse Attention (DSA): We introduce DSA, an efficient attention mechanism that substantially reduces computational complexity while preserving model performance in long-context scenarios. (2) Scalable Reinforcement Learning Framework: By implementing a robust reinforcement learning protocol and scaling post-training compute, DeepSeek-V3.2 performs comparably to GPT-5. Notably, our high-compute variant, DeepSeek-V3.2-Speciale, surpasses GPT-5 and exhibits reasoning proficiency on par with Gemini-3.0-Pro, achieving gold-medal performance in both the 2025 International Mathematical Olympiad (IMO) and the International Olympiad in Informatics (IOI). (3) Large-Scale Agentic Task Synthesis Pipeline: To integrate reasoning into tool-use scenarios, we developed a novel synthesis pipeline that systematically generates training data at scale. This methodology facilitates scalable agentic post-training, yielding substantial improvements in generalization and instruction-following robustness within complex, interactive environments.

Title: From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature

Authors: Kun Yuan, Min Woo Sun, Zhen Chen, Alejandro Lozano, Xiangteng He, Shi Li, Nassir Navab, Xiaoxiao Sun, Nicolas Padoy, Serena Yeung-Levy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02566
Pdf URL: https://arxiv.org/pdf/2512.02566
Copy Paste: [[2512.02566]] From Panel to Pixel: Zoom-In Vision-Language Pretraining from Biomedical Scientific Literature(https://arxiv.org/abs/2512.02566)
Keywords: robust
Abstract: There is a growing interest in developing strong biomedical vision-language models. A popular approach to achieve robust representations is to use web-scale scientific data. However, current biomedical vision-language pretraining typically compresses rich scientific figures and text into coarse figure-level pairs, discarding the fine-grained correspondences that clinicians actually rely on when zooming into local structures. To tackle this issue, we introduce Panel2Patch, a novel data pipeline that mines hierarchical structure from existing biomedical scientific literature, i.e., multi-panel, marker-heavy figures and their surrounding text, and converts them into multi-granular supervision. Given scientific figures and captions, Panel2Patch parses layouts, panels, and visual markers, then constructs hierarchical aligned vision-language pairs at the figure, panel, and patch levels, preserving local semantics instead of treating each figure as a single data sample. Built on this hierarchical corpus, we develop a granularity-aware pretraining strategy that unifies heterogeneous objectives from coarse didactic descriptions to fine region-focused phrases. By applying Panel2Patch to only a small set of the literature figures, we extract far more effective supervision than prior pipelines, enabling substantially better performance with less pretraining data.

Title: Co-speech Gesture Video Generation via Motion-Based Graph Retrieval

Authors: Yafei Song, Peng Zhang, Bang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02576
Pdf URL: https://arxiv.org/pdf/2512.02576
Copy Paste: [[2512.02576]] Co-speech Gesture Video Generation via Motion-Based Graph Retrieval(https://arxiv.org/abs/2512.02576)
Keywords: diffusion
Abstract: Synthesizing synchronized and natural co-speech gesture videos remains a formidable challenge. Recent approaches have leveraged motion graphs to harness the potential of existing video data. To retrieve an appropriate trajectory from the graph, previous methods either utilize the distance between features extracted from the input audio and those associated with the motions in the graph or embed both the input audio and motion into a shared feature space. However, these techniques may not be optimal due to the many-to-many mapping nature between audio and gestures, which cannot be adequately addressed by one-to-one mapping. To alleviate this limitation, we propose a novel framework that initially employs a diffusion model to generate gesture motions. The diffusion model implicitly learns the joint distribution of audio and motion, enabling the generation of contextually appropriate gestures from input audio sequences. Furthermore, our method extracts both low-level and high-level features from the input audio to enrich the training process of the diffusion model. Subsequently, a meticulously designed motion-based retrieval algorithm is applied to identify the most suitable path within the graph by assessing both global and local similarities in motion. Given that not all nodes in the retrieved path are sequentially continuous, the final step involves seamlessly stitching together these segments to produce a coherent video output. Experimental results substantiate the efficacy of our proposed method, demonstrating a significant improvement over prior approaches in terms of synchronization accuracy and naturalness of generated gestures.

Title: From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks

Authors: Changpeng Yang, Jinyang Wu, Yuchen Liu, Shuai Zhang, Yang Li, Qiliang Liang, Hongzhen Wang, Shuai Nie, Jiaming Xu, Runyu Shi, Ying Huang, Guoquan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02580
Pdf URL: https://arxiv.org/pdf/2512.02580
Copy Paste: [[2512.02580]] From Imitation to Discrimination: Toward A Generalized Curriculum Advantage Mechanism Enhancing Cross-Domain Reasoning Tasks(https://arxiv.org/abs/2512.02580)
Keywords: robust, large language model
Abstract: Reinforcement learning has emerged as a paradigm for post-training large language models, boosting their reasoning capabilities. Such approaches compute an advantage value for each sample, reflecting better or worse performance than expected, thereby yielding both positive and negative signals for training. However, the indiscriminate mixing of the two signals in existing methods, especially from the early stages, may lead to ambiguous guidance and limited gains. To address this issue, we propose **CAPO** (**C**urriculum **A**dvantage **P**olicy **O**ptimization), an adaptive curriculum mechanism based on advantage signals. The proposed mechanism bootstraps imitation learning with positive-only advantage samples to establish robust foundations, and subsequently introduces negative signals to cultivate discriminative capabilities, thereby improving generalization across complex scenarios. Compatible with diverse optimization methods including GRPO, PPO, RLOO, and Reinforce++, our method consistently achieves stable and significant improvements in mathematical reasoning tasks, and further generalizes effectively to multimodal Graphical User Interface (GUI) reasoning scenarios, establishing itself as a versatile and robust optimization framework.

Title: GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies

Authors: Chubin Zhang, Zhenglin Wan, Feng Chen, Xingrui Yu, Ivor Tsang, Bo An
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.02581
Pdf URL: https://arxiv.org/pdf/2512.02581
Copy Paste: [[2512.02581]] GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies(https://arxiv.org/abs/2512.02581)
Keywords: diffusion, generative
Abstract: Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize are often too simple to represent the multimodal action distributions needed for complex control. Gaussian policies provide tractable likelihoods and smooth gradients, but their unimodal form limits expressiveness. Conversely, generative policies based on diffusion or flow matching can model rich multimodal behaviors; however, in online RL, they are frequently unstable due to intractable likelihoods and noisy gradients propagating through deep sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this insight, we introduce GoRL (Generative Online Reinforcement Learning), a framework that optimizes a tractable latent policy while utilizing a conditional generative decoder to synthesize actions. A two-timescale update schedule enables the latent policy to learn stably while the decoder steadily increases expressiveness, without requiring tractable action likelihoods. Across a range of continuous-control tasks, GoRL consistently outperforms both Gaussian policies and recent generative-policy baselines. Notably, on the HopperStand task, it reaches a normalized return above 870, more than 3 times that of the strongest baseline. These results demonstrate that separating optimization from generation provides a practical path to policies that are both stable and highly expressive.

Title: Spoken Conversational Agents with Large Language Models

Authors: Chao-Han Huck Yang, Andreas Stolcke, Larry Heck
Subjects: cs.CL, cs.MA, cs.NE, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2512.02593
Pdf URL: https://arxiv.org/pdf/2512.02593
Copy Paste: [[2512.02593]] Spoken Conversational Agents with Large Language Models(https://arxiv.org/abs/2512.02593)
Keywords: privacy, robust, large language model
Abstract: Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.

Title: S3C2 SICP Summit 2025-06: Vulnerability Response Summit

Authors: Anna Lena Rotthaler, Simon Oberthür, Juraj Somorovsky, Kirsten Thommes, Simon Trang, Yasemin Acar, Michel Cukier, William Enck, Alexandros Kapravelos, Christian Kästner, Dominik Wermke, Laurie Williams
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.02600
Pdf URL: https://arxiv.org/pdf/2512.02600
Copy Paste: [[2512.02600]] S3C2 SICP Summit 2025-06: Vulnerability Response Summit(https://arxiv.org/abs/2512.02600)
Keywords: secure, security, attack
Abstract: Recent years have shown increased cyber attacks targeting less secure elements in the software supply chain and causing significant damage to businesses and organizations. The US and EU governments and industry are equally interested in enhancing software security, including supply chain and vulnerability response. On June 26, 2025, researchers from the NSF-supported Secure Software Supply Chain Center (S3C2) and the Software Innovation Campus Paderborn (SICP) conducted a Vulnerability Response Summit with a diverse set of 9 practitioners from 9 companies. The goal of the Summit is to enable sharing between industry practitioners having practical experiences and challenges with software supply chain security, including vulnerability response, and helping to form new collaborations. We conducted five panel discussions based on open-ended questions regarding experiences with vulnerability reports, tools used for vulnerability discovery and management, organizational structures to report vulnerability response and management, preparedness and implementations for Cyber Resilience Act1 (CRA) and NIS22, and bug bounties. The open discussions enabled mutual sharing and shed light on common challenges that industry practitioners with practical experience face when securing their software supply chain, including vulnerability response. In this paper, we provide a summary of the Summit. Full panel questions can be found in the appendix.

Title: Semigroup action based on skew polynomial evaluation with applications to Cryptography

Authors: Daniel Camazón-Portela, Juan Antonio López-Ramos
Subjects: cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2512.02603
Pdf URL: https://arxiv.org/pdf/2512.02603
Copy Paste: [[2512.02603]] Semigroup action based on skew polynomial evaluation with applications to Cryptography(https://arxiv.org/abs/2512.02603)
Keywords: secure
Abstract: Through this work we introduce an action of the skew polynomial ring $\mathbb{F}_{q}\left[X; \sigma, \delta\right]$ over $\mathbb{F}_{q}$ based on its polynomial valuation and the concept of left skew product of functions. This lead us to explore the construction of a certain subset $\mathcal{T}(X)\subset\mathbb{F}_{q}\left[X; \sigma, \delta\right]$ that allow us to control the non-commutativity of this ring, and exploit this fact in order to build a public key exchange protocol that is secure in Canetti and Krawczyk model.

Title: Modeling and Inverse Identification of Interfacial Heat Conduction in Finite Layer and Semi-Infinite Substrate Systems via a Physics-Guided Neural Framework

Authors: Wenhao Sha, Tienchong Chang
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2512.02618
Pdf URL: https://arxiv.org/pdf/2512.02618
Copy Paste: [[2512.02618]] Modeling and Inverse Identification of Interfacial Heat Conduction in Finite Layer and Semi-Infinite Substrate Systems via a Physics-Guided Neural Framework(https://arxiv.org/abs/2512.02618)
Keywords: diffusion, transformer
Abstract: Heat transfer in semiconductor devices is dominated by chip and substrate assemblies, where heat generated within a finite chip layer dissipates into a semi-infinite substrate with much higher thermophysical properties. This mismatch produces steep interfacial temperature gradients, making the transient thermal response highly sensitive to the interface. Conventional numerical solvers require excessive discretization to resolve these dynamics, while physics-informed neural networks (PINNs) often exhibit unstable convergence and loss of physical consistency near the material interface. To address these challenges, we introduce HeatTransFormer, a physics-guided Transformer architecture for interface-dominated diffusion problems. The framework integrates physically informed spatiotemporal sampling, a Laplace-based activation emulating analytical diffusion solutions, and a mask-free attention mechanism supporting bidirectional spatiotemporal coupling. These components enable the model to resolve steep gradients, maintain physical consistency, and remain stable where PINNs typically fail. HeatTransFormer produces coherent temperature fields across the interface when applied to a finite layer and semi-infinite substrate configuration. Coupled with a physics-constrained inverse strategy, it further enables reliable identification of three unknown thermal properties simultaneously using only external measurements. Overall, this work demonstrates that physics-guided Transformer architectures provide a unified framework for forward and inverse modeling in interface-dominated thermal systems.

Title: PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding

Authors: Zheng Huang, Xukai Liu, Tianyu Hu, Kai Zhang, Ye Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02624
Pdf URL: https://arxiv.org/pdf/2512.02624
Copy Paste: [[2512.02624]] PPTBench: Towards Holistic Evaluation of Large Language Models for PowerPoint Layout and Design Understanding(https://arxiv.org/abs/2512.02624)
Keywords: large language model
Abstract: PowerPoint presentations combine rich textual content with structured visual layouts, making them a natural testbed for evaluating the multimodal reasoning and layout understanding abilities of modern MLLMs. However, existing benchmarks focus solely on narrow subtasks while overlooking layout-centric challenges, which are central to real-world slide creation and editing. To bridge this gap, we introduce PPTBench, a comprehensive multimodal benchmark for evaluating LLMs on PowerPoint-related tasks. Leveraging a diverse source of 958 PPTX files, PPTBench evaluates models across four categories with 4,439 samples, including Detection, Understanding, Modification, and Generation. Our experiments reveal a substantial gap between semantic understanding and visual-layout reasoning in current MLLMs: models can interpret slide content but fail to produce coherent spatial arrangements. Ablation and further analysis show that current MLLMs struggle to combine visual cues with JSON-based layout structures and fail to integrate visual information into their API planning ability. And case studies visually expose systematic layout errors such as misalignment and element overlap. These findings provides a new perspective on evaluating VLLMs in PPT scenarios, highlighting challenges and directions for future research on visual-structural reasoning and coherent slide generation. All datasets and code are fully released to support reproducibility and future research.

Title: CryptoQA: A Large-scale Question-answering Dataset for AI-assisted Cryptography

Authors: Mayar Elfares, Pascal Reisert, Tilman Dietz, Manpa Barman, Ahmed Zaki, Ralf Küsters, Andreas Bulling
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02625
Pdf URL: https://arxiv.org/pdf/2512.02625
Copy Paste: [[2512.02625]] CryptoQA: A Large-scale Question-answering Dataset for AI-assisted Cryptography(https://arxiv.org/abs/2512.02625)
Keywords: robust, large language model
Abstract: Large language models (LLMs) excel at many general-purpose natural language processing tasks. However, their ability to perform deep reasoning and mathematical analysis, particularly for complex tasks as required in cryptography, remains poorly understood, largely due to the lack of suitable data for evaluation and training. To address this gap, we present CryptoQA, the first large-scale question-answering (QA) dataset specifically designed for cryptography. CryptoQA contains over two million QA pairs drawn from curated academic sources, along with contextual metadata that can be used to test the cryptographic capabilities of LLMs and to train new LLMs on cryptographic tasks. We benchmark 15 state-of-the-art LLMs on CryptoQA, evaluating their factual accuracy, mathematical reasoning, consistency, referencing, backward reasoning, and robustness to adversarial samples. In addition to quantitative metrics, we provide expert reviews that qualitatively assess model outputs and establish a gold-standard baseline. Our results reveal significant performance deficits of LLMs, particularly on tasks that require formal reasoning and precise mathematical knowledge. This shows the urgent need for LLM assistants tailored to cryptography research and development. We demonstrate that, by using CryptoQA, LLMs can be fine-tuned to exhibit better performance on cryptographic tasks.

Title: Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models

Authors: Xinyue Ai, Yutong He, Albert Gu, Ruslan Salakhutdinov, J Zico Kolter, Nicholas Matthew Boffi, Max Simchowitz
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.02636
Pdf URL: https://arxiv.org/pdf/2512.02636
Copy Paste: [[2512.02636]] Joint Distillation for Fast Likelihood Evaluation and Sampling in Flow-based Models(https://arxiv.org/abs/2512.02636)
Keywords: diffusion, generative
Abstract: Log-likelihood evaluation enables important capabilities in generative models, including model comparison, certain fine-tuning objectives, and many downstream applications. Yet paradoxically, some of today's best generative models -- diffusion and flow-based models -- still require hundreds to thousands of neural function evaluations (NFEs) to compute a single likelihood. While recent distillation methods have successfully accelerated sampling to just a few steps, they achieve this at the cost of likelihood tractability: existing approaches either abandon likelihood computation entirely or still require expensive integration over full trajectories. We present fast flow joint distillation (F2D2), a framework that simultaneously reduces the number of NFEs required for both sampling and likelihood evaluation by two orders of magnitude. Our key insight is that in continuous normalizing flows, the coupled ODEs for sampling and likelihood are computed from a shared underlying velocity field, allowing us to jointly distill both the sampling trajectory and cumulative divergence using a single model. F2D2 is modular, compatible with existing flow-based few-step sampling models, and requires only an additional divergence prediction head. Experiments demonstrate F2D2's capability of achieving accurate log-likelihood with few-step evaluations while maintaining high sample quality, solving a long-standing computational bottleneck in flow-based generative models. As an application of our approach, we propose a lightweight self-guidance method that enables a 2-step MeanFlow model to outperform a 1024 step teacher model with only a single additional backward NFE.

Title: Leveraging Large-Scale Pretrained Spatial-Spectral Priors for General Zero-Shot Pansharpening

Authors: Yongchuan Cui, Peng Liu, Yi Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02643
Pdf URL: https://arxiv.org/pdf/2512.02643
Copy Paste: [[2512.02643]] Leveraging Large-Scale Pretrained Spatial-Spectral Priors for General Zero-Shot Pansharpening(https://arxiv.org/abs/2512.02643)
Keywords: robust, transformer
Abstract: Existing deep learning methods for remote sensing image fusion often suffer from poor generalization when applied to unseen datasets due to the limited availability of real training data and the domain gap between different satellite sensors. To address this challenge, we explore the potential of foundation models by proposing a novel pretraining strategy that leverages large-scale simulated datasets to learn robust spatial-spectral priors. Specifically, our approach first constructs diverse simulated datasets by applying various degradation operations (blur, noise, downsampling) and augmentations (bands generation, channel shuffling, high-pass filtering, color jittering, etc.) to natural images from ImageNet and remote sensing images from SkyScript. We then pretrain fusion models on these simulated data to learn generalizable spatial-spectral representations. The pretrained models are subsequently evaluated on six datasets (WorldView-2/3/4, IKONOS, QuickBird, GaoFen-2) using zero-shot and one-shot paradigms, with both full- and freeze-tuning approaches for fine-tuning. Extensive experiments on different network architectures including convolutional neural networks, Transformer, and Mamba demonstrate that our pretraining strategy significantly improves generalization performance across different satellite sensors and imaging conditions for various fusion models. The pretrained models achieve superior results in zero-shot scenarios and show remarkable adaptation capability with minimal real data in one-shot settings. Our work provides a practical solution for cross-domain pansharpening, establishes a new benchmark for generalization in remote sensing image fusion tasks, and paves the way for leveraging foundation models through advanced training strategies.

Title: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Authors: Junwon Lee, Juhan Nam, Jiyoung Lee
Subjects: cs.CV, cs.LG, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2512.02650
Pdf URL: https://arxiv.org/pdf/2512.02650
Copy Paste: [[2512.02650]] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation(https://arxiv.org/abs/2512.02650)
Keywords: robust
Abstract: This work introduces a new task, text-conditioned selective video-to-audio (V2A) generation, which produces only the user-intended sound from a multi-object video. This capability is especially crucial in multimedia production, where audio tracks are handled individually for each sound source for precise editing, mixing, and creative control. However, current approaches generate single source-mixed sounds at once, largely because visual features are entangled, and region cues or prompts often fail to specify the source. We propose SelVA, a novel text-conditioned V2A model that treats the text prompt as an explicit selector of target source and modulates video encoder to distinctly extract prompt-relevant video features. The proposed supplementary tokens promote cross-attention by suppressing text-irrelevant activations with efficient parameter tuning, yielding robust semantic and temporal grounding. SelVA further employs a self-augmentation scheme to overcome the lack of mono audio track supervision. We evaluate SelVA on VGG-MONOAUDIO, a curated benchmark of clean single-source videos for such a task. Extensive experiments and ablations consistently verify its effectiveness across audio quality, semantic alignment, and temporal synchronization. Code and demo are available at this https URL.

Title: Adaptive Weighted LSSVM for Multi-View Classification

Authors: Farnaz Faramarzi Lighvan, Mehrdad Asadi, Lynn Houthuys
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.02653
Pdf URL: https://arxiv.org/pdf/2512.02653
Copy Paste: [[2512.02653]] Adaptive Weighted LSSVM for Multi-View Classification(https://arxiv.org/abs/2512.02653)
Keywords: privacy
Abstract: Multi-view learning integrates diverse representations of the same instances to improve performance. Most existing kernel-based multi-view learning methods use fusion techniques without enforcing an explicit collaboration type across views or co-regularization which limits global collaboration. We propose AW-LSSVM, an adaptive weighted LS-SVM that promotes complementary learning by an iterative global coupling to make each view focus on hard samples of others from previous iterations. Experiments demonstrate that AW-LSSVM outperforms existing kernel-based multi-view methods on most datasets, while keeping raw features isolated, making it also suitable for privacy-preserving scenarios.

Title: Cybersecurity AI: The World's Top AI Agent for Security Capture-the-Flag (CTF)

Authors: Víctor Mayoral-Vilches, Luis Javier Navarrete-Lozano, Francesco Balassone, María Sanz-Gómez, Cristóbal R. J. Veas Chavez, Maite del Mundo de Torres, Vanesa Turiel
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.02654
Pdf URL: https://arxiv.org/pdf/2512.02654
Copy Paste: [[2512.02654]] Cybersecurity AI: The World's Top AI Agent for Security Capture-the-Flag (CTF)(https://arxiv.org/abs/2512.02654)
Keywords: security, defense, attack
Abstract: Are Capture-the-Flag competitions obsolete? In 2025, Cybersecurity AI (CAI) systematically conquered some of the world's most prestigious hacking competitions, achieving Rank #1 at multiple events and consistently outperforming thousands of human teams. Across five major circuits-HTB's AI vs Humans, Cyber Apocalypse (8,129 teams), Dragos OT CTF, UWSP Pointer Overflow, and the Neurogrid CTF showdown-CAI demonstrated that Jeopardy-style CTFs have become a solved game for well-engineered AI agents. At Neurogrid, CAI captured 41/45 flags to claim the $50,000 top prize; at Dragos OT, it sprinted 37% faster to 10K points than elite human teams; even when deliberately paused mid-competition, it maintained top-tier rankings. Critically, CAI achieved this dominance through our specialized alias1 model architecture, which delivers enterprise-scale AI security operations at unprecedented cost efficiency and with augmented autonomy-reducing 1B token inference costs from $5,940 to just $119, making continuous security agent operation financially viable for the first time. These results force an uncomfortable reckoning: if autonomous agents now dominate competitions designed to identify top security talent at negligible cost, what are CTFs actually measuring? This paper presents comprehensive evidence of AI capability across the 2025 CTF circuit and argues that the security community must urgently transition from Jeopardy-style contests to Attack & Defense formats that genuinely test adaptive reasoning and resilience-capabilities that remain uniquely human, for now.

Title: Distill, Forget, Repeat: A Framework for Continual Unlearning in Text-to-Image Diffusion Models

Authors: Naveen George, Naoki Murata, Yuhta Takida, Konda Reddy Mopuri, Yuki Mitsufuji
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02657
Pdf URL: https://arxiv.org/pdf/2512.02657
Copy Paste: [[2512.02657]] Distill, Forget, Repeat: A Framework for Continual Unlearning in Text-to-Image Diffusion Models(https://arxiv.org/abs/2512.02657)
Keywords: privacy, diffusion, generative
Abstract: The recent rapid growth of visual generative models trained on vast web-scale datasets has created significant tension with data privacy regulations and copyright laws, such as GDPR's ``Right to be Forgotten.'' This necessitates machine unlearning (MU) to remove specific concepts without the prohibitive cost of retraining. However, existing MU techniques are fundamentally ill-equipped for real-world scenarios where deletion requests arrive sequentially, a setting known as continual unlearning (CUL). Naively applying one-shot methods in a continual setting triggers a stability crisis, leading to a cascade of degradation characterized by retention collapse, compounding collateral damage to related concepts, and a sharp decline in generative quality. To address this critical challenge, we introduce a novel generative distillation based continual unlearning framework that ensures targeted and stable unlearning under sequences of deletion requests. By reframing each unlearning step as a multi-objective, teacher-student distillation process, the framework leverages principles from continual learning to maintain model integrity. Experiments on a 10-step sequential benchmark demonstrate that our method unlearns forget concepts with better fidelity and achieves this without significant interference to the performance on retain concepts or the overall image quality, substantially outperforming baselines. This framework provides a viable pathway for the responsible deployment and maintenance of large-scale generative models, enabling industries to comply with ongoing data removal requests in a practical and effective manner.

Title: Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation

Authors: Agathoklis Georgiou
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2512.02660
Pdf URL: https://arxiv.org/pdf/2512.02660
Copy Paste: [[2512.02660]] Spatially-Grounded Document Retrieval via Patch-to-Region Relevance Propagation(https://arxiv.org/abs/2512.02660)
Keywords: transformer
Abstract: Vision-language models (VLMs) like ColPali achieve state-of-the-art document retrieval by embedding pages as images and computing fine-grained similarity between query tokens and visual patches. However, they return entire pages rather than specific regions, limiting utility for retrieval-augmented generation (RAG) where precise context is paramount. Conversely, OCR-based systems extract structured text with bounding box coordinates but lack semantic grounding for relevance assessment. We propose a hybrid architecture that unifies these paradigms: using ColPali's patch-level similarity scores as spatial relevance filters over OCR-extracted regions. We formalize the coordinate mapping between vision transformer patch grids and OCR bounding boxes, introduce intersection metrics for relevance propagation, and establish theoretical bounds on retrieval precision. Our approach operates at inference time without additional training. We release Snappy, an open-source implementation demonstrating practical applicability, with empirical evaluation ongoing.

Title: PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes

Authors: Derui Shan, Qian Qiao, Hao Lu, Tao Du, Peng Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02664
Pdf URL: https://arxiv.org/pdf/2512.02664
Copy Paste: [[2512.02664]] PolarGuide-GSDR: 3D Gaussian Splatting Driven by Polarization Priors and Deferred Reflection for Real-World Reflective Scenes(https://arxiv.org/abs/2512.02664)
Keywords: interpretability
Abstract: Polarization-aware Neural Radiance Fields (NeRF) enable novel view synthesis of specular-reflection scenes but face challenges in slow training, inefficient rendering, and strong dependencies on material/viewpoint assumptions. However, 3D Gaussian Splatting (3DGS) enables real-time rendering yet struggles with accurate reflection reconstruction from reflection-geometry entanglement, adding a deferred reflection module introduces environment map dependence. We address these limitations by proposing PolarGuide-GSDR, a polarization-forward-guided paradigm establishing a bidirectional coupling mechanism between polarization and 3DGS: first 3DGS's geometric priors are leveraged to resolve polarization ambiguity, and then the refined polarization information cues are used to guide 3DGS's normal and spherical harmonic representation. This process achieves high-fidelity reflection separation and full-scene reconstruction without requiring environment maps or restrictive material assumptions. We demonstrate on public and self-collected datasets that PolarGuide-GSDR achieves state-of-the-art performance in specular reconstruction, normal estimation, and novel view synthesis, all while maintaining real-time rendering capabilities. To our knowledge, this is the first framework embedding polarization priors directly into 3DGS optimization, yielding superior interpretability and real-time performance for modeling complex reflective scenes.

Title: Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization

Authors: Jing Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02665
Pdf URL: https://arxiv.org/pdf/2512.02665
Copy Paste: [[2512.02665]] Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization(https://arxiv.org/abs/2512.02665)
Keywords: large language model
Abstract: Large language models (LLMs) are now used in settings such as Google's AI Overviews, where it summarizes multiple long documents. However, it remains unclear whether they weight all inputs equally. Focusing on abortion-related news, we construct 40 pro-neutral-con article triplets, permute each triplet into six input orders, and prompt Gemini 2.5 Flash to generate a neutral overview. We evaluate each summary against its source articles using ROUGE-L (lexical overlap), BERTScore (semantic similarity), and SummaC (factual consistency). One-way ANOVA reveals a significant primacy effect for BERTScore across all stances, indicating that summaries are more semantically aligned with the first-seen article. Pairwise comparisons further show that Position 1 differs significantly from Positions 2 and 3, while the latter two do not differ from each other, confirming a selective preference for the first document. The findings present risks for applications that rely on LLM-generated overviews and for agentic AI systems, where the steps involving LLMs can disproportionately influence downstream actions.

Title: Graph VQ-Transformer (GVT): Fast and Accurate Molecular Generation via High-Fidelity Discrete Latents

Authors: Haozhuo Zheng, Cheng Wang, Yang Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02667
Pdf URL: https://arxiv.org/pdf/2512.02667
Copy Paste: [[2512.02667]] Graph VQ-Transformer (GVT): Fast and Accurate Molecular Generation via High-Fidelity Discrete Latents(https://arxiv.org/abs/2512.02667)
Keywords: diffusion, transformer, generative, large language model
Abstract: The de novo generation of molecules with desirable properties is a critical challenge, where diffusion models are computationally intensive and autoregressive models struggle with error propagation. In this work, we introduce the Graph VQ-Transformer (GVT), a two-stage generative framework that achieves both high accuracy and efficiency. The core of our approach is a novel Graph Vector Quantized Variational Autoencoder (VQ-VAE) that compresses molecular graphs into high-fidelity discrete latent sequences. By synergistically combining a Graph Transformer with canonical Reverse Cuthill-McKee (RCM) node ordering and Rotary Positional Embeddings (RoPE), our VQ-VAE achieves near-perfect reconstruction rates. An autoregressive Transformer is then trained on these discrete latents, effectively converting graph generation into a well-structured sequence modeling problem. Crucially, this mapping of complex graphs to high-fidelity discrete sequences bridges molecular design with the powerful paradigm of large-scale sequence modeling, unlocking potential synergies with Large Language Models (LLMs). Extensive experiments show that GVT achieves state-of-the-art or highly competitive performance across major benchmarks like ZINC250k, MOSES, and GuacaMol, and notably outperforms leading diffusion models on key distribution similarity metrics such as FCD and KL Divergence. With its superior performance, efficiency, and architectural novelty, GVT not only presents a compelling alternative to diffusion models but also establishes a strong new baseline for the field, paving the way for future research in discrete latent-space molecular generation.

Title: PGP-DiffSR: Phase-Guided Progressive Pruning for Efficient Diffusion-based Image Super-Resolution

Authors: Zhongbao Yang, Jiangxin Dong, Yazhou Yao, Jinhui Tang, Jinshan Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02681
Pdf URL: https://arxiv.org/pdf/2512.02681
Copy Paste: [[2512.02681]] PGP-DiffSR: Phase-Guided Progressive Pruning for Efficient Diffusion-based Image Super-Resolution(https://arxiv.org/abs/2512.02681)
Keywords: diffusion, transformer
Abstract: Although diffusion-based models have achieved impressive results in image super-resolution, they often rely on large-scale backbones such as Stable Diffusion XL (SDXL) and Diffusion Transformers (DiT), which lead to excessive computational and memory costs during training and inference. To address this issue, we develop a lightweight diffusion method, PGP-DiffSR, by removing redundant information from diffusion models under the guidance of the phase information of inputs for efficient image super-resolution. We first identify the intra-block redundancy within the diffusion backbone and propose a progressive pruning approach that removes redundant blocks while reserving restoration capability. We note that the phase information of the restored images produced by the pruned diffusion model is not well estimated. To solve this problem, we propose a phase-exchange adapter module that explores the phase information of the inputs to guide the pruned diffusion model for better restoration performance. We formulate the progressive pruning approach and the phase-exchange adapter module into a unified model. Extensive experiments demonstrate that our method achieves competitive restoration quality while significantly reducing computational load and memory consumption. The code is available at this https URL.

Title: Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance

Authors: Huankun Sheng, Ming Li, Yixiang Wei, Yeying Fan, Yu-Hui Wen, Tieliang Gong, Yong-Jin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02685
Pdf URL: https://arxiv.org/pdf/2512.02685
Copy Paste: [[2512.02685]] Unsupervised Structural Scene Decomposition via Foreground-Aware Slot Attention with Pseudo-Mask Guidance(https://arxiv.org/abs/2512.02685)
Keywords: robust, segmentation
Abstract: Recent advances in object-centric representation learning have shown that slot attention-based methods can effectively decompose visual scenes into object slot representations without supervision. However, existing approaches typically process foreground and background regions indiscriminately, often resulting in background interference and suboptimal instance discovery performance on real-world data. To address this limitation, we propose Foreground-Aware Slot Attention (FASA), a two-stage framework that explicitly separates foreground from background to enable precise object discovery. In the first stage, FASA performs a coarse scene decomposition to distinguish foreground from background regions through a dual-slot competition mechanism. These slots are initialized via a clustering-based strategy, yielding well-structured representations of salient regions. In the second stage, we introduce a masked slot attention mechanism where the first slot captures the background while the remaining slots compete to represent individual foreground objects. To further address over-segmentation of foreground objects, we incorporate pseudo-mask guidance derived from a patch affinity graph constructed with self-supervised image features to guide the learning of foreground slots. Extensive experiments on both synthetic and real-world datasets demonstrate that FASA consistently outperforms state-of-the-art methods, validating the effectiveness of explicit foreground modeling and pseudo-mask guidance for robust scene decomposition and object-coherent representation. Code will be made publicly available.

Title: ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data

Authors: Yuxing Liu, Yong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02686
Pdf URL: https://arxiv.org/pdf/2512.02686
Copy Paste: [[2512.02686]] ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data(https://arxiv.org/abs/2512.02686)
Keywords: robust, diffusion, segmentation
Abstract: Anomaly segmentation seeks to detect and localize unknown or out-of-distribution (OoD) objects that fall outside predefined semantic classes a capability essential for safe autonomous driving. However, the scarcity and limited diversity of anomaly data severely constrain model generalization in open-world environments. Existing approaches mitigate this issue through synthetic data generation, either by copy-pasting external objects into driving scenes or by leveraging text-to-image diffusion models to inpaint anomalous regions. While these methods improve anomaly diversity, they often lack contextual coherence and physical realism, resulting in domain gaps between synthetic and real data. In this paper, we present ClimaDrive, a semantics-guided image-to-image framework for synthesizing semantically coherent, weather-diverse, and physically plausible OoD driving data. ClimaDrive unifies structure-guided multi-weather generation with prompt-driven anomaly inpainting, enabling the creation of visually realistic training data. Based on this framework, we construct ClimaOoD, a large-scale benchmark spanning six representative driving scenarios under both clear and adverse weather conditions. Extensive experiments on four state-of-the-art methods show that training with ClimaOoD leads to robust improvements in anomaly segmentation. Across all methods, AUROC, AP, and FPR95 show notable gains, with FPR95 dropping from 3.97 to 3.52 for RbA on Fishyscapes LAF. These results demonstrate that ClimaOoD enhances model robustness, offering valuable training data for better generalization in open-world anomaly detection.

Title: An Empirical Survey of Model Merging Algorithms for Social Bias Mitigation

Authors: Daiki Shirafuji, Tatsuhiko Saito, Yasutomo Kimura
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02689
Pdf URL: https://arxiv.org/pdf/2512.02689
Copy Paste: [[2512.02689]] An Empirical Survey of Model Merging Algorithms for Social Bias Mitigation(https://arxiv.org/abs/2512.02689)
Keywords: fair, large language model
Abstract: Large language models (LLMs) are known to inherit and even amplify societal biases present in their pre-training corpora, threatening fairness and social trust. To address this issue, recent work has explored ``editing'' LLM parameters to mitigate social bias with model merging approaches; however, there is no empirical comparison. In this work, we empirically survey seven algorithms: Linear, Karcher Mean, SLERP, NuSLERP, TIES, DELLA, and Nearswap, applying 13 open weight models in the GPT, LLaMA, and Qwen families. We perform a comprehensive evaluation using three bias datasets (BBQ, BOLD, and HONEST) and measure the impact of these techniques on LLM performance in downstream tasks of the SuperGLUE benchmark. We find a trade-off between bias reduction and downstream performance: methods achieving greater bias mitigation degrade accuracy, particularly on tasks requiring reading comprehension and commonsense and causal reasoning. Among the merging algorithms, Linear, SLERP, and Nearswap consistently reduce bias while maintaining overall performance, with SLERP at moderate interpolation weights emerging as the most balanced choice. These results highlight the potential of model merging algorithms for bias mitigation, while indicating that excessive debiasing or inappropriate merging methods may lead to the degradation of important linguistic abilities.

Title: ALDI-ray: Adapting the ALDI Framework for Security X-ray Object Detection

Authors: Omid Reza Heidari, Yang Wang, Xinxin Zuo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02696
Pdf URL: https://arxiv.org/pdf/2512.02696
Copy Paste: [[2512.02696]] ALDI-ray: Adapting the ALDI Framework for Security X-ray Object Detection(https://arxiv.org/abs/2512.02696)
Keywords: security, robust, transformer
Abstract: Domain adaptation in object detection is critical for real-world applications where distribution shifts degrade model performance. Security X-ray imaging presents a unique challenge due to variations in scanning devices and environmental conditions, leading to significant domain discrepancies. To address this, we apply ALDI++, a domain adaptation framework that integrates self-distillation, feature alignment, and enhanced training strategies to mitigate domain shift effectively in this area. We conduct extensive experiments on the EDS dataset, demonstrating that ALDI++ surpasses the state-of-the-art (SOTA) domain adaptation methods across multiple adaptation scenarios. In particular, ALDI++ with a Vision Transformer for Detection (ViTDet) backbone achieves the highest mean average precision (mAP), confirming the effectiveness of transformer-based architectures for cross-domain object detection. Additionally, our category-wise analysis highlights consistent improvements in detection accuracy, reinforcing the robustness of the model across diverse object classes. Our findings establish ALDI++ as an efficient solution for domain-adaptive object detection, setting a new benchmark for performance stability and cross-domain generalization in security X-ray imagery.

Title: GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

Authors: Zixuan Song, Jing Zhang, Di Wang, Zidie Zhou, Wenbin Liu, Haonan Guo, En Wang, Bo Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02697
Pdf URL: https://arxiv.org/pdf/2512.02697
Copy Paste: [[2512.02697]] GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization(https://arxiv.org/abs/2512.02697)
Keywords: robust
Abstract: Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (e.g., drone, satellite, and street) and modalities (e.g., language and image). To address these challenges, we propose GeoBridge, a foundation model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. The dataset, source code, and pretrained models were released at this https URL.

Title: Tissue-mask supported inter-subject whole-body image registration in the UK Biobank - A method benchmarking study

Authors: Yasemin Utkueri, Elin Lundström, Håkan Ahlström, Johan Öfverstedt, Joel Kullberg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02702
Pdf URL: https://arxiv.org/pdf/2512.02702
Copy Paste: [[2512.02702]] Tissue-mask supported inter-subject whole-body image registration in the UK Biobank - A method benchmarking study(https://arxiv.org/abs/2512.02702)
Keywords: robust
Abstract: The UK Biobank is a large-scale study collecting whole-body MR imaging and non-imaging health data. Robust and accurate inter-subject image registration of these whole-body MR images would enable their body-wide spatial standardization, and region-/voxel-wise correlation analysis of non-imaging data with image-derived parameters (e.g., tissue volume or fat content). We propose a sex-stratified inter-subject whole-body MR image registration approach that uses subcutaneous adipose tissue- and muscle-masks from the state-of-the-art VIBESegmentator method to augment intensity-based graph-cut registration. The proposed method was evaluated on a subset of 4000 subjects by comparing it to an intensity-only method as well as two previously published registration methods, uniGradICON and MIRTK. The evaluation comprised overlap measures applied to the 71 VIBESegmentator masks: 1) Dice scores, and 2) voxel-wise label error frequency. Additionally, voxel-wise correlation between age and each of fat content and tissue volume was studied to exemplify the usefulness for medical research. The proposed method exhibited a mean dice score of 0.77 / 0.75 across the cohort and the 71 masks for males/females, respectively. When compared to the intensity-only registration, the mean values were 6 percentage points (pp) higher for both sexes, and the label error frequency was decreased in most tissue regions. These differences were 9pp / 8pp against uniGradICON and 12pp / 13pp against MIRTK. Using the proposed method, the age-correlation maps were less noisy and showed higher anatomical alignment. In conclusion, the image registration method using two tissue masks improves whole-body registration of UK Biobank images.

Title: CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer

Authors: Lavish Bansal, Naman Mishra
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02711
Pdf URL: https://arxiv.org/pdf/2512.02711
Copy Paste: [[2512.02711]] CREST: Universal Safety Guardrails Through Cluster-Guided Cross-Lingual Transfer(https://arxiv.org/abs/2512.02711)
Keywords: large language model
Abstract: Ensuring content safety in large language models (LLMs) is essential for their deployment in real-world applications. However, existing safety guardrails are predominantly tailored for high-resource languages, leaving a significant portion of the world's population underrepresented who communicate in low-resource languages. To address this, we introduce CREST (CRoss-lingual Efficient Safety Transfer), a parameter-efficient multilingual safety classification model that supports 100 languages with only 0.5B parameters. By training on a strategically chosen subset of only 13 high-resource languages, our model utilizes cluster-based cross-lingual transfer from a few to 100 languages, enabling effective generalization to both unseen high-resource and low-resource languages. This approach addresses the challenge of limited training data in low-resource settings. We conduct comprehensive evaluations across six safety benchmarks to demonstrate that CREST outperforms existing state-of-the-art guardrails of comparable scale and achieves competitive results against models with significantly larger parameter counts (2.5B parameters and above). Our findings highlight the limitations of language-specific guardrails and underscore the importance of developing universal, language-agnostic safety systems that can scale effectively to serve global populations.

Title: GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding

Authors: Peirong Zhang, Yidan Zhang, Luxiao Xu, Jinliang Lin, Zonghao Guo, Fengxiang Wang, Xue Yang, Kaiwen Wei, Lei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02715
Pdf URL: https://arxiv.org/pdf/2512.02715
Copy Paste: [[2512.02715]] GeoViS: Geospatially Rewarded Visual Search for Remote Sensing Visual Grounding(https://arxiv.org/abs/2512.02715)
Keywords: interpretability, large language model
Abstract: Recent advances in multimodal large language models(MLLMs) have led to remarkable progress in visual grounding, enabling fine-grained cross-modal alignment between textual queries and image regions. However, transferring such capabilities to remote sensing imagery remains challenging, as targets are often extremely small within kilometer-scale scenes, and queries typically involve intricate geospatial relations such as relative positions, spatial hierarchies, or contextual dependencies across distant objects. To address these challenges, we propose GeoViS, a Geospatially Rewarded Visual Search framework that reformulates remote sensing visual grounding as a progressive search-and-reasoning process. Rather than directly predicting the target location in a single step, GeoViS actively explores the global image through a tree-structured sequence of visual cues, integrating multimodal perception, spatial reasoning, and reward-guided exploration to refine geospatial hypotheses iteratively. This design enables the model to detect subtle small-scale targets while maintaining holistic scene awareness. Extensive experiments on five remote sensing grounding benchmarks demonstrate that GeoViS achieves precise geospatial understanding and consistently surpasses existing methods across key visual grounding metrics, highlighting its strong cross-domain generalization and interpretability.

Title: Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs

Authors: Julian Ma, Jun Wang, Zafeirios Fountas
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2512.02719
Pdf URL: https://arxiv.org/pdf/2512.02719
Copy Paste: [[2512.02719]] Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs(https://arxiv.org/abs/2512.02719)
Keywords: robust, large language model
Abstract: Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance, behaviour and efficiency in multimodal cue-combination. Beyond accuracy and efficiency metrics, we introduce a Bayesian Consistency Score that detects Bayes-consistent behavioural shifts even when accuracy saturates. Our results show that while capable models often adapt in Bayes-consistent ways, accuracy does not guarantee robustness. Notably, GPT-5 Mini achieves perfect text accuracy but fails to integrate visual cues efficiently. This reveals a critical dissociation between capability and strategy, suggesting accuracy-centric benchmarks may over-index on performance while missing brittle uncertainty handling. These findings reveal emergent principled handling of uncertainty and highlight the correlation between accuracy and Bayesian tendencies. We release our psychophysics benchmark and consistency metric (this https URL) as evaluation tools and to inform future multimodal architecture designs.

Title: DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions

Authors: Yifan Zhou, Takehiko Ohkawa, Guwenxiao Zhou, Kanoko Goto, Takumi Hirose, Yusuke Sekikawa, Nakamasa Inoue
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02727
Pdf URL: https://arxiv.org/pdf/2512.02727
Copy Paste: [[2512.02727]] DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions(https://arxiv.org/abs/2512.02727)
Keywords: robust, extraction
Abstract: Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN's inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.

Title: Reasoning-Aware Multimodal Fusion for Hateful Video Detection

Authors: Shuonan Yang, Tailin Chen, Jiangbei Yue, Guangliang Cheng, Jianbo Jiao, Zeyu Fu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02743
Pdf URL: https://arxiv.org/pdf/2512.02743
Copy Paste: [[2512.02743]] Reasoning-Aware Multimodal Fusion for Hateful Video Detection(https://arxiv.org/abs/2512.02743)
Keywords: robust
Abstract: Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.

Title: AttMetNet: Attention-Enhanced Deep Neural Network for Methane Plume Detection in Sentinel-2 Satellite Imagery

Authors: Rakib Ahsan, MD Sadik Hossain Shanto, Md Sultanul Arifin, Tanzima Hashem
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02751
Pdf URL: https://arxiv.org/pdf/2512.02751
Copy Paste: [[2512.02751]] AttMetNet: Attention-Enhanced Deep Neural Network for Methane Plume Detection in Sentinel-2 Satellite Imagery(https://arxiv.org/abs/2512.02751)
Keywords: robust
Abstract: Methane is a powerful greenhouse gas that contributes significantly to global warming. Accurate detection of methane emissions is the key to taking timely action and minimizing their impact on climate change. We present AttMetNet, a novel attention-enhanced deep learning framework for methane plume detection with Sentinel-2 satellite imagery. The major challenge in developing a methane detection model is to accurately identify methane plumes from Sentinel-2's B11 and B12 bands while suppressing false positives caused by background variability and diverse land cover types. Traditional detection methods typically depend on the differences or ratios between these bands when comparing the scenes with and without plumes. However, these methods often require verification by a domain expert because they generate numerous false positives. Recent deep learning methods make some improvements using CNN-based architectures, but lack mechanisms to prioritize methane-specific features. AttMetNet introduces a methane-aware architecture that fuses the Normalized Difference Methane Index (NDMI) with an attention-enhanced U-Net. By jointly exploiting NDMI's plume-sensitive cues and attention-driven feature selection, AttMetNet selectively amplifies methane absorption features while suppressing background noise. This integration establishes a first-of-its-kind architecture tailored for robust methane plume detection in real satellite imagery. Additionally, we employ focal loss to address the severe class imbalance arising from both limited positive plume samples and sparse plume pixels within imagery. Furthermore, AttMetNet is trained on the real methane plume dataset, making it more robust to practical scenarios. Extensive experiments show that AttMetNet surpasses recent methods in methane plume detection with a lower false positive rate, better precision recall balance, and higher IoU.

Title: PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models

Authors: Robert Belanec, Ivan Srba, Maria Bielikova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02764
Pdf URL: https://arxiv.org/pdf/2512.02764
Copy Paste: [[2512.02764]] PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models(https://arxiv.org/abs/2512.02764)
Keywords: large language model
Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy, or compare with one another. To address this, we introduce PEFT-Factory, a unified framework for efficient fine-tuning LLMs using both off-the-shelf and custom PEFT methods. While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics. As a result, PEFT-Factory provides a ready-to-use, controlled, and stable environment, improving replicability and benchmarking of PEFT methods. PEFT-Factory is a downstream framework that originates from the popular LLaMA-Factory, and is publicly available at this https URL

Title: Towards Unification of Hallucination Detection and Fact Verification for Large Language Models

Authors: Weihang Su, Jianming Long, Changyue Wang, Shiyu Lin, Jingyan Xu, Ziyi Ye, Qingyao Ai, Yiqun Liu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2512.02772
Pdf URL: https://arxiv.org/pdf/2512.02772
Copy Paste: [[2512.02772]] Towards Unification of Hallucination Detection and Fact Verification for Large Language Models(https://arxiv.org/abs/2512.02772)
Keywords: large language model
Abstract: Large Language Models (LLMs) frequently exhibit hallucinations, generating content that appears fluent and coherent but is factually incorrect. Such errors undermine trust and hinder their adoption in real-world applications. To address this challenge, two distinct research paradigms have emerged: model-centric Hallucination Detection (HD) and text-centric Fact Verification (FV). Despite sharing the same goal, these paradigms have evolved in isolation, using distinct assumptions, datasets, and evaluation protocols. This separation has created a research schism that hinders their collective progress. In this work, we take a decisive step toward bridging this divide. We introduce UniFact, a unified evaluation framework that enables direct, instance-level comparison between FV and HD by dynamically generating model outputs and corresponding factuality labels. Through large-scale experiments across multiple LLM families and detection methods, we reveal three key findings: (1) No paradigm is universally superior; (2) HD and FV capture complementary facets of factual errors; and (3) hybrid approaches that integrate both methods consistently achieve state-of-the-art performance. Beyond benchmarking, we provide the first in-depth analysis of why FV and HD diverged, as well as empirical evidence supporting the need for their unification. The comprehensive experimental results call for a new, integrated research agenda toward unifying Hallucination Detection and Fact Verification in LLMs. We have open-sourced all the code, data, and baseline implementation at: this https URL

Title: Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset

Authors: Qifan Liang, Junlin Li, Zhen Han, Xihao Wang, Zhongyuan Wang, Bin Mei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02780
Pdf URL: https://arxiv.org/pdf/2512.02780
Copy Paste: [[2512.02780]] Rethinking Surgical Smoke: A Smoke-Type-Aware Laparoscopic Video Desmoking Method and Dataset(https://arxiv.org/abs/2512.02780)
Keywords: diffusion, segmentation
Abstract: Electrocautery or lasers will inevitably generate surgical smoke, which hinders the visual guidance of laparoscopic videos for surgical procedures. The surgical smoke can be classified into different types based on its motion patterns, leading to distinctive spatio-temporal characteristics across smoky laparoscopic videos. However, existing desmoking methods fail to account for such smoke-type-specific distinctions. Therefore, we propose the first Smoke-Type-Aware Laparoscopic Video Desmoking Network (STANet) by introducing two smoke types: Diffusion Smoke and Ambient Smoke. Specifically, a smoke mask segmentation sub-network is designed to jointly conduct smoke mask and smoke type predictions based on the attention-weighted mask aggregation, while a smokeless video reconstruction sub-network is proposed to perform specially desmoking on smoky features guided by two types of smoke mask. To address the entanglement challenges of two smoke types, we further embed a coarse-to-fine disentanglement module into the mask segmentation sub-network, which yields more accurate disentangled masks through the smoke-type-aware cross attention between non-entangled and entangled regions. In addition, we also construct the first large-scale synthetic video desmoking dataset with smoke type annotations. Extensive experiments demonstrate that our method not only outperforms state-of-the-art approaches in quality evaluations, but also exhibits superior generalization across multiple downstream surgical tasks.

Title: LumiX: Structured and Coherent Text-to-Intrinsic Generation

Authors: Xu Han, Biao Zhang, Xiangjun Tang, Xianzhi Li, Peter Wonka
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02781
Pdf URL: https://arxiv.org/pdf/2512.02781
Copy Paste: [[2512.02781]] LumiX: Structured and Coherent Text-to-Intrinsic Generation(https://arxiv.org/abs/2512.02781)
Keywords: diffusion
Abstract: We present LumiX, a structured diffusion framework for coherent text-to-intrinsic generation. Conditioned on text prompts, LumiX jointly generates a comprehensive set of intrinsic maps (e.g., albedo, irradiance, normal, depth, and final color), providing a structured and physically consistent description of an underlying scene. This is enabled by two key contributions: 1) Query-Broadcast Attention, a mechanism that ensures structural consistency by sharing queries across all maps in each self-attention block. 2) Tensor LoRA, a tensor-based adaptation that parameter-efficiently models cross-map relations for efficient joint training. Together, these designs enable stable joint diffusion training and unified generation of multiple intrinsic properties. Experiments show that LumiX produces coherent and physically meaningful results, achieving 23% higher alignment and a better preference score (0.19 vs. -0.41) compared to the state of the art, and it can also perform image-conditioned intrinsic decomposition within the same framework.

Title: FiMMIA: scaling semantic perturbation-based membership inference across modalities

Authors: Anton Emelyanov, Sergei Kudriashov, Alena Fenogenova
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.02786
Pdf URL: https://arxiv.org/pdf/2512.02786
Copy Paste: [[2512.02786]] FiMMIA: scaling semantic perturbation-based membership inference across modalities(https://arxiv.org/abs/2512.02786)
Keywords: attack, membership infer, large language model
Abstract: Membership Inference Attacks (MIAs) aim to determine whether a specific data point was included in the training set of a target model. Although there are have been numerous methods developed for detecting data contamination in large language models (LLMs), their performance on multimodal LLMs (MLLMs) falls short due to the instabilities introduced through multimodal component adaptation and possible distribution shifts across multiple inputs. In this work, we investigate multimodal membership inference and address two issues: first, by identifying distribution shifts in the existing datasets, and second, by releasing an extended baseline pipeline to detect them. We also generalize the perturbation-based membership inference methods to MLLMs and release \textbf{FiMMIA} -- a modular \textbf{F}ramework for \textbf{M}ultimodal \textbf{MIA}.\footnote{The source code and framework have been made publicly available under the MIT license via \href{this https URL}{link}.The video demonstration is available on \href{this https URL}{YouTube}.} Our approach trains a neural network to analyze the target model's behavior on perturbed inputs, capturing distributional differences between members and non-members. Comprehensive evaluations on various fine-tuned multimodal models demonstrate the effectiveness of our perturbation-based membership inference attacks in multimodal domains.

Title: TrackNetV5: Residual-Driven Spatio-Temporal Refinement and Motion Direction Decoupling for Fast Object Tracking

Authors: Tang Haonan, Chen Yanjun, Jiang Lezhi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02789
Pdf URL: https://arxiv.org/pdf/2512.02789
Copy Paste: [[2512.02789]] TrackNetV5: Residual-Driven Spatio-Temporal Refinement and Motion Direction Decoupling for Fast Object Tracking(https://arxiv.org/abs/2512.02789)
Keywords: robust, transformer
Abstract: The TrackNet series has established a strong baseline for fast-moving small object tracking in sports. However, existing iterations face significant limitations: V1-V3 struggle with occlusions due to a reliance on purely visual cues, while TrackNetV4, despite introducing motion inputs, suffers from directional ambiguity as its absolute difference method discards motion polarity. To overcome these bottlenecks, we propose TrackNetV5, a robust architecture integrating two novel mechanisms. First, to recover lost directional priors, we introduce the Motion Direction Decoupling (MDD) module. Unlike V4, MDD decomposes temporal dynamics into signed polarity fields, explicitly encoding both movement occurrence and trajectory direction. Second, we propose the Residual-Driven Spatio-Temporal Refinement (R-STR) head. Operating on a coarse-to-fine paradigm, this Transformer-based module leverages factorized spatio-temporal contexts to estimate a corrective residual, effectively recovering occluded targets. Extensive experiments on the TrackNetV2 dataset demonstrate that TrackNetV5 achieves a new state-of-the-art F1-score of 0.9859 and an accuracy of 0.9733, significantly outperforming previous versions. Notably, this performance leap is achieved with a marginal 3.7% increase in FLOPs compared to V4, maintaining real-time inference capabilities while delivering superior tracking precision.

Title: PhyCustom: Towards Realistic Physical Customization in Text-to-Image Generation

Authors: Fan Wu, Cheng Chen, Zhoujie Fu, Jiacheng Wei, Yi Xu, Deheng Ye, Guosheng Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02794
Pdf URL: https://arxiv.org/pdf/2512.02794
Copy Paste: [[2512.02794]] PhyCustom: Towards Realistic Physical Customization in Text-to-Image Generation(https://arxiv.org/abs/2512.02794)
Keywords: diffusion
Abstract: Recent diffusion-based text-to-image customization methods have achieved significant success in understanding concrete concepts to control generation processes, such as styles and shapes. However, few efforts dive into the realistic yet challenging customization of physical concepts. The core limitation of current methods arises from the absence of explicitly introducing physical knowledge during training. Even when physics-related words appear in the input text prompts, our experiments consistently demonstrate that these methods fail to accurately reflect the corresponding physical properties in the generated results. In this paper, we propose PhyCustom, a fine-tuning framework comprising two novel regularization losses to activate diffusion model to perform physical customization. Specifically, the proposed isometric loss aims at activating diffusion models to learn physical concepts while decouple loss helps to eliminate the mixture learning of independent concepts. Experiments are conducted on a diverse dataset and our benchmark results demonstrate that PhyCustom outperforms previous state-of-the-art and popular methods in terms of physical customization quantitatively and qualitatively.

Title: TriLex: A Framework for Multilingual Sentiment Analysis in Low-Resource South African Languages

Authors: Mike Nkongolo, Hilton Vorster, Josh Warren, Trevor Naick, Deandre Vanmali, Masana Mashapha, Luke Brand, Alyssa Fernandes, Janco Calitz, Sibusiso Makhoba
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02799
Pdf URL: https://arxiv.org/pdf/2512.02799
Copy Paste: [[2512.02799]] TriLex: A Framework for Multilingual Sentiment Analysis in Low-Resource South African Languages(https://arxiv.org/abs/2512.02799)
Keywords: robust, extraction
Abstract: Low-resource African languages remain underrepresented in sentiment analysis, limiting both lexical coverage and the performance of multilingual Natural Language Processing (NLP) systems. This study proposes TriLex, a three-stage retrieval augmented framework that unifies corpus-based extraction, cross lingual mapping, and retrieval augmented generation (RAG) driven lexical refinement to systematically expand sentiment lexicons for low-resource languages. Using the enriched lexicon, the performance of two prominent African pretrained language models (AfroXLMR and AfriBERTa) is evaluated across multiple case studies. Results demonstrate that AfroXLMR delivers superior performance, achieving F1-scores above 80% for isiXhosa and isiZulu and exhibiting strong cross-lingual stability. Although AfriBERTa lacks pre-training on these target languages, it still achieves reliable F1-scores around 64%, validating its utility in computationally constrained settings. Both models outperform traditional machine learning baselines, and ensemble analyses further enhance precision and robustness. The findings establish TriLex as a scalable and effective framework for multilingual sentiment lexicon expansion and sentiment modeling in low-resource South African languages.

Title: SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment

Authors: Yixuan Tang, Yi Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02807
Pdf URL: https://arxiv.org/pdf/2512.02807
Copy Paste: [[2512.02807]] SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment(https://arxiv.org/abs/2512.02807)
Keywords: large language model
Abstract: Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward models are vulnerable to reward hacking, and self-evaluation methods suffer from prompt sensitivity and biases. In this work, we propose stable rank, an intrinsic, annotation-free quality signal derived from model representations. Stable rank measures the effective dimensionality of hidden states by computing the ratio of total variance to dominant-direction variance, capturing quality through how information distributes across representation dimensions. Empirically, stable rank achieves 84.04% accuracy on RewardBench and improves task accuracy by an average of 11.3 percentage points over greedy decoding via Best-of-N sampling. Leveraging this insight, we introduce Stable Rank Group Relative Policy Optimization (SR-GRPO), which uses stable rank as a reward signal for reinforcement learning. Without external supervision, SR-GRPO improves Qwen2.5-1.5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming both learned reward models and self-evaluation baselines. Our findings demonstrate that quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision.

Title: A benchmark dataset for evaluating Syndrome Differentiation and Treatment in large language models

Authors: Kunning Li, Jianbin Guo, Zhaoyang Shang, Yiqing Liu, Hongmin Du, Lingling Liu, Yuping Zhao, Lifeng Dong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02816
Pdf URL: https://arxiv.org/pdf/2512.02816
Copy Paste: [[2512.02816]] A benchmark dataset for evaluating Syndrome Differentiation and Treatment in large language models(https://arxiv.org/abs/2512.02816)
Keywords: large language model
Abstract: The emergence of Large Language Models (LLMs) within the Traditional Chinese Medicine (TCM) domain presents an urgent need to assess their clinical application capabilities. However, such evaluations are challenged by the individualized, holistic, and diverse nature of TCM's "Syndrome Differentiation and Treatment" (SDT). Existing benchmarks are confined to knowledge-based question-answering or the accuracy of syndrome differentiation, often neglecting assessment of treatment decision-making. Here, we propose a comprehensive, clinical case-based benchmark spearheaded by TCM experts, and a specialized reward model employed to quantify prescription-syndrome congruence. Data annotation follows a rigorous pipeline. This benchmark, designated TCM-BEST4SDT, encompasses four tasks, including TCM Basic Knowledge, Medical Ethics, LLM Content Safety, and SDT. The evaluation framework integrates three mechanisms, namely selected-response evaluation, judge model evaluation, and reward model evaluation. The effectiveness of TCM-BEST4SDT was corroborated through experiments on 15 mainstream LLMs, spanning both general and TCM domains. To foster the development of intelligent TCM research, TCM-BEST4SDT is now publicly available.

Title: Decryption thorough polynomial ambiguity: noise-enhanced high-memory convolutional codes for post-quantum cryptography

Authors: Meir Ariel
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.02822
Pdf URL: https://arxiv.org/pdf/2512.02822
Copy Paste: [[2512.02822]] Decryption thorough polynomial ambiguity: noise-enhanced high-memory convolutional codes for post-quantum cryptography(https://arxiv.org/abs/2512.02822)
Keywords: security, attack, robust
Abstract: We present a novel approach to post-quantum cryptography that employs directed-graph decryption of noise-enhanced high-memory convolutional codes. The proposed construction generates random-like generator matrices that effectively conceal algebraic structure and resist known structural attacks. Security is further reinforced by the deliberate injection of strong noise during decryption, arising from polynomial division: while legitimate recipients retain polynomial-time decoding, adversaries face exponential-time complexity. As a result, the scheme achieves cryptanalytic security margins surpassing those of Classic McEliece by factors exceeding 2^(200). Beyond its enhanced security, the method offers greater design flexibility, supporting arbitrary plaintext lengths with linear-time decryption and uniform per-bit computational cost, enabling seamless scalability to long messages. Practical deployment is facilitated by parallel arrays of directed-graph decoders, which identify the correct plaintext through polynomial ambiguity while allowing efficient hardware and software implementations. Altogether, the scheme represents a compelling candidate for robust, scalable, and quantum-resistant public-key cryptography.

Title: From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity

Authors: Haoming Liu, Jinnuo Liu, Yanhao Li, Liuyang Bai, Yunkai Ji, Yuanhe Guo, Shenji Wan, Hongyi Wen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02826
Pdf URL: https://arxiv.org/pdf/2512.02826
Copy Paste: [[2512.02826]] From Navigation to Refinement: Revealing the Two-Stage Nature of Flow-based Diffusion Models through Oracle Velocity(https://arxiv.org/abs/2512.02826)
Keywords: diffusion, generative
Abstract: Flow-based diffusion models have emerged as a leading paradigm for training generative models across images and videos. However, their memorization-generalization behavior remains poorly understood. In this work, we revisit the flow matching (FM) objective and study its marginal velocity field, which admits a closed-form expression, allowing exact computation of the oracle FM target. Analyzing this oracle velocity field reveals that flow-based diffusion models inherently formulate a two-stage training target: an early stage guided by a mixture of data modes, and a later stage dominated by the nearest data sample. The two-stage objective leads to distinct learning behaviors: the early navigation stage generalizes across data modes to form global layouts, whereas the later refinement stage increasingly memorizes fine-grained details. Leveraging these insights, we explain the effectiveness of practical techniques such as timestep-shifted schedules, classifier-free guidance intervals, and latent space design choices. Our study deepens the understanding of diffusion model training dynamics and offers principles for guiding future architectural and algorithmic improvements.

Title: Defense That Attacks: How Robust Models Become Better Attackers

Authors: Mohamed Awad, Mahmoud Akrm, Walid Gomaa
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02830
Pdf URL: https://arxiv.org/pdf/2512.02830
Copy Paste: [[2512.02830]] Defense That Attacks: How Robust Models Become Better Attackers(https://arxiv.org/abs/2512.02830)
Keywords: defense, attack, robust
Abstract: Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. However, its effect on the transferability of attacks is underexplored. In this work, we ask whether adversarial training unintentionally increases the transferability of adversarial examples. To answer this, we trained a diverse zoo of 36 models, including CNNs and ViTs, and conducted comprehensive transferability experiments. Our results reveal a clear paradox: adversarially trained (AT) models produce perturbations that transfer more effectively than those from standard models, which introduce a new ecosystem risk. To enable reproducibility and further study, we release all models, code, and experimental scripts. Furthermore, we argue that robustness evaluations should assess not only the resistance of a model to transferred attacks but also its propensity to produce transferable adversarial examples.

Title: ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning

Authors: Yifan Li, Yingda Yin, Lingting Zhu, Weikai Chen, Shengju Qian, Xin Wang, Yanwei Fu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2512.02835
Pdf URL: https://arxiv.org/pdf/2512.02835
Copy Paste: [[2512.02835]] ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning(https://arxiv.org/abs/2512.02835)
Keywords: segmentation
Abstract: Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its decision quality from outcome-driven signals. Experimental results demonstrate that ReVSeg attains state-of-the-art performances on standard video object segmentation benchmarks and yields interpretable reasoning trajectories. Project page is available at this https URL .

Title: promptolution: A Unified, Modular Framework for Prompt Optimization

Authors: Tom Zehle, Timo Heiß, Moritz Schlager, Matthias Aßenmacher, Matthias Feurer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02840
Pdf URL: https://arxiv.org/pdf/2512.02840
Copy Paste: [[2512.02840]] promptolution: A Unified, Modular Framework for Prompt Optimization(https://arxiv.org/abs/2512.02840)
Keywords: large language model
Abstract: Prompt optimization has become crucial for enhancing the performance of large language models (LLMs) across a broad range of tasks. Although many research papers show its effectiveness, practical adoption is hindered as existing implementations are often tied to unmaintained and isolated research codebases. To address this, we introduce promptolution, a unified and modular open-source framework that provides all components required for prompt optimization within a single extensible system for both practitioners and researchers. It integrates multiple contemporary discrete prompt optimizers while remaining agnostic to the underlying LLM implementation.

Title: Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

Authors: Lechen Zhang, Yusheng Zhou, Tolga Ergen, Lajanugen Logeswaran, Moontae Lee, David Jurgens
Subjects: cs.CL, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02841
Pdf URL: https://arxiv.org/pdf/2512.02841
Copy Paste: [[2512.02841]] Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages(https://arxiv.org/abs/2512.02841)
Keywords: robust, large language model
Abstract: System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.

Title: Bangla Hate Speech Classification with Fine-tuned Transformer Models

Authors: Yalda Keivan Jafari, Krishno Dey
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02845
Pdf URL: https://arxiv.org/pdf/2512.02845
Copy Paste: [[2512.02845]] Bangla Hate Speech Classification with Fine-tuned Transformer Models(https://arxiv.org/abs/2512.02845)
Keywords: transformer
Abstract: Hate speech recognition in low-resource lan- guages remains a difficult problem due to in- sufficient datasets, orthographic heterogeneity, and linguistic variety. Bangla is spoken by more than 230 million people of Bangladesh and India (West Bengal). Despite the grow- ing need for automated moderation on social media platforms, Bangla is significantly under- represented in computational resources. In this work, we study Subtask 1A and Subtask 1B of the BLP 2025 Shared Task on hate speech detection. We reproduce the official base- lines (e.g., Majority, Random, Support Vec- tor Machine) and also produce and consider Logistic Regression, Random Forest, and De- cision Tree as baseline methods. We also uti- lized transformer-based models such as Dis- tilBERT, BanglaBERT, m-BERT, and XLM- RoBERTa for hate speech classification. All the transformer-based models outperformed base- line methods for the subtasks, except for Distil- BERT. Among the transformer-based models, BanglaBERT produces the best performance for both subtasks. Despite being smaller in size, BanglaBERT outperforms both m-BERT and XLM-RoBERTa, which suggests language- specific pre-training is very important. Our results highlight the potential and need for pre- trained language models for the low-resource Bangla language.

Title: Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?

Authors: Manuel Benavent-Lledo, Konstantinos Bacharidis, Victoria Manousaki, Konstantinos Papoutsakis, Antonis Argyros, Jose Garcia-Rodriguez
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02846
Pdf URL: https://arxiv.org/pdf/2512.02846
Copy Paste: [[2512.02846]] Action Anticipation at a Glimpse: To What Extent Can Multimodal Cues Replace Video?(https://arxiv.org/abs/2512.02846)
Keywords: extraction
Abstract: Anticipating actions before they occur is a core challenge in action understanding research. While conventional methods rely on extracting and aggregating temporal information from videos, as humans we can often predict upcoming actions by observing a single moment from a scene, when given sufficient context. Can a model achieve this competence? The short answer is yes, although its effectiveness depends on the complexity of the task. In this work, we investigate to what extent video aggregation can be replaced with alternative modalities. To this end, based on recent advances in visual feature extraction and language-based reasoning, we introduce AAG, a method for Action Anticipation at a Glimpse. AAG combines RGB features with depth cues from a single frame for enhanced spatial reasoning, and incorporates prior action information to provide long-term context. This context is obtained either through textual summaries from Vision-Language Models, or from predictions generated by a single-frame action recognizer. Our results demonstrate that multimodal single-frame action anticipation using AAG can perform competitively compared to both temporally aggregated video baselines and state-of-the-art methods across three instructional activity datasets: IKEA-ASM, Meccano, and Assembly101.

Title: Are Detectors Fair to Indian IP-AIGC? A Cross-Generator Study

Authors: Vishal Dubey, Pallavi Tyagi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02850
Pdf URL: https://arxiv.org/pdf/2512.02850
Copy Paste: [[2512.02850]] Are Detectors Fair to Indian IP-AIGC? A Cross-Generator Study(https://arxiv.org/abs/2512.02850)
Keywords: robust, fair
Abstract: Modern image editors can produce identity-preserving AIGC (IP-AIGC), where the same person appears with new attire, background, or lighting. The robustness and fairness of current detectors in this regime remain unclear, especially for under-represented populations. We present what we believe is the first systematic study of IP-AIGC detection for Indian and South-Asian faces, quantifying cross-generator generalization and intra-population performance. We assemble Indian-focused training splits from FairFD and HAV-DF, and construct two held-out IP-AIGC test sets (HIDF-img-ip-genai and HIDF-vid-ip-genai) using commercial web-UI generators (Gemini and ChatGPT) with identity-preserving prompts. We evaluate two state-of-the-art detectors (AIDE and Effort) under pretrained (PT) and fine-tuned (FT) regimes and report AUC, AP, EER, and accuracy. Fine-tuning yields strong in-domain gains (for example, Effort AUC 0.739 to 0.944 on HAV-DF-test; AIDE EER 0.484 to 0.259), but consistently degrades performance on held-out IP-AIGC for Indian cohorts (for example, AIDE AUC 0.923 to 0.563 on HIDF-img-ip-genai; Effort 0.740 to 0.533), which indicates overfitting to training-generator cues. On non-IP HIDF images, PT performance remains high, which suggests a specific brittleness to identity-preserving edits rather than a generic distribution shift. Our study establishes IP-AIGC-Indian as a challenging and practically relevant scenario and motivates representation-preserving adaptation and India-aware benchmark curation to close generalization gaps in AIGC detection.

Title: Adaptive Decentralized Federated Learning for Robust Optimization

Authors: Shuyuan Wu, Feifei Wang, Yuan Gao, Hansheng Wang
Subjects: cs.LG, stat.ME
Abstract URL: https://arxiv.org/abs/2512.02852
Pdf URL: https://arxiv.org/pdf/2512.02852
Copy Paste: [[2512.02852]] Adaptive Decentralized Federated Learning for Robust Optimization(https://arxiv.org/abs/2512.02852)
Keywords: robust, federate
Abstract: In decentralized federated learning (DFL), the presence of abnormal clients, often caused by noisy or poisoned data, can significantly disrupt the learning process and degrade the overall robustness of the model. Previous methods on this issue often require a sufficiently large number of normal neighboring clients or prior knowledge of reliable clients, which reduces the practical applicability of DFL. To address these limitations, we develop here a novel adaptive DFL (aDFL) approach for robust estimation. The key idea is to adaptively adjust the learning rates of clients. By assigning smaller rates to suspicious clients and larger rates to normal clients, aDFL mitigates the negative impact of abnormal clients on the global model in a fully adaptive way. Our theory does not put any stringent conditions on neighboring nodes and requires no prior knowledge. A rigorous convergence analysis is provided to guarantee the oracle property of aDFL. Extensive numerical experiments demonstrate the superior performance of the aDFL method.

Title: MICCAI STSR 2025 Challenge: Semi-Supervised Teeth and Pulp Segmentation and CBCT-IOS Registration

Authors: Yaqi Wang, Zhi Li, Chengyu Wu, Jun Liu, Yifan Zhang, Jialuo Chen, Jiaxue Ni, Qian Luo, Jin Liu, Can Han, Changkai Ji, Zhi Qin Tan, Ajo Babu George, Liangyu Chen, Qianni Zhang, Dahong Qian, Shuai Wang, Huiyu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02867
Pdf URL: https://arxiv.org/pdf/2512.02867
Copy Paste: [[2512.02867]] MICCAI STSR 2025 Challenge: Semi-Supervised Teeth and Pulp Segmentation and CBCT-IOS Registration(https://arxiv.org/abs/2512.02867)
Keywords: segmentation
Abstract: Cone-Beam Computed Tomography (CBCT) and Intraoral Scanning (IOS) are essential for digital dentistry, but annotated data scarcity limits automated solutions for pulp canal segmentation and cross-modal registration. To benchmark semi-supervised learning (SSL) in this domain, we organized the STSR 2025 Challenge at MICCAI 2025, featuring two tasks: (1) semi-supervised segmentation of teeth and pulp canals in CBCT, and (2) semi-supervised rigid registration of CBCT and IOS. We provided 60 labeled and 640 unlabeled IOS samples, plus 30 labeled and 250 unlabeled CBCT scans with varying resolutions and fields of view. The challenge attracted strong community participation, with top teams submitting open-source deep learning-based SSL solutions. For segmentation, leading methods used nnU-Net and Mamba-like State Space Models with pseudo-labeling and consistency regularization, achieving a Dice score of 0.967 and Instance Affinity of 0.738 on the hidden test set. For registration, effective approaches combined PointNetLK with differentiable SVD and geometric augmentation to handle modality gaps; hybrid neural-classical refinement enabled accurate alignment despite limited labels. All data and code are publicly available at this https URL to ensure reproducibility.

Title: Taming Camera-Controlled Video Generation with Verifiable Geometry Reward

Authors: Zhaoqing Wang, Xiaobo Xia, Zhuolin Bie, Jinlin Liu, Dongdong Yu, Jia-Wang Bian, Changhu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02870
Pdf URL: https://arxiv.org/pdf/2512.02870
Copy Paste: [[2512.02870]] Taming Camera-Controlled Video Generation with Verifiable Geometry Reward(https://arxiv.org/abs/2512.02870)
Keywords: diffusion
Abstract: Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization efficiency. Moreover, we construct a comprehensive dataset featuring diverse large-amplitude camera motions and scenes with varied subject dynamics. Extensive experiments show that our online RL post-training clearly outperforms SFT baselines across multiple aspects, including camera-control accuracy, geometric consistency, and visual quality, demonstrating its superiority in advancing camera-controlled video generation.

Title: OptPO: Optimal Rollout Allocation for Test-time Policy Optimization

Authors: Youkang Wang, Jian Wang, Rubing Chen, Tianyi Zeng, Xiao-Yong Wei, Qing Li
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2512.02882
Pdf URL: https://arxiv.org/pdf/2512.02882
Copy Paste: [[2512.02882]] OptPO: Optimal Rollout Allocation for Test-time Policy Optimization(https://arxiv.org/abs/2512.02882)
Keywords: large language model
Abstract: Test-time policy optimization enables large language models (LLMs) to adapt to distribution shifts by leveraging feedback from self-generated rollouts. However, existing methods rely on fixed-budget majority voting to estimate rewards, incurring substantial computational redundancy. We propose Optimal Rollout Allocation for Test-time Policy Optimization (OptPO), a principled framework that adaptively allocates inference budgets. By formulating the voting process as a Bayesian sequential probability ratio test, OptPO dynamically halts sampling once the posterior confidence in a consensus answer exceeds a specified threshold. Crucially, it utilizes the retained rollouts for on-policy updates, seamlessly integrating with algorithms like PPO or GRPO without requiring ground-truth labels. Across diverse reasoning benchmarks, OptPO significantly reduces rollout overhead compared to fixed-sample baselines while preserving or improving accuracy. By unifying statistically optimal stopping with test-time learning, OptPO offers a computationally efficient paradigm for test-time adaptation. The source code will be open upon acceptance at https://open-upon-acceptance.

Title: Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules

Authors: Amr Mohamed, Yang Zhang, Michalis Vazirgiannis, Guokan Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02892
Pdf URL: https://arxiv.org/pdf/2512.02892
Copy Paste: [[2512.02892]] Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules(https://arxiv.org/abs/2512.02892)
Keywords: robust, diffusion, large language model
Abstract: Diffusion large language models (dLLMs) offer a promising alternative to autoregressive models, but their practical utility is severely hampered by slow, iterative sampling. We present SchED, a training-free, model-agnostic early-exit algorithm that aggregates full-span logit margins and halts decoding once a smooth, progress-dependent confidence threshold is met. We evaluated SchED on two dLLM families (Dream and LLaDA), in base and instruction-tuned variants across ten benchmarks spanning downstream tasks including multiple-choice question answering (MCQ), math, long-form QA/summarization, and translation. SchED delivers large, stable accelerations: on instruction-tuned models, it achieves $3.8$-$4.0\times$ speedups while retaining $99.8$-$100\%$ of the baseline score on average. On base models, SchED yields consistent speedup gains with $99.1$-$100\%$ performance retention, with up to $2.34\times$ under more aggressive settings. Using a conservative speed metric that heavily penalizes quality loss (QPS, $\gamma{=}4$), we show that SchED is robust and clearly outperforms prior confidence-based early-exit methods, which break down on long-form generation. An entropy analysis of the model's token predictions reveals that instruction tuning speeds up the decay of predictive entropy. By turning genuine confidence stabilization into computational savings, SchED makes dLLM decoding substantially more efficient.

Title: MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm

Authors: Wei Chen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li, Zide Liu, Xuhao Pan, Chang Ren, Xudong Rao, Chenfeng Wang, Tao Wei, Chengjun Yu, Pengfei Yu, Yufei Zheng, Chunpeng Zhou, Pan Zhou, Xuhan Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02895
Pdf URL: https://arxiv.org/pdf/2512.02895
Copy Paste: [[2512.02895]] MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm(https://arxiv.org/abs/2512.02895)
Keywords: large language model
Abstract: We present MindGPT-4ov, a multimodal large language model (MLLM) that introduces a general post-training paradigm spanning data production, model training, and efficient deployment. It achieves state-of-the-art performance across multiple benchmarks at low cost, effectively enhancing the foundational capabilities of MLLMs and the generalization ability. Focusing on data construction, supervised fine-tuning strategies, and multimodal reinforcement learning methods, this work proposes three key innovations: (1) An information density-based data generation scheme, integrated with a dual-dimensional tree-structured label system, enabling automated generation of high-quality cross-domain data. (2) A collaborative curriculum supervised fine-tuning approach that balances the injection of domain-specific knowledge with the preservation of general capabilities. (3) A hybrid reinforcement learning paradigm that enhances reasoning ability while simultaneously addressing multi-objective optimization such as diversity exploration, maintenance of multimodal perception, and response conciseness. Moreover, we implement a series of infrastructure optimizations, such as 5D parallel training, operator optimization, and inference quantization to enhance training and inference efficiency while reducing the cost of domain adaptation. Experimental results demonstrate that the MindGPT-4ov model outperforms state-of-the-art models on benchmarks such as MMBench, MMStar, MathVision, and MathVista. In addition, MindGPT-4ov also demonstrates superior user experience in vertical domain tasks, enabling a seamless transition from academic research to industrial deployment. MindGPT-4ov provides a general post-training paradigm applicable to a wide range of MLLMs. The model weights, datasets, and code for the Qwen3-VL-based variants will be recently open-sourced to support the community's development of MLLMs.

Title: Polar Perspectives: Evaluating 2-D LiDAR Projections for Robust Place Recognition with Visual Foundation Models

Authors: Pierpaolo Serio, Giulio Pisaneschi, Andrea Dan Ryals, Vincenzo Infantino, Lorenzo Gentilini, Valentina Donzella, Lorenzo Pollini
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.02897
Pdf URL: https://arxiv.org/pdf/2512.02897
Copy Paste: [[2512.02897]] Polar Perspectives: Evaluating 2-D LiDAR Projections for Robust Place Recognition with Visual Foundation Models(https://arxiv.org/abs/2512.02897)
Keywords: robust
Abstract: This work presents a systematic investigation into how alternative LiDAR-to-image projections affect metric place recognition when coupled with a state-of-the-art vision foundation model. We introduce a modular retrieval pipeline that controls for backbone, aggregation, and evaluation protocol, thereby isolating the influence of the 2-D projection itself. Using consistent geometric and structural channels across multiple datasets and deployment scenarios, we identify the projection characteristics that most strongly determine discriminative power, robustness to environmental variation, and suitability for real-time autonomy. Experiments with different datasets, including integration into an operational place recognition policy, validate the practical relevance of these findings and demonstrate that carefully designed projections can serve as an effective surrogate for end-to-end 3-D learning in LiDAR place recognition.

Title: Glance: Accelerating Diffusion Models with 1 Sample

Authors: Zhuobai Dong, Rui Zhao, Songjie Wu, Junchao Yi, Linjie Li, Zhengyuan Yang, Lijuan Wang, Alex Jinpeng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02899
Pdf URL: https://arxiv.org/pdf/2512.02899
Copy Paste: [[2512.02899]] Glance: Accelerating Diffusion Models with 1 Sample(https://arxiv.org/abs/2512.02899)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable success in image generation, yet their deployment remains constrained by the heavy computational cost and the need for numerous inference steps. Previous efforts on fewer-step distillation attempt to skip redundant steps by training compact student models, yet they often suffer from heavy retraining costs and degraded generalization. In this work, we take a different perspective: we accelerate smartly, not evenly, applying smaller speedups to early semantic stages and larger ones to later redundant phases. We instantiate this phase-aware strategy with two experts that specialize in slow and fast denoising phases. Surprisingly, instead of investing massive effort in retraining student models, we find that simply equipping the base model with lightweight LoRA adapters achieves both efficient acceleration and strong generalization. We refer to these two adapters as Slow-LoRA and Fast-LoRA. Through extensive experiments, our method achieves up to 5 acceleration over the base model while maintaining comparable visual quality across diverse benchmarks. Remarkably, the LoRA experts are trained with only 1 samples on a single V100 within one hour, yet the resulting models generalize strongly on unseen prompts.

Title: FAIRY2I: Universal Extremely-Low Bit QAT framework via Widely-Linear Representation and Phase-Aware Quantization

Authors: Feiyu Wang, Xinyu Tan, Bokai Huang, Yihao Zhang, Guoan Wang, Peizhuang Cong, Tong Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02901
Pdf URL: https://arxiv.org/pdf/2512.02901
Copy Paste: [[2512.02901]] FAIRY2I: Universal Extremely-Low Bit QAT framework via Widely-Linear Representation and Phase-Aware Quantization(https://arxiv.org/abs/2512.02901)
Keywords: fair, transformer, large language model
Abstract: Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.

Title: MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

Authors: Fan Yang, Kaihao Zhang
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2512.02906
Pdf URL: https://arxiv.org/pdf/2512.02906
Copy Paste: [[2512.02906]] MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding(https://arxiv.org/abs/2512.02906)
Keywords: large language model
Abstract: Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window this http URL on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.

Title: Belobog: Move Language Fuzzing Framework For Real-World Smart Contracts

Authors: Wanxu Xia, Ziqiao Kong, Zhengwei Li, Yi Lu, Pan Li, Liqun Yang, Yang Liu, Xiapu Luo, Shaohua Li
Subjects: cs.CR, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2512.02918
Pdf URL: https://arxiv.org/pdf/2512.02918
Copy Paste: [[2512.02918]] Belobog: Move Language Fuzzing Framework For Real-World Smart Contracts(https://arxiv.org/abs/2512.02918)
Keywords: secure
Abstract: Move is a research-oriented programming language design for secure and verifiable smart contract development and has been widely used in managing billions of digital assets in blockchains, such as Sui and Aptos. Move features a strong static type system and explicit resource semantics to enforce safety properties such as the prevention of data races, invalid asset transfers, and entry vulnerabilities. However, smart contracts written in Move may still contain certain vulnerabilities that are beyond the reach of its type system. It is thus essential to validate Move smart contracts. Unfortunately, due to its strong type system, existing smart contract fuzzers are ineffective in producing syntactically or semantically valid transactions to test Move smart contracts. This paper introduces the first fuzzing framework, Belobog, for Move smart contracts. Belobog is type-aware and ensures that all generated and mutated transactions are well-typed. More specifically, for a target Move smart contract, Belobog first constructs a type graph based on Move's type system, and then generates or mutates a transaction based on the graph trace derived from the type graph. In order to overcome the complex checks in Move smart contracts, we further design and implement a concolic executor in Belobog. We evaluated Belobog on 109 real-world Move smart contract projects. The experimental results show that Belobog is able to detect 100\% critical and 79\% major vulnerabilities manually audited by human experts. We further selected two recent notorious incidents in Move smart contracts, i.e., Cetus and Nemo. Belobog successfully reproduced full exploits for both of them, without any prior knowledge.

Title: AutoNeural: Co-Designing Vision-Language Models for NPU Inference

Authors: Wei Chen, Liangmin Wu, Yunhai Hu, Zhiyuan Li, Zhiyuan Cheng, Yicheng Qian, Lingyue Zhu, Zhipeng Hu, Luoyi Liang, Qiang Tang, Zhen Liu, Han Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02924
Pdf URL: https://arxiv.org/pdf/2512.02924
Copy Paste: [[2512.02924]] AutoNeural: Co-Designing Vision-Language Models for NPU Inference(https://arxiv.org/abs/2512.02924)
Keywords: robust, transformer
Abstract: While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.

Title: DiverseAR: Boosting Diversity in Bitwise Autoregressive Image Generation

Authors: Ying Yang, Zhengyao Lv, Tianlin Pan, Haofan Wang, Binxin Yang, Hubery Yin, Chen Li, Chenyang Si
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02931
Pdf URL: https://arxiv.org/pdf/2512.02931
Copy Paste: [[2512.02931]] DiverseAR: Boosting Diversity in Bitwise Autoregressive Image Generation(https://arxiv.org/abs/2512.02931)
Keywords: generative
Abstract: In this paper, we investigate the underexplored challenge of sample diversity in autoregressive (AR) generative models with bitwise visual tokenizers. We first analyze the factors that limit diversity in bitwise AR models and identify two key issues: (1) the binary classification nature of bitwise modeling, which restricts the prediction space, and (2) the overly sharp logits distribution, which causes sampling collapse and reduces diversity. Building on these insights, we propose DiverseAR, a principled and effective method that enhances image diversity without sacrificing visual quality. Specifically, we introduce an adaptive logits distribution scaling mechanism that dynamically adjusts the sharpness of the binary output distribution during sampling, resulting in smoother predictions and greater diversity. To mitigate potential fidelity loss caused by distribution smoothing, we further develop an energy-based generation path search algorithm that avoids sampling low-confidence tokens, thereby preserving high visual quality. Extensive experiments demonstrate that DiverseAR substantially improves sample diversity in bitwise autoregressive image generation.

Title: LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization

Authors: Zhihan Xiao, Lin Liu, Yixin Gao, Xiaopeng Zhang, Haoxuan Che, Songping Mai, Qi Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02933
Pdf URL: https://arxiv.org/pdf/2512.02933
Copy Paste: [[2512.02933]] LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization(https://arxiv.org/abs/2512.02933)
Keywords: diffusion
Abstract: Text-guided video editing, particularly for object removal and addition, remains a challenging task due to the need for precise spatial and temporal consistency. Existing methods often rely on auxiliary masks or reference images for editing guidance, which limits their scalability and generalization. To address these issues, we propose LoVoRA, a novel framework for mask-free video object removal and addition using object-aware localization mechanism. Our approach utilizes a unique dataset construction pipeline that integrates image-to-video translation, optical flow-based mask propagation, and video inpainting, enabling temporally consistent edits. The core innovation of LoVoRA is its learnable object-aware localization mechanism, which provides dense spatio-temporal supervision for both object insertion and removal tasks. By leveraging a Diffusion Mask Predictor, LoVoRA achieves end-to-end video editing without requiring external control signals during inference. Extensive experiments and human evaluation demonstrate the effectiveness and high-quality performance of LoVoRA.

Title: Layout Anything: One Transformer for Universal Room Layout Estimation

Authors: Md Sohag Mia, Muhammad Abdullah Adnan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02952
Pdf URL: https://arxiv.org/pdf/2512.02952
Copy Paste: [[2512.02952]] Layout Anything: One Transformer for Universal Room Layout Estimation(https://arxiv.org/abs/2512.02952)
Keywords: transformer, segmentation
Abstract: We present Layout Anything, a transformer-based framework for indoor layout estimation that adapts the OneFormer's universal segmentation architecture to geometric structure prediction. Our approach integrates OneFormer's task-conditioned queries and contrastive learning with two key modules: (1) a layout degeneration strategy that augments training data while preserving Manhattan-world constraints through topology-aware transformations, and (2) differentiable geometric losses that directly enforce planar consistency and sharp boundary predictions during training. By unifying these components in an end-to-end framework, the model eliminates complex post-processing pipelines while achieving high-speed inference at 114ms. Extensive experiments demonstrate state-of-the-art performance across standard benchmarks, with pixel error (PE) of 5.43% and corner error (CE) of 4.02% on the LSUN, PE of 7.04% (CE 5.17%) on the Hedau and PE of 4.03% (CE 3.15%) on the Matterport3D-Layout datasets. The framework's combination of geometric awareness and computational efficiency makes it particularly suitable for augmented reality applications and large-scale 3D scene reconstruction tasks.

Title: A Lightweight Real-Time Low-Light Enhancement Network for Embedded Automotive Vision Systems

Authors: Yuhan Chen, Yicui Shi, Guofa Li, Guangrui Bai, Jinyuan Shao, Xiangfei Huang, Wenbo Chu, Keqiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02965
Pdf URL: https://arxiv.org/pdf/2512.02965
Copy Paste: [[2512.02965]] A Lightweight Real-Time Low-Light Enhancement Network for Embedded Automotive Vision Systems(https://arxiv.org/abs/2512.02965)
Keywords: extraction
Abstract: In low-light environments like nighttime driving, image degradation severely challenges in-vehicle camera safety. Since existing enhancement algorithms are often too computationally intensive for vehicular applications, we propose UltraFast-LieNET, a lightweight multi-scale shifted convolutional network for real-time low-light image enhancement. We introduce a Dynamic Shifted Convolution (DSConv) kernel with only 12 learnable parameters for efficient feature extraction. By integrating DSConv with varying shift distances, a Multi-scale Shifted Residual Block (MSRB) is constructed to significantly expand the receptive field. To mitigate lightweight network instability, a residual structure and a novel multi-level gradient-aware loss function are incorporated. UltraFast-LieNET allows flexible parameter configuration, with a minimum size of only 36 parameters. Results on the LOLI-Street dataset show a PSNR of 26.51 dB, outperforming state-of-the-art methods by 4.6 dB while utilizing only 180 parameters. Experiments across four benchmark datasets validate its superior balance of real-time performance and enhancement quality under limited resources. Code is available at https://githubhttps://github.com/YuhanChen2024/UltraFast-LiNET

Title: BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection

Authors: Guowen Zhang, Chenhang He, Liyi Chen, Lei Zhang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.02972
Pdf URL: https://arxiv.org/pdf/2512.02972
Copy Paste: [[2512.02972]] BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection(https://arxiv.org/abs/2512.02972)
Keywords: robust, diffusion
Abstract: Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at this https URL.

Title: Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

Authors: Yuan Xiong, Ziqi Miao, Lijun Li, Chen Qian, Jie Li, Jing Shao
Subjects: cs.CV, cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2512.02973
Pdf URL: https://arxiv.org/pdf/2512.02973
Copy Paste: [[2512.02973]] Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities(https://arxiv.org/abs/2512.02973)
Keywords: attack, large language model
Abstract: While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack's efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and Qwen2.5-VL-72B models, respectively, with Attack Success Rates (ASR) reaching 86.31\% and 91.07\%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs.

Title: InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration

Authors: Zhongyu Yang, Yingfang Yuan, Xuanming Jiang, Baoyi An, Wei Pang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02981
Pdf URL: https://arxiv.org/pdf/2512.02981
Copy Paste: [[2512.02981]] InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration(https://arxiv.org/abs/2512.02981)
Keywords: robust, large language model
Abstract: Hallucination remains a critical challenge in large language models (LLMs), hindering the development of reliable multimodal LLMs (MLLMs). Existing solutions often rely on human intervention or underutilize the agent's ability to autonomously mitigate hallucination. To address these limitations, we draw inspiration from how humans make reliable decisions in the real world. They begin with introspective reasoning to reduce uncertainty and form an initial judgment, then rely on external verification from diverse perspectives to reach a final decision. Motivated by this cognitive paradigm, we propose InEx, a training-free, multi-agent framework designed to autonomously mitigate hallucination. InEx introduces internal introspective reasoning, guided by entropy-based uncertainty estimation, to improve the reliability of the decision agent's reasoning process. The agent first generates a response, which is then iteratively verified and refined through external cross-modal multi-agent collaboration with the editing agent and self-reflection agents, further enhancing reliability and mitigating hallucination. Extensive experiments show that InEx consistently outperforms existing methods, achieving 4%-27% gains on general and hallucination benchmarks, and demonstrating strong robustness.

Title: U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences

Authors: Xiang Xu, Ao Liang, Youquan Liu, Linfeng Li, Lingdong Kong, Ziwei Liu, Qingshan Liu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.02982
Pdf URL: https://arxiv.org/pdf/2512.02982
Copy Paste: [[2512.02982]] U4D: Uncertainty-Aware 4D World Modeling from LiDAR Sequences(https://arxiv.org/abs/2512.02982)
Keywords: diffusion, generative, segmentation
Abstract: Modeling dynamic 3D environments from LiDAR sequences is central to building reliable 4D worlds for autonomous driving and embodied AI. Existing generative frameworks, however, often treat all spatial regions uniformly, overlooking the varying uncertainty across real-world scenes. This uniform generation leads to artifacts in complex or ambiguous regions, limiting realism and temporal stability. In this work, we present U4D, an uncertainty-aware framework for 4D LiDAR world modeling. Our approach first estimates spatial uncertainty maps from a pretrained segmentation model to localize semantically challenging regions. It then performs generation in a "hard-to-easy" manner through two sequential stages: (1) uncertainty-region modeling, which reconstructs high-entropy regions with fine geometric fidelity, and (2) uncertainty-conditioned completion, which synthesizes the remaining areas under learned structural priors. To further ensure temporal coherence, U4D incorporates a mixture of spatio-temporal (MoST) block that adaptively fuses spatial and temporal representations during diffusion. Extensive experiments show that U4D produces geometrically faithful and temporally consistent LiDAR sequences, advancing the reliability of 4D world modeling for autonomous perception and simulation.

Title: Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic

Authors: Muyu Pan, Dheeraj Kodakandla, Mahfuza Farooque
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.02987
Pdf URL: https://arxiv.org/pdf/2512.02987
Copy Paste: [[2512.02987]] Fine-Tuned Large Language Models for Logical Translation: Reducing Hallucinations with Lang2Logic(https://arxiv.org/abs/2512.02987)
Keywords: large language model
Abstract: Recent advances in natural language processing (NLP), particularly large language models (LLMs), have motivated the automatic translation of natural language statements into formal logic without human intervention. This enables automated reasoning and facilitates debugging, finding loop invariants, and adhering to specifications in software systems. However, hallucinations-incorrect outputs generated by LLMs are challenging, particularly for logical translation tasks requiring precision. This work introduces a novel framework that inputs English sentences, converts them into logical expressions, and then translates them into Conjunctive Normal Form (CNF) for satisfiability solving. It employs classical NLP techniques with self-defined grammar, symbolic computation libraries, and a fine-tuned language model to reduce hallucinations. In the early experiments, we observed that the fine-tuned model, trained on different grammar settings, could intentionally correct the same types of hallucinations made by the original model. Thus, it provides reliable CNF generation.

Title: GraphFusion3D: Dynamic Graph Attention Convolution with Adaptive Cross-Modal Transformer for 3D Object Detection

Authors: Md Sohag Mia, Md Nahid Hasan, Tawhid Ahmed, Muhammad Abdullah Adnan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02991
Pdf URL: https://arxiv.org/pdf/2512.02991
Copy Paste: [[2512.02991]] GraphFusion3D: Dynamic Graph Attention Convolution with Adaptive Cross-Modal Transformer for 3D Object Detection(https://arxiv.org/abs/2512.02991)
Keywords: transformer
Abstract: Despite significant progress in 3D object detection, point clouds remain challenging due to sparse data, incomplete structures, and limited semantic information. Capturing contextual relationships between distant objects presents additional difficulties. To address these challenges, we propose GraphFusion3D, a unified framework combining multi-modal fusion with advanced feature learning. Our approach introduces the Adaptive Cross-Modal Transformer (ACMT), which adaptively integrates image features into point representations to enrich both geometric and semantic information. For proposal refinement, we introduce the Graph Reasoning Module (GRM), a novel mechanism that models neighborhood relationships to simultaneously capture local geometric structures and global semantic context. The module employs multi-scale graph attention to dynamically weight both spatial proximity and feature similarity between proposals. We further employ a cascade decoder that progressively refines detections through multi-stage predictions. Extensive experiments on SUN RGB-D (70.6\% AP$_{25}$ and 51.2\% AP$_{50}$) and ScanNetV2 (75.1\% AP$_{25}$ and 60.8\% AP$_{50}$) demonstrate a substantial performance improvement over existing approaches.

Title: TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond

Authors: Yifei Zeng, Yajie Bao, Jiachen Qian, Shuang Wu, Youtian Lin, Hao Zhu, Buyu Li, Feihu Zhang, Xun Cao, Yao Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.02993
Pdf URL: https://arxiv.org/pdf/2512.02993
Copy Paste: [[2512.02993]] TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond(https://arxiv.org/abs/2512.02993)
Keywords: diffusion, transformer, segmentation
Abstract: Prevailing 3D texture generation methods, which often rely on multi-view fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting the fidelity and completeness of the generated content. To overcome these challenges, we introduce TEXTRIX, a native 3D attribute generation framework for high-fidelity texture synthesis and downstream applications such as precise 3D part segmentation. Our approach constructs a latent 3D attribute grid and leverages a Diffusion Transformer equipped with sparse attention, enabling direct coloring of 3D models in volumetric space and fundamentally avoiding the limitations of multi-view fusion. Built upon this native representation, the framework naturally extends to high-precision 3D segmentation by training the same architecture to predict semantic attributes on the grid. Extensive experiments demonstrate state-of-the-art performance on both tasks, producing seamless, high-fidelity textures and accurate 3D part segmentation with precise boundaries.

Title: DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images

Authors: Xiaoxue Chen, Ziyi Xiong, Yuantao Chen, Gen Li, Nan Wang, Hongcheng Luo, Long Chen, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Hongyang Li, Ya-Qin Zhang, Hao Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03004
Pdf URL: https://arxiv.org/pdf/2512.03004
Copy Paste: [[2512.03004]] DGGT: Feedforward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images(https://arxiv.org/abs/2512.03004)
Keywords: diffusion, transformer
Abstract: Autonomous driving needs fast, scalable 4D reconstruction and re-simulation for training and evaluation, yet most methods for dynamic driving scenes still rely on per-scene optimization, known camera calibration, or short frame windows, making them slow and impractical. We revisit this problem from a feedforward perspective and introduce \textbf{Driving Gaussian Grounded Transformer (DGGT)}, a unified framework for pose-free dynamic scene reconstruction. We note that the existing formulations, treating camera pose as a required input, limit flexibility and scalability. Instead, we reformulate pose as an output of the model, enabling reconstruction directly from sparse, unposed images and supporting an arbitrary number of views for long sequences. Our approach jointly predicts per-frame 3D Gaussian maps and camera parameters, disentangles dynamics with a lightweight dynamic head, and preserves temporal consistency with a lifespan head that modulates visibility over time. A diffusion-based rendering refinement further reduces motion/interpolation artifacts and improves novel-view quality under sparse inputs. The result is a single-pass, pose-free algorithm that achieves state-of-the-art performance and speed. Trained and evaluated on large-scale driving benchmarks (Waymo, nuScenes, Argoverse2), our method outperforms prior work both when trained on each dataset and in zero-shot transfer across datasets, and it scales well as the number of input frames increases.

Title: In-Context Sync-LoRA for Portrait Video Editing

Authors: Sagi Polaczek, Or Patashnik, Ali Mahdavi-Amiri, Daniel Cohen-Or
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2512.03013
Pdf URL: https://arxiv.org/pdf/2512.03013
Copy Paste: [[2512.03013]] In-Context Sync-LoRA for Portrait Video Editing(https://arxiv.org/abs/2512.03013)
Keywords: robust, diffusion
Abstract: Editing portrait videos is a challenging task that requires flexible yet precise control over a wide range of modifications, such as appearance changes, expression edits, or the addition of objects. The key difficulty lies in preserving the subject's original temporal behavior, demanding that every edited frame remains precisely synchronized with the corresponding source frame. We present Sync-LoRA, a method for editing portrait videos that achieves high-quality visual modifications while maintaining frame-accurate synchronization and identity consistency. Our approach uses an image-to-video diffusion model, where the edit is defined by modifying the first frame and then propagated to the entire sequence. To enable accurate synchronization, we train an in-context LoRA using paired videos that depict identical motion trajectories but differ in appearance. These pairs are automatically generated and curated through a synchronization-based filtering process that selects only the most temporally aligned examples for training. This training setup teaches the model to combine motion cues from the source video with the visual changes introduced in the edited first frame. Trained on a compact, highly curated set of synchronized human portraits, Sync-LoRA generalizes to unseen identities and diverse edits (e.g., modifying appearance, adding objects, or changing backgrounds), robustly handling variations in pose and expression. Our results demonstrate high visual fidelity and strong temporal coherence, achieving a robust balance between edit fidelity and precise motion preservation.

Title: Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks

Authors: Matthew Dutson, Nathan Labiosa, Yin Li, Mohit Gupta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03014
Pdf URL: https://arxiv.org/pdf/2512.03014
Copy Paste: [[2512.03014]] Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks(https://arxiv.org/abs/2512.03014)
Keywords: robust, segmentation
Abstract: When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.

Title: AutoBrep: Autoregressive B-Rep Generation with Unified Topology and Geometry

Authors: Xiang Xu, Pradeep Kumar Jayaraman, Joseph G. Lambourne, Yilin Liu, Durvesh Malpure, Pete Meltzer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03018
Pdf URL: https://arxiv.org/pdf/2512.03018
Copy Paste: [[2512.03018]] AutoBrep: Autoregressive B-Rep Generation with Unified Topology and Geometry(https://arxiv.org/abs/2512.03018)
Keywords: transformer
Abstract: The boundary representation (B-Rep) is the standard data structure used in Computer-Aided Design (CAD) for defining solid models. Despite recent progress, directly generating B-Reps end-to-end with precise geometry and watertight topology remains a challenge. This paper presents AutoBrep, a novel Transformer model that autoregressively generates B-Reps with high quality and validity. AutoBrep employs a unified tokenization scheme that encodes both geometric and topological characteristics of a B-Rep model as a sequence of discrete tokens. Geometric primitives (i.e., surfaces and curves) are encoded as latent geometry tokens, and their structural relationships are defined as special topological reference tokens. Sequence order in AutoBrep naturally follows a breadth first traversal of the B-Rep face adjacency graph. At inference time, neighboring faces and edges along with their topological structure are progressively generated. Extensive experiments demonstrate the advantages of our unified representation when coupled with next-token prediction for B-Rep generation. AutoBrep outperforms baselines with better quality and watertightness. It is also highly scalable to complex solids with good fidelity and inference speed. We further show that autocompleting B-Reps is natively supported through our unified tokenization, enabling user-controllable CAD generation with minimal changes. Code is available at this https URL.

Title: Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge

Authors: Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, Mehdi Mirzazadeh
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03019
Pdf URL: https://arxiv.org/pdf/2512.03019
Copy Paste: [[2512.03019]] Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge(https://arxiv.org/abs/2512.03019)
Keywords: large language model
Abstract: Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.

Title: Unrolled Networks are Conditional Probability Flows in MRI Reconstruction

Authors: Kehan Qi, Saumya Gupta, Qingqiao Hu, Weimin Lyu, Chao Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03020
Pdf URL: https://arxiv.org/pdf/2512.03020
Copy Paste: [[2512.03020]] Unrolled Networks are Conditional Probability Flows in MRI Reconstruction(https://arxiv.org/abs/2512.03020)
Keywords: diffusion, generative
Abstract: Magnetic Resonance Imaging (MRI) offers excellent soft-tissue contrast without ionizing radiation, but its long acquisition time limits clinical utility. Recent methods accelerate MRI by under-sampling $k$-space and reconstructing the resulting images using deep learning. Unrolled networks have been widely used for the reconstruction task due to their efficiency, but suffer from unstable evolving caused by freely-learnable parameters in intermediate steps. In contrast, diffusion models based on stochastic differential equations offer theoretical stability in both medical and natural image tasks but are computationally expensive. In this work, we introduce flow ODEs to MRI reconstruction by theoretically proving that unrolled networks are discrete implementations of conditional probability flow ODEs. This connection provides explicit formulations for parameters and clarifies how intermediate states should evolve. Building on this insight, we propose Flow-Aligned Training (FLAT), which derives unrolled parameters from the ODE discretization and aligns intermediate reconstructions with the ideal ODE trajectory to improve stability and convergence. Experiments on three MRI datasets show that FLAT achieves high-quality reconstructions with up to $3\times$ fewer iterations than diffusion-based generative models and significantly greater stability than unrolled networks.

Title: TokenPowerBench: Benchmarking the Power Consumption of LLM Inference

Authors: Chenxu Niu, Wei Zhang, Jie Li, Yongjian Zhao, Tongyang Wang, Xi Wang, Yong Chen
Subjects: cs.LG, cs.AI, cs.CY, cs.DC
Abstract URL: https://arxiv.org/abs/2512.03024
Pdf URL: https://arxiv.org/pdf/2512.03024
Copy Paste: [[2512.03024]] TokenPowerBench: Benchmarking the Power Consumption of LLM Inference(https://arxiv.org/abs/2512.03024)
Keywords: large language model
Abstract: Large language model (LLM) services now answer billions of queries per day, and industry reports show that inference, not training, accounts for more than 90% of total power consumption. However, existing benchmarks focus on either training/fine-tuning or performance of inference and provide little support for power consumption measurement and analysis of inference. We introduce TokenPowerBench, the first lightweight and extensible benchmark designed for LLM-inference power consumption studies. The benchmark combines (i) a declarative configuration interface covering model choice, prompt set, and inference engine, (ii) a measurement layer that captures GPU-, node-, and system-level power without specialized power meters, and (iii) a phase-aligned metrics pipeline that attributes energy to the prefill and decode stages of every request. These elements make it straight-forward to explore the power consumed by an LLM inference run; furthermore, by varying batch size, context length, parallelism strategy and quantization, users can quickly assess how each setting affects joules per token and other energy-efficiency metrics. We evaluate TokenPowerBench on four of the most widely used model series (Llama, Falcon, Qwen, and Mistral). Our experiments cover from 1 billion parameters up to the frontier-scale Llama3-405B model. Furthermore, we release TokenPowerBench as open source to help users to measure power consumption, forecast operating expenses, and meet sustainability targets when deploying LLM services.

Title: The Moral Consistency Pipeline: Continuous Ethical Evaluation for Large Language Models

Authors: Saeid Jamshidi, Kawser Wazed Nafi, Arghavan Moradi Dakhel, Negar Shahabi, Foutse Khomh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03026
Pdf URL: https://arxiv.org/pdf/2512.03026
Copy Paste: [[2512.03026]] The Moral Consistency Pipeline: Continuous Ethical Evaluation for Large Language Models(https://arxiv.org/abs/2512.03026)
Keywords: large language model
Abstract: The rapid advancement and adaptability of Large Language Models (LLMs) highlight the need for moral consistency, the capacity to maintain ethically coherent reasoning across varied contexts. Existing alignment frameworks, structured approaches designed to align model behavior with human ethical and social norms, often rely on static datasets and post-hoc evaluations, offering limited insight into how ethical reasoning may evolve across different contexts or temporal scales. This study presents the Moral Consistency Pipeline (MoCoP), a dataset-free, closed-loop framework for continuously evaluating and interpreting the moral stability of LLMs. MoCoP combines three supporting layers: (i) lexical integrity analysis, (ii) semantic risk estimation, and (iii) reasoning-based judgment modeling within a self-sustaining architecture that autonomously generates, evaluates, and refines ethical scenarios without external supervision. Our empirical results on GPT-4-Turbo and DeepSeek suggest that MoCoP effectively captures longitudinal ethical behavior, revealing a strong inverse relationship between ethical and toxicity dimensions (correlation rET = -0.81, p value less than 0.001) and a near-zero association with response latency (correlation rEL approximately equal to 0). These findings demonstrate that moral coherence and linguistic safety tend to emerge as stable and interpretable characteristics of model behavior rather than short-term fluctuations. Furthermore, by reframing ethical evaluation as a dynamic, model-agnostic form of moral introspection, MoCoP offers a reproducible foundation for scalable, continuous auditing and advances the study of computational morality in autonomous AI systems.

Title: MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

Authors: Youxin Pang, Jiajun Liu, Lingfeng Tan, Yong Zhang, Feng Gao, Xiang Deng, Zhuoliang Kang, Xiaoming Wei, Yebin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03034
Pdf URL: https://arxiv.org/pdf/2512.03034
Copy Paste: [[2512.03034]] MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation(https://arxiv.org/abs/2512.03034)
Keywords: diffusion
Abstract: We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human this http URL primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary this http URL Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these this http URL, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video this http URL, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content this http URL experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users' multimodal queries.

Title: Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

Authors: Zeqi Xiao, Yiwei Zhao, Lingxiao Li, Yushi Lan, Yu Ning, Rahul Garg, Roshni Cooper, Mohammad H. Taghavi, Xingang Pan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.03040
Pdf URL: https://arxiv.org/pdf/2512.03040
Copy Paste: [[2512.03040]] Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation(https://arxiv.org/abs/2512.03040)
Keywords: diffusion, generative
Abstract: We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.

Title: OneThinker: All-in-one Reasoning Model for Image and Video

Authors: Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, Xiangyu Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03043
Pdf URL: https://arxiv.org/pdf/2512.03043
Copy Paste: [[2512.03043]] OneThinker: All-in-one Reasoning Model for Image and Video(https://arxiv.org/abs/2512.03043)
Keywords: large language model, segmentation
Abstract: Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.

Title: CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models

Authors: Minkyung Kwon, Jinhyeok Choi, Jiho Park, Seonghu Jeon, Jinhyuk Jang, Junyoung Seo, Minseop Kwak, Jin-Hwa Kim, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03045
Pdf URL: https://arxiv.org/pdf/2512.03045
Copy Paste: [[2512.03045]] CAMEO: Correspondence-Attention Alignment for Multi-View Diffusion Models(https://arxiv.org/abs/2512.03045)
Keywords: diffusion
Abstract: Multi-view diffusion models have recently emerged as a powerful paradigm for novel view synthesis, yet the underlying mechanism that enables their view-consistency remains unclear. In this work, we first verify that the attention maps of these models acquire geometric correspondence throughout training, attending to the geometrically corresponding regions across reference and target views for view-consistent generation. However, this correspondence signal remains incomplete, with its accuracy degrading under large viewpoint changes. Building on these findings, we introduce CAMEO, a simple yet effective training technique that directly supervises attention maps using geometric correspondence to enhance both the training efficiency and generation quality of multi-view diffusion models. Notably, supervising a single attention layer is sufficient to guide the model toward learning precise correspondences, thereby preserving the geometry and structure of reference images, accelerating convergence, and improving novel view synthesis performance. CAMEO reduces the number of training iterations required for convergence by half while achieving superior performance at the same iteration counts. We further demonstrate that CAMEO is model-agnostic and can be applied to any multi-view diffusion model.

Title: MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues

Authors: Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Shuailei Ma, Ka Leong Cheng, Wen Wang, Qingyan Bai, Yuxuan Zhang, Yanhong Zeng, Yixuan Li, Xing Zhu, Yujun Shen, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.03046
Pdf URL: https://arxiv.org/pdf/2512.03046
Copy Paste: [[2512.03046]] MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues(https://arxiv.org/abs/2512.03046)
Keywords: diffusion, transformer, generative
Abstract: We propose MagicQuill V2, a novel system that introduces a \textbf{layered composition} paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.