2025-12-15

Title: Scalable Data Synthesis for Computer Use Agents with Step-Level Filtering

Authors: Yifei He, Pranit Chawla, Yaser Souri, Subhojit Som, Xia Song
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.10962
Pdf URL: https://arxiv.org/pdf/2512.10962
Copy Paste: [[2512.10962]] Scalable Data Synthesis for Computer Use Agents with Step-Level Filtering(https://arxiv.org/abs/2512.10962)
Keywords: robust
Abstract: Computer use agents (CUAs) can operate real-world digital interfaces but remain difficult to train due to the high cost of graphical user interface (GUI) interaction and the scarcity of high-quality trajectory data. Existing datasets rely on human demonstrations, limiting scalability. A natural alternative is to synthesize data from strong CUAs, yet their rollouts are highly noisy, with incorrect or suboptimal actions consisting a large proportion of the steps, making naive imitation ineffective. To tackle this challenge, we introduce a scalable data synthesis pipeline that transforms noisy rollouts into reliable supervision without human annotation. The core idea is step-level filtering, which evaluates actions individually to retain only correct steps, complemented by reasoning augmentation for improved planning. Using this pipeline, we construct WebSTAR, a dataset of 13.3K trajectories and 100K graded, reasoning-rich steps synthesized from OpenAI's computer-use-preview model. We train Qwen-2.5-VL-Instruct models (7B and 32B) on WebSTAR. On WebVoyager, our 7B model surpasses SoTA open-source CUA model UI-TARS-1.5-7B by more than 15% with only supervised finetuning. Building on step-level grading, we further create WebSCORE, a dataset of graded step-level actions, and train StepRM, a 7B multimodal reward model distilled from o4-mini, which matches its grading quality while being far more efficient to deploy at scale. Our results establish step-level filtering as a key principle for scalable CUA training and construct two new datasets (WebSTAR, WebSCORE) and a lightweight reward model (StepRM) as practical tools to advance robust and efficient CUAs.

Title: Multimodal Fusion of Regional Brain Experts for Interpretable Alzheimer's Disease Diagnosis

Authors: Farica Zhuang, Dinara Aliyeva, Shu Yang, Zixuan Wen, Duy Duong-Tran, Christos Davatzikos, Tianlong Chen, Song Wang, Li Shen
Subjects: cs.LG, cs.AI, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2512.10966
Pdf URL: https://arxiv.org/pdf/2512.10966
Copy Paste: [[2512.10966]] Multimodal Fusion of Regional Brain Experts for Interpretable Alzheimer's Disease Diagnosis(https://arxiv.org/abs/2512.10966)
Keywords: interpretability
Abstract: Accurate and early diagnosis of Alzheimer's disease (AD) can benefit from integrating complementary information from multiple modalities, mirroring clinical practice. However, conventional fusion approaches often rely on simple concatenation of features, which cannot adaptively balance the contributions of biomarkers such as amyloid PET and MRI across brain regions. In this work, we propose MREF-AD, a Multimodal Regional Expert Fusion model for AD diagnosis. It is a Mixture-of-Experts (MoE) framework that models meso-scale brain regions in each modality as an independent expert and employs two-level gating networks to learn subject-specific fusion weights. Beyond improving diagnostic performance, MREF-AD provides modality- and region-level insight into how structural and molecular imaging jointly contribute to disease diagnosis. Using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), MREF-AD achieves state-of-the-art performance over baselines while providing enhanced interpretability of brain region-specific biomarker relevance, underscoring its utility as a general framework for adaptive and interpretable multimodal fusion in neuroimaging.

Title: ASR Under the Stethoscope: Evaluating Biases in Clinical Speech Recognition across Indian Languages

Authors: Subham Kumar, Prakrithi Shivaprakash, Abhishek Manoharan, Astut Kurariya, Diptadhi Mukherjee, Lekhansh Shukla, Animesh Mukherjee, Prabhat Chand, Pratima Murthy
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2512.10967
Pdf URL: https://arxiv.org/pdf/2512.10967
Copy Paste: [[2512.10967]] ASR Under the Stethoscope: Evaluating Biases in Clinical Speech Recognition across Indian Languages(https://arxiv.org/abs/2512.10967)
Keywords: fair
Abstract: Automatic Speech Recognition (ASR) is increasingly used to document clinical encounters, yet its reliability in multilingual and demographically diverse Indian healthcare contexts remains largely unknown. In this study, we conduct the first systematic audit of ASR performance on real world clinical interview data spanning Kannada, Hindi, and Indian English, comparing leading models including Indic Whisper, Whisper, Sarvam, Google speech to text, Gemma3n, Omnilingual, Vaani, and Gemini. We evaluate transcription accuracy across languages, speakers, and demographic subgroups, with a particular focus on error patterns affecting patients vs. clinicians and gender based or intersectional disparities. Our results reveal substantial variability across models and languages, with some systems performing competitively on Indian English but failing on code mixed or vernacular speech. We also uncover systematic performance gaps tied to speaker role and gender, raising concerns about equitable deployment in clinical settings. By providing a comprehensive multilingual benchmark and fairness analysis, our work highlights the need for culturally and demographically inclusive ASR development for healthcare ecosystem in India.

Title: TECM*: A Data-Driven Assessment to Reinforcement Learning Methods and Application to Heparin Treatment Strategy for Surgical Sepsis

Authors: Jiang Liu, Yujie Li, Chan Zhou, Yihao Xie, Qilong Sun, Xin Shu, Peiwei Li, Chunyong Yang, Yiziting Zhu, Jiaqi Zhu, Yuwen Chen, Bo An, Hao Wu, Bin Yi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.10973
Pdf URL: https://arxiv.org/pdf/2512.10973
Copy Paste: [[2512.10973]] TECM*: A Data-Driven Assessment to Reinforcement Learning Methods and Application to Heparin Treatment Strategy for Surgical Sepsis(https://arxiv.org/abs/2512.10973)
Keywords: robust
Abstract: Objective: Sepsis is a life-threatening condition caused by severe infection leading to acute organ dysfunction. This study proposes a data-driven metric and a continuous reward function to optimize personalized heparin therapy in surgical sepsis patients. Methods: Data from the MIMIC-IV v1.0 and eICU v2.0 databases were used for model development and evaluation. The training cohort consisted of abdominal surgery patients receiving unfractionated heparin (UFH) after postoperative sepsis onset. We introduce a new RL-based framework: converting the discrete SOFA score to a continuous cxSOFA for more nuanced state and reward functions; Second, defining "good" or "bad" strategies based on cxSOFA by a stepwise manner; Third, proposing a Treatment Effect Comparison Matrix (TECM), analogous to a confusion matrix for classification tasks, to evaluate the treatment strategies. We applied different RL algorithms, Q-Learning, DQN, DDQN, BCQ and CQL to optimize the treatment and comprehensively evaluated the framework. Results: Among the AI-derived strategies, the cxSOFA-CQL model achieved the best performance, reducing mortality from 1.83% to 0.74% with the average hospital stay from 11.11 to 9.42 days. TECM demonstrated consistent outcomes across models, highlighting robustness. Conclusion: The proposed RL framework enables interpretable and robust optimization of heparin therapy in surgical sepsis. Continuous cxSOFA scoring and TECM-based evaluation provide nuanced treatment assessment, showing promise for improving clinical outcomes and decision-support reliability.

Title: MolSculpt: Sculpting 3D Molecular Geometries from Chemical Syntax

Authors: Zhanpeng Chen, Weihao Gao, Shunyu Wang, Yanan Zhu, Hong Meng, Yuexian Zou
Subjects: cs.LG, cs.AI, physics.chem-ph, q-bio.QM
Abstract URL: https://arxiv.org/abs/2512.10991
Pdf URL: https://arxiv.org/pdf/2512.10991
Copy Paste: [[2512.10991]] MolSculpt: Sculpting 3D Molecular Geometries from Chemical Syntax(https://arxiv.org/abs/2512.10991)
Keywords: diffusion
Abstract: Generating precise 3D molecular geometries is crucial for drug discovery and material science. While prior efforts leverage 1D representations like SELFIES to ensure molecular validity, they fail to fully exploit the rich chemical knowledge entangled within 1D models, leading to a disconnect between 1D syntactic generation and 3D geometric realization. To bridge this gap, we propose MolSculpt, a novel framework that "sculpts" 3D molecular geometries from chemical syntax. MolSculpt is built upon a frozen 1D molecular foundation model and a 3D molecular diffusion model. We introduce a set of learnable queries to extract inherent chemical knowledge from the foundation model, and a trainable projector then injects this cross-modal information into the conditioning space of the diffusion model to guide the 3D geometry generation. In this way, our model deeply integrates 1D latent chemical knowledge into the 3D generation process through end-to-end optimization. Experiments demonstrate that MolSculpt achieves state-of-the-art (SOTA) performance in \textit{de novo} 3D molecule generation and conditional 3D molecule generation, showing superior 3D fidelity and stability on both the GEOM-DRUGS and QM9 datasets. Code is available at this https URL.

Title: MedBioRAG: Semantic Search and Retrieval-Augmented Generation with Large Language Models for Medical and Biological QA

Authors: Seonok Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.10996
Pdf URL: https://arxiv.org/pdf/2512.10996
Copy Paste: [[2512.10996]] MedBioRAG: Semantic Search and Retrieval-Augmented Generation with Large Language Models for Medical and Biological QA(https://arxiv.org/abs/2512.10996)
Keywords: large language model
Abstract: Recent advancements in retrieval-augmented generation (RAG) have significantly enhanced the ability of large language models (LLMs) to perform complex question-answering (QA) tasks. In this paper, we introduce MedBioRAG, a retrieval-augmented model designed to improve biomedical QA performance through a combination of semantic and lexical search, document retrieval, and supervised fine-tuning. MedBioRAG efficiently retrieves and ranks relevant biomedical documents, enabling precise and context-aware response generation. We evaluate MedBioRAG across text retrieval, close-ended QA, and long-form QA tasks using benchmark datasets such as NFCorpus, TREC-COVID, MedQA, PubMedQA, and BioASQ. Experimental results demonstrate that MedBioRAG outperforms previous state-of-the-art (SoTA) models and the GPT-4o base model in all evaluated tasks. Notably, our approach improves NDCG and MRR scores for document retrieval, while achieving higher accuracy in close-ended QA and ROUGE scores in long-form QA. Our findings highlight the effectiveness of semantic search-based retrieval and LLM fine-tuning in biomedical applications.

Title: SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models

Authors: Mohamed Afane, Abhishek Satyam, Ke Chen, Tao Li, Junaid Farooq, Juntao Chen
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2512.10998
Pdf URL: https://arxiv.org/pdf/2512.10998
Copy Paste: [[2512.10998]] SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models(https://arxiv.org/abs/2512.10998)
Keywords: security, defense, attack
Abstract: Backdoor attacks create significant security threats to language models by embedding hidden triggers that manipulate model behavior during inference, presenting critical risks for AI systems deployed in healthcare and other sensitive domains. While existing defenses effectively counter obvious threats such as out-of-context trigger words and safety alignment violations, they fail against sophisticated attacks using contextually-appropriate triggers that blend seamlessly into natural language. This paper introduces three novel contextually-aware attack scenarios that exploit domain-specific knowledge and semantic plausibility: the ViralApp attack targeting social media addiction classification, the Fever attack manipulating medical diagnosis toward hypertension, and the Referral attack steering clinical recommendations. These attacks represent realistic threats where malicious actors exploit domain-specific vocabulary while maintaining semantic coherence, demonstrating how adversaries can weaponize contextual appropriateness to evade conventional detection methods. To counter both traditional and these sophisticated attacks, we present \textbf{SCOUT (Saliency-based Classification Of Untrusted Tokens)}, a novel defense framework that identifies backdoor triggers through token-level saliency analysis rather than traditional context-based detection methods. SCOUT constructs a saliency map by measuring how the removal of individual tokens affects the model's output logits for the target label, enabling detection of both conspicuous and subtle manipulation attempts. We evaluate SCOUT on established benchmark datasets (SST-2, IMDB, AG News) against conventional attacks (BadNet, AddSent, SynBkd, StyleBkd) and our novel attacks, demonstrating that SCOUT successfully detects these sophisticated threats while preserving accuracy on clean inputs.

Title: KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering

Authors: Xin Sun, Zhongqi Chen, Xing Zheng, Qiang Liu, Shu Wu, Bowen Song, Zilei Wang, Weiqiang Wang, Liang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.10999
Pdf URL: https://arxiv.org/pdf/2512.10999
Copy Paste: [[2512.10999]] KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering(https://arxiv.org/abs/2512.10999)
Keywords: large language model
Abstract: Knowledge Base Question Answering (KBQA) challenges models to bridge the gap between natural language and strict knowledge graph schemas by generating executable logical forms. While Large Language Models (LLMs) have advanced this field, current approaches often struggle with a dichotomy of failure: they either generate hallucinated queries without verifying schema existence or exhibit rigid, template-based reasoning that mimics synthesized traces without true comprehension of the environment. To address these limitations, we present \textbf{KBQA-R1}, a framework that shifts the paradigm from text imitation to interaction optimization via Reinforcement Learning. Treating KBQA as a multi-turn decision process, our model learns to navigate the knowledge base using a list of actions, leveraging Group Relative Policy Optimization (GRPO) to refine its strategies based on concrete execution feedback rather than static supervision. Furthermore, we introduce \textbf{Referenced Rejection Sampling (RRS)}, a data synthesis method that resolves cold-start challenges by strictly aligning reasoning traces with ground-truth action sequences. Extensive experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance, effectively grounding LLM reasoning in verifiable execution.

Title: Leveraging Text Guidance for Enhancing Demographic Fairness in Gender Classification

Authors: Anoop Krishnan
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11015
Pdf URL: https://arxiv.org/pdf/2512.11015
Copy Paste: [[2512.11015]] Leveraging Text Guidance for Enhancing Demographic Fairness in Gender Classification(https://arxiv.org/abs/2512.11015)
Keywords: fair
Abstract: In the quest for fairness in artificial intelligence, novel approaches to enhance it in facial image based gender classification algorithms using text guided methodologies are presented. The core methodology involves leveraging semantic information from image captions during model training to improve generalization capabilities. Two key strategies are presented: Image Text Matching (ITM) guidance and Image Text fusion. ITM guidance trains the model to discern fine grained alignments between images and texts to obtain enhanced multimodal representations. Image text fusion combines both modalities into comprehensive representations for improved fairness. Exensive experiments conducted on benchmark datasets demonstrate these approaches effectively mitigate bias and improve accuracy across gender racial groups compared to existing methods. Additionally, the unique integration of textual guidance underscores an interpretable and intuitive training paradigm for computer vision systems. By scrutinizing the extent to which semantic information reduces disparities, this research offers valuable insights into cultivating more equitable facial analysis algorithms. The proposed methodologies contribute to addressing the pivotal challenge of demographic bias in gender classification from facial images. Furthermore, this technique operates in the absence of demographic labels and is application agnostic.

Title: Weakly Supervised Tuberculosis Localization in Chest X-rays through Knowledge Distillation

Authors: Marshal Ashif Shawkat, Moidul Hasan, Taufiq Hasan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11057
Pdf URL: https://arxiv.org/pdf/2512.11057
Copy Paste: [[2512.11057]] Weakly Supervised Tuberculosis Localization in Chest X-rays through Knowledge Distillation(https://arxiv.org/abs/2512.11057)
Keywords: robust
Abstract: Tuberculosis (TB) remains one of the leading causes of mortality worldwide, particularly in resource-limited countries. Chest X-ray (CXR) imaging serves as an accessible and cost-effective diagnostic tool but requires expert interpretation, which is often unavailable. Although machine learning models have shown high performance in TB classification, they often depend on spurious correlations and fail to generalize. Besides, building large datasets featuring high-quality annotations for medical images demands substantial resources and input from domain specialists, and typically involves several annotators reaching agreement, which results in enormous financial and logistical expenses. This study repurposes knowledge distillation technique to train CNN models reducing spurious correlations and localize TB-related abnormalities without requiring bounding-box annotations. By leveraging a teacher-student framework with ResNet50 architecture, the proposed method trained on TBX11k dataset achieve impressive 0.2428 mIOU score. Experimental results further reveal that the student model consistently outperforms the teacher, underscoring improved robustness and potential for broader clinical deployment in diverse settings.

Title: VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation

Authors: Felix O'Mahony, Roberto Cipolla, Ayush Tewari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11061
Pdf URL: https://arxiv.org/pdf/2512.11061
Copy Paste: [[2512.11061]] VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation(https://arxiv.org/abs/2512.11061)
Keywords: generative
Abstract: Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and accordingly chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. VDAWorld can then infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing high quality simulations across a wide range of dynamic scenarios.

Title: E-CHUM: Event-based Cameras for Human Detection and Urban Monitoring

Authors: Jack Brady, Andrew Dailey, Kristen Schang, Zo Vic Shong
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2512.11076
Pdf URL: https://arxiv.org/pdf/2512.11076
Copy Paste: [[2512.11076]] E-CHUM: Event-based Cameras for Human Detection and Urban Monitoring(https://arxiv.org/abs/2512.11076)
Keywords: privacy
Abstract: Understanding human movement and city dynamics has always been challenging. From traditional methods of manually observing the city's inhabitant, to using cameras, to now using sensors and more complex technology, the field of urban monitoring has evolved greatly. Still, there are more that can be done to unlock better practices for understanding city dynamics. This paper surveys how the landscape of urban dynamics studying has evolved with a particular focus on event-based cameras. Event-based cameras capture changes in light intensity instead of the RGB values that traditional cameras do. They offer unique abilities, like the ability to work in low-light, that can make them advantageous compared to other sensors. Through an analysis of event-based cameras, their applications, their advantages and challenges, and machine learning applications, we propose event-based cameras as a medium for capturing information to study urban dynamics. They offer the ability to capture important information while maintaining privacy. We also suggest multi-sensor fusion of event-based cameras and other sensors in the study of urban dynamics. Combining event-based cameras and infrared, event-LiDAR, or vibration has to potential to enhance the ability of event-based cameras and overcome the challenges that event-based cameras have.

Title: Investigating ECG Diagnosis with Ambiguous Labels using Partial Label Learning

Authors: Sana Rahmani, Javad Hashemi, Ali Etemad
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11095
Pdf URL: https://arxiv.org/pdf/2512.11095
Copy Paste: [[2512.11095]] Investigating ECG Diagnosis with Ambiguous Labels using Partial Label Learning(https://arxiv.org/abs/2512.11095)
Keywords: robust
Abstract: Label ambiguity is an inherent problem in real-world electrocardiogram (ECG) diagnosis, arising from overlapping conditions and diagnostic disagreement. However, current ECG models are trained under the assumption of clean and non-ambiguous annotations, which limits both the development and the meaningful evaluation of models under real-world conditions. Although Partial Label Learning (PLL) frameworks are designed to learn from ambiguous labels, their effectiveness in medical time-series domains, ECG in particular, remains largely unexplored. In this work, we present the first systematic study of PLL methods for ECG diagnosis. We adapt nine PLL algorithms to multi-label ECG diagnosis and evaluate them using a diverse set of clinically motivated ambiguity generation strategies, capturing both unstructured (e.g., random) and structured ambiguities (e.g., cardiologist-derived similarities, treatment relationships, and diagnostic taxonomies). Our experiments on the PTB-XL and Chapman datasets demonstrate that PLL methods vary substantially in their robustness to different types and degrees of ambiguity. Through extensive analysis, we identify key limitations of current PLL approaches in clinical settings and outline future directions for developing robust and clinically aligned ambiguity-aware learning frameworks for ECG diagnosis.

Title: VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction

Authors: Weitai Kang, Jason Kuen, Mengwei Ren, Zijun Wei, Yan Yan, Kangning Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11099
Pdf URL: https://arxiv.org/pdf/2512.11099
Copy Paste: [[2512.11099]] VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction(https://arxiv.org/abs/2512.11099)
Keywords: large language model, segmentation
Abstract: Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM's pretrained reasoning ability. In contrast, we propose VGent, a modular encoder-decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder's hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference. Moreover, it supports modular upgrades of both the encoder and decoder to benefit the whole system: we introduce (i) QuadThinker, an RL-based training paradigm for enhancing multi-target reasoning ability of the encoder; (ii) mask-aware label for resolving detection-segmentation ambiguity; and (iii) global target recognition to improve the recognition of all the targets which benefits the selection among augmented proposals. Experiments on multi-target visual grounding benchmarks show that VGent achieves a new state-of-the-art with +20.6% F1 improvement over prior methods, and further boosts gIoU by +8.2% and cIoU by +5.8% under visual reference challenges, while maintaining constant, fast inference latency.

Title: Information-driven Fusion of Pathology Foundation Models for Enhanced Disease Characterization

Authors: Brennan Flannery, Thomas DeSilvio, Jane Nguyen, Satish E. Viswanath
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11104
Pdf URL: https://arxiv.org/pdf/2512.11104
Copy Paste: [[2512.11104]] Information-driven Fusion of Pathology Foundation Models for Enhanced Disease Characterization(https://arxiv.org/abs/2512.11104)
Keywords: interpretability
Abstract: Foundation models (FMs) have demonstrated strong performance across diverse pathology tasks. While there are similarities in the pre-training objectives of FMs, there is still limited understanding of their complementarity, redundancy in embedding spaces, or biological interpretation of features. In this study, we propose an information-driven, intelligent fusion strategy for integrating multiple pathology FMs into a unified representation and systematically evaluate its performance for cancer grading and staging across three distinct diseases. Diagnostic H&E whole-slide images from kidney (519 slides), prostate (490 slides), and rectal (200 slides) cancers were dichotomized into low versus high grade or stage. Both tile-level FMs (Conch v1.5, MUSK, Virchow2, H-Optimus1, Prov-Gigapath) and slide-level FMs (TITAN, CHIEF, MADELEINE) were considered to train downstream classifiers. We then evaluated three FM fusion schemes at both tile and slide levels: majority-vote ensembling, naive feature concatenation, and intelligent fusion based on correlation-guided pruning of redundant features. Under patient-stratified cross-validation with hold-out testing, intelligent fusion of tile-level embeddings yielded consistent gains in classification performance across all three cancers compared with the best single FMs and naive fusion. Global similarity metrics revealed substantial alignment of FM embedding spaces, contrasted by lower local neighborhood agreement, indicating complementary fine-grained information across FMs. Attention maps showed that intelligent fusion yielded concentrated attention on tumor regions while reducing spurious focus on benign regions. Our findings suggest that intelligent, correlation-guided fusion of pathology FMs can yield compact, task-tailored representations that enhance both predictive performance and interpretability in downstream computational pathology tasks.

Title: Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

Authors: Jonathan Kamp, Roos Bakker, Dominique Blok
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11108
Pdf URL: https://arxiv.org/pdf/2512.11108
Copy Paste: [[2512.11108]] Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution(https://arxiv.org/abs/2512.11108)
Keywords: transformer
Abstract: Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both the lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find that lexical and position biases are structurally unbalanced in our model comparison, with models that score high on one type score low on the other. We also find signs that methods producing anomalous explanations are more likely to be biased themselves.

Title: Limits and Gains of Test-Time Scaling in Vision-Language Reasoning

Authors: Mohammadjavad Ahmadpour, Amirmahdi Meighani, Payam Taebi, Omid Ghahroodi, Amirmohammad Izadi, Mahdieh Soleymani Baghshah
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11109
Pdf URL: https://arxiv.org/pdf/2512.11109
Copy Paste: [[2512.11109]] Limits and Gains of Test-Time Scaling in Vision-Language Reasoning(https://arxiv.org/abs/2512.11109)
Keywords: large language model
Abstract: Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference time reasoning methods applied across both open-source and closed-source VLMs on different benchmarks. Our results reveal that while closed-source models consistently benefit from structured reasoning and iterative Self-Refinement, open-source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset-dependent, yielding clear improvements on multi-step reasoning tasks but offering only limited gains on perception-focused benchmarks. These findings demonstrate that TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models.

Title: FIBER: A Multilingual Evaluation Resource for Factual Inference Bias

Authors: Evren Ayberk Munis, Deniz Yılmaz, Arianna Muti, Çağrı Toraman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11110
Pdf URL: https://arxiv.org/pdf/2512.11110
Copy Paste: [[2512.11110]] FIBER: A Multilingual Evaluation Resource for Factual Inference Bias(https://arxiv.org/abs/2512.11110)
Keywords: large language model
Abstract: Large language models are widely used across domains, yet there are concerns about their factual reliability and biases. Factual knowledge probing offers a systematic means to evaluate these aspects. Most existing benchmarks focus on single-entity facts and monolingual data. We therefore present FIBER, a multilingual benchmark for evaluating factual knowledge in single- and multi-entity settings. The dataset includes sentence completion, question-answering, and object-count prediction tasks in English, Italian, and Turkish. Using FIBER, we examine whether the prompt language induces inference bias in entity selection and how large language models perform on multi-entity versus single-entity questions. The results indicate that the language of the prompt can influence the model's generated output, particularly for entities associated with the country corresponding to that language. However, this effect varies across different topics such that 31% of the topics exhibit factual inference bias score greater than 0.5. Moreover, the level of bias differs across languages such that Turkish prompts show higher bias compared to Italian in 83% of the topics, suggesting a language-dependent pattern. Our findings also show that models face greater difficulty when handling multi-entity questions than the single-entity questions. Model performance differs across both languages and model sizes. The highest mean average precision is achieved in English, while Turkish and Italian lead to noticeably lower scores. Larger models, including Llama-3.1-8B and Qwen-2.5-7B, show consistently better performance than smaller 3B-4B models.

Title: An LLVM-Based Optimization Pipeline for SPDZ

Authors: Tianye Dai, Hammurabi Mendes, Heuichan Lim
Subjects: cs.CR, cs.DC, cs.SE
Abstract URL: https://arxiv.org/abs/2512.11112
Pdf URL: https://arxiv.org/pdf/2512.11112
Copy Paste: [[2512.11112]] An LLVM-Based Optimization Pipeline for SPDZ(https://arxiv.org/abs/2512.11112)
Keywords: secure, privacy
Abstract: Actively secure arithmetic MPC is now practical for real applications, but performance and usability are still limited by framework-specific compilation stacks, the need for programmers to explicitly express parallelism, and high communication overhead. We design and implement a proof-of-concept LLVM-based optimization pipeline for the SPDZ protocol that addresses these bottlenecks. Our front end accepts a subset of C with lightweight privacy annotations and lowers it to LLVM IR, allowing us to reuse mature analyses and transformations to automatically batch independent arithmetic operations. Our back end performs data-flow and control-flow analysis on the optimized IR to drive a non-blocking runtime scheduler that overlaps independent operations and aggressively overlaps communication with computation; when enabled, it can map batched operations to GPU kernels. This design preserves a low learning curve by using a mainstream language and hiding optimization and hardware-specific mechanics from programmers. We evaluate the system on controlled microbenchmarks against MP-SPDZ, focusing on online phase performance. Our CPU back end achieves up to 5.56 times speedup under intermediate and heavy algebraic workloads, shows strong scaling with thread count, and our GPU back end scales better as the input size increases. Overall, these results indicate that leveraging LLVM with protocol-aware scheduling is an effective architectural direction for extracting parallelism without sacrificing usability.

Title: In-Context Multi-Objective Optimization

Authors: Xinyu Zhang, Conor Hassan, Julien Martinelli, Daolang Huang, Samuel Kaski
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2512.11114
Pdf URL: https://arxiv.org/pdf/2512.11114
Copy Paste: [[2512.11114]] In-Context Multi-Objective Optimization(https://arxiv.org/abs/2512.11114)
Keywords: transformer
Abstract: Balancing competing objectives is omnipresent across disciplines, from drug design to autonomous systems. Multi-objective Bayesian optimization is a promising solution for such expensive, black-box problems: it fits probabilistic surrogates and selects new designs via an acquisition function that balances exploration and exploitation. In practice, it requires tailored choices of surrogate and acquisition that rarely transfer to the next problem, is myopic when multi-step planning is often required, and adds refitting overhead, particularly in parallel or time-sensitive loops. We present TAMO, a fully amortized, universal policy for multi-objective black-box optimization. TAMO uses a transformer architecture that operates across varying input and objective dimensions, enabling pretraining on diverse corpora and transfer to new problems without retraining: at test time, the pretrained model proposes the next design with a single forward pass. We pretrain the policy with reinforcement learning to maximize cumulative hypervolume improvement over full trajectories, conditioning on the entire query history to approximate the Pareto frontier. Across synthetic benchmarks and real tasks, TAMO produces fast proposals, reducing proposal time by 50-1000x versus alternatives while matching or improving Pareto quality under tight evaluation budgets. These results show that transformers can perform multi-objective optimization entirely in-context, eliminating per-task surrogate fitting and acquisition engineering, and open a path to foundation-style, plug-and-play optimizers for scientific discovery workflows.

Title: Learning from a Generative Oracle: Domain Adaptation for Restoration

Authors: Yuyang Hu, Mojtaba Sahraee-Ardakan, Arpit Bansal, Kangfu Mei, Christian Qi, Peyman Milanfar, Mauricio Delbracio
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2512.11121
Pdf URL: https://arxiv.org/pdf/2512.11121
Copy Paste: [[2512.11121]] Learning from a Generative Oracle: Domain Adaptation for Restoration(https://arxiv.org/abs/2512.11121)
Keywords: robust, generative
Abstract: Pre-trained image restoration models often fail on real-world, out-of-distribution degradations due to significant domain gaps. Adapting to these unseen domains is challenging, as out-of-distribution data lacks ground truth, and traditional adaptation methods often require complex architectural changes. We propose LEGO (Learning from a Generative Oracle), a practical three-stage framework for post-training domain adaptation without paired data. LEGO converts this unsupervised challenge into a tractable pseudo-supervised one. First, we obtain initial restorations from the pre-trained model. Second, we leverage a frozen, large-scale generative oracle to refine these estimates into high-quality pseudo-ground-truths. Third, we fine-tune the original model using a mixed-supervision strategy combining in-distribution data with these new pseudo-pairs. This approach adapts the model to the new distribution without sacrificing its original robustness or requiring architectural modifications. Experiments demonstrate that LEGO effectively bridges the domain gap, significantly improving performance on diverse real-world benchmarks.

Title: Cybersecurity policy adoption in South Africa: Does public trust matter?

Authors: Mbali Nkosi, Mike Nkongolo
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2512.11122
Pdf URL: https://arxiv.org/pdf/2512.11122
Copy Paste: [[2512.11122]] Cybersecurity policy adoption in South Africa: Does public trust matter?(https://arxiv.org/abs/2512.11122)
Keywords: security, privacy
Abstract: This study examines how public perception influences the implementation and adoption of cybersecurity frameworks in South Africa. Using the PRISMA methodology, a systematic literature review was conducted across reputable scholarly databases, yielding 34 relevant sources aligned with predefined inclusion criteria. Cybersecurity, governance, trust, privacy, cybercrime, and public opinion emerged as dominant thematic clusters. Bibliometric and thematic analyses, supported by network visualisations, revealed that while trust and public sentiment affect cybersecurity policy adoption globally, these factors have minimal influence within the South African policy landscape, despite the country's high cybercrime prevalence. In response, the study proposes a trust-centric policymaking framework designed to integrate public perception as a proactive dimension of cybersecurity governance. This framework seeks to prevent trust deficits from obstructing policy effectiveness and provides guidance for restoring trust where it has eroded.

Title: Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Authors: Bowen Wen, Shaurya Dewan, Stan Birchfield
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.11130
Pdf URL: https://arxiv.org/pdf/2512.11130
Copy Paste: [[2512.11130]] Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching(https://arxiv.org/abs/2512.11130)
Keywords: robust
Abstract: Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: this https URL

Title: Fairness-Regularized Online Optimization with Switching Costs

Authors: Pengfei Li, Yuelin Han, Adam Wierman, Shaolei Ren
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11131
Pdf URL: https://arxiv.org/pdf/2512.11131
Copy Paste: [[2512.11131]] Fairness-Regularized Online Optimization with Switching Costs(https://arxiv.org/abs/2512.11131)
Keywords: fair
Abstract: Fairness and action smoothness are two crucial considerations in many online optimization problems, but they have yet to be addressed simultaneously. In this paper, we study a new and challenging setting of fairness-regularized smoothed online convex optimization with switching costs. First, to highlight the fundamental challenges introduced by the long-term fairness regularizer evaluated based on the entire sequence of actions, we prove that even without switching costs, no online algorithms can possibly achieve a sublinear regret or finite competitive ratio compared to the offline optimal algorithm as the problem episode length $T$ increases. Then, we propose FairOBD (Fairness-regularized Online Balanced Descent), which reconciles the tension between minimizing the hitting cost, switching cost, and fairness cost. Concretely, FairOBD decomposes the long-term fairness cost into a sequence of online costs by introducing an auxiliary variable and then leverages the auxiliary variable to regularize the online actions for fair outcomes. Based on a new approach to account for switching costs, we prove that FairOBD offers a worst-case asymptotic competitive ratio against a novel benchmark -- the optimal offline algorithm with parameterized constraints -- by considering $T\to\infty$. Finally, we run trace-driven experiments of dynamic computing resource provisioning for socially responsible AI inference to empirically evaluate FairOBD, showing that FairOBD can effectively reduce the total fairness-regularized cost and better promote fair outcomes compared to existing baseline solutions.

Title: Network and Compiler Optimizations for Efficient Linear Algebra Kernels in Private Transformer Inference

Authors: Karthik Garimella, Negar Neda, Austin Ebel, Nandan Kumar Jha, Brandon Reagen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.11135
Pdf URL: https://arxiv.org/pdf/2512.11135
Copy Paste: [[2512.11135]] Network and Compiler Optimizations for Efficient Linear Algebra Kernels in Private Transformer Inference(https://arxiv.org/abs/2512.11135)
Keywords: privacy, transformer, large language model
Abstract: Large language model (LLM) based services are primarily structured as client-server interactions, with clients sending queries directly to cloud providers that host LLMs. This approach currently compromises data privacy as all queries must be processed in the cloud and in the clear. Fully Homomorphic Encryption (FHE) is a solution to this data privacy issue by enabling computations directly upon encrypted queries. However, running encrypted transformer inference is challenging as programmers must map standard kernels to the constrained instruction set provided by FHE. In this work, we explore implementations of linear algebra kernels needed for transformer inference in FHE and understand how network optimization can help mitigate FHE costs while remaining performant. We leverage the Orion PyTorch to FHE framework to benchmark several linear algebra kernels in order to profile two linear transformation methods, packed row and BSGS, and find that BSGS outperforms packed row methods by up to $13.7 \times$ at transformer-level scales. We also incorporate network-level pruning strategies that reduce FHE runtimes of feed forward layers by up to $11.46\times$. Furthermore, we extend Orion to include ciphertext-ciphertext matrix-matrix products, a key component in the self-attention blocks. Finally, we perform a roofline analysis of FHE primitives and encrypted linear transformations and find that (SIMD encoded) implementations are memory-bound with primitives having roughly $0.1$ integer operations per byte of DRAM traffic. These findings illustrate the need for exploring alternative encoding schemes and models of computation within CKKS to unlock scalable private transformer inference. We conduct all experiments using the Orion framework which can be found at: this https URL.

Title: Learning complete and explainable visual representations from itemized text supervision

Authors: Yiwei Lyu, Chenhui Zhao, Soumyanil Banerjee, Shixuan Liu, Akshay Rao, Akhil Kondepudi, Honglak Lee, Todd C. Hollon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11141
Pdf URL: https://arxiv.org/pdf/2512.11141
Copy Paste: [[2512.11141]] Learning complete and explainable visual representations from itemized text supervision(https://arxiv.org/abs/2512.11141)
Keywords: interpretability
Abstract: Training vision models with language supervision enables general and transferable representations. However, many visual domains, especially non-object-centric domains such as medical imaging and remote sensing, contain itemized text annotations: multiple text items describing distinct and semantically independent findings within a single image. Such supervision differs from standard multi-caption supervision, where captions are redundant or highly overlapping. Here, we introduce ItemizedCLIP, a framework for learning complete and explainable visual representations from itemized text supervision. ItemizedCLIP employs a cross-attention module to produce text item-conditioned visual embeddings and a set of tailored objectives that jointly enforce item independence (distinct regions for distinct items) and representation completeness (coverage of all items). Across four domains with naturally itemized text supervision (brain MRI, head CT, chest CT, remote sensing) and one additional synthetically itemized dataset, ItemizedCLIP achieves substantial improvements in zero-shot performance and fine-grained interpretability over baselines. The resulting ItemizedCLIP representations are semantically grounded, item-differentiable, complete, and visually interpretable. Our code is available at this https URL.

Title: Automated Penetration Testing with LLM Agents and Classical Planning

Authors: Lingzhi Wang, Xinyi Shi, Ziyu Li, Yi Jiang, Shiyu Tan, Yuhao Jiang, Junjie Cheng, Wenyuan Chen, Xiangmin Shen, Zhenyuan LI, Yan Chen
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.11143
Pdf URL: https://arxiv.org/pdf/2512.11143
Copy Paste: [[2512.11143]] Automated Penetration Testing with LLM Agents and Classical Planning(https://arxiv.org/abs/2512.11143)
Keywords: security, large language model
Abstract: While penetration testing plays a vital role in cybersecurity, achieving fully automated, hands-off-the-keyboard execution remains a significant research challenge. In this paper, we introduce the "Planner-Executor-Perceptor (PEP)" design paradigm and use it to systematically review existing work and identify the key challenges in this area. We also evaluate existing penetration testing systems, with a particular focus on the use of Large Language Model (LLM) agents for this task. The results show that the out-of-the-box Claude Code and Sonnet 4.5 exhibit superior penetration capabilities observed to date, substantially outperforming all prior systems. However, a detailed analysis of their testing processes reveals specific strengths and limitations; notably, LLM agents struggle with maintaining coherent long-horizon plans, performing complex reasoning, and effectively utilizing specialized tools. These limitations significantly constrain its overall capability, efficiency, and stability. To address these limitations, we propose CHECKMATE, a framework that integrates enhanced classical planning with LLM agents, providing an external, structured "brain" that mitigates the inherent weaknesses of LLM agents. Our evaluation shows that CHECKMATE outperforms the state-of-the-art system (Claude Code) in penetration capability, improving benchmark success rates by over 20%. In addition, it delivers substantially greater stability, cutting both time and monetary costs by more than 50%.

Title: Autoencoder-based Semi-Supervised Dimensionality Reduction and Clustering for Scientific Ensembles

Authors: Lennard Manuel, Hamid Gadirov, Steffen Frey
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.11145
Pdf URL: https://arxiv.org/pdf/2512.11145
Copy Paste: [[2512.11145]] Autoencoder-based Semi-Supervised Dimensionality Reduction and Clustering for Scientific Ensembles(https://arxiv.org/abs/2512.11145)
Keywords: interpretability
Abstract: Analyzing and visualizing scientific ensemble datasets with high dimensionality and complexity poses significant challenges. Dimensionality reduction techniques and autoencoders are powerful tools for extracting features, but they often struggle with such high-dimensional data. This paper presents an enhanced autoencoder framework that incorporates a clustering loss, based on the soft silhouette score, alongside a contrastive loss to improve the visualization and interpretability of ensemble datasets. First, EfficientNetV2 is used to generate pseudo-labels for the unlabeled portions of the scientific ensemble datasets. By jointly optimizing the reconstruction, clustering, and contrastive objectives, our method encourages similar data points to group together while separating distinct clusters in the latent space. UMAP is subsequently applied to this latent representation to produce 2D projections, which are evaluated using the silhouette score. Multiple types of autoencoders are evaluated and compared based on their ability to extract meaningful features. Experiments on two scientific ensemble datasets - channel structures in soil derived from Markov chain Monte Carlo, and droplet-on-film impact dynamics - show that models incorporating clustering or contrastive loss marginally outperform the baseline approaches.

Title: MiniScope: A Least Privilege Framework for Authorizing Tool Calling Agents

Authors: Jinhao Zhu, Kevin Tseng, Gil Vernik, Xiao Huang, Shishir G. Patil, Vivian Fang, Raluca Ada Popa
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11147
Pdf URL: https://arxiv.org/pdf/2512.11147
Copy Paste: [[2512.11147]] MiniScope: A Least Privilege Framework for Authorizing Tool Calling Agents(https://arxiv.org/abs/2512.11147)
Keywords: security
Abstract: Tool calling agents are an emerging paradigm in LLM deployment, with major platforms such as ChatGPT, Claude, and Gemini adding connectors and autonomous capabilities. However, the inherent unreliability of LLMs introduces fundamental security risks when these agents operate over sensitive user services. Prior approaches either rely on manually written policies that require security expertise, or place LLMs in the confinement loop, which lacks rigorous security guarantees. We present MiniScope, a framework that enables tool calling agents to operate on user accounts while confining potential damage from unreliable LLMs. MiniScope introduces a novel way to automatically and rigorously enforce least privilege principles by reconstructing permission hierarchies that reflect relationships among tool calls and combining them with a mobile-style permission model to balance security and ease of use. To evaluate MiniScope, we create a synthetic dataset derived from ten popular real-world applications, capturing the complexity of realistic agentic tasks beyond existing simplified benchmarks. Our evaluation shows that MiniScope incurs only 1-6% latency overhead compared to vanilla tool calling agents, while significantly outperforming the LLM based baseline in minimizing permissions as well as computational and operational costs.

Title: Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization

Authors: Anh-Kiet Duong, Petra Gomez-Krämer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11189
Pdf URL: https://arxiv.org/pdf/2512.11189
Copy Paste: [[2512.11189]] Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization(https://arxiv.org/abs/2512.11189)
Keywords: robust
Abstract: We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.

Title: Beyond Memorization: Gradient Projection Enables Selective Learning in Diffusion Models

Authors: Divya Kothandaraman, Jaclyn Pytlarz
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.11194
Pdf URL: https://arxiv.org/pdf/2512.11194
Copy Paste: [[2512.11194]] Beyond Memorization: Gradient Projection Enables Selective Learning in Diffusion Models(https://arxiv.org/abs/2512.11194)
Keywords: security, privacy, defense, extraction, diffusion, generative
Abstract: Memorization in large-scale text-to-image diffusion models poses significant security and intellectual property risks, enabling adversarial attribute extraction and the unauthorized reproduction of sensitive or proprietary features. While conventional dememorization techniques, such as regularization and data filtering, limit overfitting to specific training examples, they fail to systematically prevent the internalization of prohibited concept-level features. Simply discarding all images containing a sensitive feature wastes invaluable training data, necessitating a method for selective unlearning at the concept level. To address this, we introduce a Gradient Projection Framework designed to enforce a stringent requirement of concept-level feature exclusion. Our defense operates during backpropagation by systematically identifying and excising training signals aligned with embeddings of prohibited attributes. Specifically, we project each gradient update onto the orthogonal complement of the sensitive feature's embedding space, thereby zeroing out its influence on the model's weights. Our method integrates seamlessly into standard diffusion model training pipelines and complements existing defenses. We analyze our method against an adversary aiming for feature extraction. In extensive experiments, we demonstrate that our framework drastically reduces memorization while rigorously preserving generation quality and semantic fidelity. By reframing memorization control as selective learning, our approach establishes a new paradigm for IP-safe and privacy-preserving generative AI.

Title: CADKnitter: Compositional CAD Generation from Text and Geometry Guidance

Authors: Tri Le, Khang Nguyen, Baoru Huang, Tung D. Ta, Anh Nguyen
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11199
Pdf URL: https://arxiv.org/pdf/2512.11199
Copy Paste: [[2512.11199]] CADKnitter: Compositional CAD Generation from Text and Geometry Guidance(https://arxiv.org/abs/2512.11199)
Keywords: diffusion
Abstract: Crafting computer-aided design (CAD) models has long been a painstaking and time-intensive task, demanding both precision and expertise from designers. With the emergence of 3D generation, this task has undergone a transformative impact, shifting not only from visual fidelity to functional utility but also enabling editable CAD designs. Prior works have achieved early success in single-part CAD generation, which is not well-suited for real-world applications, as multiple parts need to be assembled under semantic and geometric constraints. In this paper, we propose CADKnitter, a compositional CAD generation framework with a geometry-guided diffusion sampling strategy. CADKnitter is able to generate a complementary CAD part that follows both the geometric constraints of the given CAD model and the semantic constraints of the desired design text prompt. We also curate a dataset, so-called KnitCAD, containing over 310,000 samples of CAD models, along with textual prompts and assembly metadata that provide semantic and geometric constraints. Intensive experiments demonstrate that our proposed method outperforms other state-of-the-art baselines by a clear margin.

Title: AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path

Authors: Zhengyang Yu, Akio Hayakawa, Masato Ishii, Qingtao Yu, Takashi Shibuya, Jing Zhang, Yuki Mitsufuji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11203
Pdf URL: https://arxiv.org/pdf/2512.11203
Copy Paste: [[2512.11203]] AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path(https://arxiv.org/abs/2512.11203)
Keywords: diffusion
Abstract: Autoregressive video diffusion models (AR-VDMs) show strong promise as scalable alternatives to bidirectional VDMs, enabling real-time and interactive applications. Yet there remains room for improvement in their sample fidelity. A promising solution is inference-time alignment, which optimizes the noise space to improve sample fidelity without updating model parameters. Yet, optimization- or search-based methods are computationally impractical for AR-VDMs. Recent text-to-image (T2I) works address this via feedforward noise refiners that modulate sampled noises in a single forward pass. Can such noise refiners be extended to AR-VDMs? We identify the failure of naively extending T2I noise refiners to AR-VDMs and propose AutoRefiner-a noise refiner tailored for AR-VDMs, with two key designs: pathwise noise refinement and a reflective KV-cache. Experiments demonstrate that AutoRefiner serves as an efficient plug-in for AR-VDMs, effectively enhancing sample fidelity by refining noise along stochastic denoising paths.

Title: SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection

Authors: Tianye Qi, Weihao Li, Nick Barnes
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11215
Pdf URL: https://arxiv.org/pdf/2512.11215
Copy Paste: [[2512.11215]] SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection(https://arxiv.org/abs/2512.11215)
Keywords: large language model
Abstract: Wildfire smoke is transparent, amorphous, and often visually confounded with clouds, making early-stage detection particularly challenging. In this work, we introduce a benchmark, called SmokeBench, to evaluate the ability of multimodal large language models (MLLMs) to recognize and localize wildfire smoke in images. The benchmark consists of four tasks: (1) smoke classification, (2) tile-based smoke localization, (3) grid-based smoke localization, and (4) smoke detection. We evaluate several MLLMs, including Idefics2, Qwen2.5-VL, InternVL3, Unified-IO 2, Grounding DINO, GPT-4o, and Gemini-2.5 Pro. Our results show that while some models can classify the presence of smoke when it covers a large area, all models struggle with accurate localization, especially in the early stages. Further analysis reveals that smoke volume is strongly correlated with model performance, whereas contrast plays a comparatively minor role. These findings highlight critical limitations of current MLLMs for safety-critical wildfire monitoring and underscore the need for methods that improve early-stage smoke localization.

Title: Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference

Authors: Adilet Metinov, Gulida M. Kudakeeva, Bolotbek uulu Nursultan, Gulnara D. Kabaeva
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2512.11221
Pdf URL: https://arxiv.org/pdf/2512.11221
Copy Paste: [[2512.11221]] Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference(https://arxiv.org/abs/2512.11221)
Keywords: large language model
Abstract: We present Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery (ASR-KF-EGR), a training-free inference-time framework for efficient large language model generation. Our method introduces a reversible soft-freeze mechanism that temporarily suspends key-value (KV) updates for low-importance tokens identified within a sliding attention window. Unlike eviction-based approaches that permanently discard context, ASR-KF-EGR preserves all tokens in off-GPU storage and restores them on demand. We extend the framework with sublinear freeze scheduling, where freeze duration grows sublinearly with repeated low-importance detections, preventing over-aggressive compression. Preliminary experiments on LLaMA-3 8B demonstrate 55-67% reduction in active KV cache size while maintaining generation quality and passing needle-in-haystack retrieval tests. The method is architecture-agnostic, requires no fine-tuning, and provides a practical solution for memory-constrained deployment of long-context LLMs.

Title: VFMF: World Modeling by Forecasting Vision Foundation Model Features

Authors: Gabrijel Boduljak, Yushi Lan, Christian Rupprecht, Andrea Vedaldi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11225
Pdf URL: https://arxiv.org/pdf/2512.11225
Copy Paste: [[2512.11225]] VFMF: World Modeling by Forecasting Vision Foundation Model Features(https://arxiv.org/abs/2512.11225)
Keywords: diffusion, generative, segmentation
Abstract: Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation, we introduce a generative forecaster that performs autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used PCA-based alternatives, both for forecasting and other applications, such as image generation. Our latent predictions can be easily decoded into multiple useful and interpretable output modalities: semantic segmentation, depth, surface normals, and even RGB. With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities. Our results suggest that stochastic conditional generation of VFM features offers a promising and scalable foundation for future world models.

Title: REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation

Authors: Haotian Wang, Yuzhe Weng, Xinyi Yu, Jun Du, Haoran Xu, Xiaoyan Wu, Shan He, Bing Yin, Cong Liu, Qingfeng Liu
Subjects: cs.CV, cs.SD
Abstract URL: https://arxiv.org/abs/2512.11229
Pdf URL: https://arxiv.org/pdf/2512.11229
Copy Paste: [[2512.11229]] REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation(https://arxiv.org/abs/2512.11229)
Keywords: diffusion
Abstract: Diffusion models have significantly advanced the field of talking head generation. However, the slow inference speeds and non-autoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, the first diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through high spatiotemporal VAE compression. Additionally, to enable autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles to key-value caching for maintaining temporal consistency and identity coherence during long-time streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) training strategy is proposed to mitigate error accumulation in autoregressive generation and enhance temporal consistency, which leverages a non-streaming teacher with an asynchronous noise schedule to supervise the training of the streaming student model. REST bridges the gap between autoregressive and diffusion-based approaches, demonstrating substantial value for applications requiring real-time talking head generation. Experimental results demonstrate that REST outperforms state-of-the-art methods in both generation speed and overall performance.

Title: WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering

Authors: Yuxuan Han, Xin Ming, Tianxiao Li, Zhuofan Shen, Qixuan Zhang, Lan Xu, Feng Xu
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.11237
Pdf URL: https://arxiv.org/pdf/2512.11237
Copy Paste: [[2512.11237]] WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering(https://arxiv.org/abs/2512.11237)
Keywords: diffusion
Abstract: Existing methods achieve high-quality facial appearance capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial appearance capture from a smartphone video recorded in the wild. To disentangle high-quality reflectance from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. Specifically, we first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local physical lighting. During optimization, we jointly sample a diffusion prior for reflectance maps and optimize the lighting, effectively resolving scale ambiguity between local lights and albedo. Our method achieves significantly better results than prior arts in the same capture setup, closing the quality gap between in-the-wild and controllable recordings by a large margin. Our code will be released \href{this https URL}{\textcolor{magenta}{here}}.

Title: PersonaLive! Expressive Portrait Image Animation for Live Streaming

Authors: Zhiyuan Li, Chi-Man Pun, Chen Fang, Jue Wang, Xiaodong Cun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11253
Pdf URL: https://arxiv.org/pdf/2512.11253
Copy Paste: [[2512.11253]] PersonaLive! Expressive Portrait Image Animation for Live Streaming(https://arxiv.org/abs/2512.11253)
Keywords: diffusion
Abstract: Current diffusion-based portrait animation models predominantly focus on enhancing visual quality and expression realism, while overlooking generation latency and real-time performance, which restricts their application range in the live streaming scenario. We propose PersonaLive, a novel diffusion-based framework towards streaming real-time portrait animation with multi-stage training recipes. Specifically, we first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control. Then, a fewer-step appearance distillation strategy is proposed to eliminate appearance redundancy in the denoising process, greatly improving inference efficiency. Finally, we introduce an autoregressive micro-chunk streaming generation paradigm equipped with a sliding training strategy and a historical keyframe mechanism to enable low-latency and stable long-term video generation. Extensive experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.

Title: A Simple Generalisation of the Implicit Dynamics of In-Context Learning

Authors: Francesco Innocenti, El Mehdi Achour
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11255
Pdf URL: https://arxiv.org/pdf/2512.11255
Copy Paste: [[2512.11255]] A Simple Generalisation of the Implicit Dynamics of In-Context Learning(https://arxiv.org/abs/2512.11255)
Keywords: transformer
Abstract: In-context learning (ICL) refers to the ability of a model to learn new tasks from examples in its input without any parameter updates. In contrast to previous theories of ICL relying on toy models and data settings, recently it has been shown that an abstraction of a transformer block can be seen as implicitly updating the weights of its feedforward network according to the context (Dherin et al., 2025). Here, we provide a simple generalisation of this result for (i) all sequence positions beyond the last, (ii) any transformer block beyond the first, and (iii) more realistic residual blocks including layer normalisation. We empirically verify our theory on simple in-context linear regression tasks and investigate the relationship between the implicit updates related to different tokens within and between blocks. These results help to bring the theory of Dherin et al. (2025) even closer to practice, with potential for validation on large-scale models.

Title: Do We Need Reformer for Vision? An Experimental Comparison with Vision Transformers

Authors: Ali El Bellaj, Mohammed-Amine Cheddadi, Rhassan Berber
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11260
Pdf URL: https://arxiv.org/pdf/2512.11260
Copy Paste: [[2512.11260]] Do We Need Reformer for Vision? An Experimental Comparison with Vision Transformers(https://arxiv.org/abs/2512.11260)
Keywords: transformer
Abstract: Transformers have recently demonstrated strong performance in computer vision, with Vision Transformers (ViTs) leveraging self-attention to capture both low-level and high-level image features. However, standard ViTs remain computationally expensive, since global self-attention scales quadratically with the number of tokens, which limits their practicality for high-resolution inputs and resource-constrained settings. In this work, we investigate the Reformer architecture as an alternative vision backbone. By combining patch-based tokenization with locality-sensitive hashing (LSH) attention, our model approximates global self-attention while reducing its theoretical time complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ in the sequence length $n$. We evaluate the proposed Reformer-based vision model on CIFAR-10 to assess its behavior on small-scale datasets, on ImageNet-100 to study its accuracy--efficiency trade-off in a more realistic setting, and on a high-resolution medical imaging dataset to evaluate the model under longer token sequences. While the Reformer achieves higher accuracy on CIFAR-10 compared to our ViT-style baseline, the ViT model consistently outperforms the Reformer in our experiments in terms of practical efficiency and end-to-end computation time across the larger and higher-resolution settings. These results suggest that, despite the theoretical advantages of LSH-based attention, meaningful computation gains require sequence lengths substantially longer than those produced by typical high-resolution images.

Title: Leveraging LLMs for Title and Abstract Screening for Systematic Review: A Cost-Effective Dynamic Few-Shot Learning Approach

Authors: Yun-Chung Liu, Rui Yang, Jonathan Chong Kai Liew, Ziran Yin, Henry Foote, Christopher J. Lindsell, Chuan Hong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.11261
Pdf URL: https://arxiv.org/pdf/2512.11261
Copy Paste: [[2512.11261]] Leveraging LLMs for Title and Abstract Screening for Systematic Review: A Cost-Effective Dynamic Few-Shot Learning Approach(https://arxiv.org/abs/2512.11261)
Keywords: large language model
Abstract: Systematic reviews are a key component of evidence-based medicine, playing a critical role in synthesizing existing research evidence and guiding clinical decisions. However, with the rapid growth of research publications, conducting systematic reviews has become increasingly burdensome, with title and abstract screening being one of the most time-consuming and resource-intensive steps. To mitigate this issue, we designed a two-stage dynamic few-shot learning (DFSL) approach aimed at improving the efficiency and performance of large language models (LLMs) in the title and abstract screening task. Specifically, this approach first uses a low-cost LLM for initial screening, then re-evaluates low-confidence instances using a high-performance LLM, thereby enhancing screening performance while controlling computational costs. We evaluated this approach across 10 systematic reviews, and the results demonstrate its strong generalizability and cost-effectiveness, with potential to reduce manual screening burden and accelerate the systematic review process in practical applications.

Title: A Scalable Multi-GPU Framework for Encrypted Large-Model Inference

Authors: Siddharth Jayashankar, Joshua Kim, Michael B. Sullivan, Wenting Zheng, Dimitrios Skarlatos
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11269
Pdf URL: https://arxiv.org/pdf/2512.11269
Copy Paste: [[2512.11269]] A Scalable Multi-GPU Framework for Encrypted Large-Model Inference(https://arxiv.org/abs/2512.11269)
Keywords: privacy
Abstract: Encrypted AI using fully homomorphic encryption (FHE) provides strong privacy guarantees; but its slow performance has limited practical deployment. Recent works proposed ASICs to accelerate FHE, but require expensive advanced manufacturing processes that constrain their accessibility. GPUs are a far more accessible platform, but achieving ASIC-level performance using GPUs has remained elusive. Furthermore, state-of-the-art approaches primarily focus on small models that fit comfortably within a single device. Supporting large models such as LLMs in FHE introduces a dramatic increase in computational complexity that requires optimized GPU kernels, along with managing terabyte-scale memory footprints that far exceed the capacity of a single GPU. This paper presents Cerium, a multi-GPU framework for FHE inference on large models. Cerium integrates a domain-specific language, an optimizing compiler, and a runtime system to automatically generate high-performance GPU kernels, manage terabyte-scale memory footprints, and parallelize computation across multiple GPUs. It introduces new IR constructs, compiler passes, sparse polynomial representations, memory-efficient data layouts, and communication-aware parallelization techniques that together enable encrypted inference for models ranging from small CNNs to Llama3-8B. We build Cerium on NVIDIA GPUs and demonstrate significant performance gains. For small models, Cerium outperforms expert-written hand-optimized GPU libraries by up to 2.25 times. Cerium achieves performance competitive with state-of-the-art FHE ASICs, outright matching prior FHE ASIC CraterLake. It is the first GPU system to execute bootstrapping in under 10 milliseconds, achieving 7.5 milliseconds, and is the first to demonstrate encrypted inference for BERT-Base and Llama3-8B in 8 seconds and 134 seconds, respectively.

Title: Vision-Based Learning for Cyberattack Detection in Blockchain Smart Contracts and Transactions

Authors: Do Hai Son, Le Vu Hieu, Tran Viet Khoa, Yibeltal F. Alem, Hoang Trong Minh, Tran Thi Thuy Quynh, Nguyen Viet Ha, Nguyen Linh Trung
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.11272
Pdf URL: https://arxiv.org/pdf/2512.11272
Copy Paste: [[2512.11272]] Vision-Based Learning for Cyberattack Detection in Blockchain Smart Contracts and Transactions(https://arxiv.org/abs/2512.11272)
Keywords: attack, robust, steal, transformer
Abstract: Blockchain technology has experienced rapid growth and has been widely adopted across various sectors, including healthcare, finance, and energy. However, blockchain platforms remain vulnerable to a broad range of cyberattacks, particularly those aimed at exploiting transactions and smart contracts (SCs) to steal digital assets or compromise system integrity. To address this issue, we propose a novel and effective framework for detecting cyberattacks within blockchain systems. Our framework begins with a preprocessing tool that uses Natural Language Processing (NLP) techniques to transform key features of blockchain transactions into image representations. These images are then analyzed through vision-based analysis using Vision Transformers (ViT), a recent advancement in computer vision known for its superior ability to capture complex patterns and semantic relationships. By integrating NLP-based preprocessing with vision-based learning, our framework can detect a wide variety of attack types. Experimental evaluations on benchmark datasets demonstrate that our approach significantly outperforms existing state-of-the-art methods in terms of both accuracy (achieving 99.5%) and robustness in cyberattack detection for blockchain transactions and SCs.

Title: FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion

Authors: Xiangyang Luo, Qingyu Li, Xiaokun Liu, Wenyu Qin, Miao Yang, Meng Wang, Pengfei Wan, Di Zhang, Kun Gai, Shao-Lun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11274
Pdf URL: https://arxiv.org/pdf/2512.11274
Copy Paste: [[2512.11274]] FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion(https://arxiv.org/abs/2512.11274)
Keywords: diffusion
Abstract: Current video generation models perform well at single-shot synthesis but struggle with multi-shot videos, facing critical challenges in maintaining character and background consistency across shots and flexibly generating videos of arbitrary length and shot count. To address these limitations, we introduce \textbf{FilmWeaver}, a novel framework designed to generate consistent, multi-shot videos of arbitrary length. First, it employs an autoregressive diffusion paradigm to achieve arbitrary-length video generation. To address the challenge of consistency, our key insight is to decouple the problem into inter-shot consistency and intra-shot coherence. We achieve this through a dual-level cache mechanism: a shot memory caches keyframes from preceding shots to maintain character and scene identity, while a temporal memory retains a history of frames from the current shot to ensure smooth, continuous motion. The proposed framework allows for flexible, multi-round user interaction to create multi-shot videos. Furthermore, due to this decoupled design, our method demonstrates high versatility by supporting downstream tasks such as multi-concept injection and video extension. To facilitate the training of our consistency-aware method, we also developed a comprehensive pipeline to construct a high-quality multi-shot video dataset. Extensive experimental results demonstrate that our method surpasses existing approaches on metrics for both consistency and aesthetic quality, opening up new possibilities for creating more consistent, controllable, and narrative-driven video content. Project Page: this https URL

Title: When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents

Authors: Mrinal Rawat, Arkajyoti Chakraborty, Neha Gupta, Roberto Pieraccini
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11277
Pdf URL: https://arxiv.org/pdf/2512.11277
Copy Paste: [[2512.11277]] When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents(https://arxiv.org/abs/2512.11277)
Keywords: large language model
Abstract: Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks. However, SFT can have difficulty generalizing when the underlying data distribution changes, even when the new data does not fall completely outside the training domain. Recent reasoning-focused models such as o1 and R1 have demonstrated consistent gains over their non-reasoning counterparts, highlighting the importance of reasoning for improved generalization and reliability. However, collecting high-quality reasoning traces for SFT remains challenging -- annotations are costly, subjective, and difficult to scale. To address this limitation, we leverage Reinforcement Learning (RL) to enable models to learn reasoning strategies directly from task outcomes. We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools (e.g., function calls) and the final answer generation for conversational agents. Our method employs Group Relative Policy Optimization (GRPO) with rewards designed around tool accuracy and answer correctness, allowing the model to iteratively refine its reasoning and actions. Experimental results demonstrate that our approach improves both the quality of reasoning and the precision of tool invocations, achieving a 1.5% relative improvement over the SFT model (trained without explicit thinking) and a 40% gain compared to the base of the vanilla Qwen3-1.7B model. These findings demonstrate the promise of unifying reasoning and action learning through RL to build more capable and generalizable conversational agents.

Title: AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

Authors: Kuan-Wei Lu, Ding-Yong Hong, Pangfeng Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.11280
Pdf URL: https://arxiv.org/pdf/2512.11280
Copy Paste: [[2512.11280]] AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference(https://arxiv.org/abs/2512.11280)
Keywords: large language model
Abstract: Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft model to predict candidate tokens, which are then verified by a larger target model. However, existing approaches often require additional training, extensive hyperparameter tuning, or prior analysis of models and tasks before deployment. In this paper, we propose Adaptive Speculative Decoding (AdaSD), a hyperparameter-free decoding scheme that dynamically adjusts generation length and acceptance criteria during inference. AdaSD introduces two adaptive thresholds: one to determine when to stop candidate token generation and another to decide token acceptance, both updated in real time based on token entropy and Jensen-Shannon distance. This approach eliminates the need for pre-analysis or fine-tuning and is compatible with off-the-shelf models. Experiments on benchmark datasets demonstrate that AdaSD achieves up to 49\% speedup over standard speculative decoding while limiting accuracy degradation to under 2\%, making it a practical solution for efficient and adaptive LLM inference.

Title: CIP: A Plug-and-Play Causal Prompting Framework for Mitigating Hallucinations under Long-Context Noise

Authors: Qingsen Ma, Dianyun Wang, Ran Jing, Yujun Sun, Zhenbo Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11282
Pdf URL: https://arxiv.org/pdf/2512.11282
Copy Paste: [[2512.11282]] CIP: A Plug-and-Play Causal Prompting Framework for Mitigating Hallucinations under Long-Context Noise(https://arxiv.org/abs/2512.11282)
Keywords: interpretability, explainability, large language model
Abstract: Large language models often hallucinate when processing long and noisy retrieval contexts because they rely on spurious correlations rather than genuine causal relationships. We propose CIP, a lightweight and plug-and-play causal prompting framework that mitigates hallucinations at the input stage. CIP constructs a causal relation sequence among entities, actions, and events and injects it into the prompt to guide reasoning toward causally relevant evidence. Through causal intervention and counterfactual reasoning, CIP suppresses non causal reasoning paths, improving factual grounding and interpretability. Experiments across seven mainstream language models, including GPT-4o, Gemini 2.0 Flash, and Llama 3.1, show that CIP consistently enhances reasoning quality and reliability, achieving 2.6 points improvement in Attributable Rate, 0.38 improvement in Causal Consistency Score, and a fourfold increase in effective information density. API level profiling further shows that CIP accelerates contextual understanding and reduces end to end response latency by up to 55.1 percent. These results suggest that causal reasoning may serve as a promising paradigm for improving the explainability, stability, and efficiency of large language models.

Title: RcAE: Recursive Reconstruction Framework for Unsupervised Industrial Anomaly Detection

Authors: Rongcheng Wu, Hao Zhu, Shiying Zhang, Mingzhe Wang, Zhidong Li, Hui Li, Jianlong Zhou, Jiangtao Cui, Fang Chen, Pingyang Sun, Qiyu Liao, Ye Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11284
Pdf URL: https://arxiv.org/pdf/2512.11284
Copy Paste: [[2512.11284]] RcAE: Recursive Reconstruction Framework for Unsupervised Industrial Anomaly Detection(https://arxiv.org/abs/2512.11284)
Keywords: diffusion
Abstract: Unsupervised industrial anomaly detection requires accurately identifying defects without labeled data. Traditional autoencoder-based methods often struggle with incomplete anomaly suppression and loss of fine details, as their single-pass decoding fails to effectively handle anomalies with varying severity and scale. We propose a recursive architecture for autoencoder (RcAE), which performs reconstruction iteratively to progressively suppress anomalies while refining normal structures. Unlike traditional single-pass models, this recursive design naturally produces a sequence of reconstructions, progressively exposing suppressed abnormal patterns. To leverage this reconstruction dynamics, we introduce a Cross Recursion Detection (CRD) module that tracks inconsistencies across recursion steps, enhancing detection of both subtle and large-scale anomalies. Additionally, we incorporate a Detail Preservation Network (DPN) to recover high-frequency textures typically lost during reconstruction. Extensive experiments demonstrate that our method significantly outperforms existing non-diffusion methods, and achieves performance on par with recent diffusion models with only 10% of their parameters and offering substantially faster inference. These results highlight the practicality and efficiency of our approach for real-world applications.

Title: SRLR: Symbolic Regression based Logic Recovery to Counter Programmable Logic Controller Attacks

Authors: Hao Zhou (Beijing University of Posts and Telecommunications), Suman Sourav (Aalborg University), Binbin Chen (Singapore University of Technology and Design), Ke Yu (Beijing University of Posts and Telecommunications)
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11298
Pdf URL: https://arxiv.org/pdf/2512.11298
Copy Paste: [[2512.11298]] SRLR: Symbolic Regression based Logic Recovery to Counter Programmable Logic Controller Attacks(https://arxiv.org/abs/2512.11298)
Keywords: attack, robust
Abstract: Programmable Logic Controllers (PLCs) are critical components in Industrial Control Systems (ICSs). Their potential exposure to external world makes them susceptible to cyber-attacks. Existing detection methods against controller logic attacks use either specification-based or learnt models. However, specification-based models require experts' manual efforts or access to PLC's source code, while machine learning-based models often fall short of providing explanation for their decisions. We design SRLR -- a it Symbolic Regression based Logic Recovery} solution to identify the logic of a PLC based only on its inputs and outputs. The recovered logic is used to generate explainable rules for detecting controller logic attacks. SRLR enhances the latest deep symbolic regression methods using the following ICS-specific properties: (1) some important ICS control logic is best represented in frequency domain rather than time domain; (2) an ICS controller can operate in multiple modes, each using different logic, where mode switches usually do not happen frequently; (3) a robust controller usually filters out outlier inputs as ICS sensor data can be noisy; and (4) with the above factors captured, the degree of complexity of the formulas is reduced, making effective search possible. Thanks to these enhancements, SRLR consistently outperforms all existing methods in a variety of ICS settings that we evaluate. In terms of the recovery accuracy, SRLR's gain can be as high as 39% in some challenging environment. We also evaluate SRLR on a distribution grid containing hundreds of voltage regulators, demonstrating its stability in handling large-scale, complex systems with varied configurations.

Title: Unifying Dynamic Tool Creation and Cross-Task Experience Sharing through Cognitive Memory Architecture

Authors: Jiarun Liu, Shiyue Xu, Yang Li, Shangkun Liu, Yongli Yu, Peng Cao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.11303
Pdf URL: https://arxiv.org/pdf/2512.11303
Copy Paste: [[2512.11303]] Unifying Dynamic Tool Creation and Cross-Task Experience Sharing through Cognitive Memory Architecture(https://arxiv.org/abs/2512.11303)
Keywords: large language model
Abstract: Large Language Model agents face fundamental challenges in adapting to novel tasks due to limitations in tool availability and experience reuse. Existing approaches either rely on predefined tools with limited coverage or build tools from scratch without leveraging past experiences, leading to inefficient exploration and suboptimal performance. We introduce SMITH (Shared Memory Integrated Tool Hub), a unified cognitive architecture that seamlessly integrates dynamic tool creation with cross-task experience sharing through hierarchical memory organization. SMITH organizes agent memory into procedural, semantic, and episodic components, enabling systematic capability expansion while preserving successful execution patterns. Our approach formalizes tool creation as iterative code generation within controlled sandbox environments and experience sharing through episodic memory retrieval with semantic similarity matching. We further propose a curriculum learning strategy based on agent-ensemble difficulty re-estimation. Extensive experiments on the GAIA benchmark demonstrate SMITH's effectiveness, achieving 81.8% Pass@1 accuracy and outperforming state-of-the-art baselines including Alita (75.2%) and Memento (70.9%). Our work establishes a foundation for building truly adaptive agents that continuously evolve their capabilities through principled integration of tool creation and experience accumulation.

Title: QGEC : Quantum Golay Code Error Correction

Authors: Hideo Mukai, Hoshitaro Ohnishi
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2512.11307
Pdf URL: https://arxiv.org/pdf/2512.11307
Copy Paste: [[2512.11307]] QGEC : Quantum Golay Code Error Correction(https://arxiv.org/abs/2512.11307)
Keywords: transformer, generative
Abstract: Quantum computers have the possibility of a much reduced calculation load compared with classical computers in specific problems. Quantum error correction (QEC) is vital for handling qubits, which are vulnerable to external noise. In QEC, actual errors are predicted from the results of syndrome measurements by stabilizer generators, in place of making direct measurements of the data qubits. Here, we propose Quantum Golay code Error Correction (QGEC), a QEC method using Golay code, which is an efficient coding method in classical information theory. We investigated our method's ability in decoding calculations with the Transformer. We evaluated the accuracy of the decoder in a code space defined by the generative polynomials with three different weights sets and three noise models with different correlations of bit-flip error and phase-flip error. Furthermore, under a noise model following a discrete uniform distribution, we compared the decoding performance of Transformer decoders with identical architectures trained respectively on Golay and toric codes. The results showed that the noise model with the smaller correlation gave better accuracy, while the weights of the generative polynomials had little effect on the accuracy of the decoder. In addition, they showed that Golay code requiring 23 data qubits and having a code distance of 7 achieved higher decoding accuracy than toric code which requiring 50 data qubits and having a code distance of 5. This suggests that implementing quantum error correction using a Transformer may enable the Golay code to realize fault-tolerant quantum computation more efficiently.

Title: Benchmarking the Generality of Vision-Language-Action Models

Authors: Pranav Guruprasad, Sudipta Chowdhury, Harsh Sikka, Mridul Sharma, Helen Lu, Sean Rivera, Aryan Khurana, Hangliang Ren, Yangyue Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11315
Pdf URL: https://arxiv.org/pdf/2512.11315
Copy Paste: [[2512.11315]] Benchmarking the Generality of Vision-Language-Action Models(https://arxiv.org/abs/2512.11315)
Keywords: robust
Abstract: Generalist multimodal agents are expected to unify perception, language, and control - operating robustly across diverse real world domains. However, current evaluation practices remain fragmented across isolated benchmarks, making it difficult to assess whether today's foundation models truly generalize beyond their training distributions. We introduce MultiNet v1.0, a unified benchmark for measuring the cross domain generality of vision language models (VLMs) and vision language action models (VLAs) across six foundational capability regimes. Visual grounding, spatial reasoning, tool use, physical commonsense, multi agent coordination, and continuous robot control. Evaluating GPT 5, Pi0, and Magma, we find that no model demonstrates consistent generality. All exhibit substantial degradation on unseen domains, unfamiliar modalities, or cross domain task shifts despite strong performance within their training this http URL failures manifest as modality misalignment, output format instability, and catastrophic knowledge degradation under domain this http URL findings reveal a persistent gap between the aspiration of generalist intelligence and the actual capabilities of current foundation this http URL v1.0 provides a standardized evaluation substrate for diagnosing these gaps and guiding the development of future generalist this http URL, data, and leaderboards are publicly available.

Title: Visualisation for the CIS benchmark scanning results

Authors: Zhenshuo Zhao, Maria Spichkova, Duttkumari Champavat, Juilee N. Kulkarni, Sahil Singla, Muhammad A. Zulkefli, Pradhuman Khandelwal
Subjects: cs.CR, cs.SE
Abstract URL: https://arxiv.org/abs/2512.11316
Pdf URL: https://arxiv.org/pdf/2512.11316
Copy Paste: [[2512.11316]] Visualisation for the CIS benchmark scanning results(https://arxiv.org/abs/2512.11316)
Keywords: secure, security
Abstract: In this paper, we introduce GraphSecure, a web application that provides advanced analysis and visualisation of security scanning results. GraphSecure enables users to initiate scans for their AWS account, validate them against specific Center for Internet Security (CIS) Benchmarks and return results, showcase those returned results in the form of statistical charts and warn the users about their account status.

Title: SATMapTR: Satellite Image Enhanced Online HD Map Construction

Authors: Bingyuan Huang, Guanyi Zhao, Qian Xu, Yang Lou, Yung-Hui Li, Jianping Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11319
Pdf URL: https://arxiv.org/pdf/2512.11319
Copy Paste: [[2512.11319]] SATMapTR: Satellite Image Enhanced Online HD Map Construction(https://arxiv.org/abs/2512.11319)
Keywords: robust, extraction
Abstract: High-definition (HD) maps are evolving from pre-annotated to real-time construction to better support autonomous driving in diverse scenarios. However, this process is hindered by low-quality input data caused by onboard sensors limited capability and frequent occlusions, leading to incomplete, noisy, or missing data, and thus reduced mapping accuracy and robustness. Recent efforts have introduced satellite images as auxiliary input, offering a stable, wide-area view to complement the limited ego perspective. However, satellite images in Bird's Eye View are often degraded by shadows and occlusions from vegetation and buildings. Prior methods using basic feature extraction and fusion remain ineffective. To address these challenges, we propose SATMapTR, a novel online map construction model that effectively fuses satellite image through two key components: (1) a gated feature refinement module that adaptively filters satellite image features by integrating high-level semantics with low-level structural cues to extract high signal-to-noise ratio map-relevant representations; and (2) a geometry-aware fusion module that consistently fuse satellite and BEV features at a grid-to-grid level, minimizing interference from irrelevant regions and low-quality inputs. Experimental results on the nuScenes dataset show that SATMapTR achieves the highest mean average precision (mAP) of 73.8, outperforming state-of-the-art satellite-enhanced models by up to 14.2 mAP. It also shows lower mAP degradation under adverse weather and sensor failures, and achieves nearly 3 times higher mAP at extended perception ranges.

Title: KeyframeFace: From Text to Expressive Facial Keyframes

Authors: Jingchao Wu, Zejian Kang, Haibo Liu, Yuanchen Fei, Xiangru Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11321
Pdf URL: https://arxiv.org/pdf/2512.11321
Copy Paste: [[2512.11321]] KeyframeFace: From Text to Expressive Facial Keyframes(https://arxiv.org/abs/2512.11321)
Keywords: large language model
Abstract: Generating dynamic 3D facial animation from natural language requires understanding both temporally structured semantics and fine-grained expression changes. Existing datasets and methods mainly focus on speech-driven animation or unstructured expression sequences and therefore lack the semantic grounding and temporal structures needed for expressive human performance generation. In this work, we introduce KeyframeFace, a large-scale multimodal dataset designed for text-to-animation research through keyframe-level supervision. KeyframeFace provides 2,100 expressive scripts paired with monocular videos, per-frame ARKit coefficients, contextual backgrounds, complex emotions, manually defined keyframes, and multi-perspective annotations based on ARKit coefficients and images via Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Beyond the dataset, we propose the first text-to-animation framework that explicitly leverages LLM priors for interpretable facial motion synthesis. This design aligns the semantic understanding capabilities of LLMs with the interpretable structure of ARKit's coefficients, enabling high-fidelity expressive animation. KeyframeFace and our LLM-based framework together establish a new foundation for interpretable, keyframe-guided, and context-aware text-to-animation. Code and data are available at this https URL.

Title: MLLM Machine Unlearning via Visual Knowledge Distillation

Authors: Yuhang Wang, Zhenxing Niu, Haoxuan Ji, Guangyu He, Haichang Gao, Gang Hua
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11325
Pdf URL: https://arxiv.org/pdf/2512.11325
Copy Paste: [[2512.11325]] MLLM Machine Unlearning via Visual Knowledge Distillation(https://arxiv.org/abs/2512.11325)
Keywords: attack, robust
Abstract: Recently, machine unlearning approaches have been proposed to remove sensitive information from well-trained large models. However, most existing methods are tailored for LLMs, while MLLM-oriented unlearning remains at its early stage. Inspired by recent studies exploring the internal mechanisms of MLLMs, we propose to disentangle the visual and textual knowledge embedded within MLLMs and introduce a dedicated approach to selectively erase target visual knowledge while preserving textual knowledge. Unlike previous unlearning methods that rely on output-level supervision, our approach introduces a Visual Knowledge Distillation (VKD) scheme, which leverages intermediate visual representations within the MLLM as supervision signals. This design substantially enhances both unlearning effectiveness and model utility. Moreover, since our method only fine-tunes the visual components of the MLLM, it offers significant efficiency advantages. Extensive experiments demonstrate that our approach outperforms state-of-the-art unlearning methods in terms of both effectiveness and efficiency. Moreover, we are the first to evaluate the robustness of MLLM unlearning against relearning attacks.

Title: Spectral entropy prior-guided deep feature fusion architecture for magnetic core loss

Authors: Cong Yao, Chunye Gong, Jin Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11334
Pdf URL: https://arxiv.org/pdf/2512.11334
Copy Paste: [[2512.11334]] Spectral entropy prior-guided deep feature fusion architecture for magnetic core loss(https://arxiv.org/abs/2512.11334)
Keywords: robust, interpretability
Abstract: Accurate core loss modeling is critical for the design of high-efficiency power electronic systems. Traditional core loss modeling methods have limitations in prediction accuracy. To advance this field, the IEEE Power Electronics Society launched the MagNet Challenge in 2023, the first international competition focused on data-driven power electronics design methods, aiming to uncover complex loss patterns in magnetic components through a data-driven paradigm. Although purely data-driven models demonstrate strong fitting performance, their interpretability and cross-distribution generalization capabilities remain limited. To address these issues, this paper proposes a hybrid model, SEPI-TFPNet, which integrates empirical models with deep learning. The physical-prior submodule employs a spectral entropy discrimination mechanism to select the most suitable empirical model under different excitation waveforms. The data-driven submodule incorporates convolutional neural networks, multi-head attention mechanisms, and bidirectional long short-term memory networks to extract flux-density time-series features. An adaptive feature fusion module is introduced to improve multimodal feature interaction and integration. Using the MagNet dataset containing various magnetic materials, this paper evaluates the proposed method and compares it with 21 representative models from the 2023 challenge and three advanced methods from 2024-2025. The results show that the proposed method achieves improved modeling accuracy and robustness.

Title: FreqDINO: Frequency-Guided Adaptation for Generalized Boundary-Aware Ultrasound Image Segmentation

Authors: Yixuan Zhang, Qing Xu, Yue Li, Xiangjian He, Qian Zhang, Mainul Haque, Rong Qu, Wenting Duan, Zhen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11335
Pdf URL: https://arxiv.org/pdf/2512.11335
Copy Paste: [[2512.11335]] FreqDINO: Frequency-Guided Adaptation for Generalized Boundary-Aware Ultrasound Image Segmentation(https://arxiv.org/abs/2512.11335)
Keywords: extraction, segmentation
Abstract: Ultrasound image segmentation is pivotal for clinical diagnosis, yet challenged by speckle noise and imaging artifacts. Recently, DINOv3 has shown remarkable promise in medical image segmentation with its powerful representation capabilities. However, DINOv3, pre-trained on natural images, lacks sensitivity to ultrasound-specific boundary degradation. To address this limitation, we propose FreqDINO, a frequency-guided segmentation framework that enhances boundary perception and structural consistency. Specifically, we devise a Multi-scale Frequency Extraction and Alignment (MFEA) strategy to separate low-frequency structures and multi-scale high-frequency boundary details, and align them via learnable attention. We also introduce a Frequency-Guided Boundary Refinement (FGBR) module that extracts boundary prototypes from high-frequency components and refines spatial features. Furthermore, we design a Multi-task Boundary-Guided Decoder (MBGD) to ensure spatial coherence between boundary and semantic predictions. Extensive experiments demonstrate that FreqDINO surpasses state-of-the-art methods with superior achieves remarkable generalization capability. The code is at this https URL.

Title: UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models

Authors: Hewen Pan, Cong Wei, Dashuang Liang, Zepeng Huang, Pengfei Gao, Ziqi Zhou, Lulu Xue, Pengfei Yan, Xiaoming Wei, Minghui Li, Shengshan Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11336
Pdf URL: https://arxiv.org/pdf/2512.11336
Copy Paste: [[2512.11336]] UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models(https://arxiv.org/abs/2512.11336)
Keywords: large language model
Abstract: With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo's flexibility and advantages over GPT-4o. Furthermore, we validate the effectiveness of our model across 9 public benchmarks covering various common video understanding tasks, providing valuable insights for future Video LLMs.

Title: Symmetry-Aware Steering of Equivariant Diffusion Policies: Benefits and Limits

Authors: Minwoo Park, Junwoo Chang, Jongeun Choi, Roberto Horowitz
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2512.11345
Pdf URL: https://arxiv.org/pdf/2512.11345
Copy Paste: [[2512.11345]] Symmetry-Aware Steering of Equivariant Diffusion Policies: Benefits and Limits(https://arxiv.org/abs/2512.11345)
Keywords: diffusion, generative
Abstract: Equivariant diffusion policies (EDPs) combine the generative expressivity of diffusion models with the strong generalization and sample efficiency afforded by geometric symmetries. While steering these policies with reinforcement learning (RL) offers a promising mechanism for fine-tuning beyond demonstration data, directly applying standard (non-equivariant) RL can be sample-inefficient and unstable, as it ignores the symmetries that EDPs are designed to exploit. In this paper, we theoretically establish that the diffusion process of an EDP is equivariant, which in turn induces a group-invariant latent-noise MDP that is well-suited for equivariant diffusion steering. Building on this theory, we introduce a principled symmetry-aware steering framework and compare standard, equivariant, and approximately equivariant RL strategies through comprehensive experiments across tasks with varying degrees of symmetry. While we identify the practical boundaries of strict equivariance under symmetry breaking, we show that exploiting symmetry during the steering process yields substantial benefits-enhancing sample efficiency, preventing value divergence, and achieving strong policy improvements even when EDPs are trained from extremely limited demonstrations.

Title: Surveillance Video-Based Traffic Accident Detection Using Transformer Architecture

Authors: Tanu Singh, Pranamesh Chakraborty, Long T. Truong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11350
Pdf URL: https://arxiv.org/pdf/2512.11350
Copy Paste: [[2512.11350]] Surveillance Video-Based Traffic Accident Detection Using Transformer Architecture(https://arxiv.org/abs/2512.11350)
Keywords: robust, transformer
Abstract: Road traffic accidents represent a leading cause of mortality globally, with incidence rates rising due to increasing population, urbanization, and motorization. Rising accident rates raise concerns about traffic surveillance effectiveness. Traditional computer vision methods for accident detection struggle with limited spatiotemporal understanding and poor cross-domain generalization. Recent advances in transformer architectures excel at modeling global spatial-temporal dependencies and parallel computation. However, applying these models to automated traffic accident detection is limited by small, non-diverse datasets, hindering the development of robust, generalizable systems. To address this gap, we curated a comprehensive and balanced dataset that captures a wide spectrum of traffic environments, accident types, and contextual variations. Utilizing the curated dataset, we propose an accident detection model based on a transformer architecture using pre-extracted spatial video features. The architecture employs convolutional layers to extract local correlations across diverse patterns within a frame, while leveraging transformers to capture sequential-temporal dependencies among the retrieved features. Moreover, most existing studies neglect the integration of motion cues, which are essential for understanding dynamic scenes, especially during accidents. These approaches typically rely on static features or coarse temporal information. In this study, multiple methods for incorporating motion cues were evaluated to identify the most effective strategy. Among the tested input approaches, concatenating RGB features with optical flow achieved the highest accuracy at 88.3%. The results were further compared with vision language models (VLM) such as GPT, Gemini, and LLaVA-NeXT-Video to assess the effectiveness of the proposed method.

Title: CAT: Can Trust be Predicted with Context-Awareness in Dynamic Heterogeneous Networks?

Authors: Jie Wang, Zheng Yan, Jiahe Lan, Xuyan Li, Elisa Bertino
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11352
Pdf URL: https://arxiv.org/pdf/2512.11352
Copy Paste: [[2512.11352]] CAT: Can Trust be Predicted with Context-Awareness in Dynamic Heterogeneous Networks?(https://arxiv.org/abs/2512.11352)
Keywords: security, attack, robust
Abstract: Trust prediction provides valuable support for decision-making, risk mitigation, and system security enhancement. Recently, Graph Neural Networks (GNNs) have emerged as a promising approach for trust prediction, owing to their ability to learn expressive node representations that capture intricate trust relationships within a network. However, current GNN-based trust prediction models face several limitations: (i) Most of them fail to capture trust dynamicity, leading to questionable inferences. (ii) They rarely consider the heterogeneous nature of real-world networks, resulting in a loss of rich semantics. (iii) None of them support context-awareness, a basic property of trust, making prediction results coarse-grained. To this end, we propose CAT, the first Context-Aware GNN-based Trust prediction model that supports trust dynamicity and accurately represents real-world heterogeneity. CAT consists of a graph construction layer, an embedding layer, a heterogeneous attention layer, and a prediction layer. It handles dynamic graphs using continuous-time representations and captures temporal information through a time encoding function. To model graph heterogeneity and leverage semantic information, CAT employs a dual attention mechanism that identifies the importance of different node types and nodes within each type. For context-awareness, we introduce a new notion of meta-paths to extract contextual features. By constructing context embeddings and integrating a context-aware aggregator, CAT can predict both context-aware trust and overall trust. Extensive experiments on three real-world datasets demonstrate that CAT outperforms five groups of baselines in trust prediction, while exhibiting strong scalability to large-scale graphs and robustness against both trust-oriented and GNN-oriented attacks.

Title: A Multi-Mode Structured Light 3D Imaging System with Multi-Source Information Fusion for Underwater Pipeline Detection

Authors: Qinghan Hu, Haijiang Zhu, Na Sun, Lei Chen, Zhengqiang Fan, Zhiqing Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11354
Pdf URL: https://arxiv.org/pdf/2512.11354
Copy Paste: [[2512.11354]] A Multi-Mode Structured Light 3D Imaging System with Multi-Source Information Fusion for Underwater Pipeline Detection(https://arxiv.org/abs/2512.11354)
Keywords: robust
Abstract: Underwater pipelines are highly susceptible to corrosion, which not only shorten their service life but also pose significant safety risks. Compared with manual inspection, the intelligent real-time imaging system for underwater pipeline detection has become a more reliable and practical solution. Among various underwater imaging techniques, structured light 3D imaging can restore the sufficient spatial detail for precise defect characterization. Therefore, this paper develops a multi-mode underwater structured light 3D imaging system for pipeline detection (UW-SLD system) based on multi-source information fusion. First, a rapid distortion correction (FDC) method is employed for efficient underwater image rectification. To overcome the challenges of extrinsic calibration among underwater sensors, a factor graph-based parameter optimization method is proposed to estimate the transformation matrix between the structured light and acoustic sensors. Furthermore, a multi-mode 3D imaging strategy is introduced to adapt to the geometric variability of underwater pipelines. Given the presence of numerous disturbances in underwater environments, a multi-source information fusion strategy and an adaptive extended Kalman filter (AEKF) are designed to ensure stable pose estimation and high-accuracy measurements. In particular, an edge detection-based ICP (ED-ICP) algorithm is proposed. This algorithm integrates pipeline edge detection network with enhanced point cloud registration to achieve robust and high-fidelity reconstruction of defect structures even under variable motion conditions. Extensive experiments are conducted under different operation modes, velocities, and depths. The results demonstrate that the developed system achieves superior accuracy, adaptability and robustness, providing a solid foundation for autonomous underwater pipeline detection.

Title: Prior-Enhanced Gaussian Splatting for Dynamic Scene Reconstruction from Casual Video

Authors: Meng-Li Shih, Ying-Huan Chen, Yu-Lun Liu, Brian Curless
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11356
Pdf URL: https://arxiv.org/pdf/2512.11356
Copy Paste: [[2512.11356]] Prior-Enhanced Gaussian Splatting for Dynamic Scene Reconstruction from Casual Video(https://arxiv.org/abs/2512.11356)
Keywords: segmentation
Abstract: We introduce a fully automatic pipeline for dynamic scene reconstruction from casually captured monocular RGB videos. Rather than designing a new scene representation, we enhance the priors that drive Dynamic Gaussian Splatting. Video segmentation combined with epipolar-error maps yields object-level masks that closely follow thin structures; these masks (i) guide an object-depth loss that sharpens the consistent video depth, and (ii) support skeleton-based sampling plus mask-guided re-identification to produce reliable, comprehensive 2-D tracks. Two additional objectives embed the refined priors in the reconstruction stage: a virtual-view depth loss removes floaters, and a scaffold-projection loss ties motion nodes to the tracks, preserving fine geometry and coherent motion. The resulting system surpasses previous monocular dynamic scene reconstruction methods and delivers visibly superior renderings

Title: Attacking and Securing Community Detection: A Game-Theoretic Framework

Authors: Yifan Niu, Aochuan Chen, Tingyang Xu, Jia Li
Subjects: cs.LG, cs.GT
Abstract URL: https://arxiv.org/abs/2512.11359
Pdf URL: https://arxiv.org/pdf/2512.11359
Copy Paste: [[2512.11359]] Attacking and Securing Community Detection: A Game-Theoretic Framework(https://arxiv.org/abs/2512.11359)
Keywords: privacy, protect, defense, attack, robust
Abstract: It has been demonstrated that adversarial graphs, i.e., graphs with imperceptible perturbations, can cause deep graph models to fail on classification tasks. In this work, we extend the concept of adversarial graphs to the community detection problem, which is more challenging. We propose novel attack and defense techniques for community detection problem, with the objective of hiding targeted individuals from detection models and enhancing the robustness of community detection models, respectively. These techniques have many applications in real-world scenarios, for example, protecting personal privacy in social networks and understanding camouflage patterns in transaction networks. To simulate interactive attack and defense behaviors, we further propose a game-theoretic framework, called CD-GAME. One player is a graph attacker, while the other player is a Rayleigh Quotient defender. The CD-GAME models the mutual influence and feedback mechanisms between the attacker and the defender, revealing the dynamic evolutionary process of the game. Both players dynamically update their strategies until they reach the Nash equilibrium. Extensive experiments demonstrate the effectiveness of our proposed attack and defense methods, and both outperform existing baselines by a significant margin. Furthermore, CD-GAME provides valuable insights for understanding interactive attack and defense scenarios in community detection problems. We found that in traditional single-step attack or defense, attacker tends to employ strategies that are most effective, but are easily detected and countered by defender. When the interactive game reaches a Nash equilibrium, attacker adopts more imperceptible strategies that can still achieve satisfactory attack effectiveness even after defense.

Title: Reliable Detection of Minute Targets in High-Resolution Aerial Imagery across Temporal Shifts

Authors: Mohammad Sadegh Gholizadeh, Amir Arsalan Rezapour, Hamidreza Shayegh, Ehsan Pazouki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11360
Pdf URL: https://arxiv.org/pdf/2512.11360
Copy Paste: [[2512.11360]] Reliable Detection of Minute Targets in High-Resolution Aerial Imagery across Temporal Shifts(https://arxiv.org/abs/2512.11360)
Keywords: robust
Abstract: Efficient crop detection via Unmanned Aerial Vehicles is critical for scaling precision agriculture, yet it remains challenging due to the small scale of targets and environmental variability. This paper addresses the detection of rice seedlings in paddy fields by leveraging a Faster R-CNN architecture initialized via transfer learning. To overcome the specific difficulties of detecting minute objects in high-resolution aerial imagery, we curate a significant UAV dataset for training and rigorously evaluate the model's generalization capabilities. Specifically, we validate performance across three distinct test sets acquired at different temporal intervals, thereby assessing robustness against varying imaging conditions. Our empirical results demonstrate that transfer learning not only facilitates the rapid convergence of object detection models in agricultural contexts but also yields consistent performance despite domain shifts in image acquisition.

Title: qa-FLoRA: Data-free query-adaptive Fusion of LoRAs for LLMs

Authors: Shreya Shukla, Aditya Sriram, Milinda Kuppur Narayanaswamy, Hiteshi Jain
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.11366
Pdf URL: https://arxiv.org/pdf/2512.11366
Copy Paste: [[2512.11366]] qa-FLoRA: Data-free query-adaptive Fusion of LoRAs for LLMs(https://arxiv.org/abs/2512.11366)
Keywords: robust, data-free, large language model
Abstract: The deployment of large language models for specialized tasks often requires domain-specific parameter-efficient finetuning through Low-Rank Adaptation (LoRA) modules. However, effectively fusing these adapters to handle complex, multi-domain composite queries remains a critical challenge. Existing LoRA fusion approaches either use static weights, which assign equal relevance to each participating LoRA, or require data-intensive supervised training for every possible LoRA combination to obtain respective optimal fusion weights. We propose qa-FLoRA, a novel query-adaptive data-and-training-free method for LoRA fusion that dynamically computes layer-level fusion weights by measuring distributional divergence between the base model and respective adapters. Our approach eliminates the need for composite training data or domain-representative samples, making it readily applicable to existing adapter collections. Extensive experiments across nine multilingual composite tasks spanning mathematics, coding, and medical domains, show that qa-FLoRA outperforms static fusion by ~5% with LLaMA-2 and ~6% with LLaMA-3, and the training-free baselines by ~7% with LLaMA-2 and ~10% with LLaMA-3, while significantly closing the gap with supervised baselines. Further, layer-level analysis of our fusion weights reveals interpretable fusion patterns, demonstrating the effectiveness of our approach for robust multi-domain adaptation.

Title: Assisted Refinement Network Based on Channel Information Interaction for Camouflaged and Salient Object Detection

Authors: Kuan Wang, Yanjun Qin, Mengge Lu, Liejun Wang, Xiaoming Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11369
Pdf URL: https://arxiv.org/pdf/2512.11369
Copy Paste: [[2512.11369]] Assisted Refinement Network Based on Channel Information Interaction for Camouflaged and Salient Object Detection(https://arxiv.org/abs/2512.11369)
Keywords: extraction, segmentation
Abstract: Camouflaged Object Detection (COD) stands as a significant challenge in computer vision, dedicated to identifying and segmenting objects visually highly integrated with their backgrounds. Current mainstream methods have made progress in cross-layer feature fusion, but two critical issues persist during the decoding stage. The first is insufficient cross-channel information interaction within the same-layer features, limiting feature expressiveness. The second is the inability to effectively co-model boundary and region information, making it difficult to accurately reconstruct complete regions and sharp boundaries of objects. To address the first issue, we propose the Channel Information Interaction Module (CIIM), which introduces a horizontal-vertical integration mechanism in the channel dimension. This module performs feature reorganization and interaction across channels to effectively capture complementary cross-channel information. To address the second issue, we construct a collaborative decoding architecture guided by prior knowledge. This architecture generates boundary priors and object localization maps through Boundary Extraction (BE) and Region Extraction (RE) modules, then employs hybrid attention to collaboratively calibrate decoded features, effectively overcoming semantic ambiguity and imprecise boundaries. Additionally, the Multi-scale Enhancement (MSE) module enriches contextual feature representations. Extensive experiments on four COD benchmark datasets validate the effectiveness and state-of-the-art performance of the proposed model. We further transferred our model to the Salient Object Detection (SOD) task and demonstrated its adaptability across downstream tasks, including polyp segmentation, transparent object detection, and industrial and road defect detection. Code and experimental results are publicly available at: this https URL.

Title: Out-of-Distribution Segmentation via Wasserstein-Based Evidential Uncertainty

Authors: Arnold Brosch, Abdelrahman Eldesokey, Michael Felsberg, Kira Maag
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11373
Pdf URL: https://arxiv.org/pdf/2512.11373
Copy Paste: [[2512.11373]] Out-of-Distribution Segmentation via Wasserstein-Based Evidential Uncertainty(https://arxiv.org/abs/2512.11373)
Keywords: segmentation
Abstract: Deep neural networks achieve superior performance in semantic segmentation, but are limited to a predefined set of classes, which leads to failures when they encounter unknown objects in open-world scenarios. Recognizing and segmenting these out-of-distribution (OOD) objects is crucial for safety-critical applications such as automated driving. In this work, we present an evidence segmentation framework using a Wasserstein loss, which captures distributional distances while respecting the probability simplex geometry. Combined with Kullback-Leibler regularization and Dice structural consistency terms, our approach leads to improved OOD segmentation performance compared to uncertainty-based approaches.

Title: Mining Legal Arguments to Study Judicial Formalism

Authors: Tomáš Koref, Lena Held, Mahammad Namazov, Harun Kumru, Yassine Thlija, Christoph Burchard, Ivan Habernal
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2512.11374
Pdf URL: https://arxiv.org/pdf/2512.11374
Copy Paste: [[2512.11374]] Mining Legal Arguments to Study Judicial Formalism(https://arxiv.org/abs/2512.11374)
Keywords: explainability, transformer
Abstract: Courts must justify their decisions, but systematically analyzing judicial reasoning at scale remains difficult. This study refutes claims about formalistic judging in Central and Eastern Europe (CEE) by developing automated methods to detect and classify judicial reasoning in Czech Supreme Courts' decisions using state-of-the-art natural language processing methods. We create the MADON dataset of 272 decisions from two Czech Supreme Courts with expert annotations of 9,183 paragraphs with eight argument types and holistic formalism labels for supervised training and evaluation. Using a corpus of 300k Czech court decisions, we adapt transformer LLMs for Czech legal domain by continued pretraining and experiment with methods to address dataset imbalance including asymmetric loss and class weighting. The best models successfully detect argumentative paragraphs (82.6\% macro-F1), classify traditional types of legal argument (77.5\% macro-F1), and classify decisions as formalistic/non-formalistic (83.2\% macro-F1). Our three-stage pipeline combining ModernBERT, Llama 3.1, and traditional feature-based machine learning achieves promising results for decision classification while reducing computational costs and increasing explainability. Empirically, we challenge prevailing narratives about CEE formalism. This work shows that legal argument mining enables reliable judicial philosophy classification and shows the potential of legal argument mining for other important tasks in computational legal studies. Our methodology is easily replicable across jurisdictions, and our entire pipeline, datasets, guidelines, models, and source codes are available at this https URL.

Title: Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization

Authors: Yifan Niu, Han Xiao, Dongyi Liu, Nuo Chen, Jia Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11391
Pdf URL: https://arxiv.org/pdf/2512.11391
Copy Paste: [[2512.11391]] Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization(https://arxiv.org/abs/2512.11391)
Keywords: large language model
Abstract: As Large Language Models (LLMs) are increasingly deployed in real-world applications, it is important to ensure their behaviors align with human values, societal norms, and ethical principles. However, safety alignment under Reinforcement Learning (RL) often suffers from forgetting learned general abilities, which is also known as the alignment tax. To address this issue, we introduce Null-Space constrained Policy Optimization (NSPO), a novel RL framework for LLM safety alignment while preserving their core abilities. The safety policy gradients are geometrically projected into the null space of general tasks, thereby mitigating the safety alignment tax. In addition, we theoretically prove that NSPO preserves the model's original core capabilities, while still guaranteeing a descent direction for effective safety alignment. Extensive experiments demonstrate that NSPO outperforms existing methods by a large margin, achieving state-of-the-art safety performance without sacrificing accuracy on general tasks, including math, code, and instruction-following tasks. Notably, NSPO is data-efficient and only requires 40% of public human-annotated safety data from PKU-SafeRLHF to achieve promising safety performance, without a large amount of mixed general tasks data in existing alignment methods.

Title: Bhargava Cube--Inspired Quadratic Regularization for Structured Neural Embeddings

Authors: S Sairam, Prateek P Kulkarni
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11392
Pdf URL: https://arxiv.org/pdf/2512.11392
Copy Paste: [[2512.11392]] Bhargava Cube--Inspired Quadratic Regularization for Structured Neural Embeddings(https://arxiv.org/abs/2512.11392)
Keywords: interpretability
Abstract: We present a novel approach to neural representation learning that incorporates algebraic constraints inspired by Bhargava cubes from number theory. Traditional deep learning methods learn representations in unstructured latent spaces lacking interpretability and mathematical consistency. Our framework maps input data to constrained 3-dimensional latent spaces where embeddings are regularized to satisfy learned quadratic relationships derived from Bhargava's combinatorial structures. The architecture employs a differentiable auxiliary loss function operating independently of classification objectives, guiding models toward mathematically structured representations. We evaluate on MNIST, achieving 99.46% accuracy while producing interpretable 3D embeddings that naturally cluster by digit class and satisfy learned quadratic constraints. Unlike existing manifold learning approaches requiring explicit geometric supervision, our method imposes weak algebraic priors through differentiable constraints, ensuring compatibility with standard optimization. This represents the first application of number-theoretic constructs to neural representation learning, establishing a foundation for incorporating structured mathematical priors in neural networks.

Title: Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction

Authors: Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2512.11399
Pdf URL: https://arxiv.org/pdf/2512.11399
Copy Paste: [[2512.11399]] Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction(https://arxiv.org/abs/2512.11399)
Keywords: extraction, large language model
Abstract: Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.

Title: Collaborative Reconstruction and Repair for Multi-class Industrial Anomaly Detection

Authors: Qishan Wang, Haofeng Wang, Shuyong Gao, Jia Guo, Li Xiong, Jiaqi Li, Dengxuan Bai, Wenqiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11401
Pdf URL: https://arxiv.org/pdf/2512.11401
Copy Paste: [[2512.11401]] Collaborative Reconstruction and Repair for Multi-class Industrial Anomaly Detection(https://arxiv.org/abs/2512.11401)
Keywords: segmentation
Abstract: Industrial anomaly detection is a challenging open-set task that aims to identify unknown anomalous patterns deviating from normal data distribution. To avoid the significant memory consumption and limited generalizability brought by building separate models per class, we focus on developing a unified framework for multi-class anomaly detection. However, under this challenging setting, conventional reconstruction-based networks often suffer from an identity mapping problem, where they directly replicate input features regardless of whether they are normal or anomalous, resulting in detection failures. To address this issue, this study proposes a novel framework termed Collaborative Reconstruction and Repair (CRR), which transforms the reconstruction to repairation. First, we optimize the decoder to reconstruct normal samples while repairing synthesized anomalies. Consequently, it generates distinct representations for anomalous regions and similar representations for normal areas compared to the encoder's output. Second, we implement feature-level random masking to ensure that the representations from decoder contain sufficient local information. Finally, to minimize detection errors arising from the discrepancies between feature representations from the encoder and decoder, we train a segmentation network supervised by synthetic anomaly masks, thereby enhancing localization performance. Extensive experiments on industrial datasets that CRR effectively mitigates the identity mapping issue and achieves state-of-the-art performance in multi-class industrial anomaly detection.

Title: JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion

Authors: Chaochao Li, Ruikui Wang, Liangbo Zhou, Jinheng Feng, Huaishao Luo, Huan Zhang, Youzheng Wu, Xiaodong He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11423
Pdf URL: https://arxiv.org/pdf/2512.11423
Copy Paste: [[2512.11423]] JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion(https://arxiv.org/abs/2512.11423)
Keywords: diffusion
Abstract: Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyAvatar, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion condition; and (3) Unbounded RoPE via Cache-Resetting (URCR), enabling infinite-length generation through dynamic positional encoding. Our 1.3B-parameter causal model achieves 16 FPS on a single GPU and achieves competitive results in visual quality, temporal consistency, and lip synchronization.

Title: Proving DNSSEC Correctness: A Formal Approach to Secure Domain Name Resolution

Authors: Qifan Zhang, Zilin Shen, Imtiaz Karim, Elisa Bertino, Zhou Li
Subjects: cs.CR, cs.FL, cs.NI
Abstract URL: https://arxiv.org/abs/2512.11431
Pdf URL: https://arxiv.org/pdf/2512.11431
Copy Paste: [[2512.11431]] Proving DNSSEC Correctness: A Formal Approach to Secure Domain Name Resolution(https://arxiv.org/abs/2512.11431)
Keywords: secure, security, attack
Abstract: The Domain Name System Security Extensions (DNSSEC) are critical for preventing DNS spoofing, yet its specifications contain ambiguities and vulnerabilities that elude traditional "break-and-fix" approaches. A holistic, foundational security analysis of the protocol has thus remained an open problem. This paper introduces DNSSECVerif, the first framework for comprehensive, automated formal security analysis of the DNSSEC protocol suite. Built on the SAPIC+ symbolic verifier, our high-fidelity model captures protocol-level interactions, including cryptographic operations and stateful caching with fine-grained concurrency control. Using DNSSECVerif, we formally prove four of DNSSEC's core security guarantees and uncover critical ambiguities in the standards--notably, the insecure coexistence of NSEC and NSEC3. Our model also automatically rediscovers three classes of known attacks, demonstrating fundamental weaknesses in the protocol design. To bridge the model-to-reality gap, we validate our findings through targeted testing of mainstream DNS software and a large-scale measurement study of over 2.2 million open resolvers, confirming the real-world impact of these flaws. Our work provides crucial, evidence-based recommendations for hardening DNSSEC specifications and implementations.

Title: CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare

Authors: Akash Ghosh, Srivarshinee Sridhar, Raghav Kaushik Ravi, Muhsin Muhsin, Sriparna Saha, Chirag Agarwal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.11437
Pdf URL: https://arxiv.org/pdf/2512.11437
Copy Paste: [[2512.11437]] CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare(https://arxiv.org/abs/2512.11437)
Keywords: privacy, attack, robust, fair
Abstract: Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decision-making. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present CLINIC, a Comprehensive Multilingual Benchmark to evaluate the trustworthiness of language models in healthcare. CLINIC systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, CLINIC lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages.

Title: Hyperbolic Gaussian Blurring Mean Shift: A Statistical Mode-Seeking Framework for Clustering in Curved Spaces

Authors: Arghya Pratihar, Arnab Seal, Swagatam Das, Inesh Chattopadhyay
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2512.11448
Pdf URL: https://arxiv.org/pdf/2512.11448
Copy Paste: [[2512.11448]] Hyperbolic Gaussian Blurring Mean Shift: A Statistical Mode-Seeking Framework for Clustering in Curved Spaces(https://arxiv.org/abs/2512.11448)
Keywords: robust
Abstract: Clustering is a fundamental unsupervised learning task for uncovering patterns in data. While Gaussian Blurring Mean Shift (GBMS) has proven effective for identifying arbitrarily shaped clusters in Euclidean space, it struggles with datasets exhibiting hierarchical or tree-like structures. In this work, we introduce HypeGBMS, a novel extension of GBMS to hyperbolic space. Our method replaces Euclidean computations with hyperbolic distances and employs Möbius-weighted means to ensure that all updates remain consistent with the geometry of the space. HypeGBMS effectively captures latent hierarchies while retaining the density-seeking behavior of GBMS. We provide theoretical insights into convergence and computational complexity, along with empirical results that demonstrate improved clustering quality in hierarchical datasets. This work bridges classical mean-shift clustering and hyperbolic representation learning, offering a principled approach to density-based clustering in curved spaces. Extensive experimental evaluations on $11$ real-world datasets demonstrate that HypeGBMS significantly outperforms conventional mean-shift clustering methods in non-Euclidean settings, underscoring its robustness and effectiveness.

Title: Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation

Authors: Jingmin Zhu, Anqi Zhu, Hossein Rahmani, Jun Liu, Mohammed Bennamoun, Qiuhong Ke
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11458
Pdf URL: https://arxiv.org/pdf/2512.11458
Copy Paste: [[2512.11458]] Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation(https://arxiv.org/abs/2512.11458)
Keywords: large language model
Abstract: We introduce Skeleton-Cache, the first training-free test-time adaptation framework for skeleton-based zero-shot action recognition (SZAR), aimed at improving model generalization to unseen actions during inference. Skeleton-Cache reformulates inference as a lightweight retrieval process over a non-parametric cache that stores structured skeleton representations, combining both global and fine-grained local descriptors. To guide the fusion of descriptor-wise predictions, we leverage the semantic reasoning capabilities of large language models (LLMs) to assign class-specific importance weights. By integrating these structured descriptors with LLM-guided semantic priors, Skeleton-Cache dynamically adapts to unseen actions without any additional training or access to training data. Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zero-shot and generalized zero-shot settings. The code is publicly available at this https URL.

Title: Exploring MLLM-Diffusion Information Transfer with MetaCanvas

Authors: Han Lin, Xichen Pan, Ziqi Huang, Ji Hou, Jialiang Wang, Weifeng Chen, Zecheng He, Felix Juefei-Xu, Junzhe Sun, Zhipeng Fan, Ali Thabet, Mohit Bansal, Chu Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11464
Pdf URL: https://arxiv.org/pdf/2512.11464
Copy Paste: [[2512.11464]] Exploring MLLM-Diffusion Information Transfer with MetaCanvas(https://arxiv.org/abs/2512.11464)
Keywords: robust, diffusion, large language model
Abstract: Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation.

Title: DOS: Distilling Observable Softmaps of Zipfian Prototypes for Self-Supervised Point Representation

Authors: Mohamed Abdelsamad, Michael Ulrich, Bin Yang, Miao Zhang, Yakov Miron, Abhinav Valada
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11465
Pdf URL: https://arxiv.org/pdf/2512.11465
Copy Paste: [[2512.11465]] DOS: Distilling Observable Softmaps of Zipfian Prototypes for Self-Supervised Point Representation(https://arxiv.org/abs/2512.11465)
Keywords: robust, segmentation
Abstract: Recent advances in self-supervised learning (SSL) have shown tremendous potential for learning 3D point cloud representations without human annotations. However, SSL for 3D point clouds still faces critical challenges due to irregular geometry, shortcut-prone reconstruction, and unbalanced semantics distribution. In this work, we propose DOS (Distilling Observable Softmaps), a novel SSL framework that self-distills semantic relevance softmaps only at observable (unmasked) points. This strategy prevents information leakage from masked regions and provides richer supervision than discrete token-to-prototype assignments. To address the challenge of unbalanced semantics in an unsupervised setting, we introduce Zipfian prototypes and incorporate them using a modified Sinkhorn-Knopp algorithm, Zipf-Sinkhorn, which enforces a power-law prior over prototype usage and modulates the sharpness of the target softmap during training. DOS outperforms current state-of-the-art methods on semantic segmentation and 3D object detection across multiple benchmarks, including nuScenes, Waymo, SemanticKITTI, ScanNet, and ScanNet200, without relying on extra data or annotations. Our results demonstrate that observable-point softmaps distillation offers a scalable and effective paradigm for learning robust 3D representations.

Title: Rethinking Expert Trajectory Utilization in LLM Post-training

Authors: Bowen Ding, Yuhan Chen, Jiayang Lv, Jiyao Yuan, Qi Zhu, Shuangshuang Tian, Dantong Zhu, Futing Wang, Heyuan Deng, Fei Mi, Lifeng Shang, Tao Lin
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2512.11470
Pdf URL: https://arxiv.org/pdf/2512.11470
Copy Paste: [[2512.11470]] Rethinking Expert Trajectory Utilization in LLM Post-training(https://arxiv.org/abs/2512.11470)
Keywords: robust
Abstract: While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.

Title: CADMorph: Geometry-Driven Parametric CAD Editing via a Plan-Generate-Verify Loop

Authors: Weijian Ma, Shizhao Sun, Ruiyu Wang, Jiang Bian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11480
Pdf URL: https://arxiv.org/pdf/2512.11480
Copy Paste: [[2512.11480]] CADMorph: Geometry-Driven Parametric CAD Editing via a Plan-Generate-Verify Loop(https://arxiv.org/abs/2512.11480)
Keywords: diffusion
Abstract: A Computer-Aided Design (CAD) model encodes an object in two coupled forms: a parametric construction sequence and its resulting visible geometric shape. During iterative design, adjustments to the geometric shape inevitably require synchronized edits to the underlying parametric sequence, called geometry-driven parametric CAD editing. The task calls for 1) preserving the original sequence's structure, 2) ensuring each edit's semantic validity, and 3) maintaining high shape fidelity to the target shape, all under scarce editing data triplets. We present CADMorph, an iterative plan-generate-verify framework that orchestrates pretrained domain-specific foundation models during inference: a parameter-to-shape (P2S) latent diffusion model and a masked-parameter-prediction (MPP) model. In the planning stage, cross-attention maps from the P2S model pinpoint the segments that need modification and offer editing masks. The MPP model then infills these masks with semantically valid edits in the generation stage. During verification, the P2S model embeds each candidate sequence in shape-latent space, measures its distance to the target shape, and selects the closest one. The three stages leverage the inherent geometric consciousness and design knowledge in pretrained priors, and thus tackle structure preservation, semantic validity, and shape fidelity respectively. Besides, both P2S and MPP models are trained without triplet data, bypassing the data-scarcity bottleneck. CADMorph surpasses GPT-4o and specialized CAD baselines, and supports downstream applications such as iterative editing and reverse-engineering enhancement.

Title: Capacitive Touchscreens at Risk: Recovering Handwritten Trajectory on Smartphone via Electromagnetic Emanations

Authors: Yukun Cheng, Shiyu Zhu, Changhai Ou, Xingshuo Han, Yuan Li, Shihui Zheng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.11484
Pdf URL: https://arxiv.org/pdf/2512.11484
Copy Paste: [[2512.11484]] Capacitive Touchscreens at Risk: Recovering Handwritten Trajectory on Smartphone via Electromagnetic Emanations(https://arxiv.org/abs/2512.11484)
Keywords: security, attack
Abstract: This paper reveals and exploits a critical security vulnerability: the electromagnetic (EM) side channel of capacitive touchscreens leaks sufficient information to recover fine-grained, continuous handwriting trajectories. We present Touchscreen Electromagnetic Side-channel Leakage Attack (TESLA), a non-contact attack framework that captures EM signals generated during on-screen writing and regresses them into two-dimensional (2D) handwriting trajectories in real time. Extensive evaluations across a variety of commercial off-the-shelf (COTS) smartphones show that TESLA achieves 77% character recognition accuracy and a Jaccard index of 0.74, demonstrating its capability to recover highly recognizable motion trajectories that closely resemble the original handwriting under realistic attack conditions.

Title: Mistake Notebook Learning: Selective Batch-Wise Context Optimization for In-Context Learning

Authors: Xuanbo Su, Yingfang Zhang, Hao Luo, Xiaoteng Liu, Leo Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.11485
Pdf URL: https://arxiv.org/pdf/2512.11485
Copy Paste: [[2512.11485]] Mistake Notebook Learning: Selective Batch-Wise Context Optimization for In-Context Learning(https://arxiv.org/abs/2512.11485)
Keywords: robust, large language model
Abstract: Large language models (LLMs) adapt to tasks via gradient fine-tuning (heavy computation, catastrophic forgetting) or In-Context Learning (ICL: low robustness, poor mistake learning). To fix this, we introduce Mistake Notebook Learning (MNL), a training-free framework with a persistent knowledge base of abstracted error patterns. Unlike prior instance/single-trajectory memory methods, MNL uses batch-wise error abstraction: it extracts generalizable guidance from multiple failures, stores insights in a dynamic notebook, and retains only baseline-outperforming guidance via hold-out validation (ensuring monotonic improvement). We show MNL nearly matches Supervised Fine-Tuning (93.9% vs 94.3% on GSM8K) and outperforms training-free alternatives on GSM8K, Spider, AIME, and KaggleDBQA. On KaggleDBQA (Qwen3-8B), MNL hits 28% accuracy (47% relative gain), outperforming Memento (15.1%) and Training-Free GRPO (22.1) - proving it's a strong training-free alternative for complex reasoning.

Title: VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing

Authors: Emanuel Sánchez Aimar, Gulnaz Zhambulova, Fahad Shahbaz Khan, Yonghao Xu, Michael Felsberg
Subjects: cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2512.11490
Pdf URL: https://arxiv.org/pdf/2512.11490
Copy Paste: [[2512.11490]] VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing(https://arxiv.org/abs/2512.11490)
Keywords: generative
Abstract: Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.

Title: Building Patient Journeys in Hebrew: A Language Model for Clinical Timeline Extraction

Authors: Kai Golan Hashiloni, Brenda Kasabe Nokai, Michal Shevach, Esthy Shemesh, Ronit Bartin, Anna Bergrin, Liran Harel, Nachum Dershowitz, Liat Nadai Arad, Kfir Bar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.11502
Pdf URL: https://arxiv.org/pdf/2512.11502
Copy Paste: [[2512.11502]] Building Patient Journeys in Hebrew: A Language Model for Clinical Timeline Extraction(https://arxiv.org/abs/2512.11502)
Keywords: privacy, extraction
Abstract: We present a new Hebrew medical language model designed to extract structured clinical timelines from electronic health records, enabling the construction of patient journeys. Our model is based on DictaBERT 2.0 and continually pre-trained on over five million de-identified hospital records. To evaluate its effectiveness, we introduce two new datasets -- one from internal medicine and emergency departments, and another from oncology -- annotated for event temporal relations. Our results show that our model achieves strong performance on both datasets. We also find that vocabulary adaptation improves token efficiency and that de-identification does not compromise downstream performance, supporting privacy-conscious model development. The model is made available for research use under ethical restrictions.

Title: TSkel-Mamba: Temporal Dynamic Modeling via State Space Model for Human Skeleton-based Action Recognition

Authors: Yanan Liu, Jun Liu, Hao Zhang, Dan Xu, Hossein Rahmani, Mohammed Bennamoun, Qiuhong Ke
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11503
Pdf URL: https://arxiv.org/pdf/2512.11503
Copy Paste: [[2512.11503]] TSkel-Mamba: Temporal Dynamic Modeling via State Space Model for Human Skeleton-based Action Recognition(https://arxiv.org/abs/2512.11503)
Keywords: transformer
Abstract: Skeleton-based action recognition has garnered significant attention in the computer vision community. Inspired by the recent success of the selective state-space model (SSM) Mamba in modeling 1D temporal sequences, we propose TSkel-Mamba, a hybrid Transformer-Mamba framework that effectively captures both spatial and temporal dynamics. In particular, our approach leverages Spatial Transformer for spatial feature learning while utilizing Mamba for temporal modeling. Mamba, however, employs separate SSM blocks for individual channels, which inherently limits its ability to model inter-channel dependencies. To better adapt Mamba for skeleton data and enhance Mamba`s ability to model temporal dependencies, we introduce a Temporal Dynamic Modeling (TDM) block, which is a versatile plug-and-play component that integrates a novel Multi-scale Temporal Interaction (MTI) module. The MTI module employs multi-scale Cycle operators to capture cross-channel temporal interactions, a critical factor in action recognition. Extensive experiments on NTU-RGB+D 60, NTU-RGB+D 120, NW-UCLA and UAV-Human datasets demonstrate that TSkel-Mamba achieves state-of-the-art performance while maintaining low inference time, making it both efficient and highly effective.

Title: On Geometric Understanding and Learned Data Priors in VGGT

Authors: Jelena Bratulić, Sudhanshu Mittal, Thomas Brox, Christian Rupprecht
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11508
Pdf URL: https://arxiv.org/pdf/2512.11508
Copy Paste: [[2512.11508]] On Geometric Understanding and Learned Data Priors in VGGT(https://arxiv.org/abs/2512.11508)
Keywords: robust, transformer
Abstract: The Visual Geometry Grounded Transformer (VGGT) is a 3D foundation model that infers camera geometry and scene structure in a single feed-forward pass. Trained in a supervised, single-step fashion on large datasets, VGGT raises a key question: does it build upon geometric concepts like traditional multi-view methods, or does it rely primarily on learned appearance-based data-driven priors? In this work, we conduct a systematic analysis of VGGT's internal mechanisms to uncover whether geometric understanding emerges within its representations. By probing intermediate features, analyzing attention patterns, and performing interventions, we examine how the model implements its functionality. Our findings reveal that VGGT implicitly performs correspondence matching within its global attention layers and encodes epipolar geometry, despite being trained without explicit geometric constraints. We further investigate VGGT's dependence on its learned data priors. Using spatial input masking and perturbation experiments, we assess its robustness to occlusions, appearance variations, and camera configurations, comparing it with classical multi-stage pipelines. Together, these insights highlight how VGGT internalizes geometric structure while using learned data-driven priors.

Title: Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs

Authors: Mohor Banerjee, Nadya Yuki Wangsajaya, Syed Ali Redha Alsagoff, Min Sen Tan, Zachary Choy Kit Chun, Alvin Chan Guo Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11509
Pdf URL: https://arxiv.org/pdf/2512.11509
Copy Paste: [[2512.11509]] Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs(https://arxiv.org/abs/2512.11509)
Keywords: large language model
Abstract: Large Language Models (LLMs) exhibit remarkable capabilities in natural language understanding and reasoning, but suffer from hallucination: the generation of factually incorrect content. While numerous methods have been developed to reduce hallucinations, their impact on creative generations remains unexplored. This gap is particularly critical for AI-assisted scientific discovery, which requires both factual accuracy and creative hypothesis generation. We investigate how three hallucination-reduction techniques: Chain of Verification (CoVe), Decoding by Contrasting Layers (DoLa), and Retrieval-Augmented Generation (RAG), affect creativity in LLMs. Evaluating multiple model families (LLaMA, Qwen, Mistral) at varying scales (1B - 70B parameters) on two creativity benchmarks (NeoCoder and CS4), we find that these methods have opposing effects on divergent creativity. CoVe enhances divergent thinking, DoLa suppresses it, and RAG shows minimal impact. Our findings provide guidance for selecting appropriate hallucination-reduction methods in scientific applications, where the balance between factual accuracy and creative exploration is crucial.

Title: Reconstruction as a Bridge for Event-Based Visual Question Answering

Authors: Hanyue Lou, Jiayi Zhou, Yang Zhang, Boyu Li, Yi Wang, Guangnan Ye, Boxin Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11510
Pdf URL: https://arxiv.org/pdf/2512.11510
Copy Paste: [[2512.11510]] Reconstruction as a Bridge for Event-Based Visual Question Answering(https://arxiv.org/abs/2512.11510)
Keywords: robust, large language model
Abstract: Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-Q&A pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.

Title: NeuralOGCM: Differentiable Ocean Modeling with Learnable Physics

Authors: Hao Wu, Yuan Gao, Fan Xu, Fan Zhang, Guangliang Liu, Yuxuan Liang, Xiaomeng Huang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11525
Pdf URL: https://arxiv.org/pdf/2512.11525
Copy Paste: [[2512.11525]] NeuralOGCM: Differentiable Ocean Modeling with Learnable Physics(https://arxiv.org/abs/2512.11525)
Keywords: diffusion
Abstract: High-precision scientific simulation faces a long-standing trade-off between computational efficiency and physical fidelity. To address this challenge, we propose NeuralOGCM, an ocean modeling framework that fuses differentiable programming with deep learning. At the core of NeuralOGCM is a fully differentiable dynamical solver, which leverages physics knowledge as its core inductive bias. The learnable physics integration captures large-scale, deterministic physical evolution, and transforms key physical parameters (e.g., diffusion coefficients) into learnable parameters, enabling the model to autonomously optimize its physical core via end-to-end training. Concurrently, a deep neural network learns to correct for subgrid-scale processes and discretization errors not captured by the physics model. Both components work in synergy, with their outputs integrated by a unified ODE solver. Experiments demonstrate that NeuralOGCM maintains long-term stability and physical consistency, significantly outperforming traditional numerical models in speed and pure AI baselines in accuracy. Our work paves a new path for building fast, stable, and physically-plausible models for scientific computing.

Title: xGR: Efficient Generative Recommendation Serving at Scale

Authors: Qingxiao Sun, Tongxuan Liu, Shen Zhang, Siyu Wu, Peijun Yang, Haotian Liang, Menxin Li, Xiaolong Ma, Zhiwei Liang, Ziyi Ren, Minchao Zhang, Xinyu Liu, Ke Zhang, Depei Qian, Hailong Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11529
Pdf URL: https://arxiv.org/pdf/2512.11529
Copy Paste: [[2512.11529]] xGR: Efficient Generative Recommendation Serving at Scale(https://arxiv.org/abs/2512.11529)
Keywords: generative
Abstract: Recommendation system delivers substantial economic benefits by providing personalized predictions. Generative recommendation (GR) integrates LLMs to enhance the understanding of long user-item sequences. Despite employing attention-based architectures, GR's workload differs markedly from that of LLM serving. GR typically processes long prompt while producing short, fixed-length outputs, yet the computational cost of each decode phase is especially high due to the large beam width. In addition, since the beam search involves a vast item space, the sorting overhead becomes particularly time-consuming. We propose xGR, a GR-oriented serving system that meets strict low-latency requirements under highconcurrency scenarios. First, xGR unifies the processing of prefill and decode phases through staged computation and separated KV cache. Second, xGR enables early sorting termination and mask-based item filtering with data structure reuse. Third, xGR reconstructs the overall pipeline to exploit multilevel overlap and multi-stream parallelism. Our experiments with real-world recommendation service datasets demonstrate that xGR achieves at least 3.49x throughput compared to the state-of-the-art baseline under strict latency constraints.

Title: HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning

Authors: Yiqing Yang, Kin-Man Lam
Subjects: cs.CV, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2512.11534
Pdf URL: https://arxiv.org/pdf/2512.11534
Copy Paste: [[2512.11534]] HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning(https://arxiv.org/abs/2512.11534)
Keywords: large language model
Abstract: Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates relevance, coverage, and redundancy, enabling differentiable optimization via Gumbel-Softmax to select optimal frame combinations at the set level. Finally, student-teacher mutual learning is employed, where the student selector (SLM) and teacher reasoner (MLLM) are trained to align their frame importance distributions via KL divergence. Combined with cross-entropy loss, this enables end-to-end optimization, eliminating reliance on static pseudo labels. Experiments across various benchmarks, including Video-MME, LongVideoBench, MLVU, and NExT-QA, demonstrate that our method significantly outperforms existing approaches.

Title: A Multi-Criteria Automated MLOps Pipeline for Cost-Effective Cloud-Based Classifier Retraining in Response to Data Distribution Shifts

Authors: Emmanuel K. Katalay, David O. Dimandja, Jordan F. Masakuna
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11541
Pdf URL: https://arxiv.org/pdf/2512.11541
Copy Paste: [[2512.11541]] A Multi-Criteria Automated MLOps Pipeline for Cost-Effective Cloud-Based Classifier Retraining in Response to Data Distribution Shifts(https://arxiv.org/abs/2512.11541)
Keywords: robust
Abstract: The performance of machine learning (ML) models often deteriorates when the underlying data distribution changes over time, a phenomenon known as data distribution drift. When this happens, ML models need to be retrained and redeployed. ML Operations (MLOps) is often manual, i.e., humans trigger the process of model retraining and redeployment. In this work, we present an automated MLOps pipeline designed to address neural network classifier retraining in response to significant data distribution changes. Our MLOps pipeline employs multi-criteria statistical techniques to detect distribution shifts and triggers model updates only when necessary, ensuring computational efficiency and resource optimization. We demonstrate the effectiveness of our framework through experiments on several benchmark anomaly detection data sets, showing significant improvements in model accuracy and robustness compared to traditional retraining strategies. Our work provides a foundation for deploying more reliable and adaptive ML systems in dynamic real-world settings, where data distribution changes are common.

Title: Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models

Authors: Hossein Shahabadi, Niki Sepasian, Arash Marioriyad, Ali Sharifi-Zarchi, Mahdieh Soleymani Baghshah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11542
Pdf URL: https://arxiv.org/pdf/2512.11542
Copy Paste: [[2512.11542]] Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models(https://arxiv.org/abs/2512.11542)
Keywords: diffusion
Abstract: Achieving compositional alignment between textual descriptions and generated images - covering objects, attributes, and spatial relationships - remains a core challenge for modern text-to-image (T2I) models. Although diffusion-based architectures have been widely studied, the compositional behavior of emerging Visual Autoregressive (VAR) models is still largely unexamined. We benchmark six diverse T2I systems - SDXL, PixArt-$\alpha$, Flux-Dev, Flux-Schnell, Infinity-2B, and Infinity-8B - across the full T2I-CompBench++ and GenEval suites, evaluating alignment in color and attribute binding, spatial relations, numeracy, and complex multi-object prompts. Across both benchmarks, Infinity-8B achieves the strongest overall compositional alignment, while Infinity-2B also matches or exceeds larger diffusion models in several categories, highlighting favorable efficiency-performance trade-offs. In contrast, SDXL and PixArt-$\alpha$ show persistent weaknesses in attribute-sensitive and spatial tasks. These results provide the first systematic comparison of VAR and diffusion approaches to compositional alignment and establish unified baselines for the future development of the T2I model.

Title: Optimizing the Training Diet: Data Mixture Search for Robust Time Series Forecasting

Authors: Federico Pennino, Maurizio Gabbrielli
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11546
Pdf URL: https://arxiv.org/pdf/2512.11546
Copy Paste: [[2512.11546]] Optimizing the Training Diet: Data Mixture Search for Robust Time Series Forecasting(https://arxiv.org/abs/2512.11546)
Keywords: robust
Abstract: The standard paradigm for training deep learning models on sensor data assumes that more data is always better. However, raw sensor streams are often imbalanced and contain significant redundancy, meaning that not all data points contribute equally to model generalization. In this paper, we show that, in some cases, "less is more" when considering datasets. We do this by reframing the data selection problem: rather than tuning model hyperparameters, we fix the model and optimize the composition of the training data itself. We introduce a framework for discovering the optimal "training diet" from a large, unlabeled time series corpus. Our framework first uses a large-scale encoder and k-means clustering to partition the dataset into distinct, behaviorally consistent clusters. These clusters represent the fundamental 'ingredients' available for training. We then employ the Optuna optimization framework to search the high-dimensional space of possible data mixtures. For each trial, Optuna proposes a specific sampling ratio for each cluster, and a new training set is constructed based on this recipe. A smaller target model is then trained and evaluated. Our experiments reveal that this data-centric search consistently discovers data mixtures that yield models with significantly higher performance compared to baselines trained on the entire dataset. Specifically - evaluated on PMSM dataset - our method improved performance from a baseline MSE of 1.70 to 1.37, a 19.41% improvement.

Title: Elastic-Net Multiple Kernel Learning: Combining Multiple Data Sources for Prediction

Authors: Janaina Mourão-Miranda, Zakria Hussain, Konstantinos Tsirlis, Christophe Phillips, John Shawe-Taylor
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11547
Pdf URL: https://arxiv.org/pdf/2512.11547
Copy Paste: [[2512.11547]] Elastic-Net Multiple Kernel Learning: Combining Multiple Data Sources for Prediction(https://arxiv.org/abs/2512.11547)
Keywords: interpretability
Abstract: Multiple Kernel Learning (MKL) models combine several kernels in supervised and unsupervised settings to integrate multiple data representations or sources, each represented by a different kernel. MKL seeks an optimal linear combination of base kernels that maximizes a generalized performance measure under a regularization constraint. Various norms have been used to regularize the kernel weights, including $l1$, $l2$ and $lp$, as well as the "elastic-net" penalty, which combines $l1$- and $l2$-norm to promote both sparsity and the selection of correlated kernels. This property makes elastic-net regularized MKL (ENMKL) especially valuable when model interpretability is critical and kernels capture correlated information, such as in neuroimaging. Previous ENMKL methods have followed a two-stage procedure: fix kernel weights, train a support vector machine (SVM) with the weighted kernel, and then update the weights via gradient descent, cutting-plane methods, or surrogate functions. Here, we introduce an alternative ENMKL formulation that yields a simple analytical update for the kernel weights. We derive explicit algorithms for both SVM and kernel ridge regression (KRR) under this framework, and implement them in the open-source Pattern Recognition for Neuroimaging Toolbox (PRoNTo). We evaluate these ENMKL algorithms against $l1$-norm MKL and against SVM (or KRR) trained on the unweighted sum of kernels across three neuroimaging applications. Our results show that ENMKL matches or outperforms $l1$-norm MKL in all tasks and only underperforms standard SVM in one scenario. Crucially, ENMKL produces sparser, more interpretable models by selectively weighting correlated kernels.

Title: SSL-MedSAM2: A Semi-supervised Medical Image Segmentation Framework Powered by Few-shot Learning of SAM2

Authors: Zhendi Gong, Xin Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11548
Pdf URL: https://arxiv.org/pdf/2512.11548
Copy Paste: [[2512.11548]] SSL-MedSAM2: A Semi-supervised Medical Image Segmentation Framework Powered by Few-shot Learning of SAM2(https://arxiv.org/abs/2512.11548)
Keywords: segmentation
Abstract: Despite the success of deep learning based models in medical image segmentation, most state-of-the-art (SOTA) methods perform fully-supervised learning, which commonly rely on large scale annotated training datasets. However, medical image annotation is highly time-consuming, hindering its clinical applications. Semi-supervised learning (SSL) has been emerged as an appealing strategy in training with limited annotations, largely reducing the labelling cost. We propose a novel SSL framework SSL-MedSAM2, which contains a training-free few-shot learning branch TFFS-MedSAM2 based on the pretrained large foundation model Segment Anything Model 2 (SAM2) for pseudo label generation, and an iterative fully-supervised learning branch FSL-nnUNet based on nnUNet for pseudo label refinement. The results on MICCAI2025 challenge CARE-LiSeg (Liver Segmentation) demonstrate an outstanding performance of SSL-MedSAM2 among other methods. The average dice scores on the test set in GED4 and T1 MRI are 0.9710 and 0.9648 respectively, and the Hausdorff distances are 20.07 and 21.97 respectively. The code is available via this https URL.

Title: 3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation

Authors: Zhiguo Lu, Jianwen Lou, Mingjun Ma, Hairong Jin, Youyi Zheng, Kun Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11557
Pdf URL: https://arxiv.org/pdf/2512.11557
Copy Paste: [[2512.11557]] 3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation(https://arxiv.org/abs/2512.11557)
Keywords: segmentation
Abstract: 3D teeth segmentation, involving the localization of tooth instances and their semantic categorization in 3D dental models, is a critical yet challenging task in digital dentistry due to the complexity of real-world dentition. In this paper, we propose 3DTeethSAM, an adaptation of the Segment Anything Model 2 (SAM2) for 3D teeth segmentation. SAM2 is a pretrained foundation model for image and video segmentation, demonstrating a strong backbone in various downstream scenarios. To adapt SAM2 for 3D teeth data, we render images of 3D teeth models from predefined views, apply SAM2 for 2D segmentation, and reconstruct 3D results using 2D-3D projections. Since SAM2's performance depends on input prompts and its initial outputs often have deficiencies, and given its class-agnostic nature, we introduce three light-weight learnable modules: (1) a prompt embedding generator to derive prompt embeddings from image embeddings for accurate mask decoding, (2) a mask refiner to enhance SAM2's initial segmentation results, and (3) a mask classifier to categorize the generated masks. Additionally, we incorporate Deformable Global Attention Plugins (DGAP) into SAM2's image encoder. The DGAP enhances both the segmentation accuracy and the speed of the training process. Our method has been validated on the 3DTeethSeg benchmark, achieving an IoU of 91.90% on high-resolution 3D teeth meshes, establishing a new state-of-the-art in the field.

Title: DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

Authors: Zhenyang Cai, Jiaming Zhang, Junjie Zhao, Ziyi Zeng, Yanchao Li, Jingyi Liang, Junying Chen, Yunjin Yang, Jiajun You, Shuzhi Deng, Tongfei Wang, Wanting Chen, Chunxiu Hao, Ruiqi Xie, Zhenwei Wen, Xiangyi Feng, Zou Ting, Jin Zou Lin, Jianquan Li, Guangjun Yu, Liangyi Chen, Junwen Wang, Shan Jiang, Benyou Wang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2512.11558
Pdf URL: https://arxiv.org/pdf/2512.11558
Copy Paste: [[2512.11558]] DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry(https://arxiv.org/abs/2512.11558)
Keywords: large language model
Abstract: Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM's visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.

Title: Multi-temporal Calving Front Segmentation

Authors: Marcel Dreier, Nora Gourmelon, Dakota Pyles, Fei Wu, Matthias Braun, Thorsten Seehaus, Andreas Maier, Vincent Christlein
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11560
Pdf URL: https://arxiv.org/pdf/2512.11560
Copy Paste: [[2512.11560]] Multi-temporal Calving Front Segmentation(https://arxiv.org/abs/2512.11560)
Keywords: segmentation
Abstract: The calving fronts of marine-terminating glaciers undergo constant changes. These changes significantly affect the glacier's mass and dynamics, demanding continuous monitoring. To address this need, deep learning models were developed that can automatically delineate the calving front in Synthetic Aperture Radar imagery. However, these models often struggle to correctly classify areas affected by seasonal conditions such as ice melange or snow-covered surfaces. To address this issue, we propose to process multiple frames from a satellite image time series of the same glacier in parallel and exchange temporal information between the corresponding feature maps to stabilize each prediction. We integrate our approach into the current state-of-the-art architecture Tyrion and accomplish a new state-of-the-art performance on the CaFFe benchmark dataset. In particular, we achieve a Mean Distance Error of 184.4 m and a mean Intersection over Union of 83.6.

Title: Visualizing token importance for black-box language models

Authors: Paulius Rauba, Qiyao Wei, Mihaela van der Schaar
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11573
Pdf URL: https://arxiv.org/pdf/2512.11573
Copy Paste: [[2512.11573]] Visualizing token importance for black-box language models(https://arxiv.org/abs/2512.11573)
Keywords: fair, interpretability, large language model
Abstract: We consider the problem of auditing black-box large language models (LLMs) to ensure they behave reliably when deployed in production settings, particularly in high-stakes domains such as legal, medical, and regulatory compliance. Existing approaches for LLM auditing often focus on isolated aspects of model behavior, such as detecting specific biases or evaluating fairness. We are interested in a more general question -- can we understand how the outputs of black-box LLMs depend on each input token? There is a critical need to have such tools in real-world applications that rely on inaccessible API endpoints to language models. However, this is a highly non-trivial problem, as LLMs are stochastic functions (i.e. two outputs will be different by chance), while computing prompt-level gradients to approximate input sensitivity is infeasible. To address this, we propose Distribution-Based Sensitivity Analysis (DBSA), a lightweight model-agnostic procedure to evaluate the sensitivity of the output of a language model for each input token, without making any distributional assumptions about the LLM. DBSA is developed as a practical tool for practitioners, enabling quick, plug-and-play visual exploration of LLMs reliance on specific input tokens. Through illustrative examples, we demonstrate how DBSA can enable users to inspect LLM inputs and find sensitivities that may be overlooked by existing LLM interpretability methods.

Title: Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model

Authors: Sam Gijsen, Marc-Andre Schulz, Kerstin Ritter
Subjects: cs.LG, cs.CV, q-bio.NC
Abstract URL: https://arxiv.org/abs/2512.11582
Pdf URL: https://arxiv.org/pdf/2512.11582
Copy Paste: [[2512.11582]] Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model(https://arxiv.org/abs/2512.11582)
Keywords: robust
Abstract: The development of foundation models for functional magnetic resonance imaging (fMRI) time series holds significant promise for predicting phenotypes related to disease and cognition. Current models, however, are often trained using a mask-and-reconstruct objective on small brain regions. This focus on low-level information leads to representations that are sensitive to noise and temporal fluctuations, necessitating extensive fine-tuning for downstream tasks. We introduce Brain-Semantoks, a self-supervised framework designed specifically to learn abstract representations of brain dynamics. Its architecture is built on two core innovations: a semantic tokenizer that aggregates noisy regional signals into robust tokens representing functional networks, and a self-distillation objective that enforces representational stability across time. We show that this objective is stabilized through a novel training curriculum, ensuring the model robustly learns meaningful features from low signal-to-noise time series. We demonstrate that learned representations enable strong performance on a variety of downstream tasks even when only using a linear probe. Furthermore, we provide comprehensive scaling analyses indicating more unlabeled data reliably results in out-of-distribution performance gains without domain adaptation.

Title: Atomic Action Slicing: Planner-Aligned Options for Generalist VLA Agents

Authors: Stefan Tabakov, Asen Popov, Dimitar Dimitrov, S. Ensiye Kiyamousavi, Vladimir Hristov, Boris Kraychev
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2512.11584
Pdf URL: https://arxiv.org/pdf/2512.11584
Copy Paste: [[2512.11584]] Atomic Action Slicing: Planner-Aligned Options for Generalist VLA Agents(https://arxiv.org/abs/2512.11584)
Keywords: robust
Abstract: Current vision-language-action (VLA) models generalize poorly, particularly when tasks require new compositions of skills or objects. We introduce Atomic Action Slicing (AAS), a planner-aligned approach that decomposes long-horizon demonstrations into short, typed atomic actions that are easier for planners to use and policies to learn. Using LIBERO demonstrations, AAS produces a validated dataset of 2,124 atomic segments labeled with action type, temporal span, and confidence. A stronger segmenter (Gemini 2.5 Pro) closely matches planner-defined plans and remains robust under keyframe jitter, while smaller models perform worse on multi-object tasks. Fine-tuning CLIP-RT+ on our atomic dataset improves task success from 94.2% to 95.3% on LIBERO-Goal and 83.8% to 88.8% on LIBERO-Long. We publicly release the GATE-VLAP dataset on HuggingFace(this https URL)

Title: Granite: Granular Runtime Enforcement for GitHub Actions Permissions

Authors: Mojtaba Moazen, Amir.M Ahmadian, Musard Balliu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.11602
Pdf URL: https://arxiv.org/pdf/2512.11602
Copy Paste: [[2512.11602]] Granite: Granular Runtime Enforcement for GitHub Actions Permissions(https://arxiv.org/abs/2512.11602)
Keywords: security, protect, attack
Abstract: Modern software projects use automated CI/CD pipelines to streamline their development, build, and deployment processes. GitHub Actions is a popular CI/CD platform that enables project maintainers to create custom workflows -- collections of jobs composed of sequential steps -- using reusable components known as actions. Wary of the security risks introduced by fully-privileged actions, GitHub provides a job-level permission model for controlling workflow access to repository resources. Unfortunately, this model is too coarse-grained to reduce the attack surface pertaining to permission misuse attacks: All actions within a job share the same permissions granted to the job. This violates the principle of least privilege and can lead to broader software supply chain attacks, whenever a compromised action exploits the granted permissions to compromise the repository resources. In this paper, we present Granite, a runtime proxy-based system that enforces fine-grained permissions for GitHub Actions at the step-level granularity within a job. Granite transparently monitors requests made by JavaScript and composite actions during workflow execution and checks them against predefined step-level policies at runtime. We evaluate Granite in terms of compatibility, security, and performance overhead using a dataset of 500 workflows comprising 12,916 jobs from the most-starred GitHub repositories that use GitHub Actions. Our analysis reveals that 52.7% of the jobs can be protected by Granite against permission misuse attacks. We evaluate Granite on 20 top-starred repositories (63 actions, 58 workflows), validate attack prevention using 10 permission misuse attacks across 42 overprivileged jobs, and measure an average overhead of 55% (3.67 seconds) per job, concluding that Granite effectively reduces CI/CD attack surfaces.

Title: Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

Authors: Björn Deiseroth, Max Henning Höth, Kristian Kersting, Letitia Parcalabescu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.11614
Pdf URL: https://arxiv.org/pdf/2512.11614
Copy Paste: [[2512.11614]] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols(https://arxiv.org/abs/2512.11614)
Keywords: large language model
Abstract: Retrieval-augmented generation (RAG) models rely on retrieved evidence to guide large language model (LLM) generators, yet current systems treat retrieval as a weak heuristic rather than verifiable evidence. As a result, LLMs answer without support, hallucinate under incomplete or misleading context, and rely on spurious evidence. We introduce a training framework that treats the entire RAG pipeline -- both the retriever and the generator -- as an interactive proof system via an adaptation of the Merlin-Arthur (M/A) protocol. Arthur (the generator LLM) trains on questions of unkown provenance: Merlin provides helpful evidence, while Morgana injects adversarial, misleading context. Both use a linear-time XAI method to identify and modify the evidence most influential to Arthur. Consequently, Arthur learns to (i) answer when the context support the answer, (ii) reject when evidence is insufficient, and (iii) rely on the specific context spans that truly ground the answer. We further introduce a rigorous evaluation framework to disentangle explanation fidelity from baseline predictive errors. This allows us to introduce and measure the Explained Information Fraction (EIF), which normalizes M/A certified mutual-information guarantees relative to model capacity and imperfect benchmarks. Across three RAG datasets and two model families of varying sizes, M/A-trained LLMs show improved groundedness, completeness, soundness, and reject behavior, as well as reduced hallucinations -- without needing manually annotated unanswerable questions. The retriever likewise improves recall and MRR through automatically generated M/A hard positives and negatives. Our results demonstrate that autonomous interactive-proof-style supervision provides a principled and practical path toward reliable RAG systems that treat retrieved documents not as suggestions, but as verifiable evidence.

Title: A Fast Interpretable Fuzzy Tree Learner

Authors: Javier Fumanal-Idocin, Raquel Fernandez-Peralta, Javier Andreu-Perez
Subjects: cs.LG, cs.SC
Abstract URL: https://arxiv.org/abs/2512.11616
Pdf URL: https://arxiv.org/pdf/2512.11616
Copy Paste: [[2512.11616]] A Fast Interpretable Fuzzy Tree Learner(https://arxiv.org/abs/2512.11616)
Keywords: interpretability
Abstract: Fuzzy rule-based systems have been mostly used in interpretable decision-making because of their interpretable linguistic rules. However, interpretability requires both sensible linguistic partitions and small rule-base sizes, which are not guaranteed by many existing fuzzy rule-mining algorithms. Evolutionary approaches can produce high-quality models but suffer from prohibitive computational costs, while neural-based methods like ANFIS have problems retaining linguistic interpretations. In this work, we propose an adaptation of classical tree-based splitting algorithms from crisp rules to fuzzy trees, combining the computational efficiency of greedy algoritms with the interpretability advantages of fuzzy logic. This approach achieves interpretable linguistic partitions and substantially improves running time compared to evolutionary-based approaches while maintaining competitive predictive performance. Our experiments on tabular classification benchmarks proof that our method achieves comparable accuracy to state-of-the-art fuzzy classifiers with significantly lower computational cost and produces more interpretable rule bases with constrained complexity. Code is available in: this https URL

Title: Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

Authors: Keerthana Murugaraj, Salima Lamsiyah, Marten During, Martin Theobald
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2512.11635
Pdf URL: https://arxiv.org/pdf/2512.11635
Copy Paste: [[2512.11635]] Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling(https://arxiv.org/abs/2512.11635)
Keywords: extraction, transformer
Abstract: Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work.

Title: FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint

Authors: Jiapeng Tang, Kai Li, Chengxiang Yin, Liuhao Ge, Fei Jiang, Jiu Xu, Matthias Nießner, Christian Häne, Timur Bagautdinov, Egor Zakharov, Peihong Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11645
Pdf URL: https://arxiv.org/pdf/2512.11645
Copy Paste: [[2512.11645]] FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint(https://arxiv.org/abs/2512.11645)
Keywords: diffusion, transformer
Abstract: We introduce FactorPortrait, a video diffusion method for controllable portrait animation that enables lifelike synthesis from disentangled control signals of facial expressions, head movement, and camera viewpoints. Given a single portrait image, a driving video, and camera trajectories, our method animates the portrait by transferring facial expressions and head movements from the driving video while simultaneously enabling novel view synthesis from arbitrary viewpoints. We utilize a pre-trained image encoder to extract facial expression latents from the driving video as control signals for animation generation. Such latents implicitly capture nuanced facial expression dynamics with identity and pose information disentangled, and they are efficiently injected into the video diffusion transformer through our proposed expression controller. For camera and head pose control, we employ Plücker ray maps and normal maps rendered from 3D body mesh tracking. To train our model, we curate a large-scale synthetic dataset containing diverse combinations of camera viewpoints, head poses, and facial expression dynamics. Extensive experiments demonstrate that our method outperforms existing approaches in realism, expressiveness, control accuracy, and view consistency.

Title: Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation

Authors: Luca Cazzola, Ahed Alboody
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11654
Pdf URL: https://arxiv.org/pdf/2512.11654
Copy Paste: [[2512.11654]] Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation(https://arxiv.org/abs/2512.11654)
Keywords: robust, diffusion, generative
Abstract: The acquisition cost for large, annotated motion datasets remains a critical bottleneck for skeletal-based Human Activity Recognition (HAR). Although Text-to-Motion (T2M) generative models offer a compelling, scalable source of synthetic data, their training objectives, which emphasize general artistic motion, and dataset structures fundamentally differ from HAR's requirements for kinematically precise, class-discriminative actions. This disparity creates a significant domain gap, making generalist T2M models ill-equipped for generating motions suitable for HAR classifiers. To address this challenge, we propose KineMIC (Kinetic Mining In Context), a transfer learning framework for few-shot action synthesis. KineMIC adapts a T2M diffusion model to an HAR domain by hypothesizing that semantic correspondences in the text encoding space can provide soft supervision for kinematic distillation. We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data. This process guides fine-tuning, transforming the generalist T2M backbone into a specialized few-shot Action-to-Motion generator. We validate KineMIC using HumanML3D as the source T2M dataset and a subset of NTU RGB+D 120 as the target HAR domain, randomly selecting just 10 samples per action class. Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement. Animated illustrations and supplementary materials are available at (this https URL).

Title: Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing

Authors: Xu Zhang, Jiabin Fang, Zhuoming Ding, Jin Yuan, Xuan Liu, Qianjun Zhang, Zhiyong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11680
Pdf URL: https://arxiv.org/pdf/2512.11680
Copy Paste: [[2512.11680]] Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing(https://arxiv.org/abs/2512.11680)
Keywords: large language model, segmentation
Abstract: Recent advances in image understanding have enabled methods that leverage large language models for multimodal reasoning in remote sensing. However, existing approaches still struggle to steer models to the user-relevant regions when only simple, generic text prompts are available. Moreover, in large-scale aerial imagery many objects exhibit highly similar visual appearances and carry rich inter-object relationships, which further complicates accurate recognition. To address these challenges, we propose Cross-modal Context-aware Learning for Visual Prompt-Guided Multimodal Image Understanding (CLV-Net). CLV-Net lets users supply a simple visual cue, a bounding box, to indicate a region of interest, and uses that cue to guide the model to generate correlated segmentation masks and captions that faithfully reflect user intent. Central to our design is a Context-Aware Mask Decoder that models and integrates inter-object relationships to strengthen target representations and improve mask quality. In addition, we introduce a Semantic and Relationship Alignment module: a Cross-modal Semantic Consistency Loss enhances fine-grained discrimination among visually similar targets, while a Relationship Consistency Loss enforces alignment between textual relations and visual interactions. Comprehensive experiments on two benchmark datasets show that CLV-Net outperforms existing methods and establishes new state-of-the-art results. The model effectively captures user intent and produces precise, intention-aligned multimodal outputs.

Title: Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection

Authors: Qiushi Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11683
Pdf URL: https://arxiv.org/pdf/2512.11683
Copy Paste: [[2512.11683]] Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection(https://arxiv.org/abs/2512.11683)
Keywords: robust, extraction, segmentation
Abstract: Data augmentation is crucial for improving the robustness of face detection systems, especially under challenging conditions such as occlusion, illumination variation, and complex environments. Traditional copy paste augmentation often produces unrealistic composites due to inaccurate foreground extraction, inconsistent scene geometry, and mismatched background semantics. To address these limitations, we propose Depth Copy Paste, a multimodal and depth aware augmentation framework that generates diverse and physically consistent face detection training samples by copying full body person instances and pasting them into semantically compatible scenes. Our approach first employs BLIP and CLIP to jointly assess semantic and visual coherence, enabling automatic retrieval of the most suitable background images for the given foreground person. To ensure high quality foreground masks that preserve facial details, we integrate SAM3 for precise segmentation and Depth-Anything to extract only the non occluded visible person regions, preventing corrupted facial textures from being used in augmentation. For geometric realism, we introduce a depth guided sliding window placement mechanism that searches over the background depth map to identify paste locations with optimal depth continuity and scale alignment. The resulting composites exhibit natural depth relationships and improved visual plausibility. Extensive experiments show that Depth Copy Paste provides more diverse and realistic training data, leading to significant performance improvements in downstream face detection tasks compared with traditional copy paste and depth free augmentation methods.

Title: Leveraging FPGAs for Homomorphic Matrix-Vector Multiplication in Oblivious Message Retrieval

Authors: Grant Bosworth, Keewoo Lee, Sunwoong Kim
Subjects: cs.CR, cs.AR
Abstract URL: https://arxiv.org/abs/2512.11690
Pdf URL: https://arxiv.org/pdf/2512.11690
Copy Paste: [[2512.11690]] Leveraging FPGAs for Homomorphic Matrix-Vector Multiplication in Oblivious Message Retrieval(https://arxiv.org/abs/2512.11690)
Keywords: secure, privacy, protect
Abstract: While end-to-end encryption protects the content of messages, it does not secure metadata, which exposes sender and receiver information through traffic analysis. A plausible approach to protecting this metadata is to have senders post encrypted messages on a public bulletin board and receivers scan it for relevant messages. Oblivious message retrieval (OMR) leverages homomorphic encryption (HE) to improve user experience in this solution by delegating the scan to a resource-rich server while preserving privacy. A key process in OMR is the homomorphic detection of pertinent messages for the receiver from the bulletin board. It relies on a specialized matrix-vector multiplication algorithm, which involves extensive multiplications between ciphertext vectors and plaintext matrices, as well as homomorphic rotations. The computationally intensive nature of this process limits the practicality of OMR. To address this challenge, this paper proposes a hardware architecture to accelerate the matrix-vector multiplication algorithm. The building homomorphic operators in this algorithm are implemented using high-level synthesis, with design parameters for different parallelism levels. These operators are then deployed on a field-programmable gate array platform using an efficient design space exploration strategy to accelerate homomorphic matrix-vector multiplication. Compared to a software implementation, the proposed hardware accelerator achieves a 13.86x speedup.

Title: Text images processing system using artificial intelligence models

Authors: Aya Kaysan Bahjat
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.11691
Pdf URL: https://arxiv.org/pdf/2512.11691
Copy Paste: [[2512.11691]] Text images processing system using artificial intelligence models(https://arxiv.org/abs/2512.11691)
Keywords: transformer
Abstract: This is to present a text image classifier device that identifies textual content in images and then categorizes each image into one of four predefined categories, including Invoice, Form, Letter, or Report. The device supports a gallery mode, in which users browse files on flash disks, hard disk drives, or microSD cards, and a live mode which renders feeds of cameras connected to it. Its design is specifically aimed at addressing pragmatic challenges, such as changing light, random orientation, curvature or partial coverage of text, low resolution, and slightly visible text. The steps of the processing process are divided into four steps: image acquisition and preprocessing, textual elements detection with the help of DBNet++ (Differentiable Binarization Network Plus) model, BART (Bidirectional Auto-Regressive Transformers) model that classifies detected textual elements, and the presentation of the results through a user interface written in Python and PyQt5. All the stages are connected in such a way that they form a smooth workflow. The system achieved a text recognition rate of about 94.62% when tested over ten hours on the mentioned Total-Text dataset, that includes high resolution images, created so as to represent a wide range of problematic conditions. These experimental results support the effectiveness of the suggested methodology to practice, mixed-source text categorization, even in uncontrolled imaging conditions.

Title: SoK: Demystifying the multiverse of MPC protocols

Authors: Roberta De Viti, Vaastav Anand, Pierfrancesco Ingo, Deepak Garg
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.11699
Pdf URL: https://arxiv.org/pdf/2512.11699
Copy Paste: [[2512.11699]] SoK: Demystifying the multiverse of MPC protocols(https://arxiv.org/abs/2512.11699)
Keywords: privacy
Abstract: This paper systematizes knowledge on the performance of Multi-Party Computation (MPC) protocols. Despite strong privacy and correctness guarantees, MPC adoption in real-world applications remains limited by high costs (especially in the malicious setting) and lack of guidance on choosing suitable protocols for concrete workloads. We identify the theoretical and practical parameters that shape MPC efficiency and conduct an extensive experimental study across diverse benchmarks. Our analysis discusses the trade-offs between protocols, and highlights which techniques align best with different application scenarios and needs. By providing actionable guidance for developers and outlining open challenges for researchers, this work seeks to narrow the gap between MPC theory and practice.

Title: EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing

Authors: Wei Chow, Linfeng Li, Lingdong Kong, Zefeng Li, Qi Xu, Hang Song, Tian Ye, Xian Wang, Jinbin Bai, Shilin Xu, Xiangtai Li, Junting Pan, Shaoteng Liu, Ran Zhou, Tianshu Yang, Songhua Liu
Subjects: cs.CV, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2512.11715
Pdf URL: https://arxiv.org/pdf/2512.11715
Copy Paste: [[2512.11715]] EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing(https://arxiv.org/abs/2512.11715)
Keywords: diffusion, transformer, generative
Abstract: Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.

Title: Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks

Authors: Sergey Pankratov, Dan Alistarh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.11718
Pdf URL: https://arxiv.org/pdf/2512.11718
Copy Paste: [[2512.11718]] Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks(https://arxiv.org/abs/2512.11718)
Keywords: large language model
Abstract: Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable speedup remain poorly understood. In this work, we establish the first ``tight'' lower bounds on the runtime of any deterministic speculative generation algorithm. This is achieved by drawing a parallel between the token generation process and branching random walks, which allows us to analyze the optimal draft tree selection problem. We prove, under basic assumptions, that the expected number of tokens successfully predicted per speculative iteration is bounded as $\mathbb{E}[X] \leq (\mu + \mu_{(2)})\log(P )/\mu^2 + O(1)$, where $P$ is the verifier's capacity, $\mu$ is the expected entropy of the verifier's output distribution, and $\mu_{(2)}$ is the expected second log-moment. This result provides new insights into the limits of parallel token generation, and could guide the design of future speculative decoding systems. Empirical evaluations on Llama models validate our theoretical predictions, confirming the tightness of our bounds in practical settings.

Title: Referring Change Detection in Remote Sensing Imagery

Authors: Yilmaz Korkmaz, Jay N. Paranjape, Celso M. de Melo, Vishal M. Patel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11719
Pdf URL: https://arxiv.org/pdf/2512.11719
Copy Paste: [[2512.11719]] Referring Change Detection in Remote Sensing Imagery(https://arxiv.org/abs/2512.11719)
Keywords: diffusion, segmentation
Abstract: Change detection in remote sensing imagery is essential for applications such as urban planning, environmental monitoring, and disaster management. Traditional change detection methods typically identify all changes between two temporal images without distinguishing the types of transitions, which can lead to results that may not align with specific user needs. Although semantic change detection methods have attempted to address this by categorizing changes into predefined classes, these methods rely on rigid class definitions and fixed model architectures, making it difficult to mix datasets with different label sets or reuse models across tasks, as the output channels are tightly coupled with the number and type of semantic classes. To overcome these limitations, we introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images. By integrating language understanding with visual analysis, our approach allows users to specify the exact type of change they are interested in. However, training models for RCD is challenging due to the limited availability of annotated data and severe class imbalance in existing datasets. To address this, we propose a two-stage framework consisting of (I) \textbf{RCDNet}, a cross-modal fusion network designed for referring change detection, and (II) \textbf{RCDGen}, a diffusion-based synthetic data generation pipeline that produces realistic post-change images and change maps for a specified category using only pre-change image, without relying on semantic segmentation masks and thereby significantly lowering the barrier to scalable data creation. Experiments across multiple datasets show that our framework enables scalable and targeted change detection. Project website is here: this https URL.

Title: Weak-to-Strong Generalization Enables Fully Automated De Novo Training of Multi-head Mask-RCNN Model for Segmenting Densely Overlapping Cell Nuclei in Multiplex Whole-slice Brain Images

Authors: Lin Bai, Xiaoyang Li, Liqiang Huang, Quynh Nguyen, Hien Van Nguyen, Saurabh Prasad, Dragan Maric, John Redell, Pramod Dash, Badrinath Roysam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11722
Pdf URL: https://arxiv.org/pdf/2512.11722
Copy Paste: [[2512.11722]] Weak-to-Strong Generalization Enables Fully Automated De Novo Training of Multi-head Mask-RCNN Model for Segmenting Densely Overlapping Cell Nuclei in Multiplex Whole-slice Brain Images(https://arxiv.org/abs/2512.11722)
Keywords: segmentation
Abstract: We present a weak to strong generalization methodology for fully automated training of a multi-head extension of the Mask-RCNN method with efficient channel attention for reliable segmentation of overlapping cell nuclei in multiplex cyclic immunofluorescent (IF) whole-slide images (WSI), and present evidence for pseudo-label correction and coverage expansion, the key phenomena underlying weak to strong generalization. This method can learn to segment de novo a new class of images from a new instrument and/or a new imaging protocol without the need for human annotations. We also present metrics for automated self-diagnosis of segmentation quality in production environments, where human visual proofreading of massive WSI images is unaffordable. Our method was benchmarked against five current widely used methods and showed a significant improvement. The code, sample WSI images, and high-resolution segmentation results are provided in open form for community adoption and adaptation.

Title: SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder

Authors: Minglei Shi, Haolin Wang, Borui Zhang, Wenzhao Zheng, Bohan Zeng, Ziyang Yuan, Xiaoshi Wu, Yuanxing Zhang, Huan Yang, Xintao Wang, Pengfei Wan, Kun Gai, Jie Zhou, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11749
Pdf URL: https://arxiv.org/pdf/2512.11749
Copy Paste: [[2512.11749]] SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder(https://arxiv.org/abs/2512.11749)
Keywords: diffusion, generative
Abstract: Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.

Title: SpectralKrum: A Spectral-Geometric Defense Against Byzantine Attacks in Federated Learning

Authors: Aditya Tripathi, Karan Sharma, Rahul Mishra, Tapas Kumar Maiti
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.11760
Pdf URL: https://arxiv.org/pdf/2512.11760
Copy Paste: [[2512.11760]] SpectralKrum: A Spectral-Geometric Defense Against Byzantine Attacks in Federated Learning(https://arxiv.org/abs/2512.11760)
Keywords: privacy, defense, attack, robust, federate
Abstract: Federated Learning (FL) distributes model training across clients who retain their data locally, but this architecture exposes a fundamental vulnerability: Byzantine clients can inject arbitrarily corrupted updates that degrade or subvert the global model. While robust aggregation methods (including Krum, Bulyan, and coordinate-wise defenses) offer theoretical guarantees under idealized assumptions, their effectiveness erodes substantially when client data distributions are heterogeneous (non-IID) and adversaries can observe or approximate the defense mechanism. This paper introduces SpectralKrum, a defense that fuses spectral subspace estimation with geometric neighbor-based selection. The core insight is that benign optimization trajectories, despite per-client heterogeneity, concentrate near a low-dimensional manifold that can be estimated from historical aggregates. SpectralKrum projects incoming updates into this learned subspace, applies Krum selection in compressed coordinates, and filters candidates whose orthogonal residual energy exceeds a data-driven threshold. The method requires no auxiliary data, operates entirely on model updates, and preserves FL privacy properties. We evaluate SpectralKrum against eight robust baselines across seven attack scenarios on CIFAR-10 with Dirichlet-distributed non-IID partitions (alpha = 0.1). Experiments spanning over 56,000 training rounds show that SpectralKrum is competitive against directional and subspace-aware attacks (adaptive-steer, buffer-drift), but offers limited advantage under label-flip and min-max attacks where malicious updates remain spectrally indistinguishable from benign ones.

Title: Reducing Domain Gap with Diffusion-Based Domain Adaptation for Cell Counting

Authors: Mohammad Dehghanmanshadi, Wallapak Tavanapong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11763
Pdf URL: https://arxiv.org/pdf/2512.11763
Copy Paste: [[2512.11763]] Reducing Domain Gap with Diffusion-Based Domain Adaptation for Cell Counting(https://arxiv.org/abs/2512.11763)
Keywords: diffusion
Abstract: Generating realistic synthetic microscopy images is critical for training deep learning models in label-scarce environments, such as cell counting with many cells per image. However, traditional domain adaptation methods often struggle to bridge the domain gap when synthetic images lack the complex textures and visual patterns of real samples. In this work, we adapt the Inversion-Based Style Transfer (InST) framework originally designed for artistic style transfer to biomedical microscopy images. Our method combines latent-space Adaptive Instance Normalization with stochastic inversion in a diffusion model to transfer the style from real fluorescence microscopy images to synthetic ones, while weakly preserving content structure. We evaluate the effectiveness of our InST-based synthetic dataset for downstream cell counting by pre-training and fine-tuning EfficientNet-B0 models on various data sources, including real data, hard-coded synthetic data, and the public Cell200-s dataset. Models trained with our InST-synthesized images achieve up to 37\% lower Mean Absolute Error (MAE) compared to models trained on hard-coded synthetic data, and a 52\% reduction in MAE compared to models trained on Cell200-s (from 53.70 to 25.95 MAE). Notably, our approach also outperforms models trained on real data alone (25.95 vs. 27.74 MAE). Further improvements are achieved when combining InST-synthesized data with lightweight domain adaptation techniques such as DACS with CutMix. These findings demonstrate that InST-based style transfer most effectively reduces the domain gap between synthetic and real microscopy data. Our approach offers a scalable path for enhancing cell counting performance while minimizing manual labeling effort. The source code and resources are publicly available at: this https URL.

Title: Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints

Authors: Kai Yao, Marc Juarez
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11771
Pdf URL: https://arxiv.org/pdf/2512.11771
Copy Paste: [[2512.11771]] Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints(https://arxiv.org/abs/2512.11771)
Keywords: security, attack, robust
Abstract: Model fingerprint detection techniques have emerged as a promising approach for attributing AI-generated images to their source models, but their robustness under adversarial conditions remains largely unexplored. We present the first systematic security evaluation of these techniques, formalizing threat models that encompass both white- and black-box access and two attack goals: fingerprint removal, which erases identifying traces to evade attribution, and fingerprint forgery, which seeks to cause misattribution to a target model. We implement five attack strategies and evaluate 14 representative fingerprinting methods across RGB, frequency, and learned-feature domains on 12 state-of-the-art image generators. Our experiments reveal a pronounced gap between clean and adversarial performance. Removal attacks are highly effective, often achieving success rates above 80% in white-box settings and over 50% under constrained black-box access. While forgery is more challenging than removal, its success significantly varies across targeted models. We also identify a utility-robustness trade-off: methods with the highest attribution accuracy are often vulnerable to attacks. Although some techniques exhibit robustness in specific settings, none achieves high robustness and accuracy across all evaluated threat models. These findings highlight the need for techniques balancing robustness and accuracy, and identify the most promising approaches for advancing this goal.

Title: MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

Authors: Peiqing Yang, Shangchen Zhou, Kai Hao, Qingyi Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11782
Pdf URL: https://arxiv.org/pdf/2512.11782
Copy Paste: [[2512.11782]] MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator(https://arxiv.org/abs/2512.11782)
Keywords: segmentation
Abstract: Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The MQE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.

Title: Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously

Authors: Andrew Adiletta, Kathryn Adiletta, Kemal Derya, Berk Sunar
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2512.11783
Pdf URL: https://arxiv.org/pdf/2512.11783
Copy Paste: [[2512.11783]] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously(https://arxiv.org/abs/2512.11783)
Keywords: security, privacy, protect, attack, robust, large language model
Abstract: The rapid deployment of Large Language Models (LLMs) has created an urgent need for enhanced security and privacy measures in Machine Learning (ML). LLMs are increasingly being used to process untrusted text inputs and even generate executable code, often while having access to sensitive system controls. To address these security concerns, several companies have introduced guard models, which are smaller, specialized models designed to protect text generation models from adversarial or malicious inputs. In this work, we advance the study of adversarial inputs by introducing Super Suffixes, suffixes capable of overriding multiple alignment objectives across various models with different tokenization schemes. We demonstrate their effectiveness, along with our joint optimization technique, by successfully bypassing the protection mechanisms of Llama Prompt Guard 2 on five different text generation models for malicious text and code generation. To the best of our knowledge, this is the first work to reveal that Llama Prompt Guard 2 can be compromised through joint optimization. Additionally, by analyzing the changing similarity of a model's internal state to specific concept directions during token sequence processing, we propose an effective and lightweight method to detect Super Suffix attacks. We show that the cosine similarity between the residual stream and certain concept directions serves as a distinctive fingerprint of model intent. Our proposed countermeasure, DeltaGuard, significantly improves the detection of malicious prompts generated through Super Suffixes. It increases the non-benign classification rate to nearly 100%, making DeltaGuard a valuable addition to the guard model stack and enhancing robustness against adversarial prompt attacks.

Title: Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

Authors: Etienne Boursier, Claire Boyer
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2512.11784
Pdf URL: https://arxiv.org/pdf/2512.11784
Copy Paste: [[2512.11784]] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective(https://arxiv.org/abs/2512.11784)
Keywords: transformer
Abstract: Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax attention under both finite and infinite prompts. For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure. Building on this insight, we establish non-asymptotic concentration bounds for the output and gradient of softmax attention, quantifying how rapidly the finite-prompt model approaches its infinite-prompt counterpart, and prove that this concentration remains stable along the entire training trajectory in general in-context learning settings with sub-Gaussian tokens. In the case of in-context linear regression, we use the tractable infinite-prompt dynamics to analyze training at finite prompt length. Our results allow optimization analyses developed for linear attention to transfer directly to softmax attention when prompts are sufficiently long, showing that large-prompt softmax attention inherits the analytical structure of its linear counterpart. This, in turn, provides a principled and broadly applicable toolkit for studying the training dynamics and statistical behavior of softmax attention layers in large prompt regimes.

Title: Uncertainty-Aware Domain Adaptation for Vitiligo Segmentation in Clinical Photographs

Authors: Wentao Jiang, Vamsi Varra, Caitlin Perez-Stable, Harrison Zhu, Meredith Apicella, Nicole Nyamongo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11791
Pdf URL: https://arxiv.org/pdf/2512.11791
Copy Paste: [[2512.11791]] Uncertainty-Aware Domain Adaptation for Vitiligo Segmentation in Clinical Photographs(https://arxiv.org/abs/2512.11791)
Keywords: robust, transformer, segmentation
Abstract: Accurately quantifying vitiligo extent in routine clinical photographs is crucial for longitudinal monitoring of treatment response. We propose a trustworthy, frequency-aware segmentation framework built on three synergistic pillars: (1) a data-efficient training strategy combining domain-adaptive pre-training on the ISIC 2019 dataset with an ROI-constrained dual-task loss to suppress background noise; (2) an architectural refinement via a ConvNeXt V2-based encoder enhanced with a novel High-Frequency Spectral Gating (HFSG) module and stem-skip connections to capture subtle textures; and (3) a clinical trust mechanism employing K-fold ensemble and Test-Time Augmentation (TTA) to generate pixel-wise uncertainty maps. Extensive validation on an expert-annotated clinical cohort demonstrates superior performance, achieving a Dice score of 85.05% and significantly reducing boundary error (95% Hausdorff Distance improved from 44.79 px to 29.95 px), consistently outperforming strong CNN (ResNet-50 and UNet++) and Transformer (MiT-B5) baselines. Notably, our framework demonstrates high reliability with zero catastrophic failures and provides interpretable entropy maps to identify ambiguous regions for clinician review. Our approach suggests that the proposed framework establishes a robust and reliable standard for automated vitiligo assessment.

Title: Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Authors: Yang Fei, George Stoica, Jingyuan Liu, Qifeng Chen, Ranjay Krishna, Xiaojuan Wang, Benlin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.11792
Pdf URL: https://arxiv.org/pdf/2512.11792
Copy Paste: [[2512.11792]] Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation(https://arxiv.org/abs/2512.11792)
Keywords: diffusion
Abstract: Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60\% on VBench, 21-22\% lower FVD, and 71.4\% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51\%, surpassing REPA (92.91\%) by 2.60\%, and reduce FVD to 360.57, a 21.20\% and 22.46\% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at this https URL .

Title: Particulate: Feed-Forward 3D Object Articulation

Authors: Ruining Li, Yuxin Yao, Chuanxia Zheng, Christian Rupprecht, Joan Lasenby, Shangzhe Wu, Andrea Vedaldi
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2512.11798
Pdf URL: https://arxiv.org/pdf/2512.11798
Copy Paste: [[2512.11798]] Particulate: Feed-Forward 3D Object Articulation(https://arxiv.org/abs/2512.11798)
Keywords: extraction, transformer
Abstract: We present Particulate, a feed-forward approach that, given a single static 3D mesh of an everyday object, directly infers all attributes of the underlying articulated structure, including its 3D parts, kinematic structure, and motion constraints. At its core is a transformer network, Part Articulation Transformer, which processes a point cloud of the input mesh using a flexible and scalable architecture to predict all the aforementioned attributes with native multi-joint support. We train the network end-to-end on a diverse collection of articulated 3D assets from public datasets. During inference, Particulate lifts the network's feed-forward prediction to the input mesh, yielding a fully articulated 3D model in seconds, much faster than prior approaches that require per-object optimization. Particulate can also accurately infer the articulated structure of AI-generated 3D assets, enabling full-fledged extraction of articulated 3D objects from a single (real or synthetic) image when combined with an off-the-shelf image-to-3D generator. We further introduce a new challenging benchmark for 3D articulation estimation curated from high-quality public 3D assets, and redesign the evaluation protocol to be more consistent with human preferences. Quantitative and qualitative results show that Particulate significantly outperforms state-of-the-art approaches.