2025-12-01

Title: Cacheback: Speculative Decoding With Nothing But Cache

Authors: Zhiyao Ma, In Gim, Lin Zhong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21699
Pdf URL: https://arxiv.org/pdf/2511.21699
Copy Paste: [[2511.21699]] Cacheback: Speculative Decoding With Nothing But Cache(https://arxiv.org/abs/2511.21699)
Keywords: large language model
Abstract: We present Cacheback Decoding, a training-free and model-agnostic speculative decoding method that exploits the locality in language to accelerate Large Language Model (LLM) inference. Cacheback leverages only Least Recently Used (LRU) cache tables of token n-grams to generate draft sequences. Cacheback achieves state-of-the-art performance among comparable methods despite its minimalist design, and its simplicity allows easy integration into existing systems. Cacheback also shows potential for fast adaptation to new domains.

Title: 47B Mixture-of-Experts Beats 671B Dense Models on Chinese Medical Examinations

Authors: Chiung-Yi Tseng, Danyang Zhang, Tianyang Wang, Hongying Luo, Lu Chen, Junming Huang, Jibin Guan, Junfeng Hao, Junhao Song, Ziqian Bi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21701
Pdf URL: https://arxiv.org/pdf/2511.21701
Copy Paste: [[2511.21701]] 47B Mixture-of-Experts Beats 671B Dense Models on Chinese Medical Examinations(https://arxiv.org/abs/2511.21701)
Keywords: robust, large language model
Abstract: The rapid advancement of large language models(LLMs) has prompted significant interest in their potential applications in medical domains. This paper presents a comprehensive benchmark evaluation of 27 state-of-the-art LLMs on Chinese medical examination questions, encompassing seven medical specialties across two professional levels. We introduce a robust evaluation framework that assesses model performance on 2,800 carefully curated questions from cardiovascular, gastroenterology, hematology, infectious diseases, nephrology, neurology, and respiratory medicine domains. Our dataset distinguishes between attending physician and senior physician difficulty levels, providing nuanced insights into model capabilities across varying complexity. Our empirical analysis reveals substantial performance variations among models, with Mixtral-8x7B achieving the highest overall accuracy of 74.25%, followed by DeepSeek-R1-671B at 64.07%. Notably, we observe no consistent correlation between model size and performance, as evidenced by the strong performance of smaller mixture-of-experts architectures. The evaluation demonstrates significant performance gaps between medical specialties, with models generally performing better on cardiovascular and neurology questions compared to gastroenterology and nephrology domains. Furthermore, our analysis indicates minimal performance degradation between attending and senior physician levels for top-performing models, suggesting robust generalization capabilities. This benchmark provides critical insights for the deployment of LLMs in medical education and clinical decision support systems, highlighting both the promise and current limitations of these technologies in specialized medical contexts.

Title: CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference

Authors: Dong Liu, Yanxuan Yu, Ben Lengerich
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21702
Pdf URL: https://arxiv.org/pdf/2511.21702
Copy Paste: [[2511.21702]] CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference(https://arxiv.org/abs/2511.21702)
Keywords: large language model
Abstract: Large language models face significant computational bottlenecks during inference due to the expensive output layer computation over large vocabularies. We present CSV-Decode, a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computation while maintaining dual correctness guarantees: exact top-$k$ certification and $\varepsilon$-certified softmax approximations. Our method clusters vocabulary embeddings offline and uses centroid-plus-radius bounds to identify which tokens can be safely omitted from computation. We provide a complete system implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization. Experimental results demonstrate significant speedup over full vocabulary decoding while maintaining distributional guarantees and low fallback rates. Our code implementation available at \href{this https URL}{this https URL}.

Title: Evaluating Embedding Generalization: How LLMs, LoRA, and SLERP Shape Representational Geometry

Authors: Siyaxolisa Kabane
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21703
Pdf URL: https://arxiv.org/pdf/2511.21703
Copy Paste: [[2511.21703]] Evaluating Embedding Generalization: How LLMs, LoRA, and SLERP Shape Representational Geometry(https://arxiv.org/abs/2511.21703)
Keywords: robust, large language model
Abstract: We investigate the generalization properties of dense text embeddings when the embedding backbone is a large language model (LLM) versus when it is a non-LLM encoder, and we study the extent to which spherical linear interpolation (SLERP) model-merging mitigates over-specialization introduced by task-specific adaptation (e.g., LoRA). To make the comparison concrete and domain-agnostic, we design a controlled suite of experiments in which models embed short numerical sequences and are evaluated on their ability to cluster and classify those sequences according to well-defined number-theoretic properties. Our experimental protocol compares four families of models: (1) non-LLM encoders trained from scratch or fine-tuned for embeddings, (2) LLM-based encoders adapted with parameter-efficient methods (LoRA), (3) LLM-based encoders with LoRA followed by model souping merging into the base weights, and (4) the same LoRA-adapted LLMs merged using SLERP across checkpoints or stages. We evaluate representational quality with clustering indices (Silhouette and Davies Bouldin). We additionally analyze the use of kmeans labels to see if the embeddings encode any other information besides the one we are testing for. Empirically, we find that LLM-based backbones produce embeddings that better capture higher-order, compositional numeric patterns, but are prone to adapter dominance that degrades balanced generalization; SLERP merging consistently recovers base-model structure while retaining most task gains, yielding superior tradeoffs in clustering separability, and robustness compared to model souping or models that were not merged.

Title: Insight-A: Attribution-aware for Multimodal Misinformation Detection

Authors: Junjie Wu, Yumeng Fu, Chen Gong, Guohong Fu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2511.21705
Pdf URL: https://arxiv.org/pdf/2511.21705
Copy Paste: [[2511.21705]] Insight-A: Attribution-aware for Multimodal Misinformation Detection(https://arxiv.org/abs/2511.21705)
Keywords: large language model
Abstract: AI-generated content (AIGC) technology has emerged as a prevalent alternative to create multimodal misinformation on social media platforms, posing unprecedented threats to societal safety. However, standard prompting leverages multimodal large language models (MLLMs) to identify the emerging misinformation, which ignores the misinformation attribution. To this end, we present Insight-A, exploring attribution with MLLM insights for detecting multimodal misinformation. Insight-A makes two efforts: I) attribute misinformation to forgery sources, and II) an effective pipeline with hierarchical reasoning that detects distortions across modalities. Specifically, to attribute misinformation to forgery traces based on generation patterns, we devise cross-attribution prompting (CAP) to model the sophisticated correlations between perception and reasoning. Meanwhile, to reduce the subjectivity of human-annotated prompts, automatic attribution-debiased prompting (ADP) is used for task adaptation on MLLMs. Additionally, we design image captioning (IC) to achieve visual details for enhancing cross-modal consistency checking. Extensive experiments demonstrate the superiority of our proposal and provide a new paradigm for multimodal misinformation detection in the era of AIGC.

Title: A General Highly Accurate Online Planning Method Integrating Large Language Models into Nested Rollout Policy Adaptation for Dialogue Tasks

Authors: Hui Wang, Fafa Zhang, Xiaoyu Zhang, Chaoxu Mu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21706
Pdf URL: https://arxiv.org/pdf/2511.21706
Copy Paste: [[2511.21706]] A General Highly Accurate Online Planning Method Integrating Large Language Models into Nested Rollout Policy Adaptation for Dialogue Tasks(https://arxiv.org/abs/2511.21706)
Keywords: large language model
Abstract: In goal-oriented dialogue tasks, the main challenge is to steer the interaction towards a given goal within a limited number of turns. Existing approaches either rely on elaborate prompt engineering, whose effectiveness is heavily dependent on human experience, or integrate policy networks and pre-trained policy models, which are usually difficult to adapt to new dialogue scenarios and costly to train. Therefore, in this paper, we present Nested Rollout Policy Adaptation for Goal-oriented Dialogue (NRPA-GD), a novel dialogue policy planning method that completely avoids specific model training by utilizing a Large Language Model (LLM) to simulate behaviors of user and system at the same time. Specifically, NRPA-GD constructs a complete evaluation mechanism for dialogue trajectories and employs an optimization framework of nested Monte Carlo simulation and policy self-adaptation to dynamically adjust policies during the dialogue process. The experimental results on four typical goal-oriented dialogue datasets show that NRPA-GD outperforms both existing prompt engineering and specifically pre-trained model-based methods. Impressively, NRPA-GD surpasses ChatGPT and pre-trained policy models with only a 0.6-billion-parameter LLM. The proposed approach further demonstrates the advantages and novelty of employing planning methods on LLMs to solve practical planning tasks.

Title: Lost in the Pipeline: How Well Do Large Language Models Handle Data Preparation?

Authors: Matteo Spreafico, Ludovica Tassini, Camilla Sancricca, Cinzia Cappiello
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21708
Pdf URL: https://arxiv.org/pdf/2511.21708
Copy Paste: [[2511.21708]] Lost in the Pipeline: How Well Do Large Language Models Handle Data Preparation?(https://arxiv.org/abs/2511.21708)
Keywords: large language model
Abstract: Large language models have recently demonstrated their exceptional capabilities in supporting and automating various tasks. Among the tasks worth exploring for testing large language model capabilities, we considered data preparation, a critical yet often labor-intensive step in data-driven processes. This paper investigates whether large language models can effectively support users in selecting and automating data preparation tasks. To this aim, we considered both general-purpose and fine-tuned tabular large language models. We prompted these models with poor-quality datasets and measured their ability to perform tasks such as data profiling and cleaning. We also compare the support provided by large language models with that offered by traditional data preparation tools. To evaluate the capabilities of large language models, we developed a custom-designed quality model that has been validated through a user study to gain insights into practitioners' expectations.

Title: Quantifying and Mitigating Selection Bias in LLMs: A Transferable LoRA Fine-Tuning and Efficient Majority Voting Approach

Authors: Blessed Guda, Lawrence Francis, Gabrial Zencha Ashungafac, Carlee Joe-Wong, Moise Busogi
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2511.21709
Pdf URL: https://arxiv.org/pdf/2511.21709
Copy Paste: [[2511.21709]] Quantifying and Mitigating Selection Bias in LLMs: A Transferable LoRA Fine-Tuning and Efficient Majority Voting Approach(https://arxiv.org/abs/2511.21709)
Keywords: large language model
Abstract: Multiple Choice Question (MCQ) answering is a widely used method for evaluating the performance of Large Language Models (LLMs). However, LLMs often exhibit selection bias in MCQ tasks, where their choices are influenced by factors like answer position or option symbols rather than the content. This bias undermines the reliability of MCQ as an evaluation framework. Most existing selection bias metrics require answer labels and measure divergences between prediction and answer distributions, but do not fully capture the consistency of a model's predictions across different orderings of answer choices. Existing selection bias mitigation strategies have notable limitations: majority voting, though effective, is computationally prohibitive; calibration-based methods require validation sets and often fail to generalize across datasets. To address these gaps, we propose three key contributions: (1) a new unsupervised label-free Permutation Bias Metric (PBM) that directly quantifies inconsistencies in model predictions across answer permutations, providing a more precise measure of selection bias, (2) an efficient majority voting approach called Batch Question-Context KV caching (BaQCKV), to significantly reduce computational costs while preserving bias mitigation effectiveness, and (3) an unsupervised Low-Rank Adaptation (LoRA-1) fine-tuning strategy based on our proposed metric and the BaQCKV that mitigates selection bias, providing a computationally efficient alternative that maintains model generalizability. Experiments across multiple MCQ benchmarks demonstrate that our approaches reduce bias, increasing consistency in accuracy while minimizing computational costs.

Title: Addressing Stereotypes in Large Language Models: A Critical Examination and Mitigation

Authors: Fatima Kazi
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21711
Pdf URL: https://arxiv.org/pdf/2511.21711
Copy Paste: [[2511.21711]] Addressing Stereotypes in Large Language Models: A Critical Examination and Mitigation(https://arxiv.org/abs/2511.21711)
Keywords: fair, generative, large language model
Abstract: Large Language models (LLMs), such as ChatGPT, have gained popularity in recent years with the advancement of Natural Language Processing (NLP), with use cases spanning many disciplines and daily lives as well. LLMs inherit explicit and implicit biases from the datasets they were trained on; these biases can include social, ethical, cultural, religious, and other prejudices and stereotypes. It is important to comprehensively examine such shortcomings by identifying the existence and extent of such biases, recognizing the origin, and attempting to mitigate such biased outputs to ensure fair outputs to reduce harmful stereotypes and misinformation. This study inspects and highlights the need to address biases in LLMs amid growing generative Artificial Intelligence (AI). We utilize bias-specific benchmarks such StereoSet and CrowSPairs to evaluate the existence of various biases in many different generative models such as BERT, GPT 3.5, and ADA. To detect both explicit and implicit biases, we adopt a three-pronged approach for thorough and inclusive analysis. Results indicate fine-tuned models struggle with gender biases but excel at identifying and avoiding racial biases. Our findings also illustrated that despite some cases of success, LLMs often over-rely on keywords in prompts and its outputs. This demonstrates the incapability of LLMs to attempt to truly understand the accuracy and authenticity of its outputs. Finally, in an attempt to bolster model performance, we applied an enhancement learning strategy involving fine-tuning, models using different prompting techniques, and data augmentation of the bias benchmarks. We found fine-tuned models to exhibit promising adaptability during cross-dataset testing and significantly enhanced performance on implicit bias benchmarks, with performance gains of up to 20%.

Title: EulerESG: Automating ESG Disclosure Analysis with LLMs

Authors: Yi Ding, Xushuo Tang, Zhengyi Yang, Wenqian Zhang, Simin Wu, Yuxin Huang, Lingjing Lan, Weiyuan Li, Yin Chen, Mingchen Ju, Wenke Yang, Thong Hoang, Mykhailo Klymenko, Xiwei Zu, Wenjie Zhang
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2511.21712
Pdf URL: https://arxiv.org/pdf/2511.21712
Copy Paste: [[2511.21712]] EulerESG: Automating ESG Disclosure Analysis with LLMs(https://arxiv.org/abs/2511.21712)
Keywords: extraction
Abstract: Environmental, Social, and Governance (ESG) reports have become central to how companies communicate climate risk, social impact, and governance practices, yet they are still published primarily as long, heterogeneous PDF documents. This makes it difficult to systematically answer seemingly simple questions. Existing tools either rely on brittle rule-based extraction or treat ESG reports as generic text, without explicitly modelling the underlying reporting standards. We present \textbf{EulerESG}, an LLM-powered system for automating ESG disclosure analysis with explicit awareness of ESG frameworks. EulerESG combines (i) dual-channel retrieval and LLM-driven disclosure analysis over ESG reports, and (ii) an interactive dashboard and chatbot for exploration, benchmarking, and explanation. Using four globally recognised companies and twelve SASB sub-industries, we show that EulerESG can automatically populate standard-aligned metric tables with high fidelity (up to 0.95 average accuracy) while remaining practical in end-to-end runtime, and we compare several recent LLM models in this setting. The full implementation, together with a demonstration video, is publicly available at this https URL.

Title: An Optimized Machine Learning Classifier for Detecting Fake Reviews Using Extracted Features

Authors: Shabbir Anees, Anshuman, Ayush Chaurasia, Prathmesh Bogar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.21716
Pdf URL: https://arxiv.org/pdf/2511.21716
Copy Paste: [[2511.21716]] An Optimized Machine Learning Classifier for Detecting Fake Reviews Using Extracted Features(https://arxiv.org/abs/2511.21716)
Keywords: secure, privacy, protect, extraction
Abstract: It is well known that fraudulent reviews cast doubt on the legitimacy and dependability of online purchases. The most recent development that leads customers towards darkness is the appearance of human reviews in computer-generated (CG) ones. In this work, we present an advanced machine-learning-based system that analyses these reviews produced by AI with remarkable precision. Our method integrates advanced text preprocessing, multi-modal feature extraction, Harris Hawks Optimization (HHO) for feature selection, and a stacking ensemble classifier. We implemented this methodology on a public dataset of 40,432 Original (OR) and Computer-Generated (CG) reviews. From an initial set of 13,539 features, HHO selected the most applicable 1,368 features, achieving an 89.9% dimensionality reduction. Our final stacking model achieved 95.40% accuracy, 92.81% precision, 95.01% recall, and a 93.90% F1-Score, which demonstrates that the combination of ensemble learning and bio-inspired optimisation is an effective method for machine-generated text recognition. Because large-scale review analytics commonly run on cloud platforms, privacy-preserving techniques such as differential approaches and secure outsourcing are essential to protect user data in these systems.

Title: CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution

Authors: Baoliang Tian, Yuxuan Si, Jilong Wang, Lingyao Li, Zhongyuan Bao, Zineng Zhou, Tao Wang, Sixu Li, Ziyao Xu, Mingze Wang, Zhouzhuo Zhang, Zhihao Wang, Yike Yun, Ke Tian, Ning Yang, Minghui Qiu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2511.21717
Pdf URL: https://arxiv.org/pdf/2511.21717
Copy Paste: [[2511.21717]] CrossCheck-Bench: Diagnosing Compositional Failures in Multimodal Conflict Resolution(https://arxiv.org/abs/2511.21717)
Keywords: robust, large language model
Abstract: Multimodal Large Language Models are primarily trained and evaluated on aligned image-text pairs, which leaves their ability to detect and resolve real-world inconsistencies largely unexplored. In open-domain applications visual and textual cues often conflict, requiring models to perform structured reasoning beyond surface-level alignment. We introduce CrossCheck-Bench, a diagnostic benchmark for evaluating contradiction detection in multimodal inputs. The benchmark adopts a hierarchical task framework covering three levels of reasoning complexity and defines seven atomic capabilities essential for resolving cross-modal inconsistencies. CrossCheck-Bench includes 15k question-answer pairs sourced from real-world artifacts with synthetically injected contradictions. The dataset is constructed through a multi-stage annotation pipeline involving more than 450 expert hours to ensure semantic validity and calibrated difficulty across perception, integration, and reasoning. We evaluate 13 state-of-the-art vision-language models and observe a consistent performance drop as tasks shift from perceptual matching to logical contradiction detection. Most models perform well on isolated entity recognition but fail when multiple clues must be synthesized for conflict reasoning. Capability-level analysis further reveals uneven skill acquisition, especially in tasks requiring multi-step inference or rule-based validation. Additional probing shows that conventional prompting strategies such as Chain-of-Thought and Set-of-Mark yield only marginal gains. By contrast, methods that interleave symbolic reasoning with grounded visual processing achieve more stable improvements. These results highlight a persistent bottleneck in multimodal reasoning and suggest new directions for building models capable of robust cross-modal verification.

Title: When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

Authors: Zhaoxin Zhang, Borui Chen, Yiming Hu, Youyang Qu, Tianqing Zhu, Longxiang Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.21718
Pdf URL: https://arxiv.org/pdf/2511.21718
Copy Paste: [[2511.21718]] When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers(https://arxiv.org/abs/2511.21718)
Keywords: attack, large language model
Abstract: Recent research on large language model (LLM) jailbreaks has primarily focused on techniques that bypass safety mechanisms to elicit overtly harmful outputs. However, such efforts often overlook attacks that exploit the model's capacity for abstract generalization, creating a critical blind spot in current alignment strategies. This gap enables adversaries to induce objectionable content by subtly manipulating the implicit social values embedded in model outputs. In this paper, we introduce MICM, a novel, model-agnostic jailbreak method that targets the aggregate value structure reflected in LLM responses. Drawing on conceptual morphology theory, MICM encodes specific configurations of nuanced concepts into a fixed prompt template through a predefined set of phrases. These phrases act as conceptual triggers, steering model outputs toward a specific value stance without triggering conventional safety filters. We evaluate MICM across five advanced LLMs, including GPT-4o, Deepseek-R1, and Qwen3-8B. Experimental results show that MICM consistently outperforms state-of-the-art jailbreak techniques, achieving high success rates with minimal rejection. Our findings reveal a critical vulnerability in commercial LLMs: their safety mechanisms remain susceptible to covert manipulation of underlying value alignment.

Title: PeerCoPilot: A Language Model-Powered Assistant for Behavioral Health Organizations

Authors: Gao Mo, Naveen Raman, Megan Chai, Cindy Peng, Shannon Pagdon, Nev Jones, Hong Shen, Peggy Swarbrick, Fei Fang
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21721
Pdf URL: https://arxiv.org/pdf/2511.21721
Copy Paste: [[2511.21721]] PeerCoPilot: A Language Model-Powered Assistant for Behavioral Health Organizations(https://arxiv.org/abs/2511.21721)
Keywords: large language model
Abstract: Behavioral health conditions, which include mental health and substance use disorders, are the leading disease burden in the United States. Peer-run behavioral health organizations (PROs) critically assist individuals facing these conditions by combining mental health services with assistance for needs such as income, employment, and housing. However, limited funds and staffing make it difficult for PROs to address all service user needs. To assist peer providers at PROs with their day-to-day tasks, we introduce PeerCoPilot, a large language model (LLM)-powered assistant that helps peer providers create wellness plans, construct step-by-step goals, and locate organizational resources to support these goals. PeerCoPilot ensures information reliability through a retrieval-augmented generation pipeline backed by a large database of over 1,300 vetted resources. We conducted human evaluations with 15 peer providers and 6 service users and found that over 90% of users supported using PeerCoPilot. Moreover, we demonstrated that PeerCoPilot provides more reliable and specific information than a baseline LLM. PeerCoPilot is now used by a group of 5-10 peer providers at CSPNJ, a large behavioral health organization serving over 10,000 service users, and we are actively expanding PeerCoPilot's use.

Title: German General Personas: A Survey-Derived Persona Prompt Collection for Population-Aligned LLM Studies

Authors: Jens Rupprecht, Leon Fröhling, Claudia Wagner, Markus Strohmaier
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2511.21722
Pdf URL: https://arxiv.org/pdf/2511.21722
Copy Paste: [[2511.21722]] German General Personas: A Survey-Derived Persona Prompt Collection for Population-Aligned LLM Studies(https://arxiv.org/abs/2511.21722)
Keywords: large language model
Abstract: The use of Large Language Models (LLMs) for simulating human perspectives via persona prompting is gaining traction in computational social science. However, well-curated, empirically grounded persona collections remain scarce, limiting the accuracy and representativeness of such simulations. Here we introduce the German General Personas (GGP) collection, a comprehensive and representative persona prompt collection built from the German General Social Survey (ALLBUS). The GGP and its persona prompts are designed to be easily plugged into prompts for all types of LLMs and tasks, steering models to generate responses aligned with the underlying German population. We evaluate GGP by prompting various LLMs to simulate survey response distributions across diverse topics, demonstrating that GGP-guided LLMs outperform state-of-the-art classifiers, particularly under data scarcity. Furthermore, we analyze how the representativity and attribute selection within persona prompts affect alignment with population responses. Our findings suggest that GGP provides a potentially valuable resource for research on LLM-based social simulations that enables more systematic explorations of population-aligned persona prompting in NLP and social science research.

Title: AD-CDO: A Lightweight Ontology for Representing Eligibility Criteria in Alzheimer's Disease Clinical Trials

Authors: Zenan Sun, Rashmie Abeysinghe, Xiaojin Li, Xinyue Hu, Licong Cui, Guo-Qiang Zhang, Jiang Bian, Cui Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.21724
Pdf URL: https://arxiv.org/pdf/2511.21724
Copy Paste: [[2511.21724]] AD-CDO: A Lightweight Ontology for Representing Eligibility Criteria in Alzheimer's Disease Clinical Trials(https://arxiv.org/abs/2511.21724)
Keywords: interpretability
Abstract: Objective This study introduces the Alzheimer's Disease Common Data Element Ontology for Clinical Trials (AD-CDO), a lightweight, semantically enriched ontology designed to represent and standardize key eligibility criteria concepts in Alzheimer's disease (AD) clinical trials. Materials and Methods We extracted high-frequency concepts from more than 1,500 AD clinical trials on this http URL and organized them into seven semantic categories: Disease, Medication, Diagnostic Test, Procedure, Social Determinants of Health, Rating Criteria, and Fertility. Each concept was annotated with standard biomedical vocabularies, including the UMLS, OMOP Standardized Vocabularies, DrugBank, NDC, and NLM VSAC value sets. To balance coverage and manageability, we applied the Jenks Natural Breaks method to identify an optimal set of representative concepts. Results The optimized AD-CDO achieved over 63% coverage of extracted trial concepts while maintaining interpretability and compactness. The ontology effectively captured the most frequent and clinically meaningful entities used in AD eligibility criteria. We demonstrated AD-CDO's practical utility through two use cases: (a) an ontology-driven trial simulation system for formal modeling and virtual execution of clinical trials, and (b) an entity normalization task mapping raw clinical text to ontology-aligned terms, enabling consistency and integration with EHR data. Discussion AD-CDO bridges the gap between broad biomedical ontologies and task-specific trial modeling needs. It supports multiple downstream applications, including phenotyping algorithm development, cohort identification, and structured data integration. Conclusion By harmonizing essential eligibility entities and aligning them with standardized vocabularies, AD-CDO provides a versatile foundation for ontology-driven AD clinical trial research.

Title: PromptTailor: Multi-turn Intent-Aligned Prompt Synthesis for Lightweight LLMs

Authors: Yizhou Xu, Janet Davis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21725
Pdf URL: https://arxiv.org/pdf/2511.21725
Copy Paste: [[2511.21725]] PromptTailor: Multi-turn Intent-Aligned Prompt Synthesis for Lightweight LLMs(https://arxiv.org/abs/2511.21725)
Keywords: privacy
Abstract: Lightweight language models remain attractive for on-device and privacy-sensitive applications, but their responses are highly sensitive to prompt quality. For open-ended generation, non-expert users often lack the knowledge or time to consistently craft high-quality prompts, leading them to rely on prompt optimization tools. However, a key challenge is ensuring the optimized prompts genuinely align with users' original intents and preferences. We introduce PromptTailor, a system for controllable prompt generation for open-ended text that improves model output quality by intent-aligned prompt synthesis. PromptTailor expands minimal user instructions into rich, domain-aware prompts while preserving the user's stated preferences. The system is a quantized Llama3-8B model fine-tuned with a lightweight LoRA adapter on 12,300 prompt-refinement dialogues spanning 41 everyday domains, distilled from three stronger LLMs. The adapter attaches to any Llama3-8B base, enabling edge deployment. In human and LLM-judge evaluations across multiple target models and optimization baselines, PromptTailor yields higher preference rates than chain-of-thought prompting and matches or surpasses state-of-the-art prompt optimization methods while requiring fewer model calls (e.g., 3 vs. 9). These results show that a compact student, guided by powerful teachers, can learn effective prompt-generation strategies that enhance response quality while maintaining alignment with user intent.

Title: Goal-Directed Search Outperforms Goal-Agnostic Memory Compression in Long-Context Memory Tasks

Authors: Yicong Zheng, Kevin L. McKee, Thomas Miconi, Zacharie Bugaud, Mick van Gelderen, Jed McCaleb
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21726
Pdf URL: https://arxiv.org/pdf/2511.21726
Copy Paste: [[2511.21726]] Goal-Directed Search Outperforms Goal-Agnostic Memory Compression in Long-Context Memory Tasks(https://arxiv.org/abs/2511.21726)
Keywords: large language model
Abstract: How to enable human-like long-term memory in large language models (LLMs) has been a central question for unlocking more general capabilities such as few-shot generalization. Existing memory frameworks and benchmarks focus on finding the optimal memory compression algorithm for higher performance in tasks that require recollection and sometimes further reasoning. However, such efforts have ended up building more human bias into the compression algorithm, through the search for the best prompts and memory architectures that suit specific benchmarks, rather than finding a general solution that would work on other data distributions. On the other hand, goal-directed search on uncompressed information could potentially exhibit superior performance because compression is lossy, and a predefined compression algorithm will not fit all raw data distributions. Here we present SUMER (Search in Uncompressed Memory via Experience Replay), an end-to-end reinforcement learning agent with verifiable reward (RLVR) that learns to use search tools to gather information and answer a target question. On the LoCoMo dataset for long-context conversation understanding, SUMER with Qwen2.5-7B-Instruct learned to use search tools and outperformed all other biased memory compression approaches and also the full-context baseline, reaching SOTA performance (43% gain over the prior best). We demonstrate that a simple search method applied to raw data outperforms goal-agnostic and biased compression algorithms in current long-context memory tasks, arguing for new paradigms and benchmarks that are more dynamic and autonomously scalable. Code for SUMER and all implemented baselines is publicly available at this https URL.

Title: Affective Multimodal Agents with Proactive Knowledge Grounding for Emotionally Aligned Marketing Dialogue

Authors: Lin Yu, Xiaofei Han, Yifei Kang, Chiung-Yi Tseng, Danyang Zhang, Ziqian Bi, Zhimo Han
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21728
Pdf URL: https://arxiv.org/pdf/2511.21728
Copy Paste: [[2511.21728]] Affective Multimodal Agents with Proactive Knowledge Grounding for Emotionally Aligned Marketing Dialogue(https://arxiv.org/abs/2511.21728)
Keywords: large language model
Abstract: Recent advances in large language models (LLMs) have enabled fluent dialogue systems, but most remain reactive and struggle in emotionally rich, goal-oriented settings such as marketing conversations. To address this limitation, we propose AffectMind, a multimodal affective dialogue agent that performs proactive reasoning and dynamic knowledge grounding to sustain emotionally aligned and persuasive interactions. AffectMind combines three components: a Proactive Knowledge Grounding Network (PKGN) that continuously updates factual and affective context from text, vision, and prosody; an Emotion--Intent Alignment Model (EIAM) that jointly models user emotion and purchase intent to adapt persuasion strategies; and a Reinforced Discourse Loop (RDL) that optimizes emotional coherence and engagement via reinforcement signals from user responses. Experiments on two newly curated marketing dialogue datasets, MM-ConvMarket and AffectPromo, show that AffectMind outperforms strong LLM-based baselines in emotional consistency (+26\%), persuasive success rate (+19\%), and long-term user engagement (+23\%), highlighting emotion-grounded proactivity as a key capability for commercial multimodal agents.

Title: Identifying Quantum Structure in AI Language: Evidence for Evolutionary Convergence of Human and Artificial Cognition

Authors: Diederik Aerts, Jonito Aerts Arguëlles, Lester Beltran, Suzette Geriente, Roberto Leporini, Massimiliano Sassoli de Bianchi, Sandro Sozzo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21731
Pdf URL: https://arxiv.org/pdf/2511.21731
Copy Paste: [[2511.21731]] Identifying Quantum Structure in AI Language: Evidence for Evolutionary Convergence of Human and Artificial Cognition(https://arxiv.org/abs/2511.21731)
Keywords: large language model
Abstract: We present the results of cognitive tests on conceptual combinations, performed using specific Large Language Models (LLMs) as test subjects. In the first test, performed with ChatGPT and Gemini, we show that Bell's inequalities are significantly violated, which indicates the presence of 'quantum entanglement' in the tested concepts. In the second test, also performed using ChatGPT and Gemini, we instead identify the presence of 'Bose-Einstein statistics', rather than the intuitively expected 'Maxwell-Boltzmann statistics', in the distribution of the words contained in large-size texts. Interestingly, these findings mirror the results previously obtained in both cognitive tests with human participants and information retrieval tests on large corpora. Taken together, they point to the 'systematic emergence of quantum structures in conceptual-linguistic domains', regardless of whether the cognitive agent is human or artificial. Although LLMs are classified as neural networks for historical reasons, we believe that a more essential form of knowledge organization takes place in the distributive semantic structure of vector spaces built on top of the neural network. It is this meaning-bearing structure that lends itself to a phenomenon of evolutionary convergence between human cognition and language, slowly established through biological evolution, and LLM cognition and language, emerging much more rapidly as a result of self-learning and training. We analyze various aspects and examples that contain evidence supporting the above hypothesis. We also advance a unifying framework that explains the pervasive quantum organization of meaning that we identify.

Title: HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation

Authors: Jiajun Zhang, Shijia Luo, Ruikang Zhang, Qi Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21732
Pdf URL: https://arxiv.org/pdf/2511.21732
Copy Paste: [[2511.21732]] HUMORCHAIN: Theory-Guided Multi-Stage Reasoning for Interpretable Multimodal Humor Generation(https://arxiv.org/abs/2511.21732)
Keywords: generative, large language model
Abstract: Humor, as both a creative human activity and a social binding mechanism, has long posed a major challenge for AI generation. Although producing humor requires complex cognitive reasoning and social understanding, theories of humor suggest that it follows learnable patterns and structures, making it theoretically possible for generative models to acquire them implicitly. In recent years, multimodal humor has become a prevalent form of online communication, especially among Gen Z, highlighting the need for AI systems capable of integrating visual understanding with humorous language generation. However, existing data-driven approaches lack explicit modeling or theoretical grounding of humor, often producing literal descriptions that fail to capture its underlying cognitive mechanisms, resulting in the generated image descriptions that are fluent but lack genuine humor or cognitive depth. To address this limitation, we propose HUMORCHAIN (HUmor-guided Multi-step Orchestrated Reasoning Chain for Image Captioning), a theory-guided multi-stage reasoning framework. It integrates visual semantic parsing, humor- and psychology-based reasoning, and a fine-tuned discriminator for humor evaluation, forming an interpretable and controllable cognitive reasoning chain. To the best of our knowledge, this is the first work to explicitly embed cognitive structures from humor theories into multimodal humor generation, enabling a structured reasoning process from visual understanding to humor creation. Experiments on Meme-Image-No-Text, Oogiri-GO, and OxfordTVG-HIC datasets show that HUMORCHAIN outperforms state-of-the-art baselines in human humor preference, Elo/BT scores, and semantic diversity, demonstrating that theory-driven structured reasoning enables large language models to generate humor aligned with human perception.

Title: RoSA: Enhancing Parameter-Efficient Fine-Tuning via RoPE-aware Selective Adaptation in Large Language Models

Authors: Dayan Pan, Jingyuan Wang, Yilong Zhou, Jiawei Cheng, Pengyue Jia, Xiangyu Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21733
Pdf URL: https://arxiv.org/pdf/2511.21733
Copy Paste: [[2511.21733]] RoSA: Enhancing Parameter-Efficient Fine-Tuning via RoPE-aware Selective Adaptation in Large Language Models(https://arxiv.org/abs/2511.21733)
Keywords: large language model
Abstract: Fine-tuning large language models is essential for task-specific adaptation, yet it remains computationally prohibitive. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a solution, but current approaches typically ignore the distinct roles of model components and the heterogeneous importance across layers, thereby limiting adaptation efficiency. Motivated by the observation that Rotary Position Embeddings (RoPE) induce critical activations in the low-frequency dimensions of attention states, we propose RoPE-aware Selective Adaptation (RoSA), a novel PEFT framework that allocates trainable parameters in a more targeted and effective manner. RoSA comprises a RoPE-aware Attention Enhancement (RoAE) module, which selectively enhances the low-frequency components of RoPE-influenced attention states, and a Dynamic Layer Selection (DLS) strategy that adaptively identifies and updates the most critical layers based on LayerNorm gradient norms. By combining dimension-wise enhancement with layer-wise adaptation, RoSA achieves more targeted and efficient fine-tuning. Extensive experiments on fifteen commonsense and arithmetic benchmarks demonstrate that RoSA outperforms existing mainstream PEFT methods under comparable trainable parameters. The code is available to ease reproducibility at this https URL.

Title: Asking LLMs to Verify First is Almost Free Lunch

Authors: Shiguang Wu, Quanming Yao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21734
Pdf URL: https://arxiv.org/pdf/2511.21734
Copy Paste: [[2511.21734]] Asking LLMs to Verify First is Almost Free Lunch(https://arxiv.org/abs/2511.21734)
Keywords: large language model
Abstract: To enhance the reasoning capabilities of Large Language Models (LLMs) without high costs of training, nor extensive test-time sampling, we introduce Verification-First (VF), a strategy that prompts models to verify a provided candidate answer, even a trivial or random one, before generating a solution. This approach triggers a "reverse reasoning" process that is cognitively easier and complementary to standard forward Chain-of-Thought (CoT), effectively invoking the model's critical thinking to reduce logical errors. We further generalize the VF strategy to Iter-VF, a sequential test-time scaling (TTS) method that iteratively cycles the verification-generation process using the model's previous answer. Extensive experiments across various benchmarks (from mathematical reasoning to coding and agentic tasks) and various LLMs (from open-source 1B to cutting-edge commercial ones) confirm that VF with random answer consistently outperforms standard CoT with minimal computational overhead, and Iter-VF outperforms existing TTS strategies.

Title: R2Q: Towards Robust 2-Bit Large Language Models via Residual Refinement Quantization

Authors: Jiayi Chen, Jieqi Shi, Jing Huo, Chen Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21736
Pdf URL: https://arxiv.org/pdf/2511.21736
Copy Paste: [[2511.21736]] R2Q: Towards Robust 2-Bit Large Language Models via Residual Refinement Quantization(https://arxiv.org/abs/2511.21736)
Keywords: robust, large language model
Abstract: The rapid progress of Large Language Models (LLMs) has brought substantial computational and memory demands, spurring the adoption of low-bit quantization. While 8-bit and 4-bit formats have become prevalent, extending quantization to 2 bits remains challenging due to severe accuracy degradation. To address this, we propose Residual Refinement Quantization (R2Q)-a novel 2-bit quantization framework that decomposes the process into two sequential 1-bit sub-quantizations, forming an adaptive quantization lattice. Extensive evaluations on Llama, OPT, and Qwen across diverse benchmarks-covering question answering, commonsense reasoning, and language modeling-demonstrate that R2Q consistently outperforms existing 2-bit quantization methods in both fine-grained and coarse-grained settings. By refining quantization through a residual learning mechanism, R2Q enhances performance, improves training stability, and accelerates convergence under extreme compression. Furthermore, its modular design enables seamless integration with existing quantization-aware training (QAT) frameworks.

Title: Polarity-Aware Probing for Quantifying Latent Alignment in Language Models

Authors: Sabrina Sadiekh, Elena Ericheva, Chirag Agarwal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21737
Pdf URL: https://arxiv.org/pdf/2511.21737
Copy Paste: [[2511.21737]] Polarity-Aware Probing for Quantifying Latent Alignment in Language Models(https://arxiv.org/abs/2511.21737)
Keywords: robust, interpretability
Abstract: Advances in unsupervised probes such as Contrast-Consistent Search (CCS), which reveal latent beliefs without relying on token outputs, raise the question of whether these methods can reliably assess model alignment. We investigate this by examining the sensitivity of CCS to harmful vs. safe statements and by introducing Polarity-Aware CCS (PA-CCS), a method for evaluating whether a model's internal representations remain consistent under polarity inversion. We propose two alignment-oriented metrics, Polar-Consistency and the Contradiction Index, to quantify the semantic robustness of a model's latent knowledge. To validate PA-CCS, we curate two main datasets and one control dataset containing matched harmful-safe sentence pairs constructed using different methodologies (concurrent and antagonistic statements). We apply PA-CCS to 16 language models. Our results show that PA-CCS identifies both architectural and layer-specific differences in the encoding of latent harmful knowledge. Notably, replacing the negation token with a meaningless marker degrades PA-CCS scores for models with well-aligned internal representations, while models lacking robust internal calibration do not exhibit this degradation. Our findings highlight the potential of unsupervised probing for alignment evaluation and emphasize the need to incorporate structural robustness checks into interpretability benchmarks. Code and datasets are available at: this https URL. WARNING: This paper contains potentially sensitive, harmful, and offensive content.

Title: Decoding inner speech with an end-to-end brain-to-text neural interface

Authors: Yizi Zhang, Linyang He, Chaofei Fan, Tingkai Liu, Han Yu, Trung Le, Jingyuan Li, Scott Linderman, Lea Duncker, Francis R Willett, Nima Mesgarani, Liam Paninski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21740
Pdf URL: https://arxiv.org/pdf/2511.21740
Copy Paste: [[2511.21740]] Decoding inner speech with an end-to-end brain-to-text neural interface(https://arxiv.org/abs/2511.21740)
Keywords: large language model
Abstract: Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.

Title: A Multiscale Geometric Method for Capturing Relational Topic Alignment

Authors: Conrad D. Hougen, Karl T. Pazdernik, Alfred O. Hero
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.21741
Pdf URL: https://arxiv.org/pdf/2511.21741
Copy Paste: [[2511.21741]] A Multiscale Geometric Method for Capturing Relational Topic Alignment(https://arxiv.org/abs/2511.21741)
Keywords: transformer
Abstract: Interpretable topic modeling is essential for tracking how research interests evolve within co-author communities. In scientific corpora, where novelty is prized, identifying underrepresented niche topics is particularly important. However, contemporary models built from dense transformer embeddings tend to miss rare topics and therefore also fail to capture smooth temporal alignment. We propose a geometric method that integrates multimodal text and co-author network data, using Hellinger distances and Ward's linkage to construct a hierarchical topic dendrogram. This approach captures both local and global structure, supporting multiscale learning across semantic and temporal dimensions. Our method effectively identifies rare-topic structure and visualizes smooth topic drift over time. Experiments highlight the strength of interpretable bag-of-words models when paired with principled geometric alignment.

Title: EduMod-LLM: A Modular Approach for Designing Flexible and Transparent Educational Assistants

Authors: Meenakshi Mittal, Rishi Khare, Mihran Miroyan, Chancharik Mitra, Narges Norouzi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21742
Pdf URL: https://arxiv.org/pdf/2511.21742
Copy Paste: [[2511.21742]] EduMod-LLM: A Modular Approach for Designing Flexible and Transparent Educational Assistants(https://arxiv.org/abs/2511.21742)
Keywords: generative, large language model
Abstract: With the growing use of Large Language Model (LLM)-based Question-Answering (QA) systems in education, it is critical to evaluate their performance across individual pipeline components. In this work, we introduce {\model}, a modular function-calling LLM pipeline, and present a comprehensive evaluation along three key axes: function calling strategies, retrieval methods, and generative language models. Our framework enables fine-grained analysis by isolating and assessing each component. We benchmark function-calling performance across LLMs, compare our novel structure-aware retrieval method to vector-based and LLM-scoring baselines, and evaluate various LLMs for response synthesis. This modular approach reveals specific failure modes and performance patterns, supporting the development of interpretable and effective educational QA systems. Our findings demonstrate the value of modular function calling in improving system transparency and pedagogical alignment. Website and Supplementary Material: this https URL

Title: A Lightweight Approach to Detection of AI-Generated Texts Using Stylometric Features

Authors: Sergey K. Aityan, William Claster, Karthik Sai Emani, Sohni Rais, Thy Tran
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21744
Pdf URL: https://arxiv.org/pdf/2511.21744
Copy Paste: [[2511.21744]] A Lightweight Approach to Detection of AI-Generated Texts Using Stylometric Features(https://arxiv.org/abs/2511.21744)
Keywords: transformer
Abstract: A growing number of AI-generated texts raise serious concerns. Most existing approaches to AI-generated text detection rely on fine-tuning large transformer models or building ensembles, which are computationally expensive and often provide limited generalization across domains. Existing lightweight alternatives achieved significantly lower accuracy on large datasets. We introduce NEULIF, a lightweight approach that achieves best performance in the lightweight detector class, that does not require extensive computational power and provides high detection accuracy. In our approach, a text is first decomposed into stylometric and readability features which are then used for classification by a compact Convolutional Neural Network (CNN) or Random Forest (RF). Evaluated and tested on the Kaggle AI vs. Human corpus, our models achieve 97% accuracy (~ 0.95 F1) for CNN and 95% accuracy (~ 0.94 F1) for the Random Forest, demonstrating high precision and recall, with ROC-AUC scores of 99.5% and 95%, respectively. The CNN (~ 25 MB) and Random Forest (~ 10.6 MB) models are orders of magnitude smaller than transformer-based ensembles and can be run efficiently on standard CPU devices, without sacrificing this http URL study also highlights the potential of such models for broader applications across languages, domains, and streaming contexts, showing that simplicity, when guided by structural insights, can rival complexity in AI-generated content detection.

Title: DELTA: Language Diffusion-based EEG-to-Text Architecture

Authors: Mingyu Jeon, Hyobin Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.21746
Pdf URL: https://arxiv.org/pdf/2511.21746
Copy Paste: [[2511.21746]] DELTA: Language Diffusion-based EEG-to-Text Architecture(https://arxiv.org/abs/2511.21746)
Keywords: diffusion
Abstract: Electroencephalogram (EEG)-to-text remains challenging due to high-dimensional noise, subject variability, and error accumulation in autoregressive decoding. We introduce DELTA, which pairs a Residual Vector Quantization (RVQ) EEG tokenizer with a masked language diffusion model (LLaDA). RVQ discretizes continuous EEG into multi-layer tokens to reduce noise and individual differences, while LLaDA reconstructs sentences via non-sequential denoising. On ZuCo, DELTA improves semantic alignment by up to 5.37 points over autoregressive baselines, achieving BLEU-1 21.9 and ROUGE-1 F 17.2 under word-level conditions. These results enable reliable text generation from small EEG-text datasets and point toward scalable multimodal EEG-language models.

Title: Building Domain-Specific Small Language Models via Guided Data Generation

Authors: Aman Kumar, Ekant Muljibhai Amin, Xian Yeow Lee, Lasitha Vidyaratne, Ahmed K. Farahat, Dipanjan D. Ghosh, Yuta Koreeda, Chetan Gupta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21748
Pdf URL: https://arxiv.org/pdf/2511.21748
Copy Paste: [[2511.21748]] Building Domain-Specific Small Language Models via Guided Data Generation(https://arxiv.org/abs/2511.21748)
Keywords: privacy, large language model
Abstract: Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pretraining (DAPT), Domain-specific Supervised Fine-tuning (DSFT), and Direct Preference Optimization (DPO) to train effective small-scale models for specialized use cases. We demonstrate this approach through DiagnosticSLM, a 3B-parameter domain-specific model tailored for fault diagnosis, root cause analysis, and repair recommendation in industrial settings. To evaluate model performance, we introduce four domain-specific benchmarks: multiple-choice questions (DiagnosticMCQ), question answering (DiagnosticQA), sentence completion (DiagnosticComp), and summarization (DiagnosticSum). DiagnosticSLM achieves up to 25% accuracy improvement over open-source models of comparable or larger size (2B-9B) on the MCQ task, while also outperforming or matching them in other tasks, demonstrating effective domain-specific reasoning and generalization capabilities.

Title: Proactive Defense: Compound AI for Detecting Persuasion Attacks and Measuring Inoculation Effectiveness

Authors: Svitlana Volkova, Will Dupree, Hsien-Te Kao, Peter Bautista, Gabe Ganberg, Jeff Beaubien, Laura Cassani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21749
Pdf URL: https://arxiv.org/pdf/2511.21749
Copy Paste: [[2511.21749]] Proactive Defense: Compound AI for Detecting Persuasion Attacks and Measuring Inoculation Effectiveness(https://arxiv.org/abs/2511.21749)
Keywords: security, defense, attack, generative
Abstract: This paper introduces BRIES, a novel compound AI architecture designed to detect and measure the effectiveness of persuasion attacks across information environments. We present a system with specialized agents: a Twister that generates adversarial content employing targeted persuasion tactics, a Detector that identifies attack types with configurable parameters, a Defender that creates resilient content through content inoculation, and an Assessor that employs causal inference to evaluate inoculation effectiveness. Experimenting with the SemEval 2023 Task 3 taxonomy across the synthetic persuasion dataset, we demonstrate significant variations in detection performance across language agents. Our comparative analysis reveals significant performance disparities with GPT-4 achieving superior detection accuracy on complex persuasion techniques, while open-source models like Llama3 and Mistral demonstrated notable weaknesses in identifying subtle rhetorical, suggesting that different architectures encode and process persuasive language patterns in fundamentally different ways. We show that prompt engineering dramatically affects detection efficacy, with temperature settings and confidence scoring producing model-specific variations; Gemma and GPT-4 perform optimally at lower temperatures while Llama3 and Mistral show improved capabilities at higher temperatures. Our causal analysis provides novel insights into socio-emotional-cognitive signatures of persuasion attacks, revealing that different attack types target specific cognitive dimensions. This research advances generative AI safety and cognitive security by quantifying LLM-specific vulnerabilities to persuasion attacks and delivers a framework for enhancing human cognitive resilience through structured interventions before exposure to harmful content.

Title: SO-Bench: A Structural Output Evaluation of Multimodal LLMs

Authors: Di Feng, Kaixin Ma, Feng Nan, Haofeng Chen, Bohan Zhai, David Griffiths, Mingfei Gao, Zhe Gan, Eshan Verma, Yinfei Yang, Zhifeng Chen, Afshin Dehghan
Subjects: cs.CV, cs.AI, cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2511.21750
Pdf URL: https://arxiv.org/pdf/2511.21750
Copy Paste: [[2511.21750]] SO-Bench: A Structural Output Evaluation of Multimodal LLMs(https://arxiv.org/abs/2511.21750)
Keywords: extraction, large language model
Abstract: Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model's structured output capability. We plan to make the benchmark available to the community.

Title: Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification

Authors: Yanxi Li, Ruocheng Shan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21752
Pdf URL: https://arxiv.org/pdf/2511.21752
Copy Paste: [[2511.21752]] Semantics as a Shield: Label Disguise Defense (LDD) against Prompt Injection in LLM Sentiment Classification(https://arxiv.org/abs/2511.21752)
Keywords: defense, attack, robust, large language model
Abstract: Large language models are increasingly used for text classification tasks such as sentiment analysis, yet their reliance on natural language prompts exposes them to prompt injection attacks. In particular, class-directive injections exploit knowledge of the model's label set (e.g., positive vs. negative) to override its intended behavior through adversarial instructions. Existing defenses, such as detection-based filters, instruction hierarchies, and signed prompts, either require model retraining or remain vulnerable to obfuscation. This paper introduces Label Disguise Defense (LDD), a lightweight and model-agnostic strategy that conceals true labels by replacing them with semantically transformed or unrelated alias labels(e.g., blue vs. yellow). The model learns these new label mappings implicitly through few-shot demonstrations, preventing direct correspondence between injected directives and decision outputs. We evaluate LDD across nine state-of-the-art models, including GPT-5, GPT-4o, LLaMA3.2, Gemma3, and Mistral variants, under varying few-shot and an adversarial setting. Our results show that the ability of LDD to recover performance lost to the adversarial attack varies across models and alias choices. For every model evaluated, LDD is able to restore a portion of the accuracy degradation caused by the attack. Moreover, for the vast majority of models, we can identify more than one alias pair that achieves higher accuracy than the under-attack baseline, in which the model relies solely on few-shot learning without any defensive mechanism. A linguistic analysis further reveals that semantically aligned alias labels(e.g., good vs. bad) yield stronger robustness than unaligned symbols(e.g., blue vs. yellow). Overall, this study demonstrates that label semantics can serve as an effective defense layer, transforming meaning itself into a shield against prompt injection.

Title: Extracting Disaster Impacts and Impact Related Locations in Social Media Posts Using Large Language Models

Authors: Sameeah Noreen Hameed, Surangika Ranathunga, Raj Prasanna, Kristin Stock, Christopher B. Jones
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21753
Pdf URL: https://arxiv.org/pdf/2511.21753
Copy Paste: [[2511.21753]] Extracting Disaster Impacts and Impact Related Locations in Social Media Posts Using Large Language Models(https://arxiv.org/abs/2511.21753)
Keywords: robust, extraction, large language model
Abstract: Large-scale disasters can often result in catastrophic consequences on people and infrastructure. Situation awareness about such disaster impacts generated by authoritative data from in-situ sensors, remote sensing imagery, and/or geographic data is often limited due to atmospheric opacity, satellite revisits, and time limitations. This often results in geo-temporal information gaps. In contrast, impact-related social media posts can act as "geo-sensors" during a disaster, where people describe specific impacts and locations. However, not all locations mentioned in disaster-related social media posts relate to an impact. Only the impacted locations are critical for directing resources effectively. e.g., "The death toll from a fire which ripped through the Greek coastal town of #Mati stood at 80, with dozens of people unaccounted for as forensic experts tried to identify victims who were burned alive #Greecefires #AthensFires #Athens #Greece." contains impacted location "Mati" and non-impacted locations "Greece" and "Athens". This research uses Large Language Models (LLMs) to identify all locations, impacts and impacted locations mentioned in disaster-related social media posts. In the process, LLMs are fine-tuned to identify only impacts and impacted locations (as distinct from other, non-impacted locations), including locations mentioned in informal expressions, abbreviations, and short forms. Our fine-tuned model demonstrates efficacy, achieving an F1-score of 0.69 for impact and 0.74 for impacted location extraction, substantially outperforming the pre-trained baseline. These robust results confirm the potential of fine-tuned language models to offer a scalable solution for timely decision-making in resource allocation, situational awareness, and post-disaster recovery planning for responders.

Title: Dissecting the Ledger: Locating and Suppressing "Liar Circuits" in Financial Large Language Models

Authors: Soham Mirajkar
Subjects: cs.CL, cs.CE
Abstract URL: https://arxiv.org/abs/2511.21756
Pdf URL: https://arxiv.org/pdf/2511.21756
Copy Paste: [[2511.21756]] Dissecting the Ledger: Locating and Suppressing "Liar Circuits" in Financial Large Language Models(https://arxiv.org/abs/2511.21756)
Keywords: large language model
Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes financial domains, yet they suffer from specific, reproducible hallucinations when performing arithmetic operations. Current mitigation strategies often treat the model as a black box. In this work, we propose a mechanistic approach to intrinsic hallucination detection. By applying Causal Tracing to the GPT-2 XL architecture on the ConvFinQA benchmark, we identify a dual-stage mechanism for arithmetic reasoning: a distributed computational scratchpad in middle layers (L12-L30) and a decisive aggregation circuit in late layers (specifically Layer 46). We verify this mechanism via an ablation study, demonstrating that suppressing Layer 46 reduces the model's confidence in hallucinatory outputs by 81.8%. Furthermore, we demonstrate that a linear probe trained on this layer generalizes to unseen financial topics with 98% accuracy, suggesting a universal geometry of arithmetic deception.

Title: A Longitudinal Measurement of Privacy Policy Evolution for Large Language Models

Authors: Zhen Tao, Shidong Pan, Zhenchang Xing, Emily Black, Talia Gillis, Chunyang Chen
Subjects: cs.CR, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2511.21758
Pdf URL: https://arxiv.org/pdf/2511.21758
Copy Paste: [[2511.21758]] A Longitudinal Measurement of Privacy Policy Evolution for Large Language Models(https://arxiv.org/abs/2511.21758)
Keywords: privacy, large language model
Abstract: Large language model (LLM) services have been rapidly integrated into people's daily lives as chatbots and agentic systems. They are nourished by collecting rich streams of data, raising privacy concerns around excessive collection of sensitive personal information. Privacy policies are the fundamental mechanism for informing users about data practices in modern information privacy paradigm. Although traditional web and mobile policies are well studied, the privacy policies of LLM providers, their LLM-specific content, and their evolution over time remain largely underexplored. In this paper, we present the first longitudinal empirical study of privacy policies for mainstream LLM providers worldwide. We curate a chronological dataset of 74 historical privacy policies and 115 supplemental privacy documents from 11 LLM providers across 5 countries up to August 2025, and extract over 3,000 sentence-level edits between consecutive policy versions. We compare LLM privacy policies to those of other software formats, propose a taxonomy tailored to LLM privacy policies, annotate policy edits and align them with a timeline of key LLM ecosystem events. Results show they are substantially longer, demand college-level reading ability, and remain highly vague. Our taxonomy analysis reveals patterns in how providers disclose LLM-specific practices and highlights regional disparities in coverage. Policy edits are concentrated in first-party data collection and international/specific-audience sections, and that product releases and regulatory actions are the primary drivers, shedding light on the status quo and the evolution of LLM privacy policies.

Title: Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models

Authors: Linye Wei, Wenjue Chen, Pingzhi Tang, Xiaotian Guo, Le Ye, Runsheng Wang, Meng Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21759
Pdf URL: https://arxiv.org/pdf/2511.21759
Copy Paste: [[2511.21759]] Orchestrating Dual-Boundaries: An Arithmetic Intensity Inspired Acceleration Framework for Diffusion Language Models(https://arxiv.org/abs/2511.21759)
Keywords: diffusion, large language model
Abstract: Diffusion-based large language models (dLLMs) have recently gained significant attention for their exceptional performance and inherent potential for parallel decoding. Existing frameworks further enhance its inference efficiency by enabling KV caching. However, its bidirectional attention mechanism necessitates periodic cache refreshes that interleave prefill and decoding phases, both contributing substantial inference cost and constraining achievable speedup. Inspired by the heterogeneous arithmetic intensity of the prefill and decoding phases, we propose ODB-dLLM, a framework that orchestrates dual-boundaries to accelerate dLLM inference. In the prefill phase, we find that the predefined fixed response length introduces heavy yet redundant computational overhead, which affects efficiency. To alleviate this, ODB-dLLM incorporates an adaptive length prediction mechanism that progressively reduces prefill overhead and unnecessary computation. In the decoding phase, we analyze the computational characteristics of dLLMs and propose a dLLM-specific jump-share speculative decoding method to enhance efficiency by reducing the number of decoding iterations. Experimental results demonstrate that ODB-dLLM achieves 46-162x and 2.63-6.30x speedups over the baseline dLLM and Fast-dLLM, respectively, while simultaneously mitigating the accuracy degradation in existing acceleration frameworks.

Title: fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding

Authors: Yuxiang Wei, Yanteng Zhang, Xi Xiao, Chengxuan Qian, Tianyang Wang, Vince D. Calhoun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21760
Pdf URL: https://arxiv.org/pdf/2511.21760
Copy Paste: [[2511.21760]] fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding(https://arxiv.org/abs/2511.21760)
Keywords: large language model
Abstract: Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.

Title: LLMs for Low-Resource Dialect Translation Using Context-Aware Prompting: A Case Study on Sylheti

Authors: Tabia Tanzin Prama, Christopher M. Danforth, Peter Sheridan Dodds
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2511.21761
Pdf URL: https://arxiv.org/pdf/2511.21761
Copy Paste: [[2511.21761]] LLMs for Low-Resource Dialect Translation Using Context-Aware Prompting: A Case Study on Sylheti(https://arxiv.org/abs/2511.21761)
Keywords: large language model
Abstract: Large Language Models (LLMs) have demonstrated strong translation abilities through prompting, even without task-specific training. However, their effectiveness in dialectal and low-resource contexts remains underexplored. This study presents the first systematic investigation of LLM-based machine translation (MT) for Sylheti, a dialect of Bangla that is itself low-resource. We evaluate five advanced LLMs (GPT-4.1, GPT-4.1, LLaMA 4, Grok 3, and DeepSeek V3.2) across both translation directions (Bangla $\Leftrightarrow$ Sylheti), and find that these models struggle with dialect-specific vocabulary. To address this, we introduce Sylheti-CAP (Context-Aware Prompting), a three-step framework that embeds a linguistic rulebook, a dictionary (2{,}260 core vocabulary items and idioms), and an authenticity check directly into prompts. Extensive experiments show that Sylheti-CAP consistently improves translation quality across models and prompting strategies. Both automatic metrics and human evaluations confirm its effectiveness, while qualitative analysis reveals notable reductions in hallucinations, ambiguities, and awkward phrasing, establishing Sylheti-CAP as a scalable solution for dialectal and low-resource MT. Dataset link: \href{this https URL}{this https URL}

Title: Factors That Support Grounded Responses in LLM Conversations: A Rapid Review

Authors: Gabriele Cesar Iwashima, Claudia Susie Rodrigues, Claudio Dipolitto, Geraldo Xexéo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21762
Pdf URL: https://arxiv.org/pdf/2511.21762
Copy Paste: [[2511.21762]] Factors That Support Grounded Responses in LLM Conversations: A Rapid Review(https://arxiv.org/abs/2511.21762)
Keywords: large language model
Abstract: Large language models (LLMs) may generate outputs that are misaligned with user intent, lack contextual grounding, or exhibit hallucinations during conversation, which compromises the reliability of LLM-based applications. This review aimed to identify and analyze techniques that align LLM responses with conversational goals, ensure grounding, and reduce hallucination and topic drift. We conducted a Rapid Review guided by the PRISMA framework and the PICO strategy to structure the search, filtering, and selection processes. The alignment strategies identified were categorized according to the LLM lifecycle phase in which they operate: inference-time, post-training, and reinforcement learning-based methods. Among these, inference-time approaches emerged as particularly efficient, aligning outputs without retraining while supporting user intent, contextual grounding, and hallucination mitigation. The reviewed techniques provided structured mechanisms for improving the quality and reliability of LLM responses across key alignment objectives.

Title: Adaptive Detection of Polymorphic Malware: Leveraging Mutation Engines and YARA Rules for Enhanced Security

Authors: Shreyansh Swami, Ishwardeep Singh, Ujjwalpreet Singh, Chinmay Prawah Pant
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.21764
Pdf URL: https://arxiv.org/pdf/2511.21764
Copy Paste: [[2511.21764]] Adaptive Detection of Polymorphic Malware: Leveraging Mutation Engines and YARA Rules for Enhanced Security(https://arxiv.org/abs/2511.21764)
Keywords: security
Abstract: Polymorphic malware continually alters its structure to evade signature-based defences, challenging both commercial antivirus (AV) and enterprise detection systems. This study introduces a reproducible framework for analysing eight polymorphic behaviours-junk code insertion, control-flow obfuscation, packing, data encoding, domain generation, randomized beacon timing, protocol mimicry, and format/header tweaks-and evaluates their detectability across three layers: commercial AVs, custom rule-based detectors (YARA/Sigma), and endpoint detection and response (EDR) telemetry. Eleven inert polymorphic variants were generated per behaviour using controlled mutation engines and executed in isolated environments. Detection performance was assessed by detection rate (DR), false positive rate (FPR), and combined coverage. AVs achieved an average DR of 34%, YARA/Sigma 74% and EDR 76%; integrated detection reached ~92% with an FPR of 3.5%. Iterative YARA tuning showed a trade-off between detection and FPR, while behaviour-specific trends revealed static polymorphisms were best caught by custom rules, dynamic by EDR, and network-level by Sigma-like analysis. These results affirm that hybrid detection pipelines combining static, dynamic, and network-layer analytics offer resilient defence against polymorphic malware and form a baseline for future adaptive detection research.

Title: Categorical Framework for Quantum-Resistant Zero-Trust AI Security

Authors: I. Cherkaoui, C. Clarke, J. Horgan, I. Dey
Subjects: cs.CR, math.CT, quant-ph
Abstract URL: https://arxiv.org/abs/2511.21768
Pdf URL: https://arxiv.org/pdf/2511.21768
Copy Paste: [[2511.21768]] Categorical Framework for Quantum-Resistant Zero-Trust AI Security(https://arxiv.org/abs/2511.21768)
Keywords: secure, security, protect, robust, segmentation
Abstract: The rapid deployment of AI models necessitates robust, quantum-resistant security, particularly against adversarial threats. Here, we present a novel integration of post-quantum cryptography (PQC) and zero trust architecture (ZTA), formally grounded in category theory, to secure AI model access. Our framework uniquely models cryptographic workflows as morphisms and trust policies as functors, enabling fine-grained, adaptive trust and micro-segmentation for lattice-based PQC primitives. This approach offers enhanced protection against adversarial AI threats. We demonstrate its efficacy through a concrete ESP32-based implementation, validating a crypto-agile transition with quantifiable performance and security improvements, underpinned by categorical proofs for AI security. The implementation achieves significant memory efficiency on ESP32, with the agent utilizing 91.86% and the broker 97.88% of free heap after cryptographic operations, and successfully rejects 100% of unauthorized access attempts with sub-millisecond average latency.

Title: Physics-Informed Spiking Neural Networks via Conservative Flux Quantization

Authors: Chi Zhang, Lin Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.21784
Pdf URL: https://arxiv.org/pdf/2511.21784
Copy Paste: [[2511.21784]] Physics-Informed Spiking Neural Networks via Conservative Flux Quantization(https://arxiv.org/abs/2511.21784)
Keywords: robust
Abstract: Real-time, physically-consistent predictions on low-power edge devices is critical for the next generation embodied AI systems, yet it remains a major challenge. Physics-Informed Neural Networks (PINNs) combine data-driven learning with physics-based constraints to ensure the model's predictions are with underlying physical this http URL, PINNs are energy-intensive and struggle to strictly enforce physical conservation laws. Brain-inspired spiking neural networks (SNNs) have emerged as a promising solution for edge computing and real-time processing. However, naively converting PINNs to SNNs degrades physical fidelity and fails to address long-term generalization issues. To this end, this paper introduce a novel Physics-Informed Spiking Neural Network (PISNN) framework. Importantly, to ensure strict physical conservation, we design the Conservative Leaky Integrate-and-Fire (C-LIF) neuron, whose dynamics structurally guarantee local mass preservation. To achieve robust temporal generalization, we introduce a novel Conservative Flux Quantization (CFQ) strategy, which redefines neural spikes as discrete packets of physical flux. Our CFQ learns a time-invariant physical evolution operator, enabling the PISNN to become a general-purpose solver -- conservative-by-construction. Extensive experiments show that our PISNN excels on diverse benchmarks. For both the canonical 1D heat equation and the more challenging 2D Laplace's Equation, it accurately simulates the system dynamics while maintaining perfect mass conservation by design -- a feat that is challenging for conventional PINNs. This work establishes a robust framework for fusing the rigor of scientific computing with the efficiency of neuromorphic engineering, paving the way for complex, long-term, and energy-efficient physics predictions for intelligent systems.

Title: Advanced Data Collection Techniques in Cloud Security: A Multi-Modal Deep Learning Autoencoder Approach

Authors: Aamiruddin Syed, Mohammed Ilyas Ahmad
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.21795
Pdf URL: https://arxiv.org/pdf/2511.21795
Copy Paste: [[2511.21795]] Advanced Data Collection Techniques in Cloud Security: A Multi-Modal Deep Learning Autoencoder Approach(https://arxiv.org/abs/2511.21795)
Keywords: security, transformer
Abstract: Cloud security is an important concern. To identify and stop cyber threats, efficient data collection methods are necessary. This research presents an innovative method to cloud security by integrating numerous data sources and modalities with multi-modal deep learning autoencoders. The Multi-Modal Deep Learning Ensemble Architecture (MMDLEA), a unique approach for anomaly detection and classification in multi-modal data, is proposed in this study. The proposed design integrates the best features of six deep learning models: Multi-Modal Deep Learning Autoencoder (MMDLA), Anomaly Detection using Adaptive Metric Learning (ADAM), ADADELTA, ADAGRAD, RMSPROP, and Stacked Graph Transformer (SGT). A final prediction is produced by combining the outputs of all the models, each of which is trained using a distinct modality of the data. Based on the test dataset, the recommended MMDLA architecture achieves an accuracy of 98.5% and an F1-score of 0.985, demonstrating its superior performance over each individual model. Of the different models, the ADAM model performs the best, with an accuracy of 96.2% and an F1-score of 0.962. With an F1-score of 0.955 and an accuracy of 95.5%, the ADADELTA model trails closely behind. MMDLA obtains an F1-score of 0.948 and an accuracy of 94.8%. Additionally, the suggested MMDLEA design exhibits enhanced resilience to fluctuating modalities and noisy data, proving its usefulness in practical settings. Future study in this area is made possible by the results, which show the potential of the proposed framework for abnormal identification and categorization in multi-modal data.

Title: The Double-Edged Nature of the Rashomon Set for Trustworthy Machine Learning

Authors: Ethan Hsu, Harry Chen, Chudi Zhong, Lesia Semenova
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.21799
Pdf URL: https://arxiv.org/pdf/2511.21799
Copy Paste: [[2511.21799]] The Double-Edged Nature of the Rashomon Set for Trustworthy Machine Learning(https://arxiv.org/abs/2511.21799)
Keywords: privacy, attack, robust
Abstract: Real-world machine learning (ML) pipelines rarely produce a single model; instead, they produce a Rashomon set of many near-optimal ones. We show that this multiplicity reshapes key aspects of trustworthiness. At the individual-model level, sparse interpretable models tend to preserve privacy but are fragile to adversarial attacks. In contrast, the diversity within a large Rashomon set enables reactive robustness: even when an attack breaks one model, others often remain accurate. Rashomon sets are also stable under small distribution shifts. However, this same diversity increases information leakage, as disclosing more near-optimal models provides an attacker with progressively richer views of the training data. Through theoretical analysis and empirical studies of sparse decision trees and linear models, we characterize this robustness-privacy trade-off and highlight the dual role of Rashomon sets as both a resource and a risk for trustworthy ML.

Title: Beyond Membership: Limitations of Add/Remove Adjacency in Differential Privacy

Authors: Gauri Pradhan, Joonas Jälkö, Santiago Zanella-Bèguelin, Antti Honkela
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21804
Pdf URL: https://arxiv.org/pdf/2511.21804
Copy Paste: [[2511.21804]] Beyond Membership: Limitations of Add/Remove Adjacency in Differential Privacy(https://arxiv.org/abs/2511.21804)
Keywords: privacy, protect, attack
Abstract: Training machine learning models with differential privacy (DP) limits an adversary's ability to infer sensitive information about the training data. It can be interpreted as a bound on adversary's capability to distinguish two adjacent datasets according to chosen adjacency relation. In practice, most DP implementations use the add/remove adjacency relation, where two datasets are adjacent if one can be obtained from the other by adding or removing a single record, thereby protecting membership. In many ML applications, however, the goal is to protect attributes of individual records (e.g., labels used in supervised fine-tuning). We show that privacy accounting under add/remove overstates attribute privacy compared to accounting under the substitute adjacency relation, which permits substituting one record. To demonstrate this gap, we develop novel attacks to audit DP under substitute adjacency, and show empirically that audit results are inconsistent with DP guarantees reported under add/remove, yet remain consistent with the budget accounted under the substitute adjacency relation. Our results highlight that the choice of adjacency when reporting DP guarantees is critical when the protection target is per-record attributes rather than membership.

Title: Unsupervised Anomaly Detection for Smart IoT Devices: Performance and Resource Comparison

Authors: Md. Sad Abdullah Sami, Mushfiquzzaman Abid
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2511.21842
Pdf URL: https://arxiv.org/pdf/2511.21842
Copy Paste: [[2511.21842]] Unsupervised Anomaly Detection for Smart IoT Devices: Performance and Resource Comparison(https://arxiv.org/abs/2511.21842)
Keywords: security, robust
Abstract: The rapid expansion of Internet of Things (IoT) deployments across diverse sectors has significantly enhanced operational efficiency, yet concurrently elevated cybersecurity vulnerabilities due to increased exposure to cyber threats. Given the limitations of traditional signature-based Anomaly Detection Systems (ADS) in identifying emerging and zero-day threats, this study investigates the effectiveness of two unsupervised anomaly detection techniques, Isolation Forest (IF) and One-Class Support Vector Machine (OC-SVM), using the TON_IoT thermostat dataset. A comprehensive evaluation was performed based on standard metrics (accuracy, precision, recall, and F1-score) alongside critical resource utilization metrics such as inference time, model size, and peak RAM usage. Experimental results revealed that IF consistently outperformed OC-SVM, achieving higher detection accuracy, superior precision, and recall, along with a significantly better F1-score. Furthermore, Isolation Forest demonstrated a markedly superior computational footprint, making it more suitable for deployment on resource-constrained IoT edge devices. These findings underscore Isolation Forest's robustness in high-dimensional and imbalanced IoT environments and highlight its practical viability for real-time anomaly detection.

Title: FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers

Authors: Sarina Xi, Vishisht Rao, Justin Payan, Nihar B. Shah
Subjects: cs.CL, cs.AI, cs.DL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21843
Pdf URL: https://arxiv.org/pdf/2511.21843
Copy Paste: [[2511.21843]] FLAWS: A Benchmark for Error Identification and Localization in Scientific Papers(https://arxiv.org/abs/2511.21843)
Keywords: large language model
Abstract: The identification and localization of errors is a core task in peer review, yet the exponential growth of scientific output has made it increasingly difficult for human reviewers to reliably detect errors given the limited pool of experts. Recent advances in Large Language Models (LLMs) have sparked interest in their potential to support such evaluation tasks, from academic peer review to automated scientific assessment. However, despite the growing use of LLMs in review systems, their capabilities to pinpoint errors remain underexplored. In this work, we introduce Fault Localization Across Writing in Science (FLAWS), an automated benchmark consisting of 713 paper-error pairs designed to evaluate how effectively LLMs detect errors that undermine key claims in research papers. We construct the benchmark by systematically inserting claim-invalidating errors into peer-reviewed papers using LLMs, paired with an automated evaluation metric that measures whether models can identify and localize these errors. Developing such a benchmark presents unique challenges that we overcome: ensuring that the inserted errors are well-defined, challenging, and relevant to the content of the paper, avoiding artifacts that would make identification trivial, and designing a scalable, automated evaluation metric. On the resulting benchmark, we evaluate five frontier LLMs: Claude Sonnet 4.5, DeepSeek Reasoner v3.1, Gemini 2.5 Pro, GPT 5, and Grok 4. Among these, GPT 5 is the top-performing model, achieving 39.1% identification accuracy when k=10, where k is the number of top-ranked error text candidates generated by the LLM.

Title: Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices

Authors: Paulo Cavalin, Cassia Sanctos, Marcelo Grave, Claudio Pinhanez, Yago Primerano
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21860
Pdf URL: https://arxiv.org/pdf/2511.21860
Copy Paste: [[2511.21860]] Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices(https://arxiv.org/abs/2511.21860)
Keywords: large language model
Abstract: In this work we present the Consistency-Rebalanced Accuracy (CoRA) metric, improving the reliability of Large Language Model (LLM) scores computed on multiple choice (MC) benchmarks. Our metric explores the response consistency of the LLMs, taking advantage of synthetically-generated questions with altered answer choices. With two intermediate scores, i.e. Bare-Minimum-Consistency Accuracy (BMCA) and Consistency Index (CI), CoRA is computed by adjusting the multiple-choice question answering (MCQA) scores to better reflect the level of consistency of the LLM. We present evaluations in different benchmarks using diverse LLMs, and not only demonstrate that LLMs can present low response consistency even when they present high MCQA scores, but also that CoRA can successfully scale down the scores of inconsistent models.

Title: Towards a Foundation Model for Partial Differential Equations Across Physics Domains

Authors: Eduardo Soares, Emilio Vital Brazil, Victor Shirasuna, Breno W. S. R. de Carvalho, Cristiano Malossi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21861
Pdf URL: https://arxiv.org/pdf/2511.21861
Copy Paste: [[2511.21861]] Towards a Foundation Model for Partial Differential Equations Across Physics Domains(https://arxiv.org/abs/2511.21861)
Keywords: robust
Abstract: We present PDE-FM, a modular foundation model for physics-informed machine learning that unifies spatial, spectral, and temporal reasoning across heterogeneous partial differential equation (PDE) systems. PDE-FM combines spatial-spectral tokenization, physics-aware conditioning, and a Mamba-based state-space backbone with an operator-theoretic decoder, enabling scalable and data-efficient modeling of complex physical dynamics. In contrast to task-specific neural operators, PDE-FM is pretrained once on diverse PDE datasets and can be transferred to new physical regimes without architectural or data-specific modifications. Evaluated on twelve 2D and 3D datasets from The Well benchmark - spanning hydrodynamic, radiative, elastic, and astrophysical phenomena - PDE-FM achieves state-of-the-art accuracy in six domains, reducing mean VRMSE by 46% relative to prior operator-learning baselines. The model demonstrates robust cross-physics generalization, excelling in turbulent and radiative systems while maintaining strong performance in linear and steady-state regimes. These results suggest that large-scale pretraining across diverse physical processes can yield transferable representations of dynamics, marking a step toward unified, foundation-level surrogates for multi-physics simulation and scientific discovery.

Title: Saddle-Free Guidance: Improved On-Manifold Sampling without Labels or Additional Training

Authors: Eric Yeats, Darryl Hannan, Wilson Fearn, Timothy Doster, Henry Kvinge, Scott Mahan
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.21863
Pdf URL: https://arxiv.org/pdf/2511.21863
Copy Paste: [[2511.21863]] Saddle-Free Guidance: Improved On-Manifold Sampling without Labels or Additional Training(https://arxiv.org/abs/2511.21863)
Keywords: diffusion, generative
Abstract: Score-based generative models require guidance in order to generate plausible, on-manifold samples. The most popular guidance method, Classifier-Free Guidance (CFG), is only applicable in settings with labeled data and requires training an additional unconditional score-based model. More recently, Auto-Guidance adopts a smaller, less capable version of the original model to guide generation. While each method effectively promotes the fidelity of generated data, each requires labeled data or the training of additional models, making it challenging to guide score-based models when (labeled) training data are not available or training new models is not feasible. We make the surprising discovery that the positive curvature of log density estimates in saddle regions provides strong guidance for score-based models. Motivated by this, we develop saddle-free guidance (SFG) which maintains estimates of maximal positive curvature of the log density to guide individual score-based models. SFG has the same computational cost of classifier-free guidance, does not require additional training, and works with off-the-shelf diffusion and flow matching models. Our experiments indicate that SFG achieves state-of-the-art FID and FD-DINOv2 metrics in single-model unconditional ImageNet-512 generation. When SFG is combined with Auto-Guidance, its unconditional samples achieve general state-of-the-art in FD-DINOv2 score. Our experiments with FLUX.1-dev and Stable Diffusion v3.5 indicate that SFG boosts the diversity of output images compared to CFG while maintaining excellent prompt adherence and image fidelity.

Title: Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium

Authors: Akbar Anbar Jafari, Gholamreza Anbarjafari
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.21882
Pdf URL: https://arxiv.org/pdf/2511.21882
Copy Paste: [[2511.21882]] Closed-Loop Transformers: Autoregressive Modeling as Iterative Latent Equilibrium(https://arxiv.org/abs/2511.21882)
Keywords: diffusion, transformer
Abstract: Contemporary autoregressive transformers operate in open loop: each hidden state is computed in a single forward pass and never revised, causing errors to propagate uncorrected through the sequence. We identify this open-loop bottleneck as a fundamental architectural limitation underlying well-documented failures in long-range reasoning, factual consistency, and multi-step planning. To address this limitation, we introduce the closed-loop prediction principle, which requires that models iteratively refine latent representations until reaching a self-consistent equilibrium before committing to each token. We instantiate this principle as Equilibrium Transformers (EqT), which augment standard transformer layers with an Equilibrium Refinement Module that minimizes a learned energy function via gradient descent in latent space. The energy function enforces bidirectional prediction consistency, episodic memory coherence, and output confidence, all computed without external supervision. Theoretically, we prove that EqT performs approximate MAP inference in a latent energy-based model, establish linear convergence guarantees, and show that refinement improves predictions precisely on hard instances where one-shot inference is suboptimal. The framework unifies deep equilibrium models, diffusion language models, and test-time training as special cases. Preliminary experiments on the binary parity task demonstrate +3.28% average improvement on challenging sequences, with gains reaching +8.07% where standard transformers approach random performance, validating that the benefit of deliberation scales with task difficulty. Just as attention mechanisms resolved the sequential bottleneck of recurrent networks, we propose that closed-loop equilibrium may resolve the commitment bottleneck of open-loop autoregression, representing a foundational step toward language models.

Title: Physically Interpretable Representation Learning with Gaussian Mixture Variational AutoEncoder (GM-VAE)

Authors: Tiffany Fan, Murray Cutforth, Marta D'Elia, Alexandre Cortiella, Alireza Doostan, Eric Darve
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.21883
Pdf URL: https://arxiv.org/pdf/2511.21883
Copy Paste: [[2511.21883]] Physically Interpretable Representation Learning with Gaussian Mixture Variational AutoEncoder (GM-VAE)(https://arxiv.org/abs/2511.21883)
Keywords: robust, interpretability
Abstract: Extracting compact, physically interpretable representations from high-dimensional scientific data is a persistent challenge due to the complex, nonlinear structures inherent in physical systems. We propose a Gaussian Mixture Variational Autoencoder (GM-VAE) framework designed to address this by integrating an Expectation-Maximization (EM)-inspired training scheme with a novel spectral interpretability metric. Unlike conventional VAEs that jointly optimize reconstruction and clustering (often leading to training instability), our method utilizes a block-coordinate descent strategy, alternating between expectation and maximization steps. This approach stabilizes training and naturally aligns latent clusters with distinct physical regimes. To objectively evaluate the learned representations, we introduce a quantitative metric based on graph-Laplacian smoothness, which measures the coherence of physical quantities across the latent manifold. We demonstrate the efficacy of this framework on datasets of increasing complexity: surface reaction ODEs, Navier-Stokes wake flows, and experimental laser-induced combustion Schlieren images. The results show that our GM-VAE yields smooth, physically consistent manifolds and accurate regime clustering, offering a robust data-driven tool for interpreting turbulent and reactive flow systems.

Title: UniArt: Unified 3D Representation for Generating 3D Articulated Objects with Open-Set Articulation

Authors: Bu Jin, Weize Li, Songen Gu, Yupeng Zheng, Yuhang Zheng, Zhengyi Zhou, Yao Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.21887
Pdf URL: https://arxiv.org/pdf/2511.21887
Copy Paste: [[2511.21887]] UniArt: Unified 3D Representation for Generating 3D Articulated Objects with Open-Set Articulation(https://arxiv.org/abs/2511.21887)
Keywords: diffusion, segmentation
Abstract: Articulated 3D objects play a vital role in realistic simulation and embodied robotics, yet manually constructing such assets remains costly and difficult to scale. In this paper, we present UniArt, a diffusion-based framework that directly synthesizes fully articulated 3D objects from a single image in an end-to-end manner. Unlike prior multi-stage techniques, UniArt establishes a unified latent representation that jointly encodes geometry, texture, part segmentation, and kinematic parameters. We introduce a reversible joint-to-voxel embedding, which spatially aligns articulation features with volumetric geometry, enabling the model to learn coherent motion behaviors alongside structural formation. Furthermore, we formulate articulation type prediction as an open-set problem, removing the need for fixed joint semantics and allowing generalization to novel joint categories and unseen object types. Experiments on the PartNet-Mobility benchmark demonstrate that UniArt achieves state-of-the-art mesh quality and articulation accuracy.

Title: Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings

Authors: Fatemeh Akbarian, Anahita Baninajjar, Yingyi Zhang, Ananth Balashankar, Amir Aminifar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.21893
Pdf URL: https://arxiv.org/pdf/2511.21893
Copy Paste: [[2511.21893]] Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings(https://arxiv.org/abs/2511.21893)
Keywords: defense, attack, generative
Abstract: Multi-modal foundation models align images, text, and other modalities in a shared embedding space but remain vulnerable to adversarial illusions (Zhang et al., 2025), where imperceptible perturbations disrupt cross-modal alignment and mislead downstream tasks. To counteract the effects of adversarial illusions, we propose a task-agnostic mitigation mechanism that reconstructs the input from the attacker's perturbed input through generative models, e.g., Variational Autoencoders (VAEs), to maintain natural alignment. To further enhance our proposed defense mechanism, we adopt a generative sampling strategy combined with a consensus-based aggregation scheme over the outcomes of the generated samples. Our experiments on the state-of-the-art multi-modal encoders show that our approach substantially reduces the illusion attack success rates to near-zero and improves cross-modal alignment by 4% (42 to 46) and 11% (32 to 43) in unperturbed and perturbed input settings respectively, providing an effective and model-agnostic defense against adversarial illusions.

Title: Standardized Threat Taxonomy for AI Security, Governance, and Regulatory Compliance

Authors: Hernan Huwyler
Subjects: cs.CR, cs.AI, q-fin.RM
Abstract URL: https://arxiv.org/abs/2511.21901
Pdf URL: https://arxiv.org/pdf/2511.21901
Copy Paste: [[2511.21901]] Standardized Threat Taxonomy for AI Security, Governance, and Regulatory Compliance(https://arxiv.org/abs/2511.21901)
Keywords: security, privacy
Abstract: The accelerating deployment of artificial intelligence systems across regulated sectors has exposed critical fragmentation in risk assessment methodologies. A significant "language barrier" currently separates technical security teams, who focus on algorithmic vulnerabilities (e.g., MITRE ATLAS), from legal and compliance professionals, who address regulatory mandates (e.g., EU AI Act, NIST AI RMF). This disciplinary disconnect prevents the accurate translation of technical vulnerabilities into financial liability, leaving practitioners unable to answer fundamental economic questions regarding contingency reserves, control return-on-investment, and insurance exposure. To bridge this gap, this research presents the AI System Threat Vector Taxonomy, a structured ontology designed explicitly for Quantitative Risk Assessment (QRA). The framework categorizes AI-specific risks into nine critical domains: Misuse, Poisoning, Privacy, Adversarial, Biases, Unreliable Outputs, Drift, Supply Chain, and IP Threat, integrating 53 operationally defined sub-threats. Uniquely, each domain maps technical vectors directly to business loss categories (Confidentiality, Integrity, Availability, Legal, Reputation), enabling the translation of abstract threats into measurable financial impact. The taxonomy is empirically validated through an analysis of 133 documented AI incidents from 2025 (achieving 100% classification coverage) and reconciled against the main AI risk frameworks. Furthermore, it is explicitly aligned with ISO/IEC 42001 controls and NIST AI RMF functions to facilitate auditability.

Title: Adaptive Parameter Optimization for Robust Remote Photoplethysmography

Authors: Cecilia G. Morales, Fanurs Chi En Teh, Kai Li, Pushpak Agrawal, Artur Dubrawski
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21903
Pdf URL: https://arxiv.org/pdf/2511.21903
Copy Paste: [[2511.21903]] Adaptive Parameter Optimization for Robust Remote Photoplethysmography(https://arxiv.org/abs/2511.21903)
Keywords: robust
Abstract: Remote photoplethysmography (rPPG) enables contactless vital sign monitoring using standard RGB cameras. However, existing methods rely on fixed parameters optimized for particular lighting conditions and camera setups, limiting adaptability to diverse deployment environments. This paper introduces the Projection-based Robust Signal Mixing (PRISM) algorithm, a training-free method that jointly optimizes photometric detrending and color mixing through online parameter adaptation based on signal quality assessment. PRISM achieves state-of-the-art performance among unsupervised methods, with MAE of 0.77 bpm on PURE and 0.66 bpm on UBFC-rPPG, and accuracy of 97.3\% and 97.5\% respectively at a 5 bpm threshold. Statistical analysis confirms PRISM performs equivalently to leading supervised methods ($p > 0.2$), while maintaining real-time CPU performance without training. This validates that adaptive time series optimization significantly improves rPPG across diverse conditions.

Title: Multi-Modal Machine Learning for Early Trust Prediction in Human-AI Interaction Using Face Image and GSR Bio Signals

Authors: Hamid Shamszare, Avishek Choudhury
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.21908
Pdf URL: https://arxiv.org/pdf/2511.21908
Copy Paste: [[2511.21908]] Multi-Modal Machine Learning for Early Trust Prediction in Human-AI Interaction Using Face Image and GSR Bio Signals(https://arxiv.org/abs/2511.21908)
Keywords: extraction, transformer
Abstract: Predicting human trust in AI systems is crucial for safe integration of AI-based decision support tools, especially in healthcare. This study proposes a multi-modal machine learning framework that combines image and galvanic skin response (GSR) data to predict early user trust in AI- or human-generated recommendations in a simulated ADHD mHealth context. Facial video data were processed using OpenCV for frame extraction and transferred learning with a pre-trained transformer model to derive emotional features. Concurrently, GSR signals were decomposed into tonic and phasic components to capture physiological arousal patterns. Two temporal windows were defined for trust prediction: the Early Detection Window (6 to 3 seconds before decision-making) and the Proximal Detection Window (3 to 0 seconds before decision-making). For each window, trust prediction was conducted separately using image-based, GSR-based, and multimodal (image + GSR) features. Each modality was analyzed using machine learning algorithms, and the top-performing unimodal models were integrated through a multimodal stacking ensemble for final prediction. Experimental results showed that combining facial and physiological cues significantly improved prediction performance. The multimodal stacking framework achieved an accuracy of 0.83, F1-score of 0.88, and ROC-AUC of 0.87 in the Early Detection Window, and an accuracy of 0.75, F1-score of 0.82, and ROC-AUC of 0.66 in the Proximal Detection Window. These results demonstrate the potential of bio signals as real-time, objective markers of user trust, enabling adaptive AI systems that dynamically adjust their responses to maintain calibrated trust which is a critical capability in mental health applications where mis-calibrated trust can affect diagnostic and treatment outcomes.

Title: Exploring Dynamic Properties of Backdoor Training Through Information Bottleneck

Authors: Xinyu Liu, Xu Zhang, Can Chen, Ren Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21923
Pdf URL: https://arxiv.org/pdf/2511.21923
Copy Paste: [[2511.21923]] Exploring Dynamic Properties of Backdoor Training Through Information Bottleneck(https://arxiv.org/abs/2511.21923)
Keywords: attack, steal
Abstract: Understanding how backdoor data influences neural network training dynamics remains a complex and underexplored challenge. In this paper, we present a rigorous analysis of the impact of backdoor data on the learning process, with a particular focus on the distinct behaviors between the target class and other clean classes. Leveraging the Information Bottleneck (IB) principle connected with clustering of internal representation, We find that backdoor attacks create unique mutual information (MI) signatures, which evolve across training phases and differ based on the attack mechanism. Our analysis uncovers a surprising trade-off: visually conspicuous attacks like BadNets can achieve high stealthiness from an information-theoretic perspective, integrating more seamlessly into the model than many visually imperceptible attacks. Building on these insights, we propose a novel, dynamics-based stealthiness metric that quantifies an attack's integration at the model level. We validate our findings and the proposed metric across multiple datasets and diverse attack types, offering a new dimension for understanding and evaluating backdoor threats. Our code is available in: this https URL.

Title: Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs

Authors: Yifan Zhou, Sachin Grover, Mohamed El Mistiri, Kamalesh Kalirathnam, Pratyush Kerhalkar, Swaroop Mishra, Neelesh Kumar, Sanket Gaurav, Oya Aran, Heni Ben Amor
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21928
Pdf URL: https://arxiv.org/pdf/2511.21928
Copy Paste: [[2511.21928]] Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs(https://arxiv.org/abs/2511.21928)
Keywords: large language model
Abstract: Reinforcement Learning (RL) traditionally relies on scalar reward signals, limiting its ability to leverage the rich semantic knowledge often available in real-world tasks. In contrast, humans learn efficiently by combining numerical feedback with language, prior knowledge, and common sense. We introduce Prompted Policy Search (ProPS), a novel RL method that unifies numerical and linguistic reasoning within a single framework. Unlike prior work that augment existing RL components with language, ProPS places a large language model (LLM) at the center of the policy optimization loop-directly proposing policy updates based on both reward feedback and natural language input. We show that LLMs can perform numerical optimization in-context, and that incorporating semantic signals, such as goals, domain knowledge, and strategy hints can lead to more informed exploration and sample-efficient learning. ProPS is evaluated across fifteen Gymnasium tasks, spanning classic control, Atari games, and MuJoCo environments, and compared to seven widely-adopted RL algorithms (e.g., PPO, SAC, TRPO). It outperforms all baselines on eight out of fifteen tasks and demonstrates substantial gains when provided with domain knowledge. These results highlight the potential of unifying semantics and numerics for transparent, generalizable, and human-aligned RL.

Title: A Comparative Study of LLM Prompting and Fine-Tuning for Cross-genre Authorship Attribution on Chinese Lyrics

Authors: Yuxin Li, Lorraine Xu, Meng Fan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.21930
Pdf URL: https://arxiv.org/pdf/2511.21930
Copy Paste: [[2511.21930]] A Comparative Study of LLM Prompting and Fine-Tuning for Cross-genre Authorship Attribution on Chinese Lyrics(https://arxiv.org/abs/2511.21930)
Keywords: robust
Abstract: We propose a novel study on authorship attribution for Chinese lyrics, a domain where clean, public datasets are sorely lacking. Our contributions are twofold: (1) we create a new, balanced dataset of Chinese lyrics spanning multiple genres, and (2) we develop and fine-tune a domain-specific model, comparing its performance against zero-shot inference using the DeepSeek LLM. We test two central hypotheses. First, we hypothesize that a fine-tuned model will outperform a zero-shot LLM baseline. Second, we hypothesize that performance is genre-dependent. Our experiments strongly confirm Hypothesis 2: structured genres (e.g. Folklore & Tradition) yield significantly higher attribution accuracy than more abstract genres (e.g. Love & Romance). Hypothesis 1 receives only partial support: fine-tuning improves robustness and generalization in Test1 (real-world data and difficult genres), but offers limited or ambiguous gains in Test2, a smaller, synthetically-augmented set. We show that the design limitations of Test2 (e.g., label imbalance, shallow lexical differences, and narrow genre sampling) can obscure the true effectiveness of fine-tuning. Our work establishes the first benchmark for cross-genre Chinese lyric attribution, highlights the importance of genre-sensitive evaluation, and provides a public dataset and analytical framework for future research. We conclude with recommendations: enlarge and diversify test sets, reduce reliance on token-level data augmentation, balance author representation across genres, and investigate domain-adaptive pretraining as a pathway for improved attribution performance.

Title: Does the Model Say What the Data Says? A Simple Heuristic for Model Data Alignment

Authors: Henry Salgado, Meagan Kendall, Martine Ceberio
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21931
Pdf URL: https://arxiv.org/pdf/2511.21931
Copy Paste: [[2511.21931]] Does the Model Say What the Data Says? A Simple Heuristic for Model Data Alignment(https://arxiv.org/abs/2511.21931)
Keywords: interpretability
Abstract: In this work, we propose a simple and computationally efficient framework to evaluate whether machine learning models align with the structure of the data they learn from; that is, whether \textit{the model says what the data says}. Unlike existing interpretability methods that focus exclusively on explaining model behavior, our approach establishes a baseline derived directly from the data itself. Drawing inspiration from Rubin's Potential Outcomes Framework, we quantify how strongly each feature separates the two outcome groups in a binary classification task, moving beyond traditional descriptive statistics to estimate each feature's effect on the outcome. By comparing these data-derived feature rankings against model-based explanations, we provide practitioners with an interpretable and model-agnostic method to assess model--data alignment.

Title: Modeling Quantum Autoencoder Trainable Kernel for IoT Anomaly Detection

Authors: Swathi Chandrasekhar, Shiva Raj Pokhrel, Swati Kumari, Navneet Singh
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2511.21932
Pdf URL: https://arxiv.org/pdf/2511.21932
Copy Paste: [[2511.21932]] Modeling Quantum Autoencoder Trainable Kernel for IoT Anomaly Detection(https://arxiv.org/abs/2511.21932)
Keywords: security
Abstract: Escalating cyber threats and the high-dimensional complexity of IoT traffic have outpaced classical anomaly detection methods. While deep learning offers improvements, computational bottlenecks limit real-time deployment at scale. We present a quantum autoencoder (QAE) framework that compresses network traffic into discriminative latent representations and employs quantum support vector classification (QSVC) for intrusion detection. Evaluated on three datasets, our approach achieves improved accuracy on ideal simulators and on the IBM Quantum hardware demonstrating practical quantum advantage on current NISQ devices. Crucially, moderate depolarizing noise acts as implicit regularization, stabilizing training and enhancing generalization. This work establishes quantum machine learning as a viable, hardware-ready solution for real-world cybersecurity challenges.

Title: Heterogeneous Multi-Agent Reinforcement Learning with Attention for Cooperative and Scalable Feature Transformation

Authors: Tao Zhe, Huazhen Fang, Kunpeng Liu, Qian Lou, Tamzidul Hoque, Dongjie Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21934
Pdf URL: https://arxiv.org/pdf/2511.21934
Copy Paste: [[2511.21934]] Heterogeneous Multi-Agent Reinforcement Learning with Attention for Cooperative and Scalable Feature Transformation(https://arxiv.org/abs/2511.21934)
Keywords: robust, interpretability
Abstract: Feature transformation enhances downstream task performance by generating informative features through mathematical feature crossing. Despite the advancements in deep learning, feature transformation remains essential for structured data, where deep models often struggle to capture complex feature interactions. Prior literature on automated feature transformation has achieved success but often relies on heuristics or exhaustive searches, leading to inefficient and time-consuming processes. Recent works employ reinforcement learning (RL) to enhance traditional approaches through a more effective trial-and-error way. However, two limitations remain: 1) Dynamic feature expansion during the transformation process, which causes instability and increases the learning complexity for RL agents; 2) Insufficient cooperation and communication between agents, which results in suboptimal feature crossing operations and degraded model performance. To address them, we propose a novel heterogeneous multi-agent RL framework to enable cooperative and scalable feature transformation. The framework comprises three heterogeneous agents, grouped into two types, each designed to select essential features and operations for feature crossing. To enhance communication among these agents, we implement a shared critic mechanism that facilitates information exchange during feature transformation. To handle the dynamically expanding feature space, we tailor multi-head attention-based feature agents to select suitable features for feature crossing. Additionally, we introduce a state encoding technique during the optimization process to stabilize and enhance the learning dynamics of the RL agents, resulting in more robust and reliable transformation policies. Finally, we conduct extensive experiments to validate the effectiveness, efficiency, robustness, and interpretability of our model.

Title: Deep Learning Architectures for Code-Modulated Visual Evoked Potentials Detection

Authors: Kiran Nair, Hubert Cecotti
Subjects: cs.LG, eess.SP, q-bio.NC
Abstract URL: https://arxiv.org/abs/2511.21940
Pdf URL: https://arxiv.org/pdf/2511.21940
Copy Paste: [[2511.21940]] Deep Learning Architectures for Code-Modulated Visual Evoked Potentials Detection(https://arxiv.org/abs/2511.21940)
Keywords: robust
Abstract: Non-invasive Brain-Computer Interfaces (BCIs) based on Code-Modulated Visual Evoked Potentials (C-VEPs) require highly robust decoding methods to address temporal variability and session-dependent noise in EEG signals. This study proposes and evaluates several deep learning architectures, including convolutional neural networks (CNNs) for 63-bit m-sequence reconstruction and classification, and Siamese networks for similarity-based decoding, alongside canonical correlation analysis (CCA) baselines. EEG data were recorded from 13 healthy adults under single-target flicker stimulation. The proposed deep models significantly outperformed traditional approaches, with distance-based decoding using Earth Mover's Distance (EMD) and constrained EMD showing greater robustness to latency variations than Euclidean and Mahalanobis metrics. Temporal data augmentation with small shifts further improved generalization across sessions. Among all models, the multi-class Siamese network achieved the best overall performance with an average accuracy of 96.89%, demonstrating the potential of data-driven deep architectures for reliable, single-trial C-VEP decoding in adaptive non-invasive BCI systems.

Title: AmodalGen3D: Generative Amodal 3D Object Reconstruction from Sparse Unposed Views

Authors: Junwei Zhou, Yu-Wing Tai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.21945
Pdf URL: https://arxiv.org/pdf/2511.21945
Copy Paste: [[2511.21945]] AmodalGen3D: Generative Amodal 3D Object Reconstruction from Sparse Unposed Views(https://arxiv.org/abs/2511.21945)
Keywords: generative
Abstract: Reconstructing 3D objects from a few unposed and partially occluded views is a common yet challenging problem in real-world scenarios, where many object surfaces are never directly observed. Traditional multi-view or inpainting-based approaches struggle under such conditions, often yielding incomplete or geometrically inconsistent reconstructions. We introduce AmodalGen3D, a generative framework for amodal 3D object reconstruction that infers complete, occlusion-free geometry and appearance from arbitrary sparse inputs. The model integrates 2D amodal completion priors with multi-view stereo geometry conditioning, supported by a View-Wise Cross Attention mechanism for sparse-view feature fusion and a Stereo-Conditioned Cross Attention module for unobserved structure inference. By jointly modeling visible and hidden regions, AmodalGen3D faithfully reconstructs 3D objects that are consistent with sparse-view constraints while plausibly hallucinating unseen parts. Experiments on both synthetic and real-world datasets demonstrate that AmodalGen3D achieves superior fidelity and completeness under occlusion-heavy sparse-view settings, addressing a pressing need for object-level 3D scene reconstruction in robotics, AR/VR, and embodied AI applications.

Title: ABLE: Using Adversarial Pairs to Construct Local Models for Explaining Model Predictions

Authors: Krishna Khadka, Sunny Shree, Pujan Budhathoki, Yu Lei, Raghu Kacker, D. Richard Kuhn
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21952
Pdf URL: https://arxiv.org/pdf/2511.21952
Copy Paste: [[2511.21952]] ABLE: Using Adversarial Pairs to Construct Local Models for Explaining Model Predictions(https://arxiv.org/abs/2511.21952)
Keywords: attack
Abstract: Machine learning models are increasingly used in critical applications but are mostly "black boxes" due to their lack of transparency. Local explanation approaches, such as LIME, address this issue by approximating the behavior of complex models near a test instance using simple, interpretable models. However, these approaches often suffer from instability and poor local fidelity. In this paper, we propose a novel approach called Adversarially Bracketed Local Explanation (ABLE) to address these limitations. Our approach first generates a set of neighborhood points near the test instance, x_test, by adding bounded Gaussian noise. For each neighborhood point D, we apply an adversarial attack to generate an adversarial point A with minimal perturbation that results in a different label than D. A second adversarial attack is then performed on A to generate a point A' that has the same label as D (and thus different than A). The points A and A' form an adversarial pair that brackets the local decision boundary for x_test. We then train a linear model on these adversarial pairs to approximate the local decision boundary. Experimental results on six UCI benchmark datasets across three deep neural network architectures demonstrate that our approach achieves higher stability and fidelity than the state-of-the-art.

Title: DeepGI: Explainable Deep Learning for Gastrointestinal Image Classification

Authors: Walid Houmaidi, Mohamed Hadadi, Youssef Sabiri, Yousra Chtouki
Subjects: cs.CV, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2511.21959
Pdf URL: https://arxiv.org/pdf/2511.21959
Copy Paste: [[2511.21959]] DeepGI: Explainable Deep Learning for Gastrointestinal Image Classification(https://arxiv.org/abs/2511.21959)
Keywords: robust, interpretability, explainability
Abstract: This paper presents a comprehensive comparative model analysis on a novel gastrointestinal medical imaging dataset, comprised of 4,000 endoscopic images spanning four critical disease classes: Diverticulosis, Neoplasm, Peritonitis, and Ureters. Leveraging state-of-the-art deep learning techniques, the study confronts common endoscopic challenges such as variable lighting, fluctuating camera angles, and frequent imaging artifacts. The best performing models, VGG16 and MobileNetV2, each achieved a test accuracy of 96.5%, while Xception reached 94.24%, establishing robust benchmarks and baselines for automated disease classification. In addition to strong classification performance, the approach includes explainable AI via Grad-CAM visualization, enabling identification of image regions most influential to model predictions and enhancing clinical interpretability. Experimental results demonstrate the potential for robust, accurate, and interpretable medical image analysis even in complex real-world conditions. This work contributes original benchmarks, comparative insights, and visual explanations, advancing the landscape of gastrointestinal computer-aided diagnosis and underscoring the importance of diverse, clinically relevant datasets and model explainability in medical AI research.

Title: CTR Prediction on Alibaba's Taobao Advertising Dataset Using Traditional and Deep Learning Models

Authors: Hongyu Yang, Chunxi Wen, Jiyin Zhang, Nanfei Shen, Shijiao Zhang, Xiyan Han
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.21963
Pdf URL: https://arxiv.org/pdf/2511.21963
Copy Paste: [[2511.21963]] CTR Prediction on Alibaba's Taobao Advertising Dataset Using Traditional and Deep Learning Models(https://arxiv.org/abs/2511.21963)
Keywords: transformer
Abstract: Click-through rates prediction is critical in modern advertising systems, where ranking relevance and user engagement directly impact platform efficiency and business value. In this project, we explore how to model CTR more effectively using a large-scale Taobao dataset released by Alibaba. We start with supervised learning models, including logistic regression and Light-GBM, that are trained on static features such as user demographics, ad attributes, and contextual metadata. These models provide fast, interpretable benchmarks, but have limited capabilities to capture patterns of behavior that drive clicks. To better model user intent, we combined behavioral data from hundreds of millions of interactions over a 22-day period. By extracting and encoding user action sequences, we construct representations of user interests over time. We use deep learning models to fuse behavioral embeddings with static features. Among them, multilayer perceptrons (MLPs) have achieved significant performance improvements. To capture temporal dynamics, we designed a Transformer-based architecture that uses a self-attention mechanism to learn contextual dependencies across behavioral sequences, modeling not only what the user interacts with, but also the timing and frequency of interactions. Transformer improves AUC by 2.81 % over the baseline (LR model), with the largest gains observed for users whose interests are diverse or change over time. In addition to modeling, we propose an A/B testing strategy for real-world evaluation. We also think about the broader implications: personalized ad targeting technology can be applied to public health scenarios to achieve precise delivery of health information or behavior guidance. Our research provides a roadmap for advancing click-through rate predictions and extending their value beyond e-commerce.

Title: MOTIF-RF: Multi-template On-chip Transformer Synthesis Incorporating Frequency-domain Self-transfer Learning for RFIC Design Automation

Authors: Houbo He, Yizhou Xu, Lei Xia, Yaolong Hu, Fan Cai, Taiyun Chi
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2511.21970
Pdf URL: https://arxiv.org/pdf/2511.21970
Copy Paste: [[2511.21970]] MOTIF-RF: Multi-template On-chip Transformer Synthesis Incorporating Frequency-domain Self-transfer Learning for RFIC Design Automation(https://arxiv.org/abs/2511.21970)
Keywords: transformer
Abstract: This paper presents a systematic study on developing multi-template machine learning (ML) surrogate models and applying them to the inverse design of transformers (XFMRs) in radio-frequency integrated circuits (RFICs). Our study starts with benchmarking four widely used ML architectures, including MLP-, CNN-, UNet-, and GT-based models, using the same datasets across different XFMR topologies. To improve modeling accuracy beyond these baselines, we then propose a new frequency-domain self-transfer learning technique that exploits correlations between adjacent frequency bands, leading to around 30%-50% accuracy improvement in the S-parameters prediction. Building on these models, we further develop an inverse design framework based on the covariance matrix adaptation evolutionary strategy (CMA-ES) algorithm. This framework is validated using multiple impedance-matching tasks, all demonstrating fast convergence and trustworthy performance. These results advance the goal of AI-assisted specs-to-GDS automation for RFICs and provide RFIC designers with actionable tools for integrating AI into their workflows.

Title: Start Making Sense(s): A Developmental Probe of Attention Specialization Using Lexical Ambiguity

Authors: Pamela D. Rivière, Sean Trott
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.21974
Pdf URL: https://arxiv.org/pdf/2511.21974
Copy Paste: [[2511.21974]] Start Making Sense(s): A Developmental Probe of Attention Specialization Using Lexical Ambiguity(https://arxiv.org/abs/2511.21974)
Keywords: robust, transformer
Abstract: Despite an in-principle understanding of self-attention matrix operations in Transformer language models (LMs), it remains unclear precisely how these operations map onto interpretable computations or functions--and how or when individual attention heads develop specialized attention patterns. Here, we present a pipeline to systematically probe attention mechanisms, and we illustrate its value by leveraging lexical ambiguity--where a single word has multiple meanings--to isolate attention mechanisms that contribute to word sense disambiguation. We take a "developmental" approach: first, using publicly available Pythia LM checkpoints, we identify inflection points in disambiguation performance for each LM in the suite; in 14M and 410M, we identify heads whose attention to disambiguating words covaries with overall disambiguation performance across development. We then stress-test the robustness of these heads to stimulus perturbations: in 14M, we find limited robustness, but in 410M, we identify multiple heads with surprisingly generalizable behavior. Then, in a causal analysis, we find that ablating the target heads demonstrably impairs disambiguation performance, particularly in 14M. We additionally reproduce developmental analyses of 14M across all of its random seeds. Together, these results suggest: that disambiguation benefits from a constellation of mechanisms, some of which (especially in 14M) are highly sensitive to the position and part-of-speech of the disambiguating cue; and that larger models (410M) may contain heads with more robust disambiguation behavior. They also join a growing body of work that highlights the value of adopting a developmental perspective when probing LM mechanisms.

Title: DialBench: Towards Accurate Reading Recognition of Pointer Meter using Large Foundation Models

Authors: Futian Wang, Chaoliu Weng, Xiao Wang, Zhen Chen, Zhicheng Zhao, Jin Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.21982
Pdf URL: https://arxiv.org/pdf/2511.21982
Copy Paste: [[2511.21982]] DialBench: Towards Accurate Reading Recognition of Pointer Meter using Large Foundation Models(https://arxiv.org/abs/2511.21982)
Keywords: robust
Abstract: The precise reading recognition of pointer meters plays a key role in smart power systems, but existing approaches remain fragile due to challenges like reflections, occlusions, dynamic viewing angles, and overly between thin pointers and scale markings. Up to now, this area still lacks large-scale datasets to support the development of robust algorithms. To address these challenges, this paper first presents a new large-scale benchmark dataset for dial reading, termed RPM-10K, which contains 10730 meter images that fully reflect the aforementioned key challenges. Built upon the dataset, we propose a novel vision-language model for pointer meter reading recognition, termed MRLM, based on physical relation injection. Instead of exhaustively learning image-level correlations, MRLM explicitly encodes the geometric and causal relationships between the pointer and the scale, aligning perception with physical reasoning in the spirit of world-model perspectives. Through cross-attentional fusion and adaptive expert selection, the model learns to interpret dial configurations and generate precise numeric readings. Extensive experiments fully validated the effectiveness of our proposed framework on the newly proposed benchmark dataset. Both the dataset and source code will be released on this https URL

Title: PPBoost: Progressive Prompt Boosting for Text-Driven Medical Image Segmentation

Authors: Xuchen Li, Hengrui Gu, Mohan Zhang, Qin Liu, Zhen Tan, Xinyuan Zhu, Huixue Zhou, Tianlong Chen, Kaixiong Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.21984
Pdf URL: https://arxiv.org/pdf/2511.21984
Copy Paste: [[2511.21984]] PPBoost: Progressive Prompt Boosting for Text-Driven Medical Image Segmentation(https://arxiv.org/abs/2511.21984)
Keywords: segmentation
Abstract: Text-prompted foundation models for medical image segmentation offer an intuitive way to delineate anatomical structures from natural language queries, but their predictions often lack spatial precision and degrade under domain shift. In contrast, visual-prompted models achieve strong segmentation performance across diverse modalities by leveraging spatial cues of precise bounding-box (bbox) prompts to guide the segmentation of target lesions. However, it is costly and challenging to obtain the precise visual prompts in clinical practice. We propose PPBoost (Progressive Prompt-Boosting), a framework that bridges these limitations by transforming weak text-derived signals into strong, spatially grounded visual prompts, operating under a strict zero-shot regime with no image- or pixel-level segmentation labels. PPBoost first uses a vision-language model to produce initial pseudo-bboxes conditioned on the textual object descriptions and applies an uncertainty-aware criterion to filter unreliable predictions. The retained image-bboxes pairs are then leveraged to train a pseudo-labeled detector, producing the high-quality bboxes for the query images. During inference, PPBoost further refines the generated bboxes by appropriately expanding them to tightly cover the target anatomical structures. The enhanced spatially-grounding bbox prompts guide existing segmentation models to generate final dense masks, effectively amplifying weak text cues into strong spatial guidance. Across three datasets spanning diverse modalities and anatomies, PPBoost consistently improves Dice and Normalized Surface Distance over text- and visual-prompted baselines and, notably, surpasses few-shot segmentation models without using labeled data. PPBoost can generalize to multiple typical visual segmentation model backbones.

Title: A Safety and Security Framework for Real-World Agentic Systems

Authors: Shaona Ghosh, Barnaby Simkin, Kyriacos Shiarlis, Soumili Nandi, Dan Zhao, Matthew Fiedler, Julia Bazinska, Nikki Pope, Roopa Prabhu, Daniel Rohrer, Michael Demoret, Bartley Richardson
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2511.21990
Pdf URL: https://arxiv.org/pdf/2511.21990
Copy Paste: [[2511.21990]] A Safety and Security Framework for Real-World Agentic Systems(https://arxiv.org/abs/2511.21990)
Keywords: security, defense, attack
Abstract: This paper introduces a dynamic and actionable framework for securing agentic AI systems in enterprise deployment. We contend that safety and security are not merely fixed attributes of individual models but also emergent properties arising from the dynamic interactions among models, orchestrators, tools, and data within their operating environments. We propose a new way of identification of novel agentic risks through the lens of user safety. Although, for traditional LLMs and agentic models in isolation, safety and security has a clear separation, through the lens of safety in agentic systems, they appear to be connected. Building on this foundation, we define an operational agentic risk taxonomy that unifies traditional safety and security concerns with novel, uniquely agentic risks, including tool misuse, cascading action chains, and unintended control amplification among others. At the core of our approach is a dynamic agentic safety and security framework that operationalizes contextual agentic risk management by using auxiliary AI models and agents, with human oversight, to assist in contextual risk discovery, evaluation, and mitigation. We further address one of the most challenging aspects of safety and security of agentic systems: risk discovery through sandboxed, AI-driven red teaming. We demonstrate the framework effectiveness through a detailed case study of NVIDIA flagship agentic research assistant, AI-Q Research Assistant, showcasing practical, end-to-end safety and security evaluations in complex, enterprise-grade agentic workflows. This risk discovery phase finds novel agentic risks that are then contextually mitigated. We also release the dataset from our case study, containing traces of over 10,000 realistic attack and defense executions of the agentic workflow to help advance research in agentic safety.

Title: Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

Authors: Apratim Bhattacharyya, Bicheng Xu, Sanjay Haresh, Reza Pourreza, Litian Liu, Sunny Panchal, Pulkit Madan, Leonid Sigal, Roland Memisevic
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.21998
Pdf URL: https://arxiv.org/pdf/2511.21998
Copy Paste: [[2511.21998]] Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?(https://arxiv.org/abs/2511.21998)
Keywords: large language model
Abstract: Multi-modal Large Language Models (LLM) have advanced conversational abilities but struggle with providing live, interactive step-by-step guidance, a key capability for future AI assistants. Effective guidance requires not only delivering instructions but also detecting their successful execution, as well as identifying and alerting users to mistakes, all of which has to happen in real-time. This requires models that are not turn-based, but that can react asynchronously to a video stream, as well as video data showing users performing tasks including mistakes and their corrections. To this end, we introduce Qualcomm Interactive Cooking, a new benchmark and dataset built upon CaptainCook4D, which contains user mistakes during task execution. Our dataset and benchmark features densely annotated, timed instructions and feedback messages, specifically including mistake alerts precisely timestamped to their visual occurrence in the video. We evaluate state-of-the-art multi-modal LLMs on the Qualcomm Interactive Cooking benchmark and introduce LiveMamba, a streaming multi-modal LLM designed for interactive instructional guidance. This work provides the first dedicated benchmark and a strong baseline for developing and evaluating on live, situated coaching.

Title: GECKO: Securing Digital Assets Through(out) the Physical World (Extended Technical Report)

Authors: Cyrill Krähenbühl, Nico Hauser, Christelle Gloor, Juan Angel García-Pardo, Adrian Perrig
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.21999
Pdf URL: https://arxiv.org/pdf/2511.21999
Copy Paste: [[2511.21999]] GECKO: Securing Digital Assets Through(out) the Physical World (Extended Technical Report)(https://arxiv.org/abs/2511.21999)
Keywords: secure, protect
Abstract: Although our lives are increasingly transitioning into the digital world, many digital assets still relate to objects or places in the physical world, e.g., websites of stores or restaurants, digital documents claiming property ownership, or digital identifiers encoded in QR codes for mobile payments in shops. Currently, users cannot securely associate digital assets with their related physical space, leading to problems such as fake brand stores, property fraud, and mobile payment scams. In many cases, the necessary information to protect digital assets exists, e.g., via contractual relationships and cadaster entries, but there is currently no uniform way of retrieving and verifying these documents. In this work, we propose the Geo-Enabled Cryptographic Key Oracle (GECKO), a geographical PKI that provides a global view of digital assets based on their geo-location and occupied space. GECKO allows for the bidirectional translation of trust between the physical and digital world. Users can verify which assets are supposed to exist at their location, as well as verify which physical space is claimed by a digital entity. GECKO supplements current PKI systems and can be used in addition to current systems when its properties are of value. We show the feasibility of efficiently storing millions of assets and serving cryptographic material based on precise location queries within 11 ms at a rate of more than 19000 queries per second on a single server.

Title: StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation

Authors: Sen Fang, Hongbin Zhong, Yalin Feng, Dimitris N. Metaxas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22009
Pdf URL: https://arxiv.org/pdf/2511.22009
Copy Paste: [[2511.22009]] StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation(https://arxiv.org/abs/2511.22009)
Keywords: diffusion, generative
Abstract: New technologies such as Rectified Flow and Flow Matching have significantly improved the performance of generative models in the past two years, especially in terms of control accuracy, generation quality, and generation efficiency. However, due to some differences in its theory, design, and existing diffusion models, the existing acceleration methods cannot be directly applied to the Rectified Flow model. In this article, we have comprehensively implemented an overall acceleration pipeline from the aspects of theory, design, and reasoning strategies. This pipeline uses new methods such as batch processing with a new velocity field, vectorization of heterogeneous time-step batch processing, and dynamic TensorRT compilation for the new methods to comprehensively accelerate related models based on flow models. Currently, the existing public methods usually achieve an acceleration of 18%, while experiments have proved that our new method can accelerate the 512*512 image generation speed to up to 611%, which is far beyond the current non-generalized acceleration methods.

Title: AfriStereo: A Culturally Grounded Dataset for Evaluating Stereotypical Bias in Large Language Models

Authors: Yann Le Beux, Oluchi Audu, Oche D. Ankeli, Dhananjay Balakrishnan, Melissah Weya, Marie D. Ralaiarinosy, Ignatius Ezeani
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22016
Pdf URL: https://arxiv.org/pdf/2511.22016
Copy Paste: [[2511.22016]] AfriStereo: A Culturally Grounded Dataset for Evaluating Stereotypical Bias in Large Language Models(https://arxiv.org/abs/2511.22016)
Keywords: large language model
Abstract: Existing AI bias evaluation benchmarks largely reflect Western perspectives, leaving African contexts underrepresented and enabling harmful stereotypes in applications across various domains. To address this gap, we introduce AfriStereo, the first open-source African stereotype dataset and evaluation framework grounded in local socio-cultural contexts. Through community engaged efforts across Senegal, Kenya, and Nigeria, we collected 1,163 stereotypes spanning gender, ethnicity, religion, age, and profession. Using few-shot prompting with human-in-the-loop validation, we augmented the dataset to over 5,000 stereotype-antistereotype pairs. Entries were validated through semantic clustering and manual annotation by culturally informed reviewers. Preliminary evaluation of language models reveals that nine of eleven models exhibit statistically significant bias, with Bias Preference Ratios (BPR) ranging from 0.63 to 0.78 (p <= 0.05), indicating systematic preferences for stereotypes over antistereotypes, particularly across age, profession, and gender dimensions. Domain-specific models appeared to show weaker bias in our setup, suggesting task-specific training may mitigate some associations. Looking ahead, AfriStereo opens pathways for future research on culturally grounded bias evaluation and mitigation, offering key methodologies for the AI community on building more equitable, context-aware, and globally inclusive NLP technologies.

Title: POLARIS: Cross-Domain Access Control via Verifiable Identity and Policy-Based Authorization

Authors: Aiyao Zhang, Xiaodong Lee, Zhixian Zhuang, Jiuqi Wei, Yufan Fu, Botao Peng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.22017
Pdf URL: https://arxiv.org/pdf/2511.22017
Copy Paste: [[2511.22017]] POLARIS: Cross-Domain Access Control via Verifiable Identity and Policy-Based Authorization(https://arxiv.org/abs/2511.22017)
Keywords: secure, security, privacy, attack
Abstract: Access control is a security mechanism designed to ensure that only authorized users can access specific resources. Cross-domain access control involves access to resources across different organizations, institutions, or applications. Traditional access control, however, which handles authentication and authorization separately in centralized environments, faces challenges in identity dispersion, privacy leakage, and diversified permission requirements, failing to adapt to cross-domain scenarios. Thus, there is an urgent need for a new access control mechanism that empowers autonomous control over user identity and resources, addressing the demands for privacy-preserving authentication and flexible authorization in cross-domain scenarios. To address cross-domain access control challenges, we propose POLARIS, a unified and extensible architecture that enables policy-based, verifiable and privacy-preserving access control across different domains. POLARIS features a structured commitment mechanism for reliable, fine-grained, policy-based identity disclosure. It further introduces VPPL, a lightweight policy language that supports issuer-bound evaluation of selectively revealed attributes. A dedicated session-level security mechanism ensures binding between authentication and access, enhancing confidentiality and resilience to replay attacks. We implement a working prototype and conduct comprehensive experiments, demonstrating that POLARIS effectively provides scalable, privacy-preserving, and interoperable access control across heterogeneous domains. Our results highlight the practical viability of POLARIS for enabling secure and privacy-preserving access control in decentralized, cross-domain environments.

Title: Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models

Authors: Zhenxiang Lin, Maryam Haghighat, Will Browne, Dimity Miller
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22019
Pdf URL: https://arxiv.org/pdf/2511.22019
Copy Paste: [[2511.22019]] Intra-Class Probabilistic Embeddings for Uncertainty Estimation in Vision-Language Models(https://arxiv.org/abs/2511.22019)
Keywords: robust
Abstract: Vision-language models (VLMs), such as CLIP, have gained popularity for their strong open vocabulary classification performance, but they are prone to assigning high confidence scores to misclassifications, limiting their reliability in safety-critical applications. We introduce a training-free, post-hoc uncertainty estimation method for contrastive VLMs that can be used to detect erroneous predictions. The key to our approach is to measure visual feature consistency within a class, using feature projection combined with multivariate Gaussians to create class-specific probabilistic embeddings. Our method is VLM-agnostic, requires no fine-tuning, demonstrates robustness to distribution shift, and works effectively with as few as 10 training images per class. Extensive experiments on ImageNet, Flowers102, Food101, EuroSAT and DTD show state-of-the-art error detection performance, significantly outperforming both deterministic and probabilistic VLM baselines. Code is available at this https URL.

Title: Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation

Authors: Joel Alberto Santos, Zongwei Wu, Xavier Alameda-Pineda, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22025
Pdf URL: https://arxiv.org/pdf/2511.22025
Copy Paste: [[2511.22025]] Layover or Direct Flight: Rethinking Audio-Guided Image Segmentation(https://arxiv.org/abs/2511.22025)
Keywords: robust, segmentation
Abstract: Understanding human instructions is essential for enabling smooth human-robot interaction. In this work, we focus on object grounding, i.e., localizing an object of interest in a visual scene (e.g., an image) based on verbal human instructions. Despite recent progress, a dominant research trend relies on using text as an intermediate representation. These approaches typically transcribe speech to text, extract relevant object keywords, and perform grounding using models pretrained on large text-vision datasets. However, we question both the efficiency and robustness of such transcription-based pipelines. Specifically, we ask: Can we achieve direct audio-visual alignment without relying on text? To explore this possibility, we simplify the task by focusing on grounding from single-word spoken instructions. We introduce a new audio-based grounding dataset that covers a wide variety of objects and diverse human accents. We then adapt and benchmark several models from the closely audio-visual field. Our results demonstrate that direct grounding from audio is not only feasible but, in some cases, even outperforms transcription-based methods, especially in terms of robustness to linguistic variability. Our findings encourage a renewed interest in direct audio grounding and pave the way for more robust and efficient multimodal understanding systems.

Title: Calibration-Free EEG-based Driver Drowsiness Detection with Online Test-Time Adaptation

Authors: Geun-Deok Jang, Dong-Kyun Han, Seo-Hyeon Park, Seong-Whan Lee
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22030
Pdf URL: https://arxiv.org/pdf/2511.22030
Copy Paste: [[2511.22030]] Calibration-Free EEG-based Driver Drowsiness Detection with Online Test-Time Adaptation(https://arxiv.org/abs/2511.22030)
Keywords: robust
Abstract: Drowsy driving is a growing cause of traffic accidents, prompting recent exploration of electroencephalography (EEG)-based drowsiness detection systems. However, the inherent variability of EEG signals due to psychological and physical factors necessitates a cumbersome calibration process. In particular, the inter-subject variability of EEG signals leads to a domain shift problem, which makes it challenging to generalize drowsiness detection models to unseen target subjects. To address these issues, we propose a novel driver drowsiness detection framework that leverages online test-time adaptation (TTA) methods to dynamically adjust to target subject distributions. Our proposed method updates the learnable parameters in batch normalization (BN) layers, while preserving pretrained normalization statistics, resulting in a modified configuration that ensures effective adaptation during test time. We incorporate a memory bank that dynamically manages streaming EEG segments, selecting samples based on their reliability determined by negative energy scores and persistence time. In addition, we introduce prototype learning to ensure robust predictions against distribution shifts over time. We validated our method on the sustained-attention driving dataset collected in a simulated environment, where drowsiness was estimated from delayed reaction times during monotonous lane-keeping tasks. Our experiments show that our method outperforms all baselines, achieving an average F1-score of 81.73\%, an improvement of 11.73\% over the best TTA baseline. This demonstrates that our proposed method significantly enhances the adaptability of EEG-based drowsiness detection systems in non-i.i.d. scenarios.

Title: Early Risk Prediction with Temporally and Contextually Grounded Clinical Language Processing

Authors: Rochana Chaturvedi, Yue Zhou, Andrew Boyd, Brian T. Layden, Mudassir Rashid, Lu Cheng, Ali Cinar, Barbara Di Eugenio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22038
Pdf URL: https://arxiv.org/pdf/2511.22038
Copy Paste: [[2511.22038]] Early Risk Prediction with Temporally and Contextually Grounded Clinical Language Processing(https://arxiv.org/abs/2511.22038)
Keywords: privacy, fair, large language model
Abstract: Clinical notes in Electronic Health Records (EHRs) capture rich temporal information on events, clinician reasoning, and lifestyle factors often missing from structured data. Leveraging them for predictive modeling can be impactful for timely identification of chronic diseases. However, they present core natural language processing (NLP) challenges: long text, irregular event distribution, complex temporal dependencies, privacy constraints, and resource limitations. We present two complementary methods for temporally and contextually grounded risk prediction from longitudinal notes. First, we introduce HiTGNN, a hierarchical temporal graph neural network that integrates intra-note temporal event structures, inter-visit dynamics, and medical knowledge to model patient trajectories with fine-grained temporal granularity. Second, we propose ReVeAL, a lightweight, test-time framework that distills the reasoning of large language models into smaller verifier models. Applied to opportunistic screening for Type 2 Diabetes (T2D) using temporally realistic cohorts curated from private and public hospital corpora, HiTGNN achieves the highest predictive accuracy, especially for near-term risk, while preserving privacy and limiting reliance on large proprietary models. ReVeAL enhances sensitivity to true T2D cases and retains explanatory reasoning. Our ablations confirm the value of temporal structure and knowledge augmentation, and fairness analysis shows HiTGNN performs more equitably across subgroups.

Title: SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model

Authors: Jiayuan Du, Yiming Zhao, Zhenglong Guo, Yong Pan, Wenbo Hou, Zhihui Hao, Kun Zhan, Qijun Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22039
Pdf URL: https://arxiv.org/pdf/2511.22039
Copy Paste: [[2511.22039]] SparseWorld-TC: Trajectory-Conditioned Sparse Occupancy World Model(https://arxiv.org/abs/2511.22039)
Keywords: robust, transformer
Abstract: This paper introduces a novel architecture for trajectory-conditioned forecasting of future 3D scene occupancy. In contrast to methods that rely on variational autoencoders (VAEs) to generate discrete occupancy tokens, which inherently limit representational capacity, our approach predicts multi-frame future occupancy in an end-to-end manner directly from raw image features. Inspired by the success of attention-based transformer architectures in foundational vision and language models such as GPT and VGGT, we employ a sparse occupancy representation that bypasses the intermediate bird's eye view (BEV) projection and its explicit geometric priors. This design allows the transformer to capture spatiotemporal dependencies more effectively. By avoiding both the finite-capacity constraint of discrete tokenization and the structural limitations of BEV representations, our method achieves state-of-the-art performance on the nuScenes benchmark for 1-3 second occupancy forecasting, outperforming existing approaches by a significant margin. Furthermore, it demonstrates robust scene dynamics understanding, consistently delivering high accuracy under arbitrary future trajectory conditioning.

Title: Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression

Authors: Tianyu Zhang, Zihang Xi, Jingyu Hua, Sheng Zhong
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22044
Pdf URL: https://arxiv.org/pdf/2511.22044
Copy Paste: [[2511.22044]] Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression(https://arxiv.org/abs/2511.22044)
Keywords: security, attack, large language model
Abstract: In the realm of black-box jailbreak attacks on large language models (LLMs), the feasibility of constructing a narrow safety proxy, a lightweight model designed to predict the attack success rate (ASR) of adversarial prompts, remains underexplored. This work investigates the distillability of an LLM's core security logic. We propose a novel framework that incorporates an improved outline filling attack to achieve dense sampling of the model's security boundaries. Furthermore, we introduce a ranking regression paradigm that replaces standard regression and trains the proxy model to predict which prompt yields a higher ASR. Experimental results show that our proxy model achieves an accuracy of 91.1 percent in predicting the relative ranking of average long response (ALR), and 69.2 percent in predicting ASR. These findings confirm the predictability and distillability of jailbreak behaviors, and demonstrate the potential of leveraging such distillability to optimize black-box attacks.

Title: Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

Authors: Richard J. Young
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.22047
Pdf URL: https://arxiv.org/pdf/2511.22047
Copy Paste: [[2511.22047]] Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks(https://arxiv.org/abs/2511.22047)
Keywords: defense, attack, robust, large language model
Abstract: Large Language Model (LLM) safety guardrail models have emerged as a primary defense mechanism against harmful content generation, yet their robustness against sophisticated adversarial attacks remains poorly characterized. This study evaluated ten publicly available guardrail models from Meta, Google, IBM, NVIDIA, Alibaba, and Allen AI across 1,445 test prompts spanning 21 attack categories. While Qwen3Guard-8B achieved the highest overall accuracy (85.3%, 95% CI: 83.4-87.1%), a critical finding emerged when separating public benchmark prompts from novel attacks: all models showed substantial performance degradation on unseen prompts, with Qwen3Guard dropping from 91.0% to 33.8% (a 57.2 percentage point gap). In contrast, Granite-Guardian-3.2-5B showed the best generalization with only a 6.5% gap. A "helpful mode" jailbreak was also discovered where two guardrail models (Nemotron-Safety-8B, Granite-Guardian-3.2-5B) generated harmful content instead of blocking it, representing a novel failure mode. These findings suggest that benchmark performance may be misleading due to training data contamination, and that generalization ability, not overall accuracy, should be the primary metric for guardrail evaluation.

Title: ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resoultion

Authors: Junoh Kang, Donghun Ryu, Bohyung Han
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22048
Pdf URL: https://arxiv.org/pdf/2511.22048
Copy Paste: [[2511.22048]] ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resoultion(https://arxiv.org/abs/2511.22048)
Keywords: diffusion, generative
Abstract: Real world image super-resolution (Real-ISR) often leverages the powerful generative priors of text-to-image diffusion models by regularizing the output to lie on their learned manifold. However, existing methods often overlook the importance of the regularizing manifold, typically defaulting to a text-conditioned manifold. This approach suffers from two key limitations. Conceptually, it is misaligned with the Real-ISR task, which is to generate high quality (HQ) images directly tied to the low quality (LQ) images. Practically, the teacher model often reconstructs images with color distortions and blurred edges, indicating a flawed generative prior for this task. To correct these flaws and ensure conceptual alignment, a more suitable manifold must incorporate information from the images. While the most straightforward approach is to condition directly on the raw input images, their high information densities make the regularization process numerically unstable. To resolve this, we propose image-conditioned manifold regularization (ICM), a method that regularizes the output towards a manifold conditioned on the sparse yet essential structural information: a combination of colormap and Canny edges. ICM provides a task-aligned and stable regularization signal, thereby avoiding the instability of dense-conditioning and enhancing the final super-resolution quality. Our experiments confirm that the proposed regularization significantly enhances super-resolution performance, particularly in perceptual quality, demonstrating its effectiveness for real-world applications. We will release the source code of our work for reproducibility.

Title: OralGPT-Omni: A Versatile Dental Multimodal Large Language Model

Authors: Jing Hao, Yuci Liang, Lizhuo Lin, Yuxuan Fan, Wenkai Zhou, Kaixin Guo, Zanting Ye, Yanpeng Sun, Xinyu Zhang, Yanqi Yang, Qiankun Li, Hao Tang, James Kit-Hon Tsoi, Linlin Shen, Kuo Feng Hung
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2511.22055
Pdf URL: https://arxiv.org/pdf/2511.22055
Copy Paste: [[2511.22055]] OralGPT-Omni: A Versatile Dental Multimodal Large Language Model(https://arxiv.org/abs/2511.22055)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have exhibited immense potential across numerous medical specialties; yet, dentistry remains underexplored, in part due to limited domain-specific data, scarce dental expert annotations, insufficient modality-specific modeling, and challenges in reliability. In this paper, we present OralGPT-Omni, the first dental-specialized MLLM designed for comprehensive and trustworthy analysis across diverse dental imaging modalities and clinical tasks. To explicitly capture dentists' diagnostic reasoning, we construct TRACE-CoT, a clinically grounded chain-of-thought dataset that mirrors dental radiologists' decision-making processes. This reasoning supervision, combined with our proposed four-stage training paradigm, substantially strengthens the model's capacity for dental image understanding and analysis. In parallel, we introduce MMOral-Uni, the first unified multimodal benchmark for dental image analysis. It comprises 2,809 open-ended question-answer pairs spanning five modalities and five tasks, offering a comprehensive evaluation suite to date for MLLMs in digital dentistry. OralGPT-Omni achieves an overall score of 51.84 on the MMOral-Uni benchmark and 45.31 on the MMOral-OPG benchmark, dramatically outperforming the scores of GPT-5. Our work promotes intelligent dentistry and paves the way for future advances in dental image analysis. All code, benchmark, and models will be made publicly available.

Title: Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian

Authors: Yiran Zhang, Weihang Xu, Mo Zhou, Maryam Fazel, Simon Shaolei Du
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22069
Pdf URL: https://arxiv.org/pdf/2511.22069
Copy Paste: [[2511.22069]] Convergence Dynamics of Over-Parameterized Score Matching for a Single Gaussian(https://arxiv.org/abs/2511.22069)
Keywords: diffusion, generative
Abstract: Score matching has become a central training objective in modern generative modeling, particularly in diffusion models, where it is used to learn high-dimensional data distributions through the estimation of score functions. Despite its empirical success, the theoretical understanding of the optimization behavior of score matching, particularly in over-parameterized regimes, remains limited. In this work, we study gradient descent for training over-parameterized models to learn a single Gaussian distribution. Specifically, we use a student model with $n$ learnable parameters and train it on data generated from a single ground-truth Gaussian using the population score matching objective. We analyze the optimization dynamics under multiple regimes. When the noise scale is sufficiently large, we prove a global convergence result for gradient descent. In the low-noise regime, we identify the existence of a stationary point, highlighting the difficulty of proving global convergence in this case. Nevertheless, we show convergence under certain initialization conditions: when the parameters are initialized to be exponentially small, gradient descent ensures convergence of all parameters to the ground truth. We further prove that without the exponentially small initialization, the parameters may not converge to the ground truth. Finally, we consider the case where parameters are randomly initialized from a Gaussian distribution far from the ground truth. We prove that, with high probability, only one parameter converges while the others diverge, yet the loss still converges to zero with a $1/\tau$ rate, where $\tau$ is the number of iterations. We also establish a nearly matching lower bound on the convergence rate in this regime. This is the first work to establish global convergence guarantees for Gaussian mixtures with at least three components under the score matching framework.

Title: ARES: Anomaly Recognition Model For Edge Streams

Authors: Simone Mungari, Albert Bifet, Giuseppe Manco, Bernhard Pfahringer
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22078
Pdf URL: https://arxiv.org/pdf/2511.22078
Copy Paste: [[2511.22078]] ARES: Anomaly Recognition Model For Edge Streams(https://arxiv.org/abs/2511.22078)
Keywords: attack, extraction
Abstract: Many real-world scenarios involving streaming information can be represented as temporal graphs, where data flows through dynamic changes in edges over time. Anomaly detection in this context has the objective of identifying unusual temporal connections within the graph structure. Detecting edge anomalies in real time is crucial for mitigating potential risks. Unlike traditional anomaly detection, this task is particularly challenging due to concept drifts, large data volumes, and the need for real-time response. To face these challenges, we introduce ARES, an unsupervised anomaly detection framework for edge streams. ARES combines Graph Neural Networks (GNNs) for feature extraction with Half-Space Trees (HST) for anomaly scoring. GNNs capture both spike and burst anomalous behaviors within streams by embedding node and edge properties in a latent space, while HST partitions this space to isolate anomalies efficiently. ARES operates in an unsupervised way without the need for prior data labeling. To further validate its detection capabilities, we additionally incorporate a simple yet effective supervised thresholding mechanism. This approach leverages statistical dispersion among anomaly scores to determine the optimal threshold using a minimal set of labeled data, ensuring adaptability across different domains. We validate ARES through extensive evaluations across several real-world cyber-attack scenarios, comparing its performance against existing methods while analyzing its space and time complexity.

Title: A Fast and Flat Federated Learning Method via Weighted Momentum and Sharpness-Aware Minimization

Authors: Tianle Li, Yongzhi Huang, Linshan Jiang, Chang Liu, Qipeng Xie, Wenfeng Du, Lu Wang, Kaishun Wu
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2511.22080
Pdf URL: https://arxiv.org/pdf/2511.22080
Copy Paste: [[2511.22080]] A Fast and Flat Federated Learning Method via Weighted Momentum and Sharpness-Aware Minimization(https://arxiv.org/abs/2511.22080)
Keywords: robust, federate
Abstract: In federated learning (FL), models must \emph{converge quickly} under tight communication budgets while \emph{generalizing} across non-IID client distributions. These twin requirements have naturally led to two widely used techniques: client/server \emph{momentum} to accelerate progress, and \emph{sharpness-aware minimization} (SAM) to prefer flat solutions. However, simply combining momentum and SAM leaves two structural issues unresolved in non-IID FL. We identify and formalize two failure modes: \emph{local-global curvature misalignment} (local SAM directions need not reflect the global loss geometry) and \emph{momentum-echo oscillation} (late-stage instability caused by accumulated momentum). To our knowledge, these failure modes have not been jointly articulated and addressed in the FL literature. We propose \textbf{FedWMSAM} to address both failure modes. First, we construct a momentum-guided global perturbation from server-aggregated momentum to align clients' SAM directions with the global descent geometry, enabling a \emph{single-backprop} SAM approximation that preserves efficiency. Second, we couple momentum and SAM via a cosine-similarity adaptive rule, yielding an early-momentum, late-SAM two-phase training schedule. We provide a non-IID convergence bound that \emph{explicitly models the perturbation-induced variance} $\sigma_\rho^2=\sigma^2+(L\rho)^2$ and its dependence on $(S, K, R, N)$ on the theory side. We conduct extensive experiments on multiple datasets and model architectures, and the results validate the effectiveness, adaptability, and robustness of our method, demonstrating its superiority in addressing the optimization challenges of Federated Learning. Our code is available at this https URL.

Title: Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection

Authors: Michael J. Bommarito II
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22095
Pdf URL: https://arxiv.org/pdf/2511.22095
Copy Paste: [[2511.22095]] Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection(https://arxiv.org/abs/2511.22095)
Keywords: transformer
Abstract: Deep learning research for binary analysis faces a critical infrastructure gap. Today, existing datasets target single platforms, require specialized tooling, or provide only hand-engineered features incompatible with modern neural architectures; no single dataset supports accessible research and pedagogy on realistic use cases. To solve this, we introduce Binary-30K, the first heterogeneous binary dataset designed for sequence-based models like transformers. Critically, Binary-30K covers Windows, Linux, macOS, and Android across 15+ CPU architectures. With 29,793 binaries and approximately 26.93% malware representation, Binary-30K enables research on platform-invariant detection, cross-target transfer learning, and long-context binary understanding. The dataset provides pre-computed byte-level BPE tokenization alongside comprehensive structural metadata, supporting both sequence modeling and structure-aware approaches. Platform-first stratified sampling ensures representative coverage across operating systems and architectures, while distribution via Hugging Face with official train/validation/test splits enables reproducible benchmarking. The dataset is publicly available at this https URL, providing an accessible resource for researchers, practitioners, and students alike.

Title: WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation

Authors: Quanjian Song, Yiren Song, Kelly Peng, Yuan Gao, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22098
Pdf URL: https://arxiv.org/pdf/2511.22098
Copy Paste: [[2511.22098]] WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation(https://arxiv.org/abs/2511.22098)
Keywords: diffusion, transformer
Abstract: Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric-exocentric triplets from both synthetic and real-world scenarios. Experiments demonstrate that WorldWander achieves superior perspective synchronization, character consistency, and generalization, setting a new benchmark for egocentric-exocentric video translation.

Title: Decomposed Trust: Exploring Privacy, Adversarial Robustness, Fairness, and Ethics of Low-Rank LLMs

Authors: Daniel Agyei Asante, Md Mokarram Chowdhury, Yang Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22099
Pdf URL: https://arxiv.org/pdf/2511.22099
Copy Paste: [[2511.22099]] Decomposed Trust: Exploring Privacy, Adversarial Robustness, Fairness, and Ethics of Low-Rank LLMs(https://arxiv.org/abs/2511.22099)
Keywords: privacy, protect, robust, fair, large language model
Abstract: Large language models (LLMs) have driven major advances across domains, yet their massive size hinders deployment in resource-constrained settings. Model compression addresses this challenge, with low-rank factorization emerging as a particularly effective method for reducing size, memory, and computation while maintaining accuracy. However, while these compressed models boast of benign performance and system-level advantages, their trustworthiness implications remain poorly understood. In this paper, we present the first comprehensive study of how low-rank factorization affects LLM trustworthiness across privacy, adversarial robustness, fairness, and ethical alignment. We evaluate multiple LLMs of different sizes and variants compressed with diverse low-rank algorithms, revealing key insights: (1) low-rank compression preserves or improves training data privacy but weakens PII protection during conversation; (2) adversarial robustness is generally preserved and often enhanced, even under deep compression; (3) ethical reasoning degrades in zero-shot settings but partially recovers with few-shot prompting; (4) fairness declines under compression. Beyond compression, we investigate how model scale and fine-tuning affect trustworthiness, as both are important in low-rank methods. To guide trustworthy compression strategies, we end our paper with a gradient-based attribution analysis to identify which layers in LLMs contribute most to adversarial robustness.

Title: MRI-Based Brain Age Estimation with Supervised Contrastive Learning of Continuous Representation

Authors: Simon Joseph Clément Crête, Marta Kersten-Oertel, Yiming Xiao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22102
Pdf URL: https://arxiv.org/pdf/2511.22102
Copy Paste: [[2511.22102]] MRI-Based Brain Age Estimation with Supervised Contrastive Learning of Continuous Representation(https://arxiv.org/abs/2511.22102)
Keywords: generative
Abstract: MRI-based brain age estimation models aim to assess a subject's biological brain age based on information, such as neuroanatomical features. Various factors, including neurodegenerative diseases, can accelerate brain aging and measuring this phenomena could serve as a potential biomarker for clinical applications. While deep learning (DL)-based regression has recently attracted major attention, existing approaches often fail to capture the continuous nature of neuromorphological changes, potentially resulting in sub-optimal feature representation and results. To address this, we propose to use supervised contrastive learning with the recent Rank-N-Contrast (RNC) loss to estimate brain age based on widely used T1w structural MRI for the first time and leverage Grad-RAM to visually explain regression results. Experiments show that our proposed method achieves a mean absolute error (MAE) of 4.27 years and an $R^2$ of 0.93 with a limited dataset of training samples, significantly outperforming conventional deep regression with the same ResNet backbone while performing better or comparably with the state-of-the-art methods with significantly larger training data. Furthermore, Grad-RAM revealed more nuanced features related to age regression with the RNC loss than conventional deep regression. As an exploratory study, we employed the proposed method to estimate the gap between the biological and chronological brain ages in Alzheimer's Disease and Parkinson's disease patients, and revealed the correlation between the brain age gap and disease severity, demonstrating its potential as a biomarker in neurodegenerative disorders.

Title: MoE3D: Mixture of Experts meets Multi-Modal 3D Understanding

Authors: Yu Li, Yuenan Hou, Yingmei Wei, Xinge Zhu, Yuexin Ma, Wenqi Shao, Yanming Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22103
Pdf URL: https://arxiv.org/pdf/2511.22103
Copy Paste: [[2511.22103]] MoE3D: Mixture of Experts meets Multi-Modal 3D Understanding(https://arxiv.org/abs/2511.22103)
Keywords: transformer
Abstract: Multi-modal 3D understanding is a fundamental task in computer vision. Previous multi-modal fusion methods typically employ a single, dense fusion network, struggling to handle the significant heterogeneity and complexity across modalities, leading to suboptimal performance. In this paper, we propose MoE3D, which integrates Mixture of Experts (MoE) into the multi-modal learning framework. The core is that we deploy a set of specialized "expert" networks, each adept at processing a specific modality or a mode of cross-modal interaction. Specifically, the MoE-based transformer is designed to better utilize the complementary information hidden in the visual features. Information aggregation module is put forward to further enhance the fusion performance. Top-1 gating is employed to make one expert process features with expert groups, ensuring high efficiency. We further propose a progressive pre-training strategy to better leverage the semantic and 2D prior, thus equipping the network with good initialization. Our MoE3D achieves competitive performance across four prevalent 3D understanding tasks. Notably, our MoE3D surpasses the top-performing counterpart by 6.1 mIoU on Multi3DRefer.

Title: Energy Efficient Sleep Mode Optimization in 5G mmWave Networks via Multi Agent Deep Reinforcement Learning

Authors: Saad Masrur, Ismail Guvenc, David Lopez Perez
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22105
Pdf URL: https://arxiv.org/pdf/2511.22105
Copy Paste: [[2511.22105]] Energy Efficient Sleep Mode Optimization in 5G mmWave Networks via Multi Agent Deep Reinforcement Learning(https://arxiv.org/abs/2511.22105)
Keywords: fair
Abstract: Dynamic sleep mode optimization (SMO) in millimeter-wave (mmWave) networks is essential for maximizing energy efficiency (EE) under stringent quality-of-service (QoS) constraints. However, existing optimization and reinforcement learning (RL) approaches rely on aggregated, static base station (BS) traffic models that fail to capture non-stationary traffic dynamics and suffer from large state-action spaces, limiting real-world deployment. To address these challenges, this paper proposes a multi-agent deep reinforcement learning (MARL) framework using a Double Deep Q-Network (DDQN), referred to as MARL-DDQN, for adaptive SMO in a 3D urban environment with a time-varying and community-based user equipment (UE) mobility model. Unlike conventional single-agent RL, MARL-DDQN enables scalable, distributed decision-making with minimal signaling overhead. A realistic BS power consumption model and beamforming are integrated to accurately quantify EE, while QoS is defined in terms of throughput. The method adapts SMO policies to maximize EE while mitigating inter-cell interference and ensuring throughput fairness. Simulations show that MARL-DDQN outperforms state-of-the-art strategies, including All On, iterative QoS-aware load-based (IT-QoS-LB), MARL-DDPG, and MARL-PPO, achieving up to 0.60 Mbit/Joule EE, 8.5 Mbps 10th-percentile throughput, and meeting QoS constraints 95% of the time under dynamic scenarios.

Title: A Hybrid Theory and Data-driven Approach to Persuasion Detection with Large Language Models

Authors: Gia Bao Hoang, Keith J Ransom, Rachel Stephens, Carolyn Semmler, Nicolas Fay, Lewis Mitchell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22109
Pdf URL: https://arxiv.org/pdf/2511.22109
Copy Paste: [[2511.22109]] A Hybrid Theory and Data-driven Approach to Persuasion Detection with Large Language Models(https://arxiv.org/abs/2511.22109)
Keywords: large language model
Abstract: Traditional psychological models of belief revision focus on face-to-face interactions, but with the rise of social media, more effective models are needed to capture belief revision at scale, in this rich text-based online discourse. Here, we use a hybrid approach, utilizing large language models (LLMs) to develop a model that predicts successful persuasion using features derived from psychological experiments. Our approach leverages LLM generated ratings of features previously examined in the literature to build a random forest classification model that predicts whether a message will result in belief change. Of the eight features tested, \textit{epistemic emotion} and \textit{willingness to share} were the top-ranking predictors of belief change in the model. Our findings provide insights into the characteristics of persuasive messages and demonstrate how LLMs can enhance models of successful persuasion based on psychological theory. Given these insights, this work has broader applications in fields such as online influence detection and misinformation mitigation, as well as measuring the effectiveness of online narratives.

Title: IVGAE: Handling Incomplete Heterogeneous Data with a Variational Graph Autoencoder

Authors: Youran Zhou, Mohamed Reda Bouadjenek, Sunil Aryal%
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22116
Pdf URL: https://arxiv.org/pdf/2511.22116
Copy Paste: [[2511.22116]] IVGAE: Handling Incomplete Heterogeneous Data with a Variational Graph Autoencoder(https://arxiv.org/abs/2511.22116)
Keywords: robust, transformer
Abstract: Handling missing data remains a fundamental challenge in real-world tabular datasets, especially when data are heterogeneous with both numerical and categorical features. Existing imputation methods often fail to capture complex structural dependencies and handle heterogeneous data effectively. We present \textbf{IVGAE}, a Variational Graph Autoencoder framework for robust imputation of incomplete heterogeneous data. IVGAE constructs a bipartite graph to represent sample-feature relationships and applies graph representation learning to model structural dependencies. A key innovation is its \textit{dual-decoder architecture}, where one decoder reconstructs feature embeddings and the other models missingness patterns, providing structural priors aware of missing mechanisms. To better encode categorical variables, we introduce a Transformer-based heterogeneous embedding module that avoids high-dimensional one-hot encoding. Extensive experiments on 16 real-world datasets show that IVGAE achieves consistent improvements in RMSE and downstream F1 across MCAR, MAR, and MNAR missing scenarios under 30\% missing rates. Code and data are available at: this https URL.

Title: Privacy-preserving formal concept analysis: A homomorphic encryption-based concept construction

Authors: Qiangqiang Chen, Yunfeng Ke, Shen Li, Jinhai Li
Subjects: cs.CR, cs.CC
Abstract URL: https://arxiv.org/abs/2511.22117
Pdf URL: https://arxiv.org/pdf/2511.22117
Copy Paste: [[2511.22117]] Privacy-preserving formal concept analysis: A homomorphic encryption-based concept construction(https://arxiv.org/abs/2511.22117)
Keywords: secure, security, privacy, extraction
Abstract: Formal Concept Analysis (FCA) is extensively used in knowledge extraction, cognitive concept learning, and data mining. However, its computational demands on large-scale datasets often require outsourcing to external computing services, raising concerns about the leakage of sensitive information. To address this challenge, we propose a novel approach to enhance data security and privacy in FCA-based computations. Specifically, we introduce a Privacy-preserving Formal Context Analysis (PFCA) framework that combines binary data representation with homomorphic encryption techniques. This method enables secure and efficient concept construction without revealing private data. Experimental results and security analysis confirm the effectiveness of our approach in preserving privacy while maintaining computational performance. These findings have important implications for privacy-preserving data mining and secure knowledge discovery in large-scale FCA applications.

Title: PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization

Authors: Mingzhe Li, Renhao Zhang, Zhiyang Wen, Siqi Pan, Bruno Castro da Silva, Juan Zhai, Shiqing Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22119
Pdf URL: https://arxiv.org/pdf/2511.22119
Copy Paste: [[2511.22119]] PROMPTMINER: Black-Box Prompt Stealing against Text-to-Image Generative Models via Reinforcement Learning and Fuzz Optimization(https://arxiv.org/abs/2511.22119)
Keywords: security, attack, robust, steal, extraction, watermark, diffusion, generative
Abstract: Text-to-image (T2I) generative models such as Stable Diffusion and FLUX can synthesize realistic, high-quality images directly from textual prompts. The resulting image quality depends critically on well-crafted prompts that specify both subjects and stylistic modifiers, which have become valuable digital assets. However, the rising value and ubiquity of high-quality prompts expose them to security and intellectual-property risks. One key threat is the prompt stealing attack, i.e., the task of recovering the textual prompt that generated a given image. Prompt stealing enables unauthorized extraction and reuse of carefully engineered prompts, yet it can also support beneficial applications such as data attribution, model provenance analysis, and watermarking validation. Existing approaches often assume white-box gradient access, require large-scale labeled datasets for supervised training, or rely solely on captioning without explicit optimization, limiting their practicality and adaptability. To address these challenges, we propose PROMPTMINER, a black-box prompt stealing framework that decouples the task into two phases: (1) a reinforcement learning-based optimization phase to reconstruct the primary subject, and (2) a fuzzing-driven search phase to recover stylistic modifiers. Experiments across multiple datasets and diffusion backbones demonstrate that PROMPTMINER achieves superior results, with CLIP similarity up to 0.958 and textual alignment with SBERT up to 0.751, surpassing all baselines. Even when applied to in-the-wild images with unknown generators, it outperforms the strongest baseline by 7.5 percent in CLIP similarity, demonstrating better generalization. Finally, PROMPTMINER maintains strong performance under defensive perturbations, highlighting remarkable robustness. Code: this https URL

Title: Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation

Authors: Xiang Li, Zirui Wang, Zixuan Huang, James M. Rehg
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22121
Pdf URL: https://arxiv.org/pdf/2511.22121
Copy Paste: [[2511.22121]] Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation(https://arxiv.org/abs/2511.22121)
Keywords: robust, generative
Abstract: Humans and traditional computer vision methods rely on a diverse set of monocular cues to infer 3D structure from a single image, such as shading, texture, silhouette, etc. While recent deep generative models have dramatically advanced single-image 3D generation, it remains unclear which image cues these methods actually exploit. We introduce Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Our unified benchmark evaluates seven state-of-the-art methods, spanning regression-based, multi-view, and native 3D generative paradigms. By systematically perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity, we measure their impact on 3D output quality. Our analysis reveals that shape meaningfulness, not texture, dictates generalization. Geometric cues, particularly shading, are crucial for 3D generation. We further identify over-reliance on provided silhouettes and diverse sensitivities to cues such as perspective and local continuity across model families. By dissecting these dependencies, Cue3D advances our understanding of how modern 3D networks leverage classical vision cues, and offers directions for developing more transparent, robust, and controllable single-image 3D generation models.

Title: A Variational Manifold Embedding Framework for Nonlinear Dimensionality Reduction

Authors: John J. Vastola, Samuel J. Gershman, Kanaka Rajan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22128
Pdf URL: https://arxiv.org/pdf/2511.22128
Copy Paste: [[2511.22128]] A Variational Manifold Embedding Framework for Nonlinear Dimensionality Reduction(https://arxiv.org/abs/2511.22128)
Keywords: interpretability
Abstract: Dimensionality reduction algorithms like principal component analysis (PCA) are workhorses of machine learning and neuroscience, but each has well-known limitations. Variants of PCA are simple and interpretable, but not flexible enough to capture nonlinear data manifold structure. More flexible approaches have other problems: autoencoders are generally difficult to interpret, and graph-embedding-based methods can produce pathological distortions in manifold geometry. Motivated by these shortcomings, we propose a variational framework that casts dimensionality reduction algorithms as solutions to an optimal manifold embedding problem. By construction, this framework permits nonlinear embeddings, allowing its solutions to be more flexible than PCA. Moreover, the variational nature of the framework has useful consequences for interpretability: each solution satisfies a set of partial differential equations, and can be shown to reflect symmetries of the embedding objective. We discuss these features in detail and show that solutions can be analytically characterized in some cases. Interestingly, one special case exactly recovers PCA.

Title: Probabilistic Digital Twin for Misspecified Structural Dynamical Systems via Latent Force Modeling and Bayesian Neural Networks

Authors: Sahil Kashyap, Rajdip Nayek
Subjects: cs.LG, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2511.22133
Pdf URL: https://arxiv.org/pdf/2511.22133
Copy Paste: [[2511.22133]] Probabilistic Digital Twin for Misspecified Structural Dynamical Systems via Latent Force Modeling and Bayesian Neural Networks(https://arxiv.org/abs/2511.22133)
Keywords: robust
Abstract: This work presents a probabilistic digital twin framework for response prediction in dynamical systems governed by misspecified physics. The approach integrates Gaussian Process Latent Force Models (GPLFM) and Bayesian Neural Networks (BNNs) to enable end-to-end uncertainty-aware inference and prediction. In the diagnosis phase, model-form errors (MFEs) are treated as latent input forces to a nominal linear dynamical system and jointly estimated with system states using GPLFM from sensor measurements. A BNN is then trained on posterior samples to learn a probabilistic nonlinear mapping from system states to MFEs, while capturing diagnostic uncertainty. For prognosis, this mapping is used to generate pseudo-measurements, enabling state prediction via Kalman filtering. The framework allows for systematic propagation of uncertainty from diagnosis to prediction, a key capability for trustworthy digital twins. The framework is demonstrated using four nonlinear examples: a single degree of freedom (DOF) oscillator, a multi-DOF system, and two established benchmarks -- the Bouc-Wen hysteretic system and the Silverbox experimental dataset -- highlighting its predictive accuracy and robustness to model misspecification.

Title: EASL: Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation

Authors: Yanchao Zhao, Jihao Zhu, Yu Liu, Weizhuo Chen, Yuling Yang, Kun Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22135
Pdf URL: https://arxiv.org/pdf/2511.22135
Copy Paste: [[2511.22135]] EASL: Multi-Emotion Guided Semantic Disentanglement for Expressive Sign Language Generation(https://arxiv.org/abs/2511.22135)
Keywords: diffusion, large language model
Abstract: Large language models have revolutionized sign language generation by automatically transforming text into high-quality sign language videos, providing accessible communication for the Deaf community. However, existing LLM-based approaches prioritize semantic accuracy while overlooking emotional expressions, resulting in outputs that lack naturalness and expressiveness. We propose EASL (Emotion-Aware Sign Language), a multi-emotion-guided generation architecture for fine-grained emotional integration. We introduce emotion-semantic disentanglement modules with progressive training to separately extract semantic and affective features. During pose decoding, the emotional representations guide semantic interaction to generate sign poses with 7-class emotion confidence scores, enabling emotional expression recognition. Experimental results demonstrate that EASL achieves pose accuracy superior to all compared baselines by integrating multi-emotion information and effectively adapts to diffusion models to generate expressive sign language videos.

Title: TinyLLM: Evaluation and Optimization of Small Language Models for Agentic Tasks on Edge Devices

Authors: Mohd Ariful Haque (1), Fahad Rahman (2), Kishor Datta Gupta (1), Khalil Shujaee (1), Roy George (1) ((1) Clark Atlanta University, (2) United International University)
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22138
Pdf URL: https://arxiv.org/pdf/2511.22138
Copy Paste: [[2511.22138]] TinyLLM: Evaluation and Optimization of Small Language Models for Agentic Tasks on Edge Devices(https://arxiv.org/abs/2511.22138)
Keywords: privacy
Abstract: This paper investigates the effectiveness of small language models (SLMs) for agentic tasks (function/tool/API calling) with a focus on running agents on edge devices without reliance on cloud infrastructure. We evaluate SLMs using the Berkeley Function Calling Leaderboard (BFCL) framework and describe parameter-driven optimization strategies that include supervised fine-tuning (SFT), parameter-efficient fine-tuning (PEFT), reinforcement learning (RL)-based optimization, preference alignment via Direct Preference Optimization (DPO), and hybrid methods. We report results for models including TinyAgent, TinyLlama, Qwen, and xLAM across BFCL categories (simple, multiple, parallel, parallel-multiple, and relevance detection), both in live and non-live settings, and in multi-turn evaluations. We additionally detail a DPO training pipeline constructed from AgentBank data (e.g., ALFRED), including our conversion of SFT data to chosen-rejected pairs using TinyLlama responses as rejected outputs and manual validation. Our results demonstrate clear accuracy differences across model scales where medium-sized models (1-3B parameters) significantly outperform ultra-compact models (<1B parameters), achieving up to 65.74% overall accuracy, and 55.62% multi-turn accuracy with hybrid optimization. This study highlights the importance of hybrid optimization strategies that enable small language models to deliver accurate, efficient, and stable agentic AI on edge devices, making privacy-preserving, low-latency autonomous agents practical beyond the cloud.

Title: C$^2$DLM: Causal Concept-Guided Diffusion Large Language Models

Authors: Kairong Han, Nuanqiao Shan, Ziyu Zhao, Zijing Hu, Xinpeng Dong, Junjian Ye, Lujia Pan, Fei Wu, Kun Kuang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22146
Pdf URL: https://arxiv.org/pdf/2511.22146
Copy Paste: [[2511.22146]] C$^2$DLM: Causal Concept-Guided Diffusion Large Language Models(https://arxiv.org/abs/2511.22146)
Keywords: diffusion, large language model
Abstract: Autoregressive (AR) language models and Diffusion Language Models (DLMs) constitute the two principal paradigms of large language models. However, both paradigms suffer from insufficient reasoning capabilities. Human reasoning inherently relies on causal knowledge and thought, which are reflected in natural language. But in the AR paradigm, language is modeled as next token prediction (a strictly left-to-right, token-by-token order), whereas natural language itself exhibits more flexible causal structures. In the DLM paradigm, the attention mechanism is fully connected, which entirely disregards causal order. To fill this gap, we propose a \underline{\textbf{C}}ausal \underline{\textbf{C}}oncept-Guided \underline{\textbf{D}}iffusion \underline{\textbf{L}}anguage \underline{\textbf{M}}odel (C$^2$DLM). Starting from DLM's fully connected attention, C$^2$DLM first obtains a concept-level causal graph from the teacher model, and then explicitly guides attention to learn causal relationships between concepts. By focusing on causal relationships and avoiding interference from difficult subgoals involving causal inversion, C$^2$DLM improves 12\% with about 3.2 times training speedup in the COT-OrderPerturb task, and achieves an average gain of 1.31\% across six downstream reasoning tasks. More details in the repository ~\href{this https URL}{here}.

Title: RemedyGS: Defend 3D Gaussian Splatting against Computation Cost Attacks

Authors: Yanping Li, Zhening Liu, Zijian Li, Zehong Lin, Jun Zhang
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2511.22147
Pdf URL: https://arxiv.org/pdf/2511.22147
Copy Paste: [[2511.22147]] RemedyGS: Defend 3D Gaussian Splatting against Computation Cost Attacks(https://arxiv.org/abs/2511.22147)
Keywords: defense, attack
Abstract: As a mainstream technique for 3D reconstruction, 3D Gaussian splatting (3DGS) has been applied in a wide range of applications and services. Recent studies have revealed critical vulnerabilities in this pipeline and introduced computation cost attacks that lead to malicious resource occupancies and even denial-of-service (DoS) conditions, thereby hindering the reliable deployment of 3DGS. In this paper, we propose the first effective and comprehensive black-box defense framework, named RemedyGS, against such computation cost attacks, safeguarding 3DGS reconstruction systems and services. Our pipeline comprises two key components: a detector to identify the attacked input images with poisoned textures and a purifier to recover the benign images from their attacked counterparts, mitigating the adverse effects of these attacks. Moreover, we incorporate adversarial training into the purifier to enforce distributional alignment between the recovered and original natural images, thereby enhancing the defense efficacy. Experimental results demonstrate that our framework effectively defends against white-box, black-box, and adaptive attacks in 3DGS systems, achieving state-of-the-art performance in both safety and utility.

Title: From Topology to Retrieval: Decoding Embedding Spaces with Unified Signatures

Authors: Florian Rottach, William Rudman, Bastain Rieck, Harrisen Scells, Carsten Eickhoff
Subjects: cs.LG, cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2511.22150
Pdf URL: https://arxiv.org/pdf/2511.22150
Copy Paste: [[2511.22150]] From Topology to Retrieval: Decoding Embedding Spaces with Unified Signatures(https://arxiv.org/abs/2511.22150)
Keywords: interpretability
Abstract: Studying how embeddings are organized in space not only enhances model interpretability but also uncovers factors that drive downstream task performance. In this paper, we present a comprehensive analysis of topological and geometric measures across a wide set of text embedding models and datasets. We find a high degree of redundancy among these measures and observe that individual metrics often fail to sufficiently differentiate embedding spaces. Building on these insights, we introduce Unified Topological Signatures (UTS), a holistic framework for characterizing embedding spaces. We show that UTS can predict model-specific properties and reveal similarities driven by model architecture. Further, we demonstrate the utility of our method by linking topological structure to ranking effectiveness and accurately predicting document retrievability. We find that a holistic, multi-attribute perspective is essential to understanding and leveraging the geometry of text embeddings.

Title: A Theoretically Grounded Hybrid Ensemble for Reliable Detection of LLM-Generated Text

Authors: Sepyan Purnama Kristanto, Lutfi Hakim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22153
Pdf URL: https://arxiv.org/pdf/2511.22153
Copy Paste: [[2511.22153]] A Theoretically Grounded Hybrid Ensemble for Reliable Detection of LLM-Generated Text(https://arxiv.org/abs/2511.22153)
Keywords: extraction, transformer, large language model
Abstract: The rapid proliferation of Large Language Models (LLMs) has blurred the line between human and machine authorship, creating practical risks for academic integrity and information reliability. Existing text detectors typically rely on a single methodological paradigm and suffer from poor generalization and high false positive rates (FPR), especially on high-stakes academic text. We propose a theoretically grounded hybrid ensemble that systematically fuses three complementary detection paradigms: (i) a RoBERTa-based transformer classifier for deep semantic feature extraction, (ii) a GPT-2-based probabilistic detector using perturbation-induced likelihood curvature, and (iii) a statistical linguistic feature analyzer capturing stylometric patterns. The core novelty lies in an optimized weighted voting framework, where ensemble weights are learned on the probability simplex to maximize F1-score rather than set heuristically. We provide a bias-variance analysis and empirically demonstrate low inter-model correlation (rho ~ 0.35-0.42), a key condition for variance reduction. Evaluated on a large-scale, multigenerator corpus of 30,000 documents, our system achieves 94.2% accuracy and an AUC of 0.978, with a 35% relative reduction in false positives on academic text. This yields a more reliable and ethically responsible detector for real-world deployment in education and other high-stakes domains.

Title: IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer

Authors: Bo Chen, Tao Liu, Qi Chen, Xie Chen, Zilong Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22167
Pdf URL: https://arxiv.org/pdf/2511.22167
Copy Paste: [[2511.22167]] IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer(https://arxiv.org/abs/2511.22167)
Keywords: robust
Abstract: Talking face generation aims to synthesize realistic speaking portraits from a single image, yet existing methods often rely on explicit optical flow and local warping, which fail to model complex global motions and cause identity drift. We present IMTalker, a novel framework that achieves efficient and high-fidelity talking face generation through implicit motion transfer. The core idea is to replace traditional flow-based warping with a cross-attention mechanism that implicitly models motion discrepancy and identity alignment within a unified latent space, enabling robust global motion rendering. To further preserve speaker identity during cross-identity reenactment, we introduce an identity-adaptive module that projects motion latents into personalized spaces, ensuring clear disentanglement between motion and identity. In addition, a lightweight flow-matching motion generator produces vivid and controllable implicit motion vectors from audio, pose, and gaze cues. Extensive experiments demonstrate that IMTalker surpasses prior methods in motion accuracy, identity preservation, and audio-lip synchronization, achieving state-of-the-art quality with superior efficiency, operating at 40 FPS for video-driven and 42 FPS for audio-driven generation on an RTX 4090 GPU. We will release our code and pre-trained models to facilitate applications and future research.

Title: Partially Shared Concept Bottleneck Models

Authors: Delong Zhao, Qiang Huang, Di Yan, Yiqun Sun, Jun Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22170
Pdf URL: https://arxiv.org/pdf/2511.22170
Copy Paste: [[2511.22170]] Partially Shared Concept Bottleneck Models(https://arxiv.org/abs/2511.22170)
Keywords: interpretability, large language model
Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by introducing a layer of human-understandable concepts between inputs and predictions. While recent methods automate concept generation using Large Language Models (LLMs) and Vision-Language Models (VLMs), they still face three fundamental challenges: poor visual grounding, concept redundancy, and the absence of principled metrics to balance predictive accuracy and concept compactness. We introduce PS-CBM, a Partially Shared CBM framework that addresses these limitations through three core components: (1) a multimodal concept generator that integrates LLM-derived semantics with exemplar-based visual cues; (2) a Partially Shared Concept Strategy that merges concepts based on activation patterns to balance specificity and compactness; and (3) Concept-Efficient Accuracy (CEA), a post-hoc metric that jointly captures both predictive accuracy and concept compactness. Extensive experiments on eleven diverse datasets show that PS-CBM consistently outperforms state-of-the-art CBMs, improving classification accuracy by 1.0%-7.4% and CEA by 2.0%-9.5%, while requiring significantly fewer concepts. These results underscore PS-CBM's effectiveness in achieving both high accuracy and strong interpretability.

Title: BrepGPT: Autoregressive B-rep Generation with Voronoi Half-Patch

Authors: Pu Li, Wenhao Zhang, Weize Quan, Biao Zhang, Peter Wonka, Dong-Ming Yan
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2511.22171
Pdf URL: https://arxiv.org/pdf/2511.22171
Copy Paste: [[2511.22171]] BrepGPT: Autoregressive B-rep Generation with Voronoi Half-Patch(https://arxiv.org/abs/2511.22171)
Keywords: transformer, generative
Abstract: Boundary representation (B-rep) is the de facto standard for CAD model representation in modern industrial design. The intricate coupling between geometric and topological elements in B-rep structures has forced existing generative methods to rely on cascaded multi-stage networks, resulting in error accumulation and computational inefficiency. We present BrepGPT, a single-stage autoregressive framework for B-rep generation. Our key innovation lies in the Voronoi Half-Patch (VHP) representation, which decomposes B-reps into unified local units by assigning geometry to nearest half-edges and sampling their next pointers. Unlike hierarchical representations that require multiple distinct encodings for different structural levels, our VHP representation facilitates unifying geometric attributes and topological relations in a single, coherent format. We further leverage dual VQ-VAEs to encode both vertex topology and Voronoi Half-Patches into vertex-based tokens, achieving a more compact sequential encoding. A decoder-only Transformer is then trained to autoregressively predict these tokens, which are subsequently mapped to vertex-based features and decoded into complete B-rep models. Experiments demonstrate that BrepGPT achieves state-of-the-art performance in unconditional B-rep generation. The framework also exhibits versatility in various applications, including conditional generation from category labels, point clouds, text descriptions, and images, as well as B-rep autocompletion and interpolation.

Title: Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning

Authors: Zhaoyang Wei, Wenchao Ding, Yanchao Hao, Xi Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22172
Pdf URL: https://arxiv.org/pdf/2511.22172
Copy Paste: [[2511.22172]] Guiding the Inner Eye: A Framework for Hierarchical and Flexible Visual Grounded Reasoning(https://arxiv.org/abs/2511.22172)
Keywords: robust
Abstract: Models capable of "thinking with images" by dynamically grounding their reasoning in visual evidence represent a major leap in multimodal AI. However, replicating and advancing this ability is non-trivial, with current methods often trapped between the instability of end-to-end reinforcement learning (RL) and the rigidity of supervised fine-tuning (SFT). This leads to models that either struggle to learn or lack the cognitive flexibility required for complex, real-world scenes. To navigate this dilemma, we introduce GRiP (Guided Reasoning and Perception), a novel two-stage training framework that cultivates robust and flexible visual grounded reasoning by explicitly guiding the model's perceptual focus and logical pathways. GRiP's core lies in its cognitive-enhanced RL stage, which features two key innovations: (1) a Salience-Weighted IoU Reward that incentivizes the model to prioritize the localization of mission-critical objects over trivial distractors, and (2) a Multi-Heuristic Reward that encourages cognitive flexibility by rewarding diverse yet logically valid reasoning pathways. Initialized from the Qwen2.5-VL-7B model, GRiP demonstrates significant performance gains across multiple challenging benchmarks. It achieves state-of-the-art results among open-source models on the highly challenging TreeBench and V* Bench, proving its effectiveness in complex visual reasoning. Our work demonstrates that moving beyond simplistic rewards and instead guiding models with cognitively-inspired signals for what to see and how to think is crucial for unlocking the next level of multimodal intelligence. The code will be made publicly available.

Title: Focused Chain-of-Thought: Efficient LLM Reasoning via Structured Input Information

Authors: Lukas Struppek, Dominik Hintersdorf, Hannah Struppek, Daniel Neider, Kristian Kersting
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22176
Pdf URL: https://arxiv.org/pdf/2511.22176
Copy Paste: [[2511.22176]] Focused Chain-of-Thought: Efficient LLM Reasoning via Structured Input Information(https://arxiv.org/abs/2511.22176)
Keywords: extraction, large language model
Abstract: Recent large language models achieve strong reasoning performance by generating detailed chain-of-thought traces, but this often leads to excessive token use and high inference latency. Existing efficiency approaches typically focus on model-centric interventions, such as reinforcement learning or supervised fine-tuning, to reduce verbosity. In contrast, we propose a training-free, input-centric approach. Inspired by cognitive psychology, we introduce Focused Chain-of-Thought (F-CoT), which separates information extraction from the reasoning process. F-CoT first organizes the essential information from a query into a concise, structured context and then guides the model to reason exclusively over this context. By preventing attention to irrelevant details, F-CoT naturally produces shorter reasoning paths. On arithmetic word problems, F-CoT reduces generated tokens by 2-3x while maintaining accuracy comparable to standard zero-shot CoT. These results highlight structured input as a simple yet effective lever for more efficient LLM reasoning.

Title: Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage

Authors: Peiyu Yu, Suraj Kothawade, Sirui Xie, Ying Nian Wu, Hongliang Fei
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.22177
Pdf URL: https://arxiv.org/pdf/2511.22177
Copy Paste: [[2511.22177]] Designing Instance-Level Sampling Schedules via REINFORCE with James-Stein Shrinkage(https://arxiv.org/abs/2511.22177)
Keywords: diffusion, generative
Abstract: Most post-training methods for text-to-image samplers focus on model weights: either fine-tuning the backbone for alignment or distilling it for few-step efficiency. We take a different route: rescheduling the sampling timeline of a frozen sampler. Instead of a fixed, global schedule, we learn instance-level (prompt- and noise-conditioned) schedules through a single-pass Dirichlet policy. To ensure accurate gradient estimates in high-dimensional policy learning, we introduce a novel reward baseline based on a principled James-Stein estimator; it provably achieves lower estimation errors than commonly used variants and leads to superior performance. Our rescheduled samplers consistently improve text-image alignment including text rendering and compositional control across modern Stable Diffusion and Flux model families. Additionally, a 5-step Flux-Dev sampler with our schedules can attain generation quality comparable to deliberately distilled samplers like Flux-Schnell. We thus position our scheduling framework as an emerging model-agnostic post-training lever that unlocks additional generative potential in pretrained samplers.

Title: Personalized 3D Spatiotemporal Trajectory Privacy Protection with Differential and Distortion Geo-Perturbation

Authors: Minghui Min, Yulu Li, Gang Li, Meng Li, Hongliang Zhang, Miao Pan, Dusit Niyato, Zhu Han
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.22180
Pdf URL: https://arxiv.org/pdf/2511.22180
Copy Paste: [[2511.22180]] Personalized 3D Spatiotemporal Trajectory Privacy Protection with Differential and Distortion Geo-Perturbation(https://arxiv.org/abs/2511.22180)
Keywords: privacy, protect, attack
Abstract: The rapid advancement of location-based services (LBSs) in three-dimensional (3D) domains, such as smart cities and intelligent transportation, has raised concerns over 3D spatiotemporal trajectory privacy protection. However, existing research has not fully addressed the risk of attackers exploiting the spatiotemporal correlation of 3D spatiotemporal trajectories and the impact of height information, both of which can potentially lead to significant privacy leakage. To address these issues, this paper proposes a personalized 3D spatiotemporal trajectory privacy protection mechanism, named 3DSTPM. First, we analyze the characteristics of attackers that exploit spatiotemporal correlations between locations in a trajectory and present the attack model. Next, we exploit the complementary characteristics of 3D geo-indistinguishability (3D-GI) and distortion privacy to find a protection location set (PLS) that obscures the real location for all possible locations. To address the issue of privacy accumulation caused by continuous trajectory queries, we propose a Window-based Adaptive Privacy Budget Allocation (W-APBA), which dynamically allocates privacy budgets to all locations in the current PLS based on their predictability and sensitivity. Finally, we perturb the real location using the allocated privacy budget by the PF (Permute-and-Flip) mechanism, effectively balancing privacy protection and Quality of Service (QoS). Simulation results demonstrate that the proposed 3DSTPM effectively reduces QoS loss while meeting the user's personalized privacy protection needs.

Title: MTR-VP: Towards End-to-End Trajectory Planning through Context-Driven Image Encoding and Multiple Trajectory Prediction

Authors: Maitrayee Keskar, Mohan Trivedi, Ross Greer
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2511.22181
Pdf URL: https://arxiv.org/pdf/2511.22181
Copy Paste: [[2511.22181]] MTR-VP: Towards End-to-End Trajectory Planning through Context-Driven Image Encoding and Multiple Trajectory Prediction(https://arxiv.org/abs/2511.22181)
Keywords: transformer
Abstract: We present a method for trajectory planning for autonomous driving, learning image-based context embeddings that align with motion prediction frameworks and planning-based intention input. Within our method, a ViT encoder takes raw images and past kinematic state as input and is trained to produce context embeddings, drawing inspiration from those generated by the recent MTR (Motion Transformer) encoder, effectively substituting map-based features with learned visual representations. MTR provides a strong foundation for multimodal trajectory prediction by localizing agent intent and refining motion iteratively via motion query pairs; we name our approach MTR-VP (Motion Transformer for Vision-based Planning), and instead of the learnable intention queries used in the MTR decoder, we use cross attention on the intent and the context embeddings, which reflect a combination of information encoded from the driving scene and past vehicle states. We evaluate our methods on the Waymo End-to-End Driving Dataset, which requires predicting the agent's future 5-second trajectory in bird's-eye-view coordinates using prior camera images, agent pose history, and routing goals. We analyze our architecture using ablation studies, removing input images and multiple trajectory output. Our results suggest that transformer-based methods that are used to combine the visual features along with the kinetic features such as the past trajectory features are not effective at combining both modes to produce useful scene context embeddings, even when intention embeddings are augmented with foundation-model representations of scene context from CLIP and DINOv2, but that predicting a distribution over multiple futures instead of a single future trajectory boosts planning performance.

Title: Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation

Authors: Daniel Sungho Jung, Kyoung Mu Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22184
Pdf URL: https://arxiv.org/pdf/2511.22184
Copy Paste: [[2511.22184]] Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation(https://arxiv.org/abs/2511.22184)
Keywords: robust
Abstract: Foot contact plays a critical role in human interaction with the world, and thus exploring foot contact can advance our understanding of human movement and physical interaction. Despite its importance, existing methods often approximate foot contact using a zero-velocity constraint and focus on joint-level contact, failing to capture the detailed interaction between the foot and the world. Dense estimation of foot contact is crucial for accurately modeling this interaction, yet predicting dense foot contact from a single RGB image remains largely underexplored. There are two main challenges for learning dense foot contact estimation. First, shoes exhibit highly diverse appearances, making it difficult for models to generalize across different styles. Second, ground often has a monotonous appearance, making it difficult to extract informative features. To tackle these issues, we present a FEet COntact estimation (FECO) framework that learns dense foot contact with shoe style-invariant and ground-aware learning. To overcome the challenge of shoe appearance diversity, our approach incorporates shoe style adversarial training that enforces shoe style-invariant features for contact estimation. To effectively utilize ground information, we introduce a ground feature extractor that captures ground properties based on spatial context. As a result, our proposed method achieves robust foot contact estimation regardless of shoe appearance and effectively leverages ground information. Code will be released.

Title: HybridWorldSim: A Scalable and Controllable High-fidelity Simulator for Autonomous Driving

Authors: Qiang Li, Yingwenqi Jiang, Tuoxi Li, Duyu Chen, Xiang Feng, Yucheng Ao, Shangyue Liu, Xingchen Yu, Youcheng Cai, Yumeng Liu, Yuexin Ma, Xin Hu, Li Liu, Yu Zhang, Linkun Xu, Bingtao Gao, Xueyuan Wang, Shuchang Zhou, Xianming Liu, Ligang Liu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.22187
Pdf URL: https://arxiv.org/pdf/2511.22187
Copy Paste: [[2511.22187]] HybridWorldSim: A Scalable and Controllable High-fidelity Simulator for Autonomous Driving(https://arxiv.org/abs/2511.22187)
Keywords: robust, generative
Abstract: Realistic and controllable simulation is critical for advancing end-to-end autonomous driving, yet existing approaches often struggle to support novel view synthesis under large viewpoint changes or to ensure geometric consistency. We introduce HybridWorldSim, a hybrid simulation framework that integrates multi-traversal neural reconstruction for static backgrounds with generative modeling for dynamic agents. This unified design addresses key limitations of previous methods, enabling the creation of diverse and high-fidelity driving scenarios with reliable visual and spatial consistency. To facilitate robust benchmarking, we further release a new multi-traversal dataset MIRROR that captures a wide range of routes and environmental conditions across different cities. Extensive experiments demonstrate that HybridWorldSim surpasses previous state-of-the-art methods, providing a practical and scalable solution for high-fidelity simulation and a valuable resource for research and development in autonomous driving.

Title: Department-Specific Security Awareness Campaigns: A Cross-Organizational Study of HR and Accounting

Authors: Matthias Pfister, Giovanni Apruzzese, Irdin Pekaric
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.22189
Pdf URL: https://arxiv.org/pdf/2511.22189
Copy Paste: [[2511.22189]] Department-Specific Security Awareness Campaigns: A Cross-Organizational Study of HR and Accounting(https://arxiv.org/abs/2511.22189)
Keywords: security, attack
Abstract: Many cyberattacks succeed because they exploit flaws at the human level. To address this problem, organizations rely on security awareness programs, which aim to make employees more resilient against social engineering. While some works have suggested that such programs should account for contextual relevance, the common praxis in research is to adopt a "general" viewpoint. For instance, instead of focusing on department-specific issues, prior user studies sought to provide organization-wide conclusions. Such a protocol may lead to overlooking vulnerabilities that affect only specific subsets of an organization. In this paper, we tackle such an oversight. First, through a systematic literature review, we provide evidence that prior literature poorly accounted for department-specific needs. Then, we carry out a multi-company and mixed-methods study focusing on two pivotal departments: human resources (HR) and accounting. We explore three dimensions: threats faced by these departments; topics covered in the security-awareness campaigns delivered to these departments; and delivery methods that maximize the effectiveness of such campaigns. We begin by interviewing 16 employees of a multinational enterprise, and then use these results as a scaffold to design a structured survey through which we collect the responses of over 90 HR/accounting members of 9 organizations. We find that HR is targeted through job applications containing malware and executive impersonation, while accounting is exposed to invoice fraud, credential theft, and ransomware. Current training is often viewed as too generic, with employees preferring shorter, scenario-based formats like videos and simulations. These preferences contradict the common industry practice of annual sessions. Based on these insights, we propose recommendations for designing awareness programs tailored to departmental needs and workflows.

Title: Controllable 3D Object Generation with Single Image Prompt

Authors: Jaeseok Lee, Jaekoo Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22194
Pdf URL: https://arxiv.org/pdf/2511.22194
Copy Paste: [[2511.22194]] Controllable 3D Object Generation with Single Image Prompt(https://arxiv.org/abs/2511.22194)
Keywords: diffusion, generative
Abstract: Recently, the impressive generative capabilities of diffusion models have been demonstrated, producing images with remarkable fidelity. Particularly, existing methods for the 3D object generation tasks, which is one of the fastest-growing segments in computer vision, pre-dominantly use text-to-image diffusion models with textual inversion which train a pseudo text prompt to describe the given image. In practice, various text-to-image generative models employ textual inversion to learn concepts or styles of target object in the pseudo text prompt embedding space, thereby generating sophisticated outputs. However, textual inversion requires additional training time and lacks control ability. To tackle this issues, we propose two innovative methods: (1) using an off-the-shelf image adapter that generates 3D objects without textual inversion, offering enhanced control over conditions such as depth, pose, and text. (2) a depth conditioned warmup strategy to enhance 3D consistency. In experimental results, ours show qualitatively and quantitatively comparable performance and improved 3D consistency to the existing text-inversion-based alternatives. Furthermore, we conduct a user study to assess (i) how well results match the input image and (ii) whether 3D consistency is maintained. User study results show that our model outperforms the alternatives, validating the effectiveness of our approaches. Our code is available at GitHub repository:this https URL

Title: PULSE-ICU: A Pretrained Unified Long-Sequence Encoder for Multi-task Prediction in Intensive Care Units

Authors: Sejeong Jang, Joo Heung Yoon, Hyo Kyung Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22199
Pdf URL: https://arxiv.org/pdf/2511.22199
Copy Paste: [[2511.22199]] PULSE-ICU: A Pretrained Unified Long-Sequence Encoder for Multi-task Prediction in Intensive Care Units(https://arxiv.org/abs/2511.22199)
Keywords: robust
Abstract: Intensive care unit (ICU) data are highly irregular, heterogeneous, and temporally fragmented, posing challenges for generalizable clinical prediction. We present PULSE-ICU, a self-supervised foundation model that learns event-level ICU representations from large-scale EHR sequences without resampling or manual feature engineering. A unified embedding module encodes event identity, continuous values, units, and temporal attributes, while a Longformer-based encoder enables efficient modeling of long trajectories. PULSE-ICU was fine-tuned across 18 prediction tasks, including mortality, intervention forecasting, and phenotype identification, achieving strong performance across task types. External validation on eICU, HiRID, and P12 showed substantial improvements with minimal fine-tuning, demonstrating robustness to domain shift and variable constraints. These findings suggest that foundation-style modeling can improve data efficiency and adaptability, providing a scalable framework for ICU decision support across diverse clinical environments.

Title: Real-PGDN: A Two-level Classification Method for Full-Process Recognition of Newly Registered Pornographic and Gambling Domain Names

Authors: Hao Wang, Yingshuo Wang, Junang Gan, Yanan Cheng, Jinshuai Zhang
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22215
Pdf URL: https://arxiv.org/pdf/2511.22215
Copy Paste: [[2511.22215]] Real-PGDN: A Two-level Classification Method for Full-Process Recognition of Newly Registered Pornographic and Gambling Domain Names(https://arxiv.org/abs/2511.22215)
Keywords: privacy, extraction
Abstract: Online pornography and gambling have consistently posed regulatory challenges for governments, threatening both personal assets and privacy. Therefore, it is imperative to research the classification of the newly registered Pornographic and Gambling Domain Names (PGDN). However, scholarly investigation into this topic is limited. Previous efforts in PGDN classification pursue high accuracy using ideal sample data, while others employ up-to-date data from real-world scenarios but achieve lower classification accuracy. This paper introduces the Real-PGDN method, which accomplishes a complete process of timely and comprehensive real-data crawling, feature extraction with feature-missing tolerance, precise PGDN classification, and assessment of application effects in actual scenarios. Our two-level classifier, which integrates CoSENT (BERT-based), Multilayer Perceptron (MLP), and traditional classification algorithms, achieves a 97.88% precision. The research process amasses the NRD2024 dataset, which contains continuous detection information over 20 days for 1,500,000 newly registered domain names across 6 directions. Results from our case study demonstrate that this method also maintains a forecast precision of over 70% for PGDN that are delayed in usage after registration.

Title: 3D-Consistent Multi-View Editing by Diffusion Guidance

Authors: Josef Bengtson, David Nilsson, Dong In Lee, Fredrik Kahl
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22228
Pdf URL: https://arxiv.org/pdf/2511.22228
Copy Paste: [[2511.22228]] 3D-Consistent Multi-View Editing by Diffusion Guidance(https://arxiv.org/abs/2511.22228)
Keywords: diffusion
Abstract: Recent advancements in diffusion models have greatly improved text-based image editing, yet methods that edit images independently often produce geometrically and photometrically inconsistent results across different views of the same scene. Such inconsistencies are particularly problematic for editing of 3D representations such as NeRFs or Gaussian Splat models. We propose a training-free diffusion framework that enforces multi-view consistency during the image editing process. The key assumption is that corresponding points in the unedited images should undergo similar transformations after editing. To achieve this, we introduce a consistency loss that guides the diffusion sampling toward coherent edits. The framework is flexible and can be combined with widely varying image editing methods, supporting both dense and sparse multi-view editing setups. Experimental results show that our approach significantly improves 3D consistency compared to existing multi-view editing methods. We also show that this increased consistency enables high-quality Gaussian Splat editing with sharp details and strong fidelity to user-specified text prompts. Please refer to our project page for video results: this https URL

Title: From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation

Authors: Zhen Chen, Yihang Fu, Gabriel Madera, Mauro Giuffre, Serina Applebaum, Hyunjae Kim, Hua Xu, Qingyu Chen
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.22232
Pdf URL: https://arxiv.org/pdf/2511.22232
Copy Paste: [[2511.22232]] From Compound Figures to Composite Understanding: Developing a Multi-Modal LLM from Biomedical Literature with Medical Multiple-Image Benchmarking and Validation(https://arxiv.org/abs/2511.22232)
Keywords: large language model
Abstract: Multi-modal large language models (MLLMs) have shown promise in advancing healthcare. However, most existing models remain confined to single-image understanding, which greatly limits their applicability in clinical workflows. In practice, medical diagnosis and progression often require synthesizing information across multiple images from different modalities or time points. The development of medical MLLMs capable of such multi-image understanding has been hindered by the lack of large-scale, high-quality annotated training data. To address this limitation, we propose a novel framework that leverages license-permissive compound images in biomedical literature, as a rich yet underutilized data source for multi-image analysis. Specifically, we design a five-stage, context-aware instruction generation paradigm underpinned by a divide-and-conquer strategy. By decomposing multi-image analysis into manageable sub-tasks, this paradigm empowers MLLMs to move beyond single-panel analysis and provide a composite understanding by learning the complex spatial, temporal, and cross-modal relationships inherent in these compound figures. By parsing over 237,000 compound figures and their contextual text for instruction generation, we develop M3LLM, a medical multi-image multi-modal large language model. For benchmarking, we construct PMC-MI-Bench for composite understanding, manually validated by medical experts. Extensive experiments show that M3LLM significantly outperforms both general-purpose and specialized medical MLLMs across multi-image, single-image, text-only, and multi-choice scenarios. Notably, M3LLM exhibits strong generalization to longitudinal chest X-ray analysis using the MIMIC dataset. This work establishes a scalable and efficient paradigm for developing medical MLLMs capable of composite reasoning, bridging the gap between biomedical literature and real-world clinical applications.

Title: Bridging 3D Deep Learning and Curation for Analysis and High-Quality Segmentation in Practice

Authors: Simon Püttmann, Jonathan Jair Sànchez Contreras, Lennart Kowitz, Peter Lampen, Saumya Gupta, Davide Panzeri, Nina Hagemann, Qiaojie Xiong, Dirk M. Hermann, Cao Chen, Jianxu Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22236
Pdf URL: https://arxiv.org/pdf/2511.22236
Copy Paste: [[2511.22236]] Bridging 3D Deep Learning and Curation for Analysis and High-Quality Segmentation in Practice(https://arxiv.org/abs/2511.22236)
Keywords: segmentation
Abstract: Accurate 3D microscopy image segmentation is critical for quantitative bioimage analysis but even state-of-the-art foundation models yield error-prone results. Therefore, manual curation is still widely used for either preparing high-quality training data or fixing errors before analysis. We present VessQC, an open-source tool for uncertainty-guided curation of large 3D microscopy segmentations. By integrating uncertainty maps, VessQC directs user attention to regions most likely containing biologically meaningful errors. In a preliminary user study uncertainty-guided correction significantly improved error detection recall from 67% to 94.0% (p=0.007) without a significant increase in total curation time. VessQC thus enables efficient, human-in-the-loop refinement of volumetric segmentations and bridges a key gap in real-world applications between uncertainty estimation and practical human-computer interaction. The software is freely available at this http URL.

Title: Creating Blank Canvas Against AI-enabled Image Forgery

Authors: Qi Song, Ziyuan Luo, Renjie Wan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22237
Pdf URL: https://arxiv.org/pdf/2511.22237
Copy Paste: [[2511.22237]] Creating Blank Canvas Against AI-enabled Image Forgery(https://arxiv.org/abs/2511.22237)
Keywords: attack
Abstract: AIGC-based image editing technology has greatly simplified the realistic-level image modification, causing serious potential risks of image forgery. This paper introduces a new approach to tampering detection using the Segment Anything Model (SAM). Instead of training SAM to identify tampered areas, we propose a novel strategy. The entire image is transformed into a blank canvas from the perspective of neural models. Any modifications to this blank canvas would be noticeable to the models. To achieve this idea, we introduce adversarial perturbations to prevent SAM from ``seeing anything'', allowing it to identify forged regions when the image is tampered with. Due to SAM's powerful perceiving capabilities, naive adversarial attacks cannot completely tame SAM. To thoroughly deceive SAM and make it blind to the image, we introduce a frequency-aware optimization strategy, which further enhances the capability of tamper localization. Extensive experimental results demonstrate the effectiveness of our method.

Title: TTSnap: Test-Time Scaling of Diffusion Models via Noise-Aware Pruning

Authors: Qingtao Yu, Changlin Song, Minghao Sun, Zhengyang Yu, Vinay Kumar Verma, Soumya Roy, Sumit Negi, Hongdong Li, Dylan Campbell
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22242
Pdf URL: https://arxiv.org/pdf/2511.22242
Copy Paste: [[2511.22242]] TTSnap: Test-Time Scaling of Diffusion Models via Noise-Aware Pruning(https://arxiv.org/abs/2511.22242)
Keywords: diffusion
Abstract: A prominent approach to test-time scaling for text-to-image diffusion models formulates the problem as a search over multiple noise seeds, selecting the one that maximizes a certain image-reward function. The effectiveness of this strategy heavily depends on the number and diversity of noise seeds explored. However, verifying each candidate is computationally expensive, because each must be fully denoised before a reward can be computed. This severely limits the number of samples that can be explored under a fixed budget. We propose test-time scaling with noise-aware pruning (TTSnap), a framework that prunes low-quality candidates without fully denoising them. The key challenge is that reward models are learned in the clean image domain, and the ranking of rewards predicted for intermediate estimates are often inconsistent with those predicted for clean images. To overcome this, we train noise-aware reward models via self-distillation to align the reward for intermediate estimates with that of the final clean images. To stabilize learning across different noise levels, we adopt a curriculum training strategy that progressively shifts the data domain from clean images to noise images. In addition, we introduce a new metric that measures reward alignment and computational budget utilization. Experiments demonstrate that our approach improves performance by over 16\% compared with existing methods, enabling more efficient and effective test-time scaling. It also provides orthogonal gains when combined with post-training techniques and local test-time optimization. Code: this https URL.

Title: Semantic Anchoring for Robust Personalization in Text-to-Image Diffusion Models

Authors: Seoyun Yang, Gihoon Kim, Taesup Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22245
Pdf URL: https://arxiv.org/pdf/2511.22245
Copy Paste: [[2511.22245]] Semantic Anchoring for Robust Personalization in Text-to-Image Diffusion Models(https://arxiv.org/abs/2511.22245)
Keywords: robust, diffusion
Abstract: Text-to-image diffusion models have achieved remarkable progress in generating diverse and realistic images from textual descriptions. However, they still struggle with personalization, which requires adapting a pretrained model to depict user-specific subjects from only a few reference images. The key challenge lies in learning a new visual concept from a limited number of reference images while preserving the pretrained semantic prior that maintains text-image alignment. When the model focuses on subject fidelity, it tends to overfit the limited reference images and fails to leverage the pretrained distribution. Conversely, emphasizing prior preservation maintains semantic consistency but prevents the model from learning new personalized attributes. Building on these observations, we propose the personalization process through a semantic anchoring that guides adaptation by grounding new concepts in their corresponding distributions. We therefore reformulate personalization as the process of learning a rare concept guided by its frequent counterpart through semantic anchoring. This anchoring encourages the model to adapt new concepts in a stable and controlled manner, expanding the pretrained distribution toward personalized regions while preserving its semantic structure. As a result, the proposed method achieves stable adaptation and consistent improvements in both subject fidelity and text-image alignment compared to baseline methods. Extensive experiments and ablation studies further demonstrate the robustness and effectiveness of the proposed anchoring strategy.

Title: Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective

Authors: Bolin Lai, Xudong Wang, Saketh Rambhatla, James M. Rehg, Zsolt Kira, Rohit Girdhar, Ishan Misra
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22249
Pdf URL: https://arxiv.org/pdf/2511.22249
Copy Paste: [[2511.22249]] Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective(https://arxiv.org/abs/2511.22249)
Keywords: diffusion
Abstract: Latent diffusion has become the default paradigm for visual generation, yet we observe a persistent reconstruction-generation trade-off as latent dimensionality increases: higher-capacity autoencoders improve reconstruction fidelity but generation quality eventually declines. We trace this gap to the different behaviors in high-frequency encoding and decoding. Through controlled perturbations in both RGB and latent domains, we analyze encoder/decoder behaviors and find that decoders depend strongly on high-frequency latent components to recover details, whereas encoders under-represent high-frequency contents, yielding insufficient exposure and underfitting in high-frequency bands for diffusion model training. To address this issue, we introduce FreqWarm, a plug-and-play frequency warm-up curriculum that increases early-stage exposure to high-frequency latent signals during diffusion or flow-matching training -- without modifying or retraining the autoencoder. Applied across several high-dimensional autoencoders, FreqWarm consistently improves generation quality: decreasing gFID by 14.11 on Wan2.2-VAE, 6.13 on LTX-VAE, and 4.42 on DC-AE-f32, while remaining architecture-agnostic and compatible with diverse backbones. Our study shows that explicitly managing frequency exposure can successfully turn high-dimensional latent spaces into more diffusible targets.

Title: UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation

Authors: Dengbo Chen, Ziwei Zhao, Kexin Zhang, Shishuang Zhao, Junjie Hou, Yaqian Wang, Nianxi Liao, Anlan Sun, Fei Gao, Jia Ding, Yuhang Liu, Dong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22256
Pdf URL: https://arxiv.org/pdf/2511.22256
Copy Paste: [[2511.22256]] UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation(https://arxiv.org/abs/2511.22256)
Keywords: segmentation
Abstract: Despite significant strides in medical foundation models, the ultrasound domain lacks a comprehensive solution capable of bridging low-level Ultrasound Grounded Perception (e.g., segmentation, localization) and high-level Ultrasound Comprehensive Interpretation (e.g., diagnosis, reasoning). To bridge this gap, we propose UMind-VL, a unified foundation model designed to synergize pixel-level structural understanding with complex clinical reasoning. We first introduce UMind-DS, a large-scale multimodal dataset comprising 1.2 million ultrasound image-text pairs across 16 anatomical regions, enriching standard data with pixel-level annotations and clinician-validated rationales. Architecturally, UMind-VL incorporates a lightweight Dynamic Convolutional Mask Decoder that generates masks via dynamic kernels conditioned on LLM outputs. This design, combined with task-specific tokens, unifies segmentation, detection, geometric measurement, and diagnosis tasks within a single framework. Extensive evaluations demonstrate that UMind-VL significantly outperforms existing generalist multimodal models and achieves performance on par with, or superior to, state-of-the-art specialist models across segmentation, detection, keypoint localization, and diagnostic reasoning benchmarks, while maintaining strong generalization ability. We demonstrate the capability of UMind-VL in Figure 1.

Title: Beyond Query-Level Comparison: Fine-Grained Reinforcement Learning for Text-to-SQL with Automated Interpretable Critiques

Authors: Guifeng Wang, Yuanfeng Song, Meng Yang, Tao Zhu, Xiaoming Yin, Xing Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22258
Pdf URL: https://arxiv.org/pdf/2511.22258
Copy Paste: [[2511.22258]] Beyond Query-Level Comparison: Fine-Grained Reinforcement Learning for Text-to-SQL with Automated Interpretable Critiques(https://arxiv.org/abs/2511.22258)
Keywords: generative
Abstract: Text-to-SQL, a pivotal natural language processing (NLP) task that converts textual queries into executable SQL, has seen substantial progress in recent years. However, existing evaluation and reward mechanisms used to train and assess the text-to-SQL models remain a critical bottleneck. Current approaches heavily rely on manually annotated gold SQL queries, which are costly to produce and impractical for large-scale evaluation. More importantly, most reinforcement learning (RL) methods in text-to-SQL leverage only the final binary execution outcome as the reward signal, a coarse-grained supervision that overlooks detailed structural and semantic errors from the perspective of rubrics. To address these challenges, we propose RuCo-C, a novel generative judge model for fine-grained, query-specific automatic evaluation using interpretable critiques without human intervention. Our framework first automatically generates query-specific evaluation rubrics for human-free annotation, linking them to interpretable critiques. Subsequently, it integrates densified reward feedback through a "progressive exploration" strategy during the RL training process, which dynamically adjusts the rewards to enhance the model's performance. Comprehensive experiments demonstrate that RuCo-C outperforms existing methods in text-to-SQL evaluation, yielding significant performance gains.

Title: Silence Speaks Volumes: A New Paradigm for Covert Communication via History Timing Patterns

Authors: Christoph Weissenborn, Steffen Wendzel
Subjects: cs.CR, cs.DC, cs.NI
Abstract URL: https://arxiv.org/abs/2511.22259
Pdf URL: https://arxiv.org/pdf/2511.22259
Copy Paste: [[2511.22259]] Silence Speaks Volumes: A New Paradigm for Covert Communication via History Timing Patterns(https://arxiv.org/abs/2511.22259)
Keywords: security, robust, steal
Abstract: A Covert Channel (CC) exploits legitimate communication mechanisms to stealthily transmit information, often bypassing traditional security controls. Among these, a novel paradigm called History Covert Channels (HCC) leverages past network events as reference points to embed covert messages. Unlike traditional timing- or storage-based CCs, which directly manipulate traffic patterns or packet contents, HCCs minimize detectability by encoding information through small pointers to historical data. This approach enables them to amplify the size of transmitted covert data by referring to more bits than are actually embedded. Recent research has explored the feasibility of such methods, demonstrating their potential to evade detection by repurposing naturally occurring network behaviors as a covert transmission medium. This paper introduces a novel method for establishing and maintaining covert communication links using relative pointers to network timing patterns, which minimizes the reliance of the HCC on centralized timekeeping and reduces the likelihood of being detected by standard network monitoring tools. We also explore the tailoring of HCCs to optimize their robustness and undetectability characteristics. Our experiments reveal a better bitrate compared to previous work.

Title: Can Protective Watermarking Safeguard the Copyright of 3D Gaussian Splatting?

Authors: Wenkai Huang, Yijia Guo, Gaolei Li, Lei Ma, Hang Zhang, Liwen Hu, Jiazheng Wang, Jianhua Li, Tiejun Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22262
Pdf URL: https://arxiv.org/pdf/2511.22262
Copy Paste: [[2511.22262]] Can Protective Watermarking Safeguard the Copyright of 3D Gaussian Splatting?(https://arxiv.org/abs/2511.22262)
Keywords: protect, robust, watermark
Abstract: 3D Gaussian Splatting (3DGS) has emerged as a powerful representation for 3D scenes, widely adopted due to its exceptional efficiency and high-fidelity visual quality. Given the significant value of 3DGS assets, recent works have introduced specialized watermarking schemes to ensure copyright protection and ownership verification. However, can existing 3D Gaussian watermarking approaches genuinely guarantee robust protection of the 3D assets? In this paper, for the first time, we systematically explore and validate possible vulnerabilities of 3DGS watermarking frameworks. We demonstrate that conventional watermark removal techniques designed for 2D images do not effectively generalize to the 3DGS scenario due to the specialized rendering pipeline and unique attributes of each gaussian primitives. Motivated by this insight, we propose GSPure, the first watermark purification framework specifically for 3DGS watermarking representations. By analyzing view-dependent rendering contributions and exploiting geometrically accurate feature clustering, GSPure precisely isolates and effectively removes watermark-related Gaussian primitives while preserving scene integrity. Extensive experiments demonstrate that our GSPure achieves the best watermark purification performance, reducing watermark PSNR by up to 16.34dB while minimizing degradation to original scene fidelity with less than 1dB PSNR loss. Moreover, it consistently outperforms existing methods in both effectiveness and generalization.

Title: DriveVGGT: Visual Geometry Transformer for Autonomous Driving

Authors: Xiaosong Jia, Yanhao Liu, Junqi You, Renqiu Xia, Yu Hong, Junchi Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22264
Pdf URL: https://arxiv.org/pdf/2511.22264
Copy Paste: [[2511.22264]] DriveVGGT: Visual Geometry Transformer for Autonomous Driving(https://arxiv.org/abs/2511.22264)
Keywords: transformer
Abstract: Feed-forward reconstruction has recently gained significant attention, with VGGT being a notable example. However, directly applying VGGT to autonomous driving (AD) systems leads to sub-optimal results due to the different priors between the two tasks. In AD systems, several important new priors need to be considered: (i) The overlap between camera views is minimal, as autonomous driving sensor setups are designed to achieve coverage at a low cost. (ii) The camera intrinsics and extrinsics are known, which introduces more constraints on the output and also enables the estimation of absolute scale. (iii) Relative positions of all cameras remain fixed though the ego vehicle is in motion. To fully integrate these priors into a feed-forward framework, we propose DriveVGGT, a scale-aware 4D reconstruction framework specifically designed for autonomous driving data. Specifically, we propose a Temporal Video Attention (TVA) module to process multi-camera videos independently, which better leverages the spatiotemporal continuity within each single-camera sequence. Then, we propose a Multi-camera Consistency Attention (MCA) module to conduct window attention with normalized relative pose embeddings, aiming to establish consistency relationships across different cameras while restricting each token to attend only to nearby frames. Finally, we extend the standard VGGT heads by adding an absolute scale head and an ego vehicle pose head. Experiments show that DriveVGGT outperforms VGGT, StreamVGGT, fastVGGT on autonomous driving dataset while extensive ablation studies verify effectiveness of the proposed designs.

Title: FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning

Authors: Yuan Yao, Lixu Wang, Jiaqi Wu, Jin Song, Simin Chen, Zehua Wang, Zijian Tian, Wei Chen, Huixia Li, Xiaoxiao Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22265
Pdf URL: https://arxiv.org/pdf/2511.22265
Copy Paste: [[2511.22265]] FedRE: A Representation Entanglement Framework for Model-Heterogeneous Federated Learning(https://arxiv.org/abs/2511.22265)
Keywords: privacy, protect, attack, federate
Abstract: Federated learning (FL) enables collaborative training across clients without compromising privacy. While most existing FL methods assume homogeneous model architectures, client heterogeneity in data and resources renders this assumption impractical, motivating model-heterogeneous FL. To address this problem, we propose Federated Representation Entanglement (FedRE), a framework built upon a novel form of client knowledge termed entangled representation. In FedRE, each client aggregates its local representations into a single entangled representation using normalized random weights and applies the same weights to integrate the corresponding one-hot label encodings into the entangled-label encoding. Those are then uploaded to the server to train a global classifier. During training, each entangled representation is supervised across categories via its entangled-label encoding, while random weights are resampled each round to introduce diversity, mitigating the global classifier's overconfidence and promoting smoother decision boundaries. Furthermore, each client uploads a single cross-category entangled representation along with its entangled-label encoding, mitigating the risk of representation inversion attacks and reducing communication overhead. Extensive experiments demonstrate that FedRE achieves an effective trade-off among model performance, privacy protection, and communication overhead. The codes are available at this https URL.

Title: TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation

Authors: Henrijs Princis, Arindam Sharma, Cristina David
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22277
Pdf URL: https://arxiv.org/pdf/2511.22277
Copy Paste: [[2511.22277]] TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation(https://arxiv.org/abs/2511.22277)
Keywords: large language model
Abstract: Large language models (LLMs) have shown remarkable ability to generate code, yet their outputs often violate syntactic or semantic constraints when guided only through natural language prompts. We introduce TreeCoder, the most general and flexible framework to date for exploring decoding strategies, constraints, and hyperparameters in LLMs, and use it in code generation to enforce correctness and structure during decoding rather than relying on prompt engineering. TreeCoder represents decoding as a tree search over candidate programs, where both decoding strategies and constraint functions - such as style, syntax, execution - are treated as first-class, optimisable components. This design enables systematic exploration and automatic tuning of decoding configurations using standard optimisation techniques. Experiments on the MBPP (Python) and SQL-Spider benchmarks show that TreeCoder consistently improves accuracy across open-source models such as CodeLlama, Mistral and DeepSeek, often outperforming their unconstrained baselines by considerable margins.

Title: The Collapse of Patches

Authors: Wei Guo, Shunqi Mao, Zhuonan Liang, Heng Wang, Weidong Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22281
Pdf URL: https://arxiv.org/pdf/2511.22281
Copy Paste: [[2511.22281]] The Collapse of Patches(https://arxiv.org/abs/2511.22281)
Keywords: transformer
Abstract: Observing certain patches in an image reduces the uncertainty of others. Their realization lowers the distribution entropy of each remaining patch feature, analogous to collapsing a particle's wave function in quantum mechanics. This phenomenon can intuitively be called patch collapse. To identify which patches are most relied on during a target region's collapse, we learn an autoencoder that softly selects a subset of patches to reconstruct each target patch. Graphing these learned dependencies for each patch's PageRank score reveals the optimal patch order to realize an image. We show that respecting this order benefits various masked image modeling methods. First, autoregressive image generation can be boosted by retraining the state-of-the-art model MAR. Next, we introduce a new setup for image classification by exposing Vision Transformers only to high-rank patches in the collapse order. Seeing 22\% of such patches is sufficient to achieve high accuracy. With these experiments, we propose patch collapse as a novel image modeling perspective that promotes vision efficiency. Our project is available at this https URL .

Title: The Hidden Cost of Approximation in Online Mirror Descent

Authors: Ofir Schlisselberg, Uri Sherman, Tomer Koren, Yishay Mansour
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22283
Pdf URL: https://arxiv.org/pdf/2511.22283
Copy Paste: [[2511.22283]] The Hidden Cost of Approximation in Online Mirror Descent(https://arxiv.org/abs/2511.22283)
Keywords: robust
Abstract: Online mirror descent (OMD) is a fundamental algorithmic paradigm that underlies many algorithms in optimization, machine learning and sequential decision-making. The OMD iterates are defined as solutions to optimization subproblems which, oftentimes, can be solved only approximately, leading to an inexact version of the algorithm. Nonetheless, existing OMD analyses typically assume an idealized error free setting, thereby limiting our understanding of performance guarantees that should be expected in practice. In this work we initiate a systematic study into inexact OMD, and uncover an intricate relation between regularizer smoothness and robustness to approximation errors. When the regularizer is uniformly smooth, we establish a tight bound on the excess regret due to errors. Then, for barrier regularizers over the simplex and its subsets, we identify a sharp separation: negative entropy requires exponentially small errors to avoid linear regret, whereas log-barrier and Tsallis regularizers remain robust even when the errors are only polynomial. Finally, we show that when the losses are stochastic and the domain is the simplex, negative entropy regains robustness-but this property does not extend to all subsets, where exponentially small errors are again necessary to avoid suboptimal regret.

Title: Adaptive tumor growth forecasting via neural & universal ODEs

Authors: Kavya Subramanian, Prathamesh Dinesh Joshi, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22292
Pdf URL: https://arxiv.org/pdf/2511.22292
Copy Paste: [[2511.22292]] Adaptive tumor growth forecasting via neural & universal ODEs(https://arxiv.org/abs/2511.22292)
Keywords: robust
Abstract: Forecasting tumor growth is critical for optimizing treatment. Classical growth models such as the Gompertz and Bertalanffy equations capture general tumor dynamics but may fail to adapt to patient-specific variability, particularly with limited data available. In this study, we leverage Neural Ordinary Differential Equations (Neural ODEs) and Universal Differential Equations (UDEs), two pillars of Scientific Machine Learning (SciML), to construct adaptive tumor growth models capable of learning from experimental data. Using the Gompertz model as a baseline, we replace rigid terms with adaptive neural networks to capture hidden dynamics through robust modeling in the Julia programming language. We use our models to perform forecasting under data constraints and symbolic recovery to transform the learned dynamics into explicit mathematical expressions. Our approach has the potential to improve predictive accuracy, guiding dynamic and effective treatment strategies for improved clinical outcomes.

Title: Structure is Supervision: Multiview Masked Autoencoders for Radiology

Authors: Sonia Laguna, Andrea Agostini, Alain Ryser, Samuel Ruiperez-Campillo, Irene Cannistraci, Moritz Vandenhirtz, Stephan Mandt, Nicolas Deperrois, Farhad Nooralahzadeh, Michael Krauthammer, Thomas M. Sutter, Julia E. Vogt
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22294
Pdf URL: https://arxiv.org/pdf/2511.22294
Copy Paste: [[2511.22294]] Structure is Supervision: Multiview Masked Autoencoders for Radiology(https://arxiv.org/abs/2511.22294)
Keywords: robust
Abstract: Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data. We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision-language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.

Title: FLUX: Efficient Descriptor-Driven Clustered Federated Learning under Arbitrary Distribution Shifts

Authors: Dario Fenoglio, Mohan Li, Pietro Barbiero, Nicholas D. Lane, Marc Langheinrich, Martin Gjoreski
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22305
Pdf URL: https://arxiv.org/pdf/2511.22305
Copy Paste: [[2511.22305]] FLUX: Efficient Descriptor-Driven Clustered Federated Learning under Arbitrary Distribution Shifts(https://arxiv.org/abs/2511.22305)
Keywords: privacy, robust, extraction, federate
Abstract: Federated Learning (FL) enables collaborative model training across multiple clients while preserving data privacy. Traditional FL methods often use a global model to fit all clients, assuming that clients' data are independent and identically distributed (IID). However, when this assumption does not hold, the global model accuracy may drop significantly, limiting FL applicability in real-world scenarios. To address this gap, we propose FLUX, a novel clustering-based FL (CFL) framework that addresses the four most common types of distribution shifts during both training and test time. To this end, FLUX leverages privacy-preserving client-side descriptor extraction and unsupervised clustering to ensure robust performance and scalability across varying levels and types of distribution shifts. Unlike existing CFL methods addressing non-IID client distribution shifts, FLUX i) does not require any prior knowledge of the types of distribution shifts or the number of client clusters, and ii) supports test-time adaptation, enabling unseen and unlabeled clients to benefit from the most suitable cluster-specific models. Extensive experiments across four standard benchmarks, two real-world datasets and ten state-of-the-art baselines show that FLUX improves performance and stability under diverse distribution shifts, achieving an average accuracy gain of up to 23 percentage points over the best-performing baselines, while maintaining computational and communication overhead comparable to FedAvg.

Title: Small Object Detection for Birds with Swin Transformer

Authors: Da Huo, Marc A. Kastner, Tingwei Liu, Yasutomo Kawanishi, Takatsugu Hirayama, Takahiro Komamizu, Ichiro Ide
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22310
Pdf URL: https://arxiv.org/pdf/2511.22310
Copy Paste: [[2511.22310]] Small Object Detection for Birds with Swin Transformer(https://arxiv.org/abs/2511.22310)
Keywords: transformer
Abstract: Object detection is the task of detecting objects in an image. In this task, the detection of small objects is particularly difficult. Other than the small size, it is also accompanied by difficulties due to blur, occlusion, and so on. Current small object detection methods are tailored to small and dense situations, such as pedestrians in a crowd or far objects in remote sensing scenarios. However, when the target object is small and sparse, there is a lack of objects available for training, making it more difficult to learn effective features. In this paper, we propose a specialized method for detecting a specific category of small objects; birds. Particularly, we improve the features learned by the neck; the sub-network between the backbone and the prediction head, to learn more effective features with a hierarchical design. We employ Swin Transformer to upsample the image features. Moreover, we change the shifted window size for adapting to small objects. Experiments show that the proposed Swin Transformer-based neck combined with CenterNet can lead to good performance by changing the window sizes. We further find that smaller window sizes (default 2) benefit mAPs for small object detection.

Title: Token-Level Marginalization for Multi-Label LLM Classifiers

Authors: Anjaneya Praharaj, Jaykumar Kasundra
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22312
Pdf URL: https://arxiv.org/pdf/2511.22312
Copy Paste: [[2511.22312]] Token-Level Marginalization for Multi-Label LLM Classifiers(https://arxiv.org/abs/2511.22312)
Keywords: interpretability, generative
Abstract: This paper addresses the critical challenge of deriving interpretable confidence scores from generative language models (LLMs) when applied to multi-label content safety classification. While models like LLaMA Guard are effective for identifying unsafe content and its categories, their generative architecture inherently lacks direct class-level probabilities, which hinders model confidence assessment and performance interpretation. This limitation complicates the setting of dynamic thresholds for content moderation and impedes fine-grained error analysis. This research proposes and evaluates three novel token-level probability estimation approaches to bridge this gap. The aim is to enhance model interpretability and accuracy, and evaluate the generalizability of this framework across different instruction-tuned models. Through extensive experimentation on a synthetically generated, rigorously annotated dataset, it is demonstrated that leveraging token logits significantly improves the interpretability and reliability of generative classifiers, enabling more nuanced content safety moderation.

Title: Sentiment Analysis Of Shopee Product Reviews Using Distilbert

Authors: Zahri Aksa Dautd, Aviv Yuniar Rahman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22313
Pdf URL: https://arxiv.org/pdf/2511.22313
Copy Paste: [[2511.22313]] Sentiment Analysis Of Shopee Product Reviews Using Distilbert(https://arxiv.org/abs/2511.22313)
Keywords: transformer
Abstract: The rapid growth of digital commerce has led to the accumulation of a massive number of consumer reviews on online platforms. Shopee, as one of the largest e-commerce platforms in Southeast Asia, receives millions of product reviews every day containing valuable information regarding customer satisfaction and preferences. Manual analysis of these reviews is inefficient, thus requiring a computational approach such as sentiment analysis. This study examines the use of DistilBERT, a lightweight transformer-based deep learning model, for sentiment classification on Shopee product reviews. The dataset used consists of approximately one million English-language reviews that have been preprocessed and trained using the distilbert-base-uncased model. Evaluation was conducted using accuracy, precision, recall, and F1-score metrics, and compared against benchmark models such as BERT and SVM. The results show that DistilBERT achieved an accuracy of 94.8%, slightly below BERT (95.3%) but significantly higher than SVM (90.2%), with computation time reduced by more than 55%. These findings demonstrate that DistilBERT provides an optimal balance between accuracy and efficiency, making it suitable for large scale sentiment analysis on e-commerce platforms. Keywords: Sentiment Analysis, DistilBERT, Shopee Reviews, Natural Language Processing, Deep Learning, Transformer Models.

Title: SingleQuant: Efficient Quantization of Large Language Models in a Single Pass

Authors: Jinying Xiao, Bin Ji, Shasha Li, Xiaodong Liu, Ma Jun, Ye Zhong, Wei Li, Xuan Xie, Qingbo Wu, Jie Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22316
Pdf URL: https://arxiv.org/pdf/2511.22316
Copy Paste: [[2511.22316]] SingleQuant: Efficient Quantization of Large Language Models in a Single Pass(https://arxiv.org/abs/2511.22316)
Keywords: large language model
Abstract: Large Language Models (LLMs) quantization facilitates deploying LLMs in resource-limited settings, but existing methods that combine incompatible gradient optimization and quantization truncation lead to serious convergence pathology. This prolongs quantization time and degrades LLMs' task performance. Our studies confirm that Straight-Through Estimator (STE) on Stiefel manifolds introduce non-smoothness and gradient noise, obstructing optimization convergence and blocking high-fidelity quantized LLM development despite extensive training. To tackle the above limitations, we propose SingleQuant, a single-pass quantization framework that decouples from quantization truncation, thereby eliminating the above non-smoothness and gradient noise factors. Specifically, SingleQuant constructs Alignment Rotation Transformation (ART) and Uniformity Rotation Transformation (URT) targeting distinct activation outliers, where ART achieves smoothing of outlier values via closed-form optimal rotations, and URT reshapes distributions through geometric mapping. Both matrices comprise strictly formulated Givens rotations with predetermined dimensions and rotation angles, enabling promising LLMs task performance within a short time. Experimental results demonstrate SingleQuant's superiority over the selected baselines across diverse tasks on 7B-70B LLMs. To be more precise, SingleQuant enables quantized LLMs to achieve higher task performance while necessitating less time for quantization. For example, when quantizing LLaMA-2-13B, SingleQuant achieves 1,400$\times$ quantization speedup and increases +0.57\% average task performance compared to the selected best baseline.

Title: Enhancing the Security of Rollup Sequencers using Decentrally Attested TEEs

Authors: Giovanni Maria Cristiano, Salvatore D'Antonio, Jonah Giglio, Giovanni Mazzeo, Luigi Romano
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.22317
Pdf URL: https://arxiv.org/pdf/2511.22317
Copy Paste: [[2511.22317]] Enhancing the Security of Rollup Sequencers using Decentrally Attested TEEs(https://arxiv.org/abs/2511.22317)
Keywords: secure, security, attack
Abstract: The growing scalability demand of public Blockchains led to the rise of Layer-2 solutions, such as Rollups. Rollups improve transaction throughput by processing operations off-chain and posting the results on-chain. A critical component in Rollups is the Sequencer, responsible for receiving, ordering and batching transactions before they are submitted to the Layer-1 blockchain. While essential, the centralized nature of the Sequencer makes it vulnerable to attacks, such as censorship, transaction manipulation and tampering. To enhance its security, there are solutions in the literature that shield the Sequencer inside a Trusted Execution Environment (TEE). However, the attestation of TEEs introduces additional centralization, which is in contrast with the core Blockchain principle. In this paper, we propose a TEE-secured Sequencer equipped with a decentralized attestation mechanism. We outline the design and implementation of our solution, covering the system architecture, TEE integration, and the decentralization of the attestation process. Additionally, we present an experimental evaluation conducted on a realistic Rollup testnet. Our results show that this approach strengthens Sequencer integrity without sacrificing compatibility or deployability in existing Layer-2 architectures.

Title: Prompt-based Consistent Video Colorization

Authors: Silvia Dani, Tiberio Uricchio, Lorenzo Seidenari
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22330
Pdf URL: https://arxiv.org/pdf/2511.22330
Copy Paste: [[2511.22330]] Prompt-based Consistent Video Colorization(https://arxiv.org/abs/2511.22330)
Keywords: diffusion, segmentation
Abstract: Existing video colorization methods struggle with temporal flickering or demand extensive manual input. We propose a novel approach automating high-fidelity video colorization using rich semantic guidance derived from language and segmentation. We employ a language-conditioned diffusion model to colorize grayscale frames. Guidance is provided via automatically generated object masks and textual prompts; our primary automatic method uses a generic prompt, achieving state-of-the-art results without specific color input. Temporal stability is achieved by warping color information from previous frames using optical flow (RAFT); a correction step detects and fixes inconsistencies introduced by warping. Evaluations on standard benchmarks (DAVIS30, VIDEVO20) show our method achieves state-of-the-art performance in colorization accuracy (PSNR) and visual realism (Colorfulness, CDC), demonstrating the efficacy of automated prompt-based guidance for consistent video colorization.

Title: Keyless Entry: Breaking and Entering eMMC RPMB with EMFI

Authors: Aya Fukami, Richard Buurke
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.22340
Pdf URL: https://arxiv.org/pdf/2511.22340
Copy Paste: [[2511.22340]] Keyless Entry: Breaking and Entering eMMC RPMB with EMFI(https://arxiv.org/abs/2511.22340)
Keywords: secure, protect, attack
Abstract: The Replay Protected Memory Block (RPMB) in modern storage systems provides a secure area where data integrity is ensured by authentication. This block is used in digital devices to store pivotal information that must be safeguarded against modification by potential attackers. This paper targets the authentication scheme of the RPMB in three different eMMCs from a major manufacturer. A glitch was injected by sending an electromagnetic pulse to the target chip. RPMB authentication was successfully glitched and the information stored in two target eMMCs was overwritten with arbitrary data, without affecting the integrity of other data.

Title: Unexplored flaws in multiple-choice VQA evaluations

Authors: Fabio Rosenthal, Sebastian Schmidt, Thorsten Graf, Thorsten Bagodonat, Stephan Günnemann, Leo Schwinn
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22341
Pdf URL: https://arxiv.org/pdf/2511.22341
Copy Paste: [[2511.22341]] Unexplored flaws in multiple-choice VQA evaluations(https://arxiv.org/abs/2511.22341)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in handling image-text inputs. A common way to assess this ability is through multiple-choice Visual Question Answering (VQA). Earlier works have already revealed that these benchmarks are sensitive to answer choice order, a limitation that can be mitigated through careful design. Yet, we highlight additional, unexplored biases in prompt formatting that question the reliability of current MLLM evaluations. Specifically, we identify three key variation factors in prompt formatting and analyze their impact through a large-scale study involving $\mathbf{\text{seven}}$ MLLMs and $\mathbf{\text{five}}$ VQA datasets, spanning $\mathbf{48}$ distinct $\mathbf{\text{prompt format variations}}$. Our findings reveal that multiple-choice VQA is highly sensitive to minor prompt format changes, even when these changes are semantically neutral. We further demonstrate that these biases persist independently of known order biases or the MLLM's confidence in the correct answer. Finally, we demonstrate that existing bias mitigation strategies fail to address these newly identified biases.

Title: Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment

Authors: Yang Chen, Xiaowei Xu, Shuai Wang, Chenhui Zhu, Ruxue Wen, Xubin Li, Tiezheng Ge, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22345
Pdf URL: https://arxiv.org/pdf/2511.22345
Copy Paste: [[2511.22345]] Flowing Backwards: Improving Normalizing Flows via Reverse Representation Alignment(https://arxiv.org/abs/2511.22345)
Keywords: generative
Abstract: Normalizing Flows (NFs) are a class of generative models distinguished by a mathematically invertible architecture, where the forward pass transforms data into a latent space for density estimation, and the reverse pass generates new samples from this space. This characteristic creates an intrinsic synergy between representation learning and data generation. However, the generative quality of standard NFs is limited by poor semantic representations from log-likelihood optimization. To remedy this, we propose a novel alignment strategy that creatively leverages the invertibility of NFs: instead of regularizing the forward pass, we align the intermediate features of the generative (reverse) pass with representations from a powerful vision foundation model, demonstrating superior effectiveness over naive alignment. We also introduce a novel training-free, test-time optimization algorithm for classification, which provides a more intrinsic evaluation of the NF's embedded semantic knowledge. Comprehensive experiments demonstrate that our approach accelerates the training of NFs by over 3.3$\times$, while simultaneously delivering significant improvements in both generative quality and classification accuracy. New state-of-the-art results for NFs are established on ImageNet 64$\times$64 and 256$\times$256. Our code is available at this https URL.

Title: INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts

Authors: Anshul Bagaria
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22351
Pdf URL: https://arxiv.org/pdf/2511.22351
Copy Paste: [[2511.22351]] INSIGHT: An Interpretable Neural Vision-Language Framework for Reasoning of Generative Artifacts(https://arxiv.org/abs/2511.22351)
Keywords: robust, diffusion, generative
Abstract: The growing realism of AI-generated images produced by recent GAN and diffusion models has intensified concerns over the reliability of visual media. Yet, despite notable progress in deepfake detection, current forensic systems degrade sharply under real-world conditions such as severe downsampling, compression, and cross-domain distribution shifts. Moreover, most detectors operate as opaque classifiers, offering little insight into why an image is flagged as synthetic, undermining trust and hindering adoption in high-stakes settings. We introduce INSIGHT (Interpretable Neural Semantic and Image-based Generative-forensic Hallucination Tracing), a unified multimodal framework for robust detection and transparent explanation of AI-generated images, even at extremely low resolutions (16x16 - 64x64). INSIGHT combines hierarchical super-resolution for amplifying subtle forensic cues without inducing misleading artifacts, Grad-CAM driven multi-scale localization to reveal spatial regions indicative of generative patterns, and CLIP-guided semantic alignment to map visual anomalies to human-interpretable descriptors. A vision-language model is then prompted using a structured ReAct + Chain-of-Thought protocol to produce consistent, fine-grained explanations, verified through a dual-stage G-Eval + LLM-as-a-judge pipeline to minimize hallucinations and ensure factuality. Across diverse domains, including animals, vehicles, and abstract synthetic scenes, INSIGHT substantially improves both detection robustness and explanation quality under extreme degradation, outperforming prior detectors and black-box VLM baselines. Our results highlight a practical path toward transparent, reliable AI-generated image forensics and establish INSIGHT as a step forward in trustworthy multimodal content verification.

Title: AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows

Authors: Zhenglin Zhou, Fan Ma, Chengzhuo Gui, Xiaobo Xia, Hehe Fan, Yi Yang, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22357
Pdf URL: https://arxiv.org/pdf/2511.22357
Copy Paste: [[2511.22357]] AnchorFlow: Training-Free 3D Editing via Latent Anchor-Aligned Flows(https://arxiv.org/abs/2511.22357)
Keywords: robust, diffusion
Abstract: Training-free 3D editing aims to modify 3D shapes based on human instructions without model finetuning. It plays a crucial role in 3D content creation. However, existing approaches often struggle to produce strong or geometrically stable edits, largely due to inconsistent latent anchors introduced by timestep-dependent noise during diffusion sampling. To address these limitations, we introduce AnchorFlow, which is built upon the principle of latent anchor consistency. Specifically, AnchorFlow establishes a global latent anchor shared between the source and target trajectories, and enforces coherence using a relaxed anchor-alignment loss together with an anchor-aligned update rule. This design ensures that transformations remain stable and semantically faithful throughout the editing process. By stabilizing the latent reference space, AnchorFlow enables more pronounced semantic modifications. Moreover, AnchorFlow is mask-free. Without mask supervision, it effectively preserves geometric fidelity. Experiments on the Eval3DEdit benchmark show that AnchorFlow consistently delivers semantically aligned and structurally robust edits across diverse editing types. Code is at this https URL.

Title: Efficient-Husformer: Efficient Multimodal Transformer Hyperparameter Optimization for Stress and Cognitive Loads

Authors: Merey Orazaly, Fariza Temirkhanova, Jurn-Gyu Park
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22362
Pdf URL: https://arxiv.org/pdf/2511.22362
Copy Paste: [[2511.22362]] Efficient-Husformer: Efficient Multimodal Transformer Hyperparameter Optimization for Stress and Cognitive Loads(https://arxiv.org/abs/2511.22362)
Keywords: transformer
Abstract: Transformer-based models have gained considerable attention in the field of physiological signal analysis. They leverage long-range dependencies and complex patterns in temporal signals, allowing them to achieve performance superior to traditional RNN and CNN models. However, they require high computational intensity and memory demands. In this work, we present Efficient-Husformer, a novel Transformer-based architecture developed with hyperparameter optimization (HPO) for multi-class stress detection across two multimodal physiological datasets (WESAD and CogLoad). The main contributions of this work are: (1) the design of a structured search space, targeting effective hyperparameter optimization; (2) a comprehensive ablation study evaluating the impact of architectural decisions; (3) consistent performance improvements over the original Husformer, with the best configuration achieving an accuracy of 88.41 and 92.61 (improvements of 13.83% and 6.98%) on WESAD and CogLoad datasets, respectively. The best-performing configuration is achieved with the (L + dm) or (L + FFN) modality combinations, using a single layer, 3 attention heads, a model dimension of 18/30, and FFN dimension of 120/30, resulting in a compact model with only about 30k parameters.

Title: SuRe: Surprise-Driven Prioritised Replay for Continual LLM Learning

Authors: Hugo Hazard, Zafeirios Fountas, Martin A. Benfeghoul, Adnan Oomerjee, Jun Wang, Haitham Bou-Ammar
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.22367
Pdf URL: https://arxiv.org/pdf/2511.22367
Copy Paste: [[2511.22367]] SuRe: Surprise-Driven Prioritised Replay for Continual LLM Learning(https://arxiv.org/abs/2511.22367)
Keywords: robust, large language model
Abstract: Continual learning, one's ability to adapt to a sequence of tasks without forgetting previously acquired knowledge, remains a major challenge in machine learning and a key gap between artificial and human intelligence. While regularisation and replay perform well in vision, they lag behind multi-task learning for large language models (LLMs), especially at scale with many tasks. We revisit replay and argue that two failure modes drive this gap: selection (what to rehearse) and integration (how to consolidate new knowledge). To address selection, we propose Surprise-prioritised Replay (SuRe), a simple, architecture-agnostic rule that ranks and stores the most surprising (high Negative Log-Likelihood) sequences. SuRe achieves state-of-the-art performance in the Large Number of Tasks (LNT) setting and delivers the best overall average across both Standard CL and LNT benchmarks. To address integration, we add a dual-learner design with fast and slow LoRA adapters merged via an exponential moving average (EMA), enabling rapid adaptation while stabilising long-term knowledge. Combining SuRe with the dual learner yields further gains, including improvements of up to +5 accuracy points on LNT over prior SOTA. Ablation studies confirm that our proposed method remains robust under reduced replay frequency and small buffer size, demonstrating both effectiveness and sample efficiency. Taken together, our results establish replay as a strong baseline for continual LLM fine-tuning and demonstrate that surprise-based selection and slow-weight consolidation are complementary components for mitigating catastrophic forgetting.

Title: Mapping Clinical Doubt: Locating Linguistic Uncertainty in LLMs

Authors: Srivarshinee Sridhar, Raghav Kaushik Ravi, Kripabandhu Ghosh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22402
Pdf URL: https://arxiv.org/pdf/2511.22402
Copy Paste: [[2511.22402]] Mapping Clinical Doubt: Locating Linguistic Uncertainty in LLMs(https://arxiv.org/abs/2511.22402)
Keywords: interpretability, large language model
Abstract: Large Language Models (LLMs) are increasingly used in clinical settings, where sensitivity to linguistic uncertainty can influence diagnostic interpretation and decision-making. Yet little is known about where such epistemic cues are internally represented within these models. Distinct from uncertainty quantification, which measures output confidence, this work examines input-side representational sensitivity to linguistic uncertainty in medical text. We curate a contrastive dataset of clinical statements varying in epistemic modality (e.g., 'is consistent with' vs. 'may be consistent with') and propose Model Sensitivity to Uncertainty (MSU), a layerwise probing metric that quantifies activation-level shifts induced by uncertainty cues. Our results show that LLMs exhibit structured, depth-dependent sensitivity to clinical uncertainty, suggesting that epistemic information is progressively encoded in deeper layers. These findings reveal how linguistic uncertainty is internally represented in LLMs, offering insight into their interpretability and epistemic reliability.

Title: UAV-MM3D: A Large-Scale Synthetic Benchmark for 3D Perception of Unmanned Aerial Vehicles with Multi-Modal Data

Authors: Longkun Zou, Jiale Wang, Rongqin Liang, Hai Wu, Ke Chen, Yaowei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22404
Pdf URL: https://arxiv.org/pdf/2511.22404
Copy Paste: [[2511.22404]] UAV-MM3D: A Large-Scale Synthetic Benchmark for 3D Perception of Unmanned Aerial Vehicles with Multi-Modal Data(https://arxiv.org/abs/2511.22404)
Keywords: security, privacy
Abstract: Accurate perception of UAVs in complex low-altitude environments is critical for airspace security and related intelligent systems. Developing reliable solutions requires large-scale, accurately annotated, and multimodal data. However, real-world UAV data collection faces inherent constraints due to airspace regulations, privacy concerns, and environmental variability, while manual annotation of 3D poses and cross-modal correspondences is time-consuming and costly. To overcome these challenges, we introduce UAV-MM3D, a high-fidelity multimodal synthetic dataset for low-altitude UAV perception and motion understanding. It comprises 400K synchronized frames across diverse scenes (urban areas, suburbs, forests, coastal regions) and weather conditions (clear, cloudy, rainy, foggy), featuring multiple UAV models (micro, small, medium-sized) and five modalities - RGB, IR, LiDAR, Radar, and DVS (Dynamic Vision Sensor). Each frame provides 2D/3D bounding boxes, 6-DoF poses, and instance-level annotations, enabling core tasks related to UAVs such as 3D detection, pose estimation, target tracking, and short-term trajectory forecasting. We further propose LGFusionNet, a LiDAR-guided multimodal fusion baseline, and a dedicated UAV trajectory prediction baseline to facilitate benchmarking. With its controllable simulation environment, comprehensive scenario coverage, and rich annotations, UAV3D offers a public benchmark for advancing 3D perception of UAVs.

Title: DiffStyle360: Diffusion-Based 360° Head Stylization via Style Fusion Attention

Authors: Furkan Guzelant, Arda Goktogan, Tarık Kaya, Aysegul Dundar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22411
Pdf URL: https://arxiv.org/pdf/2511.22411
Copy Paste: [[2511.22411]] DiffStyle360: Diffusion-Based 360° Head Stylization via Style Fusion Attention(https://arxiv.org/abs/2511.22411)
Keywords: robust, diffusion
Abstract: 3D head stylization has emerged as a key technique for reimagining realistic human heads in various artistic forms, enabling expressive character design and creative visual experiences in digital media. Despite the progress in 3D-aware generation, existing 3D head stylization methods often rely on computationally expensive optimization or domain-specific fine-tuning to adapt to new styles. To address these limitations, we propose DiffStyle360, a diffusion-based framework capable of producing multi-view consistent, identity-preserving 3D head stylizations across diverse artistic domains given a single style reference image, without requiring per-style training. Building upon the 3D-aware DiffPortrait360 architecture, our approach introduces two key components: the Style Appearance Module, which disentangles style from content, and the Style Fusion Attention mechanism, which adaptively balances structure preservation and stylization fidelity in the latent space. Furthermore, we employ a 3D GAN-generated multi-view dataset for robust fine-tuning and introduce a temperaturebased key scaling strategy to control stylization intensity during inference. Extensive experiments on FFHQ and RenderMe360 demonstrate that DiffStyle360 achieves superior style quality, outperforming state-of-the-art GAN- and diffusion-based stylization methods across challenging style domains.

Title: Exposing Vulnerabilities in RL: A Novel Stealthy Backdoor Attack through Reward Poisoning

Authors: Bokang Zhang, Chaojun Lu, Jianhui Li, Junfeng Wu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.22415
Pdf URL: https://arxiv.org/pdf/2511.22415
Copy Paste: [[2511.22415]] Exposing Vulnerabilities in RL: A Novel Stealthy Backdoor Attack through Reward Poisoning(https://arxiv.org/abs/2511.22415)
Keywords: security, defense, attack, steal
Abstract: Reinforcement learning (RL) has achieved remarkable success across diverse domains, enabling autonomous systems to learn and adapt to dynamic environments by optimizing a reward function. However, this reliance on reward signals creates a significant security vulnerability. In this paper, we study a stealthy backdoor attack that manipulates an agent's policy by poisoning its reward signals. The effectiveness of this attack highlights a critical threat to the integrity of deployed RL systems and calls for urgent defenses against training-time manipulation. We evaluate the attack across classic control and MuJoCo environments. The backdoored agent remains highly stealthy in Hopper and Walker2D, with minimal performance drops of only 2.18 % and 4.59 % under non-triggered scenarios, while achieving strong attack efficacy with up to 82.31% and 71.27% declines under trigger conditions.

Title: Extending Quantum-Safe Communications to Real-World Networks: An Adaptive Security Framework

Authors: Ane Sanz, Eire Salegi, Asier Atutxa, David Franco, Jasone Astorga, Eduardo Jacob
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.22416
Pdf URL: https://arxiv.org/pdf/2511.22416
Copy Paste: [[2511.22416]] Extending Quantum-Safe Communications to Real-World Networks: An Adaptive Security Framework(https://arxiv.org/abs/2511.22416)
Keywords: security, protect, robust
Abstract: The advent of quantum computing threats classical cryptographic mechanisms, demanding new strategies for securing communication networks. Since real-world networks cannot be fully Quantum Key Distribution (QKD)-enabled due to infrastructure constraints, practical security solutions must support hybrid operation. This paper presents an adaptive security framework that enables quantum-safe communications across real-world heterogeneous networks by combining QKD and Post-Quantum Cryptography (PQC). Building upon a hierarchical key management architecture with Virtual Key Management Systems (vKMS) and a centralized Quantum Security Controller (QuSeC), the framework dynamically assigns security levels based on node capabilities. By transitioning between pure QKD, hybrid, and PQC modes, it ensures end-to-end quantum-safe protection regardless of the underlying node capabilities. The framework has been implemented and validated on a Kubernetes-based containerized testbed, demonstrating robust operation and performance across all scenarios. Results highlight its potential to support the gradual integration of quantum-safe technologies into existing infrastructures, paving the way toward fully quantum-safe communication networks.

Title: Wukong's 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models

Authors: Minghao Yin, Yukang Cao, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22425
Pdf URL: https://arxiv.org/pdf/2511.22425
Copy Paste: [[2511.22425]] Wukong's 72 Transformations: High-fidelity Textured 3D Morphing via Flow Models(https://arxiv.org/abs/2511.22425)
Keywords: transformer, generative
Abstract: We present WUKONG, a novel training-free framework for high-fidelity textured 3D morphing that takes a pair of source and target prompts (image or text) as input. Unlike conventional methods -- which rely on manual correspondence matching and deformation trajectory estimation (limiting generalization and requiring costly preprocessing) -- WUKONG leverages the generative prior of flow-based transformers to produce high-fidelity 3D transitions with rich texture details. To ensure smooth shape transitions, we exploit the inherent continuity of flow-based generative processes and formulate morphing as an optimal transport barycenter problem. We further introduce a sequential initialization strategy to prevent abrupt geometric distortions and preserve identity coherence. For faithful texture preservation, we propose a similarity-guided semantic consistency mechanism that selectively retains high-frequency details and enables precise control over blending dynamics. This avoids common artifacts like oversmoothing while maintaining semantic fidelity. Extensive quantitative and qualitative evaluations demonstrate that WUKONG significantly outperforms state-of-the-art methods, achieving superior results across diverse geometry and texture variations.

Title: Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation

Authors: Weining Ren, Hongjun Wang, Xiao Tan, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22429
Pdf URL: https://arxiv.org/pdf/2511.22429
Copy Paste: [[2511.22429]] Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation(https://arxiv.org/abs/2511.22429)
Keywords: robust, extraction
Abstract: We present Fin3R, a simple, effective, and general fine-tuning method for feed-forward 3D reconstruction models. The family of feed-forward reconstruction model regresses pointmap of all input images to a reference frame coordinate system, along with other auxiliary outputs, in a single forward pass. However, we find that current models struggle with fine geometry and robustness due to (\textit{i}) the scarcity of high-fidelity depth and pose supervision and (\textit{ii}) the inherent geometric misalignment from multi-view pointmap regression. Fin3R jointly tackles two issues with an extra lightweight fine-tuning step. We freeze the decoder, which handles view matching, and fine-tune only the image encoder-the component dedicated to feature extraction. The encoder is enriched with fine geometric details distilled from a strong monocular teacher model on large, unlabeled datasets, using a custom, lightweight LoRA adapter. We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny LoRA weights, which leave test-time memory and latency virtually unchanged. Project page: \href{this http URL}{this https URL}

Title: SkeletonAgent: An Agentic Interaction Framework for Skeleton-based Action Recognition

Authors: Hongda Liu, Yunfan Liu, Changlu Wang, Yunlong Wang, Zhenan Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22433
Pdf URL: https://arxiv.org/pdf/2511.22433
Copy Paste: [[2511.22433]] SkeletonAgent: An Agentic Interaction Framework for Skeleton-based Action Recognition(https://arxiv.org/abs/2511.22433)
Keywords: large language model
Abstract: Recent advances in skeleton-based action recognition increasingly leverage semantic priors from Large Language Models (LLMs) to enrich skeletal representations. However, the LLM is typically queried in isolation from the recognition model and receives no performance feedback. As a result, it often fails to deliver the targeted discriminative cues critical to distinguish similar actions. To overcome these limitations, we propose SkeletonAgent, a novel framework that bridges the recognition model and the LLM through two cooperative agents, i.e., Questioner and Selector. Specifically, the Questioner identifies the most frequently confused classes and supplies them to the LLM as context for more targeted guidance. Conversely, the Selector parses the LLM's response to extract precise joint-level constraints and feeds them back to the recognizer, enabling finer-grained cross-modal alignment. Comprehensive evaluations on five benchmarks, including NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, FineGYM, and UAV-Human, demonstrate that SkeletonAgent consistently outperforms state-of-the-art benchmark methods. The code is available at this https URL.

Title: FastFHE: Packing-Scalable and Depthwise-Separable CNN Inference Over FHE

Authors: Wenbo Song, Xinxin Fan, Quanliang Jing, Shaoye Luo, Wenqi Wei, Chi Lin, Yunfeng Lu, Ling Liu
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22434
Pdf URL: https://arxiv.org/pdf/2511.22434
Copy Paste: [[2511.22434]] FastFHE: Packing-Scalable and Depthwise-Separable CNN Inference Over FHE(https://arxiv.org/abs/2511.22434)
Keywords: secure, security, privacy
Abstract: The deep learning (DL) has been penetrating daily life in many domains, how to keep the DL model inference secure and sample privacy in an encrypted environment has become an urgent and increasingly important issue for various security-critical applications. To date, several approaches have been proposed based on the Residue Number System variant of the Cheon-Kim-Kim-Song (RNS-CKKS) scheme. However, they all suffer from high latency, which severely limits the applications in real-world tasks. Currently, the research on encrypted inference in deep CNNs confronts three main bottlenecks: i) the time and storage costs of convolution calculation; ii) the time overhead of huge bootstrapping operations; and iii) the consumption of circuit multiplication depth. Towards these three challenges, we in this paper propose an efficient and effective mechanism FastFHE to accelerate the model inference while simultaneously retaining high inference accuracy over fully homomorphic encryption. Concretely, our work elaborates four unique novelties. First, we propose a new scalable ciphertext data-packing scheme to save the time and storage consumptions. Second, we work out a depthwise-separable convolution fashion to degrade the computation load of convolution calculation. Third, we figure out a BN dot-product fusion matrix to merge the ciphertext convolutional layer with the batch-normalization layer without incurring extra multiplicative depth. Last but not least, we adopt the low-degree Legendre polynomial to approximate the nonlinear smooth activation function SiLU under the guarantee of tiny accuracy error before and after encrypted inference. Finally, we execute multi-facet experiments to verify the efficiency and effectiveness of our proposed approach.

Title: PISA: Prioritized Invariant Subgraph Aggregation

Authors: Ali Ghasemi, Farooq Ahmad Wani, Maria Sofia Bucarelli, Fabrizio Silvestri
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22435
Pdf URL: https://arxiv.org/pdf/2511.22435
Copy Paste: [[2511.22435]] PISA: Prioritized Invariant Subgraph Aggregation(https://arxiv.org/abs/2511.22435)
Keywords: robust
Abstract: Recent work has extended the invariance principle for out-of-distribution (OOD) generalization from Euclidean to graph data, where challenges arise due to complex structures and diverse distribution shifts in node attributes and topology. To handle these, Chen et al. proposed CIGA (Chen et al., 2022b), which uses causal modeling and an information-theoretic objective to extract a single invariant subgraph capturing causal features. However, this single-subgraph focus can miss multiple causal patterns. Liu et al. (2025) addressed this with SuGAr, which learns and aggregates diverse invariant subgraphs via a sampler and diversity regularizer, improving robustness but still relying on simple uniform or greedy aggregation. To overcome this, the proposed PISA framework introduces a dynamic MLP-based aggregation that prioritizes and combines subgraph representations more effectively. Experiments on 15 datasets, including DrugOOD (Ji et al., 2023), show that PISA achieves up to 5% higher classification accuracy than prior methods.

Title: ABounD: Adversarial Boundary-Driven Few-Shot Learning for Multi-Class Anomaly Detection

Authors: Runzhi Deng, Yundi Hu, Xinshuang Zhang, Zhao Wang, Xixi Liu, Wang-Zhou Dai, Caifeng Shan, Fang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22436
Pdf URL: https://arxiv.org/pdf/2511.22436
Copy Paste: [[2511.22436]] ABounD: Adversarial Boundary-Driven Few-Shot Learning for Multi-Class Anomaly Detection(https://arxiv.org/abs/2511.22436)
Keywords: robust
Abstract: Few-shot multi-class industrial anomaly detection remains a challenging task. Vision-language models need to be both category-adaptive and sharply discriminative, yet data scarcity often blurs the boundary between normal and abnormal states. This ambiguity leads to missed subtle defects and the rejection of atypical normal samples. We propose ABounD, an Adversarial Boundary-Driven few-shot learning for multi-class anomaly detection, which is a unified learning framework that integrates semantic concept learning with decision boundary shaping. The Dynamic Concept Fusion (DCF) module produces class-adaptive prompts by fusing generalizable priors with class-specific cues, conditioned on image features. Meanwhile, Adversarial Boundary Forging (ABF) sculpts a more precise decision margin by generating boundary-level fence features via PGD-style perturbations. Training is conducted in a single stage under a Concept-Boundary Loss, where ABF provides the main supervisory signal and semantic-spatial regularizers stabilize the optimization. This synergy yields a decision boundary that closely follows normal data while preserving flexibility and robust semantic alignment. Experiments on MVTec-AD and VisA datasets demonstrate state-of-the-art performance in the task of few-shot multi-class anomaly detection.

Title: GEO-Detective: Unveiling Location Privacy Risks in Images with LLM Agents

Authors: Xinyu Zhang, Yixin Wu, Boyang Zhang, Chenhao Lin, Chao Shen, Michael Backes, Yang Zhang
Subjects: cs.CR, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22441
Pdf URL: https://arxiv.org/pdf/2511.22441
Copy Paste: [[2511.22441]] GEO-Detective: Unveiling Location Privacy Risks in Images with LLM Agents(https://arxiv.org/abs/2511.22441)
Keywords: privacy, defense, robust
Abstract: Images shared on social media often expose geographic cues. While early geolocation methods required expert effort and lacked generalization, the rise of Large Vision Language Models (LVLMs) now enables accurate geolocation even for ordinary users. However, existing approaches are not optimized for this task. To explore the full potential and associated privacy risks, we present Geo-Detective, an agent that mimics human reasoning and tool use for image geolocation inference. It follows a procedure with four steps that adaptively selects strategies based on image difficulty and is equipped with specialized tools such as visual reverse search, which emulates how humans gather external geographic clues. Experimental results show that GEO-Detective outperforms baseline large vision language models (LVLMs) overall, particularly on images lacking visible geographic features. In country level geolocation tasks, it achieves an improvement of over 11.1% compared to baseline LLMs, and even at finer grained levels, it still provides around a 5.2% performance gain. Meanwhile, when equipped with external clues, GEO-Detective becomes more likely to produce accurate predictions, reducing the "unknown" prediction rate by more than 50.6%. We further explore multiple defense strategies and find that Geo-Detective exhibits stronger robustness, highlighting the need for more effective privacy safeguards.

Title: Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition

Authors: Maheswar Bora, Tashvik Dhamija, Shukesh Reddy, Baptiste Chopin, Pranav Balaji, Abhijit Das, Antitza Dantcheva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22443
Pdf URL: https://arxiv.org/pdf/2511.22443
Copy Paste: [[2511.22443]] Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition(https://arxiv.org/abs/2511.22443)
Keywords: robust
Abstract: Deepfake generation has witnessed remarkable progress, contributing to highly realistic generated images, videos, and audio. While technically intriguing, such progress has raised serious concerns related to the misuse of manipulated media. To mitigate such misuse, robust and reliable deepfake detection is urgently needed. Towards this, we propose a novel network FauxNet, which is based on pre-trained Visual Speech Recognition (VSR) features. By extracting temporal VSR features from videos, we identify and segregate real videos from manipulated ones. The holy grail in this context has to do with zero-shot detection, i.e., generalizable detection, which we focus on in this work. FauxNet consistently outperforms the state-of-the-art in this setting. In addition, FauxNet is able to attribute - distinguish between generation techniques from which the videos stem. Finally, we propose new datasets, referred to as Authentica-Vox and Authentica-HDTF, comprising about 38,000 real and fake videos in total, the latter created with six recent deepfake generation techniques. We provide extensive analysis and results on the Authentica datasets and FaceForensics++, demonstrating the superiority of FauxNet. The Authentica datasets will be made publicly available.

Title: Benchmarking machine learning models for multi-class state recognition in double duantum dot data

Authors: Valeria Díaz Moreno, Ryan P Khalili, Daniel Schug, Patrick J. Walsh, Justyna P. Zwolak
Subjects: cs.CV, cond-mat.mes-hall, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22451
Pdf URL: https://arxiv.org/pdf/2511.22451
Copy Paste: [[2511.22451]] Benchmarking machine learning models for multi-class state recognition in double duantum dot data(https://arxiv.org/abs/2511.22451)
Keywords: transformer
Abstract: Semiconductor quantum dots (QDs) are a leading platform for scalable quantum processors. However, scaling to large arrays requires reliable, automated tuning strategies for devices' bootstrapping, calibration, and operation, with many tuning aspects depending on accurately identifying QD device states from charge-stability diagrams (CSDs). In this work, we present a comprehensive benchmarking study of four modern machine learning (ML) architectures for multi-class state recognition in double-QD CSDs. We evaluate their performance across different data budgets and normalization schemes using both synthetic and experimental data. We find that the more resource-intensive models -- U-Nets and visual transformers (ViTs) -- achieve the highest MSE score (defined as $1-\mathrm{MSE}$) on synthetic data (over $0.98$) but fail to generalize to experimental data. MDNs are the most computationally efficient and exhibit highly stable training, but with substantially lower peak performance. CNNs offer the most favorable trade-off on experimental CSDs, achieving strong accuracy with two orders of magnitude fewer parameters than the U-Nets and ViTs. Normalization plays a nontrivial role: min-max scaling generally yields higher MSE scores but less stable convergence, whereas z-score normalization produces more predictable training dynamics but at reduced accuracy for most models. Overall, our study shows that CNNs with min-max normalization are a practical approach for QD CSDs.

Title: Beyond Real versus Fake Towards Intent-Aware Video Analysis

Authors: Saurabh Atreya, Nabyl Quignon, Baptiste Chopin, Abhijit Das, Antitza Dantcheva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22455
Pdf URL: https://arxiv.org/pdf/2511.22455
Copy Paste: [[2511.22455]] Beyond Real versus Fake Towards Intent-Aware Video Analysis(https://arxiv.org/abs/2511.22455)
Keywords: security, generative
Abstract: The rapid advancement of generative models has led to increasingly realistic deepfake videos, posing significant societal and security risks. While existing detection methods focus on distinguishing real from fake videos, such approaches fail to address a fundamental question: What is the intent behind a manipulated video? Towards addressing this question, we introduce IntentHQ: a new benchmark for human-centered intent analysis, shifting the paradigm from authenticity verification to contextual understanding of videos. IntentHQ consists of 5168 videos that have been meticulously collected and annotated with 23 fine-grained intent-categories, including "Financial fraud", "Indirect marketing", "Political propaganda", as well as "Fear mongering". We perform intent recognition with supervised and self-supervised multi-modality models that integrate spatio-temporal video features, audio processing, and text analysis to infer underlying motivations and goals behind videos. Our proposed model is streamlined to differentiate between a wide range of intent-categories.

Title: ITS3D: Inference-Time Scaling for Text-Guided 3D Diffusion Models

Authors: Zhenglin Zhou, Fan Ma, Xiaobo Xia, Hehe Fan, Yi Yang, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22456
Pdf URL: https://arxiv.org/pdf/2511.22456
Copy Paste: [[2511.22456]] ITS3D: Inference-Time Scaling for Text-Guided 3D Diffusion Models(https://arxiv.org/abs/2511.22456)
Keywords: diffusion, generative
Abstract: We explore inference-time scaling in text-guided 3D diffusion models to enhance generative quality without additional training. To this end, we introduce ITS3D, a framework that formulates the task as an optimization problem to identify the most effective Gaussian noise input. The framework is driven by a verifier-guided search algorithm, where the search algorithm iteratively refines noise candidates based on verifier feedback. To address the inherent challenges of 3D generation, we introduce three techniques for improved stability, efficiency, and exploration capability. 1) Gaussian normalization is applied to stabilize the search process. It corrects distribution shifts when noise candidates deviate from a standard Gaussian distribution during iterative updates. 2) The high-dimensional nature of the 3D search space increases computational complexity. To mitigate this, a singular value decomposition-based compression technique is employed to reduce dimensionality while preserving effective search directions. 3) To further prevent convergence to suboptimal local minima, a singular space reset mechanism dynamically updates the search space based on diversity measures. Extensive experiments demonstrate that ITS3D enhances text-to-3D generation quality, which shows the potential of computationally efficient search methods in generative processes. The source code is available at this https URL.

Title: Gaussians on Fire: High-Frequency Reconstruction of Flames

Authors: Jakob Nazarenus, Dominik Michels, Wojtek Palubicki, Simin Kou, Fang-Lue Zhang, Soren Pirk, Reinhard Koch
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22459
Pdf URL: https://arxiv.org/pdf/2511.22459
Copy Paste: [[2511.22459]] Gaussians on Fire: High-Frequency Reconstruction of Flames(https://arxiv.org/abs/2511.22459)
Keywords: robust
Abstract: We propose a method to reconstruct dynamic fire in 3D from a limited set of camera views with a Gaussian-based spatiotemporal representation. Capturing and reconstructing fire and its dynamics is highly challenging due to its volatile nature, transparent quality, and multitude of high-frequency features. Despite these challenges, we aim to reconstruct fire from only three views, which consequently requires solving for under-constrained geometry. We solve this by separating the static background from the dynamic fire region by combining dense multi-view stereo images with monocular depth priors. The fire is initialized as a 3D flow field, obtained by fusing per-view dense optical flow projections. To capture the high frequency features of fire, each 3D Gaussian encodes a lifetime and linear velocity to match the dense optical flow. To ensure sub-frame temporal alignment across cameras we employ a custom hardware synchronization pattern -- allowing us to reconstruct fire with affordable commodity hardware. Our quantitative and qualitative validations across numerous reconstruction experiments demonstrate robust performance for diverse and challenging real fire scenarios.

Title: RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding

Authors: Xiyan Liu, Han Wang, Yuhu Wang, Junjie Cai, Zhe Cao, Jianzhong Yang, Zhen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22466
Pdf URL: https://arxiv.org/pdf/2511.22466
Copy Paste: [[2511.22466]] RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding(https://arxiv.org/abs/2511.22466)
Keywords: segmentation
Abstract: Understanding mid-level road semantics, which capture the structural and contextual cues that link low-level perception to high-level planning, is essential for reliable autonomous driving and digital map construction. However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic scene structure. To address this gap, we present RoadSceneBench, a lightweight yet information-rich benchmark designed to evaluate and advance visual reasoning in complex road environments. Unlike large-scale perception datasets, RoadSceneBench emphasizes relational understanding and structural consistency, encouraging models to capture the underlying logic of real-world road scenes. Furthermore, to enhance reasoning reliability, we propose Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T), a training framework for Vision-Language Models (VLMs) in which reward signals adaptively promote spatial coherence and semantic alignment throughout the reasoning process. This paradigm enables models to move beyond static recognition toward geometry-aware and temporally consistent reasoning. Extensive experiments demonstrate that our method achieves state-of-the-art performance across diverse road configurations. RoadSceneBench thus provides a compact yet powerful foundation for studying mid-level road semantics and fostering structure-aware autonomous perception. Our dataset is available at this https URL.

Title: Rethinking Cross-Generator Image Forgery Detection through DINOv3

Authors: Zhenglin Huang, Jason Li, Haiquan Wen, Tianxiao Li, Xi Yang, Lu Qi, Bei Peng, Xiaowei Huang, Ming-Hsuan Yang, Guangliang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22471
Pdf URL: https://arxiv.org/pdf/2511.22471
Copy Paste: [[2511.22471]] Rethinking Cross-Generator Image Forgery Detection through DINOv3(https://arxiv.org/abs/2511.22471)
Keywords: generative
Abstract: As generative models become increasingly diverse and powerful, cross-generator detection has emerged as a new challenge. Existing detection methods often memorize artifacts of specific generative models rather than learning transferable cues, leading to substantial failures on unseen generators. Surprisingly, this work finds that frozen visual foundation models, especially DINOv3, already exhibit strong cross-generator detection capability without any fine-tuning. Through systematic studies on frequency, spatial, and token perspectives, we observe that DINOv3 tends to rely on global, low-frequency structures as weak but transferable authenticity cues instead of high-frequency, generator-specific artifacts. Motivated by this insight, we introduce a simple, training-free token-ranking strategy followed by a lightweight linear probe to select a small subset of authenticity-relevant tokens. This token subset consistently improves detection accuracy across all evaluated datasets. Our study provides empirical evidence and a feasible hypothesis for understanding why foundation models generalize across diverse generators, offering a universal, efficient, and interpretable baseline for image forgery detection.

Title: Adversarial Flow Models

Authors: Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, Haoqi Fan
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.22475
Pdf URL: https://arxiv.org/pdf/2511.22475
Copy Paste: [[2511.22475]] Adversarial Flow Models(https://arxiv.org/abs/2511.22475)
Keywords: generative
Abstract: We present adversarial flow models, a class of generative models that unifies adversarial models and flow models. Our method supports native one-step or multi-step generation and is trained using the adversarial objective. Unlike traditional GANs, where the generator learns an arbitrary transport plan between the noise and the data distributions, our generator learns a deterministic noise-to-data mapping, which is the same optimal transport as in flow-matching models. This significantly stabilizes adversarial training. Also, unlike consistency-based methods, our model directly learns one-step or few-step generation without needing to learn the intermediate timesteps of the probability flow for propagation. This saves model capacity, reduces training iterations, and avoids error accumulation. Under the same 1NFE setting on ImageNet-256px, our B/2 model approaches the performance of consistency-based XL/2 models, while our XL/2 model creates a new best FID of 2.38. We additionally show the possibility of end-to-end training of 56-layer and 112-layer models through depth repetition without any intermediate supervision, and achieve FIDs of 2.08 and 1.94 using a single forward pass, surpassing their 2NFE and 4NFE counterparts.

Title: Enhancing Trustworthiness with Mixed Precision: Benchmarks, Opportunities, and Challenges

Authors: Guanxi Lu, Hao Mark Chen, Zhiqiang Que, Wayne Luk, Hongxiang Fan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22483
Pdf URL: https://arxiv.org/pdf/2511.22483
Copy Paste: [[2511.22483]] Enhancing Trustworthiness with Mixed Precision: Benchmarks, Opportunities, and Challenges(https://arxiv.org/abs/2511.22483)
Keywords: robust, fair, large language model
Abstract: Large language models (LLMs) have shown promising performance across various tasks. However, their autoregressive decoding process poses significant challenges for efficient deployment on existing AI hardware. Quantization alleviates memory and compute pressure by compressing weights, activations, and KV caches to low precisions while preserving generation quality. However, existing quantization frameworks typically focus on perplexity or classification accuracy, often omitting critical trustworthiness metrics. This gap introduces risks when applying quantized LLMs to downstream high-stakes domains such as finance and healthcare. In this work, we systematically investigate the impact of quantization on four trustworthiness metrics (adversarial robustness, fairness, machine ethics, and out-of-distribution robustness) and identify the instability across compression ratios and quantization methods. Building on these observations, we develop a novel precision-ensemble voting approach that leverages predictions from mixed-precision variants of the same model and consistently improves performance by up to $5.8\%$ on trustworthiness metrics. Our results highlight the importance of considering trustworthiness when developing model compression techniques and point to research opportunities at the intersection of compression and trustworthiness for safety-critical applications.

Title: AI killed the video star. Audio-driven diffusion model for expressive talking head generation

Authors: Baptiste Chopin, Tashvik Dhamija, Pranav Balaji, Yaohui Wang, Antitza Dantcheva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22488
Pdf URL: https://arxiv.org/pdf/2511.22488
Copy Paste: [[2511.22488]] AI killed the video star. Audio-driven diffusion model for expressive talking head generation(https://arxiv.org/abs/2511.22488)
Keywords: diffusion, transformer
Abstract: We propose Dimitra++, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we propose a conditional Motion Diffusion Transformer (cMDT) to model facial motion sequences, employing a 3D representation. The cMDT is conditioned on two inputs: a reference facial image, which determines appearance, as well as an audio sequence, which drives the motion. Quantitative and qualitative experiments, as well as a user study on two widely employed datasets, i.e., VoxCeleb2 and CelebV-HQ, suggest that Dimitra++ is able to outperform existing approaches in generating realistic talking heads imparting lip motion, facial expression, and head pose.

Title: What Shape Is Optimal for Masks in Text Removal?

Authors: Hyakka Nakada, Marika Kubota
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22499
Pdf URL: https://arxiv.org/pdf/2511.22499
Copy Paste: [[2511.22499]] What Shape Is Optimal for Masks in Text Removal?(https://arxiv.org/abs/2511.22499)
Keywords: generative
Abstract: The advent of generative models has dramatically improved the accuracy of image inpainting. In particular, by removing specific text from document images, reconstructing original images is extremely important for industrial applications. However, most existing methods of text removal focus on deleting simple scene text which appears in images captured by a camera in an outdoor environment. There is little research dedicated to complex and practical images with dense text. Therefore, we created benchmark data for text removal from images including a large amount of text. From the data, we found that text-removal performance becomes vulnerable against mask profile perturbation. Thus, for practical text-removal tasks, precise tuning of the mask shape is essential. This study developed a method to model highly flexible mask profiles and learn their parameters using Bayesian optimization. The resulting profiles were found to be character-wise masks. It was also found that the minimum cover of a text region is not optimal. Our research is expected to pave the way for a user-friendly guideline for manual masking.

Title: Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking

Authors: Katia Vendrame, Bolaji Yusuf, Santosh Kesiraju, Šimon Sedláček, Oldřich Plchot, Jan Černocký
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2511.22503
Pdf URL: https://arxiv.org/pdf/2511.22503
Copy Paste: [[2511.22503]] Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking(https://arxiv.org/abs/2511.22503)
Keywords: large language model
Abstract: End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models has been proposed in recent work as to alleviate some of this difficulty. Although this approach has been shown to result in strong spoken DST models, achieving state-of-the-art performance in realistic multi-turn DST, it struggles to generalize across domains and requires annotated spoken DST training data for each domain of interest. However, collecting such data for every target domain is both costly and difficult. Noting that textual DST data is more easily obtained for various domains, in this work, we propose jointly training on available spoken DST data and written textual data from other domains as a way to achieve cross-domain generalization. We conduct experiments which show the efficacy of our proposed method for getting good cross-domain DST performance without relying on spoken training data from the target domains.

Title: Privacy-Utility-Bias Trade-offs for Privacy-Preserving Recommender Systems

Authors: Shiva Parsarad, Isabel Wagner
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22515
Pdf URL: https://arxiv.org/pdf/2511.22515
Copy Paste: [[2511.22515]] Privacy-Utility-Bias Trade-offs for Privacy-Preserving Recommender Systems(https://arxiv.org/abs/2511.22515)
Keywords: privacy, protect, fair
Abstract: Recommender systems (RSs) output ranked lists of items, such as movies or restaurants, that users may find interesting, based on the user's past ratings and ratings from other users. RSs increasingly incorporate differential privacy (DP) to protect user data, raising questions about how privacy mechanisms affect both recommendation accuracy and fairness. We conduct a comprehensive, cross-model evaluation of two DP mechanisms, differentially private stochastic gradient descent (DPSGD) and local differential privacy (LDP), applied to four recommender systems (Neural Collaborative Filtering (NCF), Bayesian Personalized Ranking (BPR), Singular Value Decomposition (SVD), and Variational Autoencoder (VAE)) on the MovieLens-1M and Yelp datasets. We find that stronger privacy consistently reduces utility, but not uniformly. NCF under DPSGD shows the smallest accuracy loss (under 10 percent at epsilon approximately 1), whereas SVD and BPR experience larger drops, especially for users with niche preferences. VAE is the most sensitive to privacy, with sharp declines for sparsely represented groups. The impact on bias metrics is similarly heterogeneous. DPSGD generally reduces the gap between recommendations of popular and less popular items, whereas LDP preserves existing patterns more closely. These results highlight that no single DP mechanism is uniformly superior; instead, each provides trade-offs under different privacy regimes and data conditions.

Title: List-Decodable Regression via Expander Sketching

Authors: Herbod Pourali, Sajjad Hashemian, Ebrahim Ardeshir-Larijani
Subjects: cs.LG, cs.DM
Abstract URL: https://arxiv.org/abs/2511.22524
Pdf URL: https://arxiv.org/pdf/2511.22524
Copy Paste: [[2511.22524]] List-Decodable Regression via Expander Sketching(https://arxiv.org/abs/2511.22524)
Keywords: robust
Abstract: We introduce an expander-sketching framework for list-decodable linear regression that achieves sample complexity $\tilde{O}((d+\log(1/\delta))/\alpha)$, list size $O(1/\alpha)$, and near input-sparsity running time $\tilde{O}(\mathrm{nnz}(X)+d^{3}/\alpha)$ under standard sub-Gaussian assumptions. Our method uses lossless expanders to synthesize lightly contaminated batches, enabling robust aggregation and a short spectral filtering stage that matches the best known efficient guarantees while avoiding SoS machinery and explicit batch structure.

Title: CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving

Authors: Zhaohui Wang, Tengbo Yu, Hao Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22532
Pdf URL: https://arxiv.org/pdf/2511.22532
Copy Paste: [[2511.22532]] CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving(https://arxiv.org/abs/2511.22532)
Keywords: robust
Abstract: Vision-Language-Action (VLA) models have recently attracted growing attention in end-to-end autonomous driving for their strong reasoning capabilities and rich world knowledge. However, existing VLAs often suffer from limited numerical reasoning ability and overly simplified input-output mappings, which hinder their performance in complex driving scenarios requiring step-by-step causal reasoning. To address these challenges, we propose CoT4AD, a novel VLA framework that introduces Chain-of-Thought (CoT) reasoning for autonomous driving to enhance both numerical and causal reasoning in Vision-Language Models (VLMs). CoT4AD integrates visual observations and language instructions to perform semantic reasoning, scene understanding, and trajectory planning. During training, it explicitly models a perception-question-prediction-action CoT to align the reasoning space with the action space across multiple driving tasks. During inference, it performs implicit CoT reasoning to enable consistent numerical reasoning and robust decision-making in dynamic environments. Extensive experiments on both real-world and simulated benchmarks, including nuScenes and Bench2Drive, demonstrate that CoT4AD achieves state-of-the-art performance in both open-loop and closed-loop evaluations. Code will be released upon paper acceptance.

Title: Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration

Authors: Mengyu Yang, Yanming Yang, Chenyi Xu, Chenxi Song, Yufan Zuo, Tong Zhao, Ruibo Li, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22533
Pdf URL: https://arxiv.org/pdf/2511.22533
Copy Paste: [[2511.22533]] Fast3Dcache: Training-free 3D Geometry Synthesis Acceleration(https://arxiv.org/abs/2511.22533)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved impressive generative quality across modalities like 2D images, videos, and 3D shapes, but their inference remains computationally expensive due to the iterative denoising process. While recent caching-based methods effectively reuse redundant computations to speed up 2D and video generation, directly applying these techniques to 3D diffusion models can severely disrupt geometric consistency. In 3D synthesis, even minor numerical errors in cached latent features accumulate, causing structural artifacts and topological inconsistencies. To overcome this limitation, we propose Fast3Dcache, a training-free geometry-aware caching framework that accelerates 3D diffusion inference while preserving geometric fidelity. Our method introduces a Predictive Caching Scheduler Constraint (PCSC) to dynamically determine cache quotas according to voxel stabilization patterns and a Spatiotemporal Stability Criterion (SSC) to select stable features for reuse based on velocity magnitude and acceleration criterion. Comprehensive experiments show that Fast3Dcache accelerates inference significantly, achieving up to a 27.12% speed-up and a 54.8% reduction in FLOPs, with minimal degradation in geometric quality as measured by Chamfer Distance (2.48%) and F-Score (1.95%).

Title: Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior

Authors: Ruoyu Feng, Yunpeng Qi, Jinming Liu, Yixin Gao, Xin Li, Xin Jin, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22549
Pdf URL: https://arxiv.org/pdf/2511.22549
Copy Paste: [[2511.22549]] Diff-ICMH: Harmonizing Machine and Human Vision in Image Compression with Generative Prior(https://arxiv.org/abs/2511.22549)
Keywords: extraction, diffusion, generative
Abstract: Image compression methods are usually optimized isolatedly for human perception or machine analysis tasks. We reveal fundamental commonalities between these objectives: preserving accurate semantic information is paramount, as it directly dictates the integrity of critical information for intelligent tasks and aids human understanding. Concurrently, enhanced perceptual quality not only improves visual appeal but also, by ensuring realistic image distributions, benefits semantic feature extraction for machine tasks. Based on this insight, we propose Diff-ICMH, a generative image compression framework aiming for harmonizing machine and human vision in image compression. It ensures perceptual realism by leveraging generative priors and simultaneously guarantees semantic fidelity through the incorporation of Semantic Consistency loss (SC loss) during training. Additionally, we introduce the Tag Guidance Module (TGM) that leverages highly semantic image-level tags to stimulate the pre-trained diffusion model's generative capabilities, requiring minimal additional bit rates. Consequently, Diff-ICMH supports multiple intelligent tasks through a single codec and bitstream without any task-specific adaptation, while preserving high-quality visual experience for human perception. Extensive experimental results demonstrate Diff-ICMH's superiority and generalizability across diverse tasks, while maintaining visual appeal for human perception. Code is available at: this https URL.

Title: Bringing Your Portrait to 3D Presence

Authors: Jiawei Zhang, Lei Chu, Jiahao Li, Zhenyu Zang, Chong Li, Xiao Li, Xun Cao, Hao Zhu, Yan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22553
Pdf URL: https://arxiv.org/pdf/2511.22553
Copy Paste: [[2511.22553]] Bringing Your Portrait to 3D Presence(https://arxiv.org/abs/2511.22553)
Keywords: robust, generative
Abstract: We present a unified framework for reconstructing animatable 3D human avatars from a single portrait across head, half-body, and full-body inputs. Our method tackles three bottlenecks: pose- and framing-sensitive feature representations, limited scalable data, and unreliable proxy-mesh estimation. We introduce a Dual-UV representation that maps image features to a canonical UV space via Core-UV and Shell-UV branches, eliminating pose- and framing-induced token shifts. We also build a factorized synthetic data manifold combining 2D generative diversity with geometry-consistent 3D renderings, supported by a training scheme that improves realism and identity consistency. A robust proxy-mesh tracker maintains stability under partial visibility. Together, these components enable strong in-the-wild generalization. Trained only on half-body synthetic data, our model achieves state-of-the-art head and upper-body reconstruction and competitive full-body results. Extensive experiments and analyses further validate the effectiveness of our approach.

Title: Text Condition Embedded Regression Network for Automated Dental Abutment Design

Authors: Mianjie Zheng, Xinquan Yang, Xuguang Li, Xiaoling Luo, Xuefen Liu, Kun Tang, He Meng, Linlin Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22578
Pdf URL: https://arxiv.org/pdf/2511.22578
Copy Paste: [[2511.22578]] Text Condition Embedded Regression Network for Automated Dental Abutment Design(https://arxiv.org/abs/2511.22578)
Keywords: extraction
Abstract: The abutment is an important part of artificial dental implants, whose design process is time-consuming and labor-intensive. Long-term use of inappropriate dental implant abutments may result in implant complications, including peri-implantitis. Using artificial intelligence to assist dental implant abutment design can quickly improve the efficiency of abutment design and enhance abutment adaptability. In this paper, we propose a text condition embedded abutment design framework (TCEAD), the novel automated abutment design solution available in literature. The proposed study extends the self-supervised learning framework of the mesh mask autoencoder (MeshMAE) by introducing a text-guided localization (TGL) module to facilitate abutment area localization. As the parameter determination of the abutment is heavily dependent on local fine-grained features (the width and height of the implant and the distance to the opposing tooth), we pre-train the encoder using oral scan data to improve the model's feature extraction ability. Moreover, considering that the abutment area is only a small part of the oral scan data, we designed a TGL module, which introduces the description of the abutment area through the text encoder of Contrastive Language-Image Pre-training (CLIP), enabling the network to quickly locate the abutment area. We validated the performance of TCEAD on a large abutment design dataset. Extensive experiments demonstrate that TCEAD achieves an Intersection over Union (IoU) improvement of 0.8%-12.85% over other mainstream methods, underscoring its potential in automated dental abutment design.

Title: Smarter, not Bigger: Fine-Tuned RAG-Enhanced LLMs for Automotive HIL Testing

Authors: Chao Feng, Zihan Liu, Siddhant Gupta, Gongpei Cui, Jan von der Assen, Burkhard Stiller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22584
Pdf URL: https://arxiv.org/pdf/2511.22584
Copy Paste: [[2511.22584]] Smarter, not Bigger: Fine-Tuned RAG-Enhanced LLMs for Automotive HIL Testing(https://arxiv.org/abs/2511.22584)
Keywords: large language model
Abstract: Hardware-in-the-Loop (HIL) testing is essential for automotive validation but suffers from fragmented and underutilized test artifacts. This paper presents HIL-GPT, a retrieval-augmented generation (RAG) system integrating domain-adapted large language models (LLMs) with semantic retrieval. HIL-GPT leverages embedding fine-tuning using a domain-specific dataset constructed via heuristic mining and LLM-assisted synthesis, combined with vector indexing for scalable, traceable test case and requirement retrieval. Experiments show that fine-tuned compact models, such as \texttt{bge-base-en-v1.5}, achieve a superior trade-off between accuracy, latency, and cost compared to larger models, challenging the notion that bigger is always better. An A/B user study further confirms that RAG-enhanced assistants improve perceived helpfulness, truthfulness, and satisfaction over general-purpose LLMs. These findings provide insights for deploying efficient, domain-aligned LLM-based assistants in industrial HIL environments.

Title: The Multiclass Score-Oriented Loss (MultiSOL) on the Simplex

Authors: Francesco Marchetti, Edoardo Legnaro, Sabrina Guastavino
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22587
Pdf URL: https://arxiv.org/pdf/2511.22587
Copy Paste: [[2511.22587]] The Multiclass Score-Oriented Loss (MultiSOL) on the Simplex(https://arxiv.org/abs/2511.22587)
Keywords: robust
Abstract: In the supervised binary classification setting, score-oriented losses have been introduced with the aim of optimizing a chosen performance metric directly during the training phase, thus avoiding \textit{a posteriori} threshold tuning. To do this, in their construction, the decision threshold is treated as a random variable provided with a certain \textit{a priori} distribution. In this paper, we use a recently introduced multidimensional threshold-based classification framework to extend such score-oriented losses to multiclass classification, defining the Multiclass Score-Oriented Loss (MultiSOL) functions. As also demonstrated by several classification experiments, this proposed family of losses is designed to preserve the main advantages observed in the binary setting, such as the direct optimization of the target metric and the robustness to class imbalance, achieving performance comparable to other state-of-the-art loss functions and providing new insights into the interaction between simplex geometry and score-oriented learning.

Title: AnoRefiner: Anomaly-Aware Group-Wise Refinement for Zero-Shot Industrial Anomaly Detection

Authors: Dayou Huang, Feng Xue, Xurui Li, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22595
Pdf URL: https://arxiv.org/pdf/2511.22595
Copy Paste: [[2511.22595]] AnoRefiner: Anomaly-Aware Group-Wise Refinement for Zero-Shot Industrial Anomaly Detection(https://arxiv.org/abs/2511.22595)
Keywords: transformer
Abstract: Zero-shot industrial anomaly detection (ZSAD) methods typically yield coarse anomaly maps as vision transformers (ViTs) extract patch-level features only. To solve this, recent solutions attempt to predict finer anomalies using features from ZSAD, but they still struggle to recover fine-grained anomalies without missed detections, mainly due to the gap between randomly synthesized training anomalies and real ones. We observe that anomaly score maps exactly provide complementary spatial cues that are largely absent from ZSAD's image features, a fact overlooked before. Inspired by this, we propose an anomaly-aware refiner (AnoRefiner) that can be plugged into most ZSAD models and improve patch-level anomaly maps to the pixel level. First, we design an anomaly refinement decoder (ARD) that progressively enhances image features using anomaly score maps, reducing the reliance on synthetic anomaly data. Second, motivated by the mass production paradigm, we propose a progressive group-wise test-time training (PGT) strategy that trains ARD in each product group for the refinement process in the next group, while staying compatible with any ZSAD method. Experiments on the MVTec AD and VisA datasets show that AnoRefiner boosts various ZSAD models by up to a 5.2\% gain in pixel-AP metrics, which can also be directly observed in many visualizations. The code will be available at this https URL.

Title: LLM-Cave: A benchmark and light environment for large language models reasoning and decision-making system

Authors: Huanyu Li, Zongyuan Li, Wei Huang, Xian Guo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22598
Pdf URL: https://arxiv.org/pdf/2511.22598
Copy Paste: [[2511.22598]] LLM-Cave: A benchmark and light environment for large language models reasoning and decision-making system(https://arxiv.org/abs/2511.22598)
Keywords: large language model
Abstract: Large language models (LLMs) such as ChatGPT o1, ChatGPT o3, and DeepSeek R1 have shown great potential in solving difficult problems. However, current LLM evaluation benchmarks are limited to one-step interactions. Some of the existing sequence decision-making environments, such as TextStarCraftII and LLM-PySC2, are too complicated and require hours of interaction to complete a game. In this paper, we introduce LLM-Cave, a benchmark and light environment for LLM reasoning and decision-making systems. This environment is a classic instance in the era of Symbolism. Artificial intelligence enables the agent to explore the environment and avoid potential losses by reasoning about nearby dangers using partial observable state information. In the experiment, we evaluated the sequential reasoning ability, decision-making performance and computational efficiency of mainstream large language models (LLMs) such as GPT-4o-mini, o1-mini, and DeepSeek-R1. Experiments show that while Deepseek-R1 achieved the highest success rate on complex reasoning tasks, smaller models like 4o-mini significantly narrowed the performance gap on challenges by employing Chain of Speculation and Planner-Critic strategies, at the expense of reduced computational efficiency. This indicates that structured, multi-step reasoning combined with an LLM-based feedback mechanism can substantially enhance an LLM's decision-making capabilities, providing a promising direction for improving reasoning in weaker models and suggesting a new reasoning-centered benchmark for LLM assessment. Our code is open-sourced in this https URL.

Title: GazeTrack: High-Precision Eye Tracking Based on Regularization and Spatial Computing

Authors: Xiaoyin Yang
Subjects: cs.CV, cs.AI, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22607
Pdf URL: https://arxiv.org/pdf/2511.22607
Copy Paste: [[2511.22607]] GazeTrack: High-Precision Eye Tracking Based on Regularization and Spatial Computing(https://arxiv.org/abs/2511.22607)
Keywords: segmentation
Abstract: Eye tracking has become increasingly important in virtual and augmented reality applications; however, the current gaze accuracy falls short of meeting the requirements for spatial computing. We designed a gaze collection framework and utilized high-precision equipment to gather the first precise benchmark dataset, GazeTrack, encompassing diverse ethnicities, ages, and visual acuity conditions for pupil localization and gaze tracking. We propose a novel shape error regularization method to constrain pupil ellipse fitting and train on open-source datasets, enhancing semantic segmentation and pupil position prediction accuracy. Additionally, we invent a novel coordinate transformation method similar to paper unfolding to accurately predict gaze vectors on the GazeTrack dataset. Finally, we built a gaze vector generation model that achieves reduced gaze angle error with lower computational complexity compared to other methods.

Title: MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory

Authors: Bo Wang, Jiehong Lin, Chenzhi Liu, Xinting Hu, Yifei Yu, Tianjia Liu, Zhongrui Wang, Xiaojuan Qi
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.22609
Pdf URL: https://arxiv.org/pdf/2511.22609
Copy Paste: [[2511.22609]] MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory(https://arxiv.org/abs/2511.22609)
Keywords: robust
Abstract: We present MG-Nav (Memory-Guided Navigation), a dual-scale framework for zero-shot visual navigation that unifies global memory-guided planning with local geometry-enhanced control. At its core is the Sparse Spatial Memory Graph (SMG), a compact, region-centric memory where each node aggregates multi-view keyframe and object semantics, capturing both appearance and spatial structure while preserving viewpoint diversity. At the global level, the agent is localized on SMG and a goal-conditioned node path is planned via an image-to-instance hybrid retrieval, producing a sequence of reachable waypoints for long-horizon guidance. At the local level, a navigation foundation policy executes these waypoints in point-goal mode with obstacle-aware control, and switches to image-goal mode when navigating from the final node towards the visual target. To further enhance viewpoint alignment and goal recognition, we introduce VGGT-adapter, a lightweight geometric module built on the pre-trained VGGT model, which aligns observation and goal features in a shared 3D-aware space. MG-Nav operates global planning and local control at different frequencies, using periodic re-localization to correct errors. Experiments on HM3D Instance-Image-Goal and MP3D Image-Goal benchmarks demonstrate that MG-Nav achieves state-of-the-art zero-shot performance and remains robust under dynamic rearrangements and unseen scene conditions.

Title: Improving LLM-based Ontology Matching with fine-tuning on synthetic data

Authors: Guilherme Sousa, Rinaldo Lima, Cassia Trojahn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22612
Pdf URL: https://arxiv.org/pdf/2511.22612
Copy Paste: [[2511.22612]] Improving LLM-based Ontology Matching with fine-tuning on synthetic data(https://arxiv.org/abs/2511.22612)
Keywords: large language model
Abstract: Large Language Models (LLMs) are increasingly being integrated into various components of Ontology Matching pipelines. This paper investigates the capability of LLMs to perform ontology matching directly on ontology modules and generate the corresponding alignments. Furthermore, it is explored how a dedicated fine-tuning strategy can enhance the model's matching performance in a zero-shot setting. The proposed method incorporates a search space reduction technique to select relevant subsets from both source and target ontologies, which are then used to automatically construct prompts. Recognizing the scarcity of reference alignments for training, a novel LLM-based approach is introduced for generating a synthetic dataset. This process creates a corpus of ontology submodule pairs and their corresponding reference alignments, specifically designed to fine-tune an LLM for the ontology matching task. The proposed approach was evaluated on the Conference, Geolink, Enslaved, Taxon, and Hydrography datasets from the OAEI complex track. The results demonstrate that the LLM fine-tuned on the synthetically generated data exhibits superior performance compared to the non-fine-tuned base model. The key contribution is a strategy that combines automatic dataset generation with fine-tuning to effectively adapt LLMs for ontology matching tasks.

Title: Stable-Drift: A Patient-Aware Latent Drift Replay Method for Stabilizing Representations in Continual Learning

Authors: Paraskevi-Antonia Theofilou, Anuhya Thota, Stefanos Kollias, Mamatha Thota
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22615
Pdf URL: https://arxiv.org/pdf/2511.22615
Copy Paste: [[2511.22615]] Stable-Drift: A Patient-Aware Latent Drift Replay Method for Stabilizing Representations in Continual Learning(https://arxiv.org/abs/2511.22615)
Keywords: robust, transformer
Abstract: When deep learning models are sequentially trained on new data, they tend to abruptly lose performance on previously learned tasks, a critical failure known as catastrophic forgetting. This challenge severely limits the deployment of AI in medical imaging, where models must continually adapt to data from new hospitals without compromising established diagnostic knowledge. To address this, we introduce a latent drift-guided replay method that identifies and replays samples with high representational instability. Specifically, our method quantifies this instability via latent drift, the change in a sample internal feature representation after naive domain adaptation. To ensure diversity and clinical relevance, we aggregate drift at the patient level, our memory buffer stores the per patient slices exhibiting the greatest multi-layer representation shift. Evaluated on a cross-hospital COVID-19 CT classification task using state-of-the-art CNN and Vision Transformer backbones, our method substantially reduces forgetting compared to naive fine-tuning and random replay. This work highlights latent drift as a practical and interpretable replay signal for advancing robust continual learning in real world medical settings.

Title: Federated Learning Survey: A Multi-Level Taxonomy of Aggregation Techniques, Experimental Insights, and Future Frontiers

Authors: Meriem Arbaoui, Mohamed-el-Amine Brahmia, Abdellatif Rahmoun, Mourad Zghal
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2511.22616
Pdf URL: https://arxiv.org/pdf/2511.22616
Copy Paste: [[2511.22616]] Federated Learning Survey: A Multi-Level Taxonomy of Aggregation Techniques, Experimental Insights, and Future Frontiers(https://arxiv.org/abs/2511.22616)
Keywords: security, privacy, robust, federate
Abstract: The integration of IoT and AI has unlocked innovation across industries, but growing privacy concerns and data isolation hinder progress. Traditional centralized ML struggles to overcome these challenges, which has led to the rise of Federated Learning (FL), a decentralized paradigm that enables collaborative model training without sharing local raw data. FL ensures data privacy, reduces communication overhead, and supports scalability, yet its heterogeneity adds complexity compared to centralized approaches. This survey focuses on three main FL research directions: personalization, optimization, and robustness, offering a structured classification through a hybrid methodology that combines bibliometric analysis with systematic review to identify the most influential works. We examine challenges and techniques related to heterogeneity, efficiency, security, and privacy, and provide a comprehensive overview of aggregation strategies, including architectures, synchronization methods, and diverse federation objectives. To complement this, we discuss practical evaluation approaches and present experiments comparing aggregation methods under IID and non-IID data distributions. Finally, we outline promising research directions to advance FL, aiming to guide future innovation in this rapidly evolving field.

Title: REASONEDIT: Towards Reasoning-Enhanced Image Editing Models

Authors: Fukun Yin, Shiyu Liu, Yucheng Han, Zhibo Wang, Peng Xing, Rui Wang, Wei Cheng, Yingming Wang, Aojie Li, Zixin Yin, Pengtao Chen, Xiangyu Zhang, Daxin Jiang, Xianfang Zeng, Gang Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22625
Pdf URL: https://arxiv.org/pdf/2511.22625
Copy Paste: [[2511.22625]] REASONEDIT: Towards Reasoning-Enhanced Image Editing Models(https://arxiv.org/abs/2511.22625)
Keywords: diffusion, large language model
Abstract: Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and Qwen-Image-Edit, where the MLLM encodes both the reference image and the instruction but remains frozen during training. In this work, we demonstrate that unlocking the reasoning capabilities of MLLM can further push the boundaries of editing models. Specifically, we explore two reasoning mechanisms, thinking and reflection, which enhance instruction understanding and editing accuracy. Based on that, our proposed framework enables image editing in a thinking-editing-reflection loop: the thinking mechanism leverages the world knowledge of MLLM to interpret abstract instructions, while the reflection reviews editing results, automatically corrects unintended manipulations, and identifies the stopping round. Extensive experiments demonstrate that our reasoning approach achieves significant performance gains, with improvements of ImgEdit (+4.3%), GEdit (+4.7%), and Kris (+8.2%) when initializing our DiT from the Step1X-Edit (ReasonEdit-S), and also outperforms previous open-source methods on both GEdit and Kris when integrated with Qwen-Image-Edit (ReasonEdit-Q).

Title: Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning

Authors: Riccardo De Santi, Marin Vlastelica, Ya-Ping Hsieh, Zebang Shen, Niao He, Andreas Krause
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22640
Pdf URL: https://arxiv.org/pdf/2511.22640
Copy Paste: [[2511.22640]] Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning(https://arxiv.org/abs/2511.22640)
Keywords: diffusion, generative
Abstract: Adapting large-scale foundation flow and diffusion generative models to optimize task-specific objectives while preserving prior information is crucial for real-world applications such as molecular design, protein docking, and creative image generation. Existing principled fine-tuning methods aim to maximize the expected reward of generated samples, while retaining knowledge from the pre-trained model via KL-divergence regularization. In this work, we tackle the significantly more general problem of optimizing general utilities beyond average rewards, including risk-averse and novelty-seeking reward maximization, diversity measures for exploration, and experiment design objectives among others. Likewise, we consider more general ways to preserve prior information beyond KL-divergence, such as optimal transport distances and Renyi divergences. To this end, we introduce Flow Density Control (FDC), a simple algorithm that reduces this complex problem to a specific sequence of simpler fine-tuning tasks, each solvable via scalable established methods. We derive convergence guarantees for the proposed scheme under realistic assumptions by leveraging recent understanding of mirror flows. Finally, we validate our method on illustrative settings, text-to-image, and molecular design tasks, showing that it can steer pre-trained generative models to optimize objectives and solve practically relevant tasks beyond the reach of current fine-tuning schemes.

Title: GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes

Authors: Di Wang, Shunyu Liu, Wentao Jiang, Fengxiang Wang, Yi Liu, Xiaolei Qin, Zhiming Luo, Chaoyang Zhou, Haonan Guo, Jing Zhang, Bo Du, Dacheng Tao, Liangpei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22645
Pdf URL: https://arxiv.org/pdf/2511.22645
Copy Paste: [[2511.22645]] GeoZero: Incentivizing Reasoning from Scratch on Geospatial Scenes(https://arxiv.org/abs/2511.22645)
Keywords: large language model
Abstract: Multimodal large language models (MLLMs) have undergone rapid development in advancing geospatial scene understanding. Recent studies have sought to enhance the reasoning capabilities of remote sensing MLLMs, typically through cold-start training with elaborately curated chain-of-thought (CoT) data. However, this approach not only incurs substantial annotation costs but also introduces human biases that may limit the diversity of model reasoning. To address these challenges, we propose GeoZero, a framework that enables MLLMs to perform geospatial reasoning without any predefined CoT supervision. Specifically, we construct two datasets, GeoZero-Instruct and GeoZero-Hard. GeoZero-Instruct allows the model to acquire preliminary geospatial knowledge through supervised fine-tuning, while GeoZero-Hard stimulates deep reasoning during the subsequent reinforcement learning stage. Furthermore, we introduce Answer-Anchored Group Relative Policy Optimization (A$^2$GRPO), where the reasoning process is regularized by the model's own answers, encouraging diverse yet accurate thinking. Extensive experiments on multiple remote sensing vision-language benchmarks demonstrate that GeoZero not only surpasses existing state-of-the-art methods but also fosters universal emergent reasoning capabilities across diverse geospatial tasks. Code,data,and models will be publicly available at this https URL.

Title: Spatially Aware Dictionary-Free Eigenfunction Identification for Modeling and Control of Nonlinear Dynamical Systems

Authors: David Grasev
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22648
Pdf URL: https://arxiv.org/pdf/2511.22648
Copy Paste: [[2511.22648]] Spatially Aware Dictionary-Free Eigenfunction Identification for Modeling and Control of Nonlinear Dynamical Systems(https://arxiv.org/abs/2511.22648)
Keywords: robust
Abstract: A new approach to data-driven discovery of Koopman eigenfunctions without a pre-defined set of basis functions is proposed. The approach is based on a reference trajectory, for which the Koopman mode amplitudes are first identified, and the Koopman mode decomposition is transformed to a new basis, which contains fundamental functions of eigenvalues and time. The initial values of the eigenfunctions are obtained by projecting trajectories onto this basis via a regularized least-squares fit. A global optimizer was employed to optimize the eigenvalues. Mapping initial-state values to eigenfunction values reveals their spatial structure, enabling the numerical computation of their gradients. Thus, deviations from the Koopman partial differential equation are penalized, leading to more robust solutions. The approach was successfully tested on several benchmark nonlinear dynamical systems, including the FitzHugh-Nagumo system with inputs, van der Pol and Duffing oscillators, and a 2-spool turbojet engine with control. The study demonstrates that incorporating principal eigenvalues and spatial structure integrity promotion significantly improves the accuracy of Koopman predictors. The approach effectively discovers Koopman spectral components even with sparse state-space sampling and reveals geometric features of the state space, such as invariant partitions. Finally, the numerical approximation of the eigenfunction gradient can be used for input dynamics modeling and control design. The results support the practicality of the approach for use with various dynamical systems.

Title: Automated Design Optimization via Strategic Search with Large Language Models

Authors: Anthony Carreon, Vansh Sharma, Venkat Raman
Subjects: cs.LG, cs.AI, cs.CE, cs.MA
Abstract URL: https://arxiv.org/abs/2511.22651
Pdf URL: https://arxiv.org/pdf/2511.22651
Copy Paste: [[2511.22651]] Automated Design Optimization via Strategic Search with Large Language Models(https://arxiv.org/abs/2511.22651)
Keywords: large language model
Abstract: Traditional optimization methods excel in well-defined search spaces but struggle with design problems where transformations and design parameters are difficult to define. Large language models (LLMs) offer a promising alternative by dynamically interpreting design spaces and leveraging encoded domain knowledge. To this end, we introduce AUTO, an LLM agent framework that treats design optimization as a gradient-free search problem guided by strategic LLM reasoning. The framework employs two collaborative agents: a Strategist that selects between exploration and exploitation strategies, and an Implementor that executes detailed designs. Applied to GPU code optimization -- a domain critical to fields from machine learning to scientific computing -- AUTO generates solutions competitive with expert implementations for chemical kinetics integration and dense matrix multiplication. The framework achieves 50-70% search efficiency relative to Bayesian optimization methodologies. It completes optimizations in approximately 8 hours at an estimated cost of up to \$159 per run, compared to an estimated cost of up to \$480 with median-wage software developers. These findings open the door to automating design optimization in ill-defined search spaces with limited prior information.

Title: Modèles de Fondation et Ajustement : Vers une Nouvelle Génération de Modèles pour la Prévision des Séries Temporelles

Authors: Morad Laglil, Emilie Devijver, Eric Gaussier, Bertrand Pracca
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22674
Pdf URL: https://arxiv.org/pdf/2511.22674
Copy Paste: [[2511.22674]] Modèles de Fondation et Ajustement : Vers une Nouvelle Génération de Modèles pour la Prévision des Séries Temporelles(https://arxiv.org/abs/2511.22674)
Keywords: large language model
Abstract: Inspired by recent advances in large language models, foundation models have been developed for zero-shot time series forecasting, enabling prediction on datasets unseen during pretraining. These large-scale models, trained on vast collections of time series, learn generalizable representations for both point and probabilistic forecasting, reducing the need for task-specific architectures and manual tuning. In this work, we review the main architectures, pretraining strategies, and optimization methods used in such models, and study the effect of fine-tuning after pretraining to enhance their performance on specific datasets. Our empirical results show that fine-tuning generally improves zero-shot forecasting capabilities, especially for long-term horizons.

Title: Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield

Authors: Dongyang Liu, Peng Gao, David Liu, Ruoyi Du, Zhen Li, Qilong Wu, Xin Jin, Sihan Cao, Shifeng Zhang, Hongsheng Li, Steven Hoi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22677
Pdf URL: https://arxiv.org/pdf/2511.22677
Copy Paste: [[2511.22677]] Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield(https://arxiv.org/abs/2511.22677)
Keywords: robust, diffusion
Abstract: Diffusion model distillation has emerged as a powerful technique for creating efficient few-step and single-step generators. Among these, Distribution Matching Distillation (DMD) and its variants stand out for their impressive performance, which is widely attributed to their core mechanism of matching the student's output distribution to that of a pre-trained teacher model. In this work, we challenge this conventional understanding. Through a rigorous decomposition of the DMD training objective, we reveal that in complex tasks like text-to-image generation, where CFG is typically required for desirable few-step performance, the primary driver of few-step distillation is not distribution matching, but a previously overlooked component we identify as CFG Augmentation (CA). We demonstrate that this term acts as the core ``engine'' of distillation, while the Distribution Matching (DM) term functions as a ``regularizer'' that ensures training stability and mitigates artifacts. We further validate this decoupling by demonstrating that while the DM term is a highly effective regularizer, it is not unique; simpler non-parametric constraints or GAN-based objectives can serve the same stabilizing function, albeit with different trade-offs. This decoupling of labor motivates a more principled analysis of the properties of both terms, leading to a more systematic and in-depth understanding. This new understanding further enables us to propose principled modifications to the distillation process, such as decoupling the noise schedules for the engine and the regularizer, leading to further performance gains. Notably, our method has been adopted by the Z-Image ( this https URL ) project to develop a top-tier 8-step image generation model, empirically validating the generalization and robustness of our findings.

Title: CacheTrap: Injecting Trojans in LLMs without Leaving any Traces in Inputs or Weights

Authors: Mohaiminul Al Nahian (1), Abeer Matar A. Almalky (1), Gamana Aragonda (2), Ranyang Zhou (2), Sabbir Ahmed (1), Dmitry Ponomarev (1), Li Yang (3), Shaahin Angizi (2), Adnan Siraj Rakin (1) ((1) SUNY Binghamton, (2) New Jersey Institute of Technology, (3) UNC Charlotte)
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.22681
Pdf URL: https://arxiv.org/pdf/2511.22681
Copy Paste: [[2511.22681]] CacheTrap: Injecting Trojans in LLMs without Leaving any Traces in Inputs or Weights(https://arxiv.org/abs/2511.22681)
Keywords: defense, attack
Abstract: Adversarial weight perturbation has emerged as a concerning threat to LLMs that either use training privileges or system-level access to inject adversarial corruption in model weights. With the emergence of innovative defensive solutions that place system- and algorithm-level checks and corrections in the input and weight spaces, these perturbations are increasingly susceptible to defenses. This work develops a novel perspective on Trojan attacks that generates an attacker-designed model output while leaving no attack traces on the inputs or weights. Such an attack space can be unlocked through corruption of the key-value (KV) cache. In this paper, we introduce CacheTrap, a novel Trojan attack that corrupts the value vectors stored in the KV cache. These vectors capture the dynamic activations for specific token positions and therefore constitute a natural surface for transient, inference-time trigger insertion. The transient nature of these KV values and their dependence on victim input imply additional constraints on our attack, such as a lack of knowledge of the victim's data or domain application, and, consequently, a lack of gradient information. The objective of the proposed CacheTrap is to develop a vulnerable KV bit-searching algorithm so that, once the attack employs the identified bit-flip as a trigger, the model generates targeted behavior, e.g., classifying inputs towards the target class. Moreover, CacheTrap is a data- and gradient-free attack which also has no impact on the model's utility. Our evaluation demonstrates that the proposed attack enables the first successful Trojan attack on LLMs with a single bit flip in the KV cache. In addition, the data-independent nature of the attack ensures that once the attacker identifies the vulnerable bit index, the location remains constant and can be transferred to a wide range of victim tasks/datasets/queries with no overhead.

Title: Test-time scaling of diffusions with flow maps

Authors: Amirmojtaba Sabour, Michael S. Albergo, Carles Domingo-Enrich, Nicholas M. Boffi, Sanja Fidler, Karsten Kreis, Eric Vanden-Eijnden
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22688
Pdf URL: https://arxiv.org/pdf/2511.22688
Copy Paste: [[2511.22688]] Test-time scaling of diffusions with flow maps(https://arxiv.org/abs/2511.22688)
Keywords: diffusion
Abstract: A common recipe to improve diffusion models at test-time so that samples score highly against a user-specified reward is to introduce the gradient of the reward into the dynamics of the diffusion itself. This procedure is often ill posed, as user-specified rewards are usually only well defined on the data distribution at the end of generation. While common workarounds to this problem are to use a denoiser to estimate what a sample would have been at the end of generation, we propose a simple solution to this problem by working directly with a flow map. By exploiting a relationship between the flow map and velocity field governing the instantaneous transport, we construct an algorithm, Flow Map Trajectory Tilting (FMTT), which provably performs better ascent on the reward than standard test-time methods involving the gradient of the reward. The approach can be used to either perform exact sampling via importance weighting or principled search that identifies local maximizers of the reward-tilted distribution. We demonstrate the efficacy of our approach against other look-ahead techniques, and show how the flow map enables engagement with complicated reward functions that make possible new forms of image editing, e.g. by interfacing with vision language models.

Title: Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation

Authors: Shubhankar Borse, Phuc Pham, Farzad Farhadzadeh, Seokeon Choi, Phong Ha Nguyen, Anh Tuan Tran, Sungrack Yun, Munawar Hayat, Fatih Porikli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22690
Pdf URL: https://arxiv.org/pdf/2511.22690
Copy Paste: [[2511.22690]] Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation(https://arxiv.org/abs/2511.22690)
Keywords: diffusion
Abstract: Despite recent advances in text-to-image generation, existing models consistently fail to produce reliable multi-human scenes, often duplicating faces, merging identities, or miscounting individuals. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect module predicts structured layouts, specifying where each person should appear. The Artist module then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with ArcFace identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model and optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images.

Title: Generative Anchored Fields: Controlled Data Generation via Emergent Velocity Fields and Transport Algebra

Authors: Deressa Wodajo Deressa, Hannes Mareen, Peter Lambert, Glenn Van Wallendael
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22693
Pdf URL: https://arxiv.org/pdf/2511.22693
Copy Paste: [[2511.22693]] Generative Anchored Fields: Controlled Data Generation via Emergent Velocity Fields and Transport Algebra(https://arxiv.org/abs/2511.22693)
Keywords: generative
Abstract: We present Generative Anchored Fields (GAF), a generative model that learns independent endpoint predictors $J$ (noise) and $K$ (data) rather than a trajectory predictor. The velocity field $v=K-J$ emerges from their time-conditioned disagreement. This factorization enables \textit{Transport Algebra}: algebraic operation on learned $\{(J_n,K_n)\}_{n=1}^N$ heads for compositional control. With class-specific $K_n$ heads, GAF supports a rich family of directed transport maps between a shared base distribution and multiple modalities, enabling controllable interpolation, hybrid generation, and semantic morphing through vector arithmetic. We achieve strong sample quality (FID 7.5 on CelebA-HQ $64\times 64$) while uniquely providing compositional generation as an architectural primitive. We further demonstrate, GAF has lossless cyclic transport between its initial and final state with LPIPS=$0.0$. Code available at this https URL

Title: Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Authors: Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Shijie Huang, Zhaohui Hou, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22699
Pdf URL: https://arxiv.org/pdf/2511.22699
Copy Paste: [[2511.22699]] Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer(https://arxiv.org/abs/2511.22699)
Keywords: diffusion, transformer, generative
Abstract: The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.

Title: Ghosting Your LLM: Without The Knowledge of Your Gradient and Data

Authors: Abeer Matar A. Almalky (1), Ziyan Wang (2), Mohaiminul Al Nahian (1), Li Yang (2), Adnan Siraj Rakin (1) ((1) Binghamton University, (2) UNC Charlotte)
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.22700
Pdf URL: https://arxiv.org/pdf/2511.22700
Copy Paste: [[2511.22700]] Ghosting Your LLM: Without The Knowledge of Your Gradient and Data(https://arxiv.org/abs/2511.22700)
Keywords: security, attack, robust, data-free, large language model
Abstract: In recent years, large language models (LLMs) have achieved substantial advancements and are increasingly integrated into critical applications across various domains. This growing adoption underscores the need to ensure their security and robustness. In this work, we focus on the impact of Bit Flip Attacks (BFAs) on LLMs, which exploits hardware faults to corrupt model parameters, posing a significant threat to model integrity and performance. Existing studies on BFA against LLMs adopt a progressive bit-search strategy that predominantly relies on gradient-based techniques to identify sensitive layers or weights. However, computing gradients comes with two specific challenges: First, in the context of LLMs, it increases computational and memory costs exponentially, and Second, it requires access to a sample victim dataset or knowledge of the victim domain to compute the gradient. In this work, we investigate beyond the scope of attack efficacy and aim to develop an efficient, practical Gradient-Data-free Bit-Flip Attack. The challenge lies in the core principle of adversarial attacks, which relies heavily on computing gradients from sample test/train data and manipulating model weights based on gradient information. To overcome this, we propose novel vulnerability index metrics that can identify vulnerable weight bits in LLMs independent of any gradient or data knowledge. By removing the dependency on gradient computation, our approach drastically reduces memory requirements and scales efficiently across multiple tasks with constant complexity. Experimental results demonstrate the efficiency of our method, requiring as few as a single bit flip to achieve adversarial objectives for five open-source LLMs.

Title: Splat-SAP: Feed-Forward Gaussian Splatting for Human-Centered Scene with Scale-Aware Point Map Reconstruction

Authors: Boyao Zhou, Shunyuan Zheng, Zhanfeng Liao, Zihan Ma, Hanzhang Tu, Boning Liu, Yebin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22704
Pdf URL: https://arxiv.org/pdf/2511.22704
Copy Paste: [[2511.22704]] Splat-SAP: Feed-Forward Gaussian Splatting for Human-Centered Scene with Scale-Aware Point Map Reconstruction(https://arxiv.org/abs/2511.22704)
Keywords: robust
Abstract: We present Splat-SAP, a feed-forward approach to render novel views of human-centered scenes from binocular cameras with large sparsity. Gaussian Splatting has shown its promising potential in rendering tasks, but it typically necessitates per-scene optimization with dense input views. Although some recent approaches achieve feed-forward Gaussian Splatting rendering through geometry priors obtained by multi-view stereo, such approaches still require largely overlapped input views to establish the geometry prior. To bridge this gap, we leverage pixel-wise point map reconstruction to represent geometry which is robust to large sparsity for its independent view modeling. In general, we propose a two-stage learning strategy. In stage 1, we transform the point map into real space via an iterative affinity learning process, which facilitates camera control in the following. In stage 2, we project point maps of two input views onto the target view plane and refine such geometry via stereo matching. Furthermore, we anchor Gaussian primitives on this refined plane in order to render high-quality images. As a metric representation, the scale-aware point map in stage 1 is trained in a self-supervised manner without 3D supervision and stage 2 is supervised with photo-metric loss. We collect multi-view human-centered data and demonstrate that our method improves both the stability of point map reconstruction and the visual quality of free-viewpoint rendering.

Title: ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering

Authors: Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2511.22715
Pdf URL: https://arxiv.org/pdf/2511.22715
Copy Paste: [[2511.22715]] ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering(https://arxiv.org/abs/2511.22715)
Keywords: large language model
Abstract: Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved content, while supervised fine-tuning serves only as a cold start. Extensive experiments on Encyclopedic-VQA and InfoSeek demonstrate that ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence. Our source code is publicly available at: this https URL.

Title: All Centers Are at most a Few Tokens Apart: Knowledge Distillation with Domain Invariant Prompt Tuning

Authors: Amir Mohammad Ezzati, Alireza Malekhosseini, Armin Khosravi, Mohammad Hossein Rohban
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22739
Pdf URL: https://arxiv.org/pdf/2511.22739
Copy Paste: [[2511.22739]] All Centers Are at most a Few Tokens Apart: Knowledge Distillation with Domain Invariant Prompt Tuning(https://arxiv.org/abs/2511.22739)
Keywords: robust
Abstract: Domain generalization is critical in computational pathology (CPath) due to inherent domain shifts caused by variations in staining protocols, scanner devices, and imaging settings across clinical centers. Vision-language models (VLMs), such as PLIP-a pathology-tuned CLIP-trained on image-text pairs across diverse domains, serve as strong knowledge distillation sources. However, their zero-shot performance with predefined prompts remains limited due to sensitivity to prompt variations. Moreover, unlike natural images, histopathology centers lack semantic descriptors (e.g., 'sketch'), making it difficult to define domain-specific prompts for clinical centers. This requires a data-driven approach for learning domain-specific and ultimately class-generic continuous prompts. We propose Domain Invariant Prompt Tuning (DIPT) for knowledge distillation process, a novel step that learns multiple input tokens for each domain. These tokens are trained separately for each domain and are averaged across domains, leading to domain-invariant prompts. Our student model then distills knowledge from PLIP's text encoder by leveraging the prompts learned by DIPT. This leads to alignment of visual features with domain-invariant embeddings, enhancing generalization by training on multiple domains. Our method adds a significant improvement in average F1-score to existing state-of-the-art (SOTA) knowledge distillation approaches in domain generalization with histopathology datasets. This work helps the way of deploying robust CPath models in real-world clinical problems with heterogeneous data sources.

Title: VeriDispatcher: Multi-Model Dispatching through Pre-Inference Difficulty Prediction for RTL Generation Optimization

Authors: Zeng Wang, Weihua Xiao, Minghao Shao, Raghu Vamshi Hemadri, Ozgur Sinanoglu, Muhammad Shafique, Ramesh Karri
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22749
Pdf URL: https://arxiv.org/pdf/2511.22749
Copy Paste: [[2511.22749]] VeriDispatcher: Multi-Model Dispatching through Pre-Inference Difficulty Prediction for RTL Generation Optimization(https://arxiv.org/abs/2511.22749)
Keywords: large language model
Abstract: Large Language Models (LLMs) show strong performance in RTL generation, but different models excel on different tasks because of architecture and training differences. Prior work mainly prompts or finetunes a single model. What remains not well studied is how to coordinate multiple different LLMs so they jointly improve RTL quality while also reducing cost, instead of running all models and choosing the best output. We define this as the multi-LLM RTL generation problem. We propose VeriDispatcher, a multi-LLM RTL generation framework that dispatches each RTL task to suitable LLMs based on pre-inference difficulty prediction. For each model, we train a compact classifier over semantic embeddings of task descriptions, using difficulty scores derived from benchmark variants that combine syntax, structural similarity, and functional correctness. At inference, VeriDispatcher uses these predictors to route tasks to a selected subset of LLMs. Across 10 diverse LLMs on RTLLM and VerilogEval, VeriDispatcher achieves up to 18% accuracy improvement on RTLLM using only 40% of commercial calls, and on VerilogEval maintains accuracy while reducing commercial usage by 25%, enabling cost-effective, high-quality LLM deployment in hardware design automation.

Title: MammoRGB: Dual-View Mammogram Synthesis Using Denoising Diffusion Probabilistic Models

Authors: Jorge Alberto Garza-Abdala, Gerardo A. Fumagal-González, Daly Avendano, Servando Cardona, Sadam Hussain, Eduardo de Avila-Armenta, Jasiel H. Toscano-Martínez, Diana S. M. Rosales Gurmendi, Alma A. Pedro-Pérez, Jose Gerardo Tamez-Pena
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22759
Pdf URL: https://arxiv.org/pdf/2511.22759
Copy Paste: [[2511.22759]] MammoRGB: Dual-View Mammogram Synthesis Using Denoising Diffusion Probabilistic Models(https://arxiv.org/abs/2511.22759)
Keywords: diffusion, segmentation
Abstract: Purpose: This study aims to develop and evaluate a three channel denoising diffusion probabilistic model (DDPM) for synthesizing single breast dual view mammograms and to assess the impact of channel representations on image fidelity and cross view consistency. Materials and Methods: A pretrained three channel DDPM, sourced from Hugging Face, was fine tuned on a private dataset of 11020 screening mammograms to generate paired craniocaudal (CC) and mediolateral oblique (MLO) views. Three third channel encodings of the CC and MLO views were evaluated: sum, absolute difference, and zero channel. Each model produced 500 synthetic image pairs. Quantitative assessment involved breast mask segmentation using Intersection over Union (IoU) and Dice Similarity Coefficient (DSC), with distributional comparisons against 2500 real pairs using Earth Movers Distance (EMD) and Kolmogorov Smirnov (KS) tests. Qualitative evaluation included a visual Turing test by a non expert radiologist to assess cross view consistency and artifacts. Results: Synthetic mammograms showed IoU and DSC distributions comparable to real images, with EMD and KS values (0.020 and 0.077 respectively). Models using sum or absolute difference encodings outperformed others in IoU and DSC (p < 0.001), though distributions remained broadly similar. Generated CC and MLO views maintained cross view consistency, with 6 to 8 percent of synthetic images exhibiting artifacts consistent with those in the training data. Conclusion: Three channel DDPMs can generate realistic and anatomically consistent dual view mammograms with promising applications in dataset augmentation.

Title: Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration

Authors: Kanchon Gharami, Quazi Sarwar Muhtaseem, Deepti Gupta, Lavanya Elluri, Shafika Showkat Moni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22769
Pdf URL: https://arxiv.org/pdf/2511.22769
Copy Paste: [[2511.22769]] Modeling Romanized Hindi and Bengali: Dataset Creation and Multilingual LLM Integration(https://arxiv.org/abs/2511.22769)
Keywords: robust, large language model
Abstract: The development of robust transliteration techniques to enhance the effectiveness of transforming Romanized scripts into native scripts is crucial for Natural Language Processing tasks, including sentiment analysis, speech recognition, information retrieval, and intelligent personal assistants. Despite significant advancements, state-of-the-art multilingual models still face challenges in handling Romanized script, where the Roman alphabet is adopted to represent the phonetic structure of diverse languages. Within the South Asian context, where the use of Romanized script for Indo-Aryan languages is widespread across social media and digital communication platforms, such usage continues to pose significant challenges for cutting-edge multilingual models. While a limited number of transliteration datasets and models are available for Indo-Aryan languages, they generally lack sufficient diversity in pronunciation and spelling variations, adequate code-mixed data for large language model (LLM) training, and low-resource adaptation. To address this research gap, we introduce a novel transliteration dataset for two popular Indo-Aryan languages, Hindi and Bengali, which are ranked as the 3rd and 7th most spoken languages worldwide. Our dataset comprises nearly 1.8 million Hindi and 1 million Bengali transliteration pairs. In addition to that, we pre-train a custom multilingual seq2seq LLM based on Marian architecture using the developed dataset. Experimental results demonstrate significant improvements compared to existing relevant models in terms of BLEU and CER metrics.

Title: Alzheimer's Disease Prediction Using EffNetViTLoRA and BiLSTM with Multimodal Longitudinal MRI Data

Authors: Mahdieh Behjat Khatooni, Mohsen Soryani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22774
Pdf URL: https://arxiv.org/pdf/2511.22774
Copy Paste: [[2511.22774]] Alzheimer's Disease Prediction Using EffNetViTLoRA and BiLSTM with Multimodal Longitudinal MRI Data(https://arxiv.org/abs/2511.22774)
Keywords: transformer, generative
Abstract: Alzheimer's disease (AD) is a prevalent neurodegenerative disorder that progressively impairs memory, decision-making, and overall cognitive function. As AD is irreversible, early prediction is critical for timely intervention and management. Mild Cognitive Impairment (MCI), a transitional stage between cognitively normal (CN) aging and AD, plays a significant role in early AD diagnosis. However, predicting MCI progression remains a significant challenge, as not all individuals with MCI convert to AD. MCI subjects are categorized into stable MCI (sMCI) and progressive MCI (pMCI) based on conversion status. In this study, we propose a generalized, end-to-end deep learning model for AD prediction using MCI cases from the Alzheimer's Disease Neuroimaging Initiative (ADNI). Our hybrid architecture integrates Convolutional Neural Networks and Vision Transformers to capture both local spatial features and global contextual dependencies from Magnetic Resonance Imaging (MRI) scans. To incorporate temporal progression, we further employ Bidirectional Long Short-Term Memory (BiLSTM) networks to process features extracted from four consecutive MRI timepoints along with some other non-image biomarkers, predicting each subject's cognitive status at month 48. Our multimodal model achieved an average progression prediction accuracy of 95.05\% between sMCI and pMCI, outperforming existing studies in AD prediction. This work demonstrates state-of-the-art performance in longitudinal AD prediction and highlights the effectiveness of combining spatial and temporal modeling for the early detection of Alzheimer's disease.

Title: World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

Authors: Eunsu Kim, Junyeong Park, Na Min An, Junseong Kim, Hitesh Laxmichand Patel, Jiho Jin, Julia Kruk, Amit Agarwal, Srikant Panda, Fenal Ashokbhai Ilasariya, Hyunjung Shim, Alice Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22787
Pdf URL: https://arxiv.org/pdf/2511.22787
Copy Paste: [[2511.22787]] World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models(https://arxiv.org/abs/2511.22787)
Keywords: robust, diffusion
Abstract: In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.

Title: PRISM: Privacy-Aware Routing for Adaptive Cloud-Edge LLM Inference via Semantic Sketch Collaboration

Authors: Junfei Zhan, Haoxun Shen, Zheng Lin, Tengjiao He
Subjects: cs.CR, cs.CL
Abstract URL: https://arxiv.org/abs/2511.22788
Pdf URL: https://arxiv.org/pdf/2511.22788
Copy Paste: [[2511.22788]] PRISM: Privacy-Aware Routing for Adaptive Cloud-Edge LLM Inference via Semantic Sketch Collaboration(https://arxiv.org/abs/2511.22788)
Keywords: privacy, protect, large language model
Abstract: Large Language Models (LLMs) demonstrate impressive capabilities in natural language understanding and generation, but incur high communication overhead and privacy risks in cloud deployments, while facing compute and memory constraints when confined to edge devices. Cloud-edge inference has emerged as a promising paradigm for improving privacy in LLM services by retaining sensitive computations on local devices. However, existing cloud-edge inference approaches apply uniform privacy protection without considering input sensitivity, resulting in unnecessary perturbation and degraded utility even for non-sensitive tokens. To address this limitation, we propose Privacy-aware Routing for Inference with Semantic Modulation (PRISM), a context-aware framework that dynamically balances privacy and inference quality. PRISM executes in four stages: (1) the edge device profiles entity-level sensitivity; (2) a soft gating module on the edge selects an execution mode - cloud, edge, or collaboration; (3) for collaborative paths, the edge applies adaptive two-layer local differential privacy based on entity risks; and (4) the cloud LLM generates a semantic sketch from the perturbed prompt, which is then refined by the edge-side small language model (SLM) using local context. Our results show that PRISM consistently achieves superior privacy-utility trade-offs across various scenarios, reducing energy consumption and latency to 40-50% of baseline methods such as Uniform and Selective LDP, while maintaining high output quality under strong privacy constraints. These findings are validated through comprehensive evaluations involving realistic prompts, actual energy measurements, and heterogeneous cloud-edge model deployments.

Title: An Efficient Privacy-preserving Intrusion Detection Scheme for UAV Swarm Networks

Authors: Kanchon Gharami, Shafika Showkat Moni
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22791
Pdf URL: https://arxiv.org/pdf/2511.22791
Copy Paste: [[2511.22791]] An Efficient Privacy-preserving Intrusion Detection Scheme for UAV Swarm Networks(https://arxiv.org/abs/2511.22791)
Keywords: secure, security, privacy, defense, attack, federate
Abstract: The rapid proliferation of unmanned aerial vehicles (UAVs) and their applications in diverse domains, such as surveillance, disaster management, agriculture, and defense, have revolutionized modern technology. While the potential benefits of swarm-based UAV networks are growing significantly, they are vulnerable to various security attacks that can jeopardize the overall mission success by degrading their performance, disrupting decision-making, and compromising the trajectory planning process. The Intrusion Detection System (IDS) plays a vital role in identifying potential security attacks to ensure the secure operation of UAV swarm networks. However, conventional IDS primarily focuses on binary classification with resource-intensive neural networks and faces challenges, including latency, privacy breaches, increased performance overhead, and model drift. This research aims to address these challenges by developing a novel lightweight and federated continuous learning-based IDS scheme. Our proposed model facilitates decentralized training across diverse UAV swarms to ensure data heterogeneity and privacy. The performance evaluation of our model demonstrates significant improvements, with classification accuracies of 99.45% on UKM-IDS, 99.99% on UAV-IDS, 96.85% on TLM-UAV dataset, and 98.05% on Cyber-Physical datasets.

Title: GSpaRC: Gaussian Splatting for Real-time Reconstruction of RF Channels

Authors: Bhavya Sai Nukapotula, Rishabh Tripathi, Seth Pregler, Dileep Kalathil, Srinivas Shakkottai, Theodore S. Rappaport
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22793
Pdf URL: https://arxiv.org/pdf/2511.22793
Copy Paste: [[2511.22793]] GSpaRC: Gaussian Splatting for Real-time Reconstruction of RF Channels(https://arxiv.org/abs/2511.22793)
Keywords: robust
Abstract: Channel state information (CSI) is essential for adaptive beamforming and maintaining robust links in wireless communication systems. However, acquiring CSI incurs significant overhead, consuming up to 25\% of spectrum resources in 5G networks due to frequent pilot transmissions at sub-millisecond intervals. Recent approaches aim to reduce this burden by reconstructing CSI from spatiotemporal RF measurements, such as signal strength and direction-of-arrival. While effective in offline settings, these methods often suffer from inference latencies in the 5--100~ms range, making them impractical for real-time systems. We present GSpaRC: Gaussian Splatting for Real-time Reconstruction of RF Channels, the first algorithm to break the 1 ms latency barrier while maintaining high accuracy. GSpaRC represents the RF environment using a compact set of 3D Gaussian primitives, each parameterized by a lightweight neural model augmented with physics-informed features such as distance-based attenuation. Unlike traditional vision-based splatting pipelines, GSpaRC is tailored for RF reception: it employs an equirectangular projection onto a hemispherical surface centered at the receiver to reflect omnidirectional antenna behavior. A custom CUDA pipeline enables fully parallelized directional sorting, splatting, and rendering across frequency and spatial dimensions. Evaluated on multiple RF datasets, GSpaRC achieves similar CSI reconstruction fidelity to recent state-of-the-art methods while reducing training and inference time by over an order of magnitude. By trading modest GPU computation for a substantial reduction in pilot overhead, GSpaRC enables scalable, low-latency channel estimation suitable for deployment in 5G and future wireless systems. The code is available here: \href{this https URL}{GSpaRC}.

Title: From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

Authors: Yiming Chen, Junlin Han, Tianyi Bai, Shengbang Tong, Filippos Kokkinos, Philip Torr
Subjects: cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2511.22805
Pdf URL: https://arxiv.org/pdf/2511.22805
Copy Paste: [[2511.22805]] From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images(https://arxiv.org/abs/2511.22805)
Keywords: large language model
Abstract: While Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects and describing scenes-they often lack the ability to understand how an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties. Our evaluation reveals a significant gap: current models are poorly aligned with human perception of these nuanced properties. We then demonstrate that a post-training phase can effectively bridge this gap, significantly enhancing the model's alignment with human judgments. Furthermore, we show that this learned cognitive alignment is not merely predictive but also transferable to downstream creative tasks. By integrating our cognitively-aligned MLLM into an image generation pipeline, we can guide the synthesis process to produce images that better embody desired traits, such as being more memorable or visually appealing. Our work provides a benchmark to measure this human-like perception, a post-training pipeline to enhance it, and a demonstration that this alignment unlocks more human-centric AI.

Title: LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer

Authors: Kai Wang, Siyi Chen, Weicong Pang, Chenchen Zhang, Renjun Gao, Ziru Chen, Cheng Li, Dasa Gu, Rui Huang, Alexis Kai Hon Lau
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22812
Pdf URL: https://arxiv.org/pdf/2511.22812
Copy Paste: [[2511.22812]] LC4-DViT: Land-cover Creation for Land-cover Classification with Deformable Vision Transformer(https://arxiv.org/abs/2511.22812)
Keywords: diffusion, transformer, generative
Abstract: Land-cover underpins ecosystem services, hydrologic regulation, disaster-risk reduction, and evidence-based land planning; timely, accurate land-cover maps are therefore critical for environmental stewardship. Remote sensing-based land-cover classification offers a scalable route to such maps but is hindered by scarce and imbalanced annotations and by geometric distortions in high-resolution scenes. We propose LC4-DViT (Land-cover Creation for Land-cover Classification with Deformable Vision Transformer), a framework that combines generative data creation with a deformation-aware Vision Transformer. A text-guided diffusion pipeline uses GPT-4o-generated scene descriptions and super-resolved exemplars to synthesize class-balanced, high-fidelity training images, while DViT couples a DCNv4 deformable convolutional backbone with a Vision Transformer encoder to jointly capture fine-scale geometry and global context. On eight classes from the Aerial Image Dataset (AID)-Beach, Bridge, Desert, Forest, Mountain, Pond, Port, and River-DViT achieves 0.9572 overall accuracy, 0.9576 macro F1-score, and 0.9510 Cohen' s Kappa, improving over a vanilla ViT baseline (0.9274 OA, 0.9300 macro F1, 0.9169 Kappa) and outperforming ResNet50, MobileNetV2, and FlashInternImage. Cross-dataset experiments on a three-class SIRI-WHU subset (Harbor, Pond, River) yield 0.9333 overall accuracy, 0.9316 macro F1, and 0.8989 Kappa, indicating good transferability. An LLM-based judge using GPT-4o to score Grad-CAM heatmaps further shows that DViT' s attention aligns best with hydrologically meaningful structures. These results suggest that description-driven generative augmentation combined with deformation-aware transformers is a promising approach for high-resolution land-cover mapping.

Title: Intelligent Neural Networks: From Layered Architectures to Graph-Organized Intelligence

Authors: Antoine Salomon
Subjects: cs.LG, cs.CL, cs.NE
Abstract URL: https://arxiv.org/abs/2511.22813
Pdf URL: https://arxiv.org/pdf/2511.22813
Copy Paste: [[2511.22813]] Intelligent Neural Networks: From Layered Architectures to Graph-Organized Intelligence(https://arxiv.org/abs/2511.22813)
Keywords: transformer
Abstract: Biological neurons exhibit remarkable intelligence: they maintain internal states, communicate selectively with other neurons, and self-organize into complex graphs rather than rigid hierarchical layers. What if artificial intelligence could emerge from similarly intelligent computational units? We introduce Intelligent Neural Networks (INN), a paradigm shift where neurons are first-class entities with internal memory and learned communication patterns, organized in complete graphs rather than sequential layers. Each Intelligent Neuron combines selective state-space dynamics (knowing when to activate) with attention-based routing (knowing to whom to send signals), enabling emergent computation through graph-structured interactions. On the standard Text8 character modeling benchmark, INN achieves 1.705 Bit-Per-Character (BPC), significantly outperforming a comparable Transformer (2.055 BPC) and matching a highly optimized LSTM baseline. Crucially, a parameter-matched baseline of stacked Mamba blocks fails to converge (>3.4 BPC) under the same training protocol, demonstrating that INN's graph topology provides essential training stability. Ablation studies confirm this: removing inter-neuron communication degrades performance or leads to instability, proving the value of learned neural routing. This work demonstrates that neuron-centric design with graph organization is not merely bio-inspired -- it is computationally effective, opening new directions for modular, interpretable, and scalable neural architectures.

Title: Mitigating Semantic Drift: Evaluating LLMs' Efficacy in Psychotherapy through MI Dialogue Summarization

Authors: Vivek Kumar, Pushpraj Singh Rajawat, Eirini Ntoutsi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22818
Pdf URL: https://arxiv.org/pdf/2511.22818
Copy Paste: [[2511.22818]] Mitigating Semantic Drift: Evaluating LLMs' Efficacy in Psychotherapy through MI Dialogue Summarization(https://arxiv.org/abs/2511.22818)
Keywords: large language model
Abstract: Recent advancements in large language models (LLMs) have shown their potential across both general and domain-specific tasks. However, there is a growing concern regarding their lack of sensitivity, factual incorrectness in responses, inconsistent expressions of empathy, bias, hallucinations, and overall inability to capture the depth and complexity of human understanding, especially in low-resource and sensitive domains such as psychology. To address these challenges, our study employs a mixed-methods approach to evaluate the efficacy of LLMs in psychotherapy. We use LLMs to generate precise summaries of motivational interviewing (MI) dialogues and design a two-stage annotation scheme based on key components of the Motivational Interviewing Treatment Integrity (MITI) framework, namely evocation, collaboration, autonomy, direction, empathy, and a non-judgmental attitude. Using expert-annotated MI dialogues as ground truth, we formulate multi-class classification tasks to assess model performance under progressive prompting techniques, incorporating one-shot and few-shot prompting. Our results offer insights into LLMs' capacity for understanding complex psychological constructs and highlight best practices to mitigate ``semantic drift" in therapeutic settings. Our work contributes not only to the MI community by providing a high-quality annotated dataset to address data scarcity in low-resource domains but also critical insights for using LLMs for precise contextual interpretation in complex behavioral therapy.

Title: A Unified and Stable Risk Minimization Framework for Weakly Supervised Learning with Theoretical Guarantees

Authors: Miao Zhang, Junpeng Li, Changchun Hua, Yana Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22823
Pdf URL: https://arxiv.org/pdf/2511.22823
Copy Paste: [[2511.22823]] A Unified and Stable Risk Minimization Framework for Weakly Supervised Learning with Theoretical Guarantees(https://arxiv.org/abs/2511.22823)
Keywords: robust
Abstract: Weakly supervised learning has emerged as a practical alternative to fully supervised learning when complete and accurate labels are costly or infeasible to acquire. However, many existing methods are tailored to specific supervision patterns -- such as positive-unlabeled (PU), unlabeled-unlabeled (UU), complementary-label (CLL), partial-label (PLL), or similarity-unlabeled annotations -- and rely on post-hoc corrections to mitigate instability induced by indirect supervision. We propose a principled, unified framework that bypasses such post-hoc adjustments by directly formulating a stable surrogate risk grounded in the structure of weakly supervised data. The formulation naturally subsumes diverse settings -- including PU, UU, CLL, PLL, multi-class unlabeled, and tuple-based learning -- under a single optimization objective. We further establish a non-asymptotic generalization bound via Rademacher complexity that clarifies how supervision structure, model capacity, and sample size jointly govern performance. Beyond this, we analyze the effect of class-prior misspecification on the bound, deriving explicit terms that quantify its impact, and we study identifiability, giving sufficient conditions -- most notably via supervision stratification across groups -- under which the target risk is recoverable. Extensive experiments show consistent gains across class priors, dataset scales, and class counts -- without heuristic stabilization -- while exhibiting robustness to overfitting.

Title: Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

Authors: Tianle Chen, Chaitanya Chakka, Arjun Reddy Akula, Xavier Thomas, Deepti Ghadiyaram
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22826
Pdf URL: https://arxiv.org/pdf/2511.22826
Copy Paste: [[2511.22826]] Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs(https://arxiv.org/abs/2511.22826)
Keywords: robust, interpretability, large language model
Abstract: Despite remarkable advancements in Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities? To rigorously study this, we introduce MMA-Bench comprising videos and tasks that probe a model's reliance on specific modalities. Using black-box and white-box interpretability techniques, we provide a critical analysis of the brittleness of both open- and closed-sourced MLLMs. We show that current MLLMs struggle under misaligned audio-visual pairs and simple misleading text, thereby lacking robust multi-modal reasoning. Building on these findings, we propose a modality alignment tuning strategy to teach the model when to prioritize, leverage, or ignore specific modality cues. Through extensive experiments and analysis, we show that our alignment tuning yields demonstrably stronger multimodal grounding. This work provides both interpretability tools and a clear path toward developing MLLMs with intrinsically reliable cross-modal reasoning. Code and dataset will be publicly available.

Title: PerfMamba: Performance Analysis and Pruning of Selective State Space Models

Authors: Abdullah Al Asif, Mobina Kashaniyan, Sixing Yu, Juan Pablo Muñoz, Ali Jannesari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22849
Pdf URL: https://arxiv.org/pdf/2511.22849
Copy Paste: [[2511.22849]] PerfMamba: Performance Analysis and Pruning of Selective State Space Models(https://arxiv.org/abs/2511.22849)
Keywords: transformer
Abstract: Recent advances in sequence modeling have introduced selective SSMs as promising alternatives to Transformer architectures, offering theoretical computational efficiency and sequence processing advantages. A comprehensive understanding of selective SSMs in runtime behavior, resource utilization patterns, and scaling characteristics still remains unexplored, thus obstructing their optimal deployment and further architectural improvements. This paper presents a thorough empirical study of Mamba-1 and Mamba-2, systematically profiled for performance to assess the design principles that contribute to their efficiency in state-space modeling. A detailed analysis of computation patterns, memory access, I/O characteristics, and scaling properties was performed for sequence lengths ranging from 64 to 16384 tokens. Our findings show that the SSM component, a central part of the selective SSM architecture, demands a significant portion of computational resources compared to other components in the Mamba block. Based on these insights, we propose a pruning technique that selectively removes low-activity states within the SSM component, achieving measurable throughput and memory gains while maintaining accuracy within a moderate pruning regime. This approach results in performance improvements across varying sequence lengths, achieving a 1.14x speedup and reducing memory usage by 11.50\%. These results offer valuable guidance for designing more efficient SSM architectures that can be applied to a wide range of real-world applications.

Title: TARFVAE: Efficient One-Step Generative Time Series Forecasting via TARFLOW based VAE

Authors: Jiawen Wei, Lan Jiang, Pengbo Wei, Ziwen Ye, Teng Song, Chen Chen, Guangrui Ma
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22853
Pdf URL: https://arxiv.org/pdf/2511.22853
Copy Paste: [[2511.22853]] TARFVAE: Efficient One-Step Generative Time Series Forecasting via TARFLOW based VAE(https://arxiv.org/abs/2511.22853)
Keywords: transformer, generative
Abstract: Time series data is ubiquitous, with forecasting applications spanning from finance to healthcare. Beyond popular deterministic methods, generative models are gaining attention due to advancements in areas like image synthesis and video generation, as well as their inherent ability to provide probabilistic predictions. However, existing generative approaches mostly involve recurrent generative operations or repeated denoising steps, making the prediction laborious, particularly for long-term forecasting. Most of them only conduct experiments for relatively short-term forecasting, with limited comparison to deterministic methods in long-term forecasting, leaving their practical advantages unclear. This paper presents TARFVAE, a novel generative framework that combines the Transformer-based autoregressive flow (TARFLOW) and variational autoencoder (VAE) for efficient one-step generative time series forecasting. Inspired by the rethinking that complex architectures for extracting time series representations might not be necessary, we add a flow module, TARFLOW, to VAE to promote spontaneous learning of latent variables that benefit predictions. TARFLOW enhances VAE's posterior estimation by breaking the Gaussian assumption, thereby enabling a more informative latent space. TARFVAE uses only the forward process of TARFLOW, avoiding autoregressive inverse operations and thus ensuring fast generation. During generation, it samples from the prior latent space and directly generates full-horizon forecasts via the VAE decoder. With simple MLP modules, TARFVAE achieves superior performance over state-of-the-art deterministic and generative models across different forecast horizons on benchmark datasets while maintaining efficient prediction speed, demonstrating its effectiveness as an efficient and powerful solution for generative time series forecasting.

Title: CRAwDAD: Causal Reasoning Augmentation with Dual-Agent Debate

Authors: Finn G. Vamosi, Nils D. Forkert
Subjects: cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2511.22854
Pdf URL: https://arxiv.org/pdf/2511.22854
Copy Paste: [[2511.22854]] CRAwDAD: Causal Reasoning Augmentation with Dual-Agent Debate(https://arxiv.org/abs/2511.22854)
Keywords: large language model
Abstract: When people reason about cause and effect, they often consider many competing "what if" scenarios before deciding which explanation fits best. Analogously, advanced language models capable of causal inference can consider multiple interventions and counterfactuals to judge the validity of causal claims. Crucially, this type of reasoning is less like a single calculation and more like an internal dialogue between alternative hypotheses. In this paper, we make this dialogue explicit through a dual-agent debate framework where one model provides a structured causal inference, and the other critically examines this reasoning for logical flaws. When disagreements arise, agents attempt to persuade each other, challenging each other's logic and revising their conclusions until they converge on a mutually agreed answer. To take advantage of this deliberative process, we specifically use reasoning language models, whose strengths in both causal inference and adversarial debate remain under-explored relative to standard large language models. We evaluate our approach on the CLadder dataset, a benchmark linking natural language questions to formally defined causal graphs across all three rungs of Pearl's ladder of causation. With Qwen3 and DeepSeek-R1 as debater agents, we demonstrate that multi-agent debate improves DeepSeek-R1's overall accuracy in causal inference from 78.03% to 87.45%, with the counterfactual category specifically improving from 67.94% to 80.04% accuracy. Similarly, Qwen3's overall accuracy improves from 84.16% to 89.41%, and counterfactual questions from 71.53% to 80.35%, showing that strong models can still benefit greatly from debate with weaker agents. Our results highlight the potential of reasoning models as building blocks for multi-agent systems in causal inference, and demonstrate the importance of diverse perspectives in causal problem-solving.

Title: CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation

Authors: Fengyi Fang, Sicheng Yang, Wenming Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22863
Pdf URL: https://arxiv.org/pdf/2511.22863
Copy Paste: [[2511.22863]] CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation(https://arxiv.org/abs/2511.22863)
Keywords: diffusion
Abstract: Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches.

Title: JBE-QA: Japanese Bar Exam QA Dataset for Assessing Legal Domain Knowledge

Authors: Zhihan Cao, Fumihito Nishino, Hiroaki Yamada, Nguyen Ha Thanh, Yusuke Miyao, Ken Satoh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22869
Pdf URL: https://arxiv.org/pdf/2511.22869
Copy Paste: [[2511.22869]] JBE-QA: Japanese Bar Exam QA Dataset for Assessing Legal Domain Knowledge(https://arxiv.org/abs/2511.22869)
Keywords: large language model
Abstract: We introduce JBE-QA, a Japanese Bar Exam Question-Answering dataset to evaluate large language models' legal knowledge. Derived from the multiple-choice (tanto-shiki) section of the Japanese bar exam (2015-2024), JBE-QA provides the first comprehensive benchmark for Japanese legal-domain evaluation of LLMs. It covers the Civil Code, the Penal Code, and the Constitution, extending beyond the Civil Code focus of prior Japanese resources. Each question is decomposed into independent true/false judgments with structured contextual fields. The dataset contains 3,464 items with balanced labels. We evaluate 26 LLMs, including proprietary, open-weight, Japanese-specialised, and reasoning models. Our results show that proprietary models with reasoning enabled perform best, and the Constitution questions are generally easier than the Civil Code or the Penal Code questions.

Title: Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis

Authors: Jungwoo Seo, David Keetae Park, Shinjae Yoo, Jiook Cha
Subjects: cs.CV, q-bio.NC
Abstract URL: https://arxiv.org/abs/2511.22870
Pdf URL: https://arxiv.org/pdf/2511.22870
Copy Paste: [[2511.22870]] Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis(https://arxiv.org/abs/2511.22870)
Keywords: diffusion, transformer
Abstract: Generating whole-brain 4D fMRI sequences conditioned on cognitive tasks remains challenging due to the high-dimensional, heterogeneous BOLD dynamics across subjects/acquisitions and the lack of neuroscience-grounded validation. We introduce the first diffusion transformer for voxelwise 4D fMRI conditional generation, combining 3D VQ-GAN latent compression with a CNN-Transformer backbone and strong task conditioning via AdaLN-Zero and cross-attention. On HCP task fMRI, our model reproduces task-evoked activation maps, preserves the inter-task representational structure observed in real data (RSA), achieves perfect condition specificity, and aligns ROI time-courses with canonical hemodynamic responses. Performance improves predictably with scale, reaching task-evoked map correlation of 0.83 and RSA of 0.98, consistently surpassing a U-Net baseline on all metrics. By coupling latent diffusion with a scalable backbone and strong conditioning, this work establishes a practical path to conditional 4D fMRI synthesis, paving the way for future applications such as virtual experiments, cross-site harmonization, and principled augmentation for downstream neuroimaging models.

Title: FEANEL: A Benchmark for Fine-Grained Error Analysis in K-12 English Writing

Authors: Jingheng Ye, Shen Wang, Jiaqi Chen, Hebin Wang, Deqing Zou, Yanyu Zhu, Jiwei Tang, Hai-Tao Zheng, Ruitong Liu, Haoyang Li, Yanfeng Wang, Qingsong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22883
Pdf URL: https://arxiv.org/pdf/2511.22883
Copy Paste: [[2511.22883]] FEANEL: A Benchmark for Fine-Grained Error Analysis in K-12 English Writing(https://arxiv.org/abs/2511.22883)
Keywords: large language model
Abstract: Large Language Models (LLMs) have transformed artificial intelligence, offering profound opportunities for educational applications. However, their ability to provide fine-grained educational feedback for K-12 English writing remains underexplored. In this paper, we challenge the error analysis and pedagogical skills of LLMs by introducing the problem of Fine-grained Error Analysis for English Learners and present the Fine-grained Error ANalysis for English Learners (FEANEL) Benchmark. The benchmark comprises 1,000 essays written by elementary and secondary school students, and a well-developed English writing error taxonomy. Each error is annotated by language education experts and categorized by type, severity, and explanatory feedback, using a part-of-speech-based taxonomy they co-developed. We evaluate state-of-the-art LLMs on the FEANEL Benchmark to explore their error analysis and pedagogical abilities. Experimental results reveal significant gaps in current LLMs' ability to perform fine-grained error analysis, highlighting the need for advancements in particular methods for educational applications.

Title: Modeling Chaotic Pedestrian Behavior Using Chaos Indicators and Supervised Learning

Authors: Md. Muhtashim Shahrier, Nazmul Haque, Md Asif Raihan, Md. Hadiuzzaman
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22887
Pdf URL: https://arxiv.org/pdf/2511.22887
Copy Paste: [[2511.22887]] Modeling Chaotic Pedestrian Behavior Using Chaos Indicators and Supervised Learning(https://arxiv.org/abs/2511.22887)
Keywords: robust
Abstract: As cities around the world aim to improve walkability and safety, understanding the irregular and unpredictable nature of pedestrian behavior has become increasingly important. This study introduces a data-driven framework for modeling chaotic pedestrian movement using empirically observed trajectory data and supervised learning. Videos were recorded during both daytime and nighttime conditions to capture pedestrian dynamics under varying ambient and traffic contexts. Pedestrian trajectories were extracted through computer vision techniques, and behavioral chaos was quantified using four chaos metrics: Approximate Entropy and Lyapunov Exponent, each computed for both velocity and direction change. A Principal Component Analysis (PCA) was then applied to consolidate these indicators into a unified chaos score. A comprehensive set of individual, group-level, and contextual traffic features was engineered and used to train Random Forest and CatBoost regression models. CatBoost models consistently achieved superior performance. The best daytime PCA-based CatBoost model reached an R^2 of 0.8319, while the nighttime PCA-based CatBoost model attained an R^2 of 0.8574. SHAP analysis highlighted that features such as distance travel, movement duration, and speed variability were robust contributors to chaotic behavior. The proposed framework enables practitioners to quantify and anticipate behavioral instability in real-world settings. Planners and engineers can use chaos scores to identify high-risk pedestrian zones, apprise infrastructure improvements, and calibrate realistic microsimulation models. The approach also supports adaptive risk assessment in automated vehicle systems by capturing short-term motion unpredictability grounded in observable, interpretable features.

Title: Adversarial Training for Process Reward Models

Authors: Gurusha Juneja, Deepak Nathani, William Yang Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22888
Pdf URL: https://arxiv.org/pdf/2511.22888
Copy Paste: [[2511.22888]] Adversarial Training for Process Reward Models(https://arxiv.org/abs/2511.22888)
Keywords: robust
Abstract: Process Reward Models (PRMs) enhance reasoning ability of LLMs by providing step-level supervision. However, their widespread adoption is limited due to expensive manual step-level annotation and poor generalization of static training data to novel errors. We introduce Adversarially Trained PRMs (\texttt{APRM}), where a Generator ($G$) learns to produce reasoning errors to deceive a PRM ($R$), while $R$ concurrently learns to detect them. This interaction yields progressively harder negatives for $R$, improving its robustness and generalization to novel errors without requiring manual step-level labels. Averaged across diverse mathematical reasoning benchmarks, \texttt{APRM} improves solver accuracy by $+3.4$ percentage points (pp) over the strongest PRM baseline. \texttt{APRM} achieves gains of $+5.3$ pp on out-of-distribution tasks.

Title: ClearGCD: Mitigating Shortcut Learning For Robust Generalized Category Discovery

Authors: Kailin Lyu, Jianwei He, Long Xiao, Jianing Zeng, Liang Fan, Lin Shu, Jie Hao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22892
Pdf URL: https://arxiv.org/pdf/2511.22892
Copy Paste: [[2511.22892]] ClearGCD: Mitigating Shortcut Learning For Robust Generalized Category Discovery(https://arxiv.org/abs/2511.22892)
Keywords: robust
Abstract: In open-world scenarios, Generalized Category Discovery (GCD) requires identifying both known and novel categories within unlabeled data. However, existing methods often suffer from prototype confusion caused by shortcut learning, which undermines generalization and leads to forgetting of known classes. We propose ClearGCD, a framework designed to mitigate reliance on non-semantic cues through two complementary mechanisms. First, Semantic View Alignment (SVA) generates strong augmentations via cross-class patch replacement and enforces semantic consistency using weak augmentations. Second, Shortcut Suppression Regularization (SSR) maintains an adaptive prototype bank that aligns known classes while encouraging separation of potential novel ones. ClearGCD can be seamlessly integrated into parametric GCD approaches and consistently outperforms state-of-the-art methods across multiple benchmarks.

Title: DM$^3$T: Harmonizing Modalities via Diffusion for Multi-Object Tracking

Authors: Weiran Li, Yeqiang Liu, Yijie Wei, Mina Han, Qiannan Guo, Zhenbo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22896
Pdf URL: https://arxiv.org/pdf/2511.22896
Copy Paste: [[2511.22896]] DM$^3$T: Harmonizing Modalities via Diffusion for Multi-Object Tracking(https://arxiv.org/abs/2511.22896)
Keywords: robust, diffusion
Abstract: Multi-object tracking (MOT) is a fundamental task in computer vision with critical applications in autonomous driving and robotics. Multimodal MOT that integrates visible light and thermal infrared information is particularly essential for robust autonomous driving systems. However, effectively fusing these heterogeneous modalities is challenging. Simple strategies like concatenation or addition often fail to bridge the significant non-linear distribution gap between their feature representations, which can lead to modality conflicts and degrade tracking accuracy. Drawing inspiration from the connection between multimodal MOT and the iterative refinement in diffusion models, this paper proposes DM$^3$T, a novel framework that reformulates multimodal fusion as an iterative feature alignment process to generate accurate and temporally coherent object trajectories. Our approach performs iterative cross-modal harmonization through a proposed Cross-Modal Diffusion Fusion (C-MDF) module. In this process, features from both modalities provide mutual guidance, iteratively projecting them onto a shared, consistent feature manifold. This enables the learning of complementary information and achieves deeper fusion compared to conventional methods. Additionally, we introduce a plug-and-play Diffusion Refiner (DR) to enhance and refine the unified feature representation. To further improve tracking robustness, we design a Hierarchical Tracker that adaptively handles confidence estimation. DM$^3$T unifies object detection, state estimation, and data association into a comprehensive online tracking framework without complex post-processing. Extensive experiments on the VT-MOT benchmark demonstrate that our method achieves 41.7 HOTA, representing a 1.54% relative improvement over existing state-of-the-art methods. The code and models are available at this https URL.

Title: From Points to Clouds: Learning Robust Semantic Distributions for Multi-modal Prompts

Authors: Weiran Li, Yeqiang Liu, Yijie Wei, Mina Han, Xin Liu, Zhenbo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22897
Pdf URL: https://arxiv.org/pdf/2511.22897
Copy Paste: [[2511.22897]] From Points to Clouds: Learning Robust Semantic Distributions for Multi-modal Prompts(https://arxiv.org/abs/2511.22897)
Keywords: robust, diffusion
Abstract: Multimodal Prompt Learning (MPL) has emerged as a pivotal technique for adapting large-scale Visual Language Models (VLMs). However, current MPL methods are fundamentally limited by their optimization of a single, static point representation. This paradigm is inherently brittle, leads to overfitting on base classes, and generalizes poorly to novel or ambiguous categories. We challenge this point paradigm, proposing that robust generalization requires learning a semantic cloud (i.e., a distribution over the embedding space). To achieve this, we introduce Points-to-Clouds (P2C), a novel framework inspired by diffusion models that reframes prompt learning as a dynamic denoising task. At the core of P2C is a dual denoising mechanism: a Dynamic Prompt Denoising (DPD) mechanism perturbs text prompts with sophisticated, annealed noise to learn a smoother semantic landscape, while an auxiliary V-L Mapper denoising loss re-tasks the mapper as a denoising autoencoder. This forces the mapper to reconstruct clean visual prompts from noisy text inputs, ensuring robust cross-modal alignment. Extensive experiments across 11 datasets demonstrate that P2C consistently outperforms strong baselines. On the base-to-novel generalization benchmark, our method achieves a Harmonic Mean of 79.7%, representing a relative improvement of 1.4% over the baseline. The code and models are available at this https URL.

Title: Leveraging Textual Compositional Reasoning for Robust Change Captioning

Authors: Kyu Ri Park, Jiyoung Park, Seong Tae Kim, Hong Joo Lee, Jung Uk Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22903
Pdf URL: https://arxiv.org/pdf/2511.22903
Copy Paste: [[2511.22903]] Leveraging Textual Compositional Reasoning for Robust Change Captioning(https://arxiv.org/abs/2511.22903)
Keywords: robust, extraction
Abstract: Change captioning aims to describe changes between a pair of images. However, existing works rely on visual features alone, which often fail to capture subtle but meaningful changes because they lack the ability to represent explicitly structured information such as object relationships and compositional semantics. To alleviate this, we present CORTEX (COmpositional Reasoning-aware TEXt-guided), a novel framework that integrates complementary textual cues to enhance change understanding. In addition to capturing cues from pixel-level differences, CORTEX utilizes scene-level textual knowledge provided by Vision Language Models (VLMs) to extract richer image text signals that reveal underlying compositional reasoning. CORTEX consists of three key modules: (i) an Image-level Change Detector that identifies low-level visual differences between paired images, (ii) a Reasoning-aware Text Extraction (RTE) module that use VLMs to generate compositional reasoning descriptions implicit in visual features, and (iii) an Image-Text Dual Alignment (ITDA) module that aligns visual and textual features for fine-grained relational reasoning. This enables CORTEX to reason over visual and textual features and capture changes that are otherwise ambiguous in visual features alone.

Title: See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection

Authors: YuEun Lee, Jung Uk Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22906
Pdf URL: https://arxiv.org/pdf/2511.22906
Copy Paste: [[2511.22906]] See, Rank, and Filter: Important Word-Aware Clip Filtering via Scene Understanding for Moment Retrieval and Highlight Detection(https://arxiv.org/abs/2511.22906)
Keywords: large language model
Abstract: Video moment retrieval (MR) and highlight detection (HD) with natural language queries aim to localize relevant moments and key highlights in a video clips. However, existing methods overlook the importance of individual words, treating the entire text query and video clips as a black-box, which hinders contextual understanding. In this paper, we propose a novel approach that enables fine-grained clip filtering by identifying and prioritizing important words in the query. Our method integrates image-text scene understanding through Multimodal Large Language Models (MLLMs) and enhances the semantic understanding of video clips. We introduce a feature enhancement module (FEM) to capture important words from the query and a ranking-based filtering module (RFM) to iteratively refine video clips based on their relevance to these important words. Extensive experiments demonstrate that our approach significantly outperforms existing state-of-the-art methods, achieving superior performance in both MR and HD tasks. Our code is available at: this https URL.

Title: ViGG: Robust RGB-D Point Cloud Registration using Visual-Geometric Mutual Guidance

Authors: Congjia Chen, Shen Yan, Yufu Qu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22908
Pdf URL: https://arxiv.org/pdf/2511.22908
Copy Paste: [[2511.22908]] ViGG: Robust RGB-D Point Cloud Registration using Visual-Geometric Mutual Guidance(https://arxiv.org/abs/2511.22908)
Keywords: robust, extraction
Abstract: Point cloud registration is a fundamental task in 3D vision. Most existing methods only use geometric information for registration. Recently proposed RGB-D registration methods primarily focus on feature fusion or improving feature learning, which limits their ability to exploit image information and hinders their practical applicability. In this paper, we propose ViGG, a robust RGB-D registration method using mutual guidance. First, we solve clique alignment in a visual-geometric combination form, employing a geometric guidance design to suppress ambiguous cliques. Second, to mitigate accuracy degradation caused by noise in visual matches, we propose a visual-guided geometric matching method that utilizes visual priors to determine the search space, enabling the extraction of high-quality, noise-insensitive correspondences. This mutual guidance strategy brings our method superior robustness, making it applicable for various RGB-D registration tasks. The experiments on 3DMatch, ScanNet and KITTI datasets show that our method outperforms recent state-of-the-art methods in both learning-free and learning-based settings. Code is available at this https URL.

Title: EnECG: Efficient Ensemble Learning for Electrocardiogram Multi-task Foundation Model

Authors: Yuhao Xu, Xiaoda Wang, Jiaying Lu, Sirui Ding, Defu Cao, Huaxiu Yao, Yan Liu, Xiao Hu, Carl Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22935
Pdf URL: https://arxiv.org/pdf/2511.22935
Copy Paste: [[2511.22935]] EnECG: Efficient Ensemble Learning for Electrocardiogram Multi-task Foundation Model(https://arxiv.org/abs/2511.22935)
Keywords: extraction
Abstract: Electrocardiogram (ECG) analysis plays a vital role in the early detection, monitoring, and management of various cardiovascular conditions. While existing models have achieved notable success in ECG interpretation, they fail to leverage the interrelated nature of various cardiac abnormalities. Conversely, developing a specific model capable of extracting all relevant features for multiple ECG tasks remains a significant challenge. Large-scale foundation models, though powerful, are not typically pretrained on ECG data, making full re-training or fine-tuning computationally expensive. To address these challenges, we propose EnECG(Mixture of Experts-based Ensemble Learning for ECG Multi-tasks), an ensemble-based framework that integrates multiple specialized foundation models, each excelling in different aspects of ECG interpretation. Instead of relying on a single model or single task, EnECG leverages the strengths of multiple specialized models to tackle a variety of ECG-based tasks. To mitigate the high computational cost of full re-training or fine-tuning, we introduce a lightweight adaptation strategy: attaching dedicated output layers to each foundation model and applying Low-Rank Adaptation (LoRA) only to these newly added parameters. We then adopt a Mixture of Experts (MoE) mechanism to learn ensemble weights, effectively combining the complementary expertise of individual models. Our experimental results demonstrate that by minimizing the scope of fine-tuning, EnECG can help reduce computational and memory costs while maintaining the strong representational power of foundation models. This framework not only enhances feature extraction and predictive performance but also ensures practical efficiency for real-world clinical applications. The code is available at this https URL.

Title: Robust Image Self-Recovery against Tampering using Watermark Generation with Pixel Shuffling

Authors: Minyoung Kim, Paul Hongsuck Seo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22936
Pdf URL: https://arxiv.org/pdf/2511.22936
Copy Paste: [[2511.22936]] Robust Image Self-Recovery against Tampering using Watermark Generation with Pixel Shuffling(https://arxiv.org/abs/2511.22936)
Keywords: attack, robust, watermark
Abstract: The rapid growth of Artificial Intelligence-Generated Content (AIGC) raises concerns about the authenticity of digital media. In this context, image self-recovery, reconstructing original content from its manipulated version, offers a practical solution for understanding the attacker's intent and restoring trustworthy data. However, existing methods often fail to accurately recover tampered regions, falling short of the primary goal of self-recovery. To address this challenge, we propose ReImage, a neural watermarking-based self-recovery framework that embeds a shuffled version of the target image into itself as a watermark. We design a generator that produces watermarks optimized for neural watermarking and introduce an image enhancement module to refine the recovered image. We further analyze and resolve key limitations of shuffled watermarking, enabling its effective use in self-recovery. We demonstrate that ReImage achieves state-of-the-art performance across diverse tampering scenarios, consistently producing high-quality recovered images. The code and pretrained models will be released upon publication.

Title: One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe

Authors: Shijun Shi, Jing Xu, Zhihang Li, Chunli Peng, Xiaoda Yang, Lijing Lu, Kai Hu, Jiangning Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22940
Pdf URL: https://arxiv.org/pdf/2511.22940
Copy Paste: [[2511.22940]] One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfe(https://arxiv.org/abs/2511.22940)
Keywords: robust, extraction, diffusion
Abstract: Recent advances in diffusion models have greatly improved pose-driven character animation. However, existing methods are limited to spatially aligned reference-pose pairs with matched skeletal structures. Handling reference-pose misalignment remains unsolved. To address this, we present One-to-All Animation, a unified framework for high-fidelity character animation and image pose transfer for references with arbitrary layouts. First, to handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task that transforms diverse-layout reference into a unified occluded-input format. Second, to process partially visible reference, we design a reference extractor for comprehensive identity feature extraction. Further, we integrate hybrid reference fusion attention to handle varying resolutions and dynamic sequence lengths. Finally, from the perspective of generation quality, we introduce identity-robust pose control that decouples appearance from skeletal structure to mitigate pose overfitting, and a token replace strategy for coherent long-video generation. Extensive experiments show that our method outperforms existing approaches. The code and model will be available at this https URL.

Title: Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework

Authors: Kelaiti Xiao, Liang Yang, Dongyu Zhang, Paerhati Tulajiang, Hongfei Lin
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2511.22943
Pdf URL: https://arxiv.org/pdf/2511.22943
Copy Paste: [[2511.22943]] Visual Puns from Idioms: An Iterative LLM-T2IM-MLLM Framework(https://arxiv.org/abs/2511.22943)
Keywords: large language model
Abstract: We study idiom-based visual puns--images that align an idiom's literal and figurative meanings--and present an iterative framework that coordinates a large language model (LLM), a text-to-image model (T2IM), and a multimodal LLM (MLLM) for automatic generation and evaluation. Given an idiom, the system iteratively (i) generates detailed visual prompts, (ii) synthesizes an image, (iii) infers the idiom from the image, and (iv) refines the prompt until recognition succeeds or a step limit is reached. Using 1,000 idioms as inputs, we synthesize a corresponding dataset of visual pun images with paired prompts, enabling benchmarking of both generation and understanding. Experiments across 10 LLMs, 10 MLLMs, and one T2IM (Qwen-Image) show that MLLM choice is the primary performance driver: GPT achieves the highest accuracies, Gemini follows, and the best open-source MLLM (Gemma) is competitive with some closed models. On the LLM side, Claude attains the strongest average performance for prompt generation.

Title: Do We Need Perfect Data? Leveraging Noise for Domain Generalized Segmentation

Authors: Taeyeong Kim, SeungJoon Lee, Jung Uk Kim, MyeongAh Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22948
Pdf URL: https://arxiv.org/pdf/2511.22948
Copy Paste: [[2511.22948]] Do We Need Perfect Data? Leveraging Noise for Domain Generalized Segmentation(https://arxiv.org/abs/2511.22948)
Keywords: robust, diffusion, segmentation
Abstract: Domain generalization in semantic segmentation faces challenges from domain shifts, particularly under adverse conditions. While diffusion-based data generation methods show promise, they introduce inherent misalignment between generated images and semantic masks. This paper presents FLEX-Seg (FLexible Edge eXploitation for Segmentation), a framework that transforms this limitation into an opportunity for robust learning. FLEX-Seg comprises three key components: (1) Granular Adaptive Prototypes that captures boundary characteristics across multiple scales, (2) Uncertainty Boundary Emphasis that dynamically adjusts learning emphasis based on prediction entropy, and (3) Hardness-Aware Sampling that progressively focuses on challenging examples. By leveraging inherent misalignment rather than enforcing strict alignment, FLEX-Seg learns robust representations while capturing rich stylistic variations. Experiments across five real-world datasets demonstrate consistent improvements over state-of-the-art methods, achieving 2.44% and 2.63% mIoU gains on ACDC and Dark Zurich. Our findings validate that adaptive strategies for handling imperfect synthetic data lead to superior domain generalization. Code is available at this https URL.

Title: RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video

Authors: Haiyang Mei, Qiming Huang, Hai Ci, Mike Zheng Shou
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.22950
Pdf URL: https://arxiv.org/pdf/2511.22950
Copy Paste: [[2511.22950]] RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video(https://arxiv.org/abs/2511.22950)
Keywords: segmentation
Abstract: Accurate robot segmentation is a fundamental capability for robotic perception. It enables precise visual servoing for VLA systems, scalable robot-centric data augmentation, accurate real-to-sim transfer, and reliable safety monitoring in dynamic human-robot environments. Despite the strong capabilities of modern segmentation models, surprisingly it remains challenging to segment robots. This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. RobotSeg is built upon the versatile SAM 2 foundation model but addresses its three limitations for robot segmentation, namely the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations, by introducing a structure-enhanced memory associator, a robot prompt generator, and a label-efficient training strategy. These innovations collectively enable a structure-aware, automatic, and label-efficient solution. We further construct the video robot segmentation (VRS) dataset comprising over 2.8k videos (138k frames) with diverse robot embodiments and environments. Extensive experiments demonstrate that RobotSeg achieves state-of-the-art performance on both images and videos, establishing a strong foundation for future advances in robot perception.

Title: Experts are all you need: A Composable Framework for Large Language Model Inference

Authors: Shrihari Sridharan, Sourjya Roy, Anand Raghunathan, Kaushik Roy
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.22955
Pdf URL: https://arxiv.org/pdf/2511.22955
Copy Paste: [[2511.22955]] Experts are all you need: A Composable Framework for Large Language Model Inference(https://arxiv.org/abs/2511.22955)
Keywords: large language model
Abstract: Large Language Models (LLMs) have achieved state-of-the-art accuracies in a variety of natural language processing (NLP) tasks. However, this success comes at the cost of increased model sizes which leads to additional computational burden. Mixture of Experts (MoEs) overcome this bottleneck by decoupling model capacity from computation by only activating a subset of parameters or "experts". However, these models require joint pretraining of these experts along with the router and do not model multi-step reasoning. In contrast, multi-agent frameworks improve reasoning by decomposing complex problems into modular subtasks. However, these frameworks rely on sequential "plan--act--observe" loops, which introduce significant latency. Our work, Comp-LLM, addresses these challenges by introducing a composable inference framework that enables cross-expert collaboration via an explicit sub-query dependency graph. Comp-LLM consists of three components: (1) A Sub-query Generator that decomposes an input query, assigns each sub-query to an appropriate expert using embedding similarity, and constructs a dependency graph; (2) A Query Executor that processes nodes in the graph and identifies opportunities for parallelism based on dependencies and resource constraints; and (3) A Response Aggregator that synthesizes intermediate expert responses into a coherent final answer. Across several benchmarks, Comp-LLM achieves up to 11.01% accuracy improvement over monolithic LLMs of similar size, while offering 1.67x--3.56x reduction in model size with no significant degradation relative to the largest model in its family. Additionally, Comp-LLM provides 1.1x--1.7x latency improvement compared to sequential sub-query processing.

Title: Contrastive Heliophysical Image Pretraining for Solar Dynamics Observatory Records

Authors: Shiyu Shen, Zhe Gao, Taifeng Chai, Yang Huang, Bin Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22958
Pdf URL: https://arxiv.org/pdf/2511.22958
Copy Paste: [[2511.22958]] Contrastive Heliophysical Image Pretraining for Solar Dynamics Observatory Records(https://arxiv.org/abs/2511.22958)
Keywords: transformer
Abstract: Deep learning has revolutionized solar image analysis, yet most approaches train task-specific encoders from scratch or rely on natural-image pretraining that ignores the unique characteristics of Solar Dynamics Observatory (SDO) data. We introduce SolarCHIP, a family of contrastively pretrained visual backbones tailored to multi-instrument SDO observations. SolarCHIP addresses three key challenges in solar imaging: multimodal sensing across AIA and HMI instruments, weak inter-class separability due to slow temporal evolution, and strong intra-class variability with sparse activity signals. Our pretraining framework employs a multi-granularity contrastive objective that jointly aligns (1) global class tokens across co-temporal AIA-HMI pairs to enhance temporal discrimination, (2) local patch tokens at fixed spatial indices to enforce position-consistent, modality-invariant features, and (3) intra-sample patches across different spatial locations to preserve fine-grained spatial structure. We train both CNN- and Vision Transformer-based autoencoders and demonstrate their effectiveness on two downstream tasks: cross-modal translation between HMI and AIA passbands via ControlNet, and full-disk flare classification. Experimental results show that SolarCHIP achieves state-of-the-art performance across both tasks, with particularly strong gains in low-resource settings where labeled data is limited. Ablation studies confirm that each contrastive component contributes essential discriminative capacity at different granularities. By publicly releasing pretrained weights and training code, we provide the heliophysics community with a practical, plug-and-play feature extractor that reduces computational requirements, improves label efficiency, and establishes a reusable foundation for diverse solar imaging applications.

Title: A Trainable Centrality Framework for Modern Data

Authors: Minh Duc Vu, Mingshuo Liu, Doudou Zhou
Subjects: cs.LG, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2511.22959
Pdf URL: https://arxiv.org/pdf/2511.22959
Copy Paste: [[2511.22959]] A Trainable Centrality Framework for Modern Data(https://arxiv.org/abs/2511.22959)
Keywords: robust
Abstract: Measuring how central or typical a data point is underpins robust estimation, ranking, and outlier detection, but classical depth notions become expensive and unstable in high dimensions and are hard to extend beyond Euclidean data. We introduce Fused Unified centrality Score Estimation (FUSE), a neural centrality framework that operates on top of arbitrary representations. FUSE combines a global head, trained from pairwise distance-based comparisons to learn an anchor-free centrality score, with a local head, trained by denoising score matching to approximate a smoothed log-density potential. A single parameter between 0 and 1 interpolates between these calibrated signals, yielding depth-like centrality from different views via one forward pass. Across synthetic distributions, real images, time series, and text data, and standard outlier detection benchmarks, FUSE recovers meaningful classical ordering, reveals multi-scale geometric structures, and attains competitive performance with strong classical baselines while remaining simple and efficient.

Title: Taming the Light: Illumination-Invariant Semantic 3DGS-SLAM

Authors: Shouhe Zhang, Dayong Ren, Sensen Song, Yurong Qian, Zhenhong Jia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22968
Pdf URL: https://arxiv.org/pdf/2511.22968
Copy Paste: [[2511.22968]] Taming the Light: Illumination-Invariant Semantic 3DGS-SLAM(https://arxiv.org/abs/2511.22968)
Keywords: robust, segmentation
Abstract: Extreme exposure degrades both the 3D map reconstruction and semantic segmentation accuracy, which is particularly detrimental to tightly-coupled systems. To achieve illumination invariance, we propose a novel semantic SLAM framework with two designs. First, the Intrinsic Appearance Normalization (IAN) module proactively disentangles the scene's intrinsic properties, such as albedo, from transient lighting. By learning a standardized, illumination-invariant appearance model, it assigns a stable and consistent color representation to each Gaussian primitive. Second, the Dynamic Radiance Balancing Loss (DRB-Loss) reactively handles frames with extreme exposure. It activates only when an image's exposure is poor, operating directly on the radiance field to guide targeted optimization. This prevents error accumulation from extreme lighting without compromising performance under normal conditions. The synergy between IAN's proactive invariance and DRB-Loss's reactive correction endows our system with unprecedented robustness. Evaluations on public datasets demonstrate state-of-the-art performance in camera tracking, map quality, and semantic and geometric accuracy.

Title: Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match

Authors: Jinze Li, Yixing Xu, Guanchen Li, Shuo Yang, Jinfeng Xu, Xuanwu Yin, Dong Li, Edith C.H.Ngai, Emad Barsoum
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.22972
Pdf URL: https://arxiv.org/pdf/2511.22972
Copy Paste: [[2511.22972]] Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match(https://arxiv.org/abs/2511.22972)
Keywords: large language model
Abstract: Large language models (LLMs) achieve strong performance across diverse tasks but suffer from high inference latency due to their autoregressive generation. Speculative Decoding (SPD) mitigates this issue by verifying candidate tokens in parallel from a smaller draft model, yet its strict exact-match verification discards many semantically valid continuations. Moreover, existing training-based SPD methods often suffer from performance degradation on out-of-distribution (OOD) tasks. To this end, we propose Training-Free Loosely Speculative Decoding (FLy), a novel method that loosens the rigid verification criterion by leveraging the target model's self-corrective behavior to judge whether a draft-target mismatch remains semantically valid. FLy introduces a two-tier mechanism: an entropy-level gate that identifies whether the current token allows multiple plausible alternatives or is nearly deterministic, and a token-level deferred window that distinguishes genuine errors from differently worded yet semantically correct variants. To further reduce latency, we design a multi-level acceleration strategy that accelerates not only the target model but also the drafter itself. Owing to its training-free design, FLy composes seamlessly with arbitrary draft-target pairs and generalizes across models and domains without hyperparameter re-tuning. Experiments show that FLy preserves more than 99% of the target model's accuracy while achieving an average 2.81x speedup on Llama-3.1-70B-Instruct and 5.07x speedup on the 405B variant. Notably, on out-of-domain datasets, our method remains highly effective and outperforms the training-based method EAGLE-3 by 1.62x.

Title: BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

Authors: Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, Bohan Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22973
Pdf URL: https://arxiv.org/pdf/2511.22973
Copy Paste: [[2511.22973]] BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation(https://arxiv.org/abs/2511.22973)
Keywords: diffusion
Abstract: Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: this https URL. Inferix (Code): this https URL.

Title: McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning

Authors: Qiushi Yang, Yingjie Chen, Yuan Yao, Yifang Men, Huaizhuo Liu, Miaomiao Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22974
Pdf URL: https://arxiv.org/pdf/2511.22974
Copy Paste: [[2511.22974]] McSc: Motion-Corrective Preference Alignment for Video Generation with Self-Critic Hierarchical Reasoning(https://arxiv.org/abs/2511.22974)
Keywords: robust, generative
Abstract: Text-to-video (T2V) generation has achieved remarkable progress in producing high-quality videos aligned with textual prompts. However, aligning synthesized videos with nuanced human preference remains challenging due to the subjective and multifaceted nature of human judgment. Existing video preference alignment methods rely on costly human annotations or utilize proxy metrics to predict preference, which lacks the understanding of human preference logic. Moreover, they usually directly align T2V models with the overall preference distribution, ignoring potential conflict dimensions like motion dynamics and visual quality, which may bias models towards low-motion content. To address these issues, we present Motion-corrective alignment with Self-critic hierarchical Reasoning (McSc), a three-stage reinforcement learning framework for robust preference modeling and alignment. Firstly, Self-critic Dimensional Reasoning (ScDR) trains a generative reward model (RM) to decompose preferences into per-dimension assessments, using self-critic reasoning chains for reliable learning. Secondly, to achieve holistic video comparison, we introduce Hierarchical Comparative Reasoning (HCR) for structural multi-dimensional reasoning with hierarchical reward supervision. Finally, using RM-preferred videos, we propose Motion-corrective Direct Preference Optimization (McDPO) to optimize T2V models, while dynamically re-weighting alignment objective to mitigate bias towards low-motion content. Experiments show that McSc achieves superior performance in human preference alignment and generates videos with high-motion dynamic.

Title: Pooling Attention: Evaluating Pretrained Transformer Embeddings for Deception Classification

Authors: Sumit Mamtani, Abhijeet Bhure
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.22977
Pdf URL: https://arxiv.org/pdf/2511.22977
Copy Paste: [[2511.22977]] Pooling Attention: Evaluating Pretrained Transformer Embeddings for Deception Classification(https://arxiv.org/abs/2511.22977)
Keywords: robust, transformer
Abstract: This paper investigates fake news detection as a downstream evaluation of Transformer representations, benchmarking encoder-only and decoder-only pre-trained models (BERT, GPT-2, Transformer-XL) as frozen embedders paired with lightweight classifiers. Through controlled preprocessing comparing pooling versus padding and neural versus linear heads, results demonstrate that contextual self-attention encodings consistently transfer effectively. BERT embeddings combined with logistic regression outperform neural baselines on LIAR dataset splits, while analyses of sequence length and aggregation reveal robustness to truncation and advantages from simple max or average pooling. This work positions attention-based token encoders as robust, architecture-centric foundations for veracity tasks, isolating Transformer contributions from classifier complexity.

Title: Ovis-Image Technical Report

Authors: Guo-Hua Wang, Liangfu Cao, Tianyu Cui, Minghao Fu, Xiaohao Chen, Pengxin Zhan, Jianshan Zhao, Lan Li, Bowen Fu, Jiaqi Liu, Qing-Guo Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.22982
Pdf URL: https://arxiv.org/pdf/2511.22982
Copy Paste: [[2511.22982]] Ovis-Image Technical Report(https://arxiv.org/abs/2511.22982)
Keywords: diffusion
Abstract: We introduce $\textbf{Ovis-Image}$, a 7B text-to-image model specifically optimized for high-quality text rendering, designed to operate efficiently under stringent computational constraints. Built upon our previous Ovis-U1 framework, Ovis-Image integrates a diffusion-based visual decoder with the stronger Ovis 2.5 multimodal backbone, leveraging a text-centric training pipeline that combines large-scale pre-training with carefully tailored post-training refinements. Despite its compact architecture, Ovis-Image achieves text rendering performance on par with significantly larger open models such as Qwen-Image and approaches closed-source systems like Seedream and GPT4o. Crucially, the model remains deployable on a single high-end GPU with moderate memory, narrowing the gap between frontier-level text rendering and practical deployment. Our results indicate that combining a strong multimodal backbone with a carefully designed, text-focused training recipe is sufficient to achieve reliable bilingual text rendering without resorting to oversized or proprietary models.

Title: Convolutional Feature Noise Reduction for 2D Cardiac MR Image Segmentation

Authors: Hong Zheng, Nan Mu, Han Su, Lin Feng, Xiaoning Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22983
Pdf URL: https://arxiv.org/pdf/2511.22983
Copy Paste: [[2511.22983]] Convolutional Feature Noise Reduction for 2D Cardiac MR Image Segmentation(https://arxiv.org/abs/2511.22983)
Keywords: segmentation
Abstract: Noise reduction constitutes a crucial operation within Digital Signal Processing. Regrettably, it frequently remains neglected when dealing with the processing of convolutional features in segmentation networks. This oversight could trigger the butterfly effect, impairing the subsequent outcomes within the entire feature system. To complete this void, we consider convolutional features following Gaussian distributions as feature signal matrices and then present a simple and effective feature filter in this study. The proposed filter is fundamentally a low-amplitude pass filter primarily aimed at minimizing noise in feature signal inputs and is named Convolutional Feature Filter (CFF). We conducted experiments on two established 2D segmentation networks and two public cardiac MR image datasets to validate the effectiveness of the CFF, and the experimental findings demonstrated a decrease in noise within the feature signal matrices. To enable a numerical observation and analysis of this reduction, we developed a binarization equation to calculate the information entropy of feature signals.

Title: MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation

Authors: Yuta Oshima, Daiki Miyake, Kohsei Matsutani, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22989
Pdf URL: https://arxiv.org/pdf/2511.22989
Copy Paste: [[2511.22989]] MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation(https://arxiv.org/abs/2511.22989)
Keywords: fair
Abstract: Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce $\textbf{MultiBanana}$, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at this https URL .

Title: Guiding Visual Autoregressive Models through Spectrum Weakening

Authors: Chaoyang Wang, Tianmeng Yang, Jingdong Wang, Yunhai Tong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.22991
Pdf URL: https://arxiv.org/pdf/2511.22991
Copy Paste: [[2511.22991]] Guiding Visual Autoregressive Models through Spectrum Weakening(https://arxiv.org/abs/2511.22991)
Keywords: diffusion
Abstract: Classifier-free guidance (CFG) has become a widely adopted and practical approach for enhancing generation quality and improving condition alignment. Recent studies have explored guidance mechanisms for unconditional generation, yet these approaches remain fundamentally tied to assumptions specific to diffusion models. In this work, we propose a spectrum-weakening framework for visual autoregressive (AR) models. This method works without the need for re-training, specific conditions, or any architectural modifications. It achieves this by constructing a controllable weak model in the spectral domain. We theoretically show that invertible spectral transformations preserve information, while selectively retaining only a subset of spectrum introduces controlled information reduction. Based on this insight, we perform spectrum selection along the channel dimension of internal representations, which avoids the structural constraints imposed by diffusion models. We further introduce two spectrum renormalization strategies that ensures numerical stability during the weakening process. Extensive experiments were conducted on both discrete and continuous AR models, with text or class conditioning. The results demonstrate that our method enables high-quality unconditional generation while maintaining strong prompt alignment for conditional generation.

Title: Optimizer Sensitivity In Vision Transformerbased Iris Recognition: Adamw Vs Sgd Vs Rmsprop

Authors: Moh Imam Faiz, Aviv Yuniar Rahman, Rangga Pahlevi Putra
Subjects: cs.CV, stat.CO
Abstract URL: https://arxiv.org/abs/2511.22994
Pdf URL: https://arxiv.org/pdf/2511.22994
Copy Paste: [[2511.22994]] Optimizer Sensitivity In Vision Transformerbased Iris Recognition: Adamw Vs Sgd Vs Rmsprop(https://arxiv.org/abs/2511.22994)
Keywords: security, robust, biometric, transformer
Abstract: The security of biometric authentication is increasingly critical as digital identity systems expand. Iris recognition offers high reliability due to its distinctive and stable texture patterns. Recent progress in deep learning, especially Vision Transformers ViT, has improved visual recognition performance. Yet, the effect of optimizer choice on ViT-based biometric systems remains understudied. This work evaluates how different optimizers influence the accuracy and stability of ViT for iris recognition, providing insights to enhance the robustness of biometric identification models.

Title: MrGS: Multi-modal Radiance Fields with 3D Gaussian Splatting for RGB-Thermal Novel View Synthesis

Authors: Minseong Kweon, Janghyun Kim, Ukcheol Shin, Jinsun Park
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.22997
Pdf URL: https://arxiv.org/pdf/2511.22997
Copy Paste: [[2511.22997]] MrGS: Multi-modal Radiance Fields with 3D Gaussian Splatting for RGB-Thermal Novel View Synthesis(https://arxiv.org/abs/2511.22997)
Keywords: extraction
Abstract: Recent advances in Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) have achieved considerable performance in RGB scene reconstruction. However, multi-modal rendering that incorporates thermal infrared imagery remains largely underexplored. Existing approaches tend to neglect distinctive thermal characteristics, such as heat conduction and the Lambertian property. In this study, we introduce MrGS, a multi-modal radiance field based on 3DGS that simultaneously reconstructs both RGB and thermal 3D scenes. Specifically, MrGS derives RGB- and thermal-related information from a single appearance feature through orthogonal feature extraction and employs view-dependent or view-independent embedding strategies depending on the degree of Lambertian reflectance exhibited by each modality. Furthermore, we leverage two physics-based principles to effectively model thermal-domain phenomena. First, we integrate Fourier's law of heat conduction prior to alpha blending to model intensity interpolation caused by thermal conduction between neighboring Gaussians. Second, we apply the Stefan-Boltzmann law and the inverse-square law to formulate a depth-aware thermal radiation map that imposes additional geometric constraints on thermal rendering. Experimental results demonstrate that the proposed MrGS achieves high-fidelity RGB-T scene reconstruction while reducing the number of Gaussians.

Title: A Modular Framework for Rapidly Building Intrusion Predictors

Authors: Xiaoxuan Wang, Rolf Stadler
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.23000
Pdf URL: https://arxiv.org/pdf/2511.23000
Copy Paste: [[2511.23000]] A Modular Framework for Rapidly Building Intrusion Predictors(https://arxiv.org/abs/2511.23000)
Keywords: attack
Abstract: We study automated intrusion prediction in an IT system using statistical learning methods. The focus is on developing online attack predictors that detect attacks in real time and identify the current stage of the attack. While such predictors have been proposed in the recent literature, these works typically rely on constructing a monolithic predictor tailored to a specific attack type and scenario. Given that hundreds of attack types are cataloged in the MITRE framework, training a separate monolithic predictor for each of them is infeasible. In this paper, we propose a modular framework for rapidly assembling online attack predictors from reusable components. The modular nature of a predictor facilitates controlling key metrics like timeliness and accuracy of prediction, as well as tuning the trade-off between them. Using public datasets for training and evaluation, we provide many examples of modular predictors and show how an effective predictor can be dynamically assembled during training from a network of modular components.

Title: Masked Diffusion for Generative Recommendation

Authors: Kulin Shah, Bhuvesh Kumar, Neil Shah, Liam Collins
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2511.23021
Pdf URL: https://arxiv.org/pdf/2511.23021
Copy Paste: [[2511.23021]] Masked Diffusion for Generative Recommendation(https://arxiv.org/abs/2511.23021)
Keywords: diffusion, generative
Abstract: Generative recommendation (GR) with semantic IDs (SIDs) has emerged as a promising alternative to traditional recommendation approaches due to its performance gains, capitalization on semantic information provided through language model embeddings, and inference and storage efficiency. Existing GR with SIDs works frame the probability of a sequence of SIDs corresponding to a user's interaction history using autoregressive modeling. While this has led to impressive next item prediction performances in certain settings, these autoregressive GR with SIDs models suffer from expensive inference due to sequential token-wise decoding, potentially inefficient use of training data and bias towards learning short-context relationships among tokens. Inspired by recent breakthroughs in NLP, we propose to instead model and learn the probability of a user's sequence of SIDs using masked diffusion. Masked diffusion employs discrete masking noise to facilitate learning the sequence distribution, and models the probability of masked tokens as conditionally independent given the unmasked tokens, allowing for parallel decoding of the masked tokens. We demonstrate through thorough experiments that our proposed method consistently outperforms autoregressive modeling. This performance gap is especially pronounced in data-constrained settings and in terms of coarse-grained recall, consistent with our intuitions. Moreover, our approach allows the flexibility of predicting multiple SIDs in parallel during inference while maintaining superior performance to autoregressive modeling.

Title: A Game-Theoretic Approach for Adversarial Information Fusion in Distributed Sensor Networks

Authors: Kassem Kallas
Subjects: cs.CR, cs.GT, cs.MA
Abstract URL: https://arxiv.org/abs/2511.23026
Pdf URL: https://arxiv.org/pdf/2511.23026
Copy Paste: [[2511.23026]] A Game-Theoretic Approach for Adversarial Information Fusion in Distributed Sensor Networks(https://arxiv.org/abs/2511.23026)
Keywords: security, protect, defense, attack, biometric, watermark
Abstract: Every day we share our personal information through digital systems which are constantly exposed to threats. For this reason, security-oriented disciplines of signal processing have received increasing attention in the last decades: multimedia forensics, digital watermarking, biometrics, network monitoring, steganography and steganalysis are just a few examples. Even though each of these fields has its own peculiarities, they all have to deal with a common problem: the presence of one or more adversaries aiming at making the system fail. Adversarial Signal Processing lays the basis of a general theory that takes into account the impact that the presence of an adversary has on the design of effective signal processing tools. By focusing on the application side of Adversarial Signal Processing, namely adversarial information fusion in distributed sensor networks, and adopting a game-theoretic approach, this thesis contributes to the above mission by addressing four issues. First, we address decision fusion in distributed sensor networks by developing a novel soft isolation defense scheme that protect the network from adversaries, specifically, Byzantines. Second, we develop an optimum decision fusion strategy in the presence of Byzantines. In the next step, we propose a technique to reduce the complexity of the optimum fusion by relying on a novel near-optimum message passing algorithm based on factor graphs. Finally, we introduce a defense mechanism to protect decentralized networks running consensus algorithm against data falsification attacks.

Title: Delta-XAI: A Unified Framework for Explaining Prediction Changes in Online Time Series Monitoring

Authors: Changhun Kim, Yechan Mun, Hyeongwon Jang, Eunseo Lee, Sangchul Hahn, Eunho Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23036
Pdf URL: https://arxiv.org/pdf/2511.23036
Copy Paste: [[2511.23036]] Delta-XAI: A Unified Framework for Explaining Prediction Changes in Online Time Series Monitoring(https://arxiv.org/abs/2511.23036)
Keywords: explainability
Abstract: Explaining online time series monitoring models is crucial across sensitive domains such as healthcare and finance, where temporal and contextual prediction dynamics underpin critical decisions. While recent XAI methods have improved the explainability of time series models, they mostly analyze each time step independently, overlooking temporal dependencies. This results in further challenges: explaining prediction changes is non-trivial, methods fail to leverage online dynamics, and evaluation remains difficult. To address these challenges, we propose Delta-XAI, which adapts 14 existing XAI methods through a wrapper function and introduces a principled evaluation suite for the online setting, assessing diverse aspects, such as faithfulness, sufficiency, and coherence. Experiments reveal that classical gradient-based methods, such as Integrated Gradients (IG), can outperform recent approaches when adapted for temporal analysis. Building on this, we propose Shifted Window Integrated Gradients (SWING), which incorporates past observations in the integration path to systematically capture temporal dependencies and mitigate out-of-distribution effects. Extensive experiments consistently demonstrate the effectiveness of SWING across diverse settings with respect to diverse metrics. Our code is publicly available at this https URL.

Title: Social Perceptions of English Spelling Variation on Twitter: A Comparative Analysis of Human and LLM Responses

Authors: Dong Nguyen, Laura Rosseel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.23041
Pdf URL: https://arxiv.org/pdf/2511.23041
Copy Paste: [[2511.23041]] Social Perceptions of English Spelling Variation on Twitter: A Comparative Analysis of Human and LLM Responses(https://arxiv.org/abs/2511.23041)
Keywords: large language model
Abstract: Spelling variation (e.g. funnnn vs. fun) can influence the social perception of texts and their writers: we often have various associations with different forms of writing (is the text informal? does the writer seem young?). In this study, we focus on the social perception of spelling variation in online writing in English and study to what extent this perception is aligned between humans and large language models (LLMs). Building on sociolinguistic methodology, we compare LLM and human ratings on three key social attributes of spelling variation (formality, carefulness, age). We find generally strong correlations in the ratings between humans and LLMs. However, notable differences emerge when we analyze the distribution of ratings and when comparing between different types of spelling variation.

Title: GOATex: Geometry & Occlusion-Aware Texturing

Authors: Hyunjin Kim, Kunho Kim, Adam Lee, Wonkwang Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23051
Pdf URL: https://arxiv.org/pdf/2511.23051
Copy Paste: [[2511.23051]] GOATex: Geometry & Occlusion-Aware Texturing(https://arxiv.org/abs/2511.23051)
Keywords: diffusion
Abstract: We present GOATex, a diffusion-based method for 3D mesh texturing that generates high-quality textures for both exterior and interior surfaces. While existing methods perform well on visible regions, they inherently lack mechanisms to handle occluded interiors, resulting in incomplete textures and visible seams. To address this, we introduce an occlusion-aware texturing framework based on the concept of hit levels, which quantify the relative depth of mesh faces via multi-view ray casting. This allows us to partition mesh faces into ordered visibility layers, from outermost to innermost. We then apply a two-stage visibility control strategy that progressively reveals interior regions with structural coherence, followed by texturing each layer using a pretrained diffusion model. To seamlessly merge textures obtained across layers, we propose a soft UV-space blending technique that weighs each texture's contribution based on view-dependent visibility confidence. Empirical results demonstrate that GOATex consistently outperforms existing methods, producing seamless, high-fidelity textures across both visible and occluded surfaces. Unlike prior works, GOATex operates entirely without costly fine-tuning of a pretrained diffusion model and allows separate prompting for exterior and interior mesh regions, enabling fine-grained control over layered appearances. For more qualitative results, please visit our project page: this https URL.

Title: Decoding the Past: Explainable Machine Learning Models for Dating Historical Texts

Authors: Paulo J. N. Pinto, Armando J. Pinho, Diogo Pratas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.23056
Pdf URL: https://arxiv.org/pdf/2511.23056
Copy Paste: [[2511.23056]] Decoding the Past: Explainable Machine Learning Models for Dating Historical Texts(https://arxiv.org/abs/2511.23056)
Keywords: interpretability, explainability
Abstract: Accurately dating historical texts is essential for organizing and interpreting cultural heritage collections. This article addresses temporal text classification using interpretable, feature-engineered tree-based machine learning models. We integrate five feature categories - compression-based, lexical structure, readability, neologism detection, and distance features - to predict the temporal origin of English texts spanning five centuries. Comparative analysis shows that these feature domains provide complementary temporal signals, with combined models outperforming any individual feature set. On a large-scale corpus, we achieve 76.7% accuracy for century-scale prediction and 26.1% for decade-scale classification, substantially above random baselines (20% and 2.3%). Under relaxed temporal precision, performance increases to 96.0% top-2 accuracy for centuries and 85.8% top-10 accuracy for decades. The final model exhibits strong ranking capabilities with AUCROC up to 94.8% and AUPRC up to 83.3%, and maintains controlled errors with mean absolute deviations of 27 years and 30 years, respectively. For authentication-style tasks, binary models around key thresholds (e.g., 1850-1900) reach 85-98% accuracy. Feature importance analysis identifies distance features and lexical structure as most informative, with compression-based features providing complementary signals. SHAP explainability reveals systematic linguistic evolution patterns, with the 19th century emerging as a pivot point across feature domains. Cross-dataset evaluation on Project Gutenberg highlights domain adaptation challenges, with accuracy dropping by 26.4 percentage points, yet the computational efficiency and interpretability of tree-based models still offer a scalable, explainable alternative to neural architectures.

Title: Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation

Authors: Felipe Akio Matsuoka, Eduardo Moreno J. M. Farina, Augusto Sarquis Serpa, Soraya Monteiro, Rodrigo Ragazzini, Nitamar Abdala, Marcelo Straus Takahashi, Felipe Campos Kitamura
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23066
Pdf URL: https://arxiv.org/pdf/2511.23066
Copy Paste: [[2511.23066]] Evaluating the Clinical Impact of Generative Inpainting on Bone Age Estimation(https://arxiv.org/abs/2511.23066)
Keywords: generative
Abstract: Generative foundation models can remove visual artifacts through realistic image inpainting, but their impact on medical AI performance remains uncertain. Pediatric hand radiographs often contain non-anatomical markers, and it is unclear whether inpainting these regions preserves features needed for bone age and gender prediction. To evaluate the clinical reliability of generative model-based inpainting for artifact removal, we used the RSNA Bone Age Challenge dataset, selecting 200 original radiographs and generating 600 inpainted versions with gpt-image-1 using natural language prompts to target non-anatomical artifacts. Downstream performance was assessed with deep learning ensembles for bone age estimation and gender classification, using mean absolute error (MAE) and area under the ROC curve (AUC) as metrics, and pixel intensity distributions to detect structural alterations. Inpainting markedly degraded model performance: bone age MAE increased from 6.26 to 30.11 months, and gender classification AUC decreased from 0.955 to 0.704. Inpainted images displayed pixel-intensity shifts and inconsistencies, indicating structural modifications not corrected by simple calibration. These findings show that, although visually realistic, foundation model-based inpainting can obscure subtle but clinically relevant features and introduce latent bias even when edits are confined to non-diagnostic regions, underscoring the need for rigorous, task-specific validation before integrating such generative tools into clinical AI workflows.

Title: Buffer replay enhances the robustness of multimodal learning under missing-modality

Authors: Hongye Zhu, Xuan Liu, Yanwen Ba, Jingye Xue, Shigeng Zhang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.23070
Pdf URL: https://arxiv.org/pdf/2511.23070
Copy Paste: [[2511.23070]] Buffer replay enhances the robustness of multimodal learning under missing-modality(https://arxiv.org/abs/2511.23070)
Keywords: robust
Abstract: Missing modalities consistently lead to significant performance degradation in multimodal models. Existing approaches either synthesize missing modalities at high computational cost or apply prompt-based fine-tuning that relies only on adjacent-layer features and overlooks long-distance contextual information, which may offer additional tolerance to errors when one or more modalities are missing. To address this, we introduce REplay Prompting (REP): (1) construct modality-wise feature buffers via a residual bypass to cache early-layer representations and replay them in deeper layers, mitigating information loss as network depth increases; (2) employ a private-shared feature decoupling strategy, where private buffers preserve modality-specific signals and shared buffers encode cross-modal semantics; and (3) design a task-aware dynamic initialization mechanism to configure these buffers differently, improving stability and generalization under diverse missing-modality conditions. Experiments on vision-language, vision-language-audio, and temporal multimodal benchmarks demonstrate that REP consistently outperforms prior methods under both single- and multi-modality missing scenarios, while introducing only negligible parameter overhead. These results establish REP as a lightweight and effective paradigm for robust multimodal learning in challenging missing-modality environments.

Title: SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models

Authors: Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, Zizhuang Wei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23075
Pdf URL: https://arxiv.org/pdf/2511.23075
Copy Paste: [[2511.23075]] SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models(https://arxiv.org/abs/2511.23075)
Keywords: large language model
Abstract: Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation. Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D. These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence. We will release code and model checkpoints to support future research.

Title: Accent Placement Models for Rigvedic Sanskrit Text

Authors: Akhil Rajeev P, Annarao Kulkarni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.23088
Pdf URL: https://arxiv.org/pdf/2511.23088
Copy Paste: [[2511.23088]] Accent Placement Models for Rigvedic Sanskrit Text(https://arxiv.org/abs/2511.23088)
Keywords: transformer
Abstract: The Rigveda, among the oldest Indian texts in Vedic Sanskrit, employs a distinctive pitch-accent system : udātta, anudātta, svarita whose marks encode melodic and interpretive cues but are often absent from modern e-texts. This work develops a parallel corpus of accented-unaccented ślokas and conducts a controlled comparison of three strategies for automatic accent placement in Rigvedic verse: (i) full fine-tuning of ByT5, a byte-level Transformer that operates directly on Unicode combining marks, (ii) a from-scratch BiLSTM-CRF sequence-labeling baseline, and (iii) LoRA-based parameter-efficient fine-tuning atop ByT5. Evaluation uses Word Error Rate (WER) and Character Error Rate (CER) for orthographic fidelity, plus a task-specific Diacritic Error Rate (DER) that isolates accent edits. Full ByT5 fine-tuning attains the lowest error across all metrics; LoRA offers strong efficiency-accuracy trade-offs, and BiLSTM-CRF serves as a transparent baseline. The study underscores practical requirements for accent restoration - Unicode-safe preprocessing, mark-aware tokenization, and evaluation that separates grapheme from accent errors - and positions heritage-language technology as an emerging NLP area connecting computational modeling with philological and pedagogical aims. Results establish reproducible baselines for Rigvedic accent restoration and provide guidance for downstream tasks such as accent-aware OCR, ASR/chant synthesis, and digital scholarship.

Title: Mind Reading or Misreading? LLMs on the Big Five Personality Test

Authors: Francesco Di Cursi, Chiara Boldrini, Marco Conti, Andrea Passarella
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23101
Pdf URL: https://arxiv.org/pdf/2511.23101
Copy Paste: [[2511.23101]] Mind Reading or Misreading? LLMs on the Big Five Personality Test(https://arxiv.org/abs/2511.23101)
Keywords: large language model
Abstract: We evaluate large language models (LLMs) for automatic personality prediction from text under the binary Five Factor Model (BIG5). Five models -- including GPT-4 and lightweight open-source alternatives -- are tested across three heterogeneous datasets (Essays, MyPersonality, Pandora) and two prompting strategies (minimal vs. enriched with linguistic and psychological cues). Enriched prompts reduce invalid outputs and improve class balance, but also introduce a systematic bias toward predicting trait presence. Performance varies substantially: Openness and Agreeableness are relatively easier to detect, while Extraversion and Neuroticism remain challenging. Although open-source models sometimes approach GPT-4 and prior benchmarks, no configuration yields consistently reliable predictions in zero-shot binary settings. Moreover, aggregate metrics such as accuracy and macro-F1 mask significant asymmetries, with per-class recall offering clearer diagnostic value. These findings show that current out-of-the-box LLMs are not yet suitable for APPT, and that careful coordination of prompt design, trait framing, and evaluation metrics is essential for interpretable results.

Title: NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing

Authors: Zhenyu Xu, Xiaoqi Shen, Haotian Nan, Xinyu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23105
Pdf URL: https://arxiv.org/pdf/2511.23105
Copy Paste: [[2511.23105]] NumeriKontrol: Adding Numeric Control to Diffusion Transformers for Instruction-based Image Editing(https://arxiv.org/abs/2511.23105)
Keywords: diffusion, transformer
Abstract: Instruction-based image editing enables intuitive manipulation through natural language commands. However, text instructions alone often lack the precision required for fine-grained control over edit intensity. We introduce NumeriKontrol, a framework that allows users to precisely adjust image attributes using continuous scalar values with common units. NumeriKontrol encodes numeric editing scales via an effective Numeric Adapter and injects them into diffusion models in a plug-and-play manner. Thanks to a task-separated design, our approach supports zero-shot multi-condition editing, allowing users to specify multiple instructions in any order. To provide high-quality supervision, we synthesize precise training data from reliable sources, including high-fidelity rendering engines and DSLR cameras. Our Common Attribute Transform (CAT) dataset covers diverse attribute manipulations with accurate ground-truth scales, enabling NumeriKontrol to function as a simple yet powerful interactive editing studio. Extensive experiments show that NumeriKontrol delivers accurate, continuous, and stable scale control across a wide range of attribute editing scenarios. These contributions advance instruction-based image editing by enabling precise, scalable, and user-controllable image manipulation.

Title: db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism

Authors: Siqi Chen, Ke Hong, Tianchen Zhao, Ruiqi Xie, Zhenhua Zhu, Xudong Zhang, Yu Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.23113
Pdf URL: https://arxiv.org/pdf/2511.23113
Copy Paste: [[2511.23113]] db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism(https://arxiv.org/abs/2511.23113)
Keywords: diffusion, transformer, generative
Abstract: Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The imbalance stems from the inherent variation in sparsity across attention heads and the irregular distribution of dense blocks within the sparse mask, when sequence parallelism is applied along the head dimension (as in Ulysses) or the block dimension (as in Ring Attention). In this paper, we formalize a sparse imbalance ratio to quantify the imbalance, and propose db-SP, a sparsity-aware sequence parallelism technique that tackles the challenge. db-SP contains a dual-level partitioning approach that achieves near-perfect workload balance at both the head and block levels with negligible overhead. Furthermore, to handle the evolving sparsity patterns across denoising steps and layers, db-SP dynamically determines the parallel degrees for the head and block dimensions at runtime. Experimental results demonstrate that db-SP delivers an end-to-end speedup of 1.25x and an attention-specific speedup of 1.40x over state-of-the-art sequence parallel methods on average. Code is available at this https URL.

Title: Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM

Authors: Mengjie Liu, Jiahui Peng, Pei Chu, Jiantao Qiu, Ren Ma, He Zhu, Rui Min, Lindong Lu, Wenchang Ning, Linfeng Hou, Kaiwen Liu, Yuan Qu, Zhenxiang Li, Chao Xu, Zhongying Tu, Wentao Zhang, Conghui He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.23119
Pdf URL: https://arxiv.org/pdf/2511.23119
Copy Paste: [[2511.23119]] Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM(https://arxiv.org/abs/2511.23119)
Keywords: extraction, generative
Abstract: Accurately and efficiently extracting main content from general web pages is of great significance for obtaining training data for large models. Using well-pre-trained decoder-only generative language models offers excellent document comprehension capabilities, thereby effectively enhancing parsing quality. However, it remains constrained by issues such as context window length, inference cost, and format hallucination. We present Dripper, an efficient HTML main content extraction framework powered by lightweight language models, which addresses these challenges through four key innovations: (1) We design a specialized HTML simplification algorithm that reduces input token count to 22\% compared to raw HTML while preserving critical structural information; (2) We reformulate main content extraction as a semantic block sequence classification task, significantly reducing inference cost; (3) We introduce a controlled decoding mechanism that strictly constrains the output space through logits processors, effectively eliminating hallucination issues common in small-scale models; (4) We propose WebMainBench, an evaluation dataset containing over 7,800 web pages with meticulously human-annotated main content extraction labels. Experimental results demonstrate that using only a 0.6B parameter model, Dripper achieves state-of-the-art performance across all evaluation benchmarks and outperforms all baseline methods, attaining an ROUGE-N F1 score of 81.58\%( 83.13\% with fall-back strategy) on our proposed WebMainBench dataset.

Title: Freeze, Diffuse, Decode: Geometry-Aware Adaptation of Pretrained Transformer Embeddings for Antimicrobial Peptide Design

Authors: Pankhil Gawade, Adam Izdebski, Myriam Lizotte, Kevin R. Moon, Jake S. Rhodes, Guy Wolf, Ewa Szczurek
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.23120
Pdf URL: https://arxiv.org/pdf/2511.23120
Copy Paste: [[2511.23120]] Freeze, Diffuse, Decode: Geometry-Aware Adaptation of Pretrained Transformer Embeddings for Antimicrobial Peptide Design(https://arxiv.org/abs/2511.23120)
Keywords: diffusion, transformer
Abstract: Pretrained transformers provide rich, general-purpose embeddings, which are transferred to downstream tasks. However, current transfer strategies: fine-tuning and probing, either distort the pretrained geometric structure of the embeddings or lack sufficient expressivity to capture task-relevant signals. These issues become even more pronounced when supervised data are scarce. Here, we introduce Freeze, Diffuse, Decode (FDD), a novel diffusion-based framework that adapts pre-trained embeddings to downstream tasks while preserving their underlying geometric structure. FDD propagates supervised signal along the intrinsic manifold of frozen embeddings, enabling a geometry-aware adaptation of the embedding space. Applied to antimicrobial peptide design, FDD yields low-dimensional, predictive, and interpretable representations that support property prediction, retrieval, and latent-space interpolation.

Title: DNA-Prior: Unsupervised Denoise Anything via Dual-Domain Prior

Authors: Yanqi Cheng, Chun-Wun Cheng, Jim Denholm, Thiago Lima, Javier A. Montoya-Zegarra, Richard Goodwin, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23124
Pdf URL: https://arxiv.org/pdf/2511.23124
Copy Paste: [[2511.23124]] DNA-Prior: Unsupervised Denoise Anything via Dual-Domain Prior(https://arxiv.org/abs/2511.23124)
Keywords: robust, segmentation
Abstract: Medical imaging pipelines critically rely on robust denoising to stabilise downstream tasks such as segmentation and reconstruction. However, many existing denoisers depend on large annotated datasets or supervised learning, which restricts their usability in clinical environments with heterogeneous modalities and limited ground-truth data. To address this limitation, we introduce DNA-Prior, a universal unsupervised denoising framework that reconstructs clean images directly from corrupted observations through a mathematically principled hybrid prior. DNA-Prior integrates (i) an implicit architectural prior, enforced through a deep network parameterisation, with (ii) an explicit spectral-spatial prior composed of a frequency-domain fidelity term and a spatial regularisation functional. This dual-domain formulation yields a well-structured optimisation problem that jointly preserves global frequency characteristics and local anatomical structure, without requiring any external training data or modality-specific tuning. Experiments across multiple modalities show that DNA achieves consistent noise suppression and structural preservation under diverse noise conditions.

Title: DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation

Authors: Hongfei Zhang, Kanghao Chen, Zixin Zhang, Harold Haodong Chen, Yuanhuiyi Lyu, Yuqi Zhang, Shuai Yang, Kun Zhou, Yingcong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23127
Pdf URL: https://arxiv.org/pdf/2511.23127
Copy Paste: [[2511.23127]] DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation(https://arxiv.org/abs/2511.23127)
Keywords: diffusion
Abstract: This paper presents DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation. Recent works have advanced this field by representing camera poses as ray-based conditions, yet they often lack sufficient scene understanding and geometric awareness. DualCamCtrl specifically targets this limitation by introducing a dual-branch framework that mutually generates camera-consistent RGB and depth sequences. To harmonize these two modalities, we further propose the Semantic Guided Mutual Alignment (SIGMA) mechanism, which performs RGB-depth fusion in a semantics-guided and mutually reinforced manner. These designs collectively enable DualCamCtrl to better disentangle appearance and geometry modeling, generating videos that more faithfully adhere to the specified camera trajectories. Additionally, we analyze and reveal the distinct influence of depth and camera poses across denoising stages and further demonstrate that early and late stages play complementary roles in forming global structure and refining local details. Extensive experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation, with over 40\% reduction in camera motion errors compared with prior methods. Our project page: this https URL\-page/

Title: Multi-chain Graph Refinement and Selection for Reliable Reasoning in Large Language Models

Authors: Yujiao Yang, Jing Lian, Linhui Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23136
Pdf URL: https://arxiv.org/pdf/2511.23136
Copy Paste: [[2511.23136]] Multi-chain Graph Refinement and Selection for Reliable Reasoning in Large Language Models(https://arxiv.org/abs/2511.23136)
Keywords: large language model
Abstract: The complex reasoning ability of Large Language Models (LLMs) poses a critical bottleneck for their practical applications. Test-time expansion methods such as Tree-of-Thought (ToT) and Graph-of-Thought (GoT) enhance reasoning by introducing intermediate reasoning structures, tree search, or graph-based exploration mechanisms. However, their reasoning strategies suffer from limited diversity, redundant search branches, and inadequate integration and error correction across heterogeneous reasoning paths. To address these limitations, we propose a novel reasoning framework called Multi-chain Graph Refinement & Selection (MGRS), which first generates multiple diverse reasoning trajectories for a given problem, refines candidate responses using a composite self- and cross-verification strategy, then constructs a reasoning relation graph and estimates the success rate of intermediate nodes, and finally computes cumulative success rates to select the most reliable answer and corresponding reasoning trajectory. Experimental results demonstrate that MGRS significantly advances both the reasoning capability and computational efficiency of reasoning enhancement methods. Across six benchmark datasets spanning four distinct tasks, MGRS achieves an average accuracy of 82.9%, outperforming state-of-the-art baselines by a clear margin of 2.1%. Remarkably, on the 24-point game, MGRS attains 100% accuracy for the first time, while delivering a 13.6x speed-up compared to the leading Forest of Thoughts framework.

Title: InstanceV: Instance-Level Video Generation

Authors: Yuheng Chen, Teng Hu, Jiangning Zhang, Zhucun Xue, Ran Yi, Lizhuang Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23146
Pdf URL: https://arxiv.org/pdf/2511.23146
Copy Paste: [[2511.23146]] InstanceV: Instance-Level Video Generation(https://arxiv.org/abs/2511.23146)
Keywords: diffusion
Abstract: Recent advances in text-to-video diffusion models have enabled the generation of high-quality videos conditioned on textual descriptions. However, most existing text-to-video models rely solely on textual conditions, lacking general fine-grained controllability over video generation. To address this challenge, we propose InstanceV, a video generation framework that enables i) instance-level control and ii) global semantic consistency. Specifically, with the aid of proposed Instance-aware Masked Cross-Attention mechanism, InstanceV maximizes the utilization of additional instance-level grounding information to generate correctly attributed instances at designated spatial locations. To improve overall consistency, We introduce the Shared Timestep-Adaptive Prompt Enhancement module, which connects local instances with global semantics in a parameter-efficient manner. Furthermore, we incorporate Spatially-Aware Unconditional Guidance during both training and inference to alleviate the disappearance of small instances. Finally, we propose a new benchmark, named InstanceBench, which combines general video quality metrics with instance-aware metrics for more comprehensive evaluation on instance-level video generation. Extensive experiments demonstrate that InstanceV not only achieves remarkable instance-level controllability in video generation, but also outperforms existing state-of-the-art models in both general quality and instance-aware metrics across qualitative and quantitative evaluations.

Title: Cascaded Robust Rectification for Arbitrary Document Images

Authors: Chaoyun Wang, Quanxin Huang, I-Chao Shen, Takeo Igarashi, Nanning Zheng, Caigui Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23150
Pdf URL: https://arxiv.org/pdf/2511.23150
Copy Paste: [[2511.23150]] Cascaded Robust Rectification for Arbitrary Document Images(https://arxiv.org/abs/2511.23150)
Keywords: robust
Abstract: Document rectification in real-world scenarios poses significant challenges due to extreme variations in camera perspectives and physical distortions. Driven by the insight that complex transformations can be decomposed and resolved progressively, we introduce a novel multi-stage framework that progressively reverses distinct distortion types in a coarse-to-fine manner. Specifically, our framework first performs a global affine transformation to correct perspective distortions arising from the camera's viewpoint, then rectifies geometric deformations resulting from physical paper curling and folding, and finally employs a content-aware iterative process to eliminate fine-grained content distortions. To address limitations in existing evaluation protocols, we also propose two enhanced metrics: layout-aligned OCR metrics (AED/ACER) for a stable assessment that decouples geometric rectification quality from the layout analysis errors of OCR engines, and masked AD/AAD (AD-M/AAD-M) tailored for accurately evaluating geometric distortions in documents with incomplete boundaries. Extensive experiments show that our method establishes new state-of-the-art performance on multiple challenging benchmarks, yielding a substantial reduction of 14.1\%--34.7\% in the AAD metric and demonstrating superior efficacy in real-world applications. The code will be publicly available at this https URL.

Title: REVEAL: Reasoning-enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection

Authors: Huangsen Cao, Qin Mei, Zhiheng Li, Yuxi Li, Ying Zhang, Chen Li, Zhimeng Zhang, Xin Ding, Yongwei Wang, Jing Lyu, Fei Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23158
Pdf URL: https://arxiv.org/pdf/2511.23158
Copy Paste: [[2511.23158]] REVEAL: Reasoning-enhanced Forensic Evidence Analysis for Explainable AI-generated Image Detection(https://arxiv.org/abs/2511.23158)
Keywords: robust, generative
Abstract: With the rapid advancement of generative models, visually realistic AI-generated images have become increasingly difficult to distinguish from authentic ones, posing severe threats to social trust and information integrity. Consequently, there is an urgent need for efficient and truly explainable image forensic methods. Recent detection paradigms have shifted towards explainable forensics. However, state-of-the-art approaches primarily rely on post-hoc rationalizations or visual discrimination, lacking a verifiable chain of evidence. This reliance on surface-level pattern matching limits the generation of causally grounded explanations and often results in poor generalization. To bridge this critical gap, we introduce \textbf{REVEAL-Bench}, the first reasoning-enhanced multimodal benchmark for AI-generated image detection that is explicitly structured around a chain-of-evidence derived from multiple lightweight expert models, then records step-by-step reasoning traces and evidential justifications. Building upon this dataset, we propose \textbf{REVEAL} (\underline{R}easoning-\underline{e}nhanced Forensic E\underline{v}id\underline{e}nce \underline{A}na\underline{l}ysis), an effective and explainable forensic framework that integrates detection with a novel expert-grounded reinforcement learning. Our reward mechanism is specially tailored to jointly optimize detection accuracy, explanation fidelity, and logical coherence grounded in explicit forensic evidence, enabling REVEAL to produce fine-grained, interpretable, and verifiable reasoning chains alongside its detection outcomes. Extensive experimental results demonstrate that REVEAL significantly enhances detection accuracy, explanation fidelity, and robust cross-model generalization, benchmarking a new state of the art for explainable image forensics.

Title: Estimating the Event-Related Potential from Few EEG Trials

Authors: Anders Vestergaard Nørskov, Kasper Jørgensen, Alexander Neergaard Zahid, Morten Mørup
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.23162
Pdf URL: https://arxiv.org/pdf/2511.23162
Copy Paste: [[2511.23162]] Estimating the Event-Related Potential from Few EEG Trials(https://arxiv.org/abs/2511.23162)
Keywords: robust
Abstract: Event-related potentials (ERP) are measurements of brain activity with wide applications in basic and clinical neuroscience, that are typically estimated using the average of many trials of electroencephalography signals (EEG) to sufficiently reduce noise and signal variability. We introduce EEG2ERP, a novel uncertainty-aware autoencoder approach that maps an arbitrary number of EEG trials to their associated ERP. To account for the ERP uncertainty we use bootstrapped training targets and introduce a separate variance decoder to model the uncertainty of the estimated ERP. We evaluate our approach in the challenging zero-shot scenario of generalizing to new subjects considering three different publicly available data sources; i) the comprehensive ERP CORE dataset that includes over 50,000 EEG trials across six ERP paradigms from 40 subjects, ii) the large P300 Speller BCI dataset, and iii) a neuroimaging dataset on face perception consisting of both EEG and magnetoencephalography (MEG) data. We consistently find that our method in the few trial regime provides substantially better ERP estimates than commonly used conventional and robust averaging procedures. EEG2ERP is the first deep learning approach to map EEG signals to their associated ERP, moving toward reducing the number of trials necessary for ERP research. Code is available at this https URL

Title: Energy-Efficient Vision Transformer Inference for Edge-AI Deployment

Authors: Nursultan Amanzhol, Jurn-Gyu Park
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.23166
Pdf URL: https://arxiv.org/pdf/2511.23166
Copy Paste: [[2511.23166]] Energy-Efficient Vision Transformer Inference for Edge-AI Deployment(https://arxiv.org/abs/2511.23166)
Keywords: transformer
Abstract: The growing deployment of Vision Transformers (ViTs) on energy-constrained devices requires evaluation methods that go beyond accuracy alone. We present a two-stage pipeline for assessing ViT energy efficiency that combines device-agnostic model selection with device-related measurements. We benchmark 13 ViT models on ImageNet-1K and CIFAR-10, running inference on NVIDIA Jetson TX2 (edge device) and an NVIDIA RTX 3050 (mobile GPU). The device-agnostic stage uses the NetScore metric for screening; the device-related stage ranks models with the Sustainable Accuracy Metric (SAM). Results show that hybrid models such as LeViT_Conv_192 reduce energy by up to 53% on TX2 relative to a ViT baseline (e.g., SAM5=1.44 on TX2/CIFAR-10), while distilled models such as TinyViT-11M_Distilled excel on the mobile GPU (e.g., SAM5=1.72 on RTX 3050/CIFAR-10 and SAM5=0.76 on RTX 3050/ImageNet-1K).

Title: PowerCLIP: Powerset Alignment for Contrastive Pre-Training

Authors: Masaki Kawamura, Nakamasa Inoue, Rintaro Yanagi, Hirokatsu Kataoka, Rio Yokota
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23170
Pdf URL: https://arxiv.org/pdf/2511.23170
Copy Paste: [[2511.23170]] PowerCLIP: Powerset Alignment for Contrastive Pre-Training(https://arxiv.org/abs/2511.23170)
Keywords: robust
Abstract: Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees. Since the naive powerset construction incurs exponential computational cost due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from O(2^M) to O(M) with respect to the number of regions M, while approximating the exact loss value with arbitrary precision. Our extensive experiments demonstrate that PowerCLIP outperforms state-of-the-art methods in zero-shot classification and retrieval tasks, underscoring the compositionality and robustness of our approach. Our code will be made publicly available.

Title: Fast Multi-view Consistent 3D Editing with Video Priors

Authors: Liyi Chen, Ruihuang Li, Guowen Zhang, Pengfei Wang, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23172
Pdf URL: https://arxiv.org/pdf/2511.23172
Copy Paste: [[2511.23172]] Fast Multi-view Consistent 3D Editing with Video Priors(https://arxiv.org/abs/2511.23172)
Keywords: generative
Abstract: Text-driven 3D editing enables user-friendly 3D object or scene editing with text instructions. Due to the lack of multi-view consistency priors, existing methods typically resort to employing 2D generation or editing models to process each view individually, followed by iterative 2D-3D-2D updating. However, these methods are not only time-consuming but also prone to over-smoothed results because the different editing signals gathered from different views are averaged during the iterative process. In this paper, we propose generative Video Prior based 3D Editing (ViP3DE) to employ the temporal consistency priors from pre-trained video generation models for multi-view consistent 3D editing in a single forward pass. Our key insight is to condition the video generation model on a single edited view to generate other consistent edited views for 3D updating directly, thereby bypassing the iterative editing paradigm. Since 3D updating requires edited views to be paired with specific camera poses, we propose motion-preserved noise blending for the video model to generate edited views at predefined camera poses. In addition, we introduce geometry-aware denoising to further enhance multi-view consistency by integrating 3D geometric priors into video models. Extensive experiments demonstrate that our proposed ViP3DE can achieve high-quality 3D editing results even within a single forward pass, significantly outperforming existing methods in both editing quality and speed.

Title: Are LLMs Good Safety Agents or a Propaganda Engine?

Authors: Neemesh Yadav, Francesco Ortu, Jiarui Liu, Joeun Yook, Bernhard Schölkopf, Rada Mihalcea, Alberto Cazzaniga, Zhijing Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.23174
Pdf URL: https://arxiv.org/pdf/2511.23174
Copy Paste: [[2511.23174]] Are LLMs Good Safety Agents or a Propaganda Engine?(https://arxiv.org/abs/2511.23174)
Keywords: attack, large language model
Abstract: Large Language Models (LLMs) are trained to refuse to respond to harmful content. However, systematic analyses of whether this behavior is truly a reflection of its safety policies or an indication of political censorship, that is practiced globally by countries, is lacking. Differentiating between safety influenced refusals or politically motivated censorship is hard and unclear. For this purpose we introduce PSP, a dataset built specifically to probe the refusal behaviors in LLMs from an explicitly political context. PSP is built by formatting existing censored content from two data sources, openly available on the internet: sensitive prompts in China generalized to multiple countries, and tweets that have been censored in various countries. We study: 1) impact of political sensitivity in seven LLMs through data-driven (making PSP implicit) and representation-level approaches (erasing the concept of politics); and, 2) vulnerability of models on PSP through prompt injection attacks (PIAs). Associating censorship with refusals on content with masked implicit intent, we find that most LLMs perform some form of censorship. We conclude with summarizing major attributes that can cause a shift in refusal distributions across models and contexts of different countries.

Title: Identification of Malicious Posts on the Dark Web Using Supervised Machine Learning

Authors: Sebastião Alves de Jesus Filho, Gustavo Di Giovanni Bernardo, Paulo Henrique Ribeiro Gabriel, Bruno Bogaz Zarpelão, Rodrigo Sanches Miani
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23183
Pdf URL: https://arxiv.org/pdf/2511.23183
Copy Paste: [[2511.23183]] Identification of Malicious Posts on the Dark Web Using Supervised Machine Learning(https://arxiv.org/abs/2511.23183)
Keywords: security, defense, attack, robust
Abstract: Given the constant growth and increasing sophistication of cyberattacks, cybersecurity can no longer rely solely on traditional defense techniques and tools. Proactive detection of cyber threats has become essential to help security teams identify potential risks and implement effective mitigation measures. Cyber Threat Intelligence (CTI) plays a key role by providing security analysts with evidence-based knowledge about cyber threats. CTI information can be extracted using various techniques and data sources; however, machine learning has proven promising. As for data sources, social networks and online discussion forums are commonly explored. In this study, we apply text mining techniques and machine learning to data collected from Dark Web forums in Brazilian Portuguese to identify malicious posts. Our contributions include the creation of three original datasets, a novel multi-stage labeling process combining indicators of compromise (IoCs), contextual keywords, and manual analysis, and a comprehensive evaluation of text representations and classifiers. To our knowledge, this is the first study to focus specifically on Brazilian Portuguese content in this domain. The best-performing model, using LightGBM and TF-IDF, was able to detect relevant posts with high accuracy. We also applied topic modeling to validate the model's outputs on unlabeled data, confirming its robustness in real-world scenarios.

Title: Listwise Preference Optimization with Element-wise Confusions for Aspect Sentiment Quad Prediction

Authors: Wenna Lai, Haoran Xie, Guandong Xu, Qing Li, S. Joe Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23184
Pdf URL: https://arxiv.org/pdf/2511.23184
Copy Paste: [[2511.23184]] Listwise Preference Optimization with Element-wise Confusions for Aspect Sentiment Quad Prediction(https://arxiv.org/abs/2511.23184)
Keywords: interpretability
Abstract: Aspect sentiment quad prediction (ASQP) is inherently challenging to predict a structured quadruple with four core sentiment elements, including aspect term (a), aspect category (c), opinion term (o), and sentiment polarity (s). Prior methods relying on marker-based prediction struggle with modeling the intricate relationships among elements and experience sharp performance declines when predicting higher-order elements (e.g., c and s) under standard supervised fine-tuning. To address these limitations, we employ reasoning-based generation to output both the quadruple and a natural language rationale under element prefixes within a unified template, encouraging explicit relational reasoning and interpretability. To further enhance element-wise alignment, we introduce a listwise preference optimization framework for improving structural validity and relational coherence. Specifically, we generate element-wise confusable candidates via syntactic and semantic proximity, then train the model with listwise objectives to prefer the gold candidates over closely competing alternatives. Extensive experiments on four benchmark datasets demonstrate that our framework effectively improves quadruple prediction accuracy and explanation consistency.

Title: Clustering Malware at Scale: A First Full-Benchmark Study

Authors: Martin Mocko, Jakub Ševcech, Daniela Chudá
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2511.23198
Pdf URL: https://arxiv.org/pdf/2511.23198
Copy Paste: [[2511.23198]] Clustering Malware at Scale: A First Full-Benchmark Study(https://arxiv.org/abs/2511.23198)
Keywords: attack
Abstract: Recent years have shown that malware attacks still happen with high frequency. Malware experts seek to categorize and classify incoming samples to confirm their trustworthiness or prove their maliciousness. One of the ways in which groups of malware samples can be identified is through malware clustering. Despite the efforts of the community, malware clustering which incorporates benign samples has been under-explored. Moreover, despite the availability of larger public benchmark malware datasets, malware clustering studies have avoided fully utilizing these datasets in their experiments, often resorting to small datasets with only a few families. Additionally, the current state-of-the-art solutions for malware clustering remain unclear. In our study, we evaluate malware clustering quality and establish the state-of-the-art on Bodmas and Ember - two large public benchmark malware datasets. Ours is the first study of malware clustering performed on whole malware benchmark datasets. Additionally, we extend the malware clustering task by incorporating benign samples. Our results indicate that incorporating benign samples does not significantly degrade clustering quality. We find that there are significant differences in the quality of the created clusters between Ember and Bodmas, as well as a private industry dataset. Contrary to popular opinion, our top clustering performers are K-Means and BIRCH, with DBSCAN and HAC falling behind.

Title: Vision Bridge Transformer at Scale

Authors: Zhenxiong Tan, Zeqing Wang, Xingyi Yang, Songhua Liu, Xinchao Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23199
Pdf URL: https://arxiv.org/pdf/2511.23199
Copy Paste: [[2511.23199]] Vision Bridge Transformer at Scale(https://arxiv.org/abs/2511.23199)
Keywords: robust, diffusion, transformer
Abstract: We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.

Title: Quantifying the Privacy-Utility Trade-off in GPS-based Daily Stress Recognition using Semantic Features

Authors: Hoang Khang Phan, Nhat Tan Le
Subjects: cs.CR, cs.HC
Abstract URL: https://arxiv.org/abs/2511.23200
Pdf URL: https://arxiv.org/pdf/2511.23200
Copy Paste: [[2511.23200]] Quantifying the Privacy-Utility Trade-off in GPS-based Daily Stress Recognition using Semantic Features(https://arxiv.org/abs/2511.23200)
Keywords: privacy
Abstract: Psychological stress is a widespread issue that significantly impacts student well-being and academic performance. Effective remote stress recognition is crucial, yet existing methods often rely on wearable devices or GPS-based clustering techniques that pose privacy risks. In this study, we introduce a novel, end-to-end privacy-enhanced framework for semantic location encoding using a self-hosted OSM engine and an LLM-bootstrapped static map. We rigorously quantify the privacy-utility trade-off and demonstrate (via LOSO validation) that our Privacy-Aware (PA) model achieves performance statistically indistinguishable from a non-private model, proving that utility does not require sacrificing privacy. Feature importance analysis highlights that recreational activity time, working time, and travel time play a significant role in stress recognition.

Title: Zero-Shot Multi-Criteria Visual Quality Inspection for Semi-Controlled Industrial Environments via Real-Time 3D Digital Twin Simulation

Authors: Jose Moises Araya-Martinez, Gautham Mohan, Kenichi Hayakawa Bolaños, Roberto Mendieta, Sarvenaz Sardari, Jens Lambrecht, Jörg Krüger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23214
Pdf URL: https://arxiv.org/pdf/2511.23214
Copy Paste: [[2511.23214]] Zero-Shot Multi-Criteria Visual Quality Inspection for Semi-Controlled Industrial Environments via Real-Time 3D Digital Twin Simulation(https://arxiv.org/abs/2511.23214)
Keywords: robust
Abstract: Early-stage visual quality inspection is vital for achieving Zero-Defect Manufacturing and minimizing production waste in modern industrial environments. However, the complexity of robust visual inspection systems and their extensive data requirements hinder widespread adoption in semi-controlled industrial settings. In this context, we propose a pose-agnostic, zero-shot quality inspection framework that compares real scenes against real-time Digital Twins (DT) in the RGB-D space. Our approach enables efficient real-time DT rendering by semantically describing industrial scenes through object detection and pose estimation of known Computer-Aided Design models. We benchmark tools for real-time, multimodal RGB-D DT creation while tracking consumption of computational resources. Additionally, we provide an extensible and hierarchical annotation strategy for multi-criteria defect detection, unifying pose labelling with logical and structural defect annotations. Based on an automotive use case featuring the quality inspection of an axial flux motor, we demonstrate the effectiveness of our framework. Our results demonstrate detection performace, achieving intersection-over-union (IoU) scores of up to 63.3% compared to ground-truth masks, even if using simple distance measurements under semi-controlled industrial conditions. Our findings lay the groundwork for future research on generalizable, low-data defect detection methods in dynamic manufacturing settings.

Title: Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day

Authors: Milad Abdollahzadeh, Abdul Raheem, Zilong Zhao, Uzair Javaid, Kevin Yee, Nalam Venkata Abhishek, Tram Truong-Huu, Biplab Sikdar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23220
Pdf URL: https://arxiv.org/pdf/2511.23220
Copy Paste: [[2511.23220]] Instruction Tuning of Large Language Models for Tabular Data Generation-in One Day(https://arxiv.org/abs/2511.23220)
Keywords: large language model
Abstract: Tabular instruction tuning has emerged as a promising research direction for improving LLMs understanding of tabular data. However, the majority of existing works only consider question-answering and reasoning tasks over tabular data, leaving tabular data generation largely unnoticed. In this work, for the first time, we explore the efficacy of instruction tuning in improving LLMs tabular data generation capabilities. More specifically, given the high data and computation requirements of tabular instruction tuning, we aim to address the possibility of instruction tuning for tabular data generation with limited data and computational resources. To achieve this, we first create a high-quality instruction dataset for tabular data, enabling efficient LLM comprehension. We then instruction-tune an open-source LLM (Llama3.1-8B-Instruct) on the training set of this dataset to improve its tabular data generation performance. Our experimental results show that by using our high-quality dataset and instruction-tuning on only 7K instructions with an A100 GPU, for less than 6 hours, we achieve tabular data generation performance on par with the most capable commercial LLM, GPT-4o.

Title: Robust 3DGS-based SLAM via Adaptive Kernel Smoothing

Authors: Shouhe Zhang, Dayong Ren, Sensen Song, Wenjie Li, Piaopiao Yu, Yurong Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23221
Pdf URL: https://arxiv.org/pdf/2511.23221
Copy Paste: [[2511.23221]] Robust 3DGS-based SLAM via Adaptive Kernel Smoothing(https://arxiv.org/abs/2511.23221)
Keywords: robust
Abstract: In this paper, we challenge the conventional notion in 3DGS-SLAM that rendering quality is the primary determinant of tracking accuracy. We argue that, compared to solely pursuing a perfect scene representation, it is more critical to enhance the robustness of the rasterization process against parameter errors to ensure stable camera pose tracking. To address this challenge, we propose a novel approach that leverages a smooth kernel strategy to enhance the robustness of 3DGS-based SLAM. Unlike conventional methods that focus solely on minimizing rendering error, our core insight is to make the rasterization process more resilient to imperfections in the 3DGS parameters. We hypothesize that by allowing each Gaussian to influence a smoother, wider distribution of pixels during rendering, we can mitigate the detrimental effects of parameter noise from outlier Gaussians. This approach intentionally introduces a controlled blur to the rendered image, which acts as a regularization term, stabilizing the subsequent pose optimization. While a complete redesign of the rasterization pipeline is an ideal solution, we propose a practical and effective alternative that is readily integrated into existing 3DGS frameworks. Our method, termed Corrective Blurry KNN (CB-KNN), adaptively modifies the RGB values and locations of the K-nearest neighboring Gaussians within a local region. This dynamic adjustment generates a smoother local rendering, reducing the impact of erroneous GS parameters on the overall image. Experimental results demonstrate that our approach, while maintaining the overall quality of the scene reconstruction (mapping), significantly improves the robustness and accuracy of camera pose tracking.

Title: DAONet-YOLOv8: An Occlusion-Aware Dual-Attention Network for Tea Leaf Pest and Disease Detection

Authors: Yefeng Wu, Shan Wan, Ling Wu, Yecheng Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23222
Pdf URL: https://arxiv.org/pdf/2511.23222
Copy Paste: [[2511.23222]] DAONet-YOLOv8: An Occlusion-Aware Dual-Attention Network for Tea Leaf Pest and Disease Detection(https://arxiv.org/abs/2511.23222)
Keywords: extraction
Abstract: Accurate detection of tea leaf pests and diseases in real plantations remains challenging due to complex backgrounds, variable illumination, and frequent occlusions among dense branches and leaves. Existing detectors often suffer from missed detections and false positives in such scenarios. To address these issues, we propose DAONet-YOLOv8, an enhanced YOLOv8 variant with three key improvements: (1) a Dual-Attention Fusion Module (DAFM) that combines convolutional local feature extraction with self-attention based global context modeling to focus on subtle lesion regions while suppressing background noise; (2) an occlusion-aware detection head (Detect-OAHead) that learns the relationship between visible and occluded parts to compensate for missing lesion features; and (3) a C2f-DSConv module employing dynamic synthesis convolutions with multiple kernel shapes to better capture irregular lesion boundaries. Experiments on our real-world tea plantation dataset containing six pest and disease categories demonstrate that DAONet-YOLOv8 achieves 92.97% precision, 92.80% recall, 97.10% mAP@50 and 76.90% mAP@50:95, outperforming the YOLOv8n baseline by 2.34, 4.68, 1.40 and 1.80 percentage points respectively, while reducing parameters by 16.7%. Comparative experiments further confirm that DAONet-YOLOv8 achieves superior performance over mainstream detection models.

Title: TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies

Authors: Guang Liang, Jie Shao, Ningyuan Tang, Xinyao Liu, Jianxin Wu
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.23225
Pdf URL: https://arxiv.org/pdf/2511.23225
Copy Paste: [[2511.23225]] TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies(https://arxiv.org/abs/2511.23225)
Keywords: transformer
Abstract: Native FP8 support in modern hardware is essential for training large Transformers, but is severely hindered by extreme activation outliers. Existing solutions either rely on complex mixed-precision engineering or invasive architectural modifications. This paper fundamentally challenges the conventional wisdom that outliers are data-driven. We demonstrate that extreme outliers are a data-independent, mechanically-produced artifact of training, originating from specific structural properties of the weight matrices (i.e., colinearity). Based on this insight, we propose TWEO (Transformers Without Extreme Outliers), a novel, non-invasive loss function. TWEO effectively prevents extreme outliers via a very simple loss term, which reduces outliers from 10000+ to less than 20. TWEO then enables full-model FP8 pre-training with neither engineering tricks nor architectural changes for both LLM and ViT. When standard FP8 training catastrophically collapses, TWEO achieves performance comparable to the BF16 baseline while delivering a 36% increase in training throughput. Also, TWEO enables a new quantization paradigm. Hardware-friendly W8A8 per-tensor static quantization of LLMs, previously considered completely unusable due to outliers, achieves SOTA performance for the first time on TWEO-trained models.

Title: Unlocking Multilingual Reasoning Capability of LLMs and LVLMs through Representation Engineering

Authors: Qiming Li, Xiaocheng Feng, Yixuan Ma, Zekai Ye, Ruihan Chen, Xiachong Feng, Bing Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23231
Pdf URL: https://arxiv.org/pdf/2511.23231
Copy Paste: [[2511.23231]] Unlocking Multilingual Reasoning Capability of LLMs and LVLMs through Representation Engineering(https://arxiv.org/abs/2511.23231)
Keywords: fair, large language model
Abstract: Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) demonstrate strong reasoning capabilities, yet their performance in English significantly outperforms that in low-resource languages, raising fairness concerns in multilingual applications. Existing approaches either rely on costly multilingual training or employ prompting with external translation tools, both of which are resource-intensive and sensitive to translation quality. To address these limitations, we propose a training-free inference-time method to enhance Multilingual Reasoning capabilities via Representation Engineering (MRRE) without using any additional training data or tools. MRRE sequentially injects two precomputed vectors at specific layers during inference processing: cross-lingual reasoning enhancement vectors, which steer non-English reasoning representations toward English space to unlock multilingual reasoning, and target-language output anchoring vectors, which restore the distribution of the target language to preserve input-output language consistency. Comprehensive experiments across six advanced LLMs and LVLMs on four reasoning benchmarks demonstrate that MRRE consistently enhances non-English reasoning by an average gain of 5.48% and up to 7.54% in low-resource languages (Thai and Swahili), while improving input-output language consistency by 3.78%.

Title: SDE-Attention: Latent Attention in SDE-RNNs for Irregularly Sampled Time Series with Missing Data

Authors: Yuting Fang, Qouc Le Gia, Flora Salim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.23238
Pdf URL: https://arxiv.org/pdf/2511.23238
Copy Paste: [[2511.23238]] SDE-Attention: Latent Attention in SDE-RNNs for Irregularly Sampled Time Series with Missing Data(https://arxiv.org/abs/2511.23238)
Keywords: robust
Abstract: Irregularly sampled time series with substantial missing observations are common in healthcare and sensor networks. We introduce SDE-Attention, a family of SDE-RNNs equipped with channel-level attention on the latent pre-RNN state, including channel recalibration, time-varying feature attention, and pyramidal multi-scale self-attention. We therefore conduct a comparison on a synthetic periodic dataset and real-world benchmarks, under varying missing rate. Latent-space attention consistently improves over a vanilla SDE-RNN. On the univariate UCR datasets, the LSTM-based time-varying feature model SDE-TVF-L achieves the highest average accuracy, raising mean performance by approximately 4, 6, and 10 percentage points over the baseline at 30%, 60% and 90% missingness, respectively (averaged across datasets). On multivariate UEA benchmarks, attention-augmented models again outperform the backbone, with SDE-TVF-L yielding up to a 7% gain in mean accuracy under high missingness. Among the proposed mechanisms, time-varying feature attention is the most robust on univariate datasets. On multivariate datasets, different attention types excel on different tasks, showing that SDE-Attention can be flexibly adapted to the structure of each problem.

Title: Towards Understanding Transformers in Learning Random Walks

Authors: Wei Shi, Yuan Cao
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.23239
Pdf URL: https://arxiv.org/pdf/2511.23239
Copy Paste: [[2511.23239]] Towards Understanding Transformers in Learning Random Walks(https://arxiv.org/abs/2511.23239)
Keywords: interpretability, transformer
Abstract: Transformers have proven highly effective across various applications, especially in handling sequential data such as natural languages and time series. However, transformer models often lack clear interpretability, and the success of transformers has not been well understood in theory. In this paper, we study the capability and interpretability of transformers in learning a family of classic statistical models, namely random walks on circles. We theoretically demonstrate that, after training with gradient descent, a one-layer transformer model can achieve optimal accuracy in predicting random walks. Importantly, our analysis reveals that the trained model is interpretable: the trained softmax attention serves as a token selector, focusing on the direct parent state; subsequently, the value matrix executes a one-step probability transition to predict the location of the next state based on this parent state. We also show that certain edge cases not covered by our theory are indeed failure cases, demonstrating that our theoretical conditions are tight. By investigating these success and failure cases, it is revealed that gradient descent with small initialization may fail or struggle to converge to a good solution in certain simple tasks even beyond random walks. Experiments are conducted to support our theoretical findings.

Title: Synthetic Industrial Object Detection: GenAI vs. Feature-Based Methods

Authors: Jose Moises Araya-Martinez, Adrián Sanchis Reig, Gautham Mohan, Sarvenaz Sardari, Jens Lambrecht, Jörg Krüger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23241
Pdf URL: https://arxiv.org/pdf/2511.23241
Copy Paste: [[2511.23241]] Synthetic Industrial Object Detection: GenAI vs. Feature-Based Methods(https://arxiv.org/abs/2511.23241)
Keywords: diffusion, generative
Abstract: Reducing the burden of data generation and annotation remains a major challenge for the cost-effective deployment of machine learning in industrial and robotics settings. While synthetic rendering is a promising solution, bridging the sim-to-real gap often requires expert intervention. In this work, we benchmark a range of domain randomization (DR) and domain adaptation (DA) techniques, including feature-based methods, generative AI (GenAI), and classical rendering approaches, for creating contextualized synthetic data without manual annotation. Our evaluation focuses on the effectiveness and efficiency of low-level and high-level feature alignment, as well as a controlled diffusion-based DA method guided by prompts generated from real-world contexts. We validate our methods on two datasets: a proprietary industrial dataset (automotive and logistics) and a public robotics dataset. Results show that if render-based data with enough variability is available as seed, simpler feature-based methods, such as brightness-based and perceptual hashing filtering, outperform more complex GenAI-based approaches in both accuracy and resource efficiency. Perceptual hashing consistently achieves the highest performance, with mAP50 scores of 98% and 67% on the industrial and robotics datasets, respectively. Additionally, GenAI methods present significant time overhead for data generation at no apparent improvement of sim-to-real mAP values compared to simpler methods. Our findings offer actionable insights for efficiently bridging the sim-to-real gap, enabling high real-world performance from models trained exclusively on synthetic data.

Title: Learning to Predict Aboveground Biomass from RGB Images with 3D Synthetic Scenes

Authors: Silvia Zuffi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23249
Pdf URL: https://arxiv.org/pdf/2511.23249
Copy Paste: [[2511.23249]] Learning to Predict Aboveground Biomass from RGB Images with 3D Synthetic Scenes(https://arxiv.org/abs/2511.23249)
Keywords: segmentation
Abstract: Forests play a critical role in global ecosystems by supporting biodiversity and mitigating climate change via carbon sequestration. Accurate aboveground biomass (AGB) estimation is essential for assessing carbon storage and wildfire fuel loads, yet traditional methods rely on labor-intensive field measurements or remote sensing approaches with significant limitations in dense vegetation. In this work, we propose a novel learning-based method for estimating AGB from a single ground-based RGB image. We frame this as a dense prediction task, introducing AGB density maps, where each pixel represents tree biomass normalized by the plot area and each tree's image area. We leverage the recently introduced synthetic 3D SPREAD dataset, which provides realistic forest scenes with per-image tree attributes (height, trunk and canopy diameter) and instance segmentation masks. Using these assets, we compute AGB via allometric equations and train a model to predict AGB density maps, integrating them to recover the AGB estimate for the captured scene. Our approach achieves a median AGB estimation error of 1.22 kg/m^2 on held-out SPREAD data and 1.94 kg/m^2 on a real-image dataset. To our knowledge, this is the first method to estimate aboveground biomass directly from a single RGB image, opening up the possibility for a scalable, interpretable, and cost-effective solution for forest monitoring, while also enabling broader participation through citizen science initiatives.

Title: One-Shot Secure Aggregation: A Hybrid Cryptographic Protocol for Private Federated Learning in IoT

Authors: Imraul Emmaka, Tran Viet Xuan Phuong
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23252
Pdf URL: https://arxiv.org/pdf/2511.23252
Copy Paste: [[2511.23252]] One-Shot Secure Aggregation: A Hybrid Cryptographic Protocol for Private Federated Learning in IoT(https://arxiv.org/abs/2511.23252)
Keywords: secure, privacy, protect, robust, federate
Abstract: Federated Learning (FL) offers a promising approach to collaboratively train machine learning models without centralizing raw data, yet its scalability is often throttled by excessive communication overhead. This challenge is magnified in Internet of Things (IoT) environments, where devices face stringent bandwidth, latency, and energy constraints. Conventional secure aggregation protocols, while essential for protecting model updates, frequently require multiple interaction rounds, large payload sizes, and per-client costs rendering them impractical for many edge deployments. In this work, we present Hyb-Agg, a lightweight and communication-efficient secure aggregation protocol that integrates Multi-Key CKKS (MK-CKKS) homomorphic encryption with Elliptic Curve Diffie-Hellman (ECDH)-based additive masking. Hyb-Agg reduces the secure aggregation process to a single, non-interactive client-to-server transmission per round, ensuring that per-client communication remains constant regardless of the number of participants. This design eliminates partial decryption exchanges, preserves strong privacy under the RLWE, CDH, and random oracle assumptions, and maintains robustness against collusion by the server and up to $N-2$ clients. We implement and evaluate Hyb-Agg on both high-performance and resource-constrained devices, including a Raspberry Pi 4, demonstrating that it delivers sub-second execution times while achieving a constant communication expansion factor of approximately 12x over plaintext size. By directly addressing the communication bottleneck, Hyb-Agg enables scalable, privacy-preserving federated learning that is practical for real-world IoT deployments.

Title: BanglaSentNet: An Explainable Hybrid Deep Learning Framework for Multi-Aspect Sentiment Analysis with Cross-Domain Transfer Learning

Authors: Ariful Islam, Md Rifat Hossen, Tanvir Mahmud
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2511.23264
Pdf URL: https://arxiv.org/pdf/2511.23264
Copy Paste: [[2511.23264]] BanglaSentNet: An Explainable Hybrid Deep Learning Framework for Multi-Aspect Sentiment Analysis with Cross-Domain Transfer Learning(https://arxiv.org/abs/2511.23264)
Keywords: robust, interpretability, explainability
Abstract: Multi-aspect sentiment analysis of Bangla e-commerce reviews remains challenging due to limited annotated datasets, morphological complexity, code-mixing phenomena, and domain shift issues, affecting 300 million Bangla-speaking users. Existing approaches lack explainability and cross-domain generalization capabilities crucial for practical deployment. We present BanglaSentNet, an explainable hybrid deep learning framework integrating LSTM, BiLSTM, GRU, and BanglaBERT through dynamic weighted ensemble learning for multi-aspect sentiment classification. We introduce a dataset of 8,755 manually annotated Bangla product reviews across four aspects (Quality, Service, Price, Decoration) from major Bangladeshi e-commerce platforms. Our framework incorporates SHAP-based feature attribution and attention visualization for transparent insights. BanglaSentNet achieves 85% accuracy and 0.88 F1-score, outperforming standalone deep learning models by 3-7% and traditional approaches substantially. The explainability suite achieves 9.4/10 interpretability score with 87.6% human agreement. Cross-domain transfer learning experiments reveal robust generalization: zero-shot performance retains 67-76% effectiveness across diverse domains (BanglaBook reviews, social media, general e-commerce, news headlines); few-shot learning with 500-1000 samples achieves 90-95% of full fine-tuning performance, significantly reducing annotation costs. Real-world deployment demonstrates practical utility for Bangladeshi e-commerce platforms, enabling data-driven decision-making for pricing optimization, service improvement, and customer experience enhancement. This research establishes a new state-of-the-art benchmark for Bangla sentiment analysis, advances ensemble learning methodologies for low-resource languages, and provides actionable solutions for commercial applications.

Title: Simultaneous Image Quality Improvement and Artefacts Correction in Accelerated MRI

Authors: Georgia Kanli, Daniele Perlo, Selma Boudissa, Radovan Jirik, Olivier Keunen
Subjects: cs.CV, cs.AI, physics.med-ph
Abstract URL: https://arxiv.org/abs/2511.23274
Pdf URL: https://arxiv.org/pdf/2511.23274
Copy Paste: [[2511.23274]] Simultaneous Image Quality Improvement and Artefacts Correction in Accelerated MRI(https://arxiv.org/abs/2511.23274)
Keywords: robust
Abstract: MR data are acquired in the frequency domain, known as k-space. Acquiring high-quality and high-resolution MR images can be time-consuming, posing a significant challenge when multiple sequences providing complementary contrast information are needed or when the patient is unable to remain in the scanner for an extended period of time. Reducing k-space measurements is a strategy to speed up acquisition, but often leads to reduced quality in reconstructed images. Additionally, in real-world MRI, both under-sampled and full-sampled images are prone to artefacts, and correcting these artefacts is crucial for maintaining diagnostic accuracy. Deep learning methods have been proposed to restore image quality from under-sampled data, while others focused on the correction of artefacts that result from the noise or motion. No approach has however been proposed so far that addresses both acceleration and artefacts correction, limiting the performance of these models when these degradation factors occur simultaneously. To address this gap, we present a method for recovering high-quality images from under-sampled data with simultaneously correction for noise and motion artefact called USArt (Under-Sampling and Artifact correction model). Customized for 2D brain anatomical images acquired with Cartesian sampling, USArt employs a dual sub-model approach. The results demonstrate remarkable increase of signal-to-noise ratio (SNR) and contrast in the images restored. Various under-sampling strategies and degradation levels were explored, with the gradient under-sampling strategy yielding the best outcomes. We achieved up to 5x acceleration and simultaneously artefacts correction without significant degradation, showcasing the model's robustness in real-world settings.

Title: Beyond Curve Fitting: Neuro-Symbolic Agents for Context-Aware Epidemic Forecasting

Authors: Joongwon Chae, Runming Wang, Chen Xiong, Gong Yunhan, Lian Zhang, Ji Jiansong, Dongmei Yu, Peiwu Qin
Subjects: cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2511.23276
Pdf URL: https://arxiv.org/pdf/2511.23276
Copy Paste: [[2511.23276]] Beyond Curve Fitting: Neuro-Symbolic Agents for Context-Aware Epidemic Forecasting(https://arxiv.org/abs/2511.23276)
Keywords: robust
Abstract: Effective surveillance of hand, foot and mouth disease (HFMD) requires forecasts accounting for epidemiological patterns and contextual drivers like school calendars and weather. While classical models and recent foundation models (e.g., Chronos, TimesFM) incorporate covariates, they often lack the semantic reasoning to interpret the causal interplay between conflicting drivers. In this work, we propose a two-agent framework decoupling contextual interpretation from probabilistic forecasting. An LLM "event interpreter" processes heterogeneous signals-including school schedules, meteorological summaries, and reports-into a scalar transmission-impact signal. A neuro-symbolic core then combines this with historical case counts to produce calibrated probabilistic forecasts. We evaluate the framework on real-world HFMD datasets from Hong Kong (2023-2024) and Lishui, China (2024). Compared to traditional and foundation-model baselines, our approach achieves competitive point forecasting accuracy while providing robust 90% prediction intervals (coverage 0.85-1.00) and human-interpretable rationales. Our results suggest that structurally integrating domain knowledge through LLMs can match state-of-the-art performance while yielding context-aware forecasts that align with public health workflows. Code is available at this https URL .

Title: MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report)

Authors: Aaron Steiner, Ralph Peeters, Christian Bizer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.23281
Pdf URL: https://arxiv.org/pdf/2511.23281
Copy Paste: [[2511.23281]] MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report)(https://arxiv.org/abs/2511.23281)
Keywords: large language model
Abstract: Large language model agents are increasingly used to automate web tasks such as product search, offer comparison, and checkout. Current research explores different interfaces through which these agents interact with websites, including traditional HTML browsing, retrieval-augmented generation (RAG) over pre-crawled content, communication via Web APIs using the Model Context Protocol (MCP), and natural-language querying through the NLWeb interface. However, no prior work has compared these four architectures within a single controlled environment using identical tasks. To address this gap, we introduce a testbed consisting of four simulated e-shops, each offering its products via HTML, MCP, and NLWeb interfaces. For each interface (HTML, RAG, MCP, and NLWeb) we develop specialized agents that perform the same sets of tasks, ranging from simple product searches and price comparisons to complex queries for complementary or substitute products and checkout processes. We evaluate the agents using GPT 4.1, GPT 5, GPT 5 mini, and Claude Sonnet 4 as underlying LLM. Our evaluation shows that the RAG, MCP and NLWeb agents outperform HTML on both effectiveness and efficiency. Averaged over all tasks, F1 rises from 0.67 for HTML to between 0.75 and 0.77 for the other agents. Token usage falls from about 241k for HTML to between 47k and 140k per task. The runtime per task drops from 291 seconds to between 50 and 62 seconds. The best overall configuration is RAG with GPT 5 achieving an F1 score of 0.87 and a completion rate of 0.79. Also taking cost into consideration, RAG with GPT 5 mini offers a good compromise between API usage fees and performance. Our experiments show the choice of the interaction interface has a substantial impact on both the effectiveness and efficiency of LLM-based web agents.

Title: Closing the Generalization Gap in Parameter-efficient Federated Edge Learning

Authors: Xinnong Du, Zhonghao Lyu, Xiaowen Cao, Chunyang Wen, Shuguang Cui, Jie Xu
Subjects: cs.LG, cs.DC, cs.IT
Abstract URL: https://arxiv.org/abs/2511.23282
Pdf URL: https://arxiv.org/pdf/2511.23282
Copy Paste: [[2511.23282]] Closing the Generalization Gap in Parameter-efficient Federated Edge Learning(https://arxiv.org/abs/2511.23282)
Keywords: privacy, federate
Abstract: Federated edge learning (FEEL) provides a promising foundation for edge artificial intelligence (AI) by enabling collaborative model training while preserving data privacy. However, limited and heterogeneous local datasets, as well as resource-constrained deployment, severely degrade both model generalization and resource utilization, leading to a compromised learning performance. Therefore, we propose a parameter-efficient FEEL framework that jointly leverages model pruning and client selection to tackle such challenges. First, we derive an information-theoretic generalization statement that characterizes the discrepancy between training and testing function losses and embed it into the convergence analysis. It reveals that a larger local generalization statement can undermine the global convergence. Then, we formulate a generalization-aware average squared gradient norm bound minimization problem, by jointly optimizing the pruning ratios, client selection, and communication-computation resources under energy and delay constraints. Despite its non-convexity, the resulting mixed-integer problem is efficiently solved via an alternating optimization algorithm. Extensive experiments demonstrate that the proposed design achieves superior learning performance than state-of-the-art baselines, validating the effectiveness of coupling generalization-aware analysis with system-level optimization for efficient FEEL.

Title: Transformer-Driven Triple Fusion Framework for Enhanced Multimodal Author Intent Classification in Low-Resource Bangla

Authors: Ariful Islam, Tanvir Mahmud, Md Rifat Hossen
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2511.23287
Pdf URL: https://arxiv.org/pdf/2511.23287
Copy Paste: [[2511.23287]] Transformer-Driven Triple Fusion Framework for Enhanced Multimodal Author Intent Classification in Low-Resource Bangla(https://arxiv.org/abs/2511.23287)
Keywords: transformer
Abstract: The expansion of the Internet and social networks has led to an explosion of user-generated content. Author intent understanding plays a crucial role in interpreting social media content. This paper addresses author intent classification in Bangla social media posts by leveraging both textual and visual data. Recognizing limitations in previous unimodal approaches, we systematically benchmark transformer-based language models (mBERT, DistilBERT, XLM-RoBERTa) and vision architectures (ViT, Swin, SwiftFormer, ResNet, DenseNet, MobileNet), utilizing the Uddessho dataset of 3,048 posts spanning six practical intent categories. We introduce a novel intermediate fusion strategy that significantly outperforms early and late fusion on this task. Experimental results show that intermediate fusion, particularly with mBERT and Swin Transformer, achieves 84.11% macro-F1 score, establishing a new state-of-the-art with an 8.4 percentage-point improvement over prior Bangla multimodal approaches. Our analysis demonstrates that integrating visual context substantially enhances intent classification. Cross-modal feature integration at intermediate levels provides optimal balance between modality-specific representation and cross-modal learning. This research establishes new benchmarks and methodological standards for Bangla and other low-resource languages. We call our proposed framework BangACMM (Bangla Author Content MultiModal).

Title: Machine Learning for Scientific Visualization: Ensemble Data Analysis

Authors: Hamid Gadirov
Subjects: cs.LG, cs.AI, cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2511.23290
Pdf URL: https://arxiv.org/pdf/2511.23290
Copy Paste: [[2511.23290]] Machine Learning for Scientific Visualization: Ensemble Data Analysis(https://arxiv.org/abs/2511.23290)
Keywords: robust
Abstract: Scientific simulations and experimental measurements produce vast amounts of spatio-temporal data, yet extracting meaningful insights remains challenging due to high dimensionality, complex structures, and missing information. Traditional analysis methods often struggle with these issues, motivating the need for more robust, data-driven approaches. This dissertation explores deep learning methodologies to improve the analysis and visualization of spatio-temporal scientific ensembles, focusing on dimensionality reduction, flow estimation, and temporal interpolation. First, we address high-dimensional data representation through autoencoder-based dimensionality reduction for scientific ensembles. We evaluate the stability of projection metrics under partial labeling and introduce a Pareto-efficient selection strategy to identify optimal autoencoder variants, ensuring expressive and reliable low-dimensional embeddings. Next, we present FLINT, a deep learning model for high-quality flow estimation and temporal interpolation in both flow-supervised and flow-unsupervised settings. FLINT reconstructs missing velocity fields and generates high-fidelity temporal interpolants for scalar fields across 2D+time and 3D+time ensembles without domain-specific assumptions or extensive finetuning. To further improve adaptability and generalization, we introduce HyperFLINT, a hypernetwork-based approach that conditions on simulation parameters to estimate flow fields and interpolate scalar data. This parameter-aware adaptation yields more accurate reconstructions across diverse scientific domains, even with sparse or incomplete data. Overall, this dissertation advances deep learning techniques for scientific visualization, providing scalable, adaptable, and high-quality solutions for interpreting complex spatio-temporal ensembles.

Title: Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Authors: Xiang Hu, Zhanchao Zhou, Ruiqi Liang, Zehuan Li, Wei Wu, Jianguo Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23319
Pdf URL: https://arxiv.org/pdf/2511.23319
Copy Paste: [[2511.23319]] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models(https://arxiv.org/abs/2511.23319)
Keywords: transformer, large language model
Abstract: This work explores the challenge of building ``Machines that Can Remember'', framing long-term memory as the problem of efficient ultra-long context modeling. We argue that this requires three key properties: \textbf{sparsity}, \textbf{random-access flexibility}, and \textbf{length generalization}. To address ultra-long-context modeling, we leverage Hierarchical Sparse Attention (HSA), a novel attention mechanism that satisfies all three properties. We integrate HSA into Transformers to build HSA-UltraLong, which is an 8B-parameter MoE model trained on over 8 trillion tokens and is rigorously evaluated on different tasks with in-domain and out-of-domain context lengths to demonstrate its capability in handling ultra-long contexts. Results show that our model performs comparably to full-attention baselines on in-domain lengths while achieving over 90\% accuracy on most in-context retrieval tasks with contexts up to 16M. This report outlines our experimental insights and open problems, contributing a foundation for future research in ultra-long context modeling.

Title: UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

Authors: Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, Jing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23332
Pdf URL: https://arxiv.org/pdf/2511.23332
Copy Paste: [[2511.23332]] UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes(https://arxiv.org/abs/2511.23332)
Keywords: segmentation
Abstract: Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at this https URL.

Title: Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

Authors: Shuqi Liu, Han Wu, Guanzhi Deng, Jianshu Chen, Xiaoyang Wang, Linqi Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23335
Pdf URL: https://arxiv.org/pdf/2511.23335
Copy Paste: [[2511.23335]] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach(https://arxiv.org/abs/2511.23335)
Keywords: interpretability, explainability, transformer, generative
Abstract: Knowledge-enhanced text generation aims to enhance the quality of generated text by utilizing internal or external knowledge sources. While language models have demonstrated impressive capabilities in generating coherent and fluent text, the lack of interpretability presents a substantial obstacle. The limited interpretability of generated text significantly impacts its practical usability, particularly in knowledge-enhanced text generation tasks that necessitate reliability and explainability. Existing methods often employ domain-specific knowledge retrievers that are tailored to specific data characteristics, limiting their generalizability to diverse data types and tasks. To overcome this limitation, we directly leverage the two-tier architecture of structured knowledge, consisting of high-level entities and low-level knowledge triples, to design our task-agnostic structured knowledge hunter. Specifically, we employ a local-global interaction scheme for structured knowledge representation learning and a hierarchical transformer-based pointer network as the backbone for selecting relevant knowledge triples and entities. By combining the strong generative ability of language models with the high faithfulness of the knowledge hunter, our model achieves high interpretability, enabling users to comprehend the model output generation process. Furthermore, we empirically demonstrate the effectiveness of our model in both internal knowledge-enhanced table-to-text generation on the RotoWireFG dataset and external knowledge-enhanced dialogue response generation on the KdConv dataset. Our task-agnostic model outperforms state-of-the-art methods and corresponding language models, setting new standards on the benchmark.

Title: Flow Straighter and Faster: Efficient One-Step Generative Modeling via MeanFlow on Rectified Trajectories

Authors: Xinxi Zhang, Shiwei Tan, Quang Nguyen, Quan Dao, Ligong Han, Xiaoxiao He, Tunyu Zhang, Alen Mrdovic, Dimitris Metaxas
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23342
Pdf URL: https://arxiv.org/pdf/2511.23342
Copy Paste: [[2511.23342]] Flow Straighter and Faster: Efficient One-Step Generative Modeling via MeanFlow on Rectified Trajectories(https://arxiv.org/abs/2511.23342)
Keywords: generative
Abstract: Flow-based generative models have recently demonstrated strong performance, yet sampling typically relies on expensive numerical integration of ordinary differential equations (ODEs). Rectified Flow enables one-step sampling by learning nearly straight probability paths, but achieving such straightness requires multiple computationally intensive reflow iterations. MeanFlow achieves one-step generation by directly modeling the average velocity over time; however, when trained on highly curved flows, it suffers from slow convergence and noisy supervision. To address these limitations, we propose Rectified MeanFlow, a framework that models the mean velocity field along the rectified trajectory using only a single reflow step. This eliminates the need for perfectly straightened trajectories while enabling efficient training. Furthermore, we introduce a simple yet effective truncation heuristic that aims to reduce residual curvature and further improve performance. Extensive experiments on ImageNet at 64, 256, and 512 resolutions show that Re-MeanFlow consistently outperforms prior one-step flow distillation and Rectified Flow methods in both sample quality and training efficiency. Code is available at this https URL.

Title: Distributed Dynamic Associative Memory via Online Convex Optimization

Authors: Bowen Wang, Matteo Zecchin, Osvaldo Simeone
Subjects: cs.LG, eess.SP
Abstract URL: https://arxiv.org/abs/2511.23347
Pdf URL: https://arxiv.org/pdf/2511.23347
Copy Paste: [[2511.23347]] Distributed Dynamic Associative Memory via Online Convex Optimization(https://arxiv.org/abs/2511.23347)
Keywords: robust, transformer
Abstract: An associative memory (AM) enables cue-response recall, and it has recently been recognized as a key mechanism underlying modern neural architectures such as Transformers. In this work, we introduce the concept of distributed dynamic associative memory (DDAM), which extends classical AM to settings with multiple agents and time-varying data streams. In DDAM, each agent maintains a local AM that must not only store its own associations but also selectively memorize information from other agents based on a specified interest matrix. To address this problem, we propose a novel tree-based distributed online gradient descent algorithm, termed DDAM-TOGD, which enables each agent to update its memory on the fly via inter-agent communication over designated routing trees. We derive rigorous performance guarantees for DDAM-TOGD, proving sublinear static regret in stationary environments and a path-length dependent dynamic regret bound in non-stationary environments. These theoretical results provide insights into how communication delays and network structure impact performance. Building on the regret analysis, we further introduce a combinatorial tree design strategy that optimizes the routing trees to minimize communication delays, thereby improving regret bounds. Numerical experiments demonstrate that the proposed DDAM-TOGD framework achieves superior accuracy and robustness compared to representative online learning baselines such as consensus-based distributed optimization, confirming the benefits of the proposed approach in dynamic, distributed environments.

Title: A Hierarchical Computer Vision Pipeline for Physiological Data Extraction from Bedside Monitors

Authors: Vinh Chau, Khoa Le Dinh Van, Hon Huynh Ngoc, Binh Nguyen Thien, Hao Nguyen Thien, Vy Nguyen Quang, Phuc Vo Hong, Yen Lam Minh, Kieu Pham Tieu, Trinh Nguyen Thi Diem, Louise Thwaites, Hai Ho Bich
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23355
Pdf URL: https://arxiv.org/pdf/2511.23355
Copy Paste: [[2511.23355]] A Hierarchical Computer Vision Pipeline for Physiological Data Extraction from Bedside Monitors(https://arxiv.org/abs/2511.23355)
Keywords: robust, extraction
Abstract: In many low-resource healthcare settings, bedside monitors remain standalone legacy devices without network connectivity, creating a persistent interoperability gap that prevents seamless integration of physiological data into electronic health record (EHR) systems. To address this challenge without requiring costly hardware replacement, we present a computer vision-based pipeline for the automated capture and digitisation of vital sign data directly from bedside monitor screens. Our method employs a hierarchical detection framework combining YOLOv11 for accurate monitor and region of interest (ROI) localisation with PaddleOCR for robust text extraction. To enhance reliability across variable camera angles and lighting conditions, a geometric rectification module standardizes the screen perspective before character recognition. We evaluated the system on a dataset of 6,498 images collected from open-source corpora and real-world intensive care units in Vietnam. The model achieved a mean Average Precision (mAP@50-95) of 99.5% for monitor detection and 91.5% for vital sign ROI localisation. The end-to-end extraction accuracy exceeded 98.9% for core physiological parameters, including heart rate, oxygen saturation SpO2, and arterial blood pressure. These results demonstrate that a lightweight, camera-based approach can reliably transform unstructured information from screen captures into structured digital data, providing a practical and scalable pathway to improve information accessibility and clinical documentation in low-resource settings.

Title: SimScale: Learning to Drive via Real-World Simulation at Scale

Authors: Haochen Tian, Tianyu Li, Haochen Liu, Jiazhi Yang, Yihang Qiu, Guang Li, Junli Wang, Yinfeng Gao, Zhang Zhang, Liang Wang, Hangjun Ye, Tieniu Tan, Long Chen, Hongyang Li
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.23369
Pdf URL: https://arxiv.org/pdf/2511.23369
Copy Paste: [[2511.23369]] SimScale: Learning to Drive via Real-World Simulation at Scale(https://arxiv.org/abs/2511.23369)
Keywords: robust
Abstract: Achieving fully autonomous driving systems requires learning rational decisions in a wide span of scenarios, including safety-critical and out-of-distribution ones. However, such cases are underrepresented in real-world corpus collected by human experts. To complement for the lack of data diversity, we introduce a novel and scalable simulation framework capable of synthesizing massive unseen states upon existing driving logs. Our pipeline utilizes advanced neural rendering with a reactive environment to generate high-fidelity multi-view observations controlled by the perturbed ego trajectory. Furthermore, we develop a pseudo-expert trajectory generation mechanism for these newly simulated states to provide action supervision. Upon the synthesized data, we find that a simple co-training strategy on both real-world and simulated samples can lead to significant improvements in both robustness and generalization for various planning methods on challenging real-world benchmarks, up to +6.8 EPDMS on navhard and +2.9 on navtest. More importantly, such policy improvement scales smoothly by increasing simulation data only, even without extra real-world data streaming in. We further reveal several crucial findings of such a sim-real learning system, which we term SimScale, including the design of pseudo-experts and the scaling properties for different policy architectures. Our simulation data and code would be released.

Title: Optimizing Multimodal Language Models through Attention-based Interpretability

Authors: Alexander Sergeev, Evgeny Kotelnikov
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2511.23375
Pdf URL: https://arxiv.org/pdf/2511.23375
Copy Paste: [[2511.23375]] Optimizing Multimodal Language Models through Attention-based Interpretability(https://arxiv.org/abs/2511.23375)
Keywords: interpretability, large language model
Abstract: Modern large language models become multimodal, analyzing various data formats like text and images. While fine-tuning is effective for adapting these multimodal language models (MLMs) to downstream tasks, full fine-tuning is computationally expensive. Parameter-Efficient Fine-Tuning (PEFT) methods address this by training only a small portion of model weights. However, MLMs are difficult to interpret, making it challenging to identify which components are most effective for training to balance efficiency and performance. We propose an attention-based interpretability method for MLMs by analyzing attention scores relative to image tokens. The core idea is to identify attention heads that focus on image key objects. We utilize this information to select optimal model components for PEFT in multimodal models. Our contributions include a method for identifying attention heads associated with image key objects, its application to PEFT for image captioning, and the creation of a new dataset containing images, key object masks, and their textual descriptions. We conducted experiments on MLMs with 2-3 billion parameters to validate the method's effectiveness. By calculating Head Impact (HI) scores we quantify an attention head's focus on key objects, indicating its significance in image understanding. Our fine-tuning experiments demonstrate that adapting layers with the highest HI scores leads to the most significant shifts in metrics compared to pre-trained, randomly selected, or lowest-HI-score layers. This indicates that fine-tuning a small percentage (around 0.01%) of parameters in these crucial layers can substantially influence image understanding capabilities.

Title: DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline

Authors: Rui Zhang, Hongxia Wang, Hangqing Liu, Yang Zhou, Qiang Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23377
Pdf URL: https://arxiv.org/pdf/2511.23377
Copy Paste: [[2511.23377]] DEAL-300K: Diffusion-based Editing Area Localization with a 300K-Scale Dataset and Frequency-Prompted Baseline(https://arxiv.org/abs/2511.23377)
Keywords: diffusion, large language model
Abstract: Diffusion-based image editing has made semantic level image manipulation easy for general users, but it also enables realistic local forgeries that are hard to localize. Existing benchmarks mainly focus on the binary detection of generated images or the localization of manually edited regions and do not reflect the properties of diffusion-based edits, which often blend smoothly into the original content. We present Diffusion-Based Image Editing Area Localization Dataset (DEAL-300K), a large scale dataset for diffusion-based image manipulation localization (DIML) with more than 300,000 annotated images. We build DEAL-300K by using a multi-modal large language model to generate editing instructions, a mask-free diffusion editor to produce manipulated images, and an active-learning change detection pipeline to obtain pixel-level annotations. On top of this dataset, we propose a localization framework that uses a frozen Visual Foundation Model (VFM) together with Multi Frequency Prompt Tuning (MFPT) to capture both semantic and frequency-domain cues of edited regions. Trained on DEAL-300K, our method reaches a pixel-level F1 score of 82.56% on our test split and 80.97% on the external CoCoGlide benchmark, providing strong baselines and a practical foundation for future DIML this http URL dataset can be accessed via this https URL.

Title: Learning-Augmented Online Bipartite Matching in the Random Arrival Order Model

Authors: Kunanon Burathep, Thomas Erlebach, William K. Moses Jr
Subjects: cs.LG, cs.DS
Abstract URL: https://arxiv.org/abs/2511.23388
Pdf URL: https://arxiv.org/pdf/2511.23388
Copy Paste: [[2511.23388]] Learning-Augmented Online Bipartite Matching in the Random Arrival Order Model(https://arxiv.org/abs/2511.23388)
Keywords: robust
Abstract: We study the online unweighted bipartite matching problem in the random arrival order model, with $n$ offline and $n$ online vertices, in the learning-augmented setting: The algorithm is provided with untrusted predictions of the types (neighborhoods) of the online vertices. We build upon the work of Choo et al. (ICML 2024, pp. 8762-8781) who proposed an approach that uses a prefix of the arrival sequence as a sample to determine whether the predictions are close to the true arrival sequence and then either follows the predictions or uses a known baseline algorithm that ignores the predictions and is $\beta$-competitive. Their analysis is limited to the case that the optimal matching has size $n$, i.e., every online vertex can be matched. We generalize their approach and analysis by removing any assumptions on the size of the optimal matching while only requiring that the size of the predicted matching is at least $\alpha n$ for any constant $0 < \alpha \le 1$. Our learning-augmented algorithm achieves $(1-o(1))$-consistency and $(\beta-o(1))$-robustness. Additionally, we show that the competitive ratio degrades smoothly between consistency and robustness with increasing prediction error.

Title: FedSGT: Exact Federated Unlearning via Sequential Group-based Training

Authors: Bokang Zhang, Hong Guan, Hong kyu Lee, Ruixuan Liu, Jia Zou, Li Xiong
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.23393
Pdf URL: https://arxiv.org/pdf/2511.23393
Copy Paste: [[2511.23393]] FedSGT: Exact Federated Unlearning via Sequential Group-based Training(https://arxiv.org/abs/2511.23393)
Keywords: privacy, robust, federate
Abstract: Federated Learning (FL) enables collaborative, privacy-preserving model training, but supporting the "Right to be Forgotten" is especially challenging because data influences the model through distributed and interleaved client updates. Existing exact unlearning methods typically require frequent retraining from scratch, resulting in high communication cost and long service downtime. To address this, we propose Federated Sequential Group-based Training (FedSGT), an exact unlearning framework for FL. FedSGT partitions the data into uniform groups, and each client may participate in multiple groups. To control communication overhead, each client can limit the number of groups it contributes to. FedSGT then trains multiple sequences of Parameter-Efficient Fine-Tuning (PEFT) modules, each corresponding to a different group permutation. Since the PEFT modules are lightweight and maintained server-side, FedSGT isolates the influence of different data groups into independent modules without incurring significant storage overhead and communication cost. Exact unlearning is thus achieved instantly by deactivating the modules corresponding to the group containing the unlearned data. Furthermore, using multiple training sequences helps maintain high model utility as deletion requests accumulate. We provide a rigorous theoretical analysis of both the deletion rate -- expected number of deletions before retraining is needed -- and the expected model performance. Experiments on various tasks demonstrate that FedSGT achieves a significantly longer service maintenance under multiple unlearning requests while maintaining comparable learning performance and training efficiency to other exact unlearning baselines. Extensive ablation studies validate the robustness of our method across a wide range of parameter settings.

Title: Quantized-Tinyllava: a new multimodal foundation model enables efficient split learning

Authors: Jiajun Guo, Xin Luo, Jie Liu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.23402
Pdf URL: https://arxiv.org/pdf/2511.23402
Copy Paste: [[2511.23402]] Quantized-Tinyllava: a new multimodal foundation model enables efficient split learning(https://arxiv.org/abs/2511.23402)
Keywords: privacy
Abstract: Split learning is well known as a method for resolving data privacy concerns by training a model on distributed devices, thereby avoiding data sharing that raises privacy issues. However, high network communication costs are always an impediment to split learning, especially for large foundation models that require transmitting large amounts of high-dimensional data. To resolve this issue, we present a new multimodal model structure that incorporates a learning-based data compression method, which compresses model embeddings into low-bit integers while preserving the model's performance, greatly reducing the transmission costs between partitions. We then determine the optimal number of discrete representation levels based on a solid theoretical foundation from entropy coding.

Title: MANTA: Physics-Informed Generalized Underwater Object Tracking

Authors: Suhas Srinath, Hemang Jamadagni, Aditya Chadrasekar, Prathosh AP
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23405
Pdf URL: https://arxiv.org/pdf/2511.23405
Copy Paste: [[2511.23405]] MANTA: Physics-Informed Generalized Underwater Object Tracking(https://arxiv.org/abs/2511.23405)
Keywords: robust
Abstract: Underwater object tracking is challenging due to wavelength dependent attenuation and scattering, which severely distort appearance across depths and water conditions. Existing trackers trained on terrestrial data fail to generalize to these physics-driven degradations. We present MANTA, a physics-informed framework integrating representation learning with tracking design for underwater scenarios. We propose a dual-positive contrastive learning strategy coupling temporal consistency with Beer-Lambert augmentations to yield features robust to both temporal and underwater distortions. We further introduce a multi-stage pipeline augmenting motion-based tracking with a physics-informed secondary association algorithm that integrates geometric consistency and appearance similarity for re-identification under occlusion and drift. To complement standard IoU metrics, we propose Center-Scale Consistency (CSC) and Geometric Alignment Score (GAS) to assess geometric fidelity. Experiments on four underwater benchmarks (WebUOT-1M, UOT32, UTB180, UWCOT220) show that MANTA achieves state-of-the-art performance, improving Success AUC by up to 6 percent, while ensuring stable long-term generalized underwater tracking and efficient runtime.

Title: Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities

Authors: Aayush Garg, Zanis Ali Khan, Renzo Degiovanni, Qiang Tang
Subjects: cs.CR, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2511.23408
Pdf URL: https://arxiv.org/pdf/2511.23408
Copy Paste: [[2511.23408]] Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities(https://arxiv.org/abs/2511.23408)
Keywords: security, large language model
Abstract: Automated vulnerability patching is crucial for software security, and recent advancements in Large Language Models (LLMs) present promising capabilities for automating this task. However, existing research has primarily assessed LLMs using publicly disclosed vulnerabilities, leaving their effectiveness on related artificial vulnerabilities largely unexplored. In this study, we empirically evaluate the patching effectiveness and complementarity of several prominent LLMs, such as OpenAI's GPT variants, LLaMA, DeepSeek, and Mistral models, using both real and artificial vulnerabilities. Our evaluation employs Proof-of-Vulnerability (PoV) test execution to concretely assess whether LLM-generated source code successfully patches vulnerabilities. Our results reveal that LLMs patch real vulnerabilities more effectively compared to artificial ones. Additionally, our analysis reveals significant variability across LLMs in terms of overlapping (multiple LLMs patching the same vulnerabilities) and complementarity (vulnerabilities patched exclusively by a single LLM), emphasizing the importance of selecting appropriate LLMs for effective vulnerability patching.

Title: Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model

Authors: Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Qinglin Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23429
Pdf URL: https://arxiv.org/pdf/2511.23429
Copy Paste: [[2511.23429]] Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model(https://arxiv.org/abs/2511.23429)
Keywords: generative
Abstract: Recent advances in generative world models have enabled remarkable progress in creating open-ended game environments, evolving from static scene synthesis toward dynamic, interactive simulation. However, current approaches remain limited by rigid action schemas and high annotation costs, restricting their ability to model diverse in-game interactions and player-driven dynamics. To address these challenges, we introduce Hunyuan-GameCraft-2, a new paradigm of instruction-driven interaction for generative game world modeling. Instead of relying on fixed keyboard inputs, our model allows users to control game video contents through natural language prompts, keyboard, or mouse signals, enabling flexible and semantically rich interaction within generated worlds. We formally defined the concept of interactive video data and developed an automated process to transform large-scale, unstructured text-video pairs into causally aligned interactive datasets. Built upon a 14B image-to-video Mixture-of-Experts(MoE) foundation model, our model incorporates a text-driven interaction injection mechanism for fine-grained control over camera motion, character behavior, and environment dynamics. We introduce an interaction-focused benchmark, InterBench, to evaluate interaction performance comprehensively. Extensive experiments demonstrate that our model generates temporally coherent and causally grounded interactive game videos that faithfully respond to diverse and free-form user instructions such as "open the door", "draw a torch", or "trigger an explosion".

Title: ASTRO: Adaptive Stitching via Dynamics-Guided Trajectory Rollouts

Authors: Hang Yu, Di Zhang, Qiwei Du, Yanping Zhao, Hai Zhang, Guang Chen, Eduardo E. Veas, Junqiao Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.23442
Pdf URL: https://arxiv.org/pdf/2511.23442
Copy Paste: [[2511.23442]] ASTRO: Adaptive Stitching via Dynamics-Guided Trajectory Rollouts(https://arxiv.org/abs/2511.23442)
Keywords: generative
Abstract: Offline reinforcement learning (RL) enables agents to learn optimal policies from pre-collected datasets. However, datasets containing suboptimal and fragmented trajectories present challenges for reward propagation, resulting in inaccurate value estimation and degraded policy performance. While trajectory stitching via generative models offers a promising solution, existing augmentation methods frequently produce trajectories that are either confined to the support of the behavior policy or violate the underlying dynamics, thereby limiting their effectiveness for policy improvement. We propose ASTRO, a data augmentation framework that generates distributionally novel and dynamics-consistent trajectories for offline RL. ASTRO first learns a temporal-distance representation to identify distinct and reachable stitch targets. We then employ a dynamics-guided stitch planner that adaptively generates connecting action sequences via Rollout Deviation Feedback, defined as the gap between target state sequence and the actual arrived state sequence by executing predicted actions, to improve trajectory stitching's feasibility and reachability. This approach facilitates effective augmentation through stitching and ultimately enhances policy learning. ASTRO outperforms prior offline RL augmentation methods across various algorithms, achieving notable performance gain on the challenging OGBench suite and demonstrating consistent improvements on standard offline RL benchmarks such as D4RL.

Title: Physics-Informed Neural Networks for Thermophysical Property Retrieval

Authors: Ali Waseem, Malcolm Mielle
Subjects: cs.LG, cs.AI, cs.CE, cs.CV
Abstract URL: https://arxiv.org/abs/2511.23449
Pdf URL: https://arxiv.org/pdf/2511.23449
Copy Paste: [[2511.23449]] Physics-Informed Neural Networks for Thermophysical Property Retrieval(https://arxiv.org/abs/2511.23449)
Keywords: diffusion
Abstract: Inverse heat problems refer to the estimation of material thermophysical properties given observed or known heat diffusion behaviour. Inverse heat problems have wide-ranging uses, but a critical application lies in quantifying how building facade renovation reduces thermal transmittance, a key determinant of building energy efficiency. However, solving inverse heat problems with non-invasive data collected in situ is error-prone due to environmental variability or deviations from theoretically assumed conditions. Hence, current methods for measuring thermal conductivity are either invasive, require lengthy observation periods, or are sensitive to environmental and experimental conditions. Here, we present a PINN-based iterative framework to estimate the thermal conductivity k of a wall from a set of thermographs; our framework alternates between estimating the forward heat problem with a PINN for a fixed k, and optimizing k by comparing the thermographs and surface temperatures predicted by the PINN, repeating until the estimated k's convergence. Using both environmental data captured by a weather station and data generated from Finite-Volume-Method software simulations, we accurately predict k across different environmental conditions and data collection sampling times, given the temperature profile of the wall at dawn is close to steady state. Although violating the steady-state assumption impacts the accuracy of k's estimation, we show that our proposed framework still only exhibits a maximum MAE of 4.0851. Our work demonstrates the potential of PINN-based methods for reliable estimation of material properties in situ and under realistic conditions, without lengthy measurement campaigns. Given the lack of research on using machine learning, and more specifically on PINNs, for solving in-situ inverse problems, we expect our work to be a starting point for more research on the topic.

Title: Object-Centric Data Synthesis for Category-level Object Detection

Authors: Vikhyat Agarwal, Jiayi Cora Guo, Declan Hoban, Sissi Zhang, Nicholas Moran, Peter Cho, Srilakshmi Pattabiraman, Shantanu Joshi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23450
Pdf URL: https://arxiv.org/pdf/2511.23450
Copy Paste: [[2511.23450]] Object-Centric Data Synthesis for Category-level Object Detection(https://arxiv.org/abs/2511.23450)
Keywords: diffusion
Abstract: Deep learning approaches to object detection have achieved reliable detection of specific object classes in images. However, extending a model's detection capability to new object classes requires large amounts of annotated training data, which is costly and time-consuming to acquire, especially for long-tailed classes with insufficient representation in existing datasets. Here, we introduce the object-centric data setting, when limited data is available in the form of object-centric data (multi-view images or 3D models), and systematically evaluate the performance of four different data synthesis methods to finetune object detection models on novel object categories in this setting. The approaches are based on simple image processing techniques, 3D rendering, and image diffusion models, and use object-centric data to synthesize realistic, cluttered images with varying contextual coherence and complexity. We assess how these methods enable models to achieve category-level generalization in real-world data, and demonstrate significant performance boosts within this data-constrained experimental setting.

Title: SmallWorlds: Assessing Dynamics Understanding of World Models in Isolated Environments

Authors: Xinyi Li, Zaishuo Xia, Weyl Lu, Chenjie Hao, Yubei Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.23465
Pdf URL: https://arxiv.org/pdf/2511.23465
Copy Paste: [[2511.23465]] SmallWorlds: Assessing Dynamics Understanding of World Models in Isolated Environments(https://arxiv.org/abs/2511.23465)
Keywords: diffusion, transformer
Abstract: Current world models lack a unified and controlled setting for systematic evaluation, making it difficult to assess whether they truly capture the underlying rules that govern environment dynamics. In this work, we address this open challenge by introducing the SmallWorld Benchmark, a testbed designed to assess world model capability under isolated and precisely controlled dynamics without relying on handcrafted reward signals. Using this benchmark, we conduct comprehensive experiments in the fully observable state space on representative architectures including Recurrent State Space Model, Transformer, Diffusion model, and Neural ODE, examining their behavior across six distinct domains. The experimental results reveal how effectively these models capture environment structure and how their predictions deteriorate over extended rollouts, highlighting both the strengths and limitations of current modeling paradigms and offering insights into future improvement directions in representation learning and dynamics modeling.

Title: Visual Generation Tuning

Authors: Jiahao Guo, Sinan Du, Jingfeng Yao, Wenyu Liu, Bo Li, Haoxiang Cao, Kun Gai, Chun Yuan, Kai Wu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23469
Pdf URL: https://arxiv.org/pdf/2511.23469
Copy Paste: [[2511.23469]] Visual Generation Tuning(https://arxiv.org/abs/2511.23469)
Keywords: diffusion, transformer
Abstract: Large Vision Language Models (VLMs) effectively bridge the modality gap through extensive pretraining, acquiring sophisticated visual representations aligned with language. However, it remains underexplored whether these representations, optimized for multimodal understanding tasks, harbor an inherent potential for visual generation. In this paper, we propose VGT, Visual Generation Tuning, a novel paradigm designed to stimulate the underlying capabilities of visual generation within any vision language models. By performing efficient visual generation tuning on well-pretrained VLMs, we significantly mitigate the alignment costs and accelerate the convergence of autoregressive modeling in the continuous space (20x speedup). Specifically, we dismiss the entangled pixel-level VAEs designed for diffusion transformers and formulate VGT-AE through aligning the semantic encoders from pretrained VLMs with the latent representations of pixel decoders. In image reconstruction tasks, we achieve 26.67 PSNR and 0.50 rFID at a 28x compression ratio, outperforming specialized VAEs; in visual generation tasks, we achieve state-of-the-art outcomes among autoregressive models, 0.77 on GenEval and 78.73 on DPG-Bench. Furthermore, our proposed VGT showcases significant scaling promise and is versatile for endowing any VLMs trained for multimodal understanding with the capabilities of visual generation, which paves the new avenue to explore next-generation unified multimodal foundation models. Models and codes are available at this https URL.

Title: ThetaEvolve: Test-time Learning on Open Problems

Authors: Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, Eva Xu, Liliang Ren, Xinyu Yang, Zeyi Huang, Xuehai He, Luyao Ma, Baolin Peng, Hao Cheng, Pengcheng He, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2511.23473
Pdf URL: https://arxiv.org/pdf/2511.23473
Copy Paste: [[2511.23473]] ThetaEvolve: Test-time Learning on Open Problems(https://arxiv.org/abs/2511.23473)
Keywords: large language model
Abstract: Recent advances in large language models (LLMs) have enabled breakthroughs in mathematical discovery, exemplified by AlphaEvolve, a closed-source system that evolves programs to improve bounds on open problems. However, it relies on ensembles of frontier LLMs to achieve new bounds and is a pure inference system that models cannot internalize the evolving strategies. We introduce ThetaEvolve, an open-source framework that simplifies and extends AlphaEvolve to efficiently scale both in-context learning and Reinforcement Learning (RL) at test time, allowing models to continually learn from their experiences in improving open optimization problems. ThetaEvolve features a single LLM, a large program database for enhanced exploration, batch sampling for higher throughput, lazy penalties to discourage stagnant outputs, and optional reward shaping for stable training signals, etc. ThetaEvolve is the first evolving framework that enable a small open-source model, like DeepSeek-R1-0528-Qwen3-8B, to achieve new best-known bounds on open problems (circle packing and first auto-correlation inequality) mentioned in AlphaEvolve. Besides, across two models and four open tasks, we find that ThetaEvolve with RL at test-time consistently outperforms inference-only baselines, and the model indeed learns evolving capabilities, as the RL-trained checkpoints demonstrate faster progress and better final performance on both trained target task and other unseen tasks. We release our code publicly: this https URL

Title: AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Authors: Zhizhou Zhong, Yicheng Ji, Zhe Kong, Yiying Liu, Jiarui Wang, Jiasun Feng, Lupeng Liu, Xiangyi Wang, Yanjia Li, Yuqing She, Ying Qin, Huan Li, Shuiyang Mao, Wei Liu, Wenhan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23475
Pdf URL: https://arxiv.org/pdf/2511.23475
Copy Paste: [[2511.23475]] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement(https://arxiv.org/abs/2511.23475)
Keywords: diffusion, transformer, generative
Abstract: Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and refines interactivity with only a few real multi-person clips. Furthermore, we contribute a targeted metric and dataset designed to evaluate the naturalness and interactivity of the generated multi-person videos. Extensive experiments demonstrate that AnyTalker achieves remarkable lip synchronization, visual quality, and natural interactivity, striking a favorable balance between data costs and identity scalability.

Title: Video-CoM: Interactive Video Reasoning via Chain of Manipulations

Authors: Hanoona Rasheed, Mohammed Zumri, Muhammad Maaz, Ming-Hsuan Yang, Fahad Shahbaz Khan, Salman Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23477
Pdf URL: https://arxiv.org/pdf/2511.23477
Copy Paste: [[2511.23477]] Video-CoM: Interactive Video Reasoning via Chain of Manipulations(https://arxiv.org/abs/2511.23477)
Keywords: interpretability, large language model
Abstract: Recent multimodal large language models (MLLMs) have advanced video understanding, yet most still "think about videos" ie once a video is encoded, reasoning unfolds entirely in text, treating visual input as a static context. This passive paradigm creates a semantic bottleneck: models cannot rewatch, refocus, or verify evidence, leading to shallow visual reasoning on tasks requiring fine grained spatio temporal understanding. In this work, we introduce Interactive Video Reasoning, a new paradigm that transforms video into an active cognitive workspace, enabling models to "think with videos". Our model, Video CoM, reasons through a Chain of Manipulations (CoM), performing iterative visual actions to gather and refine evidence. To support this behavior, we construct Video CoM Instruct, an 18K instruction tuning dataset curated for multi step manipulation reasoning. Beyond supervised learning, we further optimize the manipulation policy via reinforcement learning with reasoning aware Group Relative Policy Optimization (GRPO). Unlike prior work that relies solely on sparse answer rewards, our method introduces step level reasoning rewards, guiding the model toward grounded and consistent reasoning. Video CoM achieves strong results across nine video reasoning benchmarks, improving average performance by 3.6 percent over recent state of the art models, while training on only 25K SFT and 3K GRPO video samples, significantly fewer than comparable large scale models. Ablation studies demonstrate that reasoning aware rewards improve both accuracy and interpretability. Code: this https URL

Title: Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

Authors: Muhammad Maaz, Hanoona Rasheed, Fahad Shahbaz Khan, Salman Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.23478
Pdf URL: https://arxiv.org/pdf/2511.23478
Copy Paste: [[2511.23478]] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models(https://arxiv.org/abs/2511.23478)
Keywords: interpretability, large language model
Abstract: Reasoning over dynamic visual content remains a central challenge for multimodal large language models. Recent thinking models generate explicit reasoning traces for interpretability; however, their reasoning often appears convincing while being logically inconsistent or weakly grounded in visual evidence. We identify and formalize these issues through two diagnostic metrics: Think Answer Consistency (TAC), which measures the alignment between reasoning and answers, and Video Attention Score (VAS), which captures the extent to which reasoning depends on visual versus textual cues. Analysis across 11 video reasoning benchmarks shows that current models rely heavily on linguistic priors rather than visual content. To address this, we propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency. Our approach combines timestamp aware supervised fine tuning with Group Relative Policy Optimization (GRPO) guided by a novel Temporal Alignment Reward (TAR). This dual step post training stage encourages temporally aligned and causally coherent video reasoning. The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks, demonstrating that improvements in temporal alignment and reasoning coherence lead to more accurate and trustworthy video understanding. Our code, dataset, and model will be open sourced.