2025-11-26

Title: PuzzlePoles: Cylindrical Fiducial Markers Based on the PuzzleBoard Pattern

Authors: Juri Zach, Peer Stelldinger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19448
Pdf URL: https://arxiv.org/pdf/2511.19448
Copy Paste: [[2511.19448]] PuzzlePoles: Cylindrical Fiducial Markers Based on the PuzzleBoard Pattern(https://arxiv.org/abs/2511.19448)
Keywords: robust
Abstract: Reliable perception of the environment is a key enabler for autonomous systems, where calibration and localization tasks often rely on robust visual markers. We introduce the PuzzlePole, a new type of fiducial markers derived from the recently proposed PuzzleBoard calibration pattern. The PuzzlePole is a cylindrical marker, enabling reliable recognition and pose estimation from 360° viewing direction. By leveraging the unique combinatorial structure of the PuzzleBoard pattern, PuzzlePoles provide a high accuracy in localization and orientation while being robust to occlusions. The design offers flexibility for deployment in diverse autonomous systems scenarios, ranging from robot navigation and SLAM to tangible interfaces.

Title: Personalized Reward Modeling for Text-to-Image Generation

Authors: Jeongeun Lee, Ryang Heo, Dongha Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19458
Pdf URL: https://arxiv.org/pdf/2511.19458
Copy Paste: [[2511.19458]] Personalized Reward Modeling for Text-to-Image Generation(https://arxiv.org/abs/2511.19458)
Keywords: robust, interpretability
Abstract: Recent text-to-image (T2I) models generate semantically coherent images from textual prompts, yet evaluating how well they align with individual user preferences remains an open challenge. Conventional evaluation methods, general reward functions or similarity-based metrics, fail to capture the diversity and complexity of personal visual tastes. In this work, we present PIGReward, a personalized reward model that dynamically generates user-conditioned evaluation dimensions and assesses images through CoT reasoning. To address the scarcity of user data, PIGReward adopt a self-bootstrapping strategy that reasons over limited reference data to construct rich user contexts, enabling personalization without user-specific training. Beyond evaluation, PIGReward provides personalized feedback that drives user-specific prompt optimization, improving alignment between generated images and individual intent. We further introduce PIGBench, a per-user preference benchmark capturing diverse visual interpretations of shared prompts. Extensive experiments demonstrate that PIGReward surpasses existing methods in both accuracy and interpretability, establishing a scalable and reasoning-based foundation for personalized T2I evaluation and optimization. Taken together, our findings highlight PIGReward as a robust steptoward individually aligned T2I generation.

Title: PrefixGPT: Prefix Adder Optimization by a Generative Pre-trained Transformer

Authors: Ruogu Ding, Xin Ning, Ulf Schlichtmann, Weikang Qian
Subjects: cs.LG, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2511.19472
Pdf URL: https://arxiv.org/pdf/2511.19472
Copy Paste: [[2511.19472]] PrefixGPT: Prefix Adder Optimization by a Generative Pre-trained Transformer(https://arxiv.org/abs/2511.19472)
Keywords: transformer, generative
Abstract: Prefix adders are widely used in compute-intensive applications for their high speed. However, designing optimized prefix adders is challenging due to strict design rules and an exponentially large design space. We introduce PrefixGPT, a generative pre-trained Transformer (GPT) that directly generates optimized prefix adders from scratch. Our approach represents an adder's topology as a two-dimensional coordinate sequence and applies a legality mask during generation, ensuring every design is valid by construction. PrefixGPT features a customized decoder-only Transformer architecture. The model is first pre-trained on a corpus of randomly synthesized valid prefix adders to learn design rules and then fine-tuned to navigate the design space for optimized design quality. Compared with existing works, PrefixGPT not only finds a new optimal design with a 7.7% improved area-delay product (ADP) but exhibits superior exploration quality, lowering the average ADP by up to 79.1%. This demonstrates the potential of GPT-style models to first master complex hardware design principles and then apply them for more efficient design optimization.

Title: WavefrontDiffusion: Dynamic Decoding Schedule or Improved Reasoning

Authors: Haojin Yang, Rui Hu, Zequn Sun, Rui Zhou, Yujun Cai, Yiwei Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19473
Pdf URL: https://arxiv.org/pdf/2511.19473
Copy Paste: [[2511.19473]] WavefrontDiffusion: Dynamic Decoding Schedule or Improved Reasoning(https://arxiv.org/abs/2511.19473)
Keywords: diffusion
Abstract: Diffusion Language Models (DLMs) have shown strong potential for text generation and are becoming a competitive alternative to autoregressive models. The denoising strategy plays an important role in determining the quality of their outputs. Mainstream denoising strategies include Standard Diffusion and BlockDiffusion. Standard Diffusion performs global denoising without restricting the update range, often finalizing incomplete context and causing premature end-of-sequence predictions. BlockDiffusion updates fixed-size blocks in a preset order, but its rigid structure can break apart coherent semantic units and disrupt reasoning. We present WavefrontDiffusion, a dynamic decoding approach that expands a wavefront of active tokens outward from finalized positions. This adaptive process follows the natural flow of semantic structure while keeping computational cost equal to block-based methods. Across four benchmarks in reasoning and code generation, WavefrontDiffusion achieves state-of-the-art performance while producing outputs with higher semantic fidelity, showing the value of adaptive scheduling for more coherent and efficient generation.

Title: Tracking and Segmenting Anything in Any Modality

Authors: Tianlu Zhang, Qiang Zhang, Guiguang Ding, Jungong Han
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2511.19475
Pdf URL: https://arxiv.org/pdf/2511.19475
Copy Paste: [[2511.19475]] Tracking and Segmenting Anything in Any Modality(https://arxiv.org/abs/2511.19475)
Keywords: segmentation
Abstract: Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation subtasks from the perspectives of any modality input or multi-task inference. However, these approaches tend to overlook two critical challenges: the distributional gap across different modalities and the feature representation gap across tasks. These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input. Specifically, a Decoupled Mixture-of-Expert (DeMoE) mechanism is presented to decouple the unified representation learning task into the modeling process of cross-modal shared knowledge and specific information, thus enabling the model to maintain flexibility while enhancing generalization. Additionally, we introduce a Task-aware Multi-object Tracking (TaMOT) pipeline to unify all the task outputs as a unified set of instances with calibrated ID information, thereby alleviating the degradation of task-specific knowledge during multi-task training. SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.

Title: Exploiting the Experts: Unauthorized Compression in MoE-LLMs

Authors: Pinaki Prasad Guha Neogi, Ahmad Mohammadshirazi, Dheeraj Kulshrestha, Rajiv Ramnath
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19480
Pdf URL: https://arxiv.org/pdf/2511.19480
Copy Paste: [[2511.19480]] Exploiting the Experts: Unauthorized Compression in MoE-LLMs(https://arxiv.org/abs/2511.19480)
Keywords: secure, security, defense, large language model
Abstract: Mixture-of-Experts (MoE) architectures are increasingly adopted in large language models (LLMs) for their scalability and efficiency. However, their modular structure introduces a unique vulnerability: adversaries can attempt to compress or repurpose models by pruning experts and cheaply fine-tuning the remainder, effectively bypassing licensing and security constraints. In this paper, we systematically study the prunability of MoE-LLMs under task-specific usage. We first develop an expert attribution framework that identifies the subset of experts most responsible for a given task, then evaluate the performance trade-offs of pruning and re-aligning these experts using active learning-driven fine-tuning. Our findings reveal a critical knowledge loss--recovery trade-off: while certain experts can be isolated to retain task accuracy, significant degradation occurs without targeted re-alignment. Based on this analysis, we propose defense strategies that aim to make MoE models harder to compress and fine-tune without authorization, including entangled expert training and selective fine-tuning protocols that resist unauthorized adaptation. By positioning expert pruning as both a threat vector and a defense target, this work highlights the dual-use nature of MoE modularity and provides the first systematic evaluation framework for secure specialization of MoE-LLMs.

Title: Quality analysis and evaluation prediction of RAG retrieval based on machine learning algorithms

Authors: Ruoxin Zhang, Zhizhao Wen, Chao Wang, Chenchen Tang, Puyang Xu, Yifan Jiang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19481
Pdf URL: https://arxiv.org/pdf/2511.19481
Copy Paste: [[2511.19481]] Quality analysis and evaluation prediction of RAG retrieval based on machine learning algorithms(https://arxiv.org/abs/2511.19481)
Keywords: large language model
Abstract: With the rapid evolution of large language models, retrieval enhanced generation technology has been widely used due to its ability to integrate external knowledge to improve output accuracy. However, the performance of the system is highly dependent on the quality of the retrieval module. If the retrieval results have low relevance to user needs or contain noisy information, it will directly lead to distortion of the generated content. In response to the performance bottleneck of existing models in processing tabular features, this paper proposes an XGBoost machine learning regression model based on feature engineering and particle swarm optimization. Correlation analysis shows that answer_quality is positively correlated with doc_delevance by 0.66, indicating that document relevance has a significant positive effect on answer quality, and improving document relevance may enhance answer quality; The strong negative correlations between semantic similarity, redundancy, and diversity were -0.89 and -0.88, respectively, indicating a trade- off between semantic similarity, redundancy, and diversity. In other words, as the former two increased, diversity significantly decreased. The experimental results comparing decision trees, AdaBoost, etc. show that the VMD PSO BiLSTM model is superior in all evaluation indicators, with significantly lower MSE, RMSE, MAE, and MAPE compared to the comparison model. The R2 value is higher, indicating that its prediction accuracy, stability, and data interpretation ability are more outstanding. This achievement provides an effective path for optimizing the retrieval quality and improving the generation effect of RAG system, and has important value in promoting the implementation and application of related technologies.

Title: OmniTFT: Omni Target Forecasting for Vital Signs and Laboratory Result Trajectories in Multi Center ICU Data

Authors: Wanzhe Xu, Yutong Dai, Yitao Yang, Martin Loza, Weihang Zhang, Yang Cui, Xin Zeng, Sung Joon Park, Kenta Nakai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19485
Pdf URL: https://arxiv.org/pdf/2511.19485
Copy Paste: [[2511.19485]] OmniTFT: Omni Target Forecasting for Vital Signs and Laboratory Result Trajectories in Multi Center ICU Data(https://arxiv.org/abs/2511.19485)
Keywords: robust, transformer
Abstract: Accurate multivariate time-series prediction of vital signs and laboratory results is crucial for early intervention and precision medicine in intensive care units (ICUs). However, vital signs are often noisy and exhibit rapid fluctuations, while laboratory tests suffer from missing values, measurement lags, and device-specific bias, making integrative forecasting highly challenging. To address these issues, we propose OmniTFT, a deep learning framework that jointly learns and forecasts high-frequency vital signs and sparsely sampled laboratory results based on the Temporal Fusion Transformer (TFT). Specifically, OmniTFT implements four novel strategies to enhance performance: sliding window equalized sampling to balance physiological states, frequency-aware embedding shrinkage to stabilize rare-class representations, hierarchical variable selection to guide model attention toward informative feature clusters, and influence-aligned attention calibration to enhance robustness during abrupt physiological changes. By reducing the reliance on target-specific architectures and extensive feature engineering, OmniTFT enables unified modeling of multiple heterogeneous clinical targets while preserving cross-institutional generalizability. Across forecasting tasks, OmniTFT achieves substantial performance improvement for both vital signs and laboratory results on the MIMIC-III, MIMIC-IV, and eICU datasets. Its attention patterns are interpretable and consistent with known pathophysiology, underscoring its potential utility for quantitative decision support in clinical care.

Title: Efficient Inference Using Large Language Models with Limited Human Data: Fine-Tuning then Rectification

Authors: Lei Wang, Zikun Ye, Jinglong Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19486
Pdf URL: https://arxiv.org/pdf/2511.19486
Copy Paste: [[2511.19486]] Efficient Inference Using Large Language Models with Limited Human Data: Fine-Tuning then Rectification(https://arxiv.org/abs/2511.19486)
Keywords: large language model
Abstract: Driven by recent advances in artificial intelligence (AI), a growing body of work demonstrates the potential of using large language models (LLMs) to generate human-like responses in market research and social science applications. Two primary approaches can be applied to improve the performance of LLMs: fine-tuning, which aligns LLM predictions more closely with human responses, and rectification, which corrects biases in LLM outputs. In this paper, we develop a framework that combines fine-tuning and rectification, and optimally allocates limited labeled samples across the two stages. Unlike the conventional objective that minimizes the mean squared prediction errors, we propose to minimize the variance of the prediction errors as the fine-tuning objective, which is optimal for the downstream rectification stage. Building on this insight, we leverage empirical scaling laws to develop a data-driven method for optimally splitting samples between the fine-tuning and rectification stages. Empirical analysis validates our framework, demonstrating improved estimation and inference performance compared to using either fine-tuning or rectification alone.

Title: Generative Model-Aided Continual Learning for CSI Feedback in FDD mMIMO-OFDM Systems

Authors: Guijun Liu, Yuwen Cao, Tomoaki Ohtsuki, Jiguang He, Shahid Mumtaz
Subjects: cs.LG, cs.AI, cs.IT
Abstract URL: https://arxiv.org/abs/2511.19490
Pdf URL: https://arxiv.org/pdf/2511.19490
Copy Paste: [[2511.19490]] Generative Model-Aided Continual Learning for CSI Feedback in FDD mMIMO-OFDM Systems(https://arxiv.org/abs/2511.19490)
Keywords: robust, generative
Abstract: Deep autoencoder (DAE) frameworks have demonstrated their effectiveness in reducing channel state information (CSI) feedback overhead in massive multiple-input multiple-output (mMIMO) orthogonal frequency division multiplexing (OFDM) systems. However, existing CSI feedback models struggle to adapt to dynamic environments caused by user mobility, requiring retraining when encountering new CSI distributions. Moreover, returning to previously encountered environments often leads to performance degradation due to catastrophic forgetting. Continual learning involves enabling models to incorporate new information while maintaining performance on previously learned tasks. To address these challenges, we propose a generative adversarial network (GAN)-based learning approach for CSI feedback. By using a GAN generator as a memory unit, our method preserves knowledge from past environments and ensures consistently high performance across diverse scenarios without forgetting. Simulation results show that the proposed approach enhances the generalization capability of the DAE framework while maintaining low memory overhead. Furthermore, it can be seamlessly integrated with other advanced CSI feedback models, highlighting its robustness and adaptability.

Title: A Systematic Study of Compression Ordering for Large Language Models

Authors: Shivansh Chhawri, Rahul Mahadik, Suparna Rooj
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19495
Pdf URL: https://arxiv.org/pdf/2511.19495
Copy Paste: [[2511.19495]] A Systematic Study of Compression Ordering for Large Language Models(https://arxiv.org/abs/2511.19495)
Keywords: large language model
Abstract: Large Language Models (LLMs) require substantial computational resources, making model compression essential for efficient deployment in constrained environments. Among the dominant compression techniques: knowledge distillation, structured pruning, and low-bit quantization, their individual effects are well studied, but their interactions and optimal sequencing remain unclear. This work systematically examines how these techniques perform both independently and in combination when applied to the Qwen2.5 3B model. We evaluate multiple compression pipelines, including single, and proposed three-technique sequences, using perplexity, G-Eval, clarity, prompt alignment, and compression ratio as metrics. Our experiments show that quantization provides the greatest standalone compression, while pruning introduces moderate quality degradation. Critically, the ordering of techniques significantly affects the final model quality: the sequence Pruning, Knowledge Distillation, Quantization (P-KD-Q) yields the best balance, achieving a 3.68x compression ratio while preserving strong instruction-following and language understanding capabilities. Conversely, pipelines applying quantization early suffer severe performance degradation due to irreversible information loss that impairs subsequent training. Overall, this study offers practical insight into designing effective, ordering-aware compression pipelines for deploying LLMs in resource-limited settings.

Title: Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM

Authors: Yang Liu, Xiaolong Zhong, Ling Jiang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19496
Pdf URL: https://arxiv.org/pdf/2511.19496
Copy Paste: [[2511.19496]] Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM(https://arxiv.org/abs/2511.19496)
Keywords: large language model
Abstract: Large language models deliver strong reasoning and tool-use skills, yet their computational demands make them impractical for edge or cost-sensitive deployments. We present \textbf{Xmodel-2.5}, a 1.3-billion-parameter small language model designed as a \emph{drop-in agent core}. Training with maximal-update parameterization ($\mu$P) allows hyper-parameters tuned on a 20M-parameter proxy to transfer directly to the full model, even under the parameter-tied \emph{tie-word-embedding} architecture. A 1.4T-token Warmup--Stable--Decay curriculum is used, and we further show that \textbf{switching from AdamW to Muon during the decay phase} improves the 13-task reasoning average by 4.58\,\% while keeping every other hyper-parameter fixed, verifying that early AdamW stability can be paired with late Muon sharpening for better downstream performance. FP8-mixed-precision training balances accuracy and throughput. All checkpoints, recipes, and evaluation code are released under the Apache-2.0 license.\footnote{this https URL and this https URL (training checkpoints).} Training code and evaluation harness: this https URL.

Title: PeriodNet: Boosting the Potential of Attention Mechanism for Time Series Forecasting

Authors: Bowen Zhao, Huanlai Xing, Zhiwen Xiao, Jincheng Peng, Li Feng, Xinhan Wang, Rong Qu, Hui Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19497
Pdf URL: https://arxiv.org/pdf/2511.19497
Copy Paste: [[2511.19497]] PeriodNet: Boosting the Potential of Attention Mechanism for Time Series Forecasting(https://arxiv.org/abs/2511.19497)
Keywords: transformer, generative
Abstract: The attention mechanism has demonstrated remarkable potential in sequence modeling, exemplified by its successful application in natural language processing with models such as Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT). Despite these advancements, its utilization in time series forecasting (TSF) has yet to meet expectations. Exploring a better network structure for attention in TSF holds immense significance across various domains. In this paper, we present PeriodNet with a brand new structure to forecast univariate and multivariate time series. PeriodNet incorporates period attention and sparse period attention mechanism for analyzing adjacent periods. It enhances the mining of local characteristics, periodic patterns, and global dependencies. For efficient cross-variable modeling, we introduce an iterative grouping mechanism which can directly reduce the cross-variable redundancy. To fully leverage the extracted features on the encoder side, we redesign the entire architecture of the vanilla Transformer and propose a period diffuser for precise multi-period prediction. Through comprehensive experiments conducted on eight datasets, we demonstrate that PeriodNet outperforms six state-of-the-art models in both univariate and multivariate TSF scenarios in terms of mean square error and mean absolute error. In particular, PeriodNet achieves a relative improvement of 22% when forecasting time series with a length of 720, in comparison to other models based on the conventional encoder-decoder Transformer architecture.

Title: Hierarchical Dual-Strategy Unlearning for Biomedical and Healthcare Intelligence Using Imperfect and Privacy-Sensitive Medical Data

Authors: Yi Zhang, Tianxiang Xu, Zijian Li, Chao Zhang, Kunyu Zhang, Zhan Gao, Meinuo Li, Xiaohan Zhang, Qichao Qi, Bing Chen
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2511.19498
Pdf URL: https://arxiv.org/pdf/2511.19498
Copy Paste: [[2511.19498]] Hierarchical Dual-Strategy Unlearning for Biomedical and Healthcare Intelligence Using Imperfect and Privacy-Sensitive Medical Data(https://arxiv.org/abs/2511.19498)
Keywords: privacy, robust, large language model
Abstract: Large language models (LLMs) exhibit exceptional performance but pose substantial privacy risks due to training data memorization, particularly within healthcare contexts involving imperfect or privacy-sensitive patient information. We present a hierarchical dual-strategy framework for selective knowledge unlearning that precisely removes specialized knowledge while preserving fundamental medical competencies. Our approach synergistically integrates geometric-constrained gradient updates to selectively modulate target parameters with concept-aware token-level interventions that distinguish between preservation-critical and unlearning-targeted tokens via a unified four-level medical concept hierarchy. Comprehensive evaluations on the MedMCQA (surgical) and MHQA (anxiety, depression, trauma) datasets demonstrate superior performance, achieving an 82.7% forgetting rate and 88.5% knowledge preservation. Notably, our framework maintains robust privacy guarantees while requiring modification of only 0.1% of parameters, addressing critical needs for regulatory compliance, auditability, and ethical standards in clinical research.

Title: Beyond Binary Classification: A Semi-supervised Approach to Generalized AI-generated Image Detection

Authors: Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, Nhien-An Le-Khac
Subjects: cs.LG, cs.AI, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2511.19499
Pdf URL: https://arxiv.org/pdf/2511.19499
Copy Paste: [[2511.19499]] Beyond Binary Classification: A Semi-supervised Approach to Generalized AI-generated Image Detection(https://arxiv.org/abs/2511.19499)
Keywords: diffusion, generative
Abstract: The rapid advancement of generators (e.g., StyleGAN, Midjourney, DALL-E) has produced highly realistic synthetic images, posing significant challenges to digital media authenticity. These generators are typically based on a few core architectural families, primarily Generative Adversarial Networks (GANs) and Diffusion Models (DMs). A critical vulnerability in current forensics is the failure of detectors to achieve cross-generator generalization, especially when crossing architectural boundaries (e.g., from GANs to DMs). We hypothesize that this gap stems from fundamental differences in the artifacts produced by these \textbf{distinct architectures}. In this work, we provide a theoretical analysis explaining how the distinct optimization objectives of the GAN and DM architectures lead to different manifold coverage behaviors. We demonstrate that GANs permit partial coverage, often leading to boundary artifacts, while DMs enforce complete coverage, resulting in over-smoothing patterns. Motivated by this analysis, we propose the \textbf{Tri}archy \textbf{Detect}or (TriDetect), a semi-supervised approach that enhances binary classification by discovering latent architectural patterns within the "fake" class. TriDetect employs balanced cluster assignment via the Sinkhorn-Knopp algorithm and a cross-view consistency mechanism, encouraging the model to learn fundamental architectural distincts. We evaluate our approach on two standard benchmarks and three in-the-wild datasets against 13 baselines to demonstrate its generalization capability to unseen generators.

Title: Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma

Authors: Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.19504
Pdf URL: https://arxiv.org/pdf/2511.19504
Copy Paste: [[2511.19504]] Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma(https://arxiv.org/abs/2511.19504)
Keywords: robust, fair, large language model
Abstract: Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3--10^4 samples from homogeneous annotator pools while 10^7--10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.

Title: TouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception

Authors: Kailin Lyu, Long Xiao, Jianing Zeng, Junhao Dong, Xuexin Liu, Zhuojun Zou, Haoyue Yang, Lin Shu, Jie Hao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19509
Pdf URL: https://arxiv.org/pdf/2511.19509
Copy Paste: [[2511.19509]] TouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception(https://arxiv.org/abs/2511.19509)
Keywords: robust, transformer
Abstract: Traditional vision-based material perception methods often experience substantial performance degradation under visually impaired conditions, thereby motivating the shift toward non-visual multimodal material perception. Despite this, existing approaches frequently perform naive fusion of multimodal inputs, overlooking key challenges such as modality-specific noise, missing modalities common in real-world scenarios, and the dynamically varying importance of each modality depending on the task. These limitations lead to suboptimal performance across several benchmark tasks. In this paper, we propose a robust multimodal fusion framework, TouchFormer. Specifically, we employ a Modality-Adaptive Gating (MAG) mechanism and intra- and inter-modality attention mechanisms to adaptively integrate cross-modal features, enhancing model robustness. Additionally, we introduce a Cross-Instance Embedding Regularization(CER) strategy, which significantly improves classification accuracy in fine-grained subcategory material recognition tasks. Experimental results demonstrate that, compared to existing non-visual methods, the proposed TouchFormer framework achieves classification accuracy improvements of 2.48% and 6.83% on SSMC and USMC tasks, respectively. Furthermore, real-world robotic experiments validate TouchFormer's effectiveness in enabling robots to better perceive and interpret their environment, paving the way for its deployment in safety-critical applications such as emergency response and industrial automation. The code and datasets will be open-source, and the videos are available in the supplementary materials.

Title: Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

Authors: Liqin Luo, Guangyao Chen, Xiawu Zheng, Yongxing Dai, Yixiong Zou, Yonghong Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19516
Pdf URL: https://arxiv.org/pdf/2511.19516
Copy Paste: [[2511.19516]] Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning(https://arxiv.org/abs/2511.19516)
Keywords: interpretability, large language model
Abstract: Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.

Title: Automating Deception: Scalable Multi-Turn LLM Jailbreaks

Authors: Adarsh Kumarappan, Ananya Mujoo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19517
Pdf URL: https://arxiv.org/pdf/2511.19517
Copy Paste: [[2511.19517]] Automating Deception: Scalable Multi-Turn LLM Jailbreaks(https://arxiv.org/abs/2511.19517)
Keywords: defense, attack, robust, large language model
Abstract: Multi-turn conversational attacks, which leverage psychological principles like Foot-in-the-Door (FITD), where a small initial request paves the way for a more significant one, to bypass safety alignments, pose a persistent threat to Large Language Models (LLMs). Progress in defending against these attacks is hindered by a reliance on manual, hard-to-scale dataset creation. This paper introduces a novel, automated pipeline for generating large-scale, psychologically-grounded multi-turn jailbreak datasets. We systematically operationalize FITD techniques into reproducible templates, creating a benchmark of 1,500 scenarios across illegal activities and offensive content. We evaluate seven models from three major LLM families under both multi-turn (with history) and single-turn (without history) conditions. Our results reveal stark differences in contextual robustness: models in the GPT family demonstrate a significant vulnerability to conversational history, with Attack Success Rates (ASR) increasing by as much as 32 percentage points. In contrast, Google's Gemini 2.5 Flash exhibits exceptional resilience, proving nearly immune to these attacks, while Anthropic's Claude 3 Haiku shows strong but imperfect resistance. These findings highlight a critical divergence in how current safety architectures handle conversational context and underscore the need for defenses that can resist narrative-based manipulation.

Title: Blinking Beyond EAR: A Stable Eyelid Angle Metric for Driver Drowsiness Detection and Data Augmentation

Authors: Mathis Wolter, Julie Stephany Berrio Perez, Mao Shan
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2511.19519
Pdf URL: https://arxiv.org/pdf/2511.19519
Copy Paste: [[2511.19519]] Blinking Beyond EAR: A Stable Eyelid Angle Metric for Driver Drowsiness Detection and Data Augmentation(https://arxiv.org/abs/2511.19519)
Keywords: robust, biometric
Abstract: Detecting driver drowsiness reliably is crucial for enhancing road safety and supporting advanced driver assistance systems (ADAS). We introduce the Eyelid Angle (ELA), a novel, reproducible metric of eye openness derived from 3D facial landmarks. Unlike conventional binary eye state estimators or 2D measures, such as the Eye Aspect Ratio (EAR), the ELA provides a stable geometric description of eyelid motion that is robust to variations in camera angle. Using the ELA, we design a blink detection framework that extracts temporal characteristics, including the closing, closed, and reopening durations, which are shown to correlate with drowsiness levels. To address the scarcity and risk of collecting natural drowsiness data, we further leverage ELA signals to animate rigged avatars in Blender 3D, enabling the creation of realistic synthetic datasets with controllable noise, camera viewpoints, and blink dynamics. Experimental results in public driver monitoring datasets demonstrate that the ELA offers lower variance under viewpoint changes compared to EAR and achieves accurate blink detection. At the same time, synthetic augmentation expands the diversity of training data for drowsiness recognition. Our findings highlight the ELA as both a reliable biometric measure and a powerful tool for generating scalable datasets in driver state monitoring.

Title: EAGER: Edge-Aligned LLM Defense for Robust, Efficient, and Accurate Cybersecurity Question Answering

Authors: Onat Gungor, Roshan Sood, Jiasheng Zhou, Tajana Rosing
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.19523
Pdf URL: https://arxiv.org/pdf/2511.19523
Copy Paste: [[2511.19523]] EAGER: Edge-Aligned LLM Defense for Robust, Efficient, and Accurate Cybersecurity Question Answering(https://arxiv.org/abs/2511.19523)
Keywords: security, defense, attack, robust, large language model
Abstract: Large Language Models (LLMs) are highly effective for cybersecurity question answering (QA) but are difficult to deploy on edge devices due to their size. Quantization reduces memory and compute requirements but often degrades accuracy and increases vulnerability to adversarial attacks. We present EAGER, an edge-aligned defense framework that integrates parameter-efficient quantization with domain-specific preference alignment to jointly optimize efficiency, robustness, and accuracy. Unlike prior methods that address these aspects separately, EAGER leverages Quantized Low-Rank Adaptation (QLoRA) for low-cost fine-tuning and Direct Preference Optimization (DPO) on a self-constructed cybersecurity preference dataset, eliminating the need for human labels. Experiments show that EAGER reduces adversarial attack success rates by up to 7.3x and improves QA accuracy by up to 55% over state-of-the-art defenses, while achieving the lowest response latency on a Jetson Orin, demonstrating its practical edge deployment.

Title: VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

Authors: Boyu Chen, Zikang Wang, Zhengrong Yue, Kainan Yan, Chenyun Yu, Yi Huang, Zijun Liu, Yafei Wen, Xiaoxin Chen, Yang Liu, Peng Li, Yali Wang
Subjects: cs.CV, cs.MA
Abstract URL: https://arxiv.org/abs/2511.19524
Pdf URL: https://arxiv.org/pdf/2511.19524
Copy Paste: [[2511.19524]] VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning(https://arxiv.org/abs/2511.19524)
Keywords: robust, large language model
Abstract: By leveraging tool-augmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single or fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes. (1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user's query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers to effectively respond to the user's query. Moreover, we equip our CPP paradigm with a concise Multi-Agent Reinforcement Learning (MARL) method. Consequently, the team of policy agents can be jointly optimized to enhance VideoChat-M1's performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat-M1 achieves SOTA performance across eight benchmarks spanning four tasks. Notably, on LongVideoBench, our method outperforms the SOTA model Gemini 2.5 pro by 3.6% and GPT-4o by 15.6%.

Title: Shortcut Invariance: Targeted Jacobian Regularization in Disentangled Latent Space

Authors: Shivam Pal, Sakshi Varshney, Piyush Rai
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2511.19525
Pdf URL: https://arxiv.org/pdf/2511.19525
Copy Paste: [[2511.19525]] Shortcut Invariance: Targeted Jacobian Regularization in Disentangled Latent Space(https://arxiv.org/abs/2511.19525)
Keywords: robust
Abstract: Deep neural networks are prone to learning shortcuts, spurious and easily learned correlations in training data that cause severe failures in out-of-distribution (OOD) generalization. A dominant line of work seeks robustness by learning a robust representation, often explicitly partitioning the latent space into core and spurious components; this approach can be complex, brittle, and difficult to scale. We take a different approach, instead of a robust representation, we learn a robust function. We present a simple and effective training method that renders the classifier functionally invariant to shortcut signals. Our method operates within a disentangled latent space, which is essential as it isolates spurious and core features into distinct dimensions. This separation enables the identification of candidate shortcut features by their strong correlation with the label, used as a proxy for semantic simplicity. The classifier is then desensitized to these features by injecting targeted, anisotropic latent noise during training. We analyze this as targeted Jacobian regularization, which forces the classifier to ignore spurious features and rely on more complex, core semantic signals. The result is state-of-the-art OOD performance on established shortcut learning benchmarks.

Title: AttackPilot: Autonomous Inference Attacks Against ML Services With LLM-Based Agents

Authors: Yixin Wu, Rui Wen, Chi Cui, Michael Backes, Yang Zhang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19536
Pdf URL: https://arxiv.org/pdf/2511.19536
Copy Paste: [[2511.19536]] AttackPilot: Autonomous Inference Attacks Against ML Services With LLM-Based Agents(https://arxiv.org/abs/2511.19536)
Keywords: attack, large language model
Abstract: Inference attacks have been widely studied and offer a systematic risk assessment of ML services; however, their implementation and the attack parameters for optimal estimation are challenging for non-experts. The emergence of advanced large language models presents a promising yet largely unexplored opportunity to develop autonomous agents as inference attack experts, helping address this challenge. In this paper, we propose AttackPilot, an autonomous agent capable of independently conducting inference attacks without human intervention. We evaluate it on 20 target services. The evaluation shows that our agent, using GPT-4o, achieves a 100.0% task completion rate and near-expert attack performance, with an average token cost of only $0.627 per run. The agent can also be powered by many other representative LLMs and can adaptively optimize its strategy under service constraints. We further perform trace analysis, demonstrating that design choices, such as a multi-agent framework and task-specific action spaces, effectively mitigate errors such as bad plans, inability to follow instructions, task context loss, and hallucinations. We anticipate that such agents could empower non-expert ML service providers, auditors, or regulators to systematically assess the risks of ML services without requiring deep domain expertise.

Title: Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment

Authors: Muhao Guo, Yang Weng
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2511.19537
Pdf URL: https://arxiv.org/pdf/2511.19537
Copy Paste: [[2511.19537]] Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment(https://arxiv.org/abs/2511.19537)
Keywords: robust, transformer, large language model
Abstract: The rapid expansion of distributed photovoltaic (PV) systems poses challenges for power grid management, as many installations remain undocumented. While satellite imagery provides global coverage, traditional computer vision (CV) models such as CNNs and U-Nets require extensive labeled data and fail to generalize across regions. This study investigates the cross-domain generalization of a multimodal large language model (LLM) for global PV assessment. By leveraging structured prompts and fine-tuning, the model integrates detection, localization, and quantification within a unified schema. Cross-regional evaluation using the $\Delta$F1 metric demonstrates that the proposed model achieves the smallest performance degradation across unseen regions, outperforming conventional CV and transformer baselines. These results highlight the robustness of multimodal LLMs under domain shift and their potential for scalable, transferable, and interpretable global PV mapping.

Title: Studying Maps at Scale: A Digital Investigation of Cartography and the Evolution of Figuration

Authors: Remi Petitpierre
Subjects: cs.CV, cs.CL, cs.DL
Abstract URL: https://arxiv.org/abs/2511.19538
Pdf URL: https://arxiv.org/pdf/2511.19538
Copy Paste: [[2511.19538]] Studying Maps at Scale: A Digital Investigation of Cartography and the Evolution of Figuration(https://arxiv.org/abs/2511.19538)
Keywords: extraction, diffusion, segmentation
Abstract: This thesis presents methods and datasets to investigate cartographic heritage on a large scale and from a cultural perspective. Heritage institutions worldwide have digitized more than one million maps, and automated techniques now enable large-scale recognition and extraction of map content. Yet these methods have engaged little with the history of cartography, or the view that maps are semantic-symbolic systems, and cultural objects reflecting political and epistemic expectations. This work leverages a diverse corpus of 771,561 map records and 99,715 digitized images aggregated from 38 digital catalogs. After normalization, the dataset includes 236,925 contributors and spans six centuries, from 1492 to 1948. These data make it possible to chart geographic structures and the global chronology of map publication. The spatial focus of cartography is analyzed in relation to political dynamics, evidencing links between Atlantic maritime charting, the triangular trade, and colonial expansion. Further results document the progression of national, domestic focus and the impact of military conflicts on publication volumes. The research introduces semantic segmentation techniques and object detection models for the generic recognition of land classes and cartographic signs, trained on annotated data and synthetic images. The analysis of land classes shows that maps are designed images whose framing and composition emphasize features through centering and semantic symmetries. The study of cartographic figuration encodes 63 M signs and 25 M fragments into a latent visual space, revealing figurative shifts such as the replacement of relief hachures by terrain contours and showing that signs tend to form locally consistent systems. Analyses of collaboration and diffusion highlight the role of legitimacy, larger actors, and major cities in the spread of figurative norms and semiotic cultures.

Title: Think First, Assign Next (ThiFAN-VQA): A Two-stage Chain-of-Thought Framework for Post-Disaster Damage Assessment

Authors: Ehsan Karimi, Nhut Le, Maryam Rahnemoonfar
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.19557
Pdf URL: https://arxiv.org/pdf/2511.19557
Copy Paste: [[2511.19557]] Think First, Assign Next (ThiFAN-VQA): A Two-stage Chain-of-Thought Framework for Post-Disaster Damage Assessment(https://arxiv.org/abs/2511.19557)
Keywords: interpretability, generative
Abstract: Timely and accurate assessment of damages following natural disasters is essential for effective emergency response and recovery. Recent AI-based frameworks have been developed to analyze large volumes of aerial imagery collected by Unmanned Aerial Vehicles, providing actionable insights rapidly. However, creating and annotating data for training these models is costly and time-consuming, resulting in datasets that are limited in size and diversity. Furthermore, most existing approaches rely on traditional classification-based frameworks with fixed answer spaces, restricting their ability to provide new information without additional data collection or model retraining. Using pre-trained generative models built on in-context learning (ICL) allows for flexible and open-ended answer spaces. However, these models often generate hallucinated outputs or produce generic responses that lack domain-specific relevance. To address these limitations, we propose ThiFAN-VQA, a two-stage reasoning-based framework for visual question answering (VQA) in disaster scenarios. ThiFAN-VQA first generates structured reasoning traces using chain-of-thought (CoT) prompting and ICL to enable interpretable reasoning under limited supervision. A subsequent answer selection module evaluates the generated responses and assigns the most coherent and contextually accurate answer, effectively improve the model performance. By integrating a custom information retrieval system, domain-specific prompting, and reasoning-guided answer selection, ThiFAN-VQA bridges the gap between zero-shot and supervised methods, combining flexibility with consistency. Experiments on FloodNet and RescueNet-VQA, UAV-based datasets from flood- and hurricane-affected regions, demonstrate that ThiFAN-VQA achieves superior accuracy, interpretability, and adaptability for real-world post-disaster damage assessment tasks.

Title: SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

Authors: Mohammed Talha Alam, Nada Saadi, Fahad Shamshad, Nils Lukas, Karthik Nandakumar, Fahkri Karray, Samuele Poppi
Subjects: cs.CR, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.19558
Pdf URL: https://arxiv.org/pdf/2511.19558
Copy Paste: [[2511.19558]] SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models(https://arxiv.org/abs/2511.19558)
Keywords: robust, diffusion
Abstract: Text-to-image diffusion models can emit copyrighted, unsafe, or private content. Safety alignment aims to suppress specific concepts, yet evaluations seldom test whether safety persists under benign downstream fine-tuning routinely applied after deployment (e.g., LoRA personalization, style/domain adapters). We study the stability of current safety methods under benign fine-tuning and observe frequent breakdowns. As true safety alignment must withstand even benign post-deployment adaptations, we introduce the SPQR benchmark (Safety-Prompt adherence-Quality-Robustness). SPQR is a single-scored metric that provides a standardized and reproducible framework to evaluate how well safety-aligned diffusion models preserve safety, utility, and robustness under benign fine-tuning, by reporting a single leaderboard score to facilitate comparisons. We conduct multilingual, domain-specific, and out-of-distribution analyses, along with category-wise breakdowns, to identify when safety alignment fails after benign fine-tuning, ultimately showcasing SPQR as a concise yet comprehensive benchmark for T2I safety alignment techniques for T2I models.

Title: ModHiFi: Identifying High Fidelity predictive components for Model Modification

Authors: Dhruva Kashyap, Chaitanya Murti, Pranav K Nayak, Tanay Narshana, Chiranjib Bhattacharyya
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.19566
Pdf URL: https://arxiv.org/pdf/2511.19566
Copy Paste: [[2511.19566]] ModHiFi: Identifying High Fidelity predictive components for Model Modification(https://arxiv.org/abs/2511.19566)
Keywords: transformer
Abstract: Open weight models, which are ubiquitous, rarely provide access to their training data or loss function. This makes modifying such models for tasks such as pruning or unlearning constrained by this unavailability an active area of research. Existing techniques typically require gradients or ground-truth labels, rendering them infeasible in settings with limited computational resources. In this work, we investigate the fundamental question of identifying components that are critical to the model's predictive performance, without access to either gradients or the loss function, and with only distributional access such as synthetic data. We theoretically demonstrate that the global reconstruction error is linearly bounded by local reconstruction errors for Lipschitz-continuous networks such as CNNs and well-trained Transformers (which, contrary to existing literature, we find exhibit Lipschitz continuity). This motivates using the locally reconstructive behavior of component subsets to quantify their global importance, via a metric that we term Subset Fidelity. In the uncorrelated features setting, selecting individual components via their Subset Fidelity scores is optimal, which we use to propose ModHiFi, an algorithm for model modification that requires no training data or loss function access. ModHiFi-P, for structured pruning, achieves an 11% speedup over the current state of the art on ImageNet models and competitive performance on language models. ModHiFi-U, for classwise unlearning, achieves complete unlearning on CIFAR-10 without fine-tuning and demonstrates competitive performance on Swin Transformers.

Title: An Invariant Latent Space Perspective on Language Model Inversion

Authors: Wentao Ye, Jiaqi Hu, Haobo Wang, Xinpeng Ti, Zhiqing Xiao, Hao Chen, Liyao Li, Lei Feng, Sai Wu, Junbo Zhao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19569
Pdf URL: https://arxiv.org/pdf/2511.19569
Copy Paste: [[2511.19569]] An Invariant Latent Space Perspective on Language Model Inversion(https://arxiv.org/abs/2511.19569)
Keywords: security, privacy, protect, defense
Abstract: Language model inversion (LMI), i.e., recovering hidden prompts from outputs, emerges as a concrete threat to user privacy and system security. We recast LMI as reusing the LLM's own latent space and propose the Invariant Latent Space Hypothesis (ILSH): (1) diverse outputs from the same source prompt should preserve consistent semantics (source invariance), and (2) input<->output cyclic mappings should be self-consistent within a shared latent space (cyclic invariance). Accordingly, we present Inv^2A, which treats the LLM as an invariant decoder and learns only a lightweight inverse encoder that maps outputs to a denoised pseudo-representation. When multiple outputs are available, they are sparsely concatenated at the representation layer to increase information density. Training proceeds in two stages: contrastive alignment (source invariance) and supervised reinforcement (cyclic invariance). An optional training-free neighborhood search can refine local performance. Across 9 datasets covering user and system prompt scenarios, Inv^2A outperforms baselines by an average of 4.77% BLEU score while reducing dependence on large inverse corpora. Our analysis further shows that prevalent defenses provide limited protection, underscoring the need for stronger strategies. The source code and data involved in this paper can be found in this https URL.

Title: HunyuanOCR Technical Report

Authors: Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Houwen Peng, Hongming Yang, Senhao Xie, Binghong Wu, Mana Yang, Sergey Wang, Raccoon Liu, Dick Zhu, Jie Jiang, Linus, Han Hu, Chengquan Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19575
Pdf URL: https://arxiv.org/pdf/2511.19575
Copy Paste: [[2511.19575]] HunyuanOCR Technical Report(https://arxiv.org/abs/2511.19575)
Keywords: transformer
Abstract: This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow "OCR expert models" and inefficient "General VLMs". 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

Title: Leveraging Unlabeled Scans for NCCT Image Segmentation in Early Stroke Diagnosis: A Semi-Supervised GAN Approach

Authors: Maria Thoma, Michalis A. Savelonas, Dimitris K. Iakovidis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19576
Pdf URL: https://arxiv.org/pdf/2511.19576
Copy Paste: [[2511.19576]] Leveraging Unlabeled Scans for NCCT Image Segmentation in Early Stroke Diagnosis: A Semi-Supervised GAN Approach(https://arxiv.org/abs/2511.19576)
Keywords: generative, segmentation
Abstract: Ischemic stroke is a time-critical medical emergency where rapid diagnosis is essential for improving patient outcomes. Non-contrast computed tomography (NCCT) serves as the frontline imaging tool, yet it often fails to reveal the subtle ischemic changes present in the early, hyperacute phase. This limitation can delay crucial interventions. To address this diagnostic challenge, we introduce a semi-supervised segmentation method using generative adversarial networks (GANs) to accurately delineate early ischemic stroke regions. The proposed method employs an adversarial framework to effectively learn from a limited number of annotated NCCT scans, while simultaneously leveraging a larger pool of unlabeled scans. By employing Dice loss, cross-entropy loss, a feature matching loss and a self-training loss, the model learns to identify and delineate early infarcts, even when they are faint or their size is small. Experiments on the publicly available Acute Ischemic Stroke Dataset (AISD) demonstrate the potential of the proposed method to enhance diagnostic capabilities, reduce the burden of manual annotation, and support more efficient clinical decision-making in stroke care.

Title: Multiscale Vector-Quantized Variational Autoencoder for Endoscopic Image Synthesis

Authors: Dimitrios E. Diamantis, Dimitris K. Iakovidis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19578
Pdf URL: https://arxiv.org/pdf/2511.19578
Copy Paste: [[2511.19578]] Multiscale Vector-Quantized Variational Autoencoder for Endoscopic Image Synthesis(https://arxiv.org/abs/2511.19578)
Keywords: privacy, generative
Abstract: Gastrointestinal (GI) imaging via Wireless Capsule Endoscopy (WCE) generates a large number of images requiring manual screening. Deep learning-based Clinical Decision Support (CDS) systems can assist screening, yet their performance relies on the existence of large, diverse, training medical datasets. However, the scarcity of such data, due to privacy constraints and annotation costs, hinders CDS development. Generative machine learning offers a viable solution to combat this limitation. While current Synthetic Data Generation (SDG) methods, such as Generative Adversarial Networks and Variational Autoencoders have been explored, they often face challenges with training stability and capturing sufficient visual diversity, especially when synthesizing abnormal findings. This work introduces a novel VAE-based methodology for medical image synthesis and presents its application for the generation of WCE images. The novel contributions of this work include a) multiscale extension of the Vector Quantized VAE model, named as Multiscale Vector Quantized Variational Autoencoder (MSVQ-VAE); b) unlike other VAE-based SDG models for WCE image generation, MSVQ-VAE is used to seamlessly introduce abnormalities into normal WCE images; c) it enables conditional generation of synthetic images, enabling the introduction of different types of abnormalities into the normal WCE images; d) it performs experiments with a variety of abnormality types, including polyps, vascular and inflammatory conditions. The utility of the generated images for CDS is assessed via image classification. Comparative experiments demonstrate that training a CDS classifier using the abnormal images generated by the proposed methodology yield comparable results with a classifier trained with only real data. The generality of the proposed methodology promises its applicability to various domains related to medical multimedia.

Title: IRSDA: An Agent-Orchestrated Framework for Enterprise Intrusion Response

Authors: Damodar Panigrahi, Raj Patel, Shaswata Mitra, Sudip Mittal, Shahram Rahimi
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19644
Pdf URL: https://arxiv.org/pdf/2511.19644
Copy Paste: [[2511.19644]] IRSDA: An Agent-Orchestrated Framework for Enterprise Intrusion Response(https://arxiv.org/abs/2511.19644)
Keywords: security, defense, explainability
Abstract: Modern enterprise systems face escalating cyber threats that are increasingly dynamic, distributed, and multi-stage in nature. Traditional intrusion detection and response systems often rely on static rules and manual workflows, which limit their ability to respond with the speed and precision required in high-stakes environments. To address these challenges, we present the Intrusion Response System Digital Assistant (IRSDA), an agent-based framework designed to deliver autonomous and policy-compliant cyber defense. IRSDA combines Self-Adaptive Autonomic Computing Systems (SA-ACS) with the Knowledge guided Monitor, Analyze, Plan, and Execute (MAPE-K) loop to support real-time, partition-aware decision-making across enterprise infrastructure. IRSDA incorporates a knowledge-driven architecture that integrates contextual information with AI-based reasoning to support system-guided intrusion response. The framework leverages retrieval mechanisms and structured representations to inform decision-making while maintaining alignment with operational policies. We assess the system using a representative real-world microservices application, demonstrating its ability to automate containment, enforce compliance, and provide traceable outputs for security analyst interpretation. This work outlines a modular and agent-driven approach to cyber defense that emphasizes explainability, system-state awareness, and operational control in intrusion response.

Title: Efficient Multi-Hop Question Answering over Knowledge Graphs via LLM Planning and Embedding-Guided Search

Authors: Manil Shrestha, Edward Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19648
Pdf URL: https://arxiv.org/pdf/2511.19648
Copy Paste: [[2511.19648]] Efficient Multi-Hop Question Answering over Knowledge Graphs via LLM Planning and Embedding-Guided Search(https://arxiv.org/abs/2511.19648)
Keywords: large language model
Abstract: Multi-hop question answering over knowledge graphs remains computationally challenging due to the combinatorial explosion of possible reasoning paths. Recent approaches rely on expensive Large Language Model (LLM) inference for both entity linking and path ranking, limiting their practical deployment. Additionally, LLM-generated answers often lack verifiable grounding in structured knowledge. We present two complementary hybrid algorithms that address both efficiency and verifiability: (1) LLM-Guided Planning that uses a single LLM call to predict relation sequences executed via breadth-first search, achieving near-perfect accuracy (micro-F1 > 0.90) while ensuring all answers are grounded in the knowledge graph, and (2) Embedding-Guided Neural Search that eliminates LLM calls entirely by fusing text and graph embeddings through a lightweight 6.7M-parameter edge scorer, achieving over 100 times speedup with competitive accuracy. Through knowledge distillation, we compress planning capability into a 4B-parameter model that matches large-model performance at zero API cost. Evaluation on MetaQA demonstrates that grounded reasoning consistently outperforms ungrounded generation, with structured planning proving more transferable than direct answer generation. Our results show that verifiable multi-hop reasoning does not require massive models at inference time, but rather the right architectural inductive biases combining symbolic structure with learned representations.

Title: Synthetic Data: AI's New Weapon Against Android Malware

Authors: Angelo Gaspar Diniz Nogueira, Kayua Oleques Paim, Hendrio Bragança, Rodrigo Brandão Mansilha, Diego Kreutz
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.19649
Pdf URL: https://arxiv.org/pdf/2511.19649
Copy Paste: [[2511.19649]] Synthetic Data: AI's New Weapon Against Android Malware(https://arxiv.org/abs/2511.19649)
Keywords: attack, robust, generative
Abstract: The ever-increasing number of Android devices and the accelerated evolution of malware, reaching over 35 million samples by 2024, highlight the critical importance of effective detection methods. Attackers are now using Artificial Intelligence to create sophisticated malware variations that can easily evade traditional detection techniques. Although machine learning has shown promise in malware classification, its success relies heavily on the availability of up-to-date, high-quality datasets. The scarcity and high cost of obtaining and labeling real malware samples presents significant challenges in developing robust detection models. In this paper, we propose MalSynGen, a Malware Synthetic Data Generation methodology that uses a conditional Generative Adversarial Network (cGAN) to generate synthetic tabular data. This data preserves the statistical properties of real-world data and improves the performance of Android malware classifiers. We evaluated the effectiveness of this approach using various datasets and metrics that assess the fidelity of the generated data, its utility in classification, and the computational efficiency of the process. Our experiments demonstrate that MalSynGen can generalize across different datasets, providing a viable solution to address the issues of obsolescence and low quality data in malware detection.

Title: Accuracy and Efficiency Trade-Offs in LLM-Based Malware Detection and Explanation: A Comparative Study of Parameter Tuning vs. Full Fine-Tuning

Authors: Stephen C. Gravereaux, Sheikh Rabiul Islam
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19654
Pdf URL: https://arxiv.org/pdf/2511.19654
Copy Paste: [[2511.19654]] Accuracy and Efficiency Trade-Offs in LLM-Based Malware Detection and Explanation: A Comparative Study of Parameter Tuning vs. Full Fine-Tuning(https://arxiv.org/abs/2511.19654)
Keywords: interpretability, large language model
Abstract: This study examines whether Low-Rank Adaptation (LoRA) fine-tuned Large Language Models (LLMs) can approximate the performance of fully fine-tuned models in generating human-interpretable decisions and explanations for malware classification. Achieving trustworthy malware detection, particularly when LLMs are involved, remains a significant challenge. We developed an evaluation framework using Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), and Semantic Similarity Metrics to benchmark explanation quality across five LoRA configurations and a fully fine-tuned baseline. Results indicate that full fine-tuning achieves the highest overall scores, with BLEU and ROUGE improvements of up to 10% over LoRA variants. However, mid-range LoRA models deliver competitive performance exceeding full fine-tuning on two metrics while reducing model size by approximately 81% and training time by over 80% on a LoRA model with 15.5% trainable parameters. These findings demonstrate that LoRA offers a practical balance of interpretability and resource efficiency, enabling deployment in resource-constrained environments without sacrificing explanation quality. By providing feature-driven natural language explanations for malware classifications, this approach enhances transparency, analyst confidence, and operational scalability in malware detection systems.

Title: Structured Noise Modeling for Enhanced Time-Series Forecasting

Authors: Sepideh Koohfar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19657
Pdf URL: https://arxiv.org/pdf/2511.19657
Copy Paste: [[2511.19657]] Structured Noise Modeling for Enhanced Time-Series Forecasting(https://arxiv.org/abs/2511.19657)
Keywords: interpretability
Abstract: Time-series forecasting remains difficult in real-world settings because temporal patterns operate at multiple scales, from broad contextual trends to fast, fine-grained fluctuations that drive critical decisions. Existing neural models often struggle to represent these interacting dynamics, leading to unstable predictions and reduced reliability in downstream applications. This work introduces a forecast-blur-denoise framework that improves temporal fidelity through structured noise modeling. The approach incorporates a learnable Gaussian Process module that generates smooth, correlated perturbations, encouraging the forecasting backbone to capture long-range structure while a dedicated refinement model restores high-resolution temporal detail. Training the components jointly enables natural competence division and avoids the artifacts commonly produced by isotropic corruption methods. Experiments across electricity, traffic, and solar datasets show consistent gains in multi-horizon accuracy and stability. The modular design also allows the blur-denoise layer to operate as a lightweight enhancement for pretrained models, supporting efficient adaptation in limited-data scenarios. By strengthening the reliability and interpretability of fine-scale temporal predictions, this framework contributes to more trustworthy AI systems used in forecasting-driven decision support across energy, infrastructure, and other time-critical domains.

Title: Demystifying Diffusion Objectives: Reweighted Losses are Better Variational Bounds

Authors: Jiaxin Shi, Michalis K. Titsias
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.19664
Pdf URL: https://arxiv.org/pdf/2511.19664
Copy Paste: [[2511.19664]] Demystifying Diffusion Objectives: Reweighted Losses are Better Variational Bounds(https://arxiv.org/abs/2511.19664)
Keywords: diffusion, generative
Abstract: We derive a new theoretical interpretation of the reweighted losses that are widely used for training diffusion models. Our method is based on constructing a cascade of time-dependent variational lower bounds on the data log-likelihood, that provably improves upon the standard evidence lower bound and results in reduced data-model KL-divergences. Combining such bounds gives rise to reweighted objectives that can be applied to any generative diffusion model including both continuous Gaussian diffusion and masked (discrete) diffusion models. Then, we showcase this framework in masked diffusion and report significant improvements over previous training losses in pixel-space image modeling, approaching sample quality comparable to continuous diffusion models. Our results also provide a theoretical justification for the simple weighting scheme widely used in masked image models.

Title: OncoVision: Integrating Mammography and Clinical Data through Attention-Driven Multimodal AI for Enhanced Breast Cancer Diagnosis

Authors: Istiak Ahmed, Galib Ahmed, K. Shahriar Sanjid, Md. Tanzim Hossain, Md. Nishan Khan, Md. Misbah Khan, Md. Arifur Rahman, Sheikh Anisul Haque, Sharmin Akhtar Rupa, Mohammed Mejbahuddin Mia, Mahmud Hasan Mostofa Kamal, Md. Mostafa Kamal Sarker, M. Monir Uddin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19667
Pdf URL: https://arxiv.org/pdf/2511.19667
Copy Paste: [[2511.19667]] OncoVision: Integrating Mammography and Clinical Data through Attention-Driven Multimodal AI for Enhanced Breast Cancer Diagnosis(https://arxiv.org/abs/2511.19667)
Keywords: secure, robust, segmentation
Abstract: OncoVision is a multimodal AI pipeline that combines mammography images and clinical data for better breast cancer diagnosis. Employing an attention-based encoder-decoder backbone, it jointly segments four ROIs - masses, calcifications, axillary findings, and breast tissues - with state-of-the-art accuracy and robustly predicts ten structured clinical features: mass morphology, calcification type, ACR breast density, and BI-RADS categories. To fuse imaging and clinical insights, we developed two late-fusion strategies. By utilizing complementary multimodal data, late fusion strategies improve diagnostic precision and reduce inter-observer variability. Operationalized as a secure, user-friendly web application, OncoVision produces structured reports with dual-confidence scoring and attention-weighted visualizations for real-time diagnostic support to improve clinician trust and facilitate medical teaching. It can be easily incorporated into the clinic, making screening available in underprivileged areas around the world, such as rural South Asia. Combining accurate segmentation with clinical intuition, OncoVision raises the bar for AI-based mammography, offering a scalable and equitable solution to detect breast cancer at an earlier stage and enhancing treatment through timely interventions.

Title: BASICS: Binary Analysis and Stack Integrity Checker System for Buffer Overflow Mitigation

Authors: Luis Ferreirinha, Iberia Medeiros
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.19670
Pdf URL: https://arxiv.org/pdf/2511.19670
Copy Paste: [[2511.19670]] BASICS: Binary Analysis and Stack Integrity Checker System for Buffer Overflow Mitigation(https://arxiv.org/abs/2511.19670)
Keywords: security
Abstract: Cyber-Physical Systems have played an essential role in our daily lives, providing critical services such as power and water, whose operability, availability, and reliability must be ensured. The C programming language, prevalent in CPS development, is crucial for system control where reliability is critical. However, it is also commonly susceptible to vulnerabilities, particularly buffer overflows. Traditional vulnerability discovery techniques often struggle with scalability and precision when applied directly to the binary code of C programs, which can thereby keep programs vulnerable. This work introduces a novel approach designed to overcome these limitations by leveraging model checking and concolic execution techniques to automatically verify security properties of a program's stack memory in binary code, trampoline techniques to perform automated repair of the issues, and crash-inducing inputs to verify if they were successfully removed. The approach constructs a Memory State Space - MemStaCe- from the binary program's control flow graph and simulations, provided by concolic execution, of C function calls and loop constructs. The security properties, defined in LTL, model the correct behaviour of functions associated with vulnerabilities and allow the approach to identify vulnerabilities in MemStaCe by analysing counterexample traces that are generated when a security property is violated. These vulnerabilities are then addressed with a trampoline-based binary patching method, and the effectiveness of the patches is checked with crash-inducing inputs extracted during concolic execution. We implemented the approach in the BASICS tool for BO mitigation and evaluated using the Juliet C/C++ and SARD datasets and real applications, achieving an accuracy and precision above 87%, both in detection and correction. Also, we compared it with CWE Checker, outperforming it.

Title: CountXplain: Interpretable Cell Counting with Prototype-Based Density Map Estimation

Authors: Abdurahman Ali Mohammed, Wallapak Tavanapong, Catherine Fonder, Donald S. Sakaguchi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19686
Pdf URL: https://arxiv.org/pdf/2511.19686
Copy Paste: [[2511.19686]] CountXplain: Interpretable Cell Counting with Prototype-Based Density Map Estimation(https://arxiv.org/abs/2511.19686)
Keywords: interpretability
Abstract: Cell counting in biomedical imaging is pivotal for various clinical applications, yet the interpretability of deep learning models in this domain remains a significant challenge. We propose a novel prototype-based method for interpretable cell counting via density map estimation. Our approach integrates a prototype layer into the density estimation network, enabling the model to learn representative visual patterns for both cells and background artifacts. The learned prototypes were evaluated through a survey of biologists, who confirmed the relevance of the visual patterns identified, further validating the interpretability of the model. By generating interpretations that highlight regions in the input image most similar to each prototype, our method offers a clear understanding of how the model identifies and counts cells. Extensive experiments on two public datasets demonstrate that our method achieves interpretability without compromising counting effectiveness. This work provides researchers and clinicians with a transparent and reliable tool for cell counting, potentially increasing trust and accelerating the adoption of deep learning in critical biomedical applications. Code is available at this https URL.

Title: TREASURE: A Transformer-Based Foundation Model for High-Volume Transaction Understanding

Authors: Chin-Chia Michael Yeh, Uday Singh Saini, Xin Dai, Xiran Fan, Shubham Jain, Yujie Fan, Jiarui Sun, Junpeng Wang, Menghai Pan, Yingtong Dou, Yuzhong Chen, Vineeth Rakesh, Liang Wang, Yan Zheng, Mahashweta Das
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19693
Pdf URL: https://arxiv.org/pdf/2511.19693
Copy Paste: [[2511.19693]] TREASURE: A Transformer-Based Foundation Model for High-Volume Transaction Understanding(https://arxiv.org/abs/2511.19693)
Keywords: transformer
Abstract: Payment networks form the backbone of modern commerce, generating high volumes of transaction records from daily activities. Properly modeling this data can enable applications such as abnormal behavior detection and consumer-level insights for hyper-personalized experiences, ultimately improving people's lives. In this paper, we present TREASURE, TRansformer Engine As Scalable Universal transaction Representation Encoder, a multipurpose transformer-based foundation model specifically designed for transaction data. The model simultaneously captures both consumer behavior and payment network signals (such as response codes and system flags), providing comprehensive information necessary for applications like accurate recommendation systems and abnormal behavior detection. Verified with industry-grade datasets, TREASURE features three key capabilities: 1) an input module with dedicated sub-modules for static and dynamic attributes, enabling more efficient training and inference; 2) an efficient and effective training paradigm for predicting high-cardinality categorical attributes; and 3) demonstrated effectiveness as both a standalone model that increases abnormal behavior detection performance by 111% over production systems and an embedding provider that enhances recommendation models by 104%. We present key insights from extensive ablation studies, benchmarks against production models, and case studies, highlighting valuable knowledge gained from developing TREASURE.

Title: TiCT: A Synthetically Pre-Trained Foundation Model for Time Series Classification

Authors: Chin-Chia Michael Yeh, Uday Singh Saini, Junpeng Wang, Xin Dai, Xiran Fan, Jiarui Sun, Yujie Fan, Yan Zheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19694
Pdf URL: https://arxiv.org/pdf/2511.19694
Copy Paste: [[2511.19694]] TiCT: A Synthetically Pre-Trained Foundation Model for Time Series Classification(https://arxiv.org/abs/2511.19694)
Keywords: transformer
Abstract: The ubiquity of time series data creates a strong demand for general-purpose foundation models, yet developing them for classification remains a significant challenge, largely due to the high cost of labeled data. Foundation models capable of in-context learning (ICL) offer a powerful solution, adapting to new tasks with minimal examples and reducing the need for extensive retraining. However, prior work on large-scale time series models has predominantly focused on forecasting, leaving a critical gap for versatile, fine-tuning-free classification. To address this, we introduce TiCT (Time-series in-Context Transformer), a transformer-based model pre-trained exclusively on synthetic data to perform in-context classification. We make two primary technical contributions: 1) a novel architecture featuring a scalable bit-based label encoding and a special output attention mechanism to handle an arbitrary number of classes; and 2) a synthetic pre-training framework that combines a Mixup-inspired process with data augmentation to foster generalization and noise invariance. Extensive evaluations on the UCR Archive show that TiCT achieves competitive performance against state-of-the-art supervised methods. Crucially, this is accomplished using only in-context examples at inference time, without updating a single model weight.

Title: RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models

Authors: Omar Alama, Darshil Jariwala, Avigyan Bhattacharya, Seungchan Kim, Wenshan Wang, Sebastian Scherer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19704
Pdf URL: https://arxiv.org/pdf/2511.19704
Copy Paste: [[2511.19704]] RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models(https://arxiv.org/abs/2511.19704)
Keywords: segmentation
Abstract: Open-vocabulary semantic segmentation (OVSS) underpins many vision and robotics tasks that require generalizable semantic understanding. Existing approaches either rely on limited segmentation training data, which hinders generalization, or apply zero-shot heuristics to vision-language models (e.g CLIP), while the most competitive approaches combine multiple models to improve performance at the cost of high computational and memory demands. In this work, we leverage an overlooked agglomerative vision foundation model, RADIO, to improve zero-shot OVSS along three key axes simultaneously: mIoU, latency, and parameter efficiency. We present the first comprehensive study of RADIO for zero-shot OVSS and enhance its performance through self-correlating recursive attention, self-correlating global aggregation, and computationally efficient mask refinement. Our approach, RADSeg, achieves 6-30% mIoU improvement in the base ViT class while being 3.95x faster and using 2.5x fewer parameters. Surprisingly, RADSeg-base (105M) outperforms previous combinations of huge vision models (850-1350M) in mIoU, achieving state-of-the-art accuracy with substantially lower computational and memory cost.

Title: CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding

Authors: Ziteng Sun, Adrian Benton, Samuel Kushnir, Asher Trockman, Vikas Singh, Suhas Diggavi, Ananda Theertha Suresh
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19705
Pdf URL: https://arxiv.org/pdf/2511.19705
Copy Paste: [[2511.19705]] CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding(https://arxiv.org/abs/2511.19705)
Keywords: privacy, large language model
Abstract: Post-training quantization is an effective method for reducing the serving cost of large language models, where the standard approach is to use a round-to-nearest quantization level scheme. However, this often introduces large errors due to outliers in the weights. Proposed mitigation mechanisms include applying adaptive rounding, random rotation transformations or committing to a post-training target using calibration data. Unfortunately, this reliance on calibration data can be severely limiting in some real-world scenarios as such data may be unavailable or subject to privacy regulations. In this paper, we propose algorithms to optimize transformations and adaptive rounding without access to any calibration data. The optimization is achieved by designing a suitable proxy function for the quantization loss without calibration data. To maintain inference efficiency, we perform structured matrix transformations for single matrices. For paired weights that interact directly in the computation graph, we use dual matrix transformations and adaptive rounding methods. We conduct experiments on Gemma 2 models, and observe consistent improvement over the baselines. For Gemma 2 9B quantization, our method improves the average benchmark score from 61.9 to 62.4 for 4-bit quantization and from 52.0 to 60.6 for 3-bit quantization, while adding less than 3% of computation overhead. Furthermore, our method achieves performance comparable to the commonly used GPTQ method, which requires calibration data.

Title: Rethinking Vision Transformer Depth via Structural Reparameterization

Authors: Chengwei Zhou, Vipin Chaudhary, Gourav Datta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19718
Pdf URL: https://arxiv.org/pdf/2511.19718
Copy Paste: [[2511.19718]] Rethinking Vision Transformer Depth via Structural Reparameterization(https://arxiv.org/abs/2511.19718)
Keywords: transformer
Abstract: The computational overhead of Vision Transformers in practice stems fundamentally from their deep architectures, yet existing acceleration strategies have primarily targeted algorithmic-level optimizations such as token pruning and attention speedup. This leaves an underexplored research question: can we reduce the number of stacked transformer layers while maintaining comparable representational capacity? To answer this, we propose a branch-based structural reparameterization technique that operates during the training phase. Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models suitable for inference deployment. The consolidation mechanism works by gradually merging branches at the entry points of nonlinear components, enabling both feed-forward networks (FFN) and multi-head self-attention (MHSA) modules to undergo exact mathematical reparameterization without inducing approximation errors at test time. When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K. The resulting compressed models achieve inference speedups of up to 37% on mobile CPU platforms. Our findings suggest that the conventional wisdom favoring extremely deep transformer stacks may be unnecessarily restrictive, and point toward new opportunities for constructing efficient vision transformers.

Title: Can LLMs Faithfully Explain Themselves in Low-Resource Languages? A Case Study on Emotion Detection in Persian

Authors: Mobina Mehrazar, Mohammad Amin Yousefi, Parisa Abolfath Beygi, Behnam Bahrak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.19719
Pdf URL: https://arxiv.org/pdf/2511.19719
Copy Paste: [[2511.19719]] Can LLMs Faithfully Explain Themselves in Low-Resource Languages? A Case Study on Emotion Detection in Persian(https://arxiv.org/abs/2511.19719)
Keywords: robust, large language model
Abstract: Large language models (LLMs) are increasingly used to generate self-explanations alongside their predictions, a practice that raises concerns about the faithfulness of these explanations, especially in low-resource languages. This study evaluates the faithfulness of LLM-generated explanations in the context of emotion classification in Persian, a low-resource language, by comparing the influential words identified by the model against those identified by human annotators. We assess faithfulness using confidence scores derived from token-level log-probabilities. Two prompting strategies, differing in the order of explanation and prediction (Predict-then-Explain and Explain-then-Predict), are tested for their impact on explanation faithfulness. Our results reveal that while LLMs achieve strong classification performance, their generated explanations often diverge from faithful reasoning, showing greater agreement with each other than with human judgments. These results highlight the limitations of current explanation methods and metrics, emphasizing the need for more robust approaches to ensure LLM reliability in multilingual and low-resource contexts.

Title: Prompt Fencing: A Cryptographic Approach to Establishing Security Boundaries in Large Language Model Prompts

Authors: Steven Peh
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19727
Pdf URL: https://arxiv.org/pdf/2511.19727
Copy Paste: [[2511.19727]] Prompt Fencing: A Cryptographic Approach to Establishing Security Boundaries in Large Language Model Prompts(https://arxiv.org/abs/2511.19727)
Keywords: security, attack, large language model
Abstract: Large Language Models (LLMs) remain vulnerable to prompt injection attacks, representing the most significant security threat in production deployments. We present Prompt Fencing, a novel architectural approach that applies cryptographic authentication and data architecture principles to establish explicit security boundaries within LLM prompts. Our approach decorates prompt segments with cryptographically signed metadata including trust ratings and content types, enabling LLMs to distinguish between trusted instructions and untrusted content. While current LLMs lack native fence awareness, we demonstrate that simulated awareness through prompt instructions achieved complete prevention of injection attacks in our experiments, reducing success rates from 86.7% (260/300 successful attacks) to 0% (0/300 successful attacks) across 300 test cases with two leading LLM providers. We implement a proof-of-concept fence generation and verification pipeline with a total overhead of 0.224 seconds (0.130s for fence generation, 0.094s for validation) across 100 samples. Our approach is platform-agnostic and can be incrementally deployed as a security layer above existing LLM infrastructure, with the expectation that future models will be trained with native fence awareness for optimal security.

Title: Training-Free Active Learning Framework in Materials Science with Large Language Models

Authors: Hongchen Wang, Rafael Espinosa Castañeda, Jay R. Werber, Yao Fehlis, Edward Kim, Jason Hattrick-Simpers
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2511.19730
Pdf URL: https://arxiv.org/pdf/2511.19730
Copy Paste: [[2511.19730]] Training-Free Active Learning Framework in Materials Science with Large Language Models(https://arxiv.org/abs/2511.19730)
Keywords: large language model
Abstract: Active learning (AL) accelerates scientific discovery by prioritizing the most informative experiments, but traditional machine learning (ML) models used in AL suffer from cold-start limitations and domain-specific feature engineering, restricting their generalizability. Large language models (LLMs) offer a new paradigm by leveraging their pretrained knowledge and universal token-based representations to propose experiments directly from text-based descriptions. Here, we introduce an LLM-based active learning framework (LLM-AL) that operates in an iterative few-shot setting and benchmark it against conventional ML models across four diverse materials science datasets. We explored two prompting strategies: one using concise numerical inputs suited for datasets with more compositional and structured features, and another using expanded descriptive text suited for datasets with more experimental and procedural features to provide additional context. Across all datasets, LLM-AL could reduce the number of experiments needed to reach top-performing candidates by over 70% and consistently outperformed traditional ML models. We found that LLM-AL performs broader and more exploratory searches while still reaching the optima with fewer iterations. We further examined the stability boundaries of LLM-AL given the inherent non-determinism of LLMs and found its performance to be broadly consistent across runs, within the variability range typically observed for traditional ML approaches. These results demonstrate that LLM-AL can serve as a generalizable alternative to conventional AL pipelines for more efficient and interpretable experiment selection and potential LLM-driven autonomous discovery.

Title: Comparative Analysis of LoRA-Adapted Embedding Models for Clinical Cardiology Text Representation

Authors: Richard J. Young, Alice M. Matthews
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.19739
Pdf URL: https://arxiv.org/pdf/2511.19739
Copy Paste: [[2511.19739]] Comparative Analysis of LoRA-Adapted Embedding Models for Clinical Cardiology Text Representation(https://arxiv.org/abs/2511.19739)
Keywords: transformer
Abstract: Domain-specific text embeddings are critical for clinical natural language processing, yet systematic comparisons across model architectures remain limited. This study evaluates ten transformer-based embedding models adapted for cardiology through Low-Rank Adaptation (LoRA) fine-tuning on 106,535 cardiology text pairs derived from authoritative medical textbooks. Results demonstrate that encoder-only architectures, particularly BioLinkBERT, achieve superior domain-specific performance (separation score: 0.510) compared to larger decoder-based models, while requiring significantly fewer computational resources. The findings challenge the assumption that larger language models necessarily produce better domain-specific embeddings and provide practical guidance for clinical NLP system development. All models, training code, and evaluation datasets are publicly available to support reproducible research in medical informatics.

Title: Efficient Transferable Optimal Transport via Min-Sliced Transport Plans

Authors: Xinran Liu, Elaheh Akbari, Rocio Diaz Martin, Navid NaderiAlizadeh, Soheil Kolouri
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19741
Pdf URL: https://arxiv.org/pdf/2511.19741
Copy Paste: [[2511.19741]] Efficient Transferable Optimal Transport via Min-Sliced Transport Plans(https://arxiv.org/abs/2511.19741)
Keywords: generative
Abstract: Optimal Transport (OT) offers a powerful framework for finding correspondences between distributions and addressing matching and alignment problems in various areas of computer vision, including shape analysis, image generation, and multimodal tasks. The computation cost of OT, however, hinders its scalability. Slice-based transport plans have recently shown promise for reducing the computational cost by leveraging the closed-form solutions of 1D OT problems. These methods optimize a one-dimensional projection (slice) to obtain a conditional transport plan that minimizes the transport cost in the ambient space. While efficient, these methods leave open the question of whether learned optimal slicers can transfer to new distribution pairs under distributional shift. Understanding this transferability is crucial in settings with evolving data or repeated OT computations across closely related distributions. In this paper, we study the min-Sliced Transport Plan (min-STP) framework and investigate the transferability of optimized slicers: can a slicer trained on one distribution pair yield effective transport plans for new, unseen pairs? Theoretically, we show that optimized slicers remain close under slight perturbations of the data distributions, enabling efficient transfer across related tasks. To further improve scalability, we introduce a minibatch formulation of min-STP and provide statistical guarantees on its accuracy. Empirically, we demonstrate that the transferable min-STP achieves strong one-shot matching performance and facilitates amortized training for point cloud alignment and flow-based generative modeling.

Title: DISCO: A Browser-Based Privacy-Preserving Framework for Distributed Collaborative Learning

Authors: Julien T. T. Vignoud, Valérian Rousset, Hugo El Guedj, Ignacio Aleman, Walid Bennaceur, Batuhan Faik Derinbay, Eduard Ďurech, Damien Gengler, Lucas Giordano, Felix Grimberg, Franziska Lippoldt, Christina Kopidaki, Jiafan Liu, Lauris Lopata, Nathan Maire, Paul Mansat, Martin Milenkoski, Emmanuel Omont, Güneş Özgün, Mina Petrović, Francesco Posa, Morgan Ridel, Giorgio Savini, Marcel Torne, Lucas Trognon, Alyssa Unell, Olena Zavertiaieva, Sai Praneeth Karimireddy, Tahseen Rabbani, Mary-Anne Hartley, Martin Jaggi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19750
Pdf URL: https://arxiv.org/pdf/2511.19750
Copy Paste: [[2511.19750]] DISCO: A Browser-Based Privacy-Preserving Framework for Distributed Collaborative Learning(https://arxiv.org/abs/2511.19750)
Keywords: privacy, federate
Abstract: Data is often impractical to share for a range of well considered reasons, such as concerns over privacy, intellectual property, and legal constraints. This not only fragments the statistical power of predictive models, but creates an accessibility bias, where accuracy becomes inequitably distributed to those who have the resources to overcome these concerns. We present DISCO: an open-source DIStributed COllaborative learning platform accessible to non-technical users, offering a means to collaboratively build machine learning models without sharing any original data or requiring any programming knowledge. DISCO's web application trains models locally directly in the browser, making our tool cross-platform out-of-the-box, including smartphones. The modular design of \disco offers choices between federated and decentralized paradigms, various levels of privacy guarantees and several approaches to weight aggregation strategies that allow for model personalization and bias resilience in the collaborative training. Code repository is available at this https URL and a showcase web interface at this https URL

Title: What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities

Authors: Muchang Bahng, Charlie Berens, Jon Donnelly, Eric Chen, Chaofan Chen, Cynthia Rudin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19752
Pdf URL: https://arxiv.org/pdf/2511.19752
Copy Paste: [[2511.19752]] What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities(https://arxiv.org/abs/2511.19752)
Keywords: interpretability
Abstract: Species detection is important for monitoring the health of ecosystems and identifying invasive species, serving a crucial role in guiding conservation efforts. Multimodal neural networks have seen increasing use for identifying species to help automate this task, but they have two major drawbacks. First, their black-box nature prevents the interpretability of their decision making process. Second, collecting genetic data is often expensive and requires invasive procedures, often necessitating researchers to capture or kill the target specimen. We address both of these problems by extending prototype networks (ProtoPNets), which are a popular and interpretable alternative to traditional neural networks, to the multimodal, cost-aware setting. We ensemble prototypes from each modality, using an associated weight to determine how much a given prediction relies on each modality. We further introduce methods to identify cases for which we do not need the expensive genetic information to make confident predictions. We demonstrate that our approach can intelligently allocate expensive genetic data for fine-grained distinctions while using abundant image data for clearer visual classifications and achieving comparable accuracy to models that consistently use both modalities.

Title: Vision--Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation

Authors: Jiaqi Guo, Mingzhen Li, Hanyu Su, Santiago López, Lexiaozi Fan, Daniel Kim, Aggelos Katsaggelos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19759
Pdf URL: https://arxiv.org/pdf/2511.19759
Copy Paste: [[2511.19759]] Vision--Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation(https://arxiv.org/abs/2511.19759)
Keywords: segmentation
Abstract: Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation, reducing the reliance on extensive expert annotations. Meanwhile, vision-language models (VLMs) have demonstrated strong generalization and few-shot capabilities across diverse visual domains. In this work, we integrate VLM-based segmentation into semi-supervised medical image segmentation by introducing a Vision-Language Enhanced Semi-supervised Segmentation Assistant (VESSA) that incorporates foundation-level visual-semantic understanding into SSL frameworks. Our approach consists of two stages. In Stage 1, the VLM-enhanced segmentation foundation model VESSA is trained as a reference-guided segmentation assistant using a template bank containing gold-standard exemplars, simulating learning from limited labeled data. Given an input-template pair, VESSA performs visual feature matching to extract representative semantic and spatial cues from exemplar segmentations, generating structured prompts for a SAM2-inspired mask decoder to produce segmentation masks. In Stage 2, VESSA is integrated into a state-of-the-art SSL framework, enabling dynamic interaction with the student model: as student predictions become more refined, they are fed back to VESSA as prompts, allowing it to generate higher-quality pseudo-labels and stronger guidance. Extensive experiments across multiple segmentation datasets and domains show that VESSA-augmented SSL significantly enhances segmentation accuracy, outperforming state-of-the-art baselines under extremely limited annotation conditions.

Title: A Storage-Efficient Feature for 3D Concrete Defect Segmentation to Replace Normal Vector

Authors: Linxin Hua (1), Jianghua Deng (2), Ye Lu (1) ((1) Department of Civil and Environmental Engineering, Monash University, Melbourne, Australia, (2) School of Civil Engineering and Architecture, Changzhou Institute of Technology, Changzhou, China)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19760
Pdf URL: https://arxiv.org/pdf/2511.19760
Copy Paste: [[2511.19760]] A Storage-Efficient Feature for 3D Concrete Defect Segmentation to Replace Normal Vector(https://arxiv.org/abs/2511.19760)
Keywords: segmentation
Abstract: Point cloud reconstruction of damage offers an effective solution to image-based methods vulnerable to background noise, yet its application is constrained by the high volume of 3D data. This study proposes a new feature, relative angle, computed as the angle between the normal vector of a point and the average normal vector of its parent point cloud. This single-dimensional feature provides directionality information equivalent to normal vectors for concrete surface defect characteristics. Through entropy-based feature evaluation, this study demonstrates the ability of relative angle to filter out redundant information in undamaged sections while retaining effective information in damaged sections. By training and testing with PointNet++, models based on the relative angles achieved similar performance to that of models based on normal vectors while delivering 27.6% storage reduction and 83% input channel compression. This novel feature has the potential to enable larger-batch execution on resource-constrained hardware without the necessity of architectural modifications to models.

Title: Lightweight Transformer Framework for Weakly Supervised Semantic Segmentation

Authors: Ali Torabi, Sanjog Gaihre, Yaqoob Majeed
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19765
Pdf URL: https://arxiv.org/pdf/2511.19765
Copy Paste: [[2511.19765]] Lightweight Transformer Framework for Weakly Supervised Semantic Segmentation(https://arxiv.org/abs/2511.19765)
Keywords: transformer, segmentation
Abstract: Weakly supervised semantic segmentation (WSSS) must learn dense masks from noisy, under-specified cues. We revisit the SegFormer decoder and show that three small, synergistic changes make weak supervision markedly more effective-without altering the MiT backbone or relying on heavy post-processing. Our method, CrispFormer, augments the decoder with: (1) a boundary branch that supervises thin object contours using a lightweight edge head and a boundary-aware loss; (2) an uncertainty-guided refiner that predicts per-pixel aleatoric uncertainty and uses it to weight losses and gate a residual correction of the segmentation logits; and (3) a dynamic multi-scale fusion layer that replaces static concatenation with spatial softmax gating over multi-resolution features, optionally modulated by uncertainty. The result is a single-pass model that preserves crisp boundaries, selects appropriate scales per location, and resists label noise from weak cues. Integrated into a standard WSSS pipeline (seed, student, and EMA relabeling), CrispFormer consistently improves boundary F-score, small-object recall, and mIoU over SegFormer baselines trained on the same seeds, while adding minimal compute. Our decoder-centric formulation is simple to implement, broadly compatible with existing SegFormer variants, and offers a reproducible path to higher-fidelity masks from image-level supervision.

Title: One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer

Authors: Haoyu Wu, Jingyi Xu, Qiaomu Miao, Dimitris Samaras, Hieu Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19778
Pdf URL: https://arxiv.org/pdf/2511.19778
Copy Paste: [[2511.19778]] One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer(https://arxiv.org/abs/2511.19778)
Keywords: diffusion, transformer
Abstract: We identify a core failure mode that occurs when using the usual linear interpolation on rotary positional embeddings (RoPE) for mixed-resolution denoising with Diffusion Transformers. When tokens from different spatial grids are mixed, the attention mechanism collapses. The issue is structural. Linear coordinate remapping forces a single attention head to compare RoPE phases sampled at incompatible rates, creating phase aliasing that destabilizes the score landscape. Pretrained DiTs are especially brittle-many heads exhibit extremely sharp, periodic phase selectivity-so even tiny cross-rate inconsistencies reliably cause blur, artifacts, or full collapse. To this end, our main contribution is Cross-Resolution Phase-Aligned Attention (CRPA), a training-free drop-in fix that eliminates this failure at its source. CRPA modifies only the RoPE index map for each attention call: all Q/K positions are expressed on the query's stride so that equal physical distances always induce identical phase increments. This restores the precise phase patterns that DiTs rely on. CRPA is fully compatible with pretrained DiTs, stabilizes all heads and layers uniformly. We demonstrate that CRPA enables high-fidelity and efficient mixed-resolution generation, outperforming previous state-of-the-art methods on image and video generation.

Title: Gender Bias in Emotion Recognition by Large Language Models

Authors: Maureen Herbert, Katie Sun, Angelica Lim, Yasaman Etesam
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2511.19785
Pdf URL: https://arxiv.org/pdf/2511.19785
Copy Paste: [[2511.19785]] Gender Bias in Emotion Recognition by Large Language Models(https://arxiv.org/abs/2511.19785)
Keywords: fair, large language model
Abstract: The rapid advancement of large language models (LLMs) and their growing integration into daily life underscore the importance of evaluating and ensuring their fairness. In this work, we examine fairness within the domain of emotional theory of mind, investigating whether LLMs exhibit gender biases when presented with a description of a person and their environment and asked, "How does this person feel?". Furthermore, we propose and evaluate several debiasing strategies, demonstrating that achieving meaningful reductions in bias requires training based interventions rather than relying solely on inference-time prompt-based approaches such as prompt engineering.

Title: Terminal Velocity Matching

Authors: Linqi Zhou, Mathias Parger, Ayaan Haque, Jiaming Song
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2511.19797
Pdf URL: https://arxiv.org/pdf/2511.19797
Copy Paste: [[2511.19797]] Terminal Velocity Matching(https://arxiv.org/abs/2511.19797)
Keywords: diffusion, transformer, generative
Abstract: We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the $2$-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.

Title: Scalable Data Attribution via Forward-Only Test-Time Inference

Authors: Sibo Ma, Julian Nyarko
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19803
Pdf URL: https://arxiv.org/pdf/2511.19803
Copy Paste: [[2511.19803]] Scalable Data Attribution via Forward-Only Test-Time Inference(https://arxiv.org/abs/2511.19803)
Keywords: large language model
Abstract: Data attribution seeks to trace model behavior back to the training examples that shaped it, enabling debugging, auditing, and data valuation at scale. Classical influence-function methods offer a principled foundation but remain impractical for modern networks because they require expensive backpropagation or Hessian inversion at inference. We propose a data attribution method that preserves the same first-order counterfactual target while eliminating per-query backward passes. Our approach simulates each training example's parameter influence through short-horizon gradient propagation during training and later reads out attributions for any query using only forward evaluations. This design shifts computation from inference to simulation, reflecting real deployment regimes where a model may serve billions of user queries but originate from a fixed, finite set of data sources (for example, a large language model trained on diverse corpora while compensating a specific publisher such as the New York Times). Empirically, on standard MLP benchmarks, our estimator matches or surpasses state-of-the-art baselines such as TRAK on standard attribution metrics (LOO and LDS) while offering orders-of-magnitude lower inference cost. By combining influence-function fidelity with first-order scalability, our method provides a theoretical framework for practical, real-time data attribution in large pretrained models.

Title: Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization

Authors: Debin Meng, Chen Jin, Zheng Gao, Yanran Li, Ioannis Patras, Georgios Tzimiropoulos
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.19811
Pdf URL: https://arxiv.org/pdf/2511.19811
Copy Paste: [[2511.19811]] Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization(https://arxiv.org/abs/2511.19811)
Keywords: diffusion, generative
Abstract: Image diversity remains a fundamental challenge for text-to-image diffusion models. Low-diversity models tend to generate repetitive outputs, increasing sampling redundancy and hindering both creative exploration and downstream applications. A primary cause is that generation often collapses toward a strong mode in the learned distribution. Existing attempts to improve diversity, such as noise resampling, prompt rewriting, or steering-based guidance, often still collapse to dominant modes or introduce distortions that degrade image quality. In light of this, we propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module. TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution. At the same time, the prompt-level space provides a global semantic constraint that regulates distribution shifts, preventing quality degradation while maintaining high fidelity. Extensive experiments on MS-COCO and three diffusion backbones show that TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality. Code will be released upon acceptance.

Title: Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models

Authors: Wentao Hu, Mingkuan Zhao, Shuangyong Song, Xiaoyan Zhu, Xin Lai, Jiayin Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19822
Pdf URL: https://arxiv.org/pdf/2511.19822
Copy Paste: [[2511.19822]] Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models(https://arxiv.org/abs/2511.19822)
Keywords: large language model
Abstract: Sparse Mixture-of-Experts (SMoE) architectures have enabled a new frontier in scaling Large Language Models (LLMs), offering superior performance by activating only a fraction of their total parameters during inference. However, their practical deployment is severely hampered by substantial static memory overhead, as all experts must be loaded into memory. Existing post-training pruning methods, while reducing model size, often derive their pruning criteria from a single, general-purpose corpus. This leads to a critical limitation: a catastrophic performance degradation when the pruned model is applied to other domains, necessitating a costly re-pruning for each new domain. To address this generalization gap, we introduce Mosaic Pruning (MoP). The core idea of MoP is to construct a functionally comprehensive set of experts through a structured ``cluster-then-select" process. This process leverages a similarity metric that captures expert performance across different task domains to functionally cluster the experts, and subsequently selects the most representative expert from each cluster based on our proposed Activation Variability Score. Unlike methods that optimize for a single corpus, our proposed Mosaic Pruning ensures that the pruned model retains a functionally complementary set of experts, much like the tiles of a mosaic that together form a complete picture of the original model's capabilities, enabling it to handle diverse downstream this http URL experiments on various MoE models demonstrate the superiority of our approach. MoP significantly outperforms prior work, achieving a 7.24\% gain on general tasks and 8.92\% on specialized tasks like math reasoning and code generation.

Title: Large Language Model Aided Birt-Hogg-Dube Syndrome Diagnosis with Multimodal Retrieval-Augmented Generation

Authors: Haoqing Li, Jun Shi, Xianmeng Chen, Qiwei Jia, Rui Wang, Wei Wei, Hong An, Xiaowen Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19834
Pdf URL: https://arxiv.org/pdf/2511.19834
Copy Paste: [[2511.19834]] Large Language Model Aided Birt-Hogg-Dube Syndrome Diagnosis with Multimodal Retrieval-Augmented Generation(https://arxiv.org/abs/2511.19834)
Keywords: large language model
Abstract: Deep learning methods face dual challenges of limited clinical samples and low inter-class differentiation among Diffuse Cystic Lung Diseases (DCLDs) in advancing Birt-Hogg-Dube syndrome (BHD) diagnosis via Computed Tomography (CT) imaging. While Multimodal Large Language Models (MLLMs) demonstrate diagnostic potential fo such rare diseases, the absence of domain-specific knowledge and referable radiological features intensify hallucination risks. To address this problem, we propose BHD-RAG, a multimodal retrieval-augmented generation framework that integrates DCLD-specific expertise and clinical precedents with MLLMs to improve BHD diagnostic accuracy. BHDRAG employs: (1) a specialized agent generating imaging manifestation descriptions of CT images to construct a multimodal corpus of DCLDs cases. (2) a cosine similarity-based retriever pinpointing relevant imagedescription pairs for query images, and (3) an MLLM synthesizing retrieved evidence with imaging data for diagnosis. BHD-RAG is validated on the dataset involving four types of DCLDs, achieving superior accuracy and generating evidence-based descriptions closely aligned with expert insights.

Title: Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation

Authors: Xuewen Liu, Zhikai Li, Jing Zhang, Mengjuan Chen, Qingyi Gu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19835
Pdf URL: https://arxiv.org/pdf/2511.19835
Copy Paste: [[2511.19835]] Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation(https://arxiv.org/abs/2511.19835)
Keywords: diffusion, transformer
Abstract: Diffusion Transformers dominate video generation, but the quadratic complexity of attention computation introduces substantial latency. Attention sparsity reduces computational costs by focusing on critical tokens while ignoring non-critical tokens. However, existing methods suffer from severe performance degradation. In this paper, we revisit attention sparsity and reveal that existing methods induce systematic biases in attention allocation: (1) excessive focus on critical tokens amplifies their attention weights; (2) complete neglect of non-critical tokens causes the loss of relevant attention weights. To address these issues, we propose Rectified SpaAttn, which rectifies attention allocation with implicit full attention reference, thereby enhancing the alignment between sparse and full attention maps. Specifically: (1) for critical tokens, we show that their bias is proportional to the sparse attention weights, with the ratio governed by the amplified weights. Accordingly, we propose Isolated-Pooling Attention Reallocation, which calculates accurate rectification factors by reallocating multimodal pooled weights. (2) for non-critical tokens, recovering attention weights from the pooled query-key yields attention gains but also introduces pooling errors. Therefore, we propose Gain-Aware Pooling Rectification, which ensures that the rectified gain consistently surpasses the induced error. Moreover, we customize and integrate the Rectified SpaAttn kernel using Triton, achieving up to 3.33 and 2.08 times speedups on HunyuanVideo and Wan 2.1, respectively, while maintaining high generation quality. We release Rectified SpaAttn as open-source at this https URL .

Title: SX-GeoTree: Self-eXplaining Geospatial Regression Tree Incorporating the Spatial Similarity of Feature Attributions

Authors: Chaogui Kang, Lijian Luo, Qingfeng Guan, Yu Liu
Subjects: cs.LG, cs.CY, stat.ML
Abstract URL: https://arxiv.org/abs/2511.19845
Pdf URL: https://arxiv.org/pdf/2511.19845
Copy Paste: [[2511.19845]] SX-GeoTree: Self-eXplaining Geospatial Regression Tree Incorporating the Spatial Similarity of Feature Attributions(https://arxiv.org/abs/2511.19845)
Keywords: robust, explainability
Abstract: Decision trees remain central for tabular prediction but struggle with (i) capturing spatial dependence and (ii) producing locally stable (robust) explanations. We present SX-GeoTree, a self-explaining geospatial regression tree that integrates three coupled objectives during recursive splitting: impurity reduction (MSE), spatial residual control (global Moran's I), and explanation robustness via modularity maximization on a consensus similarity network formed from (a) geographically weighted regression (GWR) coefficient distances (stimulus-response similarity) and (b) SHAP attribution distances (explanatory similarity). We recast local Lipschitz continuity of feature attributions as a network community preservation problem, enabling scalable enforcement of spatially coherent explanations without per-sample neighborhood searches. Experiments on two exemplar tasks (county-level GDP in Fujian, n=83; point-wise housing prices in Seattle, n=21,613) show SX-GeoTree maintains competitive predictive accuracy (within 0.01 $R^{2}$ of decision trees) while improving residual spatial evenness and doubling attribution consensus (modularity: Fujian 0.19 vs 0.09; Seattle 0.10 vs 0.05). Ablation confirms Moran's I and modularity terms are complementary; removing either degrades both spatial residual structure and explanation stability. The framework demonstrates how spatial similarity - extended beyond geometric proximity through GWR-derived local relationships - can be embedded in interpretable models, advancing trustworthy geospatial machine learning and offering a transferable template for domain-aware explainability.

Title: DOGE: Differentiable Bezier Graph Optimization for Road Network Extraction

Authors: Jiahui Sun, Junran Lu, Jinhui Yin, Yishuo Xu, Yuanqi Li, Yanwen Guo
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2511.19850
Pdf URL: https://arxiv.org/pdf/2511.19850
Copy Paste: [[2511.19850]] DOGE: Differentiable Bezier Graph Optimization for Road Network Extraction(https://arxiv.org/abs/2511.19850)
Keywords: extraction, segmentation
Abstract: Automatic extraction of road networks from aerial imagery is a fundamental task, yet prevailing methods rely on polylines that struggle to model curvilinear geometry. We maintain that road geometry is inherently curve-based and introduce the Bézier Graph, a differentiable parametric curve-based representation. The primary obstacle to this representation is to obtain the difficult-to-construct vector ground-truth (GT). We sidestep this bottleneck by reframing the task as a global optimization problem over the Bézier Graph. Our framework, DOGE, operationalizes this paradigm by learning a parametric Bézier Graph directly from segmentation masks, eliminating the need for curve GT. DOGE holistically optimizes the graph by alternating between two complementary modules: DiffAlign continuously optimizes geometry via differentiable rendering, while TopoAdapt uses discrete operators to refine its topology. Our method sets a new state-of-the-art on the large-scale SpaceNet and CityScale benchmarks, presenting a new paradigm for generating high-fidelity vector maps of road networks. We will release our code and related data.

Title: Accelerating Wireless Distributed Learning via Hybrid Split and Federated Learning Optimization

Authors: Kun Guo, Xuefei Li, Xijun Wang, Howard H. Yang, Wei Feng, Tony Q. S. Quek
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2511.19851
Pdf URL: https://arxiv.org/pdf/2511.19851
Copy Paste: [[2511.19851]] Accelerating Wireless Distributed Learning via Hybrid Split and Federated Learning Optimization(https://arxiv.org/abs/2511.19851)
Keywords: federate
Abstract: Federated learning (FL) and split learning (SL) are two effective distributed learning paradigms in wireless networks, enabling collaborative model training across mobile devices without sharing raw data. While FL supports low-latency parallel training, it may converge to less accurate model. In contrast, SL achieves higher accuracy through sequential training but suffers from increased delay. To leverage the advantages of both, hybrid split and federated learning (HSFL) allows some devices to operate in FL mode and others in SL mode. This paper aims to accelerate HSFL by addressing three key questions: 1) How does learning mode selection affect overall learning performance? 2) How does it interact with batch size? 3) How can these hyperparameters be jointly optimized alongside communication and computational resources to reduce overall learning delay? We first analyze convergence, revealing the interplay between learning mode and batch size. Next, we formulate a delay minimization problem and propose a two-stage solution: a block coordinate descent method for a relaxed problem to obtain a locally optimal solution, followed by a rounding algorithm to recover integer batch sizes with near-optimal performance. Experimental results demonstrate that our approach significantly accelerates convergence to the target accuracy compared to existing methods.

Title: Profile-LLM: Dynamic Profile Optimization for Realistic Personality Expression in LLMs

Authors: Shi-Wei Dai, Yan-Wei Shie, Tsung-Huan Yang, Lun-Wei Ku, Yung-Hui Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.19852
Pdf URL: https://arxiv.org/pdf/2511.19852
Copy Paste: [[2511.19852]] Profile-LLM: Dynamic Profile Optimization for Realistic Personality Expression in LLMs(https://arxiv.org/abs/2511.19852)
Keywords: large language model
Abstract: Personalized Large Language Models (LLMs) have been shown to be an effective way to create more engaging and enjoyable user-AI interactions. While previous studies have explored using prompts to elicit specific personality traits in LLMs, they have not optimized these prompts to maximize personality expression. To address this limitation, we propose PersonaPulse: Dynamic Profile Optimization for Realistic Personality Expression in LLMs, a framework that leverages LLMs' inherent knowledge of personality traits to iteratively enhance role-play prompts while integrating a situational response benchmark as a scoring tool, ensuring a more realistic and contextually grounded evaluation to guide the optimization process. Quantitative evaluations demonstrate that the prompts generated by PersonaPulse outperform those of prior work, which were designed based on personality descriptions from psychological studies. Additionally, we explore the relationship between model size and personality modeling through extensive experiments. Finally, we find that, for certain personality traits, the extent of personality evocation can be partially controlled by pausing the optimization process. These findings underscore the importance of prompt optimization in shaping personality expression within LLMs, offering valuable insights for future research on adaptive AI interactions.

Title: A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

Authors: Farzad Ahmed, Joniel Augustine Jerome, Meliha Yetisgen, Özlem Uzuner
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19858
Pdf URL: https://arxiv.org/pdf/2511.19858
Copy Paste: [[2511.19858]] A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction(https://arxiv.org/abs/2511.19858)
Keywords: large language model
Abstract: Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.

Title: GigaWorld-0: World Models as Data Engine to Empower Embodied AI

Authors: GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, Zheng Zhu
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.19861
Pdf URL: https://arxiv.org/pdf/2511.19861
Copy Paste: [[2511.19861]] GigaWorld-0: World Models as Data Engine to Empower Embodied AI(https://arxiv.org/abs/2511.19861)
Keywords: generative
Abstract: World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.

Title: MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization

Authors: Chengyue Huang, Mellon M. Zhang, Robert Azarcon, Glen Chou, Zsolt Kira
Subjects: cs.CV, cs.AI, cs.CL, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2511.19878
Pdf URL: https://arxiv.org/pdf/2511.19878
Copy Paste: [[2511.19878]] MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization(https://arxiv.org/abs/2511.19878)
Keywords: robust
Abstract: Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naive fine-tuning often disrupts these representations and harms generalization. Existing fixes -- freezing modules or applying uniform regularization -- either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS introduces no additional parameters or data, and can be seamlessly integrated into existing VLAs. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and challenging benchmarks such as SimplerEnv, CALVIN, LIBERO, as well as real-world evaluations on the Franka Emika Panda platform, MAPS consistently boosts both in-distribution and out-of-distribution performance (up to +30%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for preserving broad generalization in VLM-to-VLA transfer.

Title: ChessMamba: Structure-Aware Interleaving of State Spaces for Change Detection in Remote Sensing Images

Authors: Lei Ding, Tong Liu, Xuanguang Liu, Xiangyun Liu, Haitao Guo, Jun Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19882
Pdf URL: https://arxiv.org/pdf/2511.19882
Copy Paste: [[2511.19882]] ChessMamba: Structure-Aware Interleaving of State Spaces for Change Detection in Remote Sensing Images(https://arxiv.org/abs/2511.19882)
Keywords: robust, transformer
Abstract: Change detection (CD) in multitemporal remote sensing imagery presents significant challenges for fine-grained recognition, owing to heterogeneity and spatiotemporal misalignment. However, existing methodologies based on vision transformers or state-space models typically disrupt local structural consistency during temporal serialization, obscuring discriminative cues under misalignment and hindering reliable change localization. To address this, we introduce ChessMamba, a structure-aware framework leveraging interleaved state-space modeling for robust CD with multi-temporal inputs. ChessMamba integrates a SpatialMamba encoder with a lightweight cross-source interaction module, featuring two key innovations: (i) Chessboard interleaving with snake scanning order, which serializes multi-temporal features into a unified sequence within a single forward pass, thereby shortening interaction paths and enabling direct comparison for accurate change localization; and (ii) Structure-aware fusion via multi-dilated convolutions, selectively capturing center-and-corner neighborhood contexts within each mono-temporal. Comprehensive evaluations on three CD tasks, including binary CD, semantic CD and multimodal building damage assessment, demonstrate that ChessMamba effectively fuses heterogeneous features and achieves substantial accuracy improvements over state-of-the-art this http URL relevant code will be available at: this http URL.

Title: Frequency Bias Matters: Diving into Robust and Generalized Deep Image Forgery Detection

Authors: Chi Liu, Tianqing Zhu, Wanlei Zhou, Wei Zhao
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2511.19886
Pdf URL: https://arxiv.org/pdf/2511.19886
Copy Paste: [[2511.19886]] Frequency Bias Matters: Diving into Robust and Generalized Deep Image Forgery Detection(https://arxiv.org/abs/2511.19886)
Keywords: security, defense, attack, robust, generative
Abstract: As deep image forgery powered by AI generative models, such as GANs, continues to challenge today's digital world, detecting AI-generated forgeries has become a vital security topic. Generalizability and robustness are two critical concerns of a forgery detector, determining its reliability when facing unknown GANs and noisy samples in an open world. Although many studies focus on improving these two properties, the root causes of these problems have not been fully explored, and it is unclear if there is a connection between them. Moreover, despite recent achievements in addressing these issues from image forensic or anti-forensic aspects, a universal method that can contribute to both sides simultaneously remains practically significant yet unavailable. In this paper, we provide a fundamental explanation of these problems from a frequency perspective. Our analysis reveals that the frequency bias of a DNN forgery detector is a possible cause of generalization and robustness issues. Based on this finding, we propose a two-step frequency alignment method to remove the frequency discrepancy between real and fake images, offering double-sided benefits: it can serve as a strong black-box attack against forgery detectors in the anti-forensic context or, conversely, as a universal defense to improve detector reliability in the forensic context. We also develop corresponding attack and defense implementations and demonstrate their effectiveness, as well as the effect of the frequency alignment method, in various experimental settings involving twelve detectors, eight forgery models, and five metrics.

Title: LiMT: A Multi-task Liver Image Benchmark Dataset

Authors: Zhe Liu, Kai Han, Siqi Ma, Yan Zhu, Jun Chen, Chongwen Lyu, Xinyi Qiu, Chengxuan Qian, Yuqing Song, Yi Liu, Liyuan Tian, Yang Ji, Yuefeng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19889
Pdf URL: https://arxiv.org/pdf/2511.19889
Copy Paste: [[2511.19889]] LiMT: A Multi-task Liver Image Benchmark Dataset(https://arxiv.org/abs/2511.19889)
Keywords: segmentation
Abstract: Computer-aided diagnosis (CAD) technology can assist clinicians in evaluating liver lesions and intervening with treatment in time. Although CAD technology has advanced in recent years, the application scope of existing datasets remains relatively limited, typically supporting only single tasks, which has somewhat constrained the development of CAD technology. To address the above limitation, in this paper, we construct a multi-task liver dataset (LiMT) used for liver and tumor segmentation, multi-label lesion classification, and lesion detection based on arterial phase-enhanced computed tomography (CT), potentially providing an exploratory solution that is able to explore the correlation between tasks and does not need to worry about the heterogeneity between task-specific datasets during training. The dataset includes CT volumes from 150 different cases, comprising four types of liver diseases as well as normal cases. Each volume has been carefully annotated and calibrated by experienced clinicians. This public multi-task dataset may become a valuable resource for the medical imaging research community in the future. In addition, this paper not only provides relevant baseline experimental results but also reviews existing datasets and methods related to liver-related tasks. Our dataset is available at this https URL.

Title: Frailty-Aware Transformer for Recurrent Survival Modeling of Driver Retention in Ride-Hailing Platforms

Authors: Shuoyan Xu, Yu Zhang, Eric J. Miller
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19893
Pdf URL: https://arxiv.org/pdf/2511.19893
Copy Paste: [[2511.19893]] Frailty-Aware Transformer for Recurrent Survival Modeling of Driver Retention in Ride-Hailing Platforms(https://arxiv.org/abs/2511.19893)
Keywords: transformer
Abstract: Ride-hailing platforms are characterized by high-frequency, behavior-driven environments. Although survival analysis has been applied to recurrent events in other domains, its use in modeling ride-hailing driver behavior remains largely unexplored. This study formulates idle behavior as a recurrent survival process using large-scale platform data and proposes a Transformer-based framework that captures long-term temporal dependencies with causal masking and incorporates driver-specific embeddings to model latent heterogeneity. Results on Toronto ride-hailing data demonstrate that the proposed Frailty-Aware Cox Transformer (FACT) achieves the highest time-dependent C-indices and lowest Brier Scores, outperforming classical and deep learning survival models. This approach enables more accurate risk estimation, supports platform retention strategies, and provides policy-relevant insights.

Title: MHB: Multimodal Handshape-aware Boundary Detection for Continuous Sign Language Recognition

Authors: Mingyu Zhao, Zhanfu Yang, Yang Zhou, Zhaoyang Xia, Can Jin, Xiaoxiao He, Carol Neidle, Dimitris N. Metaxas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19907
Pdf URL: https://arxiv.org/pdf/2511.19907
Copy Paste: [[2511.19907]] MHB: Multimodal Handshape-aware Boundary Detection for Continuous Sign Language Recognition(https://arxiv.org/abs/2511.19907)
Keywords: robust, segmentation
Abstract: This paper presents a multimodal approach for continuous sign recognition that first uses machine learning to detect the start and end frames of signs in videos of American Sign Language (ASL) sentences, and then recognizes the segmented signs. For improved robustness, we use 3D skeletal features extracted from sign language videos to capture the convergence of sign properties and their dynamics, which tend to cluster at sign boundaries. Another focus of this work is the incorporation of information from 3D handshape for boundary detection. To detect handshapes normally expected at the beginning and end of signs, we pretrain a handshape classifier for 87 linguistically defined canonical handshape categories using a dataset that we created by integrating and normalizing several existing datasets. A multimodal fusion module is then used to unify the pretrained sign video segmentation framework and the handshape classification models. Finally, the estimated boundaries are used for sign recognition, where the recognition model is trained on a large database containing both citation-form isolated signs and signs pre-segmented (based on manual annotations) from continuous signing, as such signs often differ in certain respects. We evaluate our method on the ASLLRP corpus and demonstrate significant improvements over previous work.

Title: Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance

Authors: Haoxuan Wang, Jiachen Tao, Junyi Wu, Gaowen Liu, Ramana Rao Kompella, Yan Yan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19909
Pdf URL: https://arxiv.org/pdf/2511.19909
Copy Paste: [[2511.19909]] Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance(https://arxiv.org/abs/2511.19909)
Keywords: generative
Abstract: We present Motion Marionette, a zero-shot framework for rigid motion transfer from monocular source videos to single-view target images. Previous works typically employ geometric, generative, or simulation priors to guide the transfer process, but these external priors introduce auxiliary constraints that lead to trade-offs between generalizability and temporal consistency. To address these limitations, we propose guiding the motion transfer process through an internal prior that exclusively captures the spatial-temporal transformations and is shared between the source video and any transferred target video. Specifically, we first lift both the source video and the target image into a unified 3D representation space. Motion trajectories are then extracted from the source video to construct a spatial-temporal (SpaT) prior that is independent of object geometry and semantics, encoding relative spatial variations over time. This prior is further integrated with the target object to synthesize a controllable velocity field, which is subsequently refined using Position-Based Dynamics to mitigate artifacts and enhance visual coherence. The resulting velocity field can be flexibly employed for efficient video production. Empirical results demonstrate that Motion Marionette generalizes across diverse objects, produces temporally consistent videos that align well with the source motion, and supports controllable video generation.

Title: Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

Authors: Dapeng Zhang, Zhenlong Yuan, Zhangquan Chen, Chih-Ting Liao, Yinda Chen, Fei Shen, Qingguo Zhou, Tat-Seng Chua
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.19912
Pdf URL: https://arxiv.org/pdf/2511.19912
Copy Paste: [[2511.19912]] Reasoning-VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving(https://arxiv.org/abs/2511.19912)
Keywords: robust
Abstract: Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.

Title: Coupled Physics-Gated Adaptation: Spatially Decoding Volumetric Photochemical Conversion in Complex 3D-Printed Objects

Authors: Maryam Eftekharifar, Churun Zhang, Jialiang Wei, Xudong Cao, Hossein Heidari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19913
Pdf URL: https://arxiv.org/pdf/2511.19913
Copy Paste: [[2511.19913]] Coupled Physics-Gated Adaptation: Spatially Decoding Volumetric Photochemical Conversion in Complex 3D-Printed Objects(https://arxiv.org/abs/2511.19913)
Keywords: diffusion
Abstract: We present a framework that pioneers the prediction of photochemical conversion in complex three-dimensionally printed objects, introducing a challenging new computer vision task: predicting dense, non-visual volumetric physical properties from 3D visual data. This approach leverages the largest-ever optically printed 3D specimen dataset, comprising a large family of parametrically designed complex minimal surface structures that have undergone terminal chemical characterisation. Conventional vision models are ill-equipped for this task, as they lack an inductive bias for the coupled, non-linear interactions of optical physics (diffraction, absorption) and material physics (diffusion, convection) that govern the final chemical state. To address this, we propose Coupled Physics-Gated Adaptation (C-PGA), a novel multimodal fusion architecture. Unlike standard concatenation, C-PGA explicitly models physical coupling by using sparse geometrical and process parameters (e.g., surface transport, print layer height) as a Query to dynamically gate and adapt the dense visual features via feature-wise linear modulation (FiLM). This mechanism spatially modulates dual 3D visual streams-extracted by parallel 3D-CNNs processing raw projection stacks and their diffusion-diffraction corrected counterparts allowing the model to recalibrate its visual perception based on the physical context. This approach offers a breakthrough in virtual chemical characterisation, eliminating the need for traditional post-print measurements and enabling precise control over the chemical conversion state.

Title: Scale Where It Matters: Training-Free Localized Scaling for Diffusion Models

Authors: Qin Ren, Yufei Wang, Lanqing Guo, Wen Zhang, Zhiwen Fan, Chenyu You
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19917
Pdf URL: https://arxiv.org/pdf/2511.19917
Copy Paste: [[2511.19917]] Scale Where It Matters: Training-Free Localized Scaling for Diffusion Models(https://arxiv.org/abs/2511.19917)
Keywords: diffusion
Abstract: Diffusion models have become the dominant paradigm in text-to-image generation, and test-time scaling (TTS) further improves quality by allocating more computation during inference. However, existing TTS methods operate at the full-image level, overlooking the fact that image quality is often spatially heterogeneous. This leads to unnecessary computation on already satisfactory regions and insufficient correction of localized defects. In this paper, we explore a new direction - Localized TTS - that adaptively resamples defective regions while preserving high-quality regions, thereby substantially reducing the search space. This paradigm poses two central challenges: accurately localizing defects and maintaining global consistency. We propose LoTTS, the first fully training-free framework for localized TTS. For defect localization, LoTTS contrasts cross- and self-attention signals under quality-aware prompts (e.g., high-quality vs. low-quality) to identify defective regions, and then refines them into coherent masks. For consistency, LoTTS perturbs only defective regions and denoises them locally, ensuring that corrections remain confined while the rest of the image remains undisturbed. Extensive experiments on SD2.1, SDXL, and FLUX demonstrate that LoTTS achieves state-of-the-art performance: it consistently improves both local quality and global fidelity, while reducing GPU cost by 2-4x compared to Best-of-N sampling. These findings establish localized TTS as a promising new direction for scaling diffusion models at inference time.

Title: HybriDLA: Hybrid Generation for Document Layout Analysis

Authors: Yufan Chen, Omar Moured, Ruiping Liu, Junwei Zheng, Kunyu Peng, Jiaming Zhang, Rainer Stiefelhagen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19919
Pdf URL: https://arxiv.org/pdf/2511.19919
Copy Paste: [[2511.19919]] HybriDLA: Hybrid Generation for Document Layout Analysis(https://arxiv.org/abs/2511.19919)
Keywords: diffusion, generative
Abstract: Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M$^6$Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches. All data and models will be made publicly available at this https URL.

Title: Intelligent Image Search Algorithms Fusing Visual Large Models

Authors: Kehan Wang, Tingqiong Cui, Yang Zhang, Yu Chen, Shifeng Wu, Zhenzhang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19920
Pdf URL: https://arxiv.org/pdf/2511.19920
Copy Paste: [[2511.19920]] Intelligent Image Search Algorithms Fusing Visual Large Models(https://arxiv.org/abs/2511.19920)
Keywords: security, robust
Abstract: Fine-grained image retrieval, which aims to find images containing specific object components and assess their detailed states, is critical in fields like security and industrial inspection. However, conventional methods face significant limitations: manual features (e.g., SIFT) lack robustness; deep learning-based detectors (e.g., YOLO) can identify component presence but cannot perform state-specific retrieval or zero-shot search; Visual Large Models (VLMs) offer semantic and zero-shot capabilities but suffer from poor spatial grounding and high computational cost, making them inefficient for direct retrieval. To bridge these gaps, this paper proposes DetVLM, a novel intelligent image search framework that synergistically fuses object detection with VLMs. The framework pioneers a search-enhancement paradigm via a two-stage pipeline: a YOLO detector first conducts efficient, high-recall component-level screening to determine component presence; then, a VLM acts as a recall-enhancement unit, performing secondary verification for components missed by the detector. This architecture directly enables two advanced capabilities: 1) State Search: Guided by task-specific prompts, the VLM refines results by verifying component existence and executing sophisticated state judgments (e.g., "sun visor lowered"), allowing retrieval based on component state. 2) Zero-shot Search: The framework leverages the VLM's inherent zero-shot capability to recognize and retrieve images containing unseen components or attributes (e.g., "driver wearing a mask") without any task-specific training. Experiments on a vehicle component dataset show DetVLM achieves a state-of-the-art overall retrieval accuracy of 94.82\%, significantly outperforming detection-only baselines. It also attains 94.95\% accuracy in zero-shot search for driver mask-wearing and over 90\% average accuracy in state search tasks.

Title: CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

Authors: Yuefei Chen, Jiang Liu, Xiaodong Lin, Ruixiang Tang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2511.19923
Pdf URL: https://arxiv.org/pdf/2511.19923
Copy Paste: [[2511.19923]] CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding(https://arxiv.org/abs/2511.19923)
Keywords: robust
Abstract: Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.

Title: Context-Aware Token Pruning and Discriminative Selective Attention for Transformer Tracking

Authors: Janani Kugarajeevan, Thanikasalam Kokul, Amirthalingam Ramanan, Subha Fernando
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19928
Pdf URL: https://arxiv.org/pdf/2511.19928
Copy Paste: [[2511.19928]] Context-Aware Token Pruning and Discriminative Selective Attention for Transformer Tracking(https://arxiv.org/abs/2511.19928)
Keywords: transformer
Abstract: One-stream Transformer-based trackers have demonstrated remarkable performance by concatenating template and search region tokens, thereby enabling joint attention across all tokens. However, enabling an excessive proportion of background search tokens to attend to the target template tokens weakens the tracker's discriminative capability. Several token pruning methods have been proposed to mitigate background interference; however, they often remove tokens near the target, leading to the loss of essential contextual information and degraded tracking performance. Moreover, the presence of distractors within the search tokens further reduces the tracker's ability to accurately identify the target. To address these limitations, we propose CPDATrack, a novel tracking framework designed to suppress interference from background and distractor tokens while enhancing computational efficiency. First, a learnable module is integrated between two designated encoder layers to estimate the probability of each search token being associated with the target. Based on these estimates, less-informative background tokens are pruned from the search region while preserving the contextual cues surrounding the target. To further suppress background interference, a discriminative selective attention mechanism is employed that fully blocks search-to-template attention in the early layers. In the subsequent encoder layers, high-probability target tokens are selectively extracted from a localized region to attend to the template tokens, thereby reducing the influence of background and distractor tokens. The proposed CPDATrack achieves state-of-the-art performance across multiple benchmarks, particularly on GOT-10k, where it attains an average overlap of 75.1 percent.

Title: EfficientXpert: Efficient Domain Adaptation for Large Language Models via Propagation-Aware Pruning

Authors: Songlin Zhao, Michael Pitts, Zhuwei Qin
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2511.19935
Pdf URL: https://arxiv.org/pdf/2511.19935
Copy Paste: [[2511.19935]] EfficientXpert: Efficient Domain Adaptation for Large Language Models via Propagation-Aware Pruning(https://arxiv.org/abs/2511.19935)
Keywords: large language model
Abstract: The rapid advancement of large language models (LLMs) has increased the demand for domain-specialized variants in areas such as law, healthcare, and finance. However, their large size remains a barrier to deployment in resource-constrained environments, and existing compression methods either generalize poorly across domains or incur high overhead. In this work, we propose \textbf{EfficientXpert}, a lightweight domain-pruning framework that combines a propagation-aware pruning criterion (Foresight Mask) with an efficient adapter-update algorithm (Partial Brain Surgeon). Integrated into the LoRA fine-tuning process, EfficientXpert enables a one-step transformation of general pretrained models into sparse, domain-adapted experts. Across health and legal tasks, it retains up to 98% of dense-model performance at 40% sparsity, outperforming state-of-the-art methods. Further analysis reveals substantial domain-dependent structural shifts that degrade the effectiveness of general pruning masks, underscoring the need for adaptive, domain-aware pruning strategies tailored to each domain.

Title: Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

Authors: Youngseo Kim, Dohyun Kim, Geohee Han, Paul Hongsuck Seo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19936
Pdf URL: https://arxiv.org/pdf/2511.19936
Copy Paste: [[2511.19936]] Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos(https://arxiv.org/abs/2511.19936)
Keywords: robust, diffusion, segmentation
Abstract: Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement, achieving state-of-the-art zero-shot performance on standard video object segmentation benchmarks.

Title: Optimize Flip Angle Schedules In MR Fingerprinting Using Reinforcement Learning

Authors: Shenjun Zhong, Zhifeng Chen, Zhaolin Chen
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2511.19941
Pdf URL: https://arxiv.org/pdf/2511.19941
Copy Paste: [[2511.19941]] Optimize Flip Angle Schedules In MR Fingerprinting Using Reinforcement Learning(https://arxiv.org/abs/2511.19941)
Keywords: robust
Abstract: Magnetic Resonance Fingerprinting (MRF) leverages transient-state signal dynamics generated by the tunable acquisition parameters, making the design of an optimal, robust sequence a complex, high-dimensional sequential decision problem, such as optimizing one of the key parameters, flip angle. Reinforcement learning (RL) offers a promising approach to automate parameter selection, to optimize pulse sequences that maximize the distinguishability of fingerprints across the parameter space. In this work, we introduce an RL framework for optimizing the flip-angle schedule in MRF and demonstrate a learned schedule exhibiting non-periodic patterns that enhances fingerprint separability. Additionally, an interesting observation is that the RL-optimized schedule may enable a reduction in the number of repetition time, potentially accelerate MRF acquisitions.

Title: Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning

Authors: Jingchu Gai, Guanning Zeng, Huaqing Zhang, Aditi Raghunathan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19942
Pdf URL: https://arxiv.org/pdf/2511.19942
Copy Paste: [[2511.19942]] Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning(https://arxiv.org/abs/2511.19942)
Keywords: large language model
Abstract: It is widely recognized that reinforcement learning (RL) fine-tuning of large language models often leads to \textit{diversity collapse}, where outputs lack variety. Prior work has proposed a range of heuristics to counteract this effect, but these methods are ad hoc: they frequently trade off correctness for diversity, their effectiveness varies across tasks, and in some cases they even contradict one another. In this work, we place these observations on a rigorous foundation. We first provide a formal proof of why RL fine-tuning exhibits diversity collapse via a selection and reinforcement bias. Next, we make a key observation that any reward modification to address diversity collapse only needs to be applied on the correct trajectories. Building directly on this analysis, we introduce a principled method -- \textit{differential smoothing} -- that provably improves both correctness and diversity, outperforming vanilla RL as well as widely used entropy-based heuristics. Our theory precisely characterizes when existing heuristics help and why they fail, while showing that differential smoothing is universally superior. Extensive experiments with models from 1B to 7B parameters, across domains including CountDown and real-world mathematical reasoning, demonstrate consistent gains. Differential smoothing improves both Pass@1 and Pass@k, with up to 6.7\% improvements on AIME24 dataset.

Title: Hierarchical Spatio-Temporal Attention Network with Adaptive Risk-Aware Decision for Forward Collision Warning in Complex Scenarios

Authors: Haoran Hu, Junren Shi, Shuo Jiang, Kun Cheng, Xia Yang, Changhao Piao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19952
Pdf URL: https://arxiv.org/pdf/2511.19952
Copy Paste: [[2511.19952]] Hierarchical Spatio-Temporal Attention Network with Adaptive Risk-Aware Decision for Forward Collision Warning in Complex Scenarios(https://arxiv.org/abs/2511.19952)
Keywords: transformer
Abstract: Forward Collision Warning systems are crucial for vehicle safety and autonomous driving, yet current methods often fail to balance precise multi-agent interaction modeling with real-time decision adaptability, evidenced by the high computational cost for edge deployment and the unreliability stemming from simplified interaction this http URL overcome these dual challenges-computational complexity and modeling insufficiency-along with the high false alarm rates of traditional static-threshold warnings, this paper introduces an integrated FCW framework that pairs a Hierarchical Spatio-Temporal Attention Network with a Dynamic Risk Threshold Adjustment algorithm. HSTAN employs a decoupled architecture (Graph Attention Network for spatial, cascaded GRU with self-attention for temporal) to achieve superior performance and efficiency, requiring only 12.3 ms inference time (73% faster than Transformer methods) and reducing the Average Displacement Error (ADE) to 0.73m (42.2% better than Social_LSTM) on the NGSIM dataset. Furthermore, Conformalized Quantile Regression enhances reliability by generating prediction intervals (91.3% coverage at 90% confidence), which the DTRA module then converts into timely warnings via a physics-informed risk potential function and an adaptive threshold mechanism inspired by statistical process this http URL across multi-scenario datasets, the complete system demonstrates high efficacy, achieving an F1 score of 0.912, a low false alarm rate of 8.2%, and an ample warning lead time of 2.8 seconds, validating the framework's superior performance and practical deployment feasibility in complex environments.

Title: Supervise Less, See More: Training-free Nuclear Instance Segmentation with Prototype-Guided Prompting

Authors: Wen Zhang, Qin Ren, Wenjing Liu, Haibin Ling, Chenyu You
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19953
Pdf URL: https://arxiv.org/pdf/2511.19953
Copy Paste: [[2511.19953]] Supervise Less, See More: Training-free Nuclear Instance Segmentation with Prototype-Guided Prompting(https://arxiv.org/abs/2511.19953)
Keywords: segmentation
Abstract: Accurate nuclear instance segmentation is a pivotal task in computational pathology, supporting data-driven clinical insights and facilitating downstream translational applications. While large vision foundation models have shown promise for zero-shot biomedical segmentation, most existing approaches still depend on dense supervision and computationally expensive fine-tuning. Consequently, training-free methods present a compelling research direction, yet remain largely unexplored. In this work, we introduce SPROUT, a fully training- and annotation-free prompting framework for nuclear instance segmentation. SPROUT leverages histology-informed priors to construct slide-specific reference prototypes that mitigate domain gaps. These prototypes progressively guide feature alignment through a partial optimal transport scheme. The resulting foreground and background features are transformed into positive and negative point prompts, enabling the Segment Anything Model (SAM) to produce precise nuclear delineations without any parameter updates. Extensive experiments across multiple histopathology benchmarks demonstrate that SPROUT achieves competitive performance without supervision or retraining, establishing a novel paradigm for scalable, training-free nuclear instance segmentation in pathology.

Title: Prompt Fairness: Sub-group Disparities in LLMs

Authors: Meiyu Zhong, Noel Teku, Ravi Tandon
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2511.19956
Pdf URL: https://arxiv.org/pdf/2511.19956
Copy Paste: [[2511.19956]] Prompt Fairness: Sub-group Disparities in LLMs(https://arxiv.org/abs/2511.19956)
Keywords: fair, large language model
Abstract: Large Language Models (LLMs), though shown to be effective in many applications, can vary significantly in their response quality. In this paper, we investigate this problem of prompt fairness: specifically, the phrasing of a prompt by different users/styles, despite the same question being asked in principle, may elicit different responses from an LLM. To quantify this disparity, we propose to use information-theoretic metrics that can capture two dimensions of bias: subgroup sensitivity, the variability of responses within a subgroup and cross group consistency, the variability of responses across subgroups. Our analysis reveals that certain subgroups exhibit both higher internal variability and greater divergence from others. Our empirical analysis reveals that certain demographic sub groups experience both higher internal variability and greater divergence from others, indicating structural inequities in model behavior. To mitigate these disparities, we propose practical interventions, including majority voting across multiple generations and prompt neutralization, which together improve response stability and enhance fairness across user populations. In the experiments, we observe clear prompt sensitivity disparities across demographic subgroups: before mitigation, cross-group divergence values reach 0.28 and typically fall in the from 0.14 to 0.22 range. After applying our neutralization and multi generation strategy, these divergences consistently decrease, with the largest gap reduced to 0.22 and many distances falling to 0.17 or below, indicating more stable and consistent outputs across subgroups.

Title: AppSelectBench: Application-Level Tool Selection Benchmark

Authors: Tianyi Chen, Michael Solodko, Sen Wang, Jongwoo Ko, Junheng Hao, Colby Banbury, Sara Abdali, Saeed Amizadeh, Qing Xiao, Yinheng Li, Tianyu Ding, Kamran Ghasedi Dizaji, Suzhen Zheng, Hao Fan, Justin Wagle, Pashmina Cameron, Kazuhito Koishida
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.19957
Pdf URL: https://arxiv.org/pdf/2511.19957
Copy Paste: [[2511.19957]] AppSelectBench: Application-Level Tool Selection Benchmark(https://arxiv.org/abs/2511.19957)
Keywords: large language model
Abstract: Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, together with unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented-settings. AppSelectBench covers one hundred widely used desktop applications and includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks. Extensive experiments across both closed-source and open-source large language models reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. Together, these results establish AppSelectBench as a foundation for studying and advancing application level reasoning, an essential yet underexplored capability of intelligent CUAs. The source is available at this https URL.

Title: GFT-GCN: Privacy-Preserving 3D Face Mesh Recognition with Spectral Diffusion

Authors: Hichem Felouat, Hanrui Wang, Isao Echizen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19958
Pdf URL: https://arxiv.org/pdf/2511.19958
Copy Paste: [[2511.19958]] GFT-GCN: Privacy-Preserving 3D Face Mesh Recognition with Spectral Diffusion(https://arxiv.org/abs/2511.19958)
Keywords: secure, security, privacy, protect, attack, robust, biometric, diffusion
Abstract: 3D face recognition offers a robust biometric solution by capturing facial geometry, providing resilience to variations in illumination, pose changes, and presentation attacks. Its strong spoof resistance makes it suitable for high-security applications, but protecting stored biometric templates remains critical. We present GFT-GCN, a privacy-preserving 3D face recognition framework that combines spectral graph learning with diffusion-based template protection. Our approach integrates the Graph Fourier Transform (GFT) and Graph Convolutional Networks (GCN) to extract compact, discriminative spectral features from 3D face meshes. To secure these features, we introduce a spectral diffusion mechanism that produces irreversible, renewable, and unlinkable templates. A lightweight client-server architecture ensures that raw biometric data never leaves the client device. Experiments on the BU-3DFE and FaceScape datasets demonstrate high recognition accuracy and strong resistance to reconstruction attacks. Results show that GFT-GCN effectively balances privacy and performance, offering a practical solution for secure 3D face authentication.

Title: ParaBlock: Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models

Authors: Yujia Wang, Yuanpu Cao, Jinghui Chen
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2511.19959
Pdf URL: https://arxiv.org/pdf/2511.19959
Copy Paste: [[2511.19959]] ParaBlock: Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models(https://arxiv.org/abs/2511.19959)
Keywords: privacy, federate, large language model
Abstract: Federated learning (FL) has been extensively studied as a privacy-preserving training paradigm. Recently, federated block coordinate descent scheme has become a popular option in training large-scale models, as it allows clients to train only a subset of the model locally instead of the entire model. However, in the era of large language models (LLMs), even a single block can contain a significant number of parameters, posing substantial communication latency, particularly for resource-constrained clients. To address this challenge in federated training/fine-tuning LLMs, we propose ParaBlock, a novel approach that establishes two parallel threads for communication and computation to enhance communication efficiency. We theoretically prove that the proposed ParaBlock achieves the same convergence rate as the standard federated block coordinate descent methods. Empirical evaluations on fine-tuning LLMs on general instruction following and mathematical reasoning confirm that ParaBlock not only maintains strong performance but also significantly improves communication efficiency.

Title: MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing

Authors: Changho Choi, Minho Kim, Jinkyu Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19963
Pdf URL: https://arxiv.org/pdf/2511.19963
Copy Paste: [[2511.19963]] MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing(https://arxiv.org/abs/2511.19963)
Keywords: robust, diffusion
Abstract: Despite decades of progress, a truly input-size agnostic visual encoder-a fundamental characteristic of human vision-has remained elusive. We address this limitation by proposing \textbf{MambaEye}, a novel, causal sequential encoder that leverages the low complexity and causal-process based pure Mamba2 backbone. Unlike previous Mamba-based vision encoders that often employ bidirectional processing, our strictly unidirectional approach preserves the inherent causality of State Space Models, enabling the model to generate a prediction at any point in its input sequence. A core innovation is our use of relative move embedding, which encodes the spatial shift between consecutive patches, providing a strong inductive bias for translation invariance and making the model inherently adaptable to arbitrary image resolutions and scanning patterns. To achieve this, we introduce a novel diffusion-inspired loss function that provides dense, step-wise supervision, training the model to build confidence as it gathers more visual evidence. We demonstrate that MambaEye exhibits robust performance across a wide range of image resolutions, especially at higher resolutions such as $1536^2$ on the ImageNet-1K classification task. This feat is achieved while maintaining linear time and memory complexity relative to the number of patches.

Title: HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

Authors: Hongji Yang, Yucheng Zhou, Wencheng Han, Runzhou Tao, Zhongying Qiu, Jianfei Yang, Jianbing Shen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19965
Pdf URL: https://arxiv.org/pdf/2511.19965
Copy Paste: [[2511.19965]] HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning(https://arxiv.org/abs/2511.19965)
Keywords: diffusion, generative, large language model
Abstract: Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.

Title: Stragglers Can Contribute More: Uncertainty-Aware Distillation for Asynchronous Federated Learning

Authors: Yujia Wang, Fenglong Ma, Jinghui Chen
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2511.19966
Pdf URL: https://arxiv.org/pdf/2511.19966
Copy Paste: [[2511.19966]] Stragglers Can Contribute More: Uncertainty-Aware Distillation for Asynchronous Federated Learning(https://arxiv.org/abs/2511.19966)
Keywords: robust, federate
Abstract: Asynchronous federated learning (FL) has recently gained attention for its enhanced efficiency and scalability, enabling local clients to send model updates to the server at their own pace without waiting for slower participants. However, such a design encounters significant challenges, such as the risk of outdated updates from straggler clients degrading the overall model performance and the potential bias introduced by faster clients dominating the learning process, especially under heterogeneous data distributions. Existing methods typically address only one of these issues, creating a conflict where mitigating the impact of outdated updates can exacerbate the bias created by faster clients, and vice versa. To address these challenges, we propose FedEcho, a novel framework that incorporates uncertainty-aware distillation to enhance the asynchronous FL performances under large asynchronous delays and data heterogeneity. Specifically, uncertainty-aware distillation enables the server to assess the reliability of predictions made by straggler clients, dynamically adjusting the influence of these predictions based on their estimated uncertainty. By prioritizing more certain predictions while still leveraging the diverse information from all clients, FedEcho effectively mitigates the negative impacts of outdated updates and data heterogeneity. Through extensive experiments, we demonstrate that FedEcho consistently outperforms existing asynchronous federated learning baselines, achieving robust performance without requiring access to private client data.

Title: VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction

Authors: Yu Hu, Chong Cheng, Sicheng Yu, Xiaoyang Guo, Hao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19971
Pdf URL: https://arxiv.org/pdf/2511.19971
Copy Paste: [[2511.19971]] VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction(https://arxiv.org/abs/2511.19971)
Keywords: robust, transformer, segmentation
Abstract: Reconstructing dynamic 4D scenes is challenging, as it requires robust disentanglement of dynamic objects from the static background. While 3D foundation models like VGGT provide accurate 3D geometry, their performance drops markedly when moving objects dominate. Existing 4D approaches often rely on external priors, heavy post-optimization, or require fine-tuning on 4D datasets. In this paper, we propose VGGT4D, a training-free framework that extends the 3D foundation model VGGT for robust 4D scene reconstruction. Our approach is motivated by the key finding that VGGT's global attention layers already implicitly encode rich, layer-wise dynamic cues. To obtain masks that decouple static and dynamic elements, we mine and amplify global dynamic cues via gram similarity and aggregate them across a temporal window. To further sharpen mask boundaries, we introduce a refinement strategy driven by projection gradient. We then integrate these precise masks into VGGT's early-stage inference, effectively mitigating motion interference in both pose estimation and geometric reconstruction. Across six datasets, our method achieves superior performance in dynamic object segmentation, camera pose estimation, and dense reconstruction. It also supports single-pass inference on sequences longer than 500 frames.

Title: EmoFeedback2: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback

Authors: Jingyang Jia, Kai Shu, Gang Yang, Long Xing, Xun Chen, Aiping Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19982
Pdf URL: https://arxiv.org/pdf/2511.19982
Copy Paste: [[2511.19982]] EmoFeedback2: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback(https://arxiv.org/abs/2511.19982)
Keywords: generative
Abstract: Continuous emotional image generation (C-EICG) is emerging rapidly due to its ability to produce images aligned with both user descriptions and continuous emotional values. However, existing approaches lack emotional feedback from generated images, limiting the control of emotional continuity. Additionally, their simple alignment between emotions and naively generated texts fails to adaptively adjust emotional prompts according to image content, leading to insufficient emotional fidelity. To address these concerns, we propose a novel generation-understanding-feedback reinforcement paradigm (EmoFeedback2) for C-EICG, which exploits the reasoning capability of the fine-tuned large vision-language model (LVLM) to provide reward and textual feedback for generating high-quality images with continuous emotions. Specifically, we introduce an emotion-aware reward feedback strategy, where the LVLM evaluates the emotional values of generated images and computes the reward against target emotions, guiding the reinforcement fine-tuning of the generative model and enhancing the emotional continuity of images. Furthermore, we design a self-promotion textual feedback framework, in which the LVLM iteratively analyzes the emotional content of generated images and adaptively produces refinement suggestions for the next-round prompt, improving the emotional fidelity with fine-grained content. Extensive experimental results demonstrate that our approach effectively generates high-quality images with the desired emotions, outperforming existing state-of-the-art methods in our custom dataset. The code and dataset will be released soon.

Title: Rethinking Message Passing Neural Networks with Diffusion Distance-guided Stress Majorization

Authors: Haoran Zheng, Renchi Yang, Yubo Zhou, Jianliang Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.19984
Pdf URL: https://arxiv.org/pdf/2511.19984
Copy Paste: [[2511.19984]] Rethinking Message Passing Neural Networks with Diffusion Distance-guided Stress Majorization(https://arxiv.org/abs/2511.19984)
Keywords: diffusion
Abstract: Message passing neural networks (MPNNs) have emerged as go-to models for learning on graph-structured data in the past decade. Despite their effectiveness, most of such models still incur severe issues such as over-smoothing and -correlation, due to their underlying objective of minimizing the Dirichlet energy and the derived neighborhood aggregation operations. In this paper, we propose the DDSM, a new MPNN model built on an optimization framework that includes the stress majorization and orthogonal regularization for overcoming the above issues. Further, we introduce the diffusion distances for nodes into the framework to guide the new message passing operations and develop efficient algorithms for distance approximations, both backed by rigorous theoretical analyses. Our comprehensive experiments showcase that DDSM consistently and considerably outperforms 15 strong baselines on both homophilic and heterophilic graphs.

Title: $\text{R}^2\text{R}$: A Route-to-Rerank Post-Training Framework for Multi-Domain Decoder-Only Rerankers

Authors: Xinyu Wang, Hanwei Wu, Qingchen Hu, Zhenghan Tai, Jingrui Tian, Lei Ding, Jijun Chi, Hailin He, Tung Sum Thomas Kwok, Yufei Cui, Sicheng Lyu, Muzhi Li, Mingze Li, Xinyue Yu, Ling Zhou, Peng Lu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2511.19987
Pdf URL: https://arxiv.org/pdf/2511.19987
Copy Paste: [[2511.19987]] $\text{R}^2\text{R}$: A Route-to-Rerank Post-Training Framework for Multi-Domain Decoder-Only Rerankers(https://arxiv.org/abs/2511.19987)
Keywords: robust
Abstract: Decoder-only rerankers are central to Retrieval-Augmented Generation (RAG). However, generalist models miss domain-specific nuances in high-stakes fields like finance and law, and naive fine-tuning causes surface-form overfitting and catastrophic forgetting. To address this challenge, we introduce R2R, a domain-aware framework that combines dynamic expert routing with a two-stage training strategy, Entity Abstraction for Generalization (EAG). EAG introduces a counter-shortcut mechanism by masking the most predictive surface cues, forcing the reranker to learn domain-invariant relevance patterns rather than memorizing dataset-specific entities. To efficiently activate domain experts, R2R employs a lightweight Latent Semantic Router that probes internal representations from the frozen backbone decoder to select the optimal LoRA expert per query. Extensive experiments across different reranker backbones and diverse domains (legal, medical, and financial) demonstrate that R2R consistently surpasses generalist and single-domain fine-tuned baselines. Our results confirm that R2R is a model-agnostic and modular approach to domain specialization with strong cross-domain robustness.

Title: OmniRefiner: Reinforcement-Guided Local Diffusion Refinement

Authors: Yaoli Liu, Ziheng Ouyang, Shengtao Lou, Yiren Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.19990
Pdf URL: https://arxiv.org/pdf/2511.19990
Copy Paste: [[2511.19990]] OmniRefiner: Reinforcement-Guided Local Diffusion Refinement(https://arxiv.org/abs/2511.19990)
Keywords: diffusion
Abstract: Reference-guided image generation has progressed rapidly, yet current diffusion models still struggle to preserve fine-grained visual details when refining a generated image using a reference. This limitation arises because VAE-based latent compression inherently discards subtle texture information, causing identity- and attribute-specific cues to vanish. Moreover, post-editing approaches that amplify local details based on existing methods often produce results inconsistent with the original image in terms of lighting, texture, or shape. To address this, we introduce \ourMthd{}, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction to enhance pixel-level consistency. We first adapt a single-image diffusion editor by fine-tuning it to jointly ingest the draft image and the reference image, enabling globally coherent refinement while maintaining structural fidelity. We then apply reinforcement learning to further strengthen localized editing capability, explicitly optimizing for detail accuracy and semantic consistency. Extensive experiments demonstrate that \ourMthd{} significantly improves reference alignment and fine-grained detail preservation, producing faithful and visually coherent edits that surpass both open-source and commercial models on challenging reference-guided restoration benchmarks.

Title: Directional Optimization Asymmetry in Transformers: A Synthetic Stress Test

Authors: Mihir Sahasrabudhe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.19997
Pdf URL: https://arxiv.org/pdf/2511.19997
Copy Paste: [[2511.19997]] Directional Optimization Asymmetry in Transformers: A Synthetic Stress Test(https://arxiv.org/abs/2511.19997)
Keywords: transformer
Abstract: Transformers are theoretically reversal-invariant: their function class does not prefer left-to-right over right-to-left mappings. Yet empirical studies on natural language repeatedly report a "reversal curse," and recent work on temporal asymmetry in LLMs suggests that real-world corpora carry their own arrow of time. This leaves an unresolved question: do directional failures stem from linguistic statistics, or from the architecture itself? We cut through this ambiguity with a fully synthetic, entropy-controlled benchmark designed as a clean-room stress test for directional learning. Using random string mappings with tunable branching factor K, we construct forward tasks with zero conditional entropy and inverse tasks with analytically determined entropy floors. Excess loss above these floors reveals that even scratch-trained GPT-2 models exhibit a strong, reproducible directional optimization gap (e.g., 1.16 nats at K=5), far larger than that of an MLP trained on the same data. Pre-trained initializations shift optimization behavior but do not eliminate this gap, while LoRA encounters a sharp capacity wall on high-entropy inverse mappings. Together, these results isolate a minimal, semantics-free signature of directional friction intrinsic to causal Transformer training-one that persists even when linguistic priors, token frequencies, and corpus-level temporal asymmetries are removed. Our benchmark provides a controlled instrument for dissecting directional biases in modern sequence models and motivates deeper mechanistic study of why inversion remains fundamentally harder for Transformers.

Title: A Machine Learning Approach for Detection of Mental Health Conditions and Cyberbullying from Social Media

Authors: Edward Ajayi, Martha Kachweka, Mawuli Deku, Emily Aiken
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2511.20001
Pdf URL: https://arxiv.org/pdf/2511.20001
Copy Paste: [[2511.20001]] A Machine Learning Approach for Detection of Mental Health Conditions and Cyberbullying from Social Media(https://arxiv.org/abs/2511.20001)
Keywords: robust, explainability, transformer
Abstract: Mental health challenges and cyberbullying are increasingly prevalent in digital spaces, necessitating scalable and interpretable detection systems. This paper introduces a unified multiclass classification framework for detecting ten distinct mental health and cyberbullying categories from social media data. We curate datasets from Twitter and Reddit, implementing a rigorous "split-then-balance" pipeline to train on balanced data while evaluating on a realistic, held-out imbalanced test set. We conducted a comprehensive evaluation comparing traditional lexical models, hybrid approaches, and several end-to-end fine-tuned transformers. Our results demonstrate that end-to-end fine-tuning is critical for performance, with the domain-adapted MentalBERT emerging as the top model, achieving an accuracy of 0.92 and a Macro F1 score of 0.76, surpassing both its generic counterpart and a zero-shot LLM baseline. Grounded in a comprehensive ethical analysis, we frame the system as a human-in-the-loop screening aid, not a diagnostic tool. To support this, we introduce a hybrid SHAPLLM explainability framework and present a prototype dashboard ("Social Media Screener") designed to integrate model predictions and their explanations into a practical workflow for moderators. Our work provides a robust baseline, highlighting future needs for multi-label, clinically-validated datasets at the critical intersection of online safety and computational mental health.

Title: On the Feasibility of Hijacking MLLMs' Decision Chain via One Perturbation

Authors: Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2511.20002
Pdf URL: https://arxiv.org/pdf/2511.20002
Copy Paste: [[2511.20002]] On the Feasibility of Hijacking MLLMs' Decision Chain via One Perturbation(https://arxiv.org/abs/2511.20002)
Keywords: attack, large language model
Abstract: Conventional adversarial attacks focus on manipulating a single decision of neural networks. However, real-world models often operate in a sequence of decisions, where an isolated mistake can be easily corrected, but cascading errors can lead to severe risks. This paper reveals a novel threat: a single perturbation can hijack the whole decision chain. We demonstrate the feasibility of manipulating a model's outputs toward multiple, predefined outcomes, such as simultaneously misclassifying "non-motorized lane" signs as "motorized lane" and "pedestrian" as "plastic bag". To expose this threat, we introduce Semantic-Aware Universal Perturbations (SAUPs), which induce varied outcomes based on the semantics of the inputs. We overcome optimization challenges by developing an effective algorithm, which searches for perturbations in normalized space with a semantic separation strategy. To evaluate the practical threat of SAUPs, we present RIST, a new real-world image dataset with fine-grained semantic annotations. Extensive experiments on three multimodal large language models demonstrate their vulnerability, achieving a 70% attack success rate when controlling five distinct targets using just an adversarial frame.

Title: Pedestrian Crossing Intention Prediction Using Multimodal Fusion Network

Authors: Yuanzhe Li, Steffen Müller
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20008
Pdf URL: https://arxiv.org/pdf/2511.20008
Copy Paste: [[2511.20008]] Pedestrian Crossing Intention Prediction Using Multimodal Fusion Network(https://arxiv.org/abs/2511.20008)
Keywords: extraction, transformer
Abstract: Pedestrian crossing intention prediction is essential for the deployment of autonomous vehicles (AVs) in urban environments. Ideal prediction provides AVs with critical environmental cues, thereby reducing the risk of pedestrian-related collisions. However, the prediction task is challenging due to the diverse nature of pedestrian behavior and its dependence on multiple contextual factors. This paper proposes a multimodal fusion network that leverages seven modality features from both visual and motion branches, aiming to effectively extract and integrate complementary cues across different modalities. Specifically, motion and visual features are extracted from the raw inputs using multiple Transformer-based extraction modules. Depth-guided attention module leverages depth information to guide attention towards salient regions in another modality through comprehensive spatial feature interactions. To account for the varying importance of different modalities and frames, modality attention and temporal attention are designed to selectively emphasize informative modalities and effectively capture temporal dependencies. Extensive experiments on the JAAD dataset validate the effectiveness of the proposed network, achieving superior performance compared to the baseline methods.

Title: Multi-Context Fusion Transformer for Pedestrian Crossing Intention Prediction in Urban Environments

Authors: Yuanzhe Li, Hang Zhong, Steffen Müller
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20011
Pdf URL: https://arxiv.org/pdf/2511.20011
Copy Paste: [[2511.20011]] Multi-Context Fusion Transformer for Pedestrian Crossing Intention Prediction in Urban Environments(https://arxiv.org/abs/2511.20011)
Keywords: transformer
Abstract: Pedestrian crossing intention prediction is essential for autonomous vehicles to improve pedestrian safety and reduce traffic accidents. However, accurate pedestrian intention prediction in urban environments remains challenging due to the multitude of factors affecting pedestrian behavior. In this paper, we propose a multi-context fusion Transformer (MFT) that leverages diverse numerical contextual attributes across four key dimensions, encompassing pedestrian behavior context, environmental context, pedestrian localization context and vehicle motion context, to enable accurate pedestrian intention prediction. MFT employs a progressive fusion strategy, where mutual intra-context attention enables reciprocal interactions within each context, thereby facilitating feature sequence fusion and yielding a context token as a context-specific representation. This is followed by mutual cross-context attention, which integrates features across contexts with a global CLS token serving as a compact multi-context representation. Finally, guided intra-context attention refines context tokens within each context through directed interactions, while guided cross-context attention strengthens the global CLS token to promote multi-context fusion via guided information propagation, yielding deeper and more efficient integration. Experimental results validate the superiority of MFT over state-of-the-art methods, achieving accuracy rates of 73%, 93%, and 90% on the JAADbeh, JAADall, and PIE datasets, respectively. Extensive ablation studies are further conducted to investigate the effectiveness of the network architecture and contribution of different input context. Our code is open-source: this https URL.

Title: iRadioDiff: Physics-Informed Diffusion Model for Indoor Radio Map Construction and Localization

Authors: Xiucheng Wang, Tingwei Yuan, Yang Cao, Nan Cheng, Ruijin Sun, Weihua Zhuang
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2511.20015
Pdf URL: https://arxiv.org/pdf/2511.20015
Copy Paste: [[2511.20015]] iRadioDiff: Physics-Informed Diffusion Model for Indoor Radio Map Construction and Localization(https://arxiv.org/abs/2511.20015)
Keywords: diffusion, generative
Abstract: Radio maps (RMs) serve as environment-aware electromagnetic (EM) representations that connect scenario geometry and material properties to the spatial distribution of signal strength, enabling localization without costly in-situ measurements. However, constructing high-fidelity indoor RMs remains challenging due to the prohibitive latency of EM solvers and the limitations of learning-based methods, which often rely on sparse measurements or assumptions of homogeneous material, which are misaligned with the heterogeneous and multipath-rich nature of indoor environments. To overcome these challenges, we propose iRadioDiff, a sampling-free diffusion-based framework for indoor RM construction. iRadioDiff is conditioned on access point (AP) positions, and physics-informed prompt encoded by material reflection and transmission coefficients. It further incorporates multipath-critical priors, including diffraction points, strong transmission boundaries, and line-of-sight (LoS) contours, to guide the generative process via conditional channels and boundary-weighted objectives. This design enables accurate modeling of nonstationary field discontinuities and efficient construction of physically consistent RMs. Experiments demonstrate that iRadioDiff achieves state-of-the-art performance in indoor RM construction and received signal strength based indoor localization, which offers effective generalization across layouts and material configurations. Code is available at this https URL.

Title: ACIT: Attention-Guided Cross-Modal Interaction Transformer for Pedestrian Crossing Intention Prediction

Authors: Yuanzhe Li, Steffen Müller
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20020
Pdf URL: https://arxiv.org/pdf/2511.20020
Copy Paste: [[2511.20020]] ACIT: Attention-Guided Cross-Modal Interaction Transformer for Pedestrian Crossing Intention Prediction(https://arxiv.org/abs/2511.20020)
Keywords: extraction, transformer
Abstract: Predicting pedestrian crossing intention is crucial for autonomous vehicles to prevent pedestrian-related collisions. However, effectively extracting and integrating complementary cues from different types of data remains one of the major challenges. This paper proposes an attention-guided cross-modal interaction Transformer (ACIT) for pedestrian crossing intention prediction. ACIT leverages six visual and motion modalities, which are grouped into three interaction pairs: (1) Global semantic map and global optical flow, (2) Local RGB image and local optical flow, and (3) Ego-vehicle speed and pedestrian's bounding box. Within each visual interaction pair, a dual-path attention mechanism enhances salient regions within the primary modality through intra-modal self-attention and facilitates deep interactions with the auxiliary modality (i.e., optical flow) via optical flow-guided attention. Within the motion interaction pair, cross-modal attention is employed to model the cross-modal dynamics, enabling the effective extraction of complementary motion features. Beyond pairwise interactions, a multi-modal feature fusion module further facilitates cross-modal interactions at each time step. Furthermore, a Transformer-based temporal feature aggregation module is introduced to capture sequential dependencies. Experimental results demonstrate that ACIT outperforms state-of-the-art methods, achieving accuracy rates of 70% and 89% on the JAADbeh and JAADall datasets, respectively. Extensive ablation studies are further conducted to investigate the contribution of different modules of ACIT.

Title: WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving

Authors: Seungjun Yu, Seonho Lee, Namho Kim, Jaeyo Shin, Junsung Park, Wonjeong Ryu, Raehyuk Jung, Hyunjung Shim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20022
Pdf URL: https://arxiv.org/pdf/2511.20022
Copy Paste: [[2511.20022]] WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving(https://arxiv.org/abs/2511.20022)
Keywords: large language model
Abstract: Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents.

Title: SAM-MI: A Mask-Injected Framework for Enhancing Open-Vocabulary Semantic Segmentation with SAM

Authors: Lin Chen, Yingjian Zhu, Qi Yang, Xin Niu, Kun Ding, Shiming Xiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20027
Pdf URL: https://arxiv.org/pdf/2511.20027
Copy Paste: [[2511.20027]] SAM-MI: A Mask-Injected Framework for Enhancing Open-Vocabulary Semantic Segmentation with SAM(https://arxiv.org/abs/2511.20027)
Keywords: segmentation
Abstract: Open-vocabulary semantic segmentation (OVSS) aims to segment and recognize objects universally. Trained on extensive high-quality segmentation data, the segment anything model (SAM) has demonstrated remarkable universal segmentation capabilities, offering valuable support for OVSS. Although previous methods have made progress in leveraging SAM for OVSS, there are still some challenges: (1) SAM's tendency to over-segment and (2) hard combinations between fixed masks and labels. This paper introduces a novel mask-injected framework, SAM-MI, which effectively integrates SAM with OVSS models to address these challenges. Initially, SAM-MI employs a Text-guided Sparse Point Prompter to sample sparse prompts for SAM instead of previous dense grid-like prompts, thus significantly accelerating the mask generation process. The framework then introduces Shallow Mask Aggregation (SMAgg) to merge partial masks to mitigate the SAM's over-segmentation issue. Finally, Decoupled Mask Injection (DMI) incorporates SAM-generated masks for guidance at low-frequency and high-frequency separately, rather than directly combining them with labels. Extensive experiments on multiple benchmarks validate the superiority of SAM-MI. Notably, the proposed method achieves a 16.7% relative improvement in mIoU over Grounded-SAM on the MESS benchmark, along with a 1.6$\times$ speedup. We hope SAM-MI can serve as an alternative methodology to effectively equip the OVSS model with SAM.

Title: Cross-Contrastive Clustering for Multimodal Attributed Graphs with Dual Graph Filtering

Authors: Haoran Zheng, Renchi Yang, Hongtao Wang, Jianliang Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20030
Pdf URL: https://arxiv.org/pdf/2511.20030
Copy Paste: [[2511.20030]] Cross-Contrastive Clustering for Multimodal Attributed Graphs with Dual Graph Filtering(https://arxiv.org/abs/2511.20030)
Keywords: robust
Abstract: Multimodal Attributed Graphs (MMAGs) are an expressive data model for representing the complex interconnections among entities that associate attributes from multiple data modalities (text, images, etc.). Clustering over such data finds numerous practical applications in real scenarios, including social community detection, medical data analytics, etc. However, as revealed by our empirical studies, existing multi-view clustering solutions largely rely on the high correlation between attributes across various views and overlook the unique characteristics (e.g., low modality-wise correlation and intense feature-wise noise) of multimodal attributes output by large pre-trained language and vision models in MMAGs, leading to suboptimal clustering performance. Inspired by foregoing empirical observations and our theoretical analyses with graph signal processing, we propose the Dual Graph Filtering (DGF) scheme, which innovatively incorporates a feature-wise denoising component into node representation learning, thereby effectively overcoming the limitations of traditional graph filters adopted in the extant multi-view graph clustering approaches. On top of that, DGF includes a tri-cross contrastive training strategy that employs instance-level contrastive learning across modalities, neighborhoods, and communities for learning robust and discriminative node representations. Our comprehensive experiments on eight benchmark MMAG datasets exhibit that DGF is able to outperform a wide range of state-of-the-art baselines consistently and significantly in terms of clustering quality measured against ground-truth labels.

Title: MFM-point: Multi-scale Flow Matching for Point Cloud Generation

Authors: Petr Molodyk, Jaemoo Choi, David W. Romero, Ming-Yu Liu, Yongxin Chen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20041
Pdf URL: https://arxiv.org/pdf/2511.20041
Copy Paste: [[2511.20041]] MFM-point: Multi-scale Flow Matching for Point Cloud Generation(https://arxiv.org/abs/2511.20041)
Keywords: generative
Abstract: In recent years, point cloud generation has gained significant attention in 3D generative modeling. Among existing approaches, point-based methods directly generate point clouds without relying on other representations such as latent features, meshes, or voxels. These methods offer low training cost and algorithmic simplicity, but often underperform compared to representation-based approaches. In this paper, we propose MFM-Point, a multi-scale Flow Matching framework for point cloud generation that substantially improves the scalability and performance of point-based methods while preserving their simplicity and efficiency. Our multi-scale generation algorithm adopts a coarse-to-fine generation paradigm, enhancing generation quality and scalability without incurring additional training or inference overhead. A key challenge in developing such a multi-scale framework lies in preserving the geometric structure of unordered point clouds while ensuring smooth and consistent distributional transitions across resolutions. To address this, we introduce a structured downsampling and upsampling strategy that preserves geometry and maintains alignment between coarse and fine resolutions. Our experimental results demonstrate that MFM-Point achieves best-in-class performance among point-based methods and challenges the best representation-based methods. In particular, MFM-point demonstrates strong results in multi-category and high-resolution generation tasks.

Title: RED-F: Reconstruction-Elimination based Dual-stream Contrastive Forecasting for Multivariate Time Series Anomaly Prediction

Authors: PengYu Chen, Xiaohou Shi, Yuan Chang, Yan Sun, Sajal K. Das
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20044
Pdf URL: https://arxiv.org/pdf/2511.20044
Copy Paste: [[2511.20044]] RED-F: Reconstruction-Elimination based Dual-stream Contrastive Forecasting for Multivariate Time Series Anomaly Prediction(https://arxiv.org/abs/2511.20044)
Keywords: robust
Abstract: The proactive prediction of anomalies (AP) in mul- tivariate time series (MTS) is a critical challenge to ensure system dependability. The difficulty lies in identifying subtle anomaly precursors concealed within normal signals. However, existing unsupervised methods, trained exclusively on normal data, demonstrate a fundamental propensity to reconstruct normal patterns. Consequently, when confronted with weak precursors, their predictions are dominated by the normal pattern, submerging the very signal required for prediction. To contend with the limitation, we propose RED-F, a Reconstruction- Elimination based Dual-stream Contrastive Forecasting frame- work, comprising the Reconstruction-Elimination Model (REM) and the Dual-stream Contrastive Forecasting Model (DFM). The REM utilizes a hybrid time-frequency mechanism to mitigate the precursor, generating a purified, normal-pattern baseline. The DFM then receives this purified baseline and the original sequence which retains the precursor as parallel inputs. At the core of our framework, RED-F employs a contrastive forecast that transforms the difficult task of absolute signal detection into a simpler, more robust task of relative trajectory comparison by computing the divergence between these two predictive streams. This contrastive mechanism serves to amplify the faint precursor signal. Furthermore, the DFM is trained with a novel Multi-Series Prediction (MSP) objective, which leverages distant future con- text to enhance its predictive sensitivity. Extensive experiments on six real-world datasets demonstrate the superior capability of RED-F in anomaly prediction tasks.

Title: PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images

Authors: Simon Damm, Jonas Ricker, Henning Petzka, Asja Fischer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20068
Pdf URL: https://arxiv.org/pdf/2511.20068
Copy Paste: [[2511.20068]] PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images(https://arxiv.org/abs/2511.20068)
Keywords: large language model
Abstract: Autoregressive (AR) image generation has recently emerged as a powerful paradigm for image synthesis. Leveraging the generation principle of large language models, they allow for efficiently generating deceptively real-looking images, further increasing the need for reliable detection methods. However, to date there is a lack of work specifically targeting the detection of images generated by AR image generators. In this work, we present PRADA (Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images), a simple and interpretable approach that can reliably detect AR-generated images and attribute them to their respective source model. The key idea is to inspect the ratio of a model's conditional and unconditional probability for the autoregressive token sequence representing a given image. Whenever an image is generated by a particular model, its probability ratio shows unique characteristics which are not present for images generated by other models or real images. We exploit these characteristics for threshold-based attribution and detection by calibrating a simple, model-specific score function. Our experimental evaluation shows that PRADA is highly effective against eight class-to-image and four text-to-image models.

Title: MTA: A Merge-then-Adapt Framework for Personalized Large Language Model

Authors: Xiaopeng Li, Yuanjin Zheng, Wanyu Wang, wenlin zhang, Pengyue Jia, Yiqi Wang, Maolin Wang, Xuetao Wei, Xiangyu Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.20072
Pdf URL: https://arxiv.org/pdf/2511.20072
Copy Paste: [[2511.20072]] MTA: A Merge-then-Adapt Framework for Personalized Large Language Model(https://arxiv.org/abs/2511.20072)
Keywords: large language model
Abstract: Personalized Large Language Models (PLLMs) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of fine-tuning a separate module for each user faces two major limitations: (1) storage costs scale linearly with the number of users, rendering the method unscalable; and (2) fine-tuning a static model from scratch often yields suboptimal performance for users with sparse data. To address these challenges, we propose MTA, a Merge-then-Adapt framework for PLLMs. MTA comprises three key stages. First, we construct a shared Meta-LoRA Bank by selecting anchor users and pre-training meta-personalization traits within meta-LoRA modules. Second, to ensure scalability and enable dynamic personalization combination beyond static models, we introduce an Adaptive LoRA Fusion stage. This stage retrieves and dynamically merges the most relevant anchor meta-LoRAs to synthesize a user-specific one, thereby eliminating the need for user-specific storage and supporting more flexible personalization. Third, we propose a LoRA Stacking for Few-Shot Personalization stage, which applies an additional ultra-low-rank, lightweight LoRA module on top of the merged LoRA. Fine-tuning this module enables effective personalization under few-shot settings. Extensive experiments on the LaMP benchmark demonstrate that our approach outperforms existing SOTA methods across multiple tasks.

Title: Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding

Authors: Jinghan Zhao, Yifei Huang, Feng Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20073
Pdf URL: https://arxiv.org/pdf/2511.20073
Copy Paste: [[2511.20073]] Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding(https://arxiv.org/abs/2511.20073)
Keywords: robust
Abstract: Learning procedural-aware video representations is a key step towards building agents that can reason about and execute complex tasks. Existing methods typically address this problem by aligning visual content with textual descriptions at the task and step levels to inject procedural semantics into video representations. However, due to their high level of abstraction, 'task' and 'step' descriptions fail to form a robust alignment with the concrete, observable details in visual data. To address this, we introduce 'states', i.e., textual snapshots of object configurations, as a visually-grounded semantic layer that anchors abstract procedures to what a model can actually see. We formalize this insight in a novel Task-Step-State (TSS) framework, where tasks are achieved via steps that drive transitions between observable states. To enforce this structure, we propose a progressive pre-training strategy that unfolds the TSS hierarchy, forcing the model to ground representations in states while associating them with steps and high-level tasks. Extensive experiments on the COIN and CrossTask datasets show that our method outperforms baseline models on multiple downstream tasks, including task recognition, step recognition, and next step prediction. Ablation studies show that introducing state supervision is a key driver of performance gains across all tasks. Additionally, our progressive pretraining strategy proves more effective than standard joint training, as it better enforces the intended hierarchical structure.

Title: More Bias, Less Bias: BiasPrompting for Enhanced Multiple-Choice Question Answering

Authors: Duc Anh Vu, Thong Nguyen, Cong-Duy Nguyen, Viet Anh Nguyen, Anh Tuan Luu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.20086
Pdf URL: https://arxiv.org/pdf/2511.20086
Copy Paste: [[2511.20086]] More Bias, Less Bias: BiasPrompting for Enhanced Multiple-Choice Question Answering(https://arxiv.org/abs/2511.20086)
Keywords: large language model
Abstract: With the advancement of large language models (LLMs), their performance on multiple-choice question (MCQ) tasks has improved significantly. However, existing approaches face key limitations: answer choices are typically presented to LLMs without contextual grounding or explanation. This absence of context can lead to incomplete exploration of all possible answers, ultimately degrading the models' reasoning capabilities. To address these challenges, we introduce BiasPrompting, a novel inference framework that guides LLMs to generate and critically evaluate reasoning across all plausible answer options before reaching a final prediction. It consists of two components: first, a reasoning generation stage, where the model is prompted to produce supportive reasonings for each answer option, and then, a reasoning-guided agreement stage, where the generated reasonings are synthesized to select the most plausible answer. Through comprehensive evaluations, BiasPrompting demonstrates significant improvements in five widely used multiple-choice question answering benchmarks. Our experiments showcase that BiasPrompting enhances the reasoning capabilities of LLMs and provides a strong foundation for tackling complex and challenging questions, particularly in settings where existing methods underperform.

Title: Explainable Visual Anomaly Detection via Concept Bottleneck Models

Authors: Arianna Stropeni, Valentina Zaccaria, Francesco Borsatti, Davide Dalle Pezze, Manuel Barusco, Gian Antonio Susto
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20088
Pdf URL: https://arxiv.org/pdf/2511.20088
Copy Paste: [[2511.20088]] Explainable Visual Anomaly Detection via Concept Bottleneck Models(https://arxiv.org/abs/2511.20088)
Keywords: interpretability
Abstract: In recent years, Visual Anomaly Detection (VAD) has gained significant attention due to its ability to identify anomalous images using only normal images during training. Many VAD models work without supervision but are still able to provide visual explanations by highlighting the anomalous regions within an image. However, although these visual explanations can be helpful, they lack a direct and semantically meaningful interpretation for users. To address this limitation, we propose extending Concept Bottleneck Models (CBMs) to the VAD setting. By learning meaningful concepts, the network can provide human-interpretable descriptions of anomalies, offering a novel and more insightful way to explain them. Our contributions are threefold: (i) we develop a Concept Dataset to support research on CBMs for VAD; (ii) we improve the CBM architecture to generate both concept-based and visual explanations, bridging semantic and localization interpretability; and (iii) we introduce a pipeline for synthesizing artificial anomalies, preserving the VAD paradigm of minimizing dependence on rare anomalous samples. Our approach, Concept-Aware Visual Anomaly Detection (CONVAD), achieves performance comparable to classic VAD methods while providing richer, concept-driven explanations that enhance interpretability and trust in VAD systems.

Title: Exploring State-of-the-art models for Early Detection of Forest Fires

Authors: Sharjeel Ahmed, Daim Armaghan, Fatima Naweed, Umair Yousaf, Ahmad Zubair, Murtaza Taj
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20096
Pdf URL: https://arxiv.org/pdf/2511.20096
Copy Paste: [[2511.20096]] Exploring State-of-the-art models for Early Detection of Forest Fires(https://arxiv.org/abs/2511.20096)
Keywords: transformer
Abstract: There have been many recent developments in the use of Deep Learning Neural Networks for fire detection. In this paper, we explore an early warning system for detection of forest fires. Due to the lack of sizeable datasets and models tuned for this task, existing methods suffer from missed detection. In this work, we first propose a dataset for early identification of forest fires through visual analysis. Unlike existing image corpuses that contain images of wide-spread fire, our dataset consists of multiple instances of smoke plumes and fire that indicates the initiation of fire. We obtained this dataset synthetically by utilising game simulators such as Red Dead Redemption 2. We also combined our dataset with already published images to obtain a more comprehensive set. Finally, we compared image classification and localisation methods on the proposed dataset. More specifically we used YOLOv7 (You Only Look Once) and different models of detection transformer.

Title: QiMeng-CRUX: Narrowing the Gap between Natural Language and Verilog via Core Refined Understanding eXpression

Authors: Lei Huang, Rui Zhang, Jiaming Guo, Yang Zhang, Di Huang, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen
Subjects: cs.LG, cs.AR, cs.PL
Abstract URL: https://arxiv.org/abs/2511.20099
Pdf URL: https://arxiv.org/pdf/2511.20099
Copy Paste: [[2511.20099]] QiMeng-CRUX: Narrowing the Gap between Natural Language and Verilog via Core Refined Understanding eXpression(https://arxiv.org/abs/2511.20099)
Keywords: large language model
Abstract: Large language models (LLMs) have shown promising capabilities in hardware description language (HDL) generation. However, existing approaches often rely on free-form natural language descriptions that are often ambiguous, redundant, and unstructured, which poses significant challenges for downstream Verilog code generation. We treat hardware code generation as a complex transformation from an open-ended natural language space to a domain-specific, highly constrained target space. To bridge this gap, we introduce Core Refined Understanding eXpression (CRUX), a structured intermediate space that captures the essential semantics of user intent while organizing the expression for precise Verilog code generation. We further design a two-stage training framework, comprising Joint Expression Modeling and Dual-Space Optimization, to enhance the quality of both CRUX and Verilog code. Experiments across multiple Verilog generation benchmarks demonstrate that our model, CRUX-V, achieves state-of-the-art performance among general models, particularly under challenging design tasks. Furthermore, the CRUX space proves transferable and beneficial when used as input prompts for other code models, highlighting its effectiveness in narrowing the gap between free-form natural language descriptions and precise Verilog generation.

Title: SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Authors: Zhenyi Shen, Junru Lu, Lin Gui, Jiazheng Li, Yulan He, Di Yin, Xing Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.20102
Pdf URL: https://arxiv.org/pdf/2511.20102
Copy Paste: [[2511.20102]] SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space(https://arxiv.org/abs/2511.20102)
Keywords: large language model
Abstract: The quadratic complexity of full attention limits efficient long-context processing in large language models (LLMs). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native sparse-attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention sparsity than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during sparse training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both sparse and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging sparse-attention outputs to align with their full-attention counterparts, thereby promoting stronger sparsity. As a result, SSA achieves state-of-the-art performance under both sparse and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying sparsity budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native sparse-attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.

Title: The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs

Authors: Craig Dickson
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.20104
Pdf URL: https://arxiv.org/pdf/2511.20104
Copy Paste: [[2511.20104]] The Devil in the Details: Emergent Misalignment, Format and Coherence in Open-Weights LLMs(https://arxiv.org/abs/2511.20104)
Keywords: secure, robust
Abstract: Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed "emergent misalignment" (Betley et al. 2025). While all tested models were susceptible to emergent misalignment, some models showed more resistance than others. Specifically the Qwen-2.5 family proved to be relatively resistant, while GPT-4o exhibited the strongest misalignment. In this paper we evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family and measure misalignment robustness over a range of model architectures and scales. We replicate the effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters). Models fine-tuned on insecure code generation show a 0.68% misalignment rate (compared to 0.07% for base models), matching the lower end of prior open-model results but dramatically lower than GPT-4o's 20%. We identify a critical format-dependent vulnerability: requiring JSON output doubles misalignment rates compared to natural language prompts (0.96% vs 0.42%). This suggests that structural constraints may bypass safety training by reducing the model's 'degrees of freedom' to refuse. These findings confirm emergent misalignment as a reproducible phenomenon in modern open-weights models, with rates substantially lower than observed in proprietary systems.

Title: EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning

Authors: Xingfeng Li, Xiaohan Shi, Junjie Li, Yongwei Li, Masashi Unoki, Tomoki Toda, Masato Akagi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.20106
Pdf URL: https://arxiv.org/pdf/2511.20106
Copy Paste: [[2511.20106]] EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning(https://arxiv.org/abs/2511.20106)
Keywords: robust
Abstract: This study introduces EM2LDL, a novel multilingual speech corpus designed to advance mixed emotion recognition through label distribution learning. Addressing the limitations of predominantly monolingual and single-label emotion corpora \textcolor{black}{that restrict linguistic diversity, are unable to model mixed emotions, and lack ecological validity}, EM2LDL comprises expressive utterances in English, Mandarin, and Cantonese, capturing the intra-utterance code-switching prevalent in multilingual regions like Hong Kong and Macao. The corpus integrates spontaneous emotional expressions from online platforms, annotated with fine-grained emotion distributions across 32 categories. Experimental baselines using self-supervised learning models demonstrate robust performance in speaker-independent gender-, age-, and personality-based evaluations, with HuBERT-large-EN achieving optimal results. By incorporating linguistic diversity and ecological validity, EM2LDL enables the exploration of complex emotional dynamics in multilingual settings. This work provides a versatile testbed for developing adaptive, empathetic systems for applications in affective computing, including mental health monitoring and cross-cultural communication. The dataset, annotations, and baseline codes are publicly available at this https URL.

Title: CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows

Authors: Hyeonjae Kim, Chenyue Li, Wen Deng, Mengxi Jin, Wen Huang, Mengqian Lu, Binhang Yuan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20109
Pdf URL: https://arxiv.org/pdf/2511.20109
Copy Paste: [[2511.20109]] CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows(https://arxiv.org/abs/2511.20109)
Keywords: robust
Abstract: Climate science demands automated workflows to transform comprehensive questions into data-driven statements across massive, heterogeneous datasets. However, generic LLM agents and static scripting pipelines lack climate-specific context and flexibility, thus, perform poorly in practice. We present ClimateAgent, an autonomous multi-agent framework that orchestrates end-to-end climate data analytic workflows. ClimateAgent decomposes user questions into executable sub-tasks coordinated by an Orchestrate-Agent and a Plan-Agent; acquires data via specialized Data-Agents that dynamically introspect APIs to synthesize robust download scripts; and completes analysis and reporting with a Coding-Agent that generates Python code, visualizations, and a final report with a built-in self-correction loop. To enable systematic evaluation, we introduce Climate-Agent-Bench-85, a benchmark of 85 real-world tasks spanning atmospheric rivers, drought, extreme precipitation, heat waves, sea surface temperature, and tropical cyclones. On Climate-Agent-Bench-85, ClimateAgent achieves 100% task completion and a report quality score of 8.32, outperforming GitHub-Copilot (6.27) and a GPT-5 baseline (3.26). These results demonstrate that our multi-agent orchestration with dynamic API awareness and self-correcting execution substantially advances reliable, end-to-end automation for climate science analytic tasks.

Title: LungEvaty: A Scalable, Open-Source Transformer-based Deep Learning Model for Lung Cancer Risk Prediction in LDCT Screening

Authors: Johannes Brandt, Maulik Chevli, Rickmer Braren, Georgios Kaissis, Philip Müller, Daniel Rueckert
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20116
Pdf URL: https://arxiv.org/pdf/2511.20116
Copy Paste: [[2511.20116]] LungEvaty: A Scalable, Open-Source Transformer-based Deep Learning Model for Lung Cancer Risk Prediction in LDCT Screening(https://arxiv.org/abs/2511.20116)
Keywords: transformer
Abstract: Lung cancer risk estimation is gaining increasing importance as more countries introduce population-wide screening programs using low-dose CT (LDCT). As imaging volumes grow, scalable methods that can process entire lung volumes efficiently are essential to tap into the full potential of these large screening datasets. Existing approaches either over-rely on pixel-level annotations, limiting scalability, or analyze the lung in fragments, weakening performance. We present LungEvaty, a fully transformer-based framework for predicting 1-6 year lung cancer risk from a single LDCT scan. The model operates on whole-lung inputs, learning directly from large-scale screening data to capture comprehensive anatomical and pathological cues relevant for malignancy risk. Using only imaging data and no region supervision, LungEvaty matches state-of-the-art performance, refinable by an optional Anatomically Informed Attention Guidance (AIAG) loss that encourages anatomically focused attention. In total, LungEvaty was trained on more than 90,000 CT scans, including over 28,000 for fine-tuning and 6,000 for evaluation. The framework offers a simple, data-efficient, and fully open-source solution that provides an extensible foundation for future research in longitudinal and multimodal lung cancer risk prediction.

Title: "When Data is Scarce, Prompt Smarter"... Approaches to Grammatical Error Correction in Low-Resource Settings

Authors: Somsubhra De, Harsh Kumar, Arun Prakash A
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20120
Pdf URL: https://arxiv.org/pdf/2511.20120
Copy Paste: [[2511.20120]] "When Data is Scarce, Prompt Smarter"... Approaches to Grammatical Error Correction in Low-Resource Settings(https://arxiv.org/abs/2511.20120)
Keywords: transformer, large language model
Abstract: Grammatical error correction (GEC) is an important task in Natural Language Processing that aims to automatically detect and correct grammatical mistakes in text. While recent advances in transformer-based models and large annotated datasets have greatly improved GEC performance for high-resource languages such as English, the progress has not extended equally. For most Indic languages, GEC remains a challenging task due to limited resources, linguistic diversity and complex morphology. In this work, we explore prompting-based approaches using state-of-the-art large language models (LLMs), such as GPT-4.1, Gemini-2.5 and LLaMA-4, combined with few-shot strategy to adapt them to low-resource settings. We observe that even basic prompting strategies, such as zero-shot and few-shot approaches, enable these LLMs to substantially outperform fine-tuned Indic-language models like Sarvam-22B, thereby illustrating the exceptional multilingual generalization capabilities of contemporary LLMs for GEC. Our experiments show that carefully designed prompts and lightweight adaptation significantly enhance correction quality across multiple Indic languages. We achieved leading results in the shared task--ranking 1st in Tamil (GLEU: 91.57) and Hindi (GLEU: 85.69), 2nd in Telugu (GLEU: 85.22), 4th in Bangla (GLEU: 92.86), and 5th in Malayalam (GLEU: 92.97). These findings highlight the effectiveness of prompt-driven NLP techniques and underscore the potential of large-scale LLMs to bridge resource gaps in multilingual GEC.

Title: UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

Authors: Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, Jun Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20123
Pdf URL: https://arxiv.org/pdf/2511.20123
Copy Paste: [[2511.20123]] UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers(https://arxiv.org/abs/2511.20123)
Keywords: diffusion, transformer
Abstract: Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.

Title: IDAP++: Advancing Divergence-Based Pruning via Filter-Level and Layer-Level Optimization

Authors: Aleksei Samarin, Artem Nazarenko, Egor Kotenko, Valentin Malykh, Alexander Savelev, Aleksei Toropov
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20141
Pdf URL: https://arxiv.org/pdf/2511.20141
Copy Paste: [[2511.20141]] IDAP++: Advancing Divergence-Based Pruning via Filter-Level and Layer-Level Optimization(https://arxiv.org/abs/2511.20141)
Keywords: transformer
Abstract: This paper presents a novel approach to neural network compression that addresses redundancy at both the filter and architectural levels through a unified framework grounded in information flow analysis. Building on the concept of tensor flow divergence, which quantifies how information is transformed across network layers, we develop a two-stage optimization process. The first stage employs iterative divergence-aware pruning to identify and remove redundant filters while preserving critical information pathways. The second stage extends this principle to higher-level architecture optimization by analyzing layer-wise contributions to information propagation and selectively eliminating entire layers that demonstrate minimal impact on network performance. The proposed method naturally adapts to diverse architectures, including convolutional networks, transformers, and hybrid designs, providing a consistent metric for comparing the structural importance across different layer types. Experimental validation across multiple modern architectures and datasets reveals that this combined approach achieves substantial model compression while maintaining competitive accuracy. The presented approach achieves parameter reduction results that are globally comparable to those of state-of-the-art solutions and outperforms them across a wide range of modern neural network architectures, from convolutional models to transformers. The results demonstrate how flow divergence serves as an effective guiding principle for both filter-level and layer-level optimization, offering practical benefits for deployment in resource-constrained environments.

Title: SEDA: A Self-Adapted Entity-Centric Data Augmentation for Boosting Gird-based Discontinuous NER Models

Authors: Wen-Fang Su, Hsiao-Wei Chou, Wen-Yang Lin
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2511.20143
Pdf URL: https://arxiv.org/pdf/2511.20143
Copy Paste: [[2511.20143]] SEDA: A Self-Adapted Entity-Centric Data Augmentation for Boosting Gird-based Discontinuous NER Models(https://arxiv.org/abs/2511.20143)
Keywords: robust, extraction, segmentation
Abstract: Named Entity Recognition (NER) is a critical task in natural language processing, yet it remains particularly challenging for discontinuous entities. The primary difficulty lies in text segmentation, as traditional methods often missegment or entirely miss cross-sentence discontinuous entities, significantly affecting recognition accuracy. Therefore, we aim to address the segmentation and omission issues associated with such entities. Recent studies have shown that grid-tagging methods are effective for information extraction due to their flexible tagging schemes and robust architectures. Building on this, we integrate image data augmentation techniques, such as cropping, scaling, and padding, into grid-based models to enhance their ability to recognize discontinuous entities and handle segmentation challenges. Experimental results demonstrate that traditional segmentation methods often fail to capture cross-sentence discontinuous entities, leading to decreased performance. In contrast, our augmented grid models achieve notable improvements. Evaluations on the CADEC, ShARe13, and ShARe14 datasets show F1 score gains of 1-2.5% overall and 3.7-8.4% for discontinuous entities, confirming the effectiveness of our approach.

Title: Hybrid Convolution and Frequency State Space Network for Image Compression

Authors: Haodong Pan, Hao Wei, Yusong Wang, Nanning Zheng, Caigui Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20151
Pdf URL: https://arxiv.org/pdf/2511.20151
Copy Paste: [[2511.20151]] Hybrid Convolution and Frequency State Space Network for Image Compression(https://arxiv.org/abs/2511.20151)
Keywords: transformer
Abstract: Learned image compression (LIC) has recently benefited from Transformer based and state space model (SSM) based architectures. Convolutional neural networks (CNNs) effectively capture local high frequency details, whereas Transformers and SSMs provide strong long range modeling capabilities but may cause structural information loss or ignore frequency characteristics that are crucial for compression. In this work we propose HCFSSNet, a Hybrid Convolution and Frequency State Space Network for LIC. HCFSSNet uses CNNs to extract local high frequency structures and introduces a Vision Frequency State Space (VFSS) block that models long range low frequency information. The VFSS block combines an Omni directional Neighborhood State Space (VONSS) module, which scans features horizontally, vertically and diagonally, with an Adaptive Frequency Modulation Module (AFMM) that applies content adaptive weighting of discrete cosine transform frequency components for more efficient bit allocation. To further reduce redundancy in the entropy model, we integrate AFMM with a Swin Transformer to form a Frequency Swin Transformer Attention Module (FSTAM) for frequency aware side information modeling. Experiments on the Kodak, Tecnick and CLIC Professional Validation datasets show that HCFSSNet achieves competitive rate distortion performance compared with recent SSM based codecs such as MambaIC, while using significantly fewer parameters. On Kodak, Tecnick and CLIC, HCFSSNet reduces BD rate over the VTM anchor by 18.06, 24.56 and 22.44 percent, respectively, providing an efficient and interpretable hybrid architecture for future learned image compression systems.

Title: Restora-Flow: Mask-Guided Image Restoration with Flow Matching

Authors: Arnela Hadzic, Franz Thaler, Lea Bogensperger, Simon Johannes Joham, Martin Urschler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20152
Pdf URL: https://arxiv.org/pdf/2511.20152
Copy Paste: [[2511.20152]] Restora-Flow: Mask-Guided Image Restoration with Flow Matching(https://arxiv.org/abs/2511.20152)
Keywords: diffusion, generative
Abstract: Flow matching has emerged as a promising generative approach that addresses the lengthy sampling times associated with state-of-the-art diffusion models and enables a more flexible trajectory design, while maintaining high-quality image generation. This capability makes it suitable as a generative prior for image restoration tasks. Although current methods leveraging flow models have shown promising results in restoration, some still suffer from long processing times or produce over-smoothed results. To address these challenges, we introduce Restora-Flow, a training-free method that guides flow matching sampling by a degradation mask and incorporates a trajectory correction mechanism to enforce consistency with degraded inputs. We evaluate our approach on both natural and medical datasets across several image restoration tasks involving a mask-based degradation, i.e., inpainting, super-resolution and denoising. We show superior perceptual quality and processing time compared to diffusion and flow matching-based reference methods.

Title: Alzheimers Disease Progression Prediction Based on Manifold Mapping of Irregularly Sampled Longitudinal Data

Authors: Xin Hong, Ying Shi, Yinhao Li, Yen-Wei Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20154
Pdf URL: https://arxiv.org/pdf/2511.20154
Copy Paste: [[2511.20154]] Alzheimers Disease Progression Prediction Based on Manifold Mapping of Irregularly Sampled Longitudinal Data(https://arxiv.org/abs/2511.20154)
Keywords: robust
Abstract: The uncertainty of clinical examinations frequently leads to irregular observation intervals in longitudinal imaging data, posing challenges for modeling disease this http URL existing imaging-based disease prediction models operate in Euclidean space, which assumes a flat representation of data and fails to fully capture the intrinsic continuity and nonlinear geometric structure of irregularly sampled longitudinal images. To address the challenge of modeling Alzheimers disease (AD) progression from irregularly sampled longitudinal structural Magnetic Resonance Imaging (sMRI) data, we propose a Riemannian manifold mapping, a Time-aware manifold Neural ordinary differential equation, and an Attention-based riemannian Gated recurrent unit (R-TNAG) framework. Our approach first projects features extracted from high-dimensional sMRI into a manifold space to preserve the intrinsic geometry of disease progression. On this representation, a time-aware Neural Ordinary Differential Equation (TNODE) models the continuous evolution of latent states between observations, while an Attention-based Riemannian Gated Recurrent Unit (ARGRU) adaptively integrates historical and current information to handle irregular intervals. This joint design improves temporal consistency and yields robust AD trajectory prediction under irregular this http URL results demonstrate that the proposed method consistently outperforms state-of-the-art models in both disease status prediction and cognitive score regression. Ablation studies verify the contributions of each module, highlighting their complementary roles in enhancing predictive accuracy. Moreover, the model exhibits stable performance across varying sequence lengths and missing data rates, indicating strong temporal generalizability. Cross-dataset validation further confirms its robustness and applicability in diverse clinical settings.

Title: SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery

Authors: Da Li, Ji-Ping Jin, Xuanlong Yu, Wei Liu, Xiaodong Cun, Kai Chen, Rui Fan, Jiangang Kong, Shen Xi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20157
Pdf URL: https://arxiv.org/pdf/2511.20157
Copy Paste: [[2511.20157]] SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery(https://arxiv.org/abs/2511.20157)
Keywords: transformer
Abstract: Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a transformer-based encoder-decoder architecture, where the encoder predicts coarse camera and SKEL parameters, and the decoder progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: this https URL.

Title: Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs

Authors: Ziqi Wang, Chang Che, Qi Wang, Hui Ma, Zenglin Shi, Cees G. M. Snoek, Meng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20158
Pdf URL: https://arxiv.org/pdf/2511.20158
Copy Paste: [[2511.20158]] Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs(https://arxiv.org/abs/2511.20158)
Keywords: large language model
Abstract: While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines.

Title: On the Limits of Momentum in Decentralized and Federated Optimization

Authors: Riccardo Zaccone, Sai Praneeth Karimireddy, Carlo Masone
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20168
Pdf URL: https://arxiv.org/pdf/2511.20168
Copy Paste: [[2511.20168]] On the Limits of Momentum in Decentralized and Federated Optimization(https://arxiv.org/abs/2511.20168)
Keywords: federate
Abstract: Recent works have explored the use of momentum in local methods to enhance distributed SGD. This is particularly appealing in Federated Learning (FL), where momentum intuitively appears as a solution to mitigate the effects of statistical heterogeneity. Despite recent progress in this direction, it is still unclear if momentum can guarantee convergence under unbounded heterogeneity in decentralized scenarios, where only some workers participate at each round. In this work we analyze momentum under cyclic client participation, and theoretically prove that it remains inevitably affected by statistical heterogeneity. Similarly to SGD, we prove that decreasing step-sizes do not help either: in fact, any schedule decreasing faster than $\Theta\left(1/t\right)$ leads to convergence to a constant value that depends on the initialization and the heterogeneity bound. Numerical results corroborate the theory, and deep learning experiments confirm its relevance for realistic settings.

Title: Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware

Authors: Federico Paredes-Valles, Yoshitaka Miyatani, Kirk Y. W. Scheper
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20175
Pdf URL: https://arxiv.org/pdf/2511.20175
Copy Paste: [[2511.20175]] Realizing Fully-Integrated, Low-Power, Event-Based Pupil Tracking with Neuromorphic Hardware(https://arxiv.org/abs/2511.20175)
Keywords: robust
Abstract: Eye tracking is fundamental to numerous applications, yet achieving robust, high-frequency tracking with ultra-low power consumption remains challenging for wearable platforms. While event-based vision sensors offer microsecond resolution and sparse data streams, they have lacked fully integrated, low-power processing solutions capable of real-time inference. In this work, we present the first battery-powered, wearable pupil-center-tracking system with complete on-device integration, combining event-based sensing and neuromorphic processing on the commercially available Speck2f system-on-chip with lightweight coordinate decoding on a low-power microcontroller. Our solution features a novel uncertainty-quantifying spiking neural network with gated temporal decoding, optimized for strict memory and bandwidth constraints, complemented by systematic deployment mechanisms that bridge the reality gap. We validate our system on a new multi-user dataset and demonstrate a wearable prototype with dual neuromorphic devices achieving robust binocular pupil tracking at 100 Hz with an average power consumption below 5 mW per eye. Our work demonstrates that end-to-end neuromorphic computing enables practical, always-on eye tracking for next-generation energy-efficient wearable systems.

Title: Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

Authors: Mohammad Mahdi, Yuqian Fu, Nedko Savov, Jiancheng Pan, Danda Pani Paudel, Luc Van Gool
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20186
Pdf URL: https://arxiv.org/pdf/2511.20186
Copy Paste: [[2511.20186]] Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis(https://arxiv.org/abs/2511.20186)
Keywords: generative
Abstract: Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we introduce Exo2EgoSyn, an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis. Our framework consists of three key modules. Ego-Exo View Alignment(EgoExo-Align) enforces latent-space alignment between exocentric and egocentric first-frame representations, reorienting the generative space from the given exo view toward the ego view. Multi-view Exocentric Video Conditioning (MultiExoCon) aggregates multi-view exocentric videos into a unified conditioning signal, extending WAN2.2 beyond its vanilla single-image or text conditioning. Furthermore, Pose-Aware Latent Injection (PoseInj) injects relative exo-to-ego camera pose information into the latent state, guiding geometry-aware synthesis across viewpoints. Together, these modules enable high-fidelity ego view video generation from third-person observations without retraining from scratch. Experiments on ExoEgo4D validate that Exo2EgoSyn significantly improves Ego2Exo synthesis, paving the way for scalable cross-view video generation with foundation models. Source code and models will be released publicly.

Title: In-Context Compositional Learning via Sparse Coding Transformer

Authors: Wei Chen, Jingxi Yu, Zichen Miao, Qiang Qiu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20194
Pdf URL: https://arxiv.org/pdf/2511.20194
Copy Paste: [[2511.20194]] In-Context Compositional Learning via Sparse Coding Transformer(https://arxiv.org/abs/2511.20194)
Keywords: transformer
Abstract: Transformer architectures have achieved remarkable success across language, vision, and multimodal tasks, and there is growing demand for them to address in-context compositional learning tasks. In these tasks, models solve the target problems by inferring compositional rules from context examples, which are composed of basic components structured by underlying rules. However, some of these tasks remain challenging for Transformers, which are not inherently designed to handle compositional tasks and offer limited structural inductive bias. In this work, inspired by the principle of sparse coding, we propose a reformulation of the attention to enhance its capability for compositional tasks. In sparse coding, data are represented as sparse combinations of dictionary atoms with coefficients that capture their compositional rules. Specifically, we reinterpret the attention block as a mapping of inputs into outputs through projections onto two sets of learned dictionary atoms: an encoding dictionary and a decoding dictionary. The encoding dictionary decomposes the input into a set of coefficients, which represent the compositional structure of the input. To enhance structured representations, we impose sparsity on these coefficients. The sparse coefficients are then used to linearly combine the decoding dictionary atoms to generate the output. Furthermore, to assist compositional generalization tasks, we propose estimating the coefficients of the target problem as a linear combination of the coefficients obtained from the context examples. We demonstrate the effectiveness of our approach on the S-RAVEN and RAVEN datasets. For certain compositional generalization tasks, our method maintains performance even when standard Transformers fail, owing to its ability to learn and apply compositional rules.

Title: GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering

Authors: Dionysia Danai Brilli, Dimitrios Mallis, Vassilis Pitsikalis, Petros Maragos
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20201
Pdf URL: https://arxiv.org/pdf/2511.20201
Copy Paste: [[2511.20201]] GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering(https://arxiv.org/abs/2511.20201)
Keywords: interpretability
Abstract: We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.

Title: Robust 3D Brain MRI Inpainting with Random Masking Augmentation

Authors: Juexin Zhang, Ying Weng, Ke Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20202
Pdf URL: https://arxiv.org/pdf/2511.20202
Copy Paste: [[2511.20202]] Robust 3D Brain MRI Inpainting with Random Masking Augmentation(https://arxiv.org/abs/2511.20202)
Keywords: secure, robust
Abstract: The ASNR-MICCAI BraTS-Inpainting Challenge was established to mitigate dataset biases that limit deep learning models in the quantitative analysis of brain tumor MRI. This paper details our submission to the 2025 challenge, a novel deep learning framework for synthesizing healthy tissue in 3D scans. The core of our method is a U-Net architecture trained to inpaint synthetically corrupted regions, enhanced with a random masking augmentation strategy to improve generalization. Quantitative evaluation confirmed the efficacy of our approach, yielding an SSIM of 0.873$\pm$0.004, a PSNR of 24.996$\pm$4.694, and an MSE of 0.005$\pm$0.087 on the validation set. On the final online test set, our method achieved an SSIM of 0.919$\pm$0.088, a PSNR of 26.932$\pm$5.057, and an RMSE of 0.052$\pm$0.026. This performance secured first place in the BraTS-Inpainting 2025 challenge and surpassed the winning solutions from the 2023 and 2024 competitions on the official leaderboard.

Title: OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation

Authors: Hao Yu, Jiabo Zhan, Zile Wang, Jinglin Wang, Huaisong Zhang, Hongyu Li, Xinrui Chen, Yongxian Wei, Chun Yuan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20211
Pdf URL: https://arxiv.org/pdf/2511.20211
Copy Paste: [[2511.20211]] OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation(https://arxiv.org/abs/2511.20211)
Keywords: diffusion, transformer, generative
Abstract: Generative models have excelled in RGB synthesis, but real-world applications require RGBA manipulation. This has led to a fragmented landscape: specialized, single-task models handle alpha but lack versatility, while unified multi-task frameworks are confined to the RGB domain. To bridge this critical gap, we propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing. Its architecture features MSRoPE-BiL, a novel RoPE method with a bi-directionally extendable layer axis for its Diffusion Transformer (DiT) backbone, enabling the concurrent processing of multiple input and target RGBA layers. To power this framework, we introduce AlphaLayers, a new dataset of 1,000 high-quality, multi-layer triplets, built via a novel automated synthesis and filter pipeline. Jointly training OmniAlpha on this dataset across a comprehensive suite of 21 diverse tasks, extensive experiments demonstrate that our unified approach consistently outperforms strong, specialized baselines. Most notably, OmniAlpha achieves a dramatic 84.8% relative reduction in SAD for mask-free matting on AIM-500 and wins over 90% of human preferences in layer-conditioned completion. Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.

Title: Text-guided Controllable Diffusion for Realistic Camouflage Images Generation

Authors: Yuhang Qian, Haiyan Chen, Wentong Li, Ningzhong Liu, Jie Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20218
Pdf URL: https://arxiv.org/pdf/2511.20218
Copy Paste: [[2511.20218]] Text-guided Controllable Diffusion for Realistic Camouflage Images Generation(https://arxiv.org/abs/2511.20218)
Keywords: diffusion
Abstract: Camouflage Images Generation (CIG) is an emerging research area that focuses on synthesizing images in which objects are harmoniously blended and exhibit high visual consistency with their surroundings. Existing methods perform CIG by either fusing objects into specific backgrounds or outpainting the surroundings via foreground object-guided diffusion. However, they often fail to obtain natural results because they overlook the logical relationship between camouflaged objects and background environments. To address this issue, we propose CT-CIG, a Controllable Text-guided Camouflage Images Generation method that produces realistic and logically plausible camouflage images. Leveraging Large Visual Language Models (VLM), we design a Camouflage-Revealing Dialogue Mechanism (CRDM) to annotate existing camouflage datasets with high-quality text prompts. Subsequently, the constructed image-prompt pairs are utilized to finetune Stable Diffusion, incorporating a lightweight controller to guide the location and shape of camouflaged objects for enhanced camouflage scene fitness. Moreover, we design a Frequency Interaction Refinement Module (FIRM) to capture high-frequency texture features, facilitating the learning of complex camouflage patterns. Extensive experiments, including CLIPScore evaluation and camouflage effectiveness assessment, demonstrate the semantic alignment of our generated text prompts and CT-CIG's ability to produce photorealistic camouflage images.

Title: Communication-Efficient Learning for Satellite Constellations

Authors: Ruxandra-Stefania Tudose, Moritz H.W. Grüss, Grace Ra Kim, Karl H. Johansson, Nicola Bastianello
Subjects: cs.LG, eess.SY, math.OC
Abstract URL: https://arxiv.org/abs/2511.20220
Pdf URL: https://arxiv.org/pdf/2511.20220
Copy Paste: [[2511.20220]] Communication-Efficient Learning for Satellite Constellations(https://arxiv.org/abs/2511.20220)
Keywords: federate
Abstract: Satellite constellations in low-Earth orbit are now widespread, enabling positioning, Earth imaging, and communications. In this paper we address the solution of learning problems using these satellite constellations. In particular, we focus on a federated approach, where satellites collect and locally process data, with the ground station aggregating local models. We focus on designing a novel, communication-efficient algorithm that still yields accurate trained models. To this end, we employ several mechanisms to reduce the number of communications with the ground station (local training) and their size (compression). We then propose an error feedback mechanism that enhances accuracy, which yields, as a byproduct, an algorithm-agnostic error feedback scheme that can be more broadly applied. We analyze the convergence of the resulting algorithm, and compare it with the state of the art through simulations in a realistic space scenario, showcasing superior performance.

Title: Patch-Level Glioblastoma Subregion Classification with a Contrastive Learning-Based Encoder

Authors: Juexin Zhang, Qifeng Zhong, Ying Weng, Ke Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20221
Pdf URL: https://arxiv.org/pdf/2511.20221
Copy Paste: [[2511.20221]] Patch-Level Glioblastoma Subregion Classification with a Contrastive Learning-Based Encoder(https://arxiv.org/abs/2511.20221)
Keywords: secure, transformer
Abstract: The significant molecular and pathological heterogeneity of glioblastoma, an aggressive brain tumor, complicates diagnosis and patient stratification. While traditional histopathological assessment remains the standard, deep learning offers a promising path toward objective and automated analysis of whole slide images. For the BraTS-Path 2025 Challenge, we developed a method that fine-tunes a pre-trained Vision Transformer (ViT) encoder with a dedicated classification head on the official training dataset. Our model's performance on the online validation set, evaluated via the Synapse platform, yielded a Matthews Correlation Coefficient (MCC) of 0.7064 and an F1-score of 0.7676. On the final test set, the model achieved an MCC of 0.6509 and an F1-score of 0.5330, which secured our team second place in the BraTS-Pathology 2025 Challenge. Our results establish a solid baseline for ViT-based histopathological analysis, and future efforts will focus on bridging the performance gap observed on the unseen validation data.

Title: V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs

Authors: Sen Nie, Jie Zhang, Jianxin Yan, Shiguang Shan, Xilin Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20223
Pdf URL: https://arxiv.org/pdf/2511.20223
Copy Paste: [[2511.20223]] V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs(https://arxiv.org/abs/2511.20223)
Keywords: attack, transformer
Abstract: Adversarial attacks have evolved from simply disrupting predictions on conventional task-specific models to the more complex goal of manipulating image semantics on Large Vision-Language Models (LVLMs). However, existing methods struggle with controllability and fail to precisely manipulate the semantics of specific concepts in the image. We attribute this limitation to semantic entanglement in the patch-token representations on which adversarial attacks typically operate: global context aggregated by self-attention in the vision encoder dominates individual patch features, making them unreliable handles for precise local semantic manipulation. Our systematic investigation reveals a key insight: value features (V) computed within the transformer attention block serve as much more precise handles for manipulation. We show that V suppresses global-context channels, allowing it to retain high-entropy, disentangled local semantic information. Building on this discovery, we propose V-Attack, a novel method designed for precise local semantic attacks. V-Attack targets the value features and introduces two core components: (1) a Self-Value Enhancement module to refine V's intrinsic semantic richness, and (2) a Text-Guided Value Manipulation module that leverages text prompts to locate source concept and optimize it toward a target concept. By bypassing the entangled patch features, V-Attack achieves highly effective semantic control. Extensive experiments across diverse LVLMs, including LLaVA, InternVL, DeepseekVL and GPT-4o, show that V-Attack improves the attack success rate by an average of 36% over state-of-the-art methods, exposing critical vulnerabilities in modern visual-language understanding. Our code and data are available this https URL.

Title: Improving the Identification of Real-world Malware's DNS Covert Channels Using Locality Sensitive Hashing

Authors: Pascal Ruffing, Denis Petrov, Sebastian Zillien, Steffen Wendzel
Subjects: cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2511.20229
Pdf URL: https://arxiv.org/pdf/2511.20229
Copy Paste: [[2511.20229]] Improving the Identification of Real-world Malware's DNS Covert Channels Using Locality Sensitive Hashing(https://arxiv.org/abs/2511.20229)
Keywords: robust, steal
Abstract: Nowadays, malware increasingly uses DNS-based covert channels in order to evade detection and maintain stealthy communication with its command-and-control servers. While prior work has focused on detecting such activity, identifying specific malware families and their behaviors from captured network traffic remains challenging due to the variability of DNS. In this paper, we present the first application of Locality Sensitive Hashing to the detection and identification of real-world malware utilizing DNS covert channels. Our approach encodes DNS subdomain sequences into statistical similarity features that effectively capture anomalies indicative of malicious activity. Combined with a Random Forest classifier, our method achieves higher accuracy and reduced false positive rates than prior approaches, while demonstrating improved robustness and generalization to previously unseen or modified malware samples. We further demonstrate that our approach enables reliable classification of malware behavior (e.g., uploading or downloading of files), based solely on DNS subdomains.

Title: REFLEX: Self-Refining Explainable Fact-Checking via Disentangling Truth into Style and Substance

Authors: Chuyi Kong, Gao Wei, Jing Ma, Hongzhan Lin, Zhiyuan Fan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.20233
Pdf URL: https://arxiv.org/pdf/2511.20233
Copy Paste: [[2511.20233]] REFLEX: Self-Refining Explainable Fact-Checking via Disentangling Truth into Style and Substance(https://arxiv.org/abs/2511.20233)
Keywords: interpretability, large language model
Abstract: The prevalence of misinformation on social media threatens public trust, demanding automated fact-checking systems that provide accurate verdicts with interpretable explanations. However, existing large language model-based (LLM-based) approaches often rely heavily on external knowledge sources, introducing substantial latency and even hallucinations that undermine reliability, interpretability, and responsiveness, which is crucial for real-time use. To address these challenges, we propose REason-guided Fact-checking with Latent EXplanations REFLEX paradigm, a plug-and-play, self-refining paradigm that leverages the internal knowledge in backbone model to improve both verdict accuracy and explanation quality. REFLEX reformulates fact-checking as a role-play dialogue and jointly trains verdict prediction and explanation generation. It adaptively extracts contrastive activation pairs between the backbone model and its fine-tuned variant to construct steering vectors that disentangle truth into style and substance naturally. These activation-level signals guide inference and suppress noisy explanations, enabling more faithful and efficient reasoning. Experiments on real-world datasets show that REFLEX outperforms previous methods that steer toward a single truth direction and underscores the challenge traditional approaches face when handling the subtle, human-unknown truth in fact-checking tasks. Remarkably, with only 465 self-refined training samples, RELFEX achieves state-of-the-art performance. Furthermore, models trained with explanatory objectives can effectively guide those without them, yielding up to a 7.57% improvement, highlighting that internal explanation signals play a dual role in both interpreting and enhancing factual reasoning.

Title: HistoSpeckle-Net: Mutual Information-Guided Deep Learning for high-fidelity reconstruction of complex OrganAMNIST images via perturbed Multimode Fibers

Authors: Jawaria Maqbool, M. Imran Cheema
Subjects: cs.CV, physics.optics
Abstract URL: https://arxiv.org/abs/2511.20245
Pdf URL: https://arxiv.org/pdf/2511.20245
Copy Paste: [[2511.20245]] HistoSpeckle-Net: Mutual Information-Guided Deep Learning for high-fidelity reconstruction of complex OrganAMNIST images via perturbed Multimode Fibers(https://arxiv.org/abs/2511.20245)
Keywords: robust
Abstract: Existing deep learning methods in multimode fiber (MMF) imaging often focus on simpler datasets, limiting their applicability to complex, real-world imaging tasks. These models are typically data-intensive, a challenge that becomes more pronounced when dealing with diverse and complex images. In this work, we propose HistoSpeckle-Net, a deep learning architecture designed to reconstruct structurally rich medical images from MMF speckles. To build a clinically relevant dataset, we develop an optical setup that couples laser light through a spatial light modulator (SLM) into an MMF, capturing output speckle patterns corresponding to input OrganAMNIST images. Unlike previous MMF imaging approaches, which have not considered the underlying statistics of speckles and reconstructed images, we introduce a distribution-aware learning strategy. We employ a histogram-based mutual information loss to enhance model robustness and reduce reliance on large datasets. Our model includes a histogram computation unit that estimates smooth marginal and joint histograms for calculating mutual information loss. It also incorporates a unique Three-Scale Feature Refinement Module, which leads to multiscale Structural Similarity Index Measure (SSIM) loss computation. Together, these two loss functions enhance both the structural fidelity and statistical alignment of the reconstructed images. Our experiments on the complex OrganAMNIST dataset demonstrate that HistoSpeckle-Net achieves higher fidelity than baseline models such as U-Net and Pix2Pix. It gives superior performance even with limited training samples and across varying fiber bending conditions. By effectively reconstructing complex anatomical features with reduced data and under fiber perturbations, HistoSpeckle-Net brings MMF imaging closer to practical deployment in real-world clinical environments.

Title: Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation

Authors: Daniel Kienzle, Katja Ludwig, Julian Lorenz, Shin'ichi Satoh, Rainer Lienhart
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20250
Pdf URL: https://arxiv.org/pdf/2511.20250
Copy Paste: [[2511.20250]] Uplifting Table Tennis: A Robust, Real-World Application for 3D Trajectory and Spin Estimation(https://arxiv.org/abs/2511.20250)
Keywords: robust
Abstract: Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world. This is primarily due to the inherent lack of 3D ground truth trajectories and spin annotations for real-world video. To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task. This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data. We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates. By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.

Title: Hey there! You are using WhatsApp: Enumerating Three Billion Accounts for Security and Privacy

Authors: Gabriel K. Gegenhuber, Philipp É. Frenzel, Maximilian Günther, Johanna Ullrich, Aljosha Judmayer
Subjects: cs.CR, cs.CY
Abstract URL: https://arxiv.org/abs/2511.20252
Pdf URL: https://arxiv.org/pdf/2511.20252
Copy Paste: [[2511.20252]] Hey there! You are using WhatsApp: Enumerating Three Billion Accounts for Security and Privacy(https://arxiv.org/abs/2511.20252)
Keywords: secure, security, privacy, defense
Abstract: WhatsApp, with 3.5 billion active accounts as of early 2025, is the world's largest instant messaging platform. Given its massive user base, WhatsApp plays a critical role in global communication. To initiate conversations, users must first discover whether their contacts are registered on the platform. This is achieved by querying WhatsApp's servers with mobile phone numbers extracted from the user's address book (if they allowed access). This architecture inherently enables phone number enumeration, as the service must allow legitimate users to query contact availability. While rate limiting is a standard defense against abuse, we revisit the problem and show that WhatsApp remains highly vulnerable to enumeration at scale. In our study, we were able to probe over a hundred million phone numbers per hour without encountering blocking or effective rate limiting. Our findings demonstrate not only the persistence but the severity of this vulnerability. We further show that nearly half of the phone numbers disclosed in the 2021 Facebook data leak are still active on WhatsApp, underlining the enduring risks associated with such exposures. Moreover, we were able to perform a census of WhatsApp users, providing a glimpse on the macroscopic insights a large messaging service is able to generate even though the messages themselves are end-to-end encrypted. Using the gathered data, we also discovered the re-use of certain X25519 keys across different devices and phone numbers, indicating either insecure (custom) implementations, or fraudulent activity. In this updated version of the paper, we also provide insights into the collaborative remediation process through which we confirmed that the underlying rate-limiting issue had been resolved.

Title: XiCAD: Camera Activation Detection in the Da Vinci Xi User Interface

Authors: Alexander C. Jenke, Gregor Just, Claas de Boer, Martin Wagner, Sebastian Bodenstedt, Stefanie Speidel
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20254
Pdf URL: https://arxiv.org/pdf/2511.20254
Copy Paste: [[2511.20254]] XiCAD: Camera Activation Detection in the Da Vinci Xi User Interface(https://arxiv.org/abs/2511.20254)
Keywords: extraction
Abstract: Purpose: Robot-assisted minimally invasive surgery relies on endoscopic video as the sole intraoperative visual feedback. The DaVinci Xi system overlays a graphical user interface (UI) that indicates the state of each robotic arm, including the activation of the endoscope arm. Detecting this activation provides valuable metadata such as camera movement information, which can support downstream surgical data science tasks including tool tracking, skill assessment, or camera control automation. Methods: We developed a lightweight pipeline based on a ResNet18 convolutional neural network to automatically identify the position of the camera tile and its activation state within the DaVinci Xi UI. The model was fine-tuned on manually annotated data from the SurgToolLoc dataset and evaluated across three public datasets comprising over 70,000 frames. Results: The model achieved F1-scores between 0.993 and 1.000 for the binary detection of active cameras and correctly localized the camera tile in all cases without false multiple-camera detections. Conclusion: The proposed pipeline enables reliable, real-time extraction of camera activation metadata from surgical videos, facilitating automated preprocessing and analysis for diverse downstream applications. All code, trained models, and annotations are publicly available.

Title: Interpretable Air Pollution Forecasting by Physics-Guided Spatiotemporal Decoupling

Authors: Zhiguo Zhang, Xiaoliang Ma, Daniel Schlesinger
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20257
Pdf URL: https://arxiv.org/pdf/2511.20257
Copy Paste: [[2511.20257]] Interpretable Air Pollution Forecasting by Physics-Guided Spatiotemporal Decoupling(https://arxiv.org/abs/2511.20257)
Keywords: interpretability
Abstract: Accurate and interpretable air pollution forecasting is crucial for public health, but most models face a trade-off between performance and interpretability. This study proposes a physics-guided, interpretable-by-design spatiotemporal learning framework. The model decomposes the spatiotemporal behavior of air pollutant concentrations into two transparent, additive modules. The first is a physics-guided transport kernel with directed weights conditioned on wind and geography (advection). The second is an explainable attention mechanism that learns local responses and attributes future concentrations to specific historical lags and exogenous drivers. Evaluated on a comprehensive dataset from the Stockholm region, our model consistently outperforms state-of-the-art baselines across multiple forecasting horizons. Our model's integration of high predictive performance and spatiotemporal interpretability provides a more reliable foundation for operational air-quality management in real-world applications.

Title: Modality-Balanced Collaborative Distillation for Multi-Modal Domain Generalization

Authors: Xiaohan Wang, Zhangtao Cheng, Ting Zhong, Leiting Chen, Fan Zhou
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20258
Pdf URL: https://arxiv.org/pdf/2511.20258
Copy Paste: [[2511.20258]] Modality-Balanced Collaborative Distillation for Multi-Modal Domain Generalization(https://arxiv.org/abs/2511.20258)
Keywords: robust
Abstract: Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA's flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.

Title: Advancing Image Classification with Discrete Diffusion Classification Modeling

Authors: Omer Belhasin, Shelly Golan, Ran El-Yaniv, Michael Elad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20263
Pdf URL: https://arxiv.org/pdf/2511.20263
Copy Paste: [[2511.20263]] Advancing Image Classification with Discrete Diffusion Classification Modeling(https://arxiv.org/abs/2511.20263)
Keywords: diffusion
Abstract: Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging. We release our code at this https URL .

Title: DRL-Guided Neural Batch Sampling for Semi-Supervised Pixel-Level Anomaly Detection

Authors: Amirhossein Khadivi Noghredeh, Abdollah Safari, Fatemeh Ziaeetabar, Firoozeh Haghighi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20270
Pdf URL: https://arxiv.org/pdf/2511.20270
Copy Paste: [[2511.20270]] DRL-Guided Neural Batch Sampling for Semi-Supervised Pixel-Level Anomaly Detection(https://arxiv.org/abs/2511.20270)
Keywords: segmentation
Abstract: Anomaly detection in industrial visual inspection is challenging due to the scarcity of defective samples. Most existing methods rely on unsupervised reconstruction using only normal data, often resulting in overfitting and poor detection of subtle defects. We propose a semi-supervised deep reinforcement learning framework that integrates a neural batch sampler, an autoencoder, and a predictor. The RL-based sampler adaptively selects informative patches by balancing exploration and exploitation through a composite reward. The autoencoder generates loss profiles highlighting abnormal regions, while the predictor performs segmentation in the loss-profile space. This interaction enables the system to effectively learn both normal and defective patterns with limited labeled data. Experiments on the MVTec AD dataset demonstrate that our method achieves higher accuracy and better localization of subtle anomalies than recent state-of-the-art approaches while maintaining low complexity, yielding an average improvement of 0.15 in F1_max and 0.06 in AUC, with a maximum gain of 0.37 in F1_max in the best case.

Title: VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs

Authors: Tianxiang Jiang, Sheng Xia, Yicheng Xu, Linquan Wu, Xiangyu Zeng, Limin Wang, Yu Qiao, Yi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20272
Pdf URL: https://arxiv.org/pdf/2511.20272
Copy Paste: [[2511.20272]] VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLMs(https://arxiv.org/abs/2511.20272)
Keywords: large language model
Abstract: While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world's underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.

Title: Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits

Authors: Areeb Ahmad, Abhinav Joshi, Ashutosh Modi
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.20273
Pdf URL: https://arxiv.org/pdf/2511.20273
Copy Paste: [[2511.20273]] Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits(https://arxiv.org/abs/2511.20273)
Keywords: interpretability, transformer
Abstract: Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron layers (MLPs) (the building blocks of a transformer architecture) as indivisible units, overlooking possibilities of functional substructure learned within them. In this work, we introduce a more fine-grained perspective that decomposes these components into orthogonal singular directions, revealing superposed and independent computations within a single head or MLP. We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple overlapping subfunctions aligned with distinct singular directions. Nodes in a computational graph, that are previously identified as circuit elements show strong activation along specific low-rank directions, suggesting that meaningful computations reside in compact subspaces. While some directions remain challenging to interpret fully, our results highlight that transformer computations are more distributed, structured, and compositional than previously assumed. This perspective opens new avenues for fine-grained mechanistic interpretability and a deeper understanding of model internals.

Title: ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

Authors: Advik Sinha, Saurabh Atreya, Aashutosh A V, Sk Aziz Ali, Abhijit Das
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20274
Pdf URL: https://arxiv.org/pdf/2511.20274
Copy Paste: [[2511.20274]] ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis(https://arxiv.org/abs/2511.20274)
Keywords: robust
Abstract: Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input texts, grounded relations, and input images, along with focused regions highlighting relations. The proposed model is pretrained on curated scenario data, and finetuned for specialized downstream tasks, such as cross-modal retrieval and fine-grained visual understanding tasks. To address the lack of domain-specific datasets, we generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets that are publicly available. We used a pipeline of existing language models to ground action, object, and relations, filled by manual and automatic curation. We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods. ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks. Our code and dataset are available at this https URL

Title: HVAdam: A Full-Dimension Adaptive Optimizer

Authors: Yiheng Zhang, Shaowu Wu, Yuanzhuo Xu, Jiajun Wu, Shang Xu, Steve Drew, Xiaoguang Niu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20277
Pdf URL: https://arxiv.org/pdf/2511.20277
Copy Paste: [[2511.20277]] HVAdam: A Full-Dimension Adaptive Optimizer(https://arxiv.org/abs/2511.20277)
Keywords: robust, diffusion, large language model
Abstract: Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer's ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity , allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad's hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.

Title: DAPointMamba: Domain Adaptive Point Mamba for Point Cloud Completion

Authors: Yinghui Li, Qianyu Zhou, Di Shao, Hao Yang, Ye Zhu, Richard Dazeley, Xuequan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20278
Pdf URL: https://arxiv.org/pdf/2511.20278
Copy Paste: [[2511.20278]] DAPointMamba: Domain Adaptive Point Mamba for Point Cloud Completion(https://arxiv.org/abs/2511.20278)
Keywords: transformer
Abstract: Domain adaptive point cloud completion (DA PCC) aims to narrow the geometric and semantic discrepancies between the labeled source and unlabeled target domains. Existing methods either suffer from limited receptive fields or quadratic complexity due to using CNNs or vision Transformers. In this paper, we present the first work that studies the adaptability of State Space Models (SSMs) in DA PCC and find that directly applying SSMs to DA PCC will encounter several challenges: directly serializing 3D point clouds into 1D sequences often disrupts the spatial topology and local geometric features of the target domain. Besides, the overlook of designs in the learning domain-agnostic representations hinders the adaptation performance. To address these issues, we propose a novel framework, DAPointMamba for DA PCC, that exhibits strong adaptability across domains and has the advantages of global receptive fields and efficient linear complexity. It has three novel modules. In particular, Cross-Domain Patch-Level Scanning introduces patch-level geometric correspondences, enabling effective local alignment. Cross-Domain Spatial SSM Alignment further strengthens spatial consistency by modulating patch features based on cross-domain similarity, effectively mitigating fine-grained structural discrepancies. Cross-Domain Channel SSM Alignment actively addresses global semantic gaps by interleaving and aligning feature channels. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our DAPointMamba outperforms state-of-the-art methods with less computational complexity and inference latency.

Title: SelfMOTR: Revisiting MOTR with Self-Generating Detection Priors

Authors: Fabian Gülhan, Emil Mededovic, Yuli Wu, Johannes Stegmaier
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20279
Pdf URL: https://arxiv.org/pdf/2511.20279
Copy Paste: [[2511.20279]] SelfMOTR: Revisiting MOTR with Self-Generating Detection Priors(https://arxiv.org/abs/2511.20279)
Keywords: transformer
Abstract: Despite progress toward end-to-end tracking with transformer architectures, poor detection performance and the conflict between detection and association in a joint architecture remain critical concerns. Recent approaches aim to mitigate these issues by (i) employing advanced denoising or label assignment strategies, or (ii) incorporating detection priors from external object detectors via distillation or anchor proposal techniques. Inspired by the success of integrating detection priors and by the key insight that MOTR-like models are secretly strong detection models, we introduce SelfMOTR, a novel tracking transformer that relies on self-generated detection priors. Through extensive analysis and ablation studies, we uncover and demonstrate the hidden detection capabilities of MOTR-like models, and present a practical set of tools for leveraging them effectively. On DanceTrack, SelfMOTR achieves strong performance, competing with recent state-of-the-art end-to-end tracking methods.

Title: Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement

Authors: Yang Liu, Xilin Zhao, Peisong Wen, Siran Dai, Qingming Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20280
Pdf URL: https://arxiv.org/pdf/2511.20280
Copy Paste: [[2511.20280]] Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement(https://arxiv.org/abs/2511.20280)
Keywords: large language model
Abstract: Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.

Title: Can LLMs Make (Personalized) Access Control Decisions?

Authors: Friederike Groschupp, Daniele Lain, Aritra Dhar, Lara Magdalena Lazier, Srdjan Čapkun
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20284
Pdf URL: https://arxiv.org/pdf/2511.20284
Copy Paste: [[2511.20284]] Can LLMs Make (Personalized) Access Control Decisions?(https://arxiv.org/abs/2511.20284)
Keywords: security, privacy, large language model
Abstract: Precise access control decisions are crucial to the security of both traditional applications and emerging agent-based systems. Typically, these decisions are made by users during app installation or at runtime. Due to the increasing complexity and automation of systems, making these access control decisions can add a significant cognitive load on users, often overloading them and leading to suboptimal or even arbitrary access control decisions. To address this problem, we propose to leverage the processing and reasoning capabilities of large language models (LLMs) to make dynamic, context-aware decisions aligned with the user's security preferences. For this purpose, we conducted a user study, which resulted in a dataset of 307 natural-language privacy statements and 14,682 access control decisions made by users. We then compare these decisions against those made by two versions of LLMs: a general and a personalized one, for which we also gathered user feedback on 1,446 of its decisions. Our results show that in general, LLMs can reflect users' preferences well, achieving up to 86\% accuracy when compared to the decision made by the majority of users. Our study also reveals a crucial trade-off in personalizing such a system: while providing user-specific privacy preferences to the LLM generally improves agreement with individual user decisions, adhering to those preferences can also violate some security best practices. Based on our findings, we discuss design and risk considerations for implementing a practical natural-language-based access control system that balances personalization, security, and utility.

Title: APT-CGLP: Advanced Persistent Threat Hunting via Contrastive Graph-Language Pre-Training

Authors: Xuebo Qiu, Mingqi Lv, Yimei Zhang, Tieming Chen, Tiantian Zhu, Qijie Song, Shouling Ji
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.20290
Pdf URL: https://arxiv.org/pdf/2511.20290
Copy Paste: [[2511.20290]] APT-CGLP: Advanced Persistent Threat Hunting via Contrastive Graph-Language Pre-Training(https://arxiv.org/abs/2511.20290)
Keywords: attack, extraction, large language model
Abstract: Provenance-based threat hunting identifies Advanced Persistent Threats (APTs) on endpoints by correlating attack patterns described in Cyber Threat Intelligence (CTI) with provenance graphs derived from system audit logs. A fundamental challenge in this paradigm lies in the modality gap--the structural and semantic disconnect between provenance graphs and CTI reports. Prior work addresses this by framing threat hunting as a graph matching task: 1) extracting attack graphs from CTI reports, and 2) aligning them with provenance graphs. However, this pipeline incurs severe \textit{information loss} during graph extraction and demands intensive manual curation, undermining scalability and effectiveness. In this paper, we present APT-CGLP, a novel cross-modal APT hunting system via Contrastive Graph-Language Pre-training, facilitating end-to-end semantic matching between provenance graphs and CTI reports without human intervention. First, empowered by the Large Language Model (LLM), APT-CGLP mitigates data scarcity by synthesizing high-fidelity provenance graph-CTI report pairs, while simultaneously distilling actionable insights from noisy web-sourced CTIs to improve their operational utility. Second, APT-CGLP incorporates a tailored multi-objective training algorithm that synergizes contrastive learning with inter-modal masked modeling, promoting cross-modal attack semantic alignment at both coarse- and fine-grained levels. Extensive experiments on four real-world APT datasets demonstrate that APT-CGLP consistently outperforms state-of-the-art threat hunting baselines in terms of accuracy and efficiency.

Title: CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

Authors: Shilei Cao, Ziyang Gong, Hehai Lin, Yang Liu, Jiashun Cheng, Xiaoxing Hu, Haoyuan Liang, Guowen Li, Chengwei Qin, Hong Cheng, Xue Yang, Juepeng Zheng, Haohuan Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20302
Pdf URL: https://arxiv.org/pdf/2511.20302
Copy Paste: [[2511.20302]] CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation(https://arxiv.org/abs/2511.20302)
Keywords: segmentation
Abstract: In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (\eg, spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module's importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation. The code of the work will be released.

Title: TReFT: Taming Rectified Flow Models For One-Step Image Translation

Authors: Shengqian Li, Ming Gao, Yi Liu, Zuzeng Lin, Feng Wang, Feng Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20307
Pdf URL: https://arxiv.org/pdf/2511.20307
Copy Paste: [[2511.20307]] TReFT: Taming Rectified Flow Models For One-Step Image Translation(https://arxiv.org/abs/2511.20307)
Keywords: diffusion
Abstract: Rectified Flow (RF) models have advanced high-quality image and video synthesis via optimal transport theory. However, when applied to image-to-image translation, they still depend on costly multi-step denoising, hindering real-time applications. Although the recent adversarial training paradigm, CycleGAN-Turbo, works in pretrained diffusion models for one-step image translation, we find that directly applying it to RF models leads to severe convergence issues. In this paper, we analyze these challenges and propose TReFT, a novel method to Tame Rectified Flow models for one-step image Translation. Unlike previous works, TReFT directly uses the velocity predicted by pretrained DiT or UNet as output-a simple yet effective design that tackles the convergence issues under adversarial training with one-step inference. This design is mainly motivated by a novel observation that, near the end of the denoising process, the velocity predicted by pretrained RF models converges to the vector from origin to the final clean image, a property we further justify through theoretical analysis. When applying TReFT to large pretrained RF models such as SD3.5 and FLUX, we introduce memory-efficient latent cycle-consistency and identity losses during training, as well as lightweight architectural simplifications for faster inference. Pretrained RF models finetuned with TReFT achieve performance comparable to sota methods across multiple image translation datasets while enabling real-time inference.

Title: A Reality Check on SBOM-based Vulnerability Management: An Empirical Study and A Path Forward

Authors: Li Zhou, Marc Dacier, Charalambos Konstantinou
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.20313
Pdf URL: https://arxiv.org/pdf/2511.20313
Copy Paste: [[2511.20313]] A Reality Check on SBOM-based Vulnerability Management: An Empirical Study and A Path Forward(https://arxiv.org/abs/2511.20313)
Keywords: security
Abstract: The Software Bill of Materials (SBOM) is a critical tool for securing the software supply chain (SSC), but its practical utility is undermined by inaccuracies in both its generation and its application in vulnerability scanning. This paper presents a large-scale empirical study on 2,414 open-source repositories to address these issues from a practical standpoint. First, we demonstrate that using lock files with strong package managers enables the generation of accurate and consistent SBOMs, establishing a reliable foundation for security analysis. Using this high-fidelity foundation, however, we expose a more fundamental flaw in practice: downstream vulnerability scanners produce a staggering 97.5\% false positive rate. We pinpoint the primary cause as the flagging of vulnerabilities within unreachable code. We then demonstrate that function call analysis can effectively prune 63.3\% of these false alarms. Our work validates a practical, two-stage approach for SSC security: first, generate an accurate SBOM using lock files and strong package managers, and second, enrich it with function call analysis to produce actionable, low-noise vulnerability reports that alleviate developers' alert fatigue.

Title: Geometry of Decision Making in Language Models

Authors: Abhinav Joshi, Divyanshu Bhatt, Ashutosh Modi
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.20315
Pdf URL: https://arxiv.org/pdf/2511.20315
Copy Paste: [[2511.20315]] Geometry of Decision Making in Language Models(https://arxiv.org/abs/2511.20315)
Keywords: transformer, large language model
Abstract: Large Language Models (LLMs) show strong generalization across diverse tasks, yet the internal decision-making processes behind their predictions remain opaque. In this work, we study the geometry of hidden representations in LLMs through the lens of \textit{intrinsic dimension} (ID), focusing specifically on decision-making dynamics in a multiple-choice question answering (MCQA) setting. We perform a large-scale study, with 28 open-weight transformer models and estimate ID across layers using multiple estimators, while also quantifying per-layer performance on MCQA tasks. Our findings reveal a consistent ID pattern across models: early layers operate on low-dimensional manifolds, middle layers expand this space, and later layers compress it again, converging to decision-relevant representations. Together, these results suggest LLMs implicitly learn to project linguistic inputs onto structured, low-dimensional manifolds aligned with task-specific decisions, providing new geometric insights into how generalization and reasoning emerge in language models.

Title: IrisNet: Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection

Authors: Xuelin Qian, Jiaming Lu, Zixuan Wang, Wenxuan Wang, Zhongling Huang, Dingwen Zhang, Junwei Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20319
Pdf URL: https://arxiv.org/pdf/2511.20319
Copy Paste: [[2511.20319]] IrisNet: Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection(https://arxiv.org/abs/2511.20319)
Keywords: robust, transformer
Abstract: Infrared Small Target Detection (IRSTD) faces significant challenges due to low signal-to-noise ratios, complex backgrounds, and the absence of discernible target features. While deep learning-based encoder-decoder frameworks have advanced the field, their static pattern learning suffers from pattern drift across diverse scenarios (\emph{e.g.}, day/night variations, sky/maritime/ground domains), limiting robustness. To address this, we propose IrisNet, a novel meta-learned framework that dynamically adapts detection strategies to the input infrared image status. Our approach establishes a dynamic mapping between infrared image features and entire decoder parameters via an image-to-decoder transformer. More concretely, we represent the parameterized decoder as a structured 2D tensor preserving hierarchical layer correlations and enable the transformer to model inter-layer dependencies through self-attention while generating adaptive decoding patterns via cross-attention. To further enhance the perception ability of infrared images, we integrate high-frequency components to supplement target-position and scene-edge information. Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1K datasets demonstrate the superiority of our IrisNet, achieving state-of-the-art performance.

Title: ShelfRectNet: Single View Shelf Image Rectification with Homography Estimation

Authors: Onur Berk Tore, Ibrahim Samil Yalciner, Server Calap
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20335
Pdf URL: https://arxiv.org/pdf/2511.20335
Copy Paste: [[2511.20335]] ShelfRectNet: Single View Shelf Image Rectification with Homography Estimation(https://arxiv.org/abs/2511.20335)
Keywords: robust
Abstract: Estimating homography from a single image remains a challenging yet practically valuable task, particularly in domains like retail, where only one viewpoint is typically available for shelf monitoring and product alignment. In this paper, we present a deep learning framework that predicts a 4-point parameterized homography matrix to rectify shelf images captured from arbitrary angles. Our model leverages a ConvNeXt-based backbone for enhanced feature representation and adopts normalized coordinate regression for improved stability. To address data scarcity and promote generalization, we introduce a novel augmentation strategy by modeling and sampling synthetic homographies. Our method achieves a mean corner error of 1.298 pixels on the test set. When compared with both classical computer vision and deep learning-based approaches, our method demonstrates competitive performance in both accuracy and inference speed. Together, these results establish our approach as a robust and efficient solution for realworld single-view rectification. To encourage further research in this domain, we will make our dataset, ShelfRectSet, and code publicly available

Title: The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models

Authors: Taewhoo Lee, Minju Song, Chanwoong Yoon, Jungwoo Park, Jaewoo Kang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.20344
Pdf URL: https://arxiv.org/pdf/2511.20344
Copy Paste: [[2511.20344]] The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models(https://arxiv.org/abs/2511.20344)
Keywords: large language model
Abstract: Analogical reasoning is at the core of human cognition, serving as an important foundation for a variety of intellectual activities. While prior work has shown that LLMs can represent task patterns and surface-level concepts, it remains unclear whether these models can encode high-level relational concepts and apply them to novel situations through structured comparisons. In this work, we explore this fundamental aspect using proportional and story analogies, and identify three key findings. First, LLMs effectively encode the underlying relationships between analogous entities; both attributive and relational information propagate through mid-upper layers in correct cases, whereas reasoning failures reflect missing relational information within these layers. Second, unlike humans, LLMs often struggle not only when relational information is missing, but also when attempting to apply it to new entities. In such cases, strategically patching hidden representations at critical token positions can facilitate information transfer to a certain extent. Lastly, successful analogical reasoning in LLMs is marked by strong structural alignment between analogous situations, whereas failures often reflect degraded or misplaced alignment. Overall, our findings reveal that LLMs exhibit emerging but limited capabilities in encoding and applying high-level relational concepts, highlighting both parallels and gaps with human cognition.

Title: Soft Adaptive Policy Optimization

Authors: Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, Junyang Lin
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2511.20347
Pdf URL: https://arxiv.org/pdf/2511.20347
Copy Paste: [[2511.20347]] Soft Adaptive Policy Optimization(https://arxiv.org/abs/2511.20347)
Keywords: large language model
Abstract: Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.

Title: From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations

Authors: Zhiqing Guo, Dongdong Xi, Songlin Li, Gaobo Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20359
Pdf URL: https://arxiv.org/pdf/2511.20359
Copy Paste: [[2511.20359]] From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations(https://arxiv.org/abs/2511.20359)
Keywords: robust, extraction
Abstract: Image manipulation localization (IML) faces a fundamental trade-off between minimizing annotation cost and achieving fine-grained localization accuracy. Existing fully-supervised IML methods depend heavily on dense pixel-level mask annotations, which limits scalability to large datasets or real-world this http URL contrast, the majority of existing weakly-supervised IML approaches are based on image-level labels, which greatly reduce annotation effort but typically lack precise spatial localization. To address this dilemma, we propose BoxPromptIML, a novel weakly-supervised IML framework that effectively balances annotation cost and localization performance. Specifically, we propose a coarse region annotation strategy, which can generate relatively accurate manipulation masks at lower cost. To improve model efficiency and facilitate deployment, we further design an efficient lightweight student model, which learns to perform fine-grained localization through knowledge distillation from a fixed teacher model based on the Segment Anything Model (SAM). Moreover, inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled prototypical patterns with real-time observational cues derived from the input. Instead of passive feature extraction, this strategy enables a dynamic process of knowledge recollection, where long-term memory is adapted to the specific context of the current image, significantly enhancing localization accuracy and robustness. Extensive experiments across both in-distribution and out-of-distribution datasets show that BoxPromptIML outperforms or rivals fully-supervised models, while maintaining strong generalization, low annotation cost, and efficient deployment characteristics.

Title: MoRE: Batch-Robust Multi-Omics Representations from Frozen Pre-trained Transformers

Authors: Audrey Pei-Hsuan Chen
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2511.20382
Pdf URL: https://arxiv.org/pdf/2511.20382
Copy Paste: [[2511.20382]] MoRE: Batch-Robust Multi-Omics Representations from Frozen Pre-trained Transformers(https://arxiv.org/abs/2511.20382)
Keywords: robust, transformer, generative
Abstract: Representation learning on multi-omics data is challenging due to extreme dimensionality, modality heterogeneity, and cohort-specific batch effects. While pre-trained transformer backbones have shown broad generalization capabilities in biological sequence modeling, their application to multi-omics integration remains underexplored. We present MoRE (Multi-Omics Representation Embedding), a framework that repurposes frozen pre-trained transformers to align heterogeneous assays into a shared latent space. Unlike purely generative approaches, MoRE employs a parameter-efficient fine-tuning (PEFT) strategy, prioritizing cross-sample and cross-modality alignment over simple sequence reconstruction. Specifically, MoRE attaches lightweight, modality-specific adapters and a task-adaptive fusion layer to the frozen backbone. It optimizes a masked modeling objective jointly with supervised contrastive and batch-invariant alignment losses, yielding structure-preserving embeddings that generalize across unseen cell types and platforms. We benchmark MoRE against established baselines, including scGPT, scVI, and Harmony with scArches, evaluating integration fidelity, rare population detection, and modality transfer. Our results demonstrate that MoRE achieves competitive batch robustness and biological conservation while significantly reducing trainable parameters compared to fully fine-tuned models. This work positions MoRE as a practical step toward general-purpose omics foundation models.

Title: FREE: Uncertainty-Aware Autoregression for Parallel Diffusion Transformers

Authors: Xinwan Wen, Bowen Li, Jiajun Luo, Ye Li, Zhi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20390
Pdf URL: https://arxiv.org/pdf/2511.20390
Copy Paste: [[2511.20390]] FREE: Uncertainty-Aware Autoregression for Parallel Diffusion Transformers(https://arxiv.org/abs/2511.20390)
Keywords: diffusion, transformer
Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art generation quality but require long sequential denoising trajectories, leading to high inference latency. Recent speculative inference methods enable lossless parallel sampling in U-Net-based diffusion models via a drafter-verifier scheme, but their acceleration is limited on DiTs due to insufficient draft accuracy during verification. To address this limitation, we analyze the DiTs' feature dynamics and find the features of the final transformer layer (top-block) exhibit strong temporal consistency and rich semantic abstraction. Based on this insight, we propose FREE, a novel framework that employs a lightweight drafter to perform feature-level autoregression with parallel verification, guaranteeing lossless acceleration with theoretical and empirical support. Meanwhile, prediction variance (uncertainty) of DiTs naturally increases in later denoising steps, reducing acceptance rates under speculative sampling. To mitigate this effect, we further introduce an uncertainty-guided relaxation strategy, forming FREE (relax), which dynamically adjusts the acceptance probability in response to uncertainty levels. Experiments on ImageNet-$512^2$ show that FREE achieves up to $1.86 \times$ acceleration, and FREE (relax) further reaches $2.25 \times$ speedup while maintaining high perceptual and quantitative fidelity in generation quality.

Title: BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali

Authors: Abdullah Al Sefat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20399
Pdf URL: https://arxiv.org/pdf/2511.20399
Copy Paste: [[2511.20399]] BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali(https://arxiv.org/abs/2511.20399)
Keywords: robust, large language model
Abstract: Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.

Title: Short-Range Oversquashing

Authors: Yaaqov Mishayev, Yonatan Sverdlov, Tal Amir, Nadav Dym
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20406
Pdf URL: https://arxiv.org/pdf/2511.20406
Copy Paste: [[2511.20406]] Short-Range Oversquashing(https://arxiv.org/abs/2511.20406)
Keywords: transformer
Abstract: Message Passing Neural Networks (MPNNs) are widely used for learning on graphs, but their ability to process long-range information is limited by the phenomenon of oversquashing. This limitation has led some researchers to advocate Graph Transformers as a better alternative, whereas others suggest that it can be mitigated within the MPNN framework, using virtual nodes or other rewiring techniques. In this work, we demonstrate that oversquashing is not limited to long-range tasks, but can also arise in short-range problems. This observation allows us to disentangle two distinct mechanisms underlying oversquashing: (1) the bottleneck phenomenon, which can arise even in low-range settings, and (2) the vanishing gradient phenomenon, which is closely associated with long-range tasks. We further show that the short-range bottleneck effect is not captured by existing explanations for oversquashing, and that adding virtual nodes does not resolve it. In contrast, transformers do succeed in such tasks, positioning them as the more compelling solution to oversquashing, compared to specialized MPNNs.

Title: Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs

Authors: Bao Tang, Shuai Zhang, Yueting Zhu, Jijun Xiang, Xin Yang, Li Yu, Wenyu Liu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20410
Pdf URL: https://arxiv.org/pdf/2511.20410
Copy Paste: [[2511.20410]] Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs(https://arxiv.org/abs/2511.20410)
Keywords: diffusion
Abstract: Timestep distillation is an effective approach for improving the generation efficiency of diffusion models. The Consistency Model (CM), as a trajectory-based framework, demonstrates significant potential due to its strong theoretical foundation and high-quality few-step generation. Nevertheless, current continuous-time consistency distillation methods still rely heavily on training data and computational resources, hindering their deployment in resource-constrained scenarios and limiting their scalability to diverse domains. To address this issue, we propose Trajectory-Backward Consistency Model (TBCM), which eliminates the dependence on external training data by extracting latent representations directly from the teacher model's generation trajectory. Unlike conventional methods that require VAE encoding and large-scale datasets, our self-contained distillation paradigm significantly improves both efficiency and simplicity. Moreover, the trajectory-extracted samples naturally bridge the distribution gap between training and inference, thereby enabling more effective knowledge transfer. Empirically, TBCM achieves 6.52 FID and 28.08 CLIP scores on MJHQ-30k under one-step generation, while reducing training time by approximately 40% compared to Sana-Sprint and saving a substantial amount of GPU memory, demonstrating superior efficiency without sacrificing quality. We further reveal the diffusion-generation space discrepancy in continuous-time consistency distillation and analyze how sampling strategies affect distillation performance, offering insights for future distillation research. GitHub Link: this https URL.

Title: BRIC: Bridging Kinematic Plans and Physical Control at Test Time

Authors: Dohun Lim, Minji Kim, Jaewoon Lim, Sungchan Kim
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.20431
Pdf URL: https://arxiv.org/pdf/2511.20431
Copy Paste: [[2511.20431]] BRIC: Bridging Kinematic Plans and Physical Control at Test Time(https://arxiv.org/abs/2511.20431)
Keywords: diffusion
Abstract: We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.

Title: Object-Centric Vision Token Pruning for Vision Language Models

Authors: Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20439
Pdf URL: https://arxiv.org/pdf/2511.20439
Copy Paste: [[2511.20439]] Object-Centric Vision Token Pruning for Vision Language Models(https://arxiv.org/abs/2511.20439)
Keywords: interpretability
Abstract: In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-preserving VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision pruning ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our pruning also demonstrates interesting interpretability. Our codes are available at this https URL.

Title: Diffusion for Fusion: Designing Stellarators with Generative AI

Authors: Misha Padidar, Teresa Huang, Andrew Giuliani, Marina Spivak
Subjects: cs.LG, physics.plasm-ph
Abstract URL: https://arxiv.org/abs/2511.20445
Pdf URL: https://arxiv.org/pdf/2511.20445
Copy Paste: [[2511.20445]] Diffusion for Fusion: Designing Stellarators with Generative AI(https://arxiv.org/abs/2511.20445)
Keywords: diffusion, generative
Abstract: Stellarators are a prospective class of fusion-based power plants that confine a hot plasma with three-dimensional magnetic fields. Typically framed as a PDE-constrained optimization problem, stellarator design is a time-consuming process that can take hours to solve on a computing cluster. Developing fast methods for designing stellarators is crucial for advancing fusion research. Given the recent development of large datasets of optimized stellarators, machine learning approaches have emerged as a potential candidate. Motivated by this, we present an open inverse problem to the machine learning community: to rapidly generate high-quality stellarator designs which have a set of desirable characteristics. As a case study in the problem space, we train a conditional diffusion model on data from the QUASR database to generate quasisymmetric stellarator designs with desirable characteristics (aspect ratio and mean rotational transform). The diffusion model is applied to design stellarators with characteristics not seen during training. We provide evaluation protocols and show that many of the generated stellarators exhibit solid performance: less than 5% deviation from quasisymmetry and the target characteristics. The modest deviation from quasisymmetry highlights an opportunity to reach the sub 1% target. Beyond the case study, we share multiple promising avenues for generative modeling to advance stellarator design.

Title: Learning to Generate Human-Human-Object Interactions from Textual Descriptions

Authors: Jeonghyeon Na, Sangwon Baik, Inhee Lee, Junyoung Lee, Hanbyul Joo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20446
Pdf URL: https://arxiv.org/pdf/2511.20446
Copy Paste: [[2511.20446]] Learning to Generate Human-Human-Object Interactions from Textual Descriptions(https://arxiv.org/abs/2511.20446)
Keywords: diffusion, generative
Abstract: The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.

Title: Towards Trustworthy Wi-Fi Sensing: Systematic Evaluation of Deep Learning Model Robustness to Adversarial Attacks

Authors: Shreevanth Krishnaa Gopalakrishnan, Stephen Hailes
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20456
Pdf URL: https://arxiv.org/pdf/2511.20456
Copy Paste: [[2511.20456]] Towards Trustworthy Wi-Fi Sensing: Systematic Evaluation of Deep Learning Model Robustness to Adversarial Attacks(https://arxiv.org/abs/2511.20456)
Keywords: secure, security, attack, robust
Abstract: Machine learning has become integral to Channel State Information (CSI)-based human sensing systems and is expected to power applications such as device-free activity recognition and identity detection in future cellular and Wi-Fi generations. However, these systems rely on models whose decisions can be subtly perturbed, raising concerns for security and reliability in ubiquitous sensing. Quantifying and understanding the robustness of such models, defined as their ability to maintain accurate predictions under adversarial perturbations, is therefore critical before wireless sensing can be safely deployed in real-world environments. This work presents a systematic evaluation of the robustness of CSI deep learning models under diverse threat models (white-box, black-box/transfer, and universal perturbations) and varying degrees of attack realism. We establish a framework to compare compact temporal autoencoder models with larger deep architectures across three public datasets, quantifying how model scale, training regime, and physical constraints influence robustness. Our experiments show that smaller models, while efficient and equally performant on clean data, are markedly less robust. We further confirm that physically realizable signal-space perturbations, designed to be feasible in real wireless channels, significantly reduce attack success compared to unconstrained feature-space attacks. Adversarial training mitigates these vulnerabilities, improving mean robust accuracy with only moderate degradation in clean performance across both model classes. As wireless sensing advances towards reliable, cross-domain operation, these findings provide quantitative baselines for robustness estimation and inform design principles for secure and trustworthy human-centered sensing systems.

Title: Generation, Evaluation, and Explanation of Novelists' Styles with Single-Token Prompts

Authors: Mosab Rezaei, Mina Rajaei Moghadam, Abdul Rahman Shaikh, Hamed Alhoori, Reva Freedman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20459
Pdf URL: https://arxiv.org/pdf/2511.20459
Copy Paste: [[2511.20459]] Generation, Evaluation, and Explanation of Novelists' Styles with Single-Token Prompts(https://arxiv.org/abs/2511.20459)
Keywords: transformer, generative, large language model
Abstract: Recent advances in large language models have created new opportunities for stylometry, the study of writing styles and authorship. Two challenges, however, remain central: training generative models when no paired data exist, and evaluating stylistic text without relying only on human judgment. In this work, we present a framework for both generating and evaluating sentences in the style of 19th-century novelists. Large language models are fine-tuned with minimal, single-token prompts to produce text in the voices of authors such as Dickens, Austen, Twain, Alcott, and Melville. To assess these generative models, we employ a transformer-based detector trained on authentic sentences, using it both as a classifier and as a tool for stylistic explanation. We complement this with syntactic comparisons and explainable AI methods, including attention-based and gradient-based analyses, to identify the linguistic cues that drive stylistic imitation. Our findings show that the generated text reflects the authors' distinctive patterns and that AI-based evaluation offers a reliable alternative to human assessment. All artifacts of this work are published online.

Title: STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

Authors: Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Angel Bautista, David Berthelot, Josh Susskind, Shuangfei Zhai
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20462
Pdf URL: https://arxiv.org/pdf/2511.20462
Copy Paste: [[2511.20462]] STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow(https://arxiv.org/abs/2511.20462)
Keywords: robust, diffusion, generative
Abstract: Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at this https URL.

Title: Dance Style Classification using Laban-Inspired and Frequency-Domain Motion Features

Authors: Ben Hamscher, Arnold Brosch, Nicolas Binninger, Maksymilian Jan Dejna, Kira Maag
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20469
Pdf URL: https://arxiv.org/pdf/2511.20469
Copy Paste: [[2511.20469]] Dance Style Classification using Laban-Inspired and Frequency-Domain Motion Features(https://arxiv.org/abs/2511.20469)
Keywords: robust
Abstract: Dance is an essential component of human culture and serves as a tool for conveying emotions and telling stories. Identifying and distinguishing dance genres based on motion data is a complex problem in human activity recognition, as many styles share similar poses, gestures, and temporal motion patterns. This work presents a lightweight framework for classifying dance styles that determines motion characteristics based on pose estimates extracted from videos. We propose temporal-spatial descriptors inspired by Laban Movement Analysis. These features capture local joint dynamics such as velocity, acceleration, and angular movement of the upper body, enabling a structured representation of spatial coordination. To further encode rhythmic and periodic aspects of movement, we integrate Fast Fourier Transform features that characterize movement patterns in the frequency domain. The proposed approach achieves robust classification of different dance styles with low computational effort, as complex model architectures are not required, and shows that interpretable motion representations can effectively capture stylistic nuances.

Title: NVIDIA Nemotron Parse 1.1

Authors: Kateryna Chumachenko, Amala Sanjay Deshmukh, Jarno Seppanen, Ilia Karmanov, Chia-Chih Chen, Lukas Voegtle, Philipp Fischer, Marek Wawrzos, Saeid Motiian, Roman Ageev, Kedi Wu, Alexandre Milesi, Maryam Moosaei, Krzysztof Pawelec, Padmavathy Subramanian, Mehrzad Samadi, Xin Yu, Celina Dear, Sarah Stoddard, Jenna Diamond, Jesse Oliver, Leanna Chraghchian, Patrick Skelly, Tom Balough, Yao Xu, Jane Polak Scowcroft, Daniel Korzekwa, Darragh Hanley, Sandip Bhaskar, Timo Roman, Karan Sapra, Andrew Tao, Bryan Catanzaro
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20478
Pdf URL: https://arxiv.org/pdf/2511.20478
Copy Paste: [[2511.20478]] NVIDIA Nemotron Parse 1.1(https://arxiv.org/abs/2511.20478)
Keywords: extraction
Abstract: We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances the capabilities of its predecessor, Nemoretriever-Parse-1.0. Nemotron-Parse-1.1 delivers improved capabilities across general OCR, markdown formatting, structured table parsing, and text extraction from pictures, charts, and diagrams. It also supports a longer output sequence length for visually dense documents. As with its predecessor, it extracts bounding boxes of text segments, as well as corresponding semantic classes. Nemotron-Parse-1.1 follows an encoder-decoder architecture with 885M parameters, including a compact 256M-parameter language decoder. It achieves competitive accuracy on public benchmarks making it a strong lightweight OCR solution. We release the model weights publicly on Huggingface, as well as an optimized NIM container, along with a subset of the training data as part of the broader Nemotron-VLM-v2 dataset. Additionally, we release Nemotron-Parse-1.1-TC which operates on a reduced vision token length, offering a 20% speed improvement with minimal quality degradation.

Title: Ranking-Enhanced Anomaly Detection Using Active Learning-Assisted Attention Adversarial Dual AutoEncoders

Authors: Sidahmed Benabderrahmane, James Cheney, Talal Rahwan
Subjects: cs.LG, cs.AI, cs.CR, cs.NE
Abstract URL: https://arxiv.org/abs/2511.20480
Pdf URL: https://arxiv.org/pdf/2511.20480
Copy Paste: [[2511.20480]] Ranking-Enhanced Anomaly Detection Using Active Learning-Assisted Attention Adversarial Dual AutoEncoders(https://arxiv.org/abs/2511.20480)
Keywords: security, attack, steal
Abstract: Advanced Persistent Threats (APTs) pose a significant challenge in cybersecurity due to their stealthy and long-term nature. Modern supervised learning methods require extensive labeled data, which is often scarce in real-world cybersecurity environments. In this paper, we propose an innovative approach that leverages AutoEncoders for unsupervised anomaly detection, augmented by active learning to iteratively improve the detection of APT anomalies. By selectively querying an oracle for labels on uncertain or ambiguous samples, we minimize labeling costs while improving detection rates, enabling the model to improve its detection accuracy with minimal data while reducing the need for extensive manual labeling. We provide a detailed formulation of the proposed Attention Adversarial Dual AutoEncoder-based anomaly detection framework and show how the active learning loop iteratively enhances the model. The framework is evaluated on real-world imbalanced provenance trace databases produced by the DARPA Transparent Computing program, where APT-like attacks constitute as little as 0.004\% of the data. The datasets span multiple operating systems, including Android, Linux, BSD, and Windows, and cover two attack scenarios. The results have shown significant improvements in detection rates during active learning and better performance compared to other existing approaches.

Title: MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Authors: Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain, Phil F Cheng, Petros Liakopoulos, Olivier Michielin, Michael Moor, Charlotte Bunne
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20490
Pdf URL: https://arxiv.org/pdf/2511.20490
Copy Paste: [[2511.20490]] MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology(https://arxiv.org/abs/2511.20490)
Keywords: large language model
Abstract: Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability -- frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.

Title: Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

Authors: Jakub Hoscilowicz, Artur Janicki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.20494
Pdf URL: https://arxiv.org/pdf/2511.20494
Copy Paste: [[2511.20494]] Adversarial Confusion Attack: Disrupting Multimodal Large Language Models(https://arxiv.org/abs/2511.20494)
Keywords: attack, large language model
Abstract: We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Applications include embedding adversarial images into websites to prevent MLLM-powered agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.

Title: From One Attack Domain to Another: Contrastive Transfer Learning with Siamese Networks for APT Detection

Authors: Sidahmed Benabderrahmane, Talal Rahwan
Subjects: cs.LG, cs.AI, cs.CR, cs.NE
Abstract URL: https://arxiv.org/abs/2511.20500
Pdf URL: https://arxiv.org/pdf/2511.20500
Copy Paste: [[2511.20500]] From One Attack Domain to Another: Contrastive Transfer Learning with Siamese Networks for APT Detection(https://arxiv.org/abs/2511.20500)
Keywords: security, attack, robust, steal
Abstract: Advanced Persistent Threats (APT) pose a major cybersecurity challenge due to their stealth, persistence, and adaptability. Traditional machine learning detectors struggle with class imbalance, high dimensional features, and scarce real world traces. They often lack transferability-performing well in the training domain but degrading in novel attack scenarios. We propose a hybrid transfer framework that integrates Transfer Learning, Explainable AI (XAI), contrastive learning, and Siamese networks to improve cross-domain generalization. An attention-based autoencoder supports knowledge transfer across domains, while Shapley Additive exPlanations (SHAP) select stable, informative features to reduce dimensionality and computational cost. A Siamese encoder trained with a contrastive objective aligns source and target representations, increasing anomaly separability and mitigating feature drift. We evaluate on real-world traces from the DARPA Transparent Computing (TC) program and augment with synthetic attack scenarios to test robustness. Across source to target transfers, the approach delivers improved detection scores with classical and deep baselines, demonstrating a scalable, explainable, and transferable solution for APT detection.

Title: A Physics-Informed Loss Function for Boundary-Consistent and Robust Artery Segmentation in DSA Sequences

Authors: Muhammad Irfan, Nasir Rahim, Khalid Mahmood Malik
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20501
Pdf URL: https://arxiv.org/pdf/2511.20501
Copy Paste: [[2511.20501]] A Physics-Informed Loss Function for Boundary-Consistent and Robust Artery Segmentation in DSA Sequences(https://arxiv.org/abs/2511.20501)
Keywords: robust, extraction, segmentation
Abstract: Accurate extraction and segmentation of the cerebral arteries from digital subtraction angiography (DSA) sequences is essential for developing reliable clinical management models of complex cerebrovascular diseases. Conventional loss functions often rely solely on pixel-wise overlap, overlooking the geometric and physical consistency of vascular boundaries, which can lead to fragmented or unstable vessel predictions. To overcome this limitation, we propose a novel \textit{Physics-Informed Loss} (PIL) that models the interaction between the predicted and ground-truth boundaries as an elastic process inspired by dislocation theory in materials physics. This formulation introduces a physics-based regularization term that enforces smooth contour evolution and structural consistency, allowing the network to better capture fine vascular geometry. The proposed loss is integrated into several segmentation architectures, including U-Net, U-Net++, SegFormer, and MedFormer, and evaluated on two public benchmarks: DIAS and DSCA. Experimental results demonstrate that PIL consistently outperforms conventional loss functions such as Cross-Entropy, Dice, Active Contour, and Surface losses, achieving superior sensitivity, F1 score, and boundary coherence. These findings confirm that the incorporation of physics-based boundary interactions into deep neural networks improves both the precision and robustness of vascular segmentation in dynamic angiographic imaging. The implementation of the proposed method is publicly available at this https URL.

Title: A Single-Root, Multi-Curve, Context-Isolated, PQC-Pluggable Cryptographic Identity Primitive with Stateless Secret Rotation

Authors: Jian Sheng Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.20505
Pdf URL: https://arxiv.org/pdf/2511.20505
Copy Paste: [[2511.20505]] A Single-Root, Multi-Curve, Context-Isolated, PQC-Pluggable Cryptographic Identity Primitive with Stateless Secret Rotation(https://arxiv.org/abs/2511.20505)
Keywords: secure, security
Abstract: Cryptographic identity anchors modern decentralized systems, yet current standards like BIP-39 and BIP-32 are structurally insufficient for the demands of multi-curve, multi-domain, and post-quantum (PQC) environments. These legacy schemes rely on a monolithic identity root with no inherent context isolation, algorithm agility, or secure secret rotation. This paper introduces MSCIKDF, a single-root, multi-curve, context-isolated, PQC-pluggable cryptographic identity primitive. MSCIKDF defines a new architectural foundation where identity is derived deterministically but with cryptographically enforced separation across diverse contexts (e.g., blockchain, E2EE, KMS, IoT). It achieves strong security invariants -- such as zero-linkability, multi-curve independence, and resistance to cross-context correlation -- while offering stateless secret rotation that preserves long-term identity continuity without requiring asset migration. MSCIKDF is proposed as an infrastructure-level upgrade to deterministic identity, establishing a durable and algorithm-agnostic root of trust suitable for the next decade of distributed systems, AI agents, and PQC migration.

Title: The Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models

Authors: Nathan Roll, Jill Kries, Flora Jin, Catherine Wang, Ann Marie Finley, Meghan Sumner, Cory Shain, Laura Gwilliams
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20507
Pdf URL: https://arxiv.org/pdf/2511.20507
Copy Paste: [[2511.20507]] The Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models(https://arxiv.org/abs/2511.20507)
Keywords: large language model
Abstract: Large language models (LLMs) have emerged as a candidate "model organism" for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB's design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen's kappa = 0.255 for model--consensus agreement vs. 0.286 for human--human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.

Title: DP-MicroAdam: Private and Frugal Algorithm for Training and Fine-tuning

Authors: Mihaela Hudişteanu, Edwige Cyffers, Nikita P. Kalinin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20509
Pdf URL: https://arxiv.org/pdf/2511.20509
Copy Paste: [[2511.20509]] DP-MicroAdam: Private and Frugal Algorithm for Training and Fine-tuning(https://arxiv.org/abs/2511.20509)
Keywords: privacy, transformer
Abstract: Adaptive optimizers are the de facto standard in non-private training as they often enable faster convergence and improved performance. In contrast, differentially private (DP) training is still predominantly performed with DP-SGD, typically requiring extensive compute and hyperparameter tuning. We propose DP-MicroAdam, a memory-efficient and sparsity-aware adaptive DP optimizer. We prove that DP-MicroAdam converges in stochastic non-convex optimization at the optimal $\mathcal{O}(1/\sqrt{T})$ rate, up to privacy-dependent constants. Empirically, DP-MicroAdam outperforms existing adaptive DP optimizers and achieves competitive or superior accuracy compared to DP-SGD across a range of benchmarks, including CIFAR-10, large-scale ImageNet training, and private fine-tuning of pretrained transformers. These results demonstrate that adaptive optimization can improve both performance and stability under differential privacy.

Title: DesignPref: Capturing Personal Preferences in Visual Design Generation

Authors: Yi-Hao Peng, Jeffrey P. Bigham, Jason Wu
Subjects: cs.CV, cs.AI, cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2511.20513
Pdf URL: https://arxiv.org/pdf/2511.20513
Copy Paste: [[2511.20513]] DesignPref: Capturing Personal Preferences in Visual Design Generation(https://arxiv.org/abs/2511.20513)
Keywords: diffusion, generative, large language model
Abstract: Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers' preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.

Title: HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

Authors: Xiang Wang, Zhifei Zhang, He Zhang, Zhe Lin, Yuqian Zhou, Qing Liu, Shiwei Zhang, Yijun Li, Shaoteng Liu, Haitian Zheng, Jason Kuen, Yuehuan Wang, Changxin Gao, Nong Sang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20520
Pdf URL: https://arxiv.org/pdf/2511.20520
Copy Paste: [[2511.20520]] HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation(https://arxiv.org/abs/2511.20520)
Keywords: diffusion, transformer, generative
Abstract: Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.

Title: Engel p-adic Isogeny-based Cryptography over Laurent Series: Foundations, Security, and an ESP32 Implementation

Authors: Ilias Cherkaoui, Indrakshi Dey
Subjects: cs.CR, cs.IT
Abstract URL: https://arxiv.org/abs/2511.20533
Pdf URL: https://arxiv.org/pdf/2511.20533
Copy Paste: [[2511.20533]] Engel p-adic Isogeny-based Cryptography over Laurent Series: Foundations, Security, and an ESP32 Implementation(https://arxiv.org/abs/2511.20533)
Keywords: security, attack
Abstract: Securing the Internet of Things (IoT) against quantum attacks requires public-key cryptography that (i) remains compact and (ii) runs efficiently on microcontrollers, capabilities many post-quantum (PQ) schemes lack due to large keys and heavy arithmetic. We address both constraints simultaneously with, to our knowledge, the first-ever isogeny framework that encodes super-singular elliptic-curve isogeny data via novel Engel expansions over the p-adic Laurent series. Engel coefficients compress torsion information, thereby addressing the compactness constraint, yielding public keys of ~1.1 - 16.9 kbits preserving the hallmark small sizes of isogeny systems. Engel arithmetic is local and admits fixed-precision p-adic operations, enabling micro-controller efficiency with low-memory, branch-regular kernels suitable for embedded targets.

Title: Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition

Authors: Wesley Bian, Xiaofeng Lin, Guang Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.20534
Pdf URL: https://arxiv.org/pdf/2511.20534
Copy Paste: [[2511.20534]] Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition(https://arxiv.org/abs/2511.20534)
Keywords: fair
Abstract: Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages, primarily due to the abundance of available training data. This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly. In this work, we introduce a novel data augmentation technique for speech corpora designed to mitigate this gap. Through comprehensive experiments, we demonstrate that our method significantly improves the performance of automatic speech recognition systems on low-resource languages. Furthermore, we show that our approach outperforms existing augmentation strategies, offering a practical solution for enhancing speech technology in underrepresented linguistic communities.

Title: Automated Monitoring of Cultural Heritage Artifacts Using Semantic Segmentation

Authors: Andrea Ranieri, Giorgio Palmieri, Silvia Biasotti
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20541
Pdf URL: https://arxiv.org/pdf/2511.20541
Copy Paste: [[2511.20541]] Automated Monitoring of Cultural Heritage Artifacts Using Semantic Segmentation(https://arxiv.org/abs/2511.20541)
Keywords: segmentation
Abstract: This paper addresses the critical need for automated crack detection in the preservation of cultural heritage through semantic segmentation. We present a comparative study of U-Net architectures, using various convolutional neural network (CNN) encoders, for pixel-level crack identification on statues and monuments. A comparative quantitative evaluation is performed on the test set of the OmniCrack30k dataset [1] using popular segmentation metrics including Mean Intersection over Union (mIoU), Dice coefficient, and Jaccard index. This is complemented by an out-of-distribution qualitative evaluation on an unlabeled test set of real-world cracked statues and monuments. Our findings provide valuable insights into the capabilities of different CNN- based encoders for fine-grained crack segmentation. We show that the models exhibit promising generalization capabilities to unseen cultural heritage contexts, despite never having been explicitly trained on images of statues or monuments.

Title: From Words to Wisdom: Discourse Annotation and Baseline Models for Student Dialogue Understanding

Authors: Farjana Sultana Mim, Shuchin Aeron, Eric Miller, Kristen Wendell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.20547
Pdf URL: https://arxiv.org/pdf/2511.20547
Copy Paste: [[2511.20547]] From Words to Wisdom: Discourse Annotation and Baseline Models for Student Dialogue Understanding(https://arxiv.org/abs/2511.20547)
Keywords: large language model
Abstract: Identifying discourse features in student conversations is quite important for educational researchers to recognize the curricular and pedagogical variables that cause students to engage in constructing knowledge rather than merely completing tasks. The manual analysis of student conversations to identify these discourse features is time-consuming and labor-intensive, which limits the scale and scope of studies. Leveraging natural language processing (NLP) techniques can facilitate the automatic detection of these discourse features, offering educational researchers scalable and data-driven insights. However, existing studies in NLP that focus on discourse in dialogue rarely address educational data. In this work, we address this gap by introducing an annotated educational dialogue dataset of student conversations featuring knowledge construction and task production discourse. We also establish baseline models for automatically predicting these discourse properties for each turn of talk within conversations, using pre-trained large language models GPT-3.5 and Llama-3.1. Experimental results indicate that these state-of-the-art models perform suboptimally on this task, indicating the potential for future research.

Title: Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

Authors: Guanjie Chen, Shirui Huang, Kai Liu, Jianchen Zhu, Xiaoye Qu, Peng Chen, Yu Cheng, Yifu Sun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20549
Pdf URL: https://arxiv.org/pdf/2511.20549
Copy Paste: [[2511.20549]] Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning(https://arxiv.org/abs/2511.20549)
Keywords: diffusion, generative
Abstract: Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only $2.1\%$ its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.

Title: Effective Command-line Interface Fuzzing with Path-Aware Large Language Model Orchestration

Authors: Momoko Shiraishi, Yinzhi Cao, Takahiro Shinagawa
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.20555
Pdf URL: https://arxiv.org/pdf/2511.20555
Copy Paste: [[2511.20555]] Effective Command-line Interface Fuzzing with Path-Aware Large Language Model Orchestration(https://arxiv.org/abs/2511.20555)
Keywords: large language model
Abstract: Command-line interface (CLI) fuzzing tests programs by mutating both command-line options and input file contents, thus enabling discovery of vulnerabilities that only manifest under specific option-input combinations. Prior works of CLI fuzzing face the challenges of generating semantics-rich option strings and input files, which cannot reach deeply embedded target functions. This often leads to a misdetection of such a deep vulnerability using existing CLI fuzzing techniques. In this paper, we design a novel Path-guided, Iterative LLM-Orchestrated Testing framework, called PILOT, to fuzz CLI applications. The key insight is to provide potential call paths to target functions as context to LLM so that it can better generate CLI option strings and input files. Then, PILOT iteratively repeats the process, and provides reached functions as additional context so that target functions are reached. Our evaluation on real-world CLI applications demonstrates that PILOT achieves higher coverage than state-of-the-art fuzzing approaches and discovers 51 zero-day vulnerabilities. We responsibly disclosed all the vulnerabilities to their developers and so far 41 have been confirmed by their developers with 33 being fixed and three assigned CVE identifiers.

Title: Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

Authors: Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, Li Yuan
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2511.20561
Pdf URL: https://arxiv.org/pdf/2511.20561
Copy Paste: [[2511.20561]] Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward(https://arxiv.org/abs/2511.20561)
Keywords: generative
Abstract: Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at this https URL

Title: A Reason-then-Describe Instruction Interpreter for Controllable Video Generation

Authors: Shengqiong Wu, Weicai Ye, Yuanxing Zhang, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Kun Gai, Hao Fei, Tat-Seng Chua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20563
Pdf URL: https://arxiv.org/pdf/2511.20563
Copy Paste: [[2511.20563]] A Reason-then-Describe Instruction Interpreter for Controllable Video Generation(https://arxiv.org/abs/2511.20563)
Keywords: diffusion, transformer
Abstract: Diffusion Transformers have significantly improved video fidelity and temporal coherence, however, practical controllability remains limited. Concise, ambiguous, and compositionally complex user inputs contrast with the detailed prompts used in training, yielding an intent-output mismatch. We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation. We train ReaDe via a two-stage optimization: (i) reasoning-augmented supervision imparts analytic parsing with stepwise traces and dense captions, and (ii) a multi-dimensional reward assigner enables stable, feedback-driven refinement for natural-style captions. Experiments across single- and multi-condition scenarios show consistent gains in instruction fidelity, caption accuracy, and downstream video quality, with strong generalization to reasoning-intensive and unseen inputs. ReaDe offers a practical route to aligning controllable video generation with accurately interpreted user intent. Project Page: this https URL.

Title: DINO-Tok: Adapting DINO for Visual Tokenizers

Authors: Mingkai Jia, Mingxiao Li, Liaoyuan Fan, Tianxing Shi, Jiaxin Guo, Zeming Li, Xiaoyang Guo, Xiao-Xiao Long, Qian Zhang, Ping Tan, Wei Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20565
Pdf URL: https://arxiv.org/pdf/2511.20565
Copy Paste: [[2511.20565]] DINO-Tok: Adapting DINO for Visual Tokenizers(https://arxiv.org/abs/2511.20565)
Keywords: generative
Abstract: Recent advances in visual generation have highlighted the rise of Latent Generative Models (LGMs), which rely on effective visual tokenizers to bridge pixels and semantics. However, existing tokenizers are typically trained from scratch and struggle to balance semantic representation and reconstruction fidelity, particularly in high-dimensional latent spaces. In this work, we introduce DINO-Tok, a DINO-based visual tokenizer that unifies hierarchical representations into an information-complete latent space. By integrating shallow features that retain fine-grained details with deep features encoding global semantics, DINO-Tok effectively bridges pretrained representations and visual generation. We further analyze the challenges of vector quantization (VQ) in this high-dimensional space, where key information is often lost and codebook collapse occurs. We thus propose a global PCA reweighting mechanism to stabilize VQ and preserve essential information across dimensions. On ImageNet 256$\times$256, DINO-Tok achieves state-of-the-art reconstruction performance, reaching 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling, significantly outperforming prior tokenizers and comparable to billion-level data trained models (such as Hunyuan and Wan). These results demonstrate that adapting powerful pretrained vision models like DINO for tokenization enables semantically aligned and high-fidelity latent representations, enabling next-generation visual generative models. Code will be publicly available at this https URL.

Title: MSTN: Fast and Efficient Multivariate Time Series Model

Authors: Sumit S Shevtekar, Chandresh K Maurya, Gourab Sil
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20577
Pdf URL: https://arxiv.org/pdf/2511.20577
Copy Paste: [[2511.20577]] MSTN: Fast and Efficient Multivariate Time Series Model(https://arxiv.org/abs/2511.20577)
Keywords: transformer
Abstract: Real-world time-series data is highly non stationary and complex in dynamics that operate across multiple timescales, ranging from fast, short-term changes to slow, long-term trends. Most existing models rely on fixed-scale structural priors, such as patch-based tokenization, fixed frequency transformations, or frozen backbone architectures. This often leads to over-regularization of temporal dynamics, which limits their ability to adaptively model the full spectrum of temporal variations and impairs their performance on unpredictable, Sudden, high-magnitude events. To address this, we introduce the Multi-scale Temporal Network (MSTN), a novel deep learning architecture founded on a hierarchical multi-scale and sequence modeling principle. The MSTN framework integrates: (i) a multi-scale convolutional encoder that constructs a hierarchical feature pyramid for local patterns (ii) a sequence modeling component for long-range temporal dependencies. We empirically validate this with BiLSTM and Transformer variants, establishing a flexible foundation for future architectural advancements. and (iii) a gated fusion mechanism augmented with squeeze-and-excitation (SE) and multi-head temporal attention (MHTA) for dynamic, context-aware feature integration. This design enables MSTN to adaptively model temporal patterns from milliseconds to long-range dependencies within a unified framework. Extensive evaluations across time-series long-horizon forecasting, imputation, classification and generalizability study demonstrate that MSTN achieves competitive state-of-the-art (SOTA) performance, showing improvements over contemporary approaches including EMTSF, LLM4TS, HiMTM, TIME-LLM, MTST, SOFTS, iTransformer, TimesNet, and PatchTST. In total, MSTN establishes new SOTA performance on 24 of 32 benchmark datasets, demonstrating its consistent performance across diverse temporal tasks.

Title: Anatomica: Localized Control over Geometric and Topological Properties for Anatomical Diffusion Models

Authors: Karim Kadry, Abdallah Abdelwahed, Shoaib Goraya, Ajay Manicka, Naravich Chutisilp, Farhad Nezami, Elazer Edelman
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20587
Pdf URL: https://arxiv.org/pdf/2511.20587
Copy Paste: [[2511.20587]] Anatomica: Localized Control over Geometric and Topological Properties for Anatomical Diffusion Models(https://arxiv.org/abs/2511.20587)
Keywords: diffusion
Abstract: We present Anatomica: an inference-time framework for generating multi-class anatomical voxel maps with localized geo-topological control. During generation, we use cuboidal control domains of varying dimensionality, location, and shape to slice out relevant substructures. These local substructures are used to compute differentiable penalty functions that steer the sample towards target constraints. We control geometric features such as size, shape, and position through voxel-wise moments, while topological features such as connected components, loops, and voids are enforced through persistent homology. Lastly, we implement Anatomica for latent diffusion models, where neural field decoders partially extract substructures, enabling the efficient control of anatomical properties. Anatomica applies flexibly across diverse anatomical systems, composing constraints to control complex structures over arbitrary dimensions and coordinate systems, thereby enabling the rational design of synthetic datasets for virtual trials or machine learning workflows.

Title: Latent Diffusion Inversion Requires Understanding the Latent Space

Authors: Mingxing Rao, Bowen Qu, Daniel Moyer
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2511.20592
Pdf URL: https://arxiv.org/pdf/2511.20592
Copy Paste: [[2511.20592]] Latent Diffusion Inversion Requires Understanding the Latent Space(https://arxiv.org/abs/2511.20592)
Keywords: privacy, attack, membership infer, diffusion, generative
Abstract: The recovery of training data from generative models (``model inversion'') has been extensively studied for diffusion models in the data domain. The encoder/decoder pair and corresponding latent codes have largely been ignored by inversion techniques applied to latent space generative models, e.g., Latent Diffusion models (LDMs). In this work we describe two key findings: (1) The diffusion model exhibits non-uniform memorization across latent codes, tending to overfit samples located in high-distortion regions of the decoder pullback metric. (2) Even within a single latent code, different dimensions contribute unequally to memorization. We introduce a principled method to rank latent dimensions by their per-dimensional contribution to the decoder pullback metric, identifying those most responsible for memorization. Empirically, removing less-memorizing dimensions when computing attack statistics for score-based membership inference attacker significantly improves performance, with average AUROC gains of 2.7\% and substantial increases in TPR@1\%FPR (6.42\%) across diverse datasets including CIFAR-10, CelebA, ImageNet-1K, Pokémon, MS-COCO, and Flickr. This indicates stronger confidence in identifying members under extremely low false-positive tolerance. Our results highlight the overlooked influence of the auto-encoder geometry on LDM memorization and provide a new perspective for analyzing privacy risks in diffusion-based generative models.

Title: BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents

Authors: Kaiyuan Zhang, Mark Tenenholtz, Kyle Polley, Jerry Ma, Denis Yarats, Ninghui Li
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2511.20597
Pdf URL: https://arxiv.org/pdf/2511.20597
Copy Paste: [[2511.20597]] BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents(https://arxiv.org/abs/2511.20597)
Keywords: secure, security, protect, defense, attack
Abstract: The integration of artificial intelligence (AI) agents into web browsers introduces security challenges that go beyond traditional web application threat models. Prior work has identified prompt injection as a new attack vector for web agents, yet the resulting impact within real-world environments remains insufficiently understood. In this work, we examine the landscape of prompt injection attacks and synthesize a benchmark of attacks embedded in realistic HTML payloads. Our benchmark goes beyond prior work by emphasizing injections that can influence real-world actions rather than mere text outputs, and by presenting attack payloads with complexity and distractor frequency similar to what real-world agents encounter. We leverage this benchmark to conduct a comprehensive empirical evaluation of existing defenses, assessing their effectiveness across a suite of frontier AI models. We propose a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. Our work offers a blueprint for designing practical, secure web agents through a defense-in-depth approach.

Title: On Evaluating LLM Alignment by Evaluating LLMs as Judges

Authors: Yixin Liu, Pengfei Liu, Arman Cohan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20604
Pdf URL: https://arxiv.org/pdf/2511.20604
Copy Paste: [[2511.20604]] On Evaluating LLM Alignment by Evaluating LLMs as Judges(https://arxiv.org/abs/2511.20604)
Keywords: large language model
Abstract: Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models' (LLMs) alignment typically involves directly assessing their open-ended responses, requiring human annotators or strong LLM judges. Conversely, LLMs themselves have also been extensively evaluated as judges for assessing alignment. In this work, we examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences. To this end, we first conduct a comprehensive analysis of the generation-evaluation consistency (GE-consistency) among various LLMs, revealing a strong correlation between their generation and evaluation capabilities when evaluated by a strong LLM preference oracle. Utilizing this finding, we propose a benchmarking paradigm that measures LLM alignment with human preferences without directly evaluating their generated outputs, instead assessing LLMs in their role as evaluators. Our evaluation shows that our proposed benchmark, AlignEval, matches or surpasses widely used automatic LLM evaluation benchmarks, such as AlpacaEval and Arena-Hard, in capturing human preferences when ranking LLMs. Our study offers valuable insights into the connection between LLMs' generation and evaluation capabilities, and introduces a benchmark that assesses alignment without directly evaluating model outputs.

Title: How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets

Authors: Xiwen Huang, Pierre Pinson
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2511.20605
Pdf URL: https://arxiv.org/pdf/2511.20605
Copy Paste: [[2511.20605]] How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets(https://arxiv.org/abs/2511.20605)
Keywords: robust
Abstract: We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to a benchmark random sampling approach. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.

Title: Adaptive Hopfield Network: Rethinking Similarities in Associative Memory

Authors: Shurong Wang, Yuqi Pan, Zhuoyang Shen, Meng Zhang, Hongwei Wang, Guoqi Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20609
Pdf URL: https://arxiv.org/pdf/2511.20609
Copy Paste: [[2511.20609]] Adaptive Hopfield Network: Rethinking Similarities in Associative Memory(https://arxiv.org/abs/2511.20609)
Keywords: interpretability, generative
Abstract: Associative memory models are content-addressable memory systems fundamental to biological intelligence and are notable for their high interpretability. However, existing models evaluate the quality of retrieval based on proximity, which cannot guarantee that the retrieved pattern has the strongest association with the query, failing correctness. We reframe this problem by proposing that a query is a generative variant of a stored memory pattern, and define a variant distribution to model this subtle context-dependent generative process. Consequently, correct retrieval should return the memory pattern with the maximum a posteriori probability of being the query's origin. This perspective reveals that an ideal similarity measure should approximate the likelihood of each stored pattern generating the query in accordance with variant distribution, which is impossible for fixed and pre-defined similarities used by existing associative memories. To this end, we develop adaptive similarity, a novel mechanism that learns to approximate this insightful but unknown likelihood from samples drawn from context, aiming for correct retrieval. We theoretically prove that our proposed adaptive similarity achieves optimal correct retrieval under three canonical and widely applicable types of variants: noisy, masked, and biased. We integrate this mechanism into a novel adaptive Hopfield network (A-Hop), and empirical results show that it achieves state-of-the-art performance across diverse tasks, including memory retrieval, tabular classification, image classification, and multiple instance learning.

Title: Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

Authors: Panayiotis Danassis, Naman Goel
Subjects: cs.LG, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2511.20613
Pdf URL: https://arxiv.org/pdf/2511.20613
Copy Paste: [[2511.20613]] Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning(https://arxiv.org/abs/2511.20613)
Keywords: large language model
Abstract: The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and $\sim 40$k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs' ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.

Title: Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities

Authors: Seyede Niloofar Hosseini, Ali Mojibi, Mahdi Mohseni, Navid Arjmand, Alireza Taheri
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20615
Pdf URL: https://arxiv.org/pdf/2511.20615
Copy Paste: [[2511.20615]] Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities(https://arxiv.org/abs/2511.20615)
Keywords: transformer
Abstract: This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching activities. Two time-series models were trained using bidirectional long short-term memory (BLSTM) and transformer architectures. The dataset consisted of 3D full-body plug-in gait dynamic coordinates from 20 normal-weight healthy male individuals each performing 204 load-reaching tasks from different load positions while adapting various lifting and handling techniques. The model inputs consisted of the 3D position of the hand-load position, lifting (stoop, full-squat and semi-squat) and handling (one- and two-handed) techniques, body weight and height, and the 3D coordinate data of the body posture from the first 25% of the task duration. These inputs were used by the models to predict body coordinates during the remaining 75% of the task period. Moreover, a novel method was proposed to improve the accuracy of the previous and present posture prediction networks by enforcing constant body segment lengths through the optimization of a new cost function. The results indicated that the new cost function decreased the prediction error of the models by approximately 8% and 21% for the arm and leg models, respectively. We indicated that utilizing the transformer architecture, with a root-mean-square-error of 47.0 mm, exhibited ~58% more accurate long-term performance than the BLSTM-based model. This study merits the use of neural networks that capture time series dependencies in 3D motion frames, providing a unique approach for understanding and predict motion dynamics during manual material handling activities.

Title: Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI

Authors: Xinhao Liu, Jiaqi Li, Youming Deng, Ruxin Chen, Yingjia Zhang, Yifei Ma, Li Guo, Yiming Li, Jing Zhang, Chen Feng
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2511.20620
Pdf URL: https://arxiv.org/pdf/2511.20620
Copy Paste: [[2511.20620]] Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI(https://arxiv.org/abs/2511.20620)
Keywords: robust
Abstract: Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland's rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI. Project website is at this https URL.

Title: ShapeGen: Towards High-Quality 3D Shape Synthesis

Authors: Yangguang Li, Xianglong He, Zi-Xin Zou, Zexiang Liu, Wanli Ouyang, Ding Liang, Yan-Pei Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20624
Pdf URL: https://arxiv.org/pdf/2511.20624
Copy Paste: [[2511.20624]] ShapeGen: Towards High-Quality 3D Shape Synthesis(https://arxiv.org/abs/2511.20624)
Keywords: transformer, generative
Abstract: Inspired by generative paradigms in image and video, 3D shape generation has made notable progress, enabling the rapid synthesis of high-fidelity 3D assets from a single image. However, current methods still face challenges, including the lack of intricate details, overly smoothed surfaces, and fragmented thin-shell structures. These limitations leave the generated 3D assets still one step short of meeting the standards favored by artists. In this paper, we present ShapeGen, which achieves high-quality image-to-3D shape generation through 3D representation and supervision improvements, resolution scaling up, and the advantages of linear transformers. These advancements allow the generated assets to be seamlessly integrated into 3D pipelines, facilitating their widespread adoption across various applications. Through extensive experiments, we validate the impact of these improvements on overall performance. Ultimately, thanks to the synergistic effects of these enhancements, ShapeGen achieves a significant leap in image-to-3D generation, establishing a new state-of-the-art performance.

Title: ROOT: Robust Orthogonalized Optimizer for Neural Network Training

Authors: Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu, Xinghao Chen, Yunhe Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.20626
Pdf URL: https://arxiv.org/pdf/2511.20626
Copy Paste: [[2511.20626]] ROOT: Robust Orthogonalized Optimizer for Neural Network Training(https://arxiv.org/abs/2511.20626)
Keywords: robust, large language model
Abstract: The optimization of large language models (LLMs) remains a critical challenge, particularly as model scaling exacerbates sensitivity to algorithmic imprecision and training instability. Recent advances in optimizers have improved convergence efficiency through momentum orthogonalization, but suffer from two key robustness limitations: dimensional fragility in orthogonalization precision and vulnerability to outlier-induced noise. To address these robustness challenges, we introduce ROOT, a Robust Orthogonalized Optimizer that enhances training stability through dual robustness mechanisms. First, we develop a dimension-robust orthogonalization scheme using adaptive Newton iterations with fine-grained coefficients tailored to specific matrix sizes, ensuring consistent precision across diverse architectural configurations. Second, we introduce an optimization-robust framework via proximal optimization that suppresses outlier noise while preserving meaningful gradient directions. Extensive experiments demonstrate that ROOT achieves significantly improved robustness, with faster convergence and superior final performance compared to both Muon and Adam-based optimizers, particularly in noisy and non-convex scenarios. Our work establishes a new paradigm for developing robust and precise optimizers capable of handling the complexities of modern large-scale model training. The code will be available at this https URL.

Title: MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models

Authors: Chieh-Yun Chen, Zhonghao Wang, Qi Chen, Zhifan Ye, Min Shi, Yue Zhao, Yinan Zhao, Hui Qu, Wei-An Lin, Yiru Shen, Ajinkya Kale, Irfan Essa, Humphrey Shi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20629
Pdf URL: https://arxiv.org/pdf/2511.20629
Copy Paste: [[2511.20629]] MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models(https://arxiv.org/abs/2511.20629)
Keywords: diffusion, generative
Abstract: Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.

Title: Quantum-Resistant Authentication Scheme for RFID Systems Using Lattice-Based Cryptography

Authors: Vaibhav Kumar, Kaiwalya Joshi, Bhavya Dixit, Gaurav S. Kasbekar
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2511.20630
Pdf URL: https://arxiv.org/pdf/2511.20630
Copy Paste: [[2511.20630]] Quantum-Resistant Authentication Scheme for RFID Systems Using Lattice-Based Cryptography(https://arxiv.org/abs/2511.20630)
Keywords: secure, security, attack, robust
Abstract: We propose a novel quantum-resistant mutual authentication scheme for radio-frequency identification (RFID) systems. Our scheme uses lattice-based cryptography and, in particular, achieves quantum-resistance by leveraging the hardness of the inhomogeneous short integer solution (ISIS) problem. In contrast to prior work, which assumes that the reader-server communication channel is secure, our scheme is secure even when both the reader-server and tag-reader communication channels are insecure. Our proposed protocol provides robust security against man-in-the-middle (MITM), replay, impersonation, and reflection attacks, while also ensuring unforgeability and preserving anonymity. We present a detailed security analysis, including semi-formal analysis and formal verification using the Automated Validation of Internet Security Protocols and Applications (AVISPA) tool. In addition, we analyze the storage, computation, and communication costs of the proposed protocol and compare its security properties with those of existing protocols, demonstrating that our scheme offers strong security guarantees. To the best of our knowledge, this paper is the first quantum-resistant authentication protocol for RFID systems that comprehensively addresses the insecurity of both the reader-server and tag-reader communication channels.

Title: Image2Gcode: Image-to-G-code Generation for Additive Manufacturing Using Diffusion-Transformer Model

Authors: Ziyue Wang, Yayati Jadhav, Peter Pak, Amir Barati Farimani
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.20636
Pdf URL: https://arxiv.org/pdf/2511.20636
Copy Paste: [[2511.20636]] Image2Gcode: Image-to-G-code Generation for Additive Manufacturing Using Diffusion-Transformer Model(https://arxiv.org/abs/2511.20636)
Keywords: diffusion, transformer
Abstract: Mechanical design and manufacturing workflows conventionally begin with conceptual design, followed by the creation of a computer-aided design (CAD) model and fabrication through material-extrusion (MEX) printing. This process requires converting CAD geometry into machine-readable G-code through slicing and path planning. While each step is well established, dependence on CAD modeling remains a major bottleneck: constructing object-specific 3D geometry is slow and poorly suited to rapid prototyping. Even minor design variations typically necessitate manual updates in CAD software, making iteration time-consuming and difficult to scale. To address this limitation, we introduce Image2Gcode, an end-to-end data-driven framework that bypasses the CAD stage and generates printer-ready G-code directly from images and part drawings. Instead of relying on an explicit 3D model, a hand-drawn or captured 2D image serves as the sole input. The framework first extracts slice-wise structural cues from the image and then employs a denoising diffusion probabilistic model (DDPM) over G-code sequences. Through iterative denoising, the model transforms Gaussian noise into executable print-move trajectories with corresponding extrusion parameters, establishing a direct mapping from visual input to native toolpaths. By producing structured G-code directly from 2D imagery, Image2Gcode eliminates the need for CAD or STL intermediates, lowering the entry barrier for additive manufacturing and accelerating the design-to-fabrication cycle. This approach supports on-demand prototyping from simple sketches or visual references and integrates with upstream 2D-to-3D reconstruction modules to enable an automated pipeline from concept to physical artifact. The result is a flexible, computationally efficient framework that advances accessibility in design iteration, repair workflows, and distributed manufacturing.

Title: Latent Collaboration in Multi-Agent Systems

Authors: Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20639
Pdf URL: https://arxiv.org/pdf/2511.20639
Copy Paste: [[2511.20639]] Latent Collaboration in Multi-Agent Systems(https://arxiv.org/abs/2511.20639)
Keywords: large language model
Abstract: Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent's internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at this https URL.

Title: MotionV2V: Editing Motion in a Video

Authors: Ryan Burgert, Charles Herrmann, Forrester Cole, Michael S Ryoo, Neal Wadhwa, Andrey Voynov, Nataniel Ruiz
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2511.20640
Pdf URL: https://arxiv.org/pdf/2511.20640
Copy Paste: [[2511.20640]] MotionV2V: Editing Motion in a Video(https://arxiv.org/abs/2511.20640)
Keywords: diffusion, generative
Abstract: While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a "motion edit" and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating "motion counterfactuals", video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: this https URL

Title: PixelDiT: Pixel Diffusion Transformers for Image Generation

Authors: Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Jiebo Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20645
Pdf URL: https://arxiv.org/pdf/2511.20645
Copy Paste: [[2511.20645]] PixelDiT: Pixel Diffusion Transformers for Image Generation(https://arxiv.org/abs/2511.20645)
Keywords: diffusion, transformer, generative
Abstract: Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.

Title: 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding

Authors: Xiaoye Wang, Chen Tang, Xiangyu Yue, Wei-Hong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20646
Pdf URL: https://arxiv.org/pdf/2511.20646
Copy Paste: [[2511.20646]] 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding(https://arxiv.org/abs/2511.20646)
Keywords: segmentation
Abstract: This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlations essential for comprehensive scene understanding. We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network. Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations, integrated with a feature from MTL encoder for multi-task predictions. This module is architecture-agnostic and can be applied to both single and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.

Title: Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization

Authors: Tahira Kazimi, Connor Dunlop, Pinar Yanardag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20647
Pdf URL: https://arxiv.org/pdf/2511.20647
Copy Paste: [[2511.20647]] Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization(https://arxiv.org/abs/2511.20647)
Keywords: diffusion
Abstract: While recent text-to-video (T2V) diffusion models have achieved impressive quality and prompt alignment, they often produce low-diversity outputs when sampling multiple videos from a single text prompt. We tackle this challenge by formulating it as a set-level policy optimization problem, with the goal of training a policy that can cover the diverse range of plausible outcomes for a given prompt. To address this, we introduce DPP-GRPO, a novel framework for diverse video generation that combines Determinantal Point Processes (DPPs) and Group Relative Policy Optimization (GRPO) theories to enforce explicit reward on diverse generations. Our objective turns diversity into an explicit signal by imposing diminishing returns on redundant samples (via DPP) while supplies groupwise feedback over candidate sets (via GRPO). Our framework is plug-and-play and model-agnostic, and encourages diverse generations across visual appearance, camera motions, and scene structure without sacrificing prompt fidelity or perceptual quality. We implement our method on WAN and CogVideoX, and show that our method consistently improves video diversity on state-of-the-art benchmarks such as VBench, VideoScore, and human preference studies. Moreover, we release our code and a new benchmark dataset of 30,000 diverse prompts to support future research.

Title: LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

Authors: Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li, Liang-Yan Gui, Jim Fan, Jan Kautz, Yu-Xiong Wang, Zhiding Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20648
Pdf URL: https://arxiv.org/pdf/2511.20648
Copy Paste: [[2511.20648]] LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight(https://arxiv.org/abs/2511.20648)
Keywords: robust
Abstract: To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.

Title: Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Authors: Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20649
Pdf URL: https://arxiv.org/pdf/2511.20649
Copy Paste: [[2511.20649]] Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout(https://arxiv.org/abs/2511.20649)
Keywords: diffusion
Abstract: Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model's 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce $\infty$-RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model's maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish $\infty$-RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that $\infty$-RoPE consistently surpasses previous autoregressive models in overall VBench scores.

Title: RubricRL: Simple Generalizable Rewards for Text-to-Image Generation

Authors: Xuelu Feng, Yunsheng Li, Ziyu Wan, Zixuan Gao, Junsong Yuan, Dongdong Chen, Chunming Qiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.20651
Pdf URL: https://arxiv.org/pdf/2511.20651
Copy Paste: [[2511.20651]] RubricRL: Simple Generalizable Rewards for Text-to-Image Generation(https://arxiv.org/abs/2511.20651)
Keywords: interpretability, generative
Abstract: Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt--a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism--tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.