2025-12-05

Title: Towards Contextual Sensitive Data Detection

Authors: Liang Telkamp, Madelon Hulsebos
Subjects: cs.CR, cs.AI, cs.CL, cs.CY, cs.DB, cs.IR
Abstract URL: https://arxiv.org/abs/2512.04120
Pdf URL: https://arxiv.org/pdf/2512.04120
Copy Paste: [[2512.04120]] Towards Contextual Sensitive Data Detection(https://arxiv.org/abs/2512.04120)
Keywords: in-context
Abstract: The emergence of open data portals necessitates more attention to protecting sensitive data before datasets get published and exchanged. While an abundance of methods for suppressing sensitive data exist, the conceptualization of sensitive data and methods to detect it, focus particularly on personal data that, if disclosed, may be harmful or violate privacy. We observe the need for refining and broadening our definitions of sensitive data, and argue that the sensitivity of data depends on its context. Based on this definition, we introduce two mechanisms for contextual sensitive data detection that con- sider the broader context of a dataset at hand. First, we introduce type contextualization, which first detects the semantic type of particular data values, then considers the overall context of the data values within the dataset or document. Second, we introduce domain contextualization which determines sensitivity of a given dataset in the broader context based on the retrieval of relevant rules from documents that specify data sensitivity (e.g., data topic and geographic origin). Experiments with these mechanisms, assisted by large language models (LLMs), confirm that: 1) type-contextualization significantly reduces the number of false positives for type-based sensitive data detection and reaches a recall of 94% compared to 63% with commercial tools, and 2) domain-contextualization leveraging sensitivity rule retrieval is effective for context-grounded sensitive data detection in non-standard data domains such as humanitarian datasets. Evaluation with humanitarian data experts also reveals that context-grounded LLM explanations provide useful guidance in manual data auditing processes, improving consistency. We open-source mechanisms and annotated datasets for contextual sensitive data detection at this https URL.

Title: Tipping the Dominos: Topology-Aware Multi-Hop Attacks on LLM-Based Multi-Agent Systems

Authors: Ruichao Liang, Le Yin, Jing Chen, Cong Wu, Xiaoyu Zhang, Huangpeng Gu, Zijian Zhang, Yang Liu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.04129
Pdf URL: https://arxiv.org/pdf/2512.04129
Copy Paste: [[2512.04129]] Tipping the Dominos: Topology-Aware Multi-Hop Attacks on LLM-Based Multi-Agent Systems(https://arxiv.org/abs/2512.04129)
Keywords: diffusion
Abstract: LLM-based multi-agent systems (MASs) have reshaped the digital landscape with their emergent coordination and problem-solving capabilities. However, current security evaluations of MASs are still confined to limited attack scenarios, leaving their security issues unclear and likely underestimated. To fill this gap, we propose TOMA, a topology-aware multi-hop attack scheme targeting MASs. By optimizing the propagation of contamination within the MAS topology and controlling the multi-hop diffusion of adversarial payloads originating from the environment, TOMA unveils new and effective attack vectors without requiring privileged access or direct agent manipulation. Experiments demonstrate attack success rates ranging from 40% to 78% across three state-of-the-art MAS architectures: \textsc{Magentic-One}, \textsc{LangManus}, and \textsc{OWL}, and five representative topologies, revealing intrinsic MAS vulnerabilities that may be overlooked by existing research. Inspired by these findings, we propose a conceptual defense framework based on topology trust, and prototype experiments show its effectiveness in blocking 94.8% of adaptive and composite attacks.

Title: Decoding Large Language Diffusion Models with Foreseeing Movement

Authors: Yichuan Mo, Quan Chen, Mingjie Li, Zeming Wei, Yisen Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04135
Pdf URL: https://arxiv.org/pdf/2512.04135
Copy Paste: [[2512.04135]] Decoding Large Language Diffusion Models with Foreseeing Movement(https://arxiv.org/abs/2512.04135)
Keywords: diffusion
Abstract: Large Language Diffusion Models (LLDMs) benefit from a flexible decoding mechanism that enables parallelized inference and controllable generations over autoregressive models. Yet such flexibility introduces a critical challenge: inference performance becomes highly sensitive to the decoding order of tokens. Existing heuristic methods, however, focus mainly on local effects while overlooking long-term impacts. To address this limitation, we propose the Foreseeing Decoding Method (FDM), a novel approach that integrates both local and global considerations to unlock the full potential, employing a search-based strategy to enable effective optimization in discrete spaces. Furthermore, by analyzing the consistency of chosen tokens in the full decoding process, we develop a variant, FDM with Acceleration (FDM-A), which restricts deep exploration to critical steps identified as the exploration and balance circumantences. Extensive experiments across diverse benchmarks and model architectures validate the scalability of FDM and demonstrate the superior efficiency-performance trade-off achieved by FDM-A. Our work might potentially provide a principled step toward more powerful decoding methods for LLDMs.

Title: ReasonX: MLLM-Guided Intrinsic Image Decomposition

Authors: Alara Dirik, Tuanfeng Wang, Duygu Ceylan, Stefanos Zafeiriou, Anna Frühstück
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04222
Pdf URL: https://arxiv.org/pdf/2512.04222
Copy Paste: [[2512.04222]] ReasonX: MLLM-Guided Intrinsic Image Decomposition(https://arxiv.org/abs/2512.04222)
Keywords: diffusion, generative
Abstract: Intrinsic image decomposition aims to separate images into physical components such as albedo, depth, normals, and illumination. While recent diffusion- and transformer-based models benefit from paired supervision from synthetic datasets, their generalization to diverse, real-world scenarios remains challenging. We propose ReasonX, a novel framework that leverages a multimodal large language model (MLLM) as a perceptual judge providing relative intrinsic comparisons, and uses these comparisons as GRPO rewards for fine-tuning intrinsic decomposition models on unlabeled, in-the-wild images. Unlike RL methods for generative models, our framework aligns conditional intrinsic predictors by rewarding agreement between the judge's relational assessments and analytically derived relations from the model's outputs. ReasonX is model-agnostic and can be applied to different intrinsic predictors. Across multiple base architectures and modalities, ReasonX yields significant improvements, including 9-25% WHDR reduction on IIW albedo and up to 46% depth accuracy gains on ETH3D, highlighting the promise of MLLM-guided comparative supervision to bridge low- and high-level vision reasoning.

Title: ActVAE: Modelling human activity schedules with a deep conditional generative approach

Authors: Fred Shone, Tim Hillel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04223
Pdf URL: https://arxiv.org/pdf/2512.04223
Copy Paste: [[2512.04223]] ActVAE: Modelling human activity schedules with a deep conditional generative approach(https://arxiv.org/abs/2512.04223)
Keywords: generative
Abstract: Modelling the complexity and diversity of human activity scheduling behaviour is inherently challenging. We demonstrate a deep conditional-generative machine learning approach for the modelling of realistic activity schedules depending on input labels such as an individual's age, employment status, or other information relevant to their scheduling. We combine (i) a structured latent generative approach, with (ii) a conditional approach, through a novel Conditional VAE architecture. This allows for the rapid generation of precise and realistic schedules for different input labels. We extensively evaluate model capabilities using a joint density estimation framework and several case studies. We additionally show that our approach has practical data and computational requirements, and can be deployed within new and existing demand modelling frameworks. We evaluate the importance of generative capability more generally, by comparing our combined approach to (i) a purely generative model without conditionality, and (ii) a purely conditional model which outputs the most likely schedule given the input labels. This comparison highlights the usefulness of explicitly modelling the randomness of complex and diverse human behaviours using deep generative approaches.

Title: MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models

Authors: Shaoheng Fang, Chaohui Yu, Fan Wang, Qixing Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04248
Pdf URL: https://arxiv.org/pdf/2512.04248
Copy Paste: [[2512.04248]] MVRoom: Controllable 3D Indoor Scene Generation with Multi-View Diffusion Models(https://arxiv.org/abs/2512.04248)
Keywords: diffusion
Abstract: We introduce MVRoom, a controllable novel view synthesis (NVS) pipeline for 3D indoor scenes that uses multi-view diffusion conditioned on a coarse 3D layout. MVRoom employs a two-stage design in which the 3D layout is used throughout to enforce multi-view consistency. The first stage employs novel representations to effectively bridge the 3D layout and consistent image-based condition signals for multi-view generation. The second stage performs image-conditioned multi-view generation, incorporating a layout-aware epipolar attention mechanism to enhance multi-view consistency during the diffusion process. Additionally, we introduce an iterative framework that generates 3D scenes with varying numbers of objects and scene complexities by recursively performing multi-view generation (MVRoom), supporting text-to-scene generation. Experimental results demonstrate that our approach achieves high-fidelity and controllable 3D scene generation for NVS, outperforming state-of-the-art baseline methods both quantitatively and qualitatively. Ablation studies further validate the effectiveness of key components within our generation pipeline.

Title: UniLight: A Unified Representation for Lighting

Authors: Zitian Zhang, Iliyan Georgiev, Michael Fischer, Yannick Hold-Geoffroy, Jean-François Lalonde, Valentin Deschaintre
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04267
Pdf URL: https://arxiv.org/pdf/2512.04267
Copy Paste: [[2512.04267]] UniLight: A Unified Representation for Lighting(https://arxiv.org/abs/2512.04267)
Keywords: diffusion
Abstract: Lighting has a strong influence on visual appearance, yet understanding and representing lighting in images remains notoriously difficult. Various lighting representations exist, such as environment maps, irradiance, spherical harmonics, or text, but they are incompatible, which limits cross-modal transfer. We thus propose UniLight, a joint latent space as lighting representation, that unifies multiple modalities within a shared embedding. Modality-specific encoders for text, images, irradiance, and environment maps are trained contrastively to align their representations, with an auxiliary spherical-harmonics prediction task reinforcing directional understanding. Our multi-modal data pipeline enables large-scale training and evaluation across three tasks: lighting-based retrieval, environment-map generation, and lighting control in diffusion-based image synthesis. Experiments show that our representation captures consistent and transferable lighting features, enabling flexible manipulation across modalities.

Title: The Initialization Determines Whether In-Context Learning Is Gradient Descent

Authors: Shifeng Xie, Rui Yuan, Simone Rossi, Thomas Hannagan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04268
Pdf URL: https://arxiv.org/pdf/2512.04268
Copy Paste: [[2512.04268]] The Initialization Determines Whether In-Context Learning Is Gradient Descent(https://arxiv.org/abs/2512.04268)
Keywords: in-context
Abstract: In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce yq-LSA, a simple generalization of single-head LSA with a trainable initial guess yq. We theoretically establish the capabilities of yq-LSA and provide experimental validation on linear regression tasks, thereby extending the theory that bridges ICL and GD. Finally, inspired by our findings in the case of linear regression, we consider widespread LLMs augmented with initial guess capabilities, and show that their performance is improved on a semantic similarity task.

Title: Inference-time Stochastic Refinement of GRU-Normalizing Flow for Real-time Video Motion Transfer

Authors: Tasmiah Haque, Srinjoy Das
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.04282
Pdf URL: https://arxiv.org/pdf/2512.04282
Copy Paste: [[2512.04282]] Inference-time Stochastic Refinement of GRU-Normalizing Flow for Real-time Video Motion Transfer(https://arxiv.org/abs/2512.04282)
Keywords: generative, anomaly
Abstract: Real-time video motion transfer applications such as immersive gaming and vision-based anomaly detection require accurate yet diverse future predictions to support realistic synthesis and robust downstream decision making under uncertainty. To improve the diversity of such sequential forecasts we propose a novel inference-time refinement technique that combines Gated Recurrent Unit-Normalizing Flows (GRU-NF) with stochastic sampling methods. While GRU-NF can capture multimodal distributions through its integration of normalizing flows within a temporal forecasting framework, its deterministic transformation structure can limit expressivity. To address this, inspired by Stochastic Normalizing Flows (SNF), we introduce Markov Chain Monte Carlo (MCMC) steps during GRU-NF inference, enabling the model to explore a richer output space and better approximate the true data distribution without retraining. We validate our approach in a keypoint-based video motion transfer pipeline, where capturing temporally coherent and perceptually diverse future trajectories is essential for realistic samples and low bandwidth communication. Experiments show that our inference framework, Gated Recurrent Unit- Stochastic Normalizing Flows (GRU-SNF) outperforms GRU-NF in generating diverse outputs without sacrificing accuracy, even under longer prediction horizons. By injecting stochasticity during inference, our approach captures multimodal behavior more effectively. These results highlight the potential of integrating stochastic dynamics with flow-based sequence models for generative time series forecasting.

Title: Plug-and-Play Image Restoration with Flow Matching: A Continuous Viewpoint

Authors: Fan Jia, Yuhao Huang, Shih-Hsin Wang, Cristina Garcia-Cardona, Andrea L. Bertozzi, Bao Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.04283
Pdf URL: https://arxiv.org/pdf/2512.04283
Copy Paste: [[2512.04283]] Plug-and-Play Image Restoration with Flow Matching: A Continuous Viewpoint(https://arxiv.org/abs/2512.04283)
Keywords: generative
Abstract: Flow matching-based generative models have been integrated into the plug-and-play image restoration framework, and the resulting plug-and-play flow matching (PnP-Flow) model has achieved some remarkable empirical success for image restoration. However, the theoretical understanding of PnP-Flow lags its empirical success. In this paper, we derive a continuous limit for PnP-Flow, resulting in a stochastic differential equation (SDE) surrogate model of PnP-Flow. The SDE model provides two particular insights to improve PnP-Flow for image restoration: (1) It enables us to quantify the error for image restoration, informing us to improve step scheduling and regularize the Lipschitz constant of the neural network-parameterized vector field for error reduction. (2) It informs us to accelerate off-the-shelf PnP-Flow models via extrapolation, resulting in a rescaled version of the proposed SDE model. We validate the efficacy of the SDE-informed improved PnP-Flow using several benchmark tasks, including image denoising, deblurring, super-resolution, and inpainting. Numerical results show that our method significantly outperforms the baseline PnP-Flow and other state-of-the-art approaches, achieving superior performance across evaluation metrics.

Title: SQuARE: Structured Query & Adaptive Retrieval Engine For Tabular Formats

Authors: Chinmay Gondhalekar, Urjitkumar Patel, Fang-Chun Yeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.04292
Pdf URL: https://arxiv.org/pdf/2512.04292
Copy Paste: [[2512.04292]] SQuARE: Structured Query & Adaptive Retrieval Engine For Tabular Formats(https://arxiv.org/abs/2512.04292)
Keywords: foundation model
Abstract: Accurate question answering over real spreadsheets remains difficult due to multirow headers, merged cells, and unit annotations that disrupt naive chunking, while rigid SQL views fail on files lacking consistent schemas. We present SQuARE, a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify. Evaluated on multi-header corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, SQuARE consistently surpasses single-strategy baselines and ChatGPT-4o on both retrieval precision and end-to-end answer accuracy while keeping latency predictable. By decoupling retrieval from model choice, the system is compatible with emerging tabular foundation models and offers a practical bridge toward a more robust table understanding.

Title: Gamma-from-Mono: Road-Relative, Metric, Self-Supervised Monocular Geometry for Vehicular Applications

Authors: Gasser Elazab, Maximilian Jansen, Michael Unterreiner, Olaf Hellwich
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04303
Pdf URL: https://arxiv.org/pdf/2512.04303
Copy Paste: [[2512.04303]] Gamma-from-Mono: Road-Relative, Metric, Self-Supervised Monocular Geometry for Vehicular Applications(https://arxiv.org/abs/2512.04303)
Keywords: self-supervised
Abstract: Accurate perception of the vehicle's 3D surroundings, including fine-scale road geometry, such as bumps, slopes, and surface irregularities, is essential for safe and comfortable vehicle control. However, conventional monocular depth estimation often oversmooths these features, losing critical information for motion planning and stability. To address this, we introduce Gamma-from-Mono (GfM), a lightweight monocular geometry estimation method that resolves the projective ambiguity in single-camera reconstruction by decoupling global and local structure. GfM predicts a dominant road surface plane together with residual variations expressed by gamma, a dimensionless measure of vertical deviation from the plane, defined as the ratio of a point's height above it to its depth from the camera, and grounded in established planar parallax geometry. With only the camera's height above ground, this representation deterministically recovers metric depth via a closed form, avoiding full extrinsic calibration and naturally prioritizing near-road detail. Its physically interpretable formulation makes it well suited for self-supervised learning, eliminating the need for large annotated datasets. Evaluated on KITTI and the Road Surface Reconstruction Dataset (RSRD), GfM achieves state-of-the-art near-field accuracy in both depth and gamma estimation while maintaining competitive global depth performance. Our lightweight 8.88M-parameter model adapts robustly across diverse camera setups and, to our knowledge, is the first self-supervised monocular approach evaluated on RSRD.

Title: Data-regularized Reinforcement Learning for Diffusion Models at Scale

Authors: Haotian Ye, Kaiwen Zheng, Jiashu Xu, Puheng Li, Huayu Chen, Jiaqi Han, Sheng Liu, Qinsheng Zhang, Hanzi Mao, Zekun Hao, Prithvijit Chattopadhyay, Dinghao Yang, Liang Feng, Maosheng Liao, Junjie Bai, Ming-Yu Liu, James Zou, Stefano Ermon
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04332
Pdf URL: https://arxiv.org/pdf/2512.04332
Copy Paste: [[2512.04332]] Data-regularized Reinforcement Learning for Diffusion Models at Scale(https://arxiv.org/abs/2512.04332)
Keywords: diffusion, generative
Abstract: Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human evaluations, we demonstrate on high-resolution video generation tasks that DDRL significantly improves rewards while alleviating the reward hacking seen in baselines, achieving the highest human preference and establishing a robust and scalable paradigm for diffusion post-training.

Title: STeP-Diff: Spatio-Temporal Physics-Informed Diffusion Models for Mobile Fine-Grained Pollution Forecasting

Authors: Nan Zhou, Weijie Hong, Huandong Wang, Jianfeng Zheng, Qiuhua Wang, Yali Song, Xiao-Ping Zhang, Yong Li, Xinlei Chen
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.04385
Pdf URL: https://arxiv.org/pdf/2512.04385
Copy Paste: [[2512.04385]] STeP-Diff: Spatio-Temporal Physics-Informed Diffusion Models for Mobile Fine-Grained Pollution Forecasting(https://arxiv.org/abs/2512.04385)
Keywords: diffusion
Abstract: Fine-grained air pollution forecasting is crucial for urban management and the development of healthy buildings. Deploying portable sensors on mobile platforms such as cars and buses offers a low-cost, easy-to-maintain, and wide-coverage data collection solution. However, due to the random and uncontrollable movement patterns of these non-dedicated mobile platforms, the resulting sensor data are often incomplete and temporally inconsistent. By exploring potential training patterns in the reverse process of diffusion models, we propose Spatio-Temporal Physics-Informed Diffusion Models (STeP-Diff). STeP-Diff leverages DeepONet to model the spatial sequence of measurements along with a PDE-informed diffusion model to forecast the spatio-temporal field from incomplete and time-varying data. Through a PDE-constrained regularization framework, the denoising process asymptotically converges to the convection-diffusion dynamics, ensuring that predictions are both grounded in real-world measurements and aligned with the fundamental physics governing pollution dispersion. To assess the performance of the system, we deployed 59 self-designed portable sensing devices in two cities, operating for 14 days to collect air pollution data. Compared to the second-best performing algorithm, our model achieved improvements of up to 89.12% in MAE, 82.30% in RMSE, and 25.00% in MAPE, with extensive evaluations demonstrating that STeP-Diff effectively captures the spatio-temporal dependencies in air pollution fields.

Title: MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving

Authors: Bin Suna, Yaoguang Caob, Yan Wanga, Rui Wanga, Jiachen Shanga, Xiejie Fenga, Jiayi Lu, Jia Shi, Shichun Yang, Xiaoyu Yane, Ziying Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04441
Pdf URL: https://arxiv.org/pdf/2512.04441
Copy Paste: [[2512.04441]] MindDrive: An All-in-One Framework Bridging World Models and Vision-Language Model for End-to-End Autonomous Driving(https://arxiv.org/abs/2512.04441)
Keywords: generative
Abstract: End-to-End autonomous driving (E2E-AD) has emerged as a new paradigm, where trajectory planning plays a crucial role. Existing studies mainly follow two directions: trajectory generation oriented, which focuses on producing high-quality trajectories with simple decision mechanisms, and trajectory selection oriented, which performs multi-dimensional evaluation to select the best trajectory yet lacks sufficient generative capability. In this work, we propose MindDrive, a harmonized framework that integrates high-quality trajectory generation with comprehensive decision reasoning. It establishes a structured reasoning paradigm of "context simulation - candidate generation - multi-objective trade-off". In particular, the proposed Future-aware Trajectory Generator (FaTG), based on a World Action Model (WaM), performs ego-conditioned "what-if" simulations to predict potential future scenes and generate foresighted trajectory candidates. Building upon this, the VLM-oriented Evaluator (VLoE) leverages the reasoning capability of a large vision-language model to conduct multi-objective evaluations across safety, comfort, and efficiency dimensions, leading to reasoned and human-aligned decision making. Extensive experiments on the NAVSIM-v1 and NAVSIM-v2 benchmarks demonstrate that MindDrive achieves state-of-the-art performance across multi-dimensional driving metrics, significantly enhancing safety, compliance, and generalization. This work provides a promising path toward interpretable and cognitively guided autonomous driving.

Title: GuidNoise: Single-Pair Guided Diffusion for Generalized Noise Synthesis

Authors: Changjin Kim, HyeokJun Lee, YoungJoon Yoo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04456
Pdf URL: https://arxiv.org/pdf/2512.04456
Copy Paste: [[2512.04456]] GuidNoise: Single-Pair Guided Diffusion for Generalized Noise Synthesis(https://arxiv.org/abs/2512.04456)
Keywords: diffusion, generative
Abstract: Recent image denoising methods have leveraged generative modeling for real noise synthesis to address the costly acquisition of real-world noisy data. However, these generative models typically require camera metadata and extensive target-specific noisy-clean image pairs, often showing limited generalization between settings. In this paper, to mitigate the prerequisites, we propose a Single-Pair Guided Diffusion for generalized noise synthesis GuidNoise, which uses a single noisy/clean pair as the guidance, often easily obtained by itself within a training set. To train GuidNoise, which generates synthetic noisy images from the guidance, we introduce a guidance-aware affine feature modification (GAFM) and a noise-aware refine loss to leverage the inherent potential of diffusion models. This loss function refines the diffusion model's backward process, making the model more adept at generating realistic noise distributions. The GuidNoise synthesizes high-quality noisy images under diverse noise environments without additional metadata during both training and inference. Additionally, GuidNoise enables the efficient generation of noisy-clean image pairs at inference time, making synthetic noise readily applicable for augmenting training data. This self-augmentation significantly improves denoising performance, especially in practical scenarios with lightweight models and limited training data. The code is available at this https URL.

Title: dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning

Authors: Yingzi Ma, Yulong Cao, Wenhao Ding, Shuibai Zhang, Yan Wang, Boris Ivanovic, Ming Jiang, Marco Pavone, Chaowei Xiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04459
Pdf URL: https://arxiv.org/pdf/2512.04459
Copy Paste: [[2512.04459]] dVLM-AD: Enhance Diffusion Vision-Language-Model for Driving via Controllable Reasoning(https://arxiv.org/abs/2512.04459)
Keywords: diffusion
Abstract: The autonomous driving community is increasingly focused on addressing the challenges posed by out-of-distribution (OOD) driving scenarios. A dominant research trend seeks to enhance end-to-end (E2E) driving systems by integrating vision-language models (VLMs), leveraging their rich world knowledge and reasoning abilities to improve generalization across diverse environments. However, most existing VLMs or vision-language agents (VLAs) are built upon autoregressive (AR) models. In this paper, we observe that existing AR-based VLMs -- limited by causal attention and sequential token generation -- often fail to maintain consistency and controllability between high-level reasoning and low-level planning. In contrast, recent discrete diffusion VLMs equipped with bidirectional attention exhibit superior controllability and reliability through iterative denoising. Building on these observations, we introduce dVLM-AD, a diffusion-based vision-language model that unifies perception, structured reasoning, and low-level planning for end-to-end driving. Evaluated on nuScenes and WOD-E2E, dVLM-AD yields more consistent reasoning-action pairs and achieves planning performance comparable to existing driving VLM/VLA systems despite a modest backbone, outperforming AR-based baselines with a 9 percent improvement in behavior-trajectory consistency and a 6 percent increase in RFS on long-tail WOD-E2E scenarios. These results suggest a controllable and reliable pathway for scalable end-to-end driving.

Title: UniTS: Unified Time Series Generative Model for Remote Sensing

Authors: Yuxiang Zhang, Shunlin Liang, Wenyuan Li, Han Ma, Jianglei Xu, Yichuan Ma, Jiangwei Xie, Wei Li, Mengmeng Zhang, Ran Tao, Xiang-Gen Xia
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04461
Pdf URL: https://arxiv.org/pdf/2512.04461
Copy Paste: [[2512.04461]] UniTS: Unified Time Series Generative Model for Remote Sensing(https://arxiv.org/abs/2512.04461)
Keywords: diffusion, generative
Abstract: One of the primary objectives of satellite remote sensing is to capture the complex dynamics of the Earth environment, which encompasses tasks such as reconstructing continuous cloud-free time series images, detecting land cover changes, and forecasting future surface evolution. However, existing methods typically require specialized models tailored to different tasks, lacking unified modeling of spatiotemporal features across multiple time series tasks. In this paper, we propose a Unified Time Series Generative Model (UniTS), a general framework applicable to various time series tasks, including time series reconstruction, time series cloud removal, time series semantic change detection, and time series forecasting. Based on the flow matching generative paradigm, UniTS constructs a deterministic evolution path from noise to targets under the guidance of task-specific conditions, achieving unified modeling of spatiotemporal representations for multiple tasks. The UniTS architecture consists of a diffusion transformer with spatio-temporal blocks, where we design an Adaptive Condition Injector (ACor) to enhance the model's conditional perception of multimodal inputs, enabling high-quality controllable generation. Additionally, we design a Spatiotemporal-aware Modulator (STM) to improve the ability of spatio-temporal blocks to capture complex spatiotemporal dependencies. Furthermore, we construct two high-quality multimodal time series datasets, TS-S12 and TS-S12CR, filling the gap of benchmark datasets for time series cloud removal and forecasting tasks. Extensive experiments demonstrate that UniTS exhibits exceptional generative and cognitive capabilities in both low-level and high-level time series tasks. It significantly outperforms existing methods, particularly when facing challenges such as severe cloud contamination, modality absence, and forecasting phenological variations.

Title: GraphBench: Next-generation graph learning benchmarking

Authors: Timo Stoll, Chendi Qian, Ben Finkelshtein, Ali Parviz, Darius Weber, Fabrizio Frasca, Hadar Shavit, Antoine Siraudin, Arman Mielke, Marie Anastacio, Erik Müller, Maya Bechler-Speicher, Michael Bronstein, Mikhail Galkin, Holger Hoos, Mathias Niepert, Bryan Perozzi, Jan Tönshoff, Christopher Morris
Subjects: cs.LG, cs.AI, cs.NE, stat.ML
Abstract URL: https://arxiv.org/abs/2512.04475
Pdf URL: https://arxiv.org/pdf/2512.04475
Copy Paste: [[2512.04475]] GraphBench: Next-generation graph learning benchmarking(https://arxiv.org/abs/2512.04475)
Keywords: generative
Abstract: Machine learning on graphs has recently achieved impressive progress in various domains, including molecular property prediction and chip design. However, benchmarking practices remain fragmented, often relying on narrow, task-specific datasets and inconsistent evaluation protocols, which hampers reproducibility and broader progress. To address this, we introduce GraphBench, a comprehensive benchmarking suite that spans diverse domains and prediction tasks, including node-level, edge-level, graph-level, and generative settings. GraphBench provides standardized evaluation protocols -- with consistent dataset splits and performance metrics that account for out-of-distribution generalization -- as well as a unified hyperparameter tuning framework. Additionally, we benchmark GraphBench using message-passing neural networks and graph transformer models, providing principled baselines and establishing a reference performance. See this http URL for further details.

Title: DeRA: Decoupled Representation Alignment for Video Tokenization

Authors: Pengbo Guo, Junke Wang, Zhen Xing, Chengxu Liu, Daoguo Dong, Xueming Qian, Zuxuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04483
Pdf URL: https://arxiv.org/pdf/2512.04483
Copy Paste: [[2512.04483]] DeRA: Decoupled Representation Alignment for Video Tokenization(https://arxiv.org/abs/2512.04483)
Keywords: foundation model
Abstract: This paper presents DeRA, a novel 1D video tokenizer that decouples the spatial-temporal representation learning in video tokenization to achieve better training efficiency and performance. Specifically, DeRA maintains a compact 1D latent space while factorizing video encoding into appearance and motion streams, which are aligned with pretrained vision foundation models to capture the spatial semantics and temporal dynamics in videos separately. To address the gradient conflicts introduced by the heterogeneous supervision, we further propose the Symmetric Alignment-Conflict Projection (SACP) module that proactively reformulates gradients by suppressing the components along conflicting directions. Extensive experiments demonstrate that DeRA outperforms LARP, the previous state-of-the-art video tokenizer by 25% on UCF-101 in terms of rFVD. Moreover, using DeRA for autoregressive video generation, we also achieve new state-of-the-art results on both UCF-101 class-conditional generation and K600 frame prediction.

Title: Back to Basics: Motion Representation Matters for Human Motion Generation Using Diffusion Model

Authors: Yuduo Jin, Brandon Haworth
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2512.04499
Pdf URL: https://arxiv.org/pdf/2512.04499
Copy Paste: [[2512.04499]] Back to Basics: Motion Representation Matters for Human Motion Generation Using Diffusion Model(https://arxiv.org/abs/2512.04499)
Keywords: diffusion, generative
Abstract: Diffusion models have emerged as a widely utilized and successful methodology in human motion synthesis. Task-oriented diffusion models have significantly advanced action-to-motion, text-to-motion, and audio-to-motion applications. In this paper, we investigate fundamental questions regarding motion representations and loss functions in a controlled study, and we enumerate the impacts of various decisions in the workflow of the generative motion diffusion model. To answer these questions, we conduct empirical studies based on a proxy motion diffusion model (MDM). We apply v loss as the prediction objective on MDM (vMDM), where v is the weighted sum of motion data and noise. We aim to enhance the understanding of latent data distributions and provide a foundation for improving the state of conditional motion diffusion models. First, we evaluate the six common motion representations in the literature and compare their performance in terms of quality and diversity metrics. Second, we compare the training time under various configurations to shed light on how to speed up the training process of motion diffusion models. Finally, we also conduct evaluation analysis on a large motion dataset. The results of our experiments indicate clear performance differences across motion representations in diverse datasets. Our results also demonstrate the impacts of distinct configurations on model training and suggest the importance and effectiveness of these decisions on the outcomes of motion diffusion models.

Title: UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers

Authors: Min Zhao, Bokai Yan, Xue Yang, Hongzhou Zhu, Jintao Zhang, Shilong Liu, Chongxuan Li, Jun Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04504
Pdf URL: https://arxiv.org/pdf/2512.04504
Copy Paste: [[2512.04504]] UltraImage: Rethinking Resolution Extrapolation in Image Diffusion Transformers(https://arxiv.org/abs/2512.04504)
Keywords: diffusion
Abstract: Recent image diffusion transformers achieve high-fidelity generation, but struggle to generate images beyond these scales, suffering from content repetition and quality degradation. In this work, we present UltraImage, a principled framework that addresses both issues. Through frequency-wise analysis of positional embeddings, we identify that repetition arises from the periodicity of the dominant frequency, whose period aligns with the training resolution. We introduce a recursive dominant frequency correction to constrain it within a single period after extrapolation. Furthermore, we find that quality degradation stems from diluted attention and thus propose entropy-guided adaptive attention concentration, which assigns higher focus factors to sharpen local attention for fine detail and lower ones to global attention patterns to preserve structural consistency. Experiments show that UltraImage consistently outperforms prior methods on Qwen-Image and Flux (around 4K) across three generation scenarios, reducing repetition and improving visual fidelity. Moreover, UltraImage can generate images up to 6K*6K without low-resolution guidance from a training resolution of 1328p, demonstrating its extreme extrapolation capability. Project page is available at \href{this https URL}{this https URL}.

Title: DuGI-MAE: Improving Infrared Mask Autoencoders via Dual-Domain Guidance

Authors: Yinghui Xing, Xiaoting Su, Shizhou Zhang, Donghao Chu, Di Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04511
Pdf URL: https://arxiv.org/pdf/2512.04511
Copy Paste: [[2512.04511]] DuGI-MAE: Improving Infrared Mask Autoencoders via Dual-Domain Guidance(https://arxiv.org/abs/2512.04511)
Keywords: self-supervised, foundation model
Abstract: Infrared imaging plays a critical role in low-light and adverse weather conditions. However, due to the distinct characteristics of infrared images, existing foundation models such as Masked Autoencoder (MAE) trained on visible data perform suboptimal in infrared image interpretation tasks. To bridge this gap, an infrared foundation model known as InfMAE was developed and pre-trained on large-scale infrared datasets. Despite its effectiveness, InfMAE still faces several limitations, including the omission of informative tokens, insufficient modeling of global associations, and neglect of non-uniform noise. In this paper, we propose a Dual-domain Guided Infrared foundation model based on MAE (DuGI-MAE). First, we design a deterministic masking strategy based on token entropy, preserving only high-entropy tokens for reconstruction to enhance informativeness. Next, we introduce a Dual-Domain Guidance (DDG) module, which simultaneously captures global token relationships and adaptively filters non-uniform background noise commonly present in infrared imagery. To facilitate large-scale pretraining, we construct Inf-590K, a comprehensive infrared image dataset encompassing diverse scenes, various target types, and multiple spatial resolutions. Pretrained on Inf-590K, DuGI-MAE demonstrates strong generalization capabilities across various downstream tasks, including infrared object detection, semantic segmentation, and small target detection. Experimental results validate the superiority of the proposed method over both supervised and self-supervised comparison methods. Our code is available in the supplementary material.

Title: EgoLCD: Egocentric Video Generation with Long Context Diffusion

Authors: Liuzhou Zhang, Jiarui Ye, Yuanlei Wang, Ming Zhong, Mingju Cao, Wanke Xia, Bowen Zeng, Zeyu Zhang, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04515
Pdf URL: https://arxiv.org/pdf/2512.04515
Copy Paste: [[2512.04515]] EgoLCD: Egocentric Video Generation with Long Context Diffusion(https://arxiv.org/abs/2512.04515)
Keywords: diffusion, generative
Abstract: Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: this https URL. Website: this https URL.

Title: VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

Authors: Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, Xiaojuan Qi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04519
Pdf URL: https://arxiv.org/pdf/2512.04519
Copy Paste: [[2512.04519]] VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory(https://arxiv.org/abs/2512.04519)
Keywords: diffusion
Abstract: Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.

Title: Boundary-Aware Test-Time Adaptation for Zero-Shot Medical Image Segmentation

Authors: Chenlin Xu, Lei Zhang, Lituan Wang, Xinyu Pu, Pengfei Ma, Guangwu Qian, Zizhou Wang, Yan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04520
Pdf URL: https://arxiv.org/pdf/2512.04520
Copy Paste: [[2512.04520]] Boundary-Aware Test-Time Adaptation for Zero-Shot Medical Image Segmentation(https://arxiv.org/abs/2512.04520)
Keywords: foundation model
Abstract: Due to the scarcity of annotated data and the substantial computational costs of model, conventional tuning methods in medical image segmentation face critical challenges. Current approaches to adapting pretrained models, including full-parameter and parameter-efficient fine-tuning, still rely heavily on task-specific training on downstream tasks. Therefore, zero-shot segmentation has gained increasing attention, especially with foundation models such as SAM demonstrating promising generalization capabilities. However, SAM still faces notable limitations on medical datasets due to domain shifts, making efficient zero-shot enhancement an urgent research goal. To address these challenges, we propose BA-TTA-SAM, a task-agnostic test-time adaptation framework that significantly enhances the zero-shot segmentation performance of SAM via test-time adaptation. This framework integrates two key mechanisms: (1) The encoder-level Gaussian prompt injection embeds Gaussian-based prompts directly into the image encoder, providing explicit guidance for initial representation learning. (2) The cross-layer boundary-aware attention alignment exploits the hierarchical feature interactions within the ViT backbone, aligning deep semantic responses with shallow boundary cues. Experiments on four datasets, including ISIC, Kvasir, BUSI, and REFUGE, show an average improvement of 12.4\% in the DICE score compared with SAM's zero-shot segmentation performance. The results demonstrate that our method consistently outperforms state-of-the-art models in medical image segmentation. Our framework significantly enhances the generalization ability of SAM, without requiring any source-domain training data. Extensive experiments on publicly available medical datasets strongly demonstrate the superiority of our framework. Our code is available at this https URL.

Title: PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement

Authors: Yu-Wei Zhan, Xin Wang, Hong Chen, Tongtong Feng, Wei Feng, Ren Wang, Guangyao Li, Qing Li, Wenwu Zhu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04532
Pdf URL: https://arxiv.org/pdf/2512.04532
Copy Paste: [[2512.04532]] PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement(https://arxiv.org/abs/2512.04532)
Keywords: self-supervised
Abstract: Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collecting accurate annotations for physical attributes is costly and often impractical. To address these issues, we propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs. Specifically, PhyVLLM disentangles visual appearance and object motion through a dual-branch encoder. To model physical dynamics over time, we incorporate a Neural Ordinary Differential Equation (Neural ODE) module, which generates differentiable physical dynamic representations. The resulting motion-aware representations are projected into the token space of a pretrained LLM, enabling physics reasoning without compromising the model's original multimodal capabilities. To circumvent the need for explicit physical labels, PhyVLLM employs a self-supervised manner to model the continuous evolution of object motion. Experimental results demonstrate that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks, highlighting the advantages of incorporating explicit physical modeling.

Title: Refaçade: Editing Object with Given Reference Texture

Authors: Youze Huang (1), Penghui Ruan (2), Bojia Zi (3), Xianbiao Qi (4), Jianan Wang (5), Rong Xiao (4) ((1) University of Electronic Science and Technology of China, (2) The Hong Kong Polytechnic University, (3) The Chinese University of Hong Kong, (4) IntelliFusion Inc., (5) Astribot Inc.)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04534
Pdf URL: https://arxiv.org/pdf/2512.04534
Copy Paste: [[2512.04534]] Refaçade: Editing Object with Given Reference Texture(https://arxiv.org/abs/2512.04534)
Keywords: diffusion
Abstract: Recent advances in diffusion models have brought remarkable progress in image and video editing, yet some tasks remain underexplored. In this paper, we introduce a new task, Object Retexture, which transfers local textures from a reference object to a target object in images or videos. To perform this task, a straightforward solution is to use ControlNet conditioned on the source structure and the reference texture. However, this approach suffers from limited controllability for two reasons: conditioning on the raw reference image introduces unwanted structural information, and it fails to disentangle the visual texture and structure information of the source. To address this problem, we propose Refaçade, a method that consists of two key designs to achieve precise and controllable texture transfer in both images and videos. First, we employ a texture remover trained on paired textured/untextured 3D mesh renderings to remove appearance information while preserving the geometry and motion of source videos. Second, we disrupt the reference global layout using a jigsaw permutation, encouraging the model to focus on local texture statistics rather than the global layout of the object. Extensive experiments demonstrate superior visual quality, precise editing, and controllability, outperforming strong baselines in both quantitative and human evaluations. Code is available at this https URL.

Title: X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

Authors: Pei Yang, Hai Ci, Yiren Song, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04537
Pdf URL: https://arxiv.org/pdf/2512.04537
Copy Paste: [[2512.04537]] X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale(https://arxiv.org/abs/2512.04537)
Keywords: generative
Abstract: The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline, turning community assets into 17+ hours of paired synthetic videos using Unreal Engine. We then apply our trained model to 60 hours of the Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames. Quantitative analysis and user studies confirm our method's superiority over existing baselines: 69% of users rated it best for motion consistency, and 62.1% for embodiment correctness.

Title: Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

Authors: Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, Jinkyoo Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04559
Pdf URL: https://arxiv.org/pdf/2512.04559
Copy Paste: [[2512.04559]] Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function(https://arxiv.org/abs/2512.04559)
Keywords: diffusion
Abstract: Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose \textbf{Soft Q-based Diffusion Finetuning (SQDF)}, a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity.

Title: LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative Models

Authors: Siddharth Betala, Samuel P. Gleason, Ali Ramlaoui, Andy Xu, Georgia Channing, Daniel Levy, Clémentine Fourrier, Nikita Kazeev, Chaitanya K. Joshi, Sékou-Oumar Kaba, Félix Therrien, Alex Hernandez-Garcia, Rocío Mercado, N. M. Anoop Krishnan, Alexandre Duval
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04562
Pdf URL: https://arxiv.org/pdf/2512.04562
Copy Paste: [[2512.04562]] LeMat-GenBench: A Unified Evaluation Framework for Crystal Generative Models(https://arxiv.org/abs/2512.04562)
Keywords: generative
Abstract: Generative machine learning (ML) models hold great promise for accelerating materials discovery through the inverse design of inorganic crystals, enabling an unprecedented exploration of chemical space. Yet, the lack of standardized evaluation frameworks makes it challenging to evaluate, compare, and further develop these ML models meaningfully. In this work, we introduce LeMat-GenBench, a unified benchmark for generative models of crystalline materials, supported by a set of evaluation metrics designed to better inform model development and downstream applications. We release both an open-source evaluation suite and a public leaderboard on Hugging Face, and benchmark 12 recent generative models. Results reveal that an increase in stability leads to a decrease in novelty and diversity on average, with no model excelling across all dimensions. Altogether, LeMat-GenBench establishes a reproducible and extensible foundation for fair model comparison and aims to guide the development of more reliable, discovery-oriented generative models for crystalline materials.

Title: Exploiting \texttt{ftrace}'s \texttt{function\_graph} Tracer Features for Machine Learning: A Case Study on Encryption Detection

Authors: Kenan Begovic, Abdulaziz Al-Ali, Qutaibah Malluhi
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2512.04590
Pdf URL: https://arxiv.org/pdf/2512.04590
Copy Paste: [[2512.04590]] Exploiting \texttt{ftrace}'s \texttt{function\_graph} Tracer Features for Machine Learning: A Case Study on Encryption Detection(https://arxiv.org/abs/2512.04590)
Keywords: anomaly
Abstract: This paper proposes using the Linux kernel ftrace framework, particularly the function graph tracer, to generate informative system level data for machine learning (ML) applications. Experiments on a real world encryption detection task demonstrate the efficacy of the proposed features across several learning algorithms. The learner faces the problem of detecting encryption activities across a large dataset of files, using function call traces and graph based features. Empirical results highlight an outstanding accuracy of 99.28 on the task at hand, underscoring the efficacy of features derived from the function graph tracer. The results were further validated in an additional experiment targeting a multilabel classification problem, in which running programs were identified from trace data. This work provides comprehensive methodologies for preprocessing raw trace data and extracting graph based features, offering significant advancements in applying ML to system behavior analysis, program identification, and anomaly detection. By bridging the gap between system tracing and ML, this paper paves the way for innovative solutions in performance monitoring and security analytics.

Title: QoSDiff: An Implicit Topological Embedding Learning Framework Leveraging Denoising Diffusion and Adversarial Attention for Robust QoS Prediction

Authors: Guanchen Du, Jianlong Xu, Wei Wei
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04596
Pdf URL: https://arxiv.org/pdf/2512.04596
Copy Paste: [[2512.04596]] QoSDiff: An Implicit Topological Embedding Learning Framework Leveraging Denoising Diffusion and Adversarial Attention for Robust QoS Prediction(https://arxiv.org/abs/2512.04596)
Keywords: diffusion
Abstract: Accurate Quality of Service (QoS) prediction is fundamental to service computing, providing essential data-driven guidance for service selection and ensuring superior user experiences. However, prevalent approaches, particularly Graph Neural Networks (GNNs), heavily rely on constructing explicit user--service interaction graphs. This dependency introduces severe scalability bottlenecks and limits performance when explicit connections are sparse or corrupted by noise. To address these challenges, this paper introduces \emph{QoSDiff}, a novel embedding learning framework that bypasses the prerequisite of explicit graph construction. Specifically, it leverages a denoising diffusion probabilistic model to recover intrinsic latent structures from noisy initializations. To further capture high-order interactions, we propose an adversarial interaction module that integrates a bidirectional hybrid attention mechanism. This adversarial paradigm dynamically distinguishes informative patterns from noise, enabling a dual-perspective modeling of intricate user--service associations. Extensive experiments on two large-scale real-world datasets demonstrate that QoSDiff significantly outperforms state-of-the-art baselines. Notably, the results highlight the framework's superior cross-dataset generalization capability and exceptional robustness against data sparsity and observational noise.

Title: Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space

Authors: Joey Hong, Kang Liu, Zhan Ling, Jiecao Chen, Sergey Levine
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2512.04601
Pdf URL: https://arxiv.org/pdf/2512.04601
Copy Paste: [[2512.04601]] Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space(https://arxiv.org/abs/2512.04601)
Keywords: generative
Abstract: Large language model (LLM) agents -- LLMs that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.

Title: Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence

Authors: Tianyu Yuan, Yuanbo Yang, Lin-Zhuo Chen, Yao Yao, Zhuzhong Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04619
Pdf URL: https://arxiv.org/pdf/2512.04619
Copy Paste: [[2512.04619]] Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence(https://arxiv.org/abs/2512.04619)
Keywords: diffusion, foundation model
Abstract: In this work, we introduce HeFT (Head-Frequency Tracker), a zero-shot point tracking framework that leverages the visual priors of pretrained video diffusion models. To better understand how they encode spatiotemporal information, we analyze the internal representations of Video Diffusion Transformer (VDiT). Our analysis reveals that attention heads act as minimal functional units with distinct specializations for matching, semantic understanding, and positional encoding. Additionally, we find that the low-frequency components in VDiT features are crucial for establishing correspondences, whereas the high-frequency components tend to introduce noise. Building on these insights, we propose a head- and frequency-aware feature selection strategy that jointly selects the most informative attention head and low-frequency components to enhance tracking performance. Specifically, our method extracts discriminative features through single-step denoising, applies feature selection, and employs soft-argmax localization with forward-backward consistency checks for correspondence estimation. Extensive experiments on TAP-Vid benchmarks demonstrate that HeFT achieves state-of-the-art zero-shot tracking performance, approaching the accuracy of supervised methods while eliminating the need for annotated training data. Our work further underscores the promise of video diffusion models as powerful foundation models for a wide range of downstream tasks, paving the way toward unified visual foundation models.

Title: Federated Learning for Anomaly Detection in Maritime Movement Data

Authors: Anita Graser, Axel Weißenfeld, Clemens Heistracher, Melitta Dragaschnig, Peter Widhalm
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04635
Pdf URL: https://arxiv.org/pdf/2512.04635
Copy Paste: [[2512.04635]] Federated Learning for Anomaly Detection in Maritime Movement Data(https://arxiv.org/abs/2512.04635)
Keywords: anomaly
Abstract: This paper introduces M3fed, a novel solution for federated learning of movement anomaly detection models. This innovation has the potential to improve data privacy and reduce communication costs in machine learning for movement anomaly detection. We present the novel federated learning (FL) strategies employed to train M3fed, perform an example experiment with maritime AIS data, and evaluate the results with respect to communication costs and FL model quality by comparing classic centralized M3 and the new federated M3fed.

Title: Cryptanalysis of Gleeok-128

Authors: Siwei Chen, Peipei Xie, Shengyuan Xu, Xiutao Feng, Zejun Xiang, Xiangyong Zeng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.04675
Pdf URL: https://arxiv.org/pdf/2512.04675
Copy Paste: [[2512.04675]] Cryptanalysis of Gleeok-128(https://arxiv.org/abs/2512.04675)
Keywords: diffusion
Abstract: Gleeok is a family of low latency keyed pseudorandom functions (PRFs) consisting of three parallel SPN based permutations whose outputs are XORed to form the final value. Both Gleeok-128 and Gleeok-256 use a 256 bit key, with block sizes of 128 and 256 bits, respectively. Owing to its multi branch structure, evaluating security margins and mounting effective key recovery attacks present nontrivial challenges. This paper provides the first comprehensive third party cryptanalysis of Gleeok-128. We introduce a two stage MILP based framework for constructing branch wise and full cipher differential linear (DL) distinguishers, together with an integral based key recovery framework tailored to multi branch designs. Our DL analysis yields 7, 7, 8, and 4 round distinguishers for Branch 1, Branch 2, Branch 3, and Gleeok-128, respectively, with squared correlations approximately 2 to the power minus 88.12, 2 to the power minus 88.12, 2 to the power minus 38.73, and 2 to the power minus 49.04, outperforming those in the design document except for the full PRF case. By tightening algebraic degree bounds, we further derive 9, 9, and 7 round integral distinguishers for the three branches and a 7 round distinguisher for the full PRF, extending the designers results by 3, 3, and 2 rounds and by 2 rounds, respectively. These integral properties enable 7 round and 8 round key recovery attacks in the non full codebook and full codebook settings. In addition, we identify a flaw in the original linear security evaluation of Branch 3, showing that it can be distinguished over all 12 rounds with data complexity about 2 to the power 48. We also propose optimized linear layer parameters that significantly improve linear resistance without sacrificing diffusion. Our results advance the understanding of Gleeok-128 and provide general methods for analyzing multi branch symmetric designs.

Title: Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Authors: Yubo Huang, Hailong Guo, Fangtai Wu, Shifeng Zhang, Shijie Huang, Qijun Gan, Lin Liu, Sirui Zhao, Enhong Chen, Jiaming Liu, Steven Hoi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04677
Pdf URL: https://arxiv.org/pdf/2512.04677
Copy Paste: [[2512.04677]] Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length(https://arxiv.org/abs/2512.04677)
Keywords: diffusion
Abstract: Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.

Title: Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Authors: Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, Yujun Shen, Min Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04678
Pdf URL: https://arxiv.org/pdf/2512.04678
Copy Paste: [[2512.04678]] Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation(https://arxiv.org/abs/2512.04678)
Keywords: diffusion
Abstract: Efficient streaming video generation is critical for simulating interactive and dynamic worlds. Existing methods distill few-step video diffusion models with sliding window attention, using initial frames as sink tokens to maintain attention performance and reduce error accumulation. However, video frames become overly dependent on these static tokens, resulting in copied initial frames and diminished motion dynamics. To address this, we introduce Reward Forcing, a novel framework with two key designs. First, we propose EMA-Sink, which maintains fixed-size tokens initialized from initial frames and continuously updated by fusing evicted tokens via exponential moving average as they exit the sliding window. Without additional computation cost, EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying while maintaining long-horizon consistency. Second, to better distill motion dynamics from teacher models, we propose a novel Rewarded Distribution Matching Distillation (Re-DMD). Vanilla distribution matching treats every training sample equally, limiting the model's ability to prioritize dynamic content. Instead, Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model. Re-DMD significantly enhances motion quality while preserving data fidelity. We include both quantitative and qualitative experiments to show that Reward Forcing achieves state-of-the-art performance on standard benchmarks while enabling high-quality streaming video generation at 23.1 FPS on a single H100 GPU.

Title: TRINITY: An Evolved LLM Coordinator

Authors: Jinglue Xu, Qi Sun, Peter Schwendeman, Stefan Nielsen, Edoardo Cetin, Yujin Tang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04695
Pdf URL: https://arxiv.org/pdf/2512.04695
Copy Paste: [[2512.04695]] TRINITY: An Evolved LLM Coordinator(https://arxiv.org/abs/2512.04695)
Keywords: foundation model
Abstract: Combining diverse foundation models is promising, but weight-merging is limited by mismatched architectures and closed APIs. Trinity addresses this with a lightweight coordinator that orchestrates collaboration among large language models (LLMs). The coordinator, comprising a compact language model (approximately $0.6$B parameters) and a lightweight head (approximately $10$K parameters), is optimized with an evolutionary strategy for efficient and adaptive delegation. Trinity processes queries over multiple turns, where at each turn the coordinator assigns one of three roles (Thinker, Worker, or Verifier) to a selected LLM, effectively offloading complex skill acquisition from the coordinator itself. Experiments show that Trinity consistently outperforms individual models and existing methods across coding, math, reasoning, and domain knowledge tasks, and generalizes robustly to out-of-distribution tasks. On standard benchmarks, Trinity achieves state-of-the-art results, including a score of 86.2% on LiveCodeBench. Theoretical and empirical analyses identify two main factors behind this performance: (1) the coordinator's hidden-state representations provide rich contextualization of inputs, and (2) under high dimensionality and strict budget constraints, the separable Covariance Matrix Adaptation Evolution Strategy offers advantages over reinforcement learning, imitation learning, and random search by exploiting potential block-epsilon-separability.

Title: OmniScaleSR: Unleashing Scale-Controlled Diffusion Prior for Faithful and Realistic Arbitrary-Scale Image Super-Resolution

Authors: Xinning Chai, Zhengxue Cheng, Yuhong Zhang, Hengsheng Zhang, Yingsheng Qin, Yucai Yang, Rong Xie, Li Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04699
Pdf URL: https://arxiv.org/pdf/2512.04699
Copy Paste: [[2512.04699]] OmniScaleSR: Unleashing Scale-Controlled Diffusion Prior for Faithful and Realistic Arbitrary-Scale Image Super-Resolution(https://arxiv.org/abs/2512.04699)
Keywords: diffusion
Abstract: Arbitrary-scale super-resolution (ASSR) overcomes the limitation of traditional super-resolution (SR) methods that operate only at fixed scales (e.g., 4x), enabling a single model to handle arbitrary magnification. Most existing ASSR approaches rely on implicit neural representation (INR), but its regression-driven feature extraction and aggregation intrinsically limit the ability to synthesize fine details, leading to low realism. Recent diffusion-based realistic image super-resolution (Real-ISR) models leverage powerful pre-trained diffusion priors and show impressive results at the 4x setting. We observe that they can also achieve ASSR because the diffusion prior implicitly adapts to scale by encouraging high-realism generation. However, without explicit scale control, the diffusion process cannot be properly adjusted for different magnification levels, resulting in excessive hallucination or blurry outputs, especially under ultra-high scales. To address these issues, we propose OmniScaleSR, a diffusion-based realistic arbitrary-scale SR framework designed to achieve both high fidelity and high realism. We introduce explicit, diffusion-native scale control mechanisms that work synergistically with implicit scale adaptation, enabling scale-aware and content-aware modulation of the diffusion process. In addition, we incorporate multi-domain fidelity enhancement designs to further improve reconstruction accuracy. Extensive experiments on bicubic degradation benchmarks and real-world datasets show that OmniScaleSR surpasses state-of-the-art methods in both fidelity and perceptual realism, with particularly strong performance at large magnification factors. Code will be released at this https URL.

Title: Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild

Authors: Yigui Feng, Qinglin Wang, Haotian Mo, Yang Liu, Ke Liu, Gencheng Liu, Xinhai Chen, Siqi Shen, Songzhu Mei, Jie Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.04728
Pdf URL: https://arxiv.org/pdf/2512.04728
Copy Paste: [[2512.04728]] Measuring the Unspoken: A Disentanglement Model and Benchmark for Psychological Analysis in the Wild(https://arxiv.org/abs/2512.04728)
Keywords: generative
Abstract: Generative psychological analysis of in-the-wild conversations faces two fundamental challenges: (1) existing Vision-Language Models (VLMs) fail to resolve Articulatory-Affective Ambiguity, where visual patterns of speech mimic emotional expressions; and (2) progress is stifled by a lack of verifiable evaluation metrics capable of assessing visual grounding and reasoning depth. We propose a complete ecosystem to address these twin challenges. First, we introduce Multilevel Insight Network for Disentanglement(MIND), a novel hierarchical visual encoder that introduces a Status Judgment module to algorithmically suppress ambiguous lip features based on their temporal feature variance, achieving explicit visual disentanglement. Second, we construct ConvoInsight-DB, a new large-scale dataset with expert annotations for micro-expressions and deep psychological inference. Third, Third, we designed the Mental Reasoning Insight Rating Metric (PRISM), an automated dimensional framework that uses expert-guided LLM to measure the multidimensional performance of large mental vision models. On our PRISM benchmark, MIND significantly outperforms all baselines, achieving a +86.95% gain in micro-expression detection over prior SOTA. Ablation studies confirm that our Status Judgment disentanglement module is the most critical component for this performance leap. Our code has been opened.

Title: Order Matters: 3D Shape Generation from Sequential VR Sketches

Authors: Yizi Chen, Sidi Wu, Tianyi Xiao, Nina Wiedemann, Loic Landrieu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04761
Pdf URL: https://arxiv.org/pdf/2512.04761
Copy Paste: [[2512.04761]] Order Matters: 3D Shape Generation from Sequential VR Sketches(https://arxiv.org/abs/2512.04761)
Keywords: diffusion
Abstract: VR sketching lets users explore and iterate on ideas directly in 3D, offering a faster and more intuitive alternative to conventional CAD tools. However, existing sketch-to-shape models ignore the temporal ordering of strokes, discarding crucial cues about structure and design intent. We introduce VRSketch2Shape, the first framework and multi-category dataset for generating 3D shapes from sequential VR sketches. Our contributions are threefold: (i) an automated pipeline that generates sequential VR sketches from arbitrary shapes, (ii) a dataset of over 20k synthetic and 900 hand-drawn sketch-shape pairs across four categories, and (iii) an order-aware sketch encoder coupled with a diffusion-based 3D generator. Our approach yields higher geometric fidelity than prior work, generalizes effectively from synthetic to real sketches with minimal supervision, and performs well even on partial sketches. All data and models will be released open-source at this https URL.

Title: PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling

Authors: Bowen Ping, Chengyou Jia, Minnan Luo, Changliang Xia, Xin Shen, Zhuohang Dang, Hangwei Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04784
Pdf URL: https://arxiv.org/pdf/2512.04784
Copy Paste: [[2512.04784]] PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling(https://arxiv.org/abs/2512.04784)
Keywords: generative
Abstract: Consistent image generation requires faithfully preserving identities, styles, and logical coherence across multiple images, which is essential for applications such as storytelling and character design. Supervised training approaches struggle with this task due to the lack of large-scale datasets capturing visual consistency and the complexity of modeling human perceptual preferences. In this paper, we argue that reinforcement learning (RL) offers a promising alternative by enabling models to learn complex and subjective visual criteria in a data-free manner. To achieve this, we introduce PaCo-RL, a comprehensive framework that combines a specialized consistency reward model with an efficient RL algorithm. The first component, PaCo-Reward, is a pairwise consistency evaluator trained on a large-scale dataset constructed via automated sub-figure pairing. It evaluates consistency through a generative, autoregressive scoring mechanism enhanced by task-aware instructions and CoT reasons. The second component, PaCo-GRPO, leverages a novel resolution-decoupled optimization strategy to substantially reduce RL cost, alongside a log-tamed multi-reward aggregation mechanism that ensures balanced and stable reward optimization. Extensive experiments across the two representative subtasks show that PaCo-Reward significantly improves alignment with human perceptions of visual consistency, and PaCo-GRPO achieves state-of-the-art consistency performance with improved training efficiency and stability. Together, these results highlight the promise of PaCo-RL as a practical and scalable solution for consistent image generation. The project page is available at this https URL.

Title: LaFiTe: A Generative Latent Field for 3D Native Texturing

Authors: Chia-Hao Chen, Zi-Xin Zou, Yan-Pei Cao, Ze Yuan, Guan Luo, Xiaojuan Qi, Ding Liang, Song-Hai Zhang, Yuan-Chen Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04786
Pdf URL: https://arxiv.org/pdf/2512.04786
Copy Paste: [[2512.04786]] LaFiTe: A Generative Latent Field for 3D Native Texturing(https://arxiv.org/abs/2512.04786)
Keywords: generative
Abstract: Generating high-fidelity, seamless textures directly on 3D surfaces, what we term 3D-native texturing, remains a fundamental open challenge, with the potential to overcome long-standing limitations of UV-based and multi-view projection methods. However, existing native approaches are constrained by the absence of a powerful and versatile latent representation, which severely limits the fidelity and generality of their generated textures. We identify this representation gap as the principal barrier to further progress. We introduce LaFiTe, a framework that addresses this challenge by learning to generate textures as a 3D generative sparse latent color field. At its core, LaFiTe employs a variational autoencoder (VAE) to encode complex surface appearance into a sparse, structured latent space, which is subsequently decoded into a continuous color field. This representation achieves unprecedented fidelity, exceeding state-of-the-art methods by >10 dB PSNR in reconstruction, by effectively disentangling texture appearance from mesh topology and UV parameterization. Building upon this strong representation, a conditional rectified-flow model synthesizes high-quality, coherent textures across diverse styles and geometries. Extensive experiments demonstrate that LaFiTe not only sets a new benchmark for 3D-native texturing but also enables flexible downstream applications such as material synthesis and texture super-resolution, paving the way for the next generation of 3D content creation workflows.

Title: LatentFM: A Latent Flow Matching Approach for Generative Medical Image Segmentation

Authors: Huynh Trinh Ngoc, Hoang Anh Nguyen Kim, Toan Nguyen Hai, Long Tran Quoc
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04821
Pdf URL: https://arxiv.org/pdf/2512.04821
Copy Paste: [[2512.04821]] LatentFM: A Latent Flow Matching Approach for Generative Medical Image Segmentation(https://arxiv.org/abs/2512.04821)
Keywords: generative
Abstract: Generative models have achieved remarkable progress with the emergence of flow matching (FM). It has demonstrated strong generative capabilities and attracted significant attention as a simulation-free flow-based framework capable of learning exact data densities. Motivated by these advances, we propose LatentFM, a flow-based model operating in the latent space for medical image segmentation. To model the data distribution, we first design two variational autoencoders (VAEs) to encode both medical images and their corresponding masks into a lower-dimensional latent space. We then estimate a conditional velocity field that guides the flow based on the input image. By sampling multiple latent representations, our method synthesizes diverse segmentation outputs whose pixel-wise variance reliably captures the underlying data distribution, enabling both highly accurate and uncertainty-aware predictions. Furthermore, we generate confidence maps that quantify the model certainty, providing clinicians with richer information for deeper analysis. We conduct experiments on two datasets, ISIC-2018 and CVC-Clinic, and compare our method with several prior baselines, including both deterministic and generative approach models. Through comprehensive evaluations, both qualitative and quantitative results show that our approach achieves superior segmentation accuracy while remaining highly efficient in the latent space.

Title: FreeGen: Feed-Forward Reconstruction-Generation Co-Training for Free-Viewpoint Driving Scene Synthesis

Authors: Shijie Chen, Peixi Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04830
Pdf URL: https://arxiv.org/pdf/2512.04830
Copy Paste: [[2512.04830]] FreeGen: Feed-Forward Reconstruction-Generation Co-Training for Free-Viewpoint Driving Scene Synthesis(https://arxiv.org/abs/2512.04830)
Keywords: generative
Abstract: Closed-loop simulation and scalable pre-training for autonomous driving require synthesizing free-viewpoint driving scenes. However, existing datasets and generative pipelines rarely provide consistent off-trajectory observations, limiting large-scale evaluation and training. While recent generative models demonstrate strong visual realism, they struggle to jointly achieve interpolation consistency and extrapolation realism without per-scene optimization. To address this, we propose FreeGen, a feed-forward reconstruction-generation co-training framework for free-viewpoint driving scene synthesis. The reconstruction model provides stable geometric representations to ensure interpolation consistency, while the generation model performs geometry-aware enhancement to improve realism at unseen viewpoints. Through co-training, generative priors are distilled into the reconstruction model to improve off-trajectory rendering, and the refined geometry in turn offers stronger structural guidance for generation. Experiments demonstrate that FreeGen achieves state-of-the-art performance for free-viewpoint driving scene synthesis.

Title: Tokenizing Buildings: A Transformer for Layout Synthesis

Authors: Manuel Ladron de Guevara, Jinmo Rhee, Ardavan Bidgoli, Vaidas Razgaitis, Michael Bergin
Subjects: cs.CV, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2512.04832
Pdf URL: https://arxiv.org/pdf/2512.04832
Copy Paste: [[2512.04832]] Tokenizing Buildings: A Transformer for Layout Synthesis(https://arxiv.org/abs/2512.04832)
Keywords: generative
Abstract: We introduce Small Building Model (SBM), a Transformer-based architecture for layout synthesis in Building Information Modeling (BIM) scenes. We address the question of how to tokenize buildings by unifying heterogeneous feature sets of architectural elements into sequences while preserving compositional structure. Such feature sets are represented as a sparse attribute-feature matrix that captures room properties. We then design a unified embedding module that learns joint representations of categorical and possibly correlated continuous feature groups. Lastly, we train a single Transformer backbone in two modes: an encoder-only pathway that yields high-fidelity room embeddings, and an encoder-decoder pipeline for autoregressive prediction of room entities, referred to as Data-Driven Entity Prediction (DDEP). Experiments across retrieval and generative layout synthesis show that SBM learns compact room embeddings that reliably cluster by type and topology, enabling strong semantic retrieval. In DDEP mode, SBM produces functionally sound layouts, with fewer collisions and boundary violations and improved navigability.

Title: A Novel Trust-Based DDoS Cyberattack Detection Model for Smart Business Environments

Authors: Oghenetejiri Okporokpo, Funminiyi Olajide, Nemitari Ajienka, Xiaoqi Ma
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.04855
Pdf URL: https://arxiv.org/pdf/2512.04855
Copy Paste: [[2512.04855]] A Novel Trust-Based DDoS Cyberattack Detection Model for Smart Business Environments(https://arxiv.org/abs/2512.04855)
Keywords: anomaly
Abstract: As the frequency and complexity of Distributed Denial-of-Service (DDoS) attacks continue to increase, the level of threats posed to Smart Internet of Things (SIoT) business environments have also increased. These environments generally have several interconnected SIoT systems and devices that are integral to daily operations, usually depending on cloud infrastructure and real-time data analytics, which require continuous availability and secure data exchange. Conventional detection mechanisms, while useful in static or traditional network environments, often are inadequate in responding to the needs of these dynamic and diverse SIoT networks. In this paper, we introduce a novel trust-based DDoS detection model tailored to meet the unique requirements of smart business environments. The proposed model incorporates a trust evaluation engine that continuously monitors node behaviour, calculating trust scores based on packet delivery ratio, response time, and anomaly detection. These trust metrics are then aggregated by a central trust-based repository that uses inherent trust values to identify traffic patterns indicative of DDoS attacks. By integrating both trust scores and central trust-based outputs, the trust calculation is enhanced, ensuring that threats are accurately identified and addressed in real-time. The model demonstrated a significant improvement in detection accuracy, and a low false-positive rate with enhanced scalability and adaptability under TCP SYN, Ping Flood, and UDP Flood attacks. The results show that a trust-based approach provides an effective, lightweight alternative for securing resource-constrained business IoT environments.

Title: Logic-Driven Cybersecurity: A Novel Framework for System Log Anomaly Detection using Answer Set Programming

Authors: Fang Li, Fei Zuo, Gopal Gupta
Subjects: cs.CR, cs.LO
Abstract URL: https://arxiv.org/abs/2512.04908
Pdf URL: https://arxiv.org/pdf/2512.04908
Copy Paste: [[2512.04908]] Logic-Driven Cybersecurity: A Novel Framework for System Log Anomaly Detection using Answer Set Programming(https://arxiv.org/abs/2512.04908)
Keywords: anomaly
Abstract: This study explores the application of Answer Set Programming (ASP) for detecting anomalies in system logs, addressing the challenges posed by evolving cyber threats. We propose a novel framework that leverages ASP's declarative nature and logical reasoning capabilities to encode complex security rules as logical predicates. Our ASP-based system was applied to a real-world Linux system log dataset, demonstrating its effectiveness in identifying various anomalies such as potential brute-force attacks, privilege escalations, frequent network connections from specific IPs, and various system-level issues. Key findings highlight ASP's strengths in handling structured log data, rule flexibility, and event correlation. The approach shows promise in providing explainable alerts from real-world data. This research contributes to computer forensics by demonstrating a logic-based paradigm for log analysis on a practical dataset, opening avenues for more nuanced and adaptive cyber intelligence systems.

Title: Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

Authors: Yueming Pan, Ruoyu Feng, Qi Dai, Yuqi Wang, Wenfeng Lin, Mingyu Guo, Chong Luo, Nanning Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04926
Pdf URL: https://arxiv.org/pdf/2512.04926
Copy Paste: [[2512.04926]] Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion(https://arxiv.org/abs/2512.04926)
Keywords: diffusion
Abstract: Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: this https URL.

Title: LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

Authors: Zhijian Shu, Cheng Lin, Tao Xie, Wei Yin, Ben Li, Zhiyuan Pu, Weize Li, Yao Yao, Xun Cao, Xiaoyang Guo, Xiao-Xiao Long
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04939
Pdf URL: https://arxiv.org/pdf/2512.04939
Copy Paste: [[2512.04939]] LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging(https://arxiv.org/abs/2512.04939)
Keywords: foundation model
Abstract: 3D vision foundation models like Visual Geometry Grounded Transformer (VGGT) have advanced greatly in geometric perception. However, it is time-consuming and memory-intensive for long sequences, limiting application to large-scale scenes beyond hundreds of images. To address this, we propose LiteVGGT, achieving up to 10x speedup and substantial memory reduction, enabling efficient processing of 1000-image scenes. We derive two key insights for 3D reconstruction: (1) tokens from local image regions have inherent geometric correlations, leading to high similarity and computational redundancy; (2) token similarity across adjacent network layers remains stable, allowing for reusable merge decisions. Guided by these, we design a simple yet efficient strategy, dubbed geometry-aware cached token merging. We analyze each token's geometric importance, optimizing anchor token selection to better preserve key information for reconstruction. We also cache and reuse merge indices across layers, substantially reducing latency with minimal accuracy impact. This strategy retains VGGT's core performance, enabling efficient fine-tuning and FP8 quantization for further gains. Extensive experiments validate LiteVGGT's effectiveness, scalability, and robustness. Project page: this https URL

Title: Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition

Authors: Novanto Yudistira
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.04943
Pdf URL: https://arxiv.org/pdf/2512.04943
Copy Paste: [[2512.04943]] Towards Adaptive Fusion of Multimodal Deep Networks for Human Action Recognition(https://arxiv.org/abs/2512.04943)
Keywords: self-supervised
Abstract: This study introduces a pioneering methodology for human action recognition by harnessing deep neural network techniques and adaptive fusion strategies across multiple modalities, including RGB, optical flows, audio, and depth information. Employing gating mechanisms for multimodal fusion, we aim to surpass limitations inherent in traditional unimodal recognition methods while exploring novel possibilities for diverse applications. Through an exhaustive investigation of gating mechanisms and adaptive weighting-based fusion architectures, our methodology enables the selective integration of relevant information from various modalities, thereby bolstering both accuracy and robustness in action recognition tasks. We meticulously examine various gated fusion strategies to pinpoint the most effective approach for multimodal action recognition, showcasing its superiority over conventional unimodal methods. Gating mechanisms facilitate the extraction of pivotal features, resulting in a more holistic representation of actions and substantial enhancements in recognition performance. Our evaluations across human action recognition, violence action detection, and multiple self-supervised learning tasks on benchmark datasets demonstrate promising advancements in accuracy. The significance of this research lies in its potential to revolutionize action recognition systems across diverse fields. The fusion of multimodal information promises sophisticated applications in surveillance and human-computer interaction, especially in contexts related to active assisted living.

Title: Rethinking the Use of Vision Transformers for AI-Generated Image Detection

Authors: NaHyeon Park, Kunhee Kim, Junsuk Choe, Hyunjung Shim
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2512.04969
Pdf URL: https://arxiv.org/pdf/2512.04969
Copy Paste: [[2512.04969]] Rethinking the Use of Vision Transformers for AI-Generated Image Detection(https://arxiv.org/abs/2512.04969)
Keywords: diffusion, generative
Abstract: Rich feature representations derived from CLIP-ViT have been widely utilized in AI-generated image detection. While most existing methods primarily leverage features from the final layer, we systematically analyze the contributions of layer-wise features to this task. Our study reveals that earlier layers provide more localized and generalizable features, often surpassing the performance of final-layer features in detection tasks. Moreover, we find that different layers capture distinct aspects of the data, each contributing uniquely to AI-generated image detection. Motivated by these findings, we introduce a novel adaptive method, termed MoLD, which dynamically integrates features from multiple ViT layers using a gating-based mechanism. Extensive experiments on both GAN- and diffusion-generated images demonstrate that MoLD significantly improves detection performance, enhances generalization across diverse generative models, and exhibits robustness in real-world scenarios. Finally, we illustrate the scalability and versatility of our approach by successfully applying it to other pre-trained ViTs, such as DINOv2.

Title: Efficient Generative Transformer Operators For Million-Point PDEs

Authors: Armand Kassaï Koupaï, Lise Le Boudec, Patrick Gallinari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.04974
Pdf URL: https://arxiv.org/pdf/2512.04974
Copy Paste: [[2512.04974]] Efficient Generative Transformer Operators For Million-Point PDEs(https://arxiv.org/abs/2512.04974)
Keywords: generative
Abstract: We introduce ECHO, a transformer-operator framework for generating million-point PDE trajectories. While existing neural operators (NOs) have shown promise for solving partial differential equations, they remain limited in practice due to poor scalability on dense grids, error accumulation during dynamic unrolling, and task-specific design. ECHO addresses these challenges through three key innovations. (i) It employs a hierarchical convolutional encode-decode architecture that achieves a 100 $\times$ spatio-temporal compression while preserving fidelity on mesh points. (ii) It incorporates a training and adaptation strategy that enables high-resolution PDE solution generation from sparse input grids. (iii) It adopts a generative modeling paradigm that learns complete trajectory segments, mitigating long-horizon error drift. The training strategy decouples representation learning from downstream task supervision, allowing the model to tackle multiple tasks such as trajectory generation, forward and inverse problems, and interpolation. The generative model further supports both conditional and unconditional generation. We demonstrate state-of-the-art performance on million-point simulations across diverse PDE systems featuring complex geometries, high-frequency dynamics, and long-term horizons.

Title: Reflection Removal through Efficient Adaptation of Diffusion Transformers

Authors: Daniyar Zakarin, Thiemo Wandel, Anton Obukhov, Dengxin Dai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2512.05000
Pdf URL: https://arxiv.org/pdf/2512.05000
Copy Paste: [[2512.05000]] Reflection Removal through Efficient Adaptation of Diffusion Transformers(https://arxiv.org/abs/2512.05000)
Keywords: diffusion, foundation model
Abstract: We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal. Project page: this https URL

Title: Self-Supervised Learning for Transparent Object Depth Completion Using Depth from Non-Transparent Objects

Authors: Xianghui Fan, Zhaoyu Chen, Mengyang Pan, Anping Deng, Hang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05006
Pdf URL: https://arxiv.org/pdf/2512.05006
Copy Paste: [[2512.05006]] Self-Supervised Learning for Transparent Object Depth Completion Using Depth from Non-Transparent Objects(https://arxiv.org/abs/2512.05006)
Keywords: self-supervised
Abstract: The perception of transparent objects is one of the well-known challenges in computer vision. Conventional depth sensors have difficulty in sensing the depth of transparent objects due to refraction and reflection of light. Previous research has typically train a neural network to complete the depth acquired by the sensor, and this method can quickly and accurately acquire accurate depth maps of transparent objects. However, previous training relies on a large amount of annotation data for supervision, and the labeling of depth maps is costly. To tackle this challenge, we propose a new self-supervised method for training depth completion networks. Our method simulates the depth deficits of transparent objects within non-transparent regions and utilizes the original depth map as ground truth for supervision. Experiments demonstrate that our method achieves performance comparable to supervised approach, and pre-training with our method can improve the model performance when the training samples are small.

Title: Generative Neural Video Compression via Video Diffusion Prior

Authors: Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, Siwei Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05016
Pdf URL: https://arxiv.org/pdf/2512.05016
Copy Paste: [[2512.05016]] Generative Neural Video Compression via Video Diffusion Prior(https://arxiv.org/abs/2512.05016)
Keywords: diffusion, foundation model, generative
Abstract: We present GNVC-VD, the first DiT-based generative neural video compression framework built upon an advanced video generation foundation model, where spatio-temporal latent compression and sequence-level generative refinement are unified within a single codec. Existing perceptual codecs primarily rely on pre-trained image generative priors to restore high-frequency details, but their frame-wise nature lacks temporal modeling and inevitably leads to perceptual flickering. To address this, GNVC-VD introduces a unified flow-matching latent refinement module that leverages a video diffusion transformer to jointly enhance intra- and inter-frame latents through sequence-level denoising, ensuring consistent spatio-temporal details. Instead of denoising from pure Gaussian noise as in video generation, GNVC-VD initializes refinement from decoded spatio-temporal latents and learns a correction term that adapts the diffusion prior to compression-induced degradation. A conditioning adaptor further injects compression-aware cues into intermediate DiT layers, enabling effective artifact removal while maintaining temporal coherence under extreme bitrate constraints. Extensive experiments show that GNVC-VD surpasses both traditional and learned codecs in perceptual quality and significantly reduces the flickering artifacts that persist in prior generative approaches, even below 0.01 bpp, highlighting the promise of integrating video-native generative priors into neural codecs for next-generation perceptual video compression.

Title: RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

Authors: Nicolas Houdré, Diego Marcos, Hugo Riffaud de Turckheim, Dino Ienco, Laurent Wendling, Camille Kurtz, Sylvain Lobry
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05025
Pdf URL: https://arxiv.org/pdf/2512.05025
Copy Paste: [[2512.05025]] RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation(https://arxiv.org/abs/2512.05025)
Keywords: foundation model
Abstract: Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at this https URL.

Title: Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding

Authors: Abhigyan Bhattacharya, Hiranmoy Roy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05039
Pdf URL: https://arxiv.org/pdf/2512.05039
Copy Paste: [[2512.05039]] Semantic-Guided Two-Stage GAN for Face Inpainting with Hybrid Perceptual Encoding(https://arxiv.org/abs/2512.05039)
Keywords: generative
Abstract: Facial Image inpainting aim is to restore the missing or corrupted regions in face images while preserving identity, structural consistency and photorealistic image quality, a task specifically created for photo restoration. Though there are recent lot of advances in deep generative models, existing methods face problems with large irregular masks, often producing blurry textures on the edges of the masked region, semantic inconsistencies, or unconvincing facial structures due to direct pixel level synthesis approach and limited exploitation of facial priors. In this paper we propose a novel architecture, which address these above challenges through semantic-guided hierarchical synthesis. Our approach starts with a method that organizes and synthesizes information based on meaning, followed by refining the texture. This process gives clear insights into the facial structure before we move on to creating detailed images. In the first stage, we blend two techniques: one that focuses on local features with CNNs and global features with Vision Transformers. This helped us create clear and detailed semantic layouts. In the second stage, we use a Multi-Modal Texture Generator to refine these layouts by pulling in information from different scales, ensuring everything looks cohesive and consistent. The architecture naturally handles arbitrary mask configurations through dynamic attention without maskspecific training. Experiment on two datasets CelebA-HQ and FFHQ shows that our model outperforms other state-of-the-art methods, showing improvements in metrics like LPIPS, PSNR, and SSIM. It produces visually striking results with better semantic preservation, in challenging large-area inpainting situations.

Title: Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

Authors: Yanran Zhang, Ziyi Wang, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05044
Pdf URL: https://arxiv.org/pdf/2512.05044
Copy Paste: [[2512.05044]] Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image(https://arxiv.org/abs/2512.05044)
Keywords: diffusion
Abstract: Generating interactive and dynamic 4D scenes from a single static image remains a core challenge. Most existing generate-then-reconstruct and reconstruct-then-generate methods decouple geometry from motion, causing spatiotemporal inconsistencies and poor generalization. To address these, we extend the reconstruct-then-generate framework to jointly perform Motion generation and geometric Reconstruction for 4D Synthesis (MoRe4D). We first introduce TrajScene-60K, a large-scale dataset of 60,000 video samples with dense point trajectories, addressing the scarcity of high-quality 4D scene data. Based on this, we propose a diffusion-based 4D Scene Trajectory Generator (4D-STraG) to jointly generate geometrically consistent and motion-plausible 4D point trajectories. To leverage single-view priors, we design a depth-guided motion normalization strategy and a motion-aware module for effective geometry and dynamics integration. We then propose a 4D View Synthesis Module (4D-ViSM) to render videos with arbitrary camera trajectories from 4D point track representations. Experiments show that MoRe4D generates high-quality 4D scenes with multi-view consistency and rich dynamic details from a single image. Code: this https URL.

Title: Personalizing Agent Privacy Decisions via Logical Entailment

Authors: James Flemings, Ren Yi, Octavian Suciu, Kassem Fawaz, Murali Annavaram, Marco Gruteser
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2512.05065
Pdf URL: https://arxiv.org/pdf/2512.05065
Copy Paste: [[2512.05065]] Personalizing Agent Privacy Decisions via Logical Entailment(https://arxiv.org/abs/2512.05065)
Keywords: in-context
Abstract: Personal language model-based agents are becoming more widespread for completing tasks on behalf of users; however, this raises serious privacy questions regarding whether these models will appropriately disclose user data. While prior work has evaluated language models on data-sharing scenarios based on general privacy norms, we focus on personalizing language models' privacy decisions, grounding their judgments directly in prior user privacy decisions. Our findings suggest that general privacy norms are insufficient for effective personalization of privacy decisions. Furthermore, we find that eliciting privacy judgments from the model through In-context Learning (ICL) is unreliable to due misalignment with the user's prior privacy judgments and opaque reasoning traces, which make it difficult for the user to interpret the reasoning behind the model's decisions. To address these limitations, we propose ARIEL (Agentic Reasoning with Individualized Entailment Logic), a framework that jointly leverages a language model and rule-based logic for structured data-sharing reasoning. ARIEL is based on formulating personalization of data sharing as an entailment, whether a prior user judgment on a data-sharing request implies the same judgment for an incoming request. Our experimental evaluations on advanced models and publicly-available datasets demonstrate that ARIEL can reduce the F1 score error by $\textbf{39.1%}$ over language model-based reasoning (ICL), demonstrating that ARIEL is effective at correctly judging requests where the user would approve data sharing. Overall, our findings suggest that combining LLMs with strict logical entailment is a highly effective strategy for enabling personalized privacy judgments for agents.

Title: Hybrid Quantum-Classical Autoencoders for Unsupervised Network Intrusion Detection

Authors: Mohammad Arif Rasyidi, Omar Alhussein, Sami Muhaidat, Ernesto Damiani
Subjects: cs.LG, cs.CR, quant-ph
Abstract URL: https://arxiv.org/abs/2512.05069
Pdf URL: https://arxiv.org/pdf/2512.05069
Copy Paste: [[2512.05069]] Hybrid Quantum-Classical Autoencoders for Unsupervised Network Intrusion Detection(https://arxiv.org/abs/2512.05069)
Keywords: anomaly
Abstract: Unsupervised anomaly-based intrusion detection requires models that can generalize to attack patterns not observed during training. This work presents the first large-scale evaluation of hybrid quantum-classical (HQC) autoencoders for this task. We construct a unified experimental framework that iterates over key quantum design choices, including quantum-layer placement, measurement approach, variational and non-variational formulations, and latent-space regularization. Experiments across three benchmark NIDS datasets show that HQC autoencoders can match or exceed classical performance in their best configurations, although they exhibit higher sensitivity to architectural decisions. Under zero-day evaluation, well-configured HQC models provide stronger and more stable generalization than classical and supervised baselines. Simulated gate-noise experiments reveal early performance degradation, indicating the need for noise-aware HQC designs. These results provide the first data-driven characterization of HQC autoencoder behavior for network intrusion detection and outline key factors that govern their practical viability. All experiment code and configurations are available at this https URL.

Title: David vs. Goliath: Can Small Models Win Big with Agentic AI in Hardware Design?

Authors: Shashwat Shankar, Subhranshu Pandey, Innocent Dengkhw Mochahari, Bhabesh Mali, Animesh Basak Chowdhury, Sukanta Bhattacharjee, Chandan Karfa
Subjects: cs.LG, cs.AI, cs.AR, cs.SE
Abstract URL: https://arxiv.org/abs/2512.05073
Pdf URL: https://arxiv.org/pdf/2512.05073
Copy Paste: [[2512.05073]] David vs. Goliath: Can Small Models Win Big with Agentic AI in Hardware Design?(https://arxiv.org/abs/2512.05073)
Keywords: foundation model
Abstract: Large Language Model(LLM) inference demands massive compute and energy, making domain-specific tasks expensive and unsustainable. As foundation models keep scaling, we ask: Is bigger always better for hardware design? Our work tests this by evaluating Small Language Models coupled with a curated agentic AI framework on NVIDIA's Comprehensive Verilog Design Problems(CVDP) benchmark. Results show that agentic workflows: through task decomposition, iterative feedback, and correction - not only unlock near-LLM performance at a fraction of the cost but also create learning opportunities for agents, paving the way for efficient, adaptive solutions in complex design tasks.

Title: BulletTime: Decoupled Control of Time and Camera Pose for Video Generation

Authors: Yiming Wang, Qihang Zhang, Shengqu Cai, Tong Wu, Jan Ackermann, Zhengfei Kuang, Yang Zheng, Frano Rajič, Siyu Tang, Gordon Wetzstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05076
Pdf URL: https://arxiv.org/pdf/2512.05076
Copy Paste: [[2512.05076]] BulletTime: Decoupled Control of Time and Camera Pose for Video Generation(https://arxiv.org/abs/2512.05076)
Keywords: diffusion
Abstract: Emerging video diffusion models achieve high visual fidelity but fundamentally couple scene dynamics with camera motion, limiting their ability to provide precise spatial and temporal control. We introduce a 4D-controllable video diffusion framework that explicitly decouples scene dynamics from camera pose, enabling fine-grained manipulation of both scene dynamics and camera viewpoint. Our framework takes continuous world-time sequences and camera trajectories as conditioning inputs, injecting them into the video diffusion model through a 4D positional encoding in the attention layer and adaptive normalizations for feature modulation. To train this model, we curate a unique dataset in which temporal and camera variations are independently parameterized; this dataset will be made public. Experiments show that our model achieves robust real-world 4D control across diverse timing patterns and camera trajectories, while preserving high generation quality and outperforming prior work in controllability. See our website for video results: this https URL

Title: Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints

Authors: Minghan Zhu, Zhiyi Wang, Qihang Sun, Maani Ghaffari, Michael Posa
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.05079
Pdf URL: https://arxiv.org/pdf/2512.05079
Copy Paste: [[2512.05079]] Object Reconstruction under Occlusion with Generative Priors and Contact-induced Constraints(https://arxiv.org/abs/2512.05079)
Keywords: generative
Abstract: Object geometry is key information for robot manipulation. Yet, object reconstruction is a challenging task because cameras only capture partial observations of objects, especially when occlusion occurs. In this paper, we leverage two extra sources of information to reduce the ambiguity of vision signals. First, generative models learn priors of the shapes of commonly seen objects, allowing us to make reasonable guesses of the unseen part of geometry. Second, contact information, which can be obtained from videos and physical interactions, provides sparse constraints on the boundary of the geometry. We combine the two sources of information through contact-guided 3D generation. The guidance formulation is inspired by drag-based editing in generative models. Experiments on synthetic and real-world data show that our approach improves the reconstruction compared to pure 3D generation and contact-based optimization.

Title: OMTRA: A Multi-Task Generative Model for Structure-Based Drug Design

Authors: Ian Dunn, Liv Toft, Tyler Katz, Juhi Gupta, Riya Shah, Ramith Hettiarachchi, David R. Koes
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2512.05080
Pdf URL: https://arxiv.org/pdf/2512.05080
Copy Paste: [[2512.05080]] OMTRA: A Multi-Task Generative Model for Structure-Based Drug Design(https://arxiv.org/abs/2512.05080)
Keywords: generative
Abstract: Structure-based drug design (SBDD) focuses on designing small-molecule ligands that bind to specific protein pockets. Computational methods are integral in modern SBDD workflows and often make use of virtual screening methods via docking or pharmacophore search. Modern generative modeling approaches have focused on improving novel ligand discovery by enabling de novo design. In this work, we recognize that these tasks share a common structure and can therefore be represented as different instantiations of a consistent generative modeling framework. We propose a unified approach in OMTRA, a multi-modal flow matching model that flexibly performs many tasks relevant to SBDD, including some with no analogue in conventional workflows. Additionally, we curate a dataset of 500M 3D molecular conformers, complementing protein-ligand data and expanding the chemical diversity available for training. OMTRA obtains state of the art performance on pocket-conditioned de novo design and docking; however, the effects of large-scale pretraining and multi-task training are modest. All code, trained models, and dataset for reproducing this work are available at this https URL

Title: Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

Authors: Jung Yi, Wooseok Jang, Paul Hyunbin Cho, Jisu Nam, Heeji Yoon, Seungryong Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05081
Pdf URL: https://arxiv.org/pdf/2512.05081
Copy Paste: [[2512.05081]] Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression(https://arxiv.org/abs/2512.05081)
Keywords: diffusion
Abstract: Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.

Title: The Geometry of Intelligence: Deterministic Functional Topology as a Foundation for Real-World Perception

Authors: Eduardo Di Santi
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2512.05089
Pdf URL: https://arxiv.org/pdf/2512.05089
Copy Paste: [[2512.05089]] The Geometry of Intelligence: Deterministic Functional Topology as a Foundation for Real-World Perception(https://arxiv.org/abs/2512.05089)
Keywords: self-supervised
Abstract: Real-world physical processes do not generate arbitrary variability: their signals concentrate on compact and low-variability subsets of functional space. This geometric structure enables rapid generalization from a few examples in both biological and artificial systems. This work develops a deterministic functional-topological framework in which the set of valid realizations of a physical phenomenon forms a compact perceptual manifold with stable invariants and a finite Hausdorff radius. We show that the boundaries of this manifold can be discovered in a fully self-supervised manner through Monte Carlo sampling, even when the governing equations of the system are unknown. We provide theoretical guarantees, practical estimators of knowledge boundaries, and empirical validations across three domains: electromechanical railway point machines, electrochemical battery discharge curves, and physiological ECG signals. Our results demonstrate that deterministic functional topology offers a unified mathematical foundation for perception, representation, and world-model construction, explaining why biological learners and self-supervised AI models can generalize from limited observations.

Title: TV2TV: A Unified Framework for Interleaved Language and Video Generation

Authors: Xiaochuang Han, Youssef Emad, Melissa Hall, John Nguyen, Karthik Padthe, Liam Robbins, Amir Bar, Delong Chen, Michal Drozdzal, Maha Elbayad, Yushi Hu, Shang-Wen Li, Sreya Dutta Roy, Jakob Verbeek, XuDong Wang, Marjan Ghazvininejad, Luke Zettlemoyer, Emily Dinan
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2512.05103
Pdf URL: https://arxiv.org/pdf/2512.05103
Copy Paste: [[2512.05103]] TV2TV: A Unified Framework for Interleaved Language and Video Generation(https://arxiv.org/abs/2512.05103)
Keywords: generative
Abstract: Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.

Title: NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation

Authors: Yu Zeng, Charles Ochoa, Mingyuan Zhou, Vishal M. Patel, Vitor Guizilini, Rowan McAllister
Subjects: cs.CV, cs.GR, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2512.05106
Pdf URL: https://arxiv.org/pdf/2512.05106
Copy Paste: [[2512.05106]] NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation(https://arxiv.org/abs/2512.05106)
Keywords: diffusion
Abstract: Standard diffusion corrupts data using Gaussian noise whose Fourier coefficients have random magnitudes and random phases. While effective for unconditional or text-to-image generation, corrupting phase components destroys spatial structure, making it ill-suited for tasks requiring geometric consistency, such as re-rendering, simulation enhancement, and image-to-image translation. We introduce Phase-Preserving Diffusion {\phi}-PD, a model-agnostic reformulation of the diffusion process that preserves input phase while randomizing magnitude, enabling structure-aligned generation without architectural changes or additional parameters. We further propose Frequency-Selective Structured (FSS) noise, which provides continuous control over structural rigidity via a single frequency-cutoff parameter. {\phi}-PD adds no inference-time cost and is compatible with any diffusion model for images or videos. Across photorealistic and stylized re-rendering, as well as sim-to-real enhancement for driving planners, {\phi}-PD produces controllable, spatially aligned results. When applied to the CARLA simulator, {\phi}-PD improves CARLA-to-Waymo planner performance by 50\%. The method is complementary to existing conditioning approaches and broadly applicable to image-to-image and video-to-video generation. Videos, additional examples, and code are available on our \href{this https URL}{project page}.

Title: ShadowDraw: From Any Object to Shadow-Drawing Compositional Art

Authors: Rundong Luo, Noah Snavely, Wei-Chiu Ma
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2512.05110
Pdf URL: https://arxiv.org/pdf/2512.05110
Copy Paste: [[2512.05110]] ShadowDraw: From Any Object to Shadow-Drawing Compositional Art(https://arxiv.org/abs/2512.05110)
Keywords: generative
Abstract: We introduce ShadowDraw, a framework that transforms ordinary 3D objects into shadow-drawing compositional art. Given a 3D object, our system predicts scene parameters, including object pose and lighting, together with a partial line drawing, such that the cast shadow completes the drawing into a recognizable image. To this end, we optimize scene configurations to reveal meaningful shadows, employ shadow strokes to guide line drawing generation, and adopt automatic evaluation to enforce shadow-drawing coherence and visual quality. Experiments show that ShadowDraw produces compelling results across diverse inputs, from real-world scans and curated datasets to generative assets, and naturally extends to multi-object scenes, animations, and physical deployments. Our work provides a practical pipeline for creating shadow-drawing art and broadens the design space of computational visual art, bridging the gap between algorithmic design and artistic storytelling. Check out our project page this https URL for more results and an end-to-end real-world demonstration of our pipeline!

Title: ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Authors: Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05111
Pdf URL: https://arxiv.org/pdf/2512.05111
Copy Paste: [[2512.05111]] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning(https://arxiv.org/abs/2512.05111)
Keywords: generative
Abstract: Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.

Title: Light-X: Generative 4D Video Rendering with Camera and Illumination Control

Authors: Tianqi Liu, Zhaoxi Chen, Zihao Huang, Shaocong Xu, Saining Zhang, Chongjie Ye, Bohan Li, Zhiguo Cao, Wei Li, Hao Zhao, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2512.05115
Pdf URL: https://arxiv.org/pdf/2512.05115
Copy Paste: [[2512.05115]] Light-X: Generative 4D Video Rendering with Camera and Illumination Control(https://arxiv.org/abs/2512.05115)
Keywords: generative
Abstract: Recent advances in illumination control extend image-based methods to video, yet still facing a trade-off between lighting fidelity and temporal consistency. Moving beyond relighting, a key step toward generative modeling of real-world scenes is the joint control of camera trajectory and illumination, since visual dynamics are inherently shaped by both geometry and lighting. To this end, we present Light-X, a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control. 1) We propose a disentangled design that decouples geometry and lighting signals: geometry and motion are captured via dynamic point clouds projected along user-defined camera trajectories, while illumination cues are provided by a relit frame consistently projected into the same geometry. These explicit, fine-grained cues enable effective disentanglement and guide high-quality illumination. 2) To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping that synthesizes training pairs from in-the-wild monocular footage. This strategy yields a dataset covering static, dynamic, and AI-generated scenes, ensuring robust training. Extensive experiments show that Light-X outperforms baseline methods in joint camera-illumination control and surpasses prior video relighting methods under both text- and background-conditioned settings.

Title: Value Gradient Guidance for Flow Matching Alignment

Authors: Zhen Liu, Tim Z. Xiao, Carles Domingo-Enrich, Weiyang Liu, Dinghuai Zhang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2512.05116
Pdf URL: https://arxiv.org/pdf/2512.05116
Copy Paste: [[2512.05116]] Value Gradient Guidance for Flow Matching Alignment(https://arxiv.org/abs/2512.05116)
Keywords: diffusion, generative
Abstract: While methods exist for aligning flow matching models--a popular and effective class of generative models--with human preferences, existing approaches fail to achieve both adaptation efficiency and probabilistically sound prior preservation. In this work, we leverage the theory of optimal control and propose VGG-Flow, a gradient-matching-based method for finetuning pretrained flow matching models. The key idea behind this algorithm is that the optimal difference between the finetuned velocity field and the pretrained one should be matched with the gradient field of a value function. This method not only incorporates first-order information from the reward model but also benefits from heuristic initialization of the value function to enable fast adaptation. Empirically, we show on a popular text-to-image flow matching model, Stable Diffusion 3, that our method can finetune flow matching models under limited computational budgets while achieving effective and prior-preserving alignment.