2025-10-27

Title: Video-As-Prompt: Unified Semantic Control for Video Generation

Authors: Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, Qiang Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20888
Pdf URL: https://arxiv.org/pdf/2510.20888
Copy Paste: [[2510.20888]] Video-As-Prompt: Unified Semantic Control for Video Generation(https://arxiv.org/abs/2510.20888)
Keywords: diffusion, in-context
Abstract: Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for semantic-controlled video generation with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that rivals leading condition-specific commercial models. VAP's strong zero-shot generalization and support for various downstream applications mark a significant advance toward general-purpose, controllable video generation.

Title: Code-enabled language models can outperform reasoning models on diverse tasks

Authors: Cedegao E. Zhang, Cédric Colas, Gabriel Poesia, Joshua B. Tenenbaum, Jacob Andreas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.20909
Pdf URL: https://arxiv.org/pdf/2510.20909
Copy Paste: [[2510.20909]] Code-enabled language models can outperform reasoning models on diverse tasks(https://arxiv.org/abs/2510.20909)
Keywords: in-context
Abstract: Reasoning models (RMs), language models (LMs) trained with reinforcement learning to produce long-form natural language reasoning, have been remarkably successful, but they still require large amounts of computation and data to train, and can be slow and expensive to run. In this paper, we show that standard instruct LMs can already be elicited to be strong reasoners at a level comparable to or even surpassing their corresponding RMs (e.g., DeepSeek V3 vs R1) without finetuning, across diverse domains from instruction following and creative generation to mathematical reasoning. This is achieved by CodeAdapt, our simple recipe that combines the CodeAct framework, where LMs interleave natural language reasoning with code execution in a multi-step fashion, with few-shot bootstrap in-context learning from as few as five training problems. Analyzing four matched pairs of LMs and RMs, we find that CodeAdapt enables three LMs to outperform the corresponding RMs on average over eight tasks (up to 22.9%) while being 10-81% more token efficient, and delivers superior performance on six tasks when averaged over the four models (up to 35.7%). Furthermore, the code-augmented reasoning traces display rich and varied problem-solving strategies. Our findings support that (1) CodeAdapt-style learning and reasoning may be robust and domain general and (2) code-enabled LMs are cognitively grounded and powerful systems, potentially providing a strong foundation for in-weight reinforcement learning.

Title: Generative Point Tracking with Flow Matching

Authors: Mattie Tesfaldet, Adam W. Harley, Konstantinos G. Derpanis, Derek Nowrouzezahrai, Christopher Pal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.20951
Pdf URL: https://arxiv.org/pdf/2510.20951
Copy Paste: [[2510.20951]] Generative Point Tracking with Flow Matching(https://arxiv.org/abs/2510.20951)
Keywords: generative
Abstract: Tracking a point through a video can be a challenging task due to uncertainty arising from visual obfuscations, such as appearance changes and occlusions. Although current state-of-the-art discriminative models excel in regressing long-term point trajectory estimates -- even through occlusions -- they are limited to regressing to a mean (or mode) in the presence of uncertainty, and fail to capture multi-modality. To overcome this limitation, we introduce Generative Point Tracker (GenPT), a generative framework for modelling multi-modal trajectories. GenPT is trained with a novel flow matching formulation that combines the iterative refinement of discriminative trackers, a window-dependent prior for cross-window consistency, and a variance schedule tuned specifically for point coordinates. We show how our model's generative capabilities can be leveraged to improve point trajectory estimates by utilizing a best-first search strategy on generated samples during inference, guided by the model's own confidence of its predictions. Empirically, we evaluate GenPT against the current state of the art on the standard PointOdyssey, Dynamic Replica, and TAP-Vid benchmarks. Further, we introduce a TAP-Vid variant with additional occlusions to assess occluded point tracking performance and highlight our model's ability to capture multi-modality. GenPT is capable of capturing the multi-modality in point trajectories, which translates to state-of-the-art tracking accuracy on occluded points, while maintaining competitive tracking accuracy on visible points compared to extant discriminative point trackers.

Title: VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models

Authors: Jesimon Barreto, Carlos Caetano, André Araujo, William Robson Schwartz
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.20994
Pdf URL: https://arxiv.org/pdf/2510.20994
Copy Paste: [[2510.20994]] VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models(https://arxiv.org/abs/2510.20994)
Keywords: self-supervised, foundation model, generative
Abstract: Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA's training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at this https URL.

Title: BioDet: Boosting Industrial Object Detection with Image Preprocessing Strategies

Authors: Jiaqi Hu, Hongli Xu, Junwen Huang, Peter KT Yu, Slobodan Ilic, Benjamin Busam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21000
Pdf URL: https://arxiv.org/pdf/2510.21000
Copy Paste: [[2510.21000]] BioDet: Boosting Industrial Object Detection with Image Preprocessing Strategies(https://arxiv.org/abs/2510.21000)
Keywords: foundation model
Abstract: Accurate 6D pose estimation is essential for robotic manipulation in industrial environments. Existing pipelines typically rely on off-the-shelf object detectors followed by cropping and pose refinement, but their performance degrades under challenging conditions such as clutter, poor lighting, and complex backgrounds, making detection the critical bottleneck. In this work, we introduce a standardized and plug-in pipeline for 2D detection of unseen objects in industrial settings. Based on current SOTA baselines, our approach reduces domain shift and background artifacts through low-light image enhancement and background removal guided by open-vocabulary detection with foundation models. This design suppresses the false positives prevalent in raw SAM outputs, yielding more reliable detections for downstream pose estimation. Extensive experiments on real-world industrial bin-picking benchmarks from BOP demonstrate that our method significantly boosts detection accuracy while incurring negligible inference overhead, showing the effectiveness and practicality of the proposed method.

Title: Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation

Authors: Enshu Liu, Qian Chen, Xuefei Ning, Shengen Yan, Guohao Dai, Zinan Lin, Yu Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.21003
Pdf URL: https://arxiv.org/pdf/2510.21003
Copy Paste: [[2510.21003]] Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation(https://arxiv.org/abs/2510.21003)
Keywords: generative
Abstract: Image Auto-regressive (AR) models have emerged as a powerful paradigm of visual generative models. Despite their promising performance, they suffer from slow generation speed due to the large number of sampling steps required. Although Distilled Decoding 1 (DD1) was recently proposed to enable few-step sampling for image AR models, it still incurs significant performance degradation in the one-step setting, and relies on a pre-defined mapping that limits its flexibility. In this work, we propose a new method, Distilled Decoding 2 (DD2), to further advances the feasibility of one-step sampling for image AR models. Unlike DD1, DD2 does not without rely on a pre-defined mapping. We view the original AR model as a teacher model which provides the ground truth conditional score in the latent embedding space at each token position. Based on this, we propose a novel \emph{conditional score distillation loss} to train a one-step generator. Specifically, we train a separate network to predict the conditional score of the generated distribution and apply score distillation at every token position conditioned on previous tokens. Experimental results show that DD2 enables one-step sampling for image AR models with an minimal FID increase from 3.40 to 5.43 on ImageNet-256. Compared to the strongest baseline DD1, DD2 reduces the gap between the one-step sampling and original AR model by 67%, with up to 12.3$\times$ training speed-up simultaneously. DD2 takes a significant step toward the goal of one-step AR generation, opening up new possibilities for fast and high-quality AR modeling. Code is available at this https URL.

Title: Can Current Detectors Catch Face-to-Voice Deepfake Attacks?

Authors: Nguyen Linh Bao Nguyen, Alsharif Abuadbba, Kristen Moore, Tingming Wu
Subjects: cs.CR, cs.LG, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2510.21004
Pdf URL: https://arxiv.org/pdf/2510.21004
Copy Paste: [[2510.21004]] Can Current Detectors Catch Face-to-Voice Deepfake Attacks?(https://arxiv.org/abs/2510.21004)
Keywords: generative
Abstract: The rapid advancement of generative models has enabled the creation of increasingly stealthy synthetic voices, commonly referred to as audio deepfakes. A recent technique, FOICE [USENIX'24], demonstrates a particularly alarming capability: generating a victim's voice from a single facial image, without requiring any voice sample. By exploiting correlations between facial and vocal features, FOICE produces synthetic voices realistic enough to bypass industry-standard authentication systems, including WeChat Voiceprint and Microsoft Azure. This raises serious security concerns, as facial images are far easier for adversaries to obtain than voice samples, dramatically lowering the barrier to large-scale attacks. In this work, we investigate two core research questions: (RQ1) can state-of-the-art audio deepfake detectors reliably detect FOICE-generated speech under clean and noisy conditions, and (RQ2) whether fine-tuning these detectors on FOICE data improves detection without overfitting, thereby preserving robustness to unseen voice generators such as SpeechT5. Our study makes three contributions. First, we present the first systematic evaluation of FOICE detection, showing that leading detectors consistently fail under both standard and noisy conditions. Second, we introduce targeted fine-tuning strategies that capture FOICE-specific artifacts, yielding significant accuracy improvements. Third, we assess generalization after fine-tuning, revealing trade-offs between specialization to FOICE and robustness to unseen synthesis pipelines. These findings expose fundamental weaknesses in today's defenses and motivate new architectures and training protocols for next-generation audio deepfake detection.

Title: From Information to Generative Exponent: Learning Rate Induces Phase Transitions in SGD

Authors: Konstantinos Christopher Tsiolis, Alireza Mousavi-Hosseini, Murat A. Erdogdu
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.21020
Pdf URL: https://arxiv.org/pdf/2510.21020
Copy Paste: [[2510.21020]] From Information to Generative Exponent: Learning Rate Induces Phase Transitions in SGD(https://arxiv.org/abs/2510.21020)
Keywords: generative
Abstract: To understand feature learning dynamics in neural networks, recent theoretical works have focused on gradient-based learning of Gaussian single-index models, where the label is a nonlinear function of a latent one-dimensional projection of the input. While the sample complexity of online SGD is determined by the information exponent of the link function, recent works improved this by performing multiple gradient steps on the same sample with different learning rates -- yielding a non-correlational update rule -- and instead are limited by the (potentially much smaller) generative exponent. However, this picture is only valid when these learning rates are sufficiently large. In this paper, we characterize the relationship between learning rate(s) and sample complexity for a broad class of gradient-based algorithms that encapsulates both correlational and non-correlational updates. We demonstrate that, in certain cases, there is a phase transition from an "information exponent regime" with small learning rate to a "generative exponent regime" with large learning rate. Our framework covers prior analyses of one-pass SGD and SGD with batch reuse, while also introducing a new layer-wise training algorithm that leverages a two-timescales approach (via different learning rates for each layer) to go beyond correlational queries without reusing samples or modifying the loss from squared error. Our theoretical study demonstrates that the choice of learning rate is as important as the design of the algorithm in achieving statistical and computational efficiency.

Title: Physically consistent and uncertainty-aware learning of spatiotemporal dynamics

Authors: Qingsong Xu, Jonathan L Bamber, Nils Thuerey, Niklas Boers, Paul Bates, Gustau Camps-Valls, Yilei Shi, Xiao Xiang Zhu
Subjects: cs.LG, cs.AI, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2510.21023
Pdf URL: https://arxiv.org/pdf/2510.21023
Copy Paste: [[2510.21023]] Physically consistent and uncertainty-aware learning of spatiotemporal dynamics(https://arxiv.org/abs/2510.21023)
Keywords: diffusion
Abstract: Accurate long-term forecasting of spatiotemporal dynamics remains a fundamental challenge across scientific and engineering domains. Existing machine learning methods often neglect governing physical laws and fail to quantify inherent uncertainties in spatiotemporal predictions. To address these challenges, we introduce a physics-consistent neural operator (PCNO) that enforces physical constraints by projecting surrogate model outputs onto function spaces satisfying predefined laws. A physics-consistent projection layer within PCNO efficiently computes mass and momentum conservation in Fourier space. Building upon deterministic predictions, we further propose a diffusion model-enhanced PCNO (DiffPCNO), which leverages a consistency model to quantify and mitigate uncertainties, thereby improving the accuracy and reliability of forecasts. PCNO and DiffPCNO achieve high-fidelity spatiotemporal predictions while preserving physical consistency and uncertainty across diverse systems and spatial resolutions, ranging from turbulent flow modeling to real-world flood/atmospheric forecasting. Our two-stage framework provides a robust and versatile approach for accurate, physically grounded, and uncertainty-aware spatiotemporal forecasting.

Title: Amortized Active Generation of Pareto Sets

Authors: Daniel M. Steinberg, Asiri Wijesinghe, Rafael Oliveira, Piotr Koniusz, Cheng Soon Ong, Edwin V. Bonilla
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.21052
Pdf URL: https://arxiv.org/pdf/2510.21052
Copy Paste: [[2510.21052]] Amortized Active Generation of Pareto Sets(https://arxiv.org/abs/2510.21052)
Keywords: generative
Abstract: We introduce active generation of Pareto sets (A-GPS), a new framework for online discrete black-box multi-objective optimization (MOO). A-GPS learns a generative model of the Pareto set that supports a-posteriori conditioning on user preferences. The method employs a class probability estimator (CPE) to predict non-dominance relations and to condition the generative model toward high-performing regions of the search space. We also show that this non-dominance CPE implicitly estimates the probability of hypervolume improvement (PHVI). To incorporate subjective trade-offs, A-GPS introduces preference direction vectors that encode user-specified preferences in objective space. At each iteration, the model is updated using both Pareto membership and alignment with these preference directions, producing an amortized generative model capable of sampling across the Pareto front without retraining. The result is a simple yet powerful approach that achieves high-quality Pareto set approximations, avoids explicit hypervolume computation, and flexibly captures user preferences. Empirical results on synthetic benchmarks and protein design tasks demonstrate strong sample efficiency and effective preference incorporation.

Title: Dynamic Retriever for In-Context Knowledge Editing via Policy Optimization

Authors: Mahmud Wasif Nafee, Maiqi Jiang, Haipeng Chen, Yanfu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21059
Pdf URL: https://arxiv.org/pdf/2510.21059
Copy Paste: [[2510.21059]] Dynamic Retriever for In-Context Knowledge Editing via Policy Optimization(https://arxiv.org/abs/2510.21059)
Keywords: in-context
Abstract: Large language models (LLMs) excel at factual recall yet still propagate stale or incorrect knowledge. In-context knowledge editing offers a gradient-free remedy suitable for black-box APIs, but current editors rely on static demonstration sets chosen by surface-level similarity, leading to two persistent obstacles: (i) a quantity-quality trade-off, and (ii) lack of adaptivity to task difficulty. We address these issues by dynamically selecting supporting demonstrations according to their utility for the edit. We propose Dynamic Retriever for In-Context Knowledge Editing (DR-IKE), a lightweight framework that (1) trains a BERT retriever with REINFORCE to rank demonstrations by editing reward, and (2) employs a learnable threshold to prune low-value examples, shortening the prompt when the edit is easy and expanding it when the task is hard. DR-IKE performs editing without modifying model weights, relying solely on forward passes for compatibility with black-box LLMs. On the COUNTERFACT benchmark, it improves edit success by up to 17.1%, reduces latency by 41.6%, and preserves accuracy on unrelated queries, demonstrating scalable and adaptive knowledge editing. The code is available at this https URL .

Title: Scalable Machine Learning Analysis of Parker Solar Probe Solar Wind Data

Authors: Daniela Martin, Connor O'Brien, Valmir P Moraes Filho, Jinsu Hong, Jasmine R. Kobayashi, Evangelia Samara, Joseph Gallego
Subjects: cs.LG, astro-ph.SR, physics.space-ph
Abstract URL: https://arxiv.org/abs/2510.21066
Pdf URL: https://arxiv.org/pdf/2510.21066
Copy Paste: [[2510.21066]] Scalable Machine Learning Analysis of Parker Solar Probe Solar Wind Data(https://arxiv.org/abs/2510.21066)
Keywords: anomaly
Abstract: We present a scalable machine learning framework for analyzing Parker Solar Probe (PSP) solar wind data using distributed processing and the quantum-inspired Kernel Density Matrices (KDM) method. The PSP dataset (2018--2024) exceeds 150 GB, challenging conventional analysis approaches. Our framework leverages Dask for large-scale statistical computations and KDM to estimate univariate and bivariate distributions of key solar wind parameters, including solar wind speed, proton density, and proton thermal speed, as well as anomaly thresholds for each parameter. We reveal characteristic trends in the inner heliosphere, including increasing solar wind speed with distance from the Sun, decreasing proton density, and the inverse relationship between speed and density. Solar wind structures play a critical role in enhancing and mediating extreme space weather phenomena and can trigger geomagnetic storms; our analyses provide quantitative insights into these processes. This approach offers a tractable, interpretable, and distributed methodology for exploring complex physical datasets and facilitates reproducible analysis of large-scale in situ measurements. Processed data products and analysis tools are made publicly available to advance future studies of solar wind dynamics and space weather forecasting. The code and configuration files used in this study are publicly available to support reproducibility.

Title: ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models

Authors: Pranav Saxena, Jimmy Chiun
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2510.21069
Pdf URL: https://arxiv.org/pdf/2510.21069
Copy Paste: [[2510.21069]] ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models(https://arxiv.org/abs/2510.21069)
Keywords: foundation model
Abstract: Understanding and reasoning about complex 3D environments requires structured scene representations that capture not only objects but also their semantic and spatial relationships. While recent works on 3D scene graph generation have leveraged pretrained VLMs without task-specific fine-tuning, they are largely confined to single-view settings, fail to support incremental updates as new observations arrive and lack explicit geometric grounding in 3D space, all of which are essential for embodied scenarios. In this paper, we propose, ZING-3D, a framework that leverages the vast knowledge of pretrained foundation models to enable open-vocabulary recognition and generate a rich semantic representation of the scene in a zero-shot manner while also enabling incremental updates and geometric grounding in 3D space, making it suitable for downstream robotics applications. Our approach leverages VLM reasoning to generate a rich 2D scene graph, which is grounded in 3D using depth information. Nodes represent open-vocabulary objects with features, 3D locations, and semantic context, while edges capture spatial and semantic relations with inter-object distances. Our experiments on scenes from the Replica and HM3D dataset show that ZING-3D is effective at capturing spatial and relational knowledge without the need of task-specific training.

Title: Controllable-LPMoE: Adapting to Challenging Object Segmentation via Dynamic Local Priors from Mixture-of-Experts

Authors: Yanguang Sun, Jiawei Lian, Jian Yang, Lei Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21114
Pdf URL: https://arxiv.org/pdf/2510.21114
Copy Paste: [[2510.21114]] Controllable-LPMoE: Adapting to Challenging Object Segmentation via Dynamic Local Priors from Mixture-of-Experts(https://arxiv.org/abs/2510.21114)
Keywords: foundation model
Abstract: Large-scale foundation models provide powerful feature representations for downstream object segmentation tasks. However, when adapted to specific tasks through the full-parameter fine-tuning, the enormous parameters being updated often results in significant computational overhead, creating a bottleneck in training efficiency. Although existing methods attempt to fine-tune frozen models by directly embedding trainable prompts, these prompts lack inherent semantic priors, limiting the adaptability of large-scale models. In this paper, we propose a novel dynamic priors-based fine-tuning paradigm with fewer trainable parameters, dubbed Controllable-LPMoE, which adaptively modulates frozen foundation models by dynamically controlling local priors to enhance fine-grained perception for specific segmentation tasks. More specifically, we construct a lightweight dynamic mixed local priors extractor that captures diverse local priors from input images through heterogeneous convolutions while employing a gating network to dynamically output expert priors required for the subsequent fine-tuning. Furthermore, we design a bi-directional interaction adapter that employs cosine-aligned deformable attention and channel-oriented adaptive scale enhancement to interact and restructure between frozen and trainable features, achieving efficient fine-tuning. Extensive experiments validate the superiority of our \href{this https URL} {Controllable-LPMoE} approach, demonstrating excellent segmentation performance compared to 31 state-of-the-art (SOTA) methods and adaptability to multiple binary object segmentation tasks.

Title: Digital Contrast CT Pulmonary Angiography Synthesis from Non-contrast CT for Pulmonary Vascular Disease

Authors: Ying Ming (1), Yue Lin (3), Longfei Zhao (2), Gengwan Li (2), Zuopeng Tan (2), Bing Li (2), Sheng Xie (3), Wei Song (1), Qiqi Xu (2) ((1) Department of Radiology Peking Union Medical College Hospital Chinese Academy of Medical Sciences and Peking Union Medical College, (2) Research and Development Center Canon Medical Systems China, (3) Department of Radiology, China-Japan Friendship Hospital, Beijing, China)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21140
Pdf URL: https://arxiv.org/pdf/2510.21140
Copy Paste: [[2510.21140]] Digital Contrast CT Pulmonary Angiography Synthesis from Non-contrast CT for Pulmonary Vascular Disease(https://arxiv.org/abs/2510.21140)
Keywords: generative
Abstract: Computed Tomography Pulmonary Angiography (CTPA) is the reference standard for diagnosing pulmonary vascular diseases such as Pulmonary Embolism (PE) and Chronic Thromboembolic Pulmonary Hypertension (CTEPH). However, its reliance on iodinated contrast agents poses risks including nephrotoxicity and allergic reactions, particularly in high-risk patients. This study proposes a method to generate Digital Contrast CTPA (DCCTPA) from Non-Contrast CT (NCCT) scans using a cascaded synthesizer based on Cycle-Consistent Generative Adversarial Networks (CycleGAN). Totally retrospective 410 paired CTPA and NCCT scans were obtained from three centers. The model was trained and validated internally on 249 paired images. Extra dataset that comprising 161 paired images was as test set for model generalization evaluation and downstream clinical tasks validation. Compared with state-of-the-art (SOTA) methods, the proposed method achieved the best comprehensive performance by evaluating quantitative metrics (For validation, MAE: 156.28, PSNR: 20.71 and SSIM: 0.98; For test, MAE: 165.12, PSNR: 20.27 and SSIM: 0.98) and qualitative visualization, demonstrating valid vessel enhancement, superior image fidelity and structural preservation. The approach was further applied to downstream tasks of pulmonary vessel segmentation and vascular quantification. On the test set, the average Dice, clDice, and clRecall of artery and vein pulmonary segmentation was 0.70, 0.71, 0.73 and 0.70, 0.72, 0.75 respectively, all markedly improved compared with NCCT inputs.\@ Inter-class Correlation Coefficient (ICC) for vessel volume between DCCTPA and CTPA was significantly better than that between NCCT and CTPA (Average ICC : 0.81 vs 0.70), indicating effective vascular enhancement in DCCTPA, especially for small vessels.

Title: Uncertainty-Aware Multi-Objective Reinforcement Learning-Guided Diffusion Models for 3D De Novo Molecular Design

Authors: Lianghong Chen, Dongkyu Eugene Kim, Mike Domaratzki, Pingzhao Hu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21153
Pdf URL: https://arxiv.org/pdf/2510.21153
Copy Paste: [[2510.21153]] Uncertainty-Aware Multi-Objective Reinforcement Learning-Guided Diffusion Models for 3D De Novo Molecular Design(https://arxiv.org/abs/2510.21153)
Keywords: diffusion, generative
Abstract: Designing de novo 3D molecules with desirable properties remains a fundamental challenge in drug discovery and molecular engineering. While diffusion models have demonstrated remarkable capabilities in generating high-quality 3D molecular structures, they often struggle to effectively control complex multi-objective constraints critical for real-world applications. In this study, we propose an uncertainty-aware Reinforcement Learning (RL) framework to guide the optimization of 3D molecular diffusion models toward multiple property objectives while enhancing the overall quality of the generated molecules. Our method leverages surrogate models with predictive uncertainty estimation to dynamically shape reward functions, facilitating balance across multiple optimization objectives. We comprehensively evaluate our framework across three benchmark datasets and multiple diffusion model architectures, consistently outperforming baselines for molecular quality and property optimization. Additionally, Molecular Dynamics (MD) simulations and ADMET profiling of top generated candidates indicate promising drug-like behavior and binding stability, comparable to known Epidermal Growth Factor Receptor (EGFR) inhibitors. Our results demonstrate the strong potential of RL-guided generative diffusion models for advancing automated molecular design.

Title: Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study

Authors: Guanlin Wu, Boyan Su, Yang Zhao, Pu Wang, Yichen Lin, Hao Frank Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21160
Pdf URL: https://arxiv.org/pdf/2510.21160
Copy Paste: [[2510.21160]] Towards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study(https://arxiv.org/abs/2510.21160)
Keywords: foundation model, in-context
Abstract: How to integrate and verify spatial intelligence in foundation models remains an open challenge. Current practice often proxies Visual-Spatial Intelligence (VSI) with purely textual prompts and VQA-style scoring, which obscures geometry, invites linguistic shortcuts, and weakens attribution to genuinely spatial skills. We introduce Spatial Intelligence Grid (SIG): a structured, grid-based schema that explicitly encodes object layouts, inter-object relations, and physically grounded priors. As a complementary channel to text, SIG provides a faithful, compositional representation of scene structure for foundation-model reasoning. Building on SIG, we derive SIG-informed evaluation metrics that quantify a model's intrinsic VSI, which separates spatial capability from language priors. In few-shot in-context learning with state-of-the-art multimodal LLMs (e.g. GPT- and Gemini-family models), SIG yields consistently larger, more stable, and more comprehensive gains across all VSI metrics compared to VQA-only representations, indicating its promise as a data-labeling and training schema for learning VSI. We also release SIGBench, a benchmark of 1.4K driving frames annotated with ground-truth SIG labels and human gaze traces, supporting both grid-based machine VSI tasks and attention-driven, human-like VSI tasks in autonomous-driving scenarios.

Title: Blockwise Flow Matching: Improving Flow Matching Models For Efficient High-Quality Generation

Authors: Dogyun Park, Taehoon Lee, Minseok Joo, Hyunwoo J. Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21167
Pdf URL: https://arxiv.org/pdf/2510.21167
Copy Paste: [[2510.21167]] Blockwise Flow Matching: Improving Flow Matching Models For Efficient High-Quality Generation(https://arxiv.org/abs/2510.21167)
Keywords: generative
Abstract: Recently, Flow Matching models have pushed the boundaries of high-fidelity data generation across a wide range of domains. It typically employs a single large network to learn the entire generative trajectory from noise to data. Despite their effectiveness, this design struggles to capture distinct signal characteristics across timesteps simultaneously and incurs substantial inference costs due to the iterative evaluation of the entire model. To address these limitations, we propose Blockwise Flow Matching (BFM), a novel framework that partitions the generative trajectory into multiple temporal segments, each modeled by smaller but specialized velocity blocks. This blockwise design enables each block to specialize effectively in its designated interval, improving inference efficiency and sample quality. To further enhance generation fidelity, we introduce a Semantic Feature Guidance module that explicitly conditions velocity blocks on semantically rich features aligned with pretrained representations. Additionally, we propose a lightweight Feature Residual Approximation strategy that preserves semantic quality while significantly reducing inference cost. Extensive experiments on ImageNet 256x256 demonstrate that BFM establishes a substantially improved Pareto frontier over existing Flow Matching methods, achieving 2.1x to 4.9x accelerations in inference complexity at comparable generation performance. Code is available at this https URL.

Title: TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

Authors: Qihang Zhou, Binbin Gao, Guansong Pang, Xin Wang, Jiming Chen, Shibo He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21171
Pdf URL: https://arxiv.org/pdf/2510.21171
Copy Paste: [[2510.21171]] TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection(https://arxiv.org/abs/2510.21171)
Keywords: anomaly
Abstract: Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.

Title: PLAN: Proactive Low-Rank Allocation for Continual Learning

Authors: Xiequn Wang, Zhan Zhuang, Yu Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21188
Pdf URL: https://arxiv.org/pdf/2510.21188
Copy Paste: [[2510.21188]] PLAN: Proactive Low-Rank Allocation for Continual Learning(https://arxiv.org/abs/2510.21188)
Keywords: foundation model
Abstract: Continual learning (CL) requires models to continuously adapt to new tasks without forgetting past knowledge. In this work, we propose \underline{P}roactive \underline{L}ow-rank \underline{A}llocatio\underline{N} (PLAN), a framework that extends Low-Rank Adaptation (LoRA) to enable efficient and interference-aware fine-tuning of large pre-trained models in CL settings. PLAN proactively manages the allocation of task-specific subspaces by introducing orthogonal basis vectors for each task and optimizing them through a perturbation-based strategy that minimizes conflicts with previously learned parameters. Furthermore, PLAN incorporates a novel selection mechanism that identifies and assigns basis vectors with minimal sensitivity to interference, reducing the risk of degrading past knowledge while maintaining efficient adaptation to new tasks. Empirical results on standard CL benchmarks demonstrate that PLAN consistently outperforms existing methods, establishing a new state-of-the-art for continual learning with foundation models.

Title: 3rd Place Solution to ICCV LargeFineFoodAI Retrieval

Authors: Yang Zhong, Zhiming Wang, Zhaoyang Li, Jinyu Ma, Xiang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21198
Pdf URL: https://arxiv.org/pdf/2510.21198
Copy Paste: [[2510.21198]] 3rd Place Solution to ICCV LargeFineFoodAI Retrieval(https://arxiv.org/abs/2510.21198)
Keywords: diffusion
Abstract: This paper introduces the 3rd place solution to the ICCV LargeFineFoodAI Retrieval Competition on Kaggle. Four basic models are independently trained with the weighted sum of ArcFace and Circle loss, then TTA and Ensemble are successively applied to improve feature representation ability. In addition, a new reranking method for retrieval is proposed based on diffusion and k-reciprocal reranking. Finally, our method scored 0.81219 and 0.81191 mAP@100 on the public and private leaderboard, respectively.

Title: Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models

Authors: Xiyuan Zhang, Danielle C. Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W. Mahoney, Cuixiong Hu, Huzefa Rangwala, George Karypis, Bernie Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.21204
Pdf URL: https://arxiv.org/pdf/2510.21204
Copy Paste: [[2510.21204]] Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models(https://arxiv.org/abs/2510.21204)
Keywords: foundation model, in-context
Abstract: Since the seminal work of TabPFN, research on tabular foundation models (TFMs) based on in-context learning (ICL) has challenged long-standing paradigms in machine learning. Without seeing any real-world data, models pretrained on purely synthetic datasets generalize remarkably well across diverse datasets, often using only a moderate number of in-context examples. This shifts the focus in tabular machine learning from model architecture design to the design of synthetic datasets, or, more precisely, to the prior distributions that generate them. Yet the guiding principles for prior design remain poorly understood. This work marks the first attempt to address the gap. We systematically investigate and identify key properties of synthetic priors that allow pretrained TFMs to generalize well. Based on these insights, we introduce Mitra, a TFM trained on a curated mixture of synthetic priors selected for their diversity, distinctiveness, and performance on real-world tabular data. Mitra consistently outperforms state-of-the-art TFMs, such as TabPFNv2 and TabICL, across both classification and regression benchmarks, with better sample efficiency.

Title: On the flow matching interpretability

Authors: Francesco Pivi, Simone Gazza, Davide Evangelista, Roberto Amadini, Maurizio Gabbrielli
Subjects: cs.LG, math.ST, physics.app-ph, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2510.21210
Pdf URL: https://arxiv.org/pdf/2510.21210
Copy Paste: [[2510.21210]] On the flow matching interpretability(https://arxiv.org/abs/2510.21210)
Keywords: diffusion, generative
Abstract: Generative models based on flow matching have demonstrated remarkable success in various domains, yet they suffer from a fundamental limitation: the lack of interpretability in their intermediate generation steps. In fact these models learn to transform noise into data through a series of vector field updates, however the meaning of each step remains opaque. We address this problem by proposing a general framework constraining each flow step to be sampled from a known physical distribution. Flow trajectories are mapped to (and constrained to traverse) the equilibrium states of the simulated physical process. We implement this approach through the 2D Ising model in such a way that flow steps become thermal equilibrium points along a parametric cooling schedule. Our proposed architecture includes an encoder that maps discrete Ising configurations into a continuous latent space, a flow-matching network that performs temperature-driven diffusion, and a projector that returns to discrete Ising states while preserving physical constraints. We validate this framework across multiple lattice sizes, showing that it preserves physical fidelity while outperforming Monte Carlo generation in speed as the lattice size increases. In contrast with standard flow matching, each vector field represents a meaningful stepwise transition in the 2D Ising model's latent space. This demonstrates that embedding physical semantics into generative flows transforms opaque neural trajectories into interpretable physical processes.

Title: Model Merging with Functional Dual Anchors

Authors: Kexuan Shi, Yandong Wen, Weiyang Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.21223
Pdf URL: https://arxiv.org/pdf/2510.21223
Copy Paste: [[2510.21223]] Model Merging with Functional Dual Anchors(https://arxiv.org/abs/2510.21223)
Keywords: foundation model
Abstract: Model merging is an efficient post-training strategy for integrating knowledge from multiple finetuned checkpoints of a shared foundation model. Existing methods operate in the parameter space, combining task vectors to mitigate conflicts, but remain constrained by parameter inconsistencies. We propose Functional Dual Anchors (FDAs), a framework that instead models the input-representation space. FDAs are synthetic inputs whose induced gradients align with task vectors, capturing task-specific functional shifts relative to the pretrained model. This perspective bridges joint multi-task training and post-hoc merging, offering both robustness and flexibility. We further introduce a principled initialization scheme and show that FDAs are complementary to parameter-space model merging. Comprehensive experiments demonstrate the effectiveness of FDAs in model merging.

Title: Improved Training Technique for Shortcut Models

Authors: Anh Nguyen, Viet Nguyen, Duc Vu, Trung Dao, Chi Tran, Toan Tran, Anh Tran
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21250
Pdf URL: https://arxiv.org/pdf/2510.21250
Copy Paste: [[2510.21250]] Improved Training Technique for Shortcut Models(https://arxiv.org/abs/2510.21250)
Keywords: generative
Abstract: Shortcut models represent a promising, non-adversarial paradigm for generative modeling, uniquely supporting one-step, few-step, and multi-step sampling from a single trained network. However, their widespread adoption has been stymied by critical performance bottlenecks. This paper tackles the five core issues that held shortcut models back: (1) the hidden flaw of compounding guidance, which we are the first to formalize, causing severe image artifacts; (2) inflexible fixed guidance that restricts inference-time control; (3) a pervasive frequency bias driven by a reliance on low-level distances in the direct domain, which biases reconstructions toward low frequencies; (4) divergent self-consistency arising from a conflict with EMA training; and (5) curvy flow trajectories that impede convergence. To address these challenges, we introduce iSM, a unified training framework that systematically resolves each limitation. Our framework is built on four key improvements: Intrinsic Guidance provides explicit, dynamic control over guidance strength, resolving both compounding guidance and inflexibility. A Multi-Level Wavelet Loss mitigates frequency bias to restore high-frequency details. Scaling Optimal Transport (sOT) reduces training variance and learns straighter, more stable generative paths. Finally, a Twin EMA strategy reconciles training stability with self-consistency. Extensive experiments on ImageNet 256 x 256 demonstrate that our approach yields substantial FID improvements over baseline shortcut models across one-step, few-step, and multi-step generation, making shortcut models a viable and competitive class of generative models.

Title: Correlation Dimension of Auto-Regressive Large Language Models

Authors: Xin Du, Kumiko Tanaka-Ishii
Subjects: cs.CL, cs.AI, nlin.CD
Abstract URL: https://arxiv.org/abs/2510.21258
Pdf URL: https://arxiv.org/pdf/2510.21258
Copy Paste: [[2510.21258]] Correlation Dimension of Auto-Regressive Large Language Models(https://arxiv.org/abs/2510.21258)
Keywords: generative
Abstract: Large language models (LLMs) have achieved remarkable progress in natural language generation, yet they continue to display puzzling behaviors -- such as repetition and incoherence -- even when exhibiting low perplexity. This highlights a key limitation of conventional evaluation metrics, which emphasize local prediction accuracy while overlooking long-range structural complexity. We introduce correlation dimension, a fractal-geometric measure of self-similarity, to quantify the epistemological complexity of text as perceived by a language model. This measure captures the hierarchical recurrence structure of language, bridging local and global properties in a unified framework. Through extensive experiments, we show that correlation dimension (1) reveals three distinct phases during pretraining, (2) reflects context-dependent complexity, (3) indicates a model's tendency toward hallucination, and (4) reliably detects multiple forms of degeneration in generated text. The method is computationally efficient, robust to model quantization (down to 4-bit precision), broadly applicable across autoregressive architectures (e.g., Transformer and Mamba), and provides fresh insight into the generative dynamics of LLMs.

Title: Topology Sculptor, Shape Refiner: Discrete Diffusion Model for High-Fidelity 3D Meshes Generation

Authors: Kaiyu Song, Hanjiang Lai, Yaqing Zhang, Chuangjian Cai, Yan Pan Kun Yue, Jian Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21264
Pdf URL: https://arxiv.org/pdf/2510.21264
Copy Paste: [[2510.21264]] Topology Sculptor, Shape Refiner: Discrete Diffusion Model for High-Fidelity 3D Meshes Generation(https://arxiv.org/abs/2510.21264)
Keywords: diffusion
Abstract: In this paper, we introduce Topology Sculptor, Shape Refiner (TSSR), a novel method for generating high-quality, artist-style 3D meshes based on Discrete Diffusion Models (DDMs). Our primary motivation for TSSR is to achieve highly accurate token prediction while enabling parallel generation, a significant advantage over sequential autoregressive methods. By allowing TSSR to "see" all mesh tokens concurrently, we unlock a new level of efficiency and control. We leverage this parallel generation capability through three key innovations: 1) Decoupled Training and Hybrid Inference, which distinctly separates the DDM-based generation into a topology sculpting stage and a subsequent shape refinement stage. This strategic decoupling enables TSSR to effectively capture both intricate local topology and overarching global shape. 2) An Improved Hourglass Architecture, featuring bidirectional attention enriched by face-vertex-sequence level Rotational Positional Embeddings (RoPE), thereby capturing richer contextual information across the mesh structure. 3) A novel Connection Loss, which acts as a topological constraint to further enhance the realism and fidelity of the generated meshes. Extensive experiments on complex datasets demonstrate that TSSR generates high-quality 3D artist-style meshes, capable of achieving up to 10,000 faces at a remarkable spatial resolution of $1024^3$. The code will be released at: this https URL.

Title: An Evidence-Based Post-Hoc Adjustment Framework for Anomaly Detection Under Data Contamination

Authors: Sukanya Patra, Souhaib Ben Taieb
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.21296
Pdf URL: https://arxiv.org/pdf/2510.21296
Copy Paste: [[2510.21296]] An Evidence-Based Post-Hoc Adjustment Framework for Anomaly Detection Under Data Contamination(https://arxiv.org/abs/2510.21296)
Keywords: foundation model, anomaly
Abstract: Unsupervised anomaly detection (AD) methods typically assume clean training data, yet real-world datasets often contain undetected or mislabeled anomalies, leading to significant performance degradation. Existing solutions require access to the training pipelines, data or prior knowledge of the proportions of anomalies in the data, limiting their real-world applicability. To address this challenge, we propose EPHAD, a simple yet effective test-time adaptation framework that updates the outputs of AD models trained on contaminated datasets using evidence gathered at test time. Our approach integrates the prior knowledge captured by the AD model trained on contaminated datasets with evidence derived from multimodal foundation models like Contrastive Language-Image Pre-training (CLIP), classical AD methods like the Latent Outlier Factor or domain-specific knowledge. We illustrate the intuition behind EPHAD using a synthetic toy example and validate its effectiveness through comprehensive experiments across eight visual AD datasets, twenty-six tabular AD datasets, and a real-world industrial AD dataset. Additionally, we conduct an ablation study to analyse hyperparameter influence and robustness to varying contamination levels, demonstrating the versatility and robustness of EPHAD across diverse AD models and evidence pairs. To ensure reproducibility, our code is publicly available at this https URL.

Title: Efficient semantic uncertainty quantification in language models via diversity-steered sampling

Authors: Ji Won Park, Kyunghyun Cho
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21310
Pdf URL: https://arxiv.org/pdf/2510.21310
Copy Paste: [[2510.21310]] Efficient semantic uncertainty quantification in language models via diversity-steered sampling(https://arxiv.org/abs/2510.21310)
Keywords: diffusion
Abstract: Accurately estimating semantic aleatoric and epistemic uncertainties in large language models (LLMs) is particularly challenging in free-form question answering (QA), where obtaining stable estimates often requires many expensive generations. We introduce a diversity-steered sampler that discourages semantically redundant outputs during decoding, covers both autoregressive and masked diffusion paradigms, and yields substantial sample-efficiency gains. The key idea is to inject a continuous semantic-similarity penalty into the model's proposal distribution using a natural language inference (NLI) model lightly finetuned on partial prefixes or intermediate diffusion states. We debias downstream uncertainty estimates with importance reweighting and shrink their variance with control variates. Across four QA benchmarks, our method matches or surpasses baselines while covering more semantic clusters with the same number of samples. Being modular and requiring no gradient access to the base LLM, the framework promises to serve as a drop-in enhancement for uncertainty estimation in risk-sensitive model deployments.

Title: VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

Authors: Shufan Shen, Junshu Sun, Qingming Huang, Shuhui Wang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21323
Pdf URL: https://arxiv.org/pdf/2510.21323
Copy Paste: [[2510.21323]] VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set(https://arxiv.org/abs/2510.21323)
Keywords: self-supervised
Abstract: The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities. However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity. Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations. Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment. For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts. For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination. Codes are available at this https URL.

Title: Morphologically Intelligent Perturbation Prediction with FORM

Authors: Reed Naidoo, Matt De Vries, Olga Fourkioti, Vicky Bousgouni, Mar Arias-Garcia, Maria Portillo-Malumbres, Chris Bakal
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21337
Pdf URL: https://arxiv.org/pdf/2510.21337
Copy Paste: [[2510.21337]] Morphologically Intelligent Perturbation Prediction with FORM(https://arxiv.org/abs/2510.21337)
Keywords: diffusion
Abstract: Understanding how cells respond to external stimuli is a central challenge in biomedical research and drug development. Current computational frameworks for modelling cellular responses remain restricted to two-dimensional representations, limiting their capacity to capture the complexity of cell morphology under perturbation. This dimensional constraint poses a critical bottleneck for the development of accurate virtual cell models. Here, we present FORM, a machine learning framework for predicting perturbation-induced changes in three-dimensional cellular structure. FORM consists of two components: a morphology encoder, trained end-to-end via a novel multi-channel VQGAN to learn compact 3D representations of cell shape, and a diffusion-based perturbation trajectory module that captures how morphology evolves across perturbation conditions. Trained on a large-scale dataset of over 65,000 multi-fluorescence 3D cell volumes spanning diverse chemical and genetic perturbations, FORM supports both unconditional morphology synthesis and conditional simulation of perturbed cell states. Beyond generation, FORM can predict downstream signalling activity, simulate combinatorial perturbation effects, and model morphodynamic transitions between states of unseen perturbations. To evaluate performance, we introduce MorphoEval, a benchmarking suite that quantifies perturbation-induced morphological changes in structural, statistical, and biological dimensions. Together, FORM and MorphoEval work toward the realisation of the 3D virtual cell by linking morphology, perturbation, and function through high-resolution predictive simulation.

Title: Compositional Monte Carlo Tree Diffusion for Extendable Planning

Authors: Jaesik Yoon, Hyeonseo Cho, Sungjin Ahn
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.21361
Pdf URL: https://arxiv.org/pdf/2510.21361
Copy Paste: [[2510.21361]] Compositional Monte Carlo Tree Diffusion for Extendable Planning(https://arxiv.org/abs/2510.21361)
Keywords: diffusion
Abstract: Monte Carlo Tree Diffusion (MCTD) integrates diffusion models with structured tree search to enable effective trajectory exploration through stepwise reasoning. However, MCTD remains fundamentally limited by training trajectory lengths. While periodic replanning allows plan concatenation for longer plan generation, the planning process remains locally confined, as MCTD searches within individual trajectories without access to global context. We propose Compositional Monte Carlo Tree Diffusion (C-MCTD), a framework that elevates planning from individual trajectory optimization to reasoning over complete plan compositions. C-MCTD introduces three complementary components: (1) Online Composer, which performs globally-aware planning by searching across entire plan compositions; (2) Distributed Composer, which reduces search complexity through parallel exploration from multiple starting points; and (3) Preplan Composer, which accelerates inference by leveraging cached plan graphs.

Title: FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models

Authors: Zihao Fu, Ryan Brown, Shun Shao, Kai Rawal, Eoin Delaney, Chris Russell
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2510.21363
Pdf URL: https://arxiv.org/pdf/2510.21363
Copy Paste: [[2510.21363]] FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models(https://arxiv.org/abs/2510.21363)
Keywords: diffusion
Abstract: Text-to-image diffusion models, such as Stable Diffusion, have demonstrated remarkable capabilities in generating high-quality and diverse images from natural language prompts. However, recent studies reveal that these models often replicate and amplify societal biases, particularly along demographic attributes like gender and race. In this paper, we introduce FairImagen (this https URL), a post-hoc debiasing framework that operates on prompt embeddings to mitigate such biases without retraining or modifying the underlying diffusion model. Our method integrates Fair Principal Component Analysis to project CLIP-based input embeddings into a subspace that minimizes group-specific information while preserving semantic content. We further enhance debiasing effectiveness through empirical noise injection and propose a unified cross-demographic projection method that enables simultaneous debiasing across multiple demographic attributes. Extensive experiments across gender, race, and intersectional settings demonstrate that FairImagen significantly improves fairness with a moderate trade-off in image quality and prompt fidelity. Our framework outperforms existing post-hoc methods and offers a simple, scalable, and model-agnostic solution for equitable text-to-image generation.

Title: BADiff: Bandwidth Adaptive Diffusion Model

Authors: Xi Zhang, Hanwei Zhu, Yan Zhong, Jiamang Wang, Weisi Lin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21366
Pdf URL: https://arxiv.org/pdf/2510.21366
Copy Paste: [[2510.21366]] BADiff: Bandwidth Adaptive Diffusion Model(https://arxiv.org/abs/2510.21366)
Keywords: diffusion
Abstract: In this work, we propose a novel framework to enable diffusion models to adapt their generation quality based on real-time network bandwidth constraints. Traditional diffusion models produce high-fidelity images by performing a fixed number of denoising steps, regardless of downstream transmission limitations. However, in practical cloud-to-device scenarios, limited bandwidth often necessitates heavy compression, leading to loss of fine textures and wasted computation. To address this, we introduce a joint end-to-end training strategy where the diffusion model is conditioned on a target quality level derived from the available bandwidth. During training, the model learns to adaptively modulate the denoising process, enabling early-stop sampling that maintains perceptual quality appropriate to the target transmission condition. Our method requires minimal architectural changes and leverages a lightweight quality embedding to guide the denoising trajectory. Experimental results demonstrate that our approach significantly improves the visual fidelity of bandwidth-adapted generations compared to naive early-stopping, offering a promising solution for efficient image delivery in bandwidth-constrained environments. Code is available at: this https URL.

Title: TerraGen: A Unified Multi-Task Layout Generation Framework for Remote Sensing Data Augmentation

Authors: Datao Tang, Hao Wang, Yudeng Xin, Hui Qiao, Dongsheng Jiang, Yin Li, Zhiheng Yu, Xiangyong Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21391
Pdf URL: https://arxiv.org/pdf/2510.21391
Copy Paste: [[2510.21391]] TerraGen: A Unified Multi-Task Layout Generation Framework for Remote Sensing Data Augmentation(https://arxiv.org/abs/2510.21391)
Keywords: generative
Abstract: Remote sensing vision tasks require extensive labeled data across multiple, interconnected domains. However, current generative data augmentation frameworks are task-isolated, i.e., each vision task requires training an independent generative model, and ignores the modeling of geographical information and spatial constraints. To address these issues, we propose \textbf{TerraGen}, a unified layout-to-image generation framework that enables flexible, spatially controllable synthesis of remote sensing imagery for various high-level vision tasks, e.g., detection, segmentation, and extraction. Specifically, TerraGen introduces a geographic-spatial layout encoder that unifies bounding box and segmentation mask inputs, combined with a multi-scale injection scheme and mask-weighted loss to explicitly encode spatial constraints, from global structures to fine details. Also, we construct the first large-scale multi-task remote sensing layout generation dataset containing 45k images and establish a standardized evaluation protocol for this task. Experimental results show that our TerraGen can achieve the best generation image quality across diverse tasks. Additionally, TerraGen can be used as a universal data-augmentation generator, enhancing downstream task performance significantly and demonstrating robust cross-task generalisation in both full-data and few-shot scenarios.

Title: Large Language Models as Model Organisms for Human Associative Learning

Authors: Camila Kolling, Vy Ai Vo, Mariya Toneva
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21408
Pdf URL: https://arxiv.org/pdf/2510.21408
Copy Paste: [[2510.21408]] Large Language Models as Model Organisms for Human Associative Learning(https://arxiv.org/abs/2510.21408)
Keywords: in-context
Abstract: Associative learning--forming links between co-occurring items--is fundamental to human cognition, reshaping internal representations in complex ways. Testing hypotheses on how representational changes occur in biological systems is challenging, but large language models (LLMs) offer a scalable alternative. Building on LLMs' in-context learning, we adapt a cognitive neuroscience associative learning paradigm and investigate how representations evolve across six models. Our initial findings reveal a non-monotonic pattern consistent with the Non-Monotonic Plasticity Hypothesis, with moderately similar items differentiating after learning. Leveraging the controllability of LLMs, we further show that this differentiation is modulated by the overlap of associated items with the broader vocabulary--a factor we term vocabulary interference, capturing how new associations compete with prior knowledge. We find that higher vocabulary interference amplifies differentiation, suggesting that representational change is influenced by both item similarity and global competition. Our findings position LLMs not only as powerful tools for studying representational dynamics in human-like learning systems, but also as accessible and general computational models for generating new hypotheses about the principles underlying memory reorganization in the brain.

Title: Self-diffusion for Solving Inverse Problems

Authors: Guanxiong Luo, Shoujin Huang, Yanlong Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.21417
Pdf URL: https://arxiv.org/pdf/2510.21417
Copy Paste: [[2510.21417]] Self-diffusion for Solving Inverse Problems(https://arxiv.org/abs/2510.21417)
Keywords: diffusion, generative
Abstract: We propose self-diffusion, a novel framework for solving inverse problems without relying on pretrained generative models. Traditional diffusion-based approaches require training a model on a clean dataset to learn to reverse the forward noising process. This model is then used to sample clean solutions -- corresponding to posterior sampling from a Bayesian perspective -- that are consistent with the observed data under a specific task. In contrast, self-diffusion introduces a self-contained iterative process that alternates between noising and denoising steps to progressively refine its estimate of the solution. At each step of self-diffusion, noise is added to the current estimate, and a self-denoiser, which is a single untrained convolutional network randomly initialized from scratch, is continuously trained for certain iterations via a data fidelity loss to predict the solution from the noisy estimate. Essentially, self-diffusion exploits the spectral bias of neural networks and modulates it through a scheduled noise process. Without relying on pretrained score functions or external denoisers, this approach still remains adaptive to arbitrary forward operators and noisy observations, making it highly flexible and broadly applicable. We demonstrate the effectiveness of our approach on a variety of linear inverse problems, showing that self-diffusion achieves competitive or superior performance compared to other methods.

Title: Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings

Authors: Abderrazek Abid, Thanh-Cong Ho, Fakhri Karray
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21424
Pdf URL: https://arxiv.org/pdf/2510.21424
Copy Paste: [[2510.21424]] Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings(https://arxiv.org/abs/2510.21424)
Keywords: generative
Abstract: As generative AI continues to evolve, Vision Language Models (VLMs) have emerged as promising tools in various healthcare applications. One area that remains relatively underexplored is their use in human activity recognition (HAR) for remote health monitoring. VLMs offer notable strengths, including greater flexibility and the ability to overcome some of the constraints of traditional deep learning models. However, a key challenge in applying VLMs to HAR lies in the difficulty of evaluating their dynamic and often non-deterministic outputs. To address this gap, we introduce a descriptive caption data set and propose comprehensive evaluation methods to evaluate VLMs in HAR. Through comparative experiments with state-of-the-art deep learning models, our findings demonstrate that VLMs achieve comparable performance and, in some cases, even surpass conventional approaches in terms of accuracy. This work contributes a strong benchmark and opens new possibilities for the integration of VLMs into intelligent healthcare systems.

Title: ArtiLatent: Realistic Articulated 3D Object Generation via Structured Latents

Authors: Honghua Chen, Yushi Lan, Yongwei Chen, Xingang Pan
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2510.21432
Pdf URL: https://arxiv.org/pdf/2510.21432
Copy Paste: [[2510.21432]] ArtiLatent: Realistic Articulated 3D Object Generation via Structured Latents(https://arxiv.org/abs/2510.21432)
Keywords: diffusion, generative
Abstract: We propose ArtiLatent, a generative framework that synthesizes human-made 3D objects with fine-grained geometry, accurate articulation, and realistic appearance. Our approach jointly models part geometry and articulation dynamics by embedding sparse voxel representations and associated articulation properties, including joint type, axis, origin, range, and part category, into a unified latent space via a variational autoencoder. A latent diffusion model is then trained over this space to enable diverse yet physically plausible sampling. To reconstruct photorealistic 3D shapes, we introduce an articulation-aware Gaussian decoder that accounts for articulation-dependent visibility changes (e.g., revealing the interior of a drawer when opened). By conditioning appearance decoding on articulation state, our method assigns plausible texture features to regions that are typically occluded in static poses, significantly improving visual realism across articulation configurations. Extensive experiments on furniture-like objects from PartNet-Mobility and ACD datasets demonstrate that ArtiLatent outperforms existing approaches in geometric consistency and appearance fidelity. Our framework provides a scalable solution for articulated 3D object synthesis and manipulation.

Title: REMONI: An Autonomous System Integrating Wearables and Multimodal Large Language Models for Enhanced Remote Health Monitoring

Authors: Thanh Cong Ho, Farah Kharrat, Abderrazek Abid, Fakhri Karray
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21445
Pdf URL: https://arxiv.org/pdf/2510.21445
Copy Paste: [[2510.21445]] REMONI: An Autonomous System Integrating Wearables and Multimodal Large Language Models for Enhanced Remote Health Monitoring(https://arxiv.org/abs/2510.21445)
Keywords: anomaly
Abstract: With the widespread adoption of wearable devices in our daily lives, the demand and appeal for remote patient monitoring have significantly increased. Most research in this field has concentrated on collecting sensor data, visualizing it, and analyzing it to detect anomalies in specific diseases such as diabetes, heart disease and depression. However, this domain has a notable gap in the aspect of human-machine interaction. This paper proposes REMONI, an autonomous REmote health MONItoring system that integrates multimodal large language models (MLLMs), the Internet of Things (IoT), and wearable devices. The system automatically and continuously collects vital signs, accelerometer data from a special wearable (such as a smartwatch), and visual data in patient video clips collected from cameras. This data is processed by an anomaly detection module, which includes a fall detection model and algorithms to identify and alert caregivers of the patient's emergency conditions. A distinctive feature of our proposed system is the natural language processing component, developed with MLLMs capable of detecting and recognizing a patient's activity and emotion while responding to healthcare worker's inquiries. Additionally, prompt engineering is employed to integrate all patient information seamlessly. As a result, doctors and nurses can access real-time vital signs and the patient's current state and mood by interacting with an intelligent agent through a user-friendly web application. Our experiments demonstrate that our system is implementable and scalable for real-life scenarios, potentially reducing the workload of medical professionals and healthcare costs. A full-fledged prototype illustrating the functionalities of the system has been developed and being tested to demonstrate the robustness of its various capabilities.

Title: MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly Detection

Authors: Shengtian Yang, Yue Feng, Yingshi Liu, Jingrou Zhang, Jie Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21449
Pdf URL: https://arxiv.org/pdf/2510.21449
Copy Paste: [[2510.21449]] MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly Detection(https://arxiv.org/abs/2510.21449)
Keywords: anomaly
Abstract: Video Anomaly Detection (VAD) aims to locate unusual activities or behaviors within videos. Recently, offline VAD has garnered substantial research attention, which has been invigorated by the progress in large language models (LLMs) and vision-language models (VLMs), offering the potential for a more nuanced understanding of anomalies. However, online VAD has seldom received attention due to real-time constraints and computational intensity. In this paper, we introduce a novel Memory-based online scoring queue scheme for Training-free VAD (MoniTor), to address the inherent complexities in online VAD. Specifically, MoniTor applies a streaming input to VLMs, leveraging the capabilities of pre-trained large-scale models. To capture temporal dependencies more effectively, we incorporate a novel prediction mechanism inspired by Long Short-Term Memory (LSTM) networks. This ensures the model can effectively model past states and leverage previous predictions to identify anomalous behaviors. Thereby, it better understands the current frame. Moreover, we design a scoring queue and an anomaly prior to dynamically store recent scores and cover all anomalies in the monitoring scenario, providing guidance for LLMs to distinguish between normal and abnormal behaviors over time. We evaluate MoniTor on two large datasets (i.e., UCF-Crime and XD-Violence) containing various surveillance and real-world scenarios. The results demonstrate that MoniTor outperforms state-of-the-art methods and is competitive with weakly supervised methods without training. Code is available at this https URL.

Title: VidSplice: Towards Coherent Video Inpainting via Explicit Spaced Frame Guidance

Authors: Ming Xie, Junqiu Yu, Qiaole Dong, Xiangyang Xue, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21461
Pdf URL: https://arxiv.org/pdf/2510.21461
Copy Paste: [[2510.21461]] VidSplice: Towards Coherent Video Inpainting via Explicit Spaced Frame Guidance(https://arxiv.org/abs/2510.21461)
Keywords: generative
Abstract: Recent video inpainting methods often employ image-to-video (I2V) priors to model temporal consistency across masked frames. While effective in moderate cases, these methods struggle under severe content degradation and tend to overlook spatiotemporal stability, resulting in insufficient control over the latter parts of the video. To address these limitations, we decouple video inpainting into two sub-tasks: multi-frame consistent image inpainting and masked area motion propagation. We propose VidSplice, a novel framework that introduces spaced-frame priors to guide the inpainting process with spatiotemporal cues. To enhance spatial coherence, we design a CoSpliced Module to perform first-frame propagation strategy that diffuses the initial frame content into subsequent reference frames through a splicing mechanism. Additionally, we introduce a delicate context controller module that encodes coherent priors after frame duplication and injects the spliced video into the I2V generative backbone, effectively constraining content distortion during generation. Extensive evaluations demonstrate that VidSplice achieves competitive performance across diverse video inpainting scenarios. Moreover, its design significantly improves both foreground alignment and motion stability, outperforming existing approaches.

Title: MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization

Authors: Chenglong Wang, Yang Gan, Hang Zhou, Chi Hu, Yongyu Mu, Kai Song, Murun Yang, Bei Li, Chunliang Zhang, Tongran Liu, Jingbo Zhu, Zhengtao Yu, Tong Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21473
Pdf URL: https://arxiv.org/pdf/2510.21473
Copy Paste: [[2510.21473]] MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization(https://arxiv.org/abs/2510.21473)
Keywords: diffusion
Abstract: Recent advances in diffusion language models (DLMs) have presented a promising alternative to traditional autoregressive large language models (LLMs). However, DLMs still lag behind LLMs in reasoning performance, especially as the number of denoising steps decreases. Our analysis reveals that this shortcoming arises primarily from the independent generation of masked tokens across denoising steps, which fails to capture the token correlation. In this paper, we define two types of token correlation: intra-sequence correlation and inter-sequence correlation, and demonstrate that enhancing these correlations improves reasoning performance. To this end, we propose a Multi-Reward Optimization (MRO) approach, which encourages DLMs to consider the token correlation during the denoising process. More specifically, our MRO approach leverages test-time scaling, reject sampling, and reinforcement learning to directly optimize the token correlation with multiple elaborate rewards. Additionally, we introduce group step and importance sampling strategies to mitigate reward variance and enhance sampling efficiency. Through extensive experiments, we demonstrate that MRO not only improves reasoning performance but also achieves significant sampling speedups while maintaining high performance on reasoning benchmarks.

Title: ITC-RWKV: Interactive Tissue-Cell Modeling with Recurrent Key-Value Aggregation for Histopathological Subtyping

Authors: Yating Huang, Qijun Yang, Lintao Xiang, Hujun Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21479
Pdf URL: https://arxiv.org/pdf/2510.21479
Copy Paste: [[2510.21479]] ITC-RWKV: Interactive Tissue-Cell Modeling with Recurrent Key-Value Aggregation for Histopathological Subtyping(https://arxiv.org/abs/2510.21479)
Keywords: foundation model
Abstract: Accurate interpretation of histopathological images demands integration of information across spatial and semantic scales, from nuclear morphology and cellular textures to global tissue organization and disease-specific patterns. Although recent foundation models in pathology have shown strong capabilities in capturing global tissue context, their omission of cell-level feature modeling remains a key limitation for fine-grained tasks such as cancer subtype classification. To address this, we propose a dual-stream architecture that models the interplay between macroscale tissue features and aggregated cellular representations. To efficiently aggregate information from large cell sets, we propose a receptance-weighted key-value aggregation model, a recurrent transformer that captures inter-cell dependencies with linear complexity. Furthermore, we introduce a bidirectional tissue-cell interaction module to enable mutual attention between localized cellular cues and their surrounding tissue environment. Experiments on four histopathological subtype classification benchmarks show that the proposed method outperforms existing models, demonstrating the critical role of cell-level aggregation and tissue-cell interaction in fine-grained computational pathology.

Title: Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations

Authors: Kaibo Wang, Jianda Mao, Tong Wu, Yang Xiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21512
Pdf URL: https://arxiv.org/pdf/2510.21512
Copy Paste: [[2510.21512]] Towards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations(https://arxiv.org/abs/2510.21512)
Keywords: diffusion
Abstract: Classifier-Free Guidance (CFG) is an essential component of text-to-image diffusion models, and understanding and advancing its operational mechanisms remains a central focus of research. Existing approaches stem from divergent theoretical interpretations, thereby limiting the design space and obscuring key design choices. To address this, we propose a unified perspective that reframes conditional guidance as fixed point iterations, seeking to identify a golden path where latents produce consistent outputs under both conditional and unconditional generation. We demonstrate that CFG and its variants constitute a special case of single-step short-interval iteration, which is theoretically proven to exhibit inefficiency. To this end, we introduce Foresight Guidance (FSG), which prioritizes solving longer-interval subproblems in early diffusion stages with increased iterations. Extensive experiments across diverse datasets and model architectures validate the superiority of FSG over state-of-the-art methods in both image quality and computational efficiency. Our work offers novel perspectives for conditional guidance and unlocks the potential of adaptive design.

Title: Head Pursuit: Probing Attention Specialization in Multimodal Transformers

Authors: Lorenzo Basile, Valentino Maiorca, Diego Doimo, Francesco Locatello, Alberto Cazzaniga
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21518
Pdf URL: https://arxiv.org/pdf/2510.21518
Copy Paste: [[2510.21518]] Head Pursuit: Probing Attention Specialization in Multimodal Transformers(https://arxiv.org/abs/2510.21518)
Keywords: generative
Abstract: Language and vision-language models have shown impressive performance across a wide range of tasks, but their internal mechanisms remain only partly understood. In this work, we study how individual attention heads in text-generative models specialize in specific semantic or visual attributes. Building on an established interpretability method, we reinterpret the practice of probing intermediate activations with the final decoding layer through the lens of signal processing. This lets us analyze multiple samples in a principled way and rank attention heads based on their relevance to target concepts. Our results show consistent patterns of specialization at the head level across both unimodal and multimodal transformers. Remarkably, we find that editing as few as 1% of the heads, selected using our method, can reliably suppress or enhance targeted concepts in the model output. We validate our approach on language tasks such as question answering and toxicity mitigation, as well as vision-language tasks including image classification and captioning. Our findings highlight an interpretable and controllable structure within attention layers, offering simple tools for understanding and editing large-scale generative models.

Title: Surrogate-based quantification of policy uncertainty in generative flow networks

Authors: Ramón Nartallo-Kaluarachchi, Robert Manson-Sawko, Shashanka Ubaru, Dongsung Huh, Małgorzata J Zimoń, Lior Horesh, Yoshua Bengio
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.21523
Pdf URL: https://arxiv.org/pdf/2510.21523
Copy Paste: [[2510.21523]] Surrogate-based quantification of policy uncertainty in generative flow networks(https://arxiv.org/abs/2510.21523)
Keywords: generative
Abstract: Generative flow networks are able to sample, via sequential construction, high-reward, complex objects according to a reward function. However, such reward functions are often estimated approximately from noisy data, leading to epistemic uncertainty in the learnt policy. We present an approach to quantify this uncertainty by constructing a surrogate model composed of a polynomial chaos expansion, fit on a small ensemble of trained flow networks. This model learns the relationship between reward functions, parametrised in a low-dimensional space, and the probability distributions over actions at each step along a trajectory of the flow network. The surrogate model can then be used for inexpensive Monte Carlo sampling to estimate the uncertainty in the policy given uncertain rewards. We illustrate the performance of our approach on a discrete and continuous grid-world, symbolic regression, and a Bayesian structure learning task.

Title: FrameShield: Adversarially Robust Video Anomaly Detection

Authors: Mojtaba Nafez, Mobina Poulaei, Nikan Vasei, Bardia Soltani Moakhar, Mohammad Sabokrou, MohammadHossein Rohban
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.21532
Pdf URL: https://arxiv.org/pdf/2510.21532
Copy Paste: [[2510.21532]] FrameShield: Adversarially Robust Video Anomaly Detection(https://arxiv.org/abs/2510.21532)
Keywords: anomaly
Abstract: Weakly Supervised Video Anomaly Detection (WSVAD) has achieved notable advancements, yet existing models remain vulnerable to adversarial attacks, limiting their reliability. Due to the inherent constraints of weak supervision, where only video-level labels are provided despite the need for frame-level predictions, traditional adversarial defense mechanisms, such as adversarial training, are not effective since video-level adversarial perturbations are typically weak and inadequate. To address this limitation, pseudo-labels generated directly from the model can enable frame-level adversarial training; however, these pseudo-labels are inherently noisy, significantly degrading performance. We therefore introduce a novel Pseudo-Anomaly Generation method called Spatiotemporal Region Distortion (SRD), which creates synthetic anomalies by applying severe augmentations to localized regions in normal videos while preserving temporal consistency. Integrating these precisely annotated synthetic anomalies with the noisy pseudo-labels substantially reduces label noise, enabling effective adversarial training. Extensive experiments demonstrate that our method significantly enhances the robustness of WSVAD models against adversarial attacks, outperforming state-of-the-art methods by an average of 71.0\% in overall AUROC performance across multiple benchmarks. The implementation and code are publicly available at this https URL.

Title: Document Understanding, Measurement, and Manipulation Using Category Theory

Authors: Jared Claypoole, Yunye Gong, Noson S. Yanofsky, Ajay Divakaran
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21553
Pdf URL: https://arxiv.org/pdf/2510.21553
Copy Paste: [[2510.21553]] Document Understanding, Measurement, and Manipulation Using Category Theory(https://arxiv.org/abs/2510.21553)
Keywords: self-supervised
Abstract: We apply category theory to extract multimodal document structure which leads us to develop information theoretic measures, content summarization and extension, and self-supervised improvement of large pretrained models. We first develop a mathematical representation of a document as a category of question-answer pairs. Second, we develop an orthogonalization procedure to divide the information contained in one or more documents into non-overlapping pieces. The structures extracted in the first and second steps lead us to develop methods to measure and enumerate the information contained in a document. We also build on those steps to develop new summarization techniques, as well as to develop a solution to a new problem viz. exegesis resulting in an extension of the original document. Our question-answer pair methodology enables a novel rate distortion analysis of summarization techniques. We implement our techniques using large pretrained models, and we propose a multimodal extension of our overall mathematical framework. Finally, we develop a novel self-supervised method using RLVR to improve large pretrained models using consistency constraints such as composability and closure under certain operations that stem naturally from our category theoretic framework.

Title: Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist

Authors: Kellen Parker van Dam, Abishek Stephen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.21584
Pdf URL: https://arxiv.org/pdf/2510.21584
Copy Paste: [[2510.21584]] Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist(https://arxiv.org/abs/2510.21584)
Keywords: anomaly
Abstract: Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.

Title: REVE: A Foundation Model for EEG -- Adapting to Any Setup with Large-Scale Pretraining on 25,000 Subjects

Authors: Yassine El Ouahidi, Jonathan Lys, Philipp Thölke, Nicolas Farrugia, Bastien Pasdeloup, Vincent Gripon, Karim Jerbi, Giulia Lioi
Subjects: cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2510.21585
Pdf URL: https://arxiv.org/pdf/2510.21585
Copy Paste: [[2510.21585]] REVE: A Foundation Model for EEG -- Adapting to Any Setup with Large-Scale Pretraining on 25,000 Subjects(https://arxiv.org/abs/2510.21585)
Keywords: foundation model
Abstract: Foundation models have transformed AI by reducing reliance on task-specific data through large-scale pretraining. While successful in language and vision, their adoption in EEG has lagged due to the heterogeneity of public datasets, which are collected under varying protocols, devices, and electrode configurations. Existing EEG foundation models struggle to generalize across these variations, often restricting pretraining to a single setup, resulting in suboptimal performance, in particular under linear probing. We present REVE (Representation for EEG with Versatile Embeddings), a pretrained model explicitly designed to generalize across diverse EEG signals. REVE introduces a novel 4D positional encoding scheme that enables it to process signals of arbitrary length and electrode arrangement. Using a masked autoencoding objective, we pretrain REVE on over 60,000 hours of EEG data from 92 datasets spanning 25,000 subjects, representing the largest EEG pretraining effort to date. REVE achieves state-of-the-art results on 10 downstream EEG tasks, including motor imagery classification, seizure detection, sleep staging, cognitive load estimation, and emotion recognition. With little to no fine-tuning, it demonstrates strong generalization, and nuanced spatio-temporal modeling. We release code, pretrained weights, and tutorials to support standardized EEG research and accelerate progress in clinical neuroscience.

Title: Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance

Authors: Minxing Luo, Linlong Fan, Wang Qiushi, Ge Wu, Yiyan Luo, Yuhang Yu, Jinwei Chen, Yaxing Wang, Qingnan Fan, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21590
Pdf URL: https://arxiv.org/pdf/2510.21590
Copy Paste: [[2510.21590]] Restore Text First, Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance(https://arxiv.org/abs/2510.21590)
Keywords: generative
Abstract: Current generative super-resolution methods show strong performance on natural images but distort text, creating a fundamental trade-off between image quality and textual readability. To address this, we introduce \textbf{TIGER} (\textbf{T}ext-\textbf{I}mage \textbf{G}uided sup\textbf{E}r-\textbf{R}esolution), a novel two-stage framework that breaks this trade-off through a \textit{"text-first, image-later"} paradigm. \textbf{TIGER} explicitly decouples glyph restoration from image enhancement: it first reconstructs precise text structures and then uses them to guide subsequent full-image super-resolution. This glyph-to-image guidance ensures both high fidelity and visual consistency. To support comprehensive training and evaluation, we also contribute the \textbf{UltraZoom-ST} (UltraZoom-Scene Text), the first scene text dataset with extreme zoom (\textbf{$\times$14.29}). Extensive experiments show that \textbf{TIGER} achieves \textbf{state-of-the-art} performance, enhancing readability while preserving overall image quality.

Title: S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

Authors: Orest Kupyn, Hirokatsu Kataoka, Christian Rupprecht
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21605
Pdf URL: https://arxiv.org/pdf/2510.21605
Copy Paste: [[2510.21605]] S3OD: Towards Generalizable Salient Object Detection with Synthetic Data(https://arxiv.org/abs/2510.21605)
Keywords: diffusion
Abstract: Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that naturally handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained solely on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art performance across DIS and HR-SOD benchmarks.

Title: Generalised Flow Maps for Few-Step Generative Modelling on Riemannian Manifolds

Authors: Oscar Davis, Michael S. Albergo, Nicholas M. Boffi, Michael M. Bronstein, Avishek Joey Bose
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.21608
Pdf URL: https://arxiv.org/pdf/2510.21608
Copy Paste: [[2510.21608]] Generalised Flow Maps for Few-Step Generative Modelling on Riemannian Manifolds(https://arxiv.org/abs/2510.21608)
Keywords: diffusion, generative
Abstract: Geometric data and purpose-built generative models on them have become ubiquitous in high-impact deep learning application domains, ranging from protein backbone generation and computational chemistry to geospatial data. Current geometric generative models remain computationally expensive at inference -- requiring many steps of complex numerical simulation -- as they are derived from dynamical measure transport frameworks such as diffusion and flow-matching on Riemannian manifolds. In this paper, we propose Generalised Flow Maps (GFM), a new class of few-step generative models that generalises the Flow Map framework in Euclidean spaces to arbitrary Riemannian manifolds. We instantiate GFMs with three self-distillation-based training methods: Generalised Lagrangian Flow Maps, Generalised Eulerian Flow Maps, and Generalised Progressive Flow Maps. We theoretically show that GFMs, under specific design decisions, unify and elevate existing Euclidean few-step generative models, such as consistency models, shortcut models, and meanflows, to the Riemannian setting. We benchmark GFMs against other geometric generative models on a suite of geometric datasets, including geospatial data, RNA torsion angles, and hyperbolic manifolds, and achieve state-of-the-art sample quality for single- and few-step evaluations, and superior or competitive log-likelihoods using the implicit probability flow.

Title: Generative Correlation Manifolds: Generating Synthetic Data with Preserved Higher-Order Correlations

Authors: Jens E. d'Hondt, Wieger R. Punter, Odysseas Papapetrou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21610
Pdf URL: https://arxiv.org/pdf/2510.21610
Copy Paste: [[2510.21610]] Generative Correlation Manifolds: Generating Synthetic Data with Preserved Higher-Order Correlations(https://arxiv.org/abs/2510.21610)
Keywords: generative
Abstract: The increasing need for data privacy and the demand for robust machine learning models have fueled the development of synthetic data generation techniques. However, current methods often succeed in replicating simple summary statistics but fail to preserve both the pairwise and higher-order correlation structure of the data that define the complex, multi-variable interactions inherent in real-world systems. This limitation can lead to synthetic data that is superficially realistic but fails when used for sophisticated modeling tasks. In this white paper, we introduce Generative Correlation Manifolds (GCM), a computationally efficient method for generating synthetic data. The technique uses Cholesky decomposition of a target correlation matrix to produce datasets that, by mathematical proof, preserve the entire correlation structure -- from simple pairwise relationships to higher-order interactions -- of the source dataset. We argue that this method provides a new approach to synthetic data generation with potential applications in privacy-preserving data sharing, robust model training, and simulation.

Title: Epipolar Geometry Improves Video Generation Models

Authors: Orest Kupyn, Fabian Manhardt, Federico Tombari, Christian Rupprecht
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21615
Pdf URL: https://arxiv.org/pdf/2510.21615
Copy Paste: [[2510.21615]] Epipolar Geometry Improves Video Generation Models(https://arxiv.org/abs/2510.21615)
Keywords: diffusion
Abstract: Video generation models have progressed tremendously through large latent diffusion transformers trained with rectified flow techniques. Yet these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes. 3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. We explore how epipolar geometry constraints improve modern video diffusion models. Despite massive training data, these models fail to capture fundamental geometric principles underlying visual content. We align diffusion models using pairwise epipolar geometry constraints via preference-based optimization, directly addressing unstable camera trajectories and geometric artifacts through mathematically principled geometric enforcement. Our approach efficiently enforces geometric principles without requiring end-to-end differentiability. Evaluation demonstrates that classical geometric constraints provide more stable optimization signals than modern learned metrics, which produce noisy targets that compromise alignment quality. Training on static scenes with dynamic cameras ensures high-quality measurements while the model generalizes effectively to diverse dynamic content. By bridging data-driven deep learning with classical geometric computer vision, we present a practical method for generating spatially consistent videos without compromising visual quality.

Title: DEEDEE: Fast and Scalable Out-of-Distribution Dynamics Detection

Authors: Tala Aljaafari, Varun Kanade, Philip Torr, Christian Schroeder de Witt
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21638
Pdf URL: https://arxiv.org/pdf/2510.21638
Copy Paste: [[2510.21638]] DEEDEE: Fast and Scalable Out-of-Distribution Dynamics Detection(https://arxiv.org/abs/2510.21638)
Keywords: anomaly
Abstract: Deploying reinforcement learning (RL) in safety-critical settings is constrained by brittleness under distribution shift. We study out-of-distribution (OOD) detection for RL time series and introduce DEEDEE, a two-statistic detector that revisits representation-heavy pipelines with a minimal alternative. DEEDEE uses only an episodewise mean and an RBF kernel similarity to a training summary, capturing complementary global and local deviations. Despite its simplicity, DEEDEE matches or surpasses contemporary detectors across standard RL OOD suites, delivering a 600-fold reduction in compute (FLOPs / wall-time) and an average 5% absolute accuracy gain over strong baselines. Conceptually, our results indicate that diverse anomaly types often imprint on RL trajectories through a small set of low-order statistics, suggesting a compact foundation for OOD detection in complex environments.

Title: Self-Supervised Learning of Synapse Types from EM Images

Authors: Aarav Shetty, Gary B Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21663
Pdf URL: https://arxiv.org/pdf/2510.21663
Copy Paste: [[2510.21663]] Self-Supervised Learning of Synapse Types from EM Images(https://arxiv.org/abs/2510.21663)
Keywords: self-supervised
Abstract: Separating synapses into different classes based on their appearance in EM images has many applications in biology. Examples may include assigning a neurotransmitter to a particular class, or separating synapses whose strength can be modulated from those whose strength is fixed. Traditionally, this has been done in a supervised manner, giving the classification algorithm examples of the different classes. Here we instead separate synapses into classes based only on the observation that nearby synapses in the same neuron are likely more similar than synapses chosen randomly from different cells. We apply our methodology to data from {\it Drosophila}. Our approach has the advantage that the number of synapse types does not need to be known in advance. It may also provide a principled way to select ground-truth that spans the range of synapse structure.

Title: Foundation Models in Dermatopathology: Skin Tissue Classification

Authors: Riya Gupta, Yiwei Zong, Dennis H. Murphree
Subjects: cs.CV, q-bio.QM
Abstract URL: https://arxiv.org/abs/2510.21664
Pdf URL: https://arxiv.org/pdf/2510.21664
Copy Paste: [[2510.21664]] Foundation Models in Dermatopathology: Skin Tissue Classification(https://arxiv.org/abs/2510.21664)
Keywords: foundation model
Abstract: The rapid generation of whole-slide images (WSIs) in dermatopathology necessitates automated methods for efficient processing and accurate classification. This study evaluates the performance of two foundation models, UNI and Virchow2, as feature extractors for classifying WSIs into three diagnostic categories: melanocytic, basaloid, and squamous lesions. Patch-level embeddings were aggregated into slide-level features using a mean-aggregation strategy and subsequently used to train multiple machine learning classifiers, including logistic regression, gradient-boosted trees, and random forest models. Performance was assessed using precision, recall, true positive rate, false positive rate, and the area under the receiver operating characteristic curve (AUROC) on the test set. Results demonstrate that patch-level features extracted using Virchow2 outperformed those extracted via UNI across most slide-level classifiers, with logistic regression achieving the highest accuracy (90%) for Virchow2, though the difference was not statistically significant. The study also explored data augmentation techniques and image normalization to enhance model robustness and generalizability. The mean-aggregation approach provided reliable slide-level feature representations. All experimental results and metrics were tracked and visualized using this http URL, facilitating reproducibility and interpretability. This research highlights the potential of foundation models for automated WSI classification, providing a scalable and effective approach for dermatopathological diagnosis while paving the way for future advancements in slide-level representation learning.

Title: WorldGrow: Generating Infinite 3D World

Authors: Sikuang Li, Chen Yang, Jiemin Fang, Taoran Yi, Jia Lu, Jiazhong Cen, Lingxi Xie, Wei Shen, Qi Tian
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2510.21682
Pdf URL: https://arxiv.org/pdf/2510.21682
Copy Paste: [[2510.21682]] WorldGrow: Generating Infinite 3D World(https://arxiv.org/abs/2510.21682)
Keywords: foundation model
Abstract: We tackle the challenge of generating the infinitely extendable 3D world -- large, continuous environments with coherent geometry and realistic appearance. Existing methods face key challenges: 2D-lifting approaches suffer from geometric and appearance inconsistencies across views, 3D implicit representations are hard to scale up, and current 3D foundation models are mostly object-centric, limiting their applicability to scene-level generation. Our key insight is leveraging strong generation priors from pre-trained 3D models for structured scene block generation. To this end, we propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis. Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity. Evaluated on the large-scale 3D-FRONT dataset, WorldGrow achieves SOTA performance in geometry reconstruction, while uniquely supporting infinite scene generation with photorealistic and structurally consistent outputs. These results highlight its capability for constructing large-scale virtual environments and potential for building future world models.

Title: BachVid: Training-Free Video Generation with Consistent Background and Character

Authors: Han Yan, Xibin Song, Yifu Wang, Hongdong Li, Pan Ji, Chao Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21696
Pdf URL: https://arxiv.org/pdf/2510.21696
Copy Paste: [[2510.21696]] BachVid: Training-Free Video Generation with Consistent Background and Character(https://arxiv.org/abs/2510.21696)
Keywords: diffusion
Abstract: Diffusion Transformers (DiTs) have recently driven significant progress in text-to-video (T2V) generation. However, generating multiple videos with consistent characters and backgrounds remains a significant challenge. Existing methods typically rely on reference images or extensive training, and often only address character consistency, leaving background consistency to image-to-video models. We introduce BachVid, the first training-free method that achieves consistent video generation without needing any reference images. Our approach is based on a systematic analysis of DiT's attention mechanism and intermediate features, revealing its ability to extract foreground masks and identify matching points during the denoising process. Our method leverages this finding by first generating an identity video and caching the intermediate variables, and then inject these cached variables into corresponding positions in newly generated videos, ensuring both foreground and background consistency across multiple videos. Experimental results demonstrate that BachVid achieves robust consistency in generated videos without requiring additional training, offering a novel and efficient solution for consistent video generation without relying on reference images or additional training.

Title: Visual Diffusion Models are Geometric Solvers

Authors: Nir Goren, Shai Yehezkel, Omer Dahary, Andrey Voynov, Or Patashnik, Daniel Cohen-Or
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21697
Pdf URL: https://arxiv.org/pdf/2510.21697
Copy Paste: [[2510.21697]] Visual Diffusion Models are Geometric Solvers(https://arxiv.org/abs/2510.21697)
Keywords: diffusion, generative
Abstract: In this paper we show that visual diffusion models can serve as effective geometric solvers: they can directly reason about geometric problems by working in pixel space. We first demonstrate this on the Inscribed Square Problem, a long-standing problem in geometry that asks whether every Jordan curve contains four points forming a square. We then extend the approach to two other well-known hard geometric problems: the Steiner Tree Problem and the Simple Polygon Problem. Our method treats each problem instance as an image and trains a standard visual diffusion model that transforms Gaussian noise into an image representing a valid approximate solution that closely matches the exact one. The model learns to transform noisy geometric structures into correct configurations, effectively recasting geometric reasoning as image generation. Unlike prior work that necessitates specialized architectures and domain-specific adaptations when applying diffusion to parametric geometric representations, we employ a standard visual diffusion model that operates on the visual representation of the problem. This simplicity highlights a surprising bridge between generative modeling and geometric problem solving. Beyond the specific problems studied here, our results point toward a broader paradigm: operating in image space provides a general and practical framework for approximating notoriously hard problems, and opens the door to tackling a far wider class of challenging geometric tasks.