2025-10-28

Title: A Feature Engineering Approach for Business Impact-Oriented Failure Detection in Distributed Instant Payment Systems

Authors: Lorenzo Porcelli
Subjects: cs.LG, cs.AI, cs.DC
Abstract URL: https://arxiv.org/abs/2510.21710
Pdf URL: https://arxiv.org/pdf/2510.21710
Copy Paste: [[2510.21710]] A Feature Engineering Approach for Business Impact-Oriented Failure Detection in Distributed Instant Payment Systems(https://arxiv.org/abs/2510.21710)
Keywords: anomaly
Abstract: Instant payment infrastructures have stringent performance requirements, processing millions of transactions daily with zero-downtime expectations. Traditional monitoring approaches fail to bridge the gap between technical infrastructure metrics and business process visibility. We introduce a novel feature engineering approach based on processing times computed between consecutive ISO 20022 message exchanges, creating a compact representation of system state. By applying anomaly detection to these features, we enable early failure detection and localization, allowing incident classification. Experimental evaluation on the TARGET Instant Payment Settlement (TIPS) system, using both real-world incidents and controlled simulations, demonstrates the approach's effectiveness in detecting diverse anomaly patterns and provides inherently interpretable explanations that enable operators to understand the business impact. By mapping features to distinct processing phases, the resulting framework differentiates between internal and external payment system issues, significantly reduces investigation time, and bridges observability gaps in distributed systems where transaction state is fragmented across multiple entities.

Title: Proportion and Perspective Control for Flow-Based Image Generation

Authors: Julien Boudier, Hugo Caselles-Dupré
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21763
Pdf URL: https://arxiv.org/pdf/2510.21763
Copy Paste: [[2510.21763]] Proportion and Perspective Control for Flow-Based Image Generation(https://arxiv.org/abs/2510.21763)
Keywords: diffusion
Abstract: While modern text-to-image diffusion models generate high-fidelity images, they offer limited control over the spatial and geometric structure of the output. To address this, we introduce and evaluate two ControlNets specialized for artistic control: (1) a proportion ControlNet that uses bounding boxes to dictate the position and scale of objects, and (2) a perspective ControlNet that employs vanishing lines to control the 3D geometry of the scene. We support the training of these modules with data pipelines that leverage vision-language models for annotation and specialized algorithms for conditioning image synthesis. Our experiments demonstrate that both modules provide effective control but exhibit limitations with complex constraints. Both models are released on HuggingFace: this https URL

Title: H2OFlow: Grounding Human-Object Affordances with 3D Generative Models and Dense Diffused Flows

Authors: Harry Zhang, Luca Carlone
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21769
Pdf URL: https://arxiv.org/pdf/2510.21769
Copy Paste: [[2510.21769]] H2OFlow: Grounding Human-Object Affordances with 3D Generative Models and Dense Diffused Flows(https://arxiv.org/abs/2510.21769)
Keywords: diffusion, generative
Abstract: Understanding how humans interact with the surrounding environment, and specifically reasoning about object interactions and affordances, is a critical challenge in computer vision, robotics, and AI. Current approaches often depend on labor-intensive, hand-labeled datasets capturing real-world or simulated human-object interaction (HOI) tasks, which are costly and time-consuming to produce. Furthermore, most existing methods for 3D affordance understanding are limited to contact-based analysis, neglecting other essential aspects of human-object interactions, such as orientation (\eg, humans might have a preferential orientation with respect certain objects, such as a TV) and spatial occupancy (\eg, humans are more likely to occupy certain regions around an object, like the front of a microwave rather than its back). To address these limitations, we introduce \emph{H2OFlow}, a novel framework that comprehensively learns 3D HOI affordances -- encompassing contact, orientation, and spatial occupancy -- using only synthetic data generated from 3D generative models. H2OFlow employs a dense 3D-flow-based representation, learned through a dense diffusion process operating on point clouds. This learned flow enables the discovery of rich 3D affordances without the need for human annotations. Through extensive quantitative and qualitative evaluations, we demonstrate that H2OFlow generalizes effectively to real-world objects and surpasses prior methods that rely on manual annotations or mesh-based representations in modeling 3D affordance.

Title: Noise Aggregation Analysis Driven by Small-Noise Injection: Efficient Membership Inference for Diffusion Models

Authors: Guo Li, Yuyang Yu, Xuemiao Xu
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2510.21783
Pdf URL: https://arxiv.org/pdf/2510.21783
Copy Paste: [[2510.21783]] Noise Aggregation Analysis Driven by Small-Noise Injection: Efficient Membership Inference for Diffusion Models(https://arxiv.org/abs/2510.21783)
Keywords: diffusion
Abstract: Diffusion models have demonstrated powerful performance in generating high-quality images. A typical example is text-to-image generator like Stable Diffusion. However, their widespread use also poses potential privacy risks. A key concern is membership inference attacks, which attempt to determine whether a particular data sample was used in the model training process. We propose an efficient membership inference attack method against diffusion models. This method is based on the injection of slight noise and the evaluation of the aggregation degree of the noise distribution. The intuition is that the noise prediction patterns of diffusion models for training set samples and non-training set samples exhibit distinguishable this http URL, we suppose that member images exhibit higher aggregation of predicted noise around a certain time step of the diffusion process. In contrast, the predicted noises of non-member images exhibit a more discrete characteristic around the certain time step. Compared with other existing methods, our proposed method requires fewer visits to the target diffusion model. We inject slight noise into the image under test and then determine its membership by analyzing the aggregation degree of the noise distribution predicted by the model. Empirical findings indicate that our method achieves superior performance across multiple datasets. At the same time, our method can also show better attack effects in ASR and AUC when facing large-scale text-to-image diffusion models, proving the scalability of our method.

Title: Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making

Authors: Larkin Liu, Jalal Etesami
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21788
Pdf URL: https://arxiv.org/pdf/2510.21788
Copy Paste: [[2510.21788]] Online Mixture of Experts: No-Regret Learning for Optimal Collective Decision-Making(https://arxiv.org/abs/2510.21788)
Keywords: generative
Abstract: We explore the use of expert-guided bandit learning, which we refer to as online mixture-of-experts (OMoE). In this setting, given a context, a candidate committee of experts must determine how to aggregate their outputs to achieve optimal results in terms of aggregate accuracy. We propose two algorithms to address this problem. The first algorithm combines aggregate voting with UCB-driven successive elimination, efficiently pruning suboptimal exploration actions. The second algorithm employs an online weighted-majority-voting mechanism, leveraging the respective voting power of each expert proportional to their predictive power. We derive theoretical guarantees for the regret properties in the bandit setting under ideal circumstances, and empirical results are provided accordingly. As a modern study on applications, these methods are applied to the online fine-tuning of a set of expert large language models (LLMs), where after each response, the generative LLM dynamically reweighs its set of experts and/or selects the optimal committee of experts to generate the most accurate response. Our results introduce new methodologies and no-regret guarantees for combining multiple experts to improve on the performance of the an aggregate model overall.

Title: Exploring the design space of diffusion and flow models for data fusion

Authors: Niraj Chaudhari, Manmeet Singh, Naveen Sudharsan, Amit Kumar Srivastava, Harsh Kamath, Dushyant Mahajan, Ayan Paul
Subjects: cs.CV, physics.ins-det
Abstract URL: https://arxiv.org/abs/2510.21791
Pdf URL: https://arxiv.org/pdf/2510.21791
Copy Paste: [[2510.21791]] Exploring the design space of diffusion and flow models for data fusion(https://arxiv.org/abs/2510.21791)
Keywords: diffusion, generative
Abstract: Data fusion is an essential task in various domains, enabling the integration of multi-source information to enhance data quality and insights. One key application is in satellite remote sensing, where fusing multi-sensor observations can improve spatial and temporal resolution. In this study, we explore the design space of diffusion and flow models for data fusion, focusing on the integration of Defense Meteorological Satellite Program's Operational Linescan System (DMSP-OLS) and Visible Infrared Imaging Radiometer Suite (VIIRS) nighttime lights data. Our approach leverages a diverse set of 2D image-to-image generative models, including UNET, diffusion, and flow modeling architectures. We evaluate the effectiveness of these architectures in satellite remote sensing data fusion, identifying diffusion models based on UNet as particularly adept at preserving fine-grained spatial details and generating high-fidelity fused images. We also provide guidance on the selection of noise schedulers in diffusion-based models, highlighting the trade-offs between iterative solvers for faster inference and discrete schedulers for higher-quality reconstructions. Additionally, we explore quantization techniques to optimize memory efficiency and computational cost without compromising performance. Our findings offer practical insights into selecting the most effective diffusion and flow model architectures for data fusion tasks, particularly in remote sensing applications, and provide recommendations for leveraging noise scheduling strategies to enhance fusion quality.

Title: Variance-Reduction Guidance: Sampling Trajectory Optimization for Diffusion Models

Authors: Shifeng Xu, Yanzhu Liu, Adams Wai-Kin Kong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21792
Pdf URL: https://arxiv.org/pdf/2510.21792
Copy Paste: [[2510.21792]] Variance-Reduction Guidance: Sampling Trajectory Optimization for Diffusion Models(https://arxiv.org/abs/2510.21792)
Keywords: diffusion, generative
Abstract: Diffusion models have become emerging generative models. Their sampling process involves multiple steps, and in each step the models predict the noise from a noisy sample. When the models make prediction, the output deviates from the ground truth, and we call such a deviation as \textit{prediction error}. The prediction error accumulates over the sampling process and deteriorates generation quality. This paper introduces a novel technique for statistically measuring the prediction error and proposes the Variance-Reduction Guidance (VRG) method to mitigate this error. VRG does not require model fine-tuning or modification. Given a predefined sampling trajectory, it searches for a new trajectory which has the same number of sampling steps but produces higher quality results. VRG is applicable to both conditional and unconditional generation. Experiments on various datasets and baselines demonstrate that VRG can significantly improve the generation quality of diffusion models. Source code is available at this https URL.

Title: 2D_3D Feature Fusion via Cross-Modal Latent Synthesis and Attention Guided Restoration for Industrial Anomaly Detection

Authors: Usman Ali, Ali Zia, Abdul Rehman, Umer Ramzan, Zohaib Hassan, Talha Sattar, Jing Wang, Wei Xiang
Subjects: cs.CV, cs.AI, eess.IV
Abstract URL: https://arxiv.org/abs/2510.21793
Pdf URL: https://arxiv.org/pdf/2510.21793
Copy Paste: [[2510.21793]] 2D_3D Feature Fusion via Cross-Modal Latent Synthesis and Attention Guided Restoration for Industrial Anomaly Detection(https://arxiv.org/abs/2510.21793)
Keywords: anomaly
Abstract: Industrial anomaly detection (IAD) increasingly benefits from integrating 2D and 3D data, but robust cross-modal fusion remains challenging. We propose a novel unsupervised framework, Multi-Modal Attention-Driven Fusion Restoration (MAFR), which synthesises a unified latent space from RGB images and point clouds using a shared fusion encoder, followed by attention-guided, modality-specific decoders. Anomalies are localised by measuring reconstruction errors between input features and their restored counterparts. Evaluations on the MVTec 3D-AD and Eyecandies benchmarks demonstrate that MAFR achieves state-of-the-art results, with a mean I-AUROC of 0.972 and 0.901, respectively. The framework also exhibits strong performance in few-shot learning settings, and ablation studies confirm the critical roles of the fusion architecture and composite loss. MAFR offers a principled approach for fusing visual and geometric information, advancing the robustness and accuracy of industrial anomaly detection. Code is available at this https URL

Title: Xihe: Scalable Zero-Shot Time Series Learner Via Hierarchical Interleaved Block Attention

Authors: Yinbo Sun, Yuchen Fang, Zhibo Zhu, Jia Li, Yu Liu, Qiwen Deng, Jun Zhou, Hang Yu, Xingyu Lu, Lintao Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21795
Pdf URL: https://arxiv.org/pdf/2510.21795
Copy Paste: [[2510.21795]] Xihe: Scalable Zero-Shot Time Series Learner Via Hierarchical Interleaved Block Attention(https://arxiv.org/abs/2510.21795)
Keywords: foundation model
Abstract: The rapid advancement of time series foundation models (TSFMs) has been propelled by migrating architectures from language models. While existing TSFMs demonstrate impressive performance, their direct adoption of cross-domain architectures constrains effective capture of multiscale temporal dependencies inherent to time series data. This limitation becomes particularly pronounced during zero-shot transfer across datasets with divergent underlying patterns and sampling strategies. To address these challenges, we propose Hierarchical Interleaved Block Attention (HIBA) which employs hierarchical inter- and intra-block sparse attention to effectively capture multi-scale dependencies. Intra-block attention facilitates local information exchange, and inter-block attention operates across blocks to capture global temporal pattern interaction and dynamic evolution. Leveraging the HIBA architecture, we introduce Xihe, a scalable TSFM family spanning from an ultra-efficient 9.5M parameter configuration to high-capacity 1.5B variant. Evaluated on the comprehensive GIFT-Eval benchmark, our most compact Xihe-tiny model (9.5M) surpasses the majority of contemporary TSFMs, demonstrating remarkable parameter efficiency. More impressively, Xihe-max (1.5B) establishes new state-of-the-art zero-shot performance, surpassing previous best results by a substantial margin. This consistent performance excellence across the entire parameter spectrum provides compelling evidence for the exceptional generalization capabilities and architectural superiority of HIBA.

Title: It Takes Two to Tango: Two Parallel Samplers Improve Quality in Diffusion Models for Limited Steps

Authors: Pedro Cisneros-Velarde
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.21802
Pdf URL: https://arxiv.org/pdf/2510.21802
Copy Paste: [[2510.21802]] It Takes Two to Tango: Two Parallel Samplers Improve Quality in Diffusion Models for Limited Steps(https://arxiv.org/abs/2510.21802)
Keywords: diffusion
Abstract: We consider the situation where we have a limited number of denoising steps, i.e., of evaluations of a diffusion model. We show that two parallel processors or samplers under such limitation can improve the quality of the sampled image. Particularly, the two samplers make denoising steps at successive times, and their information is appropriately integrated in the latent image. Remarkably, our method is simple both conceptually and to implement: it is plug-&-play, model agnostic, and does not require any additional fine-tuning or external models. We test our method with both automated and human evaluations for different diffusion models. We also show that a naive integration of the information from the two samplers lowers sample quality. Finally, we find that adding more parallel samplers does not necessarily improve sample quality.

Title: SITS-DECO: A Generative Decoder Is All You Need For Multitask Satellite Image Time Series Modelling

Authors: Samuel J. Barrett, Docko Sow
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21813
Pdf URL: https://arxiv.org/pdf/2510.21813
Copy Paste: [[2510.21813]] SITS-DECO: A Generative Decoder Is All You Need For Multitask Satellite Image Time Series Modelling(https://arxiv.org/abs/2510.21813)
Keywords: self-supervised, foundation model, generative
Abstract: Earth Observation (EO) Foundation Modelling (FM) holds great promise for simplifying and improving the use of EO data for diverse real-world tasks. However, most existing models require additional adaptation before they can be used and are structured rigidly around particular data sources or training approaches. To address this, we take inspiration from large language models, where diverse tasks, both pre-training and downstream, are implicitly captured through next-token prediction over unified token sequences, leveraging the structure and diversity of the training data. We introduce SITS-DECO (Satellite Image Time Series-DECoder Only), a proof-of-concept generative model that applies this unified-sequence framing to EO data. Using a simple GPT-style decoder-only architecture, and demonstrate its ability to perform useful EO tasks (pixel-wise, multi-temporal, multi-modal crop-type classification) in a purely generative framework. Through symbolic prompting, we show that the model can perform multiple supervised and self-supervised tasks within a single unified architecture, without task- or modality-specific adaptation. Despite its simplicity and lack of spatial context, SITS-DECO outperforms much larger EO foundation models on crop-type classification (PASTIS-R) demonstrating that dense temporal sequence modelling is a critical missing ingredient in the current paradigm. This work exemplifies a data-centric modelling paradigm in which capability arises from the diversity and structure of the training data rather than from architectural complexity. SITS-DECO provides a lightweight, practical route to multi-modal, multi-task EO modelling, and a conceptual bridge toward future generative EO foundation models.

Title: Wavelet-based GAN Fingerprint Detection using ResNet50

Authors: Sai Teja Erukude, Suhasnadh Reddy Veluru, Viswa Chaitanya Marella
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21822
Pdf URL: https://arxiv.org/pdf/2510.21822
Copy Paste: [[2510.21822]] Wavelet-based GAN Fingerprint Detection using ResNet50(https://arxiv.org/abs/2510.21822)
Keywords: generative
Abstract: Identifying images generated by Generative Adversarial Networks (GANs) has become a significant challenge in digital image forensics. This research presents a wavelet-based detection method that uses discrete wavelet transform (DWT) preprocessing and a ResNet50 classification layer to differentiate the StyleGAN-generated images from real ones. Haar and Daubechies wavelet filters are applied to convert the input images into multi-resolution representations, which will then be fed to a ResNet50 network for classification, capitalizing on subtle artifacts left by the generative process. Moreover, the wavelet-based models are compared to an identical ResNet50 model trained on spatial data. The Haar and Daubechies preprocessed models achieved a greater accuracy of 93.8 percent and 95.1 percent, much higher than the model developed in the spatial domain (accuracy rate of 81.5 percent). The Daubechies-based model outperforms Haar, showing that adding layers of descriptive frequency patterns can lead to even greater distinguishing power. These results indicate that the GAN-generated images have unique wavelet-domain artifacts or "fingerprints." The method proposed illustrates the effectiveness of wavelet-domain analysis to detect GAN images and emphasizes the potential of further developing the capabilities of future deepfake detection systems.

Title: A Flow Model with Low-Rank Transformers for Incomplete Multimodal Survival Analysis

Authors: Yi Yin, Yuntao Shou, Zao Dai, Yun Peng, Tao Meng, Wei Ai, Keqin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21829
Pdf URL: https://arxiv.org/pdf/2510.21829
Copy Paste: [[2510.21829]] A Flow Model with Low-Rank Transformers for Incomplete Multimodal Survival Analysis(https://arxiv.org/abs/2510.21829)
Keywords: generative
Abstract: In recent years, multimodal medical data-based survival analysis has attracted much attention. However, real-world datasets often suffer from the problem of incomplete modality, where some patient modality information is missing due to acquisition limitations or system failures. Existing methods typically infer missing modalities directly from observed ones using deep neural networks, but they often ignore the distributional discrepancy across modalities, resulting in inconsistent and unreliable modality reconstruction. To address these challenges, we propose a novel framework that combines a low-rank Transformer with a flow-based generative model for robust and flexible multimodal survival prediction. Specifically, we first formulate the concerned problem as incomplete multimodal survival analysis using the multi-instance representation of whole slide images (WSIs) and genomic profiles. To realize incomplete multimodal survival analysis, we propose a class-specific flow for cross-modal distribution alignment. Under the condition of class labels, we model and transform the cross-modal distribution. By virtue of the reversible structure and accurate density modeling capabilities of the normalizing flow model, the model can effectively construct a distribution-consistent latent space of the missing modality, thereby improving the consistency between the reconstructed data and the true distribution. Finally, we design a lightweight Transformer architecture to model intra-modal dependencies while alleviating the overfitting problem in high-dimensional modality fusion by virtue of the low-rank Transformer. Extensive experiments have demonstrated that our method not only achieves state-of-the-art performance under complete modality settings, but also maintains robust and superior accuracy under the incomplete modalities scenario.

Title: A Multimodal, Multitask System for Generating E Commerce Text Listings from Images

Authors: Nayan Kumar Singh
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2510.21835
Pdf URL: https://arxiv.org/pdf/2510.21835
Copy Paste: [[2510.21835]] A Multimodal, Multitask System for Generating E Commerce Text Listings from Images(https://arxiv.org/abs/2510.21835)
Keywords: generative
Abstract: Manually generating catchy descriptions and names is labor intensive and a slow process for retailers. Although generative AI provides an automation solution in form of Vision to Language Models (VLM), the current VLMs are prone to factual "hallucinations". Siloed, single task models are not only inefficient but also fail to capture interdependent relationships between features. To address these challenges, we propose an end to end, multi task system that generates factually grounded textual listings from a single image. The contributions of this study are two proposals for the model architecture. First, application of multi task learning approach for fine tuning a vision encoder where a single vision backbone is jointly trained on attribute prediction such as color, hemline and neck style and price regression. Second, introduction of a hierarchical generation process where the model's own predicted attributes are embedded in a prompt and fed to the text decoder to improve factual consistency. The experiments demonstrate the superiority of this architecture. The multi tasking approach outperforms both the independent price regression, with a 3.6% better R2 Value and attribute classification, with a 6.6% improvement F1 score. Critically, the hierarchical generation process proves highly effective, slashing the factual hallucination rate from 12.7% to 7.1%, a 44.5% relative reduction, compared to a non hierarchical ablation. The hierarchical approach also reduces the latency of the autoregressive text generation process by a factor of 3.5 when compared to direct vision to language model of similar size. One minor caveat is that the model does perform 3.5% worse than direct vision-to-language model on ROUGE-L score.

Title: Improving the Physics of Video Generation with VJEPA-2 Reward Signal

Authors: Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2510.21840
Pdf URL: https://arxiv.org/pdf/2510.21840
Copy Paste: [[2510.21840]] Improving the Physics of Video Generation with VJEPA-2 Reward Signal(https://arxiv.org/abs/2510.21840)
Keywords: generative
Abstract: This is a short technical report describing the winning entry of the PhysicsIQ Challenge, presented at the Perception Test Workshop at ICCV 2025. State-of-the-art video generative models exhibit severely limited physical understanding, and often produce implausible videos. The Physics IQ benchmark has shown that visual realism does not imply physics understanding. Yet, intuitive physics understanding has shown to emerge from SSL pretraining on natural videos. In this report, we investigate whether we can leverage SSL-based video world models to improve the physics plausibility of video generative models. In particular, we build ontop of the state-of-the-art video generative model MAGI-1 and couple it with the recently introduced Video Joint Embedding Predictive Architecture 2 (VJEPA-2) to guide the generation process. We show that by leveraging VJEPA-2 as reward signal, we can improve the physics plausibility of state-of-the-art video generative models by ~6%.

Title: KARIPAP: Quantum-Inspired Tensor Network Compression of Large Language Models Using Infinite Projected Entangled Pair States and Tensor Renormalization Group

Authors: Azree Nazri
Subjects: cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2510.21844
Pdf URL: https://arxiv.org/pdf/2510.21844
Copy Paste: [[2510.21844]] KARIPAP: Quantum-Inspired Tensor Network Compression of Large Language Models Using Infinite Projected Entangled Pair States and Tensor Renormalization Group(https://arxiv.org/abs/2510.21844)
Keywords: generative
Abstract: Large Language Models (LLMs) like ChatGPT and LLaMA drive rapid progress in generative AI, yet their huge parameter scales create severe computational and environmental burdens. High training costs, energy use, and limited device deployment hinder accessibility. Existing compression - pruning, distillation, low-rank, and quantization - reduces size but ignores complex inter-layer correlations. We propose KARIPAP, a quantum-inspired tensor network compression using Infinite Projected Entangled Pair States (iPEPS) and Tensor Renormalization Group (TRG) contraction. Unlike 1D Matrix Product States, iPEPS captures multi-directional entanglement in attention and deep transformer layers. TRG ensures polynomial-time contraction, making tensorization feasible while preserving key correlation geometry. Experiments on LLaMA-2 7B show up to 93% memory and 70% parameter reduction, with 50% faster training, 25% faster inference, and only 2-3% accuracy loss. Layer-wise entanglement profiling reveals redundancy in deeper layers, confirming their suitability for tensor factorization. KARIPAP demonstrates that modern LLMs occupy low-dimensional entanglement manifolds, enabling scalable, energy-efficient, and quantum-aware AI architectures.

Title: SynCast: Synergizing Contradictions in Precipitation Nowcasting via Diffusion Sequential Preference Optimization

Authors: Kaiyi Xu, Junchao Gong, Wenlong Zhang, Ben Fei, Lei Bai, Wanli Ouyang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.21847
Pdf URL: https://arxiv.org/pdf/2510.21847
Copy Paste: [[2510.21847]] SynCast: Synergizing Contradictions in Precipitation Nowcasting via Diffusion Sequential Preference Optimization(https://arxiv.org/abs/2510.21847)
Keywords: diffusion, generative
Abstract: Precipitation nowcasting based on radar echoes plays a crucial role in monitoring extreme weather and supporting disaster prevention. Although deep learning approaches have achieved significant progress, they still face notable limitations. For example, deterministic models tend to produce over-smoothed predictions, which struggle to capture extreme events and fine-scale precipitation patterns. Probabilistic generative models, due to their inherent randomness, often show fluctuating performance across different metrics and rarely achieve consistently optimal results. Furthermore, precipitation nowcasting is typically evaluated using multiple metrics, some of which are inherently conflicting. For instance, there is often a trade-off between the Critical Success Index (CSI) and the False Alarm Ratio (FAR), making it challenging for existing models to deliver forecasts that perform well on both metrics simultaneously. To address these challenges, we introduce preference optimization into precipitation nowcasting for the first time, motivated by the success of reinforcement learning from human feedback in large language models. Specifically, we propose SynCast, a method that employs the two-stage post-training framework of Diffusion Sequential Preference Optimization (Diffusion-SPO), to progressively align conflicting metrics and consistently achieve superior performance. In the first stage, the framework focuses on reducing FAR, training the model to effectively suppress false alarms. Building on this foundation, the second stage further optimizes CSI with constraints that preserve FAR alignment, thereby achieving synergistic improvements across these conflicting metrics.

Title: Poisson Flow Consistency Training

Authors: Anthony Zhang, Mahmut Gokmen, Dennis Hein, Rongjun Ge, Wenjun Xia, Ge Wang, Jin Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21857
Pdf URL: https://arxiv.org/pdf/2510.21857
Copy Paste: [[2510.21857]] Poisson Flow Consistency Training(https://arxiv.org/abs/2510.21857)
Keywords: generative
Abstract: The Poisson Flow Consistency Model (PFCM) is a consistency-style model based on the robust Poisson Flow Generative Model++ (PFGM++) which has achieved success in unconditional image generation and CT image denoising. Yet the PFCM can only be trained in distillation which limits the potential of the PFCM in many data modalities. The objective of this research was to create a method to train the PFCM in isolation called Poisson Flow Consistency Training (PFCT). The perturbation kernel was leveraged to remove the pretrained PFGM++, and the sinusoidal discretization schedule and Beta noise distribution were introduced in order to facilitate adaptability and improve sample quality. The model was applied to the task of low dose computed tomography image denoising and improved the low dose image in terms of LPIPS and SSIM. It also displayed similar denoising effectiveness as models like the Consistency Model. PFCT is established as a valid method of training the PFCM from its effectiveness in denoising CT images, showing potential with competitive results to other generative models. Further study is needed in the precise optimization of PFCT and in its applicability to other generative modeling tasks. The framework of PFCT creates more flexibility for the ways in which a PFCM can be created and can be applied to the field of generative modeling.

Title: The Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems

Authors: Bentley DeVilling (Course Correct Labs, Independent Research Group)
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2510.21861
Pdf URL: https://arxiv.org/pdf/2510.21861
Copy Paste: [[2510.21861]] The Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems(https://arxiv.org/abs/2510.21861)
Keywords: generative
Abstract: Large language models are often described as capable of reflective reasoning, yet recursive self-evaluation without external feedback frequently yields reformulation rather than progress. We test this prediction in a cross-provider study of 144 reasoning sequences across three models (OpenAI GPT-4o-mini, Anthropic Claude 3 Haiku, and Google Gemini 2.0 Flash) and four task families (arithmetic, code, explanation, reflection), each iterated ten times under two conditions: ungrounded self-critique and a minimal grounding intervention (a single verification step at iteration three). Mean informational change (delta I, measured via normalized edit distance) declined by 55% from early (0.193) to late (0.087) iterations in ungrounded runs, with consistent patterns across all three providers. Grounded runs showed a +28% rebound in informational change immediately after the intervention and sustained non-zero variance thereafter. Complementary measures-n-gram novelty, embedding drift, and character-level entropy-converged on the same pattern: reflection without contact tends toward informational closure. We interpret this as evidence for a structural limit on self-correction in generative reasoning: without an exchange of information with an independent verifier or environment, recursive inference approaches an attractor state of epistemic stasis. Minimal grounding functions as dissipative coupling, reintroducing informational flux. The cross-architecture consistency suggests the mirror loop arises from shared autoregressive training objectives rather than provider-specific alignment schemes. The results delineate when reflection is performative rather than epistemic and motivate design principles for grounded, cooperative reasoning. Materials and code are publicly available.

Title: Generative AI in Depth: A Survey of Recent Advances, Model Variants, and Real-World Applications

Authors: Shamim Yazdani, Akansha Singh, Nripsuta Saxena, Zichong Wang, Avash Palikhe, Deng Pan, Umapada Pal, Jie Yang, Wenbin Zhang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.21887
Pdf URL: https://arxiv.org/pdf/2510.21887
Copy Paste: [[2510.21887]] Generative AI in Depth: A Survey of Recent Advances, Model Variants, and Real-World Applications(https://arxiv.org/abs/2510.21887)
Keywords: diffusion, generative
Abstract: In recent years, deep learning based generative models, particularly Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Diffusion Models (DMs), have been instrumental in in generating diverse, high-quality content across various domains, such as image and video synthesis. This capability has led to widespread adoption of these models and has captured strong public interest. As they continue to advance at a rapid pace, the growing volume of research, expanding application areas, and unresolved technical challenges make it increasingly difficult to stay current. To address this need, this survey introduces a comprehensive taxonomy that organizes the literature and provides a cohesive framework for understanding the development of GANs, VAEs, and DMs, including their many variants and combined approaches. We highlight key innovations that have improved the quality, diversity, and controllability of generated outputs, reflecting the expanding potential of generative artificial intelligence. In addition to summarizing technical progress, we examine rising ethical concerns, including the risks of misuse and the broader societal impact of synthetic media. Finally, we outline persistent challenges and propose future research directions, offering a structured and forward looking perspective for researchers in this fast evolving field.

Title: The Principles of Diffusion Models

Authors: Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, Stefano Ermon
Subjects: cs.LG, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2510.21890
Pdf URL: https://arxiv.org/pdf/2510.21890
Copy Paste: [[2510.21890]] The Principles of Diffusion Models(https://arxiv.org/abs/2510.21890)
Keywords: diffusion
Abstract: This monograph presents the core principles that have guided the development of diffusion models, tracing their origins and showing how diverse formulations arise from shared mathematical ideas. Diffusion modeling starts by defining a forward process that gradually corrupts data into noise, linking the data distribution to a simple prior through a continuum of intermediate distributions. The goal is to learn a reverse process that transforms noise back into data while recovering the same intermediates. We describe three complementary views. The variational view, inspired by variational autoencoders, sees diffusion as learning to remove noise step by step. The score-based view, rooted in energy-based modeling, learns the gradient of the evolving data distribution, indicating how to nudge samples toward more likely regions. The flow-based view, related to normalizing flows, treats generation as following a smooth path that moves samples from noise to data under a learned velocity field. These perspectives share a common backbone: a time-dependent velocity field whose flow transports a simple prior to the data. Sampling then amounts to solving a differential equation that evolves noise into data along a continuous trajectory. On this foundation, the monograph discusses guidance for controllable generation, efficient numerical solvers, and diffusion-motivated flow-map models that learn direct mappings between arbitrary times. It provides a conceptual and mathematically grounded understanding of diffusion models for readers with basic deep-learning knowledge.

Title: AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing

Authors: Samuel Bright-Thonney, Christina Reissel, Gaia Grosso, Nathaniel Woodward, Katya Govorkova, Andrzej Novak, Sang Eon Park, Eric Moreno, Philip Harris
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2510.21935
Pdf URL: https://arxiv.org/pdf/2510.21935
Copy Paste: [[2510.21935]] AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing(https://arxiv.org/abs/2510.21935)
Keywords: anomaly
Abstract: Novelty detection in large scientific datasets faces two key challenges: the noisy and high-dimensional nature of experimental data, and the necessity of making statistically robust statements about any observed outliers. While there is a wealth of literature on anomaly detection via dimensionality reduction, most methods do not produce outputs compatible with quantifiable claims of scientific discovery. In this work we directly address these challenges, presenting the first step towards a unified pipeline for novelty detection adapted for the rigorous statistical demands of science. We introduce AutoSciDACT (Automated Scientific Discovery with Anomalous Contrastive Testing), a general-purpose pipeline for detecting novelty in scientific data. AutoSciDACT begins by creating expressive low-dimensional data representations using a contrastive pre-training, leveraging the abundance of high-quality simulated data in many scientific domains alongside expertise that can guide principled data augmentation strategies. These compact embeddings then enable an extremely sensitive machine learning-based two-sample test using the New Physics Learning Machine (NPLM) framework, which identifies and statistically quantifies deviations in observed data relative to a reference distribution (null hypothesis). We perform experiments across a range of astronomical, physical, biological, image, and synthetic datasets, demonstrating strong sensitivity to small injections of anomalous data across all domains.

Title: Towards Low-Latency and Adaptive Ransomware Detection Using Contrastive Learning

Authors: Zhixin Pan, Ziyu Shu, Amberbir Alemayoh
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2510.21957
Pdf URL: https://arxiv.org/pdf/2510.21957
Copy Paste: [[2510.21957]] Towards Low-Latency and Adaptive Ransomware Detection Using Contrastive Learning(https://arxiv.org/abs/2510.21957)
Keywords: self-supervised
Abstract: Ransomware has become a critical threat to cybersecurity due to its rapid evolution, the necessity for early detection, and growing diversity, posing significant challenges to traditional detection methods. While AI-based approaches had been proposed by prior works to assist ransomware detection, existing methods suffer from three major limitations, ad-hoc feature dependencies, delayed response, and limited adaptability to unseen variants. In this paper, we propose a framework that integrates self-supervised contrastive learning with neural architecture search (NAS) to address these challenges. Specifically, this paper offers three important contributions. (1) We design a contrastive learning framework that incorporates hardware performance counters (HPC) to analyze the runtime behavior of target ransomware. (2) We introduce a customized loss function that encourages early-stage detection of malicious activity, and significantly reduces the detection latency. (3) We deploy a neural architecture search (NAS) framework to automatically construct adaptive model architectures, allowing the detector to flexibly align with unseen ransomware variants. Experimental results show that our proposed method achieves significant improvements in both detection accuracy (up to 16.1%) and response time (up to 6x) compared to existing approaches while maintaining robustness under evasive attacks.

Title: Parallel Sampling from Masked Diffusion Models via Conditional Independence Testing

Authors: Iskander Azangulov, Teodora Pandeva, Niranjani Prasad, Javier Zazo, Sushrut Karmalkar
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2510.21961
Pdf URL: https://arxiv.org/pdf/2510.21961
Copy Paste: [[2510.21961]] Parallel Sampling from Masked Diffusion Models via Conditional Independence Testing(https://arxiv.org/abs/2510.21961)
Keywords: diffusion
Abstract: Masked diffusion models (MDMs) offer a compelling alternative to autoregressive models (ARMs) for discrete text generation because they enable parallel token sampling, rather than sequential, left-to-right generation. This means potentially much faster inference. However, effective parallel sampling faces two competing requirements: (i) simultaneously updated tokens must be conditionally independent, and (ii) updates should prioritise high-confidence predictions. These goals conflict because high-confidence predictions often cluster and depend on each other, opportunities for parallel updates. We present PUNT, a model-agnostic sampler that reconciles this trade-off. Our method identifies token dependencies and removes lower-confidence tokens from conflicting groups. This produces sets of indices for unmasking that satisfy both independence and confidence criteria. Our approach ensures improved parallel unmasking through approximate conditional independence testing. Our experiments show that PUNT delivers a superior trade-off between accuracy and compute when compared to other strong training-free baselines, especially for generation of longer sequences. On the IFEval benchmark, it achieves up to 16\% higher accuracy over baseline methods, including sequential generation (one-by-one). These gains hold across different values of hyperparameters, mitigating the need for brittle hyperparameter tuning. Moreover, we observe that PUNT induces an emergent hierarchical generation strategy, where the model first establishes high-level paragraph structure before local refinement, suggesting a planning-like generation process that contributes to strong alignment performance.

Title: Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers

Authors: Dogyun Park, Moayed Haji-Ali, Yanyu Li, Willi Menapace, Sergey Tulyakov, Hyunwoo J. Kim, Aliaksandr Siarohin, Anil Kag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.21986
Pdf URL: https://arxiv.org/pdf/2510.21986
Copy Paste: [[2510.21986]] Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers(https://arxiv.org/abs/2510.21986)
Keywords: diffusion, generative
Abstract: Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet naïve strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse--Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train--inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.

Title: LiteDiff

Authors: Ruchir Namjoshi, Nagasai Thadishetty, Vignesh Kumar, Hemanth Venkateshwara
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22004
Pdf URL: https://arxiv.org/pdf/2510.22004
Copy Paste: [[2510.22004]] LiteDiff(https://arxiv.org/abs/2510.22004)
Keywords: diffusion
Abstract: In recent years, diffusion models have demonstrated remarkable success in high-fidelity image synthesis. However, fine-tuning these models for specialized domains, such as medical imaging, remains challenging due to limited domain-specific data and the high computational cost of full model adaptation. In this paper, we introduce Lite-Diff (Lightweight Diffusion Model Adaptation), a novel finetuning approach that integrates lightweight adaptation layers into a frozen diffusion U-Net while enhancing training with a latent morphological autoencoder (for domain-specific latent consistency) and a pixel level discriminator(for adversarial alignment). By freezing weights of the base model and optimizing only small residual adapter modules, LiteDiff significantly reduces the computational overhead and mitigates overfitting, even in minimal-data settings. Additionally, we conduct ablation studies to analyze the effects of selectively integrating adaptation layers in different U-Net blocks, revealing an optimal balance between efficiency and performance. Experiments on three chest X-ray datasets - (1) Kaggle Chest X-Ray Pneumonia, (2) NIH Chest X-ray14 and (3) VinBigData Chest X_ray demonstrate that LiteDiff achieves superior adaptation efficiency compared to naive full fine-tuning. Our framework provides a promising direction for transfer learning in diffusion models, facilitating their deployment in diverse low data domains.

Title: FlowOpt: Fast Optimization Through Whole Flow Processes for Training-Free Editing

Authors: Or Ronai, Vladimir Kulikov, Tomer Michaeli
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2510.22010
Pdf URL: https://arxiv.org/pdf/2510.22010
Copy Paste: [[2510.22010]] FlowOpt: Fast Optimization Through Whole Flow Processes for Training-Free Editing(https://arxiv.org/abs/2510.22010)
Keywords: diffusion
Abstract: The remarkable success of diffusion and flow-matching models has ignited a surge of works on adapting them at test time for controlled generation tasks. Examples range from image editing to restoration, compression and personalization. However, due to the iterative nature of the sampling process in those models, it is computationally impractical to use gradient-based optimization to directly control the image generated at the end of the process. As a result, existing methods typically resort to manipulating each timestep separately. Here we introduce FlowOpt - a zero-order (gradient-free) optimization framework that treats the entire flow process as a black box, enabling optimization through the whole sampling path without backpropagation through the model. Our method is both highly efficient and allows users to monitor the intermediate optimization results and perform early stopping if desired. We prove a sufficient condition on FlowOpt's step-size, under which convergence to the global optimum is guaranteed. We further show how to empirically estimate this upper bound so as to choose an appropriate step-size. We demonstrate how FlowOpt can be used for image editing, showcasing two options: (i) inversion (determining the initial noise that generates a given image), and (ii) directly steering the edited image to be similar to the source image while conforming to a target text prompt. In both cases, FlowOpt achieves state-of-the-art results while using roughly the same number of neural function evaluations (NFEs) as existing methods. Code and examples are available on the project's webpage.

Title: Linearized Optimal Transport for Analysis of High-Dimensional Point-Cloud and Single-Cell Data

Authors: Tianxiang Wang, Yingtong Ke, Dhananjay Bhaskar, Smita Krishnaswamy, Alexander Cloninger
Subjects: cs.LG, q-bio.QM, stat.ML
Abstract URL: https://arxiv.org/abs/2510.22033
Pdf URL: https://arxiv.org/pdf/2510.22033
Copy Paste: [[2510.22033]] Linearized Optimal Transport for Analysis of High-Dimensional Point-Cloud and Single-Cell Data(https://arxiv.org/abs/2510.22033)
Keywords: generative
Abstract: Single-cell technologies generate high-dimensional point clouds of cells, enabling detailed characterization of complex patient states and treatment responses. Yet each patient is represented by an irregular point cloud rather than a simple vector, making it difficult to directly quantify and compare biological differences between individuals. Nonlinear methods such as kernels and neural networks achieve predictive accuracy but act as black boxes, offering little biological interpretability. To address these limitations, we adapt the Linear Optimal Transport (LOT) framework to this setting, embedding irregular point clouds into a fixed-dimensional Euclidean space while preserving distributional structure. This embedding provides a principled linear representation that preserves optimal transport geometry while enabling downstream analysis. It also forms a registration between any two patients, enabling direct comparison of their cellular distributions. Within this space, LOT enables: (i) \textbf{accurate and interpretable classification} of COVID-19 patient states, where classifier weights map back to specific markers and spatial regions driving predictions; and (ii) \textbf{synthetic data generation} for patient-derived organoids, exploiting the linearity of the LOT embedding. LOT barycenters yield averaged cellular profiles representing combined conditions or samples, supporting drug interaction testing. Together, these results establish LOT as a unified framework that bridges predictive performance, interpretability, and generative modeling. By transforming heterogeneous point clouds into structured embeddings directly traceable to the original data, LOT opens new opportunities for understanding immune variation and treatment effects in high-dimensional biological systems.

Title: Human-Centric Anomaly Detection in Surveillance Videos Using YOLO-World and Spatio-Temporal Deep Learning

Authors: Mohammad Ali Etemadi Naeen, Hoda Mohammadzade, Saeed Bagheri Shouraki
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22056
Pdf URL: https://arxiv.org/pdf/2510.22056
Copy Paste: [[2510.22056]] Human-Centric Anomaly Detection in Surveillance Videos Using YOLO-World and Spatio-Temporal Deep Learning(https://arxiv.org/abs/2510.22056)
Keywords: anomaly
Abstract: Anomaly detection in surveillance videos remains a challenging task due to the diversity of abnormal events, class imbalance, and scene-dependent visual clutter. To address these issues, we propose a robust deep learning framework that integrates human-centric preprocessing with spatio-temporal modeling for multi-class anomaly classification. Our pipeline begins by applying YOLO-World - an open-vocabulary vision-language detector - to identify human instances in raw video clips, followed by ByteTrack for consistent identity-aware tracking. Background regions outside detected bounding boxes are suppressed via Gaussian blurring, effectively reducing scene-specific distractions and focusing the model on behaviorally relevant foreground content. The refined frames are then processed by an ImageNet-pretrained InceptionV3 network for spatial feature extraction, and temporal dynamics are captured using a bidirectional LSTM (BiLSTM) for sequence-level classification. Evaluated on a five-class subset of the UCF-Crime dataset (Normal, Burglary, Fighting, Arson, Explosion), our method achieves a mean test accuracy of 92.41% across three independent trials, with per-class F1-scores consistently exceeding 0.85. Comprehensive evaluation metrics - including confusion matrices, ROC curves, and macro/weighted averages - demonstrate strong generalization and resilience to class imbalance. The results confirm that foreground-focused preprocessing significantly enhances anomaly discrimination in real-world surveillance scenarios.

Title: Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation

Authors: Zheng Qi, Chao Shang, Evangelia Spiliopoulou, Nikolaos Pappas
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22067
Pdf URL: https://arxiv.org/pdf/2510.22067
Copy Paste: [[2510.22067]] Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation(https://arxiv.org/abs/2510.22067)
Keywords: generative
Abstract: Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visual saliency map by tracking positive changes in visual attention, or "gaze shifts", during user query comprehension, and leverages this map to amplify attention to both salient visual information and the user query at each decoding step. This reduces the impact of visual attention sink, as irrelevant tokens exhibit minimal shifts, while ensuring balanced cross-modal fusion for well-integrated representation. Extensive experiments show that GIFT effectively mitigates hallucination in VLMs across both generative and classification tasks, achieving up to 20.7% improvement over greedy decoding, while maintaining general vision-language performance with low computational overhead.

Title: MAGIC-Flow: Multiscale Adaptive Conditional Flows for Generation and Interpretable Classification

Authors: Luca Caldera, Giacomo Bottacini, Lara Cavinato
Subjects: cs.LG, cs.CV, eess.IV, stat.ML
Abstract URL: https://arxiv.org/abs/2510.22070
Pdf URL: https://arxiv.org/pdf/2510.22070
Copy Paste: [[2510.22070]] MAGIC-Flow: Multiscale Adaptive Conditional Flows for Generation and Interpretable Classification(https://arxiv.org/abs/2510.22070)
Keywords: generative
Abstract: Generative modeling has emerged as a powerful paradigm for representation learning, but its direct applicability to challenging fields like medical imaging remains limited: mere generation, without task alignment, fails to provide a robust foundation for clinical use. We propose MAGIC-Flow, a conditional multiscale normalizing flow architecture that performs generation and classification within a single modular framework. The model is built as a hierarchy of invertible and differentiable bijections, where the Jacobian determinant factorizes across sub-transformations. We show how this ensures exact likelihood computation and stable optimization, while invertibility enables explicit visualization of sample likelihoods, providing an interpretable lens into the model's reasoning. By conditioning on class labels, MAGIC-Flow supports controllable sample synthesis and principled class-probability estimation, effectively aiding both generative and discriminative objectives. We evaluate MAGIC-Flow against top baselines using metrics for similarity, fidelity, and diversity. Across multiple datasets, it addresses generation and classification under scanner noise, and modality-specific synthesis and identification. Results show MAGIC-Flow creates realistic, diverse samples and improves classification. MAGIC-Flow is an effective strategy for generation and classification in data-limited domains, with direct benefits for privacy-preserving augmentation, robust generalization, and trustworthy medical AI.

Title: Discovering Latent Graphs with GFlowNets for Diverse Conditional Image Generation

Authors: Bailey Trang, Parham Saremi, Alan Q. Wang, Fangrui Huang, Zahra TehraniNasab, Amar Kumar, Tal Arbel, Li Fei-Fei, Ehsan Adeli
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22107
Pdf URL: https://arxiv.org/pdf/2510.22107
Copy Paste: [[2510.22107]] Discovering Latent Graphs with GFlowNets for Diverse Conditional Image Generation(https://arxiv.org/abs/2510.22107)
Keywords: generative
Abstract: Capturing diversity is crucial in conditional and prompt-based image generation, particularly when conditions contain uncertainty that can lead to multiple plausible outputs. To generate diverse images reflecting this diversity, traditional methods often modify random seeds, making it difficult to discern meaningful differences between samples, or diversify the input prompt, which is limited in verbally interpretable diversity. We propose Rainbow, a novel conditional image generation framework, applicable to any pretrained conditional generative model, that addresses inherent condition/prompt uncertainty and generates diverse plausible images. Rainbow is based on a simple yet effective idea: decomposing the input condition into diverse latent representations, each capturing an aspect of the uncertainty and generating a distinct image. First, we integrate a latent graph, parameterized by Generative Flow Networks (GFlowNets), into the prompt representation computation. Second, leveraging GFlowNets' advanced graph sampling capabilities to capture uncertainty and output diverse trajectories over the graph, we produce multiple trajectories that collectively represent the input condition, leading to diverse condition representations and corresponding output images. Evaluations on natural image and medical image datasets demonstrate Rainbow's improvement in both diversity and fidelity across image synthesis, image generation, and counterfactual generation tasks.

Title: GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation

Authors: Karim Elmaaroufi, Liheng Lai, Justin Svegliato, Yutong Bai, Sanjit A. Seshia, Matei Zaharia
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22118
Pdf URL: https://arxiv.org/pdf/2510.22118
Copy Paste: [[2510.22118]] GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation(https://arxiv.org/abs/2510.22118)
Keywords: generative
Abstract: Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning\textemdash{}a prerequisite for many applications. Empirically, we find that a dataset produced by a current training data generation pipeline has a 57.6\% human validation rate. These rates stem from current limitations: single-image 3D reconstruction introduces cascading modeling errors and requires wide answer tolerances, while caption-based methods require hyper-detailed annotations and suffer from generative hallucinations. We present GRAID, built on the key insight that qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. By operating exclusively on 2D bounding boxes from standard object detectors, GRAID avoids both 3D reconstruction errors and generative hallucinations, resulting in datasets that are of higher quality than existing tools that produce similar datasets as validated by human evaluations. We apply our framework to the BDD100k, NuImages, and Waymo datasets, generating over 8.5 million high-quality VQA pairs creating questions spanning spatial relations, counting, ranking, and size comparisons. We evaluate one of the datasets and find it achieves 91.16\% human-validated accuracy\textemdash{}compared to 57.6\% on a dataset generated by recent work. % or recent work Critically, we demonstrate that when trained on GRAID data, models learn spatial reasoning concepts that generalize: models fine-tuned on 6 question types improve on over 10 held-out types, with accuracy gains of 47.5\% on BDD and 37.9\% on NuImages for Llama 3.2B 11B, and when trained on all questions types, achieve improvements on several existing benchmarks such as BLINK. The GRAID framework, datasets, and additional information can be found on our \href{this https URL}{project page}.

Title: CogStereo: Neural Stereo Matching with Implicit Spatial Cognition Embedding

Authors: Lihuang Fang, Xiao Hu, Yuchen Zou, Hong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22119
Pdf URL: https://arxiv.org/pdf/2510.22119
Copy Paste: [[2510.22119]] CogStereo: Neural Stereo Matching with Implicit Spatial Cognition Embedding(https://arxiv.org/abs/2510.22119)
Keywords: foundation model
Abstract: Deep stereo matching has advanced significantly on benchmark datasets through fine-tuning but falls short of the zero-shot generalization seen in foundation models in other vision tasks. We introduce CogStereo, a novel framework that addresses challenging regions, such as occlusions or weak textures, without relying on dataset-specific priors. CogStereo embeds implicit spatial cognition into the refinement process by using monocular depth features as priors, capturing holistic scene understanding beyond local correspondences. This approach ensures structurally coherent disparity estimation, even in areas where geometry alone is inadequate. CogStereo employs a dual-conditional refinement mechanism that combines pixel-wise uncertainty with cognition-guided features for consistent global correction of mismatches. Extensive experiments on Scene Flow, KITTI, Middlebury, ETH3D, EuRoc, and real-world demonstrate that CogStereo not only achieves state-of-the-art results but also excels in cross-domain generalization, shifting stereo vision towards a cognition-driven approach.

Title: LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction

Authors: Yuhang Gao, Xiang Xiang, Sheng Zhong, Guoyou Wang
Subjects: cs.CV, cs.CL, cs.LG, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2510.22141
Pdf URL: https://arxiv.org/pdf/2510.22141
Copy Paste: [[2510.22141]] LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction(https://arxiv.org/abs/2510.22141)
Keywords: self-supervised
Abstract: Vision-Language Models (VLMs) have shown significant progress in open-set challenges. However, the limited availability of 3D datasets hinders their effective application in 3D scene understanding. We propose LOC, a general language-guided framework adaptable to various occupancy networks, supporting both supervised and self-supervised learning paradigms. For self-supervised tasks, we employ a strategy that fuses multi-frame LiDAR points for dynamic/static scenes, using Poisson reconstruction to fill voids, and assigning semantics to voxels via K-Nearest Neighbor (KNN) to obtain comprehensive voxel representations. To mitigate feature over-homogenization caused by direct high-dimensional feature distillation, we introduce Densely Contrastive Learning (DCL). DCL leverages dense voxel semantic information and predefined textual prompts. This efficiently enhances open-set recognition without dense pixel-level supervision, and our framework can also leverage existing ground truth to further improve performance. Our model predicts dense voxel features embedded in the CLIP feature space, integrating textual and image pixel information, and classifies based on text and semantic similarity. Experiments on the nuScenes dataset demonstrate the method's superior performance, achieving high-precision predictions for known classes and distinguishing unknown classes without additional training data.

Title: Attention Residual Fusion Network with Contrast for Source-free Domain Adaptation

Authors: Renrong Shao, Wei Zhang, Jun Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22142
Pdf URL: https://arxiv.org/pdf/2510.22142
Copy Paste: [[2510.22142]] Attention Residual Fusion Network with Contrast for Source-free Domain Adaptation(https://arxiv.org/abs/2510.22142)
Keywords: self-supervised
Abstract: Source-free domain adaptation (SFDA) involves training a model on source domain and then applying it to a related target domain without access to the source data and labels during adaptation. The complexity of scene information and lack of the source domain make SFDA a difficult task. Recent studies have shown promising results, but many approaches to domain adaptation concentrate on domain shift and neglect the effects of negative transfer, which may impede enhancements of model performance during adaptation. n this paper, addressing this issue, we propose a novel framework of Attention Residual Fusion Network (ARFNet) based on contrast learning for SFDA to alleviate negative transfer and domain shift during the progress of adaptation, in which attention residual fusion, global-local attention contrast, and dynamic centroid evaluation are exploited. Concretely, the attention mechanism is first exploited to capture the discriminative region of the target object. Then, in each block, attention features are decomposed into spatial-wise and channel-wise attentions to achieve the cross-layer attention residual fusion progressively and self-distillation. During adaptation progress, we contrast global and local representations to improve the perceptual capabilities of different categories, which enables the model to discriminate variations between inner-class and intra-class. Finally, a dynamic centroid evaluation strategy is exploited to evaluate the trustworthy centroids and labels for self-supervised self-distillation, which aims to accurately approximate the center of the source domain and pseudo-labels to mitigate domain shift. To validate the efficacy, we execute comprehensive experiments on five benchmarks of varying scales. Experimental outcomes indicate that our method surpasses other techniques, attaining superior performance across SFDA benchmarks.

Title: I2-NeRF: Learning Neural Radiance Fields Under Physically-Grounded Media Interactions

Authors: Shuhong Liu, Lin Gu, Ziteng Cui, Xuangeng Chu, Tatsuya Harada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22161
Pdf URL: https://arxiv.org/pdf/2510.22161
Copy Paste: [[2510.22161]] I2-NeRF: Learning Neural Radiance Fields Under Physically-Grounded Media Interactions(https://arxiv.org/abs/2510.22161)
Keywords: generative
Abstract: Participating in efforts to endow generative AI with the 3D physical world perception, we propose I2-NeRF, a novel neural radiance field framework that enhances isometric and isotropic metric perception under media degradation. While existing NeRF models predominantly rely on object-centric sampling, I2-NeRF introduces a reverse-stratified upsampling strategy to achieve near-uniform sampling across 3D space, thereby preserving isometry. We further present a general radiative formulation for media degradation that unifies emission, absorption, and scattering into a particle model governed by the Beer-Lambert attenuation law. By composing the direct and media-induced in-scatter radiance, this formulation extends naturally to complex media environments such as underwater, haze, and even low-light scenes. By treating light propagation uniformly in both vertical and horizontal directions, I2-NeRF enables isotropic metric perception and can even estimate medium properties such as water depth. Experiments on real-world datasets demonstrate that our method significantly improves both reconstruction fidelity and physical plausibility compared to existing approaches.

Title: TPPR: APT Tactic / Technique Pattern Guided Attack Path Reasoning for Attack Investigation

Authors: Qi Sheng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2510.22191
Pdf URL: https://arxiv.org/pdf/2510.22191
Copy Paste: [[2510.22191]] TPPR: APT Tactic / Technique Pattern Guided Attack Path Reasoning for Attack Investigation(https://arxiv.org/abs/2510.22191)
Keywords: anomaly
Abstract: Provenance analysis based on system audit data has emerged as a fundamental approach for investigating Advanced Persistent Threat (APT) attacks. Due to the high concealment and long-term persistence of APT attacks, they are only represented as a minimal part of the critical path in the provenance graph. While existing techniques employ behavioral pattern matching and data flow feature matching to uncover latent associations in attack sequences through provenance graph path reasoning, their inability to establish effective attack context associations often leads to the conflation of benign system operations with real attack entities, that fail to accurately characterize real APT behaviors. We observe that while the causality of entities in the provenance graph exhibit substantial complexity, attackers often follow specific attack patterns-specifically, clear combinations of tactics and techniques to achieve their goals. Based on these insights, we propose TPPR, a novel framework that first extracts anomaly subgraphs through abnormal node detection, TTP-annotation and graph pruning, then performs attack path reasoning using mined TTP sequential pattern, and finally reconstructs attack scenarios through confidence-based path scoring and merging. Extensive evaluation on real enterprise logs (more than 100 million events) and DARPA TC dataset demonstrates TPPR's capability to achieve 99.9% graph simplification (700,000 to 20 edges) while preserving 91% of critical attack nodes, outperforming state-of-the-art solutions (SPARSE, DepImpact) by 63.1% and 67.9% in reconstruction precision while maintaining attack scenario integrity.

Title: Scaling Non-Parametric Sampling with Representation

Authors: Vincent Lu, Aaron Truong, Zeyu Yun, Yubei Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22196
Pdf URL: https://arxiv.org/pdf/2510.22196
Copy Paste: [[2510.22196]] Scaling Non-Parametric Sampling with Representation(https://arxiv.org/abs/2510.22196)
Keywords: generative
Abstract: Scaling and architectural advances have produced strikingly photorealistic image generative models, yet their mechanisms still remain opaque. Rather than advancing scaling, our goal is to strip away complicated engineering tricks and propose a simple, non-parametric generative model. Our design is grounded in three principles of natural images-(i) spatial non-stationarity, (ii) low-level regularities, and (iii) high-level semantics-and defines each pixel's distribution from its local context window. Despite its minimal architecture and no training, the model produces high-fidelity samples on MNIST and visually compelling CIFAR-10 images. This combination of simplicity and strong empirical performance points toward a minimal theory of natural-image structure. The model's white-box nature also allows us to have a mechanistic understanding of how the model generalizes and generates diverse images. We study it by tracing each generated pixel back to its source images. These analyses reveal a simple, compositional procedure for "part-whole generalization", suggesting a hypothesis for how large neural network generative models learn to generalize.

Title: LongCat-Video Technical Report

Authors: Meituan LongCat Team: Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, Tong Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22200
Pdf URL: https://arxiv.org/pdf/2510.22200
Copy Paste: [[2510.22200]] LongCat-Video Technical Report(https://arxiv.org/abs/2510.22200)
Keywords: diffusion
Abstract: Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.

Title: Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation

Authors: Jeongin Kim, Wonho Bae, YouLee Han, Giyeong Oh, Youngjae Yu, Danica J. Sutherland, Junhyug Noh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22229
Pdf URL: https://arxiv.org/pdf/2510.22229
Copy Paste: [[2510.22229]] Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation(https://arxiv.org/abs/2510.22229)
Keywords: diffusion
Abstract: Semantic segmentation demands dense pixel-level annotations, which can be prohibitively expensive - especially under extremely constrained labeling budgets. In this paper, we address the problem of low-budget active learning for semantic segmentation by proposing a novel two-stage selection pipeline. Our approach leverages a pre-trained diffusion model to extract rich multi-scale features that capture both global structure and fine details. In the first stage, we perform a hierarchical, representation-based candidate selection by first choosing a small subset of representative pixels per image using MaxHerding, and then refining these into a diverse global pool. In the second stage, we compute an entropy-augmented disagreement score (eDALD) over noisy multi-scale diffusion features to capture both epistemic uncertainty and prediction confidence, selecting the most informative pixels for annotation. This decoupling of diversity and uncertainty lets us achieve high segmentation accuracy with only a tiny fraction of labeled pixels. Extensive experiments on four benchmarks (CamVid, ADE-Bed, Cityscapes, and Pascal-Context) demonstrate that our method significantly outperforms existing baselines under extreme pixel-budget regimes. Our code is available at this https URL.

Title: DiffusionLane: Diffusion Model for Lane Detection

Authors: Kunyang Zhou, Yeqin Shao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22236
Pdf URL: https://arxiv.org/pdf/2510.22236
Copy Paste: [[2510.22236]] DiffusionLane: Diffusion Model for Lane Detection(https://arxiv.org/abs/2510.22236)
Keywords: diffusion
Abstract: In this paper, we present a novel diffusion-based model for lane detection, called DiffusionLane, which treats the lane detection task as a denoising diffusion process in the parameter space of the lane. Firstly, we add the Gaussian noise to the parameters (the starting point and the angle) of ground truth lanes to obtain noisy lane anchors, and the model learns to refine the noisy lane anchors in a progressive way to obtain the target lanes. Secondly, we propose a hybrid decoding strategy to address the poor feature representation of the encoder, resulting from the noisy lane anchors. Specifically, we design a hybrid diffusion decoder to combine global-level and local-level decoders for high-quality lane anchors. Then, to improve the feature representation of the encoder, we employ an auxiliary head in the training stage to adopt the learnable lane anchors for enriching the supervision on the encoder. Experimental results on four benchmarks, Carlane, Tusimple, CULane, and LLAMAS, show that DiffusionLane possesses a strong generalization ability and promising detection performance compared to the previous state-of-the-art methods. For example, DiffusionLane with ResNet18 surpasses the existing methods by at least 1\% accuracy on the domain adaptation dataset Carlane. Besides, DiffusionLane with MobileNetV4 gets 81.32\% F1 score on CULane, 96.89\% accuracy on Tusimple with ResNet34, and 97.59\% F1 score on LLAMAS with ResNet101. Code will be available at this https URL.

Title: LUNA: Efficient and Topology-Agnostic Foundation Model for EEG Signal Analysis

Authors: Berkay Döner, Thorir Mar Ingolfsson, Luca Benini, Yawei Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22257
Pdf URL: https://arxiv.org/pdf/2510.22257
Copy Paste: [[2510.22257]] LUNA: Efficient and Topology-Agnostic Foundation Model for EEG Signal Analysis(https://arxiv.org/abs/2510.22257)
Keywords: self-supervised, foundation model
Abstract: Electroencephalography (EEG) offers a non-invasive lens into human brain activity, but building large-scale models is hampered by topological heterogeneity: each public EEG data defines its own electrode layout, limiting generalization. We introduce LUNA (Latent Unified Network Architecture), a self-supervised foundation model that reconciles disparate electrode geometries while scaling linearly -- not quadratically -- with channel count. LUNA compresses multi-channel EEG into a fixed-size, topology-agnostic latent space via learned queries and cross-attention. Downstream transformer blocks then operate exclusively on this latent representation using patch-wise temporal self-attention, decoupling computation from electrode count. Pre-trained on TUEG and Siena (over 21,000 hours of raw EEG across diverse montages) using a masked-patch reconstruction objective, LUNA transfers effectively to four downstream tasks: abnormality detection, artifact rejection, slowing classification, and emotion recognition. It demonstrates highly competitive performance across several benchmarks, achieving state-of-the-art results on TUAR and TUSL, e.g., 0.921 AUROC on TUAR, while reducing FLOPs by 300x and trimming GPU memory use by up to 10x. Critically, these gains are consistent across all evaluated electrode configurations. Code is available at this https URL

Title: Adapting Noise-Driven PUF and AI for Secure WBG ICS: A Proof-of-Concept Study

Authors: Devon A. Kelly, Christiana Chamon
Subjects: cs.CR, cs.LG, eess.SY, physics.app-ph
Abstract URL: https://arxiv.org/abs/2510.22283
Pdf URL: https://arxiv.org/pdf/2510.22283
Copy Paste: [[2510.22283]] Adapting Noise-Driven PUF and AI for Secure WBG ICS: A Proof-of-Concept Study(https://arxiv.org/abs/2510.22283)
Keywords: anomaly
Abstract: Wide-bandgap (WBG) technologies offer unprecedented improvements in power system efficiency, size, and performance, but also introduce unique sensor corruption and cybersecurity risks in industrial control systems (ICS), particularly due to high-frequency noise and sophisticated cyber-physical threats. This proof-of-concept (PoC) study demonstrates the adaptation of a noise-driven physically unclonable function (PUF) and machine learning (ML)-assisted anomaly detection framework to the demanding environment of WBG-based ICS sensor pathways. By extracting entropy from unavoidable WBG switching noise (up to 100 kHz) as a PUF source, and simultaneously using this noise as a real-time threat indicator, the proposed system unites hardware-level authentication and anomaly detection. Our approach integrates hybrid machine learning (ML) models with adaptive Bayesian filtering, providing robust and low-latency detection capabilities resilient to both natural electromagnetic interference (EMI) and active adversarial manipulation. Through detailed simulations of WBG modules under benign and attack scenarios--including EMI injection, signal tampering, and node impersonation--we achieve 95% detection accuracy and sub-millisecond processing latency. These results demonstrate the feasibility of physics-driven, dual-use noise exploitation as a scalable ICS defense primitive. Our findings lay the groundwork for next-generation security strategies that leverage inherent device characteristics, bridging hardware and artificial intelligence (AI) for enhanced protection of critical ICS infrastructure.

Title: Supervised Fine-Tuning or In-Context Learning? Evaluating LLMs for Clinical NER

Authors: Andrei Baroian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22285
Pdf URL: https://arxiv.org/pdf/2510.22285
Copy Paste: [[2510.22285]] Supervised Fine-Tuning or In-Context Learning? Evaluating LLMs for Clinical NER(https://arxiv.org/abs/2510.22285)
Keywords: in-context
Abstract: We study clinical Named Entity Recognition (NER) on the CADEC corpus and compare three families of approaches: (i) BERT-style encoders (BERT Base, BioClinicalBERT, RoBERTa-large), (ii) GPT-4o used with few-shot in-context learning (ICL) under simple vs.\ complex prompts, and (iii) GPT-4o with supervised fine-tuning (SFT). All models are evaluated on standard NER metrics over CADEC's five entity types (ADR, Drug, Disease, Symptom, Finding). RoBERTa-large and BioClinicalBERT offer limited improvements over BERT Base, showing the limit of these family of models. Among LLM settings, simple ICL outperforms a longer, instruction-heavy prompt, and SFT achieves the strongest overall performance (F1 $\approx$ 87.1%), albeit with higher cost. We find that the LLM achieve higher accuracy on simplified tasks, restricting classification to two labels.

Title: AnyECG-Lab: An Exploration Study of Fine-tuning an ECG Foundation Model to Estimate Laboratory Values from Single-Lead ECG Signals

Authors: Yujie Xiao, Gongzhen Tang, Wenhui Liu, Jun Li, Guangkun Nie, Zhuoran Kan, Deyun Zhang, Qinghao Zhao, Shenda Hong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22301
Pdf URL: https://arxiv.org/pdf/2510.22301
Copy Paste: [[2510.22301]] AnyECG-Lab: An Exploration Study of Fine-tuning an ECG Foundation Model to Estimate Laboratory Values from Single-Lead ECG Signals(https://arxiv.org/abs/2510.22301)
Keywords: foundation model
Abstract: Timely access to laboratory values is critical for clinical decision-making, yet current approaches rely on invasive venous sampling and are intrinsically delayed. Electrocardiography (ECG), as a non-invasive and widely available signal, offers a promising modality for rapid laboratory estimation. Recent progress in deep learning has enabled the extraction of latent hematological signatures from ECGs. However, existing models are constrained by low signal-to-noise ratios, substantial inter-individual variability, limited data diversity, and suboptimal generalization, especially when adapted to low-lead wearable devices. In this work, we conduct an exploratory study leveraging transfer learning to fine-tune ECGFounder, a large-scale pre-trained ECG foundation model, on the Multimodal Clinical Monitoring in the Emergency Department (MC-MED) dataset from Stanford. We generated a corpus of more than 20 million standardized ten-second ECG segments to enhance sensitivity to subtle biochemical correlates. On internal validation, the model demonstrated strong predictive performance (area under the curve above 0.65) for thirty-three laboratory indicators, moderate performance (between 0.55 and 0.65) for fifty-nine indicators, and limited performance (below 0.55) for sixteen indicators. This study provides an efficient artificial-intelligence driven solution and establishes the feasibility scope for real-time, non-invasive estimation of laboratory values.

Title: GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping

Authors: Jing Wang, Jiajun Liang, Jie Liu, Henglin Liu, Gongye Liu, Jun Zheng, Wanyuan Pang, Ao Ma, Zhenyu Xie, Xintao Wang, Meng Wang, Pengfei Wan, Xiaodan Liang
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.22319
Pdf URL: https://arxiv.org/pdf/2510.22319
Copy Paste: [[2510.22319]] GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping(https://arxiv.org/abs/2510.22319)
Keywords: diffusion
Abstract: Recently, GRPO-based reinforcement learning has shown remarkable progress in optimizing flow-matching models, effectively improving their alignment with task-specific rewards. Within these frameworks, the policy update relies on importance-ratio clipping to constrain overconfident positive and negative gradients. However, in practice, we observe a systematic shift in the importance-ratio distribution-its mean falls below 1 and its variance differs substantially across timesteps. This left-shifted and inconsistent distribution prevents positive-advantage samples from entering the clipped region, causing the mechanism to fail in constraining overconfident positive updates. As a result, the policy model inevitably enters an implicit over-optimization stage-while the proxy reward continues to increase, essential metrics such as image quality and text-prompt alignment deteriorate sharply, ultimately making the learned policy impractical for real-world use. To address this issue, we introduce GRPO-Guard, a simple yet effective enhancement to existing GRPO frameworks. Our method incorporates ratio normalization, which restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates across denoising timesteps. In addition, a gradient reweighting strategy equalizes policy gradients over noise conditions, preventing excessive updates from particular timestep regions. Together, these designs act as a regulated clipping mechanism, stabilizing optimization and substantially mitigating implicit over-optimization without relying on heavy KL regularization. Extensive experiments on multiple diffusion backbones (e.g., SD3.5M, Flux.1-dev) and diverse proxy tasks demonstrate that GRPO-Guard significantly reduces over-optimization while maintaining or even improving generation quality.

Title: Beyond Augmentation: Leveraging Inter-Instance Relation in Self-Supervised Representation Learning

Authors: Ali Javidani, Babak Nadjar Araabi, Mohammad Amin Sadeghi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22322
Pdf URL: https://arxiv.org/pdf/2510.22322
Copy Paste: [[2510.22322]] Beyond Augmentation: Leveraging Inter-Instance Relation in Self-Supervised Representation Learning(https://arxiv.org/abs/2510.22322)
Keywords: self-supervised
Abstract: This paper introduces a novel approach that integrates graph theory into self-supervised representation learning. Traditional methods focus on intra-instance variations generated by applying augmentations. However, they often overlook important inter-instance relationships. While our method retains the intra-instance property, it further captures inter-instance relationships by constructing k-nearest neighbor (KNN) graphs for both teacher and student streams during pretraining. In these graphs, nodes represent samples along with their latent representations. Edges encode the similarity between instances. Following pretraining, a representation refinement phase is performed. In this phase, Graph Neural Networks (GNNs) propagate messages not only among immediate neighbors but also across multiple hops, thereby enabling broader contextual integration. Experimental results on CIFAR-10, ImageNet-100, and ImageNet-1K demonstrate accuracy improvements of 7.3%, 3.2%, and 1.0%, respectively, over state-of-the-art methods. These results highlight the effectiveness of the proposed graph based mechanism. The code is publicly available at this https URL.

Title: Monitoring State Transitions in Markovian Systems with Sampling Cost

Authors: Kumar Saurav, Ness B. Shroff, Yingbin Liang
Subjects: cs.LG, cs.IT, stat.ML
Abstract URL: https://arxiv.org/abs/2510.22327
Pdf URL: https://arxiv.org/pdf/2510.22327
Copy Paste: [[2510.22327]] Monitoring State Transitions in Markovian Systems with Sampling Cost(https://arxiv.org/abs/2510.22327)
Keywords: foundation model
Abstract: We consider a node-monitor pair, where the node's state varies with time. The monitor needs to track the node's state at all times; however, there is a fixed cost for each state query. So the monitor may instead predict the state using time-series forecasting methods, including time-series foundation models (TSFMs), and query only when prediction uncertainty is high. Since query decisions influence prediction accuracy, determining when to query is nontrivial. A natural approach is a greedy policy that predicts when the expected prediction loss is below the query cost and queries otherwise. We analyze this policy in a Markovian setting, where the optimal (OPT) strategy is a state-dependent threshold policy minimizing the time-averaged sum of query cost and prediction losses. We show that, in general, the greedy policy is suboptimal and can have an unbounded competitive ratio, but under common conditions such as identically distributed transition probabilities, it performs close to OPT. For the case of unknown transition probabilities, we further propose a projected stochastic gradient descent (PSGD)-based learning variant of the greedy policy, which achieves a favorable predict-query tradeoff with improved computational efficiency compared to OPT.

Title: Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction

Authors: Xu Zhang, Ruijie Quan, Wenguan Wang, Yi Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22335
Pdf URL: https://arxiv.org/pdf/2510.22335
Copy Paste: [[2510.22335]] Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction(https://arxiv.org/abs/2510.22335)
Keywords: diffusion
Abstract: Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single high-level embedding, using it as fixed guidance throughout the entire generation process. However, this fixed guidance collapses hierarchical neural information and is misaligned with the stage-dependent demands of image reconstruction. In response, we propose MindHier, a coarse-to-fine fMRI-to-image reconstruction framework built on scale-wise autoregressive modeling. MindHier introduces three components: a Hierarchical fMRI Encoder to extract multi-level neural embeddings, a Hierarchy-to-Hierarchy Alignment scheme to enforce layer-wise correspondence with CLIP features, and a Scale-Aware Coarse-to-Fine Neural Guidance strategy to inject these embeddings into autoregression at matching scales. These designs make MindHier an efficient and cognitively-aligned alternative to diffusion-based methods by enabling a hierarchical reconstruction process that synthesizes global semantics before refining local details, akin to human visual perception. Extensive experiments on the NSD dataset show that MindHier achieves superior semantic fidelity, 4.67x faster inference, and more deterministic results than the diffusion-based baselines.

Title: GeoDiffusion: A Training-Free Framework for Accurate 3D Geometric Conditioning in Image Generation

Authors: Phillip Mueller, Talip Uenlue, Sebastian Schmidt, Marcel Kollovieh, Jiajie Fan, Stephan Guennemann, Lars Mikelsons
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22337
Pdf URL: https://arxiv.org/pdf/2510.22337
Copy Paste: [[2510.22337]] GeoDiffusion: A Training-Free Framework for Accurate 3D Geometric Conditioning in Image Generation(https://arxiv.org/abs/2510.22337)
Keywords: diffusion, generative
Abstract: Precise geometric control in image generation is essential for engineering \& product design and creative industries to control 3D object features accurately in image space. Traditional 3D editing approaches are time-consuming and demand specialized skills, while current image-based generative methods lack accuracy in geometric conditioning. To address these challenges, we propose GeoDiffusion, a training-free framework for accurate and efficient geometric conditioning of 3D features in image generation. GeoDiffusion employs a class-specific 3D object as a geometric prior to define keypoints and parametric correlations in 3D space. We ensure viewpoint consistency through a rendered image of a reference 3D object, followed by style transfer to meet user-defined appearance specifications. At the core of our framework is GeoDrag, improving accuracy and speed of drag-based image editing on geometry guidance tasks and general instructions on DragBench. Our results demonstrate that GeoDiffusion enables precise geometric modifications across various iterative design workflows.

Title: EndoSfM3D: Learning to 3D Reconstruct Any Endoscopic Surgery Scene using Self-supervised Foundation Model

Authors: Changhao Zhang, Matthew J. Clarkson, Mobarak I. Hoque
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22359
Pdf URL: https://arxiv.org/pdf/2510.22359
Copy Paste: [[2510.22359]] EndoSfM3D: Learning to 3D Reconstruct Any Endoscopic Surgery Scene using Self-supervised Foundation Model(https://arxiv.org/abs/2510.22359)
Keywords: self-supervised, foundation model
Abstract: 3D reconstruction of endoscopic surgery scenes plays a vital role in enhancing scene perception, enabling AR visualization, and supporting context-aware decision-making in image-guided surgery. A critical yet challenging step in this process is the accurate estimation of the endoscope's intrinsic parameters. In real surgical settings, intrinsic calibration is hindered by sterility constraints and the use of specialized endoscopes with continuous zoom and telescope rotation. Most existing methods for endoscopic 3D reconstruction do not estimate intrinsic parameters, limiting their effectiveness for accurate and reliable reconstruction. In this paper, we integrate intrinsic parameter estimation into a self-supervised monocular depth estimation framework by adapting the Depth Anything V2 (DA2) model for joint depth, pose, and intrinsics prediction. We introduce an attention-based pose network and a Weight-Decomposed Low-Rank Adaptation (DoRA) strategy for efficient fine-tuning of DA2. Our method is validated on the SCARED and C3VD public datasets, demonstrating superior performance compared to recent state-of-the-art approaches in self-supervised monocular depth estimation and 3D reconstruction. Code and model weights can be found in project repository: this https URL.

Title: T2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models

Authors: Jindong Yang, Han Fang, Weiming Zhang, Nenghai Yu, Kejiang Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22366
Pdf URL: https://arxiv.org/pdf/2510.22366
Copy Paste: [[2510.22366]] T2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models(https://arxiv.org/abs/2510.22366)
Keywords: diffusion, generative
Abstract: Diffusion models have advanced rapidly in recent years, producing high-fidelity images while raising concerns about intellectual property protection and the misuse of generative AI. Image watermarking for diffusion models, particularly Noise-as-Watermark (NaW) methods, encode watermark as specific standard Gaussian noise vector for image generation, embedding the infomation seamlessly while maintaining image quality. For detection, the generation process is inverted to recover the initial noise vector containing the watermark before extraction. However, existing NaW methods struggle to balance watermark robustness with generation diversity. Some methods achieve strong robustness by heavily constraining initial noise sampling, which degrades user experience, while others preserve diversity but prove too fragile for real-world deployment. To address this issue, we propose T2SMark, a two-stage watermarking scheme based on Tail-Truncated Sampling (TTS). Unlike prior methods that simply map bits to positive or negative values, TTS enhances robustness by embedding bits exclusively in the reliable tail regions while randomly sampling the central zone to preserve the latent distribution. Our two-stage framework then ensures sampling diversity by integrating a randomly generated session key into both encryption pipelines. We evaluate T2SMark on diffusion models with both U-Net and DiT backbones. Extensive experiments show that it achieves an optimal balance between robustness and diversity. Our code is available at \href{this https URL}{this https URL}.

Title: Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

Authors: Vijay Veerabadran, Fanyi Xiao, Nitin Kamra, Pedro Matias, Joy Chen, Caley Drooff, Brett D Roads, Riley Williams, Ethan Henderson, Xuanyi Zhao, Kevin Carlberg, Joseph Tighe, Karl Ridgeway
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.22443
Pdf URL: https://arxiv.org/pdf/2510.22443
Copy Paste: [[2510.22443]] Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents(https://arxiv.org/abs/2510.22443)
Keywords: generative
Abstract: There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. "Where did I leave my keys?"). In this work, we consider the important complementary problem of inferring that goal from multi-modal contextual observations. Solving this "goal inference" problem holds the promise of eliminating the effort needed to interact with such an agent. This work focuses on creating WAGIBench, a strong benchmark to measure progress in solving this problem using vision-language models (VLMs). Given the limited prior work in this area, we collected a novel dataset comprising 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We validate that human performance exceeds model performance, achieving 93% multiple-choice accuracy compared with 84% for the best-performing VLM. Generative benchmark results that evaluate several families of modern vision-language models show that larger models perform significantly better on the task, yet remain far from practical usefulness, as they produce relevant goals only 55% of the time. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities.

Title: DynaPose4D: High-Quality 4D Dynamic Content Generation via Pose Alignment Loss

Authors: Jing Yang, Yufeng Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22473
Pdf URL: https://arxiv.org/pdf/2510.22473
Copy Paste: [[2510.22473]] DynaPose4D: High-Quality 4D Dynamic Content Generation via Pose Alignment Loss(https://arxiv.org/abs/2510.22473)
Keywords: generative
Abstract: Recent advancements in 2D and 3D generative models have expanded the capabilities of computer vision. However, generating high-quality 4D dynamic content from a single static image remains a significant challenge. Traditional methods have limitations in modeling temporal dependencies and accurately capturing dynamic geometry changes, especially when considering variations in camera perspective. To address this issue, we propose DynaPose4D, an innovative solution that integrates 4D Gaussian Splatting (4DGS) techniques with Category-Agnostic Pose Estimation (CAPE) technology. This framework uses 3D Gaussian Splatting to construct a 3D model from single images, then predicts multi-view pose keypoints based on one-shot support from a chosen view, leveraging supervisory signals to enhance motion consistency. Experimental results show that DynaPose4D achieves excellent coherence, consistency, and fluidity in dynamic motion generation. These findings not only validate the efficacy of the DynaPose4D framework but also indicate its potential applications in the domains of computer vision and animation production.

Title: Accelerating Materials Design via LLM-Guided Evolutionary Search

Authors: Nikhil Abhyankar, Sanchit Kabra, Saaketh Desai, Chandan K. Reddy
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI, cs.NE
Abstract URL: https://arxiv.org/abs/2510.22503
Pdf URL: https://arxiv.org/pdf/2510.22503
Copy Paste: [[2510.22503]] Accelerating Materials Design via LLM-Guided Evolutionary Search(https://arxiv.org/abs/2510.22503)
Keywords: generative
Abstract: Materials discovery requires navigating vast chemical and structural spaces while satisfying multiple, often conflicting, objectives. We present LLM-guided Evolution for MAterials design (LLEMA), a unified framework that couples the scientific knowledge embedded in large language models with chemistry-informed evolutionary rules and memory-based refinement. At each iteration, an LLM proposes crystallographically specified candidates under explicit property constraints; a surrogate-augmented oracle estimates physicochemical properties; and a multi-objective scorer updates success/failure memories to guide subsequent generations. Evaluated on 14 realistic tasks spanning electronics, energy, coatings, optics, and aerospace, LLEMA discovers candidates that are chemically plausible, thermodynamically stable, and property-aligned, achieving higher hit-rates and stronger Pareto fronts than generative and LLM-only baselines. Ablation studies confirm the importance of rule-guided generation, memory-based refinement, and surrogate prediction. By enforcing synthesizability and multi-objective trade-offs, LLEMA delivers a principled pathway to accelerate practical materials discovery. Code: this https URL

Title: CANDI: Hybrid Discrete-Continuous Diffusion Models

Authors: Patrick Pynadath, Jiaxin Shi, Ruqi Zhang
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.22510
Pdf URL: https://arxiv.org/pdf/2510.22510
Copy Paste: [[2510.22510]] CANDI: Hybrid Discrete-Continuous Diffusion Models(https://arxiv.org/abs/2510.22510)
Keywords: diffusion
Abstract: While continuous diffusion has shown remarkable success in continuous domains such as image generation, its direct application to discrete data has underperformed compared to purely discrete formulations. This gap is counterintuitive, given that continuous diffusion learns score functions that enable joint evolution across multiple positions. To understand this gap, we introduce token identifiability as an analytical framework for understanding how Gaussian noise corrupts discrete data through two mechanisms: discrete identity corruption and continuous rank degradation. We reveal that these mechanisms scale differently with vocabulary size, creating a temporal dissonance: at noise levels where discrete corruption preserves enough structure for conditional learning, continuous denoising is trivial; at noise levels where continuous denoising is meaningful, discrete corruption destroys nearly all conditional structure. To solve this, we propose CANDI (Continuous ANd DIscrete diffusion), a hybrid framework that decouples discrete and continuous corruption, enabling simultaneous learning of both conditional structure and continuous geometry. We empirically validate the temporal dissonance phenomenon and demonstrate that CANDI successfully avoids it. This unlocks the benefits of continuous diffusion for discrete spaces: on controlled generation, CANDI enables classifier-based guidance with off-the-shelf classifiers through simple gradient addition; on text generation, CANDI outperforms masked diffusion at low NFE, demonstrating the value of learning continuous gradients for discrete spaces.

Title: SRSR: Enhancing Semantic Accuracy in Real-World Image Super-Resolution with Spatially Re-Focused Text-Conditioning

Authors: Chen Chen, Majid Abdolshah, Violetta Shevchenko, Hongdong Li, Chang Xu, Pulak Purkait
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22534
Pdf URL: https://arxiv.org/pdf/2510.22534
Copy Paste: [[2510.22534]] SRSR: Enhancing Semantic Accuracy in Real-World Image Super-Resolution with Spatially Re-Focused Text-Conditioning(https://arxiv.org/abs/2510.22534)
Keywords: diffusion
Abstract: Existing diffusion-based super-resolution approaches often exhibit semantic ambiguities due to inaccuracies and incompleteness in their text conditioning, coupled with the inherent tendency for cross-attention to divert towards irrelevant pixels. These limitations can lead to semantic misalignment and hallucinated details in the generated high-resolution outputs. To address these, we propose a novel, plug-and-play spatially re-focused super-resolution (SRSR) framework that consists of two core components: first, we introduce Spatially Re-focused Cross-Attention (SRCA), which refines text conditioning at inference time by applying visually-grounded segmentation masks to guide cross-attention. Second, we introduce a Spatially Targeted Classifier-Free Guidance (STCFG) mechanism that selectively bypasses text influences on ungrounded pixels to prevent hallucinations. Extensive experiments on both synthetic and real-world datasets demonstrate that SRSR consistently outperforms seven state-of-the-art baselines in standard fidelity metrics (PSNR and SSIM) across all datasets, and in perceptual quality measures (LPIPS and DISTS) on two real-world benchmarks, underscoring its effectiveness in achieving both high semantic fidelity and perceptual quality in super-resolution.

Title: FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning

Authors: Yuyang Ding, Chi Zhang, Juntao Li, Haibin Lin, Xin Liu, Min Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.22543
Pdf URL: https://arxiv.org/pdf/2510.22543
Copy Paste: [[2510.22543]] FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning(https://arxiv.org/abs/2510.22543)
Keywords: generative
Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models (LLMs). In this context, models explore reasoning trajectories and exploit rollouts with correct answers as positive signals for policy optimization. However, these rollouts might involve flawed patterns such as answer-guessing and jump-in-reasoning. Such flawed-positive rollouts are rewarded identically to fully correct ones, causing policy models to internalize these unreliable reasoning patterns. In this work, we first conduct a systematic study of flawed-positive rollouts in RL and find that they enable rapid capability gains during the early optimization stage, while constraining reasoning capability later by reinforcing unreliable patterns. Building on these insights, we propose Flawed-Aware Policy Optimization (FAPO), which presents a parameter-free reward penalty for flawed-positive rollouts, enabling the policy to leverage them as useful shortcuts in the warm-up stage, securing stable early gains, while gradually shifting optimization toward reliable reasoning in the later refinement stage. To accurately and comprehensively detect flawed-positive rollouts, we introduce a generative reward model (GenRM) with a process-level reward that precisely localizes reasoning errors. Experiments show that FAPO is effective in broad domains, improving outcome correctness, process reliability, and training stability without increasing the token budget.

Title: DDTR: Diffusion Denoising Trace Recovery

Authors: Maximilian Matyash, Avigdor Gal, Arik Senderovich
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22553
Pdf URL: https://arxiv.org/pdf/2510.22553
Copy Paste: [[2510.22553]] DDTR: Diffusion Denoising Trace Recovery(https://arxiv.org/abs/2510.22553)
Keywords: diffusion
Abstract: With recent technological advances, process logs, which were traditionally deterministic in nature, are being captured from non-deterministic sources, such as uncertain sensors or machine learning models (that predict activities using cameras). In the presence of stochastically-known logs, logs that contain probabilistic information, the need for stochastic trace recovery increases, to offer reliable means of understanding the processes that govern such systems. We design a novel deep learning approach for stochastic trace recovery, based on Diffusion Denoising Probabilistic Models (DDPM), which makes use of process knowledge (either implicitly by discovering a model or explicitly by injecting process knowledge in the training phase) to recover traces by denoising. We conduct an empirical evaluation demonstrating state-of-the-art performance with up to a 25% improvement over existing methods, along with increased robustness under high noise levels.

Title: From Pixels to Views: Learning Angular-Aware and Physics-Consistent Representations for Light Field Microscopy

Authors: Feng He, Guodong Tan, Qiankun Li, Jun Yu, Quan Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22577
Pdf URL: https://arxiv.org/pdf/2510.22577
Copy Paste: [[2510.22577]] From Pixels to Views: Learning Angular-Aware and Physics-Consistent Representations for Light Field Microscopy(https://arxiv.org/abs/2510.22577)
Keywords: self-supervised
Abstract: Light field microscopy (LFM) has become an emerging tool in neuroscience for large-scale neural imaging in vivo, notable for its single-exposure volumetric imaging, broad field of view, and high temporal resolution. However, learning-based 3D reconstruction in XLFM remains underdeveloped due to two core challenges: the absence of standardized datasets and the lack of methods that can efficiently model its angular-spatial structure while remaining physically grounded. We address these challenges by introducing three key contributions. First, we construct the XLFM-Zebrafish benchmark, a large-scale dataset and evaluation suite for XLFM reconstruction. Second, we propose Masked View Modeling for Light Fields (MVN-LF), a self-supervised task that learns angular priors by predicting occluded views, improving data efficiency. Third, we formulate the Optical Rendering Consistency Loss (ORC Loss), a differentiable rendering constraint that enforces alignment between predicted volumes and their PSF-based forward projections. On the XLFM-Zebrafish benchmark, our method improves PSNR by 7.7% over state-of-the-art baselines.

Title: Pedagogy-driven Evaluation of Generative AI-powered Intelligent Tutoring Systems

Authors: Kaushal Kumar Maurya, Ekaterina Kochmar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22581
Pdf URL: https://arxiv.org/pdf/2510.22581
Copy Paste: [[2510.22581]] Pedagogy-driven Evaluation of Generative AI-powered Intelligent Tutoring Systems(https://arxiv.org/abs/2510.22581)
Keywords: generative
Abstract: The interdisciplinary research domain of Artificial Intelligence in Education (AIED) has a long history of developing Intelligent Tutoring Systems (ITSs) by integrating insights from technological advancements, educational theories, and cognitive psychology. The remarkable success of generative AI (GenAI) models has accelerated the development of large language model (LLM)-powered ITSs, which have potential to imitate human-like, pedagogically rich, and cognitively demanding tutoring. However, the progress and impact of these systems remain largely untraceable due to the absence of reliable, universally accepted, and pedagogy-driven evaluation frameworks and benchmarks. Most existing educational dialogue-based ITS evaluations rely on subjective protocols and non-standardized benchmarks, leading to inconsistencies and limited generalizability. In this work, we take a step back from mainstream ITS development and provide comprehensive state-of-the-art evaluation practices, highlighting associated challenges through real-world case studies from careful and caring AIED research. Finally, building on insights from previous interdisciplinary AIED research, we propose three practical, feasible, and theoretically grounded research directions, rooted in learning science principles and aimed at establishing fair, unified, and scalable evaluation methodologies for ITSs.

Title: PSScreen V2: Partially Supervised Multiple Retinal Disease Screening

Authors: Boyi Zheng, Yalin Zheng, Hrvoje Bogunović, Qing Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22589
Pdf URL: https://arxiv.org/pdf/2510.22589
Copy Paste: [[2510.22589]] PSScreen V2: Partially Supervised Multiple Retinal Disease Screening(https://arxiv.org/abs/2510.22589)
Keywords: foundation model
Abstract: In this work, we propose PSScreen V2, a partially supervised self-training framework for multiple retinal disease screening. Unlike previous methods that rely on fully labelled or single-domain datasets, PSScreen V2 is designed to learn from multiple partially labelled datasets with different distributions, addressing both label absence and domain shift challenges. To this end, PSScreen V2 adopts a three-branch architecture with one teacher and two student networks. The teacher branch generates pseudo labels from weakly augmented images to address missing labels, while the two student branches introduce novel feature augmentation strategies: Low-Frequency Dropout (LF-Dropout), which enhances domain robustness by randomly discarding domain-related low-frequency components, and Low-Frequency Uncertainty (LF-Uncert), which estimates uncertain domain variability via adversarially learned Gaussian perturbations of low-frequency statistics. Extensive experiments on multiple in-domain and out-of-domain fundus datasets demonstrate that PSScreen V2 achieves state-of-the-art performance and superior domain generalization ability. Furthermore, compatibility tests with diverse backbones, including the vision foundation model DINOv2, as well as evaluations on chest X-ray datasets, highlight the universality and adaptability of the proposed framework. The codes are available at this https URL.

Title: Projection Embedded Diffusion Bridge for CT Reconstruction from Incomplete Data

Authors: Yuang Wang, Pengfei Jin, Siyeop Yoon, Matthew Tivnan, Shaoyang Zhang, Li Zhang, Quanzheng Li, Zhiqiang Chen, Dufan Wu
Subjects: cs.CV, physics.med-ph
Abstract URL: https://arxiv.org/abs/2510.22605
Pdf URL: https://arxiv.org/pdf/2510.22605
Copy Paste: [[2510.22605]] Projection Embedded Diffusion Bridge for CT Reconstruction from Incomplete Data(https://arxiv.org/abs/2510.22605)
Keywords: diffusion
Abstract: Reconstructing CT images from incomplete projection data remains challenging due to the ill-posed nature of the problem. Diffusion bridge models have recently shown promise in restoring clean images from their corresponding Filtered Back Projection (FBP) reconstructions, but incorporating data consistency into these models remains largely underexplored. Incorporating data consistency can improve reconstruction fidelity by aligning the reconstructed image with the observed projection data, and can enhance detail recovery by integrating structural information contained in the projections. In this work, we propose the Projection Embedded Diffusion Bridge (PEDB). PEDB introduces a novel reverse stochastic differential equation (SDE) to sample from the distribution of clean images conditioned on both the FBP reconstruction and the incomplete projection data. By explicitly conditioning on the projection data in sampling the clean images, PEDB naturally incorporates data consistency. We embed the projection data into the score function of the reverse SDE. Under certain assumptions, we derive a tractable expression for the posterior score. In addition, we introduce a free parameter to control the level of stochasticity in the reverse process. We also design a discretization scheme for the reverse SDE to mitigate discretization error. Extensive experiments demonstrate that PEDB achieves strong performance in CT reconstruction from three types of incomplete data, including sparse-view, limited-angle, and truncated projections. For each of these types, PEDB outperforms evaluated state-of-the-art diffusion bridge models across standard, noisy, and domain-shift evaluations.

Title: SWAN: Self-supervised Wavelet Neural Network for Hyperspectral Image Unmixing

Authors: Yassh Ramchandani, Vijayashekhar S S, Jignesh S. Bhatt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22607
Pdf URL: https://arxiv.org/pdf/2510.22607
Copy Paste: [[2510.22607]] SWAN: Self-supervised Wavelet Neural Network for Hyperspectral Image Unmixing(https://arxiv.org/abs/2510.22607)
Keywords: self-supervised
Abstract: In this article, we present SWAN: a three-stage, self-supervised wavelet neural network for joint estimation of endmembers and abundances from hyperspectral imagery. The contiguous and overlapping hyperspectral band images are first expanded to Biorthogonal wavelet basis space that provides sparse, distributed, and multi-scale representations. The idea is to exploit latent symmetries from thus obtained invariant and covariant features using a self-supervised learning paradigm. The first stage, SWANencoder maps the input wavelet coefficients to a compact lower-dimensional latent space. The second stage, SWANdecoder uses the derived latent representation to reconstruct the input wavelet coefficients. Interestingly, the third stage SWANforward learns the underlying physics of the hyperspectral image. A three-stage combined loss function is formulated in the image acquisition domain that eliminates the need for ground truth and enables self-supervised training. Adam is employed for optimizing the proposed loss function, while Sigmoid with a dropout of 0.3 is incorporated to avoid possible overfitting. Kernel regularizers bound the magnitudes and preserve spatial variations in the estimated endmember coefficients. The output of SWANencoder represents estimated abundance maps during inference, while weights of SWANdecoder are retrieved to extract endmembers. Experiments are conducted on two benchmark synthetic data sets with different signal-to-noise ratios as well as on three real benchmark hyperspectral data sets while comparing the results with several state-of-the-art neural network-based unmixing methods. The qualitative, quantitative, and ablation results show performance enhancement by learning a resilient unmixing function as well as promoting self-supervision and compact network parameters for practical applications.

Title: CLEANet: Robust and Efficient Anomaly Detection in Contaminated Multivariate Time Series

Authors: Songhan Zhang, Yuanhao Lai, Pengfei Zheng, Boxi Yu, Xiaoying Tang, Qiuai Fu, Pinjia He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.22619
Pdf URL: https://arxiv.org/pdf/2510.22619
Copy Paste: [[2510.22619]] CLEANet: Robust and Efficient Anomaly Detection in Contaminated Multivariate Time Series(https://arxiv.org/abs/2510.22619)
Keywords: anomaly
Abstract: Multivariate time series (MTS) anomaly detection is essential for maintaining the reliability of industrial systems, yet real-world deployment is hindered by two critical challenges: training data contamination (noises and hidden anomalies) and inefficient model inference. Existing unsupervised methods assume clean training data, but contamination distorts learned patterns and degrades detection accuracy. Meanwhile, complex deep models often overfit to contamination and suffer from high latency, limiting practical use. To address these challenges, we propose CLEANet, a robust and efficient anomaly detection framework in contaminated multivariate time series. CLEANet introduces a Contamination-Resilient Training Framework (CRTF) that mitigates the impact of corrupted samples through an adaptive reconstruction weighting strategy combined with clustering-guided contrastive learning, thereby enhancing robustness. To further avoid overfitting on contaminated data and improve computational efficiency, we design a lightweight conjugate MLP that disentangles temporal and cross-feature dependencies. Across five public datasets, CLEANet achieves up to 73.04% higher F1 and 81.28% lower runtime compared with ten state-of-the-art baselines. Furthermore, integrating CRTF into three advanced models yields an average 5.35% F1 gain, confirming its strong generalizability.

Title: DeepfakeBench-MM: A Comprehensive Benchmark for Multimodal Deepfake Detection

Authors: Kangran Zhao, Yupeng Chen, Xiaoyu Zhang, Yize Chen, Weinan Guan, Baicheng Chen, Chengzhe Sun, Soumyya Kanti Datta, Qingshan Liu, Siwei Lyu, Baoyuan Wu
Subjects: cs.CR, cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2510.22622
Pdf URL: https://arxiv.org/pdf/2510.22622
Copy Paste: [[2510.22622]] DeepfakeBench-MM: A Comprehensive Benchmark for Multimodal Deepfake Detection(https://arxiv.org/abs/2510.22622)
Keywords: generative
Abstract: The misuse of advanced generative AI models has resulted in the widespread proliferation of falsified data, particularly forged human-centric audiovisual content, which poses substantial societal risks (e.g., financial fraud and social instability). In response to this growing threat, several works have preliminarily explored countermeasures. However, the lack of sufficient and diverse training data, along with the absence of a standardized benchmark, hinder deeper exploration. To address this challenge, we first build Mega-MMDF, a large-scale, diverse, and high-quality dataset for multimodal deepfake detection. Specifically, we employ 21 forgery pipelines through the combination of 10 audio forgery methods, 12 visual forgery methods, and 6 audio-driven face reenactment methods. Mega-MMDF currently contains 0.1 million real samples and 1.1 million forged samples, making it one of the largest and most diverse multimodal deepfake datasets, with plans for continuous expansion. Building on it, we present DeepfakeBench-MM, the first unified benchmark for multimodal deepfake detection. It establishes standardized protocols across the entire detection pipeline and serves as a versatile platform for evaluating existing methods as well as exploring novel approaches. DeepfakeBench-MM currently supports 5 datasets and 11 multimodal deepfake detectors. Furthermore, our comprehensive evaluations and in-depth analyses uncover several key findings from multiple perspectives (e.g., augmentation, stacked forgery). We believe that DeepfakeBench-MM, together with our large-scale Mega-MMDF, will serve as foundational infrastructures for advancing multimodal deepfake detection.

Title: Self-Attention Decomposition For Training Free Diffusion Editing

Authors: Tharun Anand, Mohammad Hassan Vali, Arno Solin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22650
Pdf URL: https://arxiv.org/pdf/2510.22650
Copy Paste: [[2510.22650]] Self-Attention Decomposition For Training Free Diffusion Editing(https://arxiv.org/abs/2510.22650)
Keywords: diffusion
Abstract: Diffusion models achieve remarkable fidelity in image synthesis, yet precise control over their outputs for targeted editing remains challenging. A key step toward controllability is to identify interpretable directions in the model's latent representations that correspond to semantic attributes. Existing approaches for finding interpretable directions typically rely on sampling large sets of images or training auxiliary networks, which limits efficiency. We propose an analytical method that derives semantic editing directions directly from the pretrained parameters of diffusion models, requiring neither additional data nor fine-tuning. Our insight is that self-attention weight matrices encode rich structural information about the data distribution learned during training. By computing the eigenvectors of these weight matrices, we obtain robust and interpretable editing directions. Experiments demonstrate that our method produces high-quality edits across multiple datasets while reducing editing time significantly by 60% over current benchmarks.

Title: Variational Polya Tree

Authors: Lu Xu, Tsai Hor Chan, Kwok Fai Lam, Lequan Yu, Guosheng Yin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22651
Pdf URL: https://arxiv.org/pdf/2510.22651
Copy Paste: [[2510.22651]] Variational Polya Tree(https://arxiv.org/abs/2510.22651)
Keywords: generative
Abstract: Density estimation is essential for generative modeling, particularly with the rise of modern neural networks. While existing methods capture complex data distributions, they often lack interpretability and uncertainty quantification. Bayesian nonparametric methods, especially the \polya tree, offer a robust framework that addresses these issues by accurately capturing function behavior over small intervals. Traditional techniques like Markov chain Monte Carlo (MCMC) face high computational complexity and scalability limitations, hindering the use of Bayesian nonparametric methods in deep learning. To tackle this, we introduce the variational \polya tree (VPT) model, which employs stochastic variational inference to compute posterior distributions. This model provides a flexible, nonparametric Bayesian prior that captures latent densities and works well with stochastic gradient optimization. We also leverage the joint distribution likelihood for a more precise variational posterior approximation than traditional mean-field methods. We evaluate the model performance on both real data and images, and demonstrate its competitiveness with other state-of-the-art deep density estimation methods. We also explore its ability in enhancing interpretability and uncertainty quantification. Code is available at this https URL.

Title: Learning Without Augmenting: Unsupervised Time Series Representation Learning via Frame Projections

Authors: Berken Utku Demirel, Christian Holz
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22655
Pdf URL: https://arxiv.org/pdf/2510.22655
Copy Paste: [[2510.22655]] Learning Without Augmenting: Unsupervised Time Series Representation Learning via Frame Projections(https://arxiv.org/abs/2510.22655)
Keywords: self-supervised
Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for learning representations without labeled data. Most SSL approaches rely on strong, well-established, handcrafted data augmentations to generate diverse views for representation learning. However, designing such augmentations requires domain-specific knowledge and implicitly imposes representational invariances on the model, which can limit generalization. In this work, we propose an unsupervised representation learning method that replaces augmentations by generating views using orthonormal bases and overcomplete frames. We show that embeddings learned from orthonormal and overcomplete spaces reside on distinct manifolds, shaped by the geometric biases introduced by representing samples in different spaces. By jointly leveraging the complementary geometry of these distinct manifolds, our approach achieves superior performance without artificially increasing data diversity through strong augmentations. We demonstrate the effectiveness of our method on nine datasets across five temporal sequence tasks, where signal-specific characteristics make data augmentations particularly challenging. Without relying on augmentation-induced diversity, our method achieves performance gains of up to 15--20\% over existing self-supervised approaches. Source code: this https URL

Title: Conjugate Relation Modeling for Few-Shot Knowledge Graph Completion

Authors: Zilong Wang, Qingtian Zeng, Hua Duan, Cheng Cheng, Minghao Zou, Ziyang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22656
Pdf URL: https://arxiv.org/pdf/2510.22656
Copy Paste: [[2510.22656]] Conjugate Relation Modeling for Few-Shot Knowledge Graph Completion(https://arxiv.org/abs/2510.22656)
Keywords: diffusion
Abstract: Few-shot Knowledge Graph Completion (FKGC) infers missing triples from limited support samples, tackling long-tail distribution challenges. Existing methods, however, struggle to capture complex relational patterns and mitigate data sparsity. To address these challenges, we propose a novel FKGC framework for conjugate relation modeling (CR-FKGC). Specifically, it employs a neighborhood aggregation encoder to integrate higher-order neighbor information, a conjugate relation learner combining an implicit conditional diffusion relation module with a stable relation module to capture stable semantics and uncertainty offsets, and a manifold conjugate decoder for efficient evaluation and inference of missing triples in manifold space. Experiments on three benchmarks demonstrate that our method achieves superior performance over state-of-the-art methods.

Title: SARCLIP: A Vision Language Foundation Model for Semantic Understanding and Target Recognition in SAR Imagery

Authors: Qiwei Ma, Zhiyu Wang, Wang Liu, Xukun Lu, Bin Deng, Puhong Duan, Xudong Kang, Shutao Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22665
Pdf URL: https://arxiv.org/pdf/2510.22665
Copy Paste: [[2510.22665]] SARCLIP: A Vision Language Foundation Model for Semantic Understanding and Target Recognition in SAR Imagery(https://arxiv.org/abs/2510.22665)
Keywords: self-supervised, foundation model
Abstract: Synthetic Aperture Radar (SAR) has emerged as a crucial imaging modality due to its all-weather capabilities. While recent advancements in self-supervised learning and Masked Image Modeling (MIM) have paved the way for SAR foundation models, these approaches primarily focus on low-level visual features, often overlooking multimodal alignment and zero-shot target recognition within SAR imagery. To address this limitation, we construct SARCLIP-1M, a large-scale vision language dataset comprising over one million text-image pairs aggregated from existing datasets. We further introduce SARCLIP, the first vision language foundation model tailored for the SAR domain. Our SARCLIP model is trained using a contrastive vision language learning approach by domain transferring strategy, enabling it to bridge the gap between SAR imagery and textual descriptions. Extensive experiments on image-text retrieval and zero-shot classification tasks demonstrate the superior performance of SARCLIP in feature extraction and interpretation, significantly outperforming state-of-the-art foundation models and advancing the semantic understanding of SAR imagery. The code and datasets will be released soon.

Title: FlowCritic: Bridging Value Estimation with Flow Matching in Reinforcement Learning

Authors: Shan Zhong, Shutong Ding, He Diao, Xiangyu Wang, Kah Chan Teh, Bei Peng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22686
Pdf URL: https://arxiv.org/pdf/2510.22686
Copy Paste: [[2510.22686]] FlowCritic: Bridging Value Estimation with Flow Matching in Reinforcement Learning(https://arxiv.org/abs/2510.22686)
Keywords: generative
Abstract: Reliable value estimation serves as the cornerstone of reinforcement learning (RL) by evaluating long-term returns and guiding policy improvement, significantly influencing the convergence speed and final performance. Existing works improve the reliability of value function estimation via multi-critic ensembles and distributional RL, yet the former merely combines multi point estimation without capturing distributional information, whereas the latter relies on discretization or quantile regression, limiting the expressiveness of complex value distributions. Inspired by flow matching's success in generative modeling, we propose a generative paradigm for value estimation, named FlowCritic. Departing from conventional regression for deterministic value prediction, FlowCritic leverages flow matching to model value distributions and generate samples for value estimation.

Title: VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree

Authors: Wenlong Li, Yifei Xu, Yuan Rao, Zhenhua Wang, Shuiguang Deng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22693
Pdf URL: https://arxiv.org/pdf/2510.22693
Copy Paste: [[2510.22693]] VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree(https://arxiv.org/abs/2510.22693)
Keywords: anomaly
Abstract: Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. In contrast, training-free methods leverage the knowledge reserves and language interactivity of large pre-trained models to detect anomalies. However, the current fixed-length temporal window sampling approaches struggle to accurately capture anomalies with varying temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree leverages the knowledge embedded in a pre-trained Generic Event Boundary Detection (GEBD) model to characterize potential anomaly event boundaries. Specifically, VADTree decomposes the video into generic event nodes based on boundary confidence, and performs adaptive coarse-fine hierarchical structuring and redundancy removal to construct the HGTree. Then, the multi-dimensional priors are injected into the visual language models (VLMs) to enhance the node-wise anomaly perception, and anomaly reasoning for generic event nodes is achieved via large language models (LLMs). Finally, an inter-cluster node correlation method is used to integrate the multi-granularity anomaly scores. Extensive experiments on three challenging datasets demonstrate that VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments. The code will be available at this https URL.

Title: WaveMAE: Wavelet decomposition Masked Auto-Encoder for Remote Sensing

Authors: Vittorio Bernuzzi, Leonardo Rossi, Tomaso Fontanini, Massimo Bertozzi, Andrea Prati
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22697
Pdf URL: https://arxiv.org/pdf/2510.22697
Copy Paste: [[2510.22697]] WaveMAE: Wavelet decomposition Masked Auto-Encoder for Remote Sensing(https://arxiv.org/abs/2510.22697)
Keywords: self-supervised, foundation model
Abstract: Self-supervised learning (SSL) has recently emerged as a key strategy for building foundation models in remote sensing, where the scarcity of annotated data limits the applicability of fully supervised approaches. In this work, we introduce WaveMAE, a masked autoencoding framework tailored for multispectral satellite imagery. Unlike conventional pixel-based reconstruction, WaveMAE leverages a multi-level Discrete Wavelet Transform (DWT) to disentangle frequency components and guide the encoder toward learning scale-aware high-frequency representations. We further propose a Geo-conditioned Positional Encoding (GPE), which incorporates geographical priors via Spherical Harmonics, encouraging embeddings that respect both semantic and geospatial structure. To ensure fairness in evaluation, all methods are pretrained on the same dataset (fMoW-S2) and systematically evaluated on the diverse downstream tasks of the PANGAEA benchmark, spanning semantic segmentation, regression, change detection, and multilabel classification. Extensive experiments demonstrate that WaveMAE achieves consistent improvements over prior state-of-the-art approaches, with substantial gains on segmentation and regression benchmarks. The effectiveness of WaveMAE pretraining is further demonstrated by showing that even a lightweight variant, containing only 26.4% of the parameters, achieves state-of-the-art performance. Our results establish WaveMAE as a strong and geographically informed foundation model for multispectral remote sensing imagery.

Title: Cross-view Localization and Synthesis - Datasets, Challenges and Opportunities

Authors: Ningli Xu, Rongjun Qin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22736
Pdf URL: https://arxiv.org/pdf/2510.22736
Copy Paste: [[2510.22736]] Cross-view Localization and Synthesis - Datasets, Challenges and Opportunities(https://arxiv.org/abs/2510.22736)
Keywords: diffusion, generative
Abstract: Cross-view localization and synthesis are two fundamental tasks in cross-view visual understanding, which deals with cross-view datasets: overhead (satellite or aerial) and ground-level imagery. These tasks have gained increasing attention due to their broad applications in autonomous navigation, urban planning, and augmented reality. Cross-view localization aims to estimate the geographic position of ground-level images based on information provided by overhead imagery while cross-view synthesis seeks to generate ground-level images based on information from the overhead imagery. Both tasks remain challenging due to significant differences in viewing perspective, resolution, and occlusion, which are widely embedded in cross-view datasets. Recent years have witnessed rapid progress driven by the availability of large-scale datasets and novel approaches. Typically, cross-view localization is formulated as an image retrieval problem where ground-level features are matched with tiled overhead images feature, extracted by convolutional neural networks (CNNs) or vision transformers (ViTs) for cross-view feature embedding. Cross-view synthesis, on the other hand, seeks to generate ground-level views based on information from overhead imagery, generally using generative adversarial networks (GANs) or diffusion models. This paper presents a comprehensive survey of advances in cross-view localization and synthesis, reviewing widely used datasets, highlighting key challenges, and providing an organized overview of state-of-the-art techniques. Furthermore, it discusses current limitations, offers comparative analyses, and outlines promising directions for future research. We also include the project page via this https URL.

Title: Beyond Semantics: How Temporal Biases Shape Retrieval in Transformer and State-Space Models

Authors: Anooshka Bajaj, Deven Mahesh Mistry, Sahaj Singh Maini, Yash Aggarwal, Zoran Tiganj
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22752
Pdf URL: https://arxiv.org/pdf/2510.22752
Copy Paste: [[2510.22752]] Beyond Semantics: How Temporal Biases Shape Retrieval in Transformer and State-Space Models(https://arxiv.org/abs/2510.22752)
Keywords: in-context
Abstract: In-context learning is governed by both temporal and semantic relationships, shaping how Large Language Models (LLMs) retrieve contextual information. Analogous to human episodic memory, where the retrieval of specific events is enabled by separating events that happened at different times, this work probes the ability of various pretrained LLMs, including transformer and state-space models, to differentiate and retrieve temporally separated events. Specifically, we prompted models with sequences containing multiple presentations of the same token, which reappears at the sequence end. By fixing the positions of these repeated tokens and permuting all others, we removed semantic confounds and isolated temporal effects on next-token prediction. Across diverse sequences, models consistently placed the highest probabilities on tokens following a repeated token, but with a notable bias for those nearest the beginning or end of the input. An ablation experiment linked this phenomenon in transformers to induction heads. Extending the analysis to unique semantic contexts with partial overlap further demonstrated that memories embedded in the middle of a prompt are retrieved less reliably. Despite architectural differences, state-space and transformer models showed comparable temporal biases. Our findings deepen the understanding of temporal biases in in-context learning and offer an illustration of how these biases can enable temporal separation and episodic retrieval.

Title: Distributionally Robust Optimization via Diffusion Ambiguity Modeling

Authors: Jiaqi Wen, Jianyi Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.22757
Pdf URL: https://arxiv.org/pdf/2510.22757
Copy Paste: [[2510.22757]] Distributionally Robust Optimization via Diffusion Ambiguity Modeling(https://arxiv.org/abs/2510.22757)
Keywords: diffusion
Abstract: This paper studies Distributionally Robust Optimization (DRO), a fundamental framework for enhancing the robustness and generalization of statistical learning and optimization. An effective ambiguity set for DRO must involve distributions that remain consistent with the nominal distribution while being diverse enough to account for a variety of potential scenarios. Moreover, it should lead to tractable DRO solutions. To this end, we propose a diffusion-based ambiguity set design that captures various adversarial distributions beyond the nominal support space while maintaining consistency with the nominal distribution. Building on this ambiguity modeling, we propose Diffusion-based DRO (D-DRO), a tractable DRO algorithm that solves the inner maximization over the parameterized diffusion model space. We formally establish the stationary convergence performance of D-DRO and empirically demonstrate its superior Out-of-Distribution (OOD) generalization performance in a ML prediction task.

Title: A Theory of the Mechanics of Information: Generalization Through Measurement of Uncertainty (Learning is Measuring)

Authors: Christopher J. Hazard, Michael Resnick, Jacob Beel, Jack Xia, Cade Mack, Dominic Glennie, Matthew Fulp, David Maze, Andrew Bassett, Martin Koistinen
Subjects: cs.LG, cs.AI, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2510.22809
Pdf URL: https://arxiv.org/pdf/2510.22809
Copy Paste: [[2510.22809]] A Theory of the Mechanics of Information: Generalization Through Measurement of Uncertainty (Learning is Measuring)(https://arxiv.org/abs/2510.22809)
Keywords: generative, anomaly
Abstract: Traditional machine learning relies on explicit models and domain assumptions, limiting flexibility and interpretability. We introduce a model-free framework using surprisal (information theoretic uncertainty) to directly analyze and perform inferences from raw data, eliminating distribution modeling, reducing bias, and enabling efficient updates including direct edits and deletion of training data. By quantifying relevance through uncertainty, the approach enables generalizable inference across tasks including generative inference, causal discovery, anomaly detection, and time series forecasting. It emphasizes traceability, interpretability, and data-driven decision making, offering a unified, human-understandable framework for machine learning, and achieves at or near state-of-the-art performance across most common machine learning tasks. The mathematical foundations create a ``physics'' of information, which enable these techniques to apply effectively to a wide variety of complex data types, including missing data. Empirical results indicate that this may be a viable alternative path to neural networks with regard to scalable machine learning and artificial intelligence that can maintain human understandability of the underlying mechanics.

Title: MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control

Authors: Fatemeh Nazarieh, Zhenhua Feng, Diptesh Kanojia, Muhammad Awais, Josef Kittler
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22810
Pdf URL: https://arxiv.org/pdf/2510.22810
Copy Paste: [[2510.22810]] MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control(https://arxiv.org/abs/2510.22810)
Keywords: diffusion
Abstract: Audio-driven talking face generation has gained significant attention for applications in digital media and virtual avatars. While recent methods improve audio-lip synchronization, they often struggle with temporal consistency, identity preservation, and customization, especially in long video generation. To address these issues, we propose MAGIC-Talk, a one-shot diffusion-based framework for customizable and temporally stable talking face generation. MAGIC-Talk consists of ReferenceNet, which preserves identity and enables fine-grained facial editing via text prompts, and AnimateNet, which enhances motion coherence using structured motion priors. Unlike previous methods requiring multiple reference images or fine-tuning, MAGIC-Talk maintains identity from a single image while ensuring smooth transitions across frames. Additionally, a progressive latent fusion strategy is introduced to improve long-form video quality by reducing motion inconsistencies and flickering. Extensive experiments demonstrate that MAGIC-Talk outperforms state-of-the-art methods in visual quality, identity preservation, and synchronization accuracy, offering a robust solution for talking face generation.

Title: Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays

Authors: Haowei Hua (1), Hong Jiao (2), Xinyi Wang (3) ((1) Princeton University, (2) University of Maryland, College Park, (3) University of Maryland, College Park & Beijing Normal University)
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.22830
Pdf URL: https://arxiv.org/pdf/2510.22830
Copy Paste: [[2510.22830]] Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays(https://arxiv.org/abs/2510.22830)
Keywords: generative
Abstract: BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.

Title: Clustering by Denoising: Latent plug-and-play diffusion for single-cell data

Authors: Dominik Meier, Shixing Yu, Sagnik Nandy, Promit Ghosal, Kyra Gan
Subjects: cs.LG, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2510.22835
Pdf URL: https://arxiv.org/pdf/2510.22835
Copy Paste: [[2510.22835]] Clustering by Denoising: Latent plug-and-play diffusion for single-cell data(https://arxiv.org/abs/2510.22835)
Keywords: diffusion
Abstract: Single-cell RNA sequencing (scRNA-seq) enables the study of cellular heterogeneity. Yet, clustering accuracy, and with it downstream analyses based on cell labels, remain challenging due to measurement noise and biological variability. In standard latent spaces (e.g., obtained through PCA), data from different cell types can be projected close together, making accurate clustering difficult. We introduce a latent plug-and-play diffusion framework that separates the observation and denoising space. This separation is operationalized through a novel Gibbs sampling procedure: the learned diffusion prior is applied in a low-dimensional latent space to perform denoising, while to steer this process, noise is reintroduced into the original high-dimensional observation space. This unique "input-space steering" ensures the denoising trajectory remains faithful to the original data structure. Our approach offers three key advantages: (1) adaptive noise handling via a tunable balance between prior and observed data; (2) uncertainty quantification through principled uncertainty estimates for downstream analysis; and (3) generalizable denoising by leveraging clean reference data to denoise noisier datasets, and via averaging, improve quality beyond the training set. We evaluate robustness on both synthetic and real single-cell genomics data. Our method improves clustering accuracy on synthetic data across varied noise levels and dataset shifts. On real-world single-cell data, our method demonstrates improved biological coherence in the resulting cell clusters, with cluster boundaries that better align with known cell type markers and developmental trajectories.

Title: Semantic-Preserving Cross-Style Visual Reasoning for Robust Multi-Modal Understanding in Large Vision-Language Models

Authors: Aya Nakayama, Brian Wong, Yuji Nishimura, Kaito Tanaka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22838
Pdf URL: https://arxiv.org/pdf/2510.22838
Copy Paste: [[2510.22838]] Semantic-Preserving Cross-Style Visual Reasoning for Robust Multi-Modal Understanding in Large Vision-Language Models(https://arxiv.org/abs/2510.22838)
Keywords: in-context
Abstract: The "style trap" poses a significant challenge for Large Vision-Language Models (LVLMs), hindering robust semantic understanding across diverse visual styles, especially in in-context learning (ICL). Existing methods often fail to effectively decouple style from content, hindering generalization. To address this, we propose the Semantic-Preserving Cross-Style Visual Reasoner (SP-CSVR), a novel framework for stable semantic understanding and adaptive cross-style visual reasoning. SP-CSVR integrates a Cross-Style Feature Encoder (CSFE) for style-content disentanglement, a Semantic-Aligned In-Context Decoder (SAICD) for efficient few-shot style adaptation, and an Adaptive Semantic Consistency Module (ASCM) employing multi-task contrastive learning to enforce cross-style semantic invariance. Extensive experiments on a challenging multi-style dataset demonstrate SP-CSVR's state-of-the-art performance across visual captioning, visual question answering, and in-context style adaptation. Comprehensive evaluations, including ablation studies and generalization analysis, confirm SP-CSVR's efficacy in enhancing robustness, generalization, and efficiency across diverse visual styles.

Title: Semantic Surgery: Zero-Shot Concept Erasure in Diffusion Models

Authors: Lexiang Xiong, Chengyu Liu, Jingwen Ye, Yan Liu, Yuecong Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22851
Pdf URL: https://arxiv.org/pdf/2510.22851
Copy Paste: [[2510.22851]] Semantic Surgery: Zero-Shot Concept Erasure in Diffusion Models(https://arxiv.org/abs/2510.22851)
Keywords: diffusion, generative
Abstract: Concept erasure in text-to-image diffusion models is crucial for mitigating harmful content, yet existing methods often compromise generative quality. We introduce Semantic Surgery, a novel training-free, zero-shot framework for concept erasure that operates directly on text embeddings before the diffusion process. It dynamically estimates the presence of target concepts in a prompt and performs a calibrated vector subtraction to neutralize their influence at the source, enhancing both erasure completeness and locality. The framework includes a Co-Occurrence Encoding module for robust multi-concept erasure and a visual feedback loop to address latent concept persistence. As a training-free method, Semantic Surgery adapts dynamically to each prompt, ensuring precise interventions. Extensive experiments on object, explicit content, artistic style, and multi-celebrity erasure tasks show our method significantly outperforms state-of-the-art approaches. We achieve superior completeness and robustness while preserving locality and image quality (e.g., 93.58 H-score in object erasure, reducing explicit content to just 1 instance, and 8.09 H_a in style erasure with no quality degradation). This robustness also allows our framework to function as a built-in threat detection system, offering a practical solution for safer text-to-image generation.

Title: Encoder-Decoder Diffusion Language Models for Efficient Training and Inference

Authors: Marianne Arriola, Yair Schiff, Hao Phung, Aaron Gokaslan, Volodymyr Kuleshov
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.22852
Pdf URL: https://arxiv.org/pdf/2510.22852
Copy Paste: [[2510.22852]] Encoder-Decoder Diffusion Language Models for Efficient Training and Inference(https://arxiv.org/abs/2510.22852)
Keywords: diffusion
Abstract: Discrete diffusion models enable parallel token sampling for faster inference than autoregressive approaches. However, prior diffusion models use a decoder-only architecture, which requires sampling algorithms that invoke the full network at every denoising step and incur high computational cost. Our key insight is that discrete diffusion models perform two types of computation: 1) representing clean tokens and 2) denoising corrupted tokens, which enables us to use separate modules for each task. We propose an encoder-decoder architecture to accelerate discrete diffusion inference, which relies on an encoder to represent clean tokens and a lightweight decoder to iteratively refine a noised sequence. We also show that this architecture enables faster training of block diffusion models, which partition sequences into blocks for better quality and are commonly used in diffusion language model inference. We introduce a framework for Efficient Encoder-Decoder Diffusion (E2D2), consisting of an architecture with specialized training and sampling algorithms, and we show that E2D2 achieves superior trade-offs between generation quality and inference throughput on summarization, translation, and mathematical reasoning tasks. We provide the code, model weights, and blog post on the project page: this https URL

Title: A Comprehensive Dataset for Human vs. AI Generated Text Detection

Authors: Rajarshi Roy, Nasrin Imanpour, Ashhar Aziz, Shashwat Bajpai, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amitava Das
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.22874
Pdf URL: https://arxiv.org/pdf/2510.22874
Copy Paste: [[2510.22874]] A Comprehensive Dataset for Human vs. AI Generated Text Detection(https://arxiv.org/abs/2510.22874)
Keywords: generative
Abstract: The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 58,000 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35\%, and attributing AI texts to their generating models with an accuracy of 8.92\%. By bridging real-world journalistic content with modern generative models, the dataset aims to catalyze the development of robust detection and attribution methods, fostering trust and transparency in the era of generative AI. Our dataset is available at: this https URL.

Title: Limits of Generative Pre-Training in Structured EMR Trajectories with Irregular Sampling

Authors: Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.22878
Pdf URL: https://arxiv.org/pdf/2510.22878
Copy Paste: [[2510.22878]] Limits of Generative Pre-Training in Structured EMR Trajectories with Irregular Sampling(https://arxiv.org/abs/2510.22878)
Keywords: foundation model, generative
Abstract: Foundation models refer to architectures trained on vast datasets using autoregressive pre-training from natural language processing to capture intricate patterns and motifs. They were originally developed to transfer such learned knowledge to downstream predictive tasks. Recently, however, some studies repurpose these learned representations for phenotype discovery without rigorous validation, risking superficially realistic but clinically incoherent embeddings. To test this mismatch, we trained two autoregressive models -- a sequence-to-sequence LSTM and a reduced Transformer -- on longitudinal ART for HIV and Acute Hypotension datasets. Controlled irregularity was added during training via random inter-visit gaps, while test sequences stayed complete. Patient-trajectory synthesis evaluated distributional and correlational fidelity. Both reproduced feature distributions but failed to preserve cross-feature structure -- showing that generative pre-training yields local realism but limited clinical coherence. These results highlight the need for domain-specific evaluation and support trajectory synthesis as a practical probe before fine-tuning or deployment.

Title: On the Anisotropy of Score-Based Generative Models

Authors: Andreas Floros, Seyed-Mohsen Moosavi-Dezfooli, Pier Luigi Dragotti
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.22899
Pdf URL: https://arxiv.org/pdf/2510.22899
Copy Paste: [[2510.22899]] On the Anisotropy of Score-Based Generative Models(https://arxiv.org/abs/2510.22899)
Keywords: generative
Abstract: We investigate the role of network architecture in shaping the inductive biases of modern score-based generative models. To this end, we introduce the Score Anisotropy Directions (SADs), architecture-dependent directions that reveal how different networks preferentially capture data structure. Our analysis shows that SADs form adaptive bases aligned with the architecture's output geometry, providing a principled way to predict generalization ability in score models prior to training. Through both synthetic data and standard image benchmarks, we demonstrate that SADs reliably capture fine-grained model behavior and correlate with downstream performance, as measured by Wasserstein metrics. Our work offers a new lens for explaining and predicting directional biases of generative models.

Title: Simple Denoising Diffusion Language Models

Authors: Huaisheng Zhu, Zhengyu Chen, Shijie Zhou, Zhihui Xie, Yige Yuan, Zhimeng Guo, Siyuan Xu, Hangfan Zhang, Vasant Honavar, Teng Xiao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.22926
Pdf URL: https://arxiv.org/pdf/2510.22926
Copy Paste: [[2510.22926]] Simple Denoising Diffusion Language Models(https://arxiv.org/abs/2510.22926)
Keywords: diffusion, self-supervised
Abstract: Diffusion models have recently been extended to language generation through Masked Diffusion Language Models (MDLMs), which achieve performance competitive with strong autoregressive models. However, MDLMs tend to degrade in the few-step regime and cannot directly adopt existing few-step distillation methods designed for continuous diffusion models, as they lack the intrinsic property of mapping from noise to data. Recent Uniform-state Diffusion Models (USDMs), initialized from a uniform prior, alleviate some limitations but still suffer from complex loss formulations that hinder scalability. In this work, we propose a simplified denoising-based loss for USDMs that optimizes only noise-replaced tokens, stabilizing training and matching ELBO-level performance. Furthermore, by framing denoising as self-supervised learning, we introduce a simple modification to our denoising loss with contrastive-inspired negative gradients, which is practical and yield additional improvements in generation quality.

Title: Diffuse to Detect: A Generalizable Framework for Anomaly Detection with Diffusion Models Applications to UAVs and Beyond

Authors: Mingze Gong, Juan Du, Jianbang You
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.22928
Pdf URL: https://arxiv.org/pdf/2510.22928
Copy Paste: [[2510.22928]] Diffuse to Detect: A Generalizable Framework for Anomaly Detection with Diffusion Models Applications to UAVs and Beyond(https://arxiv.org/abs/2510.22928)
Keywords: diffusion, generative, anomaly
Abstract: Anomaly detection in complex, high-dimensional data, such as UAV sensor readings, is essential for operational safety but challenging for existing methods due to their limited sensitivity, scalability, and inability to capture intricate dependencies. We propose the Diffuse to Detect (DTD) framework, a novel approach that innovatively adapts diffusion models for anomaly detection, diverging from their conventional use in generative tasks with high inference time. By comparison, DTD employs a single-step diffusion process to predict noise patterns, enabling rapid and precise identification of anomalies without reconstruction errors. This approach is grounded in robust theoretical foundations that link noise prediction to the data distribution's score function, ensuring reliable deviation detection. By integrating Graph Neural Networks to model sensor relationships as dynamic graphs, DTD effectively captures spatial (inter-sensor) and temporal anomalies. Its two-branch architecture, with parametric neural network-based energy scoring for scalability and nonparametric statistical methods for interpretability, provides flexible trade-offs between computational efficiency and transparency. Extensive evaluations on UAV sensor data, multivariate time series, and images demonstrate DTD's superior performance over existing methods, underscoring its generality across diverse data modalities. This versatility, combined with its adaptability, positions DTD as a transformative solution for safety-critical applications, including industrial monitoring and beyond.

Title: Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges

Authors: Liling Yang, Ning Chen, Jun Yue, Yidan Liu, Jiayi Ma, Pedram Ghamisi, Antonio Plaza, Leyuan Fang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22964
Pdf URL: https://arxiv.org/pdf/2510.22964
Copy Paste: [[2510.22964]] Survey of Multimodal Geospatial Foundation Models: Techniques, Applications, and Challenges(https://arxiv.org/abs/2510.22964)
Keywords: foundation model
Abstract: Foundation models have transformed natural language processing and computer vision, and their impact is now reshaping remote sensing image analysis. With powerful generalization and transfer learning capabilities, they align naturally with the multimodal, multi-resolution, and multi-temporal characteristics of remote sensing data. To address unique challenges in the field, multimodal geospatial foundation models (GFMs) have emerged as a dedicated research frontier. This survey delivers a comprehensive review of multimodal GFMs from a modality-driven perspective, covering five core visual and vision-language modalities. We examine how differences in imaging physics and data representation shape interaction design, and we analyze key techniques for alignment, integration, and knowledge transfer to tackle modality heterogeneity, distribution shifts, and semantic gaps. Advances in training paradigms, architectures, and task-specific adaptation strategies are systematically assessed alongside a wealth of emerging benchmarks. Representative multimodal visual and vision-language GFMs are evaluated across ten downstream tasks, with insights into their architectures, performance, and application scenarios. Real-world case studies, spanning land cover mapping, agricultural monitoring, disaster response, climate studies, and geospatial intelligence, demonstrate the practical potential of GFMs. Finally, we outline pressing challenges in domain generalization, interpretability, efficiency, and privacy, and chart promising avenues for future research.

Title: VALA: Learning Latent Anchors for Training-Free and Temporally Consistent

Authors: Zhangkai Wu, Xuhui Fan, Zhongyuan Xie, Kaize Shi, Longbing Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22970
Pdf URL: https://arxiv.org/pdf/2510.22970
Copy Paste: [[2510.22970]] VALA: Learning Latent Anchors for Training-Free and Temporally Consistent(https://arxiv.org/abs/2510.22970)
Keywords: diffusion
Abstract: Recent advances in training-free video editing have enabled lightweight and precise cross-frame generation by leveraging pre-trained text-to-image diffusion models. However, existing methods often rely on heuristic frame selection to maintain temporal consistency during DDIM inversion, which introduces manual bias and reduces the scalability of end-to-end inference. In this paper, we propose~\textbf{VALA} (\textbf{V}ariational \textbf{A}lignment for \textbf{L}atent \textbf{A}nchors), a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing. To learn meaningful assignments, VALA propose a variational framework with a contrastive learning objective. Therefore, it can transform cross-frame latent representations into compressed latent anchors that preserve both content and temporal coherence. Our method can be fully integrated into training-free text-to-image based video editing models. Extensive experiments on real-world video editing benchmarks show that VALA achieves state-of-the-art performance in inversion fidelity, editing quality, and temporal consistency, while offering improved efficiency over prior methods.

Title: Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method

Authors: Bohan Li, Xin Jin, Hu Zhu, Hongsi Liu, Ruikai Li, Jiazhe Guo, Kaiwen Cai, Chao Ma, Yueming Jin, Hao Zhao, Xiaokang Yang, Wenjun Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.22973
Pdf URL: https://arxiv.org/pdf/2510.22973
Copy Paste: [[2510.22973]] Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method(https://arxiv.org/abs/2510.22973)
Keywords: generative
Abstract: Driving scene generation is a critical domain for autonomous driving, enabling downstream applications, including perception and planning evaluation. Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities; however, their performance heavily depends on annotated occupancy data, which still remains scarce. To overcome this limitation, we curate Nuplan-Occ, the largest semantic occupancy dataset to date, constructed from the widely used Nuplan benchmark. Its scale and diversity facilitate not only large-scale generative modeling but also autonomous driving downstream applications. Based on this dataset, we develop a unified framework that jointly synthesizes high-quality semantic occupancy, multi-view videos, and LiDAR point clouds. Our approach incorporates a spatio-temporal disentangled architecture to support high-fidelity spatial expansion and temporal forecasting of 4D dynamic occupancy. To bridge modal gaps, we further propose two novel techniques: a Gaussian splatting-based sparse point map rendering strategy that enhances multi-view video generation, and a sensor-aware embedding strategy that explicitly models LiDAR sensor properties for realistic multi-LiDAR simulation. Extensive experiments demonstrate that our method achieves superior generation fidelity and scalability compared to existing approaches, and validates its practical value in downstream tasks. Repo: this https URL

Title: Can Language Models Compose Skills In-Context?

Authors: Zidong Liu, Zhuoyan Xu, Zhenmei Shi, Yingyu Liang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2510.22993
Pdf URL: https://arxiv.org/pdf/2510.22993
Copy Paste: [[2510.22993]] Can Language Models Compose Skills In-Context?(https://arxiv.org/abs/2510.22993)
Keywords: in-context
Abstract: Composing basic skills from simple tasks to accomplish composite tasks is crucial for modern intelligent systems. We investigate the in-context composition ability of language models to perform composite tasks that combine basic skills demonstrated in in-context examples. This is more challenging than the standard setting, where skills and their composition can be learned in training. We conduct systematic experiments on various representative open-source language models, utilizing linguistic and logical tasks designed to probe composition abilities. The results reveal that simple task examples can have a surprising negative impact on the performance, because the models generally struggle to recognize and assemble the skills correctly, even with Chain-of-Thought examples. Theoretical analysis further shows that it is crucial to align examples with the corresponding steps in the composition. This inspires a method for the probing tasks, whose improved performance provides positive support for our insights.

Title: Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures

Authors: Shenran Wang, Timothy Tin-Long Tse, Jian Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23006
Pdf URL: https://arxiv.org/pdf/2510.23006
Copy Paste: [[2510.23006]] Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures(https://arxiv.org/abs/2510.23006)
Keywords: in-context
Abstract: We perform in-depth evaluations of in-context learning (ICL) on state-of-the-art transformer, state-space, and hybrid large language models over two categories of knowledge-based ICL tasks. Using a combination of behavioral probing and intervention-based methods, we have discovered that, while LLMs of different architectures can behave similarly in task performance, their internals could remain different. We discover that function vectors (FVs) responsible for ICL are primarily located in the self-attention and Mamba layers, and speculate that Mamba2 uses a different mechanism from FVs to perform ICL. FVs are more important for ICL involving parametric knowledge retrieval, but not for contextual knowledge understanding. Our work contributes to a more nuanced understanding across architectures and task types. Methodologically, our approach also highlights the importance of combining both behavioural and mechanistic analyses to investigate LLM capabilities.

Title: M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark

Authors: Huixuan Zhang, Xiaojun Wan
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2510.23020
Pdf URL: https://arxiv.org/pdf/2510.23020
Copy Paste: [[2510.23020]] M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark(https://arxiv.org/abs/2510.23020)
Keywords: diffusion
Abstract: Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. In this study, we introduce M$^3$T2IBench, a large-scale, multi-category, multi-instance, multi-relation along with an object-detection-based evaluation metric, $AlignScore$, which aligns closely with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This training-free post-editing method demonstrates improvements in image-text alignment across a broad range of diffusion models. \footnote{Our code and data has been released in supplementary material and will be made publicly available after the paper is accepted.}

Title: UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization

Authors: Huixuan Zhang, Xiaojun Wan
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2510.23023
Pdf URL: https://arxiv.org/pdf/2510.23023
Copy Paste: [[2510.23023]] UniAIDet: A Unified and Universal Benchmark for AI-Generated Image Content Detection and Localization(https://arxiv.org/abs/2510.23023)
Keywords: generative
Abstract: With the rapid proliferation of image generative models, the authenticity of digital images has become a significant concern. While existing studies have proposed various methods for detecting AI-generated content, current benchmarks are limited in their coverage of diverse generative models and image categories, often overlooking end-to-end image editing and artistic images. To address these limitations, we introduce UniAIDet, a unified and comprehensive benchmark that includes both photographic and artistic images. UniAIDet covers a wide range of generative models, including text-to-image, image-to-image, image inpainting, image editing, and deepfake models. Using UniAIDet, we conduct a comprehensive evaluation of various detection methods and answer three key research questions regarding generalization capability and the relation between detection and localization. Our benchmark and analysis provide a robust foundation for future research.

Title: Nested AutoRegressive Models

Authors: Hongyu Wu, Xuhui Fan, Zhangkai Wu, Longbing Cao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23028
Pdf URL: https://arxiv.org/pdf/2510.23028
Copy Paste: [[2510.23028]] Nested AutoRegressive Models(https://arxiv.org/abs/2510.23028)
Keywords: diffusion
Abstract: AutoRegressive (AR) models have demonstrated competitive performance in image generation, achieving results comparable to those of diffusion models. However, their token-by-token image generation mechanism remains computationally intensive and existing solutions such as VAR often lead to limited sample diversity. In this work, we propose a Nested AutoRegressive~(NestAR) model, which proposes nested AutoRegressive architectures in generating images. NestAR designs multi-scale modules in a hierarchical order. These different scaled modules are constructed in an AR architecture, where one larger-scale module is conditioned on outputs from its previous smaller-scale module. Within each module, NestAR uses another AR structure to generate ``patches'' of tokens. The proposed nested AR architecture reduces the overall complexity from $\mathcal{O}(n)$ to $\mathcal{O}(\log n)$ in generating $n$ image tokens, as well as increases image diversities. NestAR further incorporates flow matching loss to use continuous tokens, and develops objectives to coordinate these multi-scale modules in model training. NestAR achieves competitive image generation performance while significantly lowering computational cost.

Title: A high-capacity linguistic steganography based on entropy-driven rank-token mapping

Authors: Jun Jiang, Weiming Zhang, Nenghai Yu, Kejiang Chen
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23035
Pdf URL: https://arxiv.org/pdf/2510.23035
Copy Paste: [[2510.23035]] A high-capacity linguistic steganography based on entropy-driven rank-token mapping(https://arxiv.org/abs/2510.23035)
Keywords: generative
Abstract: Linguistic steganography enables covert communication through embedding secret messages into innocuous texts; however, current methods face critical limitations in payload capacity and security. Traditional modification-based methods introduce detectable anomalies, while retrieval-based strategies suffer from low embedding capacity. Modern generative steganography leverages language models to generate natural stego text but struggles with limited entropy in token predictions, further constraining capacity. To address these issues, we propose an entropy-driven framework called RTMStega that integrates rank-based adaptive coding and context-aware decompression with normalized entropy. By mapping secret messages to token probability ranks and dynamically adjusting sampling via context-aware entropy-based adjustments, RTMStega achieves a balance between payload capacity and imperceptibility. Experiments across diverse datasets and models demonstrate that RTMStega triples the payload capacity of mainstream generative steganography, reduces processing time by over 50%, and maintains high text quality, offering a trustworthy solution for secure and efficient covert communication.

Title: LLM Meets Diffusion: A Hybrid Framework for Crystal Material Generation

Authors: Subhojyoti Khastagir, Kishalay Das, Pawan Goyal, Seung-Cheol Lee, Satadeep Bhattacharjee, Niloy Ganguly
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23040
Pdf URL: https://arxiv.org/pdf/2510.23040
Copy Paste: [[2510.23040]] LLM Meets Diffusion: A Hybrid Framework for Crystal Material Generation(https://arxiv.org/abs/2510.23040)
Keywords: diffusion, generative
Abstract: Recent advances in generative modeling have shown significant promise in designing novel periodic crystal structures. Existing approaches typically rely on either large language models (LLMs) or equivariant denoising models, each with complementary strengths: LLMs excel at handling discrete atomic types but often struggle with continuous features such as atomic positions and lattice parameters, while denoising models are effective at modeling continuous variables but encounter difficulties in generating accurate atomic compositions. To bridge this gap, we propose CrysLLMGen, a hybrid framework that integrates an LLM with a diffusion model to leverage their complementary strengths for crystal material generation. During sampling, CrysLLMGen first employs a fine-tuned LLM to produce an intermediate representation of atom types, atomic coordinates, and lattice structure. While retaining the predicted atom types, it passes the atomic coordinates and lattice structure to a pre-trained equivariant diffusion model for refinement. Our framework outperforms state-of-the-art generative models across several benchmark tasks and datasets. Specifically, CrysLLMGen not only achieves a balanced performance in terms of structural and compositional validity but also generates more stable and novel materials compared to LLM-based and denoisingbased models Furthermore, CrysLLMGen exhibits strong conditional generation capabilities, effectively producing materials that satisfy user-defined constraints. Code is available at this https URL

Title: A Survey on LLM Mid-training

Authors: Chengying Tu, Xuemiao Zhang, Rongxiang Weng, Rumei Li, Chen Zhang, Yang Bai, Hongfei Yan, Jingang Wang, Xunliang Cai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23081
Pdf URL: https://arxiv.org/pdf/2510.23081
Copy Paste: [[2510.23081]] A Survey on LLM Mid-training(https://arxiv.org/abs/2510.23081)
Keywords: foundation model
Abstract: Recent advances in foundation models have highlighted the significant benefits of multi-stage training, with a particular emphasis on the emergence of mid-training as a vital stage that bridges pre-training and post-training. Mid-training is distinguished by its use of intermediate data and computational resources, systematically enhancing specified capabilities such as mathematics, coding, reasoning, and long-context extension, while maintaining foundational competencies. This survey provides a formal definition of mid-training for large language models (LLMs) and investigates optimization frameworks that encompass data curation, training strategies, and model architecture optimization. We analyze mainstream model implementations in the context of objective-driven interventions, illustrating how mid-training serves as a distinct and critical stage in the progressive development of LLM capabilities. By clarifying the unique contributions of mid-training, this survey offers a comprehensive taxonomy and actionable insights, supporting future research and innovation in the advancement of LLMs.

Title: Sampling from Energy distributions with Target Concrete Score Identity

Authors: Sergei Kholkin, Francisco Vargas, Alexander Korotin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.23106
Pdf URL: https://arxiv.org/pdf/2510.23106
Copy Paste: [[2510.23106]] Sampling from Energy distributions with Target Concrete Score Identity(https://arxiv.org/abs/2510.23106)
Keywords: diffusion
Abstract: We introduce the Target Concrete Score Identity Sampler (TCSIS), a method for sampling from unnormalized densities on discrete state spaces by learning the reverse dynamics of a Continuous-Time Markov Chain (CTMC). Our approach builds on a forward in time CTMC with a uniform noising kernel and relies on the proposed Target Concrete Score Identity, which relates the concrete score, the ratio of marginal probabilities of two states, to a ratio of expectations of Boltzmann factors under the forward uniform diffusion kernel. This formulation enables Monte Carlo estimation of the concrete score without requiring samples from the target distribution or computation of the partition function. We approximate the concrete score with a neural network and propose two algorithms: Self-Normalized TCSIS and Unbiased TCSIS. Finally, we demonstrate the effectiveness of TCSIS on problems from statistical physics.

Title: Residual Diffusion Bridge Model for Image Restoration

Authors: Hebaixu Wang, Jing Zhang, Haoyang Chen, Haonan Guo, Di Wang, Jiayi Ma, Bo Du
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23116
Pdf URL: https://arxiv.org/pdf/2510.23116
Copy Paste: [[2510.23116]] Residual Diffusion Bridge Model for Image Restoration(https://arxiv.org/abs/2510.23116)
Keywords: diffusion
Abstract: Diffusion bridge models establish probabilistic paths between arbitrary paired distributions and exhibit great potential for universal image restoration. Most existing methods merely treat them as simple variants of stochastic interpolants, lacking a unified analytical perspective. Besides, they indiscriminately reconstruct images through global noise injection and removal, inevitably distorting undegraded regions due to imperfect reconstruction. To address these challenges, we propose the Residual Diffusion Bridge Model (RDBM). Specifically, we theoretically reformulate the stochastic differential equations of generalized diffusion bridge and derive the analytical formulas of its forward and reverse processes. Crucially, we leverage the residuals from given distributions to modulate the noise injection and removal, enabling adaptive restoration of degraded regions while preserving intact others. Moreover, we unravel the fundamental mathematical essence of existing bridge models, all of which are special cases of RDBM and empirically demonstrate the optimality of our proposed models. Extensive experiments are conducted to demonstrate the state-of-the-art performance of our method both qualitatively and quantitatively across diverse image restoration tasks. Code is publicly available at this https URL.

Title: Implicit Modeling for Transferability Estimation of Vision Foundation Models

Authors: Yaoyan Zheng, Huiqun Wang, Nan Zhou, Di Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23145
Pdf URL: https://arxiv.org/pdf/2510.23145
Copy Paste: [[2510.23145]] Implicit Modeling for Transferability Estimation of Vision Foundation Models(https://arxiv.org/abs/2510.23145)
Keywords: foundation model
Abstract: Transferability estimation identifies the best pre-trained models for downstream tasks without incurring the high computational cost of full fine-tuning. This capability facilitates deployment and advances the pre-training and fine-tuning paradigm. However, existing methods often struggle to accurately assess transferability for emerging pre-trained models with diverse architectures, training strategies, and task alignments. In this work, we propose Implicit Transferability Modeling (ITM), a novel framework that implicitly models each model's intrinsic transferability, coupled with a Divide-and-Conquer Variational Approximation (DVA) strategy to efficiently approximate embedding space evolution. This design enables generalization across a broader range of models and downstream tasks. Extensive experiments on a comprehensive benchmark--spanning extensive training regimes and a wider variety of model types--demonstrate that ITM consistently outperforms existing methods in terms of stability, effectiveness, and efficiency.

Title: Finding 3D Scene Analogies with Multimodal Foundation Models

Authors: Junho Kim, Young Min Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23184
Pdf URL: https://arxiv.org/pdf/2510.23184
Copy Paste: [[2510.23184]] Finding 3D Scene Analogies with Multimodal Foundation Models(https://arxiv.org/abs/2510.23184)
Keywords: foundation model
Abstract: Connecting current observations with prior experiences helps robots adapt and plan in new, unseen 3D environments. Recently, 3D scene analogies have been proposed to connect two 3D scenes, which are smooth maps that align scene regions with common spatial relationships. These maps enable detailed transfer of trajectories or waypoints, potentially supporting demonstration transfer for imitation learning or task plan transfer across scenes. However, existing methods for the task require additional training and fixed object vocabularies. In this work, we propose to use multimodal foundation models for finding 3D scene analogies in a zero-shot, open-vocabulary setting. Central to our approach is a hybrid neural representation of scenes that consists of a sparse graph based on vision-language model features and a feature field derived from 3D shape foundation models. 3D scene analogies are then found in a coarse-to-fine manner, by first aligning the graph and refining the correspondence with feature fields. Our method can establish accurate correspondences between complex scenes, and we showcase applications in trajectory and waypoint transfer.

Title: Evaluation of Vision-LLMs in Surveillance Video

Authors: Pascal Benschop, Cristian Meo, Justin Dauwels, Jelte P. Mense
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23190
Pdf URL: https://arxiv.org/pdf/2510.23190
Copy Paste: [[2510.23190]] Evaluation of Vision-LLMs in Surveillance Video(https://arxiv.org/abs/2510.23190)
Keywords: anomaly
Abstract: The widespread use of cameras in our society has created an overwhelming amount of video data, far exceeding the capacity for human monitoring. This presents a critical challenge for public safety and security, as the timely detection of anomalous or criminal events is crucial for effective response and prevention. The ability for an embodied agent to recognize unexpected events is fundamentally tied to its capacity for spatial reasoning. This paper investigates the spatial reasoning of vision-language models (VLMs) by framing anomalous action recognition as a zero-shot, language-grounded task, addressing the embodied perception challenge of interpreting dynamic 3D scenes from sparse 2D video. Specifically, we investigate whether small, pre-trained vision--LLMs can act as spatially-grounded, zero-shot anomaly detectors by converting video into text descriptions and scoring labels via textual entailment. We evaluate four open models on UCF-Crime and RWF-2000 under prompting and privacy-preserving conditions. Few-shot exemplars can improve accuracy for some models, but may increase false positives, and privacy filters -- especially full-body GAN transforms -- introduce inconsistencies that degrade accuracy. These results chart where current vision--LLMs succeed (simple, spatially salient events) and where they falter (noisy spatial cues, identity obfuscation). Looking forward, we outline concrete paths to strengthen spatial grounding without task-specific training: structure-aware prompts, lightweight spatial memory across clips, scene-graph or 3D-pose priors during description, and privacy methods that preserve action-relevant geometry. This positions zero-shot, language-grounded pipelines as adaptable building blocks for embodied, real-world video understanding. Our implementation for evaluating VLMs is publicly available at: this https URL

Title: Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?

Authors: Tawsif Tashwar Dipto, Azmol Hossain, Rubayet Sabbir Faruque, Md. Rezuwan Hassan, Kanij Fatema, Tanmoy Shome, Ruwad Naswan, Md.Foriduzzaman Zihad, Mohaymen Ul Anam, Nazia Tasnim, Hasan Mahmud, Md Kamrul Hasan, Md. Mehedi Hasan Shawon, Farig Sadeque, Tahsin Reasat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23252
Pdf URL: https://arxiv.org/pdf/2510.23252
Copy Paste: [[2510.23252]] Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?(https://arxiv.org/abs/2510.23252)
Keywords: foundation model
Abstract: Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue. Our dataset also serves as a out of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available

Title: Privacy-Preserving Semantic Communication over Wiretap Channels with Learnable Differential Privacy

Authors: Weixuan Chen, Qianqian Yang, Shuo Shao, Shunpu Tang, Zhiguo Shi, Shui Yu
Subjects: cs.CR, eess.IV
Abstract URL: https://arxiv.org/abs/2510.23274
Pdf URL: https://arxiv.org/pdf/2510.23274
Copy Paste: [[2510.23274]] Privacy-Preserving Semantic Communication over Wiretap Channels with Learnable Differential Privacy(https://arxiv.org/abs/2510.23274)
Keywords: generative
Abstract: While semantic communication (SemCom) improves transmission efficiency by focusing on task-relevant information, it also raises critical privacy concerns. Many existing secure SemCom approaches rely on restrictive or impractical assumptions, such as favorable channel conditions for the legitimate user or prior knowledge of the eavesdropper's model. To address these limitations, this paper proposes a novel secure SemCom framework for image transmission over wiretap channels, leveraging differential privacy (DP) to provide approximate privacy guarantees. Specifically, our approach first extracts disentangled semantic representations from source images using generative adversarial network (GAN) inversion method, and then selectively perturbs private semantic representations with approximate DP noise. Distinct from conventional DP-based protection methods, we introduce DP noise with learnable pattern, instead of traditional white Gaussian or Laplace noise, achieved through adversarial training of neural networks (NNs). This design mitigates the inherent non-invertibility of DP while effectively protecting private information. Moreover, it enables explicitly controllable security levels by adjusting the privacy budget according to specific security requirements, which is not achieved in most existing secure SemCom approaches. Experimental results demonstrate that, compared with the previous DP-based method and direct transmission, the proposed method significantly degrades the reconstruction quality for the eavesdropper, while introducing only slight degradation in task performance. Under comparable security levels, our approach achieves an LPIPS advantage of 0.06-0.29 and an FPPSR advantage of 0.10-0.86 for the legitimate user compared with the previous DP-based method.

Title: Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling

Authors: Ruoyu Wang, Beier Zhu, Junzhi Li, Liangyu Yuan, Chi Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23285
Pdf URL: https://arxiv.org/pdf/2510.23285
Copy Paste: [[2510.23285]] Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling(https://arxiv.org/abs/2510.23285)
Keywords: diffusion, generative
Abstract: Diffusion-based generative processes, formulated as differential equation solving, frequently balance computational speed with sample quality. Our theoretical investigation of ODE- and SDE-based solvers reveals complementary weaknesses: ODE solvers accumulate irreducible gradient error along deterministic trajectories, while SDE methods suffer from amplified discretization errors when the step budget is limited. Building upon this insight, we introduce AdaSDE, a novel single-step SDE solver that aims to unify the efficiency of ODEs with the error resilience of SDEs. Specifically, we introduce a single per-step learnable coefficient, estimated via lightweight distillation, which dynamically regulates the error correction strength to accelerate diffusion sampling. Notably, our framework can be integrated with existing solvers to enhance their capabilities. Extensive experiments demonstrate state-of-the-art performance: at 5 NFE, AdaSDE achieves FID scores of 4.18 on CIFAR-10, 8.05 on FFHQ and 6.96 on LSUN Bedroom. Codes are available in this https URL.

Title: ReconViaGen: Towards Accurate Multi-view 3D Object Reconstruction via Generation

Authors: Jiahao Chang, Chongjie Ye, Yushuang Wu, Yuantao Chen, Yidan Zhang, Zhongjin Luo, Chenghong Li, Yihao Zhi, Xiaoguang Han
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23306
Pdf URL: https://arxiv.org/pdf/2510.23306
Copy Paste: [[2510.23306]] ReconViaGen: Towards Accurate Multi-view 3D Object Reconstruction via Generation(https://arxiv.org/abs/2510.23306)
Keywords: diffusion, generative
Abstract: Existing multi-view 3D object reconstruction methods heavily rely on sufficient overlap between input views, where occlusions and sparse coverage in practice frequently yield severe reconstruction incompleteness. Recent advancements in diffusion-based 3D generative techniques offer the potential to address these limitations by leveraging learned generative priors to hallucinate invisible parts of objects, thereby generating plausible 3D structures. However, the stochastic nature of the inference process limits the accuracy and reliability of generation results, preventing existing reconstruction frameworks from integrating such 3D generative priors. In this work, we comprehensively analyze the reasons why diffusion-based 3D generative methods fail to achieve high consistency, including (a) the insufficiency in constructing and leveraging cross-view connections when extracting multi-view image features as conditions, and (b) the poor controllability of iterative denoising during local detail generation, which easily leads to plausible but inconsistent fine geometric and texture details with inputs. Accordingly, we propose ReconViaGen to innovatively integrate reconstruction priors into the generative framework and devise several strategies that effectively address these issues. Extensive experiments demonstrate that our ReconViaGen can reconstruct complete and accurate 3D models consistent with input views in both global structure and local this http URL page: this https URL.

Title: Multitask Multimodal Self-Supervised Learning for Medical Images

Authors: Cristian Simionescu
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.23325
Pdf URL: https://arxiv.org/pdf/2510.23325
Copy Paste: [[2510.23325]] Multitask Multimodal Self-Supervised Learning for Medical Images(https://arxiv.org/abs/2510.23325)
Keywords: self-supervised
Abstract: This thesis works to address a pivotal challenge in medical image analysis: the reliance on extensive labeled datasets, which are often limited due to the need for expert annotation and constrained by privacy and legal issues. By focusing on the development of self-supervised learning techniques and domain adaptation methods, this research aims to circumvent these limitations, presenting a novel approach to enhance the utility and efficacy of deep learning in medical imaging. Central to this thesis is the development of the Medformer, an innovative neural network architecture designed for multitask learning and deep domain adaptation. This model is adept at pre-training on diverse medical image datasets, handling varying sizes and modalities, and is equipped with a dynamic input-output adaptation mechanism. This enables efficient processing and integration of a wide range of medical image types, from 2D X-rays to complex 3D MRIs, thus mitigating the dependency on large labeled datasets. Further, the thesis explores the current state of self-supervised learning in medical imaging. It introduces novel pretext tasks that are capable of extracting meaningful information from unlabeled data, significantly advancing the model's interpretative abilities. This approach is validated through rigorous experimentation, including the use of the MedMNIST dataset, demonstrating the model's proficiency in learning generalized features applicable to various downstream tasks. In summary, this thesis contributes to the advancement of medical image analysis by offering a scalable, adaptable framework that reduces reliance on labeled data. It paves the way for more accurate, efficient diagnostic tools in healthcare, signifying a major step forward in the application of deep learning in medical imaging.

Title: GRAD: Real-Time Gated Recurrent Anomaly Detection in Autonomous Vehicle Sensors Using Reinforced EMA and Multi-Stage Sliding Window Techniques

Authors: Mohammad Hossein Jafari Naeimi, Ali Norouzi, Athena Abdi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.23327
Pdf URL: https://arxiv.org/pdf/2510.23327
Copy Paste: [[2510.23327]] GRAD: Real-Time Gated Recurrent Anomaly Detection in Autonomous Vehicle Sensors Using Reinforced EMA and Multi-Stage Sliding Window Techniques(https://arxiv.org/abs/2510.23327)
Keywords: anomaly
Abstract: This paper introduces GRAD, a real-time anomaly detection method for autonomous vehicle sensors that integrates statistical analysis and deep learning to ensure the reliability of sensor data. The proposed approach combines the Reinforced Exponential Moving Average (REMA), which adapts smoothing factors and thresholding for outlier detection, with the Multi-Stage Sliding Window (MS-SW) technique for capturing both short- and long-term patterns. These features are processed using a lightweight Gated Recurrent Unit (GRU) model, which detects and classifies anomalies based on bias types, while a recovery module restores damaged sensor data to ensure continuous system operation. GRAD has a lightweight architecture consisting of two layers of GRU with a limited number of neurons that make it appropriate for real-time applications while maintaining high detection accuracy. The GRAD framework achieved remarkable performance in anomaly detection and classification. The model demonstrated an overall F1-score of 97.6% for abnormal data and 99.4% for normal data, signifying its high accuracy in distinguishing between normal and anomalous sensor data. Regarding the anomaly classification, GRAD successfully categorized different anomaly types with high precision, enabling the recovery module to accurately restore damaged sensor data. Relative to analogous studies, GRAD surpasses current models by attaining a balance between elevated detection accuracy and diminished computational expense. These results demonstrate GRAD's potential as a reliable and efficient solution for real-time anomaly detection in autonomous vehicle systems, guaranteeing safe vehicle operation with minimal computational overhead.

Title: ZeroFlood: A Geospatial Foundation Model for Data-Efficient Flood Susceptibility Mapping

Authors: Hyeongkyun Kim, Orestis Oikonomou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23364
Pdf URL: https://arxiv.org/pdf/2510.23364
Copy Paste: [[2510.23364]] ZeroFlood: A Geospatial Foundation Model for Data-Efficient Flood Susceptibility Mapping(https://arxiv.org/abs/2510.23364)
Keywords: foundation model
Abstract: Flood susceptibility mapping (FSM) is vital for disaster prevention but remains challenging in data-scarce regions where hydrodynamic models require dense geophysical inputs. This work introduces ZeroFlood, a geospatial foundation model framework for data-efficient FSM. The approach fine-tunes Geospatial Foundation Models (GFMs) with Thinking-in-Modality (TiM) reasoning, enabling flood prediction from basic Earth observation data such as Sentinel-1 or Sentinel-2 imagery. Using paired EO and simulated flood maps from data-rich regions, ZeroFlood bridges data availability gaps through cross-modal representation learning. Experiments with TerraMind and Prithvi GFMs show that TiM enhances model robustness, with the TerraMind-Large configuration achieving an F1 score of 67.21. The results demonstrate the feasibility of foundation-model-based FSM as a scalable and data-efficient solution for flood risk management.

Title: An Efficient Remote Sensing Super Resolution Method Exploring Diffusion Priors and Multi-Modal Constraints for Crop Type Mapping

Authors: Songxi Yang, Tang Sui, Qunying Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23382
Pdf URL: https://arxiv.org/pdf/2510.23382
Copy Paste: [[2510.23382]] An Efficient Remote Sensing Super Resolution Method Exploring Diffusion Priors and Multi-Modal Constraints for Crop Type Mapping(https://arxiv.org/abs/2510.23382)
Keywords: diffusion, generative
Abstract: Super resolution offers a way to harness medium even lowresolution but historically valuable remote sensing image archives. Generative models, especially diffusion models, have recently been applied to remote sensing super resolution (RSSR), yet several challenges exist. First, diffusion models are effective but require expensive training from scratch resources and have slow inference speeds. Second, current methods have limited utilization of auxiliary information as real-world constraints to reconstruct scientifically realistic images. Finally, most current methods lack evaluation on downstream tasks. In this study, we present a efficient LSSR framework for RSSR, supported by a new multimodal dataset of paired 30 m Landsat 8 and 10 m Sentinel 2 imagery. Built on frozen pretrained Stable Diffusion, LSSR integrates crossmodal attention with auxiliary knowledge (Digital Elevation Model, land cover, month) and Synthetic Aperture Radar guidance, enhanced by adapters and a tailored Fourier NDVI loss to balance spatial details and spectral fidelity. Extensive experiments demonstrate that LSSR significantly improves crop boundary delineation and recovery, achieving state-of-the-art performance with Peak Signal-to-Noise Ratio/Structural Similarity Index Measure of 32.63/0.84 (RGB) and 23.99/0.78 (IR), and the lowest NDVI Mean Squared Error (0.042), while maintaining efficient inference (0.39 sec/image). Moreover, LSSR transfers effectively to NASA Harmonized Landsat and Sentinel (HLS) super resolution, yielding more reliable crop classification (F1: 0.86) than Sentinel-2 (F1: 0.85). These results highlight the potential of RSSR to advance precision agriculture.

Title: Symmetria: A Synthetic Dataset for Learning in Point Clouds

Authors: Ivan Sipiran, Gustavo Santelices, Lucas Oyarzún, Andrea Ranieri, Chiara Romanengo, Silvia Biasotti, Bianca Falcidieno
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23414
Pdf URL: https://arxiv.org/pdf/2510.23414
Copy Paste: [[2510.23414]] Symmetria: A Synthetic Dataset for Learning in Point Clouds(https://arxiv.org/abs/2510.23414)
Keywords: self-supervised
Abstract: Unlike image or text domains that benefit from an abundance of large-scale datasets, point cloud learning techniques frequently encounter limitations due to the scarcity of extensive datasets. To overcome this limitation, we present Symmetria, a formula-driven dataset that can be generated at any arbitrary scale. By construction, it ensures the absolute availability of precise ground truth, promotes data-efficient experimentation by requiring fewer samples, enables broad generalization across diverse geometric settings, and offers easy extensibility to new tasks and modalities. Using the concept of symmetry, we create shapes with known structure and high variability, enabling neural networks to learn point cloud features effectively. Our results demonstrate that this dataset is highly effective for point cloud self-supervised pre-training, yielding models with strong performance in downstream tasks such as classification and segmentation, which also show good few-shot learning capabilities. Additionally, our dataset can support fine-tuning models to classify real-world objects, highlighting our approach's practical utility and application. We also introduce a challenging task for symmetry detection and provide a benchmark for baseline comparisons. A significant advantage of our approach is the public availability of the dataset, the accompanying code, and the ability to generate very large collections, promoting further research and innovation in point cloud learning.

Title: Towards Generalisable Foundation Models for 3D Brain MRI

Authors: Moona Mazher, Geoff J. M. Parker, Daniel C. Alexander
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23415
Pdf URL: https://arxiv.org/pdf/2510.23415
Copy Paste: [[2510.23415]] Towards Generalisable Foundation Models for 3D Brain MRI(https://arxiv.org/abs/2510.23415)
Keywords: self-supervised, foundation model
Abstract: Foundation models in artificial intelligence (AI) are transforming medical imaging by enabling general-purpose feature learning from large-scale, unlabeled datasets. In this work, we introduce BrainFound, a self-supervised foundation model for brain MRI, built by extending DINO-v2, a vision transformer originally designed for 2D natural images. BrainFound adapts DINO-v2 to model full 3D brain anatomy by incorporating volumetric information from sequential MRI slices, moving beyond conventional single-slice paradigms. It supports both single- and multimodal inputs, enabling a broad range of downstream tasks, including disease detection and image segmentation, while generalising across varied imaging protocols and clinical scenarios. We show that BrainFound consistently outperforms existing self-supervised pretraining strategies and supervised baselines, particularly in label-scarce and multi-contrast settings. By integrating information from diverse 3D MRI modalities (e.g., T1, T2, FLAIR), it enhances diagnostic accuracy and reduces dependency on extensive expert annotations. This flexibility makes BrainFound a scalable and practical solution for 3D neuroimaging pipelines, with significant potential for clinical deployment and research innovation.

Title: CURVETE: Curriculum Learning and Progressive Self-supervised Training for Medical Image Classification

Authors: Asmaa Abbas, Mohamed Gaber, Mohammed M. Abdelsamea
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23442
Pdf URL: https://arxiv.org/pdf/2510.23442
Copy Paste: [[2510.23442]] CURVETE: Curriculum Learning and Progressive Self-supervised Training for Medical Image Classification(https://arxiv.org/abs/2510.23442)
Keywords: self-supervised
Abstract: Identifying high-quality and easily accessible annotated samples poses a notable challenge in medical image analysis. Transfer learning techniques, leveraging pre-training data, offer a flexible solution to this issue. However, the impact of fine-tuning diminishes when the dataset exhibits an irregular distribution between classes. This paper introduces a novel deep convolutional neural network, named Curriculum Learning and Progressive Self-supervised Training (CURVETE). CURVETE addresses challenges related to limited samples, enhances model generalisability, and improves overall classification performance. It achieves this by employing a curriculum learning strategy based on the granularity of sample decomposition during the training of generic unlabelled samples. Moreover, CURVETE address the challenge of irregular class distribution by incorporating a class decomposition approach in the downstream task. The proposed method undergoes evaluation on three distinct medical image datasets: brain tumour, digital knee x-ray, and Mini-DDSM datasets. We investigate the classification performance using a generic self-supervised sample decomposition approach with and without the curriculum learning component in training the pretext task. Experimental results demonstrate that the CURVETE model achieves superior performance on test sets with an accuracy of 96.60% on the brain tumour dataset, 75.60% on the digital knee x-ray dataset, and 93.35% on the Mini-DDSM dataset using the baseline ResNet-50. Furthermore, with the baseline DenseNet-121, it achieved accuracies of 95.77%, 80.36%, and 93.22% on the brain tumour, digital knee x-ray, and Mini-DDSM datasets, respectively, outperforming other training strategies.

Title: Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

Authors: Zhuoran Jin, Hongbang Yuan, Kejian Zhu, Jiachun Li, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2510.23451
Pdf URL: https://arxiv.org/pdf/2510.23451
Copy Paste: [[2510.23451]] Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences(https://arxiv.org/abs/2510.23451)
Keywords: generative
Abstract: Reward models (RMs) play a critical role in aligning AI behaviors with human preferences, yet they face two fundamental challenges: (1) Modality Imbalance, where most RMs are mainly focused on text and image modalities, offering limited support for video, audio, and other modalities; and (2) Preference Rigidity, where training on fixed binary preference pairs fails to capture the complexity and diversity of personalized preferences. To address the above challenges, we propose Omni-Reward, a step toward generalist omni-modal reward modeling with support for free-form preferences, consisting of: (1) Evaluation: We introduce Omni-RewardBench, the first omni-modal RM benchmark with free-form preferences, covering nine tasks across five modalities including text, image, video, audio, and 3D; (2) Data: We construct Omni-RewardData, a multimodal preference dataset comprising 248K general preference pairs and 69K instruction-tuning pairs for training generalist omni-modal RMs; (3) Model: We propose Omni-RewardModel, which includes both discriminative and generative RMs, and achieves strong performance on Omni-RewardBench as well as other widely used reward modeling benchmarks.

Title: Adaptive Dual Prompting: Hierarchical Debiasing for Fairness-aware Graph Neural Networks

Authors: Yuhan Yang, Xingbo Fu, Jundong Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.23469
Pdf URL: https://arxiv.org/pdf/2510.23469
Copy Paste: [[2510.23469]] Adaptive Dual Prompting: Hierarchical Debiasing for Fairness-aware Graph Neural Networks(https://arxiv.org/abs/2510.23469)
Keywords: self-supervised
Abstract: In recent years, pre-training Graph Neural Networks (GNNs) through self-supervised learning on unlabeled graph data has emerged as a widely adopted paradigm in graph learning. Although the paradigm is effective for pre-training powerful GNN models, the objective gap often exists between pre-training and downstream tasks. To bridge this gap, graph prompting adapts pre-trained GNN models to specific downstream tasks with extra learnable prompts while keeping the pre-trained GNN models frozen. As recent graph prompting methods largely focus on enhancing model utility on downstream tasks, they often overlook fairness concerns when designing prompts for adaptation. In fact, pre-trained GNN models will produce discriminative node representations across demographic subgroups, as downstream graph data inherently contains biases in both node attributes and graph structures. To address this issue, we propose an Adaptive Dual Prompting (ADPrompt) framework that enhances fairness for adapting pre-trained GNN models to downstream tasks. To mitigate attribute bias, we design an Adaptive Feature Rectification module that learns customized attribute prompts to suppress sensitive information at the input layer, reducing bias at the source. Afterward, we propose an Adaptive Message Calibration module that generates structure prompts at each layer, which adjust the message from neighboring nodes to enable dynamic and soft calibration of the information flow. Finally, ADPrompt jointly optimizes the two prompting modules to adapt the pre-trained GNN while enhancing fairness. We conduct extensive experiments on four datasets with four pre-training strategies to evaluate the performance of ADPrompt. The results demonstrate that our proposed ADPrompt outperforms seven baseline methods on node classification tasks.

Title: T-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning

Authors: Julie Mordacq, David Loiseaux, Vicky Kalogeiton, Steve Oudot
Subjects: cs.LG, cs.CG, cs.CV
Abstract URL: https://arxiv.org/abs/2510.23484
Pdf URL: https://arxiv.org/pdf/2510.23484
Copy Paste: [[2510.23484]] T-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning(https://arxiv.org/abs/2510.23484)
Keywords: self-supervised
Abstract: Self-supervised learning (SSL) has emerged as a powerful paradigm for learning representations without labeled data, often by enforcing invariance to input transformations such as rotations or blurring. Recent studies have highlighted two pivotal properties for effective representations: (i) avoiding dimensional collapse-where the learned features occupy only a low-dimensional subspace, and (ii) enhancing uniformity of the induced distribution. In this work, we introduce T-REGS, a simple regularization framework for SSL based on the length of the Minimum Spanning Tree (MST) over the learned representation. We provide theoretical analysis demonstrating that T-REGS simultaneously mitigates dimensional collapse and promotes distribution uniformity on arbitrary compact Riemannian manifolds. Several experiments on synthetic data and on classical SSL benchmarks validate the effectiveness of our approach at enhancing representation quality.

Title: Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap

Authors: Elisabeth Jüttner, Leona Krath, Stefan Korfhage, Hannah Dröge, Matthias B. Hullin, Markus Plack
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2510.23494
Pdf URL: https://arxiv.org/pdf/2510.23494
Copy Paste: [[2510.23494]] Yesnt: Are Diffusion Relighting Models Ready for Capture Stage Compositing? A Hybrid Alternative to Bridge the Gap(https://arxiv.org/abs/2510.23494)
Keywords: diffusion
Abstract: Volumetric video relighting is essential for bringing captured performances into virtual worlds, but current approaches struggle to deliver temporally stable, production-ready results. Diffusion-based intrinsic decomposition methods show promise for single frames, yet suffer from stochastic noise and instability when extended to sequences, while video diffusion models remain constrained by memory and scale. We propose a hybrid relighting framework that combines diffusion-derived material priors with temporal regularization and physically motivated rendering. Our method aggregates multiple stochastic estimates of per-frame material properties into temporally consistent shading components, using optical-flow-guided regularization. For indirect effects such as shadows and reflections, we extract a mesh proxy from Gaussian Opacity Fields and render it within a standard graphics pipeline. Experiments on real and synthetic captures show that this hybrid strategy achieves substantially more stable relighting across sequences than diffusion-only baselines, while scaling beyond the clip lengths feasible for video diffusion. These results indicate that hybrid approaches, which balance learned priors with physically grounded constraints, are a practical step toward production-ready volumetric video relighting.

Title: Mixed Precision Training of Neural ODEs

Authors: Elena Celledoni, Brynjulf Owren, Lars Ruthotto, Tianjiao Nicole Yang
Subjects: cs.LG, cs.AI, math.NA
Abstract URL: https://arxiv.org/abs/2510.23498
Pdf URL: https://arxiv.org/pdf/2510.23498
Copy Paste: [[2510.23498]] Mixed Precision Training of Neural ODEs(https://arxiv.org/abs/2510.23498)
Keywords: generative
Abstract: Exploiting low-precision computations has become a standard strategy in deep learning to address the growing computational costs imposed by ever larger models and datasets. However, naively performing all computations in low precision can lead to roundoff errors and instabilities. Therefore, mixed precision training schemes usually store the weights in high precision and use low-precision computations only for whitelisted operations. Despite their success, these principles are currently not reliable for training continuous-time architectures such as neural ordinary differential equations (Neural ODEs). This paper presents a mixed precision training framework for neural ODEs, combining explicit ODE solvers with a custom backpropagation scheme, and demonstrates its effectiveness across a range of learning tasks. Our scheme uses low-precision computations for evaluating the velocity, parameterized by the neural network, and for storing intermediate states, while stability is provided by a custom dynamic adjoint scaling and by accumulating the solution and gradients in higher precision. These contributions address two key challenges in training neural ODE: the computational cost of repeated network evaluations and the growth of memory requirements with the number of time steps or layers. Along with the paper, we publish our extendable, open-source PyTorch package rampde, whose syntax resembles that of leading packages to provide a drop-in replacement in existing codes. We demonstrate the reliability and effectiveness of our scheme using challenging test cases and on neural ODE applications in image classification and generative models, achieving approximately 50% memory reduction and up to 2x speedup while maintaining accuracy comparable to single-precision training.

Title: Towards Deep Physics-Informed Kolmogorov-Arnold Networks

Authors: Spyros Rigas, Fotios Anagnostopoulos, Michalis Papachristou, Georgios Alexandridis
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2510.23501
Pdf URL: https://arxiv.org/pdf/2510.23501
Copy Paste: [[2510.23501]] Towards Deep Physics-Informed Kolmogorov-Arnold Networks(https://arxiv.org/abs/2510.23501)
Keywords: diffusion
Abstract: Since their introduction, Kolmogorov-Arnold Networks (KANs) have been successfully applied across several domains, with physics-informed machine learning (PIML) emerging as one of the areas where they have thrived. In the PIML setting, Chebyshev-based physics-informed KANs (cPIKANs) have become the standard due to their computational efficiency. However, like their multilayer perceptron-based counterparts, cPIKANs face significant challenges when scaled to depth, leading to training instabilities that limit their applicability to several PDE problems. To address this, we propose a basis-agnostic, Glorot-like initialization scheme that preserves activation variance and yields substantial improvements in stability and accuracy over the default initialization of cPIKANs. Inspired by the PirateNet architecture, we further introduce Residual-Gated Adaptive KANs (RGA KANs), designed to mitigate divergence in deep cPIKANs where initialization alone is not sufficient. Through empirical tests and information bottleneck analysis, we show that RGA KANs successfully traverse all training phases, unlike baseline cPIKANs, which stagnate in the diffusion phase in specific PDE settings. Evaluations on seven standard forward PDE benchmarks under a fixed training pipeline with adaptive components demonstrate that RGA KANs consistently outperform parameter-matched cPIKANs and PirateNets - often by several orders of magnitude - while remaining stable in settings where the others diverge.

Title: FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time

Authors: Yaoli Liu, Yao-Xiang Ding, Kun Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23515
Pdf URL: https://arxiv.org/pdf/2510.23515
Copy Paste: [[2510.23515]] FreeFuse: Multi-Subject LoRA Fusion via Auto Masking at Test Time(https://arxiv.org/abs/2510.23515)
Keywords: diffusion
Abstract: This paper proposes FreeFuse, a novel training-free approach for multi-subject text-to-image generation through automatic fusion of multiple subject LoRAs. In contrast to existing methods that either focus on pre-inference LoRA weight merging or rely on segmentation models and complex techniques like noise blending to isolate LoRA outputs, our key insight is that context-aware dynamic subject masks can be automatically derived from cross-attention layer weights. Mathematical analysis shows that directly applying these masks to LoRA outputs during inference well approximates the case where the subject LoRA is integrated into the diffusion model and used individually for the masked region. FreeFuse demonstrates superior practicality and efficiency as it requires no additional training, no modification to LoRAs, no auxiliary models, and no user-defined prompt templates or region specifications. Alternatively, it only requires users to provide the LoRA activation words for seamless integration into standard workflows. Extensive experiments validate that FreeFuse outperforms existing approaches in both generation quality and usability under the multi-subject generation tasks. The project page is at this https URL

Title: More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models

Authors: Hongkai Lin, Dingkang Liang, Mingyang Du, Xin Zhou, Xiang Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23574
Pdf URL: https://arxiv.org/pdf/2510.23574
Copy Paste: [[2510.23574]] More Than Generation: Unifying Generation and Depth Estimation via Text-to-Image Diffusion Models(https://arxiv.org/abs/2510.23574)
Keywords: diffusion, generative
Abstract: Generative depth estimation methods leverage the rich visual priors stored in pre-trained text-to-image diffusion models, demonstrating astonishing zero-shot capability. However, parameter updates during training lead to catastrophic degra- dation in the image generation capability of the pre-trained model. We introduce MERGE, a unified model for image generation and depth estimation, starting from a fixed pre-trained text-to-image model. MERGE demonstrates that the pre-trained text-to-image model can do more than image generation, but also expand to depth estimation effortlessly. Specifically, MERGE introduces a play- and-plug framework that enables seamless switching between image generation and depth estimation modes through simple and pluggable converters. Meanwhile, we propose a Group Reuse Mechanism to encourage parameter reuse and im- prove the utilization of the additional learnable parameters. MERGE unleashes the powerful depth estimation capability of the pre-trained text-to-image model while preserving its original image generation ability. Compared to other unified models for image generation and depth estimation, MERGE achieves state-of- the-art performance across multiple depth estimation benchmarks. The code will be made available at this https URL

Title: FARMER: Flow AutoRegressive Transformer over Pixels

Authors: Guangting Zheng, Qinyu Zhao, Tao Yang, Fei Xiao, Zhijie Lin, Jie Wu, Jiajun Deng, Yanyong Zhang, Rui Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23588
Pdf URL: https://arxiv.org/pdf/2510.23588
Copy Paste: [[2510.23588]] FARMER: Flow AutoRegressive Transformer over Pixels(https://arxiv.org/abs/2510.23588)
Keywords: self-supervised, generative
Abstract: Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.

Title: Think Twice: Branch-and-Rethink Reasoning Reward Model

Authors: Yizhu Jiao, Jiaqi Zeng, Julien Veron Vialard, Oleksii Kuchaiev, Jiawei Han, Olivier Delalleau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23596
Pdf URL: https://arxiv.org/pdf/2510.23596
Copy Paste: [[2510.23596]] Think Twice: Branch-and-Rethink Reasoning Reward Model(https://arxiv.org/abs/2510.23596)
Keywords: diffusion
Abstract: Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-oncescoringintofocused, second-lookreasoning, BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains. The code and the model will be released soon.

Title: Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling

Authors: Shuhong Zheng, Ashkan Mirzaei, Igor Gilitschenski
Subjects: cs.CV, cs.AI, cs.GR, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2510.23605
Pdf URL: https://arxiv.org/pdf/2510.23605
Copy Paste: [[2510.23605]] Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling(https://arxiv.org/abs/2510.23605)
Keywords: generative
Abstract: Current 3D/4D generation methods are usually optimized for photorealism, efficiency, and aesthetics. However, they often fail to preserve the semantic identity of the subject across different viewpoints. Adapting generation methods with one or few images of a specific subject (also known as Personalization or Subject-driven generation) allows generating visual content that align with the identity of the subject. However, personalized 3D/4D generation is still largely underexplored. In this work, we introduce TIRE (Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation. It takes an initial 3D asset produced by an existing 3D generative model as input and uses video tracking to identify the regions that need to be modified. Then, we adopt a subject-driven 2D inpainting model for progressively infilling the identified regions. Finally, we resplat the modified 2D multi-view observations back to 3D while still maintaining consistency. Extensive experiments demonstrate that our approach significantly improves identity preservation in 3D/4D generation compared to state-of-the-art methods. Our project website is available at this https URL.

Title: Variational Masked Diffusion Models

Authors: Yichi Zhang, Alex Schwing, Zhizhen Zhao
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2510.23606
Pdf URL: https://arxiv.org/pdf/2510.23606
Copy Paste: [[2510.23606]] Variational Masked Diffusion Models(https://arxiv.org/abs/2510.23606)
Keywords: diffusion, generative
Abstract: Masked diffusion models have recently emerged as a flexible framework for discrete generative modeling. However, a key limitation of standard masked diffusion is its inability to effectively capture dependencies among tokens that are predicted concurrently, leading to degraded generation quality when dependencies among tokens are important. To explicitly model dependencies among tokens, we propose Variational Masked Diffusion (VMD), a framework that introduces latent variables into the masked diffusion process. Through controlled experiments on synthetic datasets, we demonstrate that VMD successfully learns dependencies that conventional masked diffusion fails to capture. We further validate the effectiveness of our approach on Sudoku puzzles and text datasets, where learning of dependencies among tokens improves global consistency. Across these domains, VMD enhances both generation quality and dependency awareness, highlighting the value of integrating variational inference into masked diffusion. Our code is available at: this https URL.

Title: Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Authors: Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.23607
Pdf URL: https://arxiv.org/pdf/2510.23607
Copy Paste: [[2510.23607]] Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations(https://arxiv.org/abs/2510.23607)
Keywords: self-supervised
Abstract: Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.