2025-10-16

Title: A\textsuperscript{2}FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning

Authors: Qianben Chen, Jingyi Cao, Jiayu Zhang, Tianrui Qin, Xiaowan Li, King Zhu, Dingfeng Shi, He Zhu, Minghao Liu, Xiaobo Liang, Ge Zhang, Jian Yang, Yuchen Eleanor Jiang, Wangchunshu Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12838
Pdf URL: https://arxiv.org/pdf/2510.12838
Copy Paste: [[2510.12838]] A\textsuperscript{2}FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning(https://arxiv.org/abs/2510.12838)
Keywords: foundation model
Abstract: Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A\textsuperscript{2}FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third mode-instant-that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A\textsuperscript{2}FM achieves 13.4\% on BrowseComp, 70.4\% on AIME25, and 16.7\% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only \$0.00487 per correct answer-cutting cost by 45.2\% relative to reasoning and 33.5\% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.

Title: An Investigation of Memorization Risk in Healthcare Foundation Models

Authors: Sana Tonekaboni, Lena Stempfle, Adibvafa Fallahpour, Walter Gerych, Marzyeh Ghassemi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.12950
Pdf URL: https://arxiv.org/pdf/2510.12950
Copy Paste: [[2510.12950]] An Investigation of Memorization Risk in Healthcare Foundation Models(https://arxiv.org/abs/2510.12950)
Keywords: foundation model, generative
Abstract: Foundation models trained on large-scale de-identified electronic health records (EHRs) hold promise for clinical applications. However, their capacity to memorize patient information raises important privacy concerns. In this work, we introduce a suite of black-box evaluation tests to assess privacy-related memorization risks in foundation models trained on structured EHR data. Our framework includes methods for probing memorization at both the embedding and generative levels, and aims to distinguish between model generalization and harmful memorization in clinically relevant settings. We contextualize memorization in terms of its potential to compromise patient privacy, particularly for vulnerable subgroups. We validate our approach on a publicly available EHR foundation model and release an open-source toolkit to facilitate reproducible and collaborative privacy assessments in healthcare AI.

Title: Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation

Authors: Xiao He, Huangxuan Zhao, Guojia Wan, Wei Zhou, Yanxing Liu, Juhua Liu, Yongchao Xu, Yong Luo, Dacheng Tao, Bo Du
Subjects: cs.CV, cs.AI, cs.IR, cs.MM
Abstract URL: https://arxiv.org/abs/2510.12953
Pdf URL: https://arxiv.org/pdf/2510.12953
Copy Paste: [[2510.12953]] Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation(https://arxiv.org/abs/2510.12953)
Keywords: foundation model, anomaly
Abstract: Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model's inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: this https URL.

Title: CADE 2.5 - ZeResFDG: Frequency-Decoupled, Rescaled and Zero-Projected Guidance for SD/SDXL Latent Diffusion Models

Authors: Denis Rychkovskiy (DZRobo, Independent Researcher), GPT-5 (AI collaborator, OpenAI)
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.12954
Pdf URL: https://arxiv.org/pdf/2510.12954
Copy Paste: [[2510.12954]] CADE 2.5 - ZeResFDG: Frequency-Decoupled, Rescaled and Zero-Projected Guidance for SD/SDXL Latent Diffusion Models(https://arxiv.org/abs/2510.12954)
Keywords: diffusion
Abstract: We introduce CADE 2.5 (Comfy Adaptive Detail Enhancer), a sampler-level guidance stack for SD/SDXL latent diffusion models. The central module, ZeResFDG, unifies (i) frequency-decoupled guidance that reweights low- and high-frequency components of the guidance signal, (ii) energy rescaling that matches the per-sample magnitude of the guided prediction to the positive branch, and (iii) zero-projection that removes the component parallel to the unconditional direction. A lightweight spectral EMA with hysteresis switches between a conservative and a detail-seeking mode as structure crystallizes during sampling. Across SD/SDXL samplers, ZeResFDG improves sharpness, prompt adherence, and artifact control at moderate guidance scales without any retraining. In addition, we employ a training-free inference-time stabilizer, QSilk Micrograin Stabilizer (quantile clamp + depth/edge-gated micro-detail injection), which improves robustness and yields natural high-frequency micro-texture at high resolutions with negligible overhead. For completeness we note that the same rule is compatible with alternative parameterizations (e.g., velocity), which we briefly discuss in the Appendix; however, this paper focuses on SD/SDXL latent diffusion models.

Title: A Connection Between Score Matching and Local Intrinsic Dimension

Authors: Eric Yeats, Aaron Jacobson, Darryl Hannan, Yiran Jia, Timothy Doster, Henry Kvinge, Scott Mahan
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.12975
Pdf URL: https://arxiv.org/pdf/2510.12975
Copy Paste: [[2510.12975]] A Connection Between Score Matching and Local Intrinsic Dimension(https://arxiv.org/abs/2510.12975)
Keywords: diffusion
Abstract: The local intrinsic dimension (LID) of data is a fundamental quantity in signal processing and learning theory, but quantifying the LID of high-dimensional, complex data has been a historically challenging task. Recent works have discovered that diffusion models capture the LID of data through the spectra of their score estimates and through the rate of change of their density estimates under various noise perturbations. While these methods can accurately quantify LID, they require either many forward passes of the diffusion model or use of gradient computation, limiting their applicability in compute- and memory-constrained scenarios. We show that the LID is a lower bound on the denoising score matching loss, motivating use of the denoising score matching loss as a LID estimator. Moreover, we show that the equivalent implicit score matching loss also approximates LID via the normal dimension and is closely related to a recent LID estimator, FLIPD. Our experiments on a manifold benchmark and with Stable Diffusion 3.5 indicate that the denoising score matching loss is a highly competitive and scalable LID estimator, achieving superior accuracy and memory footprint under increasing problem size and quantization level.

Title: Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check

Authors: Sungjun Cho, Dasol Hwang, Frederic Sala, Sangheum Hwang, Kyunghyun Cho, Sungmin Cha
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.12981
Pdf URL: https://arxiv.org/pdf/2510.12981
Copy Paste: [[2510.12981]] Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check(https://arxiv.org/abs/2510.12981)
Keywords: diffusion, generative
Abstract: Current unlearning metrics for generative models evaluate success based on reference responses or classifier outputs rather than assessing the core objective: whether the unlearned model behaves indistinguishably from a model that never saw the unwanted data. This reference-specific approach creates systematic blind spots, allowing models to appear successful while retaining unwanted knowledge accessible through alternative prompts or attacks. We address these limitations by proposing Functional Alignment for Distributional Equivalence (FADE), a novel metric that measures distributional similarity between unlearned and reference models by comparing bidirectional likelihood assignments over generated samples. Unlike existing approaches that rely on predetermined references, FADE captures functional alignment across the entire output distribution, providing a principled assessment of genuine unlearning. Our experiments on the TOFU benchmark for LLM unlearning and the UnlearnCanvas benchmark for text-to-image diffusion model unlearning reveal that methods achieving near-optimal scores on traditional metrics fail to achieve distributional equivalence, with many becoming more distant from the gold standard than before unlearning. These findings expose fundamental gaps in current evaluation practices and demonstrate that FADE provides a more robust foundation for developing and assessing truly effective unlearning methods.

Title: Machine Learning-Based Ultrasonic Weld Characterization Using Hierarchical Wave Modeling and Diffusion-Driven Distribution Alignment

Authors: Joshua R. Tempelman, Adam J. Wachtor, Eric B. Flynn
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2510.13023
Pdf URL: https://arxiv.org/pdf/2510.13023
Copy Paste: [[2510.13023]] Machine Learning-Based Ultrasonic Weld Characterization Using Hierarchical Wave Modeling and Diffusion-Driven Distribution Alignment(https://arxiv.org/abs/2510.13023)
Keywords: diffusion
Abstract: Automated ultrasonic weld inspection remains a significant challenge in the nondestructive evaluation (NDE) community to factors such as limited training data (due to the complexity of curating experimental specimens or high-fidelity simulations) and environmental volatility of many industrial settings (resulting in the corruption of on-the-fly measurements). Thus, an end-to-end machine learning (ML) workflow for acoustic weld inspection in realistic (i.e., industrial) settings has remained an elusive goal. This work addresses the challenges of data curation and signal corruption by proposing workflow consisting of a reduced-order modeling scheme, diffusion based distribution alignment, and U-Net-based segmentation and inversion. A reduced-order Helmholtz model based on Lamb wave theory is used to generate a comprehensive dataset over varying weld heterogeneity and crack defects. The relatively inexpensive low-order solutions provide a robust training dateset for inversion models which are refined through a transfer learning stage using a limited set of full 3D elastodynamic simulations. To handle out-of-distribution (OOD) real-world measurements with varying and unpredictable noise distributions, i.e., Laser Doppler Vibrometry scans, guided diffusion produces in-distribution representations of OOD experimental LDV scans which are subsequently processed by the inversion models. This integrated framework provides an end-to-end solution for automated weld inspection on real data.

Title: SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion

Authors: Jungbin Cho, Minsu Kim, Jisoo Kim, Ce Zheng, Laszlo A. Jeni, Ming-Hsuan Yang, Youngjae Yu, Seonjoo Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13044
Pdf URL: https://arxiv.org/pdf/2510.13044
Copy Paste: [[2510.13044]] SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion(https://arxiv.org/abs/2510.13044)
Keywords: diffusion
Abstract: Human motion is inherently diverse and semantically rich, while also shaped by the surrounding scene. However, existing motion generation approaches address either motion semantics or scene-awareness in isolation, since constructing large-scale datasets with both rich text--motion coverage and precise scene interactions is extremely challenging. In this work, we introduce SceneAdapt, a framework that injects scene awareness into text-conditioned motion models by leveraging disjoint scene--motion and text--motion datasets through two adaptation stages: inbetweening and scene-aware inbetweening. The key idea is to use motion inbetweening, learnable without text, as a proxy task to bridge two distinct datasets and thereby inject scene-awareness to text-to-motion models. In the first stage, we introduce keyframing layers that modulate motion latents for inbetweening while preserving the latent manifold. In the second stage, we add a scene-conditioning layer that injects scene geometry by adaptively querying local context through cross-attention. Experimental results show that SceneAdapt effectively injects scene awareness into text-to-motion models, and we further analyze the mechanisms through which this awareness emerges. Code and models will be released.

Title: True Self-Supervised Novel View Synthesis is Transferable

Authors: Thomas W. Mitchel, Hyunwoo Ryu, Vincent Sitzmann
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13063
Pdf URL: https://arxiv.org/pdf/2510.13063
Copy Paste: [[2510.13063]] True Self-Supervised Novel View Synthesis is Transferable(https://arxiv.org/abs/2510.13063)
Keywords: self-supervised
Abstract: In this paper, we identify that the key criterion for determining whether a model is truly capable of novel view synthesis (NVS) is transferability: Whether any pose representation extracted from one video sequence can be used to re-render the same camera trajectory in another. We analyze prior work on self-supervised NVS and find that their predicted poses do not transfer: The same set of poses lead to different camera trajectories in different 3D scenes. Here, we present XFactor, the first geometry-free self-supervised model capable of true NVS. XFactor combines pair-wise pose estimation with a simple augmentation scheme of the inputs and outputs that jointly enables disentangling camera pose from scene content and facilitates geometric reasoning. Remarkably, we show that XFactor achieves transferability with unconstrained latent pose variables, without any 3D inductive biases or concepts from multi-view geometry -- such as an explicit parameterization of poses as elements of SE(3). We introduce a new metric to quantify transferability, and through large-scale experiments, we demonstrate that XFactor significantly outperforms prior pose-free NVS transformers, and show that latent poses are highly correlated with real-world poses through probing experiments.

Title: NeuroRVQ: Multi-Scale EEG Tokenization for Generative Large Brainwave Models

Authors: Konstantinos Barmpas, Na Lee, Alexandros Koliousis, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Stefanos Zafeiriou
Subjects: cs.LG, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2510.13068
Pdf URL: https://arxiv.org/pdf/2510.13068
Copy Paste: [[2510.13068]] NeuroRVQ: Multi-Scale EEG Tokenization for Generative Large Brainwave Models(https://arxiv.org/abs/2510.13068)
Keywords: foundation model, generative
Abstract: Electroencephalography (EEG) captures neural activity across multiple temporal and spectral scales, yielding signals that are rich but complex for representation learning. Recently, EEG foundation models trained to predict masked signal-tokens have shown promise for learning generalizable representations. However, their performance is hindered by their signal tokenization modules. Existing neural tokenizers fail to preserve high-frequency dynamics, limiting their ability to reconstruct EEG signals with high fidelity. We introduce NeuroRVQ, a scalable Large Brainwave Model (LBM) centered on a codebook-based tokenizer. Our tokenizer integrates: (i) multi-scale feature extraction modules that capture the full frequency neural spectrum; (ii) hierarchical residual vector quantization (RVQ) codebooks for high-resolution encoding; and, (iii) an EEG signal phase- and amplitude-aware loss function for efficient training. This design enables efficient EEG compression while supporting accurate reconstruction across all frequency bands, leading to robust generative masked modeling. Our empirical results demonstrate that NeuroRVQ achieves lower reconstruction error and outperforms existing LBMs on a variety of downstream tasks. More broadly, NeuroRVQ tokenizer establishes a strong prior for codebook-based general-purpose brainwave models, enabling advances in neural decoding, generative modeling and multimodal biosignal integration.

Title: Counting Hallucinations in Diffusion Models

Authors: Shuai Fu, Jian Zhou, Qi Chen, Huang Jing, Huy Anh Nguyen, Xiaohan Liu, Zhixiong Zeng, Lin Ma, Quanshi Zhang, Qi Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13080
Pdf URL: https://arxiv.org/pdf/2510.13080
Copy Paste: [[2510.13080]] Counting Hallucinations in Diffusion Models(https://arxiv.org/abs/2510.13080)
Keywords: diffusion, generative
Abstract: Diffusion probabilistic models (DPMs) have demonstrated remarkable progress in generative tasks, such as image and video synthesis. However, they still often produce hallucinated samples (hallucinations) that conflict with real-world knowledge, such as generating an implausible duplicate cup floating beside another cup. Despite their prevalence, the lack of feasible methodologies for systematically quantifying such hallucinations hinders progress in addressing this challenge and obscures potential pathways for designing next-generation generative models under factual constraints. In this work, we bridge this gap by focusing on a specific form of hallucination, which we term counting hallucination, referring to the generation of an incorrect number of instances or structured objects, such as a hand image with six fingers, despite such patterns being absent from the training data. To this end, we construct a dataset suite CountHalluSet, with well-defined counting criteria, comprising ToyShape, SimObject, and RealHand. Using these datasets, we develop a standardized evaluation protocol for quantifying counting hallucinations, and systematically examine how different sampling conditions in DPMs, including solver type, ODE solver order, sampling steps, and initial noise, affect counting hallucination levels. Furthermore, we analyze their correlation with common evaluation metrics such as FID, revealing that this widely used image quality metric fails to capture counting hallucinations consistently. This work aims to take the first step toward systematically quantifying hallucinations in diffusion models and offer new insights into the investigation of hallucination phenomena in image generation.

Title: Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation

Authors: Yi Zuo, Zitao Wang, Lingling Li, Xu Liu, Fang Liu, Licheng Jiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13084
Pdf URL: https://arxiv.org/pdf/2510.13084
Copy Paste: [[2510.13084]] Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation(https://arxiv.org/abs/2510.13084)
Keywords: diffusion
Abstract: Text-to-image (T2I) diffusion models have recently demonstrated significant progress in video editing. However, existing video editing methods are severely limited by their high computational overhead and memory consumption. Furthermore, these approaches often sacrifice visual fidelity, leading to undesirable temporal inconsistencies and artifacts such as blurring and pronounced mosaic-like patterns. We propose Edit-Your-Interest, a lightweight, text-driven, zero-shot video editing method. Edit-Your-Interest introduces a spatio-temporal feature memory to cache features from previous frames, significantly reducing computational overhead compared to full-sequence spatio-temporal modeling approaches. Specifically, we first introduce a Spatio-Temporal Feature Memory bank (SFM), which is designed to efficiently cache and retain the crucial image tokens processed by spatial attention. Second, we propose the Feature Most-Similar Propagation (FMP) method. FMP propagates the most relevant tokens from previous frames to subsequent ones, preserving temporal consistency. Finally, we introduce an SFM update algorithm that continuously refreshes the cached features, ensuring their long-term relevance and effectiveness throughout the video sequence. Furthermore, we leverage cross-attention maps to automatically extract masks for the instances of interest. These masks are seamlessly integrated into the diffusion denoising process, enabling fine-grained control over target objects and allowing Edit-Your-Interest to perform highly accurate edits while robustly preserving the background integrity. Extensive experiments decisively demonstrate that the proposed Edit-Your-Interest outperforms state-of-the-art methods in both efficiency and visual fidelity, validating its superior effectiveness and practicality.

Title: On the Reasoning Abilities of Masked Diffusion Language Models

Authors: Anej Svete, Ashish Sabharwal
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2510.13117
Pdf URL: https://arxiv.org/pdf/2510.13117
Copy Paste: [[2510.13117]] On the Reasoning Abilities of Masked Diffusion Language Models(https://arxiv.org/abs/2510.13117)
Keywords: diffusion
Abstract: Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent to their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning.

Title: DP-TTA: Test-time Adaptation for Transient Electromagnetic Signal Denoising via Dictionary-driven Prior Regularization

Authors: Meng Yang, Kecheng Chen, Wei Luo, Xianjie Chen, Yong Jia, Mingyue Wang, Fanqiang Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13160
Pdf URL: https://arxiv.org/pdf/2510.13160
Copy Paste: [[2510.13160]] DP-TTA: Test-time Adaptation for Transient Electromagnetic Signal Denoising via Dictionary-driven Prior Regularization(https://arxiv.org/abs/2510.13160)
Keywords: self-supervised
Abstract: Transient Electromagnetic (TEM) method is widely used in various geophysical applications, providing valuable insights into subsurface properties. However, time-domain TEM signals are often submerged in various types of noise. While recent deep learning-based denoising models have shown strong performance, these models are mostly trained on simulated or single real-world scenario data, overlooking the significant differences in noise characteristics from different geographical regions. Intuitively, models trained in one environment often struggle to perform well in new settings due to differences in geological conditions, equipment, and external interference, leading to reduced denoising performance. To this end, we propose the Dictionary-driven Prior Regularization Test-time Adaptation (DP-TTA). Our key insight is that TEM signals possess intrinsic physical characteristics, such as exponential decay and smoothness, which remain consistent across different regions regardless of external conditions. These intrinsic characteristics serve as ideal prior knowledge for guiding the TTA strategy, which helps the pre-trained model dynamically adjust parameters by utilizing self-supervised losses, improving denoising performance in new scenarios. To implement this, we customized a network, named DTEMDNet. Specifically, we first use dictionary learning to encode these intrinsic characteristics as a dictionary-driven prior, which is integrated into the model during training. At the testing stage, this prior guides the model to adapt dynamically to new environments by minimizing self-supervised losses derived from the dictionary-driven consistency and the signal one-order variation. Extensive experimental results demonstrate that the proposed method achieves much better performance than existing TEM denoising methods and TTA methods.

Title: Text Anomaly Detection with Simplified Isolation Kernel

Authors: Yang Cao, Sikun Yang, Yujiu Yang, Lianyong Qi, Ming Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13197
Pdf URL: https://arxiv.org/pdf/2510.13197
Copy Paste: [[2510.13197]] Text Anomaly Detection with Simplified Isolation Kernel(https://arxiv.org/abs/2510.13197)
Keywords: anomaly
Abstract: Two-step approaches combining pre-trained large language model embeddings and anomaly detectors demonstrate strong performance in text anomaly detection by leveraging rich semantic representations. However, high-dimensional dense embeddings extracted by large language models pose challenges due to substantial memory requirements and high computation time. To address this challenge, we introduce the Simplified Isolation Kernel (SIK), which maps high-dimensional dense embeddings to lower-dimensional sparse representations while preserving crucial anomaly characteristics. SIK has linear time complexity and significantly reduces space complexity through its innovative boundary-focused feature mapping. Experiments across 7 datasets demonstrate that SIK achieves better detection performance than 11 state-of-the-art (SOTA) anomaly detection algorithms while maintaining computational efficiency and low memory cost. All code and demonstrations are available at this https URL.

Title: CleverCatch: A Knowledge-Guided Weak Supervision Model for Fraud Detection

Authors: Amirhossein Mozafari, Kourosh Hashemi, Erfan Shafagh, Soroush Motamedi, Azar Taheri Tayebi, Mohammad A. Tayebi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13205
Pdf URL: https://arxiv.org/pdf/2510.13205
Copy Paste: [[2510.13205]] CleverCatch: A Knowledge-Guided Weak Supervision Model for Fraud Detection(https://arxiv.org/abs/2510.13205)
Keywords: anomaly
Abstract: Healthcare fraud detection remains a critical challenge due to limited availability of labeled data, constantly evolving fraud tactics, and the high dimensionality of medical records. Traditional supervised methods are challenged by extreme label scarcity, while purely unsupervised approaches often fail to capture clinically meaningful anomalies. In this work, we introduce CleverCatch, a knowledge-guided weak supervision model designed to detect fraudulent prescription behaviors with improved accuracy and interpretability. Our approach integrates structured domain expertise into a neural architecture that aligns rules and data samples within a shared embedding space. By training encoders jointly on synthetic data representing both compliance and violation, CleverCatch learns soft rule embeddings that generalize to complex, real-world datasets. This hybrid design enables data-driven learning to be enhanced by domain-informed constraints, bridging the gap between expert heuristics and machine learning. Experiments on the large-scale real-world dataset demonstrate that CleverCatch outperforms four state-of-the-art anomaly detection baselines, yielding average improvements of 1.3\% in AUC and 3.4\% in recall. Our ablation study further highlights the complementary role of expert rules, confirming the adaptability of the framework. The results suggest that embedding expert rules into the learning process not only improves detection accuracy but also increases transparency, offering an interpretable approach for high-stakes domains such as healthcare fraud detection.

Title: Prompt-based Adaptation in Large-scale Vision Models: A Survey

Authors: Xi Xiao, Yunbei Zhang, Lin Zhao, Yiyang Liu, Xiaoying Liao, Zheda Mai, Xingjian Li, Xiao Wang, Hao Xu, Jihun Hamm, Xue Lin, Min Xu, Qifan Wang, Tianyang Wang, Cheng Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13219
Pdf URL: https://arxiv.org/pdf/2510.13219
Copy Paste: [[2510.13219]] Prompt-based Adaptation in Large-scale Vision Models: A Survey(https://arxiv.org/abs/2510.13219)
Keywords: generative
Abstract: In computer vision, Visual Prompting (VP) and Visual Prompt Tuning (VPT) have recently emerged as lightweight and effective alternatives to full fine-tuning for adapting large-scale vision models within the ``pretrain-then-finetune'' paradigm. However, despite rapid progress, their conceptual boundaries remain blurred, as VP and VPT are frequently used interchangeably in current research, reflecting a lack of systematic distinction between these techniques and their respective applications. In this survey, we revisit the designs of VP and VPT from first principles, and conceptualize them within a unified framework termed Prompt-based Adaptation (PA). We provide a taxonomy that categorizes existing methods into learnable, generative, and non-learnable prompts, and further organizes them by injection granularity -- pixel-level and token-level. Beyond the core methodologies, we examine PA's integrations across diverse domains, including medical imaging, 3D point clouds, and vision-language tasks, as well as its role in test-time adaptation and trustworthy AI. We also summarize current benchmarks and identify key challenges and future directions. To the best of our knowledge, we are the first comprehensive survey dedicated to PA's methodologies and applications in light of their distinct characteristics. Our survey aims to provide a clear roadmap for researchers and practitioners in all area to understand and explore the evolving landscape of PA-related research.

Title: CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation

Authors: Li Liang, Bo Miao, Xinyu Wang, Naveed Akhtar, Jordan Vice, Ajmal Mian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13245
Pdf URL: https://arxiv.org/pdf/2510.13245
Copy Paste: [[2510.13245]] CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation(https://arxiv.org/abs/2510.13245)
Keywords: diffusion
Abstract: Outdoor 3D semantic scene generation produces realistic and semantically rich environments for applications such as urban simulation and autonomous driving. However, advances in this direction are constrained by the absence of publicly available, well-annotated datasets. We introduce SketchSem3D, the first large-scale benchmark for generating 3D outdoor semantic scenes from abstract freehand sketches and pseudo-labeled annotations of satellite images. SketchSem3D includes two subsets, Sketch-based SemanticKITTI and Sketch-based KITTI-360 (containing LiDAR voxels along with their corresponding sketches and annotated satellite images), to enable standardized, rigorous, and diverse evaluations. We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in outdoor 3D scene generation. CymbaDiff imposes structured spatial ordering, explicitly captures cylindrical continuity and vertical hierarchy, and preserves both physical neighborhood relationships and global context within the generated scenes. Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization. The code and dataset will be available at this https URL

Title: End-to-End Multi-Modal Diffusion Mamba

Authors: Chunhao Lu, Qiang Lu, Meichen Dong, Jake Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13253
Pdf URL: https://arxiv.org/pdf/2510.13253
Copy Paste: [[2510.13253]] End-to-End Multi-Modal Diffusion Mamba(https://arxiv.org/abs/2510.13253)
Keywords: diffusion
Abstract: Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we propose a novel architecture called MDM (Multi-modal Diffusion Mamba). MDM utilizes a Mamba-based multi-step selection diffusion model to progressively generate and refine modality-specific information through a unified variational autoencoder for both encoding and decoding. This innovative approach allows MDM to achieve superior performance when processing high-dimensional data, particularly in generating high-resolution images and extended text sequences simultaneously. Our evaluations in areas such as image generation, image captioning, visual question answering, text comprehension, and reasoning tasks demonstrate that MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, and Chameleon etc.) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral. Our results validate MDM's effectiveness in unifying multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.

Title: Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan's Intelligent Interaction Systems

Authors: Xuxin Cheng, Ke Zeng, Zhiquan Cao, Linyi Dai, Wenxuan Gao, Fei Han, Ai Jian, Feng Hong, Wenxing Hu, Zihe Huang, Dejian Kong, Jia Leng, Zhuoyuan Liao, Pei Liu, Jiaye Lin, Xing Ma, Jingqing Ruan, Jiaxing Song, Xiaoyu Tan, Ruixuan Xiao, Wenhui Yu, Wenyu Zhan, Haoxing Zhang, Chao Zhou, Hao Zhou, Shaodong Zheng, Ruinian Chen, Siyuan Chen, Ziyang Chen, Yiwen Dong, Yaoyou Fan, Yangyi Fang, Yang Gan, Shiguang Guo, Qi He, Chaowen Hu, Binghui Li, Dailin Li, Xiangyu Li, Yan Li, Chengjian Liu, Xiangfeng Liu, Jiahui Lv, Qiao Ma, Jiang Pan, Cong Qin, Chenxing Sun, Wen Sun, Zhonghui Wang, Abudukelimu Wuerkaixi, Xin Yang, Fangyi Yuan, Yawen Zhu, Tianyi Zhai, Jie Zhang, Runlai Zhang, Yao Xu, Yiran Zhao, Yifan Wang, Xunliang Cai, Yangen Hu, Cao Liu, Lu Pan, Xiaoli Wang, Bo Xiao, Wenyuan Yao, Qianlin Zhou, Benchang Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13291
Pdf URL: https://arxiv.org/pdf/2510.13291
Copy Paste: [[2510.13291]] Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan's Intelligent Interaction Systems(https://arxiv.org/abs/2510.13291)
Keywords: generative
Abstract: Enhancing customer experience is essential for business success, particularly as service demands grow in scale and complexity. Generative artificial intelligence and Large Language Models (LLMs) have empowered intelligent interaction systems to deliver efficient, personalized, and 24/7 support. In practice, intelligent interaction systems encounter several challenges: (1) Constructing high-quality data for cold-start training is difficult, hindering self-evolution and raising labor costs. (2) Multi-turn dialogue performance remains suboptimal due to inadequate intent understanding, rule compliance, and solution extraction. (3) Frequent evolution of business rules affects system operability and transferability, constraining low-cost expansion and adaptability. (4) Reliance on a single LLM is insufficient in complex scenarios, where the absence of multi-agent frameworks and effective collaboration undermines process completeness and service quality. (5) The open-domain nature of multi-turn dialogues, lacking unified golden answers, hampers quantitative evaluation and continuous optimization. To address these challenges, we introduce WOWService, an intelligent interaction system tailored for industrial applications. With the integration of LLMs and multi-agent architectures, WOWService enables autonomous task management and collaborative problem-solving. Specifically, WOWService focuses on core modules including data construction, general capability enhancement, business scenario adaptation, multi-agent coordination, and automated evaluation. Currently, WOWService is deployed on the Meituan App, achieving significant gains in key metrics, e.g., User Satisfaction Metric 1 (USM 1) -27.53% and User Satisfaction Metric 2 (USM 2) +25.51%, demonstrating its effectiveness in capturing user needs and advancing personalized service.

Title: Federated Conditional Conformal Prediction via Generative Models

Authors: Rui Xu, Sihong Xie
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.13297
Pdf URL: https://arxiv.org/pdf/2510.13297
Copy Paste: [[2510.13297]] Federated Conditional Conformal Prediction via Generative Models(https://arxiv.org/abs/2510.13297)
Keywords: diffusion, generative
Abstract: Conformal Prediction (CP) provides distribution-free uncertainty quantification by constructing prediction sets that guarantee coverage of the true labels. This reliability makes CP valuable for high-stakes federated learning scenarios such as multi-center healthcare. However, standard CP assumes i.i.d. data, which is violated in federated settings where client distributions differ substantially. Existing federated CP methods address this by maintaining marginal coverage on each client, but such guarantees often fail to reflect input-conditional uncertainty. In this work, we propose Federated Conditional Conformal Prediction (Fed-CCP) via generative models, which aims for conditional coverage that adapts to local data heterogeneity. Fed-CCP leverages generative models, such as normalizing flows or diffusion models, to approximate conditional data distributions without requiring the sharing of raw data. This enables each client to locally calibrate conformal scores that reflect its unique uncertainty, while preserving global consistency through federated aggregation. Experiments on real datasets demonstrate that Fed-CCP achieves more adaptive prediction sets.

Title: Km-scale dynamical downscaling through conformalized latent diffusion models

Authors: Alessandro Brusaferri, Andrea Ballarino
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.13301
Pdf URL: https://arxiv.org/pdf/2510.13301
Copy Paste: [[2510.13301]] Km-scale dynamical downscaling through conformalized latent diffusion models(https://arxiv.org/abs/2510.13301)
Keywords: diffusion, generative
Abstract: Dynamical downscaling is crucial for deriving high-resolution meteorological fields from coarse-scale simulations, enabling detailed analysis for critical applications such as weather forecasting and renewable energy modeling. Generative Diffusion models (DMs) have recently emerged as powerful data-driven tools for this task, offering reconstruction fidelity and more scalable sampling supporting uncertainty quantification. However, DMs lack finite-sample guarantees against overconfident predictions, resulting in miscalibrated grid-point-level uncertainty estimates hindering their reliability in operational contexts. In this work, we tackle this issue by augmenting the downscaling pipeline with a conformal prediction framework. Specifically, the DM's samples are post-processed to derive conditional quantile estimates, incorporated into a conformalized quantile regression procedure targeting locally adaptive prediction intervals with finite-sample marginal validity. The proposed approach is evaluated on ERA5 reanalysis data over Italy, downscaled to a 2-km grid. Results demonstrate grid-point-level uncertainty estimates with markedly improved coverage and stable probabilistic scores relative to the DM baseline, highlighting the potential of conformalized generative models for more trustworthy probabilistic downscaling to high-resolution meteorological fields.

Title: LLM one-shot style transfer for Authorship Attribution and Verification

Authors: Pablo Miralles-González, Javier Huertas-Tato, Alejandro Martín, David Camacho
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13302
Pdf URL: https://arxiv.org/pdf/2510.13302
Copy Paste: [[2510.13302]] LLM one-shot style transfer for Authorship Attribution and Verification(https://arxiv.org/abs/2510.13302)
Keywords: in-context
Abstract: Computational stylometry analyzes writing style through quantitative patterns in text, supporting applications from forensic tasks such as identity linking and plagiarism detection to literary attribution in the humanities. Supervised and contrastive approaches rely on data with spurious correlations and often confuse style with topic. Despite their natural use in AI-generated text detection, the CLM pre-training of modern LLMs has been scarcely leveraged for general authorship problems. We propose a novel unsupervised approach based on this extensive pre-training and the in-context learning capabilities of LLMs, employing the log-probabilities of an LLM to measure style transferability from one text to another. Our method significantly outperforms LLM prompting approaches of comparable scale and achieves higher accuracy than contrastively trained baselines when controlling for topical correlations. Moreover, performance scales fairly consistently with the size of the base model and, in the case of authorship verification, with an additional mechanism that increases test-time computation; enabling flexible trade-offs between computational cost and accuracy.

Title: Isolation-based Spherical Ensemble Representations for Anomaly Detection

Authors: Yang Cao, Sikun Yang, Hao Tian, Kai He, Lianyong Qi, Ming Liu, Yujiu Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.13311
Pdf URL: https://arxiv.org/pdf/2510.13311
Copy Paste: [[2510.13311]] Isolation-based Spherical Ensemble Representations for Anomaly Detection(https://arxiv.org/abs/2510.13311)
Keywords: anomaly
Abstract: Anomaly detection is a critical task in data mining and management with applications spanning fraud detection, network security, and log monitoring. Despite extensive research, existing unsupervised anomaly detection methods still face fundamental challenges including conflicting distributional assumptions, computational inefficiency, and difficulty handling different anomaly types. To address these problems, we propose ISER (Isolation-based Spherical Ensemble Representations) that extends existing isolation-based methods by using hypersphere radii as proxies for local density characteristics while maintaining linear time and constant space complexity. ISER constructs ensemble representations where hypersphere radii encode density information: smaller radii indicate dense regions while larger radii correspond to sparse areas. We introduce a novel similarity-based scoring method that measures pattern consistency by comparing ensemble representations against a theoretical anomaly reference pattern. Additionally, we enhance the performance of Isolation Forest by using ISER and adapting the scoring function to address axis-parallel bias and local anomaly detection limitations. Comprehensive experiments on 22 real-world datasets demonstrate ISER's superior performance over 11 baseline methods.

Title: Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models

Authors: Hong-Kai Zheng, Piji Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13331
Pdf URL: https://arxiv.org/pdf/2510.13331
Copy Paste: [[2510.13331]] Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models(https://arxiv.org/abs/2510.13331)
Keywords: self-supervised
Abstract: Vector Quantized Variational Autoencoders (VQ-VAEs) leverage self-supervised learning through reconstruction tasks to represent continuous vectors using the closest vectors in a codebook. However, issues such as codebook collapse persist in the VQ model. To address these issues, existing approaches employ implicit static codebooks or jointly optimize the entire codebook, but these methods constrain the codebook's learning capability, leading to reduced reconstruction quality. In this paper, we propose Group-VQ, which performs group-wise optimization on the codebook. Each group is optimized independently, with joint optimization performed within groups. This approach improves the trade-off between codebook utilization and reconstruction performance. Additionally, we introduce a training-free codebook resampling method, allowing post-training adjustment of the codebook size. In image reconstruction experiments under various settings, Group-VQ demonstrates improved performance on reconstruction metrics. And the post-training codebook sampling method achieves the desired flexibility in adjusting the codebook size.

Title: Document Intelligence in the Era of Large Language Models: A Survey

Authors: Weishi Wang, Hengchang Hu, Zhijie Zhang, Zhaochen Li, Hongxin Shao, Daniel Dahlmeier
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13366
Pdf URL: https://arxiv.org/pdf/2510.13366
Copy Paste: [[2510.13366]] Document Intelligence in the Era of Large Language Models: A Survey(https://arxiv.org/abs/2510.13366)
Keywords: foundation model
Abstract: Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (LLMs). While earlier approaches relied on encoder-decoder architectures, decoder-only LLMs have revolutionized DAI, bringing remarkable advancements in understanding and generation. This survey provides a comprehensive overview of DAI's evolution, highlighting current research attempts and future prospects of LLMs in this field. We explore key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions, including agent-based approaches and document-specific foundation models. This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.

Title: Contrastive Learning-Based Dependency Modeling for Anomaly Detection in Cloud Services

Authors: Yue Xing, Yingnan Deng, Heyao Liu, Ming Wang, Yun Zi, Xiaoxuan Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.13368
Pdf URL: https://arxiv.org/pdf/2510.13368
Copy Paste: [[2510.13368]] Contrastive Learning-Based Dependency Modeling for Anomaly Detection in Cloud Services(https://arxiv.org/abs/2510.13368)
Keywords: anomaly
Abstract: This paper addresses the challenges of complex dependencies and diverse anomaly patterns in cloud service environments by proposing a dependency modeling and anomaly detection method that integrates contrastive learning. The method abstracts service interactions into a dependency graph, extracts temporal and structural features through embedding functions, and employs a graph convolution mechanism to aggregate neighborhood information for context-aware service representations. A contrastive learning framework is then introduced, constructing positive and negative sample pairs to enhance the separability of normal and abnormal patterns in the representation space. Furthermore, a temporal consistency constraint is designed to maintain representation stability across time steps and reduce the impact of short-term fluctuations and noise. The overall optimization combines contrastive loss and temporal consistency loss to ensure stable and reliable detection across multi-dimensional features. Experiments on public datasets systematically evaluate the method from hyperparameter, environmental, and data sensitivity perspectives. Results show that the proposed approach significantly outperforms existing methods on key metrics such as Precision, Recall, F1-Score, and AUC, while maintaining robustness under conditions of sparse labeling, monitoring noise, and traffic fluctuations. This study verifies the effectiveness of integrating dependency modeling with contrastive learning, provides a complete technical solution for cloud service anomaly detection, and demonstrates strong adaptability and stability in complex environments.

Title: Generalizing WiFi Gesture Recognition via Large-Model-Aware Semantic Distillation and Alignment

Authors: Feng-Qi Cui, Yu-Tong Guo, Tianyue Zheng, Jinyang Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13390
Pdf URL: https://arxiv.org/pdf/2510.13390
Copy Paste: [[2510.13390]] Generalizing WiFi Gesture Recognition via Large-Model-Aware Semantic Distillation and Alignment(https://arxiv.org/abs/2510.13390)
Keywords: foundation model
Abstract: WiFi-based gesture recognition has emerged as a promising RF sensing paradigm for enabling non-contact and privacy-preserving human-computer interaction in AIoT environments. However, existing methods often suffer from limited generalization and semantic expressiveness due to the domain-sensitive nature of Channel State Information and the lack of high-level gesture abstraction. To address these challenges, we propose a novel generalization framework, termed Large-Model-Aware Semantic Distillation and Alignment (GLSDA), which leverages the semantic prior of pre-trained large foundation models to enhance gesture representation learning in both in-domain and cross-domain scenarios. Specifically, we first design a dual-path CSI encoding pipeline that captures geometric and dynamic gesture patterns via CSI-Ratio phase sequences and Doppler spectrograms. These representations are then fed into a Multiscale Semantic Encoder, which learns robust temporal embeddings and aligns them with gesture semantics through cross-modal attention mechanisms. To further enhance category discrimination, we introduce a Semantic-Aware Soft Supervision scheme that encodes inter-class correlations and reduces label ambiguity, especially for semantically similar gestures. Finally, we develop a Robust Dual-Distillation strategy to compress the aligned model into a lightweight student network, jointly distilling intermediate features and semantic-informed soft labels from the teacher model. Extensive experiments on the Widar3.0 benchmark show that GLSDA consistently outperforms state-of-the-art methods in both in-domain and cross-domain gesture recognition tasks, while significantly reducing model size and inference latency. Our method offers a scalable and deployable solution for generalized RF-based gesture interfaces in real-world AIoT applications.

Title: Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models

Authors: Agnese Lombardi, Alessandro Lenci
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13395
Pdf URL: https://arxiv.org/pdf/2510.13395
Copy Paste: [[2510.13395]] Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models(https://arxiv.org/abs/2510.13395)
Keywords: generative
Abstract: Language is fundamental to human cooperation, facilitating not only the exchange of information but also the coordination of actions through shared interpretations of situational contexts. This study explores whether the Generative Agent-Based Model (GABM) Concordia can effectively model Theory of Mind (ToM) within simulated real-world environments. Specifically, we assess whether this framework successfully simulates ToM abilities and whether GPT-4 can perform tasks by making genuine inferences from social context, rather than relying on linguistic memorization. Our findings reveal a critical limitation: GPT-4 frequently fails to select actions based on belief attribution, suggesting that apparent ToM-like abilities observed in previous studies may stem from shallow statistical associations rather than true reasoning. Additionally, the model struggles to generate coherent causal effects from agent actions, exposing difficulties in processing complex social interactions. These results challenge current statements about emergent ToM-like capabilities in LLMs and highlight the need for more rigorous, action-based evaluation frameworks.

Title: Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation

Authors: Yifu Luo, Xinhao Hu, Keyu Fan, Haoyuan Sun, Zeyu Chen, Bo Xia, Tiantian Zhang, Yongzhe Chang, Xueqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13418
Pdf URL: https://arxiv.org/pdf/2510.13418
Copy Paste: [[2510.13418]] Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation(https://arxiv.org/abs/2510.13418)
Keywords: diffusion, generative
Abstract: Reinforcement learning (RL) has garnered increasing attention in text-to-image (T2I) generation. However, most existing RL approaches are tailored to either diffusion models or autoregressive models, overlooking an important alternative: masked generative models. In this work, we propose Mask-GRPO, the first method to incorporate Group Relative Policy Optimization (GRPO)-based RL into this overlooked paradigm. Our core insight is to redefine the transition probability, which is different from current approaches, and formulate the unmasking process as a multi-step decision-making problem. To further enhance our method, we explore several useful strategies, including removing the KL constraint, applying the reduction strategy, and filtering out low-quality samples. Using Mask-GRPO, we improve a base model, Show-o, with substantial improvements on standard T2I benchmarks and preference alignment, outperforming existing state-of-the-art approaches. The code is available on this https URL

Title: Ultra High-Resolution Image Inpainting with Patch-Based Content Consistency Adapter

Authors: Jianhui Zhang, Sheng Cheng, Qirui Sun, Jia Liu, Wang Luyang, Chaoyu Feng, Chen Fang, Lei Lei, Jue Wang, Shuaicheng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13419
Pdf URL: https://arxiv.org/pdf/2510.13419
Copy Paste: [[2510.13419]] Ultra High-Resolution Image Inpainting with Patch-Based Content Consistency Adapter(https://arxiv.org/abs/2510.13419)
Keywords: diffusion
Abstract: In this work, we present Patch-Adapter, an effective framework for high-resolution text-guided image inpainting. Unlike existing methods limited to lower resolutions, our approach achieves 4K+ resolution while maintaining precise content consistency and prompt alignment, two critical challenges in image inpainting that intensify with increasing resolution and texture complexity. Patch-Adapter leverages a two-stage adapter architecture to scale the diffusion model's resolution from 1K to 4K+ without requiring structural overhauls: (1) Dual Context Adapter learns coherence between masked and unmasked regions at reduced resolutions to establish global structural consistency; and (2) Reference Patch Adapter implements a patch-level attention mechanism for full-resolution inpainting, preserving local detail fidelity through adaptive feature fusion. This dual-stage architecture uniquely addresses the scalability gap in high-resolution inpainting by decoupling global semantics from localized refinement. Experiments demonstrate that Patch-Adapter not only resolves artifacts common in large-scale inpainting but also achieves state-of-the-art performance on the OpenImages and Photo-Concept-Bucket datasets, outperforming existing methods in both perceptual quality and text-prompt adherence.

Title: Manifold Decoders: A Framework for Generative Modeling from Nonlinear Embeddings

Authors: Riddhish Thakare, Kingdom Mutala Akugri
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.13622
Pdf URL: https://arxiv.org/pdf/2510.13622
Copy Paste: [[2510.13622]] Manifold Decoders: A Framework for Generative Modeling from Nonlinear Embeddings(https://arxiv.org/abs/2510.13622)
Keywords: diffusion, generative
Abstract: Classical nonlinear dimensionality reduction (NLDR) techniques like t-SNE, Isomap, and LLE excel at creating low-dimensional embeddings for data visualization but fundamentally lack the ability to map these embeddings back to the original high-dimensional space. This one-way transformation limits their use in generative applications. This paper addresses this critical gap by introducing a system- atic framework for constructing neural decoder architectures for prominent NLDR methods, enabling bidirectional mapping for the first time. We extend this framework by implementing a diffusion-based generative process that operates directly within these learned manifold spaces. Through experiments on the CelebA dataset, we evaluate the reconstruction and generative performance of our approach against autoencoder and standard diffusion model baselines. Our findings reveal a fundamental trade- off: while the decoders successfully reconstruct data, their quality is surpassed by end-to-end optimized autoencoders. Moreover, manifold-constrained diffusion yields poor-quality samples, suggesting that the discrete and sparse nature of classical NLDR embeddings is ill-suited for the continuous inter- polation required by generative models. This work highlights the inherent challenges in retrofitting generative capabilities onto NLDR methods designed primarily for visualization and analysis.

Title: AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset

Authors: Amjid Ali, Zulfiqar Ahmad Khan, Altaf Hussain, Muhammad Munsif, Adnan Hussain, Sung Wook Baik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13630
Pdf URL: https://arxiv.org/pdf/2510.13630
Copy Paste: [[2510.13630]] AVAR-Net: A Lightweight Audio-Visual Anomaly Recognition Framework with a Benchmark Dataset(https://arxiv.org/abs/2510.13630)
Keywords: anomaly
Abstract: Anomaly recognition plays a vital role in surveillance, transportation, healthcare, and public safety. However, most existing approaches rely solely on visual data, making them unreliable under challenging conditions such as occlusion, low illumination, and adverse weather. Moreover, the absence of large-scale synchronized audio-visual datasets has hindered progress in multimodal anomaly recognition. To address these limitations, this study presents AVAR-Net, a lightweight and efficient audio-visual anomaly recognition framework designed for real-world environments. AVAR-Net consists of four main modules: an audio feature extractor, a video feature extractor, fusion strategy, and a sequential pattern learning network that models cross-modal relationships for anomaly recognition. Specifically, the Wav2Vec2 model extracts robust temporal features from raw audio, while MobileViT captures both local and global visual representations from video frames. An early fusion mechanism combines these modalities, and a Multi-Stage Temporal Convolutional Network (MTCN) model that learns long-range temporal dependencies within the fused representation, enabling robust spatiotemporal reasoning. A novel Visual-Audio Anomaly Recognition (VAAR) dataset, is also introduced, serving as a medium-scale benchmark containing 3,000 real-world videos with synchronized audio across ten diverse anomaly classes. Experimental evaluations demonstrate that AVAR-Net achieves 89.29% accuracy on VAAR and 88.56% Average Precision on the XD-Violence dataset, improving Average Precision by 2.8% over existing state-of-the-art methods. These results highlight the effectiveness, efficiency, and generalization capability of the proposed framework, as well as the utility of VAAR as a benchmark for advancing multimodal anomaly recognition research.

Title: Towards Adversarial Robustness and Uncertainty Quantification in DINOv2-based Few-Shot Anomaly Detection

Authors: Akib Mohammed Khan, Bartosz Krawczyk
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13643
Pdf URL: https://arxiv.org/pdf/2510.13643
Copy Paste: [[2510.13643]] Towards Adversarial Robustness and Uncertainty Quantification in DINOv2-based Few-Shot Anomaly Detection(https://arxiv.org/abs/2510.13643)
Keywords: foundation model, anomaly
Abstract: Foundation models such as DINOv2 have shown strong performance in few-shot anomaly detection, yet two key questions remain unexamined: (i) how susceptible are these detectors to adversarial perturbations; and (ii) how well do their anomaly scores reflect calibrated uncertainty? Building on AnomalyDINO, a training-free deep nearest-neighbor detector over DINOv2 features, we present one of the first systematic studies of adversarial attacks and uncertainty estimation in this setting. To enable white-box gradient attacks while preserving test-time behavior, we attach a lightweight linear head to frozen DINOv2 features only for crafting perturbations. Using this heuristic, we evaluate the impact of FGSM across the MVTec-AD and VisA datasets and observe consistent drops in F1, AUROC, AP, and G-mean, indicating that imperceptible perturbations can flip nearest-neighbor relations in feature space to induce confident misclassification. Complementing robustness, we probe reliability and find that raw anomaly scores are poorly calibrated, revealing a gap between confidence and correctness that limits safety-critical use. As a simple, strong baseline toward trustworthiness, we apply post-hoc Platt scaling to the anomaly scores for uncertainty estimation. The resulting calibrated posteriors yield significantly higher predictive entropy on adversarially perturbed inputs than on clean ones, enabling a practical flagging mechanism for attack detection while reducing calibration error (ECE). Our findings surface concrete vulnerabilities in DINOv2-based few-shot anomaly detectors and establish an evaluation protocol and baseline for robust, uncertainty-aware anomaly detection. We argue that adversarial robustness and principled uncertainty quantification are not optional add-ons but essential capabilities if anomaly detection systems are to be trustworthy and ready for real-world deployment.

Title: Local-Global Context-Aware and Structure-Preserving Image Super-Resolution

Authors: Sanchar Palit, Subhasis Chaudhuri, Biplab Banerjee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13649
Pdf URL: https://arxiv.org/pdf/2510.13649
Copy Paste: [[2510.13649]] Local-Global Context-Aware and Structure-Preserving Image Super-Resolution(https://arxiv.org/abs/2510.13649)
Keywords: diffusion
Abstract: Diffusion models have recently achieved significant success in various image manipulation tasks, including image super-resolution and perceptual quality enhancement. Pretrained text-to-image models, such as Stable Diffusion, have exhibited strong capabilities in synthesizing realistic image content, which makes them particularly attractive for addressing super-resolution tasks. While some existing approaches leverage these models to achieve state-of-the-art results, they often struggle when applied to diverse and highly degraded images, leading to noise amplification or incorrect content generation. To address these limitations, we propose a contextually precise image super-resolution framework that effectively maintains both local and global pixel relationships through Local-Global Context-Aware Attention, enabling the generation of high-quality images. Furthermore, we propose a distribution- and perceptual-aligned conditioning mechanism in the pixel space to enhance perceptual fidelity. This mechanism captures fine-grained pixel-level representations while progressively preserving and refining structural information, transitioning from local content details to the global structural composition. During inference, our method generates high-quality images that are structurally consistent with the original content, mitigating artifacts and ensuring realistic detail restoration. Extensive experiments on multiple super-resolution benchmarks demonstrate the effectiveness of our approach in producing high-fidelity, perceptually accurate reconstructions.

Title: EditCast3D: Single-Frame-Guided 3D Editing with Video Propagation and View Selection

Authors: Huaizhi Qu, Ruichen Zhang, Shuqing Luo, Luchao Qi, Zhihao Zhang, Xiaoming Liu, Roni Sengupta, Tianlong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13652
Pdf URL: https://arxiv.org/pdf/2510.13652
Copy Paste: [[2510.13652]] EditCast3D: Single-Frame-Guided 3D Editing with Video Propagation and View Selection(https://arxiv.org/abs/2510.13652)
Keywords: foundation model
Abstract: Recent advances in foundation models have driven remarkable progress in image editing, yet their extension to 3D editing remains underexplored. A natural approach is to replace the image editing modules in existing workflows with foundation models. However, their heavy computational demands and the restrictions and costs of closed-source APIs make plugging these models into existing iterative editing strategies impractical. To address this limitation, we propose EditCast3D, a pipeline that employs video generation foundation models to propagate edits from a single first frame across the entire dataset prior to reconstruction. While editing propagation enables dataset-level editing via video models, its consistency remains suboptimal for 3D reconstruction, where multi-view alignment is essential. To overcome this, EditCast3D introduces a view selection strategy that explicitly identifies consistent and reconstruction-friendly views and adopts feedforward reconstruction without requiring costly refinement. In combination, the pipeline both minimizes reliance on expensive image editing and mitigates prompt ambiguities that arise when applying foundation models independently across images. We evaluate EditCast3D on commonly used 3D editing datasets and compare it against state-of-the-art 3D editing baselines, demonstrating superior editing quality and high efficiency. These results establish EditCast3D as a scalable and general paradigm for integrating foundation models into 3D editing pipelines. The code is available at this https URL

Title: Time Series Foundation Models: Benchmarking Challenges and Requirements

Authors: Marcel Meyer, Sascha Kaltenpoth, Kevin Zalipski, Oliver Müller
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13654
Pdf URL: https://arxiv.org/pdf/2510.13654
Copy Paste: [[2510.13654]] Time Series Foundation Models: Benchmarking Challenges and Requirements(https://arxiv.org/abs/2510.13654)
Keywords: foundation model
Abstract: Time Series Foundation Models (TSFMs) represent a new paradigm for time series forecasting, offering zero-shot forecasting capabilities without the need for domain-specific pre-training or fine-tuning. However, as with Large Language Models (LLMs), evaluating TSFMs is tricky, as with ever more extensive training sets, it becomes more and more challenging to ensure the integrity of benchmarking data. Our investigation of existing TSFM evaluation highlights multiple challenges, ranging from the representativeness of the benchmark datasets, over the lack of spatiotemporal evaluation, to risks of information leakage due to overlapping and obscure datasets, and the memorization of global patterns caused by external shocks like economic crises or pandemics. Our findings reveal widespread confusion regarding data partitions, risking inflated performance estimates and incorrect transfer of global knowledge to local time series. We argue for the development of robust evaluation methodologies to prevent pitfalls already observed in LLM and classical time series benchmarking, and call upon the research community to design new, principled approaches, such as evaluations on truly out-of-sample future data, to safeguard the integrity of TSFM assessment.

Title: Axial Neural Networks for Dimension-Free Foundation Models

Authors: Hyunsu Kim, Jonggeon Park, Joan Bruna, Hongseok Yang, Juho Lee
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13665
Pdf URL: https://arxiv.org/pdf/2510.13665
Copy Paste: [[2510.13665]] Axial Neural Networks for Dimension-Free Foundation Models(https://arxiv.org/abs/2510.13665)
Keywords: foundation model, in-context
Abstract: The advent of foundation models in AI has significantly advanced general-purpose learning, enabling remarkable capabilities in zero-shot inference and in-context learning. However, training such models on physics data, including solutions to partial differential equations (PDEs), poses a unique challenge due to varying dimensionalities across different systems. Traditional approaches either fix a maximum dimension or employ separate encoders for different dimensionalities, resulting in inefficiencies. To address this, we propose a dimension-agnostic neural network architecture, the Axial Neural Network (XNN), inspired by parameter-sharing structures such as Deep Sets and Graph Neural Networks. XNN generalizes across varying tensor dimensions while maintaining computational efficiency. We convert existing PDE foundation models into axial neural networks and evaluate their performance across three training scenarios: training from scratch, pretraining on multiple PDEs, and fine-tuning on a single PDE. Our experiments show that XNNs perform competitively with original models and exhibit superior generalization to unseen dimensions, highlighting the importance of multidimensional pretraining for foundation models.

Title: CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas

Authors: Zian Li, Muhan Zhang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13669
Pdf URL: https://arxiv.org/pdf/2510.13669
Copy Paste: [[2510.13669]] CanvasMAR: Improving Masked Autoregressive Video Generation With Canvas(https://arxiv.org/abs/2510.13669)
Keywords: diffusion
Abstract: Masked autoregressive models (MAR) have recently emerged as a powerful paradigm for image and video generation, combining the flexibility of masked modeling with the potential of continuous tokenizer. However, video MAR models suffer from two major limitations: the slow-start problem, caused by the lack of a structured global prior at early sampling stages, and error accumulation across the autoregression in both spatial and temporal dimensions. In this work, we propose CanvasMAR, a novel video MAR model that mitigates these issues by introducing a canvas mechanism--a blurred, global prediction of the next frame, used as the starting point for masked generation. The canvas provides global structure early in sampling, enabling faster and more coherent frame synthesis. Furthermore, we introduce compositional classifier-free guidance that jointly enlarges spatial (canvas) and temporal conditioning, and employ noise-based canvas augmentation to enhance robustness. Experiments on the BAIR and Kinetics-600 benchmarks demonstrate that CanvasMAR produces high-quality videos with fewer autoregressive steps. Our approach achieves remarkable performance among autoregressive models on Kinetics-600 dataset and rivals diffusion-based methods.

Title: FlashWorld: High-quality 3D Scene Generation within Seconds

Authors: Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, Liujuan Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13678
Pdf URL: https://arxiv.org/pdf/2510.13678
Copy Paste: [[2510.13678]] FlashWorld: High-quality 3D Scene Generation within Seconds(https://arxiv.org/abs/2510.13678)
Keywords: diffusion, generative
Abstract: We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, 10~100$\times$ faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation modes. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model's generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method.

Title: Generating healthy counterfactuals with denoising diffusion bridge models

Authors: Ana Lawry Aguila, Peirong Liu, Marina Crespo Aguirre, Juan Eugenio Iglesias
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13684
Pdf URL: https://arxiv.org/pdf/2510.13684
Copy Paste: [[2510.13684]] Generating healthy counterfactuals with denoising diffusion bridge models(https://arxiv.org/abs/2510.13684)
Keywords: diffusion, generative, anomaly
Abstract: Generating healthy counterfactuals from pathological images holds significant promise in medical imaging, e.g., in anomaly detection or for application of analysis tools that are designed for healthy scans. These counterfactuals should represent what a patient's scan would plausibly look like in the absence of pathology, preserving individual anatomical characteristics while modifying only the pathological regions. Denoising diffusion probabilistic models (DDPMs) have become popular methods for generating healthy counterfactuals of pathology data. Typically, this involves training on solely healthy data with the assumption that a partial denoising process will be unable to model disease regions and will instead reconstruct a closely matched healthy counterpart. More recent methods have incorporated synthetic pathological images to better guide the diffusion process. However, it remains challenging to guide the generative process in a way that effectively balances the removal of anomalies with the retention of subject-specific features. To solve this problem, we propose a novel application of denoising diffusion bridge models (DDBMs) - which, unlike DDPMs, condition the diffusion process not only on the initial point (i.e., the healthy image), but also on the final point (i.e., a corresponding synthetically generated pathological image). Treating the pathological image as a structurally informative prior enables us to generate counterfactuals that closely match the patient's anatomy while selectively removing pathology. The results show that our DDBM outperforms previously proposed diffusion models and fully supervised approaches at segmentation and anomaly detection tasks.

Title: MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion

Authors: Minjung Shin, Hyunin Cho, Sooyeon Go, Jin-Hwa Kim, Youngjung Uh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13702
Pdf URL: https://arxiv.org/pdf/2510.13702
Copy Paste: [[2510.13702]] MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion(https://arxiv.org/abs/2510.13702)
Keywords: diffusion, generative
Abstract: Multi-view generation with camera pose control and prompt-based customization are both essential elements for achieving controllable generative models. However, existing multi-view generation models do not support customization with geometric consistency, whereas customization models lack explicit viewpoint control, making them challenging to unify. Motivated by these gaps, we introduce a novel task, multi-view customization, which aims to jointly achieve multi-view camera pose control and customization. Due to the scarcity of training data in customization, existing multi-view generation models, which inherently rely on large-scale datasets, struggle to generalize to diverse prompts. To address this, we propose MVCustom, a novel diffusion-based framework explicitly designed to achieve both multi-view consistency and customization fidelity. In the training stage, MVCustom learns the subject's identity and geometry using a feature-field representation, incorporating the text-to-video diffusion backbone enhanced with dense spatio-temporal attention, which leverages temporal coherence for multi-view consistency. In the inference stage, we introduce two novel techniques: depth-aware feature rendering explicitly enforces geometric consistency, and consistent-aware latent completion ensures accurate perspective alignment of the customized subject and surrounding backgrounds. Extensive experiments demonstrate that MVCustom is the only framework that simultaneously achieves faithful multi-view generation and customization.

Title: NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching

Authors: Run Luo, Xiaobo Xia, Lu Wang, Longze Chen, Renke Shan, Jing Luo, Min Yang, Tat-Seng Chua
Subjects: cs.CL, cs.AI, cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2510.13721
Pdf URL: https://arxiv.org/pdf/2510.13721
Copy Paste: [[2510.13721]] NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching(https://arxiv.org/abs/2510.13721)
Keywords: foundation model
Abstract: Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal this http URL this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.

Title: Assessing the Geographic Generalization and Physical Consistency of Generative Models for Climate Downscaling

Authors: Carlo Saccardi, Maximilian Pierzyna, Haitz Sáez de Ocáriz Borde, Simone Monaco, Cristian Meo, Pietro Liò, Rudolf Saathof, Geethu Joseph, Justin Dauwels
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2510.13722
Pdf URL: https://arxiv.org/pdf/2510.13722
Copy Paste: [[2510.13722]] Assessing the Geographic Generalization and Physical Consistency of Generative Models for Climate Downscaling(https://arxiv.org/abs/2510.13722)
Keywords: generative
Abstract: Kilometer-scale weather data is crucial for real-world applications but remains computationally intensive to produce using traditional weather simulations. An emerging solution is to use deep learning models, which offer a faster alternative for climate downscaling. However, their reliability is still in question, as they are often evaluated using standard machine learning metrics rather than insights from atmospheric and weather physics. This paper benchmarks recent state-of-the-art deep learning models and introduces physics-inspired diagnostics to evaluate their performance and reliability, with a particular focus on geographic generalization and physical consistency. Our experiments show that, despite the seemingly strong performance of models such as CorrDiff, when trained on a limited set of European geographies (e.g., central Europe), they struggle to generalize to other regions such as Iberia, Morocco in the south, or Scandinavia in the north. They also fail to accurately capture second-order variables such as divergence and vorticity derived from predicted velocity fields. These deficiencies appear even in in-distribution geographies, indicating challenges in producing physically consistent predictions. We propose a simple initial solution: introducing a power spectral density loss function that empirically improves geographic generalization by encouraging the reconstruction of small-scale physical structures. The code for reproducing the experimental results can be found at this https URL

Title: Cyclic Self-Supervised Diffusion for Ultra Low-field to High-field MRI Synthesis

Authors: Zhenxuan Zhang, Peiyuan Jing, Zi Wang, Ula Briski, Coraline Beitone, Yue Yang, Yinzhe Wu, Fanwen Wang, Liutao Yang, Jiahao Huang, Zhifan Gao, Zhaolin Chen, Kh Tohidul Islam, Guang Yang, Peter J. Lally
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13735
Pdf URL: https://arxiv.org/pdf/2510.13735
Copy Paste: [[2510.13735]] Cyclic Self-Supervised Diffusion for Ultra Low-field to High-field MRI Synthesis(https://arxiv.org/abs/2510.13735)
Keywords: diffusion, self-supervised, generative
Abstract: Synthesizing high-quality images from low-field MRI holds significant potential. Low-field MRI is cheaper, more accessible, and safer, but suffers from low resolution and poor signal-to-noise ratio. This synthesis process can reduce reliance on costly acquisitions and expand data availability. However, synthesizing high-field MRI still suffers from a clinical fidelity gap. There is a need to preserve anatomical fidelity, enhance fine-grained structural details, and bridge domain gaps in image contrast. To address these issues, we propose a \emph{cyclic self-supervised diffusion (CSS-Diff)} framework for high-field MRI synthesis from real low-field MRI data. Our core idea is to reformulate diffusion-based synthesis under a cycle-consistent constraint. It enforces anatomical preservation throughout the generative process rather than just relying on paired pixel-level supervision. The CSS-Diff framework further incorporates two novel processes. The slice-wise gap perception network aligns inter-slice inconsistencies via contrastive learning. The local structure correction network enhances local feature restoration through self-reconstruction of masked and perturbed patches. Extensive experiments on cross-field synthesis tasks demonstrate the effectiveness of our method, achieving state-of-the-art performance (e.g., 31.80 $\pm$ 2.70 dB in PSNR, 0.943 $\pm$ 0.102 in SSIM, and 0.0864 $\pm$ 0.0689 in LPIPS). Beyond pixel-wise fidelity, our method also preserves fine-grained anatomical structures compared with the original low-field MRI (e.g., left cerebral white matter error drops from 12.1$\%$ to 2.1$\%$, cortex from 4.2$\%$ to 3.7$\%$). To conclude, our CSS-Diff can synthesize images that are both quantitatively reliable and anatomically consistent.

Title: UniCalli: A Unified Diffusion Framework for Column-Level Generation and Recognition of Chinese Calligraphy

Authors: Tianshuo Xu, Kai Wang, Zhifei Chen, Leyi Wu, Tianshui Wen, Fei Chao, Ying-Cong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13745
Pdf URL: https://arxiv.org/pdf/2510.13745
Copy Paste: [[2510.13745]] UniCalli: A Unified Diffusion Framework for Column-Level Generation and Recognition of Chinese Calligraphy(https://arxiv.org/abs/2510.13745)
Keywords: diffusion, generative
Abstract: Computational replication of Chinese calligraphy remains challenging. Existing methods falter, either creating high-quality isolated characters while ignoring page-level aesthetics like ligatures and spacing, or attempting page synthesis at the expense of calligraphic correctness. We introduce \textbf{UniCalli}, a unified diffusion framework for column-level recognition and generation. Training both tasks jointly is deliberate: recognition constrains the generator to preserve character structure, while generation provides style and layout priors. This synergy fosters concept-level abstractions that improve both tasks, especially in limited-data regimes. We curated a dataset of over 8,000 digitized pieces, with ~4,000 densely annotated. UniCalli employs asymmetric noising and a rasterized box map for spatial priors, trained on a mix of synthetic, labeled, and unlabeled data. The model achieves state-of-the-art generative quality with superior ligature continuity and layout fidelity, alongside stronger recognition. The framework successfully extends to other ancient scripts, including Oracle bone inscriptions and Egyptian hieroglyphs. Code and data can be viewed in \href{this https URL}{this URL}.

Title: Scaling Vision Transformers for Functional MRI with Flat Maps

Authors: Connor Lane, Daniel Z. Kaplan, Tanishq Mathew Abraham, Paul S. Scotti
Subjects: cs.CV, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2510.13768
Pdf URL: https://arxiv.org/pdf/2510.13768
Copy Paste: [[2510.13768]] Scaling Vision Transformers for Functional MRI with Flat Maps(https://arxiv.org/abs/2510.13768)
Keywords: foundation model
Abstract: A key question for adapting modern deep learning architectures to functional MRI (fMRI) is how to represent the data for model input. To bridge the modality gap between fMRI and natural images, we transform the 4D volumetric fMRI data into videos of 2D fMRI activity flat maps. We train Vision Transformers on 2.3K hours of fMRI flat map videos from the Human Connectome Project using the spatiotemporal masked autoencoder (MAE) framework. We observe that masked fMRI modeling performance improves with dataset size according to a strict power scaling law. Downstream classification benchmarks show that our model learns rich representations supporting both fine-grained state decoding across subjects, as well as subject-specific trait decoding across changes in brain state. This work is part of an ongoing open science project to build foundation models for fMRI data. Our code and datasets are available at this https URL.

Title: UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations

Authors: Dominik J. Mühlematter, Lin Che, Ye Hong, Martin Raubal, Nina Wiedemann
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2510.13774
Pdf URL: https://arxiv.org/pdf/2510.13774
Copy Paste: [[2510.13774]] UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations(https://arxiv.org/abs/2510.13774)
Keywords: foundation model
Abstract: Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent foundation models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a Geo-Foundation Model (GeoFM) that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion's strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior foundation models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios. All source code is available at this https URL.

Title: Adaptive Visual Conditioning for Semantic Consistency in Diffusion-Based Story Continuation

Authors: Seyed Mohammad Mousavi, Morteza Analoui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2510.13787
Pdf URL: https://arxiv.org/pdf/2510.13787
Copy Paste: [[2510.13787]] Adaptive Visual Conditioning for Semantic Consistency in Diffusion-Based Story Continuation(https://arxiv.org/abs/2510.13787)
Keywords: diffusion
Abstract: Story continuation focuses on generating the next image in a narrative sequence so that it remains coherent with both the ongoing text description and the previously observed images. A central challenge in this setting lies in utilizing prior visual context effectively, while ensuring semantic alignment with the current textual input. In this work, we introduce AVC (Adaptive Visual Conditioning), a framework for diffusion-based story continuation. AVC employs the CLIP model to retrieve the most semantically aligned image from previous frames. Crucially, when no sufficiently relevant image is found, AVC adaptively restricts the influence of prior visuals to only the early stages of the diffusion process. This enables the model to exploit visual context when beneficial, while avoiding the injection of misleading or irrelevant information. Furthermore, we improve data quality by re-captioning a noisy dataset using large language models, thereby strengthening textual supervision and semantic alignment. Quantitative results and human evaluations demonstrate that AVC achieves superior coherence, semantic consistency, and visual fidelity compared to strong baselines, particularly in challenging cases where prior visuals conflict with the current input.

Title: NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models

Authors: Nir Goren, Oren Katzir, Abhinav Nakarmi, Eyal Ronen, Mahmood Sharif, Or Patashnik
Subjects: cs.CV, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13793
Pdf URL: https://arxiv.org/pdf/2510.13793
Copy Paste: [[2510.13793]] NoisePrints: Distortion-Free Watermarks for Authorship in Private Diffusion Models(https://arxiv.org/abs/2510.13793)
Keywords: diffusion
Abstract: With the rapid adoption of diffusion models for visual content generation, proving authorship and protecting copyright have become critical. This challenge is particularly important when model owners keep their models private and may be unwilling or unable to handle authorship issues, making third-party verification essential. A natural solution is to embed watermarks for later verification. However, existing methods require access to model weights and rely on computationally heavy procedures, rendering them impractical and non-scalable. To address these challenges, we propose , a lightweight watermarking scheme that utilizes the random seed used to initialize the diffusion process as a proof of authorship without modifying the generation process. Our key observation is that the initial noise derived from a seed is highly correlated with the generated visual content. By incorporating a hash function into the noise sampling process, we further ensure that recovering a valid seed from the content is infeasible. We also show that sampling an alternative seed that passes verification is infeasible, and demonstrate the robustness of our method under various manipulations. Finally, we show how to use cryptographic zero-knowledge proofs to prove ownership without revealing the seed. By keeping the seed secret, we increase the difficulty of watermark removal. In our experiments, we validate NoisePrints on multiple state-of-the-art diffusion models for images and videos, demonstrating efficient verification using only the seed and output, without requiring access to model weights.

Title: BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning

Authors: Jia-Chen Gu, Junyi Zhang, Di Wu, Yuankai Li, Kai-Wei Chang, Nanyun Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13799
Pdf URL: https://arxiv.org/pdf/2510.13799
Copy Paste: [[2510.13799]] BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning(https://arxiv.org/abs/2510.13799)
Keywords: in-context
Abstract: As retrieval-augmented generation (RAG) tackles complex tasks, increasingly expanded contexts offer richer information, but at the cost of higher latency and increased cognitive load on the model. To mitigate this bottleneck, especially for intricate multi-hop questions, we introduce BRIEF-Pro. It is a universal, lightweight compressor that distills relevant evidence for a given query from retrieved documents into a concise summary for seamless integration into in-context RAG. Using seed data consisting of relatively short contexts (fewer than 1k words), BRIEF-Pro is trained to perform abstractive compression of extended contexts exceeding 10k words across a wide range of scenarios. Furthermore, BRIEF-Pro offers flexible user control over summary length by allowing users to specify the desired number of sentences. Experiments on four open-domain multi-hop question-answering datasets show that BRIEF-Pro generates more concise and relevant summaries, enhancing performance across small, large, and proprietary language models. With the 70B reader model, 32x compression by BRIEF-Pro improves QA performance by 4.67% on average over LongLLMLingua's 9x, while requiring only 23% of its computational overhead.

Title: Generative Universal Verifier as Multimodal Meta-Reasoner

Authors: Xinchen Zhang, Xiaoying Zhang, Youbin Wu, Yanbin Cao, Renrui Zhang, Ruihang Chu, Ling Yang, Yujiu Yang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2510.13804
Pdf URL: https://arxiv.org/pdf/2510.13804
Copy Paste: [[2510.13804]] Generative Universal Verifier as Multimodal Meta-Reasoner(https://arxiv.org/abs/2510.13804)
Keywords: generative
Abstract: We introduce Generative Universal Verifier, a novel concept and plugin designed for next-generation multimodal reasoning in vision-language models and unified multimodal models, providing the fundamental capability of reflection and refinement on visual outcomes during the reasoning and generation process. This work makes three main contributions: (1) We build ViVerBench, a comprehensive benchmark spanning 16 categories of critical tasks for evaluating visual outcomes in multimodal reasoning. Results show that existing VLMs consistently underperform across these tasks, underscoring a substantial gap from human-level capability in reliable visual verification. (2) We design two automated pipelines to construct large-scale visual verification data and train OmniVerifier-7B, the first omni-capable generative verifier trained for universal visual verification and achieves notable gains on ViVerBench(+8.3). Through training, we identify three atomic capabilities in visual verification and demonstrate how they generalize and interact synergistically. (3) We propose OmniVerifier-TTS, a sequential test-time scaling paradigm that leverages the universal verifier to bridge image generation and editing within unified models, enhancing the upper bound of generative ability through iterative fine-grained optimization. Beyond generation, we extend universal verifier to broader world-modeling interleaved reasoning scenarios. Empirically, OmniVerifier-TTS achieves improvements on T2I-ReasonBench(+3.7), and GenEval++(+4.3), outperforming existing parallel test-time scaling methods, such as Best-of-N. By endowing multimodal reasoning with reliable visual verification, OmniVerifier advances both reliable reflection during generation and scalable test-time refinement, marking a step toward more trustworthy and controllable next-generation reasoning systems.