2025-11-05

Title: Challenging DINOv3 Foundation Model under Low Inter-Class Variability: A Case Study on Fetal Brain Ultrasound

Authors: Edoardo Conti, Riccardo Rosati, Lorenzo Federici, Adriano Mancini, Maria Chiara Fiorentin
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2511.01915
Pdf URL: https://arxiv.org/pdf/2511.01915
Copy Paste: [[2511.01915]] Challenging DINOv3 Foundation Model under Low Inter-Class Variability: A Case Study on Fetal Brain Ultrasound(https://arxiv.org/abs/2511.01915)
Keywords: self-supervised, foundation model
Abstract: Purpose: This study provides the first comprehensive evaluation of foundation models in fetal ultrasound (US) imaging under low inter-class variability conditions. While recent vision foundation models such as DINOv3 have shown remarkable transferability across medical domains, their ability to discriminate anatomically similar structures has not been systematically investigated. We address this gap by focusing on fetal brain standard planes--transthalamic (TT), transventricular (TV), and transcerebellar (TC)--which exhibit highly overlapping anatomical features and pose a critical challenge for reliable biometric assessment. Methods: To ensure a fair and reproducible evaluation, all publicly available fetal ultrasound datasets were curated and aggregated into a unified multicenter benchmark, FetalUS-188K, comprising more than 188,000 annotated images from heterogeneous acquisition settings. DINOv3 was pretrained in a self-supervised manner to learn ultrasound-aware representations. The learned features were then evaluated through standardized adaptation protocols, including linear probing with frozen backbone and full fine-tuning, under two initialization schemes: (i) pretraining on FetalUS-188K and (ii) initialization from natural-image DINOv3 weights. Results: Models pretrained on fetal ultrasound data consistently outperformed those initialized on natural images, with weighted F1-score improvements of up to 20 percent. Domain-adaptive pretraining enabled the network to preserve subtle echogenic and structural cues crucial for distinguishing intermediate planes such as TV. Conclusion: Results demonstrate that generic foundation models fail to generalize under low inter-class variability, whereas domain-specific pretraining is essential to achieve robust and clinically reliable representations in fetal brain ultrasound imaging.

Title: Dynamic Population Distribution Aware Human Trajectory Generation with Diffusion Model

Authors: Qingyue Long, Can Rong, Tong Li, Yong Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.01929
Pdf URL: https://arxiv.org/pdf/2511.01929
Copy Paste: [[2511.01929]] Dynamic Population Distribution Aware Human Trajectory Generation with Diffusion Model(https://arxiv.org/abs/2511.01929)
Keywords: diffusion
Abstract: Human trajectory data is crucial in urban planning, traffic engineering, and public health. However, directly using real-world trajectory data often faces challenges such as privacy concerns, data acquisition costs, and data quality. A practical solution to these challenges is trajectory generation, a method developed to simulate human mobility behaviors. Existing trajectory generation methods mainly focus on capturing individual movement patterns but often overlook the influence of population distribution on trajectory generation. In reality, dynamic population distribution reflects changes in population density across different regions, significantly impacting individual mobility behavior. Thus, we propose a novel trajectory generation framework based on a diffusion model, which integrates the dynamic population distribution constraints to guide high-fidelity generation outcomes. Specifically, we construct a spatial graph to enhance the spatial correlation of trajectories. Then, we design a dynamic population distribution aware denoising network to capture the spatiotemporal dependencies of human mobility behavior as well as the impact of population distribution in the denoising process. Extensive experiments show that the trajectories generated by our model can resemble real-world trajectories in terms of some critical statistical metrics, outperforming state-of-the-art algorithms by over 54%.

Title: Locally-Supervised Global Image Restoration

Authors: Benjamin Walder, Daniel Toader, Robert Nuster, Günther Paltauf, Peter Burgholzer, Gregor Langer, Lukas Krainer, Markus Haltmeier
Subjects: cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2511.01998
Pdf URL: https://arxiv.org/pdf/2511.01998
Copy Paste: [[2511.01998]] Locally-Supervised Global Image Restoration(https://arxiv.org/abs/2511.01998)
Keywords: self-supervised
Abstract: We address the problem of image reconstruction from incomplete measurements, encompassing both upsampling and inpainting, within a learning-based framework. Conventional supervised approaches require fully sampled ground truth data, while self-supervised methods allow incomplete ground truth but typically rely on random sampling that, in expectation, covers the entire image. In contrast, we consider fixed, deterministic sampling patterns with inherently incomplete coverage, even in expectation. To overcome this limitation, we exploit multiple invariances of the underlying image distribution, which theoretically allows us to achieve the same reconstruction performance as fully supervised approaches. We validate our method on optical-resolution image upsampling in photoacoustic microscopy (PAM), demonstrating competitive or superior results while requiring substantially less ground truth data.

Title: Quantum-Enhanced Generative Models for Rare Event Prediction

Authors: M.Z. Haider, M.U. Ghouri, Tayyaba Noreen, M. Salman
Subjects: cs.LG, cs.AI, cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2511.02042
Pdf URL: https://arxiv.org/pdf/2511.02042
Copy Paste: [[2511.02042]] Quantum-Enhanced Generative Models for Rare Event Prediction(https://arxiv.org/abs/2511.02042)
Keywords: diffusion, generative
Abstract: Rare events such as financial crashes, climate extremes, and biological anomalies are notoriously difficult to model due to their scarcity and heavy-tailed distributions. Classical deep generative models often struggle to capture these rare occurrences, either collapsing low-probability modes or producing poorly calibrated uncertainty estimates. In this work, we propose the Quantum-Enhanced Generative Model (QEGM), a hybrid classical-quantum framework that integrates deep latent-variable models with variational quantum circuits. The framework introduces two key innovations: (1) a hybrid loss function that jointly optimizes reconstruction fidelity and tail-aware likelihood, and (2) quantum randomness-driven noise injection to enhance sample diversity and mitigate mode collapse. Training proceeds via a hybrid loop where classical parameters are updated through backpropagation while quantum parameters are optimized using parameter-shift gradients. We evaluate QEGM on synthetic Gaussian mixtures and real-world datasets spanning finance, climate, and protein structure. Results demonstrate that QEGM reduces tail KL divergence by up to 50 percent compared to state-of-the-art baselines (GAN, VAE, Diffusion), while improving rare-event recall and coverage calibration. These findings highlight the potential of QEGM as a principled approach for rare-event prediction, offering robustness beyond what is achievable with purely classical methods.

Title: Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis

Authors: Soham Joshi, Shwet Kamal Mishra, Viswanath Gopalakrishnan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.02046
Pdf URL: https://arxiv.org/pdf/2511.02046
Copy Paste: [[2511.02046]] Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis(https://arxiv.org/abs/2511.02046)
Keywords: foundation model
Abstract: Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.

Title: Beyond Static Cutoffs: One-Shot Dynamic Thresholding for Diffusion Language Models

Authors: Jucheng Shen, Yeonju Ro
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.02077
Pdf URL: https://arxiv.org/pdf/2511.02077
Copy Paste: [[2511.02077]] Beyond Static Cutoffs: One-Shot Dynamic Thresholding for Diffusion Language Models(https://arxiv.org/abs/2511.02077)
Keywords: diffusion
Abstract: Masked diffusion language models (MDLMs) are becoming competitive with their autoregressive counterparts but typically decode with fixed steps and sequential unmasking. To accelerate decoding, recent work such as Fast-dLLM enables parallel decoding via a static global confidence threshold, yet we observe strong block- and step-wise confidence fluctuations and, within a dataset, near-identical confidence trajectories across inputs as measured by cosine similarity. Motivated by these observations, we introduce One-Shot Dynamic Thresholding (OSDT), which calibrates thresholds on a single sequence and applies them to subsequent inputs with negligible overhead. On GPQA, GSM8K, and HumanEval, OSDT attains superior accuracy-throughput trade-offs (+24% tokens/s on GSM8K at the best accuracy, +45% on GPQA with comparable accuracy, and +50% on HumanEval with a modest accuracy gap). Beyond these results, our findings suggest broader opportunities to leverage reusable task-level confidence signatures for more general-purpose algorithmic and systems innovations in diffusion decoding.

Title: Watermarking Discrete Diffusion Language Models

Authors: Avi Bagchi, Akhil Bhimaraju, Moulik Choraria, Daniel Alabi, Lav R. Varshney
Subjects: cs.CR, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2511.02083
Pdf URL: https://arxiv.org/pdf/2511.02083
Copy Paste: [[2511.02083]] Watermarking Discrete Diffusion Language Models(https://arxiv.org/abs/2511.02083)
Keywords: diffusion
Abstract: Watermarking has emerged as a promising technique to track AI-generated content and differentiate it from authentic human creations. While prior work extensively studies watermarking for autoregressive large language models (LLMs) and image diffusion models, none address discrete diffusion language models, which are becoming popular due to their high inference throughput. In this paper, we introduce the first watermarking method for discrete diffusion models by applying the distribution-preserving Gumbel-max trick at every diffusion step and seeding the randomness with the sequence index to enable reliable detection. We experimentally demonstrate that our scheme is reliably detectable on state-of-the-art diffusion language models and analytically prove that it is distortion-free with an exponentially decaying probability of false detection in the token sequence length.

Title: Energy Loss Functions for Physical Systems

Authors: Sékou-Oumar Kaba, Kusha Sareen, Daniel Levy, Siamak Ravanbakhsh
Subjects: cs.LG, cs.AI, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2511.02087
Pdf URL: https://arxiv.org/pdf/2511.02087
Copy Paste: [[2511.02087]] Energy Loss Functions for Physical Systems(https://arxiv.org/abs/2511.02087)
Keywords: generative
Abstract: Effectively leveraging prior knowledge of a system's physics is crucial for applications of machine learning to scientific domains. Previous approaches mostly focused on incorporating physical insights at the architectural level. In this paper, we propose a framework to leverage physical information directly into the loss function for prediction and generative modeling tasks on systems like molecules and spins. We derive energy loss functions assuming that each data sample is in thermal equilibrium with respect to an approximate energy landscape. By using the reverse KL divergence with a Boltzmann distribution around the data, we obtain the loss as an energy difference between the data and the model predictions. This perspective also recasts traditional objectives like MSE as energy-based, but with a physically meaningless energy. In contrast, our formulation yields physically grounded loss functions with gradients that better align with valid configurations, while being architecture-agnostic and computationally efficient. The energy loss functions also inherently respect physical symmetries. We demonstrate our approach on molecular generation and spin ground-state prediction and report significant improvements over baselines.

Title: Natural Building Blocks for Structured World Models: Theory, Evidence, and Scaling

Authors: Lancelot Da Costa, Sanjeev Namjoshi, Mohammed Abbas Ansari, Bernhard Schölkopf
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.02091
Pdf URL: https://arxiv.org/pdf/2511.02091
Copy Paste: [[2511.02091]] Natural Building Blocks for Structured World Models: Theory, Evidence, and Scaling(https://arxiv.org/abs/2511.02091)
Keywords: generative
Abstract: The field of world modeling is fragmented, with researchers developing bespoke architectures that rarely build upon each other. We propose a framework that specifies the natural building blocks for structured world models based on the fundamental stochastic processes that any world model must capture: discrete processes (logic, symbols) and continuous processes (physics, dynamics); the world model is then defined by the hierarchical composition of these building blocks. We examine Hidden Markov Models (HMMs) and switching linear dynamical systems (sLDS) as natural building blocks for discrete and continuous modeling--which become partially-observable Markov decision processes (POMDPs) and controlled sLDS when augmented with actions. This modular approach supports both passive modeling (generation, forecasting) and active control (planning, decision-making) within the same architecture. We avoid the combinatorial explosion of traditional structure learning by largely fixing the causal architecture and searching over only four depth parameters. We review practical expressiveness through multimodal generative modeling (passive) and planning from pixels (active), with performance competitive to neural approaches while maintaining interpretability. The core outstanding challenge is scalable joint structure-parameter learning; current methods finesse this by cleverly growing structure and parameters incrementally, but are limited in their scalability. If solved, these natural building blocks could provide foundational infrastructure for world modeling, analogous to how standardized layers enabled progress in deep learning.

Title: Language-Enhanced Generative Modeling for PET Synthesis from MRI and Blood Biomarkers

Authors: Zhengjie Zhang, Xiaoxie Mao, Qihao Guo, Shaoting Zhang, Qi Huang, Mu Zhou, Fang Xie, Mianxin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.02206
Pdf URL: https://arxiv.org/pdf/2511.02206
Copy Paste: [[2511.02206]] Language-Enhanced Generative Modeling for PET Synthesis from MRI and Blood Biomarkers(https://arxiv.org/abs/2511.02206)
Keywords: generative
Abstract: Background: Alzheimer's disease (AD) diagnosis heavily relies on amyloid-beta positron emission tomography (Abeta-PET), which is limited by high cost and limited accessibility. This study explores whether Abeta-PET spatial patterns can be predicted from blood-based biomarkers (BBMs) and MRI scans. Methods: We collected Abeta-PET images, T1-weighted MRI scans, and BBMs from 566 participants. A language-enhanced generative model, driven by a large language model (LLM) and multimodal information fusion, was developed to synthesize PET images. Synthesized images were evaluated for image quality, diagnostic consistency, and clinical applicability within a fully automated diagnostic pipeline. Findings: The synthetic PET images closely resemble real PET scans in both structural details (SSIM = 0.920 +/- 0.003) and regional patterns (Pearson's r = 0.955 +/- 0.007). Diagnostic outcomes using synthetic PET show high agreement with real PET-based diagnoses (accuracy = 0.80). Using synthetic PET, we developed a fully automatic AD diagnostic pipeline integrating PET synthesis and classification. The synthetic PET-based model (AUC = 0.78) outperforms T1-based (AUC = 0.68) and BBM-based (AUC = 0.73) models, while combining synthetic PET and BBMs further improved performance (AUC = 0.79). Ablation analysis supports the advantages of LLM integration and prompt engineering. Interpretation: Our language-enhanced generative model synthesizes realistic PET images, enhancing the utility of MRI and BBMs for Abeta spatial pattern assessment and improving the diagnostic workflow for Alzheimer's disease.

Title: Can Foundation Models Revolutionize Mobile AR Sparse Sensing?

Authors: Yiqin Zhao, Tian Guo
Subjects: cs.CV, cs.ET
Abstract URL: https://arxiv.org/abs/2511.02215
Pdf URL: https://arxiv.org/pdf/2511.02215
Copy Paste: [[2511.02215]] Can Foundation Models Revolutionize Mobile AR Sparse Sensing?(https://arxiv.org/abs/2511.02215)
Keywords: foundation model
Abstract: Mobile sensing systems have long faced a fundamental trade-off between sensing quality and efficiency due to constraints in computation, power, and other limitations. Sparse sensing, which aims to acquire and process only a subset of sensor data, has been a key strategy for maintaining performance under such constraints. However, existing sparse sensing methods often suffer from reduced accuracy, as missing information across space and time introduces uncertainty into many sensing systems. In this work, we investigate whether foundation models can change the landscape of mobile sparse sensing. Using real-world mobile AR data, our evaluations demonstrate that foundation models offer significant improvements in geometry-aware image warping, a central technique for enabling accurate reuse of cross-frame information. Furthermore, our study demonstrates the scalability of foundation model-based sparse sensing and shows its leading performance in 3D scene reconstruction. Collectively, our study reveals critical aspects of the promises and the open challenges of integrating foundation models into mobile sparse sensing systems.

Title: Federated Quantum Kernel Learning for Anomaly Detection in Multivariate IoT Time-Series

Authors: Kuan-Cheng Chen, Samuel Yen-Chi Chen, Chen-Yu Liu, Kin K. Leung
Subjects: cs.LG, cs.AI, quant-ph
Abstract URL: https://arxiv.org/abs/2511.02301
Pdf URL: https://arxiv.org/pdf/2511.02301
Copy Paste: [[2511.02301]] Federated Quantum Kernel Learning for Anomaly Detection in Multivariate IoT Time-Series(https://arxiv.org/abs/2511.02301)
Keywords: anomaly
Abstract: The rapid growth of industrial Internet of Things (IIoT) systems has created new challenges for anomaly detection in high-dimensional, multivariate time-series, where privacy, scalability, and communication efficiency are critical. Classical federated learning approaches mitigate privacy concerns by enabling decentralized training, but they often struggle with highly non-linear decision boundaries and imbalanced anomaly distributions. To address this gap, we propose a Federated Quantum Kernel Learning (FQKL) framework that integrates quantum feature maps with federated aggregation to enable distributed, privacy-preserving anomaly detection across heterogeneous IoT networks. In our design, quantum edge nodes locally compute compressed kernel statistics using parameterized quantum circuits and share only these summaries with a central server, which constructs a global Gram matrix and trains a decision function (e.g., Fed-QSVM). Experimental results on synthetic IIoT benchmarks demonstrate that FQKL achieves superior generalization in capturing complex temporal correlations compared to classical federated baselines, while significantly reducing communication overhead. This work highlights the promise of quantum kernels in federated settings, advancing the path toward scalable, robust, and quantum-enhanced intelligence for next-generation IoT infrastructures.

Title: Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation

Authors: Wongyu Kim, Hochang Lee, Sanghak Lee, Yoonsung Kim, Jaehyun Park
Subjects: cs.CL, cs.AI, cs.IR, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2511.02358
Pdf URL: https://arxiv.org/pdf/2511.02358
Copy Paste: [[2511.02358]] Let Multimodal Embedders Learn When to Augment Query via Adaptive Query Augmentation(https://arxiv.org/abs/2511.02358)
Keywords: generative
Abstract: Query augmentation makes queries more meaningful by appending further information to the queries to find relevant documents. Current studies have proposed Large Language Model (LLM)-based embedders, which learn representation for embedding and generation for query augmentation in a multi-task manner by leveraging the generative capabilities of LLM. During inference, these jointly trained embedders have conducted query augmentation followed by embedding, showing effective results. However, augmenting every query leads to substantial embedding latency and query augmentation can be detrimental to performance for some queries. Also, previous methods have not been explored in multimodal environments. To tackle these problems, we propose M-Solomon, a universal multimodal embedder that can adaptively determine when to augment queries. Our approach first divides the queries of the training datasets into two groups at the dataset level. One includes queries that require augmentation and the other includes queries that do not. Then, we introduces a synthesis process that generates appropriate augmentations for queries that require them by leveraging a powerful Multimodal LLM (MLLM). Next, we present adaptive query augmentation. Through this step, M-Solomon can conduct query augmentation only when necessary by learning to generate synthetic augmentations with the prefix /augment for queries that demand them and to generate the simple string /embed for others. Experimental results showed that M-Solomon not only surpassed the baseline without augmentation by a large margin but also outperformed the baseline that always used augmentation, providing much faster embedding latency.

Title: CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning

Authors: Jizheng Ma, Xiaofei Zhou, Yanlong Song, Han Yan
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2511.02360
Pdf URL: https://arxiv.org/pdf/2511.02360
Copy Paste: [[2511.02360]] CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning(https://arxiv.org/abs/2511.02360)
Keywords: diffusion
Abstract: In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.

Title: Self-Supervised Moving Object Segmentation of Sparse and Noisy Radar Point Clouds

Authors: Leon Schwarzer, Matthias Zeller, Daniel Casado Herraez, Simon Dierl, Michael Heidingsfeld, Cyrill Stachniss
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2511.02395
Pdf URL: https://arxiv.org/pdf/2511.02395
Copy Paste: [[2511.02395]] Self-Supervised Moving Object Segmentation of Sparse and Noisy Radar Point Clouds(https://arxiv.org/abs/2511.02395)
Keywords: self-supervised
Abstract: Moving object segmentation is a crucial task for safe and reliable autonomous mobile systems like self-driving cars, improving the reliability and robustness of subsequent tasks like SLAM or path planning. While the segmentation of camera or LiDAR data is widely researched and achieves great results, it often introduces an increased latency by requiring the accumulation of temporal sequences to gain the necessary temporal context. Radar sensors overcome this problem with their ability to provide a direct measurement of a point's Doppler velocity, which can be exploited for single-scan moving object segmentation. However, radar point clouds are often sparse and noisy, making data annotation for use in supervised learning very tedious, time-consuming, and cost-intensive. To overcome this problem, we address the task of self-supervised moving object segmentation of sparse and noisy radar point clouds. We follow a two-step approach of contrastive self-supervised representation learning with subsequent supervised fine-tuning using limited amounts of annotated data. We propose a novel clustering-based contrastive loss function with cluster refinement based on dynamic points removal to pretrain the network to produce motion-aware representations of the radar data. Our method improves label efficiency after fine-tuning, effectively boosting state-of-the-art performance by self-supervised pretraining.

Title: Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs

Authors: Arya Shah, Vaibhav Tripathi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2511.02404
Pdf URL: https://arxiv.org/pdf/2511.02404
Copy Paste: [[2511.02404]] Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs(https://arxiv.org/abs/2511.02404)
Keywords: self-supervised
Abstract: Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic cats) have vertically elongated pupils linked to ambush predation; yet, how such specializations manifest in downstream visual representations remains incompletely understood. We present a unified, frozen-encoder benchmark that quantifies feline-human cross-species representational alignment in the wild, across convolutional networks, supervised Vision Transformers, windowed transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis, with additional distributional and stability tests reported in the paper. Across models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF $\approx0.814$, mean CKA-linear $\approx0.745$, mean RSA $\approx0.698$), peaking at early blocks, indicating that token-level self-supervision induces early-stage features that bridge species-specific statistics. Supervised ViTs are competitive on CKA yet show weaker geometric correspondence than DINO (e.g., ViT-B/16 RSA $\approx0.53$ at block8; ViT-L/16 $\approx0.47$ at block14), revealing depth-dependent divergences between similarity and representational geometry. CNNs remain strong baselines but below plain ViTs on alignment, and windowed transformers underperform plain ViTs, implicating architectural inductive biases in cross-species alignment. Results indicate that self-supervision coupled with ViT inductive biases yields representational geometries that more closely align feline and human visual systems than widely used CNNs and windowed Transformers, providing testable neuroscientific hypotheses about where and how cross-species visual computations converge. We release our code and dataset for reference and reproducibility.

Title: KAO: Kernel-Adaptive Optimization in Diffusion for Satellite Image

Authors: Teerapong Panboonyuen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.02462
Pdf URL: https://arxiv.org/pdf/2511.02462
Copy Paste: [[2511.02462]] KAO: Kernel-Adaptive Optimization in Diffusion for Satellite Image(https://arxiv.org/abs/2511.02462)
Keywords: diffusion
Abstract: Satellite image inpainting is a crucial task in remote sensing, where accurately restoring missing or occluded regions is essential for robust image analysis. In this paper, we propose KAO, a novel framework that utilizes Kernel-Adaptive Optimization within diffusion models for satellite image inpainting. KAO is specifically designed to address the challenges posed by very high-resolution (VHR) satellite datasets, such as DeepGlobe and the Massachusetts Roads Dataset. Unlike existing methods that rely on preconditioned models requiring extensive retraining or postconditioned models with significant computational overhead, KAO introduces a Latent Space Conditioning approach, optimizing a compact latent space to achieve efficient and accurate inpainting. Furthermore, we incorporate Explicit Propagation into the diffusion process, facilitating forward-backward fusion, which improves the stability and precision of the method. Experimental results demonstrate that KAO sets a new benchmark for VHR satellite image restoration, providing a scalable, high-performance solution that balances the efficiency of preconditioned models with the flexibility of postconditioned models.

Title: Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization

Authors: Tao Liu, Kan Ren, Qian Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.02489
Pdf URL: https://arxiv.org/pdf/2511.02489
Copy Paste: [[2511.02489]] Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization(https://arxiv.org/abs/2511.02489)
Keywords: generative
Abstract: With the rapid growth of the low-altitude economy, UAVs have become crucial for measurement and tracking in patrol systems. However, in GNSS-denied areas, satellite-based localization methods are prone to failure. This paper presents a cross-view UAV localization framework that performs map matching via object detection, aimed at effectively addressing cross-temporal, cross-view, heterogeneous aerial image matching. In typical pipelines, UAV visual localization is formulated as an image-retrieval problem: features are extracted to build a localization map, and the pose of a query image is estimated by matching it to a reference database with known poses. Because publicly available UAV localization datasets are limited, many approaches recast localization as a classification task and rely on scene labels in these datasets to ensure accuracy. Other methods seek to reduce cross-domain differences using polar-coordinate reprojection, perspective transformations, or generative adversarial networks; however, they can suffer from misalignment, content loss, and limited realism. In contrast, we leverage modern object detection to accurately extract salient instances from UAV and satellite images, and integrate a graph neural network to reason about inter-image and intra-image node relationships. Using a fine-grained, graph-based node-similarity metric, our method achieves strong retrieval and localization performance. Extensive experiments on public and real-world datasets show that our approach handles heterogeneous appearance differences effectively and generalizes well, making it applicable to scenarios with larger modality gaps, such as infrared-visible image matching. Our dataset will be publicly available at the following URL: this https URL.

Title: DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding

Authors: Zixuan Liu, Siavash H. Khajavi, Guangkai Jiang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2511.02495
Pdf URL: https://arxiv.org/pdf/2511.02495
Copy Paste: [[2511.02495]] DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding(https://arxiv.org/abs/2511.02495)
Keywords: diffusion
Abstract: Recent advances in multi-modal models have demonstrated strong performance in tasks such as image generation and reasoning. However, applying these models to the fire domain remains challenging due to the lack of publicly available datasets with high-quality fire domain annotations. To address this gap, we introduce DetectiumFire, a large-scale, multi-modal dataset comprising of 22.5k high-resolution fire-related images and 2.5k real-world fire-related videos covering a wide range of fire types, environments, and risk levels. The data are annotated with both traditional computer vision labels (e.g., bounding boxes) and detailed textual prompts describing the scene, enabling applications such as synthetic data generation and fire risk reasoning. DetectiumFire offers clear advantages over existing benchmarks in scale, diversity, and data quality, significantly reducing redundancy and enhancing coverage of real-world scenarios. We validate the utility of DetectiumFire across multiple tasks, including object detection, diffusion-based image generation, and vision-language reasoning. Our results highlight the potential of this dataset to advance fire-related research and support the development of intelligent safety systems. We release DetectiumFire to promote broader exploration of fire understanding in the AI community. The dataset is available at this https URL

Title: Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes

Authors: Robinson Umeike, Neil Getty, Yin Xiangyu, Yi Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.02503
Pdf URL: https://arxiv.org/pdf/2511.02503
Copy Paste: [[2511.02503]] Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes(https://arxiv.org/abs/2511.02503)
Keywords: foundation model, in-context
Abstract: The automation of workflows in advanced microscopy is a key goal where foundation models like Language Models (LLMs) and Vision-Language Models (VLMs) show great potential. However, adapting these general-purpose models for specialized scientific tasks is critical, and the optimal domain adaptation strategy is often unclear. To address this, we introduce PtychoBench, a new multi-modal, multi-task benchmark for ptychographic analysis. Using this benchmark, we systematically compare two specialization strategies: Supervised Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies on a visual artifact detection task with VLMs and a textual parameter recommendation task with LLMs in a data-scarce regime. Our findings reveal that the optimal specialization pathway is task-dependent. For the visual task, SFT and ICL are highly complementary, with a fine-tuned model guided by context-aware examples achieving the highest mean performance (Micro-F1 of 0.728). Conversely, for the textual task, ICL on a large base model is the superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a powerful "super-expert" SFT model (0-shot Micro-F1 of 0.839). We also confirm the superiority of context-aware prompting and identify a consistent contextual interference phenomenon in fine-tuned models. These results, benchmarked against strong baselines including GPT-4o and a DINOv3-based classifier, offer key observations for AI in science: the optimal specialization path in our benchmark is dependent on the task modality, offering a clear framework for developing more effective science-based agentic systems.

Title: Unsupervised Learning for Industrial Defect Detection: A Case Study on Shearographic Data

Authors: Jessica Plassmann, Nicolas Schuler, Georg von Freymann, Michael Schuth
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.02541
Pdf URL: https://arxiv.org/pdf/2511.02541
Copy Paste: [[2511.02541]] Unsupervised Learning for Industrial Defect Detection: A Case Study on Shearographic Data(https://arxiv.org/abs/2511.02541)
Keywords: anomaly
Abstract: Shearography is a non-destructive testing method for detecting subsurface defects, offering high sensitivity and full-field inspection capabilities. However, its industrial adoption remains limited due to the need for expert interpretation. To reduce reliance on labeled data and manual evaluation, this study explores unsupervised learning methods for automated anomaly detection in shearographic images. Three architectures are evaluated: a fully connected autoencoder, a convolutional autoencoder, and a student-teacher feature matching model. All models are trained solely on defect-free data. A controlled dataset was developed using a custom specimen with reproducible defect patterns, enabling systematic acquisition of shearographic measurements under both ideal and realistic deformation conditions. Two training subsets were defined: one containing only undistorted, defect-free samples, and one additionally including globally deformed, yet defect-free, data. The latter simulates practical inspection conditions by incorporating deformation-induced fringe patterns that may obscure localized anomalies. The models are evaluated in terms of binary classification and, for the student-teacher model, spatial defect localization. Results show that the student-teacher approach achieves superior classification robustness and enables precise localization. Compared to the autoencoder-based models, it demonstrates improved separability of feature representations, as visualized through t-SNE embeddings. Additionally, a YOLOv8 model trained on labeled defect data serves as a reference to benchmark localization quality. This study underscores the potential of unsupervised deep learning for scalable, label-efficient shearographic inspection in industrial environments.

Title: Forecasting Future Anatomies: Longitudianl Brain Mri-to-Mri Prediction

Authors: Ali Farki, Elaheh Moradi, Deepika Koundal, Jussi Tohka
Subjects: cs.CV, cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2511.02558
Pdf URL: https://arxiv.org/pdf/2511.02558
Copy Paste: [[2511.02558]] Forecasting Future Anatomies: Longitudianl Brain Mri-to-Mri Prediction(https://arxiv.org/abs/2511.02558)
Keywords: generative
Abstract: Predicting future brain state from a baseline magnetic resonance image (MRI) is a central challenge in neuroimaging and has important implications for studying neurodegenerative diseases such as Alzheimer's disease (AD). Most existing approaches predict future cognitive scores or clinical outcomes, such as conversion from mild cognitive impairment to dementia. Instead, here we investigate longitudinal MRI image-to-image prediction that forecasts a participant's entire brain MRI several years into the future, intrinsically modeling complex, spatially distributed neurodegenerative patterns. We implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR, Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL). Predicted follow-up MRIs are directly compared with the actual follow-up scans using metrics that capture global similarity and local differences. The best performing models achieve high-fidelity predictions, and all models generalize well to an independent external dataset, demonstrating robust cross-cohort performance. Our results indicate that deep learning can reliably predict participant-specific brain MRI at the voxel level, offering new opportunities for individualized prognosis.

Title: TAUE: Training-free Noise Transplant and Cultivation Diffusion Model

Authors: Daichi Nagai, Ryugo Morita, Shunsuke Kitada, Hitoshi Iyatomi
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2511.02580
Pdf URL: https://arxiv.org/pdf/2511.02580
Copy Paste: [[2511.02580]] TAUE: Training-free Noise Transplant and Cultivation Diffusion Model(https://arxiv.org/abs/2511.02580)
Keywords: diffusion, generative
Abstract: Despite the remarkable success of text-to-image diffusion models, their output of a single, flattened image remains a critical bottleneck for professional applications requiring layer-wise control. Existing solutions either rely on fine-tuning with large, inaccessible datasets or are training-free yet limited to generating isolated foreground elements, failing to produce a complete and coherent scene. To address this, we introduce the Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a novel framework for zero-shot, layer-wise image generation. Our core technique, Noise Transplantation and Cultivation (NTC), extracts intermediate latent representations from both foreground and composite generation processes, transplanting them into the initial noise for subsequent layers. This ensures semantic and structural coherence across foreground, background, and composite layers, enabling consistent, multi-layered outputs without requiring fine-tuning or auxiliary datasets. Extensive experiments show that our training-free method achieves performance comparable to fine-tuned methods, enhancing layer-wise consistency while maintaining high image quality and fidelity. TAUE not only eliminates costly training and dataset requirements but also unlocks novel downstream applications, such as complex compositional editing, paving the way for more accessible and controllable generative workflows.

Title: Zero-Shot Multi-Animal Tracking in the Wild

Authors: Jan Frederik Meier, Timo Lüddecke
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.02591
Pdf URL: https://arxiv.org/pdf/2511.02591
Copy Paste: [[2511.02591]] Zero-Shot Multi-Animal Tracking in the Wild(https://arxiv.org/abs/2511.02591)
Keywords: foundation model
Abstract: Multi-animal tracking is crucial for understanding animal ecology and behavior. However, it remains a challenging task due to variations in habitat, motion patterns, and species appearance. Traditional approaches typically require extensive model fine-tuning and heuristic design for each application scenario. In this work, we explore the potential of recent vision foundation models for zero-shot multi-animal tracking. By combining a Grounding Dino object detector with the Segment Anything Model 2 (SAM 2) tracker and carefully designed heuristics, we develop a tracking framework that can be applied to new datasets without any retraining or hyperparameter adaptation. Evaluations on ChimpAct, Bird Flock Tracking, AnimalTrack, and a subset of GMOT-40 demonstrate strong and consistent performance across diverse species and environments. The code is available at this https URL.

Title: UniChange: Unifying Change Detection with Multimodal Large Language Model

Authors: Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu, Jianye Wang, Qicheng Li, Xiang Li
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2511.02607
Pdf URL: https://arxiv.org/pdf/2511.02607
Copy Paste: [[2511.02607]] UniChange: Unifying Change Detection with Multimodal Large Language Model(https://arxiv.org/abs/2511.02607)
Keywords: generative
Abstract: Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at this https URL.

Title: A Non-Adversarial Approach to Idempotent Generative Modelling

Authors: Mohammed Al-Jaff, Giovanni Luca Marchetti, Michael C Welle, Jens Lundell, Mats G. Gustafsson, Gustav Eje Henter, Hossein Azizpour, Danica Kragic
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.02614
Pdf URL: https://arxiv.org/pdf/2511.02614
Copy Paste: [[2511.02614]] A Non-Adversarial Approach to Idempotent Generative Modelling(https://arxiv.org/abs/2511.02614)
Keywords: generative
Abstract: Idempotent Generative Networks (IGNs) are deep generative models that also function as local data manifold projectors, mapping arbitrary inputs back onto the manifold. They are trained to act as identity operators on the data and as idempotent operators off the data manifold. However, IGNs suffer from mode collapse, mode dropping, and training instability due to their objectives, which contain adversarial components and can cause the model to cover the data manifold only partially -- an issue shared with generative adversarial networks. We introduce Non-Adversarial Idempotent Generative Networks (NAIGNs) to address these issues. Our loss function combines reconstruction with the non-adversarial generative objective of Implicit Maximum Likelihood Estimation (IMLE). This improves on IGN's ability to restore corrupted data and generate new samples that closely match the data distribution. We moreover demonstrate that NAIGNs implicitly learn the distance field to the data manifold, as well as an energy-based model.

Title: VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

Authors: Zhicheng Zhang, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Pengfei Wan, Di Zhang, Jufeng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.02712
Pdf URL: https://arxiv.org/pdf/2511.02712
Copy Paste: [[2511.02712]] VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models(https://arxiv.org/abs/2511.02712)
Keywords: foundation model
Abstract: Understanding and predicting emotion from videos has gathered significant attention in recent studies, driven by advancements in video large language models (VideoLLMs). While advanced methods have made progress in video emotion analysis, the intrinsic nature of emotions poses significant challenges. Emotions are characterized by dynamic and cues-dependent properties, making it difficult to understand complex and evolving emotional states with reasonable rationale. To tackle these challenges, we propose a novel affective cues-guided reasoning framework that unifies fundamental attribute perception, expression analysis, and high-level emotional understanding in a stage-wise manner. At the core of our approach is a family of video emotion foundation models (VidEmo), specifically designed for emotion reasoning and instruction-following. These models undergo a two-stage tuning process: first, curriculum emotion learning for injecting emotion knowledge, followed by affective-tree reinforcement learning for emotion reasoning. Moreover, we establish a foundational data infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG) consisting of 2.1M diverse instruction-based samples. Emo-CFG includes explainable emotional question-answering, fine-grained captions, and associated rationales, providing essential resources for advancing emotion understanding tasks. Experimental results demonstrate that our approach achieves competitive performance, setting a new milestone across 15 face perception tasks.

Title: AI Diffusion in Low Resource Language Countries

Authors: Amit Misra, Syed Waqas Zamir, Wassim Hamidouche, Inbal Becker-Reshef, Juan Lavista Ferres
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2511.02752
Pdf URL: https://arxiv.org/pdf/2511.02752
Copy Paste: [[2511.02752]] AI Diffusion in Low Resource Language Countries(https://arxiv.org/abs/2511.02752)
Keywords: diffusion
Abstract: Artificial intelligence (AI) is diffusing globally at unprecedented speed, but adoption remains uneven. Frontier Large Language Models (LLMs) are known to perform poorly on low-resource languages due to data scarcity. We hypothesize that this performance deficit reduces the utility of AI, thereby slowing adoption in Low-Resource Language Countries (LRLCs). To test this, we use a weighted regression model to isolate the language effect from socioeconomic and demographic factors, finding that LRLCs have a share of AI users that is approximately 20% lower relative to their baseline. These results indicate that linguistic accessibility is a significant, independent barrier to equitable AI diffusion.

Title: STAR-VAE: Latent Variable Transformers for Scalable and Controllable Molecular Generation

Authors: Bum Chul Kwon, Ben Shapira, Moshiko Raboh, Shreyans Sethi, Shruti Murarka, Joseph A Morrone, Jianying Hu, Parthasarathy Suryanarayanan
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2511.02769
Pdf URL: https://arxiv.org/pdf/2511.02769
Copy Paste: [[2511.02769]] STAR-VAE: Latent Variable Transformers for Scalable and Controllable Molecular Generation(https://arxiv.org/abs/2511.02769)
Keywords: generative
Abstract: The chemical space of drug-like molecules is vast, motivating the development of generative models that must learn broad chemical distributions, enable conditional generation by capturing structure-property representations, and provide fast molecular generation. Meeting the objectives depends on modeling choices, including the probabilistic modeling approach, the conditional generative formulation, the architecture, and the molecular input representation. To address the challenges, we present STAR-VAE (Selfies-encoded, Transformer-based, AutoRegressive Variational Auto Encoder), a scalable latent-variable framework with a Transformer encoder and an autoregressive Transformer decoder. It is trained on 79 million drug-like molecules from PubChem, using SELFIES to guarantee syntactic validity. The latent-variable formulation enables conditional generation: a property predictor supplies a conditioning signal that is applied consistently to the latent prior, the inference network, and the decoder. Our contributions are: (i) a Transformer-based latent-variable encoder-decoder model trained on SELFIES representations; (ii) a principled conditional latent-variable formulation for property-guided generation; and (iii) efficient finetuning with low-rank adapters (LoRA) in both encoder and decoder, enabling fast adaptation with limited property and activity data. On the GuacaMol and MOSES benchmarks, our approach matches or exceeds baselines, and latent-space analyses reveal smooth, semantically structured representations that support both unconditional exploration and property-aware generation. On the Tartarus benchmarks, the conditional model shifts docking-score distributions toward stronger predicted binding. These results suggest that a modernized, scale-appropriate VAE remains competitive for molecular generation when paired with principled conditioning and parameter-efficient finetuning.

Title: AI-Generated Image Detection: An Empirical Study and Future Research Directions

Authors: Nusrat Tasnim, Kutub Uddin, Khalid Mahmood Malik
Subjects: cs.CV, cs.GT
Abstract URL: https://arxiv.org/abs/2511.02791
Pdf URL: https://arxiv.org/pdf/2511.02791
Copy Paste: [[2511.02791]] AI-Generated Image Detection: An Empirical Study and Future Research Directions(https://arxiv.org/abs/2511.02791)
Keywords: diffusion
Abstract: The threats posed by AI-generated media, particularly deepfakes, are now raising significant challenges for multimedia forensics, misinformation detection, and biometric system resulting in erosion of public trust in the legal system, significant increase in frauds, and social engineering attacks. Although several forensic methods have been proposed, they suffer from three critical gaps: (i) use of non-standardized benchmarks with GAN- or diffusion-generated images, (ii) inconsistent training protocols (e.g., scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail to capture generalization and explainability. These limitations hinder fair comparison, obscure true robustness, and restrict deployment in security-critical applications. This paper introduces a unified benchmarking framework for systematic evaluation of forensic methods under controlled and reproducible conditions. We benchmark ten SoTA forensic methods (scratch, frozen, and fine-tuned) and seven publicly available datasets (GAN and diffusion) to perform extensive and systematic evaluations. We evaluate performance using multiple metrics, including accuracy, average precision, ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations demonstrate substantial variability in generalization, with certain methods exhibiting strong in-distribution performance but degraded cross-model transferability. This study aims to guide the research community toward a deeper understanding of the strengths and limitations of current forensic approaches, and to inspire the development of more robust, generalizable, and explainable solutions.

Title: TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models

Authors: Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, Vinay Kumar Sankarapu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2511.02802
Pdf URL: https://arxiv.org/pdf/2511.02802
Copy Paste: [[2511.02802]] TabTune: A Unified Library for Inference and Fine-Tuning Tabular Foundation Models(https://arxiv.org/abs/2511.02802)
Keywords: foundation model
Abstract: Tabular foundation models represent a growing paradigm in structured data learning, extending the benefits of large-scale pretraining to tabular domains. However, their adoption remains limited due to heterogeneous preprocessing pipelines, fragmented APIs, inconsistent fine-tuning procedures, and the absence of standardized evaluation for deployment-oriented metrics such as calibration and fairness. We present TabTune, a unified library that standardizes the complete workflow for tabular foundation models through a single interface. TabTune provides consistent access to seven state-of-the-art models supporting multiple adaptation strategies, including zero-shot inference, meta-learning, supervised fine-tuning (SFT), and parameter-efficient fine-tuning (PEFT). The framework automates model-aware preprocessing, manages architectural heterogeneity internally, and integrates evaluation modules for performance, calibration, and fairness. Designed for extensibility and reproducibility, TabTune enables consistent benchmarking of adaptation strategies of tabular foundation models. The library is open source and available at this https URL .

Title: Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities

Authors: Amanda Bertsch, Adithya Pratapa, Teruko Mitamura, Graham Neubig, Matthew R. Gormley
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.02817
Pdf URL: https://arxiv.org/pdf/2511.02817
Copy Paste: [[2511.02817]] Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities(https://arxiv.org/abs/2511.02817)
Keywords: in-context
Abstract: As model context lengths continue to grow, concerns about whether models effectively use the full context length have persisted. While several carefully designed long-context evaluations have recently been released, these evaluations tend to rely on retrieval from one or more sections of the context, which allows nearly all of the context tokens to be disregarded as noise. This represents only one type of task that might be performed with long context. We introduce Oolong, a benchmark of long-context reasoning tasks that require analyzing individual chunks of text on an atomic level, and then aggregating these analyses to answer distributional questions. Oolong is separated into two task sets: Oolong-synth, a set of naturalistic synthetic tasks, where we can easily ablate components of the reasoning problem; and Oolong-real, a downstream setting which requires reasoning over real-world conversational data. Oolong requires models to reason over large quantities of examples, to perform both classification and counting in-context, and to reason over temporal and user relations. Even frontier models struggle on Oolong, with GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all achieving less than 50% accuracy on both splits at 128K. We release the data and evaluation harness for Oolong to enable further development of models that can reason over large quantities of text.

Title: PLUTO-4: Frontier Pathology Foundation Models

Authors: Harshith Padigela, Shima Nofallah, Atchuth Naveen Chilaparasetti, Ryun Han, Andrew Walker, Judy Shen, Chintan Shah, Blake Martin, Aashish Sood, Elliot Miller, Ben Glass, Andy Beck, Harsha Pokkalla, Syed Ashar Javed
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2511.02826
Pdf URL: https://arxiv.org/pdf/2511.02826
Copy Paste: [[2511.02826]] PLUTO-4: Frontier Pathology Foundation Models(https://arxiv.org/abs/2511.02826)
Keywords: self-supervised, foundation model
Abstract: Foundation models trained on large-scale pathology image corpora have demonstrated strong transfer capabilities across diverse histopathology tasks. Building on this progress, we introduce PLUTO-4, our next generation of pathology foundation models that extend the Pathology-Universal Transformer (PLUTO) to frontier scale. We share two complementary Vision Transformer architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE embeddings, and a frontier-scale PLUTO-4G model trained with a single patch size to maximize representation capacity and stability. Both models are pretrained using a self-supervised objective derived from DINOv2 on a large multi-institutional corpus containing 551,164 WSIs from 137,144 patients across over 50 institutions, spanning over 60 disease types and over 100 stains. Comprehensive evaluation across public and internal benchmarks demonstrates that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying spatial and biological context, including patch-level classification, segmentation, and slide-level diagnosis. The compact PLUTO-4S provides high-throughput and robust performance for practical deployment, while PLUTO-4G establishes new performance frontiers across multiple pathology benchmarks, including an 11% improvement in dermatopathology diagnosis. These diverse improvements underscore PLUTO-4's potential to transform real-world applications as a backbone for translational research and diagnostic use cases.

Title: GeoCrossBench: Cross-Band Generalization for Remote Sensing

Authors: Hakob Tamazyan, Ani Vanyan, Alvard Barseghyan, Anna Khosrovyan, Evan Shelhamer, Hrant Khachatrian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2511.02831
Pdf URL: https://arxiv.org/pdf/2511.02831
Copy Paste: [[2511.02831]] GeoCrossBench: Cross-Band Generalization for Remote Sensing(https://arxiv.org/abs/2511.02831)
Keywords: self-supervised, foundation model
Abstract: The number and diversity of remote sensing satellites grows over time, while the vast majority of labeled data comes from older satellites. As the foundation models for Earth observation scale up, the cost of (re-)training to support new satellites grows too, so the generalization capabilities of the models towards new satellites become increasingly important. In this work we introduce GeoCrossBench, an extension of the popular GeoBench benchmark with a new evaluation protocol: it tests the in-distribution performance; generalization to satellites with no band overlap; and generalization to satellites with additional bands with respect to the training set. We also develop a self-supervised extension of ChannelViT, ChiViT, to improve its cross-satellite performance. First, we show that even the best foundation models for remote sensing (DOFA, TerraFM) do not outperform general purpose models like DINOv3 in the in-distribution setting. Second, when generalizing to new satellites with no band overlap, all models suffer 2-4x drop in performance, and ChiViT significantly outperforms the runner-up DINOv3. Third, the performance of all tested models drops on average by 5-25\% when given additional bands during test time. Finally, we show that fine-tuning just the last linear layer of these models using oracle labels from all bands can get relatively consistent performance across all satellites, highlighting that the benchmark is far from being saturated. We publicly release the code and the datasets to encourage the development of more future-proof remote sensing models with stronger cross-satellite generalization.