2025-05-22

Title: CrypticBio: A Large Multimodal Dataset for Visually Confusing Biodiversity

Authors: Georgiana Manolache, Gerard Schouten, Joaquin Vanschoren
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14707
Pdf URL: https://arxiv.org/pdf/2505.14707
Copy Paste: [[2505.14707]] CrypticBio: A Large Multimodal Dataset for Visually Confusing Biodiversity(https://arxiv.org/abs/2505.14707)
Keywords: foundation model
Abstract: We present CrypticBio, the largest publicly available multimodal dataset of visually confusing species, specifically curated to support the development of AI models in the context of biodiversity applications. Visually confusing or cryptic species are groups of two or more taxa that are nearly indistinguishable based on visual characteristics alone. While much existing work addresses taxonomic identification in a broad sense, datasets that directly address the morphological confusion of cryptic species are small, manually curated, and target only a single taxon. Thus, the challenge of identifying such subtle differences in a wide range of taxa remains unaddressed. Curated from real-world trends in species misidentification among community annotators of iNaturalist, CrypticBio contains 52K unique cryptic groups spanning 67K species, represented in 166 million images. Rich research-grade image annotations--including scientific, multicultural, and multilingual species terminology, hierarchical taxonomy, spatiotemporal context, and associated cryptic groups--address multimodal AI in biodiversity research. For easy dataset curation, we provide an open-source pipeline CrypticBio-Curate. The multimodal nature of the dataset beyond vision-language arises from the integration of geographical and temporal data as complementary cues to identifying cryptic species. To highlight the importance of the dataset, we benchmark a suite of state-of-the-art foundation models across CrypticBio subsets of common, unseen, endangered, and invasive species, and demonstrate the substantial impact of geographical context on vision-language zero-shot learning for cryptic species. By introducing CrypticBio, we aim to catalyze progress toward real-world-ready biodiversity AI models capable of handling the nuanced challenges of species ambiguity.

Title: DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance

Authors: Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jiuxiang Gu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14708
Pdf URL: https://arxiv.org/pdf/2505.14708
Copy Paste: [[2505.14708]] DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance(https://arxiv.org/abs/2505.14708)
Keywords: diffusion
Abstract: Diffusion transformer-based video generation models (DiTs) have recently attracted widespread attention for their excellent generation quality. However, their computational cost remains a major bottleneck-attention alone accounts for over 80% of total latency, and generating just 8 seconds of 720p video takes tens of minutes-posing serious challenges to practical application and scalability. To address this, we propose the DraftAttention, a training-free framework for the acceleration of video diffusion transformers with dynamic sparse attention on GPUs. We apply down-sampling to each feature map across frames in the compressed latent space, enabling a higher-level receptive field over the latent composed of hundreds of thousands of tokens. The low-resolution draft attention map, derived from draft query and key, exposes redundancy both spatially within each feature map and temporally across frames. We reorder the query, key, and value based on the draft attention map to guide the sparse attention computation in full resolution, and subsequently restore their original order after the attention computation. This reordering enables structured sparsity that aligns with hardware-optimized execution. Our theoretical analysis demonstrates that the low-resolution draft attention closely approximates the full attention, providing reliable guidance for constructing accurate sparse attention. Experimental results show that our method outperforms existing sparse attention approaches in video generation quality and achieves up to 1.75x end-to-end speedup on GPUs. Code: this https URL

Title: Time Series Similarity Score Functions to Monitor and Interact with the Training and Denoising Process of a Time Series Diffusion Model applied to a Human Activity Recognition Dataset based on IMUs

Authors: Heiko Oppel, Andreas Spilz, Michael Munz
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14739
Pdf URL: https://arxiv.org/pdf/2505.14739
Copy Paste: [[2505.14739]] Time Series Similarity Score Functions to Monitor and Interact with the Training and Denoising Process of a Time Series Diffusion Model applied to a Human Activity Recognition Dataset based on IMUs(https://arxiv.org/abs/2505.14739)
Keywords: diffusion, generative
Abstract: Denoising diffusion probabilistic models are able to generate synthetic sensor signals. The training process of such a model is controlled by a loss function which measures the difference between the noise that was added in the forward process and the noise that was predicted by the diffusion model. This enables the generation of realistic data. However, the randomness within the process and the loss function itself makes it difficult to estimate the quality of the data. Therefore, we examine multiple similarity metrics and adapt an existing metric to overcome this issue by monitoring the training and synthetisation process using those metrics. The adapted metric can even be fine-tuned on the input data to comply with the requirements of an underlying classification task. We were able to significantly reduce the amount of training epochs without a performance reduction in the classification task. An optimized training process not only saves resources, but also reduces the time for training generative models.

Title: Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism

Authors: Kunyun Wang, Bohan Li, Kai Yu, Minyi Guo, Jieru Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14741
Pdf URL: https://arxiv.org/pdf/2505.14741
Copy Paste: [[2505.14741]] Communication-Efficient Diffusion Denoising Parallelization via Reuse-then-Predict Mechanism(https://arxiv.org/abs/2505.14741)
Keywords: diffusion, generative
Abstract: Diffusion models have emerged as a powerful class of generative models across various modalities, including image, video, and audio synthesis. However, their deployment is often limited by significant inference latency, primarily due to the inherently sequential nature of the denoising process. While existing parallelization strategies attempt to accelerate inference by distributing computation across multiple devices, they typically incur high communication overhead, hindering deployment on commercial hardware. To address this challenge, we propose \textbf{ParaStep}, a novel parallelization method based on a reuse-then-predict mechanism that parallelizes diffusion inference by exploiting similarity between adjacent denoising steps. Unlike prior approaches that rely on layer-wise or stage-wise communication, ParaStep employs lightweight, step-wise communication, substantially reducing overhead. ParaStep achieves end-to-end speedups of up to \textbf{3.88}$\times$ on SVD, \textbf{2.43}$\times$ on CogVideoX-2b, and \textbf{6.56}$\times$ on AudioLDM2-large, while maintaining generation quality. These results highlight ParaStep as a scalable and communication-efficient solution for accelerating diffusion inference, particularly in bandwidth-constrained environments.

Title: Large Language Models for Data Synthesis

Authors: Yihong Tang, Menglin Kong, Lijun Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.14752
Pdf URL: https://arxiv.org/pdf/2505.14752
Copy Paste: [[2505.14752]] Large Language Models for Data Synthesis(https://arxiv.org/abs/2505.14752)
Keywords: generative
Abstract: Generating synthetic data that faithfully captures the statistical structure of real-world distributions is a fundamental challenge in data modeling. Classical approaches often depend on strong parametric assumptions or manual structural design and struggle in high-dimensional or heterogeneous domains. Recent progress in Large Language Models (LLMs) reveals their potential as flexible, high-dimensional priors over real-world distributions. However, when applied to data synthesis, standard LLM-based sampling is inefficient, constrained by fixed context limits, and fails to ensure statistical alignment. Given this, we introduce LLMSynthor, a general framework for data synthesis that transforms LLMs into structure-aware simulators guided by distributional feedback. LLMSynthor treats the LLM as a nonparametric copula simulator for modeling high-order dependencies and introduces LLM Proposal Sampling to generate grounded proposal distributions that improve sampling efficiency without requiring rejection. By minimizing discrepancies in the summary statistics space, the iterative synthesis loop aligns real and synthetic data while gradually uncovering and refining the latent generative structure. We evaluate LLMSynthor in both controlled and real-world settings using heterogeneous datasets in privacy-sensitive domains (e.g., e-commerce, population, and mobility) that encompass both structured and unstructured formats. The synthetic data produced by LLMSynthor shows high statistical fidelity, practical utility, and cross-data adaptability, positioning it as a valuable tool across economics, social science, urban studies, and beyond.

Title: This Time is Different: An Observability Perspective on Time Series Foundation Models

Authors: Ben Cohen, Emaad Khwaja, Youssef Doubli, Salahidine Lemaachi, Chris Lettieri, Charles Masson, Hugo Miccinilli, Elise Ramé, Qiqi Ren, Afshin Rostamizadeh, Jean Ogier du Terrail, Anna-Monica Toon, Kan Wang, Stephan Xie, David Asker, Ameet Talwalkar, Othmane Abou-Amal
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14766
Pdf URL: https://arxiv.org/pdf/2505.14766
Copy Paste: [[2505.14766]] This Time is Different: An Observability Perspective on Time Series Foundation Models(https://arxiv.org/abs/2505.14766)
Keywords: foundation model
Abstract: We introduce Toto, a time series forecasting foundation model with 151 million parameters. Toto uses a modern decoder-only architecture coupled with architectural innovations designed to account for specific challenges found in multivariate observability time series data. Toto's pre-training corpus is a mixture of observability data, open datasets, and synthetic data, and is 4-10$\times$ larger than those of leading time series foundation models. Additionally, we introduce BOOM, a large-scale benchmark consisting of 350 million observations across 2,807 real-world time series. For both Toto and BOOM, we source observability data exclusively from Datadog's own telemetry and internal observability metrics. Extensive evaluations demonstrate that Toto achieves state-of-the-art performance on both BOOM and on established general purpose time series forecasting benchmarks. Toto's model weights, inference code, and evaluation scripts, as well as BOOM's data and evaluation code, are all available as open source under the Apache 2.0 License available at this https URL and this https URL.

Title: KO: Kinetics-inspired Neural Optimizer with PDE Simulation Approaches

Authors: Mingquan Feng, Yixin Huang, Yifan Fu, Shaobo Wang, Junchi Yan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14777
Pdf URL: https://arxiv.org/pdf/2505.14777
Copy Paste: [[2505.14777]] KO: Kinetics-inspired Neural Optimizer with PDE Simulation Approaches(https://arxiv.org/abs/2505.14777)
Keywords: diffusion
Abstract: The design of optimization algorithms for neural networks remains a critical challenge, with most existing methods relying on heuristic adaptations of gradient-based approaches. This paper introduces KO (Kinetics-inspired Optimizer), a novel neural optimizer inspired by kinetic theory and partial differential equation (PDE) simulations. We reimagine the training dynamics of network parameters as the evolution of a particle system governed by kinetic principles, where parameter updates are simulated via a numerical scheme for the Boltzmann transport equation (BTE) that models stochastic particle collisions. This physics-driven approach inherently promotes parameter diversity during optimization, mitigating the phenomenon of parameter condensation, i.e. collapse of network parameters into low-dimensional subspaces, through mechanisms analogous to thermal diffusion in physical systems. We analyze this property, establishing both a mathematical proof and a physical interpretation. Extensive experiments on image classification (CIFAR-10/100, ImageNet) and text classification (IMDB, Snips) tasks demonstrate that KO consistently outperforms baseline optimizers (e.g., Adam, SGD), achieving accuracy improvements while computation cost remains comparable.

Title: Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation

Authors: Runze Zhao, Yue Yu, Adams Yiyue Zhu, Chen Yang, Dongruo Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14821
Pdf URL: https://arxiv.org/pdf/2505.14821
Copy Paste: [[2505.14821]] Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation(https://arxiv.org/abs/2505.14821)
Keywords: diffusion
Abstract: Continuous-time reinforcement learning (CTRL) provides a principled framework for sequential decision-making in environments where interactions evolve continuously over time. Despite its empirical success, the theoretical understanding of CTRL remains limited, especially in settings with general function approximation. In this work, we propose a model-based CTRL algorithm that achieves both sample and computational efficiency. Our approach leverages optimism-based confidence sets to establish the first sample complexity guarantee for CTRL with general function approximation, showing that a near-optimal policy can be learned with a suboptimality gap of $\tilde{O}(\sqrt{d_{\mathcal{R}} + d_{\mathcal{F}}}N^{-1/2})$ using $N$ measurements, where $d_{\mathcal{R}}$ and $d_{\mathcal{F}}$ denote the distributional Eluder dimensions of the reward and dynamic functions, respectively, capturing the complexity of general function approximation in reinforcement learning. Moreover, we introduce structured policy updates and an alternative measurement strategy that significantly reduce the number of policy updates and rollouts while maintaining competitive sample efficiency. We implemented experiments to backup our proposed algorithms on continuous control tasks and diffusion model fine-tuning, demonstrating comparable performance with significantly fewer policy updates and rollouts.

Title: Leveraging Generative AI Models to Explore Human Identity

Authors: Yunha Yeo, Daeho Um
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14843
Pdf URL: https://arxiv.org/pdf/2505.14843
Copy Paste: [[2505.14843]] Leveraging Generative AI Models to Explore Human Identity(https://arxiv.org/abs/2505.14843)
Keywords: diffusion, generative
Abstract: This paper attempts to explore human identity by utilizing neural networks in an indirect manner. For this exploration, we adopt diffusion models, state-of-the-art AI generative models trained to create human face images. By relating the generated human face to human identity, we establish a correspondence between the face image generation process of the diffusion model and the process of human identity formation. Through experiments with the diffusion model, we observe that changes in its external input result in significant changes in the generated face image. Based on the correspondence, we indirectly confirm the dependence of human identity on external factors in the process of human identity formation. Furthermore, we introduce \textit{Fluidity of Human Identity}, a video artwork that expresses the fluid nature of human identity affected by varying external factors. The video is available at this https URL.

Title: A self-regulated convolutional neural network for classifying variable stars

Authors: Francisco Pérez-Galarce, Jorge Martínez-Palomera, Karim Pichara, Pablo Huijse, Márcio Catelan
Subjects: cs.LG, astro-ph.SR
Abstract URL: https://arxiv.org/abs/2505.14877
Pdf URL: https://arxiv.org/pdf/2505.14877
Copy Paste: [[2505.14877]] A self-regulated convolutional neural network for classifying variable stars(https://arxiv.org/abs/2505.14877)
Keywords: generative
Abstract: Over the last two decades, machine learning models have been widely applied and have proven effective in classifying variable stars, particularly with the adoption of deep learning architectures such as convolutional neural networks, recurrent neural networks, and transformer models. While these models have achieved high accuracy, they require high-quality, representative data and a large number of labelled samples for each star type to generalise well, which can be challenging in time-domain surveys. This challenge often leads to models learning and reinforcing biases inherent in the training data, an issue that is not easily detectable when validation is performed on subsamples from the same catalogue. The problem of biases in variable star data has been largely overlooked, and a definitive solution has yet to be established. In this paper, we propose a new approach to improve the reliability of classifiers in variable star classification by introducing a self-regulated training process. This process utilises synthetic samples generated by a physics-enhanced latent space variational autoencoder, incorporating six physical parameters from Gaia Data Release 3. Our method features a dynamic interaction between a classifier and a generative model, where the generative model produces ad-hoc synthetic light curves to reduce confusion during classifier training and populate underrepresented regions in the physical parameter space. Experiments conducted under various scenarios demonstrate that our self-regulated training approach outperforms traditional training methods for classifying variable stars on biased datasets, showing statistically significant improvements.

Title: In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties

Authors: Nathan Roll, Calbert Graham, Yuka Tatsumi, Kim Tien Nguyen, Meghan Sumner, Dan Jurafsky
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2505.14887
Pdf URL: https://arxiv.org/pdf/2505.14887
Copy Paste: [[2505.14887]] In-Context Learning Boosts Speech Recognition via Human-like Adaptation to Speakers and Language Varieties(https://arxiv.org/abs/2505.14887)
Keywords: in-context
Abstract: Human listeners readily adjust to unfamiliar speakers and language varieties through exposure, but do these adaptation benefits extend to state-of-the-art spoken language models? We introduce a scalable framework that allows for in-context learning (ICL) in Phi-4 Multimodal using interleaved task prompts and audio-text pairs, and find that as few as 12 example utterances (~50 seconds) at inference time reduce word error rates by a relative 19.7% (1.2 pp.) on average across diverse English corpora. These improvements are most pronounced in low-resource varieties, when the context and target speaker match, and when more examples are provided--though scaling our procedure yields diminishing marginal returns to context length. Overall, we find that our novel ICL adaptation scheme (1) reveals a similar performance profile to human listeners, and (2) demonstrates consistent improvements to automatic speech recognition (ASR) robustness across diverse speakers and language backgrounds. While adaptation succeeds broadly, significant gaps remain for certain varieties, revealing where current models still fall short of human flexibility. We release our prompts and code on GitHub.

Title: Foundations of Unknown-aware Machine Learning

Authors: Xuefeng Du
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.14933
Pdf URL: https://arxiv.org/pdf/2505.14933
Copy Paste: [[2505.14933]] Foundations of Unknown-aware Machine Learning(https://arxiv.org/abs/2505.14933)
Keywords: foundation model
Abstract: Ensuring the reliability and safety of machine learning models in open-world deployment is a central challenge in AI safety. This thesis develops both algorithmic and theoretical foundations to address key reliability issues arising from distributional uncertainty and unknown classes, from standard neural networks to modern foundation models like large language models (LLMs). Traditional learning paradigms, such as empirical risk minimization (ERM), assume no distribution shift between training and inference, often leading to overconfident predictions on out-of-distribution (OOD) inputs. This thesis introduces novel frameworks that jointly optimize for in-distribution accuracy and reliability to unseen data. A core contribution is the development of an unknown-aware learning framework that enables models to recognize and handle novel inputs without labeled OOD data. We propose new outlier synthesis methods, VOS, NPOS, and DREAM-OOD, to generate informative unknowns during training. Building on this, we present SAL, a theoretical and algorithmic framework that leverages unlabeled in-the-wild data to enhance OOD detection under realistic deployment conditions. These methods demonstrate that abundant unlabeled data can be harnessed to recognize and adapt to unforeseen inputs, providing formal reliability guarantees. The thesis also extends reliable learning to foundation models. We develop HaloScope for hallucination detection in LLMs, MLLMGuard for defending against malicious prompts in multimodal models, and data cleaning methods to denoise human feedback used for better alignment. These tools target failure modes that threaten the safety of large-scale models in deployment. Overall, these contributions promote unknown-aware learning as a new paradigm, and we hope it can advance the reliability of AI systems with minimal human efforts.

Title: Anomaly Detection Based on Critical Paths for Deep Neural Networks

Authors: Fangzhen Zhao, Chenyi Zhang, Naipeng Dong, Ming Li, Jinxiao Shan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.14967
Pdf URL: https://arxiv.org/pdf/2505.14967
Copy Paste: [[2505.14967]] Anomaly Detection Based on Critical Paths for Deep Neural Networks(https://arxiv.org/abs/2505.14967)
Keywords: anomaly
Abstract: Deep neural networks (DNNs) are notoriously hard to understand and difficult to defend. Extracting representative paths (including the neuron activation values and the connections between neurons) from DNNs using software engineering approaches has recently shown to be a promising approach in interpreting the decision making process of blackbox DNNs, as the extracted paths are often effective in capturing essential features. With this in mind, this work investigates a novel approach that extracts critical paths from DNNs and subsequently applies the extracted paths for the anomaly detection task, based on the observation that outliers and adversarial inputs do not usually induce the same activation pattern on those paths as normal (in-distribution) inputs. In our approach, we first identify critical detection paths via genetic evolution and mutation. Since different paths in a DNN often capture different features for the same target class, we ensemble detection results from multiple paths by integrating random subspace sampling and a voting mechanism. Compared with state-of-the-art methods, our experimental results suggest that our method not only outperforms them, but it is also suitable for the detection of a broad range of anomaly types with high accuracy.

Title: Flattening Hierarchies with Policy Bootstrapping

Authors: John L. Zhou, Jonathan C. Kao
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2505.14975
Pdf URL: https://arxiv.org/pdf/2505.14975
Copy Paste: [[2505.14975]] Flattening Hierarchies with Policy Bootstrapping(https://arxiv.org/abs/2505.14975)
Keywords: self-supervised, foundation model, generative
Abstract: Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail.

Title: Meta-Design Matters: A Self-Design Multi-Agent System

Authors: Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.14996
Pdf URL: https://arxiv.org/pdf/2505.14996
Copy Paste: [[2505.14996]] Meta-Design Matters: A Self-Design Multi-Agent System(https://arxiv.org/abs/2505.14996)
Keywords: self-supervised
Abstract: Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation-set for tuning and yield static MAS designs lacking adaptability during inference. We introduce SELF-MAS, the first self-supervised, inference-time only framework for automatic MAS design. SELF-MAS employs meta-level design to iteratively generate, evaluate, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic agent composition and problem decomposition through meta-feedback on solvability and completeness. Experiments across math, graduate-level QA, and software engineering benchmarks, using both closed-source and open-source LLM back-bones of varying sizes, demonstrate that SELF-MAS outperforms both manual and automatic MAS baselines, achieving a 7.44% average accuracy improvement over the next strongest baseline while maintaining cost-efficiency. These findings underscore the promise of meta-level self-supervised design for creating effective and adaptive MAS.

Title: One-Layer Transformers are Provably Optimal for In-context Reasoning and Distributional Association Learning in Next-Token Prediction Tasks

Authors: Quan Nguyen, Thanh Nguyen-Tang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15009
Pdf URL: https://arxiv.org/pdf/2505.15009
Copy Paste: [[2505.15009]] One-Layer Transformers are Provably Optimal for In-context Reasoning and Distributional Association Learning in Next-Token Prediction Tasks(https://arxiv.org/abs/2505.15009)
Keywords: in-context
Abstract: We study the approximation capabilities and on-convergence behaviors of one-layer transformers on the noiseless and noisy in-context reasoning of next-token prediction. Existing theoretical results focus on understanding the in-context reasoning behaviors for either the first gradient step or when the number of samples is infinite. Furthermore, no convergence rates nor generalization abilities were known. Our work addresses these gaps by showing that there exists a class of one-layer transformers that are provably Bayes-optimal with both linear and ReLU attention. When being trained with gradient descent, we show via a finite-sample analysis that the expected loss of these transformers converges at linear rate to the Bayes risk. Moreover, we prove that the trained models generalize to unseen samples as well as exhibit learning behaviors that were empirically observed in previous works. Our theoretical findings are further supported by extensive empirical validations.

Title: RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning

Authors: Kaiwen Zha, Zhengqi Gao, Maohao Shen, Zhang-Wei Hong, Duane S. Boning, Dina Katabi
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.15034
Pdf URL: https://arxiv.org/pdf/2505.15034
Copy Paste: [[2505.15034]] RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning(https://arxiv.org/abs/2505.15034)
Keywords: generative
Abstract: Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: this https URL.

Title: Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

Authors: Siyue Zhang, Yilun Zhao, Liyuan Geng, Arman Cohan, Anh Tuan Luu, Chen Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15045
Pdf URL: https://arxiv.org/pdf/2505.15045
Copy Paste: [[2505.15045]] Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective(https://arxiv.org/abs/2505.15045)
Keywords: diffusion
Abstract: Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.

Title: Improving the fact-checking performance of language models by relying on their entailment ability

Authors: Gaurav Kumar, Debajyoti Mazumder, Ayush Garg, Jasabanta Patro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15050
Pdf URL: https://arxiv.org/pdf/2505.15050
Copy Paste: [[2505.15050]] Improving the fact-checking performance of language models by relying on their entailment ability(https://arxiv.org/abs/2505.15050)
Keywords: generative
Abstract: Automated fact-checking is a crucial task in this digital age. To verify a claim, current approaches majorly follow one of two strategies i.e. (i) relying on embedded knowledge of language models, and (ii) fine-tuning them with evidence pieces. While the former can make systems to hallucinate, the later have not been very successful till date. The primary reason behind this is that fact verification is a complex process. Language models have to parse through multiple pieces of evidence before making a prediction. Further, the evidence pieces often contradict each other. This makes the reasoning process even more complex. We proposed a simple yet effective approach where we relied on entailment and the generative ability of language models to produce ''supporting'' and ''refuting'' justifications (for the truthfulness of a claim). We trained language models based on these justifications and achieved superior results. Apart from that, we did a systematic comparison of different prompting and fine-tuning strategies, as it is currently lacking in the literature. Some of our observations are: (i) training language models with raw evidence sentences registered an improvement up to 8.20% in macro-F1, over the best performing baseline for the RAW-FC dataset, (ii) similarly, training language models with prompted claim-evidence understanding (TBE-2) registered an improvement (with a margin up to 16.39%) over the baselines for the same dataset, (iii) training language models with entailed justifications (TBE-3) outperformed the baselines by a huge margin (up to 28.57% and 44.26% for LIAR-RAW and RAW-FC, respectively). We have shared our code repository to reproduce the results.

Title: Generalization Through Growth: Hidden Dynamics Controls Depth Dependence

Authors: Sho Sonoda, Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda
Subjects: cs.LG, math.DS, stat.ML
Abstract URL: https://arxiv.org/abs/2505.15064
Pdf URL: https://arxiv.org/pdf/2505.15064
Copy Paste: [[2505.15064]] Generalization Through Growth: Hidden Dynamics Controls Depth Dependence(https://arxiv.org/abs/2505.15064)
Keywords: diffusion
Abstract: Recent theory has reduced the depth dependence of generalization bounds from exponential to polynomial and even depth-independent rates, yet these results remain tied to specific architectures and Euclidean inputs. We present a unified framework for arbitrary \blue{pseudo-metric} spaces in which a depth-$k$ network is the composition of continuous hidden maps $f:\mathcal{X}\to \mathcal{X}$ and an output map $h:\mathcal{X}\to \mathbb{R}$. The resulting bound $O(\sqrt{(\alpha + \log \beta(k))/n})$ isolates the sole depth contribution in $\beta(k)$, the word-ball growth of the semigroup generated by the hidden layers. By Gromov's theorem polynomial (resp. exponential) growth corresponds to virtually nilpotent (resp. expanding) dynamics, revealing a geometric dichotomy behind existing $O(\sqrt{k})$ (sublinear depth) and $\tilde{O}(1)$ (depth-independent) rates. We further provide covering-number estimates showing that expanding dynamics yield an exponential parameter saving via compositional expressivity. Our results decouple specification from implementation, offering architecture-agnostic and dynamical-systems-aware guarantees applicable to modern deep-learning paradigms such as test-time inference and diffusion models.

Title: Agentic Feature Augmentation: Unifying Selection and Generation with Teaming, Planning, and Memories

Authors: Nanxu Gong, Sixun Dong, Haoyue Bai, Xinyuan Wang, Wangyang Ying, Yanjie Fu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15076
Pdf URL: https://arxiv.org/pdf/2505.15076
Copy Paste: [[2505.15076]] Agentic Feature Augmentation: Unifying Selection and Generation with Teaming, Planning, and Memories(https://arxiv.org/abs/2505.15076)
Keywords: in-context
Abstract: As a widely-used and practical tool, feature engineering transforms raw data into discriminative features to advance AI model performance. However, existing methods usually apply feature selection and generation separately, failing to strive a balance between reducing redundancy and adding meaningful dimensions. To fill this gap, we propose an agentic feature augmentation concept, where the unification of feature generation and selection is modeled as agentic teaming and planning. Specifically, we develop a Multi-Agent System with Long and Short-Term Memory (MAGS), comprising a selector agent to eliminate redundant features, a generator agent to produce informative new dimensions, and a router agent that strategically coordinates their actions. We leverage in-context learning with short-term memory for immediate feedback refinement and long-term memory for globally optimal guidance. Additionally, we employ offline Proximal Policy Optimization (PPO) reinforcement fine-tuning to train the router agent for effective decision-making to navigate a vast discrete feature space. Extensive experiments demonstrate that this unified agentic framework consistently achieves superior task performance by intelligently orchestrating feature selection and generation.

Title: Data Augmentation and Resolution Enhancement using GANs and Diffusion Models for Tree Segmentation

Authors: Alessandro dos Santos Ferreira, Ana Paula Marques Ramos, José Marcato Junior, Wesley Nunes Gonçalves
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15077
Pdf URL: https://arxiv.org/pdf/2505.15077
Copy Paste: [[2505.15077]] Data Augmentation and Resolution Enhancement using GANs and Diffusion Models for Tree Segmentation(https://arxiv.org/abs/2505.15077)
Keywords: diffusion
Abstract: Urban forests play a key role in enhancing environmental quality and supporting biodiversity in cities. Mapping and monitoring these green spaces are crucial for urban planning and conservation, yet accurately detecting trees is challenging due to complex landscapes and the variability in image resolution caused by different satellite sensors or UAV flight altitudes. While deep learning architectures have shown promise in addressing these challenges, their effectiveness remains strongly dependent on the availability of large and manually labeled datasets, which are often expensive and difficult to obtain in sufficient quantity. In this work, we propose a novel pipeline that integrates domain adaptation with GANs and Diffusion models to enhance the quality of low-resolution aerial images. Our proposed pipeline enhances low-resolution imagery while preserving semantic content, enabling effective tree segmentation without requiring large volumes of manually annotated data. Leveraging models such as pix2pix, Real-ESRGAN, Latent Diffusion, and Stable Diffusion, we generate realistic and structurally consistent synthetic samples that expand the training dataset and unify scale across domains. This approach not only improves the robustness of segmentation models across different acquisition conditions but also provides a scalable and replicable solution for remote sensing scenarios with scarce annotation resources. Experimental results demonstrated an improvement of over 50% in IoU for low-resolution images, highlighting the effectiveness of our method compared to traditional pipelines.

Title: Mechanistic evaluation of Transformers and state space models

Authors: Aryaman Arora, Neil Rathi, Nikil Roashan Selvam, Róbert Csórdas, Dan Jurafsky, Christopher Potts
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15105
Pdf URL: https://arxiv.org/pdf/2505.15105
Copy Paste: [[2505.15105]] Mechanistic evaluation of Transformers and state space models(https://arxiv.org/abs/2505.15105)
Keywords: in-context
Abstract: State space models (SSMs) for language modelling promise an efficient and performant alternative to quadratic-attention Transformers, yet show variable performance on recalling basic information from the context. While performance on synthetic tasks like Associative Recall (AR) can point to this deficiency, behavioural metrics provide little information as to why--on a mechanistic level--certain architectures fail and others succeed. To address this, we conduct experiments on AR and find that only Transformers and Based SSM models fully succeed at AR, with Mamba a close third, whereas the other SSMs (H3, Hyena) fail. We then use causal interventions to explain why. We find that Transformers and Based learn to store key-value associations in-context using induction heads. By contrast, the SSMs compute these associations only at the last state, with only Mamba succeeding because of its short convolution component. To extend and deepen these findings, we introduce Associative Treecall (ATR), a synthetic task similar to AR based on PCFG induction. ATR introduces language-like hierarchical structure into the AR setting. We find that all architectures learn the same mechanism as they did for AR, and the same three models succeed at the task. These results reveal that architectures with similar accuracy may still have substantive differences, motivating the adoption of mechanistic evaluations.

Title: Graph Foundation Models: A Comprehensive Survey

Authors: Zehong Wang, Zheyuan Liu, Tianyi Ma, Jiazheng Li, Zheyuan Zhang, Xingbo Fu, Yiyang Li, Zhengqing Yuan, Wei Song, Yijun Ma, Qingkai Zeng, Xiusi Chen, Jianan Zhao, Jundong Li, Meng Jiang, Pietro Lio, Nitesh Chawla, Chuxu Zhang, Yanfang Ye
Subjects: cs.LG, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2505.15116
Pdf URL: https://arxiv.org/pdf/2505.15116
Copy Paste: [[2505.15116]] Graph Foundation Models: A Comprehensive Survey(https://arxiv.org/abs/2505.15116)
Keywords: foundation model
Abstract: Graph-structured data pervades domains such as social networks, biological systems, knowledge graphs, and recommender systems. While foundation models have transformed natural language processing, vision, and multimodal learning through large-scale pretraining and generalization, extending these capabilities to graphs -- characterized by non-Euclidean structures and complex relational semantics -- poses unique challenges and opens new opportunities. To this end, Graph Foundation Models (GFMs) aim to bring scalable, general-purpose intelligence to structured data, enabling broad transfer across graph-centric tasks and domains. This survey provides a comprehensive overview of GFMs, unifying diverse efforts under a modular framework comprising three key components: backbone architectures, pretraining strategies, and adaptation mechanisms. We categorize GFMs by their generalization scope -- universal, task-specific, and domain-specific -- and review representative methods, key innovations, and theoretical insights within each category. Beyond methodology, we examine theoretical foundations including transferability and emergent capabilities, and highlight key challenges such as structural alignment, heterogeneity, scalability, and evaluation. Positioned at the intersection of graph learning and general-purpose AI, GFMs are poised to become foundational infrastructure for open-ended reasoning over structured data. This survey consolidates current progress and outlines future directions to guide research in this rapidly evolving field. Resources are available at this https URL.

Title: Filtering Learning Histories Enhances In-Context Reinforcement Learning

Authors: Weiqin Chen, Xinjie Zhang, Dharmashankar Subramanian, Santiago Paternain
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15143
Pdf URL: https://arxiv.org/pdf/2505.15143
Copy Paste: [[2505.15143]] Filtering Learning Histories Enhances In-Context Reinforcement Learning(https://arxiv.org/abs/2505.15143)
Keywords: in-context
Abstract: Transformer models (TMs) have exhibited remarkable in-context reinforcement learning (ICRL) capabilities, allowing them to generalize to and improve in previously unseen environments without re-training or fine-tuning. This is typically accomplished by imitating the complete learning histories of a source RL algorithm over a substantial amount of pretraining environments, which, however, may transfer suboptimal behaviors inherited from the source algorithm/dataset. Therefore, in this work, we address the issue of inheriting suboptimality from the perspective of dataset preprocessing. Motivated by the success of the weighted empirical risk minimization, we propose a simple yet effective approach, learning history filtering (LHF), to enhance ICRL by reweighting and filtering the learning histories based on their improvement and stability characteristics. To the best of our knowledge, LHF is the first approach to avoid source suboptimality by dataset preprocessing, and can be combined with the current state-of-the-art (SOTA) ICRL algorithms. We substantiate the effectiveness of LHF through a series of experiments conducted on the well-known ICRL benchmarks, encompassing both discrete environments and continuous robotic manipulation tasks, with three SOTA ICRL algorithms (AD, DPT, DICP) as the backbones. LHF exhibits robust performance across a variety of suboptimal scenarios, as well as under varying hyperparameters and sampling strategies. Notably, the superior performance of LHF becomes more pronounced in the presence of noisy data, indicating the significance of filtering learning histories.

Title: From Pixels to Images: Deep Learning Advances in Remote Sensing Image Semantic Segmentation

Authors: Quanwei Liu, Tao Huang, Yanni Dong, Jiaqi Yang, Wei Xiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15147
Pdf URL: https://arxiv.org/pdf/2505.15147
Copy Paste: [[2505.15147]] From Pixels to Images: Deep Learning Advances in Remote Sensing Image Semantic Segmentation(https://arxiv.org/abs/2505.15147)
Keywords: foundation model
Abstract: Remote sensing images (RSIs) capture both natural and human-induced changes on the Earth's surface, serving as essential data for environmental monitoring, urban planning, and resource management. Semantic segmentation (SS) of RSIs enables the fine-grained interpretation of surface features, making it a critical task in remote sensing analysis. With the increasing diversity and volume of RSIs collected by sensors on various platforms, traditional processing methods struggle to maintain efficiency and accuracy. In response, deep learning (DL) has emerged as a transformative approach, enabling substantial advances in remote sensing image semantic segmentation (RSISS) by automating feature extraction and improving segmentation accuracy across diverse modalities. This paper revisits the evolution of DL-based RSISS by categorizing existing approaches into four stages: the early pixel-based methods, the prevailing patch-based and tile-based techniques, and the emerging image-based strategies enabled by foundation models. We analyze these developments from the perspective of feature extraction and learning strategies, revealing the field's progression from pixel-level to tile-level and from unimodal to multimodal segmentation. Furthermore, we conduct a comprehensive evaluation of nearly 40 advanced techniques on a unified dataset to quantitatively characterize their performance and applicability. This review offers a holistic view of DL-based SS for RS, highlighting key advancements, comparative insights, and open challenges to guide future research.

Title: Time Tracker: Mixture-of-Experts-Enhanced Foundation Time Series Forecasting Model with Decoupled Training Pipelines

Authors: Xiaohou Shi, Ke Li, Aobo Liang, Yan Sun
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15151
Pdf URL: https://arxiv.org/pdf/2505.15151
Copy Paste: [[2505.15151]] Time Tracker: Mixture-of-Experts-Enhanced Foundation Time Series Forecasting Model with Decoupled Training Pipelines(https://arxiv.org/abs/2505.15151)
Keywords: foundation model
Abstract: In the past few years, time series foundation models have achieved superior predicting accuracy. However, real-world time series often exhibit significant diversity in their temporal patterns across different time spans and domains, making it challenging for a single model architecture to fit all complex scenarios. In addition, time series data may have multiple variables exhibiting complex correlations between each other. Recent mainstream works have focused on modeling times series in a channel-independent manner in both pretraining and finetuning stages, overlooking the valuable inter-series dependencies. To this end, we propose \textbf{Time Tracker} for better predictions on multivariate time series data. Firstly, we leverage sparse mixture of experts (MoE) within Transformers to handle the modeling of diverse time series patterns, thereby alleviating the learning difficulties of a single model while improving its generalization. Besides, we propose Any-variate Attention, enabling a unified model structure to seamlessly handle both univariate and multivariate time series, thereby supporting channel-independent modeling during pretraining and channel-mixed modeling for finetuning. Furthermore, we design a graph learning module that constructs relations among sequences from frequency-domain features, providing more precise guidance to capture inter-series dependencies in channel-mixed modeling. Based on these advancements, Time Tracker achieves state-of-the-art performance in predicting accuracy, model generalization and adaptability.

Title: Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation

Authors: Nanxu Gong, Zijun Li, Sixun Dong, Haoyue Bai, Wangyang Ying, Xinyuan Wang, Yanjie Fu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15152
Pdf URL: https://arxiv.org/pdf/2505.15152
Copy Paste: [[2505.15152]] Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation(https://arxiv.org/abs/2505.15152)
Keywords: diffusion, generative
Abstract: Feature Transformation (FT) crafts new features from original ones via mathematical operations to enhance dataset expressiveness for downstream models. However, existing FT methods exhibit critical limitations: discrete search struggles with enormous combinatorial spaces, impeding practical use; and continuous search, being highly sensitive to initialization and step sizes, often becomes trapped in local optima, restricting global exploration. To overcome these limitations, DIFFT redefines FT as a reward-guided generative task. It first learns a compact and expressive latent space for feature sets using a Variational Auto-Encoder (VAE). A Latent Diffusion Model (LDM) then navigates this space to generate high-quality feature embeddings, its trajectory guided by a performance evaluator towards task-specific optima. This synthesis of global distribution learning (from LDM) and targeted optimization (reward guidance) produces potent embeddings, which a novel semi-autoregressive decoder efficiently converts into structured, discrete features, preserving intra-feature dependencies while allowing parallel inter-feature generation. Extensive experiments on 14 benchmark datasets show DIFFT consistently outperforms state-of-the-art baselines in predictive accuracy and robustness, with significantly lower training and inference times.

Title: MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models

Authors: Yifan Liu, Keyu Fan, Weihao Yu, Chenxin Li, Hao Lu, Yixuan Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15185
Pdf URL: https://arxiv.org/pdf/2505.15185
Copy Paste: [[2505.15185]] MonoSplat: Generalizable 3D Gaussian Splatting from Monocular Depth Foundation Models(https://arxiv.org/abs/2505.15185)
Keywords: foundation model
Abstract: Recent advances in generalizable 3D Gaussian Splatting have demonstrated promising results in real-time high-fidelity rendering without per-scene optimization, yet existing approaches still struggle to handle unfamiliar visual content during inference on novel scenes due to limited generalizability. To address this challenge, we introduce MonoSplat, a novel framework that leverages rich visual priors from pre-trained monocular depth foundation models for robust Gaussian reconstruction. Our approach consists of two key components: a Mono-Multi Feature Adapter that transforms monocular features into multi-view representations, coupled with an Integrated Gaussian Prediction module that effectively fuses both feature types for precise Gaussian generation. Through the Adapter's lightweight attention mechanism, features are seamlessly aligned and aggregated across views while preserving valuable monocular priors, enabling the Prediction module to generate Gaussian primitives with accurate geometry and appearance. Through extensive experiments on diverse real-world datasets, we convincingly demonstrate that MonoSplat achieves superior reconstruction quality and generalization capability compared to existing methods while maintaining computational efficiency with minimal trainable parameters. Codes are available at this https URL.

Title: Leveraging Foundation Models for Multimodal Graph-Based Action Recognition

Authors: Fatemeh Ziaeetabar, Florentin Wörgötter
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15192
Pdf URL: https://arxiv.org/pdf/2505.15192
Copy Paste: [[2505.15192]] Leveraging Foundation Models for Multimodal Graph-Based Action Recognition(https://arxiv.org/abs/2505.15192)
Keywords: foundation model
Abstract: Foundation models have ushered in a new era for multimodal video understanding by enabling the extraction of rich spatiotemporal and semantic representations. In this work, we introduce a novel graph-based framework that integrates a vision-language foundation, leveraging VideoMAE for dynamic visual encoding and BERT for contextual textual embedding, to address the challenge of recognizing fine-grained bimanual manipulation actions. Departing from conventional static graph architectures, our approach constructs an adaptive multimodal graph where nodes represent frames, objects, and textual annotations, and edges encode spatial, temporal, and semantic relationships. These graph structures evolve dynamically based on learned interactions, allowing for flexible and context-aware reasoning. A task-specific attention mechanism within a Graph Attention Network further enhances this reasoning by modulating edge importance based on action semantics. Through extensive evaluations on diverse benchmark datasets, we demonstrate that our method consistently outperforms state-of-the-art baselines, underscoring the strength of combining foundation models with dynamic graph-based reasoning for robust and generalizable action recognition.

Title: Flashback: Memory-Driven Zero-shot, Real-time Video Anomaly Detection

Authors: Hyogun Lee, Haksub Kim, Ig-Jae Kim, Yonghun Choi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15205
Pdf URL: https://arxiv.org/pdf/2505.15205
Copy Paste: [[2505.15205]] Flashback: Memory-Driven Zero-shot, Real-time Video Anomaly Detection(https://arxiv.org/abs/2505.15205)
Keywords: anomaly
Abstract: Video Anomaly Detection (VAD) automatically identifies anomalous events from video, mitigating the need for human operators in large-scale surveillance deployments. However, three fundamental obstacles hinder real-world adoption: domain dependency and real-time constraints -- requiring near-instantaneous processing of incoming video. To this end, we propose Flashback, a zero-shot and real-time video anomaly detection paradigm. Inspired by the human cognitive mechanism of instantly judging anomalies and reasoning in current scenes based on past experience, Flashback operates in two stages: Recall and Respond. In the offline recall stage, an off-the-shelf LLM builds a pseudo-scene memory of both normal and anomalous captions without any reliance on real anomaly data. In the online respond stage, incoming video segments are embedded and matched against this memory via similarity search. By eliminating all LLM calls at inference time, Flashback delivers real-time VAD even on a consumer-grade GPU. On two large datasets from real-world surveillance scenarios, UCF-Crime and XD-Violence, we achieve 87.3 AUC (+7.0 pp) and 75.1 AP (+13.1 pp), respectively, outperforming prior zero-shot VAD methods by large margins.

Title: Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection

Authors: Haotian Qin, Dongliang Chang, Yueying Gao, Bingyao Yu, Lei Chen, Zhanyu Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15217
Pdf URL: https://arxiv.org/pdf/2505.15217
Copy Paste: [[2505.15217]] Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection(https://arxiv.org/abs/2505.15217)
Keywords: generative
Abstract: Although existing CLIP-based methods for detecting AI-generated images have achieved promising results, they are still limited by severe feature redundancy, which hinders their generalization ability. To address this issue, incorporating an information bottleneck network into the task presents a straightforward solution. However, relying solely on image-corresponding prompts results in suboptimal performance due to the inherent diversity of prompts. In this paper, we propose a multimodal conditional bottleneck network to reduce feature redundancy while enhancing the discriminative power of features extracted by CLIP, thereby improving the model's generalization ability. We begin with a semantic analysis experiment, where we observe that arbitrary text features exhibit lower cosine similarity with real image features than with fake image features in the CLIP feature space, a phenomenon we refer to as "bias". Therefore, we introduce InfoFD, a text-guided AI-generated image detection framework. InfoFD consists of two key components: the Text-Guided Conditional Information Bottleneck (TGCIB) and Dynamic Text Orthogonalization (DTO). TGCIB improves the generalizability of learned representations by conditioning on both text and class modalities. DTO dynamically updates weighted text features, preserving semantic information while leveraging the global "bias". Our model achieves exceptional generalization performance on the GenImage dataset and latest generative models. Our code is available at this https URL.

Title: Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework

Authors: Zihao Jiang, Ben Liu, Miao Peng, Wenjie Xu, Yao Xiao, Zhenyan Shan, Min Peng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15245
Pdf URL: https://arxiv.org/pdf/2505.15245
Copy Paste: [[2505.15245]] Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework(https://arxiv.org/abs/2505.15245)
Keywords: generative
Abstract: While large language models (LLMs) show great potential in temporal reasoning, most existing work focuses heavily on enhancing performance, often neglecting the explainable reasoning processes underlying the results. To address this gap, we introduce a comprehensive benchmark covering a wide range of temporal granularities, designed to systematically evaluate LLMs' capabilities in explainable temporal reasoning. Furthermore, our findings reveal that LLMs struggle to deliver convincing explanations when relying solely on textual information. To address challenge, we propose GETER, a novel structure-aware generative framework that integrates Graph structures with text for Explainable TEmporal Reasoning. Specifically, we first leverage temporal knowledge graphs to develop a temporal encoder that captures structural information for the query. Subsequently, we introduce a structure-text prefix adapter to map graph structure features into the text embedding space. Finally, LLMs generate explanation text by seamlessly integrating the soft graph token with instruction-tuning prompt tokens. Experimental results indicate that GETER achieves state-of-the-art performance while also demonstrating its effectiveness as well as strong generalization capabilities. Our dataset and code are available at this https URL.

Title: VET-DINO: Learning Anatomical Understanding Through Multi-View Distillation in Veterinary Imaging

Authors: Andre Dourson, Kylie Taylor, Xiaoli Qiao, Michael Fitzke
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15248
Pdf URL: https://arxiv.org/pdf/2505.15248
Copy Paste: [[2505.15248]] VET-DINO: Learning Anatomical Understanding Through Multi-View Distillation in Veterinary Imaging(https://arxiv.org/abs/2505.15248)
Keywords: self-supervised
Abstract: Self-supervised learning has emerged as a powerful paradigm for training deep neural networks, particularly in medical imaging where labeled data is scarce. While current approaches typically rely on synthetic augmentations of single images, we propose VET-DINO, a framework that leverages a unique characteristic of medical imaging: the availability of multiple standardized views from the same study. Using a series of clinical veterinary radiographs from the same patient study, we enable models to learn view-invariant anatomical structures and develop an implied 3D understanding from 2D projections. We demonstrate our approach on a dataset of 5 million veterinary radiographs from 668,000 canine studies. Through extensive experimentation, including view synthesis and downstream task performance, we show that learning from real multi-view pairs leads to superior anatomical understanding compared to purely synthetic augmentations. VET-DINO achieves state-of-the-art performance on various veterinary imaging tasks. Our work establishes a new paradigm for self-supervised learning in medical imaging that leverages domain-specific properties rather than merely adapting natural image techniques.

Title: Loss-Guided Auxiliary Agents for Overcoming Mode Collapse in GFlowNets

Authors: Idriss Malek, Abhijit Sharma, Salem Lahlou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15251
Pdf URL: https://arxiv.org/pdf/2505.15251
Copy Paste: [[2505.15251]] Loss-Guided Auxiliary Agents for Overcoming Mode Collapse in GFlowNets(https://arxiv.org/abs/2505.15251)
Keywords: generative
Abstract: Although Generative Flow Networks (GFlowNets) are designed to capture multiple modes of a reward function, they often suffer from mode collapse in practice, getting trapped in early discovered modes and requiring prolonged training to find diverse solutions. Existing exploration techniques may rely on heuristic novelty signals. We propose Loss-Guided GFlowNets (LGGFN), a novel approach where an auxiliary GFlowNet's exploration is directly driven by the main GFlowNet's training loss. By prioritizing trajectories where the main model exhibits high loss, LGGFN focuses sampling on poorly understood regions of the state space. This targeted exploration significantly accelerates the discovery of diverse, high-reward samples. Empirically, across various benchmarks including grid environments, structured sequence generation, and Bayesian structure learning, LGGFN consistently enhances exploration efficiency and sample diversity compared to baselines. For instance, on a challenging sequence generation task, it discovered over 40 times more unique valid modes while simultaneously reducing the exploration error metric by approximately 99\%.

Title: An Efficient Private GPT Never Autoregressively Decodes

Authors: Zhengyi Li, Yue Guan, Kang Yang, Yu Feng, Ning Liu, Yu Yu, Jingwen Leng, Minyi Guo
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15252
Pdf URL: https://arxiv.org/pdf/2505.15252
Copy Paste: [[2505.15252]] An Efficient Private GPT Never Autoregressively Decodes(https://arxiv.org/abs/2505.15252)
Keywords: generative
Abstract: The wide deployment of the generative pre-trained transformer (GPT) has raised privacy concerns for both clients and servers. While cryptographic primitives can be employed for secure GPT inference to protect the privacy of both parties, they introduce considerable performance this http URL accelerate secure inference, this study proposes a public decoding and secure verification approach that utilizes public GPT models, motivated by the observation that securely decoding one and multiple tokens takes a similar latency. The client uses the public model to generate a set of tokens, which are then securely verified by the private model for acceptance. The efficiency of our approach depends on the acceptance ratio of tokens proposed by the public model, which we improve from two aspects: (1) a private sampling protocol optimized for cryptographic primitives and (2) model alignment using knowledge distillation. Our approach improves the efficiency of secure decoding while maintaining the same level of privacy and generation quality as standard secure decoding. Experiments demonstrate a $2.1\times \sim 6.0\times$ speedup compared to standard decoding across three pairs of public-private models and different network conditions.

Title: gen2seg: Generative Models Enable Generalizable Instance Segmentation

Authors: Om Khangaonkar, Hamed Pirsiavash
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15263
Pdf URL: https://arxiv.org/pdf/2505.15263
Copy Paste: [[2505.15263]] gen2seg: Generative Models Enable Generalizable Instance Segmentation(https://arxiv.org/abs/2505.15263)
Keywords: diffusion, generative
Abstract: By pretraining to synthesize coherent images from perturbed inputs, generative models inherently learn to understand object boundaries and scene compositions. How can we repurpose these generative representations for general-purpose perceptual organization? We finetune Stable Diffusion and MAE (encoder+decoder) for category-agnostic instance segmentation using our instance coloring loss exclusively on a narrow set of object types (indoor furnishings and cars). Surprisingly, our models exhibit strong zero-shot generalization, accurately segmenting objects of types and styles unseen in finetuning (and in many cases, MAE's ImageNet-1K pretraining too). Our best-performing models closely approach the heavily supervised SAM when evaluated on unseen object types and styles, and outperform it when segmenting fine structures and ambiguous boundaries. In contrast, existing promptable segmentation architectures or discriminatively pretrained models fail to generalize. This suggests that generative models learn an inherent grouping mechanism that transfers across categories and domains, even without internet-scale pretraining. Code, pretrained models, and demos are available on our website.

Title: Scaling Diffusion Transformers Efficiently via $μ$P

Authors: Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.15270
Pdf URL: https://arxiv.org/pdf/2505.15270
Copy Paste: [[2505.15270]] Scaling Diffusion Transformers Efficiently via $μ$P(https://arxiv.org/abs/2505.15270)
Keywords: diffusion, generative
Abstract: Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers.

Title: FaceCrafter: Identity-Conditional Diffusion with Disentangled Control over Facial Pose, Expression, and Emotion

Authors: Kazuaki Mishima, Antoni Bigata Casademunt, Stavros Petridis, Maja Pantic, Kenji Suzuki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15313
Pdf URL: https://arxiv.org/pdf/2505.15313
Copy Paste: [[2505.15313]] FaceCrafter: Identity-Conditional Diffusion with Disentangled Control over Facial Pose, Expression, and Emotion(https://arxiv.org/abs/2505.15313)
Keywords: diffusion, generative
Abstract: Human facial images encode a rich spectrum of information, encompassing both stable identity-related traits and mutable attributes such as pose, expression, and emo- tion. While recent advances in image generation have enabled high-quality identity- conditional face synthesis, precise control over non-identity attributes remains challeng- ing, and disentangling identity from these mutable factors is particularly difficult. To address these limitations, we propose a novel identity-conditional diffusion model that introduces two lightweight control modules designed to independently manipulate facial pose, expression, and emotion without compromising identity preservation. These mod- ules are embedded within the cross-attention layers of the base diffusion model, enabling precise attribute control with minimal parameter overhead. Furthermore, our tailored training strategy, which leverages cross-attention between the identity feature and each non-identity control feature, encourages identity features to remain orthogonal to control signals, enhancing controllability and diversity. Quantitative and qualitative evaluations, along with perceptual user studies, demonstrate that our method surpasses existing ap- proaches in terms of control accuracy over pose, expression, and emotion, while also improving generative diversity under identity-only conditioning.

Title: Parameter-Efficient Fine-Tuning of Multispectral Foundation Models for Hyperspectral Image Classification

Authors: Bernardin Ligan, Khalide Jbilou, Fahd Kalloubi, Ahmed Ratnani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15334
Pdf URL: https://arxiv.org/pdf/2505.15334
Copy Paste: [[2505.15334]] Parameter-Efficient Fine-Tuning of Multispectral Foundation Models for Hyperspectral Image Classification(https://arxiv.org/abs/2505.15334)
Keywords: foundation model
Abstract: Foundation models have achieved great success across diverse domains, including remote sensing (RS), thanks to their versatility and strong generalization abilities. However, most RS foundation models are designed for multispectral data, while hyperspectral imagery (HSI) - with its hundreds of spectral bands - remains less explored. Fine-tuning such models for downstream tasks is also challenging, often demanding considerable memory and storage. In this paper, we propose an efficient framework to fine-tune SpectralGPT, a multispectral foundation model, for hyperspectral image classification (HSIC). We explore several Parameter-Efficient Fine-Tuning (PEFT) methods, including Low-Rank Adaptation (LoRA), Kronecker-based adaptation (KronA), Low-Rank Kronecker (LoKr), and the recent LoRA+, which uses distinct learning rates for low-rank adapters scaled by a factor lambda. Inspired by LoRA+, we introduce KronA+, which applies a similar mechanism to the Kronecker matrices. We evaluate our approach on five datasets from different sensors, showing competitive performance with state-of-the-art HSI models. Our full fine-tuning (FFT) setup for SpectralGPT even outperforms a dedicated hyperspectral foundation model on some datasets while requiring only a quarter of the training epochs. Under the same number of epochs, KronA+ reaches similar performance with far fewer trainable parameters - just 0.056 percent - and adds only approximately 0.2 megabytes of storage, making it the most effective PEFT method tested.

Title: My Face Is Mine, Not Yours: Facial Protection Against Diffusion Model Face Swapping

Authors: Hon Ming Yam, Zhongliang Guo, Chun Pong Lau
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15336
Pdf URL: https://arxiv.org/pdf/2505.15336
Copy Paste: [[2505.15336]] My Face Is Mine, Not Yours: Facial Protection Against Diffusion Model Face Swapping(https://arxiv.org/abs/2505.15336)
Keywords: diffusion, generative
Abstract: The proliferation of diffusion-based deepfake technologies poses significant risks for unauthorized and unethical facial image manipulation. While traditional countermeasures have primarily focused on passive detection methods, this paper introduces a novel proactive defense strategy through adversarial attacks that preemptively protect facial images from being exploited by diffusion-based deepfake systems. Existing adversarial protection methods predominantly target conventional generative architectures (GANs, AEs, VAEs) and fail to address the unique challenges presented by diffusion models, which have become the predominant framework for high-quality facial deepfakes. Current diffusion-specific adversarial approaches are limited by their reliance on specific model architectures and weights, rendering them ineffective against the diverse landscape of diffusion-based deepfake implementations. Additionally, they typically employ global perturbation strategies that inadequately address the region-specific nature of facial manipulation in deepfakes.

Title: Revealing Language Model Trajectories via Kullback-Leibler Divergence

Authors: Ryo Kishino, Yusuke Takase, Momose Oyama, Hiroaki Yamagiwa, Hidetoshi Shimodaira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15353
Pdf URL: https://arxiv.org/pdf/2505.15353
Copy Paste: [[2505.15353]] Revealing Language Model Trajectories via Kullback-Leibler Divergence(https://arxiv.org/abs/2505.15353)
Keywords: diffusion
Abstract: A recently proposed method enables efficient estimation of the KL divergence between language models, including models with different architectures, by assigning coordinates based on log-likelihood vectors. To better understand the behavior of this metric, we systematically evaluate KL divergence across a wide range of conditions using publicly available language models. Our analysis covers comparisons between pretraining checkpoints, fine-tuned and base models, and layers via the logit lens. We find that trajectories of language models, as measured by KL divergence, exhibit a spiral structure during pretraining and thread-like progressions across layers. Furthermore, we show that, in terms of diffusion exponents, model trajectories in the log-likelihood space are more constrained than those in weight space.

Title: Federated Learning-Enhanced Blockchain Framework for Privacy-Preserving Intrusion Detection in Industrial IoT

Authors: Anas Ali, Mubashar Husain, Peter Hans
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.15376
Pdf URL: https://arxiv.org/pdf/2505.15376
Copy Paste: [[2505.15376]] Federated Learning-Enhanced Blockchain Framework for Privacy-Preserving Intrusion Detection in Industrial IoT(https://arxiv.org/abs/2505.15376)
Keywords: anomaly
Abstract: Industrial Internet of Things (IIoT) systems have become integral to smart manufacturing, yet their growing connectivity has also exposed them to significant cybersecurity threats. Traditional intrusion detection systems (IDS) often rely on centralized architectures that raise concerns over data privacy, latency, and single points of failure. In this work, we propose a novel Federated Learning-Enhanced Blockchain Framework (FL-BCID) for privacy-preserving intrusion detection tailored for IIoT environments. Our architecture combines federated learning (FL) to ensure decentralized model training with blockchain technology to guarantee data integrity, trust, and tamper resistance across IIoT nodes. We design a lightweight intrusion detection model collaboratively trained using FL across edge devices without exposing sensitive data. A smart contract-enabled blockchain system records model updates and anomaly scores to establish accountability. Experimental evaluations using the ToN-IoT and N-BaIoT datasets demonstrate the superior performance of our framework, achieving 97.3% accuracy while reducing communication overhead by 41% compared to baseline centralized methods. Our approach ensures privacy, scalability, and robustness-critical for secure industrial operations. The proposed FL-BCID system provides a promising solution for enhancing trust and privacy in modern IIoT security architectures.

Title: Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions

Authors: Zhiwen Li, Die Chen, Mingyuan Fan, Cen Chen, Yaliang Li, Yanhao Wang, Wenmeng Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15427
Pdf URL: https://arxiv.org/pdf/2505.15427
Copy Paste: [[2505.15427]] Responsible Diffusion Models via Constraining Text Embeddings within Safe Regions(https://arxiv.org/abs/2505.15427)
Keywords: diffusion
Abstract: The remarkable ability of diffusion models to generate high-fidelity images has led to their widespread adoption. However, concerns have also arisen regarding their potential to produce Not Safe for Work (NSFW) content and exhibit social biases, hindering their practical use in real-world applications. In response to this challenge, prior work has focused on employing security filters to identify and exclude toxic text, or alternatively, fine-tuning pre-trained diffusion models to erase sensitive concepts. Unfortunately, existing methods struggle to achieve satisfactory performance in the sense that they can have a significant impact on the normal model output while still failing to prevent the generation of harmful content in some cases. In this paper, we propose a novel self-discovery approach to identifying a semantic direction vector in the embedding space to restrict text embedding within a safe region. Our method circumvents the need for correcting individual words within the input text and steers the entire text prompt towards a safe region in the embedding space, thereby enhancing model robustness against all possibly unsafe prompts. In addition, we employ Low-Rank Adaptation (LoRA) for semantic direction vector initialization to reduce the impact on the model performance for other semantics. Furthermore, our method can also be integrated with existing methods to improve their social responsibility. Extensive experiments on benchmark datasets demonstrate that our method can effectively reduce NSFW content and mitigate social bias generated by diffusion models compared to several state-of-the-art baselines.

Title: Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation

Authors: Jianyuan Guo, Peike Li, Trevor Cohn
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15438
Pdf URL: https://arxiv.org/pdf/2505.15438
Copy Paste: [[2505.15438]] Bridging Sign and Spoken Languages: Pseudo Gloss Generation for Sign Language Translation(https://arxiv.org/abs/2505.15438)
Keywords: in-context
Abstract: Sign Language Translation (SLT) aims to map sign language videos to spoken language text. A common approach relies on gloss annotations as an intermediate representation, decomposing SLT into two sub-tasks: video-to-gloss recognition and gloss-to-text translation. While effective, this paradigm depends on expert-annotated gloss labels, which are costly and rarely available in existing datasets, limiting its scalability. To address this challenge, we propose a gloss-free pseudo gloss generation framework that eliminates the need for human-annotated glosses while preserving the structured intermediate representation. Specifically, we prompt a Large Language Model (LLM) with a few example text-gloss pairs using in-context learning to produce draft sign glosses from spoken language text. To enhance the correspondence between LLM-generated pseudo glosses and the sign sequences in video, we correct the ordering in the pseudo glosses for better alignment via a weakly supervised learning process. This reordering facilitates the incorporation of auxiliary alignment objectives, and allows for the use of efficient supervision via a Connectionist Temporal Classification (CTC) loss. We train our SLT mode, which consists of a vision encoder and a translator, through a three-stage pipeline, which progressively narrows the modality gap between sign language and spoken language. Despite its simplicity, our approach outperforms previous state-of-the-art gloss-free frameworks on two SLT benchmarks and achieves competitive results compared to gloss-based methods.

Title: Stronger ViTs With Octic Equivariance

Authors: David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15441
Pdf URL: https://arxiv.org/pdf/2505.15441
Copy Paste: [[2505.15441]] Stronger ViTs With Octic Equivariance(https://arxiv.org/abs/2505.15441)
Keywords: self-supervised
Abstract: Recent efforts at scaling computer vision models have established Vision Transformers (ViTs) as the leading architecture. ViTs incorporate weight sharing over image patches as an important inductive bias. In this work, we show that ViTs benefit from incorporating equivariance under the octic group, i.e., reflections and 90-degree rotations, as a further inductive bias. We develop new architectures, octic ViTs, that use octic-equivariant layers and put them to the test on both supervised and self-supervised learning. Through extensive experiments on DeiT-III and DINOv2 training on ImageNet-1K, we show that octic ViTs yield more computationally efficient networks while also improving performance. In particular, we achieve approximately 40% reduction in FLOPs for ViT-H while simultaneously improving both classification and segmentation results.

Title: Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models

Authors: Die Chen, Zhiwen Li, Cen Chen, Yuexiang Xie, Xiaodan Li, Jinyan Ye, Yingda Chen, Yaliang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15450
Pdf URL: https://arxiv.org/pdf/2505.15450
Copy Paste: [[2505.15450]] Comprehensive Evaluation and Analysis for NSFW Concept Erasure in Text-to-Image Diffusion Models(https://arxiv.org/abs/2505.15450)
Keywords: diffusion
Abstract: Text-to-image diffusion models have gained widespread application across various domains, demonstrating remarkable creative potential. However, the strong generalization capabilities of diffusion models can inadvertently lead to the generation of not-safe-for-work (NSFW) content, posing significant risks to their safe deployment. While several concept erasure methods have been proposed to mitigate the issue associated with NSFW content, a comprehensive evaluation of their effectiveness across various scenarios remains absent. To bridge this gap, we introduce a full-pipeline toolkit specifically designed for concept erasure and conduct the first systematic study of NSFW concept erasure methods. By examining the interplay between the underlying mechanisms and empirical observations, we provide in-depth insights and practical guidance for the effective application of concept erasure methods in various real-world scenarios, with the aim of advancing the understanding of content safety in diffusion models and establishing a solid foundation for future research and development in this critical area.

Title: Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts

Authors: Debarshi Brahma, Anuska Roy, Soma Biswas
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15506
Pdf URL: https://arxiv.org/pdf/2505.15506
Copy Paste: [[2505.15506]] Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts(https://arxiv.org/abs/2505.15506)
Keywords: foundation model
Abstract: Recently, Vision-Language foundation models like CLIP and ALIGN, which are pre-trained on large-scale data have shown remarkable zero-shot generalization to diverse datasets with different classes and even domains. In this work, we take a step further and analyze whether these models can be adapted to target datasets having very different distributions and classes compared to what these models have been trained on, using only a few labeled examples from the target dataset. In such scenarios, finetuning large pretrained models is challenging due to problems of overfitting as well as loss of generalization, and has not been well explored in prior literature. Since, the pre-training data of such models are unavailable, it is difficult to comprehend the performance on various downstream datasets. First, we try to answer the question: Given a target dataset with a few labelled examples, can we estimate whether further fine-tuning can enhance the performance compared to zero-shot evaluation? by analyzing the common vision-language embedding space. Based on the analysis, we propose a novel prompt-tuning method, PromptMargin for adapting such large-scale VLMs directly on the few target samples. PromptMargin effectively tunes the text as well as visual prompts for this task, and has two main modules: 1) Firstly, we use a selective augmentation strategy to complement the few training samples in each task; 2) Additionally, to ensure robust training in the presence of unfamiliar class names, we increase the inter-class margin for improved class discrimination using a novel Multimodal Margin Regularizer. Extensive experiments and analysis across fifteen target benchmark datasets, with varying degrees of distribution shifts from natural images, shows the effectiveness of the proposed framework over the existing state-of-the-art approaches applied to this setting. this http URL.

Title: NOMAD Projection

Authors: Brandon Duderstadt, Zach Nussbaum, Laurens van der Maaten
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15511
Pdf URL: https://arxiv.org/pdf/2505.15511
Copy Paste: [[2505.15511]] NOMAD Projection(https://arxiv.org/abs/2505.15511)
Keywords: generative
Abstract: The rapid adoption of generative AI has driven an explosion in the size of datasets consumed and produced by AI models. Traditional methods for unstructured data visualization, such as t-SNE and UMAP, have not kept up with the pace of dataset scaling. This presents a significant challenge for AI explainability, which relies on methods such as t-SNE and UMAP for exploratory data analysis. In this paper, we introduce Negative Or Mean Affinity Discrimination (NOMAD) Projection, the first method for unstructured data visualization via nonlinear dimensionality reduction that can run on multiple GPUs at train time. We provide theory that situates NOMAD Projection as an approximate upper bound on the InfoNC-t-SNE loss, and empirical results that demonstrate NOMAD Projection's superior performance and speed profile compared to existing state-of-the-art methods. We demonstrate the scalability of NOMAD Projection by computing the first complete data map of Multilingual Wikipedia.

Title: Detection of Underwater Multi-Targets Based on Self-Supervised Learning and Deformable Path Aggregation Feature Pyramid Network

Authors: Chang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15518
Pdf URL: https://arxiv.org/pdf/2505.15518
Copy Paste: [[2505.15518]] Detection of Underwater Multi-Targets Based on Self-Supervised Learning and Deformable Path Aggregation Feature Pyramid Network(https://arxiv.org/abs/2505.15518)
Keywords: self-supervised
Abstract: To overcome the constraints of the underwater environment and improve the accuracy and robustness of underwater target detection models, this paper develops a specialized dataset for underwater target detection and proposes an efficient algorithm for underwater multi-target detection. A self-supervised learning based on the SimSiam structure is employed for the pre-training of underwater target detection network. To address the problems of low detection accuracy caused by low contrast, mutual occlusion and dense distribution of underwater targets in underwater object detection, a detection model suitable for underwater target detection is proposed by introducing deformable convolution and dilated convolution. The proposed detection model can obtain more effective information by increasing the receptive field. In addition, the regression loss function EIoU is introduced, which improves model performance by separately calculating the width and height losses of the predicted box. Experiment results show that the accuracy of the underwater target detection has been improved by the proposed detector.

Title: PlantDreamer: Achieving Realistic 3D Plant Models with Diffusion-Guided Gaussian Splatting

Authors: Zane K J Hartley, Lewis A G Stuart, Andrew P French, Michael P Pound
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2505.15528
Pdf URL: https://arxiv.org/pdf/2505.15528
Copy Paste: [[2505.15528]] PlantDreamer: Achieving Realistic 3D Plant Models with Diffusion-Guided Gaussian Splatting(https://arxiv.org/abs/2505.15528)
Keywords: diffusion, generative
Abstract: Recent years have seen substantial improvements in the ability to generate synthetic 3D objects using AI. However, generating complex 3D objects, such as plants, remains a considerable challenge. Current generative 3D models struggle with plant generation compared to general objects, limiting their usability in plant analysis tools, which require fine detail and accurate geometry. We introduce PlantDreamer, a novel approach to 3D synthetic plant generation, which can achieve greater levels of realism for complex plant geometry and textures than available text-to-3D models. To achieve this, our new generation pipeline leverages a depth ControlNet, fine-tuned Low-Rank Adaptation and an adaptable Gaussian culling algorithm, which directly improve textural realism and geometric integrity of generated 3D plant models. Additionally, PlantDreamer enables both purely synthetic plant generation, by leveraging L-System-generated meshes, and the enhancement of real-world plant point clouds by converting them into 3D Gaussian Splats. We evaluate our approach by comparing its outputs with state-of-the-art text-to-3D models, demonstrating that PlantDreamer outperforms existing methods in producing high-fidelity synthetic plants. Our results indicate that our approach not only advances synthetic plant generation, but also facilitates the upgrading of legacy point cloud datasets, making it a valuable tool for 3D phenotyping applications.

Title: Bridging the Domain Gap in Equation Distillation with Reinforcement Feedback

Authors: Wangyang Ying, Haoyue Bai, Nanxu Gong, Xinyuan Wang, Sixun Dong, Haifeng Chen, Yanjie Fu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15572
Pdf URL: https://arxiv.org/pdf/2505.15572
Copy Paste: [[2505.15572]] Bridging the Domain Gap in Equation Distillation with Reinforcement Feedback(https://arxiv.org/abs/2505.15572)
Keywords: foundation model
Abstract: The data-to-equation (Data2Eqn) task aims to discover interpretable mathematical equations that map observed values to labels, offering physical insights and broad applicability across academic and industrial domains. Genetic programming and traditional deep learning-based approaches suffer from search inefficiency and poor generalization on small task-specific datasets. Foundation models showed promise in this area, but existing approaches suffer from: 1) They are pretrained on general-purpose data distributions, making them less effective for domain-specific tasks; and 2) their training objectives focus on token-level alignment, overlooking mathematical semantics, which can lead to inaccurate equations. To address these issues, we aim to enhance the domain adaptability of foundation models for Data2Eqn tasks. In this work, we propose a reinforcement learning-based finetuning framework that directly optimizes the generation policy of a pretrained model through reward signals derived from downstream numerical fitness. Our method allows the model to adapt to specific and complex data distributions and generate mathematically meaningful equations. Extensive experiments demonstrate that our approach improves both the accuracy and robustness of equation generation under complex distributions.

Title: Beyond Classification: Evaluating Diffusion Denoised Smoothing for Security-Utility Trade off

Authors: Yury Belousov, Brian Pulfer, Vitaliy Kinakh, Slava Voloshynovskiy
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.15594
Pdf URL: https://arxiv.org/pdf/2505.15594
Copy Paste: [[2505.15594]] Beyond Classification: Evaluating Diffusion Denoised Smoothing for Security-Utility Trade off(https://arxiv.org/abs/2505.15594)
Keywords: diffusion, foundation model
Abstract: While foundation models demonstrate impressive performance across various tasks, they remain vulnerable to adversarial inputs. Current research explores various approaches to enhance model robustness, with Diffusion Denoised Smoothing emerging as a particularly promising technique. This method employs a pretrained diffusion model to preprocess inputs before model inference. Yet, its effectiveness remains largely unexplored beyond classification. We aim to address this gap by analyzing three datasets with four distinct downstream tasks under three different adversarial attack algorithms. Our findings reveal that while foundation models maintain resilience against conventional transformations, applying high-noise diffusion denoising to clean images without any distortions significantly degrades performance by as high as 57%. Low-noise diffusion settings preserve performance but fail to provide adequate protection across all attack types. Moreover, we introduce a novel attack strategy specifically targeting the diffusion process itself, capable of circumventing defenses in the low-noise regime. Our results suggest that the trade-off between adversarial robustness and performance remains a challenge to be addressed.

Title: FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models

Authors: Zhen Sun, Ziyi Zhang, Zeren Luo, Zeyang Sha, Tianshuo Cong, Zheng Li, Shiwen Cui, Weiqiang Wang, Jiaheng Wei, Xinlei He, Qi Li, Qian Wang
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2505.15644
Pdf URL: https://arxiv.org/pdf/2505.15644
Copy Paste: [[2505.15644]] FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models(https://arxiv.org/abs/2505.15644)
Keywords: diffusion
Abstract: Fine-grained edited image detection of localized edits in images is crucial for assessing content authenticity, especially given that modern diffusion models and image editing methods can produce highly realistic manipulations. However, this domain faces three challenges: (1) Binary classifiers yield only a global real-or-fake label without providing localization; (2) Traditional computer vision methods often rely on costly pixel-level annotations; and (3) No large-scale, high-quality dataset exists for modern image-editing detection techniques. To address these gaps, we develop an automated data-generation pipeline to create FragFake, the first dedicated benchmark dataset for edited image detection, which includes high-quality images from diverse editing models and a wide variety of edited objects. Based on FragFake, we utilize Vision Language Models (VLMs) for the first time in the task of edited image classification and edited region localization. Experimental results show that fine-tuned VLMs achieve higher average Object Precision across all datasets, significantly outperforming pretrained models. We further conduct ablation and transferability analyses to evaluate the detectors across various configurations and editing scenarios. To the best of our knowledge, this work is the first to reformulate localized image edit detection as a vision-language understanding task, establishing a new paradigm for the field. We anticipate that this work will establish a solid foundation to facilitate and inspire subsequent research endeavors in the domain of multimodal content authenticity.

Title: The Devil is in Fine-tuning and Long-tailed Problems:A New Benchmark for Scene Text Detection

Authors: Tianjiao Cao, Jiahao Lyu, Weichao Zeng, Weimin Mu, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15649
Pdf URL: https://arxiv.org/pdf/2505.15649
Copy Paste: [[2505.15649]] The Devil is in Fine-tuning and Long-tailed Problems:A New Benchmark for Scene Text Detection(https://arxiv.org/abs/2505.15649)
Keywords: self-supervised
Abstract: Scene text detection has seen the emergence of high-performing methods that excel on academic benchmarks. However, these detectors often fail to replicate such success in real-world scenarios. We uncover two key factors contributing to this discrepancy through extensive experiments. First, a \textit{Fine-tuning Gap}, where models leverage \textit{Dataset-Specific Optimization} (DSO) paradigm for one domain at the cost of reduced effectiveness in others, leads to inflated performances on academic benchmarks. Second, the suboptimal performance in practical settings is primarily attributed to the long-tailed distribution of texts, where detectors struggle with rare and complex categories as artistic or overlapped text. Given that the DSO paradigm might undermine the generalization ability of models, we advocate for a \textit{Joint-Dataset Learning} (JDL) protocol to alleviate the Fine-tuning Gap. Additionally, an error analysis is conducted to identify three major categories and 13 subcategories of challenges in long-tailed scene text, upon which we propose a Long-Tailed Benchmark (LTB). LTB facilitates a comprehensive evaluation of ability to handle a diverse range of long-tailed challenges. We further introduce MAEDet, a self-supervised learning-based method, as a strong baseline for LTB. The code is available at this https URL.

Title: Graph Conditional Flow Matching for Relational Data Generation

Authors: Davide Scassola, Sebastiano Saccani, Luca Bortolussi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15668
Pdf URL: https://arxiv.org/pdf/2505.15668
Copy Paste: [[2505.15668]] Graph Conditional Flow Matching for Relational Data Generation(https://arxiv.org/abs/2505.15668)
Keywords: generative
Abstract: Data synthesis is gaining momentum as a privacy-enhancing technology. While single-table tabular data generation has seen considerable progress, current methods for multi-table data often lack the flexibility and expressiveness needed to capture complex relational structures. In particular, they struggle with long-range dependencies and complex foreign-key relationships, such as tables with multiple parent tables or multiple types of links between the same pair of tables. We propose a generative model for relational data that generates the content of a relational dataset given the graph formed by the foreign-key relationships. We do this by learning a deep generative model of the content of the whole relational database by flow matching, where the neural network trained to denoise records leverages a graph neural network to obtain information from connected records. Our method is flexible, as it can support relational datasets with complex structures, and expressive, as the generation of each record can be influenced by any other record within the same connected component. We evaluate our method on several benchmark datasets and show that it achieves state-of-the-art performance in terms of synthetic data fidelity.

Title: UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models

Authors: Miao Yu, Liang Lin, Guibin Zhang, Xinfeng Li, Junfeng Fang, Ningyu Zhang, Kun Wang, Yang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15674
Pdf URL: https://arxiv.org/pdf/2505.15674
Copy Paste: [[2505.15674]] UniErase: Unlearning Token as a Universal Erasure Primitive for Language Models(https://arxiv.org/abs/2505.15674)
Keywords: in-context
Abstract: Large language models require iterative updates to address challenges such as knowledge conflicts and outdated information (e.g., incorrect, private, or illegal contents). Machine unlearning provides a systematic methodology for targeted knowledge removal from trained models, enabling elimination of sensitive information influences. However, mainstream fine-tuning-based unlearning methods often fail to balance unlearning efficacy and model ability, frequently resulting in catastrophic model collapse under extensive knowledge removal. Meanwhile, in-context unlearning, which relies solely on contextual prompting without modifying the model's intrinsic mechanisms, suffers from limited generalizability and struggles to achieve true unlearning. In this work, we introduce UniErase, a novel unlearning paradigm that employs learnable parametric suffix (unlearning token) to steer language models toward targeted forgetting behaviors. UniErase operates through two key phases: (I) an optimization stage that binds desired unlearning outputs to the model's autoregressive probability distribution via token optimization, followed by (II) a lightweight model editing phase that activates the learned token to probabilistically induce specified forgetting objective. Serving as a new research direction for token learning to induce unlearning target, UniErase achieves state-of-the-art (SOTA) performance across batch, sequential, and precise unlearning under fictitious and real-world knowledge settings. Remarkably, in terms of TOFU benchmark, UniErase, modifying only around 3.66% of the LLM parameters, outperforms previous forgetting SOTA baseline by around 4.01 times for model ability with even better unlearning efficacy. Similarly, UniErase, maintaining more ability, also surpasses previous retaining SOTA by 35.96% for unlearning efficacy, showing dual top-tier performances in current unlearing domain.

Title: Constructing a 3D Town from a Single Image

Authors: Kaizhi Zheng, Ruijian Zhang, Jing Gu, Jie Yang, Xin Eric Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15765
Pdf URL: https://arxiv.org/pdf/2505.15765
Copy Paste: [[2505.15765]] Constructing a 3D Town from a Single Image(https://arxiv.org/abs/2505.15765)
Keywords: generative
Abstract: Acquiring detailed 3D scenes typically demands costly equipment, multi-view data, or labor-intensive modeling. Therefore, a lightweight alternative, generating complex 3D scenes from a single top-down image, plays an essential role in real-world applications. While recent 3D generative models have achieved remarkable results at the object level, their extension to full-scene generation often leads to inconsistent geometry, layout hallucinations, and low-quality meshes. In this work, we introduce 3DTown, a training-free framework designed to synthesize realistic and coherent 3D scenes from a single top-down view. Our method is grounded in two principles: region-based generation to improve image-to-3D alignment and resolution, and spatial-aware 3D inpainting to ensure global scene coherence and high-quality geometry generation. Specifically, we decompose the input image into overlapping regions and generate each using a pretrained 3D object generator, followed by a masked rectified flow inpainting process that fills in missing geometry while maintaining structural continuity. This modular design allows us to overcome resolution bottlenecks and preserve spatial structure without requiring 3D supervision or fine-tuning. Extensive experiments across diverse scenes show that 3DTown outperforms state-of-the-art baselines, including Trellis, Hunyuan3D-2, and TripoSG, in terms of geometry quality, spatial coherence, and texture fidelity. Our results demonstrate that high-quality 3D town generation is achievable from a single image using a principled, training-free approach.

Title: dKV-Cache: The Cache for Diffusion Language Models

Authors: Xinyin Ma, Runpeng Yu, Gongfan Fang, Xinchao Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15781
Pdf URL: https://arxiv.org/pdf/2505.15781
Copy Paste: [[2505.15781]] dKV-Cache: The Cache for Diffusion Language Models(https://arxiv.org/abs/2505.15781)
Keywords: diffusion
Abstract: Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that existing DLMs may under-utilise contextual information during inference. (2) dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving higher speed-ups with quadratic time complexity at the cost of some performance degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference, largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on several benchmarks, delivering acceleration across general language understanding, mathematical, and code-generation benchmarks. Experiments demonstrate that cache can also be used in DLMs, even in a training-free manner from current DLMs.

Title: Large Language Models as Computable Approximations to Solomonoff Induction

Authors: Jun Wan, Lingrui Mei
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2505.15784
Pdf URL: https://arxiv.org/pdf/2505.15784
Copy Paste: [[2505.15784]] Large Language Models as Computable Approximations to Solomonoff Induction(https://arxiv.org/abs/2505.15784)
Keywords: in-context
Abstract: The rapid advancement of large language models (LLMs) calls for a rigorous theoretical framework to explain their empirical success. While significant progress has been made in understanding LLM behaviors, existing theoretical frameworks remain fragmented in explaining emergent phenomena through a unified mathematical lens. We establish the first formal connection between LLM architectures and Algorithmic Information Theory (AIT) by proving two fundamental results: (1) the training process computationally approximates Solomonoff prior through loss minimization interpreted as program length optimization, and (2) next-token prediction implements approximate Solomonoff induction. We leverage AIT to provide a unified theoretical explanation for in-context learning, few-shot learning, and scaling laws. Furthermore, our theoretical insights lead to a principled method for few-shot example selection that prioritizes samples where models exhibit lower predictive confidence. We demonstrate through experiments on diverse text classification benchmarks that this strategy yields significant performance improvements, particularly for smaller model architectures, when compared to selecting high-confidence examples. Our framework bridges the gap between theoretical foundations and practical LLM behaviors, providing both explanatory power and actionable insights for future model development.

Title: VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL

Authors: Fengyuan Dai, Zifeng Zhuang, Yufei Huang, Siteng Huang, Bangyan Liao, Donglin Wang, Fajie Yuan
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15791
Pdf URL: https://arxiv.org/pdf/2505.15791
Copy Paste: [[2505.15791]] VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL(https://arxiv.org/abs/2505.15791)
Keywords: diffusion, generative
Abstract: Diffusion models have emerged as powerful generative tools across various domains, yet tailoring pre-trained models to exhibit specific desirable properties remains challenging. While reinforcement learning (RL) offers a promising solution,current methods struggle to simultaneously achieve stable, efficient fine-tuning and support non-differentiable rewards. Furthermore, their reliance on sparse rewards provides inadequate supervision during intermediate steps, often resulting in suboptimal generation quality. To address these limitations, dense and differentiable signals are required throughout the diffusion process. Hence, we propose VAlue-based Reinforced Diffusion (VARD): a novel approach that first learns a value function predicting expection of rewards from intermediate states, and subsequently uses this value function with KL regularization to provide dense supervision throughout the generation process. Our method maintains proximity to the pretrained model while enabling effective and stable training via backpropagation. Experimental results demonstrate that our approach facilitates better trajectory guidance, improves training efficiency and extends the applicability of RL to diffusion models optimized for complex, non-differentiable reward functions.

Title: Interspatial Attention for Efficient 4D Human Video Generation

Authors: Ruizhi Shao, Yinghao Xu, Yujun Shen, Ceyuan Yang, Yang Zheng, Changan Chen, Yebin Liu, Gordon Wetzstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15800
Pdf URL: https://arxiv.org/pdf/2505.15800
Copy Paste: [[2505.15800]] Interspatial Attention for Efficient 4D Human Video Generation(https://arxiv.org/abs/2505.15800)
Keywords: diffusion
Abstract: Generating photorealistic videos of digital humans in a controllable manner is crucial for a plethora of applications. Existing approaches either build on methods that employ template-based 3D representations or emerging video generation models but suffer from poor quality or limited consistency and identity preservation when generating individual or multiple digital humans. In this paper, we introduce a new interspatial attention (ISA) mechanism as a scalable building block for modern diffusion transformer (DiT)--based video generation models. ISA is a new type of cross attention that uses relative positional encodings tailored for the generation of human videos. Leveraging a custom-developed video variation autoencoder, we train a latent ISA-based diffusion model on a large corpus of video data. Our model achieves state-of-the-art performance for 4D human video synthesis, demonstrating remarkable motion consistency and identity preservation while providing precise control of the camera and body poses. Our code and model are publicly released at this https URL.

Title: The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation

Authors: Patrick Kahardipraja, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15807
Pdf URL: https://arxiv.org/pdf/2505.15807
Copy Paste: [[2505.15807]] The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation(https://arxiv.org/abs/2505.15807)
Keywords: in-context
Abstract: Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings remain unclear. In this work, we shed light on the mechanism of in-context retrieval augmentation for question answering by viewing a prompt as a composition of informational components. We propose an attribution-based method to identify specialized attention heads, revealing in-context heads that comprehend instructions and retrieve relevant contextual information, and parametric heads that store entities' relational knowledge. To better understand their roles, we extract function vectors and modify their attention weights to show how they can influence the answer generation process. Finally, we leverage the gained insights to trace the sources of knowledge used during inference, paving the way towards more safe and transparent language models.

Title: Neural Conditional Transport Maps

Authors: Carlos Rodriguez-Pardo, Leonardo Chiani, Emanuele Borgonovo, Massimo Tavoni
Subjects: cs.LG, cs.AI, math.PR, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2505.15808
Pdf URL: https://arxiv.org/pdf/2505.15808
Copy Paste: [[2505.15808]] Neural Conditional Transport Maps(https://arxiv.org/abs/2505.15808)
Keywords: generative
Abstract: We present a neural framework for learning conditional optimal transport (OT) maps between probability distributions. Our approach introduces a conditioning mechanism capable of processing both categorical and continuous conditioning variables simultaneously. At the core of our method lies a hypernetwork that generates transport layer parameters based on these inputs, creating adaptive mappings that outperform simpler conditioning methods. Comprehensive ablation studies demonstrate the superior performance of our method over baseline configurations. Furthermore, we showcase an application to global sensitivity analysis, offering high performance in computing OT-based sensitivity indices. This work advances the state-of-the-art in conditional optimal transport, enabling broader application of optimal transport principles to complex, high-dimensional domains such as generative modeling and black-box model explainability.

Title: MMaDA: Multimodal Large Diffusion Language Models

Authors: Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15809
Pdf URL: https://arxiv.org/pdf/2505.15809
Copy Paste: [[2505.15809]] MMaDA: Multimodal Large Diffusion Language Models(https://arxiv.org/abs/2505.15809)
Keywords: diffusion, foundation model
Abstract: We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: this https URL

Title: On the creation of narrow AI: hierarchy and nonlocality of neural network skills

Authors: Eric J. Michaud, Asher Parker-Sartori, Max Tegmark
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.15811
Pdf URL: https://arxiv.org/pdf/2505.15811
Copy Paste: [[2505.15811]] On the creation of narrow AI: hierarchy and nonlocality of neural network skills(https://arxiv.org/abs/2505.15811)
Keywords: foundation model
Abstract: We study the problem of creating strong, yet narrow, AI systems. While recent AI progress has been driven by the training of large general-purpose foundation models, the creation of smaller models specialized for narrow domains could be valuable for both efficiency and safety. In this work, we explore two challenges involved in creating such systems, having to do with basic properties of how neural networks learn and structure their representations. The first challenge regards when it is possible to train narrow models from scratch. Through experiments on a synthetic task, we find that it is sometimes necessary to train networks on a wide distribution of data to learn certain narrow skills within that distribution. This effect arises when skills depend on each other hierarchically, and training on a broad distribution introduces a curriculum which substantially accelerates learning. The second challenge regards how to transfer particular skills from large general models into small specialized models. We find that model skills are often not perfectly localized to a particular set of prunable components. However, we find that methods based on pruning can still outperform distillation. We investigate the use of a regularization objective to align desired skills with prunable components while unlearning unnecessary skills.

Title: Leveraging the Powerful Attention of a Pre-trained Diffusion Model for Exemplar-based Image Colorization

Authors: Satoshi Kosugi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.15812
Pdf URL: https://arxiv.org/pdf/2505.15812
Copy Paste: [[2505.15812]] Leveraging the Powerful Attention of a Pre-trained Diffusion Model for Exemplar-based Image Colorization(https://arxiv.org/abs/2505.15812)
Keywords: diffusion
Abstract: Exemplar-based image colorization aims to colorize a grayscale image using a reference color image, ensuring that reference colors are applied to corresponding input regions based on their semantic similarity. To achieve accurate semantic matching between regions, we leverage the self-attention module of a pre-trained diffusion model, which is trained on a large dataset and exhibits powerful attention capabilities. To harness this power, we propose a novel, fine-tuning-free approach based on a pre-trained diffusion model, making two key contributions. First, we introduce dual attention-guided color transfer. We utilize the self-attention module to compute an attention map between the input and reference images, effectively capturing semantic correspondences. The color features from the reference image is then transferred to the semantically matching regions of the input image, guided by this attention map, and finally, the grayscale features are replaced with the corresponding color features. Notably, we utilize dual attention to calculate attention maps separately for the grayscale and color images, achieving more precise semantic alignment. Second, we propose classifier-free colorization guidance, which enhances the transferred colors by combining color-transferred and non-color-transferred outputs. This process improves the quality of colorization. Our experimental results demonstrate that our method outperforms existing techniques in terms of image quality and fidelity to the reference. Specifically, we use 335 input-reference pairs from previous research, achieving an FID of 95.27 (image quality) and an SI-FID of 5.51 (fidelity to the reference). Our source code is available at this https URL.

Title: Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex

Authors: Muquan Yu, Mu Nan, Hossein Adeli, Jacob S. Prince, John A. Pyles, Leila Wehbe, Margaret M. Henderson, Michael J. Tarr, Andrew F. Luo
Subjects: cs.LG, q-bio.NC
Abstract URL: https://arxiv.org/abs/2505.15813
Pdf URL: https://arxiv.org/pdf/2505.15813
Copy Paste: [[2505.15813]] Meta-Learning an In-Context Transformer Model of Human Higher Visual Cortex(https://arxiv.org/abs/2505.15813)
Keywords: in-context
Abstract: Understanding functional representations within higher visual cortex is a fundamental question in computational neuroscience. While artificial neural networks pretrained on large-scale datasets exhibit striking representational alignment with human neural responses, learning image-computable models of visual cortex relies on individual-level, large-scale fMRI datasets. The necessity for expensive, time-intensive, and often impractical data acquisition limits the generalizability of encoders to new subjects and stimuli. BraInCoRL uses in-context learning to predict voxelwise neural responses from few-shot examples without any additional finetuning for novel subjects and stimuli. We leverage a transformer architecture that can flexibly condition on a variable number of in-context image stimuli, learning an inductive bias over multiple subjects. During training, we explicitly optimize the model for in-context learning. By jointly conditioning on image features and voxel activations, our model learns to directly generate better performing voxelwise models of higher visual cortex. We demonstrate that BraInCoRL consistently outperforms existing voxelwise encoder designs in a low-data regime when evaluated on entirely novel images, while also exhibiting strong test-time scaling behavior. The model also generalizes to an entirely new visual fMRI dataset, which uses different subjects and fMRI data acquisition parameters. Further, BraInCoRL facilitates better interpretability of neural signals in higher visual cortex by attending to semantically relevant stimuli. Finally, we show that our framework enables interpretable mappings from natural language queries to voxel selectivity.