2025-05-28

Title: Beyond Demonstrations: Dynamic Vector Construction from Latent Representations

Authors: Wang Cai, Hsiu-Yuan Huang, Zhixiang Wang, Yunfang Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20318
Pdf URL: https://arxiv.org/pdf/2505.20318
Copy Paste: [[2505.20318]] Beyond Demonstrations: Dynamic Vector Construction from Latent Representations(https://arxiv.org/abs/2505.20318)
Keywords: in-context
Abstract: In-Context derived Vector (ICV) methods extract task-relevant representations from large language models (LLMs) and reinject them during inference, achieving comparable performance to few-shot In-Context Learning (ICL) without repeated demonstration processing. However, existing ICV methods remain sensitive to ICL-specific factors, often use coarse or semantically fragmented representations as the source of the vector, and rely on heuristic-based injection positions, limiting their applicability. To address these issues, we propose Dynamic Vector (DyVec), which incorporates an Exhaustive Query Rotation (EQR) strategy to extract robust semantically aggregated latent representations by mitigating variance introduced by ICL. It then applies Dynamic Latent Segmentation and Injection to adaptively partition representations based on task complexity and leverages REINFORCE-based optimization to learn optimal injection positions for each segment. Experiments results show that DyVec outperforms few-shot ICL, LoRA, and prior ICV baselines. Further analysis highlights the effectiveness of dynamically segmenting and injecting semantically aggregated latent representations. DyVec provides a lightweight and data-efficient solution for inference-time task adaptation.

Title: Joint-stochastic-approximation Random Fields with Application to Semi-supervised Learning

Authors: Yunfu Song, Zhijian Ou
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.20330
Pdf URL: https://arxiv.org/pdf/2505.20330
Copy Paste: [[2505.20330]] Joint-stochastic-approximation Random Fields with Application to Semi-supervised Learning(https://arxiv.org/abs/2505.20330)
Keywords: generative
Abstract: Our examination of deep generative models (DGMs) developed for semi-supervised learning (SSL), mainly GANs and VAEs, reveals two problems. First, mode missing and mode covering phenomenons are observed in genertion with GANs and VAEs. Second, there exists an awkward conflict between good classification and good generation in SSL by employing directed generative models. To address these problems, we formally present joint-stochastic-approximation random fields (JRFs) -- a new family of algorithms for building deep undirected generative models, with application to SSL. It is found through synthetic experiments that JRFs work well in balancing mode covering and mode missing, and match the empirical data distribution well. Empirically, JRFs achieve good classification results comparable to the state-of-art methods on widely adopted datasets -- MNIST, SVHN, and CIFAR-10 in SSL, and simultaneously perform good generation.

Title: Decision Flow Policy Optimization

Authors: Jifeng Hu, Sili Huang, Siyuan Guo, Zhaogeng Liu, Li Shen, Lichao Sun, Hechang Chen, Yi Chang, Dacheng Tao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20350
Pdf URL: https://arxiv.org/pdf/2505.20350
Copy Paste: [[2505.20350]] Decision Flow Policy Optimization(https://arxiv.org/abs/2505.20350)
Keywords: generative
Abstract: In recent years, generative models have shown remarkable capabilities across diverse fields, including images, videos, language, and decision-making. By applying powerful generative models such as flow-based models to reinforcement learning, we can effectively model complex multi-modal action distributions and achieve superior robotic control in continuous action spaces, surpassing the limitations of single-modal action distributions with traditional Gaussian-based policies. Previous methods usually adopt the generative models as behavior models to fit state-conditioned action distributions from datasets, with policy optimization conducted separately through additional policies using value-based sample weighting or gradient-based updates. However, this separation prevents the simultaneous optimization of multi-modal distribution fitting and policy improvement, ultimately hindering the training of models and degrading the performance. To address this issue, we propose Decision Flow, a unified framework that integrates multi-modal action distribution modeling and policy optimization. Specifically, our method formulates the action generation procedure of flow-based models as a flow decision-making process, where each action generation step corresponds to one flow decision. Consequently, our method seamlessly optimizes the flow policy while capturing multi-modal action distributions. We provide rigorous proofs of Decision Flow and validate the effectiveness through extensive experiments across dozens of offline RL environments. Compared with established offline RL baselines, the results demonstrate that our method achieves or matches the SOTA performance.

Title: FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation

Authors: Dong Liu, Jiayi Zhang, Yifan Li, Yanxuan Yu, Ben Lengerich, Ying Nian Wu
Subjects: cs.LG, cs.AI, cs.CV, cs.MM, cs.PF
Abstract URL: https://arxiv.org/abs/2505.20353
Pdf URL: https://arxiv.org/pdf/2505.20353
Copy Paste: [[2505.20353]] FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation(https://arxiv.org/abs/2505.20353)
Keywords: diffusion, generative
Abstract: Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks. To alleviate this inefficiency, we propose FastCache, a hidden-state-level caching and compression framework that accelerates DiT inference by exploiting redundancy within the model's internal representations. FastCache introduces a dual strategy: (1) a spatial-aware token selection mechanism that adaptively filters redundant tokens based on hidden state saliency, and (2) a transformer-level cache that reuses latent activations across timesteps when changes are statistically insignificant. These modules work jointly to reduce unnecessary computation while preserving generation fidelity through learnable linear approximation. Theoretical analysis shows that FastCache maintains bounded approximation error under a hypothesis-testing-based decision rule. Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage, with best generation output quality compared to other cache methods, as measured by FID and t-FID. Code implementation of FastCache is available on GitHub at this https URL.

Title: GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning

Authors: Yeonjoon Jung, Daehyun Ahn, Hyungjun Kim, Taesu Kim, Eunhyeok Park
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20355
Pdf URL: https://arxiv.org/pdf/2505.20355
Copy Paste: [[2505.20355]] GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning(https://arxiv.org/abs/2505.20355)
Keywords: generative
Abstract: Low-Rank Adaptation (LoRA) is a popular method for parameter-efficient fine-tuning (PEFT) of generative models, valued for its simplicity and effectiveness. Despite recent enhancements, LoRA still suffers from a fundamental limitation: overfitting when the bottleneck is widened. It performs best at ranks 32-64, yet its accuracy stagnates or declines at higher ranks, still falling short of full fine-tuning (FFT) performance. We identify the root cause as LoRA's structural bottleneck, which introduces gradient entanglement to the unrelated input channels and distorts gradient propagation. To address this, we introduce a novel structure, Granular Low-Rank Adaptation (GraLoRA) that partitions weight matrices into sub-blocks, each with its own low-rank adapter. With negligible computational or storage cost, GraLoRA overcomes LoRA's limitations, effectively increases the representational capacity, and more closely approximates FFT behavior. Experiments on code generation and commonsense reasoning benchmarks show that GraLoRA consistently outperforms LoRA and other baselines, achieving up to +8.5% absolute gain in Pass@1 on HumanEval+. These improvements hold across model sizes and rank settings, making GraLoRA a scalable and robust solution for PEFT. Code, data, and scripts are available at this https URL

Title: What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models

Authors: Lorenzo Baraldi, Davide Bucciarelli, Federico Betti, Marcella Cornia, Lorenzo Baraldi, Nicu Sebe, Rita Cucchiara
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2505.20405
Pdf URL: https://arxiv.org/pdf/2505.20405
Copy Paste: [[2505.20405]] What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models(https://arxiv.org/abs/2505.20405)
Keywords: generative
Abstract: Instruction-based image editing models offer increased personalization opportunities in generative tasks. However, properly evaluating their results is challenging, and most of the existing metrics lag in terms of alignment with human judgment and explainability. To tackle these issues, we introduce DICE (DIfference Coherence Estimator), a model designed to detect localized differences between the original and the edited image and to assess their relevance to the given modification request. DICE consists of two key components: a difference detector and a coherence estimator, both built on an autoregressive Multimodal Large Language Model (MLLM) and trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision. Through extensive experiments, we evaluate each stage of our pipeline, comparing different MLLMs within the proposed framework. We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment. We publicly release our source code, models, and data.

Title: SEMMA: A Semantic Aware Knowledge Graph Foundation Model

Authors: Arvindh Arun, Sumit Kumar, Mojtaba Nayyeri, Bo Xiong, Ponnurangam Kumaraguru, Antonio Vergari, Steffen Staab
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20422
Pdf URL: https://arxiv.org/pdf/2505.20422
Copy Paste: [[2505.20422]] SEMMA: A Semantic Aware Knowledge Graph Foundation Model(https://arxiv.org/abs/2505.20422)
Keywords: foundation model
Abstract: Knowledge Graph Foundation Models (KGFMs) have shown promise in enabling zero-shot reasoning over unseen graphs by learning transferable patterns. However, most existing KGFMs rely solely on graph structure, overlooking the rich semantic signals encoded in textual attributes. We introduce SEMMA, a dual-module KGFM that systematically integrates transferable textual semantics alongside structure. SEMMA leverages Large Language Models (LLMs) to enrich relation identifiers, generating semantic embeddings that subsequently form a textual relation graph, which is fused with the structural component. Across 54 diverse KGs, SEMMA outperforms purely structural baselines like ULTRA in fully inductive link prediction. Crucially, we show that in more challenging generalization settings, where the test-time relation vocabulary is entirely unseen, structural methods collapse while SEMMA is 2x more effective. Our findings demonstrate that textual semantics are critical for generalization in settings where structure alone fails, highlighting the need for foundation models that unify structural and linguistic signals in knowledge reasoning.

Title: In-context Language Learning for Endangered Languages in Speech Recognition

Authors: Zhaolin Li, Jan Niehues
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20445
Pdf URL: https://arxiv.org/pdf/2505.20445
Copy Paste: [[2505.20445]] In-context Language Learning for Endangered Languages in Speech Recognition(https://arxiv.org/abs/2505.20445)
Keywords: in-context
Abstract: With approximately 7,000 languages spoken worldwide, current large language models (LLMs) support only a small subset. Prior research indicates LLMs can learn new languages for certain tasks without supervised data. We extend this investigation to speech recognition, investigating whether LLMs can learn unseen, low-resource languages through in-context learning (ICL). With experiments on four diverse endangered languages that LLMs have not been trained on, we find that providing more relevant text samples enhances performance in both language modelling and Automatic Speech Recognition (ASR) tasks. Furthermore, we show that the probability-based approach outperforms the traditional instruction-based approach in language learning. Lastly, we show ICL enables LLMs to achieve ASR performance that is comparable to or even surpasses dedicated language models trained specifically for these languages, while preserving the original capabilities of the LLMs.

Title: Time Series Generation Under Data Scarcity: A Unified Generative Modeling Approach

Authors: Tal Gonen, Itai Pemper, Ilan Naiman, Nimrod Berman, Omri Azencot
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.20446
Pdf URL: https://arxiv.org/pdf/2505.20446
Copy Paste: [[2505.20446]] Time Series Generation Under Data Scarcity: A Unified Generative Modeling Approach(https://arxiv.org/abs/2505.20446)
Keywords: diffusion, generative
Abstract: Generative modeling of time series is a central challenge in time series analysis, particularly under data-scarce conditions. Despite recent advances in generative modeling, a comprehensive understanding of how state-of-the-art generative models perform under limited supervision remains lacking. In this work, we conduct the first large-scale study evaluating leading generative models in data-scarce settings, revealing a substantial performance gap between full-data and data-scarce regimes. To close this gap, we propose a unified diffusion-based generative framework that can synthesize high-fidelity time series across diverse domains using just a few examples. Our model is pre-trained on a large, heterogeneous collection of time series datasets, enabling it to learn generalizable temporal representations. It further incorporates architectural innovations such as dynamic convolutional layers for flexible channel adaptation and dataset token conditioning for domain-aware generation. Without requiring abundant supervision, our unified model achieves state-of-the-art performance in few-shot settings-outperforming domain-specific baselines across a wide range of subset sizes. Remarkably, it also surpasses all baselines even when tested on full datasets benchmarks, highlighting the strength of pre-training and cross-domain generalization. We hope this work encourages the community to revisit few-shot generative modeling as a key problem in time series research and pursue unified solutions that scale efficiently across domains. Code is available at this https URL.

Title: DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data

Authors: Ruqi Wu, Xinjie Wang, Liu Liu, Chunle Guo, Jiaxiong Qiu, Chongyi Li, Lichao Huang, Zhizhong Su, Ming-Ming Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20460
Pdf URL: https://arxiv.org/pdf/2505.20460
Copy Paste: [[2505.20460]] DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data(https://arxiv.org/abs/2505.20460)
Keywords: diffusion
Abstract: We present DIPO, a novel framework for the controllable generation of articulated 3D objects from a pair of images: one depicting the object in a resting state and the other in an articulated state. Compared to the single-image approach, our dual-image input imposes only a modest overhead for data collection, but at the same time provides important motion information, which is a reliable guide for predicting kinematic relationships between parts. Specifically, we propose a dual-image diffusion model that captures relationships between the image pair to generate part layouts and joint parameters. In addition, we introduce a Chain-of-Thought (CoT) based graph reasoner that explicitly infers part connectivity relationships. To further improve robustness and generalization on complex articulated objects, we develop a fully automated dataset expansion pipeline, name LEGO-Art, that enriches the diversity and complexity of PartNet-Mobility dataset. We propose PM-X, a large-scale dataset of complex articulated 3D objects, accompanied by rendered images, URDF annotations, and textual descriptions. Extensive experiments demonstrate that DIPO significantly outperforms existing baselines in both the resting state and the articulated state, while the proposed PM-X dataset further enhances generalization to diverse and structurally complex articulated objects. Our code and dataset will be released to the community upon publication.

Title: WeatherEdit: Controllable Weather Editing with 4D Gaussian Field

Authors: Chenghao Qian, Wenjing Li, Yuhu Guo, Gustav Markkula
Subjects: cs.CV, cs.AI, cs.ET, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2505.20471
Pdf URL: https://arxiv.org/pdf/2505.20471
Copy Paste: [[2505.20471]] WeatherEdit: Controllable Weather Editing with 4D Gaussian Field(https://arxiv.org/abs/2505.20471)
Keywords: diffusion
Abstract: In this work, we present WeatherEdit, a novel weather editing pipeline for generating realistic weather effects with controllable types and severity in 3D scenes. Our approach is structured into two key components: weather background editing and weather particle construction. For weather background editing, we introduce an all-in-one adapter that integrates multiple weather styles into a single pretrained diffusion model, enabling the generation of diverse weather effects in 2D image backgrounds. During inference, we design a Temporal-View (TV-) attention mechanism that follows a specific order to aggregate temporal and spatial information, ensuring consistent editing across multi-frame and multi-view images. To construct the weather particles, we first reconstruct a 3D scene using the edited images and then introduce a dynamic 4D Gaussian field to generate snowflakes, raindrops and fog in the scene. The attributes and dynamics of these particles are precisely controlled through physical-based modelling and simulation, ensuring realistic weather representation and flexible severity adjustments. Finally, we integrate the 4D Gaussian field with the 3D scene to render consistent and highly realistic weather effects. Experiments on multiple driving datasets demonstrate that WeatherEdit can generate diverse weather effects with controllable condition severity, highlighting its potential for autonomous driving simulation in adverse weather. See project page: this https URL

Title: Gatsby Without the 'E': Crafting Lipograms with LLMs

Authors: Rohan Balasubramanian, Nitish Gokulakrishnan, Syeda Jannatus Saba, Steven Skiena
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.20501
Pdf URL: https://arxiv.org/pdf/2505.20501
Copy Paste: [[2505.20501]] Gatsby Without the 'E': Crafting Lipograms with LLMs(https://arxiv.org/abs/2505.20501)
Keywords: generative
Abstract: Lipograms are a unique form of constrained writing where all occurrences of a particular letter are excluded from the text, typified by the novel Gadsby, which daringly avoids all usage of the letter 'e'. In this study, we explore the power of modern large language models (LLMs) by transforming the novel F. Scott Fitzgerald's The Great Gatsby into a fully 'e'-less text. We experimented with a range of techniques, from baseline methods like synonym replacement to sophisticated generative models enhanced with beam search and named entity analysis. We show that excluding up to 3.6% of the most common letters (up to the letter 'u') had minimal impact on the text's meaning, although translation fidelity rapidly and predictably decays with stronger lipogram constraints. Our work highlights the surprising flexibility of English under strict constraints, revealing just how adaptable and creative language can be.

Title: CPathAgent: An Agent-based Foundation Model for Interpretable High-Resolution Pathology Image Analysis Mimicking Pathologists' Diagnostic Logic

Authors: Yuxuan Sun, Yixuan Si, Chenglu Zhu, Kai Zhang, Zhongyi Shui, Bowen Ding, Tao Lin, Lin Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20510
Pdf URL: https://arxiv.org/pdf/2505.20510
Copy Paste: [[2505.20510]] CPathAgent: An Agent-based Foundation Model for Interpretable High-Resolution Pathology Image Analysis Mimicking Pathologists' Diagnostic Logic(https://arxiv.org/abs/2505.20510)
Keywords: foundation model
Abstract: Recent advances in computational pathology have led to the emergence of numerous foundation models. However, these approaches fail to replicate the diagnostic process of pathologists, as they either simply rely on general-purpose encoders with multi-instance learning for classification or directly apply multimodal models to generate reports from images. A significant limitation is their inability to emulate the diagnostic logic employed by pathologists, who systematically examine slides at low magnification for overview before progressively zooming in on suspicious regions to formulate comprehensive diagnoses. To address this gap, we introduce CPathAgent, an innovative agent-based model that mimics pathologists' reasoning processes by autonomously executing zoom-in/out and navigation operations across pathology images based on observed visual features. To achieve this, we develop a multi-stage training strategy unifying patch-level, region-level, and whole-slide capabilities within a single model, which is essential for mimicking pathologists, who require understanding and reasoning capabilities across all three scales. This approach generates substantially more detailed and interpretable diagnostic reports compared to existing methods, particularly for huge region understanding. Additionally, we construct an expert-validated PathMMU-HR$^{2}$, the first benchmark for huge region analysis, a critical intermediate scale between patches and whole slides, as diagnosticians typically examine several key regions rather than entire slides at once. Extensive experiments demonstrate that CPathAgent consistently outperforms existing approaches across three scales of benchmarks, validating the effectiveness of our agent-based diagnostic approach and highlighting a promising direction for the future development of computational pathology.

Title: MetaWriter: Personalized Handwritten Text Recognition Using Meta-Learned Prompt Tuning

Authors: Wenhao Gu, Li Gu, Ching Yee Suen, Yang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20513
Pdf URL: https://arxiv.org/pdf/2505.20513
Copy Paste: [[2505.20513]] MetaWriter: Personalized Handwritten Text Recognition Using Meta-Learned Prompt Tuning(https://arxiv.org/abs/2505.20513)
Keywords: self-supervised
Abstract: Recent advancements in handwritten text recognition (HTR) have enabled the effective conversion of handwritten text to digital formats. However, achieving robust recognition across diverse writing styles remains challenging. Traditional HTR methods lack writer-specific personalization at test time due to limitations in model architecture and training strategies. Existing attempts to bridge this gap, through gradient-based meta-learning, still require labeled examples and suffer from parameter-inefficient fine-tuning, leading to substantial computational and memory overhead. To overcome these challenges, we propose an efficient framework that formulates personalization as prompt tuning, incorporating an auxiliary image reconstruction task with a self-supervised loss to guide prompt adaptation with unlabeled test-time examples. To ensure self-supervised loss effectively minimizes text recognition error, we leverage meta-learning to learn the optimal initialization of the prompts. As a result, our method allows the model to efficiently capture unique writing styles by updating less than 1% of its parameters and eliminating the need for time-intensive annotation processes. We validate our approach on the RIMES and IAM Handwriting Database benchmarks, where it consistently outperforms previous state-of-the-art methods while using 20x fewer parameters. We believe this represents a significant advancement in personalized handwritten text recognition, paving the way for more reliable and practical deployment in resource-constrained scenarios.

Title: MultLFG: Training-free Multi-LoRA composition using Frequency-domain Guidance

Authors: Aniket Roy, Maitreya Suin, Ketul Shah, Rama Chellappa
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20525
Pdf URL: https://arxiv.org/pdf/2505.20525
Copy Paste: [[2505.20525]] MultLFG: Training-free Multi-LoRA composition using Frequency-domain Guidance(https://arxiv.org/abs/2505.20525)
Keywords: generative
Abstract: Low-Rank Adaptation (LoRA) has gained prominence as a computationally efficient method for fine-tuning generative models, enabling distinct visual concept synthesis with minimal overhead. However, current methods struggle to effectively merge multiple LoRA adapters without training, particularly in complex compositions involving diverse visual elements. We introduce MultLFG, a novel framework for training-free multi-LoRA composition that utilizes frequency-domain guidance to achieve adaptive fusion of multiple LoRAs. Unlike existing methods that uniformly aggregate concept-specific LoRAs, MultLFG employs a timestep and frequency subband adaptive fusion strategy, selectively activating relevant LoRAs based on content relevance at specific timesteps and frequency bands. This frequency-sensitive guidance not only improves spatial coherence but also provides finer control over multi-LoRA composition, leading to more accurate and consistent results. Experimental evaluations on the ComposLoRA benchmark reveal that MultLFG substantially enhances compositional fidelity and image quality across various styles and concept sets, outperforming state-of-the-art baselines in multi-concept generation tasks. Code will be released.

Title: Causality and "In-the-Wild" Video-Based Person Re-ID: A Survey

Authors: Md Rashidunnabi, Kailash Hambarde, Hugo Proença
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20540
Pdf URL: https://arxiv.org/pdf/2505.20540
Copy Paste: [[2505.20540]] Causality and "In-the-Wild" Video-Based Person Re-ID: A Survey(https://arxiv.org/abs/2505.20540)
Keywords: self-supervised, generative
Abstract: Video-based person re-identification (Re-ID) remains brittle in real-world deployments despite impressive benchmark performance. Most existing models rely on superficial correlations such as clothing, background, or lighting that fail to generalize across domains, viewpoints, and temporal variations. This survey examines the emerging role of causal reasoning as a principled alternative to traditional correlation-based approaches in video-based Re-ID. We provide a structured and critical analysis of methods that leverage structural causal models, interventions, and counterfactual reasoning to isolate identity-specific features from confounding factors. The survey is organized around a novel taxonomy of causal Re-ID methods that spans generative disentanglement, domain-invariant modeling, and causal transformers. We review current evaluation metrics and introduce causal-specific robustness measures. In addition, we assess practical challenges of scalability, fairness, interpretability, and privacy that must be addressed for real-world adoption. Finally, we identify open problems and outline future research directions that integrate causal modeling with efficient architectures and self-supervised learning. This survey aims to establish a coherent foundation for causal video-based person Re-ID and to catalyze the next phase of research in this rapidly evolving domain.

Title: Emotion Classification In-Context in Spanish

Authors: Bipul Thapa, Gabriel Cofre
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20571
Pdf URL: https://arxiv.org/pdf/2505.20571
Copy Paste: [[2505.20571]] Emotion Classification In-Context in Spanish(https://arxiv.org/abs/2505.20571)
Keywords: in-context
Abstract: Classifying customer feedback into distinct emotion categories is essential for understanding sentiment and improving customer experience. In this paper, we classify customer feedback in Spanish into three emotion categories--positive, neutral, and negative--using advanced NLP and ML techniques. Traditional methods translate feedback from widely spoken languages to less common ones, resulting in a loss of semantic integrity and contextual nuances inherent to the original language. To address this limitation, we propose a hybrid approach that combines TF-IDF with BERT embeddings, effectively transforming Spanish text into rich numerical representations that preserve the semantic depth of the original language by using a Custom Stacking Ensemble (CSE) approach. To evaluate emotion classification, we utilize a range of models, including Logistic Regression, KNN, Bagging classifier with LGBM, and AdaBoost. The CSE model combines these classifiers as base models and uses a one-vs-all Logistic Regression as the meta-model. Our experimental results demonstrate that CSE significantly outperforms the individual and BERT model, achieving a test accuracy of 93.3% on the native Spanish dataset--higher than the accuracy obtained from the translated version. These findings underscore the challenges of emotion classification in Spanish and highlight the advantages of combining vectorization techniques like TF-IDF with BERT for improved accuracy. Our results provide valuable insights for businesses seeking to leverage emotion classification to enhance customer feedback analysis and service improvements.

Title: Ctrl-DNA: Controllable Cell-Type-Specific Regulatory DNA Design via Constrained RL

Authors: Xingyu Chen, Shihao Ma, Runsheng Lin, Jiecong Lin, Bo Wang
Subjects: cs.LG, cs.AI, q-bio.GN
Abstract URL: https://arxiv.org/abs/2505.20578
Pdf URL: https://arxiv.org/pdf/2505.20578
Copy Paste: [[2505.20578]] Ctrl-DNA: Controllable Cell-Type-Specific Regulatory DNA Design via Constrained RL(https://arxiv.org/abs/2505.20578)
Keywords: generative
Abstract: Designing regulatory DNA sequences that achieve precise cell-type-specific gene expression is crucial for advancements in synthetic biology, gene therapy and precision medicine. Although transformer-based language models (LMs) can effectively capture patterns in regulatory DNA, their generative approaches often struggle to produce novel sequences with reliable cell-specific activity. Here, we introduce Ctrl-DNA, a novel constrained reinforcement learning (RL) framework tailored for designing regulatory DNA sequences with controllable cell-type specificity. By formulating regulatory sequence design as a biologically informed constrained optimization problem, we apply RL to autoregressive genomic LMs, enabling the models to iteratively refine sequences that maximize regulatory activity in targeted cell types while constraining off-target effects. Our evaluation on human promoters and enhancers demonstrates that Ctrl-DNA consistently outperforms existing generative and RL-based approaches, generating high-fitness regulatory sequences and achieving state-of-the-art cell-type specificity. Moreover, Ctrl-DNA-generated sequences capture key cell-type-specific transcription factor binding sites (TFBS), short DNA motifs recognized by regulatory proteins that control gene expression, demonstrating the biological plausibility of the generated sequences.

Title: Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction

Authors: Mahdi Pourmirzaei, Farzaneh Esmaili, Salhuldin Alqarghuli, Mohammadreza Pourmirzaei, Ye Han, Kai Chen, Mohsen Rezaei, Duolin Wang, Dong Xu
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2505.20589
Pdf URL: https://arxiv.org/pdf/2505.20589
Copy Paste: [[2505.20589]] Prot2Token: A Unified Framework for Protein Modeling via Next-Token Prediction(https://arxiv.org/abs/2505.20589)
Keywords: self-supervised
Abstract: The diverse nature of protein prediction tasks has traditionally necessitated specialized models, hindering the development of broadly applicable and computationally efficient Protein Language Models (PLMs). In this work, we introduce Prot2Token, a unified framework that overcomes these challenges by converting a wide spectrum of protein-related predictions, from sequence-level properties and residue-specific attributes to complex inter-protein interactions, into a standardized next-token prediction format. At its core, Prot2Token employs an autoregressive decoder, conditioned on embeddings from pre-trained protein encoders and guided by learnable task tokens, to perform diverse predictions. This architecture uniquely facilitates multi-task learning, enabling a single model to master numerous tasks with improved efficiency. We present extensive experimental validation across a variety of benchmarks, demonstrating Prot2Tokens strong predictive power in different types of protein-prediction tasks. Key results include significant speedups (e.g., near 1000x over AlphaFold2 with MSA) and performance often matching or exceeding specialized approaches. Beyond that, we introduce an auxiliary self-supervised decoder pre-training approach to improve spatially sensitive task performance. Prot2Token thus offers a significant step towards a versatile, high-throughput paradigm for protein modeling, promising to accelerate biological discovery and the development of novel therapeutics. The code is available at this https URL .

Title: Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation

Authors: Dancheng Liu, Amir Nassereldine, Chenhui Xu, Jinjun Xiong
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2505.20606
Pdf URL: https://arxiv.org/pdf/2505.20606
Copy Paste: [[2505.20606]] Towards Pretraining Robust ASR Foundation Model with Acoustic-Aware Data Augmentation(https://arxiv.org/abs/2505.20606)
Keywords: foundation model
Abstract: Whisper's robust performance in automatic speech recognition (ASR) is often attributed to its massive 680k-hour training set, an impractical scale for most researchers. In this work, we examine how linguistic and acoustic diversity in training data affect the robustness of the ASR model and reveal that transcription generalization is primarily driven by acoustic variation rather than linguistic richness. We find that targeted acoustic augmentation methods could significantly improve the generalization ability of ASR models, reducing word-error rates by up to 19.24 percent on unseen datasets when training on the 960-hour Librispeech dataset. These findings highlight strategic acoustically focused data augmentation as a promising alternative to massive datasets for building robust ASR models, offering a potential solution to future foundation ASR models when massive human speech data is lacking.

Title: OccLE: Label-Efficient 3D Semantic Occupancy Prediction

Authors: Naiyu Fang, Zheyuan Zhou, Fayao Liu, Xulei Yang, Jiacheng Wei, Lemiao Qiu, Guosheng Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20617
Pdf URL: https://arxiv.org/pdf/2505.20617
Copy Paste: [[2505.20617]] OccLE: Label-Efficient 3D Semantic Occupancy Prediction(https://arxiv.org/abs/2505.20617)
Keywords: foundation model
Abstract: 3D semantic occupancy prediction offers an intuitive and efficient scene understanding and has attracted significant interest in autonomous driving perception. Existing approaches either rely on full supervision, which demands costly voxel-level annotations, or on self-supervision, which provides limited guidance and yields suboptimal performance. To address these challenges, we propose OccLE, a Label-Efficient 3D Semantic Occupancy Prediction that takes images and LiDAR as inputs and maintains high performance with limited voxel annotations. Our intuition is to decouple the semantic and geometric learning tasks and then fuse the learned feature grids from both tasks for the final semantic occupancy prediction. Therefore, the semantic branch distills 2D foundation model to provide aligned pseudo labels for 2D and 3D semantic learning. The geometric branch integrates image and LiDAR inputs in cross-plane synergy based on their inherency, employing semi-supervision to enhance geometry learning. We fuse semantic-geometric feature grids through Dual Mamba and incorporate a scatter-accumulated projection to supervise unannotated prediction with aligned pseudo labels. Experiments show that OccLE achieves competitive performance with only 10% of voxel annotations, reaching a mIoU of 16.59% on the SemanticKITTI validation set.

Title: Incorporating Flexible Image Conditioning into Text-to-Video Diffusion Models without Training

Authors: Bolin Lai, Sangmin Lee, Xu Cao, Xiang Li, James M. Rehg
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20629
Pdf URL: https://arxiv.org/pdf/2505.20629
Copy Paste: [[2505.20629]] Incorporating Flexible Image Conditioning into Text-to-Video Diffusion Models without Training(https://arxiv.org/abs/2505.20629)
Keywords: diffusion, foundation model
Abstract: Text-image-to-video (TI2V) generation is a critical problem for controllable video generation using both semantic and visual conditions. Most existing methods typically add visual conditions to text-to-video (T2V) foundation models by finetuning, which is costly in resources and only limited to a few predefined conditioning settings. To tackle this issue, we introduce a unified formulation for TI2V generation with flexible visual conditioning. Furthermore, we propose an innovative training-free approach, dubbed FlexTI2V, that can condition T2V foundation models on an arbitrary amount of images at arbitrary positions. Specifically, we firstly invert the condition images to noisy representation in a latent space. Then, in the denoising process of T2V models, our method uses a novel random patch swapping strategy to incorporate visual features into video representations through local image patches. To balance creativity and fidelity, we use a dynamic control mechanism to adjust the strength of visual conditioning to each video frame. Extensive experiments validate that our method surpasses previous training-free image conditioning methods by a notable margin. We also show more insights of our method by detailed ablation study and analysis.

Title: Test-Time Learning for Large Language Models

Authors: Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, Mingkui Tan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20633
Pdf URL: https://arxiv.org/pdf/2505.20633
Copy Paste: [[2505.20633]] Test-Time Learning for Large Language Models(https://arxiv.org/abs/2505.20633)
Keywords: self-supervised
Abstract: While Large Language Models (LLMs) have exhibited remarkable emergent capabilities through extensive pre-training, they still face critical limitations in generalizing to specialized domains and handling diverse linguistic variations, known as distribution shifts. In this paper, we propose a Test-Time Learning (TTL) paradigm for LLMs, namely TLM, which dynamically adapts LLMs to target domains using only unlabeled test data during testing. Specifically, we first provide empirical evidence and theoretical insights to reveal that more accurate predictions from LLMs can be achieved by minimizing the input perplexity of the unlabeled test data. Based on this insight, we formulate the Test-Time Learning process of LLMs as input perplexity minimization, enabling self-supervised enhancement of LLM performance. Furthermore, we observe that high-perplexity samples tend to be more informative for model optimization. Accordingly, we introduce a Sample Efficient Learning Strategy that actively selects and emphasizes these high-perplexity samples for test-time updates. Lastly, to mitigate catastrophic forgetting and ensure adaptation stability, we adopt Low-Rank Adaptation (LoRA) instead of full-parameter optimization, which allows lightweight model updates while preserving more original knowledge from the model. We introduce the AdaptEval benchmark for TTL and demonstrate through experiments that TLM improves performance by at least 20% compared to original LLMs on domain knowledge adaptation.

Title: Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers

Authors: Yukun Zhang, Xueqing Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20666
Pdf URL: https://arxiv.org/pdf/2505.20666
Copy Paste: [[2505.20666]] Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers(https://arxiv.org/abs/2505.20666)
Keywords: diffusion
Abstract: We propose a novel framework, Continuous_Time Attention, which infuses partial differential equations (PDEs) into the Transformer's attention mechanism to address the challenges of extremely long input sequences. Instead of relying solely on a static attention matrix, we allow attention weights to evolve over a pseudo_time dimension via diffusion, wave, or reaction_diffusion dynamics. This mechanism systematically smooths local noise, enhances long_range dependencies, and stabilizes gradient flow. Theoretically, our analysis shows that PDE_based attention leads to better optimization landscapes and polynomial rather than exponential decay of distant interactions. Empirically, we benchmark our method on diverse experiments_demonstrating consistent gains over both standard and specialized long sequence Transformer variants. Our findings highlight the potential of PDE_based formulations to enrich attention mechanisms with continuous_time dynamics and global coherence.

Title: Pretraining Language Models to Ponder in Continuous Space

Authors: Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li, Ziwei He, Xinbing Wang, Zhiyu Li, Zhouhan Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.20674
Pdf URL: https://arxiv.org/pdf/2505.20674
Copy Paste: [[2505.20674]] Pretraining Language Models to Ponder in Continuous Space(https://arxiv.org/abs/2505.20674)
Keywords: self-supervised
Abstract: Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort. In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations. Our method is straightforward and can be seamlessly integrated with various existing language models. Experiments across three widely used open-source architectures-GPT-2, Pythia, and LLaMA-and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. For language modeling tasks, pondering language models achieve performance comparable to vanilla models with twice the number of parameters. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, pondering-enhanced Pythia-1B is comparable to TinyLlama-1.1B, which is trained on 10 times more data. The code is available at this https URL.

Title: Generating Hypotheses of Dynamic Causal Graphs in Neuroscience: Leveraging Generative Factor Models of Observed Time Series

Authors: Zachary C. Brown, David Carlson
Subjects: cs.LG, cs.AI, stat.AP, stat.ML
Abstract URL: https://arxiv.org/abs/2505.20697
Pdf URL: https://arxiv.org/pdf/2505.20697
Copy Paste: [[2505.20697]] Generating Hypotheses of Dynamic Causal Graphs in Neuroscience: Leveraging Generative Factor Models of Observed Time Series(https://arxiv.org/abs/2505.20697)
Keywords: generative
Abstract: The field of hypothesis generation promises to reduce costs in neuroscience by narrowing the range of interventional studies needed to study various phenomena. Existing machine learning methods can generate scientific hypotheses from complex datasets, but many approaches assume causal relationships are static over time, limiting their applicability to systems with dynamic, state-dependent behavior, such as the brain. While some techniques attempt dynamic causal discovery through factor models, they often restrict relationships to linear patterns or impose other simplifying assumptions. We propose a novel method that models dynamic graphs as a conditionally weighted superposition of static graphs, where each static graph can capture nonlinear relationships. This approach enables the detection of complex, time-varying interactions between variables beyond linear limitations. Our method improves f1-scores of predicted dynamic causal patterns by roughly 22-28% on average over baselines in some of our experiments, with some improvements reaching well over 60%. A case study on real brain data demonstrates our method's ability to uncover relationships linked to specific behavioral states, offering valuable insights into neural dynamics.

Title: LeDiFlow: Learned Distribution-guided Flow Matching to Accelerate Image Generation

Authors: Pascal Zwick, Nils Friederich, Maximilian Beichter, Lennart Hilbert, Ralf Mikut, Oliver Bringmann
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20723
Pdf URL: https://arxiv.org/pdf/2505.20723
Copy Paste: [[2505.20723]] LeDiFlow: Learned Distribution-guided Flow Matching to Accelerate Image Generation(https://arxiv.org/abs/2505.20723)
Keywords: diffusion, generative
Abstract: Enhancing the efficiency of high-quality image generation using Diffusion Models (DMs) is a significant challenge due to the iterative nature of the process. Flow Matching (FM) is emerging as a powerful generative modeling paradigm based on a simulation-free training objective instead of a score-based one used in DMs. Typical FM approaches rely on a Gaussian distribution prior, which induces curved, conditional probability paths between the prior and target data distribution. These curved paths pose a challenge for the Ordinary Differential Equation (ODE) solver, requiring a large number of inference calls to the flow prediction network. To address this issue, we present Learned Distribution-guided Flow Matching (LeDiFlow), a novel scalable method for training FM-based image generation models using a better-suited prior distribution learned via a regression-based auxiliary model. By initializing the ODE solver with a prior closer to the target data distribution, LeDiFlow enables the learning of more computationally tractable probability paths. These paths directly translate to fewer solver steps needed for high-quality image generation at inference time. Our method utilizes a State-Of-The-Art (SOTA) transformer architecture combined with latent space sampling and can be trained on a consumer workstation. We empirically demonstrate that LeDiFlow remarkably outperforms the respective FM baselines. For instance, when operating directly on pixels, our model accelerates inference by up to 3.75x compared to the corresponding pixel-space baseline. Simultaneously, our latent FM model enhances image quality on average by 1.32x in CLIP Maximum Mean Discrepancy (CMMD) metric against its respective baseline.

Title: Intern-GS: Vision Model Guided Sparse-View 3D Gaussian Splatting

Authors: Xiangyu Sun, Runnan Chen, Mingming Gong, Dong Xu, Tongliang Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20729
Pdf URL: https://arxiv.org/pdf/2505.20729
Copy Paste: [[2505.20729]] Intern-GS: Vision Model Guided Sparse-View 3D Gaussian Splatting(https://arxiv.org/abs/2505.20729)
Keywords: foundation model
Abstract: Sparse-view scene reconstruction often faces significant challenges due to the constraints imposed by limited observational data. These limitations result in incomplete information, leading to suboptimal reconstructions using existing methodologies. To address this, we present Intern-GS, a novel approach that effectively leverages rich prior knowledge from vision foundation models to enhance the process of sparse-view Gaussian Splatting, thereby enabling high-quality scene reconstruction. Specifically, Intern-GS utilizes vision foundation models to guide both the initialization and the optimization process of 3D Gaussian splatting, effectively addressing the limitations of sparse inputs. In the initialization process, our method employs DUSt3R to generate a dense and non-redundant gaussian point cloud. This approach significantly alleviates the limitations encountered by traditional structure-from-motion (SfM) methods, which often struggle under sparse-view constraints. During the optimization process, vision foundation models predict depth and appearance for unobserved views, refining the 3D Gaussians to compensate for missing information in unseen regions. Extensive experiments demonstrate that Intern-GS achieves state-of-the-art rendering quality across diverse datasets, including both forward-facing and large-scale scenes, such as LLFF, DTU, and Tanks and Temples.

Title: MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition

Authors: Hao Zhang, Zhan Zhuang, Xuehao Wang, Xiaodong Yang, Yu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20744
Pdf URL: https://arxiv.org/pdf/2505.20744
Copy Paste: [[2505.20744]] MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition(https://arxiv.org/abs/2505.20744)
Keywords: self-supervised
Abstract: Human Activity Recognition (HAR) with wearable sensors is challenged by limited interpretability, which significantly impacts cross-dataset generalization. To address this challenge, we propose Motion-Primitive Transformer (MoPFormer), a novel self-supervised framework that enhances interpretability by tokenizing inertial measurement unit signals into semantically meaningful motion primitives and leverages a Transformer architecture to learn rich temporal representations. MoPFormer comprises two-stages. first stage is to partition multi-channel sensor streams into short segments and quantizing them into discrete "motion primitive" codewords, while the second stage enriches those tokenized sequences through a context-aware embedding module and then processes them with a Transformer encoder. The proposed MoPFormer can be pre-trained using a masked motion-modeling objective that reconstructs missing primitives, enabling it to develop robust representations across diverse sensor configurations. Experiments on six HAR benchmarks demonstrate that MoPFormer not only outperforms state-of-the-art methods but also successfully generalizes across multiple datasets. Most importantly, the learned motion primitives significantly enhance both interpretability and cross-dataset performance by capturing fundamental movement patterns that remain consistent across similar activities regardless of dataset origin.

Title: Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction

Authors: Yifei Wang, Weimin Bai, Colin Zhang, Debing Zhang, Weijian Luo, He Sun
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2505.20755
Pdf URL: https://arxiv.org/pdf/2505.20755
Copy Paste: [[2505.20755]] Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction(https://arxiv.org/abs/2505.20755)
Keywords: diffusion
Abstract: In this paper, we unify more than 10 existing one-step diffusion distillation approaches, such as Diff-Instruct, DMD, SIM, SiD, $f$-distill, etc, inside a theory-driven framework which we name the \textbf{\emph{Uni-Instruct}}. Uni-Instruct is motivated by our proposed diffusion expansion theory of the $f$-divergence family. Then we introduce key theories that overcome the intractability issue of the original expanded $f$-divergence, resulting in an equivalent yet tractable loss that effectively trains one-step diffusion models by minimizing the expanded $f$-divergence family. The novel unification introduced by Uni-Instruct not only offers new theoretical contributions that help understand existing approaches from a high-level perspective but also leads to state-of-the-art one-step diffusion generation performances. On the CIFAR10 generation benchmark, Uni-Instruct achieves record-breaking Frechet Inception Distance (FID) values of \textbf{\emph{1.46}} for unconditional generation and \textbf{\emph{1.38}} for conditional generation. On the ImageNet-$64\times 64$ generation benchmark, Uni-Instruct achieves a new SoTA one-step generation FID of \textbf{\emph{1.02}}, which outperforms its 79-step teacher diffusion with a significant improvement margin of 1.33 (1.02 vs 2.35). We also apply Uni-Instruct on broader tasks like text-to-3D generation. For text-to-3D generation, Uni-Instruct gives decent results, which slightly outperforms previous methods, such as SDS and VSD, in terms of both generation quality and diversity. Both the solid theoretical and empirical contributions of Uni-Instruct will potentially help future studies on one-step diffusion distillation and knowledge transferring of diffusion models.

Title: Robust and Explainable Detector of Time Series Anomaly via Augmenting Multiclass Pseudo-Anomalies

Authors: Kohei Obata, Yasuko Matsubara, Yasushi Sakurai
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.20765
Pdf URL: https://arxiv.org/pdf/2505.20765
Copy Paste: [[2505.20765]] Robust and Explainable Detector of Time Series Anomaly via Augmenting Multiclass Pseudo-Anomalies(https://arxiv.org/abs/2505.20765)
Keywords: anomaly
Abstract: Unsupervised anomaly detection in time series has been a pivotal research area for decades. Current mainstream approaches focus on learning normality, on the assumption that all or most of the samples in the training set are normal. However, anomalies in the training set (i.e., anomaly contamination) can be misleading. Recent studies employ data augmentation to generate pseudo-anomalies and learn the boundary separating the training samples from the augmented samples. Although this approach mitigates anomaly contamination if augmented samples mimic unseen real anomalies, it suffers from several limitations. (1) Covering a wide range of time series anomalies is challenging. (2) It disregards augmented samples that resemble normal samples (i.e., false anomalies). (3) It places too much trust in the labels of training and augmented samples. In response, we propose RedLamp, which employs diverse data augmentations to generate multiclass pseudo-anomalies and learns the multiclass boundary. Such multiclass pseudo-anomalies cover a wide variety of time series anomalies. We conduct multiclass classification using soft labels, which prevents the model from being overconfident and ensures its robustness against contaminated/false anomalies. The learned latent space is inherently explainable as it is trained to separate pseudo-anomalies into multiclasses. Extensive experiments demonstrate the effectiveness of RedLamp in anomaly detection and its robustness against anomaly contamination.

Title: Integrating Intermediate Layer Optimization and Projected Gradient Descent for Solving Inverse Problems with Diffusion Models

Authors: Yang Zheng, Wen Li, Zhaoqiang Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20789
Pdf URL: https://arxiv.org/pdf/2505.20789
Copy Paste: [[2505.20789]] Integrating Intermediate Layer Optimization and Projected Gradient Descent for Solving Inverse Problems with Diffusion Models(https://arxiv.org/abs/2505.20789)
Keywords: diffusion, generative
Abstract: Inverse problems (IPs) involve reconstructing signals from noisy observations. Traditional approaches often rely on handcrafted priors, which can fail to capture the complexity of real-world data. The advent of pre-trained generative models has introduced new paradigms, offering improved reconstructions by learning rich priors from data. Among these, diffusion models (DMs) have emerged as a powerful framework, achieving remarkable reconstruction performance across numerous IPs. However, existing DM-based methods frequently encounter issues such as heavy computational demands and suboptimal convergence. In this work, building upon the idea of the recent work DMPlug~\cite{wang2024dmplug}, we propose two novel methods, DMILO and DMILO-PGD, to address these challenges. Our first method, DMILO, employs intermediate layer optimization (ILO) to alleviate the memory burden inherent in DMPlug. Additionally, by introducing sparse deviations, we expand the range of DMs, enabling the exploration of underlying signals that may lie outside the range of the diffusion model. We further propose DMILO-PGD, which integrates ILO with projected gradient descent (PGD), thereby reducing the risk of suboptimal convergence. We provide an intuitive theoretical analysis of our approach under appropriate conditions and validate its superiority through extensive experiments on diverse image datasets, encompassing both linear and nonlinear IPs. Our results demonstrate significant performance gains over state-of-the-art methods, highlighting the effectiveness of DMILO and DMILO-PGD in addressing common challenges in DM-based IP solvers.

Title: Not All Thats Rare Is Lost: Causal Paths to Rare Concept Synthesis

Authors: Bo-Kai Ruan, Zi-Xiang Ni, Bo-Lun Huang, Teng-Fang Hsiao, Hong-Han Shuai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20808
Pdf URL: https://arxiv.org/pdf/2505.20808
Copy Paste: [[2505.20808]] Not All Thats Rare Is Lost: Causal Paths to Rare Concept Synthesis(https://arxiv.org/abs/2505.20808)
Keywords: diffusion, generative
Abstract: Diffusion models have shown strong capabilities in high-fidelity image generation but often falter when synthesizing rare concepts, i.e., prompts that are infrequently observed in the training distribution. In this paper, we introduce RAP, a principled framework that treats rare concept generation as navigating a latent causal path: a progressive, model-aligned trajectory through the generative space from frequent concepts to rare targets. Rather than relying on heuristic prompt alternation, we theoretically justify that rare prompt guidance can be approximated by semantically related frequent prompts. We then formulate prompt switching as a dynamic process based on score similarity, enabling adaptive stage transitions. Furthermore, we reinterpret prompt alternation as a second-order denoising mechanism, promoting smooth semantic progression and coherent visual synthesis. Through this causal lens, we align input scheduling with the model's internal generative dynamics. Experiments across diverse diffusion backbones demonstrate that RAP consistently enhances rare concept generation, outperforming strong baselines in both automated evaluations and human studies.

Title: Frame-Level Captions for Long Video Generation with Complex Multi Scenes

Authors: Guangcong Zheng, Jianlong Yuan, Bo Wang, Haoyang Huang, Guoqing Ma, Nan Duan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20827
Pdf URL: https://arxiv.org/pdf/2505.20827
Copy Paste: [[2505.20827]] Frame-Level Captions for Long Video Generation with Complex Multi Scenes(https://arxiv.org/abs/2505.20827)
Keywords: diffusion
Abstract: Generating long videos that can show complex stories, like movie scenes from scripts, has great promise and offers much more than short clips. However, current methods that use autoregression with diffusion models often struggle because their step-by-step process naturally leads to a serious error accumulation (drift). Also, many existing ways to make long videos focus on single, continuous scenes, making them less useful for stories with many events and changes. This paper introduces a new approach to solve these problems. First, we propose a novel way to annotate datasets at the frame-level, providing detailed text guidance needed for making complex, multi-scene long videos. This detailed guidance works with a Frame-Level Attention Mechanism to make sure text and video match precisely. A key feature is that each part (frame) within these windows can be guided by its own distinct text prompt. Our training uses Diffusion Forcing to provide the model with the ability to handle time flexibly. We tested our approach on difficult VBench 2.0 benchmarks ("Complex Plots" and "Complex Landscapes") based on the WanX2.1-T2V-1.3B model. The results show our method is better at following instructions in complex, changing scenes and creates high-quality long videos. We plan to share our dataset annotation methods and trained models with the research community. Project page: this https URL .

Title: HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling

Authors: Hexiong Yang, Mingrui Chen, Huaibo Huang, Junxian Duan, Jie Cao, Zhen Zhou, Ran He
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2505.20836
Pdf URL: https://arxiv.org/pdf/2505.20836
Copy Paste: [[2505.20836]] HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling(https://arxiv.org/abs/2505.20836)
Keywords: self-supervised
Abstract: Inspired by the great success of Masked Language Modeling (MLM) in the natural language domain, the paradigm of self-supervised pre-training and fine-tuning has also achieved remarkable progress in the field of DNA sequence modeling. However, previous methods often relied on massive pre-training data or large-scale base models with huge parameters, imposing a significant computational burden. To address this, many works attempted to use more compact models to achieve similar outcomes but still fell short by a considerable margin. In this work, we propose a Hybrid Architecture Distillation (HAD) approach, leveraging both distillation and reconstruction tasks for more efficient and effective pre-training. Specifically, we employ the NTv2-500M as the teacher model and devise a grouping masking strategy to align the feature embeddings of visible tokens while concurrently reconstructing the invisible tokens during MLM pre-training. To validate the effectiveness of our proposed method, we conducted comprehensive experiments on the Nucleotide Transformer Benchmark and Genomic Benchmark. Compared to models with similar parameters, our model achieved excellent performance. More surprisingly, it even surpassed the distillation ceiling-teacher model on some sub-tasks, which is more than 500 $\times$ larger. Lastly, we utilize t-SNE for more intuitive visualization, which shows that our model can gain a sophisticated understanding of the intrinsic representation pattern in genomic sequences.

Title: Exploring Timeline Control for Facial Motion Generation

Authors: Yifeng Ma, Jinwei Qi, Chaonan Ji, Peng Zhang, Bang Zhang, Zhidong Deng, Liefeng Bo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20861
Pdf URL: https://arxiv.org/pdf/2505.20861
Copy Paste: [[2505.20861]] Exploring Timeline Control for Facial Motion Generation(https://arxiv.org/abs/2505.20861)
Keywords: diffusion
Abstract: This paper introduces a new control signal for facial motion generation: timeline control. Compared to audio and text signals, timelines provide more fine-grained control, such as generating specific facial motions with precise timing. Users can specify a multi-track timeline of facial actions arranged in temporal intervals, allowing precise control over the timing of each action. To model the timeline control capability, We first annotate the time intervals of facial actions in natural facial motion sequences at a frame-level granularity. This process is facilitated by Toeplitz Inverse Covariance-based Clustering to minimize human labor. Based on the annotations, we propose a diffusion-based generation model capable of generating facial motions that are natural and accurately aligned with input timelines. Our method supports text-guided motion generation by using ChatGPT to convert text into timelines. Experimental results show that our method can annotate facial action intervals with satisfactory accuracy, and produces natural facial motions accurately aligned with timelines.

Title: Respond to Change with Constancy: Instruction-tuning with LLM for Non-I.I.D. Network Traffic Classification

Authors: Xinjie Lin, Gang Xiong, Gaopeng Gou, Wenqi Dong, Jing Yu, Zhen Li, Wei Xia
Subjects: cs.CR, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2505.20866
Pdf URL: https://arxiv.org/pdf/2505.20866
Copy Paste: [[2505.20866]] Respond to Change with Constancy: Instruction-tuning with LLM for Non-I.I.D. Network Traffic Classification(https://arxiv.org/abs/2505.20866)
Keywords: self-supervised
Abstract: Encrypted traffic classification is highly challenging in network security due to the need for extracting robust features from content-agnostic traffic data. Existing approaches face critical issues: (i) Distribution drift, caused by reliance on the closedworld assumption, limits adaptability to realworld, shifting patterns; (ii) Dependence on labeled data restricts applicability where such data is scarce or unavailable. Large language models (LLMs) have demonstrated remarkable potential in offering generalizable solutions across a wide range of tasks, achieving notable success in various specialized fields. However, their effectiveness in traffic analysis remains constrained by challenges in adapting to the unique requirements of the traffic domain. In this paper, we introduce a novel traffic representation model named Encrypted Traffic Out-of-Distribution Instruction Tuning with LLM (ETooL), which integrates LLMs with knowledge of traffic structures through a self-supervised instruction tuning paradigm. This framework establishes connections between textual information and traffic interactions. ETooL demonstrates more robust classification performance and superior generalization in both supervised and zero-shot traffic classification tasks. Notably, it achieves significant improvements in F1 scores: APP53 (I.I.D.) to 93.19%(6.62%) and 92.11%(4.19%), APP53 (O.O.D.) to 74.88%(18.17%) and 72.13%(15.15%), and ISCX-Botnet (O.O.D.) to 95.03%(9.16%) and 81.95%(12.08%). Additionally, we construct NETD, a traffic dataset designed to support dynamic distributional shifts, and use it to validate ETooL's effectiveness under varying distributional conditions. Furthermore, we evaluate the efficiency gains achieved through ETooL's instruction tuning approach.

Title: In Context Learning with Vision Transformers: Case Study

Authors: Antony Zhao, Alex Proshkin, Fergal Hennessy, Francesco Crivelli
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20872
Pdf URL: https://arxiv.org/pdf/2505.20872
Copy Paste: [[2505.20872]] In Context Learning with Vision Transformers: Case Study(https://arxiv.org/abs/2505.20872)
Keywords: in-context
Abstract: Large transformer models have been shown to be capable of performing in-context learning. By using examples in a prompt as well as a query, they are capable of performing tasks such as few-shot, one-shot, or zero-shot learning to output the corresponding answer to this query. One area of interest to us is that these transformer models have been shown to be capable of learning the general class of certain functions, such as linear functions and small 2-layer neural networks, on random data (Garg et al, 2023). We aim to extend this to the image space to analyze their capability to in-context learn more complex functions on the image space, such as convolutional neural networks and other methods.

Title: Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

Authors: Jeongsoo Choi, Jaehun Kim, Joon Son Chung
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.20899
Pdf URL: https://arxiv.org/pdf/2505.20899
Copy Paste: [[2505.20899]] Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing(https://arxiv.org/abs/2505.20899)
Keywords: diffusion
Abstract: This paper introduces a cross-lingual dubbing system that translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed. Despite the strong translation quality of existing speech translation approaches, they often overlook the transfer of speech patterns, leading to mismatches with source speech and limiting their suitability for dubbing applications. To address this, we propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation. We then synthesize speech based on the predicted units and source identity with a conditional flow matching model. Additionally, we introduce a unit-based speed adaptation mechanism that guides the translation model to produce speech at a rate consistent with the source, without relying on any text. Extensive experiments demonstrate that our framework generates natural and fluent translations that align with the original speech's duration and speaking pace, while achieving competitive translation performance.

Title: Create Anything Anywhere: Layout-Controllable Personalized Diffusion Model for Multiple Subjects

Authors: Wei Li, Hebei Li, Yansong Peng, Siying Wu, Yueyi Zhang, Xiaoyan Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20909
Pdf URL: https://arxiv.org/pdf/2505.20909
Copy Paste: [[2505.20909]] Create Anything Anywhere: Layout-Controllable Personalized Diffusion Model for Multiple Subjects(https://arxiv.org/abs/2505.20909)
Keywords: diffusion, generative
Abstract: Diffusion models have significantly advanced text-to-image generation, laying the foundation for the development of personalized generative frameworks. However, existing methods lack precise layout controllability and overlook the potential of dynamic features of reference subjects in improving fidelity. In this work, we propose Layout-Controllable Personalized Diffusion (LCP-Diffusion) model, a novel framework that integrates subject identity preservation with flexible layout guidance in a tuning-free approach. Our model employs a Dynamic-Static Complementary Visual Refining module to comprehensively capture the intricate details of reference subjects, and introduces a Dual Layout Control mechanism to enforce robust spatial control across both training and inference stages. Extensive experiments validate that LCP-Diffusion excels in both identity preservation and layout controllability. To the best of our knowledge, this is a pioneering work enabling users to "create anything anywhere".

Title: Geometry-Editable and Appearance-Preserving Object Compositon

Authors: Jianman Lin, Haojie Li, Chunmei Qing, Zhijing Yang, Liang Lin, Tianshui Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20914
Pdf URL: https://arxiv.org/pdf/2505.20914
Copy Paste: [[2505.20914]] Geometry-Editable and Appearance-Preserving Object Compositon(https://arxiv.org/abs/2505.20914)
Keywords: diffusion
Abstract: General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties, while simultaneously preserving its fine-grained appearance details. Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation. However, these highly compact embeddings encode only high-level semantic cues and inevitably discard fine-grained appearance details. We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion (DGAD) model that first leverages semantic embeddings to implicitly capture the desired geometric transformations and then employs a cross-attention retrieval mechanism to align fine-grained appearance features with the geometry-edited representation, facilitating both precise geometry editing and faithful appearance preservation in object composition. Specifically, DGAD builds on CLIP/DINO-derived and reference networks to extract semantic embeddings and appearance-preserving representations, which are then seamlessly integrated into the encoding and decoding pipelines in a disentangled manner. We first integrate the semantic embeddings into pre-trained diffusion models that exhibit strong spatial reasoning capabilities to implicitly capture object geometry, thereby facilitating flexible object manipulation and ensuring effective editability. Then, we design a dense cross-attention mechanism that leverages the implicitly learned object geometry to retrieve and spatially align appearance features with their corresponding regions, ensuring faithful appearance consistency. Extensive experiments on public benchmarks demonstrate the effectiveness of the proposed DGAD framework.

Title: NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion

Authors: Max Collins, Jordan Vice, Tim French, Ajmal Mian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.20934
Pdf URL: https://arxiv.org/pdf/2505.20934
Copy Paste: [[2505.20934]] NatADiff: Adversarial Boundary Guidance for Natural Adversarial Diffusion(https://arxiv.org/abs/2505.20934)
Keywords: diffusion
Abstract: Adversarial samples exploit irregularities in the manifold ``learned'' by deep learning models to cause misclassifications. The study of these adversarial samples provides insight into the features a model uses to classify inputs, which can be leveraged to improve robustness against future attacks. However, much of the existing literature focuses on constrained adversarial samples, which do not accurately reflect test-time errors encountered in real-world settings. To address this, we propose `NatADiff', an adversarial sampling scheme that leverages denoising diffusion to generate natural adversarial samples. Our approach is based on the observation that natural adversarial samples frequently contain structural elements from the adversarial class. Deep learning models can exploit these structural elements to shortcut the classification process, rather than learning to genuinely distinguish between classes. To leverage this behavior, we guide the diffusion trajectory towards the intersection of the true and adversarial classes, combining time-travel sampling with augmented classifier guidance to enhance attack transferability while preserving image fidelity. Our method achieves comparable attack success rates to current state-of-the-art techniques, while exhibiting significantly higher transferability across model architectures and better alignment with natural test-time errors as measured by FID. These results demonstrate that NatADiff produces adversarial samples that not only transfer more effectively across models, but more faithfully resemble naturally occurring test-time errors.

Title: ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation

Authors: Sanghyun Jo, Wooyeol Lee, Ziseok Lee, Kyungsu Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20935
Pdf URL: https://arxiv.org/pdf/2505.20935
Copy Paste: [[2505.20935]] ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation(https://arxiv.org/abs/2505.20935)
Keywords: diffusion
Abstract: Text-to-image diffusion models excel at generating single-instance scenes but struggle with multi-instance scenarios, often merging or omitting objects. Unlike previous training-free approaches that rely solely on semantic-level guidance without addressing instance individuation, our training-free method, Instance-to-Semantic Attention Control (ISAC), explicitly resolves incomplete instance formation and semantic entanglement through an instance-first modeling approach. This enables ISAC to effectively leverage a hierarchical, tree-structured prompt mechanism, disentangling multiple object instances and individually aligning them with their corresponding semantic labels. Without employing any external models, ISAC achieves up to 52% average multi-class accuracy and 83% average multi-instance accuracy by effectively forming disentangled instances. The code will be made available upon publication.

Title: Unveiling Impact of Frequency Components on Membership Inference Attacks for Diffusion Models

Authors: Puwei Lian, Yujun Cai, Songze Li
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.20955
Pdf URL: https://arxiv.org/pdf/2505.20955
Copy Paste: [[2505.20955]] Unveiling Impact of Frequency Components on Membership Inference Attacks for Diffusion Models(https://arxiv.org/abs/2505.20955)
Keywords: diffusion
Abstract: Diffusion models have achieved tremendous success in image generation, but they also raise significant concerns regarding privacy and copyright issues. Membership Inference Attacks (MIAs) are designed to ascertain whether specific data were utilized during a model's training phase. As current MIAs for diffusion models typically exploit the model's image prediction ability, we formalize them into a unified general paradigm which computes the membership score for membership identification. Under this paradigm, we empirically find that existing attacks overlook the inherent deficiency in how diffusion models process high-frequency information. Consequently, this deficiency leads to member data with more high-frequency content being misclassified as hold-out data, and hold-out data with less high-frequency content tend to be misclassified as member data. Moreover, we theoretically demonstrate that this deficiency reduces the membership advantage of attacks, thereby interfering with the effective discrimination of member data and hold-out data. Based on this insight, we propose a plug-and-play high-frequency filter module to mitigate the adverse effects of the deficiency, which can be seamlessly integrated into any attacks within this general paradigm without additional time costs. Extensive experiments corroborate that this module significantly improves the performance of baseline attacks across different datasets and models.

Title: OrienText: Surface Oriented Textual Image Generation

Authors: Shubham Singh Paliwal, Arushi Jain, Monika Sharma, Vikram Jamwal, Lovekesh Vig
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20958
Pdf URL: https://arxiv.org/pdf/2505.20958
Copy Paste: [[2505.20958]] OrienText: Surface Oriented Textual Image Generation(https://arxiv.org/abs/2505.20958)
Keywords: diffusion
Abstract: Textual content in images is crucial in e-commerce sectors, particularly in marketing campaigns, product imaging, advertising, and the entertainment industry. Current text-to-image (T2I) generation diffusion models, though proficient at producing high-quality images, often struggle to incorporate text accurately onto complex surfaces with varied perspectives, such as angled views of architectural elements like buildings, banners, or walls. In this paper, we introduce the Surface Oriented Textual Image Generation (OrienText) method, which leverages region-specific surface normals as conditional input to T2I generation diffusion model. Our approach ensures accurate rendering and correct orientation of the text within the image context. We demonstrate the effectiveness of the OrienText method on a self-curated dataset of images and compare it against the existing textual image generation methods.

Title: Personalized Query Auto-Completion for Long and Short-Term Interests with Adaptive Detoxification Generation

Authors: Zhibo Wang, Xiaoze Jiang, Zhiheng Qin, Enyun Yu, Han Li
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.20966
Pdf URL: https://arxiv.org/pdf/2505.20966
Copy Paste: [[2505.20966]] Personalized Query Auto-Completion for Long and Short-Term Interests with Adaptive Detoxification Generation(https://arxiv.org/abs/2505.20966)
Keywords: generative
Abstract: Query auto-completion (QAC) plays a crucial role in modern search systems. However, in real-world applications, there are two pressing challenges that still need to be addressed. First, there is a need for hierarchical personalized representations for users. Previous approaches have typically used users' search behavior as a single, overall representation, which proves inadequate in more nuanced generative scenarios. Additionally, query prefixes are typically short and may contain typos or sensitive information, increasing the likelihood of generating toxic content compared to traditional text generation tasks. Such toxic content can degrade user experience and lead to public relations issues. Therefore, the second critical challenge is detoxifying QAC systems. To address these two limitations, we propose a novel model (LaD) that captures personalized information from both long-term and short-term interests, incorporating adaptive detoxification. In LaD, personalized information is captured hierarchically at both coarse-grained and fine-grained levels. This approach preserves as much personalized information as possible while enabling online generation within time constraints. To move a futher step, we propose an online training method based on Reject Preference Optimization (RPO). By incorporating a special token [Reject] during both the training and inference processes, the model achieves adaptive detoxification. Consequently, the generated text presented to users is both non-toxic and relevant to the given prefix. We conduct comprehensive experiments on industrial-scale datasets and perform online A/B tests, delivering the largest single-experiment metric improvement in nearly two years of our product. Our model has been deployed on Kuaishou search, driving the primary traffic for hundreds of millions of active users. The code is available at this https URL.

Title: DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization

Authors: Shamil Ayupov, Maksim Nakhodnov, Anastasia Yaschenko, Andrey Kuznetsov, Aibek Alanov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.20975
Pdf URL: https://arxiv.org/pdf/2505.20975
Copy Paste: [[2505.20975]] DreamBoothDPO: Improving Personalized Generation using Direct Preference Optimization(https://arxiv.org/abs/2505.20975)
Keywords: diffusion
Abstract: Personalized diffusion models have shown remarkable success in Text-to-Image (T2I) generation by enabling the injection of user-defined concepts into diverse contexts. However, balancing concept fidelity with contextual alignment remains a challenging open problem. In this work, we propose an RL-based approach that leverages the diverse outputs of T2I models to address this issue. Our method eliminates the need for human-annotated scores by generating a synthetic paired dataset for DPO-like training using external quality metrics. These better-worse pairs are specifically constructed to improve both concept fidelity and prompt adherence. Moreover, our approach supports flexible adjustment of the trade-off between image fidelity and textual alignment. Through multi-step training, our approach outperforms a naive baseline in convergence speed and output quality. We conduct extensive qualitative and quantitative analysis, demonstrating the effectiveness of our method across various architectures and fine-tuning techniques. The source code can be found at this https URL.

Title: Facial Attribute Based Text Guided Face Anonymization

Authors: Mustafa İzzet Muştu, Hazım Kemal Ekenel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21002
Pdf URL: https://arxiv.org/pdf/2505.21002
Copy Paste: [[2505.21002]] Facial Attribute Based Text Guided Face Anonymization(https://arxiv.org/abs/2505.21002)
Keywords: diffusion, generative
Abstract: The increasing prevalence of computer vision applications necessitates handling vast amounts of visual data, often containing personal information. While this technology offers significant benefits, it should not compromise privacy. Data privacy regulations emphasize the need for individual consent for processing personal data, hindering researchers' ability to collect high-quality datasets containing the faces of the individuals. This paper presents a deep learning-based face anonymization pipeline to overcome this challenge. Unlike most of the existing methods, our method leverages recent advancements in diffusion-based inpainting models, eliminating the need for training Generative Adversarial Networks. The pipeline employs a three-stage approach: face detection with RetinaNet, feature extraction with VGG-Face, and realistic face generation using the state-of-the-art BrushNet diffusion model. BrushNet utilizes the entire image, face masks, and text prompts specifying desired facial attributes like age, ethnicity, gender, and expression. This enables the generation of natural-looking images with unrecognizable individuals, facilitating the creation of privacy-compliant datasets for computer vision research.

Title: Uncertainty Unveiled: Can Exposure to More In-context Examples Mitigate Uncertainty for Large Language Models?

Authors: Yifei Wang, Yu Sheng, Linjing Li, Daniel Zeng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21003
Pdf URL: https://arxiv.org/pdf/2505.21003
Copy Paste: [[2505.21003]] Uncertainty Unveiled: Can Exposure to More In-context Examples Mitigate Uncertainty for Large Language Models?(https://arxiv.org/abs/2505.21003)
Keywords: in-context
Abstract: Recent advances in handling long sequences have facilitated the exploration of long-context in-context learning (ICL). While much of the existing research emphasizes performance improvements driven by additional in-context examples, the influence on the trustworthiness of generated responses remains underexplored. This paper addresses this gap by investigating how increased examples influence predictive uncertainty, an essential aspect in trustworthiness. We begin by systematically quantifying the uncertainty of ICL with varying shot counts, analyzing the impact of example quantity. Through uncertainty decomposition, we introduce a novel perspective on performance enhancement, with a focus on epistemic uncertainty (EU). Our results reveal that additional examples reduce total uncertainty in both simple and complex tasks by injecting task-specific knowledge, thereby diminishing EU and enhancing performance. For complex tasks, these advantages emerge only after addressing the increased noise and uncertainty associated with longer inputs. Finally, we explore the evolution of internal confidence across layers, unveiling the mechanisms driving the reduction in uncertainty.

Title: Efficient and Unbiased Sampling from Boltzmann Distributions via Variance-Tuned Diffusion Models

Authors: Fengzhe Zhang, Laurence I. Midgley, José Miguel Hernández-Lobato
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.21005
Pdf URL: https://arxiv.org/pdf/2505.21005
Copy Paste: [[2505.21005]] Efficient and Unbiased Sampling from Boltzmann Distributions via Variance-Tuned Diffusion Models(https://arxiv.org/abs/2505.21005)
Keywords: diffusion
Abstract: Score-based diffusion models (SBDMs) are powerful amortized samplers for Boltzmann distributions; however, imperfect score estimates bias downstream Monte Carlo estimates. Classical importance sampling (IS) can correct this bias, but computing exact likelihoods requires solving the probability-flow ordinary differential equation (PF-ODE), a procedure that is prohibitively costly and scales poorly with dimensionality. We introduce Variance-Tuned Diffusion Importance Sampling (VT-DIS), a lightweight post-training method that adapts the per-step noise covariance of a pretrained SBDM by minimizing the $\alpha$-divergence ($\alpha=2$) between its forward diffusion and reverse denoising trajectories. VT-DIS assigns a single trajectory-wise importance weight to the joint forward-reverse process, yielding unbiased expectation estimates at test time with negligible overhead compared to standard sampling. On the DW-4, LJ-13, and alanine-dipeptide benchmarks, VT-DIS achieves effective sample sizes of approximately 80 %, 35 %, and 3.5 %, respectively, while using only a fraction of the computational budget required by vanilla diffusion + IS or PF-ODE-based IS.

Title: FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models

Authors: Nils Neukirch, Johanna Vielhaben, Nils Strodthoff
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21032
Pdf URL: https://arxiv.org/pdf/2505.21032
Copy Paste: [[2505.21032]] FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models(https://arxiv.org/abs/2505.21032)
Keywords: diffusion
Abstract: Internal representations are crucial for understanding deep neural networks, such as their properties and reasoning patterns, but remain difficult to interpret. While mapping from feature space to input space aids in interpreting the former, existing approaches often rely on crude approximations. We propose using a conditional diffusion model - a pretrained high-fidelity diffusion model conditioned on spatially resolved feature maps - to learn such a mapping in a probabilistic manner. We demonstrate the feasibility of this approach across various pretrained image classifiers from CNNs to ViTs, showing excellent reconstruction capabilities. Through qualitative comparisons and robustness analysis, we validate our method and showcase possible applications, such as the visualization of concept steering in input space or investigations of the composite nature of the feature space. This approach has broad potential for improving feature space understanding in computer vision models.

Title: RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy

Authors: Aiyue Chen, Bin Dong, Jingru Li, Jing Lin, Yiwu Yao, Gongyi Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21036
Pdf URL: https://arxiv.org/pdf/2505.21036
Copy Paste: [[2505.21036]] RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy(https://arxiv.org/abs/2505.21036)
Keywords: diffusion
Abstract: Video generation using diffusion models is highly computationally intensive, with 3D attention in Diffusion Transformer (DiT) models accounting for over 80\% of the total computational resources. In this work, we introduce {\bf RainFusion}, a novel training-free sparse attention method that exploits inherent sparsity nature in visual data to accelerate attention computation while preserving video quality. Specifically, we identify three unique sparse patterns in video generation attention calculations--Spatial Pattern, Temporal Pattern and Textural Pattern. The sparse pattern for each attention head is determined online with negligible overhead (\textasciitilde\,0.2\%) with our proposed {\bf ARM} (Adaptive Recognition Module) during inference. Our proposed {\bf RainFusion} is a plug-and-play method, that can be seamlessly integrated into state-of-the-art 3D-attention video generation models without additional training or calibration. We evaluate our method on leading open-sourced models including HunyuanVideo, OpenSoraPlan-1.2 and CogVideoX-5B, demonstrating its broad applicability and effectiveness. Experimental results show that RainFusion achieves over {\bf 2$\times$} speedup in attention computation while maintaining video quality, with only a minimal impact on VBench scores (-0.2\%).

Title: Advancing high-fidelity 3D and Texture Generation with 2.5D latents

Authors: Xin Yang, Jiantao Lin, Yingjie Xu, Haodong Li, Yingcong Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21050
Pdf URL: https://arxiv.org/pdf/2505.21050
Copy Paste: [[2505.21050]] Advancing high-fidelity 3D and Texture Generation with 2.5D latents(https://arxiv.org/abs/2505.21050)
Keywords: foundation model, generative
Abstract: Despite the availability of large-scale 3D datasets and advancements in 3D generative models, the complexity and uneven quality of 3D geometry and texture data continue to hinder the performance of 3D generation techniques. In most existing approaches, 3D geometry and texture are generated in separate stages using different models and non-unified representations, frequently leading to unsatisfactory coherence between geometry and texture. To address these challenges, we propose a novel framework for joint generation of 3D geometry and texture. Specifically, we focus in generate a versatile 2.5D representations that can be seamlessly transformed between 2D and 3D. Our approach begins by integrating multiview RGB, normal, and coordinate images into a unified representation, termed as 2.5D latents. Next, we adapt pre-trained 2D foundation models for high-fidelity 2.5D generation, utilizing both text and image conditions. Finally, we introduce a lightweight 2.5D-to-3D refiner-decoder framework that efficiently generates detailed 3D representations from 2.5D images. Extensive experiments demonstrate that our model not only excels in generating high-quality 3D objects with coherent structure and color from text and image inputs but also significantly outperforms existing methods in geometry-conditioned texture generation.

Title: Minute-Long Videos with Dual Parallelisms

Authors: Zeqing Wang, Bowen Zheng, Xingyi Yang, Yuecong Xu, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21070
Pdf URL: https://arxiv.org/pdf/2505.21070
Copy Paste: [[2505.21070]] Minute-Long Videos with Dual Parallelisms(https://arxiv.org/abs/2505.21070)
Keywords: diffusion
Abstract: Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54$\times$ lower latency and 1.48$\times$ lower memory cost on 8$\times$RTX 4090 GPUs.

Title: Conditional Diffusion Models with Classifier-Free Gibbs-like Guidance

Authors: Badr Moufad, Yazid Janati, Alain Durmus, Ahmed Ghorbel, Eric Moulines, Jimmy Olsson
Subjects: cs.LG, stat.ME
Abstract URL: https://arxiv.org/abs/2505.21101
Pdf URL: https://arxiv.org/pdf/2505.21101
Copy Paste: [[2505.21101]] Conditional Diffusion Models with Classifier-Free Gibbs-like Guidance(https://arxiv.org/abs/2505.21101)
Keywords: diffusion
Abstract: Classifier-Free Guidance (CFG) is a widely used technique for improving conditional diffusion models by linearly combining the outputs of conditional and unconditional denoisers. While CFG enhances visual quality and improves alignment with prompts, it often reduces sample diversity, leading to a challenging trade-off between quality and diversity. To address this issue, we make two key contributions. First, CFG generally does not correspond to a well-defined denoising diffusion model (DDM). In particular, contrary to common intuition, CFG does not yield samples from the target distribution associated with the limiting CFG score as the noise level approaches zero -- where the data distribution is tilted by a power $w \gt 1$ of the conditional distribution. We identify the missing component: a Rényi divergence term that acts as a repulsive force and is required to correct CFG and render it consistent with a proper DDM. Our analysis shows that this correction term vanishes in the low-noise limit. Second, motivated by this insight, we propose a Gibbs-like sampling procedure to draw samples from the desired tilted distribution. This method starts with an initial sample from the conditional diffusion model without CFG and iteratively refines it, preserving diversity while progressively enhancing sample quality. We evaluate our approach on both image and text-to-audio generation tasks, demonstrating substantial improvements over CFG across all considered metrics. The code is available at this https URL

Title: A Lightweight Multi-Expert Generative Language Model System for Engineering Information and Knowledge Extraction

Authors: Bogdan Bogachov, Yaoyao Fiona Zhao
Subjects: cs.CL, cs.AI, cs.CE, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21109
Pdf URL: https://arxiv.org/pdf/2505.21109
Copy Paste: [[2505.21109]] A Lightweight Multi-Expert Generative Language Model System for Engineering Information and Knowledge Extraction(https://arxiv.org/abs/2505.21109)
Keywords: generative
Abstract: Despite recent advancements in domain adaptation techniques for large language models, these methods remain computationally intensive, and the resulting models can still exhibit hallucination issues. Most existing adaptation methods do not prioritize reducing the computational resources required for fine-tuning and inference of language models. Hallucination issues have gradually decreased with each new model release. However, they remain prevalent in engineering contexts, where generating well-structured text with minimal errors and inconsistencies is critical. This work introduces a novel approach called the Small Language Graph (SLG), which is a lightweight adaptation solution designed to address the two key challenges outlined above. The system is structured in the form of a graph, where each node represents a lightweight expert - a small language model fine-tuned on specific and concise texts. The results of this study have shown that SLG was able to surpass conventional fine-tuning methods on the Exact Match metric by 3 times. Additionally, the fine-tuning process was 1.7 times faster compared to that of a larger stand-alone language model. These findings introduce a potential for small to medium-sized engineering companies to confidently use generative AI technologies, such as LLMs, without the necessity to invest in expensive computational resources. Also, the graph architecture and the small size of expert nodes offer a possible opportunity for distributed AI systems, thus potentially diverting the global need for expensive centralized compute clusters.

Title: Differentiable Solver Search for Fast Diffusion Sampling

Authors: Shuai Wang, Zexian Li, Qipeng zhang, Tianhui Song, Xubin Li, Tiezheng Ge, Bo Zheng, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21114
Pdf URL: https://arxiv.org/pdf/2505.21114
Copy Paste: [[2505.21114]] Differentiable Solver Search for Fast Diffusion Sampling(https://arxiv.org/abs/2505.21114)
Keywords: diffusion
Abstract: Diffusion models have demonstrated remarkable generation quality but at the cost of numerous function evaluations. Recently, advanced ODE-based solvers have been developed to mitigate the substantial computational demands of reverse-diffusion solving under limited sampling steps. However, these solvers, heavily inspired by Adams-like multistep methods, rely solely on t-related Lagrange interpolation. We show that t-related Lagrange interpolation is suboptimal for diffusion model and reveal a compact search space comprised of time steps and solver coefficients. Building on our analysis, we propose a novel differentiable solver search algorithm to identify more optimal solver. Equipped with the searched solver, rectified-flow models, e.g., SiT-XL/2 and FlowDCN-XL/2, achieve FID scores of 2.40 and 2.35, respectively, on ImageNet256 with only 10 steps. Meanwhile, DDPM model, DiT-XL/2, reaches a FID score of 2.33 with only 10 steps. Notably, our searched solver outperforms traditional solvers by a significant margin. Moreover, our searched solver demonstrates generality across various model architectures, resolutions, and model sizes.

Title: ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction

Authors: Adeela Islam, Stefano Fiorini, Stuart James, Pietro Morerio, Alessio Del Bue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21117
Pdf URL: https://arxiv.org/pdf/2505.21117
Copy Paste: [[2505.21117]] ReassembleNet: Learnable Keypoints and Diffusion for 2D Fresco Reconstruction(https://arxiv.org/abs/2505.21117)
Keywords: diffusion
Abstract: The task of reassembly is a significant challenge across multiple domains, including archaeology, genomics, and molecular docking, requiring the precise placement and orientation of elements to reconstruct an original structure. In this work, we address key limitations in state-of-the-art Deep Learning methods for reassembly, namely i) scalability; ii) multimodality; and iii) real-world applicability: beyond square or simple geometric shapes, realistic and complex erosion, or other real-world problems. We propose ReassembleNet, a method that reduces complexity by representing each input piece as a set of contour keypoints and learning to select the most informative ones by Graph Neural Networks pooling inspired techniques. ReassembleNet effectively lowers computational complexity while enabling the integration of features from multiple modalities, including both geometric and texture data. Further enhanced through pretraining on a semi-synthetic dataset. We then apply diffusion-based pose estimation to recover the original structure. We improve on prior methods by 55% and 86% for RMSE Rotation and Translation, respectively.

Title: Learning Single Index Models with Diffusion Priors

Authors: Anqi Tang, Youming Chen, Shuchen Xue, Zhaoqiang Liu
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2505.21135
Pdf URL: https://arxiv.org/pdf/2505.21135
Copy Paste: [[2505.21135]] Learning Single Index Models with Diffusion Priors(https://arxiv.org/abs/2505.21135)
Keywords: diffusion, generative
Abstract: Diffusion models (DMs) have demonstrated remarkable ability to generate diverse and high-quality images by efficiently modeling complex data distributions. They have also been explored as powerful generative priors for signal recovery, resulting in a substantial improvement in the quality of reconstructed signals. However, existing research on signal recovery with diffusion models either focuses on specific reconstruction problems or is unable to handle nonlinear measurement models with discontinuous or unknown link functions. In this work, we focus on using DMs to achieve accurate recovery from semi-parametric single index models, which encompass a variety of popular nonlinear models that may have {\em discontinuous} and {\em unknown} link functions. We propose an efficient reconstruction method that only requires one round of unconditional sampling and (partial) inversion of DMs. Theoretical analysis on the effectiveness of the proposed methods has been established under appropriate conditions. We perform numerical experiments on image datasets for different nonlinear measurement models. We observe that compared to competing methods, our approach can yield more accurate reconstructions while utilizing significantly fewer neural function evaluations.

Title: Scaling and Prompting for Improved End-to-End Spoken Grammatical Error Correction

Authors: Mengjie Qian, Rao Ma, Stefano Bannò, Kate M. Knill, Mark J.F. Gales
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.21137
Pdf URL: https://arxiv.org/pdf/2505.21137
Copy Paste: [[2505.21137]] Scaling and Prompting for Improved End-to-End Spoken Grammatical Error Correction(https://arxiv.org/abs/2505.21137)
Keywords: foundation model
Abstract: Spoken Grammatical Error Correction (SGEC) and Feedback (SGECF) are crucial for second language learners, teachers and test takers. Traditional SGEC systems rely on a cascaded pipeline consisting of an ASR, a module for disfluency detection (DD) and removal and one for GEC. With the rise of end-to-end (E2E) speech foundation models, we investigate their effectiveness in SGEC and feedback generation. This work introduces a pseudo-labelling process to address the challenge of limited labelled data, expanding the training data size from 77 hours to approximately 2500 hours, leading to improved performance. Additionally, we prompt an E2E Whisper-based SGEC model with fluent transcriptions, showing a slight improvement in SGEC performance, with more significant gains in feedback generation. Finally, we assess the impact of increasing model size, revealing that while pseudo-labelled data does not yield performance gain for a larger Whisper model, training with prompts proves beneficial.

Title: Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis

Authors: Tianyi Xu, Hongjie Chen, Wang Qing, Lv Hang, Jian Kang, Li Jie, Zhennan Lin, Yongxiang Li, Xie Lei
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.21138
Pdf URL: https://arxiv.org/pdf/2505.21138
Copy Paste: [[2505.21138]] Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis(https://arxiv.org/abs/2505.21138)
Keywords: self-supervised
Abstract: Large-scale training corpora have significantly improved the performance of ASR models. Unfortunately, due to the relative scarcity of data, Chinese accents and dialects remain a challenge for most ASR models. Recent advancements in self-supervised learning have shown that self-supervised pre- training, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios. We aim to investigate the effectiveness of this paradigm for Chinese dialects. Specifically, we pre-train a Data2vec2 model on 300,000 hours of unlabeled dialect and accented speech data and do alignment training on a supervised dataset of 40,000 hours. Then, we systematically examine the impact of various projectors and LLMs on Mandarin, dialect, and accented speech recognition performance under this paradigm. Our method achieved SOTA results on multiple dialect datasets, including Kespeech. We will open-source our work to promote reproducible research

Title: FastFace: Tuning Identity Preservation in Distilled Diffusion via Guidance and Attention

Authors: Sergey Karpukhin, Vadim Titov, Andrey Kuznetsov, Aibek Alanov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21144
Pdf URL: https://arxiv.org/pdf/2505.21144
Copy Paste: [[2505.21144]] FastFace: Tuning Identity Preservation in Distilled Diffusion via Guidance and Attention(https://arxiv.org/abs/2505.21144)
Keywords: diffusion
Abstract: In latest years plethora of identity-preserving adapters for a personalized generation with diffusion models have been released. Their main disadvantage is that they are dominantly trained jointly with base diffusion models, which suffer from slow multi-step inference. This work aims to tackle the challenge of training-free adaptation of pretrained ID-adapters to diffusion models accelerated via distillation - through careful re-design of classifier-free guidance for few-step stylistic generation and attention manipulation mechanisms in decoupled blocks to improve identity similarity and fidelity, we propose universal FastFace framework. Additionally, we develop a disentangled public evaluation protocol for id-preserving adapters.

Title: Assessment of L2 Oral Proficiency using Speech Large Language Models

Authors: Rao Ma, Mengjie Qian, Siyuan Tang, Stefano Bannò, Kate M. Knill, Mark J.F. Gales
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.21148
Pdf URL: https://arxiv.org/pdf/2505.21148
Copy Paste: [[2505.21148]] Assessment of L2 Oral Proficiency using Speech Large Language Models(https://arxiv.org/abs/2505.21148)
Keywords: self-supervised
Abstract: The growing population of L2 English speakers has increased the demand for developing automatic graders for spoken language assessment (SLA). Historically, statistical models, text encoders, and self-supervised speech models have been utilised for this task. However, cascaded systems suffer from the loss of information, while E2E graders also have limitations. With the recent advancements of multi-modal large language models (LLMs), we aim to explore their potential as L2 oral proficiency graders and overcome these issues. In this work, we compare various training strategies using regression and classification targets. Our results show that speech LLMs outperform all previous competitive baselines, achieving superior performance on two datasets. Furthermore, the trained grader demonstrates strong generalisation capabilities in the cross-part or cross-task evaluation, facilitated by the audio understanding knowledge acquired during LLM pre-training.

Title: RoBiS: Robust Binary Segmentation for High-Resolution Industrial Images

Authors: Xurui Li, Zhonesheng Jiang, Tingxuan Ai, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21152
Pdf URL: https://arxiv.org/pdf/2505.21152
Copy Paste: [[2505.21152]] RoBiS: Robust Binary Segmentation for High-Resolution Industrial Images(https://arxiv.org/abs/2505.21152)
Keywords: anomaly
Abstract: Robust unsupervised anomaly detection (AD) in real-world scenarios is an important task. Current methods exhibit severe performance degradation on the MVTec AD 2 benchmark due to its complex real-world challenges. To solve this problem, we propose a robust framework RoBiS, which consists of three core modules: (1) Swin-Cropping, a high-resolution image pre-processing strategy to preserve the information of small anomalies through overlapping window cropping. (2) The data augmentation of noise addition and lighting simulation is carried out on the training data to improve the robustness of AD model. We use INP-Former as our baseline, which could generate better results on the various sub-images. (3) The traditional statistical-based binarization strategy (mean+3std) is combined with our previous work, MEBin (published in CVPR2025), for joint adaptive binarization. Then, SAM is further employed to refine the segmentation results. Compared with some methods reported by the MVTec AD 2, our RoBiS achieves a 29.2% SegF1 improvement (from 21.8% to 51.00%) on Test_private and 29.82% SegF1 gains (from 16.7% to 46.52%) on Test_private_mixed. Code is available at this https URL.

Title: STEB: In Search of the Best Evaluation Approach for Synthetic Time Series

Authors: Michael Stenger, Robert Leppich, André Bauer, Samuel Kounev
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21160
Pdf URL: https://arxiv.org/pdf/2505.21160
Copy Paste: [[2505.21160]] STEB: In Search of the Best Evaluation Approach for Synthetic Time Series(https://arxiv.org/abs/2505.21160)
Keywords: generative
Abstract: The growing need for synthetic time series, due to data augmentation or privacy regulations, has led to numerous generative models, frameworks, and evaluation measures alike. Objectively comparing these measures on a large scale remains an open challenge. We propose the Synthetic Time series Evaluation Benchmark (STEB) -- the first benchmark framework that enables comprehensive and interpretable automated comparisons of synthetic time series evaluation measures. Using 10 diverse datasets, randomness injection, and 13 configurable data transformations, STEB computes indicators for measure reliability and score consistency. It tracks running time, test errors, and features sequential and parallel modes of operation. In our experiments, we determine a ranking of 41 measures from literature and confirm that the choice of upstream time series embedding heavily impacts the final score.

Title: Normalized Attention Guidance: Universal Negative Guidance for Diffusion Model

Authors: Dar-Yen Chen, Hmrishav Bandyopadhyay, Kai Zou, Yi-Zhe Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21179
Pdf URL: https://arxiv.org/pdf/2505.21179
Copy Paste: [[2505.21179]] Normalized Attention Guidance: Universal Negative Guidance for Diffusion Model(https://arxiv.org/abs/2505.21179)
Keywords: diffusion
Abstract: Negative guidance -- explicitly suppressing unwanted attributes -- remains a fundamental challenge in diffusion models, particularly in few-step sampling regimes. While Classifier-Free Guidance (CFG) works well in standard settings, it fails under aggressive sampling step compression due to divergent predictions between positive and negative branches. We present Normalized Attention Guidance (NAG), an efficient, training-free mechanism that applies extrapolation in attention space with L1-based normalization and refinement. NAG restores effective negative guidance where CFG collapses while maintaining fidelity. Unlike existing approaches, NAG generalizes across architectures (UNet, DiT), sampling regimes (few-step, multi-step), and modalities (image, video), functioning as a \textit{universal} plug-in with minimal computational overhead. Through extensive experimentation, we demonstrate consistent improvements in text alignment (CLIP Score), fidelity (FID, PFID), and human-perceived quality (ImageReward). Our ablation studies validate each design component, while user studies confirm significant preference for NAG-guided outputs. As a model-agnostic inference-time approach requiring no retraining, NAG provides effortless negative guidance for all modern diffusion frameworks -- pseudocode in the Appendix!

Title: Sci-Fi: Symmetric Constraint for Frame Inbetweening

Authors: Liuhan Chen, Xiaodong Cun, Xiaoyu Li, Xianyi He, Shenghai Yuan, Jie Chen, Ying Shan, Li Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21205
Pdf URL: https://arxiv.org/pdf/2505.21205
Copy Paste: [[2505.21205]] Sci-Fi: Symmetric Constraint for Frame Inbetweening(https://arxiv.org/abs/2505.21205)
Keywords: diffusion
Abstract: Frame inbetweening aims to synthesize intermediate video sequences conditioned on the given start and end frames. Current state-of-the-art methods mainly extend large-scale pre-trained Image-to-Video Diffusion models (I2V-DMs) by incorporating end-frame constraints via directly fine-tuning or omitting training. We identify a critical limitation in their design: Their injections of the end-frame constraint usually utilize the same mechanism that originally imposed the start-frame (single image) constraint. However, since the original I2V-DMs are adequately trained for the start-frame condition in advance, naively introducing the end-frame constraint by the same mechanism with much less (even zero) specialized training probably can't make the end frame have a strong enough impact on the intermediate content like the start frame. This asymmetric control strength of the two frames over the intermediate content likely leads to inconsistent motion or appearance collapse in generated frames. To efficiently achieve symmetric constraints of start and end frames, we propose a novel framework, termed Sci-Fi, which applies a stronger injection for the constraint of a smaller training scale. Specifically, it deals with the start-frame constraint as before, while introducing the end-frame constraint by an improved mechanism. The new mechanism is based on a well-designed lightweight module, named EF-Net, which encodes only the end frame and expands it into temporally adaptive frame-wise features injected into the I2V-DM. This makes the end-frame constraint as strong as the start-frame constraint, enabling our Sci-Fi to produce more harmonious transitions in various scenarios. Extensive experiments prove the superiority of our Sci-Fi compared with other baselines.

Title: Is Hyperbolic Space All You Need for Medical Anomaly Detection?

Authors: Alvaro Gonzalez-Jimenez, Simone Lionetti, Ludovic Amruthalingam, Philippe Gottfrois, Fabian Gröger, Marc Pouly, Alexander A. Navarini
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21228
Pdf URL: https://arxiv.org/pdf/2505.21228
Copy Paste: [[2505.21228]] Is Hyperbolic Space All You Need for Medical Anomaly Detection?(https://arxiv.org/abs/2505.21228)
Keywords: anomaly
Abstract: Medical anomaly detection has emerged as a promising solution to challenges in data availability and labeling constraints. Traditional methods extract features from different layers of pre-trained networks in Euclidean space; however, Euclidean representations fail to effectively capture the hierarchical relationships within these features, leading to suboptimal anomaly detection performance. We propose a novel yet simple approach that projects feature representations into hyperbolic space, aggregates them based on confidence levels, and classifies samples as healthy or anomalous. Our experiments demonstrate that hyperbolic space consistently outperforms Euclidean-based frameworks, achieving higher AUROC scores at both image and pixel levels across multiple medical benchmark datasets. Additionally, we show that hyperbolic space exhibits resilience to parameter variations and excels in few-shot scenarios, where healthy images are scarce. These findings underscore the potential of hyperbolic space as a powerful alternative for medical anomaly detection. The project website can be found at this https URL

Title: LMCD: Language Models are Zeroshot Cognitive Diagnosis Learners

Authors: Yu He, Zihan Yao, Chentao Song, Tianyu Qi, Jun Liu, Ming Li, Qing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21239
Pdf URL: https://arxiv.org/pdf/2505.21239
Copy Paste: [[2505.21239]] LMCD: Language Models are Zeroshot Cognitive Diagnosis Learners(https://arxiv.org/abs/2505.21239)
Keywords: diffusion
Abstract: Cognitive Diagnosis (CD) has become a critical task in AI-empowered education, supporting personalized learning by accurately assessing students' cognitive states. However, traditional CD models often struggle in cold-start scenarios due to the lack of student-exercise interaction data. Recent NLP-based approaches leveraging pre-trained language models (PLMs) have shown promise by utilizing textual features but fail to fully bridge the gap between semantic understanding and cognitive profiling. In this work, we propose Language Models as Zeroshot Cognitive Diagnosis Learners (LMCD), a novel framework designed to handle cold-start challenges by harnessing large language models (LLMs). LMCD operates via two primary phases: (1) Knowledge Diffusion, where LLMs generate enriched contents of exercises and knowledge concepts (KCs), establishing stronger semantic links; and (2) Semantic-Cognitive Fusion, where LLMs employ causal attention mechanisms to integrate textual information and student cognitive states, creating comprehensive profiles for both students and exercises. These representations are efficiently trained with off-the-shelf CD models. Experiments on two real-world datasets demonstrate that LMCD significantly outperforms state-of-the-art methods in both exercise-cold and domain-cold settings. The code is publicly available at this https URL

Title: BindEnergyCraft: Casting Protein Structure Predictors as Energy-Based Models for Binder Design

Authors: Divya Nori, Anisha Parsan, Caroline Uhler, Wengong Jin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.21241
Pdf URL: https://arxiv.org/pdf/2505.21241
Copy Paste: [[2505.21241]] BindEnergyCraft: Casting Protein Structure Predictors as Energy-Based Models for Binder Design(https://arxiv.org/abs/2505.21241)
Keywords: diffusion
Abstract: Protein binder design has been transformed by hallucination-based methods that optimize structure prediction confidence metrics, such as the interface predicted TM-score (ipTM), via backpropagation. However, these metrics do not reflect the statistical likelihood of a binder-target complex under the learned distribution and yield sparse gradients for optimization. In this work, we propose a method to extract such likelihoods from structure predictors by reinterpreting their confidence outputs as an energy-based model (EBM). By leveraging the Joint Energy-based Modeling (JEM) framework, we introduce pTMEnergy, a statistical energy function derived from predicted inter-residue error distributions. We incorporate pTMEnergy into BindEnergyCraft (BECraft), a design pipeline that maintains the same optimization framework as BindCraft but replaces ipTM with our energy-based objective. BECraft outperforms BindCraft, RFDiffusion, and ESM3 across multiple challenging targets, achieving higher in silico binder success rates while reducing structural clashes. Furthermore, pTMEnergy establishes a new state-of-the-art in structure-based virtual screening tasks for miniprotein and RNA aptamer binders.

Title: Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings

Authors: Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21242
Pdf URL: https://arxiv.org/pdf/2505.21242
Copy Paste: [[2505.21242]] Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings(https://arxiv.org/abs/2505.21242)
Keywords: in-context
Abstract: Large Language Models (LLMs) recently achieved great success in medical text summarization by simply using in-context learning. However, these recent efforts do not perform fine-grained evaluations under difficult settings where LLMs might fail. They typically report performance scores over the entire dataset. Through our benchmarking study, we show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue where the LLM vocabulary gets updated with certain expert domain (here, medical) words or subwords. An interesting finding from our study is that Llama-3.1, even with a vocabulary size of around 128K tokens, still faces over-fragmentation issue with medical words. To that end, we show vocabulary adaptation helps improve the LLM summarization performance even in difficult settings. Through extensive experimentation of multiple vocabulary adaptation strategies, two continual pretraining strategies, and three benchmark medical summarization datasets, we gain valuable insights into the role of vocabulary adaptation strategies for customizing LLMs to the medical domain. We also performed a human evaluation study with medical experts where they found that vocabulary adaptation results in more relevant and faithful summaries. Our codebase is made publicly available at this https URL.

Title: Supervised and self-supervised land-cover segmentation & classification of the Biesbosch wetlands

Authors: Eva Gmelich Meijling, Roberto Del Prete, Arnoud Visser
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2505.21269
Pdf URL: https://arxiv.org/pdf/2505.21269
Copy Paste: [[2505.21269]] Supervised and self-supervised land-cover segmentation & classification of the Biesbosch wetlands(https://arxiv.org/abs/2505.21269)
Keywords: self-supervised
Abstract: Accurate wetland land-cover classification is essential for environmental monitoring, biodiversity assessment, and sustainable ecosystem management. However, the scarcity of annotated data, especially for high-resolution satellite imagery, poses a significant challenge for supervised learning approaches. To tackle this issue, this study presents a methodology for wetland land-cover segmentation and classification that adopts both supervised and self-supervised learning (SSL). We train a U-Net model from scratch on Sentinel-2 imagery across six wetland regions in the Netherlands, achieving a baseline model accuracy of 85.26%. Addressing the limited availability of labeled data, the results show that SSL pretraining with an autoencoder can improve accuracy, especially for the high-resolution imagery where it is more difficult to obtain labeled data, reaching an accuracy of 88.23%. Furthermore, we introduce a framework to scale manually annotated high-resolution labels to medium-resolution inputs. While the quantitative performance between resolutions is comparable, high-resolution imagery provides significantly sharper segmentation boundaries and finer spatial detail. As part of this work, we also contribute a curated Sentinel-2 dataset with Dynamic World labels, tailored for wetland classification tasks and made publicly available.

Title: Learnable Kernel Density Estimation for Graphs

Authors: Xudong Wang, Ziheng Sun, Chris Ding, Jicong Fan
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2505.21285
Pdf URL: https://arxiv.org/pdf/2505.21285
Copy Paste: [[2505.21285]] Learnable Kernel Density Estimation for Graphs(https://arxiv.org/abs/2505.21285)
Keywords: anomaly
Abstract: This work proposes a framework LGKDE that learns kernel density estimation for graphs. The key challenge in graph density estimation lies in effectively capturing both structural patterns and semantic variations while maintaining theoretical guarantees. Combining graph kernels and kernel density estimation (KDE) is a standard approach to graph density estimation, but has unsatisfactory performance due to the handcrafted and fixed features of kernels. Our method LGKDE leverages graph neural networks to represent each graph as a discrete distribution and utilizes maximum mean discrepancy to learn the graph metric for multi-scale KDE, where all parameters are learned by maximizing the density of graphs relative to the density of their well-designed perturbed counterparts. The perturbations are conducted on both node features and graph spectra, which helps better characterize the boundary of normal density regions. Theoretically, we establish consistency and convergence guarantees for LGKDE, including bounds on the mean integrated squared error, robustness, and complexity. We validate LGKDE by demonstrating its effectiveness in recovering the underlying density of synthetic graph distributions and applying it to graph anomaly detection across diverse benchmark datasets. Extensive empirical evaluation shows that LGKDE demonstrates superior performance compared to state-of-the-art baselines on most benchmark datasets.

Title: A Cross Modal Knowledge Distillation & Data Augmentation Recipe for Improving Transcriptomics Representations through Morphological Features

Authors: Ihab Bendidi, Yassir El Mesbahi, Alisandra K. Denton, Karush Suri, Kian Kenyon-Dean, Auguste Genovesio, Emmanuel Noutahi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21317
Pdf URL: https://arxiv.org/pdf/2505.21317
Copy Paste: [[2505.21317]] A Cross Modal Knowledge Distillation & Data Augmentation Recipe for Improving Transcriptomics Representations through Morphological Features(https://arxiv.org/abs/2505.21317)
Keywords: foundation model
Abstract: Understanding cellular responses to stimuli is crucial for biological discovery and drug development. Transcriptomics provides interpretable, gene-level insights, while microscopy imaging offers rich predictive features but is harder to interpret. Weakly paired datasets, where samples share biological states, enable multimodal learning but are scarce, limiting their utility for training and multimodal inference. We propose a framework to enhance transcriptomics by distilling knowledge from microscopy images. Using weakly paired data, our method aligns and binds modalities, enriching gene expression representations with morphological information. To address data scarcity, we introduce (1) Semi-Clipped, an adaptation of CLIP for cross-modal distillation using pretrained foundation models, achieving state-of-the-art results, and (2) PEA (Perturbation Embedding Augmentation), a novel augmentation technique that enhances transcriptomics data while preserving inherent biological information. These strategies improve the predictive power and retain the interpretability of transcriptomics, enabling rich unimodal representations for complex biological tasks.

Title: MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on

Authors: Guangyuan Li, Siming Zheng, Hao Zhang, Jinwei Chen, Junsheng Luan, Binkai Ou, Lei Zhao, Bo Li, Peng-Tao Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21325
Pdf URL: https://arxiv.org/pdf/2505.21325
Copy Paste: [[2505.21325]] MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on(https://arxiv.org/abs/2505.21325)
Keywords: diffusion
Abstract: Video Virtual Try-On (VVT) aims to simulate the natural appearance of garments across consecutive video frames, capturing their dynamic variations and interactions with human body motion. However, current VVT methods still face challenges in terms of spatiotemporal consistency and garment content preservation. First, they use diffusion models based on the U-Net, which are limited in their expressive capability and struggle to reconstruct complex details. Second, they adopt a separative modeling approach for spatial and temporal attention, which hinders the effective capture of structural relationships and dynamic consistency across frames. Third, their expression of garment details remains insufficient, affecting the realism and stability of the overall synthesized results, especially during human motion. To address the above challenges, we propose MagicTryOn, a video virtual try-on framework built upon the large-scale video diffusion this http URL replace the U-Net architecture with a diffusion Transformer and combine full self-attention to jointly model the spatiotemporal consistency of videos. We design a coarse-to-fine garment preservation strategy. The coarse strategy integrates garment tokens during the embedding stage, while the fine strategy incorporates multiple garment-based conditions, such as semantics, textures, and contour lines during the denoising stage. Moreover, we introduce a mask-aware loss to further optimize garment region fidelity. Extensive experiments on both image and video try-on datasets demonstrate that our method outperforms existing SOTA methods in comprehensive evaluations and generalizes to in-the-wild scenarios.

Title: AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Crop Mapping

Authors: Wenyuan Li, Shunlin Liang, Keyan Chen, Yongzhe Chen, Han Ma, Jianglei Xu, Yichuan Ma, Shikang Guan, Husheng Fang, Zhenwei Shi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21357
Pdf URL: https://arxiv.org/pdf/2505.21357
Copy Paste: [[2505.21357]] AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Crop Mapping(https://arxiv.org/abs/2505.21357)
Keywords: foundation model
Abstract: Accurate crop mapping fundamentally relies on modeling multi-scale spatiotemporal patterns, where spatial scales range from individual field textures to landscape-level context, and temporal scales capture both short-term phenological transitions and full growing-season dynamics. Transformer-based remote sensing foundation models (RSFMs) offer promising potential for crop mapping due to their innate ability for unified spatiotemporal processing. However, current RSFMs remain suboptimal for crop mapping: they either employ fixed spatiotemporal windows that ignore the multi-scale nature of crop systems or completely disregard temporal information by focusing solely on spatial patterns. To bridge these gaps, we present AgriFM, a multi-source remote sensing foundation model specifically designed for agricultural crop mapping. Our approach begins by establishing the necessity of simultaneous hierarchical spatiotemporal feature extraction, leading to the development of a modified Video Swin Transformer architecture where temporal down-sampling is synchronized with spatial scaling operations. This modified backbone enables efficient unified processing of long time-series satellite inputs. AgriFM leverages temporally rich data streams from three satellite sources including MODIS, Landsat-8/9 and Sentinel-2, and is pre-trained on a global representative dataset comprising over 25 million image samples supervised by land cover products. The resulting framework incorporates a versatile decoder architecture that dynamically fuses these learned spatiotemporal representations, supporting diverse downstream tasks. Comprehensive evaluations demonstrate AgriFM's superior performance over conventional deep learning approaches and state-of-the-art general-purpose RSFMs across all downstream tasks. Codes will be available at urlhttps://github.com/flyakon/AgriFM.

Title: GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution

Authors: Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, Hongzhen Wang, Wenjing Yang, Bo Du, Jing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21375
Pdf URL: https://arxiv.org/pdf/2505.21375
Copy Paste: [[2505.21375]] GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution(https://arxiv.org/abs/2505.21375)
Keywords: foundation model
Abstract: Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data for Earth observation but pose challenges for existing multimodal foundation models due to two key bottlenecks: (1) limited availability of UHR training data, and (2) token explosion caused by the large image size. To address data scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA (avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion, our pilot studies reveal significant redundancy in RS images: crucial information is concentrated in a small subset of object-centric tokens, while pruning background tokens (e.g., ocean or forest) can even improve performance. Motivated by these findings, we propose two strategies: Background Token Pruning and Anchored Token Selection, to reduce the memory footprint while preserving key this http URL these techniques, we introduce GeoLLaVA-8K, the first RS-focused multimodal large language model capable of handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework. Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art on the XLRS-Bench.

Title: ZigzagPointMamba: Spatial-Semantic Mamba for Point Cloud Understanding

Authors: Linshuang Diao, Dayong Ren, Sensen Song, Yurong Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21381
Pdf URL: https://arxiv.org/pdf/2505.21381
Copy Paste: [[2505.21381]] ZigzagPointMamba: Spatial-Semantic Mamba for Point Cloud Understanding(https://arxiv.org/abs/2505.21381)
Keywords: self-supervised
Abstract: State Space models (SSMs) such as PointMamba enable efficient feature extraction for point cloud self-supervised learning with linear complexity, outperforming Transformers in computational efficiency. However, existing PointMamba-based methods depend on complex token ordering and random masking, which disrupt spatial continuity and local semantic correlations. We propose ZigzagPointMamba to tackle these challenges. The core of our approach is a simple zigzag scan path that globally sequences point cloud tokens, enhancing spatial continuity by preserving the proximity of spatially adjacent point tokens. Nevertheless, random masking undermines local semantic modeling in self-supervised learning. To address this, we introduce a Semantic-Siamese Masking Strategy (SMS), which masks semantically similar tokens to facilitate reconstruction by integrating local features of original and similar tokens. This overcomes the dependence on isolated local features and enables robust global semantic modeling. Our pre-trained ZigzagPointMamba weights significantly improve downstream tasks, achieving a 1.59% mIoU gain on ShapeNetPart for part segmentation, a 0.4% higher accuracy on ModelNet40 for classification, and 0.19%, 1.22%, and 0.72% higher accuracies respectively for the classification tasks on the OBJ-BG, OBJ-ONLY, and PB-T50-RS subsets of ScanObjectNN. The code is available at: this https URL

Title: DeCAF: Decentralized Consensus-And-Factorization for Low-Rank Adaptation of Foundation Models

Authors: Nastaran Saadati, Zhanhong Jiang, Joshua R. Waite, Shreyan Ganguly, Aditya Balu, Chinmay Hegde, Soumik Sarkar
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.21382
Pdf URL: https://arxiv.org/pdf/2505.21382
Copy Paste: [[2505.21382]] DeCAF: Decentralized Consensus-And-Factorization for Low-Rank Adaptation of Foundation Models(https://arxiv.org/abs/2505.21382)
Keywords: foundation model
Abstract: Low-Rank Adaptation (LoRA) has emerged as one of the most effective, computationally tractable fine-tuning approaches for training Vision-Language Models (VLMs) and Large Language Models (LLMs). LoRA accomplishes this by freezing the pre-trained model weights and injecting trainable low-rank matrices, allowing for efficient learning of these foundation models even on edge devices. However, LoRA in decentralized settings still remains under explored, particularly for the theoretical underpinnings due to the lack of smoothness guarantee and model consensus interference (defined formally below). This work improves the convergence rate of decentralized LoRA (DLoRA) to match the rate of decentralized SGD by ensuring gradient smoothness. We also introduce DeCAF, a novel algorithm integrating DLoRA with truncated singular value decomposition (TSVD)-based matrix factorization to resolve consensus interference. Theoretical analysis shows TSVD's approximation error is bounded and consensus differences between DLoRA and DeCAF vanish as rank increases, yielding DeCAF's matching convergence rate. Extensive experiments across vision/language tasks demonstrate our algorithms outperform local training and rivals federated learning under both IID and non-IID data distributions.

Title: Automatically Identify and Rectify: Robust Deep Contrastive Multi-view Clustering in Noisy Scenarios

Authors: Xihong Yang, Siwei Wang, Fangdi Wang, Jiaqi Jin, Suyuan Liu, Yue Liu, En Zhu, Xinwang Liu, Yueming Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21387
Pdf URL: https://arxiv.org/pdf/2505.21387
Copy Paste: [[2505.21387]] Automatically Identify and Rectify: Robust Deep Contrastive Multi-view Clustering in Noisy Scenarios(https://arxiv.org/abs/2505.21387)
Keywords: anomaly
Abstract: Leveraging the powerful representation learning capabilities, deep multi-view clustering methods have demonstrated reliable performance by effectively integrating multi-source information from diverse views in recent years. Most existing methods rely on the assumption of clean views. However, noise is pervasive in real-world scenarios, leading to a significant degradation in performance. To tackle this problem, we propose a novel multi-view clustering framework for the automatic identification and rectification of noisy data, termed AIRMVC. Specifically, we reformulate noisy identification as an anomaly identification problem using GMM. We then design a hybrid rectification strategy to mitigate the adverse effects of noisy data based on the identification results. Furthermore, we introduce a noise-robust contrastive mechanism to generate reliable representations. Additionally, we provide a theoretical proof demonstrating that these representations can discard noisy information, thereby improving the performance of downstream tasks. Extensive experiments on six benchmark datasets demonstrate that AIRMVC outperforms state-of-the-art algorithms in terms of robustness in noisy scenarios. The code of AIRMVC are available at this https URL on Github.

Title: A Convergence Theory for Diffusion Language Models: An Information-Theoretic Perspective

Authors: Gen Li, Changxiao Cai
Subjects: cs.LG, cs.IT, math.ST, stat.ML
Abstract URL: https://arxiv.org/abs/2505.21400
Pdf URL: https://arxiv.org/pdf/2505.21400
Copy Paste: [[2505.21400]] A Convergence Theory for Diffusion Language Models: An Information-Theoretic Perspective(https://arxiv.org/abs/2505.21400)
Keywords: diffusion, generative
Abstract: Diffusion models have emerged as a powerful paradigm for modern generative modeling, demonstrating strong potential for large language models (LLMs). Unlike conventional autoregressive (AR) models that generate tokens sequentially, diffusion models enable parallel token sampling, leading to faster generation and eliminating left-to-right generation constraints. Despite their empirical success, the theoretical understanding of diffusion model approaches remains underdeveloped. In this work, we develop convergence guarantees for diffusion language models from an information-theoretic perspective. Our analysis demonstrates that the sampling error, measured by the Kullback-Leibler (KL) divergence, decays inversely with the number of iterations $T$ and scales linearly with the mutual information between tokens in the target text sequence. In particular, we establish matching upper and lower bounds, up to some constant factor, to demonstrate the tightness of our convergence analysis. These results offer novel theoretical insights into the practical effectiveness of diffusion language models.

Title: Mentor3AD: Feature Reconstruction-based 3D Anomaly Detection via Multi-modality Mentor Learning

Authors: Jinbao Wang, Hanzhe Liang, Can Gao, Chenxi Hu, Jie Zhou, Yunkang Cao, Linlin Shen, Weiming Shen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2505.21420
Pdf URL: https://arxiv.org/pdf/2505.21420
Copy Paste: [[2505.21420]] Mentor3AD: Feature Reconstruction-based 3D Anomaly Detection via Multi-modality Mentor Learning(https://arxiv.org/abs/2505.21420)
Keywords: anomaly
Abstract: Multimodal feature reconstruction is a promising approach for 3D anomaly detection, leveraging the complementary information from dual modalities. We further advance this paradigm by utilizing multi-modal mentor learning, which fuses intermediate features to further distinguish normal from feature differences. To address these challenges, we propose a novel method called Mentor3AD, which utilizes multi-modal mentor learning. By leveraging the shared features of different modalities, Mentor3AD can extract more effective features and guide feature reconstruction, ultimately improving detection performance. Specifically, Mentor3AD includes a Mentor of Fusion Module (MFM) that merges features extracted from RGB and 3D modalities to create a mentor feature. Additionally, we have designed a Mentor of Guidance Module (MGM) to facilitate cross-modal reconstruction, supported by the mentor feature. Lastly, we introduce a Voting Module (VM) to more accurately generate the final anomaly score. Extensive comparative and ablation studies on MVTec 3D-AD and Eyecandies have verified the effectiveness of the proposed method.

Title: Can Large Reasoning Models Self-Train?

Authors: Sheikh Shafayat, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider, Andrea Zanette
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2505.21444
Pdf URL: https://arxiv.org/pdf/2505.21444
Copy Paste: [[2505.21444]] Can Large Reasoning Models Self-Train?(https://arxiv.org/abs/2505.21444)
Keywords: self-supervised
Abstract: Scaling the performance of large language models (LLMs) increasingly depends on methods that reduce reliance on human supervision. Reinforcement learning from automated verification offers an alternative, but it incurs scalability limitations due to dependency upon human-designed verifiers. Self-training, where the model's own judgment provides the supervisory signal, presents a compelling direction. We propose an online self-training reinforcement learning algorithm that leverages the model's self-consistency to infer correctness signals and train without any ground-truth supervision. We apply the algorithm to challenging mathematical reasoning tasks and show that it quickly reaches performance levels rivaling reinforcement-learning methods trained explicitly on gold-standard answers. Additionally, we analyze inherent limitations of the algorithm, highlighting how the self-generated proxy reward initially correlated with correctness can incentivize reward hacking, where confidently incorrect outputs are favored. Our results illustrate how self-supervised improvement can achieve significant performance gains without external labels, while also revealing its fundamental challenges.

Title: OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers

Authors: Ziqiao Peng, Jiwen Liu, Haoxian Zhang, Xiaoqiang Liu, Songlin Tang, Pengfei Wan, Di Zhang, Hongyan Liu, Jun He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21448
Pdf URL: https://arxiv.org/pdf/2505.21448
Copy Paste: [[2505.21448]] OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers(https://arxiv.org/abs/2505.21448)
Keywords: diffusion
Abstract: Lip synchronization is the task of aligning a speaker's lip movements in video with corresponding speech audio, and it is essential for creating realistic, expressive video content. However, existing methods often rely on reference frames and masked-frame inpainting, which limit their robustness to identity consistency, pose variations, facial occlusions, and stylized content. In addition, since audio signals provide weaker conditioning than visual cues, lip shape leakage from the original video will affect lip sync quality. In this paper, we present OmniSync, a universal lip synchronization framework for diverse visual scenarios. Our approach introduces a mask-free training paradigm using Diffusion Transformer models for direct frame editing without explicit masks, enabling unlimited-duration inference while maintaining natural facial dynamics and preserving character identity. During inference, we propose a flow-matching-based progressive noise initialization to ensure pose and identity consistency, while allowing precise mouth-region editing. To address the weak conditioning signal of audio, we develop a Dynamic Spatiotemporal Classifier-Free Guidance (DS-CFG) mechanism that adaptively adjusts guidance strength over time and space. We also establish the AIGC-LipSync Benchmark, the first evaluation suite for lip synchronization in diverse AI-generated videos. Extensive experiments demonstrate that OmniSync significantly outperforms prior methods in both visual quality and lip sync accuracy, achieving superior results in both real-world and AI-generated videos.

Title: Designing Cyclic Peptides via Harmonic SDE with Atom-Bond Modeling

Authors: Xiangxin Zhou, Mingyu Li, Yi Xiao, Jiahan Li, Dongyu Xue, Zaixiang Zheng, Jianzhu Ma, Quanquan Gu
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2505.21452
Pdf URL: https://arxiv.org/pdf/2505.21452
Copy Paste: [[2505.21452]] Designing Cyclic Peptides via Harmonic SDE with Atom-Bond Modeling(https://arxiv.org/abs/2505.21452)
Keywords: generative
Abstract: Cyclic peptides offer inherent advantages in pharmaceuticals. For example, cyclic peptides are more resistant to enzymatic hydrolysis compared to linear peptides and usually exhibit excellent stability and affinity. Although deep generative models have achieved great success in linear peptide design, several challenges prevent the development of computational methods for designing diverse types of cyclic peptides. These challenges include the scarcity of 3D structural data on target proteins and associated cyclic peptide ligands, the geometric constraints that cyclization imposes, and the involvement of non-canonical amino acids in cyclization. To address the above challenges, we introduce CpSDE, which consists of two key components: AtomSDE, a generative structure prediction model based on harmonic SDE, and ResRouter, a residue type predictor. Utilizing a routed sampling algorithm that alternates between these two models to iteratively update sequences and structures, CpSDE facilitates the generation of cyclic peptides. By employing explicit all-atom and bond modeling, CpSDE overcomes existing data limitations and is proficient in designing a wide variety of cyclic peptides. Our experimental results demonstrate that the cyclic peptides designed by our method exhibit reliable stability and affinity.

Title: M3S-UPD: Efficient Multi-Stage Self-Supervised Learning for Fine-Grained Encrypted Traffic Classification with Unknown Pattern Discovery

Authors: Yali Yuan, Yu Huang, Xingjian Zeng, Hantao Mei, Guang Cheng
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2505.21462
Pdf URL: https://arxiv.org/pdf/2505.21462
Copy Paste: [[2505.21462]] M3S-UPD: Efficient Multi-Stage Self-Supervised Learning for Fine-Grained Encrypted Traffic Classification with Unknown Pattern Discovery(https://arxiv.org/abs/2505.21462)
Keywords: self-supervised
Abstract: The growing complexity of encrypted network traffic presents dual challenges for modern network management: accurate multiclass classification of known applications and reliable detection of unknown traffic patterns. Although deep learning models show promise in controlled environments, their real-world deployment is hindered by data scarcity, concept drift, and operational constraints. This paper proposes M3S-UPD, a novel Multi-Stage Self-Supervised Unknown-aware Packet Detection framework that synergistically integrates semi-supervised learning with representation analysis. Our approach eliminates artificial segregation between classification and detection tasks through a four-phase iterative process: 1) probabilistic embedding generation, 2) clustering-based structure discovery, 3) distribution-aligned outlier identification, and 4) confidence-aware model updating. Key innovations include a self-supervised unknown detection mechanism that requires neither synthetic samples nor prior knowledge, and a continuous learning architecture that is resistant to performance degradation. Experimental results show that M3S-UPD not only outperforms existing methods on the few-shot encrypted traffic classification task, but also simultaneously achieves competitive performance on the zero-shot unknown traffic discovery task.

Title: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

Authors: Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S. Abdelfattah, Jae-sun Seo, Zhiru Zhang, Udit Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.21467
Pdf URL: https://arxiv.org/pdf/2505.21467
Copy Paste: [[2505.21467]] Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion(https://arxiv.org/abs/2505.21467)
Keywords: diffusion
Abstract: Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream 7B, LLaDA 8B) suffer from slow inference. While they match the quality of similarly sized Autoregressive (AR) Models (e.g., Qwen2.5 7B, Llama3 8B), their iterative denoising requires multiple full-sequence forward passes, resulting in high computational costs and latency, particularly for long input prompts and long-context scenarios. Furthermore, parallel token generation introduces token incoherence problems, and current sampling heuristics suffer from significant quality drops with decreasing denoising steps. We address these limitations with two training-free techniques. First, we propose FreeCache, a Key-Value (KV) approximation caching technique that reuses stable KV projections across denoising steps, effectively reducing the computational cost of DLM inference. Second, we introduce Guided Diffusion, a training-free method that uses a lightweight pretrained autoregressive model to supervise token unmasking, dramatically reducing the total number of denoising iterations without sacrificing quality. We conduct extensive evaluations on open-source reasoning benchmarks, and our combined methods deliver up to a 34x end-to-end speedup without compromising accuracy. For the first time, diffusion language models achieve a comparable and even faster latency as the widely adopted autoregressive models. Our work successfully paved the way for scaling up the diffusion language model to a broader scope of applications across different domains.

Title: MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation

Authors: Kerui Ren, Jiayang Bai, Linning Xu, Lihan Jiang, Jiangmiao Pang, Mulin Yu, Bo Dai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21483
Pdf URL: https://arxiv.org/pdf/2505.21483
Copy Paste: [[2505.21483]] MV-CoLight: Efficient Object Compositing with Consistent Lighting and Shadow Generation(https://arxiv.org/abs/2505.21483)
Keywords: diffusion
Abstract: Object compositing offers significant promise for augmented reality (AR) and embodied intelligence applications. Existing approaches predominantly focus on single-image scenarios or intrinsic decomposition techniques, facing challenges with multi-view consistency, complex scenes, and diverse lighting conditions. Recent inverse rendering advancements, such as 3D Gaussian and diffusion-based methods, have enhanced consistency but are limited by scalability, heavy data requirements, or prolonged reconstruction time per scene. To broaden its applicability, we introduce MV-CoLight, a two-stage framework for illumination-consistent object compositing in both 2D images and 3D scenes. Our novel feed-forward architecture models lighting and shadows directly, avoiding the iterative biases of diffusion-based methods. We employ a Hilbert curve-based mapping to align 2D image inputs with 3D Gaussian scene representations seamlessly. To facilitate training and evaluation, we further introduce a large-scale 3D compositing dataset. Experiments demonstrate state-of-the-art harmonized results across standard benchmarks and our dataset, as well as casually captured real-world scenes demonstrate the framework's robustness and wide generalization.

Title: Be Decisive: Noise-Induced Layouts for Multi-Subject Generation

Authors: Omer Dahary, Yehonathan Cohen, Or Patashnik, Kfir Aberman, Daniel Cohen-Or
Subjects: cs.CV, cs.AI, cs.GR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.21488
Pdf URL: https://arxiv.org/pdf/2505.21488
Copy Paste: [[2505.21488]] Be Decisive: Noise-Induced Layouts for Multi-Subject Generation(https://arxiv.org/abs/2505.21488)
Keywords: diffusion
Abstract: Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject's spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model's prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model's prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model's original distribution.

Title: Frame In-N-Out: Unbounded Controllable Image-to-Video Generation

Authors: Boyang Wang, Xuweiyi Chen, Matheus Gadelha, Zezhou Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2505.21491
Pdf URL: https://arxiv.org/pdf/2505.21491
Copy Paste: [[2505.21491]] Frame In-N-Out: Unbounded Controllable Image-to-Video Generation(https://arxiv.org/abs/2505.21491)
Keywords: diffusion
Abstract: Controllability, temporal coherence, and detail synthesis remain the most critical challenges in video generation. In this paper, we focus on a commonly used yet underexplored cinematic technique known as Frame In and Frame Out. Specifically, starting from image-to-video generation, users can control the objects in the image to naturally leave the scene or provide breaking new identity references to enter the scene, guided by user-specified motion trajectory. To support this task, we introduce a new dataset curated semi-automatically, a comprehensive evaluation protocol targeting this setting, and an efficient identity-preserving motion-controllable video Diffusion Transformer architecture. Our evaluation shows that our proposed approach significantly outperforms existing baselines.