2025-02-18

Title: Leveraging Constraint Violation Signals For Action-Constrained Reinforcement Learning

Authors: Janaka Chathuranga Brahmanage, Jiajing Ling, Akshat Kumar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10431
Pdf URL: https://arxiv.org/pdf/2502.10431
Copy Paste: [[2502.10431]] Leveraging Constraint Violation Signals For Action-Constrained Reinforcement Learning(https://arxiv.org/abs/2502.10431)
Keywords: generative
Abstract: In many RL applications, ensuring an agent's actions adhere to constraints is crucial for safety. Most previous methods in Action-Constrained Reinforcement Learning (ACRL) employ a projection layer after the policy network to correct the action. However projection-based methods suffer from issues like the zero gradient problem and higher runtime due to the usage of optimization solvers. Recently methods were proposed to train generative models to learn a differentiable mapping between latent variables and feasible actions to address this issue. However, generative models require training using samples from the constrained action space, which itself is challenging. To address such limitations, first, we define a target distribution for feasible actions based on constraint violation signals, and train normalizing flows by minimizing the KL divergence between an approximated distribution over feasible actions and the target. This eliminates the need to generate feasible action samples, greatly simplifying the flow model learning. Second, we integrate the learned flow model with existing deep RL methods, which restrict it to exploring only the feasible action space. Third, we extend our approach beyond ACRL to handle state-wise constraints by learning the constraint violation signal from the environment. Empirically, our approach has significantly fewer constraint violations while achieving similar or better quality in several control tasks than previous best methods.

Title: Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models via Ownership Verification with Reasoning

Authors: Junfeng Guo, Yiming Li, Ruibo Chen, Yihan Wu, Chenxi Liu, Yanshuo Chen, Heng Huang
Subjects: cs.CR, cs.AI, cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.10440
Pdf URL: https://arxiv.org/pdf/2502.10440
Copy Paste: [[2502.10440]] Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models via Ownership Verification with Reasoning(https://arxiv.org/abs/2502.10440)
Keywords: anomaly
Abstract: Large language models (LLMs) are increasingly integrated into real-world applications through retrieval-augmented generation (RAG) mechanisms to supplement their responses with up-to-date and domain-specific knowledge. However, the valuable and often proprietary nature of the knowledge bases used in RAG introduces the risk of unauthorized usage by adversaries. Existing methods that can be generalized as watermarking techniques to protect these knowledge bases typically involve poisoning attacks. However, these methods require to alter the results of verification samples (\eg, generating incorrect outputs), inevitably making them susceptible to anomaly detection and even introduce new security risks. To address these challenges, we propose \name{} for `harmless' copyright protection of knowledge bases. Instead of manipulating LLM's final output, \name{} implants distinct verification behaviors in the space of chain-of-thought (CoT) reasoning, maintaining the correctness of the final answer. Our method has three main stages: (1) \textbf{Generating CoTs}: For each verification question, we generate two CoTs, including a target CoT for building watermark behaviors; (2) \textbf{Optimizing Watermark Phrases and Target CoTs}: We optimize them to minimize retrieval errors under the black-box setting of suspicious LLM, ensuring that the watermarked verification queries activate the target CoTs without being activated in non-watermarked ones; (3) \textbf{Ownership Verification}: We exploit a pairwise Wilcoxon test to statistically verify whether a suspicious LLM is augmented with the protected knowledge base by comparing its responses to watermarked and benign verification queries. Our experiments on diverse benchmarks demonstrate that \name{} effectively protects knowledge bases against unauthorized usage while preserving the integrity and performance of the RAG.

Title: FlexControl: Computation-Aware ControlNet with Differentiable Router for Text-to-Image Generation

Authors: Zheng Fang, Lichuan Xiang, Xu Cai, Kaicheng Zhou, Hongkai Wen
Subjects: cs.LG, cs.GR
Abstract URL: https://arxiv.org/abs/2502.10451
Pdf URL: https://arxiv.org/pdf/2502.10451
Copy Paste: [[2502.10451]] FlexControl: Computation-Aware ControlNet with Differentiable Router for Text-to-Image Generation(https://arxiv.org/abs/2502.10451)
Keywords: diffusion, generative
Abstract: ControlNet offers a powerful way to guide diffusion-based generative models, yet most implementations rely on ad-hoc heuristics to choose which network blocks to control-an approach that varies unpredictably with different tasks. To address this gap, we propose FlexControl, a novel framework that copies all diffusion blocks during training and employs a trainable gating mechanism to dynamically select which blocks to activate at each denoising step. With introducing a computation-aware loss, we can encourage control blocks only to activate when it benefit the generation quality. By eliminating manual block selection, FlexControl enhances adaptability across diverse tasks and streamlines the design pipeline, with computation-aware training loss in an end-to-end training manner. Through comprehensive experiments on both UNet (e.g., SD1.5) and DiT (e.g., SD3.0), we show that our method outperforms existing ControlNet variants in certain key aspects of interest. As evidenced by both quantitative and qualitative evaluations, FlexControl preserves or enhances image fidelity while also reducing computational overhead by selectively activating the most relevant blocks. These results underscore the potential of a flexible, data-driven approach for controlled diffusion and open new avenues for efficient generative model design.

Title: I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

Authors: Zhenxing Mi, Kuan-Chieh Wang, Guocheng Qian, Hanrong Ye, Runtao Liu, Sergey Tulyakov, Kfir Aberman, Dan Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10458
Pdf URL: https://arxiv.org/pdf/2502.10458
Copy Paste: [[2502.10458]] I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models(https://arxiv.org/abs/2502.10458)
Keywords: diffusion, in-context
Abstract: This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-language models (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder large language model (LLM) instead of a diffusion decoder. This proxy task builds on the observation that the $\textbf{LLM decoder}$ shares the same input feature space with $\textbf{diffusion decoders}$ that use the corresponding $\textbf{LLM encoder}$ for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: this https URL.

Title: LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search

Authors: Yang Gao, Hong Yang, Yizhi Chen, Junxian Wu, Peng Zhang, Haishuai Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10459
Pdf URL: https://arxiv.org/pdf/2502.10459
Copy Paste: [[2502.10459]] LLM4GNAS: A Large Language Model Based Toolkit for Graph Neural Architecture Search(https://arxiv.org/abs/2502.10459)
Keywords: generative
Abstract: Graph Neural Architecture Search (GNAS) facilitates the automatic design of Graph Neural Networks (GNNs) tailored to specific downstream graph learning tasks. However, existing GNAS approaches often require manual adaptation to new graph search spaces, necessitating substantial code optimization and domain-specific knowledge. To address this challenge, we present LLM4GNAS, a toolkit for GNAS that leverages the generative capabilities of Large Language Models (LLMs). LLM4GNAS includes an algorithm library for graph neural architecture search algorithms based on LLMs, enabling the adaptation of GNAS methods to new search spaces through the modification of LLM prompts. This approach reduces the need for manual intervention in algorithm adaptation and code modification. The LLM4GNAS toolkit is extensible and robust, incorporating LLM-enhanced graph feature engineering, LLM-enhanced graph neural architecture search, and LLM-enhanced hyperparameter optimization. Experimental results indicate that LLM4GNAS outperforms existing GNAS methods on tasks involving both homogeneous and heterogeneous graphs.

Title: SinSim: Sinkhorn-Regularized SimCLR

Authors: M.Hadi Sepanj, Paul Fiegth
Subjects: cs.LG, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2502.10478
Pdf URL: https://arxiv.org/pdf/2502.10478
Copy Paste: [[2502.10478]] SinSim: Sinkhorn-Regularized SimCLR(https://arxiv.org/abs/2502.10478)
Keywords: self-supervised
Abstract: Self-supervised learning has revolutionized representation learning by eliminating the need for labeled data. Contrastive learning methods, such as SimCLR, maximize the agreement between augmented views of an image but lack explicit regularization to enforce a globally structured latent space. This limitation often leads to suboptimal generalization. We propose SinSim, a novel extension of SimCLR that integrates Sinkhorn regularization from optimal transport theory to enhance representation structure. The Sinkhorn loss, an entropy-regularized Wasserstein distance, encourages a well-dispersed and geometry-aware feature space, preserving discriminative power. Empirical evaluations on various datasets demonstrate that SinSim outperforms SimCLR and achieves competitive performance against prominent self-supervised methods such as VICReg and Barlow Twins. UMAP visualizations further reveal improved class separability and structured feature distributions. These results indicate that integrating optimal transport regularization into contrastive learning provides a principled and effective mechanism for learning robust, well-structured representations. Our findings open new directions for applying transport-based constraints in self-supervised learning frameworks.

Title: SWA-LDM: Toward Stealthy Watermarks for Latent Diffusion Models

Authors: Zhonghao Yang, Linye Lyu, Xuanhang Chang, Daojing He, YU LI
Subjects: cs.CR, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2502.10495
Pdf URL: https://arxiv.org/pdf/2502.10495
Copy Paste: [[2502.10495]] SWA-LDM: Toward Stealthy Watermarks for Latent Diffusion Models(https://arxiv.org/abs/2502.10495)
Keywords: diffusion
Abstract: In the rapidly evolving landscape of image generation, Latent Diffusion Models (LDMs) have emerged as powerful tools, enabling the creation of highly realistic images. However, this advancement raises significant concerns regarding copyright infringement and the potential misuse of generated content. Current watermarking techniques employed in LDMs often embed constant signals to the generated images that compromise their stealthiness, making them vulnerable to detection by malicious attackers. In this paper, we introduce SWA-LDM, a novel approach that enhances watermarking by randomizing the embedding process, effectively eliminating detectable patterns while preserving image quality and robustness. Our proposed watermark presence attack reveals the inherent vulnerabilities of existing latent-based watermarking methods, demonstrating how easily these can be exposed. Through comprehensive experiments, we validate that SWA-LDM not only fortifies watermark stealthiness but also maintains competitive performance in watermark robustness and visual fidelity. This work represents a pivotal step towards securing LDM-generated images against unauthorized use, ensuring both copyright protection and content integrity in an era where digital image authenticity is paramount.

Title: Hallucinations and Truth: A Comprehensive Accuracy Evaluation of RAG, LoRA and DoRA

Authors: Mohammad Baqar, Rajat Khanda
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10497
Pdf URL: https://arxiv.org/pdf/2502.10497
Copy Paste: [[2502.10497]] Hallucinations and Truth: A Comprehensive Accuracy Evaluation of RAG, LoRA and DoRA(https://arxiv.org/abs/2502.10497)
Keywords: generative
Abstract: Recent advancements in Generative AI have significantly improved the efficiency and adaptability of natural language processing (NLP) systems, particularly through Retrieval-Augmented Generation (RAG), Low-Rank Adaptation (LoRA), and Weight-Decomposed Low-Rank Adaptation (DoRA). RAG integrates external knowledge to enhance factual consistency in generative outputs, while LoRA enables parameter-efficient fine-tuning of large language models (LLMs). DoRA further refines this process by optimizing fine-tuning through adaptive parameter ranking and domain-aware weight adjustments, improving learning efficiency while maintaining inference performance. This paper presents a large-scale empirical evaluation of RAG, LoRA, and DoRA, with model fine-tuning and generation performance assessed on 20,000 FAQ-based queries, while the knowledge base spans 400,000 entries. The study analyzes key performance metrics such as accuracy, relevance, and inference latency. Experimental results demonstrate that DoRA achieves the highest accuracy (90.1%), relevance score (0.88), and lowest latency (110 ms per query), outperforming both LoRA and RAG in real-world, domain-specific generative AI applications. Furthermore, this study examines the trade-offs between fine-tuning efficiency, computational cost, and real-time adaptability across different models. Findings highlight RAG's effectiveness in knowledge grounding, LoRA's cost-efficient domain adaptation, and DoRA's ability to balance fine-tuning efficiency with model precision. These insights provide practical guidance for deploying AI-driven generative systems in accuracy-critical domains such as healthcare, finance, and legal services, ensuring scalability, reliability, and optimal performance in dynamic environments.

Title: Preference learning made easy: Everything should be understood through win rate

Authors: Lily H. Zhang, Rajesh Ranganath
Subjects: cs.LG, cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2502.10505
Pdf URL: https://arxiv.org/pdf/2502.10505
Copy Paste: [[2502.10505]] Preference learning made easy: Everything should be understood through win rate(https://arxiv.org/abs/2502.10505)
Keywords: generative
Abstract: Preference learning, or the task of aligning generative models to preference comparison data, has yet to reach the conceptual maturity of classification, density estimation, etc. To close this gap, this work presents a framework to understand preference learning starting from the sampling distribution of pairwise preference data. First, we prove that the only evaluation of a generative model that respects both preferences and prevalences in the data distribution is a form of win rate, justifying win rate as the focal point to understand preference learning. We then analyze preference learning methods as win rate optimization (WRO) or non-WRO. We present novel instances of WRO beyond existing examples (RLHF, NLHF) and identify two key theoretical benefits of all such methods. We prove that common non-WRO methods like DPO and SFT on preferred samples lack these properties and suggest ways to mitigate such theoretical limitations. We also show that WRO underperforms in practice due optimization difficulties and that optimization success predicts performance better than choices which affect the objective's solution. Our analysis highlights best practices for existing methods and provides recommendations for future research, guided by the principle that one should either align non-WRO methods more closely with WRO or improve the optimization of WRO objectives.

Title: Efficient Hierarchical Contrastive Self-supervising Learning for Time Series Classification via Importance-aware Resolution Selection

Authors: Kevin Garcia, Juan Manuel Perez, Yifeng Gao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10567
Pdf URL: https://arxiv.org/pdf/2502.10567
Copy Paste: [[2502.10567]] Efficient Hierarchical Contrastive Self-supervising Learning for Time Series Classification via Importance-aware Resolution Selection(https://arxiv.org/abs/2502.10567)
Keywords: self-supervised
Abstract: Recently, there has been a significant advancement in designing Self-Supervised Learning (SSL) frameworks for time series data to reduce the dependency on data labels. Among these works, hierarchical contrastive learning-based SSL frameworks, which learn representations by contrasting data embeddings at multiple resolutions, have gained considerable attention. Due to their ability to gather more information, they exhibit better generalization in various downstream tasks. However, when the time series data length is significant long, the computational cost is often significantly higher than that of other SSL frameworks. In this paper, to address this challenge, we propose an efficient way to train hierarchical contrastive learning models. Inspired by the fact that each resolution's data embedding is highly dependent, we introduce importance-aware resolution selection based training framework to reduce the computational cost. In the experiment, we demonstrate that the proposed method significantly improves training time while preserving the original model's integrity in extensive time series classification performance evaluations. Our code could be found here, this https URL

Title: Classifier-free Guidance with Adaptive Scaling

Authors: Dawid Malarz, Artur Kasymov, Maciej Zięba, Jacek Tabor, Przemysław Spurek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.10574
Pdf URL: https://arxiv.org/pdf/2502.10574
Copy Paste: [[2502.10574]] Classifier-free Guidance with Adaptive Scaling(https://arxiv.org/abs/2502.10574)
Keywords: diffusion
Abstract: Classifier-free guidance (CFG) is an essential mechanism in contemporary text-driven diffusion models. In practice, in controlling the impact of guidance we can see the trade-off between the quality of the generated images and correspondence to the prompt. When we use strong guidance, generated images fit the conditioned text perfectly but at the cost of their quality. Dually, we can use small guidance to generate high-quality results, but the generated images do not suit our prompt. In this paper, we present $\beta$-CFG ($\beta$-adaptive scaling in Classifier-Free Guidance), which controls the impact of guidance during generation to solve the above trade-off. First, $\beta$-CFG stabilizes the effects of guiding by gradient-based adaptive normalization. Second, $\beta$-CFG uses the family of single-modal ($\beta$-distribution), time-dependent curves to dynamically adapt the trade-off between prompt matching and the quality of samples during the diffusion denoising process. Our model obtained better FID scores, maintaining the text-to-image CLIP similarity scores at a level similar to that of the reference CFG.

Title: Towards Self-Supervised Covariance Estimation in Deep Heteroscedastic Regression

Authors: Megh Shukla, Aziz Shameem, Mathieu Salzmann, Alexandre Alahi
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2502.10587
Pdf URL: https://arxiv.org/pdf/2502.10587
Copy Paste: [[2502.10587]] Towards Self-Supervised Covariance Estimation in Deep Heteroscedastic Regression(https://arxiv.org/abs/2502.10587)
Keywords: self-supervised
Abstract: Deep heteroscedastic regression models the mean and covariance of the target distribution through neural networks. The challenge arises from heteroscedasticity, which implies that the covariance is sample dependent and is often unknown. Consequently, recent methods learn the covariance through unsupervised frameworks, which unfortunately yield a trade-off between computational complexity and accuracy. While this trade-off could be alleviated through supervision, obtaining labels for the covariance is non-trivial. Here, we study self-supervised covariance estimation in deep heteroscedastic regression. We address two questions: (1) How should we supervise the covariance assuming ground truth is available? (2) How can we obtain pseudo labels in the absence of the ground-truth? We address (1) by analysing two popular measures: the KL Divergence and the 2-Wasserstein distance. Subsequently, we derive an upper bound on the 2-Wasserstein distance between normal distributions with non-commutative covariances that is stable to optimize. We address (2) through a simple neighborhood based heuristic algorithm which results in surprisingly effective pseudo labels for the covariance. Our experiments over a wide range of synthetic and real datasets demonstrate that the proposed 2-Wasserstein bound coupled with pseudo label annotations results in a computationally cheaper yet accurate deep heteroscedastic regression.

Title: Post-training an LLM for RAG? Train on Self-Generated Demonstrations

Authors: Matthew Finlayson, Ilia Kulikov, Daneil M. Bikel, Barlas Oguz, Xilun Chen, Aasish Pappu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.10596
Pdf URL: https://arxiv.org/pdf/2502.10596
Copy Paste: [[2502.10596]] Post-training an LLM for RAG? Train on Self-Generated Demonstrations(https://arxiv.org/abs/2502.10596)
Keywords: in-context
Abstract: Large language models (LLMs) often struggle with knowledge intensive NLP tasks, such as answering "Who won the latest World Cup?" because the knowledge they learn during training may be insufficient or outdated. Conditioning generation on retrieved documents -- a technique known as retrieval augmented generation (RAG) -- mitigates these shortcomings by allowing the model to leverage in-context information. Practitioners can improve LLM RAG performance by fine-tuning on retrieval-augmented instructions, but must beware that this can cause undesirable model behaviors like hallucinations. We attribute this degradation to the fact that the training data is likely to be out-of-distribution for the model and may suffer from quality issues, such as misalignment between retrievals and target responses (since retrievals are frequently added post-hoc). We propose a recipe for training RAG-enabled LLMs using self-generated demonstrations, thereby avoiding training on out-of-distribution text and integrating retrievals into the LLM responses. We evaluate our method on knowledge intensive question answering (QA) tasks and show that our method teaches LLMs to properly handle in-context retrievals and abstain from questions it will likely get wrong. Compared to conventional RA-IT methods, our method prevents model degradation in non-RAG settings while exhibiting superior QA performance.

Title: Federated Learning-Driven Cybersecurity Framework for IoT Networks with Privacy-Preserving and Real-Time Threat Detection Capabilities

Authors: Milad Rahmati
Subjects: cs.CR, cs.LG, cs.NI
Abstract URL: https://arxiv.org/abs/2502.10599
Pdf URL: https://arxiv.org/pdf/2502.10599
Copy Paste: [[2502.10599]] Federated Learning-Driven Cybersecurity Framework for IoT Networks with Privacy-Preserving and Real-Time Threat Detection Capabilities(https://arxiv.org/abs/2502.10599)
Keywords: anomaly
Abstract: The rapid expansion of the Internet of Things (IoT) ecosystem has transformed various sectors but has also introduced significant cybersecurity challenges. Traditional centralized security methods often struggle to balance privacy preservation and real-time threat detection in IoT networks. To address these issues, this study proposes a Federated Learning-Driven Cybersecurity Framework designed specifically for IoT environments. The framework enables decentralized data processing by training models locally on edge devices, ensuring data privacy. Secure aggregation of these locally trained models is achieved using homomorphic encryption, allowing collaborative learning without exposing sensitive information. The proposed framework utilizes recurrent neural networks (RNNs) for anomaly detection, optimized for resource-constrained IoT networks. Experimental results demonstrate that the system effectively detects complex cyber threats, including distributed denial-of-service (DDoS) attacks, with over 98% accuracy. Additionally, it improves energy efficiency by reducing resource consumption by 20% compared to centralized approaches. This research addresses critical gaps in IoT cybersecurity by integrating federated learning with advanced threat detection techniques. The framework offers a scalable and privacy-preserving solution adaptable to various IoT applications. Future work will explore the integration of blockchain for transparent model aggregation and quantum-resistant cryptographic methods to further enhance security in evolving technological landscapes.

Title: HIPPo: Harnessing Image-to-3D Priors for Model-free Zero-shot 6D Pose Estimation

Authors: Yibo Liu, Zhaodong Jiang, Binbin Xu, Guile Wu, Yuan Ren, Tongtong Cao, Bingbing Liu, Rui Heng Yang, Amir Rasouli, Jinjun Shan
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2502.10606
Pdf URL: https://arxiv.org/pdf/2502.10606
Copy Paste: [[2502.10606]] HIPPo: Harnessing Image-to-3D Priors for Model-free Zero-shot 6D Pose Estimation(https://arxiv.org/abs/2502.10606)
Keywords: diffusion, foundation model
Abstract: This work focuses on model-free zero-shot 6D object pose estimation for robotics applications. While existing methods can estimate the precise 6D pose of objects, they heavily rely on curated CAD models or reference images, the preparation of which is a time-consuming and labor-intensive process. Moreover, in real-world scenarios, 3D models or reference images may not be available in advance and instant robot reaction is desired. In this work, we propose a novel framework named HIPPo, which eliminates the need for curated CAD models and reference images by harnessing image-to-3D priors from Diffusion Models, enabling model-free zero-shot 6D pose estimation. Specifically, we construct HIPPo Dreamer, a rapid image-to-mesh model built on a multiview Diffusion Model and a 3D reconstruction foundation model. Our HIPPo Dreamer can generate a 3D mesh of any unseen objects from a single glance in just a few seconds. Then, as more observations are acquired, we propose to continuously refine the diffusion prior mesh model by joint optimization of object geometry and appearance. This is achieved by a measurement-guided scheme that gradually replaces the plausible diffusion priors with more reliable online observations. Consequently, HIPPo can instantly estimate and track the 6D pose of a novel object and maintain a complete mesh for immediate robotic applications. Thorough experiments on various benchmarks show that HIPPo outperforms state-of-the-art methods in 6D object pose estimation when prior reference images are limited.

Title: Lost in the Passage: Passage-level In-context Learning Does Not Necessarily Need a "Passage"

Authors: Hao Sun, Chenming Tang, Gengyang Li, Yunfang Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10634
Pdf URL: https://arxiv.org/pdf/2502.10634
Copy Paste: [[2502.10634]] Lost in the Passage: Passage-level In-context Learning Does Not Necessarily Need a "Passage"(https://arxiv.org/abs/2502.10634)
Keywords: in-context
Abstract: By simply incorporating demonstrations into the context, in-context learning (ICL) enables large language models (LLMs) to yield awesome performance on many tasks. In this paper, we focus on passage-level long-context ICL for generation tasks and find that LLMs cannot learn the intrinsic relationships between the demonstration passage and the generation output. We conduct experiments with different LLMs on two typical generation tasks including single-document QA and distractor generation, demonstrating that even a completely meaningless demonstration passage with 1/4 length achieves much better performance than the original full passage. Analysis via attention score reveals that LLMs pay little attention to passages compared to other components in prompt and little attention flows from the passage to other parts of the demonstration, which further confirms our finding. Additionally, experiments on context compression indicate that compression approaches proven effective on other long-context tasks are not suitable for passage-level ICL, since simply using shorter meaningless demonstration passages has achieved competitive performance.

Title: Is Self-Supervised Pre-training on Satellite Imagery Better than ImageNet? A Systematic Study with Sentinel-2

Authors: Saad Lahrichi, Zion Sheng, Shufan Xia, Kyle Bradbury, Jordan Malof
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.10669
Pdf URL: https://arxiv.org/pdf/2502.10669
Copy Paste: [[2502.10669]] Is Self-Supervised Pre-training on Satellite Imagery Better than ImageNet? A Systematic Study with Sentinel-2(https://arxiv.org/abs/2502.10669)
Keywords: self-supervised
Abstract: Self-supervised learning (SSL) has demonstrated significant potential in pre-training robust models with limited labeled data, making it particularly valuable for remote sensing (RS) tasks. A common assumption is that pre-training on domain-aligned data provides maximal benefits on downstream tasks, particularly when compared to ImageNet-pretraining (INP). In this work, we investigate this assumption by collecting GeoNet, a large and diverse dataset of global optical Sentinel-2 imagery, and pre-training SwAV and MAE on both GeoNet and ImageNet. Evaluating these models on six downstream tasks in the few-shot setting reveals that SSL pre-training on RS data offers modest performance improvements over INP, and that it remains competitive in multiple scenarios. This indicates that the presumed benefits of SSL pre-training on RS data may be overstated, and the additional costs of data curation and pre-training could be unjustified.

Title: Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model

Authors: Jiarui Jin, Haoyu Wang, Hongyan Li, Jun Li, Jiahui Pan, Shenda Hong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10707
Pdf URL: https://arxiv.org/pdf/2502.10707
Copy Paste: [[2502.10707]] Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model(https://arxiv.org/abs/2502.10707)
Keywords: self-supervised
Abstract: Electrocardiogram (ECG) is essential for the clinical diagnosis of arrhythmias and other heart diseases, but deep learning methods based on ECG often face limitations due to the need for high-quality annotations. Although previous ECG self-supervised learning (eSSL) methods have made significant progress in representation learning from unannotated ECG data, they typically treat ECG signals as ordinary time-series data, segmenting the signals using fixed-size and fixed-step time windows, which often ignore the form and rhythm characteristics and latent semantic relationships in ECG signals. In this work, we introduce a novel perspective on ECG signals, treating heartbeats as words and rhythms as sentences. Based on this perspective, we first designed the QRS-Tokenizer, which generates semantically meaningful ECG sentences from the raw ECG signals. Building on these, we then propose HeartLang, a novel self-supervised learning framework for ECG language processing, learning general representations at form and rhythm levels. Additionally, we construct the largest heartbeat-based ECG vocabulary to date, which will further advance the development of ECG language processing. We evaluated HeartLang across six public ECG datasets, where it demonstrated robust competitiveness against other eSSL methods. Our data and code are publicly available at this https URL.

Title: A Computational Model for Ransomware Detection Using Cross-Domain Entropy Signatures

Authors: Michael Mannon, Evan Statham, Quentin Featherstone, Sebastian Arkwright, Clive Fenwick, Gareth Willoughby
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2502.10711
Pdf URL: https://arxiv.org/pdf/2502.10711
Copy Paste: [[2502.10711]] A Computational Model for Ransomware Detection Using Cross-Domain Entropy Signatures(https://arxiv.org/abs/2502.10711)
Keywords: anomaly
Abstract: Detecting encryption-driven cyber threats remains a large challenge due to the evolving techniques employed to evade traditional detection mechanisms. An entropy-based computational framework was introduced to analyze multi-domain system variations, enabling the identification of malicious encryption behaviors through entropy deviations. By integrating entropy patterns across file operations, memory allocations, and network transmissions, a detection methodology was developed to differentiate between benign and ransomware-induced entropy shifts. A mathematical model was formulated to quantify entropy dynamics, incorporating time-dependent variations and weighted domain contributions to enhance anomaly detection. Experimental evaluations demonstrated that the proposed approach achieved high accuracy across diverse ransomware families while maintaining low false positive rates. Computational efficiency analysis indicated minimal processing overhead, suggesting feasibility for real-time implementation in security-sensitive environments. The study highlighted entropy fluctuations as a useful indicator for identifying malicious encryption processes, reinforcing entropy-driven methodologies as a viable component of cybersecurity strategies.

Title: FuncGenFoil: Airfoil Generation and Editing Model in Function Space

Authors: Jinouwen Zhang, Junjie Ren, Aobo Yang, Yan Lu, Lu Chen, Hairun Xie, Jing Wang, Miao Zhang, Wanli Ouyang, Shixiang Tang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10712
Pdf URL: https://arxiv.org/pdf/2502.10712
Copy Paste: [[2502.10712]] FuncGenFoil: Airfoil Generation and Editing Model in Function Space(https://arxiv.org/abs/2502.10712)
Keywords: generative
Abstract: Aircraft manufacturing is the jewel in the crown of industry, among which generating high-fidelity airfoil geometries with controllable and editable representations remains a fundamental challenge. While existing deep-learning-based methods rely on predefined parametric function families, e.g., Bézier curves and discrete point-based representations, they suffer from inherent trade-offs between expressiveness and resolution flexibility. To tackle this challenge, we introduce FuncGenFoil, a novel function-space generative model that directly learns functional airfoil geometries. Our method inherits both the advantages of arbitrary resolution sampling and the smoothness of parametric functions, as well as the strong expressiveness of discrete point-based functions. Empirical evaluations on the AFBench dataset demonstrate that FuncGenFoil improves upon state-of-the-art methods in airfoil generation by achieving a relative -74.4 label error reduction and +23.2 diversity increase on the AF-200K dataset. Our results highlight the advantages of function-space modeling for aerodynamic shape optimization, offering a powerful and flexible framework for high-fidelity airfoil design. Our code will be released.

Title: Disentangle Nighttime Lens Flares: Self-supervised Generation-based Lens Flare Removal

Authors: Yuwen He, Wei Wang, Wanyu Wang, Kui Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.10714
Pdf URL: https://arxiv.org/pdf/2502.10714
Copy Paste: [[2502.10714]] Disentangle Nighttime Lens Flares: Self-supervised Generation-based Lens Flare Removal(https://arxiv.org/abs/2502.10714)
Keywords: self-supervised
Abstract: Lens flares arise from light reflection and refraction within sensor arrays, whose diverse types include glow, veiling glare, reflective flare and so on. Existing methods are specialized for one specific type only, and overlook the simultaneous occurrence of multiple typed lens flares, which is common in the real-world, e.g. coexistence of glow and displacement reflections from the same light source. These co-occurring lens flares cannot be effectively resolved by the simple combination of individual flare removal methods, since these coexisting flares originates from the same light source and are generated simultaneously within the same sensor array, exhibit a complex interdependence rather than simple additive relation. To model this interdependent flare relationship, our Nighttime Lens Flare Formation model is the first attempt to learn the intrinsic physical relationship between flares on the imaging plane. Building on this physical model, we introduce a solution to this joint flare removal task named Self-supervised Generation-based Lens Flare Removal Network (SGLFR-Net), which is self-supervised without pre-training. Specifically, the nighttime glow is detangled in PSF Rendering Network(PSFR-Net) based on PSF Rendering Prior, while the reflective flare is modelled in Texture Prior Based Reflection Flare Removal Network (TPRR-Net). Empirical evaluations demonstrate the effectiveness of the proposed method in both joint and individual glare removal tasks.

Title: BASE-SQL: A powerful open source Text-To-SQL baseline approach

Authors: Lei Sheng, Shuai-Shuai Xu, Wei Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10739
Pdf URL: https://arxiv.org/pdf/2502.10739
Copy Paste: [[2502.10739]] BASE-SQL: A powerful open source Text-To-SQL baseline approach(https://arxiv.org/abs/2502.10739)
Keywords: in-context
Abstract: The conversion of natural language into SQL language for querying databases (Text-to-SQL) has broad application prospects and has attracted widespread attention. At present, the mainstream Text-to-SQL methods are mainly divided into in-context learning (ICL) based methods and supervised fine-tuning (SFT) based methods. ICL-based methods can achieve relatively good results thanks to the use of the most advanced closed-source models. However, in real-world application scenarios, factors such as data privacy, SQL generation efficiency and cost need to be considered. SFT-based methods have certain advantages. At present, methods based on fine-tuning of open source models lack easy-to-implement and effective (cost-effective) baseline methods. We propose a pipeline-based method using open source model fine-tuning, referred to as BASE-SQL, which includes four components: Schema Linking, Candidate SQL Generate, SQL Revision and SQL Merge Revision. Experimental results show that BASE-SQL uses the open source model Qwen2.5-Coder-32B-Instruct, and achieves an accuracy of 67.47% on the BIRD development set and 88.9% on the Spider test set, which is significantly better than other methods using open source models, and even exceeds several methods using the GPT-4o closed-source model. At the same time, BASE-SQL is easy to implement and highly efficient (on average, only five calls to the large language model are required to generate SQL once). The code will be open sourced at this https URL.

Title: Preconditioned Inexact Stochastic ADMM for Deep Model

Authors: Shenglong Zhou, Ouya Wang, Ziyan Luo, Yongxu Zhu, Geoffrey Ye Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.10784
Pdf URL: https://arxiv.org/pdf/2502.10784
Copy Paste: [[2502.10784]] Preconditioned Inexact Stochastic ADMM for Deep Model(https://arxiv.org/abs/2502.10784)
Keywords: foundation model, generative
Abstract: The recent advancement of foundation models (FMs) has brought about a paradigm shift, revolutionizing various sectors worldwide. The popular optimizers used to train these models are stochastic gradient descent-based algorithms, which face inherent limitations, such as slow convergence and stringent assumptions for convergence. In particular, data heterogeneity arising from distributed settings poses significant challenges to their theoretical and numerical performance. This paper develops an algorithm, PISA ({P}reconditioned {I}nexact {S}tochastic {A}lternating Direction Method of Multipliers), which enables scalable parallel computing and supports various second-moment schemes. Grounded in rigorous theoretical guarantees, the algorithm converges under the sole assumption of Lipschitz continuity of the gradient, thereby removing the need for other conditions commonly imposed by stochastic methods. This capability enables PISA to tackle the challenge of data heterogeneity effectively. Comprehensive experimental evaluations for training or fine-tuning diverse FMs, including vision models, large language models, reinforcement learning models, generative adversarial networks, and recurrent neural networks, demonstrate its superior numerical performance compared to various state-of-the-art optimizers.

Title: Epidemic-guided deep learning for spatiotemporal forecasting of Tuberculosis outbreak

Authors: Madhab Barman, Madhurima Panja, Nachiketa Mishra, Tanujit Chakraborty
Subjects: cs.LG, q-bio.QM, stat.ML
Abstract URL: https://arxiv.org/abs/2502.10786
Pdf URL: https://arxiv.org/pdf/2502.10786
Copy Paste: [[2502.10786]] Epidemic-guided deep learning for spatiotemporal forecasting of Tuberculosis outbreak(https://arxiv.org/abs/2502.10786)
Keywords: diffusion
Abstract: Tuberculosis (TB) remains a formidable global health challenge, driven by complex spatiotemporal transmission dynamics and influenced by factors such as population mobility and behavioral changes. We propose an Epidemic-Guided Deep Learning (EGDL) approach that fuses mechanistic epidemiological principles with advanced deep learning techniques to enhance early warning systems and intervention strategies for TB outbreaks. Our framework is built upon a networked Susceptible-Infectious-Recovered (SIR) model augmented with a saturated incidence rate and graph Laplacian diffusion, capturing both long-term transmission dynamics and region-specific population mobility patterns. Compartmental model parameters are rigorously estimated using Bayesian inference via the Markov Chain Monte Carlo (MCMC) approach. Theoretical analysis leveraging the comparison principle and Green's formula establishes global stability properties of the disease-free and endemic equilibria. Building on these epidemiological insights, we design two forecasting architectures, EGDL-Parallel and EGDL-Series, that integrate the mechanistic outputs of the networked SIR model within deep neural networks. This integration mitigates the overfitting risks commonly encountered in data-driven methods and filters out noise inherent in surveillance data, resulting in reliable forecasts of real-world epidemic trends. Experiments conducted on TB incidence data from 47 prefectures in Japan demonstrate that our approach delivers robust and accurate predictions across multiple time horizons (short to medium-term forecasts). Additionally, incorporating uncertainty quantification through conformal prediction enhances the model's practical utility for guiding targeted public health interventions.

Title: PDA: Generalizable Detection of AI-Generated Images via Post-hoc Distribution Alignment

Authors: Li Wang, Wenyu Chen, Zheng Li, Shanqing Guo
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2502.10803
Pdf URL: https://arxiv.org/pdf/2502.10803
Copy Paste: [[2502.10803]] PDA: Generalizable Detection of AI-Generated Images via Post-hoc Distribution Alignment(https://arxiv.org/abs/2502.10803)
Keywords: diffusion, generative
Abstract: The rapid advancement of generative models has led to the proliferation of highly realistic AI-generated images, posing significant challenges for detection methods to generalize across diverse and evolving generative techniques. Existing approaches often fail to adapt to unknown models without costly retraining, limiting their practicability. To fill this gap, we propose Post-hoc Distribution Alignment (PDA), a novel approach for the generalizable detection for AI-generated images. The key idea is to use the known generative model to regenerate undifferentiated test images. This process aligns the distributions of the re-generated real images with the known fake images, enabling effective distinction from unknown fake images. PDA employs a two-step detection framework: 1) evaluating whether a test image aligns with the known fake distribution based on deep k-nearest neighbor (KNN) distance, and 2) re-generating test images using known generative models to create pseudo-fake images for further classification. This alignment strategy allows PDA to effectively detect fake images without relying on unseen data or requiring retraining. Extensive experiments demonstrate the superiority of PDA, achieving 96.73\% average accuracy across six state-of-the-art generative models, including GANs, diffusion models, and text-to-image models, and improving by 16.07\% over the best baseline. Through t-SNE visualizations and KNN distance analysis, we provide insights into PDA's effectiveness in separating real and fake images. Our work provides a flexible and effective solution for real-world fake image detection, advancing the generalization ability of detection systems.

Title: HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model

Authors: Mingqian Ma, Guoqing Liu, Chuan Cao, Pan Deng, Tri Dao, Albert Gu, Peiran Jin, Zhao Yang, Yingce Xia, Renqian Luo, Pipi Hu, Zun Wang, Yuan-Jyue Chen, Haiguang Liu, Tao Qin
Subjects: cs.LG, cs.AI, q-bio.GN
Abstract URL: https://arxiv.org/abs/2502.10807
Pdf URL: https://arxiv.org/pdf/2502.10807
Copy Paste: [[2502.10807]] HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model(https://arxiv.org/abs/2502.10807)
Keywords: generative
Abstract: Advances in natural language processing and large language models have sparked growing interest in modeling DNA, often referred to as the "language of life". However, DNA modeling poses unique challenges. First, it requires the ability to process ultra-long DNA sequences while preserving single-nucleotide resolution, as individual nucleotides play a critical role in DNA function. Second, success in this domain requires excelling at both generative and understanding tasks: generative tasks hold potential for therapeutic and industrial applications, while understanding tasks provide crucial insights into biological mechanisms and diseases. To address these challenges, we propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture, seamlessly integrating the strengths of attention mechanisms with selective state-space models. This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution. HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks, and demonstrates exceptional capability in generating synthetic cis-regulatory elements (CREs) with desired properties. Furthermore, we show that HybriDNA adheres to expected scaling laws, with performance improving consistently as the model scales from 300M to 3B and 7B parameters. These findings underscore HybriDNA's versatility and its potential to advance DNA research and applications, paving the way for innovations in understanding and engineering the "language of life".

Title: BalanceBenchmark: A Survey for Imbalanced Learning

Authors: Shaoxuan Xu, Menglu Cui, Chengxiang Huang, Hongfa Wang, DiHu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10816
Pdf URL: https://arxiv.org/pdf/2502.10816
Copy Paste: [[2502.10816]] BalanceBenchmark: A Survey for Imbalanced Learning(https://arxiv.org/abs/2502.10816)
Keywords: foundation model
Abstract: Multimodal learning has gained attention for its capacity to integrate information from different modalities. However, it is often hindered by the multimodal imbalance problem, where certain modality dominates while others remain underutilized. Although recent studies have proposed various methods to alleviate this problem, they lack comprehensive and fair comparisons. In this paper, we systematically categorize various mainstream multimodal imbalance algorithms into four groups based on the strategies they employ to mitigate imbalance. To facilitate a comprehensive evaluation of these methods, we introduce BalanceBenchmark, a benchmark including multiple widely used multidimensional datasets and evaluation metrics from three perspectives: performance, imbalance degree, and complexity. To ensure fair comparisons, we have developed a modular and extensible toolkit that standardizes the experimental workflow across different methods. Based on the experiments using BalanceBenchmark, we have identified several key insights into the characteristics and advantages of different method groups in terms of performance, balance degree and computational complexity. We expect such analysis could inspire more efficient approaches to address the imbalance problem in the future, as well as foundation models. The code of the toolkit is available at this https URL.

Title: The Vendiscope: An Algorithmic Microscope For Data Collections

Authors: Amey P. Pasarkar, Adji Bousso Dieng
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI, q-bio.QM
Abstract URL: https://arxiv.org/abs/2502.10828
Pdf URL: https://arxiv.org/pdf/2502.10828
Copy Paste: [[2502.10828]] The Vendiscope: An Algorithmic Microscope For Data Collections(https://arxiv.org/abs/2502.10828)
Keywords: generative
Abstract: The evolution of microscopy, beginning with its invention in the late 16th century, has continuously enhanced our ability to explore and understand the microscopic world, enabling increasingly detailed observations of structures and phenomena. In parallel, the rise of data-driven science has underscored the need for sophisticated methods to explore and understand the composition of complex data collections. This paper introduces the Vendiscope, the first algorithmic microscope designed to extend traditional microscopy to computational analysis. The Vendiscope leverages the Vendi scores -- a family of differentiable diversity metrics rooted in ecology and quantum mechanics -- and assigns weights to data points based on their contribution to the overall diversity of the collection. These weights enable high-resolution data analysis at scale. We demonstrate this across biology, materials science, and machine learning (ML). We analyzed the $250$ million protein sequences in the protein universe, discovering that over $200$ million are near-duplicates and that AlphaFold fails on proteins with Gene Ontology (GO) functions that contribute most to diversity. Applying the Vendiscope to the Materials Project database led to similar findings: more than $85\%$ of the crystals with formation energy data are near-duplicates and ML models perform poorly on materials that enhance diversity. Additionally, the Vendiscope can be used to study phenomena such as memorization in generative models. We used the Vendiscope to identify memorized training samples from $13$ different generative models and found that the best-performing ones often memorize the training samples that contribute least to diversity. Our findings demonstrate that the Vendiscope can serve as a powerful tool for data-driven science.

Title: SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers

Authors: Di Qiu, Zhengcong Fei, Rui Wang, Jialin Bai, Changqian Yu, Mingyuan Fan, Guibin Chen, Xiang Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.10841
Pdf URL: https://arxiv.org/pdf/2502.10841
Copy Paste: [[2502.10841]] SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers(https://arxiv.org/abs/2502.10841)
Keywords: diffusion, generative
Abstract: We present SkyReels-A1, a simple yet effective framework built upon video diffusion Transformer to facilitate portrait image animation. Existing methodologies still encounter issues, including identity distortion, background instability, and unrealistic facial dynamics, particularly in head-only animation scenarios. Besides, extending to accommodate diverse body proportions usually leads to visual inconsistencies or unnatural articulations. To address these challenges, SkyReels-A1 capitalizes on the strong generative capabilities of video DiT, enhancing facial motion transfer precision, identity retention, and temporal coherence. The system incorporates an expression-aware conditioning module that enables seamless video synthesis driven by expression-guided landmark inputs. Integrating the facial image-text alignment module strengthens the fusion of facial attributes with motion trajectories, reinforcing identity preservation. Additionally, SkyReels-A1 incorporates a multi-stage training paradigm to incrementally refine the correlation between expressions and motion while ensuring stable identity reproduction. Extensive empirical evaluations highlight the model's ability to produce visually coherent and compositionally diverse results, making it highly applicable to domains such as virtual avatars, remote communication, and digital media generation.

Title: Do Deepfake Detectors Work in Reality?

Authors: Simiao Ren, Hengwei Xu, Tsang Ng, Kidus Zewde, Shengkai Jiang, Ramini Desai, Disha Patil, Ning-Yau Cheng, Yining Zhou, Ragavi Muthukrishnan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.10920
Pdf URL: https://arxiv.org/pdf/2502.10920
Copy Paste: [[2502.10920]] Do Deepfake Detectors Work in Reality?(https://arxiv.org/abs/2502.10920)
Keywords: generative
Abstract: Deepfakes, particularly those involving faceswap-based manipulations, have sparked significant societal concern due to their increasing realism and potential for misuse. Despite rapid advancements in generative models, detection methods have not kept pace, creating a critical gap in defense strategies. This disparity is further amplified by the disconnect between academic research and real-world applications, which often prioritize different objectives and evaluation criteria. In this study, we take a pivotal step toward bridging this gap by presenting a novel observation: the post-processing step of super-resolution, commonly employed in real-world scenarios, substantially undermines the effectiveness of existing deepfake detection methods. To substantiate this claim, we introduce and publish the first real-world faceswap dataset, collected from popular online faceswap platforms. We then qualitatively evaluate the performance of state-of-the-art deepfake detectors on real-world deepfakes, revealing that their accuracy approaches the level of random guessing. Furthermore, we quantitatively demonstrate the significant performance degradation caused by common post-processing techniques. By addressing this overlooked challenge, our study underscores a critical avenue for enhancing the robustness and practical applicability of deepfake detection methods in real-world settings.

Title: Exploring Contextual Flux in Large Language Models: A Novel Approach to Self-Modulating Semantic Networks

Authors: Henry Evidail, Zachary Mountebank, Alistair Hathersage, Peter Stanhope, Basil Ravenscroft, Tobias Waddingham
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.10942
Pdf URL: https://arxiv.org/pdf/2502.10942
Copy Paste: [[2502.10942]] Exploring Contextual Flux in Large Language Models: A Novel Approach to Self-Modulating Semantic Networks(https://arxiv.org/abs/2502.10942)
Keywords: generative
Abstract: Self-modulating mechanisms introduce dynamic adaptation capabilities within language models through contextual realignment strategies that influence token embedding trajectories across extended sequences. Contextual Flux is explored as an approach to embedding modulation, integrating an auxiliary gating mechanism within the self-attention framework to dynamically adjust token representations based on evolving contextual dependencies. The empirical analysis evaluates entropy variations, latent space realignments, and coherence stability to assess the extent to which self-regulation enhances text generation consistency while preserving generative flexibility. Quantitative assessments suggest that embedding shifts contribute to more structured adaptation in long-form sequences, with measured reductions in redundant phrase repetitions and improvements in thematic retention. Variability in contextual weight computation affects modulation stability, leading to differing levels of adaptation across diverse linguistic structures. The computational demands introduced through real-time embedding reconfiguration are examined in relation to model scalability, emphasizing the need for optimization strategies in high-volume generative applications. The findings suggest that while adaptive embedding updates improve certain aspects of coherence, their impact remains contingent on model capacity and input complexity.

Title: Skillful Nowcasting of Convective Clouds With a Cascade Diffusion Model

Authors: Haoming Chen, Xiaohui Zhong, Qiang Zhai, Xiaomeng Li, Ying Wa Chan, Pak Wai Chan, Yuanyuan Huang, Hao Li, Xiaoming Shi
Subjects: cs.CV, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2502.10957
Pdf URL: https://arxiv.org/pdf/2502.10957
Copy Paste: [[2502.10957]] Skillful Nowcasting of Convective Clouds With a Cascade Diffusion Model(https://arxiv.org/abs/2502.10957)
Keywords: diffusion
Abstract: Accurate nowcasting of convective clouds from satellite imagery is essential for mitigating the impacts of meteorological disasters, especially in developing countries and remote regions with limited ground-based observations. Recent advances in deep learning have shown promise in video prediction; however, existing models frequently produce blurry results and exhibit reduced accuracy when forecasting physical fields. Here, we introduce SATcast, a diffusion model that leverages a cascade architecture and multimodal inputs for nowcasting cloud fields in satellite imagery. SATcast incorporates physical fields predicted by FuXi, a deep-learning weather model, alongside past satellite observations as conditional inputs to generate high-quality future cloud fields. Through comprehensive evaluation, SATcast outperforms conventional methods on multiple metrics, demonstrating its superior accuracy and robustness. Ablation studies underscore the importance of its multimodal design and the cascade architecture in achieving reliable predictions. Notably, SATcast maintains predictive skill for up to 24 hours, underscoring its potential for operational nowcasting applications.

Title: ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations

Authors: Bowen Jiang, Yuan Yuan, Xinyi Bai, Zhuoqun Hao, Alyson Yin, Yaojie Hu, Wenyu Liao, Lyle Ungar, Camillo J. Taylor
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2502.10999
Pdf URL: https://arxiv.org/pdf/2502.10999
Copy Paste: [[2502.10999]] ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations(https://arxiv.org/abs/2502.10999)
Keywords: diffusion, self-supervised
Abstract: This work demonstrates that diffusion models can achieve font-controllable multilingual text rendering using just raw images without font label annotations. Visual text rendering remains a significant challenge. While recent methods condition diffusion on glyphs, it is impossible to retrieve exact font annotations from large-scale, real-world datasets, which prevents user-specified font control. To address this, we propose a data-driven solution that integrates the conditional diffusion model with a text segmentation model, utilizing segmentation masks to capture and represent fonts in pixel space in a self-supervised manner, thereby eliminating the need for any ground-truth labels and enabling users to customize text rendering with any multilingual font of their choice. The experiment provides a proof of concept of our algorithm in zero-shot text and font editing across diverse fonts and languages, providing valuable insights for the community and industry toward achieving generalized visual text rendering.

Title: Prompt Inject Detection with Generative Explanation as an Investigative Tool

Authors: Jonathan Pan, Swee Liang Wong, Yidi Yuan, Xin Wei Chia
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11006
Pdf URL: https://arxiv.org/pdf/2502.11006
Copy Paste: [[2502.11006]] Prompt Inject Detection with Generative Explanation as an Investigative Tool(https://arxiv.org/abs/2502.11006)
Keywords: generative
Abstract: Large Language Models (LLMs) are vulnerable to adversarial prompt based injects. These injects could jailbreak or exploit vulnerabilities within these models with explicit prompt requests leading to undesired responses. In the context of investigating prompt injects, the challenge is the sheer volume of input prompts involved that are likely to be largely benign. This investigative challenge is further complicated by the semantics and subjectivity of the input prompts involved in the LLM conversation with its user and the context of the environment to which the conversation is being carried out. Hence, the challenge for AI security investigators would be two-fold. The first is to identify adversarial prompt injects and then to assess whether the input prompt is contextually benign or adversarial. For the first step, this could be done using existing AI security solutions like guardrails to detect and protect the LLMs. Guardrails have been developed using a variety of approaches. A popular approach is to use signature based. Another popular approach to develop AI models to classify such prompts include the use of NLP based models like a language model. However, in the context of conducting an AI security investigation of prompt injects, these guardrails lack the ability to aid investigators in triaging or assessing the identified input prompts. In this applied research exploration, we explore the use of a text generation capabilities of LLM to detect prompt injects and generate explanation for its detections to aid AI security investigators in assessing and triaging of such prompt inject detections. The practical benefit of such a tool is to ease the task of conducting investigation into prompt injects.

Title: Collaborative Deterministic-Diffusion Model for Probabilistic Urban Spatiotemporal Prediction

Authors: Zhi Sheng, Yuan Yuan, Yudi Zhang, Depeng Jin, Yong Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11013
Pdf URL: https://arxiv.org/pdf/2502.11013
Copy Paste: [[2502.11013]] Collaborative Deterministic-Diffusion Model for Probabilistic Urban Spatiotemporal Prediction(https://arxiv.org/abs/2502.11013)
Keywords: diffusion
Abstract: Accurate prediction of urban spatiotemporal dynamics is essential for enhancing urban management and decision-making. Existing spatiotemporal prediction models are predominantly deterministic, focusing on primary spatiotemporal patterns. However, those dynamics are highly complex, exhibiting multi-modal distributions that are challenging for deterministic models to capture. In this paper, we highlight the critical role of probabilistic prediction in capturing the uncertainties and complexities inherent in spatiotemporal data. While mainstream probabilistic models can capture uncertainty, they struggle with accurately learning primary patterns and often suffer from computational inefficiency. To address these challenges, we propose CoST, which collaborates deterministic and probabilistic models to improve both predictive accuracy and the ability to handle uncertainty. To achieve this, we design a mean-residual decomposition framework, where the mean value is modeled by a deterministic model, and the residual variations are learned by a probabilistic model, specifically diffusion models. Moreover, we introduce a scale-aware diffusion process, which better accounts for spatially heterogeneous dynamics across different regions. Extensive experiments on eight real-world datasets demonstrate that CoST significantly outperforms existing methods in both deterministic and probabilistic metrics, achieving a 20% improvement with low computational cost. CoST bridges the gap between deterministic precision and probabilistic uncertainty, making a significant advancement in the field of urban spatiotemporal prediction.

Title: ClimateLLM: Efficient Weather Forecasting via Frequency-Aware Large Language Models

Authors: Shixuan Li, Wei Yang, Peiyu Zhang, Xiongye Xiao, Defu Cao, Yuehan Qin, Xiaole Zhang, Yue Zhao, Paul Bogdan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11059
Pdf URL: https://arxiv.org/pdf/2502.11059
Copy Paste: [[2502.11059]] ClimateLLM: Efficient Weather Forecasting via Frequency-Aware Large Language Models(https://arxiv.org/abs/2502.11059)
Keywords: foundation model
Abstract: Weather forecasting is crucial for public safety, disaster prevention and mitigation, agricultural production, and energy management, with global relevance. Although deep learning has significantly advanced weather prediction, current methods face critical limitations: (i) they often struggle to capture both dynamic temporal dependencies and short-term abrupt changes, making extreme weather modeling difficult; (ii) they incur high computational costs due to extensive training and resource requirements; (iii) they have limited adaptability to multi-scale frequencies, leading to challenges when separating global trends from local fluctuations. To address these issues, we propose ClimateLLM, a foundation model for weather forecasting. It captures spatiotemporal dependencies via a cross-temporal and cross-spatial collaborative modeling framework that integrates Fourier-based frequency decomposition with Large Language Models (LLMs) to strengthen spatial and temporal modeling. Our framework uses a Mixture-of-Experts (MoE) mechanism that adaptively processes different frequency components, enabling efficient handling of both global signals and localized extreme events. In addition, we introduce a cross-temporal and cross-spatial dynamic prompting mechanism, allowing LLMs to incorporate meteorological patterns across multiple scales effectively. Extensive experiments on real-world datasets show that ClimateLLM outperforms state-of-the-art approaches in accuracy and efficiency, as a scalable solution for global weather forecasting.

Title: Are Generative Models Underconfident? An Embarrassingly Simple Quality Estimation Approach

Authors: Tu Anh Dinh, Jan Niehues
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11115
Pdf URL: https://arxiv.org/pdf/2502.11115
Copy Paste: [[2502.11115]] Are Generative Models Underconfident? An Embarrassingly Simple Quality Estimation Approach(https://arxiv.org/abs/2502.11115)
Keywords: generative
Abstract: Quality Estimation (QE) is estimating the quality of model output when the ground truth reference is not available. Looking at model uncertainty from its own output probabilities is the most trivial and low-effort way to estimate the output quality. However, for generative model, output probabilities might not be the best quality estimator. At an output step, there can be multiple correct options, making the probability distribution spread out more. Thus, lower token probability does not necessarily mean lower output quality. In other words, the model can be considered underconfident. In this paper, we propose a QE approach called Dominant Mass Probability (DMP}, that boosts the model confidence in cases where there are multiple viable output options. We show that, with no increase in complexity, DMP is notably better than sequence probability when estimating the quality of different models (Whisper, Llama, etc.) on different tasks (translation, summarization, etc.). Compared to sequence probability, DMP achieves on average +0.208 improvement in Pearson correlation to ground-truth quality.

Title: FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

Authors: Hui Wang, Shujie Liu, Lingwei Meng, Jinyu Li, Yifan Yang, Shiwan Zhao, Haiyang Sun, Yanqing Liu, Haoqin Sun, Jiaming Zhou, Yan Lu, Yong Qin
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2502.11128
Pdf URL: https://arxiv.org/pdf/2502.11128
Copy Paste: [[2502.11128]] FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching(https://arxiv.org/abs/2502.11128)
Keywords: generative
Abstract: To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown in this https URL.

Title: Machine Learning-Based Intrusion Detection and Prevention System for IIoT Smart Metering Networks: Challenges and Solutions

Authors: Sahar Lazim, Qutaiba I. Ali
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.11138
Pdf URL: https://arxiv.org/pdf/2502.11138
Copy Paste: [[2502.11138]] Machine Learning-Based Intrusion Detection and Prevention System for IIoT Smart Metering Networks: Challenges and Solutions(https://arxiv.org/abs/2502.11138)
Keywords: anomaly
Abstract: The Industrial Internet of Things (IIoT) has revolutionized industries by enabling automation, real-time data exchange, and smart decision-making. However, its increased connectivity introduces cybersecurity threats, particularly in smart metering networks, which play a crucial role in monitoring and optimizing energy consumption. This paper explores the challenges associated with securing IIoT-based smart metering networks and proposes a Machine Learning (ML)-based Intrusion Detection and Prevention System (IDPS) for safeguarding edge devices. The study reviews various intrusion detection approaches, highlighting the strengths and limitations of both signature-based and anomaly-based detection techniques. The findings suggest that integrating ML-driven IDPS in IIoT smart metering environments enhances security, efficiency, and resilience against evolving cyber threats.

Title: AnyRefill: A Unified, Data-Efficient Framework for Left-Prompt-Guided Vision Tasks

Authors: Ming Xie, Chenjie Cao, Yunuo Cai, Xiangyang Xue, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11158
Pdf URL: https://arxiv.org/pdf/2502.11158
Copy Paste: [[2502.11158]] AnyRefill: A Unified, Data-Efficient Framework for Left-Prompt-Guided Vision Tasks(https://arxiv.org/abs/2502.11158)
Keywords: diffusion, generative
Abstract: In this paper, we present a novel Left-Prompt-Guided (LPG) paradigm to address a diverse range of reference-based vision tasks. Inspired by the human creative process, we reformulate these tasks using a left-right stitching formulation to construct contextual input. Building upon this foundation, we propose AnyRefill, an extension of LeftRefill, that effectively adapts Text-to-Image (T2I) models to various vision tasks. AnyRefill leverages the inpainting priors of advanced T2I model based on the Diffusion Transformer (DiT) architecture, and incorporates flexible components to enhance its capabilities. By combining task-specific LoRAs with the stitching input, AnyRefill unlocks its potential across diverse tasks, including conditional generation, visual perception, and image editing, without requiring additional visual encoders. Meanwhile, AnyRefill exhibits remarkable data efficiency, requiring minimal task-specific fine-tuning while maintaining high generative performance. Through extensive ablation studies, we demonstrate that AnyRefill outperforms other image condition injection methods and achieves competitive results compared to state-of-the-art open-source methods. Notably, AnyRefill delivers results comparable to advanced commercial tools, such as IC-Light and SeedEdit, even in challenging scenarios. Comprehensive experiments and ablation studies across versatile tasks validate the strong generation of the proposed simple yet effective LPG formulation, establishing AnyRefill as a unified, highly data-efficient solution for reference-based vision tasks.

Title: LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning

Authors: Tianshi Zheng, Jiayang Cheng, Chunyang Li, Haochen Shi, Zihao Wang, Jiaxin Bai, Yangqiu Song, Ginny Y. Wong, Simon See
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11176
Pdf URL: https://arxiv.org/pdf/2502.11176
Copy Paste: [[2502.11176]] LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning(https://arxiv.org/abs/2502.11176)
Keywords: in-context
Abstract: Modern large language models (LLMs) employ various forms of logical inference, both implicitly and explicitly, when addressing reasoning tasks. Understanding how to optimally leverage these inference paradigms is critical for advancing LLMs' reasoning capabilities. This paper adopts an exploratory approach by introducing a controlled evaluation environment for analogical reasoning -- a fundamental cognitive task -- that is systematically parameterized across three dimensions: modality (textual, visual, symbolic), difficulty (easy, medium, hard), and task format (multiple-choice or free-text generation). We analyze the comparative dynamics of inductive, abductive, and deductive inference pipelines across these dimensions, and demonstrate that our findings generalize to broader in-context learning tasks. Additionally, we investigate advanced paradigms such as hypothesis selection, verification, and refinement, revealing their potential to scale up logical inference in LLM reasoning. This exploratory study provides a foundation for future research in enhancing LLM reasoning through systematic logical inference strategies.

Title: MaskFlow: Discrete Flows For Flexible and Efficient Long Video Generation

Authors: Michael Fuest, Vincent Tao Hu, Björn Ommer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11234
Pdf URL: https://arxiv.org/pdf/2502.11234
Copy Paste: [[2502.11234]] MaskFlow: Discrete Flows For Flexible and Efficient Long Video Generation(https://arxiv.org/abs/2502.11234)
Keywords: generative
Abstract: Generating long, high-quality videos remains a challenge due to the complex interplay of spatial and temporal dynamics and hardware limitations. In this work, we introduce \textbf{MaskFlow}, a unified video generation framework that combines discrete representations with flow-matching to enable efficient generation of high-quality long videos. By leveraging a frame-level masking strategy during training, MaskFlow conditions on previously generated unmasked frames to generate videos with lengths ten times beyond that of the training sequences. MaskFlow does so very efficiently by enabling the use of fast Masked Generative Model (MGM)-style sampling and can be deployed in both fully autoregressive as well as full-sequence generation modes. We validate the quality of our method on the FaceForensics (FFS) and Deepmind Lab (DMLab) datasets and report Fréchet Video Distance (FVD) competitive with state-of-the-art approaches. We also provide a detailed analysis on the sampling efficiency of our method and demonstrate that MaskFlow can be applied to both timestep-dependent and timestep-independent models in a training-free manner.

Title: Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL

Authors: Matthew Zurek, Yudong Chen
Subjects: cs.LG, cs.IT, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2502.11238
Pdf URL: https://arxiv.org/pdf/2502.11238
Copy Paste: [[2502.11238]] Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL(https://arxiv.org/abs/2502.11238)
Keywords: generative
Abstract: We study the sample complexity of finding an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDPs) with a generative model. The minimax optimal span-based complexity of $\widetilde{O}(SAH/\varepsilon^2)$, where $H$ is the span of the optimal bias function, has only been achievable with prior knowledge of the value of $H$. Prior-knowledge-free algorithms have been the objective of intensive research, but several natural approaches provably fail to achieve this goal. We resolve this problem, developing the first algorithms matching the optimal span-based complexity without $H$ knowledge, both when the dataset size is fixed and when the suboptimality level $\varepsilon$ is fixed. Our main technique combines the discounted reduction approach with a method for automatically tuning the effective horizon based on empirical confidence intervals or lower bounds on performance, which we term horizon calibration. We also develop an empirical span penalization approach, inspired by sample variance penalization, which satisfies an oracle inequality performance guarantee. In particular this algorithm can outperform the minimax complexity in benign settings such as when there exist near-optimal policies with span much smaller than $H$.

Title: Uncertainty-Aware Step-wise Verification with Generative Reward Models

Authors: Zihuiwen Ye, Luckeciano Carvalho Melo, Younesse Kaddar, Phil Blunsom, Sam Staton, Yarin Gal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11250
Pdf URL: https://arxiv.org/pdf/2502.11250
Copy Paste: [[2502.11250]] Uncertainty-Aware Step-wise Verification with Generative Reward Models(https://arxiv.org/abs/2502.11250)
Keywords: generative
Abstract: Complex multi-step reasoning tasks, such as solving mathematical problems, remain challenging for large language models (LLMs). While outcome supervision is commonly used, process supervision via process reward models (PRMs) provides intermediate rewards to verify step-wise correctness in solution traces. However, as proxies for human judgement, PRMs suffer from reliability issues, including susceptibility to reward hacking. In this work, we propose leveraging uncertainty quantification (UQ) to enhance the reliability of step-wise verification with generative reward models for mathematical reasoning tasks. We introduce CoT Entropy, a novel UQ method that outperforms existing approaches in quantifying a PRM's uncertainty in step-wise verification. Our results demonstrate that incorporating uncertainty estimates improves the robustness of judge-LM PRMs, leading to more reliable verification.

Title: Exploiting Point-Language Models with Dual-Prompts for 3D Anomaly Detection

Authors: Jiaxiang Wang, Haote Xu, Xiaolu Chen, Haodi Xu, Yue Huang, Xinghao Ding, Xiaotong Tu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11307
Pdf URL: https://arxiv.org/pdf/2502.11307
Copy Paste: [[2502.11307]] Exploiting Point-Language Models with Dual-Prompts for 3D Anomaly Detection(https://arxiv.org/abs/2502.11307)
Keywords: anomaly
Abstract: Anomaly detection (AD) in 3D point clouds is crucial in a wide range of industrial applications, especially in various forms of precision manufacturing. Considering the industrial demand for reliable 3D AD, several methods have been developed. However, most of these approaches typically require training separate models for each category, which is memory-intensive and lacks flexibility. In this paper, we propose a novel Point-Language model with dual-prompts for 3D ANomaly dEtection (PLANE). The approach leverages multi-modal prompts to extend the strong generalization capabilities of pre-trained Point-Language Models (PLMs) to the domain of 3D point cloud AD, achieving impressive detection performance across multiple categories using a single model. Specifically, we propose a dual-prompt learning method, incorporating both text and point cloud prompts. The method utilizes a dynamic prompt creator module (DPCM) to produce sample-specific dynamic prompts, which are then integrated with class-specific static prompts for each modality, effectively driving the PLMs. Additionally, based on the characteristics of point cloud data, we propose a pseudo 3D anomaly generation method (Ano3D) to improve the model's detection capabilities in an unsupervised setting. Experimental results demonstrate that the proposed method, which is under the multi-class-one-model paradigm, achieves a +8.7%/+17% gain on anomaly detection and localization performance as compared to the state-of-the-art one-class-one-model methods for the Anomaly-ShapeNet dataset, and obtains +4.3%/+4.1% gain for the Real3D-AD dataset. Code will be available upon publication.

Title: ALGEN: Few-shot Inversion Attacks on Textual Embeddings using Alignment and Generation

Authors: Yiyi Chen, Qiongkai Xu, Johannes Bjerva
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2502.11308
Pdf URL: https://arxiv.org/pdf/2502.11308
Copy Paste: [[2502.11308]] ALGEN: Few-shot Inversion Attacks on Textual Embeddings using Alignment and Generation(https://arxiv.org/abs/2502.11308)
Keywords: generative
Abstract: With the growing popularity of Large Language Models (LLMs) and vector databases, private textual data is increasingly processed and stored as numerical embeddings. However, recent studies have proven that such embeddings are vulnerable to inversion attacks, where original text is reconstructed to reveal sensitive information. Previous research has largely assumed access to millions of sentences to train attack models, e.g., through data leakage or nearly unrestricted API access. With our method, a single data point is sufficient for a partially successful inversion attack. With as little as 1k data samples, performance reaches an optimum across a range of black-box encoders, without training on leaked data. We present a Few-shot Textual Embedding Inversion Attack using ALignment and GENeration (ALGEN), by aligning victim embeddings to the attack space and using a generative model to reconstruct text. We find that ALGEN attacks can be effectively transferred across domains and languages, revealing key information. We further examine a variety of defense mechanisms against ALGEN, and find that none are effective, highlighting the vulnerabilities posed by inversion attacks. By significantly lowering the cost of inversion and proving that embedding spaces can be aligned through one-step optimization, we establish a new textual embedding inversion paradigm with broader applications for embedding alignment in NLP.

Title: Inverse Flow and Consistency Models

Authors: Yuchen Zhang, Jian Zhou
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11333
Pdf URL: https://arxiv.org/pdf/2502.11333
Copy Paste: [[2502.11333]] Inverse Flow and Consistency Models(https://arxiv.org/abs/2502.11333)
Keywords: diffusion, generative
Abstract: Inverse generation problems, such as denoising without ground truth observations, is a critical challenge in many scientific inquiries and real-world applications. While recent advances in generative models like diffusion models, conditional flow matching, and consistency models achieved impressive results by casting generation as denoising problems, they cannot be directly used for inverse generation without access to clean data. Here we introduce Inverse Flow (IF), a novel framework that enables using these generative models for inverse generation problems including denoising without ground truth. Inverse Flow can be flexibly applied to nearly any continuous noise distribution and allows complex dependencies. We propose two algorithms for learning Inverse Flows, Inverse Flow Matching (IFM) and Inverse Consistency Model (ICM). Notably, to derive the computationally efficient, simulation-free inverse consistency model objective, we generalized consistency training to any forward diffusion processes or conditional flows, which have applications beyond denoising. We demonstrate the effectiveness of IF on synthetic and real datasets, outperforming prior approaches while enabling noise distributions that previous methods cannot support. Finally, we showcase applications of our techniques to fluorescence microscopy and single-cell genomics data, highlighting IF's utility in scientific problems. Overall, this work expands the applications of powerful generative models to inversion generation problems.

Title: WRT-SAM: Foundation Model-Driven Segmentation for Generalized Weld Radiographic Testing

Authors: Yunyi Zhou, Kun Shi, Gang Hao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11338
Pdf URL: https://arxiv.org/pdf/2502.11338
Copy Paste: [[2502.11338]] WRT-SAM: Foundation Model-Driven Segmentation for Generalized Weld Radiographic Testing(https://arxiv.org/abs/2502.11338)
Keywords: foundation model, anomaly
Abstract: Radiographic testing is a fundamental non-destructive evaluation technique for identifying weld defects and assessing quality in industrial applications due to its high-resolution imaging capabilities. Over the past decade, deep learning techniques have significantly advanced weld defect identification in radiographic images. However, conventional approaches, which rely on training small-scale, task-specific models on single-scenario datasets, exhibit poor cross-scenario generalization. Recently, the Segment Anything Model (SAM), a pre-trained visual foundation model trained on large-scale datasets, has demonstrated exceptional zero-shot generalization capabilities. Fine-tuning SAM with limited domain-specific data has yielded promising results in fields such as medical image segmentation and anomaly detection. To the best of our knowledge, this work is the first to introduce SAM-based segmentation for general weld radiographic testing images. We propose WRT-SAM, a novel weld radiographic defect segmentation model that leverages SAM through an adapter-based integration with a specialized prompt generator architecture. To improve adaptability to grayscale weld radiographic images, we introduce a frequency prompt generator module, which enhances the model's sensitivity to frequency-domain information. Furthermore, to address the multi-scale nature of weld defects, we incorporate a multi-scale prompt generator module, enabling the model to effectively extract and encode defect information across varying scales. Extensive experimental evaluations demonstrate that WRT-SAM achieves a recall of 78.87%, a precision of 84.04%, and an AUC of 0.9746, setting a new state-of-the-art (SOTA) benchmark. Moreover, the model exhibits superior zero-shot generalization performance, highlighting its potential for practical deployment in diverse radiographic testing scenarios.

Title: Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning

Authors: Yilei Tu, Andrew Xue, Freda Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11364
Pdf URL: https://arxiv.org/pdf/2502.11364
Copy Paste: [[2502.11364]] Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning(https://arxiv.org/abs/2502.11364)
Keywords: in-context
Abstract: While multilingual large language models generally perform adequately, and sometimes even rival English performance on high-resource languages (HRLs), they often significantly underperform on low-resource languages (LRLs). Among several prompting strategies aiming at bridging the gap, multilingual in-context learning (ICL) has been particularly effective when demonstration in target languages is unavailable. However, there lacks a systematic understanding when and why it works well. In this work, we systematically analyze multilingual ICL, using demonstrations in HRLs to enhance cross-lingual transfer. We show that demonstrations in mixed HRLs consistently outperform English-only ones across the board, particularly for tasks written in LRLs. Surprisingly, our ablation study show that the presence of irrelevant non-English sentences in the prompt yields measurable gains, suggesting the effectiveness of multilingual exposure itself. Our results highlight the potential of strategically leveraging multilingual resources to bridge the performance gap for underrepresented languages.

Title: Without Paired Labeled Data: An End-to-End Self-Supervised Paradigm for UAV-View Geo-Localization

Authors: Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11381
Pdf URL: https://arxiv.org/pdf/2502.11381
Copy Paste: [[2502.11381]] Without Paired Labeled Data: An End-to-End Self-Supervised Paradigm for UAV-View Geo-Localization(https://arxiv.org/abs/2502.11381)
Keywords: self-supervised
Abstract: UAV-View Geo-Localization (UVGL) aims to ascertain the precise location of a UAV by retrieving the most similar GPS-tagged satellite image. However, existing methods predominantly rely on supervised learning paradigms that necessitate annotated paired data for training, which incurs substantial annotation costs and impedes large-scale deployment. To overcome this limitation, we propose the Dynamic Memory-Driven and Neighborhood Information Learning (DMNIL) network, a lightweight end-to-end self-supervised framework for UAV-view geo-localization. The DMNIL framework utilizes a dual-path clustering-based contrastive learning architecture as its baseline to model intra-view structural relationships, enhancing feature consistency and discriminability. Additionally, a dynamic memory-driven hierarchical learning module is proposed to progressively mine local and global information, reinforcing multi-level feature associations to improve model robustness. To bridge the domain gap between UAV and satellite views, we design an information-consistent evolutionary learning mechanism that systematically explores latent correlations within intra-view neighborhoods and across cross-view domains, ultimately constructing a unified cross-view feature representation space. Extensive experiments on three benchmarks (University-1652, SUES-200, and DenseUAV) demonstrate that DMNIL achieves competitive performance against state-of-the-art supervised methods while maintaining computational efficiency. Notably, this superiority is attained without relying on paired training data, underscoring the framework's practicality for real-world deployment. Codes will be released soon.

Title: MARS: Mesh AutoRegressive Model for 3D Shape Detailization

Authors: Jingnan Gao, Weizhe Liu, Weixuan Sun, Senbo Wang, Xibin Song, Taizhang Shang, Shenzhou Chen, Hongdong Li, Xiaokang Yang, Yichao Yan, Pan Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11390
Pdf URL: https://arxiv.org/pdf/2502.11390
Copy Paste: [[2502.11390]] MARS: Mesh AutoRegressive Model for 3D Shape Detailization(https://arxiv.org/abs/2502.11390)
Keywords: generative
Abstract: State-of-the-art methods for mesh detailization predominantly utilize Generative Adversarial Networks (GANs) to generate detailed meshes from coarse ones. These methods typically learn a specific style code for each category or similar categories without enforcing geometry supervision across different Levels of Detail (LODs). Consequently, such methods often fail to generalize across a broader range of categories and cannot ensure shape consistency throughout the detailization process. In this paper, we introduce MARS, a novel approach for 3D shape detailization. Our method capitalizes on a novel multi-LOD, multi-category mesh representation to learn shape-consistent mesh representations in latent space across different LODs. We further propose a mesh autoregressive model capable of generating such latent representations through next-LOD token prediction. This approach significantly enhances the realism of the generated shapes. Extensive experiments conducted on the challenging 3D Shape Detailization benchmark demonstrate that our proposed MARS model achieves state-of-the-art performance, surpassing existing methods in both qualitative and quantitative assessments. Notably, the model's capability to generate fine-grained details while preserving the overall shape integrity is particularly commendable.

Title: Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment

Authors: Jingcheng Deng, Zhongtao Jiang, Liang Pang, Liwei Chen, Kun Xu, Zihao Wei, Huawei Shen, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11401
Pdf URL: https://arxiv.org/pdf/2502.11401
Copy Paste: [[2502.11401]] Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment(https://arxiv.org/abs/2502.11401)
Keywords: generative
Abstract: A new trend uses LLMs as dense text encoders via contrastive learning. However, since LLM embeddings predict the probability distribution of the next token, they are inherently generative and distributive, conflicting with contrastive learning, which requires embeddings to capture full-text semantics and align via cosine similarity. This discrepancy hinders the full utilization of LLMs' pre-training capabilities, resulting in inefficient learning. In response to this issue, we propose AutoRegEmbed, a new contrastive learning method built on embedding conditional probability distributions, which integrates two core tasks: information compression and conditional distribution alignment. The information compression task encodes text into the embedding space, ensuring that the embedding vectors capture global semantics. The conditional distribution alignment task focuses on aligning text embeddings with positive samples embeddings by leveraging the conditional distribution of embeddings while simultaneously reducing the likelihood of generating negative samples from text embeddings, thereby achieving embedding alignment and uniformity. Experimental results demonstrate that our method significantly outperforms traditional contrastive learning approaches and achieves performance comparable to state-of-the-art models when using the same amount of data.

Title: Training-Free Guidance Beyond Differentiability: Scalable Path Steering with Tree Search in Diffusion and Flow Models

Authors: Yingqing Guo, Yukang Yang, Hui Yuan, Mengdi Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.11420
Pdf URL: https://arxiv.org/pdf/2502.11420
Copy Paste: [[2502.11420]] Training-Free Guidance Beyond Differentiability: Scalable Path Steering with Tree Search in Diffusion and Flow Models(https://arxiv.org/abs/2502.11420)
Keywords: diffusion
Abstract: Training-free guidance enables controlled generation in diffusion and flow models, but most existing methods assume differentiable objectives and rely on gradients. This work focuses on training-free guidance addressing challenges from non-differentiable objectives and discrete data distributions. We propose an algorithmic framework TreeG: Tree Search-Based Path Steering Guidance, applicable to both continuous and discrete settings in diffusion and flow models. TreeG offers a unified perspective on training-free guidance: proposing candidates for the next step, evaluating candidates, and selecting the best to move forward, enhanced by a tree search mechanism over active paths or parallelizing exploration. We comprehensively investigate the design space of TreeG over the candidate proposal module and the evaluation function, instantiating TreeG into three novel algorithms. Our experiments show that TreeG consistently outperforms the top guidance baselines in symbolic music generation, small molecule generation, and enhancer DNA design, all of which involve non-differentiable challenges. Additionally, we identify an inference-time scaling law showing TreeG's scalability in inference-time computation.

Title: ADO: Automatic Data Optimization for Inputs in LLM Prompts

Authors: Sam Lin, Wenyue Hua, Lingyao Li, Zhenting Wang, Yongfeng Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.11436
Pdf URL: https://arxiv.org/pdf/2502.11436
Copy Paste: [[2502.11436]] ADO: Automatic Data Optimization for Inputs in LLM Prompts(https://arxiv.org/abs/2502.11436)
Keywords: in-context
Abstract: This study explores a novel approach to enhance the performance of Large Language Models (LLMs) through the optimization of input data within prompts. While previous research has primarily focused on refining instruction components and augmenting input data with in-context examples, our work investigates the potential benefits of optimizing the input data itself. We introduce a two-pronged strategy for input data optimization: content engineering and structural reformulation. Content engineering involves imputing missing values, removing irrelevant attributes, and enriching profiles by generating additional information inferred from existing attributes. Subsequent to content engineering, structural reformulation is applied to optimize the presentation of the modified content to LLMs, given their sensitivity to input format. Our findings suggest that these optimizations can significantly improve the performance of LLMs in various tasks, offering a promising avenue for future research in prompt engineering. The source code is available at this https URL

Title: SAFE-SQL: Self-Augmented In-Context Learning with Fine-grained Example Selection for Text-to-SQL

Authors: Jimin Lee, Ingeol Baek, Byeongjeong Kim, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11438
Pdf URL: https://arxiv.org/pdf/2502.11438
Copy Paste: [[2502.11438]] SAFE-SQL: Self-Augmented In-Context Learning with Fine-grained Example Selection for Text-to-SQL(https://arxiv.org/abs/2502.11438)
Keywords: in-context
Abstract: Text-to-SQL aims to convert natural language questions into executable SQL queries. While previous approaches, such as skeleton-masked selection, have demonstrated strong performance by retrieving similar training examples to guide large language models (LLMs), they struggle in real-world scenarios where such examples are unavailable. To overcome this limitation, we propose Self-Augmentation in-context learning with Fine-grained Example selection for Text-to-SQL (SAFE-SQL), a novel framework that improves SQL generation by generating and filtering self-augmented examples. SAFE-SQL first prompts an LLM to generate multiple Text-to-SQL examples relevant to the test input. Then SAFE-SQL filters these examples through three relevance assessments, constructing high-quality in-context learning examples. Using self-generated examples, SAFE-SQL surpasses the previous zero-shot, and few-shot Text-to-SQL frameworks, achieving higher execution accuracy. Notably, our approach provides additional performance gains in extra hard and unseen scenarios, where conventional methods often fail.

Title: An Efficient Row-Based Sparse Fine-Tuning

Authors: Cen-Jhih Li, Aditya Bhaskara
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.11439
Pdf URL: https://arxiv.org/pdf/2502.11439
Copy Paste: [[2502.11439]] An Efficient Row-Based Sparse Fine-Tuning(https://arxiv.org/abs/2502.11439)
Keywords: foundation model
Abstract: Fine-tuning is an important step in adapting foundation models such as large language models to downstream tasks. To make this step more accessible to users with limited computational budgets, it is crucial to develop fine-tuning methods that are memory and computationally efficient. Sparse Fine-tuning (SFT) and Low-rank adaptation (LoRA) are two frameworks that have emerged for addressing this problem and have been adopted widely in practice. In this work, we develop a new SFT framework, based on ideas from neural network pruning. At a high level, we first identify "important" neurons/nodes using feature importance metrics from network pruning (specifically, we use the structural pruning method), and then perform fine-tuning by restricting to weights involving these neurons. Using experiments on common language tasks, we demonstrate that our method significantly improves the memory efficiency of SFT without increasing training time complexity and implementation complexity, while achieving accuracy comparable to state-of-the-art methods such as LoRA and its variants.

Title: Medical Image Registration Meets Vision Foundation Model: Prototype Learning and Contour Awareness

Authors: Hao Xu, Tengfei Xue, Jianan Fan, Dongnan Liu, Yuqian Chen, Fan Zhang, Carl-Fredrik Westin, Ron Kikinis, Lauren J. O'Donnell, Weidong Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11440
Pdf URL: https://arxiv.org/pdf/2502.11440
Copy Paste: [[2502.11440]] Medical Image Registration Meets Vision Foundation Model: Prototype Learning and Contour Awareness(https://arxiv.org/abs/2502.11440)
Keywords: foundation model
Abstract: Medical image registration is a fundamental task in medical image analysis, aiming to establish spatial correspondences between paired images. However, existing unsupervised deformable registration methods rely solely on intensity-based similarity metrics, lacking explicit anatomical knowledge, which limits their accuracy and robustness. Vision foundation models, such as the Segment Anything Model (SAM), can generate high-quality segmentation masks that provide explicit anatomical structure knowledge, addressing the limitations of traditional methods that depend only on intensity similarity. Based on this, we propose a novel SAM-assisted registration framework incorporating prototype learning and contour awareness. The framework includes: (1) Explicit anatomical information injection, where SAM-generated segmentation masks are used as auxiliary inputs throughout training and testing to ensure the consistency of anatomical information; (2) Prototype learning, which leverages segmentation masks to extract prototype features and aligns prototypes to optimize semantic correspondences between images; and (3) Contour-aware loss, a contour-aware loss is designed that leverages the edges of segmentation masks to improve the model's performance in fine-grained deformation fields. Extensive experiments demonstrate that the proposed framework significantly outperforms existing methods across multiple datasets, particularly in challenging scenarios with complex anatomical structures and ambiguous boundaries. Our code is available at this https URL.

Title: Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation

Authors: Taeyoung Yun, Dinghuai Zhang, Jinkyoo Park, Ling Pan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11477
Pdf URL: https://arxiv.org/pdf/2502.11477
Copy Paste: [[2502.11477]] Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation(https://arxiv.org/abs/2502.11477)
Keywords: diffusion, generative
Abstract: Recent advances in text-to-image diffusion models have achieved impressive image generation capabilities. However, it remains challenging to control the generation process with desired properties (e.g., aesthetic quality, user intention), which can be expressed as black-box reward functions. In this paper, we focus on prompt adaptation, which refines the original prompt into model-preferred prompts to generate desired images. While prior work uses reinforcement learning (RL) to optimize prompts, we observe that applying RL often results in generating similar postfixes and deterministic behaviors. To this end, we introduce \textbf{P}rompt \textbf{A}daptation with \textbf{G}FlowNets (\textbf{PAG}), a novel approach that frames prompt adaptation as a probabilistic inference problem. Our key insight is that leveraging Generative Flow Networks (GFlowNets) allows us to shift from reward maximization to sampling from an unnormalized density function, enabling both high-quality and diverse prompt generation. However, we identify that a naive application of GFlowNets suffers from mode collapse and uncovers a previously overlooked phenomenon: the progressive loss of neural plasticity in the model, which is compounded by inefficient credit assignment in sequential prompt generation. To address this critical challenge, we develop a systematic approach in PAG with flow reactivation, reward-prioritized sampling, and reward decomposition for prompt adaptation. Extensive experiments validate that PAG successfully learns to sample effective and diverse prompts for text-to-image generation. We also show that PAG exhibits strong robustness across various reward functions and transferability to different text-to-image models.

Title: Balanced Multi-Factor In-Context Learning for Multilingual Large Language Models

Authors: Masahiro Kaneko, Alham Fikri Aji, Timothy Baldwin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11495
Pdf URL: https://arxiv.org/pdf/2502.11495
Copy Paste: [[2502.11495]] Balanced Multi-Factor In-Context Learning for Multilingual Large Language Models(https://arxiv.org/abs/2502.11495)
Keywords: in-context
Abstract: Multilingual large language models (MLLMs) are able to leverage in-context learning (ICL) to achieve high performance by leveraging cross-lingual knowledge transfer without parameter updates. However, their effectiveness is highly sensitive to example selection, particularly in multilingual settings. Based on the findings of existing work, three key factors influence multilingual ICL: (1) semantic similarity, (2) linguistic alignment, and (3) language-specific performance. However, existing approaches address these factors independently, without explicitly disentangling their combined impact, leaving optimal example selection underexplored. To address this gap, we propose balanced multi-factor ICL (\textbf{BMF-ICL}), a method that quantifies and optimally balances these factors for improved example selection. Experiments on mCSQA and TYDI across four MLLMs demonstrate that BMF-ICL outperforms existing methods. Further analysis highlights the importance of incorporating all three factors and the importance of selecting examples from multiple languages.

Title: DifCluE: Generating Counterfactual Explanations with Diffusion Autoencoders and modal clustering

Authors: Suparshva Jain, Amit Sangroya, Lovekesh Vig
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11509
Pdf URL: https://arxiv.org/pdf/2502.11509
Copy Paste: [[2502.11509]] DifCluE: Generating Counterfactual Explanations with Diffusion Autoencoders and modal clustering(https://arxiv.org/abs/2502.11509)
Keywords: diffusion
Abstract: Generating multiple counterfactual explanations for different modes within a class presents a significant challenge, as these modes are distinct yet converge under the same classification. Diffusion probabilistic models (DPMs) have demonstrated a strong ability to capture the underlying modes of data distributions. In this paper, we harness the power of a Diffusion Autoencoder to generate multiple distinct counterfactual explanations. By clustering in the latent space, we uncover the directions corresponding to the different modes within a class, enabling the generation of diverse and meaningful counterfactuals. We introduce a novel methodology, DifCluE, which consistently identifies these modes and produces more reliable counterfactual explanations. Our experimental results demonstrate that DifCluE outperforms the current state-of-the-art in generating multiple counterfactual explanations, offering a significant advance- ment in model interpretability.

Title: SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion

Authors: Junxian Ma, Shiwen Wang, Jian Yang, Junyi Hu, Jian Liang, Guosheng Lin, Jingbo chen, Kai Li, Yu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11515
Pdf URL: https://arxiv.org/pdf/2502.11515
Copy Paste: [[2502.11515]] SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion(https://arxiv.org/abs/2502.11515)
Keywords: diffusion
Abstract: Recent advances in diffusion models have led to significant progress in audio-driven lip synchronization. However, existing methods typically rely on constrained audio-visual alignment priors or multi-stage learning of intermediate representations to force lip motion synthesis. This leads to complex training pipelines and limited motion naturalness. In this paper, we present SayAnything, a conditional video diffusion framework that directly synthesizes lip movements from audio input while preserving speaker identity. Specifically, we propose three specialized modules including identity preservation module, audio guidance module, and editing control module. Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation without requiring additional supervision signals or intermediate representations. Extensive experiments demonstrate that SayAnything generates highly realistic videos with improved lip-teeth coherence, enabling unseen characters to say anything, while effectively generalizing to animated characters.

Title: Training Large Language Models to be Better Rule Followers

Authors: Yi Hu, Shijia Kang, Haotong Yang, Haotian Xu, Muhan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11525
Pdf URL: https://arxiv.org/pdf/2502.11525
Copy Paste: [[2502.11525]] Training Large Language Models to be Better Rule Followers(https://arxiv.org/abs/2502.11525)
Keywords: in-context
Abstract: Large language models (LLMs) have shown impressive performance across a wide range of tasks. However, they often exhibit unexpected failures in seemingly straightforward tasks, suggesting a reliance on case-based reasoning rather than rule-based reasoning. While the vast training corpus of LLMs contains numerous textual "rules", current training methods fail to leverage these rules effectively. Crucially, the relationships between these "rules" and their corresponding "instances" are not explicitly modeled. As a result, while LLMs can often recall rules with ease, they fail to apply these rules strictly and consistently in relevant reasoning scenarios. In this paper, we investigate the rule-following capabilities of LLMs and propose Meta Rule-Following Fine-Tuning (Meta-RFFT) to enhance the cross-task transferability of rule-following abilities. We first construct a dataset of 88 tasks requiring following rules, encompassing diverse reasoning domains. We demonstrate through extensive experiments that models trained on large-scale rule-following tasks are better rule followers, outperforming the baselines in both downstream fine-tuning and few-shot prompting scenarios. This highlights the cross-task transferability of models with the aid of Meta-RFFT. Furthermore, we examine the influence of factors such as dataset size, rule formulation, and in-context learning.

Title: Control-CLIP: Decoupling Category and Style Guidance in CLIP for Specific-Domain Generation

Authors: Zexi Jia, Chuanwei Huang, Hongyan Fei, Yeshuang Zhu, Zhiqiang Yuan, Jinchao Zhang, Jie Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11532
Pdf URL: https://arxiv.org/pdf/2502.11532
Copy Paste: [[2502.11532]] Control-CLIP: Decoupling Category and Style Guidance in CLIP for Specific-Domain Generation(https://arxiv.org/abs/2502.11532)
Keywords: diffusion
Abstract: Text-to-image diffusion models have shown remarkable capabilities of generating high-quality images closely aligned with textual inputs. However, the effectiveness of text guidance heavily relies on the CLIP text encoder, which is trained to pay more attention to general content but struggles to capture semantics in specific domains like styles. As a result, generation models tend to fail on prompts like "a photo of a cat in Pokemon style" in terms of simply producing images depicting "a photo of a cat". To fill this gap, we propose Control-CLIP, a novel decoupled CLIP fine-tuning framework that enables the CLIP model to learn the meaning of category and style in a complement manner. With specially designed fine-tuning tasks on minimal data and a modified cross-attention mechanism, Control-CLIP can precisely guide the diffusion model to a specific domain. Moreover, the parameters of the diffusion model remain unchanged at all, preserving the original generation performance and diversity. Experiments across multiple domains confirm the effectiveness of our approach, particularly highlighting its robust plug-and-play capability in generating content with various specific styles.

Title: DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection

Authors: Yingli Shen, Wen Lai, Shuo Wang, Xueren Zhang, Kangyang Luo, Alexander Fraser, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11546
Pdf URL: https://arxiv.org/pdf/2502.11546
Copy Paste: [[2502.11546]] DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection(https://arxiv.org/abs/2502.11546)
Keywords: anomaly
Abstract: The rapid development of multilingual large language models (LLMs) highlights the need for high-quality, diverse, and clean multilingual datasets. In this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a large-scale multilingual corpus built using newly extracted Common Crawl data and existing multilingual datasets. DCAD-2000 includes over 2,282 languages, 46.72TB of data, and 8.63 billion documents, spanning 155 high- and medium-resource languages and 159 writing scripts. To overcome the limitations of current data cleaning methods, which rely on manual heuristic thresholds, we propose reframing data cleaning as an anomaly detection task. This dynamic filtering approach significantly enhances data quality by identifying and removing noisy or anomalous content. We evaluate the quality of DCAD-2000 on the FineTask benchmark, demonstrating substantial improvements in multilingual dataset quality and task performance.

Title: Continuous Diffusion Model for Language Modeling

Authors: Jaehyeong Jo, Sung Ju Hwang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.11564
Pdf URL: https://arxiv.org/pdf/2502.11564
Copy Paste: [[2502.11564]] Continuous Diffusion Model for Language Modeling(https://arxiv.org/abs/2502.11564)
Keywords: diffusion
Abstract: Diffusion models have emerged as a promising alternative to autoregressive models in modeling discrete categorical data. Yet diffusion models that directly work on discrete data space do not fully exploit the power of iterative refinement, as the signals are lost during the transition between discrete states. Existing continuous diffusion models for discrete data have limited performance compared to discrete approaches, and the unclear link between them restricts the development of diffusion models for discrete data. In this work, we propose a continuous diffusion model for language modeling that incorporates the geometry of the underlying categorical distribution. We establish a connection between the discrete diffusion and continuous flow on the statistical manifold, and building on the analogy, we introduce a simple design for the diffusion process that generalizes previous discrete diffusion models. We further propose a simulation-free training framework based on radial symmetry and a simple technique to address the high dimensionality of the manifold. Comprehensive experiments on language modeling benchmarks and other modalities show that our method outperforms existing discrete diffusion models and approaches the performance of autoregressive models. Codes available at \href{this https URL}{this https URL}.

Title: Towards a Trustworthy Anomaly Detection for Critical Applications through Approximated Partial AUC Loss

Authors: Arnaud Bougaham, Benoît Frénay
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2502.11570
Pdf URL: https://arxiv.org/pdf/2502.11570
Copy Paste: [[2502.11570]] Towards a Trustworthy Anomaly Detection for Critical Applications through Approximated Partial AUC Loss(https://arxiv.org/abs/2502.11570)
Keywords: anomaly
Abstract: Anomaly Detection is a crucial step for critical applications such in the industrial, medical or cybersecurity domains. These sectors share the same requirement of handling differently the different types of classification errors. Indeed, even if false positives are acceptable, false negatives are not, because it would reflect a missed detection of a quality issue, a disease or a cyber threat. To fulfill this requirement, we propose a method that dynamically applies a trustworthy approximated partial AUC ROC loss (tapAUC). A binary classifier is trained to optimize the specific range of the AUC ROC curve that prevents the True Positive Rate (TPR) to reach 100% while minimizing the False Positive Rate (FPR). The optimal threshold that does not trigger any false negative is then kept and used at the test step. The results show a TPR of 92.52% at a 20.43% FPR for an average across 6 datasets, representing a TPR improvement of 4.3% for a FPR cost of 12.2% against other state-of-the-art methods. The code is available at this https URL.

Title: Syllables to Scenes: Literary-Guided Free-Viewpoint 3D Scene Synthesis from Japanese Haiku

Authors: Chunan Yu, Yidong Han, Chaotao Ding, Ying Zang, Lanyun Zhu, Xinhao Chen, Zejian Li, Renjun Xu, Tianrun Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11586
Pdf URL: https://arxiv.org/pdf/2502.11586
Copy Paste: [[2502.11586]] Syllables to Scenes: Literary-Guided Free-Viewpoint 3D Scene Synthesis from Japanese Haiku(https://arxiv.org/abs/2502.11586)
Keywords: diffusion, generative
Abstract: In the era of the metaverse, where immersive technologies redefine human experiences, translating abstract literary concepts into navigable 3D environments presents a fundamental challenge in preserving semantic and emotional fidelity. This research introduces HaikuVerse, a novel framework for transforming poetic abstraction into spatial representation, with Japanese Haiku serving as an ideal test case due to its sophisticated encapsulation of profound emotions and imagery within minimal text. While existing text-to-3D methods struggle with nuanced interpretations, we present a literary-guided approach that synergizes traditional poetry analysis with advanced generative technologies. Our framework centers on two key innovations: (1) Hierarchical Literary-Criticism Theory Grounded Parsing (H-LCTGP), which captures both explicit imagery and implicit emotional resonance through structured semantic decomposition, and (2) Progressive Dimensional Synthesis (PDS), a multi-stage pipeline that systematically transforms poetic elements into coherent 3D scenes through sequential diffusion processes, geometric optimization, and real-time enhancement. Extensive experiments demonstrate that HaikuVerse significantly outperforms conventional text-to-3D approaches in both literary fidelity and visual quality, establishing a new paradigm for preserving cultural heritage in immersive digital spaces. Project website at: this https URL

Title: iMOVE: Instance-Motion-Aware Video Understanding

Authors: Jiaze Li, Yaya Shi, Zongyang Ma, Haoran Xu, Feng Cheng, Huihui Xiao, Ruiwen Kang, Fan Yang, Tingting Gao, Di Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11594
Pdf URL: https://arxiv.org/pdf/2502.11594
Copy Paste: [[2502.11594]] iMOVE: Instance-Motion-Aware Video Understanding(https://arxiv.org/abs/2502.11594)
Keywords: foundation model
Abstract: Enhancing the fine-grained instance spatiotemporal motion perception capabilities of Video Large Language Models is crucial for improving their temporal and general video understanding. However, current models struggle to perceive detailed and complex instance motions. To address these challenges, we have made improvements from both data and model perspectives. In terms of data, we have meticulously curated iMOVE-IT, the first large-scale instance-motion-aware video instruction-tuning dataset. This dataset is enriched with comprehensive instance motion annotations and spatiotemporal mutual-supervision tasks, providing extensive training for the model's instance-motion-awareness. Building on this foundation, we introduce iMOVE, an instance-motion-aware video foundation model that utilizes Event-aware Spatiotemporal Efficient Modeling to retain informative instance spatiotemporal motion details while maintaining computational efficiency. It also incorporates Relative Spatiotemporal Position Tokens to ensure awareness of instance spatiotemporal positions. Evaluations indicate that iMOVE excels not only in video temporal understanding and general video understanding but also demonstrates significant advantages in long-term video understanding.

Title: GraphThought: Graph Combinatorial Optimization with Thought Generation

Authors: Zixiao Huang, Lifeng Guo, Junjie Sheng, Haosheng Chen, Wenhao Li, Bo Jin, Changhong Lu, Xiangfeng Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.11607
Pdf URL: https://arxiv.org/pdf/2502.11607
Copy Paste: [[2502.11607]] GraphThought: Graph Combinatorial Optimization with Thought Generation(https://arxiv.org/abs/2502.11607)
Keywords: generative
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across various domains, especially in text processing and generative tasks. Recent advancements in the reasoning capabilities of state-of-the-art LLMs, such as OpenAI-o1, have significantly broadened their applicability, particularly in complex problem-solving and logical inference. However, most existing LLMs struggle with notable limitations in handling graph combinatorial optimization (GCO) problems. To bridge this gap, we formally define the Optimal Thoughts Design (OTD) problem, including its state and action thought space. We then introduce a novel framework, GraphThought, designed to generate high-quality thought datasets for GCO problems. Leveraging these datasets, we fine-tune the Llama-3-8B-Instruct model to develop Llama-GT. Notably, despite its compact 8B-parameter architecture, Llama-GT matches the performance of state-of-the-art LLMs on the GraphArena benchmark. Experimental results show that our approach outperforms both proprietary and open-source models, even rivaling specialized models like o1-mini. This work sets a new state-of-the-art benchmark while challenging the prevailing notion that model scale is the primary driver of reasoning capability.

Title: Maximum Entropy Reinforcement Learning with Diffusion Policy

Authors: Xiaoyi Dong, Jian Cheng, Xi Sheryl Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11612
Pdf URL: https://arxiv.org/pdf/2502.11612
Copy Paste: [[2502.11612]] Maximum Entropy Reinforcement Learning with Diffusion Policy(https://arxiv.org/abs/2502.11612)
Keywords: diffusion, generative
Abstract: The Soft Actor-Critic (SAC) algorithm with a Gaussian policy has become a mainstream implementation for realizing the Maximum Entropy Reinforcement Learning (MaxEnt RL) objective, which incorporates entropy maximization to encourage exploration and enhance policy robustness. While the Gaussian policy performs well on simpler tasks, its exploration capacity and potential performance in complex multi-goal RL environments are limited by its inherent unimodality. In this paper, we employ the diffusion model, a powerful generative model capable of capturing complex multimodal distributions, as the policy representation to fulfill the MaxEnt RL objective, developing a method named MaxEnt RL with Diffusion Policy (MaxEntDP). Our method enables efficient exploration and brings the policy closer to the optimal MaxEnt policy. Experimental results on Mujoco benchmarks show that MaxEntDP outperforms the Gaussian policy and other generative models within the MaxEnt RL framework, and performs comparably to other state-of-the-art diffusion-based online RL algorithms. Our code is available at this https URL.

Title: In-Context Parametric Inference: Point or Distribution Estimators?

Authors: Sarthak Mittal, Yoshua Bengio, Nikolay Malkin, Guillaume Lajoie
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2502.11617
Pdf URL: https://arxiv.org/pdf/2502.11617
Copy Paste: [[2502.11617]] In-Context Parametric Inference: Point or Distribution Estimators?(https://arxiv.org/abs/2502.11617)
Keywords: diffusion, in-context
Abstract: Bayesian and frequentist inference are two fundamental paradigms in statistical estimation. Bayesian methods treat hypotheses as random variables, incorporating priors and updating beliefs via Bayes' theorem, whereas frequentist methods assume fixed but unknown hypotheses, relying on estimators like maximum likelihood. While extensive research has compared these approaches, the frequentist paradigm of obtaining point estimates has become predominant in deep learning, as Bayesian inference is challenging due to the computational complexity and the approximation gap of posterior estimation methods. However, a good understanding of trade-offs between the two approaches is lacking in the regime of amortized estimators, where in-context learners are trained to estimate either point values via maximum likelihood or maximum a posteriori estimation, or full posteriors using normalizing flows, score-based diffusion samplers, or diagonal Gaussian approximations, conditioned on observations. To help resolve this, we conduct a rigorous comparative analysis spanning diverse problem settings, from linear models to shallow neural networks, with a robust evaluation framework assessing both in-distribution and out-of-distribution generalization on tractable tasks. Our experiments indicate that amortized point estimators generally outperform posterior inference, though the latter remain competitive in some low-dimensional problems, and we further discuss why this might be the case.

Title: Membership Inference Attacks for Face Images Against Fine-Tuned Latent Diffusion Models

Authors: Lauritz Christian Holme, Anton Mosquera Storgaard, Siavash Arjomand Bigdeli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11619
Pdf URL: https://arxiv.org/pdf/2502.11619
Copy Paste: [[2502.11619]] Membership Inference Attacks for Face Images Against Fine-Tuned Latent Diffusion Models(https://arxiv.org/abs/2502.11619)
Keywords: diffusion, generative
Abstract: The rise of generative image models leads to privacy concerns when it comes to the huge datasets used to train such models. This paper investigates the possibility of inferring if a set of face images was used for fine-tuning a Latent Diffusion Model (LDM). A Membership Inference Attack (MIA) method is presented for this task. Using generated auxiliary data for the training of the attack model leads to significantly better performance, and so does the use of watermarks. The guidance scale used for inference was found to have a significant influence. If a LDM is fine-tuned for long enough, the text prompt used for inference has no significant influence. The proposed MIA is found to be viable in a realistic black-box setup against LDMs fine-tuned on face-images.

Title: GaussianMotion: End-to-End Learning of Animatable Gaussian Avatars with Pose Guidance from Text

Authors: Gyumin Shim, Sangmin Lee, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11642
Pdf URL: https://arxiv.org/pdf/2502.11642
Copy Paste: [[2502.11642]] GaussianMotion: End-to-End Learning of Animatable Gaussian Avatars with Pose Guidance from Text(https://arxiv.org/abs/2502.11642)
Keywords: diffusion
Abstract: In this paper, we introduce GaussianMotion, a novel human rendering model that generates fully animatable scenes aligned with textual descriptions using Gaussian Splatting. Although existing methods achieve reasonable text-to-3D generation of human bodies using various 3D representations, they often face limitations in fidelity and efficiency, or primarily focus on static models with limited pose control. In contrast, our method generates fully animatable 3D avatars by combining deformable 3D Gaussian Splatting with text-to-3D score distillation, achieving high fidelity and efficient rendering for arbitrary poses. By densely generating diverse random poses during optimization, our deformable 3D human model learns to capture a wide range of natural motions distilled from a pose-conditioned diffusion model in an end-to-end manner. Furthermore, we propose Adaptive Score Distillation that effectively balances realistic detail and smoothness to achieve optimal 3D results. Experimental results demonstrate that our approach outperforms existing baselines by producing high-quality textures in both static and animated results, and by generating diverse 3D human models from various textual inputs.

Title: Hyperspherical Energy Transformer with Recurrent Depth

Authors: Yunzhe Hu, Difan Zou, Dong Xu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.11646
Pdf URL: https://arxiv.org/pdf/2502.11646
Copy Paste: [[2502.11646]] Hyperspherical Energy Transformer with Recurrent Depth(https://arxiv.org/abs/2502.11646)
Keywords: foundation model
Abstract: Transformer-based foundation models have achieved unprecedented success with a gigantic amount of parameters and computational resources. Yet, the core building blocks of these models, the Transformer layers, and how they are arranged and configured are primarily engineered from the bottom up and driven by heuristics. For advancing next-generation architectures, it demands exploring a prototypical model that is amenable to high interpretability and of practical competence. To this end, we take a step from the top-down view and design neural networks from an energy minimization perspective. Specifically, to promote isotropic token distribution on the sphere, we formulate a modified Hopfield energy function on the subspace-embedded hypersphere, based on which Transformer layers with symmetric structures are designed as the iterative optimization for the energy function. By integrating layers with the same parameters, we propose \textit{Hyper-Spherical Energy Transformer} (Hyper-SET), an alternative to the vanilla Transformer with recurrent depth. This design inherently provides greater interpretability and allows for scaling to deeper layers without a significant increase in the number of parameters. We also empirically demonstrate that Hyper-SET achieves comparable or even superior performance on both synthetic and real-world tasks, such as solving Sudoku and masked image modeling, while utilizing fewer parameters.

Title: Object-Centric Image to Video Generation with Language Guidance

Authors: Angel Villar-Corrales, Gjergj Plepi, Sven Behnke
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11655
Pdf URL: https://arxiv.org/pdf/2502.11655
Copy Paste: [[2502.11655]] Object-Centric Image to Video Generation with Language Guidance(https://arxiv.org/abs/2502.11655)
Keywords: generative
Abstract: Accurate and flexible world models are crucial for autonomous systems to understand their environment and predict future events. Object-centric models, with structured latent spaces, have shown promise in modeling object dynamics and interactions, but often face challenges in scaling to complex datasets and incorporating external guidance, limiting their applicability in robotics. To address these limitations, we propose TextOCVP, an object-centric model for image-to-video generation guided by textual descriptions. TextOCVP parses an observed scene into object representations, called slots, and utilizes a text-conditioned transformer predictor to forecast future object states and video frames. Our approach jointly models object dynamics and interactions while incorporating textual guidance, thus leading to accurate and controllable predictions. Our method's structured latent space offers enhanced control over the prediction process, outperforming several image-to-video generative baselines. Additionally, we demonstrate that structured object-centric representations provide superior controllability and interpretability, facilitating the modeling of object dynamics and enabling more precise and understandable predictions. Videos and code are available at this https URL.

Title: MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction

Authors: Jingcheng Ni, Yuxin Guo, Yichen Liu, Rui Chen, Lewei Lu, Zehuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11663
Pdf URL: https://arxiv.org/pdf/2502.11663
Copy Paste: [[2502.11663]] MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction(https://arxiv.org/abs/2502.11663)
Keywords: diffusion, generative
Abstract: World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: MaskGWM-long, focusing on long-horizon prediction, and MaskGWM-mview, dedicated to multi-view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long-horizon rollout of OpenDV-2K dataset and zero-shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state-of-the-art driving world model.

Title: RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars

Authors: Yuncheng Hua, Lizhen Qu, Zhuang Li, Hao Xue, Flora D. Salim, Gholamreza Haffari
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11681
Pdf URL: https://arxiv.org/pdf/2502.11681
Copy Paste: [[2502.11681]] RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars(https://arxiv.org/abs/2502.11681)
Keywords: in-context
Abstract: Alignment tuning is crucial for ensuring large language models (LLMs) behave ethically and helpfully. Current alignment approaches require high-quality annotations and significant training resources. This paper proposes a low-cost, tuning-free method using in-context learning (ICL) to enhance LLM alignment. Through an analysis of high-quality ICL demos, we identified style as a key factor influencing LLM alignment capabilities and explicitly restyled ICL exemplars based on this stylistic framework. Additionally, we combined the restyled demos to achieve a balance between the two conflicting aspects of LLM alignment--factuality and safety. We packaged the restyled examples as prompts to trigger few-shot learning, improving LLM alignment. Compared to the best baseline approach, with an average score of 5.00 as the maximum, our method achieves a maximum 0.10 increase on the Alpaca task (from 4.50 to 4.60), a 0.22 enhancement on the Just-eval benchmark (from 4.34 to 4.56), and a maximum improvement of 0.32 (from 3.53 to 3.85) on the MT-Bench dataset. We release the code and data at this https URL.

Title: Improve LLM-as-a-Judge Ability as a General Ability

Authors: Jiachen Yu, Shaoning Sun, Xiaohui Hu, Jiaxu Yan, Kaidong Yu, Xuelong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11689
Pdf URL: https://arxiv.org/pdf/2502.11689
Copy Paste: [[2502.11689]] Improve LLM-as-a-Judge Ability as a General Ability(https://arxiv.org/abs/2502.11689)
Keywords: generative
Abstract: LLM-as-a-Judge leverages the generative and reasoning capabilities of large language models (LLMs) to evaluate LLM responses across diverse scenarios, providing accurate preference signals. This approach plays a vital role in aligning LLMs with human values, ensuring ethical and reliable AI outputs that align with societal norms. Recent studies have raised many methods to train LLM as generative judges, but most of them are data consuming or lack accuracy, and only focus on LLM's judge ability. In this work, we regard judge ability as a general ability of LLM and implement a two-stage training approach, comprising supervised fine-tuning (SFT) warm-up and direct preference optimization (DPO) enhancement, to achieve judge style adaptation and improve judgment accuracy. Additionally, we introduce an efficient data synthesis method to generate judgmental content. Experimental results demonstrate that our approach, utilizing only about 2% to 40% of the data required by other methods, achieves SOTA performance on RewardBench. Furthermore, our training method enhances the general capabilities of the model by constructing complicated judge task, and the judge signals provided by our model have significantly enhanced the downstream DPO training performance of our internal models in our test to optimize policy model with Judge Model. We also open-source our model weights and training data to facilitate further research.

Title: MVTokenFlow: High-quality 4D Content Generation using Multiview Token Flow

Authors: Hanzhuo Huang, Yuan Liu, Ge Zheng, Jiepeng Wang, Zhiyang Dou, Sibei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11697
Pdf URL: https://arxiv.org/pdf/2502.11697
Copy Paste: [[2502.11697]] MVTokenFlow: High-quality 4D Content Generation using Multiview Token Flow(https://arxiv.org/abs/2502.11697)
Keywords: diffusion, generative
Abstract: In this paper, we present MVTokenFlow for high-quality 4D content creation from monocular videos. Recent advancements in generative models such as video diffusion models and multiview diffusion models enable us to create videos or 3D models. However, extending these generative models for dynamic 4D content creation is still a challenging task that requires the generated content to be consistent spatially and temporally. To address this challenge, MVTokenFlow utilizes the multiview diffusion model to generate multiview images on different timesteps, which attains spatial consistency across different viewpoints and allows us to reconstruct a reasonable coarse 4D field. Then, MVTokenFlow further regenerates all the multiview images using the rendered 2D flows as guidance. The 2D flows effectively associate pixels from different timesteps and improve the temporal consistency by reusing tokens in the regeneration process. Finally, the regenerated images are spatiotemporally consistent and utilized to refine the coarse 4D field to get a high-quality 4D field. Experiments demonstrate the effectiveness of our design and show significantly improved quality than baseline methods.

Title: The Worse The Better: Content-Aware Viewpoint Generation Network for Projection-related Point Cloud Quality Assessment

Authors: Zhiyong Su, Bingxu Xie, Zheng Li, Jincan Wu, Weiqing Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11710
Pdf URL: https://arxiv.org/pdf/2502.11710
Copy Paste: [[2502.11710]] The Worse The Better: Content-Aware Viewpoint Generation Network for Projection-related Point Cloud Quality Assessment(https://arxiv.org/abs/2502.11710)
Keywords: self-supervised
Abstract: Through experimental studies, however, we observed the instability of final predicted quality scores, which change significantly over different viewpoint settings. Inspired by the "wooden barrel theory", given the default content-independent viewpoints of existing projection-related PCQA approaches, this paper presents a novel content-aware viewpoint generation network (CAVGN) to learn better viewpoints by taking the distribution of geometric and attribute features of degraded point clouds into consideration. Firstly, the proposed CAVGN extracts multi-scale geometric and texture features of the entire input point cloud, respectively. Then, for each default content-independent viewpoint, the extracted geometric and texture features are refined to focus on its corresponding visible part of the input point cloud. Finally, the refined geometric and texture features are concatenated to generate an optimized viewpoint. To train the proposed CAVGN, we present a self-supervised viewpoint ranking network (SSVRN) to select the viewpoint with the worst quality projected image to construct a default-optimized viewpoint dataset, which consists of thousands of paired default viewpoints and corresponding optimized viewpoints. Experimental results show that the projection-related PCQA methods can achieve higher performance using the viewpoints generated by the proposed CAVGN.

Title: Component-aware Unsupervised Logical Anomaly Generation for Industrial Anomaly Detection

Authors: Xuan Tong, Yang Chang, Qing Zhao, Jiawen Yu, Boyang Wang, Junxiong Lin, Yuxuan Lin, Xinji Mai, Haoran Wang, Zeng Tao, Yan Wang, Wenqiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11712
Pdf URL: https://arxiv.org/pdf/2502.11712
Copy Paste: [[2502.11712]] Component-aware Unsupervised Logical Anomaly Generation for Industrial Anomaly Detection(https://arxiv.org/abs/2502.11712)
Keywords: generative, anomaly
Abstract: Anomaly detection is critical in industrial manufacturing for ensuring product quality and improving efficiency in automated processes. The scarcity of anomalous samples limits traditional detection methods, making anomaly generation essential for expanding the data repository. However, recent generative models often produce unrealistic anomalies increasing false positives, or require real-world anomaly samples for training. In this work, we treat anomaly generation as a compositional problem and propose ComGEN, a component-aware and unsupervised framework that addresses the gap in logical anomaly generation. Our method comprises a multi-component learning strategy to disentangle visual components, followed by subsequent generation editing procedures. Disentangled text-to-component pairs, revealing intrinsic logical constraints, conduct attention-guided residual mapping and model training with iteratively matched references across multiple scales. Experiments on the MVTecLOCO dataset confirm the efficacy of ComGEN, achieving the best AUROC score of 91.2%. Additional experiments on the real-world scenario of Diesel Engine and widely-used MVTecAD dataset demonstrate significant performance improvements when integrating simulated anomalies generated by ComGEN into automated production workflows.

Title: Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing

Authors: Site Qu, Guoqiang Hu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11715
Pdf URL: https://arxiv.org/pdf/2502.11715
Copy Paste: [[2502.11715]] Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing(https://arxiv.org/abs/2502.11715)
Keywords: generative
Abstract: The Location-Routing Problem (LRP), which combines the challenges of facility (depot) locating and vehicle route planning, is critically constrained by the reliance on predefined depot candidates, limiting the solution space and potentially leading to suboptimal outcomes. Previous research on LRP without predefined depots is scant and predominantly relies on heuristic algorithms that iteratively attempt depot placements across a planar area. Such approaches lack the ability to proactively generate depot locations that meet specific geographic requirements, revealing a notable gap in current research landscape. To bridge this gap, we propose a data-driven generative DRL framework, designed to proactively generate depots for LRP without predefined depot candidates, solely based on customer requests data which include geographic and demand information. It can operate in two distinct modes: direct generation of exact depot locations, and the creation of a multivariate Gaussian distribution for flexible depots sampling. By extracting depots' geographic pattern from customer requests data, our approach can dynamically respond to logistical needs, identifying high-quality depot locations that further reduce total routing costs compared to traditional methods. Extensive experiments demonstrate that, for a same group of customer requests, compared with those depots identified through random attempts, our framework can proactively generate depots that lead to superior solution routes with lower routing cost. The implications of our framework potentially extend into real-world applications, particularly in emergency medical rescue and disaster relief logistics, where rapid establishment and adjustment of depot locations are paramount, showcasing its potential in addressing LRP for dynamic and unpredictable environments.

Title: ILIAS: Instance-Level Image retrieval At Scale

Authors: Giorgos Kordopatis-Zilos, Vladan Stojnić, Anna Manko, Pavel Šuma, Nikolaos-Antonios Ypsilantis, Nikos Efthymiadis, Zakaria Laskar, Jiří Matas, Ondřej Chum, Giorgos Tolias
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11748
Pdf URL: https://arxiv.org/pdf/2502.11748
Copy Paste: [[2502.11748]] ILIAS: Instance-Level Image retrieval At Scale(https://arxiv.org/abs/2502.11748)
Keywords: foundation model
Abstract: This work introduces ILIAS, a new test dataset for Instance-Level Image retrieval At Scale. It is designed to evaluate the ability of current and future foundation models and retrieval techniques to recognize particular objects. The key benefits over existing datasets include large scale, domain diversity, accurate ground truth, and a performance that is far from saturated. ILIAS includes query and positive images for 1,000 object instances, manually collected to capture challenging conditions and diverse domains. Large-scale retrieval is conducted against 100 million distractor images from YFCC100M. To avoid false negatives without extra annotation effort, we include only query objects confirmed to have emerged after 2014, i.e. the compilation date of YFCC100M. An extensive benchmarking is performed with the following observations: i) models fine-tuned on specific domains, such as landmarks or products, excel in that domain but fail on ILIAS ii) learning a linear adaptation layer using multi-domain class supervision results in performance improvements, especially for vision-language models iii) local descriptors in retrieval re-ranking are still a key ingredient, especially in the presence of severe background clutter iv) the text-to-image performance of the vision-language foundation models is surprisingly close to the corresponding image-to-image case. website: this https URL

Title: Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning

Authors: Yuqi Pang, Bowen Yang, Haoqin Tu, Yun Cao, Zeyu Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11751
Pdf URL: https://arxiv.org/pdf/2502.11751
Copy Paste: [[2502.11751]] Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning(https://arxiv.org/abs/2502.11751)
Keywords: in-context
Abstract: Although Large Language Models (LLMs) excel in reasoning and generation for language tasks, they are not specifically designed for multimodal challenges. Training Multimodal Large Language Models (MLLMs), however, is resource-intensive and constrained by various training limitations. In this paper, we propose the Modular-based Visual Contrastive Decoding (MVCD) framework to move this obstacle. Our framework leverages LLMs' In-Context Learning (ICL) capability and the proposed visual contrastive-example decoding (CED), specifically tailored for this framework, without requiring any additional training. By converting visual signals into text and focusing on contrastive output distributions during decoding, we can highlight the new information introduced by contextual examples, explore their connections, and avoid over-reliance on prior encoded knowledge. MVCD enhances LLMs' visual perception to make it see and reason over the input visuals. To demonstrate MVCD's effectiveness, we conduct experiments with four LLMs across five question answering datasets. Our results not only show consistent improvement in model accuracy but well explain the effective components inside our decoding strategy. Our code will be available at this https URL.

Title: BackdoorDM: A Comprehensive Benchmark for Backdoor Learning in Diffusion Model

Authors: Weilin Lin, Nanjun Zhou, Yanyun Wang, Jianze Li, Hui Xiong, Li Liu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2502.11798
Pdf URL: https://arxiv.org/pdf/2502.11798
Copy Paste: [[2502.11798]] BackdoorDM: A Comprehensive Benchmark for Backdoor Learning in Diffusion Model(https://arxiv.org/abs/2502.11798)
Keywords: diffusion
Abstract: Backdoor learning is a critical research topic for understanding the vulnerabilities of deep neural networks. While it has been extensively studied in discriminative models over the past few years, backdoor learning in diffusion models (DMs) has recently attracted increasing attention, becoming a new research hotspot. Although many different backdoor attack and defense methods have been proposed for DMs, a comprehensive benchmark for backdoor learning in DMs is still lacking. This absence makes it difficult to conduct fair comparisons and thoroughly evaluate existing approaches, thus hindering future research progress. To address this issue, we propose BackdoorDM, the first comprehensive benchmark designed for backdoor learning in DMs. It comprises nine state-of-the-art (SOTA) attack methods, four SOTA defense strategies, and two helpful visualization analysis tools. We first systematically classify and formulate the existing literature in a unified framework, focusing on three different backdoor attack types and five backdoor target types, which are restricted to a single type in discriminative models. Then, we systematically summarize the evaluation metrics for each type and propose a unified backdoor evaluation method based on GPT-4o. Finally, we conduct a comprehensive evaluation and highlight several important conclusions. We believe that BackdoorDM will help overcome current barriers and contribute to building a trustworthy DMs community. The codes are released in this https URL.

Title: Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Authors: Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11831
Pdf URL: https://arxiv.org/pdf/2502.11831
Copy Paste: [[2502.11831]] Intuitive physics understanding emerges from self-supervised pretraining on natural videos(https://arxiv.org/abs/2502.11831)
Keywords: self-supervised
Abstract: We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regions in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.

Title: Enhanced Anomaly Detection in IoMT Networks using Ensemble AI Models on the CICIoMT2024 Dataset

Authors: Prathamesh Chandekar, Mansi Mehta, Swet Chandan
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2502.11854
Pdf URL: https://arxiv.org/pdf/2502.11854
Copy Paste: [[2502.11854]] Enhanced Anomaly Detection in IoMT Networks using Ensemble AI Models on the CICIoMT2024 Dataset(https://arxiv.org/abs/2502.11854)
Keywords: anomaly
Abstract: The rapid proliferation of Internet of Medical Things (IoMT) devices in healthcare has introduced unique cybersecurity challenges, primarily due to the diverse communication protocols and critical nature of these devices This research aims to develop an advanced, real-time anomaly detection framework tailored for IoMT network traffic, leveraging AI/ML models and the CICIoMT2024 dataset By integrating multi-protocol (MQTT, WiFi), attack-specific (DoS, DDoS), time-series (active/idle states), and device-specific (Bluetooth) data, our study captures a comprehensive range of IoMT interactions As part of our data analysis, various machine learning techniques are employed which include an ensemble model using XGBoost for improved performance against specific attack types, sequential models comprised of LSTM and CNN-LSTM that leverage time dependencies, and unsupervised models such as Autoencoders and Isolation Forest that are good in general anomaly detection The results of the experiment prove with an ensemble model lowers false positive rates and reduced detections.

Title: Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu

Authors: Renhao Pei, Yihong Liu, Peiqin Lin, François Yvon, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2502.11862
Pdf URL: https://arxiv.org/pdf/2502.11862
Copy Paste: [[2502.11862]] Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu(https://arxiv.org/abs/2502.11862)
Keywords: in-context
Abstract: In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT, as it can readily take advantage of linguistic resources such as grammar books and dictionaries. Such resources are usually selectively integrated into the prompt so that LLMs can directly perform translation without any specific training, via their in-context learning capability (ICL). However, the relative importance of each type of resource e.g., dictionary, grammar book, and retrieved parallel examples, is not entirely clear. To address this gap, this study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language as our case study. To remove any prior knowledge of Manchu encoded in the LLM parameters and single out the effect of ICL, we also experiment with an encrypted version of Manchu texts. Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help. In a follow-up study, we showcase a promising application of in-context MT: parallel data augmentation as a way to bootstrap the conventional MT model. When monolingual data abound, generating synthetic parallel data through in-context MT offers a pathway to mitigate data scarcity and build effective and efficient low-resource neural MT systems.

Title: DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation

Authors: Zhihang Yuan, Siyuan Wang, Rui Xie, Hanling Zhang, Tongcheng Fang, Yuzhang Shang, Shengen Yan, Guohao Dai, Yu Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2502.11897
Pdf URL: https://arxiv.org/pdf/2502.11897
Copy Paste: [[2502.11897]] DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation(https://arxiv.org/abs/2502.11897)
Keywords: generative
Abstract: In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a training-free paradigm that can make use of adaptive temporal compression in latent space. While existing video generative models apply fixed compression rates via pretrained VAE, we observe that real-world video content exhibits substantial temporal non-uniformity, with high-motion segments containing more information than static scenes. Based on this insight, DLFR-VAE dynamically adjusts the latent frame rate according to the content complexity. Specifically, DLFR-VAE comprises two core innovations: (1) A Dynamic Latent Frame Rate Scheduler that partitions videos into temporal chunks and adaptively determines optimal frame rates based on information-theoretic content complexity, and (2) A training-free adaptation mechanism that transforms pretrained VAE architectures into a dynamic VAE that can process features with variable frame rates. Our simple but effective DLFR-VAE can function as a plug-and-play module, seamlessly integrating with existing video generation models and accelerating the video generation process.

Title: Continual Learning Should Move Beyond Incremental Classification

Authors: Rupert Mitchell, Antonio Alliegro, Raffaello Camoriano, Dustin Carrión-Ojeda, Antonio Carta, Georgia Chalvatzaki, Nikhil Churamani, Carlo D'Eramo, Samin Hamidi, Robin Hesse, Fabian Hinder, Roshni Ramanna Kamath, Vincenzo Lomonaco, Subarnaduti Paul, Francesca Pistilli, Tinne Tuytelaars, Gido M van de Ven, Kristian Kersting, Simone Schaub-Meyer, Martin Mundt
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2502.11927
Pdf URL: https://arxiv.org/pdf/2502.11927
Copy Paste: [[2502.11927]] Continual Learning Should Move Beyond Incremental Classification(https://arxiv.org/abs/2502.11927)
Keywords: generative
Abstract: Continual learning (CL) is the sub-field of machine learning concerned with accumulating knowledge in dynamic environments. So far, CL research has mainly focused on incremental classification tasks, where models learn to classify new categories while retaining knowledge of previously learned ones. Here, we argue that maintaining such a focus limits both theoretical development and practical applicability of CL methods. Through a detailed analysis of concrete examples - including multi-target classification, robotics with constrained output spaces, learning in continuous task domains, and higher-level concept memorization - we demonstrate how current CL approaches often fail when applied beyond standard classification. We identify three fundamental challenges: (C1) the nature of continuity in learning problems, (C2) the choice of appropriate spaces and metrics for measuring similarity, and (C3) the role of learning objectives beyond classification. For each challenge, we provide specific recommendations to help move the field forward, including formalizing temporal dynamics through distribution processes, developing principled approaches for continuous task spaces, and incorporating density estimation and generative objectives. In so doing, this position paper aims to broaden the scope of CL research while strengthening its theoretical foundations, making it more applicable to real-world problems.

Title: Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

Authors: Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Brian Li, Changyi Wan, Hanpeng Hu, Ranchen Ming, Song Yuan, Xuelin Zhang, Yu Zhou, Bingxin Li, Buyun Ma, Kang An, Wei Ji, Wen Li, Xuan Wen, Yuankai Ma, Yuanwei Liang, Yun Mou, Bahtiyar Ahmidi, Bin Wang, Bo Li, Changxin Miao, Chen Xu, Chengting Feng, Chenrun Wang, Dapeng Shi, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Guanzhe Huang, Gulin Yan, Heng Wang, Haonan Jia, Haoyang Zhang, Jiahao Gong, Jianchang Wu, Jiahong Liu, Jianjian Sun, Jiangjie Zhen, Jie Feng, Jie Wu, Jiaoren Wu, Jie Yang, Jinguo Wang, Jingyang Zhang, Junzhe Lin, Kaixiang Li, Lei Xia, Li Zhou, Longlong Gu, Mei Chen, Menglin Wu, Ming Li, Mingxiao Li, Mingyao Liang, Na Wang, Nie Hao, Qiling Wu, Qinyuan Tan, Shaoliang Pang, Shiliang Yang, Shuli Gao, Siqi Liu, Sitong Liu, Tiancheng Cao, Tianyu Wang, Wenjin Deng, Wenqing He, Wen Sun, Xin Han, Xiaomin Deng, Xiaojia Liu, Xu Zhao, Yanan Wei, Yanbo Yu, Yang Cao, Yangguang Li, Yangzhen Ma, Yanming Xu, Yaqiang Shi, Yilei Wang, Yinmin Zhong
Subjects: cs.CL, cs.AI, cs.HC, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2502.11946
Pdf URL: https://arxiv.org/pdf/2502.11946
Copy Paste: [[2502.11946]] Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction(https://arxiv.org/abs/2502.11946)
Keywords: generative
Abstract: Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at this https URL.

Title: Image Inversion: A Survey from GANs to Diffusion and Beyond

Authors: Yinan Chen, Jiangning Zhang, Yali Bi, Xiaobin Hu, Teng Hu, Zhucun Xue, Ran Yi, Yong Liu, Ying Tai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.11974
Pdf URL: https://arxiv.org/pdf/2502.11974
Copy Paste: [[2502.11974]] Image Inversion: A Survey from GANs to Diffusion and Beyond(https://arxiv.org/abs/2502.11974)
Keywords: diffusion, generative
Abstract: Image inversion is a fundamental task in generative models, aiming to map images back to their latent representations to enable downstream applications such as editing, restoration, and style transfer. This paper provides a comprehensive review of the latest advancements in image inversion techniques, focusing on two main paradigms: Generative Adversarial Network (GAN) inversion and diffusion model inversion. We categorize these techniques based on their optimization methods. For GAN inversion, we systematically classify existing methods into encoder-based approaches, latent optimization approaches, and hybrid approaches, analyzing their theoretical foundations, technical innovations, and practical trade-offs. For diffusion model inversion, we explore training-free strategies, fine-tuning methods, and the design of additional trainable modules, highlighting their unique advantages and limitations. Additionally, we discuss several popular downstream applications and emerging applications beyond image tasks, identifying current challenges and future research directions. By synthesizing the latest developments, this paper aims to provide researchers and practitioners with a valuable reference resource, promoting further advancements in the field of image inversion. We keep track of the latest works at this https URL

Title: Unsupervised Structural-Counterfactual Generation under Domain Shift

Authors: Krishn Vishwas Kher, Lokesh Venkata Siva Maruthi Badisa, Kusampudi Venkata Datta Sri Harsha, Chitneedi Geetha Sowmya, SakethaNath Jagarlapudi
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.12013
Pdf URL: https://arxiv.org/pdf/2502.12013
Copy Paste: [[2502.12013]] Unsupervised Structural-Counterfactual Generation under Domain Shift(https://arxiv.org/abs/2502.12013)
Keywords: generative
Abstract: Motivated by the burgeoning interest in cross-domain learning, we present a novel generative modeling challenge: generating counterfactual samples in a target domain based on factual observations from a source domain. Our approach operates within an unsupervised paradigm devoid of parallel or joint datasets, relying exclusively on distinct observational samples and causal graphs for each domain. This setting presents challenges that surpass those of conventional counterfactual generation. Central to our methodology is the disambiguation of exogenous causes into effect-intrinsic and domain-intrinsic categories. This differentiation facilitates the integration of domain-specific causal graphs into a unified joint causal graph via shared effect-intrinsic exogenous variables. We propose leveraging Neural Causal models within this joint framework to enable accurate counterfactual generation under standard identifiability assumptions. Furthermore, we introduce a novel loss function that effectively segregates effect-intrinsic from domain-intrinsic variables during model training. Given a factual observation, our framework combines the posterior distribution of effect-intrinsic variables from the source domain with the prior distribution of domain-intrinsic variables from the target domain to synthesize the desired counterfactuals, adhering to Pearl's causal hierarchy. Intriguingly, when domain shifts are restricted to alterations in causal mechanisms without accompanying covariate shifts, our training regimen parallels the resolution of a conditional optimal transport problem. Empirical evaluations on a synthetic dataset show that our framework generates counterfactuals in the target domain that very closely resemble the ground truth.

Title: HumanGif: Single-View Human Diffusion with Generative Prior

Authors: Shoukang Hu, Takuya Narihira, Kazumi Fukuda, Ryosuke Sawata, Takashi Shibuya, Yuki Mitsufuji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.12080
Pdf URL: https://arxiv.org/pdf/2502.12080
Copy Paste: [[2502.12080]] HumanGif: Single-View Human Diffusion with Generative Prior(https://arxiv.org/abs/2502.12080)
Keywords: diffusion, generative
Abstract: While previous single-view-based 3D human reconstruction methods made significant progress in novel view synthesis, it remains a challenge to synthesize both view-consistent and pose-consistent results for animatable human avatars from a single image input. Motivated by the success of 2D character animation, we propose HumanGif, a single-view human diffusion model with generative prior. Specifically, we formulate the single-view-based 3D human novel view and pose synthesis as a single-view-conditioned human diffusion process, utilizing generative priors from foundational diffusion models. To ensure fine-grained and consistent novel view and pose synthesis, we introduce a Human NeRF module in HumanGif to learn spatially aligned features from the input image, implicitly capturing the relative camera and human pose transformation. Furthermore, we introduce an image-level loss during optimization to bridge the gap between latent and image spaces in diffusion models. Extensive experiments on RenderPeople and DNA-Rendering datasets demonstrate that HumanGif achieves the best perceptual performance, with better generalizability for novel view and pose synthesis.

Title: Unifying Explainable Anomaly Detection and Root Cause Analysis in Dynamical Systems

Authors: Yue Sun, Rick S. Blum, Parv Venkitasubramaniam
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2502.12086
Pdf URL: https://arxiv.org/pdf/2502.12086
Copy Paste: [[2502.12086]] Unifying Explainable Anomaly Detection and Root Cause Analysis in Dynamical Systems(https://arxiv.org/abs/2502.12086)
Keywords: anomaly
Abstract: Dynamical systems, prevalent in various scientific and engineering domains, are susceptible to anomalies that can significantly impact their performance and reliability. This paper addresses the critical challenges of anomaly detection, root cause localization, and anomaly type classification in dynamical systems governed by ordinary differential equations (ODEs). We define two categories of anomalies: cyber anomalies, which propagate through interconnected variables, and measurement anomalies, which remain localized to individual variables. To address these challenges, we propose the Interpretable Causality Ordinary Differential Equation (ICODE) Networks, a model-intrinsic explainable learning framework. ICODE leverages Neural ODEs for anomaly detection while employing causality inference through an explanation channel to perform root cause analysis (RCA), elucidating why specific time periods are flagged as anomalous. ICODE is designed to simultaneously perform anomaly detection, RCA, and anomaly type classification within a single, interpretable framework. Our approach is grounded in the hypothesis that anomalies alter the underlying ODEs of the system, manifesting as changes in causal relationships between variables. We provide a theoretical analysis of how perturbations in learned model parameters can be utilized to identify anomalies and their root causes in time series data. Comprehensive experimental evaluations demonstrate the efficacy of ICODE across various dynamical systems, showcasing its ability to accurately detect anomalies, classify their types, and pinpoint their origins.

Title: Descriminative-Generative Custom Tokens for Vision-Language Models

Authors: Pramuditha Perera, Matthew Trager, Luca Zancato, Alessandro Achille, Stefano Soatto
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.12095
Pdf URL: https://arxiv.org/pdf/2502.12095
Copy Paste: [[2502.12095]] Descriminative-Generative Custom Tokens for Vision-Language Models(https://arxiv.org/abs/2502.12095)
Keywords: generative
Abstract: This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs). Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well with words to form new input queries. The targeted concept is specified in terms of a small set of images and a parent concept described using text. We operate on CLIP text features and propose to use a combination of a textual inversion loss and a classification loss to ensure that text features of the learned token are aligned with image features of the concept in the CLIP embedding space. We restrict the learned token to a low-dimensional subspace spanned by tokens for attributes that are appropriate for the given super-class. These modifications improve the quality of compositions of the learned token with natural language for generating new scenes. Further, we show that learned custom tokens can be used to form queries for text-to-image retrieval task, and also have the important benefit that composite queries can be visualized to ensure that the desired concept is faithfully encoded. Based on this, we introduce the method of Generation Aided Image Retrieval, where the query is modified at inference time to better suit the search intent. On the DeepFashion2 dataset, our method improves Mean Reciprocal Retrieval (MRR) over relevant baselines by 7%.

Title: A-MEM: Agentic Memory for LLM Agents

Authors: Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2502.12110
Pdf URL: https://arxiv.org/pdf/2502.12110
Copy Paste: [[2502.12110]] A-MEM: Agentic Memory for LLM Agents(https://arxiv.org/abs/2502.12110)
Keywords: foundation model
Abstract: While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code is available at this https URL.

Title: LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities

Authors: Florian Sestak, Artur Toshev, Andreas Fürst, Günter Klambauer, Andreas Mayr, Johannes Brandstetter
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2502.12128
Pdf URL: https://arxiv.org/pdf/2502.12128
Copy Paste: [[2502.12128]] LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities(https://arxiv.org/abs/2502.12128)
Keywords: generative
Abstract: Generative models are spearheading recent progress in deep learning, showing strong promise for trajectory sampling in dynamical systems as well. However, while latent space modeling paradigms have transformed image and video generation, similar approaches are more difficult for most dynamical systems. Such systems -- from chemical molecule structures to collective human behavior -- are described by interactions of entities, making them inherently linked to connectivity patterns and the traceability of entities over time. Our approach, LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked Entities), combines the advantages of graph neural networks, i.e., the traceability of entities across time-steps, with the efficiency and scalability of recent advances in image and video generation, where pre-trained encoder and decoder are frozen to enable generative modeling in the latent space. The core idea of LaM-SLidE is to introduce identifier representations (IDs) to allow for retrieval of entity properties, e.g., entity coordinates, from latent system representations and thus enables traceability. Experimentally, across different domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy, and generalizability. (Code is available at this https URL)

Title: MagicArticulate: Make Your 3D Models Articulation-Ready

Authors: Chaoyue Song, Jianfeng Zhang, Xiu Li, Fan Yang, Yiwen Chen, Zhongcong Xu, Jun Hao Liew, Xiaoyang Guo, Fayao Liu, Jiashi Feng, Guosheng Lin
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2502.12135
Pdf URL: https://arxiv.org/pdf/2502.12135
Copy Paste: [[2502.12135]] MagicArticulate: Make Your 3D Models Articulation-Ready(https://arxiv.org/abs/2502.12135)
Keywords: diffusion
Abstract: With the explosive growth of 3D content creation, there is an increasing demand for automatically converting static 3D models into articulation-ready versions that support realistic animation. Traditional approaches rely heavily on manual annotation, which is both time-consuming and labor-intensive. Moreover, the lack of large-scale benchmarks has hindered the development of learning-based solutions. In this work, we present MagicArticulate, an effective framework that automatically transforms static 3D models into articulation-ready assets. Our key contributions are threefold. First, we introduce Articulation-XL, a large-scale benchmark containing over 33k 3D models with high-quality articulation annotations, carefully curated from Objaverse-XL. Second, we propose a novel skeleton generation method that formulates the task as a sequence modeling problem, leveraging an auto-regressive transformer to naturally handle varying numbers of bones or joints within skeletons and their inherent dependencies across different 3D models. Third, we predict skinning weights using a functional diffusion process that incorporates volumetric geodesic distance priors between vertices and joints. Extensive experiments demonstrate that MagicArticulate significantly outperforms existing methods across diverse object categories, achieving high-quality articulation that enables realistic animation. Project page: this https URL.

Title: Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening

Authors: Ye Tian, Ling Yang, Xinchen Zhang, Yunhai Tong, Mengdi Wang, Bin Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.12146
Pdf URL: https://arxiv.org/pdf/2502.12146
Copy Paste: [[2502.12146]] Diffusion-Sharpening: Fine-tuning Diffusion Models with Denoising Trajectory Sharpening(https://arxiv.org/abs/2502.12146)
Keywords: diffusion
Abstract: We propose Diffusion-Sharpening, a fine-tuning approach that enhances downstream alignment by optimizing sampling trajectories. Existing RL-based fine-tuning methods focus on single training timesteps and neglect trajectory-level alignment, while recent sampling trajectory optimization methods incur significant inference NFE costs. Diffusion-Sharpening overcomes this by using a path integral framework to select optimal trajectories during training, leveraging reward feedback, and amortizing inference costs. Our method demonstrates superior training efficiency with faster convergence, and best inference efficiency without requiring additional NFEs. Extensive experiments show that Diffusion-Sharpening outperforms RL-based fine-tuning methods (e.g., Diffusion-DPO) and sampling trajectory optimization methods (e.g., Inference Scaling) across diverse metrics including text alignment, compositional capabilities, and human preferences, offering a scalable and efficient solution for future diffusion model fine-tuning. Code: this https URL

Title: HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

Authors: Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2502.12148
Pdf URL: https://arxiv.org/pdf/2502.12148
Copy Paste: [[2502.12148]] HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation(https://arxiv.org/abs/2502.12148)
Keywords: foundation model, generative
Abstract: The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: this https URL

Title: Diffusion Models without Classifier-free Guidance

Authors: Zhicong Tang, Jianmin Bao, Dong Chen, Baining Guo
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2502.12154
Pdf URL: https://arxiv.org/pdf/2502.12154
Copy Paste: [[2502.12154]] Diffusion Models without Classifier-free Guidance(https://arxiv.org/abs/2502.12154)
Keywords: diffusion
Abstract: This paper presents Model-guidance (MG), a novel objective for training diffusion model that addresses and removes of the commonly used Classifier-free guidance (CFG). Our innovative approach transcends the standard modeling of solely data distribution to incorporating the posterior probability of conditions. The proposed technique originates from the idea of CFG and is easy yet effective, making it a plug-and-play module for existing models. Our method significantly accelerates the training process, doubles the inference speed, and achieve exceptional quality that parallel and even surpass concurrent diffusion models with CFG. Extensive experiments demonstrate the effectiveness, efficiency, scalability on different models and datasets. Finally, we establish state-of-the-art performance on ImageNet 256 benchmarks with an FID of 1.34. Our code is available at this https URL.