2025-06-03

Title: Amadeus-Verbo Technical Report: The powerful Qwen2.5 family models trained in Portuguese

Authors: William Alberto Cruz-Castañeda, Marcellus Amadeus
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00019
Pdf URL: https://arxiv.org/pdf/2506.00019
Copy Paste: [[2506.00019]] Amadeus-Verbo Technical Report: The powerful Qwen2.5 family models trained in Portuguese(https://arxiv.org/abs/2506.00019)
Keywords: foundation model
Abstract: This report introduces the experience of developing Amadeus Verbo, a family of large language models for Brazilian Portuguese. To handle diverse use cases, Amadeus Verbo includes base-tuned, merged, and instruction-tuned models in sizes of 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters. Thus, the main objective is to show how easy it is to fine-tune foundation models to democratize the open-source development of Brazilian Portuguese LLMs when data and resources are available. Amadeus-Verbo family models are all available at HuggingFace at this https URL.

Title: Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards

Authors: Xun Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00103
Pdf URL: https://arxiv.org/pdf/2506.00103
Copy Paste: [[2506.00103]] Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards(https://arxiv.org/abs/2506.00103)
Keywords: generative
Abstract: Reinforcement learning with verifiable rewards (RLVR) has enabled large language models (LLMs) to achieve remarkable breakthroughs in reasoning tasks with objective ground-truth answers, such as mathematics and code generation. However, a significant gap remains for non-verifiable tasks, like creative writing and open-ended dialogue, where quality assessment is inherently subjective and lacks definitive references. Existing approaches for these domains often rely on scalar reward models trained with human preferences, which suffer from limited generalization and are prone to reward hacking, such as over-explanation and length bias. In this work, we propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards. We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm. The pairwise writing GenRM leverages self-principled critique to transform subjective assessments into reliable, verifiable rewards, while BRPO enables dynamic, reference-free pairwise comparison by leveraging a bootstrapped response as temporary reference from within group rollouts during RL training. Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning, as demonstrated by Writing-Zero, which shows consistent improvement and strong resistance to reward hacking compared to scalar reward baselines. Furthermore, our method achieves competitive results on both in-house and open-source writing benchmarks. Our findings suggest the potential to unify rule-based, reference-based, and reference-free reward modeling under the RLVR framework, thus paving the way for a comprehensive and scalable RL training paradigm applicable across all language tasks.

Title: On Designing Diffusion Autoencoders for Efficient Generation and Representation Learning

Authors: Magdalena Proszewska, Nikolay Malkin, N. Siddharth
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.00136
Pdf URL: https://arxiv.org/pdf/2506.00136
Copy Paste: [[2506.00136]] On Designing Diffusion Autoencoders for Efficient Generation and Representation Learning(https://arxiv.org/abs/2506.00136)
Keywords: diffusion, generative
Abstract: Diffusion autoencoders (DAs) are variants of diffusion generative models that use an input-dependent latent variable to capture representations alongside the diffusion process. These representations, to varying extents, can be used for tasks such as downstream classification, controllable generation, and interpolation. However, the generative performance of DAs relies heavily on how well the latent variables can be modelled and subsequently sampled from. Better generative modelling is also the primary goal of another class of diffusion models -- those that learn their forward (noising) process. While effective at adjusting the noise process in an input-dependent manner, they must satisfy additional constraints derived from the terminal conditions of the diffusion process. Here, we draw a connection between these two classes of models and show that certain design decisions (latent variable choice, conditioning method, etc.) in the DA framework -- leading to a model we term DMZ -- allow us to obtain the best of both worlds: effective representations as evaluated on downstream tasks, including domain transfer, as well as more efficient modelling and generation with fewer denoising steps compared to standard DMs.

Title: Cluster-Aware Causal Mixer for Online Anomaly Detection in Multivariate Time Series

Authors: Md Mahmuddun Nabi Murad, Yasin Yilmaz
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.00188
Pdf URL: https://arxiv.org/pdf/2506.00188
Copy Paste: [[2506.00188]] Cluster-Aware Causal Mixer for Online Anomaly Detection in Multivariate Time Series(https://arxiv.org/abs/2506.00188)
Keywords: anomaly
Abstract: Early and accurate detection of anomalies in time series data is critical, given the significant risks associated with false or missed detections. While MLP-based mixer models have shown promise in time series analysis, they lack a causality mechanism to preserve temporal dependencies inherent in the system. Moreover, real-world multivariate time series often contain numerous channels with diverse inter-channel correlations. A single embedding mechanism for all channels does not effectively capture these complex relationships. To address these challenges, we propose a novel cluster-aware causal mixer to effectively detect anomalies in multivariate time series. Our model groups channels into clusters based on their correlations, with each cluster processed through a dedicated embedding layer. In addition, we introduce a causal mixer in our model, which mixes the information while maintaining causality. Furthermore, we present an anomaly detection framework that accumulates the anomaly evidence over time to prevent false positives due to nominal outliers. Our proposed model operates in an online fashion, making it suitable for real-time time-series anomaly detection tasks. Experimental evaluations across six public benchmark datasets demonstrate that our model consistently achieves superior F1 scores.

Title: MOFGPT: Generative Design of Metal-Organic Frameworks using Language Models

Authors: Srivathsan Badrinarayanan, Rishikesh Magar, Akshay Antony, Radheesh Sharma Meda, Amir Barati Farimani
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00198
Pdf URL: https://arxiv.org/pdf/2506.00198
Copy Paste: [[2506.00198]] MOFGPT: Generative Design of Metal-Organic Frameworks using Language Models(https://arxiv.org/abs/2506.00198)
Keywords: generative
Abstract: The discovery of Metal-Organic Frameworks (MOFs) with application-specific properties remains a central challenge in materials chemistry, owing to the immense size and complexity of their structural design space. Conventional computational screening techniques such as molecular simulations and density functional theory (DFT), while accurate, are computationally prohibitive at scale. Machine learning offers an exciting alternative by leveraging data-driven approaches to accelerate materials discovery. The complexity of MOFs, with their extended periodic structures and diverse topologies, creates both opportunities and challenges for generative modeling approaches. To address these challenges, we present a reinforcement learning-enhanced, transformer-based framework for the de novo design of MOFs. Central to our approach is MOFid, a chemically-informed string representation encoding both connectivity and topology, enabling scalable generative modeling. Our pipeline comprises three components: (1) a generative GPT model trained on MOFid sequences, (2) MOFormer, a transformer-based property predictor, and (3) a reinforcement learning (RL) module that optimizes generated candidates via property-guided reward functions. By integrating property feedback into sequence generation, our method drives the model toward synthesizable, topologically valid MOFs with desired functional attributes. This work demonstrates the potential of large language models, when coupled with reinforcement learning, to accelerate inverse design in reticular chemistry and unlock new frontiers in computational MOF discovery.

Title: Structuring Radiology Reports: Challenging LLMs with Lightweight Models

Authors: Johannes Moll, Louisa Fay, Asfandyar Azhar, Sophie Ostmeier, Tim Lueth, Sergios Gatidis, Curtis Langlotz, Jean-Benoit Delbrouck
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00200
Pdf URL: https://arxiv.org/pdf/2506.00200
Copy Paste: [[2506.00200]] Structuring Radiology Reports: Challenging LLMs with Lightweight Models(https://arxiv.org/abs/2506.00200)
Keywords: in-context
Abstract: Radiology reports are critical for clinical decision-making but often lack a standardized format, limiting both human interpretability and machine learning (ML) applications. While large language models (LLMs) have shown strong capabilities in reformatting clinical text, their high computational requirements, lack of transparency, and data privacy concerns hinder practical deployment. To address these challenges, we explore lightweight encoder-decoder models (<300M parameters)-specifically T5 and BERT2BERT-for structuring radiology reports from the MIMIC-CXR and CheXpert Plus datasets. We benchmark these models against eight open-source LLMs (1B-70B), adapted using prefix prompting, in-context learning (ICL), and low-rank adaptation (LoRA) finetuning. Our best-performing lightweight model outperforms all LLMs adapted using prompt-based techniques on a human-annotated test set. While some LoRA-finetuned LLMs achieve modest gains over the lightweight model on the Findings section (BLEU 6.4%, ROUGE-L 4.8%, BERTScore 3.6%, F1-RadGraph 1.1%, GREEN 3.6%, and F1-SRR-BERT 4.3%), these improvements come at the cost of substantially greater computational resources. For example, LLaMA-3-70B incurred more than 400 times the inference time, cost, and carbon emissions compared to the lightweight model. These results underscore the potential of lightweight, task-specific models as sustainable and privacy-preserving solutions for structuring clinical text in resource-constrained healthcare settings.

Title: Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models

Authors: Liwen Sun, Hao-Ren Yao, Gary Gao, Ophir Frieder, Chenyan Xiong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.00209
Pdf URL: https://arxiv.org/pdf/2506.00209
Copy Paste: [[2506.00209]] Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models(https://arxiv.org/abs/2506.00209)
Keywords: foundation model
Abstract: Cancer screening, leading to early detection, saves lives. Unfortunately, existing screening techniques require expensive and intrusive medical procedures, not globally available, resulting in too many lost would-be-saved lives. We present CATCH-FM, CATch Cancer early with Healthcare Foundation Models, a cancer pre-screening methodology that identifies high-risk patients for further screening solely based on their historical medical records. With millions of electronic healthcare records (EHR), we establish the scaling law of EHR foundation models pretrained on medical code sequences, pretrain compute-optimal foundation models of up to 2.4 billion parameters, and finetune them on clinician-curated cancer risk prediction cohorts. In our retrospective evaluation comprising of thirty thousand patients, CATCH-FM achieved strong efficacy (60% sensitivity) with low risk (99% specificity and Negative Predictive Value), outperforming feature-based tree models as well as general and medical large language models by large margins. Despite significant demographic, healthcare system, and EHR coding differences, CATCH-FM achieves state-of-the-art pancreatic cancer risk prediction on the EHRSHOT few-shot leaderboard, outperforming EHR foundation models pretrained using on-site patient data. Our analysis demonstrates the robustness of CATCH-FM in various patient distributions, the benefits of operating in the ICD code space, and its ability to capture non-trivial cancer risk factors. Our code will be open-sourced.

Title: Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes

Authors: Anthony Gosselin, Ge Ya Luo, Luis Lara, Florian Golemo, Derek Nowrouzezahrai, Liam Paull, Alexia Jolicoeur-Martineau, Christopher Pal
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2506.00227
Pdf URL: https://arxiv.org/pdf/2506.00227
Copy Paste: [[2506.00227]] Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes(https://arxiv.org/abs/2506.00227)
Keywords: diffusion
Abstract: Video diffusion techniques have advanced significantly in recent years; however, they struggle to generate realistic imagery of car crashes due to the scarcity of accident events in most driving datasets. Improving traffic safety requires realistic and controllable accident simulations. To tackle the problem, we propose Ctrl-Crash, a controllable car crash video generation model that conditions on signals such as bounding boxes, crash types, and an initial image frame. Our approach enables counterfactual scenario generation where minor variations in input can lead to dramatically different crash outcomes. To support fine-grained control at inference time, we leverage classifier-free guidance with independently tunable scales for each conditioning signal. Ctrl-Crash achieves state-of-the-art performance across quantitative video quality metrics (e.g., FVD and JEDi) and qualitative measurements based on a human-evaluation of physical realism and video quality compared to prior diffusion-based methods.

Title: Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Authors: Oliver Mortensen, Mohammad Sadegh Talebi
Subjects: cs.LG, cs.AI, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2506.00286
Pdf URL: https://arxiv.org/pdf/2506.00286
Copy Paste: [[2506.00286]] Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model(https://arxiv.org/abs/2506.00286)
Keywords: generative
Abstract: In this paper we analyze the sample complexities of learning the optimal state-action value function $Q^*$ and an optimal policy $\pi^*$ in a discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter $\beta\neq 0$ and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive $Q$-value-iteration (MB-RS-QVI) which leads to $(\epsilon,\delta)$-PAC-bounds on $\|Q^*-Q^k\|$, and $\|V^*-V^{\pi_k}\|$ where $Q_k$ is the output of MB-RS-QVI after k iterations and $\pi_k$ is the greedy policy with respect to $Q_k$. Both PAC-bounds have exponential dependence on the effective horizon $\frac{1}{1-\gamma}$ and the strength of this dependence grows with the learners risk-sensitivity $|\beta|$. We also provide two lower bounds which shows that exponential dependence on $|\beta|\frac{1}{1-\gamma}$ is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are both tight in $\varepsilon$ and $\delta$ and that the PAC-bound on $Q$-learning is tight in the number of actions $A$, and that the PAC-bound on policy-learning is nearly tight in $A$.

Title: Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation

Authors: Ahmed Elhady, Eneko Agirre, Mikel Artetxe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00288
Pdf URL: https://arxiv.org/pdf/2506.00288
Copy Paste: [[2506.00288]] Emergent Abilities of Large Language Models under Continued Pretraining for Language Adaptation(https://arxiv.org/abs/2506.00288)
Keywords: in-context
Abstract: Continued pretraining (CPT) is a popular approach to adapt existing large language models (LLMs) to new languages. When doing so, it is common practice to include a portion of English data in the mixture, but its role has not been carefully studied to date. In this work, we show that including English does not impact validation perplexity, yet it is critical for the emergence of downstream capabilities in the target language. We introduce a language-agnostic benchmark for in-context learning (ICL), which reveals catastrophic forgetting early on CPT when English is not included. This in turn damages the ability of the model to generalize to downstream prompts in the target language as measured by perplexity, even if it does not manifest in terms of accuracy until later in training, and can be tied to a big shift in the model parameters. Based on these insights, we introduce curriculum learning and exponential moving average (EMA) of weights as effective alternatives to mitigate the need for English. All in all, our work sheds light into the dynamics by which emergent abilities arise when doing CPT for language adaptation, and can serve as a foundation to design more effective methods in the future.

Title: DLM-One: Diffusion Language Models for One-Step Sequence Generation

Authors: Tianqi Chen, Shujian Zhang, Mingyuan Zhou
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.00290
Pdf URL: https://arxiv.org/pdf/2506.00290
Copy Paste: [[2506.00290]] DLM-One: Diffusion Language Models for One-Step Sequence Generation(https://arxiv.org/abs/2506.00290)
Keywords: diffusion
Abstract: This paper introduces DLM-One, a score-distillation-based framework for one-step sequence generation with continuous diffusion language models (DLMs). DLM-One eliminates the need for iterative refinement by aligning the scores of a student model's outputs in the continuous token embedding space with the score function of a pretrained teacher DLM. We investigate whether DLM-One can achieve substantial gains in sampling efficiency for language modeling. Through comprehensive experiments on DiffuSeq -- a representative continuous DLM -- we show that DLM-One achieves up to ~500x speedup in inference time while maintaining competitive performance on benchmark text generation tasks used to evaluate the teacher models. We further analyze the method's empirical behavior across multiple datasets, providing initial insights into its generality and practical applicability. Our findings position one-step diffusion as a promising direction for efficient, high-quality language generation and broader adoption of continuous diffusion models operating in embedding space for natural language processing.

Title: Inference-Time Alignment of Diffusion Models with Evolutionary Algorithms

Authors: Purvish Jajal, Nick John Eliopoulos, Benjamin Shiue-Hal Chou, George K. Thiruvathukal, James C. Davis, Yung-Hsiang Lu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.00299
Pdf URL: https://arxiv.org/pdf/2506.00299
Copy Paste: [[2506.00299]] Inference-Time Alignment of Diffusion Models with Evolutionary Algorithms(https://arxiv.org/abs/2506.00299)
Keywords: diffusion, generative
Abstract: Diffusion models are state-of-the-art generative models in various domains, yet their samples often fail to satisfy downstream objectives such as safety constraints or domain-specific validity. Existing techniques for alignment require gradients, internal model access, or large computational budgets. We introduce an inference-time alignment framework based on evolutionary algorithms. We treat diffusion models as black-boxes and search their latent space to maximize alignment objectives. Our method enables efficient inference-time alignment for both differentiable and non-differentiable alignment objectives across a range of diffusion models. On the DrawBench and Open Image Preferences benchmark, our EA methods outperform state-of-the-art gradient-based and gradient-free inference-time methods. In terms of memory consumption, we require 55% to 76% lower GPU memory than gradient-based methods. In terms of running-time, we are 72% to 80% faster than gradient-based methods. We achieve higher alignment scores over 50 optimization steps on Open Image Preferences than gradient-based and gradient-free methods.

Title: SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation

Authors: Yufei Tian, Jiao Sun, Nanyun Peng, Zizhao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00319
Pdf URL: https://arxiv.org/pdf/2506.00319
Copy Paste: [[2506.00319]] SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation(https://arxiv.org/abs/2506.00319)
Keywords: in-context
Abstract: As language models evolve to tackle complex, multifaceted tasks, their evaluation must adapt to capture this intricacy. A granular, skill-specific understanding of model capabilities can empower researchers to make informed model development plans. In this paper, we introduce SkillVerse, an unsupervised tree-structured diagnosis framework for understanding model proficiency in specific abilities. With LLM as a judge, SkillVerse first critiques the model responses, and then organizes them into a hierarchical structure termed dendrogram. Given proficiency at arbitrary levels of granularity, SkillVerse is flexible to produce insights of behaviors of modern large models. We also demonstrate its efficacy in two downstream tasks: 1) improving model in-context learning by 25% using a tree-search algorithm to select more informative few-shot demonstrations, and 2) accurately predicting new model weaknesses with a 55% success rate, 22% higher than without SkillVerse.

Title: Towards Effective and Efficient Adversarial Defense with Diffusion Models for Robust Visual Tracking

Authors: Long Xu, Peng Gao, Wen-Jia Tang, Fei Wang, Ru-Yue Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00325
Pdf URL: https://arxiv.org/pdf/2506.00325
Copy Paste: [[2506.00325]] Towards Effective and Efficient Adversarial Defense with Diffusion Models for Robust Visual Tracking(https://arxiv.org/abs/2506.00325)
Keywords: diffusion
Abstract: Although deep learning-based visual tracking methods have made significant progress, they exhibit vulnerabilities when facing carefully designed adversarial attacks, which can lead to a sharp decline in tracking performance. To address this issue, this paper proposes for the first time a novel adversarial defense method based on denoise diffusion probabilistic models, termed DiffDf, aimed at effectively improving the robustness of existing visual tracking methods against adversarial attacks. DiffDf establishes a multi-scale defense mechanism by combining pixel-level reconstruction loss, semantic consistency loss, and structural similarity loss, effectively suppressing adversarial perturbations through a gradual denoising process. Extensive experimental results on several mainstream datasets show that the DiffDf method demonstrates excellent generalization performance for trackers with different architectures, significantly improving various evaluation metrics while achieving real-time inference speeds of over 30 FPS, showcasing outstanding defense performance and efficiency. Codes are available at this https URL.

Title: Latent Guidance in Diffusion Models for Perceptual Evaluations

Authors: Shreshth Saini, Ru-Ling Liao, Yan Ye, Alan C. Bovik
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00327
Pdf URL: https://arxiv.org/pdf/2506.00327
Copy Paste: [[2506.00327]] Latent Guidance in Diffusion Models for Perceptual Evaluations(https://arxiv.org/abs/2506.00327)
Keywords: diffusion
Abstract: Despite recent advancements in latent diffusion models that generate high-dimensional image data and perform various downstream tasks, there has been little exploration into perceptual consistency within these models on the task of No-Reference Image Quality Assessment (NR-IQA). In this paper, we hypothesize that latent diffusion models implicitly exhibit perceptually consistent local regions within the data manifold. We leverage this insight to guide on-manifold sampling using perceptual features and input measurements. Specifically, we propose Perceptual Manifold Guidance (PMG), an algorithm that utilizes pretrained latent diffusion models and perceptual quality features to obtain perceptually consistent multi-scale and multi-timestep feature maps from the denoising U-Net. We empirically demonstrate that these hyperfeatures exhibit high correlation with human perception in IQA tasks. Our method can be applied to any existing pretrained latent diffusion model and is straightforward to integrate. To the best of our knowledge, this paper is the first work on guiding diffusion model with perceptual features for NR-IQA. Extensive experiments on IQA datasets show that our method, LGDM, achieves state-of-the-art performance, underscoring the superior generalization capabilities of diffusion models for NR-IQA tasks.

Title: Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation

Authors: Muhammad Adnan, Nithesh Kurella, Akhil Arunkumar, Prashant J. Nair
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.00329
Pdf URL: https://arxiv.org/pdf/2506.00329
Copy Paste: [[2506.00329]] Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation(https://arxiv.org/abs/2506.00329)
Keywords: diffusion
Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art results in text-to-image, text-to-video generation, and editing. However, their large model size and the quadratic cost of spatial-temporal attention over multiple denoising steps make video generation computationally expensive. Static caching mitigates this by reusing features across fixed steps but fails to adapt to generation dynamics, leading to suboptimal trade-offs between speed and quality. We propose Foresight, an adaptive layer-reuse technique that reduces computational redundancy across denoising steps while preserving baseline performance. Foresight dynamically identifies and reuses DiT block outputs for all layers across steps, adapting to generation parameters such as resolution and denoising schedules to optimize efficiency. Applied to OpenSora, Latte, and CogVideoX, Foresight achieves up to 1.63x end-to-end speedup, while maintaining video quality. The source code of Foresight is available at \texttt{this https URL}.

Title: OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning

Authors: Yifan Peng, Shakeel Muhammad, Yui Sudo, William Chen, Jinchuan Tian, Chyi-Jiunn Lin, Shinji Watanabe
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.00338
Pdf URL: https://arxiv.org/pdf/2506.00338
Copy Paste: [[2506.00338]] OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning(https://arxiv.org/abs/2506.00338)
Keywords: foundation model
Abstract: The Open Whisper-style Speech Models (OWSM) project has developed a series of fully open speech foundation models using academic-scale resources, but their training data remains insufficient. This work enhances OWSM by integrating YODAS, a large-scale web-crawled dataset with a Creative Commons license. However, incorporating YODAS is nontrivial due to its wild nature, which introduces challenges such as incorrect language labels and audio-text misalignments. To address this, we develop a scalable data-cleaning pipeline using public toolkits, yielding a dataset with 166,000 hours of speech across 75 languages. Our new series of OWSM v4 models, trained on this curated dataset alongside existing OWSM data, significantly outperform previous versions on multilingual benchmarks. Our models even match or surpass frontier industrial models like Whisper and MMS in multiple scenarios. We will publicly release the cleaned YODAS data, pre-trained models, and all associated scripts via the ESPnet toolkit.

Title: Scaling Textual Gradients via Sampling-Based Momentum

Authors: Zixin Ding, Junyuan Hong, Jiachen T. Wang, Zinan Lin, Zhangyang Wang, Yuxin Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00400
Pdf URL: https://arxiv.org/pdf/2506.00400
Copy Paste: [[2506.00400]] Scaling Textual Gradients via Sampling-Based Momentum(https://arxiv.org/abs/2506.00400)
Keywords: in-context
Abstract: As prompts play an increasingly critical role in large language models (LLMs), optimizing textual prompts has become a crucial challenge. The Textual Gradient Descent (TGD) framework has emerged as a promising data-driven approach that iteratively refines textual prompts using LLM - suggested updates (or textual gradients) over minibatches of training samples. In this paper, we empirically demonstrate that scaling the number of training examples initially improves but later degrades TGD's performance across multiple downstream NLP tasks. However, while data scaling improves results for most tasks, it also significantly increases the computational cost when leveraging LLMs. To address this, we draw inspiration from numerical gradient descent and propose Textual Stochastic Gradient Descent with Momentum (TSGD-M) - a method that facilitates scalable in-context learning by reweighting prompt sampling based on past batch distributions. Across nine NLP tasks spanning three domains - including BIG-Bench Hard (BBH), natural language understanding tasks, and reasoning tasks - TSGD-M significantly outperforms TGD baselines that do not incorporate reweighted sampling, while also reducing variance in most tasks.

Title: JojoSCL: Shrinkage Contrastive Learning for single-cell RNA sequence Clustering

Authors: Ziwen Wang
Subjects: cs.LG, q-bio.GN, stat.ML
Abstract URL: https://arxiv.org/abs/2506.00410
Pdf URL: https://arxiv.org/pdf/2506.00410
Copy Paste: [[2506.00410]] JojoSCL: Shrinkage Contrastive Learning for single-cell RNA sequence Clustering(https://arxiv.org/abs/2506.00410)
Keywords: self-supervised
Abstract: Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular processes by enabling gene expression analysis at the individual cell level. Clustering allows for the identification of cell types and the further discovery of intrinsic patterns in single-cell data. However, the high dimensionality and sparsity of scRNA-seq data continue to challenge existing clustering models. In this paper, we introduce JojoSCL, a novel self-supervised contrastive learning framework for scRNA-seq clustering. By incorporating a shrinkage estimator based on hierarchical Bayesian estimation, which adjusts gene expression estimates towards more reliable cluster centroids to reduce intra-cluster dispersion, and optimized using Stein's Unbiased Risk Estimate (SURE), JojoSCL refines both instance-level and cluster-level contrastive learning. Experiments on ten scRNA-seq datasets substantiate that JojoSCL consistently outperforms prevalent clustering methods, with further validation of its practicality through robustness analysis and ablation studies. JojoSCL's code is available at: this https URL.

Title: Accelerating Diffusion LLMs via Adaptive Parallel Decoding

Authors: Daniel Israel, Guy Van den Broeck, Aditya Grover
Subjects: cs.CL, cs.AI, cs.LG, cs.PF
Abstract URL: https://arxiv.org/abs/2506.00413
Pdf URL: https://arxiv.org/pdf/2506.00413
Copy Paste: [[2506.00413]] Accelerating Diffusion LLMs via Adaptive Parallel Decoding(https://arxiv.org/abs/2506.00413)
Keywords: diffusion
Abstract: The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in practice struggle to achieve the speed of autoregressive models without significantly sacrificing quality. We therefore introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel. We achieve this by defining a multiplicative mixture between the dLLM marginal probabilities and the joint probability of sequences under a small auxiliary autoregressive model. This inverts the standard setup of speculative decoding, where the goal is to sample from a large autoregressive verifier by drafting from a smaller model. We further optimize APD by enabling KV caching and limiting the size of the masked input. Altogether, our method puts forward three tunable parameters to flexibly tradeoff throughput and quality. We show that APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.

Title: Dual Debiasing for Noisy In-Context Learning for Text Generation

Authors: Siqi Liang, Sumyeong Ahn, Paramveer S. Dhillon, Jiayu Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00418
Pdf URL: https://arxiv.org/pdf/2506.00418
Copy Paste: [[2506.00418]] Dual Debiasing for Noisy In-Context Learning for Text Generation(https://arxiv.org/abs/2506.00418)
Keywords: in-context
Abstract: In context learning (ICL) relies heavily on high quality demonstrations drawn from large annotated corpora. Existing approaches detect noisy annotations by ranking local perplexities, presuming that noisy samples yield higher perplexities than their clean counterparts. However, this assumption breaks down when the noise ratio is high and many demonstrations are flawed. We reexamine the perplexity based paradigm for text generation under noisy annotations, highlighting two sources of bias in perplexity: the annotation itself and the domain specific knowledge inherent in large language models (LLMs). To overcome these biases, we introduce a dual debiasing framework that uses synthesized neighbors to explicitly correct perplexity estimates, yielding a robust Sample Cleanliness Score. This metric uncovers absolute sample cleanliness regardless of the overall corpus noise level. Extensive experiments demonstrate our method's superior noise detection capabilities and show that its final ICL performance is comparable to that of a fully clean demonstration corpus. Moreover, our approach remains robust even when noise ratios are extremely high.

Title: A New Spatiotemporal Correlation Anomaly Detection Method that Integrates Contrastive Learning and Few-Shot Learning in Wireless Sensor Networks

Authors: Miao Ye, Suxiao Wang, Jiaguang Han, Yong Wang, Xiaoli Wang, Jingxuan Wei, Peng Wen, Jing Cui
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00420
Pdf URL: https://arxiv.org/pdf/2506.00420
Copy Paste: [[2506.00420]] A New Spatiotemporal Correlation Anomaly Detection Method that Integrates Contrastive Learning and Few-Shot Learning in Wireless Sensor Networks(https://arxiv.org/abs/2506.00420)
Keywords: anomaly
Abstract: Detecting anomalies in the data collected by WSNs can provide crucial evidence for assessing the reliability and stability of WSNs. Existing methods for WSN anomaly detection often face challenges such as the limited extraction of spatiotemporal correlation features, the absence of sample labels, few anomaly samples, and an imbalanced sample distribution. To address these issues, a spatiotemporal correlation detection model (MTAD-RD) considering both model architecture and a two-stage training strategy perspective is proposed. In terms of model structure design, the proposed MTAD-RD backbone network includes a retentive network (RetNet) enhanced by a cross-retention (CR) module, a multigranular feature fusion module, and a graph attention network module to extract internode correlation information. This proposed model can integrate the intermodal correlation features and spatial features of WSN neighbor nodes while extracting global information from time series data. Moreover, its serialized inference characteristic can remarkably reduce inference overhead. For model training, a two-stage training approach was designed. First, a contrastive learning proxy task was designed for time series data with graph structure information in WSNs, enabling the backbone network to learn transferable features from unlabeled data using unsupervised contrastive learning methods, thereby addressing the issue of missing sample labels in the dataset. Then, a caching-based sample sampler was designed to divide samples into few-shot and contrastive learning data. A specific joint loss function was developed to jointly train the dual-graph discriminator network to address the problem of sample imbalance effectively. In experiments carried out on real public datasets, the designed MTAD-RD anomaly detection method achieved an F1 score of 90.97%, outperforming existing supervised WSN anomaly detection methods.

Title: Channel Normalization for Time Series Channel Identification

Authors: Seunghan Lee, Taeyoung Park, Kibok Lee
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.00432
Pdf URL: https://arxiv.org/pdf/2506.00432
Copy Paste: [[2506.00432]] Channel Normalization for Time Series Channel Identification(https://arxiv.org/abs/2506.00432)
Keywords: foundation model
Abstract: Channel identifiability (CID) refers to the ability to distinguish between individual channels in time series (TS) modeling. The absence of CID often results in producing identical outputs for identical inputs, disregarding channel-specific characteristics. In this paper, we highlight the importance of CID and propose Channel Normalization (CN), a simple yet effective normalization strategy that enhances CID by assigning distinct affine transformation parameters to each channel. We further extend CN in two ways: 1) Adaptive CN (ACN) dynamically adjusts parameters based on the input TS, improving adaptability in TS models, and 2) Prototypical CN (PCN) introduces a set of learnable prototypes instead of per-channel parameters, enabling applicability to datasets with unknown or varying number of channels and facilitating use in TS foundation models. We demonstrate the effectiveness of CN and its variants by applying them to various TS models, achieving significant performance gains for both non-CID and CID models. In addition, we analyze the success of our approach from an information theory perspective. Code is available at this https URL.

Title: Latent Wavelet Diffusion: Enabling 4K Image Synthesis for Free

Authors: Luigi Sigillo, Shengfeng He, Danilo Comminiello
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2506.00433
Pdf URL: https://arxiv.org/pdf/2506.00433
Copy Paste: [[2506.00433]] Latent Wavelet Diffusion: Enabling 4K Image Synthesis for Free(https://arxiv.org/abs/2506.00433)
Keywords: diffusion, generative
Abstract: High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight framework that enables any latent diffusion model to scale to ultra-high-resolution image generation (2K to 4K) for free. LWD introduces three key components: (1) a scale-consistent variational autoencoder objective that enhances the spectral fidelity of latent representations; (2) wavelet energy maps that identify and localize detail-rich spatial regions within the latent space; and (3) a time-dependent masking strategy that focuses denoising supervision on high-frequency components during training. LWD requires no architectural modifications and incurs no additional computational overhead. Despite its simplicity, it consistently improves perceptual quality and reduces FID in ultra-high-resolution image synthesis, outperforming strong baseline models. These results highlight the effectiveness of frequency-aware, signal-driven supervision as a principled and efficient approach for high-resolution generative modeling.

Title: G2S: A General-to-Specific Learning Framework for Temporal Knowledge Graph Forecasting with Large Language Models

Authors: Long Bai, Zixuan Li, Xiaolong Jin, Jiafeng Guo, Xueqi Cheng, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00445
Pdf URL: https://arxiv.org/pdf/2506.00445
Copy Paste: [[2506.00445]] G2S: A General-to-Specific Learning Framework for Temporal Knowledge Graph Forecasting with Large Language Models(https://arxiv.org/abs/2506.00445)
Keywords: in-context
Abstract: Forecasting over Temporal Knowledge Graphs (TKGs) which predicts future facts based on historical ones has received much attention. Recent studies have introduced Large Language Models (LLMs) for this task to enhance the models' generalization abilities. However, these models perform forecasting via simultaneously learning two kinds of entangled knowledge in the TKG: (1) general patterns, i.e., invariant temporal structures shared across different scenarios; and (2) scenario information, i.e., factual knowledge engaged in specific scenario, such as entities and relations. As a result, the learning processes of these two kinds of knowledge may interfere with each other, which potentially impact the generalization abilities of the models. To enhance the generalization ability of LLMs on this task, in this paper, we propose a General-to-Specific learning framework (G2S) that disentangles the learning processes of the above two kinds of knowledge. In the general learning stage, we mask the scenario information in different TKGs and convert it into anonymous temporal structures. After training on these structures, the model is able to capture the general patterns across different TKGs. In the specific learning stage, we inject the scenario information into the structures via either in-context learning or fine-tuning modes. Experimental results show that G2S effectively improves the generalization abilities of LLMs.

Title: Comparing Traditional and Reinforcement-Learning Methods for Energy Storage Control

Authors: Elinor Ginzburg, Itay Segev, Yoash Levron, Sarah Keren
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00459
Pdf URL: https://arxiv.org/pdf/2506.00459
Copy Paste: [[2506.00459]] Comparing Traditional and Reinforcement-Learning Methods for Energy Storage Control(https://arxiv.org/abs/2506.00459)
Keywords: generative
Abstract: We aim to better understand the tradeoffs between traditional and reinforcement learning (RL) approaches for energy storage management. More specifically, we wish to better understand the performance loss incurred when using a generative RL policy instead of using a traditional approach to find optimal control policies for specific instances. Our comparison is based on a simplified micro-grid model, that includes a load component, a photovoltaic source, and a storage device. Based on this model, we examine three use cases of increasing complexity: ideal storage with convex cost functions, lossy storage devices, and lossy storage devices with convex transmission losses. With the aim of promoting the principled use RL based methods in this challenging and important domain, we provide a detailed formulation of each use case and a detailed description of the optimization challenges. We then compare the performance of traditional and RL methods, discuss settings in which it is beneficial to use each method, and suggest avenues for future investigation.

Title: Exploring In-context Example Generation for Machine Translation

Authors: Dohyun Lee, Seungil Chad Lee, Chanwoo Yang, Yujin Baek, Jaegul Choo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00507
Pdf URL: https://arxiv.org/pdf/2506.00507
Copy Paste: [[2506.00507]] Exploring In-context Example Generation for Machine Translation(https://arxiv.org/abs/2506.00507)
Keywords: in-context
Abstract: Large language models (LLMs) have demonstrated strong performance across various tasks, leveraging their exceptional in-context learning ability with only a few examples. Accordingly, the selection of optimal in-context examples has been actively studied in the field of machine translation. However, these studies presuppose the presence of a demonstration pool with human-annotated pairs, making them less applicable to low-resource languages where such an assumption is challenging to meet. To overcome this limitation, this paper explores the research direction of in-context example generation for machine translation. Specifically, we propose Demonstration Augmentation for Translation (DAT), a simple yet effective approach that generates example pairs without relying on any external resources. This method builds upon two prior criteria, relevance and diversity, which have been highlighted in previous work as key factors for in-context example selection. Through experiments and analysis on low-resource languages where human-annotated pairs are scarce, we show that DAT achieves superior translation quality compared to the baselines. Furthermore, we investigate the potential of progressively accumulating generated pairs during test time to build and reuse a demonstration pool. Our implementation is publicly available at this https URL.

Title: SSAM: Self-Supervised Association Modeling for Test-Time Adaption

Authors: Yaxiong Wang, Zhenqiang Zhang, Lechao Cheng, Zhun Zhong, Dan Guo, Meng Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00513
Pdf URL: https://arxiv.org/pdf/2506.00513
Copy Paste: [[2506.00513]] SSAM: Self-Supervised Association Modeling for Test-Time Adaption(https://arxiv.org/abs/2506.00513)
Keywords: self-supervised
Abstract: Test-time adaption (TTA) has witnessed important progress in recent years, the prevailing methods typically first encode the image and the text and design strategies to model the association between them. Meanwhile, the image encoder is usually frozen due to the absence of explicit supervision in TTA scenarios. We identify a critical limitation in this paradigm: While test-time images often exhibit distribution shifts from training data, existing methods persistently freeze the image encoder due to the absence of explicit supervision during adaptation. This practice overlooks the image encoder's crucial role in bridging distribution shift between training and test. To address this challenge, we propose SSAM (Self-Supervised Association Modeling), a new TTA framework that enables dynamic encoder refinement through dual-phase association learning. Our method operates via two synergistic components: 1) Soft Prototype Estimation (SPE), which estimates probabilistic category associations to guide feature space reorganization, and 2) Prototype-anchored Image Reconstruction (PIR), enforcing encoder stability through cluster-conditional image feature reconstruction. Comprehensive experiments across diverse baseline methods and benchmarks demonstrate that SSAM can surpass state-of-the-art TTA baselines by a clear margin while maintaining computational efficiency. The framework's architecture-agnostic design and minimal hyperparameter dependence further enhance its practical applicability.

Title: SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation

Authors: Xingtong Ge, Xin Zhang, Tongda Xu, Yi Zhang, Xinjie Zhang, Yan Wang, Jun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00523
Pdf URL: https://arxiv.org/pdf/2506.00523
Copy Paste: [[2506.00523]] SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation(https://arxiv.org/abs/2506.00523)
Keywords: diffusion
Abstract: The Distribution Matching Distillation (DMD) has been successfully applied to text-to-image diffusion models such as Stable Diffusion (SD) 1.5. However, vanilla DMD suffers from convergence difficulties on large-scale flow-based text-to-image models, such as SD 3.5 and FLUX. In this paper, we first analyze the issues when applying vanilla DMD on large-scale models. Then, to overcome the scalability challenge, we propose implicit distribution alignment (IDA) to regularize the distance between the generator and fake distribution. Furthermore, we propose intra-segment guidance (ISG) to relocate the timestep importance distribution from the teacher model. With IDA alone, DMD converges for SD 3.5; employing both IDA and ISG, DMD converges for SD 3.5 and FLUX.1 dev. Along with other improvements such as scaled up discriminator models, our final model, dubbed \textbf{SenseFlow}, achieves superior performance in distillation for both diffusion based text-to-image models such as SDXL, and flow-matching models such as SD 3.5 Large and FLUX. The source code will be avaliable at this https URL.

Title: Decoupling Reasoning and Knowledge Injection for In-Context Knowledge Editing

Authors: Changyue Wang, Weihang Su, Qingyao Ai, Yujia Zhou, Yiqun Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00536
Pdf URL: https://arxiv.org/pdf/2506.00536
Copy Paste: [[2506.00536]] Decoupling Reasoning and Knowledge Injection for In-Context Knowledge Editing(https://arxiv.org/abs/2506.00536)
Keywords: in-context
Abstract: Knowledge editing aims to efficiently update Large Language Models (LLMs) by modifying specific knowledge without retraining the entire model. Among knowledge editing approaches, in-context editing (ICE) offers a lightweight solution by injecting new knowledge directly into the input context, leaving model parameters unchanged. However, existing ICE approaches do not explicitly separate the newly injected knowledge from the model's original reasoning process. This entanglement often results in conflicts between external updates and internal parametric knowledge, undermining the consistency and accuracy of the reasoning this http URL this work, we conduct preliminary experiments to examine how parametric knowledge influences reasoning path planning. We find that the model's reasoning is tightly coupled with its internal knowledge, and that naively injecting new information without adapting the reasoning path often leads to performance degradation, particularly in multi-hop tasks. To this end, we propose DecKER, a novel ICE framework that decouples reasoning from knowledge editing by generating a masked reasoning path and then resolving knowledge edits via hybrid retrieval and model-based validation. Experiments on multi-hop QA benchmarks show that DecKER significantly outperforms existing ICE methods by mitigating knowledge conflicts and preserving reasoning consistency. Our code is available at: this https URL .

Title: Imputation of Missing Data in Smooth Pursuit Eye Movements Using a Self-Attention-based Deep Learning Approach

Authors: Mehdi Bejani, Guillermo Perez-de-Arenaza-Pozo, Julián D. Arias-Londoño, Juan I. Godino-LLorente
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00545
Pdf URL: https://arxiv.org/pdf/2506.00545
Copy Paste: [[2506.00545]] Imputation of Missing Data in Smooth Pursuit Eye Movements Using a Self-Attention-based Deep Learning Approach(https://arxiv.org/abs/2506.00545)
Keywords: generative
Abstract: Missing data is a relevant issue in time series, especially in biomedical sequences such as those corresponding to smooth pursuit eye movements, which often contain gaps due to eye blinks and track losses, complicating the analysis and extraction of meaningful biomarkers. In this paper, a novel imputation framework is proposed using Self-Attention-based Imputation networks for time series, which leverages the power of deep learning and self-attention mechanisms to impute missing data. We further refine the imputed data using a custom made autoencoder, tailored to represent smooth pursuit eye movement sequences. The proposed approach was implemented using 5,504 sequences from 172 Parkinsonian patients and healthy controls. Results show a significant improvement in the accuracy of reconstructed eye movement sequences with respect to other state of the art techniques, substantially reducing the values for common time domain error metrics such as the mean absolute error, mean relative error, and root mean square error, while also preserving the signal's frequency domain characteristics. Moreover, it demonstrates robustness when large intervals of data are missing. This method offers an alternative solution for robustly handling missing data in time series, enhancing the reliability of smooth pursuit analysis for the screening and monitoring of neurodegenerative disorders.

Title: SEED: A Benchmark Dataset for Sequential Facial Attribute Editing with Diffusion Models

Authors: Yule Zhu, Ping Liu, Zhedong Zheng, Wei Liu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2506.00562
Pdf URL: https://arxiv.org/pdf/2506.00562
Copy Paste: [[2506.00562]] SEED: A Benchmark Dataset for Sequential Facial Attribute Editing with Diffusion Models(https://arxiv.org/abs/2506.00562)
Keywords: diffusion
Abstract: Diffusion models have recently enabled precise and photorealistic facial editing across a wide range of semantic attributes. Beyond single-step modifications, a growing class of applications now demands the ability to analyze and track sequences of progressive edits, such as stepwise changes to hair, makeup, or accessories. However, sequential editing introduces significant challenges in edit attribution and detection robustness, further complicated by the lack of large-scale, finely annotated benchmarks tailored explicitly for this task. We introduce SEED, a large-scale Sequentially Edited facE Dataset constructed via state-of-the-art diffusion models. SEED contains over 90,000 facial images with one to four sequential attribute modifications, generated using diverse diffusion-based editing pipelines (LEdits, SDXL, SD3). Each image is annotated with detailed edit sequences, attribute masks, and prompts, facilitating research on sequential edit tracking, visual provenance analysis, and manipulation robustness assessment. To benchmark this task, we propose FAITH, a frequency-aware transformer-based model that incorporates high-frequency cues to enhance sensitivity to subtle sequential changes. Comprehensive experiments, including systematic comparisons of multiple frequency-domain methods, demonstrate the effectiveness of FAITH and the unique challenges posed by SEED. SEED offers a challenging and flexible resource for studying progressive diffusion-based edits at scale. Dataset and code will be publicly released at: this https URL.

Title: MR2US-Pro: Prostate MR to Ultrasound Image Translation and Registration Based on Diffusion Models

Authors: Xudong Ma, Nantheera Anantrasirichai, Stefanos Bolomytis, Alin Achim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00591
Pdf URL: https://arxiv.org/pdf/2506.00591
Copy Paste: [[2506.00591]] MR2US-Pro: Prostate MR to Ultrasound Image Translation and Registration Based on Diffusion Models(https://arxiv.org/abs/2506.00591)
Keywords: diffusion
Abstract: The diagnosis of prostate cancer increasingly depends on multimodal imaging, particularly magnetic resonance imaging (MRI) and transrectal ultrasound (TRUS). However, accurate registration between these modalities remains a fundamental challenge due to the differences in dimensionality and anatomical representations. In this work, we present a novel framework that addresses these challenges through a two-stage process: TRUS 3D reconstruction followed by cross-modal registration. Unlike existing TRUS 3D reconstruction methods that rely heavily on external probe tracking information, we propose a totally probe-location-independent approach that leverages the natural correlation between sagittal and transverse TRUS views. With the help of our clustering-based feature matching method, we enable the spatial localization of 2D frames without any additional probe tracking information. For the registration stage, we introduce an unsupervised diffusion-based framework guided by modality translation. Unlike existing methods that translate one modality into another, we map both MR and US into a pseudo intermediate modality. This design enables us to customize it to retain only registration-critical features, greatly easing registration. To further enhance anatomical alignment, we incorporate an anatomy-aware registration strategy that prioritizes internal structural coherence while adaptively reducing the influence of boundary inconsistencies. Extensive validation demonstrates that our approach outperforms state-of-the-art methods by achieving superior registration accuracy with physically realistic deformations in a completely unsupervised fashion.

Title: Graph Evidential Learning for Anomaly Detection

Authors: Chunyu Wei, Wenji Hu, Xingjia Hao, Yunhai Wang, Yueguo Chen, Bing Bai, Fei Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00594
Pdf URL: https://arxiv.org/pdf/2506.00594
Copy Paste: [[2506.00594]] Graph Evidential Learning for Anomaly Detection(https://arxiv.org/abs/2506.00594)
Keywords: anomaly
Abstract: Graph anomaly detection faces significant challenges due to the scarcity of reliable anomaly-labeled datasets, driving the development of unsupervised methods. Graph autoencoders (GAEs) have emerged as a dominant approach by reconstructing graph structures and node features while deriving anomaly scores from reconstruction errors. However, relying solely on reconstruction error for anomaly detection has limitations, as it increases the sensitivity to noise and overfitting. To address these issues, we propose Graph Evidential Learning (GEL), a probabilistic framework that redefines the reconstruction process through evidential learning. By modeling node features and graph topology using evidential distributions, GEL quantifies two types of uncertainty: graph uncertainty and reconstruction uncertainty, incorporating them into the anomaly scoring mechanism. Extensive experiments demonstrate that GEL achieves state-of-the-art performance while maintaining high robustness against noise and structural perturbations.

Title: Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

Authors: Danfeng li, Hui Zhang, Sheng Wang, Jiacheng Li, Zuxuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00596
Pdf URL: https://arxiv.org/pdf/2506.00596
Copy Paste: [[2506.00596]] Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control(https://arxiv.org/abs/2506.00596)
Keywords: diffusion
Abstract: Despite recent advances in diffusion models, top-tier text-to-image (T2I) models still struggle to achieve precise spatial layout control, i.e. accurately generating entities with specified attributes and locations. Segmentation-mask-to-image (S2I) generation has emerged as a promising solution by incorporating pixel-level spatial guidance and regional text prompts. However, existing S2I methods fail to simultaneously ensure semantic consistency and shape consistency. To address these challenges, we propose Seg2Any, a novel S2I framework built upon advanced multimodal diffusion transformers (e.g. FLUX). First, to achieve both semantic and shape consistency, we decouple segmentation mask conditions into regional semantic and high-frequency shape components. The regional semantic condition is introduced by a Semantic Alignment Attention Mask, ensuring that generated entities adhere to their assigned text prompts. The high-frequency shape condition, representing entity boundaries, is encoded as an Entity Contour Map and then introduced as an additional modality via multi-modal attention to guide image spatial structure. Second, to prevent attribute leakage across entities in multi-entity scenarios, we introduce an Attribute Isolation Attention Mask mechanism, which constrains each entity's image tokens to attend exclusively to themselves during image self-attention. To support open-set S2I generation, we construct SACap-1M, a large-scale dataset containing 1 million images with 5.9 million segmented entities and detailed regional captions, along with a SACap-Eval benchmark for comprehensive S2I evaluation. Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities.

Title: ABCDEFGH: An Adaptation-Based Convolutional Neural Network-CycleGAN Disease-Courses Evolution Framework Using Generative Models in Health Education

Authors: Ruiming Min, Minghao Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00605
Pdf URL: https://arxiv.org/pdf/2506.00605
Copy Paste: [[2506.00605]] ABCDEFGH: An Adaptation-Based Convolutional Neural Network-CycleGAN Disease-Courses Evolution Framework Using Generative Models in Health Education(https://arxiv.org/abs/2506.00605)
Keywords: generative
Abstract: With the advancement of modern medicine and the development of technologies such as MRI, CT, and cellular analysis, it has become increasingly critical for clinicians to accurately interpret various diagnostic images. However, modern medical education often faces challenges due to limited access to high-quality teaching materials, stemming from privacy concerns and a shortage of educational resources (Balogh et al., 2015). In this context, image data generated by machine learning models, particularly generative models, presents a promising solution. These models can create diverse and comparable imaging datasets without compromising patient privacy, thereby supporting modern medical education. In this study, we explore the use of convolutional neural networks (CNNs) and CycleGAN (Zhu et al., 2017) for generating synthetic medical images. The source code is available at this https URL.

Title: Parallel Rescaling: Rebalancing Consistency Guidance for Personalized Diffusion Models

Authors: JungWoo Chae, Jiyoon Kim, Sangheum Hwang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00607
Pdf URL: https://arxiv.org/pdf/2506.00607
Copy Paste: [[2506.00607]] Parallel Rescaling: Rebalancing Consistency Guidance for Personalized Diffusion Models(https://arxiv.org/abs/2506.00607)
Keywords: diffusion
Abstract: Personalizing diffusion models to specific users or concepts remains challenging, particularly when only a few reference images are available. Existing methods such as DreamBooth and Textual Inversion often overfit to limited data, causing misalignment between generated images and text prompts when attempting to balance identity fidelity with prompt adherence. While Direct Consistency Optimization (DCO) with its consistency-guided sampling partially alleviates this issue, it still struggles with complex or stylized prompts. In this paper, we propose a parallel rescaling technique for personalized diffusion models. Our approach explicitly decomposes the consistency guidance signal into parallel and orthogonal components relative to classifier free guidance (CFG). By rescaling the parallel component, we minimize disruptive interference with CFG while preserving the subject's identity. Unlike prior personalization methods, our technique does not require additional training data or expensive annotations. Extensive experiments show improved prompt alignment and visual fidelity compared to baseline methods, even on challenging stylized prompts. These findings highlight the potential of parallel rescaled guidance to yield more stable and accurate personalization for diverse user inputs.

Title: Improving Dialogue State Tracking through Combinatorial Search for In-Context Examples

Authors: Haesung Pyun, Yoonah Park, Yohan Jo
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2506.00622
Pdf URL: https://arxiv.org/pdf/2506.00622
Copy Paste: [[2506.00622]] Improving Dialogue State Tracking through Combinatorial Search for In-Context Examples(https://arxiv.org/abs/2506.00622)
Keywords: in-context
Abstract: In dialogue state tracking (DST), in-context learning comprises a retriever that selects labeled dialogues as in-context examples and a DST model that uses these examples to infer the dialogue state of the query dialogue. Existing methods for constructing training data for retrievers suffer from three key limitations: (1) the synergistic effect of examples is not considered, (2) the linguistic characteristics of the query are not sufficiently factored in, and (3) scoring is not directly optimized for DST performance. Consequently, the retriever can fail to retrieve examples that would substantially improve DST performance. To address these issues, we present CombiSearch, a method that scores effective in-context examples based on their combinatorial impact on DST performance. Our evaluation on MultiWOZ shows that retrievers trained with CombiSearch surpass state-of-the-art models, achieving a 20x gain in data efficiency and generalizing well to the SGD dataset. Moreover, CombiSearch attains a 12% absolute improvement in the upper bound DST performance over traditional approaches when no retrieval errors are assumed. This significantly increases the headroom for practical DST performance while demonstrating that existing methods rely on suboptimal data for retriever training.

Title: Probabilistic Forecasting for Building Energy Systems using Time-Series Foundation Models

Authors: Young Jin Park, Francois Germain, Jing Liu, Ye Wang, Toshiaki Koike-Akino, Gordon Wichern, Navid Azizan, Christopher R. Laughman, Ankush Chakrabarty
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.00630
Pdf URL: https://arxiv.org/pdf/2506.00630
Copy Paste: [[2506.00630]] Probabilistic Forecasting for Building Energy Systems using Time-Series Foundation Models(https://arxiv.org/abs/2506.00630)
Keywords: foundation model
Abstract: Decision-making in building energy systems critically depends on the predictive accuracy of relevant time-series models. In scenarios lacking extensive data from a target building, foundation models (FMs) represent a promising technology that can leverage prior knowledge from vast and diverse pre-training datasets to construct accurate probabilistic predictors for use in decision-making tools. This paper investigates the applicability and fine-tuning strategies of time-series foundation models (TSFMs) in building energy forecasting. We analyze both full fine-tuning and parameter-efficient fine-tuning approaches, particularly low-rank adaptation (LoRA), by using real-world data from a commercial net-zero energy building to capture signals such as room occupancy, carbon emissions, plug loads, and HVAC energy consumption. Our analysis reveals that the zero-shot predictive performance of TSFMs is generally suboptimal. To address this shortcoming, we demonstrate that employing either full fine-tuning or parameter-efficient fine-tuning significantly enhances forecasting accuracy, even with limited historical data. Notably, fine-tuning with low-rank adaptation (LoRA) substantially reduces computational costs without sacrificing accuracy. Furthermore, fine-tuned TSFMs consistently outperform state-of-the-art deep forecasting models (e.g., temporal fusion transformers) in accuracy, robustness, and generalization across varying building zones and seasonal conditions. These results underline the efficacy of TSFMs for practical, data-constrained building energy management systems, enabling improved decision-making in pursuit of energy efficiency and sustainability.

Title: Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Authors: Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00633
Pdf URL: https://arxiv.org/pdf/2506.00633
Copy Paste: [[2506.00633]] Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining(https://arxiv.org/abs/2506.00633)
Keywords: diffusion, generative
Abstract: Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric Computed Tomography (CT) remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE, enabling efficient 3D denoising diffusion without requiring external super-resolution stages. Results: We evaluate our method on the CT-RATE dataset and conduct a comprehensive assessment of image fidelity, clinical relevance, and semantic alignment. Our model achieves competitive performance across all tasks, significantly outperforming prior baselines for text-to-CT generation. Moreover, we demonstrate that CT scans synthesized by our framework can effectively augment real data, improving downstream diagnostic performance. Conclusion: Our results show that modality-specific vision-language alignment is a key component for high-quality 3D medical image generation. By integrating contrastive pretraining and volumetric diffusion, our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text, paving the way for new applications in data augmentation, medical education, and automated clinical simulation.

Title: Video Signature: In-generation Watermarking for Latent Video Diffusion Models

Authors: Yu Huang, Junhao Chen, Qi Zheng, Hanqian Li, Shuliang Liu, Xuming Hu
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2506.00652
Pdf URL: https://arxiv.org/pdf/2506.00652
Copy Paste: [[2506.00652]] Video Signature: In-generation Watermarking for Latent Video Diffusion Models(https://arxiv.org/abs/2506.00652)
Keywords: diffusion
Abstract: The rapid development of Artificial Intelligence Generated Content (AIGC) has led to significant progress in video generation but also raises serious concerns about intellectual property protection and reliable content tracing. Watermarking is a widely adopted solution to this issue, but existing methods for video generation mainly follow a post-generation paradigm, which introduces additional computational overhead and often fails to effectively balance the trade-off between video quality and watermark extraction. To address these issues, we propose Video Signature (VIDSIG), an in-generation watermarking method for latent video diffusion models, which enables implicit and adaptive watermark integration during generation. Specifically, we achieve this by partially fine-tuning the latent decoder, where Perturbation-Aware Suppression (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality. Beyond spatial fidelity, we further enhance temporal consistency by introducing a lightweight Temporal Alignment module that guides the decoder to generate coherent frame sequences during fine-tuning. Experimental results show that VIDSIG achieves the best overall performance in watermark extraction, visual quality, and generation efficiency. It also demonstrates strong robustness against both spatial and temporal tampering, highlighting its practicality in real-world scenarios.

Title: Differential Privacy for Deep Learning in Medicine

Authors: Marziyeh Mohammadi, Mohsen Vejdanihemmat, Mahshad Lotfinia, Mirabela Rusu, Daniel Truhn, Andreas Maier, Soroosh Tayebi Arasteh
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00660
Pdf URL: https://arxiv.org/pdf/2506.00660
Copy Paste: [[2506.00660]] Differential Privacy for Deep Learning in Medicine(https://arxiv.org/abs/2506.00660)
Keywords: generative
Abstract: Differential privacy (DP) is a key technique for protecting sensitive patient data in medical deep learning (DL). As clinical models grow more data-dependent, balancing privacy with utility and fairness has become a critical challenge. This scoping review synthesizes recent developments in applying DP to medical DL, with a particular focus on DP-SGD and alternative mechanisms across centralized and federated settings. Using a structured search strategy, we identified 74 studies published up to March 2025. Our analysis spans diverse data modalities, training setups, and downstream tasks, and highlights the tradeoffs between privacy guarantees, model accuracy, and subgroup fairness. We find that while DP-especially at strong privacy budgets-can preserve performance in well-structured imaging tasks, severe degradation often occurs under strict privacy, particularly in underrepresented or complex modalities. Furthermore, privacy-induced performance gaps disproportionately affect demographic subgroups, with fairness impacts varying by data type and task. A small subset of studies explicitly addresses these tradeoffs through subgroup analysis or fairness metrics, but most omit them entirely. Beyond DP-SGD, emerging approaches leverage alternative mechanisms, generative models, and hybrid federated designs, though reporting remains inconsistent. We conclude by outlining key gaps in fairness auditing, standardization, and evaluation protocols, offering guidance for future work toward equitable and clinically robust privacy-preserving DL systems in medicine.

Title: CineMA: A Foundation Model for Cine Cardiac MRI

Authors: Yunguan Fu, Weixi Yi, Charlotte Manisty, Anish N Bhuva, Thomas A Treibel, James C Moon, Matthew J Clarkson, Rhodri Huw Davies, Yipeng Hu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00679
Pdf URL: https://arxiv.org/pdf/2506.00679
Copy Paste: [[2506.00679]] CineMA: A Foundation Model for Cine Cardiac MRI(https://arxiv.org/abs/2506.00679)
Keywords: self-supervised, foundation model
Abstract: Cardiac magnetic resonance (CMR) is a key investigation in clinical cardiovascular medicine and has been used extensively in population research. However, extracting clinically important measurements such as ejection fraction for diagnosing cardiovascular diseases remains time-consuming and subjective. We developed CineMA, a foundation AI model automating these tasks with limited labels. CineMA is a self-supervised autoencoder model trained on 74,916 cine CMR studies to reconstruct images from masked inputs. After fine-tuning, it was evaluated across eight datasets on 23 tasks from four categories: ventricle and myocardium segmentation, left and right ventricle ejection fraction calculation, disease detection and classification, and landmark localisation. CineMA is the first foundation model for cine CMR to match or outperform convolutional neural networks (CNNs). CineMA demonstrated greater label efficiency than CNNs, achieving comparable or better performance with fewer annotations. This reduces the burden of clinician labelling and supports replacing task-specific training with fine-tuning foundation models in future cardiac imaging applications. Models and code for pre-training and fine-tuning are available at this https URL, democratising access to high-performance models that otherwise require substantial computational resources, promoting reproducibility and accelerating clinical translation.

Title: Concept-Centric Token Interpretation for Vector-Quantized Generative Models

Authors: Tianze Yang, Yucheng Shi, Mengnan Du, Xuansheng Wu, Qiaoyu Tan, Jin Sun, Ninghao Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00698
Pdf URL: https://arxiv.org/pdf/2506.00698
Copy Paste: [[2506.00698]] Concept-Centric Token Interpretation for Vector-Quantized Generative Models(https://arxiv.org/abs/2506.00698)
Keywords: generative
Abstract: Vector-Quantized Generative Models (VQGMs) have emerged as powerful tools for image generation. However, the key component of VQGMs -- the codebook of discrete tokens -- is still not well understood, e.g., which tokens are critical to generate an image of a certain concept? This paper introduces Concept-Oriented Token Explanation (CORTEX), a novel approach for interpreting VQGMs by identifying concept-specific token combinations. Our framework employs two methods: (1) a sample-level explanation method that analyzes token importance scores in individual images, and (2) a codebook-level explanation method that explores the entire codebook to find globally relevant tokens. Experimental results demonstrate CORTEX's efficacy in providing clear explanations of token usage in the generative process, outperforming baselines across multiple pretrained VQGMs. Besides enhancing VQGMs transparency, CORTEX is useful in applications such as targeted image editing and shortcut feature detection. Our code is available at this https URL.

Title: RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models

Authors: Valter Hudovernik, Minkai Xu, Juntong Shi, Lovro Šubelj, Stefano Ermon, Erik Štrumbelj, Jure Leskovec
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.00710
Pdf URL: https://arxiv.org/pdf/2506.00710
Copy Paste: [[2506.00710]] RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models(https://arxiv.org/abs/2506.00710)
Keywords: diffusion, generative
Abstract: Real-world databases are predominantly relational, comprising multiple interlinked tables that contain complex structural and statistical dependencies. Learning generative models on relational data has shown great promise in generating synthetic data and imputing missing values. However, existing methods often struggle to capture this complexity, typically reducing relational data to conditionally generated flat tables and imposing limiting structural assumptions. To address these limitations, we introduce RelDiff, a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure. RelDiff combines a joint graph-conditioned diffusion process across all tables for attribute synthesis, and a $2K+$SBM graph generator based on the Stochastic Block Model for structure generation. The decomposition of graph structure and relational attributes ensures both high fidelity and referential integrity, both of which are crucial aspects of synthetic relational database generation. Experiments on 11 benchmark datasets demonstrate that RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases. Code is available at this https URL.

Title: QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

Authors: Wei Dai, Peilin Chen, Chanakya Ekbote, Paul Pu Liang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.00711
Pdf URL: https://arxiv.org/pdf/2506.00711
Copy Paste: [[2506.00711]] QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training(https://arxiv.org/abs/2506.00711)
Keywords: foundation model
Abstract: Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med is trained with Domain-aware Relative Policy Optimization (DRPO), a novel reinforcement-learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at this https URL.

Title: From Local Cues to Global Percepts: Emergent Gestalt Organization in Self-Supervised Vision Models

Authors: Tianqin Li, Ziqi Wen, Leiran Song, Jun Liu, Zhi Jing, Tai Sing Lee
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00718
Pdf URL: https://arxiv.org/pdf/2506.00718
Copy Paste: [[2506.00718]] From Local Cues to Global Percepts: Emergent Gestalt Organization in Self-Supervised Vision Models(https://arxiv.org/abs/2506.00718)
Keywords: self-supervised
Abstract: Human vision organizes local cues into coherent global forms using Gestalt principles like closure, proximity, and figure-ground assignment -- functions reliant on global spatial structure. We investigate whether modern vision models show similar behaviors, and under what training conditions these emerge. We find that Vision Transformers (ViTs) trained with Masked Autoencoding (MAE) exhibit activation patterns consistent with Gestalt laws, including illusory contour completion, convexity preference, and dynamic figure-ground segregation. To probe the computational basis, we hypothesize that modeling global dependencies is necessary for Gestalt-like organization. We introduce the Distorted Spatial Relationship Testbench (DiSRT), which evaluates sensitivity to global spatial perturbations while preserving local textures. Using DiSRT, we show that self-supervised models (e.g., MAE, CLIP) outperform supervised baselines and sometimes even exceed human performance. ConvNeXt models trained with MAE also exhibit Gestalt-compatible representations, suggesting such sensitivity can arise without attention architectures. However, classification finetuning degrades this ability. Inspired by biological vision, we show that a Top-K activation sparsity mechanism can restore global sensitivity. Our findings identify training conditions that promote or suppress Gestalt-like perception and establish DiSRT as a diagnostic for global structure sensitivity across models.

Title: Common Inpainted Objects In-N-Out of Context

Authors: Tianze Yang, Tyson Jordan, Ninghao Liu, Jin Sun
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00721
Pdf URL: https://arxiv.org/pdf/2506.00721
Copy Paste: [[2506.00721]] Common Inpainted Objects In-N-Out of Context(https://arxiv.org/abs/2506.00721)
Keywords: diffusion
Abstract: We present Common Inpainted Objects In-N-Out of Context (COinCO), a novel dataset addressing the scarcity of out-of-context examples in existing vision datasets. By systematically replacing objects in COCO images through diffusion-based inpainting, we create 97,722 unique images featuring both contextually coherent and inconsistent scenes, enabling effective context learning. Each inpainted object is meticulously verified and categorized as in- or out-of-context through a multimodal large language model assessment. Our analysis reveals significant patterns in semantic priors that influence inpainting success across object categories. We demonstrate three key tasks enabled by COinCO: (1) training context classifiers that effectively determine whether existing objects belong in their context; (2) a novel Objects-from-Context prediction task that determines which new objects naturally belong in given scenes at both instance and clique levels, and (3) context-enhanced fake detection on state-of-the-art methods without fine-tuning. COinCO provides a controlled testbed with contextual variations, establishing a foundation for advancing context-aware visual understanding in computer vision and image forensics. Our code and data are at: this https URL.

Title: MoPINNEnKF: Iterative Model Inference using generic-PINN-based ensemble Kalman filter

Authors: Binghang Lu, Changhong Mou, Guang Lin
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00731
Pdf URL: https://arxiv.org/pdf/2506.00731
Copy Paste: [[2506.00731]] MoPINNEnKF: Iterative Model Inference using generic-PINN-based ensemble Kalman filter(https://arxiv.org/abs/2506.00731)
Keywords: diffusion
Abstract: Physics-informed neural networks (PINNs) have emerged as a powerful tool for solving forward and inverse problems involving partial differential equations (PDEs) by incorporating physical laws into the training process. However, the performance of PINNs is often hindered in real-world scenarios involving noisy observational data and missing physics, particularly in inverse problems. In this work, we propose an iterative multi-objective PINN ensemble Kalman filter (MoPINNEnKF) framework that improves the robustness and accuracy of PINNs in both forward and inverse problems by using the \textit{ensemble Kalman filter} and the \textit{non-dominated sorting genetic algorithm} III (NSGA-III). Specifically, NSGA-III is used as a multi-objective optimizer that can generate various ensemble members of PINNs along the optimal Pareto front, while accounting the model uncertainty in the solution space. These ensemble members are then utilized within the EnKF to assimilate noisy observational data. The EnKF's analysis is subsequently used to refine the data loss component for retraining the PINNs, thereby iteratively updating their parameters. The iterative procedure generates improved solutions to the PDEs. The proposed method is tested on two benchmark problems: the one-dimensional viscous Burgers equation and the time-fractional mixed diffusion-wave equation (TFMDWE). The numerical results show it outperforms standard PINNs in handling noisy data and missing physics.

Title: ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary

Authors: Zeqi Gu, Yin Cui, Zhaoshuo Li, Fangyin Wei, Yunhao Ge, Jinwei Gu, Ming-Yu Liu, Abe Davis, Yifan Ding
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00742
Pdf URL: https://arxiv.org/pdf/2506.00742
Copy Paste: [[2506.00742]] ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary(https://arxiv.org/abs/2506.00742)
Keywords: in-context
Abstract: Designing 3D scenes is traditionally a challenging task that demands both artistic expertise and proficiency with complex software. Recent advances in text-to-3D generation have greatly simplified this process by letting users create scenes based on simple text descriptions. However, as these methods generally require extra training or in-context learning, their performance is often hindered by the limited availability of high-quality 3D data. In contrast, modern text-to-image models learned from web-scale images can generate scenes with diverse, reliable spatial layouts and consistent, visually appealing styles. Our key insight is that instead of learning directly from 3D scenes, we can leverage generated 2D images as an intermediary to guide 3D synthesis. In light of this, we introduce ArtiScene, a training-free automated pipeline for scene design that integrates the flexibility of free-form text-to-image generation with the diversity and reliability of 2D intermediary layouts. First, we generate 2D images from a scene description, then extract the shape and appearance of objects to create 3D models. These models are assembled into the final scene using geometry, position, and pose information derived from the same intermediary image. Being generalizable to a wide range of scenes and styles, ArtiScene outperforms state-of-the-art benchmarks by a large margin in layout and aesthetic quality by quantitative metrics. It also averages a 74.89% winning rate in extensive user studies and 95.07% in GPT-4o evaluation. Project page: this https URL

Title: Aiding Medical Diagnosis through Image Synthesis and Classification

Authors: Kanishk Choudhary
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00786
Pdf URL: https://arxiv.org/pdf/2506.00786
Copy Paste: [[2506.00786]] Aiding Medical Diagnosis through Image Synthesis and Classification(https://arxiv.org/abs/2506.00786)
Keywords: diffusion, generative
Abstract: Medical professionals, especially those in training, often depend on visual reference materials to support an accurate diagnosis and develop pattern recognition skills. However, existing resources may lack the diversity and accessibility needed for broad and effective clinical learning. This paper presents a system designed to generate realistic medical images from textual descriptions and validate their accuracy through a classification model. A pretrained stable diffusion model was fine-tuned using Low-Rank Adaptation (LoRA) on the PathMNIST dataset, consisting of nine colorectal histopathology tissue types. The generative model was trained multiple times using different training parameter configurations, guided by domain-specific prompts to capture meaningful features. To ensure quality control, a ResNet-18 classification model was trained on the same dataset, achieving 99.76% accuracy in detecting the correct label of a colorectal histopathological medical image. Generated images were then filtered using the trained classifier and an iterative process, where inaccurate outputs were discarded and regenerated until they were correctly classified. The highest performing version of the generative model from experimentation achieved an F1 score of 0.6727, with precision and recall scores of 0.6817 and 0.7111, respectively. Some types of tissue, such as adipose tissue and lymphocytes, reached perfect classification scores, while others proved more challenging due to structural complexity. The self-validating approach created demonstrates a reliable method for synthesizing domain-specific medical images because of high accuracy in both the generation and classification portions of the system, with potential applications in both diagnostic support and clinical education. Future work includes improving prompt-specific accuracy and extending the system to other areas of medical imaging.

Title: TIME: TabPFN-Integrated Multimodal Engine for Robust Tabular-Image Learning

Authors: Jiaqi Luo, Yuan Yuan, Shixin Xu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00813
Pdf URL: https://arxiv.org/pdf/2506.00813
Copy Paste: [[2506.00813]] TIME: TabPFN-Integrated Multimodal Engine for Robust Tabular-Image Learning(https://arxiv.org/abs/2506.00813)
Keywords: foundation model
Abstract: Tabular-image multimodal learning, which integrates structured tabular data with imaging data, holds great promise for a variety of tasks, especially in medical applications. Yet, two key challenges remain: (1) the lack of a standardized, pretrained representation for tabular data, as is commonly available in vision and language domains; and (2) the difficulty of handling missing values in the tabular modality, which are common in real-world medical datasets. To address these issues, we propose the TabPFN-Integrated Multimodal Engine (TIME), a novel multimodal framework that builds on the recently introduced tabular foundation model, TabPFN. TIME leverages TabPFN as a frozen tabular encoder to generate robust, strong embeddings that are naturally resilient to missing data, and combines them with image features from pretrained vision backbones. We explore a range of fusion strategies and tabular encoders, and evaluate our approach on both natural and medical datasets. Extensive experiments demonstrate that TIME consistently outperforms competitive baselines across both complete and incomplete tabular inputs, underscoring its practical value in real-world multimodal learning scenarios.

Title: From Plain Text to Poetic Form: Generating Metrically-Constrained Sanskrit Verses

Authors: Manoj Balaji Jagadeeshan, Samarth Bhatia, Pretam Ray, Harshul Raj Surana, Akhil Rajeev P, Priya Mishra, Annarao Kulkarni, Ganesh Ramakrishnan, Prathosh AP, Pawan Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00815
Pdf URL: https://arxiv.org/pdf/2506.00815
Copy Paste: [[2506.00815]] From Plain Text to Poetic Form: Generating Metrically-Constrained Sanskrit Verses(https://arxiv.org/abs/2506.00815)
Keywords: generative
Abstract: Recent advances in large language models (LLMs) have significantly improved natural language generation, including creative tasks like poetry composition. However, most progress remains concentrated in high-resource languages. This raises an important question: Can LLMs be adapted for structured poetic generation in a low-resource, morphologically rich language such as Sanskrit? In this work, we introduce a dataset designed for translating English prose into structured Sanskrit verse, with strict adherence to classical metrical patterns, particularly the Anushtub meter. We evaluate a range of generative models-both open-source and proprietary-under multiple settings. Specifically, we explore constrained decoding strategies and instruction-based fine-tuning tailored to metrical and semantic fidelity. Our decoding approach achieves over 99% accuracy in producing syntactically valid poetic forms, substantially outperforming general-purpose models in meter conformity. Meanwhile, instruction-tuned variants show improved alignment with source meaning and poetic style, as supported by human assessments, albeit with marginal trade-offs in metrical precision.

Title: QuantFace: Low-Bit Post-Training Quantization for One-Step Diffusion Face Restoration

Authors: Jiatong Li, Libo Zhu, Haotong Qin, Jingkai Wang, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00820
Pdf URL: https://arxiv.org/pdf/2506.00820
Copy Paste: [[2506.00820]] QuantFace: Low-Bit Post-Training Quantization for One-Step Diffusion Face Restoration(https://arxiv.org/abs/2506.00820)
Keywords: diffusion
Abstract: Diffusion models have been achieving remarkable performance in face restoration. However, the heavy computations of diffusion models make it difficult to deploy them on devices like smartphones. In this work, we propose QuantFace, a novel low-bit quantization for one-step diffusion face restoration models, where the full-precision (\ie, 32-bit) weights and activations are quantized to 4$\sim$6-bit. We first analyze the data distribution within activations and find that they are highly variant. To preserve the original data information, we employ rotation-scaling channel balancing. Furthermore, we propose Quantization-Distillation Low-Rank Adaptation (QD-LoRA) that jointly optimizes for quantization and distillation performance. Finally, we propose an adaptive bit-width allocation strategy. We formulate such a strategy as an integer programming problem, which combines quantization error and perceptual metrics to find a satisfactory resource allocation. Extensive experiments on the synthetic and real-world datasets demonstrate the effectiveness of QuantFace under 6-bit and 4-bit. QuantFace achieves significant advantages over recent leading low-bit quantization methods for face restoration. The code is available at this https URL.

Title: SafeGenes: Evaluating the Adversarial Robustness of Genomic Foundation Models

Authors: Huixin Zhan, Jason H. Moore
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00821
Pdf URL: https://arxiv.org/pdf/2506.00821
Copy Paste: [[2506.00821]] SafeGenes: Evaluating the Adversarial Robustness of Genomic Foundation Models(https://arxiv.org/abs/2506.00821)
Keywords: foundation model
Abstract: Genomic Foundation Models (GFMs), such as Evolutionary Scale Modeling (ESM), have demonstrated significant success in variant effect prediction. However, their adversarial robustness remains largely unexplored. To address this gap, we propose SafeGenes: a framework for Secure analysis of genomic foundation models, leveraging adversarial attacks to evaluate robustness against both engineered near-identical adversarial Genes and embedding-space manipulations. In this study, we assess the adversarial vulnerabilities of GFMs using two approaches: the Fast Gradient Sign Method (FGSM) and a soft prompt attack. FGSM introduces minimal perturbations to input sequences, while the soft prompt attack optimizes continuous embeddings to manipulate model predictions without modifying the input tokens. By combining these techniques, SafeGenes provides a comprehensive assessment of GFM susceptibility to adversarial manipulation. Targeted soft prompt attacks led to substantial performance degradation, even in large models such as ESM1b and ESM1v. These findings expose critical vulnerabilities in current foundation models, opening new research directions toward improving their security and robustness in high-stakes genomic applications such as variant effect prediction.

Title: Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks

Authors: Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Zhengwen Feng, Hao Peng, Jianwei Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00823
Pdf URL: https://arxiv.org/pdf/2506.00823
Copy Paste: [[2506.00823]] Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks(https://arxiv.org/abs/2506.00823)
Keywords: in-context
Abstract: Large language models (LLMs) are trained on extensive datasets that encapsulate substantial world knowledge. However, their outputs often include confidently stated inaccuracies. Earlier works suggest that LLMs encode truthfulness as a distinct linear feature, termed the "truth direction", which can classify truthfulness reliably. We address several open questions about the truth direction: (i) whether LLMs universally exhibit consistent truth directions; (ii) whether sophisticated probing techniques are necessary to identify truth directions; and (iii) how the truth direction generalizes across diverse contexts. Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models, particularly in the context of logical negation. Additionally, we demonstrate that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources. Finally, we explore the practical application of truthfulness probes in selective question-answering, illustrating their potential to improve user trust in LLM outputs. These results advance our understanding of truth directions and provide new insights into the internal representations of LLM beliefs. Our code is public at this https URL

Title: HERGC: Heterogeneous Experts Representation and Generative Completion for Multimodal Knowledge Graphs

Authors: Yongkang Xiao, Rui Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00826
Pdf URL: https://arxiv.org/pdf/2506.00826
Copy Paste: [[2506.00826]] HERGC: Heterogeneous Experts Representation and Generative Completion for Multimodal Knowledge Graphs(https://arxiv.org/abs/2506.00826)
Keywords: generative
Abstract: Multimodal knowledge graphs (MMKGs) enrich traditional knowledge graphs (KGs) by incorporating diverse modalities such as images and text. Multi-modal knowledge graph completion (MMKGC) seeks to exploit these heterogeneous signals to infer missing facts, thereby mitigating the intrinsic incompleteness of MMKGs. Existing MMKGC methods typically leverage only the information contained in the MMKGs under the closed-world assumption and adopt discriminative training objectives, which limits their reasoning capacity during completion. Recent generative completion approaches powered by advanced large language models (LLMs) have shown strong reasoning abilities in unimodal knowledge graph completion, but their potential in MMKGC remains largely unexplored. To bridge this gap, we propose HERGC, a Heterogeneous Experts Representation and Generative Completion framework for MMKGs. HERGC first deploys a Heterogeneous Experts Representation Retriever that enriches and fuses multimodal information and retrieves a compact candidate set for each incomplete triple. It then uses a Generative LLM Predictor fine-tuned on minimal instruction data to accurately identify the correct answer from these candidates. Extensive experiments on three standard MMKG benchmarks demonstrate HERGC's effectiveness and robustness, achieving state-of-the-art performance.

Title: SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers

Authors: Zhengcong Fei, Hao Jiang, Di Qiu, Baoxuan Gu, Youqiang Zhang, Jiahua Wang, Jialin Bai, Debang Li, Mingyuan Fan, Guibin Chen, Yahui Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00830
Pdf URL: https://arxiv.org/pdf/2506.00830
Copy Paste: [[2506.00830]] SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers(https://arxiv.org/abs/2506.00830)
Keywords: diffusion
Abstract: The generation and editing of audio-conditioned talking portraits guided by multimodal inputs, including text, images, and videos, remains under explored. In this paper, we present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos. Built upon pretrained video diffusion transformers, our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs. We employ a hybrid curriculum learning strategy to progressively align audio with facial motion, enabling fine-grained multimodal control over long video sequences. To enhance local facial coherence, we introduce a facial mask loss and an audio-guided classifier-free guidance mechanism. A sliding-window denoising approach further fuses latent representations across temporal segments, ensuring visual fidelity and temporal consistency across extended durations and diverse identities. More importantly, we construct a dedicated data pipeline for curating high-quality triplets consisting of synchronized audio, video, and textual descriptions. Comprehensive benchmark evaluations show that SkyReels-Audio achieves superior performance in lip-sync accuracy, identity consistency, and realistic facial dynamics, particularly under complex and challenging conditions.

Title: A Large Language Model-Supported Threat Modeling Framework for Transportation Cyber-Physical Systems

Authors: M Sabbir Salek, Mashrur Chowdhury, Muhaimin Bin Munir, Yuchen Cai, Mohammad Imtiaz Hasan, Jean-Michel Tine, Latifur Khan, Mizanur Rahman
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00831
Pdf URL: https://arxiv.org/pdf/2506.00831
Copy Paste: [[2506.00831]] A Large Language Model-Supported Threat Modeling Framework for Transportation Cyber-Physical Systems(https://arxiv.org/abs/2506.00831)
Keywords: in-context
Abstract: Modern transportation systems rely on cyber-physical systems (CPS), where cyber systems interact seamlessly with physical systems like transportation-related sensors and actuators to enhance safety, mobility, and energy efficiency. However, growing automation and connectivity increase exposure to cyber vulnerabilities. Existing threat modeling frameworks for transportation CPS are often limited in scope, resource-intensive, and dependent on significant cybersecurity expertise. To address these gaps, we present TraCR-TMF (Transportation Cybersecurity and Resiliency Threat Modeling Framework), a large language model (LLM)-based framework that minimizes expert intervention. TraCR-TMF identifies threats, potential attack techniques, and corresponding countermeasures by leveraging the MITRE ATT&CK matrix through three LLM-based approaches: (i) a retrieval-augmented generation (RAG) method requiring no expert input, (ii) an in-context learning approach requiring low expert input, and (iii) a supervised fine-tuning method requiring moderate expert input. TraCR-TMF also maps attack paths to critical assets by analyzing vulnerabilities using a customized LLM. The framework was evaluated in two scenarios. First, it identified relevant attack techniques across transportation CPS applications, with 90% precision as validated by experts. Second, using a fine-tuned LLM, it successfully predicted multiple exploitations including lateral movement, data exfiltration, and ransomware-related encryption that occurred during a major real-world cyberattack incident. These results demonstrate TraCR-TMF's effectiveness in CPS threat modeling, its reduced reliance on cybersecurity expertise, and its adaptability across CPS domains.

Title: Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience

Authors: Jiawei Gu, Ziting Xian, Yuanzhen Xie, Ye Liu, Enjie Liu, Ruichao Zhong, Mochi Gao, Yunzhi Tan, Bo Hu, Zang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00842
Pdf URL: https://arxiv.org/pdf/2506.00842
Copy Paste: [[2506.00842]] Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience(https://arxiv.org/abs/2506.00842)
Keywords: in-context
Abstract: Large language models (LLMs) achieve strong performance on plain text tasks but underperform on structured data like tables and databases. Potential challenges arise from their underexposure during pre-training and rigid text-to-structure transfer mechanisms. Unlike humans who seamlessly apply learned patterns across data modalities, LLMs struggle to infer implicit relationships embedded in tabular formats, especially in the absence of explicit structural guidance. To bridge this cognitive gap, we introduce Contrastive Retrieval-Augmented Generation on Experience (CoRE), a framework that builds experience memory representations and enhances generalization through contrastive In-Context Learning (ICL) to simulate human-like knowledge transfer. Experiments on Text-to-SQL and TableQA show CoRE significantly improves performance, achieving average gains of 3.44% and 4.24%, with up to 17.2% on challenging tasks. Our Monte Carlo Tree Search (MCTS)-generated Experience Memory expands training data 8-9x, enhancing diversity and domain coverage. This training-free and continual method propels LLMs toward structured knowledge expertise.

Title: Generalization in VAE and Diffusion Models: A Unified Information-Theoretic Analysis

Authors: Qi Chen, Jierui Zhu, Florian Shkurti
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00849
Pdf URL: https://arxiv.org/pdf/2506.00849
Copy Paste: [[2506.00849]] Generalization in VAE and Diffusion Models: A Unified Information-Theoretic Analysis(https://arxiv.org/abs/2506.00849)
Keywords: diffusion
Abstract: Despite the empirical success of Diffusion Models (DMs) and Variational Autoencoders (VAEs), their generalization performance remains theoretically underexplored, especially lacking a full consideration of the shared encoder-generator structure. Leveraging recent information-theoretic tools, we propose a unified theoretical framework that provides guarantees for the generalization of both the encoder and generator by treating them as randomized mappings. This framework further enables (1) a refined analysis for VAEs, accounting for the generator's generalization, which was previously overlooked; (2) illustrating an explicit trade-off in generalization terms for DMs that depends on the diffusion time $T$; and (3) providing computable bounds for DMs based solely on the training data, allowing the selection of the optimal $T$ and the integration of such bounds into the optimization process to improve model performance. Empirical results on both synthetic and real datasets illustrate the validity of the proposed theory.

Title: FourierFlow: Frequency-aware Flow Matching for Generative Turbulence Modeling

Authors: Haixin Wang, Jiashu Pan, Hao Wu, Fan Zhang, Tailin Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.00862
Pdf URL: https://arxiv.org/pdf/2506.00862
Copy Paste: [[2506.00862]] FourierFlow: Frequency-aware Flow Matching for Generative Turbulence Modeling(https://arxiv.org/abs/2506.00862)
Keywords: diffusion, generative
Abstract: Modeling complex fluid systems, especially turbulence governed by partial differential equations (PDEs), remains a fundamental challenge in science and engineering. Recently, diffusion-based generative models have gained attention as a powerful approach for these tasks, owing to their capacity to capture long-range dependencies and recover hierarchical structures. However, we present both empirical and theoretical evidence showing that generative models struggle with significant spectral bias and common-mode noise when generating high-fidelity turbulent flows. Here we propose FourierFlow, a novel generative turbulence modeling framework that enhances the frequency-aware learning by both implicitly and explicitly mitigating spectral bias and common-mode noise. FourierFlow comprises three key innovations. Firstly, we adopt a dual-branch backbone architecture, consisting of a salient flow attention branch with local-global awareness to focus on sensitive turbulence areas. Secondly, we introduce a frequency-guided Fourier mixing branch, which is integrated via an adaptive fusion strategy to explicitly mitigate spectral bias in the generative model. Thirdly, we leverage the high-frequency modeling capabilities of the masked auto-encoder pre-training and implicitly align the features of the generative model toward high-frequency components. We validate the effectiveness of FourierFlow on three canonical turbulent flow scenarios, demonstrating superior performance compared to state-of-the-art methods. Furthermore, we show that our model exhibits strong generalization capabilities in challenging settings such as out-of-distribution domains, long-term temporal extrapolation, and robustness to noisy inputs. The code can be found at this https URL.

Title: Local Manifold Approximation and Projection for Manifold-Aware Diffusion Planning

Authors: Kyowoon Lee, Jaesik Choi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00867
Pdf URL: https://arxiv.org/pdf/2506.00867
Copy Paste: [[2506.00867]] Local Manifold Approximation and Projection for Manifold-Aware Diffusion Planning(https://arxiv.org/abs/2506.00867)
Keywords: diffusion, generative
Abstract: Recent advances in diffusion-based generative modeling have demonstrated significant promise in tackling long-horizon, sparse-reward tasks by leveraging offline datasets. While these approaches have achieved promising results, their reliability remains inconsistent due to the inherent stochastic risk of producing infeasible trajectories, limiting their applicability in safety-critical applications. We identify that the primary cause of these failures is inaccurate guidance during the sampling procedure, and demonstrate the existence of manifold deviation by deriving a lower bound on the guidance gap. To address this challenge, we propose Local Manifold Approximation and Projection (LoMAP), a training-free method that projects the guided sample onto a low-rank subspace approximated from offline datasets, preventing infeasible trajectory generation. We validate our approach on standard offline reinforcement learning benchmarks that involve challenging long-horizon planning. Furthermore, we show that, as a standalone module, LoMAP can be incorporated into the hierarchical diffusion planner, providing further performance enhancements.

Title: Towards Predicting Any Human Trajectory In Context

Authors: Ryo Fujii, Hideo Saito, Ryo Hachiuma
Subjects: cs.CV, cs.AI, cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2506.00871
Pdf URL: https://arxiv.org/pdf/2506.00871
Copy Paste: [[2506.00871]] Towards Predicting Any Human Trajectory In Context(https://arxiv.org/abs/2506.00871)
Keywords: in-context
Abstract: Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, this process is often impractical on edge devices due to constrained computational resources. To address this challenge, we introduce TrajICL, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables rapid adaptation without fine-tuning on the scenario-specific data. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. The code will be released at this https URL.

Title: Breaking Latent Prior Bias in Detectors for Generalizable AIGC Image Detection

Authors: Yue Zhou, Xinan He, KaiQing Lin, Bin Fan, Feng Ding, Bin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00874
Pdf URL: https://arxiv.org/pdf/2506.00874
Copy Paste: [[2506.00874]] Breaking Latent Prior Bias in Detectors for Generalizable AIGC Image Detection(https://arxiv.org/abs/2506.00874)
Keywords: diffusion, generative
Abstract: Current AIGC detectors often achieve near-perfect accuracy on images produced by the same generator used for training but struggle to generalize to outputs from unseen generators. We trace this failure in part to latent prior bias: detectors learn shortcuts tied to patterns stemming from the initial noise vector rather than learning robust generative artifacts. To address this, we propose On-Manifold Adversarial Training (OMAT): by optimizing the initial latent noise of diffusion models under fixed conditioning, we generate on-manifold adversarial examples that remain on the generator's output manifold-unlike pixel-space attacks, which introduce off-manifold perturbations that the generator itself cannot reproduce and that can obscure the true discriminative artifacts. To test against state-of-the-art generative models, we introduce GenImage++, a test-only benchmark of outputs from advanced generators (Flux.1, SD3) with extended prompts and diverse styles. We apply our adversarial-training paradigm to ResNet50 and CLIP baselines and evaluate across existing AIGC forensic benchmarks and recent challenge datasets. Extensive experiments show that adversarially trained detectors significantly improve cross-generator performance without any network redesign. Our findings on latent-prior bias offer valuable insights for future dataset construction and detector evaluation, guiding the development of more robust and generalizable AIGC forensic methodologies.

Title: State-Covering Trajectory Stitching for Diffusion Planners

Authors: Kyowoon Lee, Jaesik Choi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00895
Pdf URL: https://arxiv.org/pdf/2506.00895
Copy Paste: [[2506.00895]] State-Covering Trajectory Stitching for Diffusion Planners(https://arxiv.org/abs/2506.00895)
Keywords: diffusion, generative
Abstract: Diffusion-based generative models are emerging as powerful tools for long-horizon planning in reinforcement learning (RL), particularly with offline datasets. However, their performance is fundamentally limited by the quality and diversity of training data. This often restricts their generalization to tasks outside their training distribution or longer planning horizons. To overcome this challenge, we propose State-Covering Trajectory Stitching (SCoTS), a novel reward-free trajectory augmentation method that incrementally stitches together short trajectory segments, systematically generating diverse and extended trajectories. SCoTS first learns a temporal distance-preserving latent representation that captures the underlying temporal structure of the environment, then iteratively stitches trajectory segments guided by directional exploration and novelty to effectively cover and expand this latent space. We demonstrate that SCoTS significantly improves the performance and generalization capabilities of diffusion planners on offline goal-conditioned benchmarks requiring stitching and long-horizon reasoning. Furthermore, augmented trajectories generated by SCoTS significantly improve the performance of widely used offline goal-conditioned RL algorithms across diverse environments.

Title: DS-VTON: High-Quality Virtual Try-on via Disentangled Dual-Scale Generation

Authors: Xianbing Sun, Yan Hong, Jiahui Zhan, Jun Lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00908
Pdf URL: https://arxiv.org/pdf/2506.00908
Copy Paste: [[2506.00908]] DS-VTON: High-Quality Virtual Try-on via Disentangled Dual-Scale Generation(https://arxiv.org/abs/2506.00908)
Keywords: diffusion
Abstract: Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. In this paper, we propose DS-VTON, a dual-scale virtual try-on framework that explicitly disentangles these objectives for more effective modeling. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. The second stage introduces a residual-guided diffusion process that reconstructs high-resolution outputs by refining the residual between the two scales, focusing on texture fidelity. In addition, our method adopts a fully mask-free generation paradigm, eliminating reliance on human parsing maps or segmentation masks. By leveraging the semantic priors embedded in pretrained diffusion models, this design more effectively preserves the person's appearance and geometric consistency. Extensive experiments demonstrate that DS-VTON achieves state-of-the-art performance in both structural alignment and texture preservation across multiple standard virtual try-on benchmarks.

Title: 3D Skeleton-Based Action Recognition: A Review

Authors: Mengyuan Liu, Hong Liu, Qianshuo Hu, Bin Ren, Junsong Yuan, Jiaying Lin, Jiajun Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00915
Pdf URL: https://arxiv.org/pdf/2506.00915
Copy Paste: [[2506.00915]] 3D Skeleton-Based Action Recognition: A Review(https://arxiv.org/abs/2506.00915)
Keywords: generative
Abstract: With the inherent advantages of skeleton representation, 3D skeleton-based action recognition has become a prominent topic in the field of computer vision. However, previous reviews have predominantly adopted a model-oriented perspective, often neglecting the fundamental steps involved in skeleton-based action recognition. This oversight tends to ignore key components of skeleton-based action recognition beyond model design and has hindered deeper, more intrinsic understanding of the task. To bridge this gap, our review aims to address these limitations by presenting a comprehensive, task-oriented framework for understanding skeleton-based action recognition. We begin by decomposing the task into a series of sub-tasks, placing particular emphasis on preprocessing steps such as modality derivation and data augmentation. The subsequent discussion delves into critical sub-tasks, including feature extraction and spatio-temporal modeling techniques. Beyond foundational action recognition networks, recently advanced frameworks such as hybrid architectures, Mamba models, large language models (LLMs), and generative models have also been highlighted. Finally, a comprehensive overview of public 3D skeleton datasets is presented, accompanied by an analysis of state-of-the-art algorithms evaluated on these benchmarks. By integrating task-oriented discussions, comprehensive examinations of sub-tasks, and an emphasis on the latest advancements, our review provides a fundamental and accessible structured roadmap for understanding and advancing the field of 3D skeleton-based action recognition.

Title: Position as Probability: Self-Supervised Transformers that Think Past Their Training for Length Extrapolation

Authors: Philip Heejun Lee
Subjects: cs.LG, cs.AI, cs.CL, cs.NE
Abstract URL: https://arxiv.org/abs/2506.00920
Pdf URL: https://arxiv.org/pdf/2506.00920
Copy Paste: [[2506.00920]] Position as Probability: Self-Supervised Transformers that Think Past Their Training for Length Extrapolation(https://arxiv.org/abs/2506.00920)
Keywords: self-supervised
Abstract: Deep sequence models typically degrade in accuracy when test sequences significantly exceed their training lengths, yet many critical tasks--such as algorithmic reasoning, multi-step arithmetic, and compositional generalization--require robust length extrapolation. We introduce PRISM, a Probabilistic Relative-position Implicit Superposition Model, a novel positional encoding mechanism that enables Transformers to extrapolate accurately up to 10x beyond their training length. PRISM learns continuous relative positions through a differentiable histogram-filter update, preserving position uncertainty via a probabilistic superposition rather than conventional deterministic embeddings. Empirically, PRISM achieves state-of-the-art length extrapolation, successfully generalizing to previously intractable sequence lengths across algorithmic benchmarks--including arithmetic (addition, multiplication), SCAN compositionality tasks, and complex copy variants derived from DeepMind's recent datasets. Our analysis demonstrates that PRISM's stochastic positional encoding maintains sharp and interpretable internal states, providing a theoretical basis for reliable length generalization. These results advance the goal of neural sequence models that remain algorithmically robust at lengths far exceeding their training horizon.

Title: Deformable registration and generative modelling of aortic anatomies by auto-decoders and neural ODEs

Authors: Riccardo Tenderini, Luca Pegolotti, Fanwei Kong, Stefano Pagani, Francesco Regazzoni, Alison L. Marsden, Simone Deparis
Subjects: cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2506.00947
Pdf URL: https://arxiv.org/pdf/2506.00947
Copy Paste: [[2506.00947]] Deformable registration and generative modelling of aortic anatomies by auto-decoders and neural ODEs(https://arxiv.org/abs/2506.00947)
Keywords: generative
Abstract: This work introduces AD-SVFD, a deep learning model for the deformable registration of vascular shapes to a pre-defined reference and for the generation of synthetic anatomies. AD-SVFD operates by representing each geometry as a weighted point cloud and models ambient space deformations as solutions at unit time of ODEs, whose time-independent right-hand sides are expressed through artificial neural networks. The model parameters are optimized by minimizing the Chamfer Distance between the deformed and reference point clouds, while backward integration of the ODE defines the inverse transformation. A distinctive feature of AD-SVFD is its auto-decoder structure, that enables generalization across shape cohorts and favors efficient weight sharing. In particular, each anatomy is associated with a low-dimensional code that acts as a self-conditioning field and that is jointly optimized with the network parameters during training. At inference, only the latent codes are fine-tuned, substantially reducing computational overheads. Furthermore, the use of implicit shape representations enables generative applications: new anatomies can be synthesized by suitably sampling from the latent space and applying the corresponding inverse transformations to the reference geometry. Numerical experiments, conducted on healthy aortic anatomies, showcase the high-quality results of AD-SVFD, which yields extremely accurate approximations at competitive computational costs.

Title: Continual-MEGA: A Large-scale Benchmark for Generalizable Continual Anomaly Detection

Authors: Geonu Lee, Yujeong Oh, Geonhui Jang, Soyoung Lee, Jeonghyo Song, Sungmin Cha, YoungJoon Yoo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00956
Pdf URL: https://arxiv.org/pdf/2506.00956
Copy Paste: [[2506.00956]] Continual-MEGA: A Large-scale Benchmark for Generalizable Continual Anomaly Detection(https://arxiv.org/abs/2506.00956)
Keywords: anomaly
Abstract: In this paper, we introduce a new benchmark for continual learning in anomaly detection, aimed at better reflecting real-world deployment scenarios. Our benchmark, Continual-MEGA, includes a large and diverse dataset that significantly expands existing evaluation settings by combining carefully curated existing datasets with our newly proposed dataset, ContinualAD. In addition to standard continual learning with expanded quantity, we propose a novel scenario that measures zero-shot generalization to unseen classes, those not observed during continual adaptation. This setting poses a new problem setting that continual adaptation also enhances zero-shot performance. We also present a unified baseline algorithm that improves robustness in few-shot detection and maintains strong generalization. Through extensive evaluations, we report three key findings: (1) existing methods show substantial room for improvement, particularly in pixel-level defect localization; (2) our proposed method consistently outperforms prior approaches; and (3) the newly introduced ContinualAD dataset enhances the performance of strong anomaly detection models. We release the benchmark and code in this https URL.

Title: From Objectives to Questions: A Planning-based Framework for Educational Mathematical Question Generation

Authors: Cheng Cheng, Zhenya Huang, Guanhao Zhao, Yuxiang Guo, Xin Lin, Jinze Wu, Xin Li, Shijin Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.00963
Pdf URL: https://arxiv.org/pdf/2506.00963
Copy Paste: [[2506.00963]] From Objectives to Questions: A Planning-based Framework for Educational Mathematical Question Generation(https://arxiv.org/abs/2506.00963)
Keywords: generative
Abstract: Automatically generating high-quality mathematical problems that align with educational objectives is a crucial task in NLP-based educational technology. Traditional generation methods focus primarily on textual quality, but they often overlook educational objectives. Moreover, these methods address only single-dimensional, simple question generation, failing to meet complex, multifaceted educational requirements. To address these challenges, we constructed and annotated EduMath, a dataset of 16k mathematical questions with multi-dimensional educational objectives. Based on this dataset, we developed EQGEVAL, which incorporates three evaluation dimensions and is designed to assess the ability of models to generate educational questions. Drawing inspiration from teachers' problem design processes, we propose the Educational Question Planning with self-Reflection (EQPR) method for educational mathematical question generation, following a "plan-evaluate-optimize" approach. Specifically, by combining planning algorithm based on Monte Carlo Tree Search with the generative capabilities of Large Language Models, we continuously optimize questions through iterative feedback. This self-optimization mechanism ensures that the generated questions both fit the educational context and strategically achieve specific basic educational objectives. Through extensive experiments based on EQGEVAL, we have demonstrated that EQPR achieves significant improvements in generating questions that meet multi-dimensional educational objectives.

Title: Pilot Contamination-Aware Graph Attention Network for Power Control in CFmMIMO

Authors: Tingting Zhang, Sergiy A. Vorobyov, David J. Love, Taejoon Kim, Kai Dong
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.00967
Pdf URL: https://arxiv.org/pdf/2506.00967
Copy Paste: [[2506.00967]] Pilot Contamination-Aware Graph Attention Network for Power Control in CFmMIMO(https://arxiv.org/abs/2506.00967)
Keywords: self-supervised
Abstract: Optimization-based power control algorithms are predominantly iterative with high computational complexity, making them impractical for real-time applications in cell-free massive multiple-input multiple-output (CFmMIMO) systems. Learning-based methods have emerged as a promising alternative, and among them, graph neural networks (GNNs) have demonstrated their excellent performance in solving power control problems. However, all existing GNN-based approaches assume ideal orthogonality among pilot sequences for user equipments (UEs), which is unrealistic given that the number of UEs exceeds the available orthogonal pilot sequences in CFmMIMO schemes. Moreover, most learning-based methods assume a fixed number of UEs, whereas the number of active UEs varies over time in practice. Additionally, supervised training necessitates costly computational resources for computing the target power control solutions for a large volume of training samples. To address these issues, we propose a graph attention network for downlink power control in CFmMIMO systems that operates in a self-supervised manner while effectively handling pilot contamination and adapting to a dynamic number of UEs. Experimental results show its effectiveness, even in comparison to the optimal accelerated projected gradient method as a baseline.

Title: NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction

Authors: Qichao Wang, Ziqiao Meng, Wenqian Cui, Yifei Zhang, Pengcheng Wu, Bingzhe Wu, Irwin King, Liang Chen, Peilin Zhao
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.00975
Pdf URL: https://arxiv.org/pdf/2506.00975
Copy Paste: [[2506.00975]] NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction(https://arxiv.org/abs/2506.00975)
Keywords: generative
Abstract: Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern large language models, and introduce a novel generative modeling paradigm, Next-Token-Pair Prediction (NTPP), to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications.

Title: Quantization-based Bounds on the Wasserstein Metric

Authors: Jonathan Bobrutsky, Amit Moscovich
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.00976
Pdf URL: https://arxiv.org/pdf/2506.00976
Copy Paste: [[2506.00976]] Quantization-based Bounds on the Wasserstein Metric(https://arxiv.org/abs/2506.00976)
Keywords: generative
Abstract: The Wasserstein metric has become increasingly important in many machine learning applications such as generative modeling, image retrieval and domain adaptation. Despite its appeal, it is often too costly to compute. This has motivated approximation methods like entropy-regularized optimal transport, downsampling, and subsampling, which trade accuracy for computational efficiency. In this paper, we consider the challenge of computing efficient approximations to the Wasserstein metric that also serve as strict upper or lower bounds. Focusing on discrete measures on regular grids, our approach involves formulating and exactly solving a Kantorovich problem on a coarse grid using a quantized measure and specially designed cost matrix, followed by an upscaling and correction stage. This is done either in the primal or dual space to obtain valid upper and lower bounds on the Wasserstein metric of the full-resolution inputs. We evaluate our methods on the DOTmark optimal transport images benchmark, demonstrating a 10x-100x speedup compared to entropy-regularized OT while keeping the approximation error below 2%.

Title: IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

Authors: Wayne Zhang, Changjiang Jiang, Zhonghao Zhang, Chenyang Si, Fengchang Yu, Wei Peng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00979
Pdf URL: https://arxiv.org/pdf/2506.00979
Copy Paste: [[2506.00979]] IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection(https://arxiv.org/abs/2506.00979)
Keywords: diffusion, generative
Abstract: The rapid advancement of Artificial Intelligence Generated Content (AIGC) in visual domains has resulted in highly realistic synthetic images and videos, driven by sophisticated generative frameworks such as diffusion-based architectures. While these breakthroughs open substantial opportunities, they simultaneously raise critical concerns about content authenticity and integrity. Many current AIGC detection methods operate as black-box binary classifiers, which offer limited interpretability, and no approach supports detecting both images and videos in a unified framework. This dual limitation compromises model transparency, reduces trustworthiness, and hinders practical deployment. To address these challenges, we introduce IVY-FAKE , a novel, unified, and large-scale dataset specifically designed for explainable multimodal AIGC detection. Unlike prior benchmarks, which suffer from fragmented modality coverage and sparse annotations, IVY-FAKE contains over 150,000 richly annotated training samples (images and videos) and 18,700 evaluation examples, each accompanied by detailed natural-language reasoning beyond simple binary labels. Building on this, we propose Ivy Explainable Detector (IVY-XDETECTOR), a unified AIGC detection and explainable architecture that jointly performs explainable detection for both image and video content. Our unified vision-language model achieves state-of-the-art performance across multiple image and video detection benchmarks, highlighting the significant advancements enabled by our dataset and modeling framework. Our data is publicly available at this https URL.

Title: What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training

Authors: Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Martijn Bentum
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.00981
Pdf URL: https://arxiv.org/pdf/2506.00981
Copy Paste: [[2506.00981]] What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training(https://arxiv.org/abs/2506.00981)
Keywords: self-supervised
Abstract: How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it's less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.

Title: GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs

Authors: Xiaorong Zhu, Ziheng Jia, Jiarui Wang, Xiangyu Zhao, Haodong Duan, Xiongkuo Min, Jia Wang, Zicheng Zhang, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00991
Pdf URL: https://arxiv.org/pdf/2506.00991
Copy Paste: [[2506.00991]] GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs(https://arxiv.org/abs/2506.00991)
Keywords: generative
Abstract: The rapid evolution of Multi-modality Large Language Models (MLLMs) is driving significant advancements in visual understanding and generation. Nevertheless, a comprehensive assessment of their capabilities, concerning the fine-grained physical principles especially in geometric optics, remains underexplored. To address this gap, we introduce GOBench, the first benchmark to systematically evaluate MLLMs' ability across two tasks: 1) Generating Optically Authentic Imagery and 2) Understanding Underlying Optical Phenomena. We curates high-quality prompts of geometric optical scenarios and use MLLMs to construct GOBench-Gen-1k this http URL then organize subjective experiments to assess the generated imagery based on Optical Authenticity, Aesthetic Quality, and Instruction Fidelity, revealing MLLMs' generation flaws that violate optical principles. For the understanding task, we apply crafted evaluation instructions to test optical understanding ability of eleven prominent MLLMs. The experimental results demonstrate that current models face significant challenges in both optical generation and understanding. The top-performing generative model, GPT-4o-Image, cannot perfectly complete all generation tasks, and the best-performing MLLM model, Gemini-2.5Pro, attains a mere 37.35\% accuracy in optical understanding.

Title: Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models

Authors: Kinam Kim, Junha Hyung, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00996
Pdf URL: https://arxiv.org/pdf/2506.00996
Copy Paste: [[2506.00996]] Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models(https://arxiv.org/abs/2506.00996)
Keywords: diffusion, in-context
Abstract: Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural modifications, which demand large datasets and are typically restricted to spatially aligned conditioning, limiting flexibility and scalability. In this work, we introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient and versatile approach for adapting pretrained video diffusion models to diverse conditional generation tasks. Our key idea is to concatenate condition and target frames along the temporal axis and insert intermediate buffer frames with progressively increasing noise levels. These buffer frames enable smooth transitions, aligning the fine-tuning process with the pretrained model's temporal dynamics. TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples. We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B. Extensive experiments show that TIC-FT outperforms existing baselines in both condition fidelity and visual quality, while remaining highly efficient in both training and inference. For additional results, visit this https URL

Title: Motion-Aware Concept Alignment for Consistent Video Editing

Authors: Tong Zhang, Juan C Leon Alcazar, Bernard Ghanem
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01004
Pdf URL: https://arxiv.org/pdf/2506.01004
Copy Paste: [[2506.01004]] Motion-Aware Concept Alignment for Consistent Video Editing(https://arxiv.org/abs/2506.01004)
Keywords: diffusion
Abstract: We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video. Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video, while preserving the original motion and visual context. Our approach leverages a diagonal denoising schedule and class-agnostic segmentation to detect and track objects in the latent space and precisely control the spatial location of the blended objects. To ensure temporal coherence, we incorporate momentum-based semantic corrections and gamma residual noise stabilization for smooth frame transitions. We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames. Using self-constructed dataset, MoCA-Video outperforms current baselines, achieving superior spatial consistency, coherent motion, and a significantly higher CASS score, despite having no training or fine-tuning. MoCA-Video demonstrates that structured manipulation in the diffusion noise trajectory allows for controllable, high-quality video synthesis.

Title: Autoregressive Images Watermarking through Lexical Biasing: An Approach Resistant to Regeneration Attack

Authors: Siqi Hui, Yiren Song, Sanping Zhou, Ye Deng, Wenli Huang, Jinjun Wang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.01011
Pdf URL: https://arxiv.org/pdf/2506.01011
Copy Paste: [[2506.01011]] Autoregressive Images Watermarking through Lexical Biasing: An Approach Resistant to Regeneration Attack(https://arxiv.org/abs/2506.01011)
Keywords: diffusion
Abstract: Autoregressive (AR) image generation models have gained increasing attention for their breakthroughs in synthesis quality, highlighting the need for robust watermarking to prevent misuse. However, existing in-generation watermarking techniques are primarily designed for diffusion models, where watermarks are embedded within diffusion latent states. This design poses significant challenges for direct adaptation to AR models, which generate images sequentially through token prediction. Moreover, diffusion-based regeneration attacks can effectively erase such watermarks by perturbing diffusion latent states. To address these challenges, we propose Lexical Bias Watermarking (LBW), a novel framework designed for AR models that resists regeneration attacks. LBW embeds watermarks directly into token maps by biasing token selection toward a predefined green list during generation. This approach ensures seamless integration with existing AR models and extends naturally to post-hoc watermarking. To increase the security against white-box attacks, instead of using a single green list, the green list for each image is randomly sampled from a pool of green lists. Watermark detection is performed via quantization and statistical analysis of the token distribution. Extensive experiments demonstrate that LBW achieves superior watermark robustness, particularly in resisting regeneration attacks.

Title: AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Authors: Yuyuan Liu, Yuanhong Chen, Chong Wang, Junlin Han, Junde Wu, Can Peng, Jingkun Chen, Yu Tian, Gustavo Carneiro
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01015
Pdf URL: https://arxiv.org/pdf/2506.01015
Copy Paste: [[2506.01015]] AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting(https://arxiv.org/abs/2506.01015)
Keywords: foundation model
Abstract: Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches mainly follow two directions: (1) injecting adapters into the image encoder to receive audio signals, which incurs efficiency costs during prompt engineering, and (2) leveraging additional foundation models to generate visual prompts for the sounding objects, which are often imprecisely localised, leading to misguidance in SAM2. Moreover, these methods overlook the rich semantic interplay between hierarchical visual features and other modalities, resulting in suboptimal cross-modal fusion. In this work, we propose AuralSAM2, comprising the novel AuralFuser module, which externally attaches to SAM2 to integrate features from different modalities and generate feature-level prompts, guiding SAM2's decoder in segmenting sounding targets. Such integration is facilitated by a feature pyramid, further refining semantic understanding and enhancing object awareness in multimodal scenarios. Additionally, the audio-guided contrastive learning is introduced to explicitly align audio and visual representations and to also mitigate biases caused by dominant visual patterns. Results on public benchmarks show that our approach achieves remarkable improvements over the previous methods in the field. Code is available at this https URL.

Title: Modality Translation and Registration of MR and Ultrasound Images Using Diffusion Models

Authors: Xudong Ma, Nantheera Anantrasirichai, Stefanos Bolomytis, Alin Achim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01025
Pdf URL: https://arxiv.org/pdf/2506.01025
Copy Paste: [[2506.01025]] Modality Translation and Registration of MR and Ultrasound Images Using Diffusion Models(https://arxiv.org/abs/2506.01025)
Keywords: diffusion
Abstract: Multimodal MR-US registration is critical for prostate cancer diagnosis. However, this task remains challenging due to significant modality discrepancies. Existing methods often fail to align critical boundaries while being overly sensitive to irrelevant details. To address this, we propose an anatomically coherent modality translation (ACMT) network based on a hierarchical feature disentanglement design. We leverage shallow-layer features for texture consistency and deep-layer features for boundary preservation. Unlike conventional modality translation methods that convert one modality into another, our ACMT introduces the customized design of an intermediate pseudo modality. Both MR and US images are translated toward this intermediate domain, effectively addressing the bottlenecks faced by traditional translation methods in the downstream registration task. Experiments demonstrate that our method mitigates modality-specific discrepancies while preserving crucial anatomical boundaries for accurate registration. Quantitative evaluations show superior modality similarity compared to state-of-the-art modality translation methods. Furthermore, downstream registration experiments confirm that our translated images achieve the best alignment performance, highlighting the robustness of our framework for multi-modal prostate image registration.

Title: Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution

Authors: Shijun Shi, Jing Xu, Lijing Lu, Zhihang Li, Kai Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01037
Pdf URL: https://arxiv.org/pdf/2506.01037
Copy Paste: [[2506.01037]] Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution(https://arxiv.org/abs/2506.01037)
Keywords: diffusion, self-supervised
Abstract: Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across adjacent frames, we enhance the diffusion model with a global spatio-temporal attention mechanism using the Video State-Space block with a 3D Selective Scan module, which reinforces coherence at an affordable computational cost. To further reduce artifacts in generated details, we introduce a self-supervised ControlNet that leverages HR features as guidance and employs contrastive learning to extract degradation-insensitive features from LR videos. Finally, a three-stage training strategy based on a mixture of HR-LR videos is proposed to stabilize VSR training. The proposed Self-supervised ControlNet with Spatio-Temporal Continuous Mamba based VSR algorithm achieves superior perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.

Title: ECP-Mamba: An Efficient Multi-scale Self-supervised Contrastive Learning Method with State Space Model for PolSAR Image Classification

Authors: Zuzheng Kuang, Haixia Bi, Chen Xu, Jian Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01040
Pdf URL: https://arxiv.org/pdf/2506.01040
Copy Paste: [[2506.01040]] ECP-Mamba: An Efficient Multi-scale Self-supervised Contrastive Learning Method with State Space Model for PolSAR Image Classification(https://arxiv.org/abs/2506.01040)
Keywords: self-supervised
Abstract: Recently, polarimetric synthetic aperture radar (PolSAR) image classification has been greatly promoted by deep neural networks. However,current deep learning-based PolSAR classification methods encounter difficulties due to its dependence on extensive labeled data and the computational inefficiency of architectures like Transformers. This paper presents ECP-Mamba, an efficient framework integrating multi-scale self-supervised contrastive learning with a state space model (SSM) backbone. Specifically, ECP-Mamba addresses annotation scarcity through a multi-scale predictive pretext task based on local-to-global feature correspondences, which uses a simplified self-distillation paradigm without negative sample pairs. To enhance computational efficiency,the Mamba architecture (a selective SSM) is first tailored for pixel-wise PolSAR classification task by designing a spiral scan strategy. This strategy prioritizes causally relevant features near the central pixel, leveraging the localized nature of pixel-wise classification tasks. Additionally, the lightweight Cross Mamba module is proposed to facilitates complementary multi-scale feature interaction with minimal overhead. Extensive experiments across four benchmark datasets demonstrate ECP-Mamba's effectiveness in balancing high accuracy with resource efficiency. On the Flevoland 1989 dataset, ECP-Mamba achieves state-of-the-art performance with an overall accuracy of 99.70%, average accuracy of 99.64% and Kappa coefficient of 99.62e-2. Our code will be available at this https URL.

Title: AceVFI: A Comprehensive Survey of Advances in Video Frame Interpolation

Authors: Dahyeon Kye, Changhyun Roh, Sukhun Ko, Chanho Eom, Jihyong Oh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01061
Pdf URL: https://arxiv.org/pdf/2506.01061
Copy Paste: [[2506.01061]] AceVFI: A Comprehensive Survey of Advances in Video Frame Interpolation(https://arxiv.org/abs/2506.01061)
Keywords: diffusion
Abstract: Video Frame Interpolation (VFI) is a fundamental Low-Level Vision (LLV) task that synthesizes intermediate frames between existing ones while maintaining spatial and temporal coherence. VFI techniques have evolved from classical motion compensation-based approach to deep learning-based approach, including kernel-, flow-, hybrid-, phase-, GAN-, Transformer-, Mamba-, and more recently diffusion model-based approach. We introduce AceVFI, the most comprehensive survey on VFI to date, covering over 250+ papers across these approaches. We systematically organize and describe VFI methodologies, detailing the core principles, design assumptions, and technical characteristics of each approach. We categorize the learning paradigm of VFI methods namely, Center-Time Frame Interpolation (CTFI) and Arbitrary-Time Frame Interpolation (ATFI). We analyze key challenges of VFI such as large motion, occlusion, lighting variation, and non-linear motion. In addition, we review standard datasets, loss functions, evaluation metrics. We examine applications of VFI including event-based, cartoon, medical image VFI and joint VFI with other LLV tasks. We conclude by outlining promising future research directions to support continued progress in the field. This survey aims to serve as a unified reference for both newcomers and experts seeking a deep understanding of modern VFI landscapes.

Title: A Large Convolutional Neural Network for Clinical Target and Multi-organ Segmentation in Gynecologic Brachytherapy with Multi-stage Learning

Authors: Mingzhe Hu, Yuan Gao, Yuheng Li, Ricahrd LJ Qiu, Chih-Wei Chang, Keyur D. Shah, Priyanka Kapoor, Beth Bradshaw, Yuan Shao, Justin Roper, Jill Remick, Zhen Tian, Xiaofeng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01073
Pdf URL: https://arxiv.org/pdf/2506.01073
Copy Paste: [[2506.01073]] A Large Convolutional Neural Network for Clinical Target and Multi-organ Segmentation in Gynecologic Brachytherapy with Multi-stage Learning(https://arxiv.org/abs/2506.01073)
Keywords: self-supervised
Abstract: Purpose: Accurate segmentation of clinical target volumes (CTV) and organs-at-risk is crucial for optimizing gynecologic brachytherapy (GYN-BT) treatment planning. However, anatomical variability, low soft-tissue contrast in CT imaging, and limited annotated datasets pose significant challenges. This study presents GynBTNet, a novel multi-stage learning framework designed to enhance segmentation performance through self-supervised pretraining and hierarchical fine-tuning strategies. Methods: GynBTNet employs a three-stage training strategy: (1) self-supervised pretraining on large-scale CT datasets using sparse submanifold convolution to capture robust anatomical representations, (2) supervised fine-tuning on a comprehensive multi-organ segmentation dataset to refine feature extraction, and (3) task-specific fine-tuning on a dedicated GYN-BT dataset to optimize segmentation performance for clinical applications. The model was evaluated against state-of-the-art methods using the Dice Similarity Coefficient (DSC), 95th percentile Hausdorff Distance (HD95), and Average Surface Distance (ASD). Results: Our GynBTNet achieved superior segmentation performance, significantly outperforming nnU-Net and Swin-UNETR. Notably, it yielded a DSC of 0.837 +/- 0.068 for CTV, 0.940 +/- 0.052 for the bladder, 0.842 +/- 0.070 for the rectum, and 0.871 +/- 0.047 for the uterus, with reduced HD95 and ASD compared to baseline models. Self-supervised pretraining led to consistent performance improvements, particularly for structures with complex boundaries. However, segmentation of the sigmoid colon remained challenging, likely due to anatomical ambiguities and inter-patient variability. Statistical significance analysis confirmed that GynBTNet's improvements were significant compared to baseline models.

Title: Contextual Candor: Enhancing LLM Trustworthiness Through Hierarchical Unanswerability Detection

Authors: Steven Robinson, Antonio Carlos Rivera
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01104
Pdf URL: https://arxiv.org/pdf/2506.01104
Copy Paste: [[2506.01104]] Contextual Candor: Enhancing LLM Trustworthiness Through Hierarchical Unanswerability Detection(https://arxiv.org/abs/2506.01104)
Keywords: generative
Abstract: The pervasive deployment of large language models (LLMs) in conversational AI systems has revolutionized information access, yet their propensity for generating factually unsupported or hallucinated responses remains a critical impediment to trustworthiness and widespread adoption. This paper introduces Reinforced Unanswerability Learning (RUL), a novel hybrid training paradigm designed to imbue LLMs with the intrinsic capability to accurately detect unanswerable questions and generate reliably appropriate responses. Unlike conventional approaches that rely on external classifiers or simple prompting, RUL integrates a discriminative unanswerability prediction head with the LLM's generative core, guided by a multi-stage learning strategy. This includes supervised fine-tuning on a novel, richly annotated dataset, Enhanced-CAsT-Answerability (ECA), which features hierarchical answerability labels and ground-truth refusal responses. Crucially, RUL incorporates a subsequent reinforcement learning with human feedback (RLHF) phase to refine the nuance, helpfulness, and informativeness of refusal responses. Extensive experiments demonstrate RUL's superior performance, achieving significantly higher accuracy in unanswerability detection across sentence, paragraph, and ranking levels, and substantially increasing the generation of appropriate refusals for unanswerable queries, alongside strong performance on answerable questions. Human evaluations further corroborate RUL's effectiveness, highlighting a marked improvement in perceived helpfulness and trustworthiness, ultimately paving the way for more reliable and user-centric conversational AI.

Title: Revolutionizing Radiology Workflow with Factual and Efficient CXR Report Generation

Authors: Pimchanok Sukjai, Apiradee Boonmee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01118
Pdf URL: https://arxiv.org/pdf/2506.01118
Copy Paste: [[2506.01118]] Revolutionizing Radiology Workflow with Factual and Efficient CXR Report Generation(https://arxiv.org/abs/2506.01118)
Keywords: foundation model
Abstract: The escalating demand for medical image interpretation underscores the critical need for advanced artificial intelligence solutions to enhance the efficiency and accuracy of radiological diagnoses. This paper introduces CXR-PathFinder, a novel Large Language Model (LLM)-centric foundation model specifically engineered for automated chest X-ray (CXR) report generation. We propose a unique training paradigm, Clinician-Guided Adversarial Fine-Tuning (CGAFT), which meticulously integrates expert clinical feedback into an adversarial learning framework to mitigate factual inconsistencies and improve diagnostic precision. Complementing this, our Knowledge Graph Augmentation Module (KGAM) acts as an inference-time safeguard, dynamically verifying generated medical statements against authoritative knowledge bases to minimize hallucinations and ensure standardized terminology. Leveraging a comprehensive dataset of millions of paired CXR images and expert reports, our experiments demonstrate that CXR-PathFinder significantly outperforms existing state-of-the-art medical vision-language models across various quantitative metrics, including clinical accuracy (Macro F1 (14): 46.5, Micro F1 (14): 59.5). Furthermore, blinded human evaluation by board-certified radiologists confirms CXR-PathFinder's superior clinical utility, completeness, and accuracy, establishing its potential as a reliable and efficient aid for radiological practice. The developed method effectively balances high diagnostic fidelity with computational efficiency, providing a robust solution for automated medical report generation.

Title: Neuro-Symbolic Generative Diffusion Models for Physically Grounded, Robust, and Safe Generation

Authors: Jacob K. Christopher, Michael Cardei, Jinhao Liang, Ferdinando Fioretto
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01121
Pdf URL: https://arxiv.org/pdf/2506.01121
Copy Paste: [[2506.01121]] Neuro-Symbolic Generative Diffusion Models for Physically Grounded, Robust, and Safe Generation(https://arxiv.org/abs/2506.01121)
Keywords: diffusion, generative
Abstract: Despite the remarkable generative capabilities of diffusion models, their integration into safety-critical or scientifically rigorous applications remains hindered by the need to ensure compliance with stringent physical, structural, and operational constraints. To address this challenge, this paper introduces Neuro-Symbolic Diffusion (NSD), a novel framework that interleaves diffusion steps with symbolic optimization, enabling the generation of certifiably consistent samples under user-defined functional and logic constraints. This key feature is provided for both standard and discrete diffusion models, enabling, for the first time, the generation of both continuous (e.g., images and trajectories) and discrete (e.g., molecular structures and natural language) outputs that comply with constraints. This ability is demonstrated on tasks spanning three key challenges: (1) Safety, in the context of non-toxic molecular generation and collision-free trajectory optimization; (2) Data scarcity, in domains such as drug discovery and materials engineering; and (3) Out-of-domain generalization, where enforcing symbolic constraints allows adaptation beyond the training distribution.

Title: From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models

Authors: Asım Ersoy, Basel Mousi, Shammur Chowdhury, Firoj Alam, Fahim Dalvi, Nadir Durrani
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.01133
Pdf URL: https://arxiv.org/pdf/2506.01133
Copy Paste: [[2506.01133]] From Words to Waves: Analyzing Concept Formation in Speech and Text-Based Foundation Models(https://arxiv.org/abs/2506.01133)
Keywords: foundation model
Abstract: The emergence of large language models (LLMs) has demonstrated that systems trained solely on text can acquire extensive world knowledge, develop reasoning capabilities, and internalize abstract semantic concepts--showcasing properties that can be associated with general intelligence. This raises an intriguing question: Do such concepts emerge in models trained on other modalities, such as speech? Furthermore, when models are trained jointly on multiple modalities: Do they develop a richer, more structured semantic understanding? To explore this, we analyze the conceptual structures learned by speech and textual models both individually and jointly. We employ Latent Concept Analysis, an unsupervised method for uncovering and interpreting latent representations in neural networks, to examine how semantic abstractions form across modalities. For reproducibility we made scripts and other resources available to the community.

Title: FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation

Authors: Ariel Shaulov, Itay Hazan, Lior Wolf, Hila Chefer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01144
Pdf URL: https://arxiv.org/pdf/2506.01144
Copy Paste: [[2506.01144]] FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation(https://arxiv.org/abs/2506.01144)
Keywords: diffusion
Abstract: Text-to-video diffusion models are notoriously limited in their ability to model temporal aspects such as motion, physics, and dynamic interactions. Existing approaches address this limitation by retraining the model or introducing external conditioning signals to enforce temporal consistency. In this work, we explore whether a meaningful temporal representation can be extracted directly from the predictions of a pre-trained model without any additional training or auxiliary inputs. We introduce \textbf{FlowMo}, a novel training-free guidance method that enhances motion coherence using only the model's own predictions in each diffusion step. FlowMo first derives an appearance-debiased temporal representation by measuring the distance between latents corresponding to consecutive frames. This highlights the implicit temporal structure predicted by the model. It then estimates motion coherence by measuring the patch-wise variance across the temporal dimension and guides the model to reduce this variance dynamically during sampling. Extensive experiments across multiple text-to-video models demonstrate that FlowMo significantly improves motion coherence without sacrificing visual quality or prompt alignment, offering an effective plug-and-play solution for enhancing the temporal fidelity of pre-trained video diffusion models.

Title: FORT: Forward-Only Regression Training of Normalizing Flows

Authors: Danyal Rehman, Oscar Davis, Jiarui Lu, Jian Tang, Michael Bronstein, Yoshua Bengio, Alexander Tong, Avishek Joey Bose
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.01158
Pdf URL: https://arxiv.org/pdf/2506.01158
Copy Paste: [[2506.01158]] FORT: Forward-Only Regression Training of Normalizing Flows(https://arxiv.org/abs/2506.01158)
Keywords: diffusion, generative
Abstract: Simulation-free training frameworks have been at the forefront of the generative modelling revolution in continuous spaces, leading to neural dynamical systems that encompass modern large-scale diffusion and flow matching models. Despite the scalability of training, the generation of high-quality samples and their corresponding likelihood under the model requires expensive numerical simulation -- inhibiting adoption in numerous scientific applications such as equilibrium sampling of molecular systems. In this paper, we revisit classical normalizing flows as one-step generative models with exact likelihoods and propose a novel, scalable training objective that does not require computing the expensive change of variable formula used in conventional maximum likelihood training. We propose Forward-Only Regression Training (FORT), a simple $\ell_2$-regression objective that maps prior samples under our flow to specifically chosen targets. We demonstrate that FORT supports a wide class of targets, such as optimal transport targets and targets from pre-trained continuous-time normalizing flows (CNF). We further demonstrate that by using CNF targets, our one-step flows allow for larger-scale training that exceeds the performance and stability of maximum likelihood training, while unlocking a broader class of architectures that were previously challenging to train. Empirically, we elucidate that our trained flows can perform equilibrium conformation sampling in Cartesian coordinates of alanine dipeptide, alanine tripeptide, and alanine tetrapeptide.

Title: Bridging Quantum and Classical Computing in Drug Design: Architecture Principles for Improved Molecule Generation

Authors: Andrew Smith, Erhan Guven
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2506.01177
Pdf URL: https://arxiv.org/pdf/2506.01177
Copy Paste: [[2506.01177]] Bridging Quantum and Classical Computing in Drug Design: Architecture Principles for Improved Molecule Generation(https://arxiv.org/abs/2506.01177)
Keywords: generative
Abstract: Hybrid quantum-classical machine learning offers a path to leverage noisy intermediate-scale quantum (NISQ) devices for drug discovery, but optimal model architectures remain unclear. We systematically optimize the quantum-classical bridge architecture for generative adversarial networks (GANs) in molecular discovery using multi-objective Bayesian optimization. Our optimized model (BO-QGAN) significantly improves performance, achieving a 2.27-fold higher Drug Candidate Score (DCS) than prior quantum-hybrid benchmarks and 2.21-fold higher than the classical baseline, using over 60% fewer parameters. Key findings favor layering multiple (3-4) shallow (4-8 qubit) quantum circuits sequentially, while classical architecture shows less sensitivity above a minimum capacity. This work provides the first empirically grounded architectural guidelines for hybrid models, enabling more effective integration of current quantum computers into pharmaceutical research pipelines.

Title: Self-Supervised Multi-View Representation Learning using Vision-Language Model for 3D/4D Facial Expression Recognition

Authors: Muzammil Behzad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01203
Pdf URL: https://arxiv.org/pdf/2506.01203
Copy Paste: [[2506.01203]] Self-Supervised Multi-View Representation Learning using Vision-Language Model for 3D/4D Facial Expression Recognition(https://arxiv.org/abs/2506.01203)
Keywords: self-supervised
Abstract: Facial expression recognition (FER) is a fundamental task in affective computing with applications in human-computer interaction, mental health analysis, and behavioral understanding. In this paper, we propose SMILE-VLM, a self-supervised vision-language model for 3D/4D FER that unifies multiview visual representation learning with natural language supervision. SMILE-VLM learns robust, semantically aligned, and view-invariant embeddings by proposing three core components: multiview decorrelation via a Barlow Twins-style loss, vision-language contrastive alignment, and cross-modal redundancy minimization. Our framework achieves the state-of-the-art performance on multiple benchmarks. We further extend SMILE-VLM to the task of 4D micro-expression recognition (MER) to recognize the subtle affective cues. The extensive results demonstrate that SMILE-VLM not only surpasses existing unsupervised methods but also matches or exceeds supervised baselines, offering a scalable and annotation-efficient solution for expressive facial behavior understanding.

Title: Visual Sparse Steering: Improving Zero-shot Image Classification with Sparsity Guided Steering Vectors

Authors: Gerasimos Chatzoudis, Zhuowei Li, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01247
Pdf URL: https://arxiv.org/pdf/2506.01247
Copy Paste: [[2506.01247]] Visual Sparse Steering: Improving Zero-shot Image Classification with Sparsity Guided Steering Vectors(https://arxiv.org/abs/2506.01247)
Keywords: foundation model
Abstract: Steering vision foundation models at inference time without retraining or access to large labeled datasets is a desirable yet challenging objective, particularly in dynamic or resource-constrained settings. In this paper, we introduce Visual Sparse Steering (VS2), a lightweight, test-time method that guides vision models using steering vectors derived from sparse features learned by top-$k$ Sparse Autoencoders without requiring contrastive data. Specifically, VS2 surpasses zero-shot CLIP by 4.12% on CIFAR-100, 1.08% on CUB-200, and 1.84% on Tiny-ImageNet. We further propose VS2++, a retrieval-augmented variant that selectively amplifies relevant sparse features using pseudo-labeled neighbors at inference time. With oracle positive/negative sets, VS2++ achieves absolute top-1 gains over CLIP zero-shot of up to 21.44% on CIFAR-100, 7.08% on CUB-200, and 20.47% on Tiny-ImageNet. Interestingly, VS2 and VS2++ raise per-class accuracy by up to 25% and 38%, respectively, showing that sparse steering benefits specific classes by disambiguating visually or taxonomically proximate categories rather than providing a uniform boost. Finally, to better align the sparse features learned through the SAE reconstruction task with those relevant for downstream performance, we propose Prototype-Aligned Sparse Steering (PASS). By incorporating a prototype-alignment loss during SAE training, using labels only during training while remaining fully test-time unsupervised, PASS consistently, though modestly, outperforms VS2, achieving a 6.12% gain over VS2 only on CIFAR-100 with ViT-B/32.

Title: Beyond In-Context Learning: Aligning Long-form Generation of Large Language Models via Task-Inherent Attribute Guidelines

Authors: Do Xuan Long, Duong Ngoc Yen, Do Xuan Trong, Luu Anh Tuan, Kenji Kawaguchi, Shafiq Joty, Min-Yen Kan, Nancy F. Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01265
Pdf URL: https://arxiv.org/pdf/2506.01265
Copy Paste: [[2506.01265]] Beyond In-Context Learning: Aligning Long-form Generation of Large Language Models via Task-Inherent Attribute Guidelines(https://arxiv.org/abs/2506.01265)
Keywords: in-context
Abstract: In-context learning (ICL) is an important yet not fully understood ability of pre-trained large language models (LLMs). It can greatly enhance task performance using a few examples, termed demonstrations, without fine-tuning. Although effective in question answering, ICL often underperforms in long-form generation tasks such as summarization. Under appropriately realistic assumptions, we empirically and theoretically show that ICL demonstrations alone are insufficient to teach LLMs the task language and format distributions for generation. We argue for explicit exposure to the task distributions and hypothesize that defining them by prompting enhances model performance. To this end, we present LongGuide, which efficiently generates two parallel streams of guidelines capturing task language and format properties: (i) Metric Guidelines (MGs) that instruct models to optimize self-evaluated metrics; and (ii) Output Constraint Guidelines (OCGs) that constrain generation at both token and sentence levels. LongGuide automatically selects the best combination of guidelines, improving both strong open- and closed-source LLMs by over 5% in both zero- and few-shot settings. We show that LongGuide is generalizable, learnable by weak models to enhance strong ones, and integrates synergistically with automatic prompt optimizers.

Title: Schema as Parameterized Tools for Universal Information Extraction

Authors: Sheng Liang, Yongyue Zhang, Yaxiong Wu, Ruiming Tang, Yong Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01276
Pdf URL: https://arxiv.org/pdf/2506.01276
Copy Paste: [[2506.01276]] Schema as Parameterized Tools for Universal Information Extraction(https://arxiv.org/abs/2506.01276)
Keywords: in-context
Abstract: Universal information extraction (UIE) primarily employs an extractive generation approach with large language models (LLMs), typically outputting structured information based on predefined schemas such as JSON or tables. UIE suffers from a lack of adaptability when selecting between predefined schemas and on-the-fly schema generation within the in-context learning paradigm, especially when there are numerous schemas to choose from. In this paper, we propose a unified adaptive text-to-structure generation framework, called Schema as Parameterized Tools (SPT), which reimagines the tool-calling capability of LLMs by treating predefined schemas as parameterized tools for tool selection and parameter filling. Specifically, our SPT method can be applied to unify closed, open, and on-demand IE tasks by adopting Schema Retrieval by fetching the relevant schemas from a predefined pool, Schema Filling by extracting information and filling slots as with tool parameters, or Schema Generation by synthesizing new schemas with uncovered cases. Experiments show that the SPT method can handle four distinct IE tasks adaptively, delivering robust schema retrieval and selection performance. SPT also achieves comparable extraction performance to LoRA baselines and current leading UIE systems with significantly fewer trainable parameters.

Title: TSRating: Rating Quality of Diverse Time Series Data by Meta-learning from LLM Judgment

Authors: Shunyu Wu, Dan Li, Haozheng Ye, Zhuomin Chen, Jiahui Zhou, Jian Lou, Zibin Zheng, See-Kiong Ng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01290
Pdf URL: https://arxiv.org/pdf/2506.01290
Copy Paste: [[2506.01290]] TSRating: Rating Quality of Diverse Time Series Data by Meta-learning from LLM Judgment(https://arxiv.org/abs/2506.01290)
Keywords: foundation model
Abstract: High-quality time series (TS) data are essential for ensuring TS model performance, rendering research on rating TS data quality indispensable. Existing methods have shown promising rating accuracy within individual domains, primarily by extending data quality rating techniques such as influence functions and Shapley values to account for temporal characteristics. However, they neglect the fact that real-world TS data can span vastly different domains and exhibit distinct properties, hampering the accurate and efficient rating of diverse TS data. In this paper, we propose TSRating, a novel and unified framework for rating the quality of time series data crawled from diverse domains. TSRating is built on the assumption that LLMs inherit ample knowledge, acquired during their extensive pretraining, enabling them to comprehend and discern quality differences in diverse TS data. We verify this assumption by devising a series of prompts to elicit quality comparisons from LLMs for pairs of TS samples. We then fit a dedicated rating model, termed TSRater, to convert the LLMs' judgments into efficient quality predictions via TSRater's inference on future TS samples. To ensure cross-domain adaptability, we develop a meta-learning scheme to train TSRater on quality comparisons collected from nine distinct domains. To improve training efficiency, we employ signSGD for inner-loop updates, thus circumventing the demanding computation of hypergradients. Extensive experimental results on eleven benchmark datasets across three time series tasks, each using both conventional TS models and TS foundation models, demonstrate that TSRating outperforms baselines in terms of estimation accuracy, efficiency, and domain adaptability.

Title: SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost

Authors: Haiyang Mei, Pengyu Zhang, Mike Zheng Shou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01304
Pdf URL: https://arxiv.org/pdf/2506.01304
Copy Paste: [[2506.01304]] SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost(https://arxiv.org/abs/2506.01304)
Keywords: foundation model
Abstract: Foundation models like the Segment Anything Model (SAM) have significantly advanced promptable image segmentation in computer vision. However, extending these capabilities to videos presents substantial challenges, particularly in ensuring precise and temporally consistent mask propagation in dynamic scenes. SAM 2 attempts to address this by training a model on massive image and video data from scratch to learn complex spatiotemporal associations, resulting in huge training costs that hinder research and practical deployment. In this paper, we introduce SAM-I2V, an effective image-to-video upgradation method for cultivating a promptable video segmentation (PVS) model. Our approach strategically upgrades the pre-trained SAM to support PVS, significantly reducing training complexity and resource requirements. To achieve this, we introduce three key innovations: (i) an image-to-video feature extraction upgrader built upon SAM's static image encoder to enable spatiotemporal video perception, (ii) a memory filtering strategy that selects the most relevant past frames for more effective utilization of historical information, and (iii) a memory-as-prompt mechanism leveraging object memory to ensure temporally consistent mask propagation in dynamic scenes. Comprehensive experiments demonstrate that our method achieves over 90% of SAM 2's performance while using only 0.2% of its training cost. Our work presents a resource-efficient pathway to PVS, lowering barriers for further research in PVS model design and enabling broader applications and advancements in the field. Code and model are available at: this https URL.

Title: $Ψ$-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models

Authors: Taehoon Yoon, Yunhong Min, Kyeongmin Yeo, Minhyuk Sung
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.01320
Pdf URL: https://arxiv.org/pdf/2506.01320
Copy Paste: [[2506.01320]] $Ψ$-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models(https://arxiv.org/abs/2506.01320)
Keywords: generative
Abstract: We introduce $\Psi$-Sampler, an SMC-based framework incorporating pCNL-based initial particle sampling for effective inference-time reward alignment with a score-based generative model. Inference-time reward alignment with score-based generative models has recently gained significant traction, following a broader paradigm shift from pre-training to post-training optimization. At the core of this trend is the application of Sequential Monte Carlo (SMC) to the denoising process. However, existing methods typically initialize particles from the Gaussian prior, which inadequately captures reward-relevant regions and results in reduced sampling efficiency. We demonstrate that initializing from the reward-aware posterior significantly improves alignment performance. To enable posterior sampling in high-dimensional latent spaces, we introduce the preconditioned Crank-Nicolson Langevin (pCNL) algorithm, which combines dimension-robust proposals with gradient-informed dynamics. This approach enables efficient and scalable posterior sampling and consistently improves performance across various reward alignment tasks, including layout-to-image generation, quantity-aware generation, and aesthetic-preference generation, as demonstrated in our experiments.

Title: Ultra-High-Resolution Image Synthesis: Data, Method and Evaluation

Authors: Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, Di Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01331
Pdf URL: https://arxiv.org/pdf/2506.01331
Copy Paste: [[2506.01331]] Ultra-High-Resolution Image Synthesis: Data, Method and Evaluation(https://arxiv.org/abs/2506.01331)
Keywords: diffusion
Abstract: Ultra-high-resolution image synthesis holds significant potential, yet remains an underexplored challenge due to the absence of standardized benchmarks and computational constraints. In this paper, we establish Aesthetic-4K, a meticulously curated dataset containing dedicated training and evaluation subsets specifically designed for comprehensive research on ultra-high-resolution image synthesis. This dataset consists of high-quality 4K images accompanied by descriptive captions generated by GPT-4o. Furthermore, we propose Diffusion-4K, an innovative framework for the direct generation of ultra-high-resolution images. Our approach incorporates the Scale Consistent Variational Auto-Encoder (SC-VAE) and Wavelet-based Latent Fine-tuning (WLF), which are designed for efficient visual token compression and the capture of intricate details in ultra-high-resolution images, thereby facilitating direct training with photorealistic 4K data. This method is applicable to various latent diffusion models and demonstrates its efficacy in synthesizing highly detailed 4K images. Additionally, we propose novel metrics, namely the GLCM Score and Compression Ratio, to assess the texture richness and fine details in local patches, in conjunction with holistic measures such as FID, Aesthetics, and CLIPScore, enabling a thorough and multifaceted evaluation of ultra-high-resolution image synthesis. Consequently, Diffusion-4K achieves impressive performance in ultra-high-resolution image synthesis, particularly when powered by state-of-the-art large-scale diffusion models (eg, Flux-12B). The source code is publicly available at this https URL.

Title: NoiseAR: AutoRegressing Initial Noise Prior for Diffusion Models

Authors: Zeming Li, Xiangyue Liu, Xiangyu Zhang, Ping Tan, Heung-Yeung Shum
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01337
Pdf URL: https://arxiv.org/pdf/2506.01337
Copy Paste: [[2506.01337]] NoiseAR: AutoRegressing Initial Noise Prior for Diffusion Models(https://arxiv.org/abs/2506.01337)
Keywords: diffusion, generative
Abstract: Diffusion models have emerged as powerful generative frameworks, creating data samples by progressively denoising an initial random state. Traditionally, this initial state is sampled from a simple, fixed distribution like isotropic Gaussian, inherently lacking structure and a direct mechanism for external control. While recent efforts have explored ways to introduce controllability into the diffusion process, particularly at the initialization stage, they often rely on deterministic or heuristic approaches. These methods can be suboptimal, lack expressiveness, and are difficult to scale or integrate into more sophisticated optimization frameworks. In this paper, we introduce NoiseAR, a novel method for AutoRegressive Initial Noise Prior for Diffusion Models. Instead of a static, unstructured source, NoiseAR learns to generate a dynamic and controllable prior distribution for the initial noise. We formulate the generation of the initial noise prior's parameters as an autoregressive probabilistic modeling task over spatial patches or tokens. This approach enables NoiseAR to capture complex spatial dependencies and introduce learned structure into the initial state. Crucially, NoiseAR is designed to be conditional, allowing text prompts to directly influence the learned prior, thereby achieving fine-grained control over the diffusion initialization. Our experiments demonstrate that NoiseAR can generate initial noise priors that lead to improved sample quality and enhanced consistency with conditional inputs, offering a powerful, learned alternative to traditional random initialization. A key advantage of NoiseAR is its probabilistic formulation, which naturally supports seamless integration into probabilistic frameworks like Markov Decision Processes and Reinforcement Learning. Our code will be available at this https URL

Title: Unraveling Spatio-Temporal Foundation Models via the Pipeline Lens: A Comprehensive Review

Authors: Yuchen Fang, Hao Miao, Yuxuan Liang, Liwei Deng, Yue Cui, Ximu Zeng, Yuyang Xia, Yan Zhao, Torben Bach Pedersen, Christian S. Jensen, Xiaofang Zhou, Kai Zheng
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01364
Pdf URL: https://arxiv.org/pdf/2506.01364
Copy Paste: [[2506.01364]] Unraveling Spatio-Temporal Foundation Models via the Pipeline Lens: A Comprehensive Review(https://arxiv.org/abs/2506.01364)
Keywords: foundation model
Abstract: Spatio-temporal deep learning models aims to utilize useful patterns in such data to support tasks like prediction. However, previous deep learning models designed for specific tasks typically require separate training for each use case, leading to increased computational and storage costs. To address this issue, spatio-temporal foundation models have emerged, offering a unified framework capable of solving multiple spatio-temporal tasks. These foundation models achieve remarkable success by learning general knowledge with spatio-temporal data or transferring the general capabilities of pre-trained language models. While previous surveys have explored spatio-temporal data and methodologies separately, they have ignored a comprehensive examination of how foundation models are designed, selected, pre-trained, and adapted. As a result, the overall pipeline for spatio-temporal foundation models remains unclear. To bridge this gap, we innovatively provide an up-to-date review of previous spatio-temporal foundation models from the pipeline perspective. The pipeline begins with an introduction to different types of spatio-temporal data, followed by details of data preprocessing and embedding techniques. The pipeline then presents a novel data property taxonomy to divide existing methods according to data sources and dependencies, providing efficient and effective model design and selection for researchers. On this basis, we further illustrate the training objectives of primitive models, as well as the adaptation techniques of transferred models. Overall, our survey provides a clear and structured pipeline to understand the connection between core elements of spatio-temporal foundation models while guiding researchers to get started quickly. Additionally, we introduce emerging opportunities such as multi-objective training in the field of spatio-temporal foundation models.

Title: Synthetic Data Augmentation using Pre-trained Diffusion Models for Long-tailed Food Image Classification

Authors: GaYeon Koh, Hyun-Jic Oh, Jeonghyun Noh, Won-Ki Jeong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01368
Pdf URL: https://arxiv.org/pdf/2506.01368
Copy Paste: [[2506.01368]] Synthetic Data Augmentation using Pre-trained Diffusion Models for Long-tailed Food Image Classification(https://arxiv.org/abs/2506.01368)
Keywords: diffusion, generative
Abstract: Deep learning-based food image classification enables precise identification of food categories, further facilitating accurate nutritional analysis. However, real-world food images often show a skewed distribution, with some food types being more prevalent than others. This class imbalance can be problematic, causing models to favor the majority (head) classes with overall performance degradation for the less common (tail) classes. Recently, synthetic data augmentation using diffusion-based generative models has emerged as a promising solution to address this issue. By generating high-quality synthetic images, these models can help uniformize the data distribution, potentially improving classification performance. However, existing approaches face challenges: fine-tuning-based methods need a uniformly distributed dataset, while pre-trained model-based approaches often overlook inter-class separation in synthetic data. In this paper, we propose a two-stage synthetic data augmentation framework, leveraging pre-trained diffusion models for long-tailed food classification. We generate a reference set conditioned by a positive prompt on the generation target and then select a class that shares similar features with the generation target as a negative prompt. Subsequently, we generate a synthetic augmentation set using positive and negative prompt conditions by a combined sampling strategy that promotes intra-class diversity and inter-class separation. We demonstrate the efficacy of the proposed method on two long-tailed food benchmark datasets, achieving superior performance compared to previous works in terms of top-1 accuracy.

Title: Playing with Transformer at 30+ FPS via Next-Frame Diffusion

Authors: Xinle Cheng, Tianyu He, Jiayi Xu, Junliang Guo, Di He, Jiang Bian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01380
Pdf URL: https://arxiv.org/pdf/2506.01380
Copy Paste: [[2506.01380]] Playing with Transformer at 30+ FPS via Next-Frame Diffusion(https://arxiv.org/abs/2506.01380)
Keywords: diffusion
Abstract: Autoregressive video models offer distinct advantages over bidirectional diffusion models in creating interactive video content and supporting streaming applications with arbitrary duration. In this work, we present Next-Frame Diffusion (NFD), an autoregressive diffusion transformer that incorporates block-wise causal attention, enabling iterative sampling and efficient inference via parallel token generation within each frame. Nonetheless, achieving real-time video generation remains a significant challenge for such models, primarily due to the high computational cost associated with diffusion sampling and the hardware inefficiencies inherent to autoregressive generation. To address this, we introduce two innovations: (1) We extend consistency distillation to the video domain and adapt it specifically for video models, enabling efficient inference with few sampling steps; (2) To fully leverage parallel computation, motivated by the observation that adjacent frames often share the identical action input, we propose speculative sampling. In this approach, the model generates next few frames using current action input, and discard speculatively generated frames if the input action differs. Experiments on a large-scale action-conditioned video generation benchmark demonstrate that NFD beats autoregressive baselines in terms of both visual quality and sampling efficiency. We, for the first time, achieves autoregressive video generation at over 30 Frames Per Second (FPS) on an A100 GPU using a 310M model.

Title: System Calls for Malware Detection and Classification: Methodologies and Applications

Authors: Bishwajit Prasad Gond, Durga Prasad Mohapatra
Subjects: cs.CR, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01412
Pdf URL: https://arxiv.org/pdf/2506.01412
Copy Paste: [[2506.01412]] System Calls for Malware Detection and Classification: Methodologies and Applications(https://arxiv.org/abs/2506.01412)
Keywords: anomaly
Abstract: As malware continues to become more complex and harder to detect, Malware Analysis needs to continue to evolve to stay one step ahead. One promising key area approach focuses on using system calls and API Calls, the core communication between user applications and the operating system and their kernels. These calls provide valuable insight into how software or programs behaves, making them an useful tool for spotting suspicious or harmful activity of programs and software. This chapter takes a deep down look at how system calls are used in malware detection and classification, covering techniques like static and dynamic analysis, as well as sandboxing. By combining these methods with advanced techniques like machine learning, statistical analysis, and anomaly detection, researchers can analyze system call patterns to tell the difference between normal and malicious behavior. The chapter also explores how these techniques are applied across different systems, including Windows, Linux, and Android, while also looking at the ways sophisticated malware tries to evade detection.

Title: Self-supervised Latent Space Optimization with Nebula Variational Coding

Authors: Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2506.01414
Pdf URL: https://arxiv.org/pdf/2506.01414
Copy Paste: [[2506.01414]] Self-supervised Latent Space Optimization with Nebula Variational Coding(https://arxiv.org/abs/2506.01414)
Keywords: self-supervised, generative
Abstract: Deep learning approaches process data in a layer-by-layer way with intermediate (or latent) features. We aim at designing a general solution to optimize the latent manifolds to improve the performance on classification, segmentation, completion and/or reconstruction through probabilistic models. This paper proposes a variational inference model which leads to a clustered embedding. We introduce additional variables in the latent space, called \textbf{nebula anchors}, that guide the latent variables to form clusters during training. To prevent the anchors from clustering among themselves, we employ the variational constraint that enforces the latent features within an anchor to form a Gaussian distribution, resulting in a generative model we refer as Nebula Variational Coding (NVC). Since each latent feature can be labeled with the closest anchor, we also propose to apply metric learning in a self-supervised way to make the separation between clusters more explicit. As a consequence, the latent variables of our variational coder form clusters which adapt to the generated semantic of the training data, \textit{e.g.} the categorical labels of each sample. We demonstrate experimentally that it can be used within different architectures designed to solve different problems including text sequence, images, 3D point clouds and volumetric data, validating the advantage of our proposed method.

Title: DNAEdit: Direct Noise Alignment for Text-Guided Rectified Flow Editing

Authors: Chenxi Xie, Minghan Li, Shuai Li, Yuhui Wu, Qiaosi Yi, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01430
Pdf URL: https://arxiv.org/pdf/2506.01430
Copy Paste: [[2506.01430]] DNAEdit: Direct Noise Alignment for Text-Guided Rectified Flow Editing(https://arxiv.org/abs/2506.01430)
Keywords: diffusion
Abstract: Leveraging the powerful generation capability of large-scale pretrained text-to-image models, training-free methods have demonstrated impressive image editing results. Conventional diffusion-based methods, as well as recent rectified flow (RF)-based methods, typically reverse synthesis trajectories by gradually adding noise to clean images, during which the noisy latent at the current timestep is used to approximate that at the next timesteps, introducing accumulated drift and degrading reconstruction accuracy. Considering the fact that in RF the noisy latent is estimated through direct interpolation between Gaussian noises and clean images at each timestep, we propose Direct Noise Alignment (DNA), which directly refines the desired Gaussian noise in the noise domain, significantly reducing the error accumulation in previous methods. Specifically, DNA estimates the velocity field of the interpolated noised latent at each timestep and adjusts the Gaussian noise by computing the difference between the predicted and expected velocity field. We validate the effectiveness of DNA and reveal its relationship with existing RF-based inversion methods. Additionally, we introduce a Mobile Velocity Guidance (MVG) to control the target prompt-guided generation process, balancing image background preservation and target object editability. DNA and MVG collectively constitute our proposed method, namely DNAEdit. Finally, we introduce DNA-Bench, a long-prompt benchmark, to evaluate the performance of advanced image editing models. Experimental results demonstrate that our DNAEdit achieves superior performance to state-of-the-art text-guided editing methods. Codes and benchmark will be available at \href{ this https URL}{this https URL}.

Title: Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data

Authors: Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Satoshi Asakawa
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2506.01439
Pdf URL: https://arxiv.org/pdf/2506.01439
Copy Paste: [[2506.01439]] Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data(https://arxiv.org/abs/2506.01439)
Keywords: self-supervised
Abstract: This paper reports on the development of a large-scale speech recognition model, Whale. Similar to models such as Whisper and OWSM, Whale leverages both a large model size and a diverse, extensive dataset. Whale's architecture integrates w2v-BERT self-supervised model, an encoder-decoder backbone built on E-Branchformer, and a joint CTC-attention decoding strategy. The training corpus comprises varied speech data, of not only public corpora but also in-house data, thereby enhancing the model's robustness to different speaking styles and acoustic conditions. Through evaluations on multiple benchmarks, Whale achieved comparable performance to existing models. In particular, it achieves a word error rate of 2.4% on the Librispeech test-clean set and a character error rate of 3.4% on the CSJ eval3 set, outperforming Whisper large-v3 and OWSM v3.1.

Title: ShaTS: A Shapley-based Explainability Method for Time Series Artificial Intelligence Models applied to Anomaly Detection in Industrial Internet of Things

Authors: Manuel Franco de la Peña (1), Ángel Luis Perales Gómez (1), Lorenzo Fernández Maimó (1) ((1) Departamento de Ingeniería y Tecnología de Computadores, University of Murcia, Spain, Murcia)
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01450
Pdf URL: https://arxiv.org/pdf/2506.01450
Copy Paste: [[2506.01450]] ShaTS: A Shapley-based Explainability Method for Time Series Artificial Intelligence Models applied to Anomaly Detection in Industrial Internet of Things(https://arxiv.org/abs/2506.01450)
Keywords: anomaly
Abstract: Industrial Internet of Things environments increasingly rely on advanced Anomaly Detection and explanation techniques to rapidly detect and mitigate cyberincidents, thereby ensuring operational safety. The sequential nature of data collected from these environments has enabled improvements in Anomaly Detection using Machine Learning and Deep Learning models by processing time windows rather than treating the data as tabular. However, conventional explanation methods often neglect this temporal structure, leading to imprecise or less actionable explanations. This work presents ShaTS (Shapley values for Time Series models), which is a model-agnostic explainable Artificial Intelligence method designed to enhance the precision of Shapley value explanations for time series models. ShaTS addresses the shortcomings of traditional approaches by incorporating an a priori feature grouping strategy that preserves temporal dependencies and produces both coherent and actionable insights. Experiments conducted on the SWaT dataset demonstrate that ShaTS accurately identifies critical time instants, precisely pinpoints the sensors, actuators, and processes affected by anomalies, and outperforms SHAP in terms of both explainability and resource efficiency, fulfilling the real-time requirements of industrial environments.

Title: DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion

Authors: Geunmin Hwang, Hyun-kyu Ko, Younghyun Kim, Seungryong Lee, Eunbyung Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01454
Pdf URL: https://arxiv.org/pdf/2506.01454
Copy Paste: [[2506.01454]] DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion(https://arxiv.org/abs/2506.01454)
Keywords: diffusion
Abstract: Recent advancements in diffusion models have revolutionized video generation, enabling the creation of high-quality, temporally consistent videos. However, generating high frame-rate (FPS) videos remains a significant challenge due to issues such as flickering and degradation in long sequences, particularly in fast-motion scenarios. Existing methods often suffer from computational inefficiencies and limitations in maintaining video quality over extended frames. In this paper, we present a novel, training-free approach for high FPS video generation using pre-trained diffusion models. Our method, DiffuseSlide, introduces a new pipeline that leverages key frames from low FPS videos and applies innovative techniques, including noise re-injection and sliding window latent denoising, to achieve smooth, consistent video outputs without the need for additional fine-tuning. Through extensive experiments, we demonstrate that our approach significantly improves video quality, offering enhanced temporal coherence and spatial fidelity. The proposed method is not only computationally efficient but also adaptable to various video generation tasks, making it ideal for applications such as virtual reality, video games, and high-quality content creation.

Title: Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark

Authors: Shuyu Yang, Yilun Wang, Yaxiong Wang, Li Zhu, Zhedong Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01466
Pdf URL: https://arxiv.org/pdf/2506.01466
Copy Paste: [[2506.01466]] Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark(https://arxiv.org/abs/2506.01466)
Keywords: generative, anomaly
Abstract: Video anomaly retrieval aims to localize anomalous events in videos using natural language queries to facilitate public safety. However, existing datasets suffer from severe limitations: (1) data scarcity due to the long-tail nature of real-world anomalies, and (2) privacy constraints that impede large-scale collection. To address the aforementioned issues in one go, we introduce SVTA (Synthetic Video-Text Anomaly benchmark), the first large-scale dataset for cross-modal anomaly retrieval, leveraging generative models to overcome data availability challenges. Specifically, we collect and generate video descriptions via the off-the-shelf LLM (Large Language Model) covering 68 anomaly categories, e.g., throwing, stealing, and shooting. These descriptions encompass common long-tail events. We adopt these texts to guide the video generative model to produce diverse and high-quality videos. Finally, our SVTA involves 41,315 videos (1.36M frames) with paired captions, covering 30 normal activities, e.g., standing, walking, and sports, and 68 anomalous events, e.g., falling, fighting, theft, explosions, and natural disasters. We adopt three widely-used video-text retrieval baselines to comprehensively test our SVTA, revealing SVTA's challenging nature and its effectiveness in evaluating a robust cross-modal retrieval method. SVTA eliminates privacy risks associated with real-world anomaly collection while maintaining realistic scenarios. The dataset demo is available at: [this https URL].

Title: Feature-aware Hypergraph Generation via Next-Scale Prediction

Authors: Dorian Gailhard, Enzo Tartaglione, Lirida Naviner, Jhony H. Giraldo
Subjects: cs.LG, cs.DM
Abstract URL: https://arxiv.org/abs/2506.01467
Pdf URL: https://arxiv.org/pdf/2506.01467
Copy Paste: [[2506.01467]] Feature-aware Hypergraph Generation via Next-Scale Prediction(https://arxiv.org/abs/2506.01467)
Keywords: generative
Abstract: Hypergraphs generalize traditional graphs by allowing hyperedges to connect multiple nodes, making them well-suited for modeling complex structures with higher-order relationships, such as 3D meshes, molecular systems, and electronic circuits. While topology is central to hypergraph structure, many real-world applications also require node and hyperedge features. Existing hypergraph generation methods focus solely on topology, often overlooking feature modeling. In this work, we introduce FAHNES (feature-aware hypergraph generation via next-scale prediction), a hierarchical approach that jointly generates hypergraph topology and features. FAHNES builds a multi-scale representation through node coarsening, then learns to reconstruct finer levels via localized expansion and refinement, guided by a new node budget mechanism that controls cluster splitting. We evaluate FAHNES on synthetic hypergraphs, 3D meshes, and molecular datasets. FAHNES achieves competitive results in reconstructing topology and features, establishing a foundation for future research in featured hypergraph generative modeling.

Title: SemiVT-Surge: Semi-Supervised Video Transformer for Surgical Phase Recognition

Authors: Yiping Li, Ronald de Jong, Sahar Nasirihaghighi, Tim Jaspers, Romy van Jaarsveld, Gino Kuiper, Richard van Hillegersberg, Fons van der Sommen, Jelle Ruurda, Marcel Breeuwer, Yasmina Al Khalil
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01471
Pdf URL: https://arxiv.org/pdf/2506.01471
Copy Paste: [[2506.01471]] SemiVT-Surge: Semi-Supervised Video Transformer for Surgical Phase Recognition(https://arxiv.org/abs/2506.01471)
Keywords: self-supervised
Abstract: Accurate surgical phase recognition is crucial for computer-assisted interventions and surgical video analysis. Annotating long surgical videos is labor-intensive, driving research toward leveraging unlabeled data for strong performance with minimal annotations. Although self-supervised learning has gained popularity by enabling large-scale pretraining followed by fine-tuning on small labeled subsets, semi-supervised approaches remain largely underexplored in the surgical domain. In this work, we propose a video transformer-based model with a robust pseudo-labeling framework. Our method incorporates temporal consistency regularization for unlabeled data and contrastive learning with class prototypes, which leverages both labeled data and pseudo-labels to refine the feature space. Through extensive experiments on the private RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) dataset and the public Cholec80 dataset, we demonstrate the effectiveness of our approach. By incorporating unlabeled data, we achieve state-of-the-art performance on RAMIE with a 4.9% accuracy increase and obtain comparable results to full supervision while using only 1/4 of the labeled data on Cholec80. Our findings establish a strong benchmark for semi-supervised surgical phase recognition, paving the way for future research in this domain.

Title: Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?

Authors: Zijian Zhao, Dian Jin, Zijing Zhou, Xiaoyu Zhang
Subjects: cs.LG, cs.AI, cs.MM, eess.AS
Abstract URL: https://arxiv.org/abs/2506.01482
Pdf URL: https://arxiv.org/pdf/2506.01482
Copy Paste: [[2506.01482]] Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?(https://arxiv.org/abs/2506.01482)
Keywords: generative
Abstract: Stage lighting plays an essential role in live music performances, influencing the engaging experience of both musicians and audiences. Given the high costs associated with hiring or training professional lighting engineers, Automatic Stage Lighting Control (ASLC) has gained increasing attention. However, most existing approaches only classify music into limited categories and map them to predefined light patterns, resulting in formulaic and monotonous outcomes that lack rationality. To address this issue, this paper presents an end-to-end solution that directly learns from experienced lighting engineers -- Skip-BART. To the best of our knowledge, this is the first work to conceptualize ASLC as a generative task rather than merely a classification problem. Our method modifies the BART model to take audio music as input and produce light hue and value (intensity) as output, incorporating a novel skip connection mechanism to enhance the relationship between music and light within the frame this http URL validate our method through both quantitative analysis and an human evaluation, demonstrating that Skip-BART outperforms conventional rule-based methods across all evaluation metrics and shows only a limited gap compared to real lighting this http URL, our method yields a p-value of 0.72 in a statistical comparison based on human evaluations with human lighting engineers, suggesting that the proposed approach closely matches human lighting engineering performance. To support further research, we have made our self-collected dataset, code, and trained model parameters available at this https URL .

Title: Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity

Authors: Yuya Kobayashi, Yuhta Takida, Takashi Shibuya, Yuki Mitsufuji
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01493
Pdf URL: https://arxiv.org/pdf/2506.01493
Copy Paste: [[2506.01493]] Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity(https://arxiv.org/abs/2506.01493)
Keywords: generative
Abstract: Recently, Generative Adversarial Networks (GANs) have been successfully scaled to billion-scale large text-to-image datasets. However, training such models entails a high training cost, limiting some applications and research usage. To reduce the cost, one promising direction is the incorporation of pre-trained models. The existing method of utilizing pre-trained models for a generator significantly reduced the training cost compared with the other large-scale GANs, but we found the model loses the diversity of generation for a given prompt by a large margin. To build an efficient and high-fidelity text-to-image GAN without compromise, we propose to use two specialized discriminators with Slicing Adversarial Networks (SANs) adapted for text-to-image tasks. Our proposed model, called SCAD, shows a notable enhancement in diversity for a given prompt with better sample fidelity. We also propose to use a metric called Per-Prompt Diversity (PPD) to evaluate the diversity of text-to-image models quantitatively. SCAD achieved a zero-shot FID competitive with the latest large-scale GANs at two orders of magnitude less training cost.

Title: Continual Speech Learning with Fused Speech Features

Authors: Guitao Wang, Jinming Zhao, Hao Yang, Guilin Qi, Tongtong Wu, Gholamreza Haffari
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.01496
Pdf URL: https://arxiv.org/pdf/2506.01496
Copy Paste: [[2506.01496]] Continual Speech Learning with Fused Speech Features(https://arxiv.org/abs/2506.01496)
Keywords: generative
Abstract: Rapid growth in speech data demands adaptive models, as traditional static methods fail to keep pace with dynamic and diverse speech information. We introduce continuous speech learning, a new set-up targeting at bridging the adaptation gap in current speech models. We use the encoder-decoder Whisper model to standardize speech tasks into a generative format. We integrate a learnable gated-fusion layer on the top of the encoder to dynamically select task-specific features for downstream tasks. Our approach improves accuracy significantly over traditional methods in six speech processing tasks, demonstrating gains in adapting to new speech tasks without full retraining.

Title: Analyzing the Importance of Blank for CTC-Based Knowledge Distillation

Authors: Benedikt Hilmes, Nick Rossenbach, Ralf Schlüter
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.01503
Pdf URL: https://arxiv.org/pdf/2506.01503
Copy Paste: [[2506.01503]] Analyzing the Importance of Blank for CTC-Based Knowledge Distillation(https://arxiv.org/abs/2506.01503)
Keywords: foundation model
Abstract: With the rise of large pre-trained foundation models for automatic speech recognition new challenges appear. While the performance of these models is good, runtime and cost of inference increases. One approach to make use of their strength while retaining efficiency is to distill their knowledge to smaller models during training. In this work, we explore different CTC-based distillation variants, focusing on blank token handling. We show that common approaches like blank elimination do not always work off the shelf. We explore new blank selection patterns as a potential sweet spot between standard knowledge distillation and blank elimination mechanisms. Through the introduction of a symmetric selection method, we are able to remove the CTC loss during knowledge distillation with minimal to no performance degradation. With this, we make the training independent from target labels, potentially allowing for distillation on untranscribed audio data.

Title: Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment

Authors: Kaixun Jiang, Zhaoyu Chen, Haijing Guo, Jinglun Li, Jiyuan Fu, Pinxue Guo, Hao Tang, Bo Li, Wenqiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01511
Pdf URL: https://arxiv.org/pdf/2506.01511
Copy Paste: [[2506.01511]] Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment(https://arxiv.org/abs/2506.01511)
Keywords: diffusion
Abstract: Preference alignment in diffusion models has primarily focused on benign human preferences (e.g., aesthetic). In this paper, we propose a novel perspective: framing unrestricted adversarial example generation as a problem of aligning with adversary preferences. Unlike benign alignment, adversarial alignment involves two inherently conflicting preferences: visual consistency and attack effectiveness, which often lead to unstable optimization and reward hacking (e.g., reducing visual quality to improve attack success). To address this, we propose APA (Adversary Preferences Alignment), a two-stage framework that decouples conflicting preferences and optimizes each with differentiable rewards. In the first stage, APA fine-tunes LoRA to improve visual consistency using rule-based similarity reward. In the second stage, APA updates either the image latent or prompt embedding based on feedback from a substitute classifier, guided by trajectory-level and step-wise rewards. To enhance black-box transferability, we further incorporate a diffusion augmentation strategy. Experiments demonstrate that APA achieves significantly better attack transferability while maintaining high visual consistency, inspiring further research to approach adversarial attacks from an alignment perspective. Code will be available at this https URL.

Title: FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents

Authors: Bobo Li, Yuheng Wang, Hao Fei, Juncheng Li, Wei Ji, Mong-Li Lee, Wynne Hsu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01520
Pdf URL: https://arxiv.org/pdf/2506.01520
Copy Paste: [[2506.01520]] FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents(https://arxiv.org/abs/2506.01520)
Keywords: generative
Abstract: Online form filling is a common yet labor-intensive task involving extensive keyboard and mouse interactions. Despite the long-standing vision of automating this process with "one click", existing tools remain largely rule-based and lack generalizable, generative capabilities. Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising agents for GUI-related tasks in general-purpose scenarios. However, they struggle with the unique challenges of form filling, such as flexible layouts and the difficulty of aligning textual instructions with on-screen fields. To bridge this gap, we formally define the form-filling task and propose FormFactory, an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and carefully constructed dataset. Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions. We conduct a comprehensive evaluation of state-of-the-art MLLMs and observe that no model surpasses 5% accuracy, underscoring the inherent difficulty of the task. These findings also reveal significant limitations in current models' visual layout reasoning and field-value alignment abilities. We hope our benchmark can serve as a stepping stone for further research into robust, practical form-filling agents.

Title: Beyond Diagonal Covariance: Flexible Posterior VAEs via Free-Form Injective Flows

Authors: Peter Sorrenson, Lukas Lührs, Hans Olischläger, Ullrich Köthe
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.01522
Pdf URL: https://arxiv.org/pdf/2506.01522
Copy Paste: [[2506.01522]] Beyond Diagonal Covariance: Flexible Posterior VAEs via Free-Form Injective Flows(https://arxiv.org/abs/2506.01522)
Keywords: generative
Abstract: Variational Autoencoders (VAEs) are powerful generative models widely used for learning interpretable latent spaces, quantifying uncertainty, and compressing data for downstream generative tasks. VAEs typically rely on diagonal Gaussian posteriors due to computational constraints. Using arguments grounded in differential geometry, we demonstrate inherent limitations in the representational capacity of diagonal covariance VAEs, as illustrated by explicit low-dimensional examples. In response, we show that a regularized variant of the recently introduced Free-form Injective Flow (FIF) can be interpreted as a VAE featuring a highly flexible, implicitly defined posterior. Crucially, this regularization yields a posterior equivalent to a full Gaussian covariance distribution, yet maintains computational costs comparable to standard diagonal covariance VAEs. Experiments on image datasets validate our approach, demonstrating that incorporating full covariance substantially improves model likelihood.

Title: A Diffusion-Based Method for Learning the Multi-Outcome Distribution of Medical Treatments

Authors: Yuchen Ma, Jonas Schweisthal, Hengrui Zhang, Stefan Feuerriegel
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01533
Pdf URL: https://arxiv.org/pdf/2506.01533
Copy Paste: [[2506.01533]] A Diffusion-Based Method for Learning the Multi-Outcome Distribution of Medical Treatments(https://arxiv.org/abs/2506.01533)
Keywords: diffusion
Abstract: In medicine, treatments often influence multiple, interdependent outcomes, such as primary endpoints, complications, adverse events, or other secondary endpoints. Hence, to make optimal treatment decisions, clinicians are interested in learning the distribution of multi-dimensional treatment outcomes. However, the vast majority of machine learning methods for predicting treatment effects focus on single-outcome settings, despite the fact that medical data often include multiple, interdependent outcomes. To address this limitation, we propose a novel diffusion-based method called DIME to learn the joint distribution of multiple outcomes of medical treatments. We addresses three challenges relevant in medical practice: (i)it is tailored to learn the joint interventional distribution of multiple medical outcomes, which enables reliable decision-making with uncertainty quantification rather than relying solely on point estimates; (ii)it explicitly captures the dependence structure between outcomes; (iii)it can handle outcomes of mixed type, including binary, categorical, and continuous variables. In DIME, we take into account the fundamental problem of causal inference through causal masking. For training, our method decomposes the joint distribution into a series of conditional distributions with a customized conditional masking to account for the dependence structure across outcomes. For inference, our method auto-regressively generates predictions. This allows our method to move beyond point estimates of causal quantities and thus learn the joint interventional distribution. To the best of our knowledge, DIME is the first neural method tailored to learn the joint, multi-outcome distribution of medical treatments. Across various experiments, we demonstrate that our method effectively learns the joint distribution and captures shared information among multiple outcomes.

Title: G4Seg: Generation for Inexact Segmentation Refinement with Diffusion Models

Authors: Tianjiao Zhang, Fei Zhang, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01539
Pdf URL: https://arxiv.org/pdf/2506.01539
Copy Paste: [[2506.01539]] G4Seg: Generation for Inexact Segmentation Refinement with Diffusion Models(https://arxiv.org/abs/2506.01539)
Keywords: diffusion, generative
Abstract: This paper considers the problem of utilizing a large-scale text-to-image diffusion model to tackle the challenging Inexact Segmentation (IS) task. Unlike traditional approaches that rely heavily on discriminative-model-based paradigms or dense visual representations derived from internal attention mechanisms, our method focuses on the intrinsic generative priors in Stable Diffusion~(SD). Specifically, we exploit the pattern discrepancies between original images and mask-conditional generated images to facilitate a coarse-to-fine segmentation refinement by establishing a semantic correspondence alignment and updating the foreground probability. Comprehensive quantitative and qualitative experiments validate the effectiveness and superiority of our plug-and-play design, underscoring the potential of leveraging generation discrepancies to model dense representations and encouraging further exploration of generative approaches for solving discriminative tasks.

Title: Adaptive Destruction Processes for Diffusion Samplers

Authors: Timofei Gritsaev, Nikita Morozov, Kirill Tamogashev, Daniil Tiapkin, Sergey Samsonov, Alexey Naumov, Dmitry Vetrov, Nikolay Malkin
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.01541
Pdf URL: https://arxiv.org/pdf/2506.01541
Copy Paste: [[2506.01541]] Adaptive Destruction Processes for Diffusion Samplers(https://arxiv.org/abs/2506.01541)
Keywords: diffusion, generative
Abstract: This paper explores the challenges and benefits of a trainable destruction process in diffusion samplers -- diffusion-based generative models trained to sample an unnormalised density without access to data samples. Contrary to the majority of work that views diffusion samplers as approximations to an underlying continuous-time model, we view diffusion models as discrete-time policies trained to produce samples in very few generation steps. We propose to trade some of the elegance of the underlying theory for flexibility in the definition of the generative and destruction policies. In particular, we decouple the generation and destruction variances, enabling both transition kernels to be learned as unconstrained Gaussian densities. We show that, when the number of steps is limited, training both generation and destruction processes results in faster convergence and improved sampling quality on various benchmarks. Through a robust ablation study, we investigate the design choices necessary to facilitate stable training. Finally, we show the scalability of our approach through experiments on GAN latent space sampling for conditional image generation.

Title: LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model

Authors: Xiaodong Wang, Zhirong Wu, Peixi Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01546
Pdf URL: https://arxiv.org/pdf/2506.01546
Copy Paste: [[2506.01546]] LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model(https://arxiv.org/abs/2506.01546)
Keywords: diffusion, self-supervised
Abstract: Driving world models are used to simulate futures by video generation based on the condition of the current state and actions. However, current models often suffer serious error accumulations when predicting the long-term future, which limits the practical application. Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility. However, these models are always trained on short video clips (high fps and short duration), and multiple roll-out generations struggle to produce consistent and reasonable long videos due to the training-inference gap. To this end, we propose several solutions to build a simple yet effective long-term driving world model. First, we hierarchically decouple world model learning into large motion learning and bidirectional continuous motion learning. Then, considering the continuity of driving scenes, we propose a simple distillation method where fine-grained video flows are self-supervised signals for coarse-grained flows. The distillation is designed to improve the coherence of infinite video generation. The coarse-grained and fine-grained modules are coordinated to generate long-term and temporally coherent videos. In the public benchmark NuScenes, compared with the state-of-the-art front-view model, our model improves FVD by $27\%$ and reduces inference time by $85\%$ for the video task of generating 110+ frames. More videos (including 90s duration) are available at this https URL.

Title: HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception

Authors: Wei Yao, Yunlian Sun, Hongwen Zhang, Yebin Liu, Jinhui Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01579
Pdf URL: https://arxiv.org/pdf/2506.01579
Copy Paste: [[2506.01579]] HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception(https://arxiv.org/abs/2506.01579)
Keywords: diffusion
Abstract: Generating high-fidelity full-body human interactions with dynamic objects and static scenes remains a critical challenge in computer graphics and animation. Existing methods for human-object interaction often neglect scene context, leading to implausible penetrations, while human-scene interaction approaches struggle to coordinate fine-grained manipulations with long-range navigation. To address these limitations, we propose HOSIG, a novel framework for synthesizing full-body interactions through hierarchical scene perception. Our method decouples the task into three key components: 1) a scene-aware grasp pose generator that ensures collision-free whole-body postures with precise hand-object contact by integrating local geometry constraints, 2) a heuristic navigation algorithm that autonomously plans obstacle-avoiding paths in complex indoor environments via compressed 2D floor maps and dual-component spatial reasoning, and 3) a scene-guided motion diffusion model that generates trajectory-controlled, full-body motions with finger-level accuracy by incorporating spatial anchors and dual-space classifier-free guidance. Extensive experiments on the TRUMANS dataset demonstrate superior performance over state-of-the-art methods. Notably, our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention. This work bridges the critical gap between scene-aware navigation and dexterous object manipulation, advancing the frontier of embodied interaction synthesis. Codes will be available after publication. Project page: this http URL

Title: PMNO: A novel physics guided multi-step neural operator predictor for partial differential equations

Authors: Jin Song, Kenji Kawaguchi, Zhenya Yan
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2506.01598
Pdf URL: https://arxiv.org/pdf/2506.01598
Copy Paste: [[2506.01598]] PMNO: A novel physics guided multi-step neural operator predictor for partial differential equations(https://arxiv.org/abs/2506.01598)
Keywords: diffusion
Abstract: Neural operators, which aim to approximate mappings between infinite-dimensional function spaces, have been widely applied in the simulation and prediction of physical systems. However, the limited representational capacity of network architectures, combined with their heavy reliance on large-scale data, often hinder effective training and result in poor extrapolation performance. In this paper, inspired by traditional numerical methods, we propose a novel physics guided multi-step neural operator (PMNO) architecture to address these challenges in long-horizon prediction of complex physical systems. Distinct from general operator learning methods, the PMNO framework replaces the single-step input with multi-step historical data in the forward pass and introduces an implicit time-stepping scheme based on the Backward Differentiation Formula (BDF) during backpropagation. This design not only strengthens the model's extrapolation capacity but also facilitates more efficient and stable training with fewer data samples, especially for long-term predictions. Meanwhile, a causal training strategy is employed to circumvent the need for multi-stage training and to ensure efficient end-to-end optimization. The neural operator architecture possesses resolution-invariant properties, enabling the trained model to perform fast extrapolation on arbitrary spatial resolutions. We demonstrate the superior predictive performance of PMNO predictor across a diverse range of physical systems, including 2D linear system, modeling over irregular domain, complex-valued wave dynamics, and reaction-diffusion processes. Depending on the specific problem setting, various neural operator architectures, including FNO, DeepONet, and their variants, can be seamlessly integrated into the PMNO framework.

Title: Minimal Impact ControlNet: Advancing Multi-ControlNet Integration

Authors: Shikun Sun, Min Zhou, Zixuan Wang, Xubin Li, Tiezheng Ge, Zijie Ye, Xiaoyu Qin, Junliang Xing, Bo Zheng, Jia Jia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.01672
Pdf URL: https://arxiv.org/pdf/2506.01672
Copy Paste: [[2506.01672]] Minimal Impact ControlNet: Advancing Multi-ControlNet Integration(https://arxiv.org/abs/2506.01672)
Keywords: diffusion
Abstract: With the advancement of diffusion models, there is a growing demand for high-quality, controllable image generation, particularly through methods that utilize one or multiple control signals based on ControlNet. However, in current ControlNet training, each control is designed to influence all areas of an image, which can lead to conflicts when different control signals are expected to manage different parts of the image in practical applications. This issue is especially pronounced with edge-type control conditions, where regions lacking boundary information often represent low-frequency signals, referred to as silent control signals. When combining multiple ControlNets, these silent control signals can suppress the generation of textures in related areas, resulting in suboptimal outcomes. To address this problem, we propose Minimal Impact ControlNet. Our approach mitigates conflicts through three key strategies: constructing a balanced dataset, combining and injecting feature signals in a balanced manner, and addressing the asymmetry in the score function's Jacobian matrix induced by ControlNet. These improvements enhance the compatibility of control signals, allowing for freer and more harmonious generation in areas with silent control signals.

Title: mdok of KInIT: Robustly Fine-tuned LLM for Binary and Multiclass AI-Generated Text Detection

Authors: Dominik Macko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01702
Pdf URL: https://arxiv.org/pdf/2506.01702
Copy Paste: [[2506.01702]] mdok of KInIT: Robustly Fine-tuned LLM for Binary and Multiclass AI-Generated Text Detection(https://arxiv.org/abs/2506.01702)
Keywords: generative
Abstract: The large language models (LLMs) are able to generate high-quality texts in multiple languages. Such texts are often not recognizable by humans as generated, and therefore present a potential of LLMs for misuse (e.g., plagiarism, spams, disinformation spreading). An automated detection is able to assist humans to indicate the machine-generated texts; however, its robustness to out-of-distribution data is still challenging. This notebook describes our mdok approach in robust detection, based on fine-tuning smaller LLMs for text classification. It is applied to both subtasks of Voight-Kampff Generative AI Detection 2025, providing remarkable performance in binary detection as well as in multiclass (1st rank) classification of various cases of human-AI collaboration.

Title: Principled data augmentation for learning to solve quadratic programming problems

Authors: Chendi Qian, Christopher Morris
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01728
Pdf URL: https://arxiv.org/pdf/2506.01728
Copy Paste: [[2506.01728]] Principled data augmentation for learning to solve quadratic programming problems(https://arxiv.org/abs/2506.01728)
Keywords: self-supervised
Abstract: Linear and quadratic optimization are crucial in numerous real-world applications, from training machine learning models to integer-linear optimization. Recently, learning-to-optimize methods (L2O) for linear (LPs) or quadratic programs (QPs) using message-passing graph neural networks (MPNNs) have gained traction, promising lightweight, data-driven proxies for solving such optimization problems. For example, they replace the costly computation of strong branching scores in branch-and-bound solvers, requiring solving many such optimization problems. However, robust L2O MPNNs remain challenging in data-scarce settings, especially when addressing complex optimization problems such as QPs. This work introduces a principled approach to data augmentation tailored for QPs via MPNNs. Our method leverages theoretically justified data augmentation techniques to generate diverse yet optimality-preserving instances. Furthermore, we integrate these augmentations into a self-supervised learning framework based on contrastive learning, thereby pretraining MPNNs for enhanced performance on L2O tasks. Extensive experiments demonstrate that our approach improves generalization in supervised scenarios and facilitates effective transfer learning to related optimization problems.

Title: Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

Authors: Tao Yang, Ruibin Li, Yangming Shi, Yuqi Zhang, Qide Dong, Haoran Cheng, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01758
Pdf URL: https://arxiv.org/pdf/2506.01758
Copy Paste: [[2506.01758]] Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks(https://arxiv.org/abs/2506.01758)
Keywords: diffusion, foundation model
Abstract: Diffusion models have shown impressive performance in many visual generation and manipulation tasks. Many existing methods focus on training a model for a specific task, especially, text-to-video (T2V) generation, while many other works focus on finetuning the pretrained T2V model for image-to-video (I2V), video-to-video (V2V), image and video manipulation tasks, etc. However, training a strong T2V foundation model requires a large amount of high-quality annotations, which is very costly. In addition, many existing models can perform only one or several tasks. In this work, we introduce a unified framework, namely many-for-many, which leverages the available training data from many different visual generation and manipulation tasks to train a single model for those different tasks. Specifically, we design a lightweight adapter to unify the different conditions in different tasks, then employ a joint image-video learning strategy to progressively train the model from scratch. Our joint learning leads to a unified visual generation and manipulation model with improved video generation performance. In addition, we introduce depth maps as a condition to help our model better perceive the 3D space in visual generation. Two versions of our model are trained with different model sizes (8B and 2B), each of which can perform more than 10 different tasks. In particular, our 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines. Our models and source codes are available at this https URL.

Title: Federated Gaussian Mixture Models

Authors: Sophia Zhang Pettersson, Kuo-Yun Liang, Juan Carlos Andresen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.01780
Pdf URL: https://arxiv.org/pdf/2506.01780
Copy Paste: [[2506.01780]] Federated Gaussian Mixture Models(https://arxiv.org/abs/2506.01780)
Keywords: generative, anomaly
Abstract: This paper introduces FedGenGMM, a novel one-shot federated learning approach for Gaussian Mixture Models (GMM) tailored for unsupervised learning scenarios. In federated learning (FL), where multiple decentralized clients collaboratively train models without sharing raw data, significant challenges include statistical heterogeneity, high communication costs, and privacy concerns. FedGenGMM addresses these issues by allowing local GMM models, trained independently on client devices, to be aggregated through a single communication round. This approach leverages the generative property of GMMs, enabling the creation of a synthetic dataset on the server side to train a global model efficiently. Evaluation across diverse datasets covering image, tabular, and time series data demonstrates that FedGenGMM consistently achieves performance comparable to non-federated and iterative federated methods, even under significant data heterogeneity. Additionally, FedGenGMM significantly reduces communication overhead, maintains robust performance in anomaly detection tasks, and offers flexibility in local model complexities, making it particularly suitable for edge computing environments.

Title: Human-Centric Evaluation for Foundation Models

Authors: Yijin Guo, Kaiyuan Ji, Xiaorong Zhu, Junying Wang, Farong Wen, Chunyi Li, Zicheng Zhang, Guangtao Zhai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.01793
Pdf URL: https://arxiv.org/pdf/2506.01793
Copy Paste: [[2506.01793]] Human-Centric Evaluation for Foundation Models(https://arxiv.org/abs/2506.01793)
Keywords: foundation model
Abstract: Currently, nearly all evaluations of foundation models focus on objective metrics, emphasizing quiz performance to define model capabilities. While this model-centric approach enables rapid performance assessment, it fails to reflect authentic human experiences. To address this gap, we propose a Human-Centric subjective Evaluation (HCE) framework, focusing on three core dimensions: problem-solving ability, information quality, and interaction experience. Through experiments involving Deepseek R1, OpenAI o3 mini, Grok 3, and Gemini 2.5, we conduct over 540 participant-driven evaluations, where humans and models collaborate on open-ended research tasks, yielding a comprehensive subjective dataset. This dataset captures diverse user feedback across multiple disciplines, revealing distinct model strengths and adaptability. Our findings highlight Grok 3's superior performance, followed by Deepseek R1 and Gemini 2.5, with OpenAI o3 mini lagging behind. By offering a novel framework and a rich dataset, this study not only enhances subjective evaluation methodologies but also lays the foundation for standardized, automated assessments, advancing LLM development for research and practical scenarios. Our dataset link is this https URL.

Title: WorldExplorer: Towards Generating Fully Navigable 3D Scenes

Authors: Manuel-Andreas Schneider, Lukas Höllein, Matthias Nießner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01799
Pdf URL: https://arxiv.org/pdf/2506.01799
Copy Paste: [[2506.01799]] WorldExplorer: Towards Generating Fully Navigable 3D Scenes(https://arxiv.org/abs/2506.01799)
Keywords: diffusion
Abstract: Generating 3D worlds from text is a highly anticipated goal in computer vision. Existing works are limited by the degree of exploration they allow inside of a scene, i.e., produce streched-out and noisy artifacts when moving beyond central or panoramic perspectives. To this end, we propose WorldExplorer, a novel method based on autoregressive video trajectory generation, which builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints. We initialize our scenes by creating multi-view consistent images corresponding to a 360 degree panorama. Then, we expand it by leveraging video diffusion models in an iterative scene generation pipeline. Concretely, we generate multiple videos along short, pre-defined trajectories, that explore the scene in depth, including motion around objects. Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results, like moving into objects. Finally, we fuse all generated views into a unified 3D representation via 3D Gaussian Splatting optimization. Compared to prior approaches, WorldExplorer produces high-quality scenes that remain stable under large camera motion, enabling for the first time realistic and unrestricted exploration. We believe this marks a significant step toward generating immersive and truly explorable virtual 3D environments.

Title: OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation

Authors: Sen Liang, Zhentao Yu, Zhengguang Zhou, Teng Hu, Hongmei Wang, Yi Chen, Qin Lin, Yuan Zhou, Xin Li, Qinglin Lu, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01801
Pdf URL: https://arxiv.org/pdf/2506.01801
Copy Paste: [[2506.01801]] OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation(https://arxiv.org/abs/2506.01801)
Keywords: diffusion
Abstract: The emergence of Diffusion Transformers (DiT) has brought significant advancements to video generation, especially in text-to-video and image-to-video tasks. Although video generation is widely applied in various fields, most existing models are limited to single scenarios and cannot perform diverse video generation and editing through dynamic content manipulation. We propose OmniV2V, a video model capable of generating and editing videos across different scenarios based on various operations, including: object movement, object addition, mask-guided video edit, try-on, inpainting, outpainting, human animation, and controllable character video synthesis. We explore a unified dynamic content manipulation injection module, which effectively integrates the requirements of the above tasks. In addition, we design a visual-text instruction module based on LLaVA, enabling the model to effectively understand the correspondence between visual content and instructions. Furthermore, we build a comprehensive multi-task data processing system. Since there is data overlap among various tasks, this system can efficiently provide data augmentation. Using this system, we construct a multi-type, multi-scenario OmniV2V dataset and its corresponding OmniV2V-Test benchmark. Extensive experiments show that OmniV2V works as well as, and sometimes better than, the best existing open-source and commercial models for many video generation and editing tasks.

Title: SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model

Authors: Zhao Yang, Jiwei Zhu, Bing Su
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2506.01833
Pdf URL: https://arxiv.org/pdf/2506.01833
Copy Paste: [[2506.01833]] SPACE: Your Genomic Profile Predictor is a Powerful DNA Foundation Model(https://arxiv.org/abs/2506.01833)
Keywords: foundation model
Abstract: Inspired by the success of unsupervised pre-training paradigms, researchers have applied these approaches to DNA pre-training. However, we argue that these approaches alone yield suboptimal results because pure DNA sequences lack sufficient information, since their functions are regulated by genomic profiles like chromatin accessibility. Here, we demonstrate that supervised training for genomic profile prediction serves as a more effective alternative to pure sequence pre-training. Furthermore, considering the multi-species and multi-profile nature of genomic profile prediction, we introduce our $\textbf{S}$pecies-$\textbf{P}$rofile $\textbf{A}$daptive $\textbf{C}$ollaborative $\textbf{E}$xperts (SPACE) that leverages Mixture of Experts (MoE) to better capture the relationships between DNA sequences across different species and genomic profiles, thereby learning more effective DNA representations. Through extensive experiments across various tasks, our model achieves state-of-the-art performance, establishing that DNA models trained with supervised genomic profiles serve as powerful DNA representation learners. The code is available at this https URL.

Title: Learning to Explore: An In-Context Learning Approach for Pure Exploration

Authors: Alessio Russo, Ryan Welch, Aldo Pacchiano
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01876
Pdf URL: https://arxiv.org/pdf/2506.01876
Copy Paste: [[2506.01876]] Learning to Explore: An In-Context Learning Approach for Pure Exploration(https://arxiv.org/abs/2506.01876)
Keywords: in-context
Abstract: In this work, we study the active sequential hypothesis testing problem, also known as pure exploration, where the goal is to actively control a data collection process to efficiently identify the correct hypothesis underlying a decision problem. While relevant across multiple domains, devising adaptive exploration strategies remains challenging, particularly due to difficulties in encoding appropriate inductive biases. Existing Reinforcement Learning (RL)-based methods often underperform when relevant information structures are inadequately represented, whereas more complex methods, like Best Arm Identification (BAI) techniques, may be difficult to devise and typically rely on explicit modeling assumptions. To address these limitations, we introduce In-Context Pure Exploration (ICPE), an in-context learning approach that uses Transformers to learn exploration strategies directly from experience. ICPE combines supervised learning and reinforcement learning to identify and exploit latent structure across related tasks, without requiring prior assumptions. Numerical results across diverse synthetic and semi-synthetic benchmarks highlight ICPE's capability to achieve robust performance performance in deterministic, stochastic, and structured settings. These results demonstrate ICPE's ability to match optimal instance-dependent algorithms using only deep learning techniques, making it a practical and general approach to data-efficient exploration.

Title: SMOTE-DP: Improving Privacy-Utility Tradeoff with Synthetic Data

Authors: Yan Zhou, Bradley Malin, Murat Kantarcioglu
Subjects: cs.LG, cs.CR, stat.ML
Abstract URL: https://arxiv.org/abs/2506.01907
Pdf URL: https://arxiv.org/pdf/2506.01907
Copy Paste: [[2506.01907]] SMOTE-DP: Improving Privacy-Utility Tradeoff with Synthetic Data(https://arxiv.org/abs/2506.01907)
Keywords: generative
Abstract: Privacy-preserving data publication, including synthetic data sharing, often experiences trade-offs between privacy and utility. Synthetic data is generally more effective than data anonymization in balancing this trade-off, however, not without its own challenges. Synthetic data produced by generative models trained on source data may inadvertently reveal information about outliers. Techniques specifically designed for preserving privacy, such as introducing noise to satisfy differential privacy, often incur unpredictable and significant losses in utility. In this work we show that, with the right mechanism of synthetic data generation, we can achieve strong privacy protection without significant utility loss. Synthetic data generators producing contracting data patterns, such as Synthetic Minority Over-sampling Technique (SMOTE), can enhance a differentially private data generator, leveraging the strengths of both. We prove in theory and through empirical demonstration that this SMOTE-DP technique can produce synthetic data that not only ensures robust privacy protection but maintains utility in downstream learning tasks.

Title: Elucidating the representation of images within an unconditional diffusion model denoiser

Authors: Zahra Kadkhodaie, Stéphane Mallat, Eero Simoncelli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01912
Pdf URL: https://arxiv.org/pdf/2506.01912
Copy Paste: [[2506.01912]] Elucidating the representation of images within an unconditional diffusion model denoiser(https://arxiv.org/abs/2506.01912)
Keywords: diffusion, generative
Abstract: Generative diffusion models learn probability densities over diverse image datasets by estimating the score with a neural network trained to remove noise. Despite their remarkable success in generating high-quality images, the internal mechanisms of the underlying score networks are not well understood. Here, we examine a UNet trained for denoising on the ImageNet dataset, to better understand its internal representation and computation of the score. We show that the middle block of the UNet decomposes individual images into sparse subsets of active channels, and that the vector of spatial averages of these channels can provide a nonlinear representation of the underlying clean images. We develop a novel algorithm for stochastic reconstruction of images from this representation and demonstrate that it recovers a sample from a set of images defined by a target image representation. We then study the properties of the representation and demonstrate that Euclidean distances in the latent space correspond to distances between conditional densities induced by representations as well as semantic similarities in the image space. Applying a clustering algorithm in the representation space yields groups of images that share both fine details (e.g., specialized features, textured regions, small objects), as well as global structure, but are only partially aligned with object identities. Thus, we show for the first time that a network trained solely on denoising contains a rich and accessible sparse representation of images.

Title: TaxaDiffusion: Progressively Trained Diffusion Model for Fine-Grained Species Generation

Authors: Amin Karimi Monsefi, Mridul Khurana, Rajiv Ramnath, Anuj Karpatne, Wei-Lun Chao, Cheng Zhang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01923
Pdf URL: https://arxiv.org/pdf/2506.01923
Copy Paste: [[2506.01923]] TaxaDiffusion: Progressively Trained Diffusion Model for Fine-Grained Species Generation(https://arxiv.org/abs/2506.01923)
Keywords: diffusion
Abstract: We propose TaxaDiffusion, a taxonomy-informed training framework for diffusion models to generate fine-grained animal images with high morphological and identity accuracy. Unlike standard approaches that treat each species as an independent category, TaxaDiffusion incorporates domain knowledge that many species exhibit strong visual similarities, with distinctions often residing in subtle variations of shape, pattern, and color. To exploit these relationships, TaxaDiffusion progressively trains conditioned diffusion models across different taxonomic levels -- starting from broad classifications such as Class and Order, refining through Family and Genus, and ultimately distinguishing at the Species level. This hierarchical learning strategy first captures coarse-grained morphological traits shared by species with common ancestors, facilitating knowledge transfer before refining fine-grained differences for species-level distinction. As a result, TaxaDiffusion enables accurate generation even with limited training samples per species. Extensive experiments on three fine-grained animal datasets demonstrate that outperforms existing approaches, achieving superior fidelity in fine-grained animal image generation. Project page: this https URL

Title: Esoteric Language Models

Authors: Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, Arash Vahdat
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01928
Pdf URL: https://arxiv.org/pdf/2506.01928
Copy Paste: [[2506.01928]] Esoteric Language Models(https://arxiv.org/abs/2506.01928)
Keywords: diffusion
Abstract: Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models (MDMs) achieve the strongest performance but still underperform AR models in perplexity and lack key inference-time efficiency features--most notably, KV caching. In this work, we introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, enabling smooth interpolation between their perplexities while overcoming their respective limitations. Eso-LMs set a new state of the art on standard language modeling benchmarks. Crucially, we are the **first to introduce KV caching for MDMs** while preserving parallel generation, significantly improving inference efficiency. Combined with an optimized sampling schedule, our method achieves up to **65x** faster inference than standard MDMs and **4x** faster inference than prior semi-autoregressive approaches. We provide the code and model checkpoints on the project page: [this http URL](this http URL)

Title: E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models

Authors: Wenyan Cong, Yiqing Liang, Yancheng Zhang, Ziyi Yang, Yan Wang, Boris Ivanovic, Marco Pavone, Chen Chen, Zhangyang Wang, Zhiwen Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01933
Pdf URL: https://arxiv.org/pdf/2506.01933
Copy Paste: [[2506.01933]] E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models(https://arxiv.org/abs/2506.01933)
Keywords: foundation model
Abstract: Spatial intelligence, encompassing 3D reconstruction, perception, and reasoning, is fundamental to applications such as robotics, aerial imaging, and extended reality. A key enabler is the real-time, accurate estimation of core 3D attributes (camera parameters, point clouds, depth maps, and 3D point tracks) from unstructured or streaming imagery. Inspired by the success of large foundation models in language and 2D vision, a new class of end-to-end 3D geometric foundation models (GFMs) has emerged, directly predicting dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters. Since late 2023, the field has exploded with diverse variants, but systematic evaluation is lacking. In this work, we present the first comprehensive benchmark for 3D GFMs, covering five core tasks: sparse-view depth estimation, video depth estimation, 3D reconstruction, multi-view pose estimation, novel view synthesis, and spanning both standard and challenging out-of-distribution datasets. Our standardized toolkit automates dataset handling, evaluation protocols, and metric computation to ensure fair, reproducible comparisons. We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains, and derive key insights to guide future model scaling and optimization. All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.

Title: Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Authors: Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, Dahua Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01943
Pdf URL: https://arxiv.org/pdf/2506.01943
Copy Paste: [[2506.01943]] Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control(https://arxiv.org/abs/2506.01943)
Keywords: diffusion
Abstract: Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex robotic manipulation. This limitation arises from multi-feature entanglement in overlapping regions, which leads to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction. Each stage is modeled using the feature of the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction, thereby mitigating the drawback of multi-object feature fusion present during interaction in prior work. To further ensure subject semantic consistency throughout the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge V2 dataset, as well as in-the-wild evaluation, demonstrate that our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.

Title: MLLMs Need 3D-Aware Representation Supervision for Scene Understanding

Authors: Xiaohu Huang, Jingjing Wu, Qunyi Xie, Kai Han
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01946
Pdf URL: https://arxiv.org/pdf/2506.01946
Copy Paste: [[2506.01946]] MLLMs Need 3D-Aware Representation Supervision for Scene Understanding(https://arxiv.org/abs/2506.01946)
Keywords: foundation model
Abstract: Recent advances in scene understanding have leveraged multimodal large language models (MLLMs) for 3D reasoning by capitalizing on their strong 2D pretraining. However, the lack of explicit 3D data during MLLM pretraining limits 3D representation capability. In this paper, we investigate the 3D-awareness of MLLMs by evaluating multi-view correspondence and reveal a strong positive correlation between the quality of 3D-aware representation and downstream task performance. Motivated by this, we propose 3DRS, a framework that enhances MLLM 3D representation learning by introducing supervision from pretrained 3D foundation models. Our approach aligns MLLM visual features with rich 3D knowledge distilled from 3D models, effectively improving scene understanding. Extensive experiments across multiple benchmarks and MLLMs -- including visual grounding, captioning, and question answering -- demonstrate consistent performance gains. Project page: this https URL

Title: IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout

Authors: Fei Shen, Xiaoyu Du, Yutong Gao, Jian Yu, Yushe Cao, Xing Lei, Jinhui Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01949
Pdf URL: https://arxiv.org/pdf/2506.01949
Copy Paste: [[2506.01949]] IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout(https://arxiv.org/abs/2506.01949)
Keywords: diffusion
Abstract: Recent diffusion models have advanced image editing by enhancing visual quality and control, supporting broad applications across creative and personalized domains. However, current image editing largely overlooks multi-object scenarios, where precise control over object categories, counts, and spatial layouts remains a significant challenge. To address this, we introduce a new task, quantity-and-layout consistent image editing (QL-Edit), which aims to enable fine-grained control of object quantity and spatial structure in complex scenes. We further propose IMAGHarmony, a structure-aware framework that incorporates harmony-aware attention (HA) to integrate multimodal semantics, explicitly modeling object counts and layouts to enhance editing accuracy and structural consistency. In addition, we observe that diffusion models are susceptible to initial noise and exhibit strong preferences for specific noise patterns. Motivated by this, we present a preference-guided noise selection (PNS) strategy that chooses semantically aligned initial noise samples based on vision-language matching, thereby improving generation stability and layout consistency in multi-object editing. To support evaluation, we construct HarmonyBench, a comprehensive benchmark covering diverse quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony consistently outperforms state-of-the-art methods in structural alignment and semantic accuracy. The code and model are available at this https URL.

Title: Dual-Process Image Generation

Authors: Grace Luo, Jonathan Granskog, Aleksander Holynski, Trevor Darrell
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01955
Pdf URL: https://arxiv.org/pdf/2506.01955
Copy Paste: [[2506.01955]] Dual-Process Image Generation(https://arxiv.org/abs/2506.01955)
Keywords: in-context
Abstract: Prior methods for controlling image generation are limited in their ability to be taught new tasks. In contrast, vision-language models, or VLMs, can learn tasks in-context and produce the correct outputs for a given input. We propose a dual-process distillation scheme that allows feed-forward image generators to learn new tasks from deliberative VLMs. Our scheme uses a VLM to rate the generated images and backpropagates this gradient to update the weights of the image generator. Our general framework enables a wide variety of new control tasks through the same text-and-image based interface. We showcase a handful of applications of this technique for different types of control signals, such as commonsense inferences and visual prompts. With our method, users can implement multimodal controls for properties such as color palette, line weight, horizon position, and relative depth within a matter of minutes. Project page: this https URL.