2025-06-10

Title: Facial Foundational Model Advances Early Warning of Coronary Artery Disease from Live Videos with DigitalShadow

Authors: Juexiao Zhou, Zhongyi Han, Mankun Xin, Xingwei He, Guotao Wang, Jiaoyan Song, Gongning Luo, Wenjia He, Xintong Li, Yuetan Chu, Juanwen Chen, Bo Wang, Xia Wu, Wenwen Duan, Zhixia Guo, Liyan Bai, Yilin Pan, Xuefei Bi, Lu Liu, Long Feng, Xiaonan He, Xin Gao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06283
Pdf URL: https://arxiv.org/pdf/2506.06283
Copy Paste: [[2506.06283]] Facial Foundational Model Advances Early Warning of Coronary Artery Disease from Live Videos with DigitalShadow(https://arxiv.org/abs/2506.06283)
Keywords: foundation model
Abstract: Global population aging presents increasing challenges to healthcare systems, with coronary artery disease (CAD) responsible for approximately 17.8 million deaths annually, making it a leading cause of global mortality. As CAD is largely preventable, early detection and proactive management are essential. In this work, we introduce DigitalShadow, an advanced early warning system for CAD, powered by a fine-tuned facial foundation model. The system is pre-trained on 21 million facial images and subsequently fine-tuned into LiveCAD, a specialized CAD risk assessment model trained on 7,004 facial images from 1,751 subjects across four hospitals in China. DigitalShadow functions passively and contactlessly, extracting facial features from live video streams without requiring active user engagement. Integrated with a personalized database, it generates natural language risk reports and individualized health recommendations. With privacy as a core design principle, DigitalShadow supports local deployment to ensure secure handling of user data.

Title: dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Authors: Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyuan Wei, Shaobo Wang, Linfeng Zhang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.06295
Pdf URL: https://arxiv.org/pdf/2506.06295
Copy Paste: [[2506.06295]] dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching(https://arxiv.org/abs/2506.06295)
Keywords: diffusion
Abstract: Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1 x speedup over standard inference without compromising output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. Codes are provided in the supplementary material and will be released publicly on GitHub.

Title: Reward Is Enough: LLMs Are In-Context Reinforcement Learners

Authors: Kefan Song, Amir Moeini, Peng Wang, Lei Gong, Rohan Chandra, Yanjun Qi, Shangtong Zhang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.06303
Pdf URL: https://arxiv.org/pdf/2506.06303
Copy Paste: [[2506.06303]] Reward Is Enough: LLMs Are In-Context Reinforcement Learners(https://arxiv.org/abs/2506.06303)
Keywords: in-context
Abstract: Reinforcement learning (RL) is a human-designed framework for solving sequential decision making problems. In this work, we demonstrate that, surprisingly, RL emerges in LLM's (Large Language Model) inference time -- a phenomenon known as in-context RL (ICRL). Specifically, we propose a novel multi-round prompting framework called ICRL prompting. The goal is to prompt the LLM to complete a task. After the LLM generates a response at the current round, we give numerical scalar feedbacks for the response, called the rewards. At the next round, we prompt the LLM again with the same task and a context consisting of all previous responses and rewards. We observe that the quality of the LLM's response increases as the context grows. In other words, the LLM is able to maximize the scalar reward signal in the inference time, just like an RL algorithm. We evaluate ICRL prompting in three benchmarks (Game of 24, creative writing, and ScienceWorld) and demonstrate significant performance improvements over baseline methods such as Self-Refine and Reflexion. Surprisingly, in some experiments the reward signals are generated by the LLM itself, yet performance improvements are still observed from ICRL prompting, offering a promising paradigm for scaling test-time compute.

Title: From Transformers to Large Language Models: A systematic review of AI applications in the energy sector towards Agentic Digital Twins

Authors: Gabriel Antonesi, Tudor Cioara, Ionut Anghel, Vasilis Michalakopoulos, Elissaios Sarmas, Liana Toderean
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06359
Pdf URL: https://arxiv.org/pdf/2506.06359
Copy Paste: [[2506.06359]] From Transformers to Large Language Models: A systematic review of AI applications in the energy sector towards Agentic Digital Twins(https://arxiv.org/abs/2506.06359)
Keywords: foundation model, generative
Abstract: Artificial intelligence (AI) has long promised to improve energy management in smart grids by enhancing situational awareness and supporting more effective decision-making. While traditional machine learning has demonstrated notable results in forecasting and optimization, it often struggles with generalization, situational awareness, and heterogeneous data integration. Recent advances in foundation models such as Transformer architecture and Large Language Models (LLMs) have demonstrated improved capabilities in modelling complex temporal and contextual relationships, as well as in multi-modal data fusion which is essential for most AI applications in the energy sector. In this review we synthesize the rapid expanding field of AI applications in the energy domain focusing on Transformers and LLMs. We examine the architectural foundations, domain-specific adaptations and practical implementations of transformer models across various forecasting and grid management tasks. We then explore the emerging role of LLMs in the field: adaptation and fine tuning for the energy sector, the type of tasks they are suited for, and the new challenges they introduce. Along the way, we highlight practical implementations, innovations, and areas where the research frontier is rapidly expanding. These recent developments reviewed underscore a broader trend: Generative AI (GenAI) is beginning to augment decision-making not only in high-level planning but also in day-to-day operations, from forecasting and grid balancing to workforce training and asset onboarding. Building on these developments, we introduce the concept of the Agentic Digital Twin, a next-generation model that integrates LLMs to bring autonomy, proactivity, and social interaction into digital twin-based energy management systems.

Title: Beyond the Norm: A Survey of Synthetic Data Generation for Rare Events

Authors: Jingyi Gu, Xuan Zhang, Guiling Wang
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2506.06380
Pdf URL: https://arxiv.org/pdf/2506.06380
Copy Paste: [[2506.06380]] Beyond the Norm: A Survey of Synthetic Data Generation for Rare Events(https://arxiv.org/abs/2506.06380)
Keywords: generative
Abstract: Extreme events, such as market crashes, natural disasters, and pandemics, are rare but catastrophic, often triggering cascading failures across interconnected systems. Accurate prediction and early warning can help minimize losses and improve preparedness. While data-driven methods offer powerful capabilities for extreme event modeling, they require abundant training data, yet extreme event data is inherently scarce, creating a fundamental challenge. Synthetic data generation has emerged as a powerful solution. However, existing surveys focus on general data with privacy preservation emphasis, rather than extreme events' unique performance requirements. This survey provides the first overview of synthetic data generation for extreme events. We systematically review generative modeling techniques and large language models, particularly those enhanced by statistical theory as well as specialized training and sampling mechanisms to capture heavy-tailed distributions. We summarize benchmark datasets and introduce a tailored evaluation framework covering statistical, dependence, visual, and task-oriented metrics. A central contribution is our in-depth analysis of each metric's applicability in extremeness and domain-specific adaptations, providing actionable guidance for model evaluation in extreme settings. We categorize key application domains and identify underexplored areas like behavioral finance, wildfires, earthquakes, windstorms, and infectious outbreaks. Finally, we outline open challenges, providing a structured foundation for advancing synthetic rare-event research.

Title: Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights

Authors: Sooyung Choi, Jaehyeok Lee, Xiaoyuan Yi, Jing Yao, Xing Xie, JinYeong Bak
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06404
Pdf URL: https://arxiv.org/pdf/2506.06404
Copy Paste: [[2506.06404]] Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights(https://arxiv.org/abs/2506.06404)
Keywords: in-context
Abstract: The application scope of Large Language Models (LLMs) continues to expand, leading to increasing interest in personalized LLMs that align with human values. However, aligning these models with individual values raises significant safety concerns, as certain values may correlate with harmful information. In this paper, we identify specific safety risks associated with value-aligned LLMs and investigate the psychological principles behind these challenges. Our findings reveal two key insights. (1) Value-aligned LLMs are more prone to harmful behavior compared to non-fine-tuned models and exhibit slightly higher risks in traditional safety evaluations than other fine-tuned models. (2) These safety issues arise because value-aligned LLMs genuinely generate text according to the aligned values, which can amplify harmful outcomes. Using a dataset with detailed safety categories, we find significant correlations between value alignment and safety risks, supported by psychological hypotheses. This study offers insights into the "black box" of value alignment and proposes in-context alignment methods to enhance the safety of value-aligned LLMs.

Title: TimeWak: Temporal Chained-Hashing Watermark for Time Series Data

Authors: Zhi Wen Soi, Chaoyi Zhu, Fouad Abiad, Aditya Shankar, Jeroen M. Galjaard, Huijuan Wang, Lydia Y. Chen
Subjects: cs.CR, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.06407
Pdf URL: https://arxiv.org/pdf/2506.06407
Copy Paste: [[2506.06407]] TimeWak: Temporal Chained-Hashing Watermark for Time Series Data(https://arxiv.org/abs/2506.06407)
Keywords: diffusion
Abstract: Synthetic time series generated by diffusion models enable sharing privacy-sensitive datasets, such as patients' functional MRI records. Key criteria for synthetic data include high data utility and traceability to verify the data source. Recent watermarking methods embed in homogeneous latent spaces, but state-of-the-art time series generators operate in real space, making latent-based watermarking incompatible. This creates the challenge of watermarking directly in real space while handling feature heterogeneity and temporal dependencies. We propose TimeWak, the first watermarking algorithm for multivariate time series diffusion models. To handle temporal dependence and spatial heterogeneity, TimeWak embeds a temporal chained-hashing watermark directly within the real temporal-feature space. The other unique feature is the $\epsilon$-exact inversion, which addresses the non-uniform reconstruction error distribution across features from inverting the diffusion process to detect watermarks. We derive the error bound of inverting multivariate time series and further maintain high watermark detectability. We extensively evaluate TimeWak on its impact on synthetic data quality, watermark detectability, and robustness under various post-editing attacks, against 5 datasets and baselines of different temporal lengths. Our results show that TimeWak achieves improvements of 61.96% in context-FID score, and 8.44% in correlational scores against the state-of-the-art baseline, while remaining consistently detectable.

Title: A Systematic Review of Poisoning Attacks Against Large Language Models

Authors: Neil Fendley, Edward W. Staley, Joshua Carney, William Redman, Marie Chau, Nathan Drenkow
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06518
Pdf URL: https://arxiv.org/pdf/2506.06518
Copy Paste: [[2506.06518]] A Systematic Review of Poisoning Attacks Against Large Language Models(https://arxiv.org/abs/2506.06518)
Keywords: generative
Abstract: With the widespread availability of pretrained Large Language Models (LLMs) and their training datasets, concerns about the security risks associated with their usage has increased significantly. One of these security risks is the threat of LLM poisoning attacks where an attacker modifies some part of the LLM training process to cause the LLM to behave in a malicious way. As an emerging area of research, the current frameworks and terminology for LLM poisoning attacks are derived from earlier classification poisoning literature and are not fully equipped for generative LLM settings. We conduct a systematic review of published LLM poisoning attacks to clarify the security implications and address inconsistencies in terminology across the literature. We propose a comprehensive poisoning threat model applicable to categorize a wide range of LLM poisoning attacks. The poisoning threat model includes four poisoning attack specifications that define the logistics and manipulation strategies of an attack as well as six poisoning metrics used to measure key characteristics of an attack. Under our proposed framework, we organize our discussion of published LLM poisoning literature along four critical dimensions of LLM poisoning attacks: concept poisons, stealthy poisons, persistent poisons, and poisons for unique tasks, to better understand the current landscape of security risks.

Title: Textile Analysis for Recycling Automation using Transfer Learning and Zero-Shot Foundation Models

Authors: Yannis Spyridis, Vasileios Argyriou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06569
Pdf URL: https://arxiv.org/pdf/2506.06569
Copy Paste: [[2506.06569]] Textile Analysis for Recycling Automation using Transfer Learning and Zero-Shot Foundation Models(https://arxiv.org/abs/2506.06569)
Keywords: foundation model
Abstract: Automated sorting is crucial for improving the efficiency and scalability of textile recycling, but accurately identifying material composition and detecting contaminants from sensor data remains challenging. This paper investigates the use of standard RGB imagery, a cost-effective sensing modality, for key pre-processing tasks in an automated system. We present computer vision components designed for a conveyor belt setup to perform (a) classification of four common textile types and (b) segmentation of non-textile features such as buttons and zippers. For classification, several pre-trained architectures were evaluated using transfer learning and cross-validation, with EfficientNetB0 achieving the best performance on a held-out test set with 81.25\% accuracy. For feature segmentation, a zero-shot approach combining the Grounding DINO open-vocabulary detector with the Segment Anything Model (SAM) was employed, demonstrating excellent performance with a mIoU of 0.90 for the generated masks against ground truth. This study demonstrates the feasibility of using RGB images coupled with modern deep learning techniques, including transfer learning for classification and foundation models for zero-shot segmentation, to enable essential analysis steps for automated textile recycling pipelines.

Title: A Deep Learning Approach for Facial Attribute Manipulation and Reconstruction in Surveillance and Reconnaissance

Authors: Anees Nashath Shaik, Barbara Villarini, Vasileios Argyriou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06578
Pdf URL: https://arxiv.org/pdf/2506.06578
Copy Paste: [[2506.06578]] A Deep Learning Approach for Facial Attribute Manipulation and Reconstruction in Surveillance and Reconnaissance(https://arxiv.org/abs/2506.06578)
Keywords: generative
Abstract: Surveillance systems play a critical role in security and reconnaissance, but their performance is often compromised by low-quality images and videos, leading to reduced accuracy in face recognition. Additionally, existing AI-based facial analysis models suffer from biases related to skin tone variations and partially occluded faces, further limiting their effectiveness in diverse real-world scenarios. These challenges are the results of data limitations and imbalances, where available training datasets lack sufficient diversity, resulting in unfair and unreliable facial recognition performance. To address these issues, we propose a data-driven platform that enhances surveillance capabilities by generating synthetic training data tailored to compensate for dataset biases. Our approach leverages deep learning-based facial attribute manipulation and reconstruction using autoencoders and Generative Adversarial Networks (GANs) to create diverse and high-quality facial datasets. Additionally, our system integrates an image enhancement module, improving the clarity of low-resolution or occluded faces in surveillance footage. We evaluate our approach using the CelebA dataset, demonstrating that the proposed platform enhances both training data diversity and model fairness. This work contributes to reducing bias in AI-based facial analysis and improving surveillance accuracy in challenging environments, leading to fairer and more reliable security applications.

Title: EV-LayerSegNet: Self-supervised Motion Segmentation using Event Cameras

Authors: Youssef Farah, Federico Paredes-Vallés, Guido De Croon, Muhammad Ahmed Humais, Hussain Sajwani, Yahya Zweiri
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06596
Pdf URL: https://arxiv.org/pdf/2506.06596
Copy Paste: [[2506.06596]] EV-LayerSegNet: Self-supervised Motion Segmentation using Event Cameras(https://arxiv.org/abs/2506.06596)
Keywords: self-supervised
Abstract: Event cameras are novel bio-inspired sensors that capture motion dynamics with much higher temporal resolution than traditional cameras, since pixels react asynchronously to brightness changes. They are therefore better suited for tasks involving motion such as motion segmentation. However, training event-based networks still represents a difficult challenge, as obtaining ground truth is very expensive, error-prone and limited in frequency. In this article, we introduce EV-LayerSegNet, a self-supervised CNN for event-based motion segmentation. Inspired by a layered representation of the scene dynamics, we show that it is possible to learn affine optical flow and segmentation masks separately, and use them to deblur the input events. The deblurring quality is then measured and used as self-supervised learning loss. We train and test the network on a simulated dataset with only affine motion, achieving IoU and detection rate up to 71% and 87% respectively.

Title: TrustConnect: An In-Vehicle Anomaly Detection Framework through Topology-Based Trust Rating

Authors: Ayan Roy, Jeetkumar Patel, Rik Chakraborti, Shudip Datta
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.06635
Pdf URL: https://arxiv.org/pdf/2506.06635
Copy Paste: [[2506.06635]] TrustConnect: An In-Vehicle Anomaly Detection Framework through Topology-Based Trust Rating(https://arxiv.org/abs/2506.06635)
Keywords: anomaly
Abstract: Modern vehicles are equipped with numerous in-vehicle components that interact with the external environment through remote communications and services, such as Bluetooth and vehicle-to-infrastructure communication. These components form a network, exchanging information to ensure the proper functioning of the vehicle. However, the presence of false or fabricated information can disrupt the vehicle's performance. Given that these components are interconnected, erroneous data can propagate throughout the network, potentially affecting other components and leading to catastrophic consequences. To address this issue, we propose TrustConnect, a framework designed to assess the trustworthiness of a vehicle's in-vehicle network by evaluating the trust levels of individual components under various network configurations. The proposed framework leverages the interdependency of all the vehicle's components, along with the correlation of their values and their vulnerability to remote injection based on the outside exposure of each component, to determine the reliability of the in-vehicle network. The effectiveness of the proposed framework has been validated through programming simulations conducted across various scenarios using a random distribution of an in-vehicle network graph generated with the Networkx package in Python.

Title: Non-Intrusive Load Monitoring Based on Image Load Signatures and Continual Learning

Authors: Olimjon Toirov, Wei Yu
Subjects: cs.LG, cs.AI, cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2506.06637
Pdf URL: https://arxiv.org/pdf/2506.06637
Copy Paste: [[2506.06637]] Non-Intrusive Load Monitoring Based on Image Load Signatures and Continual Learning(https://arxiv.org/abs/2506.06637)
Keywords: self-supervised
Abstract: Non-Intrusive Load Monitoring (NILM) identifies the operating status and energy consumption of each electrical device in the circuit by analyzing the electrical signals at the bus, which is of great significance for smart power management. However, the complex and changeable load combinations and application environments lead to the challenges of poor feature robustness and insufficient model generalization of traditional NILM methods. To this end, this paper proposes a new non-intrusive load monitoring method that integrates "image load signature" and continual learning. This method converts multi-dimensional power signals such as current, voltage, and power factor into visual image load feature signatures, and combines deep convolutional neural networks to realize the identification and classification of multiple devices; at the same time, self-supervised pre-training is introduced to improve feature generalization, and continual online learning strategies are used to overcome model forgetting to adapt to the emergence of new loads. This paper conducts a large number of experiments on high-sampling rate load datasets, and compares a variety of existing methods and model variants. The results show that the proposed method has achieved significant improvements in recognition accuracy.

Title: Learning Robust Heterogeneous Graph Representations via Contrastive-Reconstruction under Sparse Semantics

Authors: Di Lin, Wanjing Ren, Xuanbin Li, Rui Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06682
Pdf URL: https://arxiv.org/pdf/2506.06682
Copy Paste: [[2506.06682]] Learning Robust Heterogeneous Graph Representations via Contrastive-Reconstruction under Sparse Semantics(https://arxiv.org/abs/2506.06682)
Keywords: self-supervised
Abstract: In graph self-supervised learning, masked autoencoders (MAE) and contrastive learning (CL) are two prominent paradigms. MAE focuses on reconstructing masked elements, while CL maximizes similarity between augmented graph views. Recent studies highlight their complementarity: MAE excels at local feature capture, and CL at global information extraction. Hybrid frameworks for homogeneous graphs have been proposed, but face challenges in designing shared encoders to meet the semantic requirements of both tasks. In semantically sparse scenarios, CL struggles with view construction, and gradient imbalance between positive and negative samples persists. This paper introduces HetCRF, a novel dual-channel self-supervised learning framework for heterogeneous graphs. HetCRF uses a two-stage aggregation strategy to adapt embedding semantics, making it compatible with both MAE and CL. To address semantic sparsity, it enhances encoder output for view construction instead of relying on raw features, improving efficiency. Two positive sample augmentation strategies are also proposed to balance gradient contributions. Node classification experiments on four real-world heterogeneous graph datasets demonstrate that HetCRF outperforms state-of-the-art baselines. On datasets with missing node features, such as Aminer and Freebase, at a 40% label rate in node classification, HetCRF improves the Macro-F1 score by 2.75% and 2.2% respectively compared to the second-best baseline, validating its effectiveness and superiority.

Title: Breaking Data Silos: Towards Open and Scalable Mobility Foundation Models via Generative Continual Learning

Authors: Yuan Yuan, Yukun Liu, Chonghua Han, Jie Feng, Yong Li
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2506.06694
Pdf URL: https://arxiv.org/pdf/2506.06694
Copy Paste: [[2506.06694]] Breaking Data Silos: Towards Open and Scalable Mobility Foundation Models via Generative Continual Learning(https://arxiv.org/abs/2506.06694)
Keywords: foundation model, generative
Abstract: Foundation models have revolutionized fields such as natural language processing and computer vision by enabling general-purpose learning across diverse tasks and datasets. However, building analogous models for human mobility remains challenging due to the privacy-sensitive nature of mobility data and the resulting data silos across institutions. To bridge this gap, we propose MoveGCL, a scalable and privacy-preserving framework for training mobility foundation models via generative continual learning. Without sharing raw data, MoveGCL enables decentralized and progressive model evolution by replaying synthetic trajectories generated from a frozen teacher model, and reinforces knowledge retention through a tailored distillation strategy that mitigates catastrophic forgetting. To address the heterogeneity of mobility patterns, MoveGCL incorporates a Mixture-of-Experts Transformer with a mobility-aware expert routing mechanism, and employs a layer-wise progressive adaptation strategy to stabilize continual updates. Experiments on six real-world urban datasets demonstrate that MoveGCL achieves performance comparable to joint training and significantly outperforms federated learning baselines, while offering strong privacy protection. MoveGCL marks a crucial step toward unlocking foundation models for mobility, offering a practical blueprint for open, scalable, and privacy-preserving model development in the era of foundation models.

Title: MarginSel : Max-Margin Demonstration Selection for LLMs

Authors: Rajeev Bhatt Ambati, James Lester, Shashank Srivastava, Snigdha Chaturvedi
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.06699
Pdf URL: https://arxiv.org/pdf/2506.06699
Copy Paste: [[2506.06699]] MarginSel : Max-Margin Demonstration Selection for LLMs(https://arxiv.org/abs/2506.06699)
Keywords: in-context
Abstract: Large Language Models (LLMs) excel at few-shot learning via in-context learning (ICL). However, the effectiveness of ICL is often sensitive to the selection and ordering of demonstration examples. To address this, we present MarginSel: Max-Margin Demonstration Selection for LLMs, a two-step method that selects hard demonstration examples for the ICL prompt, adapting to each test instance. Our approach achieves 2-7% absolute improvement in F1-score across classification tasks, compared to a random selection of examples. We also provide theoretical insights and empirical evidence showing that MarginSel induces max-margin behavior in LLMs by effectively increasing the margin for hard examples, analogous to support vectors, thereby shifting the decision boundary in a beneficial direction.

Title: Dynamic and Parametric Retrieval-Augmented Generation

Authors: Weihang Su, Qingyao Ai, Jingtao Zhan, Qian Dong, Yiqun Liu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2506.06704
Pdf URL: https://arxiv.org/pdf/2506.06704
Copy Paste: [[2506.06704]] Dynamic and Parametric Retrieval-Augmented Generation(https://arxiv.org/abs/2506.06704)
Keywords: in-context
Abstract: Retrieval-Augmented Generation (RAG) has become a foundational paradigm for equipping large language models (LLMs) with external knowledge, playing a critical role in information retrieval and knowledge-intensive applications. However, conventional RAG systems typically adopt a static retrieve-then-generate pipeline and rely on in-context knowledge injection, which can be suboptimal for complex tasks that require multihop reasoning, adaptive information access, and deeper integration of external knowledge. Motivated by these limitations, the research community has moved beyond static retrieval and in-context knowledge injection. Among the emerging directions, this tutorial delves into two rapidly growing and complementary research areas on RAG: Dynamic RAG and Parametric RAG. Dynamic RAG adaptively determines when and what to retrieve during the LLM's generation process, enabling real-time adaptation to the LLM's evolving information needs. Parametric RAG rethinks how retrieved knowledge should be injected into LLMs, transitioning from input-level to parameter-level knowledge injection for enhanced efficiency and effectiveness. This tutorial offers a comprehensive overview of recent advances in these emerging research areas. It also shares theoretical foundations and practical insights to support and inspire further research in RAG.

Title: Active Contour Models Driven by Hyperbolic Mean Curvature Flow for Image Segmentation

Authors: Saiyu Hu, Chunlei He, Jianfeng Zhang, Dexing Kong, Shoujun Huang
Subjects: cs.CV, math.AP
Abstract URL: https://arxiv.org/abs/2506.06712
Pdf URL: https://arxiv.org/pdf/2506.06712
Copy Paste: [[2506.06712]] Active Contour Models Driven by Hyperbolic Mean Curvature Flow for Image Segmentation(https://arxiv.org/abs/2506.06712)
Keywords: diffusion
Abstract: Parabolic mean curvature flow-driven active contour models (PMCF-ACMs) are widely used in image segmentation, which however depend heavily on the selection of initial curve configurations. In this paper, we firstly propose several hyperbolic mean curvature flow-driven ACMs (HMCF-ACMs), which introduce tunable initial velocity fields, enabling adaptive optimization for diverse segmentation scenarios. We shall prove that HMCF-ACMs are indeed normal flows and establish the numerical equivalence between dissipative HMCF formulations and certain wave equations using the level set method with signed distance function. Building on this framework, we furthermore develop hyperbolic dual-mode regularized flow-driven ACMs (HDRF-ACMs), which utilize smooth Heaviside functions for edge-aware force modulation to suppress over-diffusion near weak boundaries. Then, we optimize a weighted fourth-order Runge-Kutta algorithm with nine-point stencil spatial discretization when solving the above-mentioned wave equations. Experiments show that both HMCF-ACMs and HDRF-ACMs could achieve more precise segmentations with superior noise resistance and numerical stability due to task-adaptive configurations of initial velocities and initial contours.

Title: LADSG: Label-Anonymized Distillation and Similar Gradient Substitution for Label Privacy in Vertical Federated Learning

Authors: Zeyu Yan, Yifei Yao, Xuanbing Wen, Juli Zhang, Kai Fan
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.06742
Pdf URL: https://arxiv.org/pdf/2506.06742
Copy Paste: [[2506.06742]] LADSG: Label-Anonymized Distillation and Similar Gradient Substitution for Label Privacy in Vertical Federated Learning(https://arxiv.org/abs/2506.06742)
Keywords: anomaly
Abstract: Vertical federated learning (VFL) has become a key paradigm for collaborative machine learning, enabling multiple parties to train models over distributed feature spaces while preserving data privacy. Despite security protocols that defend against external attacks - such as gradient masking and encryption, which prevent unauthorized access to sensitive data - recent label inference attacks from within the system have emerged. These attacks exploit gradients and semantic embeddings to reconstruct private labels, bypassing traditional defenses. For example, the passive label inference attack can reconstruct tens of thousands of participants' private data using just 40 auxiliary labels, posing a significant security threat. Existing defenses address single leakage pathways, such as gradient leakage or label exposure. As attack strategies evolve, their limitations become clear, especially against hybrid attacks that combine multiple vectors. To address this, we propose Label-Anonymized Defense with Substitution Gradient (LADSG), a unified defense framework that integrates gradient substitution, label anonymization, and anomaly detection. LADSG mitigates both gradient and label leakage while maintaining the scalability and efficiency of VFL. Experiments on six real-world datasets show that LADSG reduces label inference attack success rates by 30-60%, with minimal computational overhead, underscoring the importance of lightweight defenses in securing VFL.

Title: Training-Free Identity Preservation in Stylized Image Generation Using Diffusion Models

Authors: Mohammad Ali Rezaei, Helia Hajikazem, Saeed Khanehgir, Mahdi Javanmardi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06802
Pdf URL: https://arxiv.org/pdf/2506.06802
Copy Paste: [[2506.06802]] Training-Free Identity Preservation in Stylized Image Generation Using Diffusion Models(https://arxiv.org/abs/2506.06802)
Keywords: diffusion, generative
Abstract: While diffusion models have demonstrated remarkable generative capabilities, existing style transfer techniques often struggle to maintain identity while achieving high-quality stylization. This limitation is particularly acute for images where faces are small or exhibit significant camera-to-face distances, frequently leading to inadequate identity preservation. To address this, we introduce a novel, training-free framework for identity-preserved stylized image synthesis using diffusion models. Key contributions include: (1) the "Mosaic Restored Content Image" technique, significantly enhancing identity retention, especially in complex scenes; and (2) a training-free content consistency loss that enhances the preservation of fine-grained content details by directing more attention to the original image during stylization. Our experiments reveal that the proposed approach substantially surpasses the baseline model in concurrently maintaining high stylistic fidelity and robust identity integrity, particularly under conditions of small facial regions or significant camera-to-face distances, all without necessitating model retraining or fine-tuning.

Title: Label-semantics Aware Generative Approach for Domain-Agnostic Multilabel Classification

Authors: Subhendu Khatuya, Shashwat Naidu, Saptarshi Ghosh, Pawan Goyal, Niloy Ganguly
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06806
Pdf URL: https://arxiv.org/pdf/2506.06806
Copy Paste: [[2506.06806]] Label-semantics Aware Generative Approach for Domain-Agnostic Multilabel Classification(https://arxiv.org/abs/2506.06806)
Keywords: generative
Abstract: The explosion of textual data has made manual document classification increasingly challenging. To address this, we introduce a robust, efficient domain-agnostic generative model framework for multi-label text classification. Instead of treating labels as mere atomic symbols, our approach utilizes predefined label descriptions and is trained to generate these descriptions based on the input text. During inference, the generated descriptions are matched to the pre-defined labels using a finetuned sentence transformer. We integrate this with a dual-objective loss function, combining cross-entropy loss and cosine similarity of the generated sentences with the predefined target descriptions, ensuring both semantic alignment and accuracy. Our proposed model LAGAMC stands out for its parameter efficiency and versatility across diverse datasets, making it well-suited for practical applications. We demonstrate the effectiveness of our proposed model by achieving new state-of-the-art performances across all evaluated datasets, surpassing several strong baselines. We achieve improvements of 13.94% in Micro-F1 and 24.85% in Macro-F1 compared to the closest baseline across all datasets.

Title: IMPA-HGAE:Intra-Meta-Path Augmented Heterogeneous Graph Autoencoder

Authors: Di Lin, Wanjing Ren, Xuanbin Li, Rui Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06809
Pdf URL: https://arxiv.org/pdf/2506.06809
Copy Paste: [[2506.06809]] IMPA-HGAE:Intra-Meta-Path Augmented Heterogeneous Graph Autoencoder(https://arxiv.org/abs/2506.06809)
Keywords: self-supervised, generative
Abstract: Self-supervised learning (SSL) methods have been increasingly applied to diverse downstream tasks due to their superior generalization capabilities and low annotation costs. However, most existing heterogeneous graph SSL models convert heterogeneous graphs into homogeneous ones via meta-paths for training, which only leverage information from nodes at both ends of meta-paths while underutilizing the heterogeneous node information along the meta-paths. To address this limitation, this paper proposes a novel framework named IMPA-HGAE to enhance target node embeddings by fully exploiting internal node information along meta-paths. Experimental results validate that IMPA-HGAE achieves superior performance on heterogeneous datasets. Furthermore, this paper introduce innovative masking strategies to strengthen the representational capacity of generative SSL models on heterogeneous graph data. Additionally, this paper discuss the interpretability of the proposed method and potential future directions for generative self-supervised learning in heterogeneous graphs. This work provides insights into leveraging meta-path-guided structural semantics for robust representation learning in complex graph scenarios.

Title: Path Integral Optimiser: Global Optimisation via Neural Schrödinger-Föllmer Diffusion

Authors: Max McGuinness, Eirik Fladmark, Francisco Vargas
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06815
Pdf URL: https://arxiv.org/pdf/2506.06815
Copy Paste: [[2506.06815]] Path Integral Optimiser: Global Optimisation via Neural Schrödinger-Föllmer Diffusion(https://arxiv.org/abs/2506.06815)
Keywords: diffusion
Abstract: We present an early investigation into the use of neural diffusion processes for global optimisation, focusing on Zhang et al.'s Path Integral Sampler. One can use the Boltzmann distribution to formulate optimization as solving a Schrödinger bridge sampling problem, then apply Girsanov's theorem with a simple (single-point) prior to frame it in stochastic control terms, and compute the solution's integral terms via a neural approximation (a Fourier MLP). We provide theoretical bounds for this optimiser, results on toy optimisation tasks, and a summary of the stochastic theory motivating the model. Ultimately, we found the optimiser to display promising per-step performance at optimisation tasks between 2 and 1,247 dimensions, but struggle to explore higher-dimensional spaces when faced with a 15.9k parameter model, indicating a need for work on adaptation in such environments.

Title: Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs

Authors: Wenyu Zhang, Yingxu He, Geyu Lin, Zhuohan Liu, Shuo Sun, Bin Wang, Xunlong Zou, Jeremy H. M. Wong, Qiongqiong Wang, Hardik B. Sailor, Nancy F. Chen, Ai Ti Aw
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.06820
Pdf URL: https://arxiv.org/pdf/2506.06820
Copy Paste: [[2506.06820]] Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs(https://arxiv.org/abs/2506.06820)
Keywords: generative
Abstract: Audio Large Language Models (AudioLLMs) have achieved strong results in semantic tasks like speech recognition and translation, but remain limited in modeling paralinguistic cues such as emotion. Existing approaches often treat emotion understanding as a classification problem, offering little insight into the underlying rationale behind predictions. In this work, we explore emotion reasoning, a strategy that leverages the generative capabilities of AudioLLMs to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations. To support this in multitask AudioLLMs, we introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training. This approach enables AudioLLMs to effectively learn different tasks while incorporating emotional reasoning. Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses.

Title: Hi-LSplat: Hierarchical 3D Language Gaussian Splatting

Authors: Chenlu Zhan, Yufei Zhang, Gaoang Wang, Hongwei Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06822
Pdf URL: https://arxiv.org/pdf/2506.06822
Copy Paste: [[2506.06822]] Hi-LSplat: Hierarchical 3D Language Gaussian Splatting(https://arxiv.org/abs/2506.06822)
Keywords: foundation model
Abstract: Modeling 3D language fields with Gaussian Splatting for open-ended language queries has recently garnered increasing attention. However, recent 3DGS-based models leverage view-dependent 2D foundation models to refine 3D semantics but lack a unified 3D representation, leading to view inconsistencies. Additionally, inherent open-vocabulary challenges cause inconsistencies in object and relational descriptions, impeding hierarchical semantic understanding. In this paper, we propose Hi-LSplat, a view-consistent Hierarchical Language Gaussian Splatting work for 3D open-vocabulary querying. To achieve view-consistent 3D hierarchical semantics, we first lift 2D features to 3D features by constructing a 3D hierarchical semantic tree with layered instance clustering, which addresses the view inconsistency issue caused by 2D semantic features. Besides, we introduce instance-wise and part-wise contrastive losses to capture all-sided hierarchical semantic representations. Notably, we construct two hierarchical semantic datasets to better assess the model's ability to distinguish different semantic levels. Extensive experiments highlight our method's superiority in 3D open-vocabulary segmentation and localization. Its strong performance on hierarchical semantic datasets underscores its ability to capture complex hierarchical semantics within 3D scenes.

Title: Controllable Coupled Image Generation via Diffusion Models

Authors: Chenfei Yuan, Nanshan Jia, Hangqi Li, Peter W. Glynn, Zeyu Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06826
Pdf URL: https://arxiv.org/pdf/2506.06826
Copy Paste: [[2506.06826]] Controllable Coupled Image Generation via Diffusion Models(https://arxiv.org/abs/2506.06826)
Keywords: diffusion
Abstract: We provide an attention-level control method for the task of coupled image generation, where "coupled" means that multiple simultaneously generated images are expected to have the same or very similar backgrounds. While backgrounds coupled, the centered objects in the generated images are still expected to enjoy the flexibility raised from different text prompts. The proposed method disentangles the background and entity components in the model's cross-attention modules, attached with a sequence of time-varying weight control parameters depending on the time step of sampling. We optimize this sequence of weight control parameters with a combined objective that assesses how coupled the backgrounds are as well as text-to-image alignment and overall visual quality. Empirical results demonstrate that our method outperforms existing approaches across these criteria.

Title: EndoARSS: Adapting Spatially-Aware Foundation Model for Efficient Activity Recognition and Semantic Segmentation in Endoscopic Surgery

Authors: Guankun Wang, Rui Tang, Mengya Xu, Long Bai, Huxin Gao, Hongliang Ren
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06830
Pdf URL: https://arxiv.org/pdf/2506.06830
Copy Paste: [[2506.06830]] EndoARSS: Adapting Spatially-Aware Foundation Model for Efficient Activity Recognition and Semantic Segmentation in Endoscopic Surgery(https://arxiv.org/abs/2506.06830)
Keywords: foundation model
Abstract: Endoscopic surgery is the gold standard for robotic-assisted minimally invasive surgery, offering significant advantages in early disease detection and precise interventions. However, the complexity of surgical scenes, characterized by high variability in different surgical activity scenarios and confused image features between targets and the background, presents challenges for surgical environment understanding. Traditional deep learning models often struggle with cross-activity interference, leading to suboptimal performance in each downstream task. To address this limitation, we explore multi-task learning, which utilizes the interrelated features between tasks to enhance overall task performance. In this paper, we propose EndoARSS, a novel multi-task learning framework specifically designed for endoscopy surgery activity recognition and semantic segmentation. Built upon the DINOv2 foundation model, our approach integrates Low-Rank Adaptation to facilitate efficient fine-tuning while incorporating Task Efficient Shared Low-Rank Adapters to mitigate gradient conflicts across diverse tasks. Additionally, we introduce the Spatially-Aware Multi-Scale Attention that enhances feature representation discrimination by enabling cross-spatial learning of global information. In order to evaluate the effectiveness of our framework, we present three novel datasets, MTLESD, MTLEndovis and MTLEndovis-Gen, tailored for endoscopic surgery scenarios with detailed annotations for both activity recognition and semantic segmentation tasks. Extensive experiments demonstrate that EndoARSS achieves remarkable performance across multiple benchmarks, significantly improving both accuracy and robustness in comparison to existing models. These results underscore the potential of EndoARSS to advance AI-driven endoscopic surgical systems, offering valuable insights for enhancing surgical safety and efficiency.

Title: Harnessing Vision-Language Models for Time Series Anomaly Detection

Authors: Zelin He, Sarah Alnegheimish, Matthew Reimherr
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06836
Pdf URL: https://arxiv.org/pdf/2506.06836
Copy Paste: [[2506.06836]] Harnessing Vision-Language Models for Time Series Anomaly Detection(https://arxiv.org/abs/2506.06836)
Keywords: anomaly
Abstract: Time-series anomaly detection (TSAD) has played a vital role in a variety of fields, including healthcare, finance, and industrial monitoring. Prior methods, which mainly focus on training domain-specific models on numerical data, lack the visual-temporal reasoning capacity that human experts have to identify contextual anomalies. To fill this gap, we explore a solution based on vision language models (VLMs). Recent studies have shown the ability of VLMs for visual reasoning tasks, yet their direct application to time series has fallen short on both accuracy and efficiency. To harness the power of VLMs for TSAD, we propose a two-stage solution, with (1) ViT4TS, a vision-screening stage built on a relatively lightweight pretrained vision encoder, which leverages 2-D time-series representations to accurately localize candidate anomalies; (2) VLM4TS, a VLM-based stage that integrates global temporal context and VLM reasoning capacity to refine the detection upon the candidates provided by ViT4TS. We show that without any time-series training, VLM4TS outperforms time-series pretrained and from-scratch baselines in most cases, yielding a 24.6 percent improvement in F1-max score over the best baseline. Moreover, VLM4TS also consistently outperforms existing language-model-based TSAD methods and is on average 36 times more efficient in token usage.

Title: Position Prediction Self-Supervised Learning for Multimodal Satellite Imagery Semantic Segmentation

Authors: John Waithaka, Moise Busogi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06852
Pdf URL: https://arxiv.org/pdf/2506.06852
Copy Paste: [[2506.06852]] Position Prediction Self-Supervised Learning for Multimodal Satellite Imagery Semantic Segmentation(https://arxiv.org/abs/2506.06852)
Keywords: self-supervised
Abstract: Semantic segmentation of satellite imagery is crucial for Earth observation applications, but remains constrained by limited labelled training data. While self-supervised pretraining methods like Masked Autoencoders (MAE) have shown promise, they focus on reconstruction rather than localisation-a fundamental aspect of segmentation tasks. We propose adapting LOCA (Location-aware), a position prediction self-supervised learning method, for multimodal satellite imagery semantic segmentation. Our approach addresses the unique challenges of satellite data by extending SatMAE's channel grouping from multispectral to multimodal data, enabling effective handling of multiple modalities, and introducing same-group attention masking to encourage cross-modal interaction during pretraining. The method uses relative patch position prediction, encouraging spatial reasoning for localisation rather than reconstruction. We evaluate our approach on the Sen1Floods11 flood mapping dataset, where it significantly outperforms existing reconstruction-based self-supervised learning methods for satellite imagery. Our results demonstrate that position prediction tasks, when properly adapted for multimodal satellite imagery, learn representations more effective for satellite image semantic segmentation than reconstruction-based approaches.

Title: Face recognition on point cloud with cgan-top for denoising

Authors: Junyu Liu, Jianfeng Ren, Sunhong Liang, Xudong Jiang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06864
Pdf URL: https://arxiv.org/pdf/2506.06864
Copy Paste: [[2506.06864]] Face recognition on point cloud with cgan-top for denoising(https://arxiv.org/abs/2506.06864)
Keywords: generative
Abstract: Face recognition using 3D point clouds is gaining growing interest, while raw point clouds often contain a significant amount of noise due to imperfect sensors. In this paper, an end-to-end 3D face recognition on a noisy point cloud is proposed, which synergistically integrates the denoising and recognition modules. Specifically, a Conditional Generative Adversarial Network on Three Orthogonal Planes (cGAN-TOP) is designed to effectively remove the noise in the point cloud, and recover the underlying features for subsequent recognition. A Linked Dynamic Graph Convolutional Neural Network (LDGCNN) is then adapted to recognize faces from the processed point cloud, which hierarchically links both the local point features and neighboring features of multiple scales. The proposed method is validated on the Bosphorus dataset. It significantly improves the recognition accuracy under all noise settings, with a maximum gain of 14.81%.

Title: Can In-Context Reinforcement Learning Recover From Reward Poisoning Attacks?

Authors: Paulius Sasnauskas, Yiğit Yalın, Goran Radanović
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2506.06891
Pdf URL: https://arxiv.org/pdf/2506.06891
Copy Paste: [[2506.06891]] Can In-Context Reinforcement Learning Recover From Reward Poisoning Attacks?(https://arxiv.org/abs/2506.06891)
Keywords: in-context
Abstract: We study the corruption-robustness of in-context reinforcement learning (ICRL), focusing on the Decision-Pretrained Transformer (DPT, Lee et al., 2023). To address the challenge of reward poisoning attacks targeting the DPT, we propose a novel adversarial training framework, called Adversarially Trained Decision-Pretrained Transformer (AT-DPT). Our method simultaneously trains an attacker to minimize the true reward of the DPT by poisoning environment rewards, and a DPT model to infer optimal actions from the poisoned data. We evaluate the effectiveness of our approach against standard bandit algorithms, including robust baselines designed to handle reward contamination. Our results show that the proposed method significantly outperforms these baselines in bandit settings, under a learned attacker. We additionally evaluate AT-DPT on an adaptive attacker, and observe similar results. Furthermore, we extend our evaluation to the MDP setting, confirming that the robustness observed in bandit scenarios generalizes to more complex environments.

Title: LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer

Authors: Ying Shen, Zhiyang Xu, Jiuhai Chen, Shizhe Diao, Jiaxin Zhang, Yuguang Yao, Joy Rimchala, Ismini Lourentzou, Lifu Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06952
Pdf URL: https://arxiv.org/pdf/2506.06952
Copy Paste: [[2506.06952]] LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer(https://arxiv.org/abs/2506.06952)
Keywords: foundation model
Abstract: Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.

Title: Towards Physics-informed Diffusion for Anomaly Detection in Trajectories

Authors: Arun Sharma, Mingzhou Yang, Majid Farhadloo, Subhankar Ghosh, Bharat Jayaprakash, Shashi Shekhar
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2506.06999
Pdf URL: https://arxiv.org/pdf/2506.06999
Copy Paste: [[2506.06999]] Towards Physics-informed Diffusion for Anomaly Detection in Trajectories(https://arxiv.org/abs/2506.06999)
Keywords: diffusion, generative, anomaly
Abstract: Given trajectory data, a domain-specific study area, and a user-defined threshold, we aim to find anomalous trajectories indicative of possible GPS spoofing (e.g., fake trajectory). The problem is societally important to curb illegal activities in international waters, such as unauthorized fishing and illicit oil transfers. The problem is challenging due to advances in AI generated in deep fakes generation (e.g., additive noise, fake trajectories) and lack of adequate amount of labeled samples for ground-truth verification. Recent literature shows promising results for anomalous trajectory detection using generative models despite data sparsity. However, they do not consider fine-scale spatiotemporal dependencies and prior physical knowledge, resulting in higher false-positive rates. To address these limitations, we propose a physics-informed diffusion model that integrates kinematic constraints to identify trajectories that do not adhere to physical laws. Experimental results on real-world datasets in the maritime and urban domains show that the proposed framework results in higher prediction accuracy and lower estimation error rate for anomaly detection and trajectory generation methods, respectively. Our implementation is available at this https URL.

Title: ModelForge: Using GenAI to Improve the Development of Security Protocols

Authors: Martin Duclos, Ivan A. Fernandez, Kaneesha Moore, Sudip Mittal, Edward Zieglar
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.07010
Pdf URL: https://arxiv.org/pdf/2506.07010
Copy Paste: [[2506.07010]] ModelForge: Using GenAI to Improve the Development of Security Protocols(https://arxiv.org/abs/2506.07010)
Keywords: generative
Abstract: Formal methods can be used for verifying security protocols, but their adoption can be hindered by the complexity of translating natural language protocol specifications into formal representations. In this paper, we introduce ModelForge, a novel tool that automates the translation of protocol specifications for the Cryptographic Protocol Shapes Analyzer (CPSA). By leveraging advances in Natural Language Processing (NLP) and Generative AI (GenAI), ModelForge processes protocol specifications and generates a CPSA protocol definition. This approach reduces the manual effort required, making formal analysis more accessible. We evaluate ModelForge by fine-tuning a large language model (LLM) to generate protocol definitions for CPSA, comparing its performance with other popular LLMs. The results from our evaluation show that ModelForge consistently produces quality outputs, excelling in syntactic accuracy, though some refinement is needed to handle certain protocol details. The contributions of this work include the architecture and proof of concept for a translating tool designed to simplify the adoption of formal methods in the development of security protocols.

Title: UNO: Unified Self-Supervised Monocular Odometry for Platform-Agnostic Deployment

Authors: Wentao Zhao, Yihe Niu, Yanbo Wang, Tianchen Deng, Shenghai Yuan, Zhenli Wang, Rui Guo, Jingchuan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07013
Pdf URL: https://arxiv.org/pdf/2506.07013
Copy Paste: [[2506.07013]] UNO: Unified Self-Supervised Monocular Odometry for Platform-Agnostic Deployment(https://arxiv.org/abs/2506.07013)
Keywords: self-supervised
Abstract: This work presents UNO, a unified monocular visual odometry framework that enables robust and adaptable pose estimation across diverse environments, platforms, and motion patterns. Unlike traditional methods that rely on deployment-specific tuning or predefined motion priors, our approach generalizes effectively across a wide range of real-world scenarios, including autonomous vehicles, aerial drones, mobile robots, and handheld devices. To this end, we introduce a Mixture-of-Experts strategy for local state estimation, with several specialized decoders that each handle a distinct class of ego-motion patterns. Moreover, we introduce a fully differentiable Gumbel-Softmax module that constructs a robust inter-frame correlation graph, selects the optimal expert decoder, and prunes erroneous estimates. These cues are then fed into a unified back-end that combines pre-trained, scale-independent depth priors with a lightweight bundling adjustment to enforce geometric consistency. We extensively evaluate our method on three major benchmark datasets: KITTI (outdoor/autonomous driving), EuRoC-MAV (indoor/aerial drones), and TUM-RGBD (indoor/handheld), demonstrating state-of-the-art performance.

Title: Mixture Experts with Test-Time Self-Supervised Aggregation for Tabular Imbalanced Regression

Authors: Yung-Chien Wang, Kuang-Da Wang, Wei-Yao Wang, Wen-Chih Peng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07033
Pdf URL: https://arxiv.org/pdf/2506.07033
Copy Paste: [[2506.07033]] Mixture Experts with Test-Time Self-Supervised Aggregation for Tabular Imbalanced Regression(https://arxiv.org/abs/2506.07033)
Keywords: self-supervised
Abstract: Tabular data serve as a fundamental and ubiquitous representation of structured information in numerous real-world applications, e.g., finance and urban planning. In the realm of tabular imbalanced applications, data imbalance has been investigated in classification tasks with insufficient instances in certain labels, causing the model's ineffective generalizability. However, the imbalance issue of tabular regression tasks is underexplored, and yet is critical due to unclear boundaries for continuous labels and simplifying assumptions in existing imbalance regression work, which often rely on known and balanced test distributions. Such assumptions may not hold in practice and can lead to performance degradation. To address these issues, we propose MATI: Mixture Experts with Test-Time Self-Supervised Aggregation for Tabular Imbalance Regression, featuring two key innovations: (i) the Region-Aware Mixture Expert, which adopts a Gaussian Mixture Model to capture the underlying related regions. The statistical information of each Gaussian component is then used to synthesize and train region-specific experts to capture the unique characteristics of their respective regions. (ii) Test-Time Self-Supervised Expert Aggregation, which dynamically adjusts region expert weights based on test data features to reinforce expert adaptation across varying test distributions. We evaluated MATI on four real-world tabular imbalance regression datasets, including house pricing, bike sharing, and age prediction. To reflect realistic deployment scenarios, we adopted three types of test distributions: a balanced distribution with uniform target frequencies, a normal distribution that follows the training data, and an inverse distribution that emphasizes rare target regions. On average across these three test distributions, MATI achieved a 7.1% improvement in MAE compared to existing methods.

Title: Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Authors: LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, Yu Rong
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.07044
Pdf URL: https://arxiv.org/pdf/2506.07044
Copy Paste: [[2506.07044]] Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning(https://arxiv.org/abs/2506.07044)
Keywords: foundation model
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...

Title: FairPFN: A Tabular Foundation Model for Causal Fairness

Authors: Jake Robertson, Noah Hollmann, Samuel Müller, Noor Awad, Frank Hutter
Subjects: cs.LG, cs.CY
Abstract URL: https://arxiv.org/abs/2506.07049
Pdf URL: https://arxiv.org/pdf/2506.07049
Copy Paste: [[2506.07049]] FairPFN: A Tabular Foundation Model for Causal Fairness(https://arxiv.org/abs/2506.07049)
Keywords: foundation model
Abstract: Machine learning (ML) systems are utilized in critical sectors, such as healthcare, law enforcement, and finance. However, these systems are often trained on historical data that contains demographic biases, leading to ML decisions that perpetuate or exacerbate existing social inequalities. Causal fairness provides a transparent, human-in-the-loop framework to mitigate algorithmic discrimination, aligning closely with legal doctrines of direct and indirect discrimination. However, current causal fairness frameworks hold a key limitation in that they assume prior knowledge of the correct causal model, restricting their applicability in complex fairness scenarios where causal models are unknown or difficult to identify. To bridge this gap, we propose FairPFN, a tabular foundation model pre-trained on synthetic causal fairness data to identify and mitigate the causal effects of protected attributes in its predictions. FairPFN's key contribution is that it requires no knowledge of the causal model and still demonstrates strong performance in identifying and removing protected causal effects across a diverse set of hand-crafted and real-world scenarios relative to robust baseline methods. FairPFN paves the way for promising future research, making causal fairness more accessible to a wider variety of complex fairness problems.

Title: A Layered Self-Supervised Knowledge Distillation Framework for Efficient Multimodal Learning on the Edge

Authors: Tarique Dahri, Zulfiqar Ali Memon, Zhenyu Yu, Mohd. Yamani Idna Idris, Sheheryar Khan, Sadiq Ahmad, Maged Shoman, Saddam Aziz, Rizwan Qureshi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07055
Pdf URL: https://arxiv.org/pdf/2506.07055
Copy Paste: [[2506.07055]] A Layered Self-Supervised Knowledge Distillation Framework for Efficient Multimodal Learning on the Edge(https://arxiv.org/abs/2506.07055)
Keywords: self-supervised
Abstract: We introduce Layered Self-Supervised Knowledge Distillation (LSSKD) framework for training compact deep learning models. Unlike traditional methods that rely on pre-trained teacher networks, our approach appends auxiliary classifiers to intermediate feature maps, generating diverse self-supervised knowledge and enabling one-to-one transfer across different network stages. Our method achieves an average improvement of 4.54\% over the state-of-the-art PS-KD method and a 1.14% gain over SSKD on CIFAR-100, with a 0.32% improvement on ImageNet compared to HASSKD. Experiments on Tiny ImageNet and CIFAR-100 under few-shot learning scenarios also achieve state-of-the-art results. These findings demonstrate the effectiveness of our approach in enhancing model generalization and performance without the need for large over-parameterized teacher networks. Importantly, at the inference stage, all auxiliary classifiers can be removed, yielding no extra computational cost. This makes our model suitable for deploying small language models on affordable low-computing devices. Owing to its lightweight design and adaptability, our framework is particularly suitable for multimodal sensing and cyber-physical environments that require efficient and responsive inference. LSSKD facilitates the development of intelligent agents capable of learning from limited sensory data under weak supervision.

Title: E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models

Authors: Jiaheng Dong, Hong Jia, Soumyajit Chatterjee, Abhirup Ghosh, James Bailey, Ting Dang
Subjects: cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.07078
Pdf URL: https://arxiv.org/pdf/2506.07078
Copy Paste: [[2506.07078]] E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models(https://arxiv.org/abs/2506.07078)
Keywords: foundation model
Abstract: Speech Foundation Models encounter significant performance degradation when deployed in real-world scenarios involving acoustic domain shifts, such as background noise and speaker accents. Test-time adaptation (TTA) has recently emerged as a viable strategy to address such domain shifts at inference time without requiring access to source data or labels. However, existing TTA approaches, particularly those relying on backpropagation, are memory-intensive, limiting their applicability in speech tasks and resource-constrained settings. Although backpropagation-free methods offer improved efficiency, existing ones exhibit poor accuracy. This is because they are predominantly developed for vision tasks, which fundamentally differ from speech task formulations, noise characteristics, and model architecture, posing unique transferability challenges. In this paper, we introduce E-BATS, the first Efficient BAckpropagation-free TTA framework designed explicitly for speech foundation models. E-BATS achieves a balance between adaptation effectiveness and memory efficiency through three key components: (i) lightweight prompt adaptation for a forward-pass-based feature alignment, (ii) a multi-scale loss to capture both global (utterance-level) and local distribution shifts (token-level) and (iii) a test-time exponential moving average mechanism for stable adaptation across utterances. Experiments conducted on four noisy speech datasets spanning sixteen acoustic conditions demonstrate consistent improvements, with 4.1%-13.5% accuracy gains over backpropagation-free baselines and 2.0-6.4 times GPU memory savings compared to backpropagation-based methods. By enabling scalable and robust adaptation under acoustic variability, this work paves the way for developing more efficient adaptation approaches for practical speech processing systems in real-world environments.

Title: Filling the Missings: Spatiotemporal Data Imputation by Conditional Diffusion

Authors: Wenying He, Jieling Huang, Junhua Gu, Ji Zhang, Yude Bai
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07099
Pdf URL: https://arxiv.org/pdf/2506.07099
Copy Paste: [[2506.07099]] Filling the Missings: Spatiotemporal Data Imputation by Conditional Diffusion(https://arxiv.org/abs/2506.07099)
Keywords: diffusion
Abstract: Missing data in spatiotemporal systems presents a significant challenge for modern applications, ranging from environmental monitoring to urban traffic management. The integrity of spatiotemporal data often deteriorates due to hardware malfunctions and software failures in real-world deployments. Current approaches based on machine learning and deep learning struggle to model the intricate interdependencies between spatial and temporal dimensions effectively and, more importantly, suffer from cumulative errors during the data imputation process, which propagate and amplify through iterations. To address these limitations, we propose CoFILL, a novel Conditional Diffusion Model for spatiotemporal data imputation. CoFILL builds on the inherent advantages of diffusion models to generate high-quality imputations without relying on potentially error-prone prior estimates. It incorporates an innovative dual-stream architecture that processes temporal and frequency domain features in parallel. By fusing these complementary features, CoFILL captures both rapid fluctuations and underlying patterns in the data, which enables more robust imputation. The extensive experiments reveal that CoFILL's noise prediction network successfully transforms random noise into meaningful values that align with the true data distribution. The results also show that CoFILL outperforms state-of-the-art methods in imputation accuracy. The source code is publicly available at this https URL.

Title: Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion

Authors: Huaize Liu, Wenzhang Sun, Qiyuan Zhang, Donglin Di, Biao Gong, Hao Li, Chen Wei, Changqing Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07136
Pdf URL: https://arxiv.org/pdf/2506.07136
Copy Paste: [[2506.07136]] Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion(https://arxiv.org/abs/2506.07136)
Keywords: diffusion, self-supervised, generative
Abstract: Recent breakthroughs in video autoencoders (Video AEs) have advanced video generation, but existing methods fail to efficiently model spatio-temporal redundancies in dynamics, resulting in suboptimal compression factors. This shortfall leads to excessive training costs for downstream tasks. To address this, we introduce Hi-VAE, an efficient video autoencoding framework that hierarchically encode coarse-to-fine motion representations of video dynamics and formulate the decoding process as a conditional generation task. Specifically, Hi-VAE decomposes video dynamics into two latent spaces: Global Motion, capturing overarching motion patterns, and Detailed Motion, encoding high-frequency spatial details. Using separate self-supervised motion encoders, we compress video latents into compact motion representations to reduce redundancy significantly. A conditional diffusion decoder then reconstructs videos by combining hierarchical global and detailed motions, enabling high-fidelity video reconstructions. Extensive experiments demonstrate that Hi-VAE achieves a high compression factor of 1428$\times$, almost 30$\times$ higher than baseline methods (e.g., Cosmos-VAE at 48$\times$), validating the efficiency of our approach. Meanwhile, Hi-VAE maintains high reconstruction quality at such high compression rates and performs effectively in downstream generative tasks. Moreover, Hi-VAE exhibits interpretability and scalability, providing new perspectives for future exploration in video latent representation and generation.

Title: Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models

Authors: Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Jaehong Yoon, Soo Ye Kim, Zhe Lin, Sung Ju Hwang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07177
Pdf URL: https://arxiv.org/pdf/2506.07177
Copy Paste: [[2506.07177]] Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models(https://arxiv.org/abs/2506.07177)
Keywords: diffusion
Abstract: Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals, such as keyframes, style reference images, sketches, or depth maps. For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation. Frame Guidance enables effective control across diverse tasks, including keyframe guidance, stylization, and looping, without any training, compatible with any video models. Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.

Title: GGBall: Graph Generative Model on Poincaré Ball

Authors: Tianci Bu, Chuanrui Wang, Hao Ma, Haoren Zheng, Xin Lu, Tailin Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07198
Pdf URL: https://arxiv.org/pdf/2506.07198
Copy Paste: [[2506.07198]] GGBall: Graph Generative Model on Poincaré Ball(https://arxiv.org/abs/2506.07198)
Keywords: generative
Abstract: Generating graphs with hierarchical structures remains a fundamental challenge due to the limitations of Euclidean geometry in capturing exponential complexity. Here we introduce \textbf{GGBall}, a novel hyperbolic framework for graph generation that integrates geometric inductive biases with modern generative paradigms. GGBall combines a Hyperbolic Vector-Quantized Autoencoder (HVQVAE) with a Riemannian flow matching prior defined via closed-form geodesics. This design enables flow-based priors to model complex latent distributions, while vector quantization helps preserve the curvature-aware structure of the hyperbolic space. We further develop a suite of hyperbolic GNN and Transformer layers that operate entirely within the manifold, ensuring stability and scalability. Empirically, our model reduces degree MMD by over 75\% on Community-Small and over 40\% on Ego-Small compared to state-of-the-art baselines, demonstrating an improved ability to preserve topological hierarchies. These results highlight the potential of hyperbolic geometry as a powerful foundation for the generative modeling of complex, structured, and hierarchical data domains. Our code is available at \href{this https URL}{here}.

Title: TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation

Authors: Min-Jung Kim, Dongjin Kim, Seokju Yun, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07205
Pdf URL: https://arxiv.org/pdf/2506.07205
Copy Paste: [[2506.07205]] TV-LiVE: Training-Free, Text-Guided Video Editing via Layer Informed Vitality Exploitation(https://arxiv.org/abs/2506.07205)
Keywords: diffusion
Abstract: Video editing has garnered increasing attention alongside the rapid progress of diffusion-based video generation models. As part of these advancements, there is a growing demand for more accessible and controllable forms of video editing, such as prompt-based editing. Previous studies have primarily focused on tasks such as style transfer, background replacement, object substitution, and attribute modification, while maintaining the content structure of the source video. However, more complex tasks, including the addition of novel objects and nonrigid transformations, remain relatively unexplored. In this paper, we present TV-LiVE, a Training-free and text-guided Video editing framework via Layerinformed Vitality Exploitation. We empirically identify vital layers within the video generation model that significantly influence the quality of generated outputs. Notably, these layers are closely associated with Rotary Position Embeddings (RoPE). Based on this observation, our method enables both object addition and non-rigid video editing by selectively injecting key and value features from the source model into the corresponding layers of the target model guided by the layer vitality. For object addition, we further identify prominent layers to extract the mask regions corresponding to the newly added target prompt. We found that the extracted masks from the prominent layers faithfully indicate the region to be edited. Experimental results demonstrate that TV-LiVE outperforms existing approaches for both object addition and non-rigid video editing. Project Page: this https URL

Title: SDE-SQL: Enhancing Text-to-SQL Generation in Large Language Models via Self-Driven Exploration with SQL Probes

Authors: Wenxuan Xie, Yaxun Dai, Wenhao Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07245
Pdf URL: https://arxiv.org/pdf/2506.07245
Copy Paste: [[2506.07245]] SDE-SQL: Enhancing Text-to-SQL Generation in Large Language Models via Self-Driven Exploration with SQL Probes(https://arxiv.org/abs/2506.07245)
Keywords: in-context
Abstract: Recent advancements in large language models (LLMs) have significantly improved performance on the Text-to-SQL task. However, prior approaches typically rely on static, pre-processed database information provided at inference time, which limits the model's ability to fully understand the database contents. Without dynamic interaction, LLMs are constrained to fixed, human-provided context and cannot autonomously explore the underlying data. To address this limitation, we propose SDE-SQL, a framework that enables large language models to perform self-driven exploration of databases during inference. This is accomplished by generating and executing SQL probes, which allow the model to actively retrieve information from the database and iteratively update its understanding of the data. Unlike prior methods, SDE-SQL operates in a zero-shot setting, without relying on any question-SQL pairs as in-context demonstrations. When evaluated on the BIRD benchmark with Qwen2.5-72B-Instruct, SDE-SQL achieves an 8.02% relative improvement in execution accuracy over the vanilla Qwen2.5-72B-Instruct baseline, establishing a new state-of-the-art among methods based on open-source models without supervised fine-tuning (SFT) or model ensembling. Moreover, with SFT, the performance of SDE-SQL can be further enhanced, yielding an additional 0.52% improvement.

Title: Promoting Ensemble Diversity with Interactive Bayesian Distributional Robustness for Fine-tuning Foundation Models

Authors: Ngoc-Quan Pham, Tuan Truong, Quyen Tran, Tan Nguyen, Dinh Phung, Trung Le
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07247
Pdf URL: https://arxiv.org/pdf/2506.07247
Copy Paste: [[2506.07247]] Promoting Ensemble Diversity with Interactive Bayesian Distributional Robustness for Fine-tuning Foundation Models(https://arxiv.org/abs/2506.07247)
Keywords: foundation model
Abstract: We introduce Interactive Bayesian Distributional Robustness (IBDR), a novel Bayesian inference framework that allows modeling the interactions between particles, thereby enhancing ensemble quality through increased particle diversity. IBDR is grounded in a generalized theoretical framework that connects the distributional population loss with the approximate posterior, motivating a practical dual optimization procedure that enforces distributional robustness while fostering particle diversity. We evaluate IBDR's performance against various baseline methods using the VTAB-1K benchmark and the common reasoning language task. The results consistently show that IBDR outperforms these baselines, underscoring its effectiveness in real-world applications.

Title: A Stable Whitening Optimizer for Efficient Neural Network Training

Authors: Kevin Frans, Sergey Levine, Pieter Abbeel
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07254
Pdf URL: https://arxiv.org/pdf/2506.07254
Copy Paste: [[2506.07254]] A Stable Whitening Optimizer for Efficient Neural Network Training(https://arxiv.org/abs/2506.07254)
Keywords: diffusion
Abstract: In this work, we take an experimentally grounded look at neural network optimization. Building on the Shampoo family of algorithms, we identify and alleviate three key issues, resulting in the proposed SPlus method. First, we find that naive Shampoo is prone to divergence when matrix-inverses are cached for long periods. We introduce an alternate bounded update combining a historical eigenbasis with instantaneous normalization, resulting in across-the-board stability and significantly lower computational requirements. Second, we adapt a shape-aware scaling to enable learning rate transfer across network width. Third, we find that high learning rates result in large parameter noise, and propose a simple iterate-averaging scheme which unblocks faster learning. To properly confirm these findings, we introduce a pointed Transformer training benchmark, considering three objectives (language modelling, image classification, and diffusion modelling) across different stages of training. On average, SPlus is able to reach the validation performance of Adam within 44% of the gradient steps and 62% of the wallclock time.

Title: Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs

Authors: Atahan Özer, Çağatay Yıldız
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07270
Pdf URL: https://arxiv.org/pdf/2506.07270
Copy Paste: [[2506.07270]] Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs(https://arxiv.org/abs/2506.07270)
Keywords: in-context
Abstract: Large language models (LLMs) exhibit remarkable capabilities in question answering and reasoning thanks to their extensive parametric memory. However, their knowledge is inherently limited by the scope of their pre-training data, while real-world information evolves continuously. Updating this knowledge typically requires costly and brittle re-training, or in-context learning (ICL), which becomes impractical at scale given the volume and volatility of modern information. Motivated by these limitations, we investigate how LLMs perform when exposed to temporal text corpora, or documents that reflect evolving knowledge over time, such as sports biographies where facts like a player's "current team" change year by year. To this end, we introduce two new benchmarks: Temporal Wiki, which captures factual drift across historical Wikipedia snapshots, and Unified Clark, which aggregates timestamped news articles to simulate real-world information accumulation. Our analysis reveals that LLMs often struggle to reconcile conflicting or outdated facts and can be misled when multiple versions of a fact appear in context. To address these issues, we propose a lightweight, agentic framework that incrementally builds a structured, external memory from source documents without requiring re-training. This knowledge organization strategy enables models to retrieve and reason over temporally filtered, relevant information at inference time. Empirically, our method outperforms ICL and RAG baselines across both benchmarks, especially on questions requiring more complex reasoning or integration of conflicting facts.

Title: From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models

Authors: Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, Paolo Favaro
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07280
Pdf URL: https://arxiv.org/pdf/2506.07280
Copy Paste: [[2506.07280]] From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models(https://arxiv.org/abs/2506.07280)
Keywords: diffusion, foundation model, generative
Abstract: Video Diffusion Models (VDMs) have emerged as powerful generative tools, capable of synthesizing high-quality spatiotemporal content. Yet, their potential goes far beyond mere video generation. We argue that the training dynamics of VDMs, driven by the need to model coherent sequences, naturally pushes them to internalize structured representations and an implicit understanding of the visual world. To probe the extent of this internal knowledge, we introduce a few-shot fine-tuning framework that repurposes VDMs for new tasks using only a handful of examples. Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input-output sequences without altering the generative interface of a frozen VDM. Despite minimal supervision, the model exhibits strong generalization across diverse tasks, from low-level vision (for example, segmentation and pose estimation) to high-level reasoning (for example, on ARC-AGI). These results reframe VDMs as more than generative engines. They are adaptable visual learners with the potential to serve as the backbone for future foundation models in vision.

Title: Multi-Step Guided Diffusion for Image Restoration on Edge Devices: Toward Lightweight Perception in Embodied AI

Authors: Aditya Chakravarty
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2506.07286
Pdf URL: https://arxiv.org/pdf/2506.07286
Copy Paste: [[2506.07286]] Multi-Step Guided Diffusion for Image Restoration on Edge Devices: Toward Lightweight Perception in Embodied AI(https://arxiv.org/abs/2506.07286)
Keywords: diffusion
Abstract: Diffusion models have shown remarkable flexibility for solving inverse problems without task-specific retraining. However, existing approaches such as Manifold Preserving Guided Diffusion (MPGD) apply only a single gradient update per denoising step, limiting restoration fidelity and robustness, especially in embedded or out-of-distribution settings. In this work, we introduce a multistep optimization strategy within each denoising timestep, significantly enhancing image quality, perceptual accuracy, and generalization. Our experiments on super-resolution and Gaussian deblurring demonstrate that increasing the number of gradient updates per step improves LPIPS and PSNR with minimal latency overhead. Notably, we validate this approach on a Jetson Orin Nano using degraded ImageNet and a UAV dataset, showing that MPGD, originally trained on face datasets, generalizes effectively to natural and aerial scenes. Our findings highlight MPGD's potential as a lightweight, plug-and-play restoration module for real-time visual perception in embodied AI agents such as drones and mobile robots.

Title: Pre-trained Large Language Models Learn Hidden Markov Models In-context

Authors: Yijia Dai, Zhaolin Gao, Yahya Satter, Sarah Dean, Jennifer J. Sun
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07298
Pdf URL: https://arxiv.org/pdf/2506.07298
Copy Paste: [[2506.07298]] Pre-trained Large Language Models Learn Hidden Markov Models In-context(https://arxiv.org/abs/2506.07298)
Keywords: in-context
Abstract: Hidden Markov Models (HMMs) are foundational tools for modeling sequential data with latent Markovian structure, yet fitting them to real-world data remains computationally challenging. In this work, we show that pre-trained large language models (LLMs) can effectively model data generated by HMMs via in-context learning (ICL)$\unicode{x2013}$their ability to infer patterns from examples within a prompt. On a diverse set of synthetic HMMs, LLMs achieve predictive accuracy approaching the theoretical optimum. We uncover novel scaling trends influenced by HMM properties, and offer theoretical conjectures for these empirical observations. We also provide practical guidelines for scientists on using ICL as a diagnostic tool for complex data. On real-world animal decision-making tasks, ICL achieves competitive performance with models designed by human experts. To our knowledge, this is the first demonstration that ICL can learn and predict HMM-generated sequences$\unicode{x2013}$an advance that deepens our understanding of in-context learning in LLMs and establishes its potential as a powerful tool for uncovering hidden structure in complex scientific data.

Title: Paged Attention Meets FlexAttention: Unlocking Long-Context Efficiency in Deployed Inference

Authors: Thomas Joshi, Herman Saini, Neil Dhillon, Antoni Viros i Martin, Kaoutar El Maghraoui
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07311
Pdf URL: https://arxiv.org/pdf/2506.07311
Copy Paste: [[2506.07311]] Paged Attention Meets FlexAttention: Unlocking Long-Context Efficiency in Deployed Inference(https://arxiv.org/abs/2506.07311)
Keywords: foundation model
Abstract: Large Language Models (LLMs) encounter severe memory inefficiencies during long-context inference due to conventional handling of key-value (KV) caches. In this work, we introduce a novel integration of PagedAttention with PyTorch's FlexAttention, addressing internal fragmentation and inefficiencies associated with monolithic KV cache allocations. Implemented within IBM's Foundation Model Stack (FMS), our fused attention kernel efficiently gathers scattered KV data. Our benchmarks on an NVIDIA L4 GPU (24GB) demonstrate significantly reduced inference latency, growing only linearly (~2x) with sequence length from 128 to 2048 tokens when utilizing a global KV cache, compared to exponential latency increases without caching. While peak memory usage remains largely unchanged for single-step evaluations (dominated by model weights and activations), paged attention causes minimal incremental memory usage, observable only at sequence lengths exceeding 2048 tokens due to its power-of-two cache allocations. We open-source the full implementation and discuss its implications for future long-context model deployment.

Title: Generative Modeling of Networked Time-Series via Transformer Architectures

Authors: Yusuf Elnady
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07312
Pdf URL: https://arxiv.org/pdf/2506.07312
Copy Paste: [[2506.07312]] Generative Modeling of Networked Time-Series via Transformer Architectures(https://arxiv.org/abs/2506.07312)
Keywords: generative
Abstract: Many security and network applications require having large datasets to train the machine learning models. Limited data access is a well-known problem in the security domain. Recent studies have shown the potential of Transformer models to enlarge the size of data by synthesizing new samples, but the synthesized samples don't improve the models over the real data. To address this issue, we design an efficient transformer-based model as a generative framework to generate time-series data, that can be used to boost the performance of existing and new ML workflows. Our new transformer model achieves the SOTA results. We style our model to be generalizable and work across different datasets, and produce high-quality samples.

Title: DEF: Diffusion-augmented Ensemble Forecasting

Authors: David Millard, Arielle Carr, Stéphane Gaudreault, Ali Baheri
Subjects: cs.LG, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2506.07324
Pdf URL: https://arxiv.org/pdf/2506.07324
Copy Paste: [[2506.07324]] DEF: Diffusion-augmented Ensemble Forecasting(https://arxiv.org/abs/2506.07324)
Keywords: diffusion
Abstract: We present DEF (\textbf{\ul{D}}iffusion-augmented \textbf{\ul{E}}nsemble \textbf{\ul{F}}orecasting), a novel approach for generating initial condition perturbations. Modern approaches to initial condition perturbations are primarily designed for numerical weather prediction (NWP) solvers, limiting their applicability in the rapidly growing field of machine learning for weather prediction. Consequently, stochastic models in this domain are often developed on a case-by-case basis. We demonstrate that a simple conditional diffusion model can (1) generate meaningful structured perturbations, (2) be applied iteratively, and (3) utilize a guidance term to intuitivey control the level of perturbation. This method enables the transformation of any deterministic neural forecasting system into a stochastic one. With our stochastic extended systems, we show that the model accumulates less error over long-term forecasts while producing meaningful forecast distributions. We validate our approach on the 5.625$^\circ$ ERA5 reanalysis dataset, which comprises atmospheric and surface variables over a discretized global grid, spanning from the 1960s to the present. On this dataset, our method demonstrates improved predictive performance along with reasonable spread estimates.

Title: Reward Model Interpretability via Optimal and Pessimal Tokens

Authors: Brian Christian, Hannah Rose Kirk, Jessica A.F. Thompson, Christopher Summerfield, Tsvetomira Dumbalska
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07326
Pdf URL: https://arxiv.org/pdf/2506.07326
Copy Paste: [[2506.07326]] Reward Model Interpretability via Optimal and Pessimal Tokens(https://arxiv.org/abs/2506.07326)
Keywords: generative
Abstract: Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves -- which directly encode human value judgments by turning prompt-response pairs into scalar rewards -- remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training -- distortions that risk propagating through the downstream large language models now deployed to millions.

Title: Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding

Authors: Bolin Chen, Shanzhi Yin, Goluck Konuko, Giuseppe Valenzise, Zihan Zhang, Shiqi Wang, Yan Ye
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07369
Pdf URL: https://arxiv.org/pdf/2506.07369
Copy Paste: [[2506.07369]] Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding(https://arxiv.org/abs/2506.07369)
Keywords: generative
Abstract: The rise of deep generative models has greatly advanced video compression, reshaping the paradigm of face video coding through their powerful capability for semantic-aware representation and lifelike synthesis. Generative Face Video Coding (GFVC) stands at the forefront of this revolution, which could characterize complex facial dynamics into compact latent codes for bitstream compactness at the encoder side and leverages powerful deep generative models to reconstruct high-fidelity face signal from the compressed latent codes at the decoder side. As such, this well-designed GFVC paradigm could enable high-fidelity face video communication at ultra-low bitrate ranges, far surpassing the capabilities of the latest Versatile Video Coding (VVC) standard. To pioneer foundational research and accelerate the evolution of GFVC, this paper presents the first comprehensive survey of GFVC technologies, systematically bridging critical gaps between theoretical innovation and industrial standardization. In particular, we first review a broad range of existing GFVC methods with different feature representations and optimization strategies, and conduct a thorough benchmarking analysis. In addition, we construct a large-scale GFVC-compressed face video database with subjective Mean Opinion Scores (MOSs) based on human perception, aiming to identify the most appropriate quality metrics tailored to GFVC. Moreover, we summarize the GFVC standardization potentials with a unified high-level syntax and develop a low-complexity GFVC system which are both expected to push forward future practical deployments and applications. Finally, we envision the potential of GFVC in industrial applications and deliberate on the current challenges and future opportunities.

Title: Enhanced Consistency Bi-directional GAN(CBiGAN) for Malware Anomaly Detection

Authors: Thesath Wijayasiri, Kar Wai Fok, Vrizlynn L. L. Thing
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.07372
Pdf URL: https://arxiv.org/pdf/2506.07372
Copy Paste: [[2506.07372]] Enhanced Consistency Bi-directional GAN(CBiGAN) for Malware Anomaly Detection(https://arxiv.org/abs/2506.07372)
Keywords: anomaly
Abstract: Static analysis, a cornerstone technique in cybersecurity, offers a noninvasive method for detecting malware by analyzing dormant software without executing potentially harmful code. However, traditional static analysis often relies on biased or outdated datasets, leading to gaps in detection capabilities against emerging malware threats. To address this, our study focuses on the binary content of files as key features for malware detection. These binary contents are transformed and represented as images, which then serve as inputs to deep learning models. This method takes into account the visual patterns within the binary data, allowing the model to analyze potential malware effectively. This paper introduces the application of the CBiGAN in the domain of malware anomaly detection. Our approach leverages the CBiGAN for its superior latent space mapping capabilities, critical for modeling complex malware patterns by utilizing a reconstruction error-based anomaly detection method. We utilized several datasets including both portable executable (PE) files as well as Object Linking and Embedding (OLE) files. We then evaluated our model against a diverse set of both PE and OLE files, including self-collected malicious executables from 214 malware families. Our findings demonstrate the robustness of this innovative approach, with the CBiGAN achieving high Area Under the Curve (AUC) results with good generalizability, thereby confirming its capability to distinguish between benign and diverse malicious files with reasonably high accuracy.

Title: DINO-CoDT: Multi-class Collaborative Detection and Tracking with Vision Foundation Models

Authors: Xunjie He, Christina Dao Wen Lee, Meiling Wang, Chengran Yuan, Zefan Huang, Yufeng Yue, Marcelo H. Ang Jr
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07375
Pdf URL: https://arxiv.org/pdf/2506.07375
Copy Paste: [[2506.07375]] DINO-CoDT: Multi-class Collaborative Detection and Tracking with Vision Foundation Models(https://arxiv.org/abs/2506.07375)
Keywords: foundation model
Abstract: Collaborative perception plays a crucial role in enhancing environmental understanding by expanding the perceptual range and improving robustness against sensor failures, which primarily involves collaborative 3D detection and tracking tasks. The former focuses on object recognition in individual frames, while the latter captures continuous instance tracklets over time. However, existing works in both areas predominantly focus on the vehicle superclass, lacking effective solutions for both multi-class collaborative detection and tracking. This limitation hinders their applicability in real-world scenarios, which involve diverse object classes with varying appearances and motion patterns. To overcome these limitations, we propose a multi-class collaborative detection and tracking framework tailored for diverse road users. We first present a detector with a global spatial attention fusion (GSAF) module, enhancing multi-scale feature learning for objects of varying sizes. Next, we introduce a tracklet RE-IDentification (REID) module that leverages visual semantics with a vision foundation model to effectively reduce ID SWitch (IDSW) errors, in cases of erroneous mismatches involving small objects like pedestrians. We further design a velocity-based adaptive tracklet management (VATM) module that adjusts the tracking interval dynamically based on object motion. Extensive experiments on the V2X-Real and OPV2V datasets show that our approach significantly outperforms existing state-of-the-art methods in both detection and tracking accuracy.

Title: Anomaly Detection and Early Warning Mechanism for Intelligent Monitoring Systems in Multi-Cloud Environments Based on LLM

Authors: Yihong Jin, Ze Yang, Juntian Liu, Xinhe Xu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07407
Pdf URL: https://arxiv.org/pdf/2506.07407
Copy Paste: [[2506.07407]] Anomaly Detection and Early Warning Mechanism for Intelligent Monitoring Systems in Multi-Cloud Environments Based on LLM(https://arxiv.org/abs/2506.07407)
Keywords: anomaly
Abstract: With the rapid development of multi-cloud environments, it is increasingly important to ensure the security and reliability of intelligent monitoring systems. In this paper, we propose an anomaly detection and early warning mechanism for intelligent monitoring system in multi-cloud environment based on Large-Scale Language Model (LLM). On the basis of the existing monitoring framework, the proposed model innovatively introduces a multi-level feature extraction method, which combines the natural language processing ability of LLM with traditional machine learning methods to enhance the accuracy of anomaly detection and improve the real-time response efficiency. By introducing the contextual understanding capabilities of LLMs, the model dynamically adapts to different cloud service providers and environments, so as to more effectively detect abnormal patterns and predict potential failures. Experimental results show that the proposed model is significantly better than the traditional anomaly detection system in terms of detection accuracy and latency, and significantly improves the resilience and active management ability of cloud infrastructure.

Title: LG-ANNA-Embedding technical report

Authors: Jooyoung Choi, Hyun Kim, Hansol Jang, Changwook Jun, Kyunghoon Bae, Hyewon Choi, Stanley Jungkyu Choi, Honglak Lee, Chulmin Yun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07438
Pdf URL: https://arxiv.org/pdf/2506.07438
Copy Paste: [[2506.07438]] LG-ANNA-Embedding technical report(https://arxiv.org/abs/2506.07438)
Keywords: in-context
Abstract: This report presents a unified instruction-based framework for learning generalized text embeddings optimized for both information retrieval (IR) and non-IR tasks. Built upon a decoder-only large language model (Mistral-7B), our approach combines in-context learning, soft supervision, and adaptive hard-negative mining to generate context-aware embeddings without task-specific fine-tuning. Structured instructions and few-shot examples are used to guide the model across diverse tasks, enabling strong performance on classification, semantic similarity, clustering, and reranking benchmarks. To improve semantic discrimination, we employ a soft labeling framework where continuous relevance scores, distilled from a high-performance dense retriever and reranker, serve as fine-grained supervision signals. In addition, we introduce adaptive margin-based hard-negative mining, which filters out semantically ambiguous negatives based on their similarity to positive examples, thereby enhancing training stability and retrieval robustness. Our model is evaluated on the newly introduced MTEB (English, v2) benchmark, covering 41 tasks across seven categories. Results show that our method achieves strong generalization and ranks among the top-performing models by Borda score, outperforming several larger or fully fine-tuned baselines. These findings highlight the effectiveness of combining in-context prompting, soft supervision, and adaptive sampling for scalable, high-quality embedding generation.

Title: Federated In-Context Learning: Iterative Refinement for Improved Answer Quality

Authors: Ruhan Wang, Zhiyong Wang, Chengkai Huang, Rui Wang, Tong Yu, Lina Yao, John C.S. Lui, Dongruo Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07440
Pdf URL: https://arxiv.org/pdf/2506.07440
Copy Paste: [[2506.07440]] Federated In-Context Learning: Iterative Refinement for Improved Answer Quality(https://arxiv.org/abs/2506.07440)
Keywords: in-context
Abstract: For question-answering (QA) tasks, in-context learning (ICL) enables language models to generate responses without modifying their parameters by leveraging examples provided in the input. However, the effectiveness of ICL heavily depends on the availability of high-quality examples, which are often scarce due to data privacy constraints, annotation costs, and distribution disparities. A natural solution is to utilize examples stored on client devices, but existing approaches either require transmitting model parameters - incurring significant communication overhead - or fail to fully exploit local datasets, limiting their effectiveness. To address these challenges, we propose Federated In-Context Learning (Fed-ICL), a general framework that enhances ICL through an iterative, collaborative process. Fed-ICL progressively refines responses by leveraging multi-round interactions between clients and a central server, improving answer quality without the need to transmit model parameters. We establish theoretical guarantees for the convergence of Fed-ICL and conduct extensive experiments on standard QA benchmarks, demonstrating that our proposed approach achieves strong performance while maintaining low communication costs.

Title: PhysiInter: Integrating Physical Mapping for High-Fidelity Human Interaction Generation

Authors: Wei Yao, Yunlian Sun, Chang Liu, Hongwen Zhang, Jinhui Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07456
Pdf URL: https://arxiv.org/pdf/2506.07456
Copy Paste: [[2506.07456]] PhysiInter: Integrating Physical Mapping for High-Fidelity Human Interaction Generation(https://arxiv.org/abs/2506.07456)
Keywords: generative
Abstract: Driven by advancements in motion capture and generative artificial intelligence, leveraging large-scale MoCap datasets to train generative models for synthesizing diverse, realistic human motions has become a promising research direction. However, existing motion-capture techniques and generative models often neglect physical constraints, leading to artifacts such as interpenetration, sliding, and floating. These issues are exacerbated in multi-person motion generation, where complex interactions are involved. To address these limitations, we introduce physical mapping, integrated throughout the human interaction generation pipeline. Specifically, motion imitation within a physics-based simulation environment is used to project target motions into a physically valid space. The resulting motions are adjusted to adhere to real-world physics constraints while retaining their original semantic meaning. This mapping not only improves MoCap data quality but also directly informs post-processing of generated motions. Given the unique interactivity of multi-person scenarios, we propose a tailored motion representation framework. Motion Consistency (MC) and Marker-based Interaction (MI) loss functions are introduced to improve model performance. Experiments show our method achieves impressive results in generated human motion quality, with a 3%-89% improvement in physical fidelity. Project page this http URL

Title: ProteinZero: Self-Improving Protein Generation via Online Reinforcement Learning

Authors: Ziwen Wang, Jiajun Fan, Ruihan Guo, Thao Nguyen, Heng Ji, Ge Liu
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2506.07459
Pdf URL: https://arxiv.org/pdf/2506.07459
Copy Paste: [[2506.07459]] ProteinZero: Self-Improving Protein Generation via Online Reinforcement Learning(https://arxiv.org/abs/2506.07459)
Keywords: generative
Abstract: Protein generative models have shown remarkable promise in protein design but still face limitations in success rate, due to the scarcity of high-quality protein datasets for supervised pretraining. We present ProteinZero, a novel framework that enables scalable, automated, and continuous self-improvement of the inverse folding model through online reinforcement learning. To achieve computationally tractable online feedback, we introduce efficient proxy reward models based on ESM-fold and a novel rapid ddG predictor that significantly accelerates evaluation speed. ProteinZero employs a general RL framework balancing multi-reward maximization, KL-divergence from a reference model, and a novel protein-embedding level diversity regularization that prevents mode collapse while promoting higher sequence diversity. Through extensive experiments, we demonstrate that ProteinZero substantially outperforms existing methods across every key metric in protein design, achieving significant improvements in structural accuracy, designability, thermodynamic stability, and sequence diversity. Most impressively, ProteinZero reduces design failure rates by approximately 36% - 48% compared to widely-used methods like ProteinMPNN, ESM-IF and InstructPLM, consistently achieving success rates exceeding 90% across diverse and complex protein folds. Notably, the entire RL run on CATH-4.3 can be done with a single 8 X GPU node in under 3 days, including reward computation. Our work establishes a new paradigm for protein design where models evolve continuously from their own generated outputs, opening new possibilities for exploring the vast protein design space.

Title: Circumventing Backdoor Space via Weight Symmetry

Authors: Jie Peng, Hongwei Yang, Jing Zhao, Hengji Dong, Hui He, Weizhe Zhang, Haoyu He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07467
Pdf URL: https://arxiv.org/pdf/2506.07467
Copy Paste: [[2506.07467]] Circumventing Backdoor Space via Weight Symmetry(https://arxiv.org/abs/2506.07467)
Keywords: self-supervised
Abstract: Deep neural networks are vulnerable to backdoor attacks, where malicious behaviors are implanted during training. While existing defenses can effectively purify compromised models, they typically require labeled data or specific training procedures, making them difficult to apply beyond supervised learning settings. Notably, recent studies have shown successful backdoor attacks across various learning paradigms, highlighting a critical security concern. To address this gap, we propose Two-stage Symmetry Connectivity (TSC), a novel backdoor purification defense that operates independently of data format and requires only a small fraction of clean samples. Through theoretical analysis, we prove that by leveraging permutation invariance in neural networks and quadratic mode connectivity, TSC amplifies the loss on poisoned samples while maintaining bounded clean accuracy. Experiments demonstrate that TSC achieves robust performance comparable to state-of-the-art methods in supervised learning scenarios. Furthermore, TSC generalizes to self-supervised learning frameworks, such as SimCLR and CLIP, maintaining its strong defense capabilities. Our code is available at this https URL.

Title: Drive Any Mesh: 4D Latent Diffusion for Mesh Deformation from Video

Authors: Yahao Shi, Yang Liu, Yanmin Wu, Xing Liu, Chen Zhao, Jie Luo, Bin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07489
Pdf URL: https://arxiv.org/pdf/2506.07489
Copy Paste: [[2506.07489]] Drive Any Mesh: 4D Latent Diffusion for Mesh Deformation from Video(https://arxiv.org/abs/2506.07489)
Keywords: diffusion
Abstract: We propose DriveAnyMesh, a method for driving mesh guided by monocular video. Current 4D generation techniques encounter challenges with modern rendering engines. Implicit methods have low rendering efficiency and are unfriendly to rasterization-based engines, while skeletal methods demand significant manual effort and lack cross-category generalization. Animating existing 3D assets, instead of creating 4D assets from scratch, demands a deep understanding of the input's 3D structure. To tackle these challenges, we present a 4D diffusion model that denoises sequences of latent sets, which are then decoded to produce mesh animations from point cloud trajectory sequences. These latent sets leverage a transformer-based variational autoencoder, simultaneously capturing 3D shape and motion information. By employing a spatiotemporal, transformer-based diffusion model, information is exchanged across multiple latent frames, enhancing the efficiency and generalization of the generated results. Our experimental results demonstrate that DriveAnyMesh can rapidly produce high-quality animations for complex motions and is compatible with modern rendering engines. This method holds potential for applications in both the gaming and filming industries.

Title: Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency

Authors: Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Lijun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07497
Pdf URL: https://arxiv.org/pdf/2506.07497
Copy Paste: [[2506.07497]] Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency(https://arxiv.org/abs/2506.07497)
Keywords: diffusion
Abstract: We present Genesis, a unified framework for joint generation of multi-view driving videos and LiDAR sequences with spatio-temporal and cross-modal consistency. Genesis employs a two-stage architecture that integrates a DiT-based video diffusion model with 3D-VAE encoding, and a BEV-aware LiDAR generator with NeRF-based rendering and adaptive sampling. Both modalities are directly coupled through a shared latent space, enabling coherent evolution across visual and geometric domains. To guide the generation with structured semantics, we introduce DataCrafter, a captioning module built on vision-language models that provides scene-level and instance-level supervision. Extensive experiments on the nuScenes benchmark demonstrate that Genesis achieves state-of-the-art performance across video and LiDAR metrics (FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including segmentation and 3D detection, validating the semantic fidelity and practical utility of the generated data.

Title: DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction

Authors: Solee Im, Wonjun Lee, Jinmyeong An, Yunsu Kim, Jungseul Ok, Gary Geunbae Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07510
Pdf URL: https://arxiv.org/pdf/2506.07510
Copy Paste: [[2506.07510]] DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction(https://arxiv.org/abs/2506.07510)
Keywords: generative, in-context
Abstract: We present DeRAGEC, a method for improving Named Entity (NE) correction in Automatic Speech Recognition (ASR) systems. By extending the Retrieval-Augmented Generative Error Correction (RAGEC) framework, DeRAGEC employs synthetic denoising rationales to filter out noisy NE candidates before correction. By leveraging phonetic similarity and augmented definitions, it refines noisy retrieved NEs using in-context learning, requiring no additional training. Experimental results on CommonVoice and STOP datasets show significant improvements in Word Error Rate (WER) and NE hit ratio, outperforming baseline ASR and RAGEC methods. Specifically, we achieved a 28% relative reduction in WER compared to ASR without postprocessing. Our source code is publicly available at: this https URL

Title: APTOS-2024 challenge report: Generation of synthetic 3D OCT images from fundus photographs

Authors: Bowen Liu, Weiyi Zhang, Peranut Chotcomwongse, Xiaolan Chen, Ruoyu Chen, Pawin Pakaymaskul, Niracha Arjkongharn, Nattaporn Vongsa, Xuelian Cheng, Zongyuan Ge, Kun Huang, Xiaohui Li, Yiru Duan, Zhenbang Wang, BaoYe Xie, Qiang Chen, Huazhu Fu, Michael A. Mahr, Jiaqi Qu, Wangyiyang Chen, Shiye Wang, Yubo Tan, Yongjie Li, Mingguang He, Danli Shi, Paisan Ruamviboonsuk
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07542
Pdf URL: https://arxiv.org/pdf/2506.07542
Copy Paste: [[2506.07542]] APTOS-2024 challenge report: Generation of synthetic 3D OCT images from fundus photographs(https://arxiv.org/abs/2506.07542)
Keywords: foundation model, generative
Abstract: Optical Coherence Tomography (OCT) provides high-resolution, 3D, and non-invasive visualization of retinal layers in vivo, serving as a critical tool for lesion localization and disease diagnosis. However, its widespread adoption is limited by equipment costs and the need for specialized operators. In comparison, 2D color fundus photography offers faster acquisition and greater accessibility with less dependence on expensive devices. Although generative artificial intelligence has demonstrated promising results in medical image synthesis, translating 2D fundus images into 3D OCT images presents unique challenges due to inherent differences in data dimensionality and biological information between modalities. To advance generative models in the fundus-to-3D-OCT setting, the Asia Pacific Tele-Ophthalmology Society (APTOS-2024) organized a challenge titled Artificial Intelligence-based OCT Generation from Fundus Images. This paper details the challenge framework (referred to as APTOS-2024 Challenge), including: the benchmark dataset, evaluation methodology featuring two fidelity metrics-image-based distance (pixel-level OCT B-scan similarity) and video-based distance (semantic-level volumetric consistency), and analysis of top-performing solutions. The challenge attracted 342 participating teams, with 42 preliminary submissions and 9 finalists. Leading methodologies incorporated innovations in hybrid data preprocessing or augmentation (cross-modality collaborative paradigms), pre-training on external ophthalmic imaging datasets, integration of vision foundation models, and model architecture improvement. The APTOS-2024 Challenge is the first benchmark demonstrating the feasibility of fundus-to-3D-OCT synthesis as a potential solution for improving ophthalmic care accessibility in under-resourced healthcare settings, while helping to expedite medical research and clinical applications.

Title: Cross-channel Perception Learning for H&E-to-IHC Virtual Staining

Authors: Hao Yang, JianYu Wu, Run Fang, Xuelian Zhao, Yuan Ji, Zhiyu Chen, Guibin He, Junceng Guo, Yang Liu, Xinhua Zeng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07559
Pdf URL: https://arxiv.org/pdf/2506.07559
Copy Paste: [[2506.07559]] Cross-channel Perception Learning for H&E-to-IHC Virtual Staining(https://arxiv.org/abs/2506.07559)
Keywords: foundation model
Abstract: With the rapid development of digital pathology, virtual staining has become a key technology in multimedia medical information systems, offering new possibilities for the analysis and diagnosis of pathological images. However, existing H&E-to-IHC studies often overlook the cross-channel correlations between cell nuclei and cell membranes. To address this issue, we propose a novel Cross-Channel Perception Learning (CCPL) strategy. Specifically, CCPL first decomposes HER2 immunohistochemical staining into Hematoxylin and DAB staining channels, corresponding to cell nuclei and cell membranes, respectively. Using the pathology foundation model Gigapath's Tile Encoder, CCPL extracts dual-channel features from both the generated and real images and measures cross-channel correlations between nuclei and membranes. The features of the generated and real stained images, obtained through the Tile Encoder, are also used to calculate feature distillation loss, enhancing the model's feature extraction capabilities without increasing the inference burden. Additionally, CCPL performs statistical analysis on the focal optical density maps of both single channels to ensure consistency in staining distribution and intensity. Experimental results, based on quantitative metrics such as PSNR, SSIM, PCC, and FID, along with professional evaluations from pathologists, demonstrate that CCPL effectively preserves pathological features, generates high-quality virtual stained images, and provides robust support for automated pathological diagnosis using multimedia medical data.

Title: LLM-driven Indoor Scene Layout Generation via Scaled Human-aligned Data Synthesis and Multi-Stage Preference Optimization

Authors: Yixuan Yang, Zhen Luo, Tongsheng Ding, Junru Lu, Mingqi Gao, Jinyu Yang, Victor Sanchez, Feng Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07570
Pdf URL: https://arxiv.org/pdf/2506.07570
Copy Paste: [[2506.07570]] LLM-driven Indoor Scene Layout Generation via Scaled Human-aligned Data Synthesis and Multi-Stage Preference Optimization(https://arxiv.org/abs/2506.07570)
Keywords: diffusion
Abstract: Automatic indoor layout generation has attracted increasing attention due to its potential in interior design, virtual environment construction, and embodied AI. Existing methods fall into two categories: prompt-driven approaches that leverage proprietary LLM services (e.g., GPT APIs) and learning-based methods trained on layout data upon diffusion-based models. Prompt-driven methods often suffer from spatial inconsistency and high computational costs, while learning-based methods are typically constrained by coarse relational graphs and limited datasets, restricting their generalization to diverse room categories. In this paper, we revisit LLM-based indoor layout generation and present 3D-SynthPlace, a large-scale dataset that combines synthetic layouts generated via a 'GPT synthesize, Human inspect' pipeline, upgraded from the 3D-Front dataset. 3D-SynthPlace contains nearly 17,000 scenes, covering four common room types -- bedroom, living room, kitchen, and bathroom -- enriched with diverse objects and high-level spatial annotations. We further introduce OptiScene, a strong open-source LLM optimized for indoor layout generation, fine-tuned based on our 3D-SynthPlace dataset through our two-stage training. For the warum-up stage I, we adopt supervised fine-tuning (SFT), which is taught to first generate high-level spatial descriptions then conditionally predict concrete object placements. For the reinforcing stage II, to better align the generated layouts with human design preferences, we apply multi-turn direct preference optimization (DPO), which significantly improving layout quality and generation success rates. Extensive experiments demonstrate that OptiScene outperforms traditional prompt-driven and learning-based baselines. Moreover, OptiScene shows promising potential in interactive tasks such as scene editing and robot navigation.

Title: Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding

Authors: Boyu Chen, Siran Chen, Kunchang Li, Qinglin Xu, Yu Qiao, Yali Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07576
Pdf URL: https://arxiv.org/pdf/2506.07576
Copy Paste: [[2506.07576]] Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding(https://arxiv.org/abs/2506.07576)
Keywords: foundation model
Abstract: Video understanding has been considered as one critical step towards world modeling, which is an important long-term problem in AI research. Recently, multi-modal foundation models have shown such potential via large-scale pretraining. However, these models simply align encoders of different modalities via contrastive learning, while lacking deeper multi-modal interactions, which is critical for understanding complex target movements with diversified video scenes. To fill this gap, we propose a unified Super Encoding Network (SEN) for video understanding, which builds up such distinct interactions through recursive association of multi-modal encoders in the foundation models. Specifically, we creatively treat those well-trained encoders as "super neurons" in our SEN. Via designing a Recursive Association (RA) block, we progressively fuse multi-modalities with the input video, based on knowledge integrating, distributing, and prompting of super neurons in a recursive manner. In this way, our SEN can effectively encode deeper multi-modal interactions, for prompting various video understanding tasks in downstream. Extensive experiments show that, our SEN can remarkably boost the four most representative video tasks, including tracking, recognition, chatting, and editing, e.g., for pixel-level tracking, the average jaccard index improves 2.7%, temporal coherence(TC) drops 8.8% compared to the popular CaDeX++ approach. For one-shot video editing, textual alignment improves 6.4%, and frame consistency increases 4.1% compared to the popular TuneA-Video approach.

Title: MIRA: Medical Time Series Foundation Model for Real-World Health Data

Authors: Hao Li, Bowen Deng, Chang Xu, Zhiyuan Feng, Viktor Schlegel, Yu-Hao Huang, Yizheng Sun, Jingyuan Sun, Kailai Yang, Yiyao Yu, Jiang Bian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07584
Pdf URL: https://arxiv.org/pdf/2506.07584
Copy Paste: [[2506.07584]] MIRA: Medical Time Series Foundation Model for Real-World Health Data(https://arxiv.org/abs/2506.07584)
Keywords: foundation model
Abstract: A unified foundation model for medical time series -- pretrained on open access and ethics board-approved medical corpora -- offers the potential to reduce annotation burdens, minimize model customization, and enable robust transfer across clinical institutions, modalities, and tasks, particularly in data-scarce or privacy-constrained environments. However, existing generalist time series foundation models struggle to handle medical time series data due to their inherent challenges, including irregular intervals, heterogeneous sampling rates, and frequent missing values. To address these challenges, we introduce MIRA, a unified foundation model specifically designed for medical time series forecasting. MIRA incorporates a Continuous-Time Rotary Positional Encoding that enables fine-grained modeling of variable time intervals, a frequency-specific mixture-of-experts layer that routes computation across latent frequency regimes to further promote temporal specialization, and a Continuous Dynamics Extrapolation Block based on Neural ODE that models the continuous trajectory of latent states, enabling accurate forecasting at arbitrary target timestamps. Pretrained on a large-scale and diverse medical corpus comprising over 454 billion time points collect from publicly available datasets, MIRA achieves reductions in forecasting errors by an average of 10% and 7% in out-of-distribution and in-distribution scenarios, respectively, when compared to other zero-shot and fine-tuned baselines. We also introduce a comprehensive benchmark spanning multiple downstream clinical tasks, establishing a foundation for future research in medical time series modeling.

Title: MalGEN: A Generative Agent Framework for Modeling Malicious Software in Cybersecurity

Authors: Bikash Saha, Sandeep Kumar Shukla
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.07586
Pdf URL: https://arxiv.org/pdf/2506.07586
Copy Paste: [[2506.07586]] MalGEN: A Generative Agent Framework for Modeling Malicious Software in Cybersecurity(https://arxiv.org/abs/2506.07586)
Keywords: generative
Abstract: The dual use nature of Large Language Models (LLMs) presents a growing challenge in cybersecurity. While LLM enhances automation and reasoning for defenders, they also introduce new risks, particularly their potential to be misused for generating evasive, AI crafted malware. Despite this emerging threat, the research community currently lacks controlled and extensible tools that can simulate such behavior for testing and defense preparation. We present MalGEN, a multi agent framework that simulates coordinated adversarial behavior to generate diverse, activity driven malware samples. The agents work collaboratively to emulate attacker workflows, including payload planning, capability selection, and evasion strategies, within a controlled environment built for ethical and defensive research. Using MalGEN, we synthesized ten novel malware samples and evaluated them against leading antivirus and behavioral detection engines. Several samples exhibited stealthy and evasive characteristics that bypassed current defenses, validating MalGEN's ability to model sophisticated and new threats. By transforming the threat of LLM misuse into an opportunity for proactive defense, MalGEN offers a valuable framework for evaluating and strengthening cybersecurity systems. The framework addresses data scarcity, enables rigorous testing, and supports the development of resilient and future ready detection strategies.

Title: Explore the vulnerability of black-box models via diffusion models

Authors: Jiacheng Shi, Yanfu Zhang, Huajie Shao, Ashley Gao
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07590
Pdf URL: https://arxiv.org/pdf/2506.07590
Copy Paste: [[2506.07590]] Explore the vulnerability of black-box models via diffusion models(https://arxiv.org/abs/2506.07590)
Keywords: diffusion
Abstract: Recent advancements in diffusion models have enabled high-fidelity and photorealistic image generation across diverse applications. However, these models also present security and privacy risks, including copyright violations, sensitive information leakage, and the creation of harmful or offensive content that could be exploited maliciously. In this study, we uncover a novel security threat where an attacker leverages diffusion model APIs to generate synthetic images, which are then used to train a high-performing substitute model. This enables the attacker to execute model extraction and transfer-based adversarial attacks on black-box classification models with minimal queries, without needing access to the original training data. The generated images are sufficiently high-resolution and diverse to train a substitute model whose outputs closely match those of the target model. Across the seven benchmarks, including CIFAR and ImageNet subsets, our method shows an average improvement of 27.37% over state-of-the-art methods while using just 0.01 times of the query budget, achieving a 98.68% success rate in adversarial attacks on the target model.

Title: SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis

Authors: Jianhui Wei, Zikai Xiao, Danyu Sun, Luqi Gong, Zongxin Yang, Zuozhu Liu, Jian Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07603
Pdf URL: https://arxiv.org/pdf/2506.07603
Copy Paste: [[2506.07603]] SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis(https://arxiv.org/abs/2506.07603)
Keywords: foundation model
Abstract: Surgical video understanding is pivotal for enabling automated intraoperative decision-making, skill assessment, and postoperative quality improvement. However, progress in developing surgical video foundation models (FMs) remains hindered by the scarcity of large-scale, diverse datasets for pretraining and systematic evaluation. In this paper, we introduce \textbf{SurgBench}, a unified surgical video benchmarking framework comprising a pretraining dataset, \textbf{SurgBench-P}, and an evaluation benchmark, \textbf{SurgBench-E}. SurgBench offers extensive coverage of diverse surgical scenarios, with SurgBench-P encompassing 53 million frames across 22 surgical procedures and 11 specialties, and SurgBench-E providing robust evaluation across six categories (phase classification, camera motion, tool recognition, disease diagnosis, action classification, and organ detection) spanning 72 fine-grained tasks. Extensive experiments reveal that existing video FMs struggle to generalize across varied surgical video analysis tasks, whereas pretraining on SurgBench-P yields substantial performance improvements and superior cross-domain generalization to unseen procedures and modalities. Our dataset and code are available upon request.

Title: ProSplat: Improved Feed-Forward 3D Gaussian Splatting for Wide-Baseline Sparse Views

Authors: Xiaohan Lu, Jiaye Fu, Jiaqi Zhang, Zetian Song, Chuanmin Jia, Siwei Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07670
Pdf URL: https://arxiv.org/pdf/2506.07670
Copy Paste: [[2506.07670]] ProSplat: Improved Feed-Forward 3D Gaussian Splatting for Wide-Baseline Sparse Views(https://arxiv.org/abs/2506.07670)
Keywords: diffusion
Abstract: Feed-forward 3D Gaussian Splatting (3DGS) has recently demonstrated promising results for novel view synthesis (NVS) from sparse input views, particularly under narrow-baseline conditions. However, its performance significantly degrades in wide-baseline scenarios due to limited texture details and geometric inconsistencies across views. To address these challenges, in this paper, we propose ProSplat, a two-stage feed-forward framework designed for high-fidelity rendering under wide-baseline conditions. The first stage involves generating 3D Gaussian primitives via a 3DGS generator. In the second stage, rendered views from these primitives are enhanced through an improvement model. Specifically, this improvement model is based on a one-step diffusion model, further optimized by our proposed Maximum Overlap Reference view Injection (MORI) and Distance-Weighted Epipolar Attention (DWEA). MORI supplements missing texture and color by strategically selecting a reference view with maximum viewpoint overlap, while DWEA enforces geometric consistency using epipolar constraints. Additionally, we introduce a divide-and-conquer training strategy that aligns data distributions between the two stages through joint optimization. We evaluate ProSplat on the RealEstate10K and DL3DV-10K datasets under wide-baseline settings. Experimental results demonstrate that ProSplat achieves an average improvement of 1 dB in PSNR compared to recent SOTA methods.

Title: NOVA3D: Normal Aligned Video Diffusion Model for Single Image to 3D Generation

Authors: Yuxiao Yang, Peihao Li, Yuhong Zhang, Junzhe Lu, Xianglong He, Minghan Qin, Weitao Wang, Haoqian Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07698
Pdf URL: https://arxiv.org/pdf/2506.07698
Copy Paste: [[2506.07698]] NOVA3D: Normal Aligned Video Diffusion Model for Single Image to 3D Generation(https://arxiv.org/abs/2506.07698)
Keywords: diffusion
Abstract: 3D AI-generated content (AIGC) has made it increasingly accessible for anyone to become a 3D content creator. While recent methods leverage Score Distillation Sampling to distill 3D objects from pretrained image diffusion models, they often suffer from inadequate 3D priors, leading to insufficient multi-view consistency. In this work, we introduce NOVA3D, an innovative single-image-to-3D generation framework. Our key insight lies in leveraging strong 3D priors from a pretrained video diffusion model and integrating geometric information during multi-view video fine-tuning. To facilitate information exchange between color and geometric domains, we propose the Geometry-Temporal Alignment (GTA) attention mechanism, thereby improving generalization and multi-view consistency. Moreover, we introduce the de-conflict geometry fusion algorithm, which improves texture fidelity by addressing multi-view inaccuracies and resolving discrepancies in pose alignment. Extensive experiments validate the superiority of NOVA3D over existing baselines.

Title: Evaluating Robustness in Latent Diffusion Models via Embedding Level Augmentation

Authors: Boris Martirosyan, Alexey Karmanov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07706
Pdf URL: https://arxiv.org/pdf/2506.07706
Copy Paste: [[2506.07706]] Evaluating Robustness in Latent Diffusion Models via Embedding Level Augmentation(https://arxiv.org/abs/2506.07706)
Keywords: diffusion
Abstract: Latent diffusion models (LDMs) achieve state-of-the-art performance across various tasks, including image generation and video synthesis. However, they generally lack robustness, a limitation that remains not fully explored in current research. In this paper, we propose several methods to address this gap. First, we hypothesize that the robustness of LDMs primarily should be measured without their text encoder, because if we take and explore the whole architecture, the problems of image generator and text encoders wll be fused. Second, we introduce novel data augmentation techniques designed to reveal robustness shortcomings in LDMs when processing diverse textual prompts. We then fine-tune Stable Diffusion 3 and Stable Diffusion XL models using Dreambooth, incorporating these proposed augmentation methods across multiple tasks. Finally, we propose a novel evaluation pipeline specifically tailored to assess the robustness of LDMs fine-tuned via Dreambooth.

Title: Consistent Video Editing as Flow-Driven Image-to-Video Generation

Authors: Ge Wang, Songlin Fan, Hangxu Liu, Quanjian Song, Hewei Wang, Jinfeng Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07713
Pdf URL: https://arxiv.org/pdf/2506.07713
Copy Paste: [[2506.07713]] Consistent Video Editing as Flow-Driven Image-to-Video Generation(https://arxiv.org/abs/2506.07713)
Keywords: diffusion
Abstract: With the prosper of video diffusion models, down-stream applications like video editing have been significantly promoted without consuming much computational cost. One particular challenge in this task lies at the motion transfer process from the source video to the edited one, where it requires the consideration of the shape deformation in between, meanwhile maintaining the temporal consistency in the generated video sequence. However, existing methods fail to model complicated motion patterns for video editing, and are fundamentally limited to object replacement, where tasks with non-rigid object motions like multi-object and portrait editing are largely neglected. In this paper, we observe that optical flows offer a promising alternative in complex motion modeling, and present FlowV2V to re-investigate video editing as a task of flow-driven Image-to-Video (I2V) generation. Specifically, FlowV2V decomposes the entire pipeline into first-frame editing and conditional I2V generation, and simulates pseudo flow sequence that aligns with the deformed shape, thus ensuring the consistency during editing. Experimental results on DAVIS-EDIT with improvements of 13.67% and 50.66% on DOVER and warping error illustrate the superior temporal consistency and sample quality of FlowV2V compared to existing state-of-the-art ones. Furthermore, we conduct comprehensive ablation studies to analyze the internal functionalities of the first-frame paradigm and flow alignment in the proposed method.

Title: AssetDropper: Asset Extraction via Diffusion Models with Reward-Driven Optimization

Authors: Lanjiong Li, Guanhua Zhao, Lingting Zhu, Zeyu Cai, Lequan Yu, Jian Zhang, Zeyu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07738
Pdf URL: https://arxiv.org/pdf/2506.07738
Copy Paste: [[2506.07738]] AssetDropper: Asset Extraction via Diffusion Models with Reward-Driven Optimization(https://arxiv.org/abs/2506.07738)
Keywords: diffusion, generative
Abstract: Recent research on generative models has primarily focused on creating product-ready visual outputs; however, designers often favor access to standardized asset libraries, a domain that has yet to be significantly enhanced by generative capabilities. Although open-world scenes provide ample raw materials for designers, efficiently extracting high-quality, standardized assets remains a challenge. To address this, we introduce AssetDropper, the first framework designed to extract assets from reference images, providing artists with an open-world asset palette. Our model adeptly extracts a front view of selected subjects from input images, effectively handling complex scenarios such as perspective distortion and subject occlusion. We establish a synthetic dataset of more than 200,000 image-subject pairs and a real-world benchmark with thousands more for evaluation, facilitating the exploration of future research in downstream tasks. Furthermore, to ensure precise asset extraction that aligns well with the image prompts, we employ a pre-trained reward model to fulfill a closed-loop with feedback. We design the reward model to perform an inverse task that pastes the extracted assets back into the reference sources, which assists training with additional consistency and mitigates hallucination. Extensive experiments show that, with the aid of reward-driven optimization, AssetDropper achieves the state-of-the-art results in asset extraction. Project page: this http URL.

Title: Flow-Anything: Learning Real-World Optical Flow Estimation from Large-Scale Single-view Images

Authors: Yingping Liang, Ying Fu, Yutao Hu, Wenqi Shao, Jiaming Liu, Debing Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07740
Pdf URL: https://arxiv.org/pdf/2506.07740
Copy Paste: [[2506.07740]] Flow-Anything: Learning Real-World Optical Flow Estimation from Large-Scale Single-view Images(https://arxiv.org/abs/2506.07740)
Keywords: foundation model
Abstract: Optical flow estimation is a crucial subfield of computer vision, serving as a foundation for video tasks. However, the real-world robustness is limited by animated synthetic datasets for training. This introduces domain gaps when applied to real-world applications and limits the benefits of scaling up datasets. To address these challenges, we propose \textbf{Flow-Anything}, a large-scale data generation framework designed to learn optical flow estimation from any single-view images in the real world. We employ two effective steps to make data scaling-up promising. First, we convert a single-view image into a 3D representation using advanced monocular depth estimation networks. This allows us to render optical flow and novel view images under a virtual camera. Second, we develop an Object-Independent Volume Rendering module and a Depth-Aware Inpainting module to model the dynamic objects in the 3D representation. These two steps allow us to generate realistic datasets for training from large-scale single-view images, namely \textbf{FA-Flow Dataset}. For the first time, we demonstrate the benefits of generating optical flow training data from large-scale real-world images, outperforming the most advanced unsupervised methods and supervised methods on synthetic datasets. Moreover, our models serve as a foundation model and enhance the performance of various downstream video tasks.

Title: Difference Inversion: Interpolate and Isolate the Difference with Token Consistency for Image Analogy Generation

Authors: Hyunsoo Kim, Donghyun Kim, Suhyun Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07750
Pdf URL: https://arxiv.org/pdf/2506.07750
Copy Paste: [[2506.07750]] Difference Inversion: Interpolate and Isolate the Difference with Token Consistency for Image Analogy Generation(https://arxiv.org/abs/2506.07750)
Keywords: diffusion, in-context
Abstract: How can we generate an image B' that satisfies A:A'::B:B', given the input images A,A' and B? Recent works have tackled this challenge through approaches like visual in-context learning or visual instruction. However, these methods are typically limited to specific models (e.g. InstructPix2Pix. Inpainting models) rather than general diffusion models (e.g. Stable Diffusion, SDXL). This dependency may lead to inherited biases or lower editing capabilities. In this paper, we propose Difference Inversion, a method that isolates only the difference from A and A' and applies it to B to generate a plausible B'. To address model dependency, it is crucial to structure prompts in the form of a "Full Prompt" suitable for input to stable diffusion models, rather than using an "Instruction Prompt". To this end, we accurately extract the Difference between A and A' and combine it with the prompt of B, enabling a plug-and-play application of the difference. To extract a precise difference, we first identify it through 1) Delta Interpolation. Additionally, to ensure accurate training, we propose the 2) Token Consistency Loss and 3) Zero Initialization of Token Embeddings. Our extensive experiments demonstrate that Difference Inversion outperforms existing baselines both quantitatively and qualitatively, indicating its ability to generate more feasible B' in a model-agnostic manner.

Title: Comparing Credit Risk Estimates in the Gen-AI Era

Authors: Nicola Lavecchia, Sid Fadanelli, Federico Ricciuti, Gennaro Aloe, Enrico Bagli, Pietro Giuffrida, Daniele Vergari
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07754
Pdf URL: https://arxiv.org/pdf/2506.07754
Copy Paste: [[2506.07754]] Comparing Credit Risk Estimates in the Gen-AI Era(https://arxiv.org/abs/2506.07754)
Keywords: generative
Abstract: Generative AI technologies have demonstrated significant potential across diverse applications. This study provides a comparative analysis of credit score modeling techniques, contrasting traditional approaches with those leveraging generative AI. Our findings reveal that current generative AI models fall short of matching the performance of traditional methods, regardless of the integration strategy employed. These results highlight the limitations in the current capabilities of generative AI for credit risk scoring, emphasizing the need for further research and development before the possibility of applying generative AI for this specific task, or equivalent ones.

Title: Language-Vision Planner and Executor for Text-to-Visual Reasoning

Authors: Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Sihao Hu, Tiansheng Huang, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Ling Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07778
Pdf URL: https://arxiv.org/pdf/2506.07778
Copy Paste: [[2506.07778]] Language-Vision Planner and Executor for Text-to-Visual Reasoning(https://arxiv.org/abs/2506.07778)
Keywords: in-context
Abstract: The advancement in large language models (LLMs) and large vision models has fueled the rapid progress in multi-modal visual-text reasoning capabilities. However, existing vision-language models (VLMs) to date suffer from generalization performance. Inspired by recent development in LLMs for visual reasoning, this paper presents VLAgent, an AI system that can create a step-by-step visual reasoning plan with an easy-to-understand script and execute each step of the plan in real time by integrating planning script with execution verifications via an automated process supported by VLAgent. In the task planning phase, VLAgent fine-tunes an LLM through in-context learning to generate a step-by-step planner for each user-submitted text-visual reasoning task. During the plan execution phase, VLAgent progressively refines the composition of neuro-symbolic executable modules to generate high-confidence reasoning results. VLAgent has three unique design characteristics: First, we improve the quality of plan generation through in-context learning, improving logic reasoning by reducing erroneous logic steps, incorrect programs, and LLM hallucinations. Second, we design a syntax-semantics parser to identify and correct additional logic errors of the LLM-generated planning script prior to launching the plan executor. Finally, we employ the ensemble method to improve the generalization performance of our step-executor. Extensive experiments with four visual reasoning benchmarks (GQA, MME, NLVR2, VQAv2) show that VLAgent achieves significant performance enhancement for multimodal text-visual reasoning applications, compared to the exiting representative VLMs and LLM based visual composition approaches like ViperGPT and VisProg, thanks to the novel optimization modules of VLAgent back-engine (SS-Parser, Plan Repairer, Output Verifiers). Code and data will be made available upon paper acceptance.

Title: Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger

Authors: Qi Yang, Chenghao Zhang, Lubin Fan, Kun Ding, Jieping Ye, Shiming Xiang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07785
Pdf URL: https://arxiv.org/pdf/2506.07785
Copy Paste: [[2506.07785]] Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger(https://arxiv.org/abs/2506.07785)
Keywords: in-context
Abstract: Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks through multimodal Retrieval-Augmented Generation (RAG). However, existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge. To address these issues, in this study, we propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method. Specifically, we introduce a self-consistent evaluation mechanism to enrich the knowledge base with intrinsic reasoning patterns. We further propose a Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) to prioritize the most relevant examples. This ensures that LVLMs can leverage high-quality contextual reasoning for better and more consistent responses. Extensive experiments demonstrate that our framework achieves state-of-the-art performance on multiple VQA datasets, significantly outperforming In-Context Learning (ICL) and Vanilla-RAG methods. It highlights the effectiveness of our knowledge base and re-ranking method in improving LVLMs. Our code is available at this https URL.

Title: Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution

Authors: Junseo Bang, Joonhee Lee, Kyeonghyun Lee, Haechang Lee, Dong Un Kang, Se Young Chun
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07813
Pdf URL: https://arxiv.org/pdf/2506.07813
Copy Paste: [[2506.07813]] Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution(https://arxiv.org/abs/2506.07813)
Keywords: diffusion, generative
Abstract: Arbitrary-scale image super-resolution aims to upsample images to any desired resolution, offering greater flexibility than traditional fixed-scale super-resolution. Recent approaches in this domain utilize regression-based or generative models, but many of them are a single-stage upsampling process, which may be challenging to learn across a wide, continuous distribution of scaling factors. Progressive upsampling strategies have shown promise in mitigating this issue, yet their integration with diffusion models for flexible upscaling remains underexplored. Here, we present CasArbi, a novel self-cascaded diffusion framework for arbitrary-scale image super-resolution. CasArbi meets the varying scaling demands by breaking them down into smaller sequential factors and progressively enhancing the image resolution at each step with seamless transitions for arbitrary scales. Our novel coordinate-guided residual diffusion model allows for the learning of continuous image representations while enabling efficient diffusion sampling. Extensive experiments demonstrate that our CasArbi outperforms prior arts in both perceptual and distortion performance metrics across diverse arbitrary-scale super-resolution benchmarks.

Title: WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

Authors: Zhiyu Lin, Zhengda Zhou, Zhiyuan Zhao, Tianrui Wan, Yilun Ma, Junyu Gao, Xuelong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.07818
Pdf URL: https://arxiv.org/pdf/2506.07818
Copy Paste: [[2506.07818]] WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code(https://arxiv.org/abs/2506.07818)
Keywords: generative
Abstract: With the rapid advancement of Generative AI technology, Multimodal Large Language Models(MLLMs) have the potential to act as AI software engineers capable of executing complex web application development. Considering that the model requires a confluence of multidimensional sub-capabilities to address the challenges of various development phases, constructing a multi-view evaluation framework is crucial for accurately guiding the enhancement of development efficiency. However, existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes. In this work, we draw inspiration from the principles of software engineering and further propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming,WebUI-HTML Understanding, and WebUI-to-Code. WebUIBench comprises 21K high-quality question-answer pairs derived from over 0.7K real-world websites. The extensive evaluation of 29 mainstream MLLMs uncovers the skill characteristics and various weakness that models encountered during the development process.

Title: Accelerating Diffusion Models in Offline RL via Reward-Aware Consistency Trajectory Distillation

Authors: Xintong Duan, Yutong He, Fahim Tajwar, Ruslan Salakhutdinov, J. Zico Kolter, Jeff Schneider
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07822
Pdf URL: https://arxiv.org/pdf/2506.07822
Copy Paste: [[2506.07822]] Accelerating Diffusion Models in Offline RL via Reward-Aware Consistency Trajectory Distillation(https://arxiv.org/abs/2506.07822)
Keywords: diffusion
Abstract: Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While the consistency model offers a potential solution, its applications to decision-making often struggle with suboptimal demonstrations or rely on complex concurrent training of multiple networks. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method enables single-step generation while maintaining higher performance and simpler training. Empirical evaluations on the Gym MuJoCo benchmarks and long horizon planning demonstrate that our approach can achieve an 8.7% improvement over previous state-of-the-art while offering up to 142x speedup over diffusion counterparts in inference time.

Title: R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation

Authors: William Ljungbergh, Bernardo Taveira, Wenzhao Zheng, Adam Tonderski, Chensheng Peng, Fredrik Kahl, Christoffer Petersson, Michael Felsberg, Kurt Keutzer, Masayoshi Tomizuka, Wei Zhan
Subjects: cs.CV, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2506.07826
Pdf URL: https://arxiv.org/pdf/2506.07826
Copy Paste: [[2506.07826]] R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation(https://arxiv.org/abs/2506.07826)
Keywords: diffusion, generative
Abstract: Validating autonomous driving (AD) systems requires diverse and safety-critical testing, making photorealistic virtual environments essential. Traditional simulation platforms, while controllable, are resource-intensive to scale and often suffer from a domain gap with real-world data. In contrast, neural reconstruction methods like 3D Gaussian Splatting (3DGS) offer a scalable solution for creating photorealistic digital twins of real-world driving scenes. However, they struggle with dynamic object manipulation and reusability as their per-scene optimization-based methodology tends to result in incomplete object models with integrated illumination effects. This paper introduces R3D2, a lightweight, one-step diffusion model designed to overcome these limitations and enable realistic insertion of complete 3D assets into existing scenes by generating plausible rendering effects-such as shadows and consistent lighting-in real time. This is achieved by training R3D2 on a novel dataset: 3DGS object assets are generated from in-the-wild AD data using an image-conditioned 3D generative model, and then synthetically placed into neural rendering-based virtual environments, allowing R3D2 to learn realistic integration. Quantitative and qualitative evaluations demonstrate that R3D2 significantly enhances the realism of inserted assets, enabling use-cases like text-to-3D asset insertion and cross-scene/dataset object transfer, allowing for true scalability in AD validation. To promote further research in scalable and realistic AD simulation, we will release our dataset and code, see this https URL.

Title: Diffusion models under low-noise regime

Authors: Elizabeth Pavlova, Xue-Xin Wei
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.07841
Pdf URL: https://arxiv.org/pdf/2506.07841
Copy Paste: [[2506.07841]] Diffusion models under low-noise regime(https://arxiv.org/abs/2506.07841)
Keywords: diffusion, generative
Abstract: Recent work on diffusion models proposed that they operate in two regimes: memorization, in which models reproduce their training data, and generalization, in which they generate novel samples. While this has been tested in high-noise settings, the behavior of diffusion models as effective denoisers when the corruption level is small remains unclear. To address this gap, we systematically investigated the behavior of diffusion models under low-noise diffusion dynamics, with implications for model robustness and interpretability. Using (i) CelebA subsets of varying sample sizes and (ii) analytic Gaussian mixture benchmarks, we reveal that models trained on disjoint data diverge near the data manifold even when their high-noise outputs converge. We quantify how training set size, data geometry, and model objective choice shape denoising trajectories and affect score accuracy, providing insights into how these models actually learn representations of data distributions. This work starts to address gaps in our understanding of generative model reliability in practical applications where small perturbations are common.

Title: Jarzynski Reweighting and Sampling Dynamics for Training Energy-Based Models: Theoretical Analysis of Different Transition Kernels

Authors: Davide Carbone
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2506.07843
Pdf URL: https://arxiv.org/pdf/2506.07843
Copy Paste: [[2506.07843]] Jarzynski Reweighting and Sampling Dynamics for Training Energy-Based Models: Theoretical Analysis of Different Transition Kernels(https://arxiv.org/abs/2506.07843)
Keywords: diffusion, generative
Abstract: Energy-Based Models (EBMs) provide a flexible framework for generative modeling, but their training remains theoretically challenging due to the need to approximate normalization constants and efficiently sample from complex, multi-modal distributions. Traditional methods, such as contrastive divergence and score matching, introduce biases that can hinder accurate learning. In this work, we present a theoretical analysis of Jarzynski reweighting, a technique from non-equilibrium statistical mechanics, and its implications for training EBMs. We focus on the role of the choice of the kernel and we illustrate these theoretical considerations in two key generative frameworks: (i) flow-based diffusion models, where we reinterpret Jarzynski reweighting in the context of stochastic interpolants to mitigate discretization errors and improve sample quality, and (ii) Restricted Boltzmann Machines, where we analyze its role in correcting the biases of contrastive divergence. Our results provide insights into the interplay between kernel choice and model performance, highlighting the potential of Jarzynski reweighting as a principled tool for generative learning.

Title: VIVAT: Virtuous Improving VAE Training through Artifact Mitigation

Authors: Lev Novitskiy, Viacheslav Vasilev, Maria Kovaleva, Vladimir Arkhipkin, Denis Dimitrov
Subjects: cs.CV, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.07863
Pdf URL: https://arxiv.org/pdf/2506.07863
Copy Paste: [[2506.07863]] VIVAT: Virtuous Improving VAE Training through Artifact Mitigation(https://arxiv.org/abs/2506.07863)
Keywords: generative
Abstract: Variational Autoencoders (VAEs) remain a cornerstone of generative computer vision, yet their training is often plagued by artifacts that degrade reconstruction and generation quality. This paper introduces VIVAT, a systematic approach to mitigating common artifacts in KL-VAE training without requiring radical architectural changes. We present a detailed taxonomy of five prevalent artifacts - color shift, grid patterns, blur, corner and droplet artifacts - and analyze their root causes. Through straightforward modifications, including adjustments to loss weights, padding strategies, and the integration of Spatially Conditional Normalization, we demonstrate significant improvements in VAE performance. Our method achieves state-of-the-art results in image reconstruction metrics (PSNR and SSIM) across multiple benchmarks and enhances text-to-image generation quality, as evidenced by superior CLIP scores. By preserving the simplicity of the KL-VAE framework while addressing its practical challenges, VIVAT offers actionable insights for researchers and practitioners aiming to optimize VAE training.

Title: Diffusion Counterfactual Generation with Semantic Abduction

Authors: Rajat Rasal, Avinash Kori, Fabio De Sousa Ribeiro, Tian Xia, Ben Glocker
Subjects: cs.LG, cs.AI, cs.CV, stat.ML
Abstract URL: https://arxiv.org/abs/2506.07883
Pdf URL: https://arxiv.org/pdf/2506.07883
Copy Paste: [[2506.07883]] Diffusion Counterfactual Generation with Semantic Abduction(https://arxiv.org/abs/2506.07883)
Keywords: diffusion
Abstract: Counterfactual image generation presents significant challenges, including preserving identity, maintaining perceptual quality, and ensuring faithfulness to an underlying causal model. While existing auto-encoding frameworks admit semantic latent spaces which can be manipulated for causal control, they struggle with scalability and fidelity. Advancements in diffusion models present opportunities for improving counterfactual image editing, having demonstrated state-of-the-art visual quality, human-aligned perception and representation learning capabilities. Here, we present a suite of diffusion-based causal mechanisms, introducing the notions of spatial, semantic and dynamic abduction. We propose a general framework that integrates semantic representations into diffusion models through the lens of Pearlian causality to edit images via a counterfactual reasoning process. To our knowledge, this is the first work to consider high-level semantic identity preservation for diffusion counterfactuals and to demonstrate how semantic control enables principled trade-offs between faithful causal control and identity preservation.

Title: EgoM2P: Egocentric Multimodal Multitask Pretraining

Authors: Gen Li, Yutong Chen, Yiqian Wu, Kaifeng Zhao, Marc Pollefeys, Siyu Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07886
Pdf URL: https://arxiv.org/pdf/2506.07886
Copy Paste: [[2506.07886]] EgoM2P: Egocentric Multimodal Multitask Pretraining(https://arxiv.org/abs/2506.07886)
Keywords: foundation model, generative
Abstract: Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction. These capabilities enable systems to better interpret the camera wearer's actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal and multitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models. To address these challenges, we introduce a set of efficient temporal tokenizers and propose EgoM2P, a masked modeling framework that learns from temporally aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding. This unified design supports multitasking across diverse egocentric perception and synthesis tasks, including gaze prediction, egocentric camera tracking, and monocular depth estimation from egocentric video. EgoM2P also serves as a generative model for conditional egocentric video synthesis. Across these tasks, EgoM2P matches or outperforms specialist models while being an order of magnitude faster. We will fully open-source EgoM2P to support the community and advance egocentric vision research. Project page: this https URL

Title: Video Unlearning via Low-Rank Refusal Vector

Authors: Simone Facchiano, Stefano Saravalle, Matteo Migliarini, Edoardo De Matteis, Alessio Sampieri, Andrea Pilzer, Emanuele Rodolà, Indro Spinelli, Luca Franco, Fabio Galasso
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07891
Pdf URL: https://arxiv.org/pdf/2506.07891
Copy Paste: [[2506.07891]] Video Unlearning via Low-Rank Refusal Vector(https://arxiv.org/abs/2506.07891)
Keywords: diffusion, generative
Abstract: Video generative models democratize the creation of visual content through intuitive instruction following, but they also inherit the biases and harmful concepts embedded within their web-scale training data. This inheritance creates a significant risk, as users can readily generate undesirable and even illegal content. This work introduces the first unlearning technique tailored explicitly for video diffusion models to address this critical issue. Our method requires 5 multi-modal prompt pairs only. Each pair contains a "safe" and an "unsafe" example that differ only by the target concept. Averaging their per-layer latent differences produces a "refusal vector", which, once subtracted from the model parameters, neutralizes the unsafe concept. We introduce a novel low-rank factorization approach on the covariance difference of embeddings that yields robust refusal vectors. This isolates the target concept while minimizing collateral unlearning of other semantics, thus preserving the visual quality of the generated video. Our method preserves the model's generation quality while operating without retraining or access to the original training data. By embedding the refusal direction directly into the model's weights, the suppression mechanism becomes inherently more robust against adversarial bypass attempts compared to surface-level input-output filters. In a thorough qualitative and quantitative evaluation, we show that we can neutralize a variety of harmful contents, including explicit nudity, graphic violence, copyrights, and trademarks. Project page: this https URL.

Title: FunDiff: Diffusion Models over Function Spaces for Physics-Informed Generative Modeling

Authors: Sifan Wang, Zehao Dou, Tong-Rui Liu, Lu Lu
Subjects: cs.LG, physics.comp-ph, stat.ML
Abstract URL: https://arxiv.org/abs/2506.07902
Pdf URL: https://arxiv.org/pdf/2506.07902
Copy Paste: [[2506.07902]] FunDiff: Diffusion Models over Function Spaces for Physics-Informed Generative Modeling(https://arxiv.org/abs/2506.07902)
Keywords: diffusion, generative
Abstract: Recent advances in generative modeling -- particularly diffusion models and flow matching -- have achieved remarkable success in synthesizing discrete data such as images and videos. However, adapting these models to physical applications remains challenging, as the quantities of interest are continuous functions governed by complex physical laws. Here, we introduce $\textbf{FunDiff}$, a novel framework for generative modeling in function spaces. FunDiff combines a latent diffusion process with a function autoencoder architecture to handle input functions with varying discretizations, generate continuous functions evaluable at arbitrary locations, and seamlessly incorporate physical priors. These priors are enforced through architectural constraints or physics-informed loss functions, ensuring that generated samples satisfy fundamental physical laws. We theoretically establish minimax optimality guarantees for density estimation in function spaces, showing that diffusion-based estimators achieve optimal convergence rates under suitable regularity conditions. We demonstrate the practical effectiveness of FunDiff across diverse applications in fluid dynamics and solid mechanics. Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy and low-resolution data. Code and datasets are publicly available at this https URL.

Title: Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces

Authors: Kevin Rojas, Yuchen Zhu, Sichen Zhu, Felix X.-F. Ye, Molei Tao
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.07903
Pdf URL: https://arxiv.org/pdf/2506.07903
Copy Paste: [[2506.07903]] Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces(https://arxiv.org/abs/2506.07903)
Keywords: diffusion
Abstract: Diffusion models have demonstrated remarkable performance in generating unimodal data across various tasks, including image, video, and text generation. On the contrary, the joint generation of multimodal data through diffusion models is still in the early stages of exploration. Existing approaches heavily rely on external preprocessing protocols, such as tokenizers and variational autoencoders, to harmonize varied data representations into a unified, unimodal format. This process heavily demands the high accuracy of encoders and decoders, which can be problematic for applications with limited data. To lift this restriction, we propose a novel framework for building multimodal diffusion models on arbitrary state spaces, enabling native generation of coupled data across different modalities. By introducing an innovative decoupled noise schedule for each modality, we enable both unconditional and modality-conditioned generation within a single model simultaneously. We empirically validate our approach for text-image generation and mixed-type tabular data synthesis, demonstrating that it achieves competitive performance.

Title: CausalPFN: Amortized Causal Effect Estimation via In-Context Learning

Authors: Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas, Benson Li, Junwei Ma, Jesse C. Cresswell, Rahul G. Krishnan
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.07918
Pdf URL: https://arxiv.org/pdf/2506.07918
Copy Paste: [[2506.07918]] CausalPFN: Amortized Causal Effect Estimation via In-Context Learning(https://arxiv.org/abs/2506.07918)
Keywords: in-context
Abstract: Causal effect estimation from observational data is fundamental across various applications. However, selecting an appropriate estimator from dozens of specialized methods demands substantial manual effort and domain expertise. We present CausalPFN, a single transformer that amortizes this workflow: trained once on a large library of simulated data-generating processes that satisfy ignorability, it infers causal effects for new observational datasets out-of-the-box. CausalPFN combines ideas from Bayesian causal inference with the large-scale training protocol of prior-fitted networks (PFNs), learning to map raw observations directly to causal effects without any task-specific adjustment. Our approach achieves superior average performance on heterogeneous and average treatment effect estimation benchmarks (IHDP, Lalonde, ACIC). Moreover, it shows competitive performance for real-world policy making on uplift modeling tasks. CausalPFN provides calibrated uncertainty estimates to support reliable decision-making based on Bayesian principles. This ready-to-use model does not require any further training or tuning and takes a step toward automated causal inference (this https URL).

Title: A Generative Physics-Informed Reinforcement Learning-Based Approach for Construction of Representative Drive Cycle

Authors: Amirreza Yasami, Mohammadali Tofigh, Mahdi Shahbakhti, Charles Robert Koch
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2506.07929
Pdf URL: https://arxiv.org/pdf/2506.07929
Copy Paste: [[2506.07929]] A Generative Physics-Informed Reinforcement Learning-Based Approach for Construction of Representative Drive Cycle(https://arxiv.org/abs/2506.07929)
Keywords: generative
Abstract: Accurate driving cycle construction is crucial for vehicle design, fuel economy analysis, and environmental impact assessments. A generative Physics-Informed Expected SARSA-Monte Carlo (PIESMC) approach that constructs representative driving cycles by capturing transient dynamics, acceleration, deceleration, idling, and road grade transitions while ensuring model fidelity is introduced. Leveraging a physics-informed reinforcement learning framework with Monte Carlo sampling, PIESMC delivers efficient cycle construction with reduced computational cost. Experimental evaluations on two real-world datasets demonstrate that PIESMC replicates key kinematic and energy metrics, achieving up to a 57.3% reduction in cumulative kinematic fragment errors compared to the Micro-trip-based (MTB) method and a 10.5% reduction relative to the Markov-chain-based (MCB) method. Moreover, it is nearly an order of magnitude faster than conventional techniques. Analyses of vehicle-specific power distributions and wavelet-transformed frequency content further confirm its ability to reproduce experimental central tendencies and variability.

Title: Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models

Authors: Chengyue Huang, Yuchen Zhu, Sichen Zhu, Jingyun Xiao, Moises Andrade, Shivang Chopra, Zsolt Kira
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07936
Pdf URL: https://arxiv.org/pdf/2506.07936
Copy Paste: [[2506.07936]] Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models(https://arxiv.org/abs/2506.07936)
Keywords: in-context
Abstract: Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL), a property similar to that of their language-only counterparts. While recent work suggests VLMs can perform multimodal ICL (MM-ICL), studies show they often rely on shallow heuristics -- such as copying or majority voting -- rather than true task understanding. We revisit this assumption by evaluating VLMs under distribution shifts, where support examples come from a dataset different from the query. Surprisingly, performance often degrades with more demonstrations, and models tend to copy answers rather than learn from them. To investigate further, we propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer. We conduct extensive and comprehensive experiments on both perception- and reasoning-required datasets with open-source VLMs ranging from 3B to 72B and proprietary models such as Gemini 2.0. We conduct controlled studies varying shot count, retrieval method, rationale quality, and distribution. Our results show limited performance sensitivity across these factors, suggesting that current VLMs do not effectively utilize demonstration-level information as intended in MM-ICL.

Title: Cost-Optimal Active AI Model Evaluation

Authors: Anastasios N. Angelopoulos, Jacob Eisenstein, Jonathan Berant, Alekh Agarwal, Adam Fisch
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.07949
Pdf URL: https://arxiv.org/pdf/2506.07949
Copy Paste: [[2506.07949]] Cost-Optimal Active AI Model Evaluation(https://arxiv.org/abs/2506.07949)
Keywords: generative
Abstract: The development lifecycle of generative AI systems requires continual evaluation, data acquisition, and annotation, which is costly in both resources and time. In practice, rapid iteration often makes it necessary to rely on synthetic annotation data because of the low cost, despite the potential for substantial bias. In this paper, we develop novel, cost-aware methods for actively balancing the use of a cheap, but often inaccurate, weak rater -- such as a model-based autorater that is designed to automatically assess the quality of generated content -- with a more expensive, but also more accurate, strong rater alternative such as a human. More specifically, the goal of our approach is to produce a low variance, unbiased estimate of the mean of the target "strong" rating, subject to some total annotation budget. Building on recent work in active and prediction-powered statistical inference, we derive a family of cost-optimal policies for allocating a given annotation budget between weak and strong raters so as to maximize statistical efficiency. Using synthetic and real-world data, we empirically characterize the conditions under which these policies yield improvements over prior methods. We find that, especially in tasks where there is high variability in the difficulty of examples, our policies can achieve the same estimation precision at a far lower total annotation budget than standard evaluation methods.

Title: Neural Tangent Kernel Analysis to Probe Convergence in Physics-informed Neural Solvers: PIKANs vs. PINNs

Authors: Salah A. Faroughi, Farinaz Mostajeran
Subjects: cs.LG, math-ph, math.AP, math.SP
Abstract URL: https://arxiv.org/abs/2506.07958
Pdf URL: https://arxiv.org/pdf/2506.07958
Copy Paste: [[2506.07958]] Neural Tangent Kernel Analysis to Probe Convergence in Physics-informed Neural Solvers: PIKANs vs. PINNs(https://arxiv.org/abs/2506.07958)
Keywords: diffusion
Abstract: Physics-informed Kolmogorov-Arnold Networks (PIKANs), and in particular their Chebyshev-based variants (cPIKANs), have recently emerged as promising models for solving partial differential equations (PDEs). However, their training dynamics and convergence behavior remain largely unexplored both theoretically and numerically. In this work, we aim to advance the theoretical understanding of cPIKANs by analyzing them using Neural Tangent Kernel (NTK) theory. Our objective is to discern the evolution of kernel structure throughout gradient-based training and its subsequent impact on learning efficiency. We first derive the NTK of standard cKANs in a supervised setting, and then extend the analysis to the physics-informed context. We analyze the spectral properties of NTK matrices, specifically their eigenvalue distributions and spectral bias, for four representative PDEs: the steady-state Helmholtz equation, transient diffusion and Allen-Cahn equations, and forced vibrations governed by the Euler-Bernoulli beam equation. We also conduct an investigation into the impact of various optimization strategies, e.g., first-order, second-order, and hybrid approaches, on the evolution of the NTK and the resulting learning dynamics. Results indicate a tractable behavior for NTK in the context of cPIKANs, which exposes learning dynamics that standard physics-informed neural networks (PINNs) cannot capture. Spectral trends also reveal when domain decomposition improves training, directly linking kernel behavior to convergence rates under different setups. To the best of our knowledge, this is the first systematic NTK study of cPIKANs, providing theoretical insight that clarifies and predicts their empirical performance.

Title: CXR-LT 2024: A MICCAI challenge on long-tailed, multi-label, and zero-shot disease classification from chest X-ray

Authors: Mingquan Lin, Gregory Holste, Song Wang, Yiliang Zhou, Yishu Wei, Imon Banerjee, Pengyi Chen, Tianjie Dai, Yuexi Du, Nicha C. Dvornek, Yuyan Ge, Zuowei Guo, Shouhei Hanaoka, Dongkyun Kim, Pablo Messina, Yang Lu, Denis Parra, Donghyun Son, Álvaro Soto, Aisha Urooj, René Vidal, Yosuke Yamagishi, Zefan Yang, Ruichi Zhang, Yang Zhou, Leo Anthony Celi, Ronald M. Summers, Zhiyong Lu, Hao Chen, Adam Flanders, George Shih, Zhangyang Wang, Yifan Peng
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07984
Pdf URL: https://arxiv.org/pdf/2506.07984
Copy Paste: [[2506.07984]] CXR-LT 2024: A MICCAI challenge on long-tailed, multi-label, and zero-shot disease classification from chest X-ray(https://arxiv.org/abs/2506.07984)
Keywords: generative
Abstract: The CXR-LT series is a community-driven initiative designed to enhance lung disease classification using chest X-rays (CXR). It tackles challenges in open long-tailed lung disease classification and enhances the measurability of state-of-the-art techniques. The first event, CXR-LT 2023, aimed to achieve these goals by providing high-quality benchmark CXR data for model development and conducting comprehensive evaluations to identify ongoing issues impacting lung disease classification performance. Building on the success of CXR-LT 2023, the CXR-LT 2024 expands the dataset to 377,110 chest X-rays (CXRs) and 45 disease labels, including 19 new rare disease findings. It also introduces a new focus on zero-shot learning to address limitations identified in the previous event. Specifically, CXR-LT 2024 features three tasks: (i) long-tailed classification on a large, noisy test set, (ii) long-tailed classification on a manually annotated "gold standard" subset, and (iii) zero-shot generalization to five previously unseen disease findings. This paper provides an overview of CXR-LT 2024, detailing the data curation process and consolidating state-of-the-art solutions, including the use of multimodal models for rare disease detection, advanced generative approaches to handle noisy labels, and zero-shot learning strategies for unseen diseases. Additionally, the expanded dataset enhances disease coverage to better represent real-world clinical settings, offering a valuable resource for future research. By synthesizing the insights and innovations of participating teams, we aim to advance the development of clinically realistic and generalizable diagnostic models for chest radiography.

Title: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Authors: Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.07986
Pdf URL: https://arxiv.org/pdf/2506.07986
Copy Paste: [[2506.07986]] Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers(https://arxiv.org/abs/2506.07986)
Keywords: diffusion
Abstract: Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{this https URL}

Title: Generative Modeling of Weights: Generalization or Memorization?

Authors: Boya Zeng, Yida Yin, Zhiqiu Xu, Zhuang Liu
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.07998
Pdf URL: https://arxiv.org/pdf/2506.07998
Copy Paste: [[2506.07998]] Generative Modeling of Weights: Generalization or Memorization?(https://arxiv.org/abs/2506.07998)
Keywords: diffusion, generative
Abstract: Generative models, with their success in image and video generation, have recently been explored for synthesizing effective neural network weights. These approaches take trained neural network checkpoints as training data, and aim to generate high-performing neural network weights during inference. In this work, we examine four representative methods on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. Surprisingly, we find that these methods synthesize weights largely by memorization: they produce either replicas, or at best simple interpolations, of the training checkpoints. Current methods fail to outperform simple baselines, such as adding noise to the weights or taking a simple weight ensemble, in obtaining different and simultaneously high-performing models. We further show that this memorization cannot be effectively mitigated by modifying modeling factors commonly associated with memorization in image diffusion models, or applying data augmentations. Our findings provide a realistic assessment of what types of data current generative models can model, and highlight the need for more careful evaluation of generative models in new domains. Our code is available at this https URL.

Title: MADFormer: Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation

Authors: Junhao Chen, Yulia Tsvetkov, Xiaochuang Han
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.07999
Pdf URL: https://arxiv.org/pdf/2506.07999
Copy Paste: [[2506.07999]] MADFormer: Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation(https://arxiv.org/abs/2506.07999)
Keywords: diffusion, generative
Abstract: Recent progress in multimodal generation has increasingly combined autoregressive (AR) and diffusion-based approaches, leveraging their complementary strengths: AR models capture long-range dependencies and produce fluent, context-aware outputs, while diffusion models operate in continuous latent spaces to refine high-fidelity visual details. However, existing hybrids often lack systematic guidance on how and why to allocate model capacity between these paradigms. In this work, we introduce MADFormer, a Mixed Autoregressive and Diffusion Transformer that serves as a testbed for analyzing AR-diffusion trade-offs. MADFormer partitions image generation into spatial blocks, using AR layers for one-pass global conditioning across blocks and diffusion layers for iterative local refinement within each block. Through controlled experiments on FFHQ-1024 and ImageNet, we identify two key insights: (1) block-wise partitioning significantly improves performance on high-resolution images, and (2) vertically mixing AR and diffusion layers yields better quality-efficiency balances--improving FID by up to 75% under constrained inference compute. Our findings offer practical design principles for future hybrid generative models.

Title: Dynamic View Synthesis as an Inverse Problem

Authors: Hidir Yesiltepe, Pinar Yanardag
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.08004
Pdf URL: https://arxiv.org/pdf/2506.08004
Copy Paste: [[2506.08004]] Dynamic View Synthesis as an Inverse Problem(https://arxiv.org/abs/2506.08004)
Keywords: diffusion
Abstract: In this work, we address dynamic view synthesis from monocular videos as an inverse problem in a training-free setting. By redesigning the noise initialization phase of a pre-trained video diffusion model, we enable high-fidelity dynamic view synthesis without any weight updates or auxiliary modules. We begin by identifying a fundamental obstacle to deterministic inversion arising from zero-terminal signal-to-noise ratio (SNR) schedules and resolve it by introducing a novel noise representation, termed K-order Recursive Noise Representation. We derive a closed form expression for this representation, enabling precise and efficient alignment between the VAE-encoded and the DDIM inverted latents. To synthesize newly visible regions resulting from camera motion, we introduce Stochastic Latent Modulation, which performs visibility aware sampling over the latent space to complete occluded regions. Comprehensive experiments demonstrate that dynamic view synthesis can be effectively performed through structured latent manipulation in the noise initialization phase.

Title: Dreamland: Controllable World Creation with Simulator and Generative Models

Authors: Sicheng Mo, Ziyang Leng, Leon Liu, Weizhen Wang, Honglin He, Bolei Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.08006
Pdf URL: https://arxiv.org/pdf/2506.08006
Copy Paste: [[2506.08006]] Dreamland: Controllable World Creation with Simulator and Generative Models(https://arxiv.org/abs/2506.08006)
Keywords: generative
Abstract: Large-scale video generative models can synthesize diverse and realistic visual content for dynamic world creation, but they often lack element-wise controllability, hindering their use in editing scenes and training embodied AI agents. We propose Dreamland, a hybrid world generation framework combining the granular control of a physics-based simulator and the photorealistic content output of large-scale pretrained generative models. In particular, we design a layered world abstraction that encodes both pixel-level and object-level semantics and geometry as an intermediate representation to bridge the simulator and the generative model. This approach enhances controllability, minimizes adaptation cost through early alignment with real-world distributions, and supports off-the-shelf use of existing and future pretrained generative models. We further construct a D3Sim dataset to facilitate the training and evaluation of hybrid generation pipelines. Experiments demonstrate that Dreamland outperforms existing baselines with 50.8% improved image quality, 17.9% stronger controllability, and has great potential to enhance embodied agent training. Code and data will be made available.

Title: Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Authors: Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, Eli Shechtman
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.08009
Pdf URL: https://arxiv.org/pdf/2506.08009
Copy Paste: [[2506.08009]] Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion(https://arxiv.org/abs/2506.08009)
Keywords: diffusion
Abstract: We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models. Project website: this http URL

Title: StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets

Authors: Anh-Quan Cao, Ivan Lopes, Raoul de Charette
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.08013
Pdf URL: https://arxiv.org/pdf/2506.08013
Copy Paste: [[2506.08013]] StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets(https://arxiv.org/abs/2506.08013)
Keywords: diffusion
Abstract: Multi-task learning for dense prediction is limited by the need for extensive annotation for every task, though recent works have explored training with partial task labels. Leveraging the generalization power of diffusion models, we extend the partial learning setup to a zero-shot setting, training a multi-task model on multiple synthetic datasets, each labeled for only a subset of tasks. Our method, StableMTL, repurposes image generators for latent regression. Adapting a denoising framework with task encoding, per-task conditioning and a tailored training scheme. Instead of per-task losses requiring careful balancing, a unified latent loss is adopted, enabling seamless scaling to more tasks. To encourage inter-task synergy, we introduce a multi-stream model with a task-attention mechanism that converts N-to-N task interactions into efficient 1-to-N attention, promoting effective cross-task sharing. StableMTL outperforms baselines on 7 tasks across 8 benchmarks.