2025-06-24

Title: Mechanistic Interpretability of Diffusion Models: Circuit-Level Analysis and Causal Validation

Authors: Dip Roy
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17237
Pdf URL: https://arxiv.org/pdf/2506.17237
Copy Paste: [[2506.17237]] Mechanistic Interpretability of Diffusion Models: Circuit-Level Analysis and Causal Validation(https://arxiv.org/abs/2506.17237)
Keywords: diffusion, generative
Abstract: We present a quantitative circuit-level analysis of diffusion models, establishing computational pathways and mechanistic principles underlying image generation processes. Through systematic intervention experiments across 2,000 synthetic and 2,000 CelebA facial images, we discover fundamental algorithmic differences in how diffusion architectures process synthetic versus naturalistic data distributions. Our investigation reveals that real-world face processing requires circuits with measurably higher computational complexity (complexity ratio = 1.084 plus/minus 0.008, p < 0.001), exhibiting distinct attention specialization patterns with entropy divergence ranging from 0.015 to 0.166 across denoising timesteps. We identify eight functionally distinct attention mechanisms showing specialized computational roles: edge detection (entropy = 3.18 plus/minus 0.12), texture analysis (entropy = 4.16 plus/minus 0.08), and semantic understanding (entropy = 2.67 plus/minus 0.15). Intervention analysis demonstrates critical computational bottlenecks where targeted ablations produce 25.6% to 128.3% performance degradation, providing causal evidence for identified circuit functions. These findings establish quantitative foundations for algorithmic understanding and control of generative model behavior through mechanistic intervention strategies.

Title: Recursive Learning-Based Virtual Buffering for Analytical Global Placement

Authors: Andrew B. Kahng, Yiting Liu, Zhiang Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.17247
Pdf URL: https://arxiv.org/pdf/2506.17247
Copy Paste: [[2506.17247]] Recursive Learning-Based Virtual Buffering for Analytical Global Placement(https://arxiv.org/abs/2506.17247)
Keywords: generative
Abstract: Due to the skewed scaling of interconnect versus cell delay in modern technology nodes, placement with buffer porosity (i.e., cell density) awareness is essential for timing closure in physical synthesis flows. However, existing approaches face two key challenges: (i) traditional van Ginneken-Lillis-style buffering approaches are computationally expensive during global placement; and (ii) machine learning-based approaches, such as BufFormer, lack a thorough consideration of Electrical Rule Check (ERC) violations and fail to "close the loop" back into the physical design flow. In this work, we propose MLBuf-RePlAce, the first open-source learning-driven virtual buffering-aware analytical global placement framework, built on top of the OpenROAD infrastructure. MLBuf-RePlAce adopts an efficient recursive learning-based generative buffering approach to predict buffer types and locations, addressing ERC violations during global placement. We compare MLBuf-RePlAce against the default virtual buffering-based timing-driven global placer in OpenROAD, using open-source testcases from the TILOS MacroPlacement and OpenROAD-flow-scripts repositories. Without degradation of post-route power, MLBuf-RePlAce achieves (maximum, average) improvements of (56%, 31%) in total negative slack (TNS) within the open-source OpenROAD flow. When evaluated by completion in a commercial flow, MLBuf-RePlAce achieves (maximum, average) improvements of (53%, 28%) in TNS with an average of 0.2% improvement in post-route power.

Title: Securing Generative AI Agentic Workflows: Risks, Mitigation, and a Proposed Firewall Architecture

Authors: Sunil Kumar Jang Bahadur, Gopala Dhar
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.17266
Pdf URL: https://arxiv.org/pdf/2506.17266
Copy Paste: [[2506.17266]] Securing Generative AI Agentic Workflows: Risks, Mitigation, and a Proposed Firewall Architecture(https://arxiv.org/abs/2506.17266)
Keywords: generative
Abstract: Generative Artificial Intelligence (GenAI) presents significant advancements but also introduces novel security challenges, particularly within agentic workflows where AI agents operate autonomously. These risks escalate in multi-agent systems due to increased interaction complexity. This paper outlines critical security vulnerabilities inherent in GenAI agentic workflows, including data privacy breaches, model manipulation, and issues related to agent autonomy and system integration. It discusses key mitigation strategies such as data encryption, access control, prompt engineering, model monitoring, agent sandboxing, and security audits. Furthermore, it details a proposed "GenAI Security Firewall" architecture designed to provide comprehensive, adaptable, and efficient protection for these systems by integrating various security services and leveraging GenAI itself for enhanced defense. Addressing these security concerns is paramount for the responsible and safe deployment of this transformative technology.

Title: Mercury: Ultra-Fast Language Models Based on Diffusion

Authors: Inception Labs, Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum, Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, Aditya Grover, Volodymyr Kuleshov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.17298
Pdf URL: https://arxiv.org/pdf/2506.17298
Copy Paste: [[2506.17298]] Mercury: Ultra-Fast Language Models Based on Diffusion(https://arxiv.org/abs/2506.17298)
Keywords: diffusion
Abstract: We present Mercury, a new generation of commercial-scale large language models (LLMs) based on diffusion. These models are parameterized via the Transformer architecture and trained to predict multiple tokens in parallel. In this report, we detail Mercury Coder, our first set of diffusion LLMs designed for coding applications. Currently, Mercury Coder comes in two sizes: Mini and Small. These models set a new state-of-the-art on the speed-quality frontier. Based on independent evaluations conducted by Artificial Analysis, Mercury Coder Mini and Mercury Coder Small achieve state-of-the-art throughputs of 1109 tokens/sec and 737 tokens/sec, respectively, on NVIDIA H100 GPUs and outperform speed-optimized frontier models by up to 10x on average while maintaining comparable quality. We discuss additional results on a variety of code benchmarks spanning multiple languages and use-cases as well as real-world validation by developers on Copilot Arena, where the model currently ranks second on quality and is the fastest model overall. We also release a public API at this https URL and free playground at this https URL

Title: Fine-Scale Soil Mapping in Alaska with Multimodal Machine Learning

Authors: Yijun Lin, Theresa Chen, Colby Brungard, Grunwald Sabine, Sue Ives, Matt Macander, Timm Nawrocki, Yao-Yi Chiang, Nic Jelinski
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.17302
Pdf URL: https://arxiv.org/pdf/2506.17302
Copy Paste: [[2506.17302]] Fine-Scale Soil Mapping in Alaska with Multimodal Machine Learning(https://arxiv.org/abs/2506.17302)
Keywords: foundation model
Abstract: Fine-scale soil mapping in Alaska, traditionally relying on fieldwork and localized simulations, remains a critical yet underdeveloped task, despite the region's ecological importance and extensive permafrost coverage. As permafrost thaw accelerates due to climate change, it threatens infrastructure stability and key ecosystem services, such as soil carbon storage. High-resolution soil maps are essential for characterizing permafrost distribution, identifying vulnerable areas, and informing adaptation strategies. We present MISO, a vision-based machine learning (ML) model to produce statewide fine-scale soil maps for near-surface permafrost and soil taxonomy. The model integrates a geospatial foundation model for visual feature extraction, implicit neural representations for continuous spatial prediction, and contrastive learning for multimodal alignment and geo-location awareness. We compare MISO with Random Forest (RF), a traditional ML model that has been widely used in soil mapping applications. Spatial cross-validation and regional analysis across Permafrost Zones and Major Land Resource Areas (MLRAs) show that MISO generalizes better to remote, unseen locations and achieves higher recall than RF, which is critical for monitoring permafrost thaw and related environmental processes. These findings demonstrate the potential of advanced ML approaches for fine-scale soil mapping and provide practical guidance for future soil sampling and infrastructure planning in permafrost-affected landscapes. The project will be released at this https URL.

Title: Origins of Creativity in Attention-Based Diffusion Models

Authors: Emma Finn, T. Anderson Keller, Manos Theodosis, Demba E. Ba
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.17324
Pdf URL: https://arxiv.org/pdf/2506.17324
Copy Paste: [[2506.17324]] Origins of Creativity in Attention-Based Diffusion Models(https://arxiv.org/abs/2506.17324)
Keywords: diffusion
Abstract: As diffusion models have become the tool of choice for image generation and as the quality of the images continues to improve, the question of how `creativity' originates in diffusion has become increasingly important. The score matching perspective on diffusion has proven particularly fruitful for understanding how and why diffusion models generate images that remain plausible while differing significantly from their training images. In particular, as explained in (Kamb \& Ganguli, 2024) and others, e.g., (Ambrogioni, 2023), theory suggests that if our score matching were optimal, we would only be able to recover training samples through our diffusion process. However, as shown by Kamb \& Ganguli, (2024), in diffusion models where the score is parametrized by a simple CNN, the inductive biases of the CNN itself (translation equivariance and locality) allow the model to generate samples that globally do not match any training samples, but are rather patch-wise `mosaics'. Notably, however, this theory does not extend to describe the role of self-attention in this process. In this work, we take a preliminary step in this direction to extend this theory to the case of diffusion models whose score is parametrized by a CNN with a final self-attention layer. We show that our theory suggests that self-attention will induce a globally image-consistent arrangement of local features beyond the patch-level in generated samples, and we verify this behavior empirically on a carefully crafted dataset.

Title: AndroIDS : Android-based Intrusion Detection System using Federated Learning

Authors: Akarsh K Nair, Shanik Hubert Satheesh Kumar., Deepti Gupta
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.17349
Pdf URL: https://arxiv.org/pdf/2506.17349
Copy Paste: [[2506.17349]] AndroIDS : Android-based Intrusion Detection System using Federated Learning(https://arxiv.org/abs/2506.17349)
Keywords: anomaly
Abstract: The exponential growth of android-based mobile IoT systems has significantly increased the susceptibility of devices to cyberattacks, particularly in smart homes, UAVs, and other connected mobile environments. This article presents a federated learning-based intrusion detection framework called AndroIDS that leverages system call traces as a personalized and privacy-preserving data source. Unlike conventional centralized approaches, the proposed method enables collaborative anomaly detection without sharing raw data, thus preserving user privacy across distributed nodes. A generalized system call dataset was generated to reflect realistic android system behavior and serves as the foundation for experimentation. Extensive evaluation demonstrates the effectiveness of the FL model under both IID and non-IID conditions, achieving an accuracy of 96.46 % and 92.87 %, and F1-scores of 89 % and 86 %, respectively. These results highlight the models robustness to data heterogeneity, with only a minor performance drop in the non-IID case. Further, a detailed comparison with centralized deep learning further illustrates trade-offs in detection performance and deployment feasibility. Overall, the results validate the practical applicability of the proposed approach for secure and scalable intrusion detection in real-world mobile IoT scenarios.

Title: Spatial-Temporal Pre-Training for Embryo Viability Prediction Using Time-Lapse Videos

Authors: Zhiyi Shi, Junsik Kim, Helen Y. Yang, Yonghyun Song, Hyun-Jic Oh, Dalit Ben-Yosef, Daniel Needleman, Hanspeter Pfister
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17403
Pdf URL: https://arxiv.org/pdf/2506.17403
Copy Paste: [[2506.17403]] Spatial-Temporal Pre-Training for Embryo Viability Prediction Using Time-Lapse Videos(https://arxiv.org/abs/2506.17403)
Keywords: self-supervised
Abstract: Automating embryo viability prediction for in vitro fertilization (IVF) is important but challenging due to the limited availability of labeled pregnancy outcome data, as only a small fraction of embryos are labeled after transfer. Self-supervised learning (SSL) can leverage both labeled and unlabeled data to improve prediction. However, existing SSL methods for videos are not directly applicable to embryo development videos due to two challenges: (1) embryo time-lapse videos contain hundreds of frames, requiring significant GPU memory for conventional SSL; (2) the dataset contains videos with varying lengths and many outlier frames, causing traditional video alignment methods to struggle with semantic misalignment. We propose Spatial-Temporal Pre-Training (STPT) to address these challenges. STPT includes two stages: spatial and temporal. In each stage, only one encoder is trained while the other is frozen, reducing memory demands. To handle temporal misalignment, STPT avoids frame-by-frame alignment across videos. The spatial stage learns from alignments within each video and its temporally consistent augmentations. The temporal stage then models relationships between video embeddings. Our method efficiently handles long videos and temporal variability. On 23,027 time-lapse videos (3,286 labeled), STPT achieves the highest AUC of 0.635 (95% CI: 0.632-0.638) compared to baselines, with limited computational resources.

Title: Leveraging LLMs to Assess Tutor Moves in Real-Life Dialogues: A Feasibility Study

Authors: Danielle R. Thomas, Conrad Borchers, Jionghao Lin, Sanjit Kakarla, Shambhavi Bhushan, Erin Gatz, Shivang Gupta, Ralph Abboud, Kenneth R. Koedinger
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2506.17410
Pdf URL: https://arxiv.org/pdf/2506.17410
Copy Paste: [[2506.17410]] Leveraging LLMs to Assess Tutor Moves in Real-Life Dialogues: A Feasibility Study(https://arxiv.org/abs/2506.17410)
Keywords: generative
Abstract: Tutoring improves student achievement, but identifying and studying what tutoring actions are most associated with student learning at scale based on audio transcriptions is an open research problem. This present study investigates the feasibility and scalability of using generative AI to identify and evaluate specific tutor moves in real-life math tutoring. We analyze 50 randomly selected transcripts of college-student remote tutors assisting middle school students in mathematics. Using GPT-4, GPT-4o, GPT-4-turbo, Gemini-1.5-pro, and LearnLM, we assess tutors' application of two tutor skills: delivering effective praise and responding to student math errors. All models reliably detected relevant situations, for example, tutors providing praise to students (94-98% accuracy) and a student making a math error (82-88% accuracy) and effectively evaluated the tutors' adherence to tutoring best practices, aligning closely with human judgments (83-89% and 73-77%, respectively). We propose a cost-effective prompting strategy and discuss practical implications for using large language models to support scalable assessment in authentic settings. This work further contributes LLM prompts to support reproducibility and research in AI-supported learning.

Title: When Every Millisecond Counts: Real-Time Anomaly Detection via the Multimodal Asynchronous Hybrid Network

Authors: Dong Xiao, Guangyao Chen, Peixi Peng, Yangru Huang, Yifan Zhao, Yongxing Dai, Yonghong Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17457
Pdf URL: https://arxiv.org/pdf/2506.17457
Copy Paste: [[2506.17457]] When Every Millisecond Counts: Real-Time Anomaly Detection via the Multimodal Asynchronous Hybrid Network(https://arxiv.org/abs/2506.17457)
Keywords: anomaly
Abstract: Anomaly detection is essential for the safety and reliability of autonomous driving systems. Current methods often focus on detection accuracy but neglect response time, which is critical in time-sensitive driving scenarios. In this paper, we introduce real-time anomaly detection for autonomous driving, prioritizing both minimal response time and high accuracy. We propose a novel multimodal asynchronous hybrid network that combines event streams from event cameras with image data from RGB cameras. Our network utilizes the high temporal resolution of event cameras through an asynchronous Graph Neural Network and integrates it with spatial features extracted by a CNN from RGB images. This combination effectively captures both the temporal dynamics and spatial details of the driving environment, enabling swift and precise anomaly detection. Extensive experiments on benchmark datasets show that our approach outperforms existing methods in both accuracy and response time, achieving millisecond-level real-time performance.

Title: Probing for Phonology in Self-Supervised Speech Representations: A Case Study on Accent Perception

Authors: Nitin Venkateswaran, Kevin Tang, Ratree Wayland
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.17542
Pdf URL: https://arxiv.org/pdf/2506.17542
Copy Paste: [[2506.17542]] Probing for Phonology in Self-Supervised Speech Representations: A Case Study on Accent Perception(https://arxiv.org/abs/2506.17542)
Keywords: self-supervised
Abstract: Traditional models of accent perception underestimate the role of gradient variations in phonological features which listeners rely upon for their accent judgments. We investigate how pretrained representations from current self-supervised learning (SSL) models of speech encode phonological feature-level variations that influence the perception of segmental accent. We focus on three segments: the labiodental approximant, the rhotic tap, and the retroflex stop, which are uniformly produced in the English of native speakers of Hindi as well as other languages in the Indian sub-continent. We use the CSLU Foreign Accented English corpus (Lander, 2007) to extract, for these segments, phonological feature probabilities using Phonet (Vásquez-Correa et al., 2019) and pretrained representations from Wav2Vec2-BERT (Barrault et al., 2023) and WavLM (Chen et al., 2022) along with accent judgements by native speakers of American English. Probing analyses show that accent strength is best predicted by a subset of the segment's pretrained representation features, in which perceptually salient phonological features that contrast the expected American English and realized non-native English segments are given prominent weighting. A multinomial logistic regression of pretrained representation-based segment distances from American and Indian English baselines on accent ratings reveals strong associations between the odds of accent strength and distances from the baselines, in the expected directions. These results highlight the value of self-supervised speech representations for modeling accent perception using interpretable phonological features.

Title: Accelerating Residual Reinforcement Learning with Uncertainty Estimation

Authors: Lakshita Dodeja, Karl Schmeckpeper, Shivam Vats, Thomas Weng, Mingxi Jia, George Konidaris, Stefanie Tellex
Subjects: cs.LG, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2506.17564
Pdf URL: https://arxiv.org/pdf/2506.17564
Copy Paste: [[2506.17564]] Accelerating Residual Reinforcement Learning with Uncertainty Estimation(https://arxiv.org/abs/2506.17564)
Keywords: diffusion
Abstract: Residual Reinforcement Learning (RL) is a popular approach for adapting pretrained policies by learning a lightweight residual policy that provides corrective actions. While Residual RL is more sample-efficient than finetuning the entire base policy, existing methods struggle with sparse rewards and are designed for deterministic base policies. We propose two improvements to Residual RL that further enhance its sample efficiency and make it suitable for stochastic base policies. First, we leverage uncertainty estimates of the base policy to focus exploration on regions in which the base policy is not confident. Second, we propose a simple modification to off-policy residual learning that allows it to observe base actions and better handle stochastic base policies. We evaluate our method with both Gaussian-based and Diffusion-based stochastic base policies on tasks from Robosuite and D4RL, and compare against state-of-the-art finetuning methods, demo-augmented RL methods, and other residual RL methods. Our algorithm significantly outperforms existing baselines in a variety of simulation benchmark environments. We also deploy our learned polices in the real world to demonstrate their robustness with zero-shot sim-to-real transfer.

Title: OpenMAP-BrainAge: Generalizable and Interpretable Brain Age Predictor

Authors: Pengyu Kan, Craig Jones, Kenichi Oishi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17597
Pdf URL: https://arxiv.org/pdf/2506.17597
Copy Paste: [[2506.17597]] OpenMAP-BrainAge: Generalizable and Interpretable Brain Age Predictor(https://arxiv.org/abs/2506.17597)
Keywords: self-supervised, generative
Abstract: Purpose: To develop an age prediction model which is interpretable and robust to demographic and technological variances in brain MRI scans. Materials and Methods: We propose a transformer-based architecture that leverages self-supervised pre-training on large-scale datasets. Our model processes pseudo-3D T1-weighted MRI scans from three anatomical views and incorporates brain volumetric information. By introducing a stem architecture, we reduce the conventional quadratic complexity of transformer models to linear complexity, enabling scalability for high-dimensional MRI data. We trained our model on ADNI2 $\&$ 3 (N=1348) and OASIS3 (N=716) datasets (age range: 42 - 95) from the North America, with an 8:1:1 split for train, validation and test. Then, we validated it on the AIBL dataset (N=768, age range: 60 - 92) from Australia. Results: We achieved an MAE of 3.65 years on ADNI2 $\&$ 3 and OASIS3 test set and a high generalizability of MAE of 3.54 years on AIBL. There was a notable increase in brain age gap (BAG) across cognitive groups, with mean of 0.15 years (95% CI: [-0.22, 0.51]) in CN, 2.55 years ([2.40, 2.70]) in MCI, 6.12 years ([5.82, 6.43]) in AD. Additionally, significant negative correlation between BAG and cognitive scores was observed, with correlation coefficient of -0.185 (p < 0.001) for MoCA and -0.231 (p < 0.001) for MMSE. Gradient-based feature attribution highlighted ventricles and white matter structures as key regions influenced by brain aging. Conclusion: Our model effectively fused information from different views and volumetric information to achieve state-of-the-art brain age prediction accuracy, improved generalizability and interpretability with association to neurodegenerative disorders.

Title: Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning

Authors: Shih-Wen Liu, Hsuan-Yu Fan, Wei-Ta Chu, Fu-En Yang, Yu-Chiang Frank Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17645
Pdf URL: https://arxiv.org/pdf/2506.17645
Copy Paste: [[2506.17645]] Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning(https://arxiv.org/abs/2506.17645)
Keywords: in-context
Abstract: Automating medical report generation from histopathology images is a critical challenge requiring effective visual representations and domain-specific knowledge. Inspired by the common practices of human experts, we propose an in-context learning framework called PathGenIC that integrates context derived from the training set with a multimodal in-context learning (ICL) mechanism. Our method dynamically retrieves semantically similar whole slide image (WSI)-report pairs and incorporates adaptive feedback to enhance contextual relevance and generation quality. Evaluated on the HistGen benchmark, the framework achieves state-of-the-art results, with significant improvements across BLEU, METEOR, and ROUGE-L metrics, and demonstrates robustness across diverse report lengths and disease categories. By maximizing training data utility and bridging vision and language with ICL, our work offers a solution for AI-driven histopathology reporting, setting a strong foundation for future advancements in multimodal clinical applications.

Title: SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification

Authors: Gnana Praveen Rajasekhar, Jahangir Alam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17694
Pdf URL: https://arxiv.org/pdf/2506.17694
Copy Paste: [[2506.17694]] SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification(https://arxiv.org/abs/2506.17694)
Keywords: self-supervised
Abstract: Conventional audio-visual methods for speaker verification rely on large amounts of labeled data and separate modality-specific architectures, which is computationally expensive, limiting their scalability. To address these problems, we propose a self-supervised learning framework based on contrastive learning with asymmetric masking and masked data modeling to obtain robust audiovisual feature representations. In particular, we employ a unified framework for self-supervised audiovisual speaker verification using a single shared backbone for audio and visual inputs, leveraging the versatility of vision transformers. The proposed unified framework can handle audio, visual, or audiovisual inputs using a single shared vision transformer backbone during training and testing while being computationally efficient and robust to missing modalities. Extensive experiments demonstrate that our method achieves competitive performance without labeled data while reducing computational costs compared to traditional approaches.

Title: DreamJourney: Perpetual View Generation with Video Diffusion Models

Authors: Bo Pan, Yang Chen, Yingwei Pan, Ting Yao, Wei Chen, Tao Mei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17705
Pdf URL: https://arxiv.org/pdf/2506.17705
Copy Paste: [[2506.17705]] DreamJourney: Perpetual View Generation with Video Diffusion Models(https://arxiv.org/abs/2506.17705)
Keywords: diffusion, generative
Abstract: Perpetual view generation aims to synthesize a long-term video corresponding to an arbitrary camera trajectory solely from a single input image. Recent methods commonly utilize a pre-trained text-to-image diffusion model to synthesize new content of previously unseen regions along camera movement. However, the underlying 2D diffusion model lacks 3D awareness and results in distorted artifacts. Moreover, they are limited to generating views of static 3D scenes, neglecting to capture object movements within the dynamic 4D world. To alleviate these issues, we present DreamJourney, a two-stage framework that leverages the world simulation capacity of video diffusion models to trigger a new perpetual scene view generation task with both camera movements and object dynamics. Specifically, in stage I, DreamJourney first lifts the input image to 3D point cloud and renders a sequence of partial images from a specific camera trajectory. A video diffusion model is then utilized as generative prior to complete the missing regions and enhance visual coherence across the sequence, producing a cross-view consistent video adheres to the 3D scene and camera trajectory. Meanwhile, we introduce two simple yet effective strategies (early stopping and view padding) to further stabilize the generation process and improve visual quality. Next, in stage II, DreamJourney leverages a multimodal large language model to produce a text prompt describing object movements in current view, and uses video diffusion model to animate current view with object movements. Stage I and II are repeated recurrently, enabling perpetual dynamic scene view generation. Extensive experiments demonstrate the superiority of our DreamJourney over state-of-the-art methods both quantitatively and qualitatively. Our project page: this https URL.

Title: Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models

Authors: Jihyun Kim, Junho Park, Kyeongbo Kong, Suk-Ju Kang
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2506.17707
Pdf URL: https://arxiv.org/pdf/2506.17707
Copy Paste: [[2506.17707]] Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models(https://arxiv.org/abs/2506.17707)
Keywords: diffusion
Abstract: We present Programmable-Room, a framework which interactively generates and edits a 3D room mesh, given natural language instructions. For precise control of a room's each attribute, we decompose the challenging task into simpler steps such as creating plausible 3D coordinates for room meshes, generating panorama images for the texture, constructing 3D meshes by integrating the coordinates and panorama texture images, and arranging furniture. To support the various decomposed tasks with a unified framework, we incorporate visual programming (VP). VP is a method that utilizes a large language model (LLM) to write a Python-like program which is an ordered list of necessary modules for the various tasks given in natural language. We develop most of the modules. Especially, for the texture generating module, we utilize a pretrained large-scale diffusion model to generate panorama images conditioned on text and visual prompts (i.e., layout, depth, and semantic map) simultaneously. Specifically, we enhance the panorama image generation quality by optimizing the training objective with a 1D representation of a panorama scene obtained from bidirectional LSTM. We demonstrate Programmable-Room's flexibility in generating and editing 3D room meshes, and prove our framework's superiority to an existing model quantitatively and qualitatively. Project page is available in this https URL.

Title: PhysID: Physics-based Interactive Dynamics from a Single-view Image

Authors: Sourabh Vasant Gothe, Ayon Chattopadhyay, Gunturi Venkata Sai Phani Kiran, Pratik, Vibhav Agarwal, Jayesh Rajkumar Vachhani, Sourav Ghosh, Parameswaranath VM, Barath Raj KR
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17746
Pdf URL: https://arxiv.org/pdf/2506.17746
Copy Paste: [[2506.17746]] PhysID: Physics-based Interactive Dynamics from a Single-view Image(https://arxiv.org/abs/2506.17746)
Keywords: generative
Abstract: Transforming static images into interactive experiences remains a challenging task in computer vision. Tackling this challenge holds the potential to elevate mobile user experiences, notably through interactive and AR/VR applications. Current approaches aim to achieve this either using pre-recorded video responses or requiring multi-view images as input. In this paper, we present PhysID, that streamlines the creation of physics-based interactive dynamics from a single-view image by leveraging large generative models for 3D mesh generation and physical property prediction. This significantly reduces the expertise required for engineering-intensive tasks like 3D modeling and intrinsic property calibration, enabling the process to be scaled with minimal manual intervention. We integrate an on-device physics-based engine for physically plausible real-time rendering with user interactions. PhysID represents a leap forward in mobile-based interactive dynamics, offering real-time, non-deterministic interactions and user-personalization with efficient on-device memory consumption. Experiments evaluate the zero-shot capabilities of various Multimodal Large Language Models (MLLMs) on diverse tasks and the performance of 3D reconstruction models. These results demonstrate the cohesive functioning of all modules within the end-to-end framework, contributing to its effectiveness.

Title: Towards a Unified Textual Graph Framework for Spectral Reasoning via Physical and Chemical Information Fusion

Authors: Jiheng Liang, Ziru Yu, Zujie Xie, Yuchen Guo, Yulan Guo, Xiangyang Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.17761
Pdf URL: https://arxiv.org/pdf/2506.17761
Copy Paste: [[2506.17761]] Towards a Unified Textual Graph Framework for Spectral Reasoning via Physical and Chemical Information Fusion(https://arxiv.org/abs/2506.17761)
Keywords: in-context
Abstract: Motivated by the limitations of current spectral analysis methods-such as reliance on single-modality data, limited generalizability, and poor interpretability-we propose a novel multi-modal spectral analysis framework that integrates prior knowledge graphs with Large Language Models. Our method explicitly bridges physical spectral measurements and chemical structural semantics by representing them in a unified Textual Graph format, enabling flexible, interpretable, and generalizable spectral understanding. Raw spectra are first transformed into TAGs, where nodes and edges are enriched with textual attributes describing both spectral properties and chemical context. These are then merged with relevant prior knowledge-including functional groups and molecular graphs-to form a Task Graph that incorporates "Prompt Nodes" supporting LLM-based contextual reasoning. A Graph Neural Network further processes this structure to complete downstream tasks. This unified design enables seamless multi-modal integration and automated feature decoding with minimal manual annotation. Our framework achieves consistently high performance across multiple spectral analysis tasks, including node-level, edge-level, and graph-level classification. It demonstrates robust generalization in both zero-shot and few-shot settings, highlighting its effectiveness in learning from limited data and supporting in-context reasoning. This work establishes a scalable and interpretable foundation for LLM-driven spectral analysis, unifying physical and chemical modalities for scientific applications.

Title: PhysiX: A Foundation Model for Physics Simulations

Authors: Tung Nguyen, Arsh Koneru, Shufan Li, Aditya grover
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.17774
Pdf URL: https://arxiv.org/pdf/2506.17774
Copy Paste: [[2506.17774]] PhysiX: A Foundation Model for Physics Simulations(https://arxiv.org/abs/2506.17774)
Keywords: foundation model, generative
Abstract: Foundation models have achieved remarkable success across video, image, and language domains. By scaling up the number of parameters and training datasets, these models acquire generalizable world knowledge and often surpass task-specific approaches. However, such progress has yet to extend to the domain of physics simulation. A primary bottleneck is data scarcity: while millions of images, videos, and textual resources are readily available on the internet, the largest physics simulation datasets contain only tens of thousands of samples. This data limitation hinders the use of large models, as overfitting becomes a major concern. As a result, physics applications typically rely on small models, which struggle with long-range prediction due to limited context understanding. Additionally, unlike images, videos, or text-which typically exhibit fixed granularity-physics datasets often vary drastically in scale, amplifying the challenges of scaling up multitask training. We introduce PhysiX, the first large-scale foundation model for physics simulation. PhysiX is a 4.5B parameter autoregressive generative model. It uses a discrete tokenizer to encode physical processes at different scales into a sequence of discrete tokens, and employs an autoregressive next-token prediction objective to model such processes in the token space. To mitigate the rounding error in the discretization process, PhysiX incorporates a specialized refinement module. Through extensive experiments, we show that PhysiX effectively addresses the data bottleneck, outperforming task-specific baselines under comparable settings as well as the previous absolute state-of-the-art approaches on The Well benchmark. Our results indicate that knowledge learned from natural videos can be successfully transferred to physics simulation, and that joint training across diverse simulation tasks enables synergistic learning.

Title: Reimagining Parameter Space Exploration with Diffusion Models

Authors: Lijun Zhang, Xiao Liu, Hui Guan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.17807
Pdf URL: https://arxiv.org/pdf/2506.17807
Copy Paste: [[2506.17807]] Reimagining Parameter Space Exploration with Diffusion Models(https://arxiv.org/abs/2506.17807)
Keywords: diffusion, generative
Abstract: Adapting neural networks to new tasks typically requires task-specific fine-tuning, which is time-consuming and reliant on labeled data. We explore a generative alternative that produces task-specific parameters directly from task identity, eliminating the need for task-specific training. To this end, we propose using diffusion models to learn the underlying structure of effective task-specific parameter space and synthesize parameters on demand. Once trained, the task-conditioned diffusion model can generate specialized weights directly from task identifiers. We evaluate this approach across three scenarios: generating parameters for a single seen task, for multiple seen tasks, and for entirely unseen tasks. Experiments show that diffusion models can generate accurate task-specific parameters and support multi-task interpolation when parameter subspaces are well-structured, but fail to generalize to unseen tasks, highlighting both the potential and limitations of this generative solution.

Title: Time-Contrastive Pretraining for In-Context Image and Video Segmentation

Authors: Assefa Wahd, Jacob Jaremko, Abhilash Hareendranathan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17837
Pdf URL: https://arxiv.org/pdf/2506.17837
Copy Paste: [[2506.17837]] Time-Contrastive Pretraining for In-Context Image and Video Segmentation(https://arxiv.org/abs/2506.17837)
Keywords: self-supervised, in-context
Abstract: In-context learning (ICL) enables generalization to new tasks with minimal labeled data. However, mainstream ICL approaches rely on a gridding strategy, which lacks the flexibility required for vision applications. We introduce Temporal, a time-contrastive self-supervised objective that pretrains a prompt retriever for visual ICL, and formulate ICL as a video object segmentation (VOS) task. Temporal addresses key limitations of grid-based methods that restrict the number and resolution of context images. By reframing ICL as a VOS problem, our approach supports a variable number of context images while preserving their full resolution. To address the challenge of selecting optimal context sets for queries, we pretrain a prompt retriever on videos via self-supervised learning, where adjacent frames serve as positives and distant frames as negatives. For image segmentation, the prompt retriever selects relevant sequences that, when combined with the query, form coherent videos for VOS processing. For video segmentation, it identifies keyframes, predicts their masks using our ICL pipeline, and propagates them throughout the sequence. When evaluated on MICCAI FLARE 2022, our method achieves substantial improvements over baselines: 90.95% Dice score for image segmentation (10.64% improvement) and 92.45% Dice for video segmentation (14.88% improvement).

Title: In-Context Learning Strategies Emerge Rationally

Authors: Daniel Wurgaft, Ekdeep Singh Lubana, Core Francisco Park, Hidenori Tanaka, Gautam Reddy, Noah D. Goodman
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.17859
Pdf URL: https://arxiv.org/pdf/2506.17859
Copy Paste: [[2506.17859]] In-Context Learning Strategies Emerge Rationally(https://arxiv.org/abs/2506.17859)
Keywords: in-context
Abstract: Recent work analyzing in-context learning (ICL) has identified a broad set of strategies that describe model behavior in different experimental conditions. We aim to unify these findings by asking why a model learns these disparate strategies in the first place. Specifically, we start with the observation that when trained to learn a mixture of tasks, as is popular in the literature, the strategies learned by a model for performing ICL can be captured by a family of Bayesian predictors: a memorizing predictor, which assumes a discrete prior on the set of seen tasks, and a generalizing predictor, wherein the prior matches the underlying task distribution. Adopting the lens of rational analysis from cognitive science, where a learner's behavior is explained as an optimal adaptation to data given computational constraints, we develop a hierarchical Bayesian framework that almost perfectly predicts Transformer next token predictions throughout training without assuming access to its weights. Under this framework, pretraining is viewed as a process of updating the posterior probability of different strategies, and its inference-time behavior as a posterior-weighted average over these strategies' predictions. Our framework draws on common assumptions about neural network learning dynamics, which make explicit a tradeoff between loss and complexity among candidate strategies: beyond how well it explains the data, a model's preference towards implementing a strategy is dictated by its complexity. This helps explain well-known ICL phenomena, while offering novel predictions: e.g., we show a superlinear trend in the timescale for transition to memorization as task diversity is increased. Overall, our work advances an explanatory and predictive account of ICL grounded in tradeoffs between strategy loss and complexity.

Title: How Alignment Shrinks the Generative Horizon

Authors: Chenghao Yang, Ari Holtzman
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.17871
Pdf URL: https://arxiv.org/pdf/2506.17871
Copy Paste: [[2506.17871]] How Alignment Shrinks the Generative Horizon(https://arxiv.org/abs/2506.17871)
Keywords: generative
Abstract: Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this stability in the generation? We investigate this phenomenon through the lens of probability concentration in the model's output distribution. To quantify this concentration, we introduce the Branching Factor (BF) -- a token-invariant measure of the effective number of plausible next steps during generation. Our empirical analysis reveals two key findings: (1) BF often decreases as generation progresses, suggesting that LLMs become more predictable as they generate. (2) alignment tuning substantially sharpens the model's output distribution from the outset, reducing BF by nearly an order of magnitude (e.g., from 12 to 1.2) relative to base models. This stark reduction helps explain why aligned models often appear less sensitive to decoding strategies. Building on this insight, we find this stability has surprising implications for complex reasoning. Aligned Chain-of-Thought (CoT) models (e.g., DeepSeek-distilled models), for instance, leverage this effect; by generating longer reasoning chains, they push generation into later, more deterministic (lower BF) stages, resulting in more stable outputs. We hypothesize that alignment tuning does not fundamentally change a model's behavior, but instead steers it toward stylistic tokens (e.g., "Sure") that unlock low-entropy trajectories already present in the base model. This view is supported by nudging experiments, which show that prompting base models with such tokens can similarly reduce BF. Together, our findings establish BF as a powerful diagnostic for understanding and controlling LLM outputs - clarifying how alignment reduces variability, how CoT promotes stable generations, and how base models can be steered away from diversity.

Title: EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations

Authors: Junho Park, Andrew Sangwoo Ye, Taein Kwon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.17896
Pdf URL: https://arxiv.org/pdf/2506.17896
Copy Paste: [[2506.17896]] EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations(https://arxiv.org/abs/2506.17896)
Keywords: diffusion
Abstract: Egocentric vision is essential for both human and machine visual understanding, particularly in capturing the detailed hand-object interactions needed for manipulation tasks. Translating third-person views into first-person views significantly benefits augmented reality (AR), virtual reality (VR) and robotics applications. However, current exocentric-to-egocentric translation methods are limited by their dependence on 2D cues, synchronized multi-view settings, and unrealistic assumptions such as necessity of initial egocentric frame and relative camera poses during inference. To overcome these challenges, we introduce EgoWorld, a novel two-stage framework that reconstructs an egocentric view from rich exocentric observations, including projected point clouds, 3D hand poses, and textual descriptions. Our approach reconstructs a point cloud from estimated exocentric depth maps, reprojects it into the egocentric perspective, and then applies diffusion-based inpainting to produce dense, semantically coherent egocentric images. Evaluated on the H2O and TACO datasets, EgoWorld achieves state-of-the-art performance and demonstrates robust generalization to new objects, actions, scenes, and subjects. Moreover, EgoWorld shows promising results even on unlabeled real-world examples.

Title: Adapting Vision-Language Models for Evaluating World Models

Authors: Mariya Hendriksen, Tabish Rashid, David Bignell, Raluca Georgescu, Abdelhak Lemkhenter, Katja Hofmann, Sam Devlin, Sarah Parisot
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.17967
Pdf URL: https://arxiv.org/pdf/2506.17967
Copy Paste: [[2506.17967]] Adapting Vision-Language Models for Evaluating World Models(https://arxiv.org/abs/2506.17967)
Keywords: generative
Abstract: World models -- generative models that simulate environment dynamics conditioned on past observations and actions -- are gaining prominence in planning, simulation, and embodied AI. However, evaluating their rollouts remains a fundamental challenge, requiring fine-grained, temporally grounded assessment of action alignment and semantic consistency -- capabilities not captured by existing metrics. Vision-Language Models (VLMs) have shown promise as automatic evaluators of generative content due to their strong multimodal reasoning abilities. Yet, their use in fine-grained, temporally sensitive evaluation tasks remains limited and requires targeted adaptation. We introduce a evaluation protocol targeting two recognition tasks -- action recognition and character recognition -- each assessed across binary, multiple-choice, and open-ended formats. To support this, we present UNIVERSE (UNIfied Vision-language Evaluator for Rollouts in Simulated Environments), a method for adapting VLMs to rollout evaluation under data and compute constraints. We conduct a large-scale study comparing full, partial, and parameter-efficient finetuning across task formats, context lengths, sampling strategies, and data compositions. The resulting unified evaluator matches the performance of task-specific baselines using a single checkpoint. Human studies confirm strong alignment with human judgments, establishing UNIVERSE as a scalable, semantics-aware evaluator for world models.

Title: Enabling PSO-Secure Synthetic Data Sharing Using Diversity-Aware Diffusion Models

Authors: Mischa Dombrowski, Bernhard Kainz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.17975
Pdf URL: https://arxiv.org/pdf/2506.17975
Copy Paste: [[2506.17975]] Enabling PSO-Secure Synthetic Data Sharing Using Diversity-Aware Diffusion Models(https://arxiv.org/abs/2506.17975)
Keywords: diffusion
Abstract: Synthetic data has recently reached a level of visual fidelity that makes it nearly indistinguishable from real data, offering great promise for privacy-preserving data sharing in medical imaging. However, fully synthetic datasets still suffer from significant limitations: First and foremost, the legal aspect of sharing synthetic data is often neglected and data regulations, such as the GDPR, are largley ignored. Secondly, synthetic models fall short of matching the performance of real data, even for in-domain downstream applications. Recent methods for image generation have focused on maximising image diversity instead of fidelity solely to improve the mode coverage and therefore the downstream performance of synthetic data. In this work, we shift perspective and highlight how maximizing diversity can also be interpreted as protecting natural persons from being singled out, which leads to predicate singling-out (PSO) secure synthetic datasets. Specifically, we propose a generalisable framework for training diffusion models on personal data which leads to unpersonal synthetic datasets achieving performance within one percentage point of real-data models while significantly outperforming state-of-the-art methods that do not ensure privacy. Our code is available at this https URL.

Title: Imputation of Longitudinal Data Using GANs: Challenges and Implications for Classification

Authors: Sharon Torao Pingi, Md Abul Bashar, Richi Nayak
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.18007
Pdf URL: https://arxiv.org/pdf/2506.18007
Copy Paste: [[2506.18007]] Imputation of Longitudinal Data Using GANs: Challenges and Implications for Classification(https://arxiv.org/abs/2506.18007)
Keywords: generative
Abstract: Longitudinal data is commonly utilised across various domains, such as health, biomedical, education and survey studies. This ubiquity has led to a rise in statistical, machine and deep learning-based methods for Longitudinal Data Classification (LDC). However, the intricate nature of the data, characterised by its multi-dimensionality, causes instance-level heterogeneity and temporal correlations that add to the complexity of longitudinal data analysis. Additionally, LDC accuracy is often hampered by the pervasiveness of missing values in longitudinal data. Despite ongoing research that draw on the generative power and utility of Generative Adversarial Networks (GANs) to address the missing data problem, critical considerations include statistical assumptions surrounding longitudinal data and missingness within it, as well as other data-level challenges like class imbalance and mixed data types that impact longitudinal data imputation (LDI) and the subsequent LDC process in GANs. This paper provides a comprehensive overview of how GANs have been applied in LDI, with a focus whether GANS have adequately addressed fundamental assumptions about the data from a LDC perspective. We propose a categorisation of main approaches to GAN-based LDI, highlight strengths and limitations of methods, identify key research trends, and provide promising future directions. Our findings indicate that while GANs show great potential for LDI to improve usability and quality of longitudinal data for tasks like LDC, there is need for more versatile approaches that can handle the wider spectrum of challenges presented by longitudinal data with missing values. By synthesising current knowledge and identifying critical research gaps, this survey aims to guide future research efforts in developing more effective GAN-based solutions to address LDC challenges.

Title: On the Robustness of Human-Object Interaction Detection against Distribution Shift

Authors: Chi Xie, Shuang Liang, Jie Li, Feng Zhu, Rui Zhao, Yichen Wei, Shengjie Zhao
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2506.18021
Pdf URL: https://arxiv.org/pdf/2506.18021
Copy Paste: [[2506.18021]] On the Robustness of Human-Object Interaction Detection against Distribution Shift(https://arxiv.org/abs/2506.18021)
Keywords: foundation model
Abstract: Human-Object Interaction (HOI) detection has seen substantial advances in recent years. However, existing works focus on the standard setting with ideal images and natural distribution, far from practical scenarios with inevitable distribution shifts. This hampers the practical applicability of HOI detection. In this work, we investigate this issue by benchmarking, analyzing, and enhancing the robustness of HOI detection models under various distribution shifts. We start by proposing a novel automated approach to create the first robustness evaluation benchmark for HOI detection. Subsequently, we evaluate more than 40 existing HOI detection models on this benchmark, showing their insufficiency, analyzing the features of different frameworks, and discussing how the robustness in HOI is different from other tasks. With the insights from such analyses, we propose to improve the robustness of HOI detection methods through: (1) a cross-domain data augmentation integrated with mixup, and (2) a feature fusion strategy with frozen vision foundation models. Both are simple, plug-and-play, and applicable to various methods. Our experimental results demonstrate that the proposed approach significantly increases the robustness of various methods, with benefits on standard benchmarks, too. The dataset and code will be released.

Title: TAB: Unified Benchmarking of Time Series Anomaly Detection Methods

Authors: Xiangfei Qiu, Zhe Li, Wanghui Qiu, Shiyan Hu, Lekui Zhou, Xingjian Wu, Zhengyu Li, Chenjuan Guo, Aoying Zhou, Zhenli Sheng, Jilin Hu, Christian S. Jensen, Bin Yang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.18046
Pdf URL: https://arxiv.org/pdf/2506.18046
Copy Paste: [[2506.18046]] TAB: Unified Benchmarking of Time Series Anomaly Detection Methods(https://arxiv.org/abs/2506.18046)
Keywords: anomaly
Abstract: Time series anomaly detection (TSAD) plays an important role in many domains such as finance, transportation, and healthcare. With the ongoing instrumentation of reality, more time series data will be available, leading also to growing demands for TSAD. While many TSAD methods already exist, new and better methods are still desirable. However, effective progress hinges on the availability of reliable means of evaluating new methods and comparing them with existing methods. We address deficiencies in current evaluation procedures related to datasets and experimental settings and protocols. Specifically, we propose a new time series anomaly detection benchmark, called TAB. First, TAB encompasses 29 public multivariate datasets and 1,635 univariate time series from different domains to facilitate more comprehensive evaluations on diverse datasets. Second, TAB covers a variety of TSAD methods, including Non-learning, Machine learning, Deep learning, LLM-based, and Time-series pre-trained methods. Third, TAB features a unified and automated evaluation pipeline that enables fair and easy evaluation of TSAD methods. Finally, we employ TAB to evaluate existing TSAD methods and report on the outcomes, thereby offering a deeper insight into the performance of these methods. Besides, all datasets and code are available at this https URL.

Title: CLGRPO: Reasoning Ability Enhancement for Small VLMs

Authors: Fanyi Wang, Binzhi Dong, Haotian Hu, Jinjin Xu, Zhiwang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18048
Pdf URL: https://arxiv.org/pdf/2506.18048
Copy Paste: [[2506.18048]] CLGRPO: Reasoning Ability Enhancement for Small VLMs(https://arxiv.org/abs/2506.18048)
Keywords: self-supervised
Abstract: Small Vision Language Models (SVLMs) generally refer to models with parameter sizes less than or equal to 2B. Their low cost and power consumption characteristics confer high commercial value. However, their reasoning abilities are limited by the number of parameters. To address this issue, this paper proposes a post-training optimization paradigm called the Incremental Training Strategy to enhance the reasoning ability of SVLMs. Firstly, we constructed a Self-Supervised Chain-of-Thought (COT) Data Construction System, which leverages multiple LVLMs with 7B parameters or more to transform original data into COT data in a self-supervised manner. Our proposed Incremental Training Strategy consists of four stages. Stage 1 injects domain knowledge by performing Supervised Fine-Tuning (SFT) to the pretrained model on the COT data. Stage 2 aligns the COT data format by conducting a small amount of Group Relative Policy Optimization (GRPO) training constrained only by format rewards on the COT data. Stage 3 enhances reasoning ability by applying GRPO training on the COT data with constraints on both format and accuracy rewards. The resulting model shows significant improvement compared to the baseline. Stage 4 addresses the limited capacity of the SVLMs and the weak ability to capture complex patterns by proposing ClipLow GRPO (CLGRPO) to constrain the capture space of the training process. We conducted extensive comparative and ablation experiments on the abstract semantic recognition dataset EMOSet-118K. Experimental results demonstrate that our method significantly improves the reasoning ability of 1B SVLM. Compared to the baseline model fine-tuned on the original data, accuracy increased by 2.77 and recall by 0.69, achieving performance comparable to that of 8B models.

Title: Deep Supervised LSTM for 3D morphology estimation from Multi-View RGB Images of Wheat Spikes

Authors: Olivia Zumsteg, Nico Graf, Aaron Haeusler, Norbert Kirchgessner, Nicola Storni, Lukas Roth, Andreas Hund
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18060
Pdf URL: https://arxiv.org/pdf/2506.18060
Copy Paste: [[2506.18060]] Deep Supervised LSTM for 3D morphology estimation from Multi-View RGB Images of Wheat Spikes(https://arxiv.org/abs/2506.18060)
Keywords: self-supervised
Abstract: Estimating three-dimensional morphological traits from two-dimensional RGB images presents inherent challenges due to the loss of depth information, projection distortions, and occlusions under field conditions. In this work, we explore multiple approaches for non-destructive volume estimation of wheat spikes, using RGB image sequences and structured-light 3D scans as ground truth references. Due to the complex geometry of the spikes, we propose a neural network approach for volume estimation in 2D images, employing a transfer learning pipeline that combines DINOv2, a self-supervised Vision Transformer, with a unidirectional Long Short-Term Memory (LSTM) network. By using deep supervision, the model is able to learn more robust intermediate representations, which enhances its generalisation ability across varying evaluation sequences. We benchmark our model against two conventional baselines: a 2D area-based projection and a geometric reconstruction using axis-aligned cross-sections. Our deep supervised model achieves a mean absolute percentage error (MAPE) of 6.46% on six-view indoor images, outperforming the area (9.36%) and geometric (13.98%) baselines. Fine-tuning the model on field-based single-image data enables domain adaptation, yielding a MAPE of 10.82%. We demonstrate that object shape significantly impacts volume prediction accuracy, with irregular geometries such as wheat spikes posing greater challenges for geometric methods compared to our deep learning approach.

Title: Evaluating Prompt-Based and Fine-Tuned Approaches to Czech Anaphora Resolution

Authors: Patrik Stano, Aleš Horák
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.18091
Pdf URL: https://arxiv.org/pdf/2506.18091
Copy Paste: [[2506.18091]] Evaluating Prompt-Based and Fine-Tuned Approaches to Czech Anaphora Resolution(https://arxiv.org/abs/2506.18091)
Keywords: generative
Abstract: Anaphora resolution plays a critical role in natural language understanding, especially in morphologically rich languages like Czech. This paper presents a comparative evaluation of two modern approaches to anaphora resolution on Czech text: prompt engineering with large language models (LLMs) and fine-tuning compact generative models. Using a dataset derived from the Prague Dependency Treebank, we evaluate several instruction-tuned LLMs, including Mistral Large 2 and Llama 3, using a series of prompt templates. We compare them against fine-tuned variants of the mT5 and Mistral models that we trained specifically for Czech anaphora resolution. Our experiments demonstrate that while prompting yields promising few-shot results (up to 74.5% accuracy), the fine-tuned models, particularly mT5-large, outperform them significantly, achieving up to 88% accuracy while requiring fewer computational resources. We analyze performance across different anaphora types, antecedent distances, and source corpora, highlighting key strengths and trade-offs of each approach.

Title: ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation

Authors: Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, Benyou Wang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.18095
Pdf URL: https://arxiv.org/pdf/2506.18095
Copy Paste: [[2506.18095]] ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation(https://arxiv.org/abs/2506.18095)
Keywords: generative
Abstract: Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o's image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.

Title: Enhancing VICReg: Random-Walk Pairing for Improved Generalization and Better Global Semantics Capturing

Authors: Idan Simai, Ronen Talmon, Uri Shaham
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.18104
Pdf URL: https://arxiv.org/pdf/2506.18104
Copy Paste: [[2506.18104]] Enhancing VICReg: Random-Walk Pairing for Improved Generalization and Better Global Semantics Capturing(https://arxiv.org/abs/2506.18104)
Keywords: self-supervised
Abstract: In this paper, we argue that viewing VICReg-a popular self-supervised learning (SSL) method--through the lens of spectral embedding reveals a potential source of sub-optimality: it may struggle to generalize robustly to unseen data due to overreliance on the training data. This observation invites a closer look at how well this method achieves its goal of producing meaningful representations of images outside of the training set as well. Here, we investigate this issue and introduce SAG-VICReg (Stable and Generalizable VICReg), a method that builds on VICReg by incorporating new training techniques. These enhancements improve the model's ability to capture global semantics within the data and strengthen the generalization capabilities. Experiments demonstrate that SAG-VICReg effectively addresses the generalization challenge while matching or surpassing diverse state-of-the-art SSL baselines. Notably, our method exhibits superior performance on metrics designed to evaluate global semantic understanding, while simultaneously maintaining competitive results on local evaluation metrics. Furthermore, we propose a new standalone evaluation metric for embeddings that complements the standard evaluation methods and accounts for the global data structure without requiring labels--a key issue when tagged data is scarce or not available.

Title: $ϕ^{\infty}$: Clause Purification, Embedding Realignment, and the Total Suppression of the Em Dash in Autoregressive Language Models

Authors: Bugra Kilictas, Faruk Alpay
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18129
Pdf URL: https://arxiv.org/pdf/2506.18129
Copy Paste: [[2506.18129]] $ϕ^{\infty}$: Clause Purification, Embedding Realignment, and the Total Suppression of the Em Dash in Autoregressive Language Models(https://arxiv.org/abs/2506.18129)
Keywords: foundation model
Abstract: We identify a critical vulnerability in autoregressive transformer language models where the em dash token induces recursive semantic drift, leading to clause boundary hallucination and embedding space entanglement. Through formal analysis of token-level perturbations in semantic lattices, we demonstrate that em dash insertion fundamentally alters the model's latent representations, causing compounding errors in long-form generation. We propose a novel solution combining symbolic clause purification via the phi-infinity operator with targeted embedding matrix realignment. Our approach enables total suppression of problematic tokens without requiring model retraining, while preserving semantic coherence through fixed-point convergence guarantees. Experimental validation shows significant improvements in generation consistency and topic maintenance. This work establishes a general framework for identifying and mitigating token-level vulnerabilities in foundation models, with immediate implications for AI safety, model alignment, and robust deployment of large language models in production environments. The methodology extends beyond punctuation to address broader classes of recursive instabilities in neural text generation systems.

Title: Targeted False Positive Synthesis via Detector-guided Adversarial Diffusion Attacker for Robust Polyp Detection

Authors: Quan Zhou, Gan Luo, Qiang Hu, Qingyong Zhang, Jinhua Zhang, Yinjiao Tian, Qiang Li, Zhiwei Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18134
Pdf URL: https://arxiv.org/pdf/2506.18134
Copy Paste: [[2506.18134]] Targeted False Positive Synthesis via Detector-guided Adversarial Diffusion Attacker for Robust Polyp Detection(https://arxiv.org/abs/2506.18134)
Keywords: diffusion, generative
Abstract: Polyp detection is crucial for colorectal cancer screening, yet existing models are limited by the scale and diversity of available data. While generative models show promise for data augmentation, current methods mainly focus on enhancing polyp diversity, often overlooking the critical issue of false positives. In this paper, we address this gap by proposing an adversarial diffusion framework to synthesize high-value false positives. The extensive variability of negative backgrounds presents a significant challenge in false positive synthesis. To overcome this, we introduce two key innovations: First, we design a regional noise matching strategy to construct a negative synthesis space using polyp detection datasets. This strategy trains a negative-centric diffusion model by masking polyp regions, ensuring the model focuses exclusively on learning diverse background patterns. Second, we introduce the Detector-guided Adversarial Diffusion Attacker (DADA) module, which perturbs the negative synthesis process to disrupt a pre-trained detector's decision, guiding the negative-centric diffusion model to generate high-value, detector-confusing false positives instead of low-value, ordinary backgrounds. Our approach is the first to apply adversarial diffusion to lesion detection, establishing a new paradigm for targeted false positive synthesis and paving the way for more reliable clinical applications in colorectal cancer screening. Extensive results on public and in-house datasets verify the superiority of our method over the current state-of-the-arts, with our synthesized data improving the detectors by at least 2.6% and 2.7% in F1-score, respectively, over the baselines. Codes are at this https URL.

Title: Pattern-Based Phase-Separation of Tracer and Dispersed Phase Particles in Two-Phase Defocusing Particle Tracking Velocimetry

Authors: Christian Sax, Jochen Kriegseis
Subjects: cs.CV, physics.app-ph, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2506.18157
Pdf URL: https://arxiv.org/pdf/2506.18157
Copy Paste: [[2506.18157]] Pattern-Based Phase-Separation of Tracer and Dispersed Phase Particles in Two-Phase Defocusing Particle Tracking Velocimetry(https://arxiv.org/abs/2506.18157)
Keywords: generative
Abstract: This work investigates the feasibility of a post-processing-based approach for phase separation in defocusing particle tracking velocimetry for dispersed two-phase flows. The method enables the simultaneous 3D localization determination of both tracer particles and particles of the dispersed phase, using a single-camera setup. The distinction between phases is based on pattern differences in defocused particle images, which arise from distinct light scattering behaviors of tracer particles and bubbles or droplets. Convolutional neural networks, including Faster R-CNN and YOLOv4 variants, are trained to detect and classify particle images based on these pattern features. To generate large, labeled training datasets, a generative adversarial network based framework is introduced, allowing the generation of auto-labeled data that more closely reflects experiment-specific visual appearance. Evaluation across six datasets, comprising synthetic two-phase and real single- and two-phase flows, demonstrates high detection precision and classification accuracy (95-100%), even under domain shifts. The results confirm the viability of using CNNs for robust phase separation in disperse two-phase DPTV, particularly in scenarios where traditional wavelength-, size-, or ensemble correlation-based methods are impractical.

Title: CDG-MAE: Learning Correspondences from Diffusion Generated Views

Authors: Varun Belagali, Pierre Marza, Srikar Yellapragada, Zilinghan Li, Tarak Nath Nandi, Ravi K Madduri, Joel Saltz, Stergios Christodoulidis, Maria Vakalopoulou, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18164
Pdf URL: https://arxiv.org/pdf/2506.18164
Copy Paste: [[2506.18164]] CDG-MAE: Learning Correspondences from Diffusion Generated Views(https://arxiv.org/abs/2506.18164)
Keywords: diffusion, self-supervised
Abstract: Learning dense correspondences, critical for application such as video label propagation, is hindered by tedious and unscalable manual annotation. Self-supervised methods address this by using a cross-view pretext task, often modeled with a masked autoencoder, where a masked target view is reconstructed from an anchor view. However, acquiring effective training data remains a challenge - collecting diverse video datasets is difficult and costly, while simple image crops lack necessary pose variations. This paper introduces CDG-MAE, a novel MAE-based self-supervised method that uses diverse synthetic views generated from static images via an image-conditioned diffusion model. These generated views exhibit substantial changes in pose and perspective, providing a rich training signal that overcomes the limitations of video and crop-based anchors. We present a quantitative method to evaluate local and global consistency of generated images, discussing their use for cross-view self-supervised pretraining. Furthermore, we enhance the standard single-anchor MAE setting to a multi-anchor strategy to effectively modulate the difficulty of pretext task. CDG-MAE significantly outperforms state-of-the-art MAE methods reliant only on images and substantially narrows the performance gap to video-based approaches.

Title: Non-equilibrium Annealed Adjoint Sampler

Authors: Jaemoo Choi, Yongxin Chen, Molei Tao, Guan-Horng Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18165
Pdf URL: https://arxiv.org/pdf/2506.18165
Copy Paste: [[2506.18165]] Non-equilibrium Annealed Adjoint Sampler(https://arxiv.org/abs/2506.18165)
Keywords: diffusion
Abstract: Recently, there has been significant progress in learning-based diffusion samplers, which aim to sample from a given unnormalized density. These methods typically follow one of two paradigms: (i) formulating sampling as an unbiased stochastic optimal control (SOC) problem using a canonical reference process, or (ii) refining annealed path measures through importance-weighted sampling. Although annealing approaches have advantages in guiding samples toward high-density regions, reliance on importance sampling leads to high variance and limited scalability in practice. In this paper, we introduce the \textbf{Non-equilibrium Annealed Adjoint Sampler (NAAS)}, a novel SOC-based diffusion sampler that leverages annealed reference dynamics without resorting to importance sampling. NAAS employs a lean adjoint system inspired by adjoint matching, enabling efficient and scalable training. We demonstrate the effectiveness of our approach across a range of tasks, including sampling from classical energy landscapes and molecular Boltzmann distribution.

Title: Joint Embedding Predictive Architecture for self-supervised pretraining on polymer molecular graphs

Authors: Francesco Picolli, Gabriel Vogel, Jana M. Weber
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.18194
Pdf URL: https://arxiv.org/pdf/2506.18194
Copy Paste: [[2506.18194]] Joint Embedding Predictive Architecture for self-supervised pretraining on polymer molecular graphs(https://arxiv.org/abs/2506.18194)
Keywords: self-supervised
Abstract: Recent advances in machine learning (ML) have shown promise in accelerating the discovery of polymers with desired properties by aiding in tasks such as virtual screening via property prediction. However, progress in polymer ML is hampered by the scarcity of high-quality labeled datasets, which are necessary for training supervised ML models. In this work, we study the use of the very recent 'Joint Embedding Predictive Architecture' (JEPA), a type of architecture for self-supervised learning (SSL), on polymer molecular graphs to understand whether pretraining with the proposed SSL strategy improves downstream performance when labeled data is scarce. Our results indicate that JEPA-based self-supervised pretraining on polymer graphs enhances downstream performance, particularly when labeled data is very scarce, achieving improvements across all tested datasets.

Title: Cross-Architecture Knowledge Distillation (KD) for Retinal Fundus Image Anomaly Detection on NVIDIA Jetson Nano

Authors: Berk Yilmaz, Aniruddh Aiyengar
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.18220
Pdf URL: https://arxiv.org/pdf/2506.18220
Copy Paste: [[2506.18220]] Cross-Architecture Knowledge Distillation (KD) for Retinal Fundus Image Anomaly Detection on NVIDIA Jetson Nano(https://arxiv.org/abs/2506.18220)
Keywords: self-supervised, anomaly
Abstract: Early and accurate identification of retinal ailments is crucial for averting ocular decline; however, access to dependable diagnostic devices is not often available in low-resourced settings. This project proposes to solve that by developing a lightweight, edge-device deployable disease classifier using cross-architecture knowledge distilling. We first train a high-capacity vision transformer (ViT) teacher model, pre-trained using I-JEPA self-supervised learning, to classify fundus images into four classes: Normal, Diabetic Retinopathy, Glaucoma, and Cataract. We kept an Internet of Things (IoT) focus when compressing to a CNN-based student model for deployment in resource-limited conditions, such as the NVIDIA Jetson Nano. This was accomplished using a novel framework which included a Partitioned Cross-Attention (PCA) projector, a Group-Wise Linear (GL) projector, and a multi-view robust training method. The teacher model has 97.4 percent more parameters than the student model, with it achieving 89 percent classification with a roughly 93 percent retention of the teacher model's diagnostic performance. The retention of clinical classification behavior supports our method's initial aim: compression of the ViT while retaining accuracy. Our work serves as an example of a scalable, AI-driven triage solution for retinal disorders in under-resourced areas.

Title: Semantic Structure-Aware Generative Attacks for Enhanced Adversarial Transferability

Authors: Jongoh Jeong, Hunmin Yang, Jaeseok Jeong, Kuk-Jin Yoon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18248
Pdf URL: https://arxiv.org/pdf/2506.18248
Copy Paste: [[2506.18248]] Semantic Structure-Aware Generative Attacks for Enhanced Adversarial Transferability(https://arxiv.org/abs/2506.18248)
Keywords: generative
Abstract: Generative adversarial attacks train a perturbation generator on a white-box surrogate model and subsequently apply the crafted perturbations to unseen black-box victim models. In contrast to iterative attacks, these methods deliver superior inference-time efficiency, scalability, and transferability; however, up until now, existing studies have not fully exploited the representational capacity of generative models to preserve and harness semantic information. Specifically, the intermediate activations of the generator encode rich semantic features--object boundaries and coarse shapes--that remain under-exploited, thereby limiting the alignment of perturbations with object-salient regions which are critical for adversarial transferability. To remedy this, we introduce a semantic structure-aware attack framework based on the Mean Teacher, which serves as a temporally smoothed feature reference. With this smoothed reference, we further direct semantic consistency between the early-layer activations in the student and those of the semantically rich teacher by feature distillation. By anchoring perturbation synthesis to the semantically salient early intermediate blocks within the generator based on empirical findings, our method guides progressive adversarial perturbation on regions that substantially enhance adversarial transferability. We conduct extensive experiments over diverse models, domains and tasks to demonstrate consistent improvements relative to state-of-the-art generative attacks, comprehensively evaluated using conventional metrics and our newly proposed Accidental Correction Rate (ACR).

Title: YouTube-Occ: Learning Indoor 3D Semantic Occupancy Prediction from YouTube Videos

Authors: Haoming Chen, Lichen Yuan, TianFang Sun, Jingyu Gong, Xin Tan, Zhizhong Zhang, Yuan Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18266
Pdf URL: https://arxiv.org/pdf/2506.18266
Copy Paste: [[2506.18266]] YouTube-Occ: Learning Indoor 3D Semantic Occupancy Prediction from YouTube Videos(https://arxiv.org/abs/2506.18266)
Keywords: self-supervised, foundation model
Abstract: 3D semantic occupancy prediction in the past was considered to require precise geometric relationships in order to enable effective training. However, in complex indoor environments, the large-scale and widespread collection of data, along with the necessity for fine-grained annotations, becomes impractical due to the complexity of data acquisition setups and privacy concerns. In this paper, we demonstrate that 3D spatially-accurate training can be achieved using only indoor Internet data, without the need for any pre-knowledge of intrinsic or extrinsic camera parameters. In our framework, we collect a web dataset, YouTube-Occ, which comprises house tour videos from YouTube, providing abundant real house scenes for 3D representation learning. Upon on this web dataset, we establish a fully self-supervised model to leverage accessible 2D prior knowledge for reaching powerful 3D indoor perception. Specifically, we harness the advantages of the prosperous vision foundation models, distilling the 2D region-level knowledge into the occupancy network by grouping the similar pixels into superpixels. Experimental results show that our method achieves state-of-the-art zero-shot performance on two popular benchmarks (NYUv2 and OccScanNet

Title: ARD-LoRA: Dynamic Rank Allocation for Parameter-Efficient Fine-Tuning of Foundation Models with Heterogeneous Adaptation Needs

Authors: Haseeb Ullah Khan Shinwari, Muhammad Usama
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18267
Pdf URL: https://arxiv.org/pdf/2506.18267
Copy Paste: [[2506.18267]] ARD-LoRA: Dynamic Rank Allocation for Parameter-Efficient Fine-Tuning of Foundation Models with Heterogeneous Adaptation Needs(https://arxiv.org/abs/2506.18267)
Keywords: foundation model
Abstract: Conventional Low-Rank Adaptation (LoRA) methods employ a fixed rank, imposing uniform adaptation across transformer layers and attention heads despite their heterogeneous learning dynamics. This paper introduces Adaptive Rank Dynamic LoRA (ARD-LoRA), a novel framework that automates rank allocation through learnable scaling factors. These factors are optimized via a meta-objective balancing task performance and parameter efficiency, incorporating $\ell_1$ sparsity for minimal rank and Total Variation regularization for stable rank transitions. ARD-LoRA enables continuous, differentiable, per-head rank adaptation. Experiments on LLAMA-3.1-70B and PaliGemma-2 demonstrate ARD-LoRA's efficacy, achieving up to 99.3% of full fine-tuning performance with only 0.32% trainable parameters, outperforming strong baselines like DoRA and AdaLoRA. Furthermore, it reduces multimodal adaptation memory by 41%. These results establish dynamic, fine-grained rank allocation as a critical paradigm for efficient foundation model adaptation.

Title: Adaptive Mask-guided K-space Diffusion for Accelerated MRI Reconstruction

Authors: Qinrong Cai, Yu Guan, Zhibo Chen, Dong Liang, Qiuyun Fan, Qiegen Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18270
Pdf URL: https://arxiv.org/pdf/2506.18270
Copy Paste: [[2506.18270]] Adaptive Mask-guided K-space Diffusion for Accelerated MRI Reconstruction(https://arxiv.org/abs/2506.18270)
Keywords: diffusion
Abstract: As the deep learning revolution marches on, masked modeling has emerged as a distinctive approach that involves predicting parts of the original data that are proportionally masked during training, and has demonstrated exceptional performance in multiple fields. Magnetic Resonance Imaging (MRI) reconstruction is a critical task in medical imaging that seeks to recover high-quality images from under-sampled k-space data. However, previous MRI reconstruction strategies usually optimized the entire image domain or k-space, without considering the importance of different frequency regions in the k-space This work introduces a diffusion model based on adaptive masks (AMDM), which utilizes the adaptive adjustment of frequency distribution based on k-space data to develop a hybrid masks mechanism that adapts to different k-space inputs. This enables the effective separation of high-frequency and low-frequency components, producing diverse frequency-specific representations. Additionally, the k-space frequency distribution informs the generation of adaptive masks, which, in turn, guide a closed-loop diffusion process. Experimental results verified the ability of this method to learn specific frequency information and thereby improved the quality of MRI reconstruction, providing a flexible framework for optimizing k-space data using masks in the future.

Title: Learning Causal Graphs at Scale: A Foundation Model Approach

Authors: Naiyu Yin, Tian Gao, Yue Yu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18285
Pdf URL: https://arxiv.org/pdf/2506.18285
Copy Paste: [[2506.18285]] Learning Causal Graphs at Scale: A Foundation Model Approach(https://arxiv.org/abs/2506.18285)
Keywords: foundation model
Abstract: Due to its human-interpretability and invariance properties, Directed Acyclic Graph (DAG) has been a foundational tool across various areas of AI research, leading to significant advancements. However, DAG learning remains highly challenging, due to its super-exponential growth in computational cost and identifiability issues, particularly in small-sample regimes. To address these two challenges, in this work we leverage the recent success of linear transformers and develop a foundation model approach for discovering multiple order-consistent DAGs across tasks. In particular, we propose Attention-DAG (ADAG), a novel attention-mechanism-based architecture for learning multiple linear Structural Equation Models (SEMs). ADAG learns the mapping from observed data to both graph structure and parameters via a nonlinear attention-based kernel, enabling efficient multi-task estimation of the underlying linear SEMs. By formulating the learning process across multiple tasks as a continuous optimization problem, the pre-trained ADAG model captures the common structural properties as a shared low-dimensional prior, thereby reducing the ill-posedness of downstream DAG learning tasks in small-sample regimes. We evaluate our proposed approach on benchmark synthetic datasets and find that ADAG achieves substantial improvements in both DAG learning accuracy and zero-shot inference efficiency. To the best of our knowledge, this is the first practical approach for pre-training a foundation model specifically designed for DAG learning, representing a step toward more efficient and generalizable down-stream applications in causal discovery.

Title: Learning High-Quality Latent Representations for Anomaly Detection and Signal Integrity Enhancement in High-Speed Signals

Authors: Muhammad Usama, Hee-Deok Jang, Soham Shanbhag, Yoo-Chang Sung, Seung-Jun Bae, Dong Eui Chang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.18288
Pdf URL: https://arxiv.org/pdf/2506.18288
Copy Paste: [[2506.18288]] Learning High-Quality Latent Representations for Anomaly Detection and Signal Integrity Enhancement in High-Speed Signals(https://arxiv.org/abs/2506.18288)
Keywords: anomaly
Abstract: This paper addresses the dual challenge of improving anomaly detection and signal integrity in high-speed dynamic random access memory signals. To achieve this, we propose a joint training framework that integrates an autoencoder with a classifier to learn more distinctive latent representations by focusing on valid data features. Our approach is evaluated across three anomaly detection algorithms and consistently outperforms two baseline methods. Detailed ablation studies further support these findings. Furthermore, we introduce a signal integrity enhancement algorithm that improves signal integrity by an average of 11.3%. The source code and data used in this study are available at this https URL.

Title: Instability in Diffusion ODEs: An Explanation for Inaccurate Image Reconstruction

Authors: Han Zhang, Jinghong Mao, Shangwen Zhu, Zhantao Yang, Lianghua Huang, Yu Liu, Deli Zhao, Ruili Feng, Fan Cheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.18290
Pdf URL: https://arxiv.org/pdf/2506.18290
Copy Paste: [[2506.18290]] Instability in Diffusion ODEs: An Explanation for Inaccurate Image Reconstruction(https://arxiv.org/abs/2506.18290)
Keywords: diffusion
Abstract: Diffusion reconstruction plays a critical role in various applications such as image editing, restoration, and style transfer. In theory, the reconstruction should be simple - it just inverts and regenerates images by numerically solving the Probability Flow-Ordinary Differential Equation (PF-ODE). Yet in practice, noticeable reconstruction errors have been observed, which cannot be well explained by numerical errors. In this work, we identify a deeper intrinsic property in the PF-ODE generation process, the instability, that can further amplify the reconstruction errors. The root of this instability lies in the sparsity inherent in the generation distribution, which means that the probability is concentrated on scattered and small regions while the vast majority remains almost empty. To demonstrate the existence of instability and its amplification on reconstruction error, we conduct experiments on both toy numerical examples and popular open-sourced diffusion models. Furthermore, based on the characteristics of image data, we theoretically prove that the instability's probability converges to one as the data dimensionality increases. Our findings highlight the inherent challenges in diffusion-based reconstruction and can offer insights for future improvements.

Title: NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation

Authors: Yu Xie, Chengjie Zeng, Lingyun Zhang, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18325
Pdf URL: https://arxiv.org/pdf/2506.18325
Copy Paste: [[2506.18325]] NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation(https://arxiv.org/abs/2506.18325)
Keywords: diffusion
Abstract: The rapid advancement of text-to-image (T2I) models, such as Stable Diffusion, has enhanced their capability to synthesize images from textual prompts. However, this progress also raises significant risks of misuse, including the generation of harmful content (e.g., pornography, violence, discrimination), which contradicts the ethical goals of T2I technology and hinders its sustainable development. Inspired by "jailbreak" attacks in large language models, which bypass restrictions through subtle prompt modifications, this paper proposes NSFW-Classifier Guided Prompt Sanitization (PromptSan), a novel approach to detoxify harmful prompts without altering model architecture or degrading generation capability. PromptSan includes two variants: PromptSan-Modify, which iteratively identifies and replaces harmful tokens in input prompts using text NSFW classifiers during inference, and PromptSan-Suffix, which trains an optimized suffix token sequence to neutralize harmful intent while passing both text and image NSFW classifier checks. Extensive experiments demonstrate that PromptSan achieves state-of-the-art performance in reducing harmful content generation across multiple metrics, effectively balancing safety and usability.

Title: Geometry-Aware Preference Learning for 3D Texture Generation

Authors: AmirHossein Zamani, Tianhao Xie, Amir G. Aghdam, Tiberiu Popa, Eugene Belilovsky
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18331
Pdf URL: https://arxiv.org/pdf/2506.18331
Copy Paste: [[2506.18331]] Geometry-Aware Preference Learning for 3D Texture Generation(https://arxiv.org/abs/2506.18331)
Keywords: generative
Abstract: Recent advances in 3D generative models have achieved impressive results but 3D contents generated by these models may not align with subjective human preferences or task-specific criteria. Moreover, a core challenge in the 3D texture generation domain remains: most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To address this, we propose an end-to-end differentiable preference learning framework that back-propagates human preferences, represented by differentiable reward functions, through the entire 3D generative pipeline, making the process inherently geometry-aware. We demonstrate the effectiveness of our framework using four proposed novel geometry-aware reward functions, offering a more controllable and interpretable pathway for high-quality 3D content creation from natural language.

Title: Controlled Generation with Equivariant Variational Flow Matching

Authors: Floor Eijkelboom, Heiko Zimmermann, Sharvaree Vadgama, Erik J Bekkers, Max Welling, Christian A. Naesseth, Jan-Willem van de Meent
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18340
Pdf URL: https://arxiv.org/pdf/2506.18340
Copy Paste: [[2506.18340]] Controlled Generation with Equivariant Variational Flow Matching(https://arxiv.org/abs/2506.18340)
Keywords: generative
Abstract: We derive a controlled generation objective within the framework of Variational Flow Matching (VFM), which casts flow matching as a variational inference problem. We demonstrate that controlled generation can be implemented two ways: (1) by way of end-to-end training of conditional generative models, or (2) as a Bayesian inference problem, enabling post hoc control of unconditional models without retraining. Furthermore, we establish the conditions required for equivariant generation and provide an equivariant formulation of VFM tailored for molecular generation, ensuring invariance to rotations, translations, and permutations. We evaluate our approach on both uncontrolled and controlled molecular generation, achieving state-of-the-art performance on uncontrolled generation and outperforming state-of-the-art models in controlled generation, both with end-to-end training and in the Bayesian inference setting. This work strengthens the connection between flow-based generative modeling and Bayesian inference, offering a scalable and principled framework for constraint-driven and symmetry-aware generation.

Title: Sequential keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection

Authors: Anja Delić, Matej Grcić, Siniša Šegvić
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18368
Pdf URL: https://arxiv.org/pdf/2506.18368
Copy Paste: [[2506.18368]] Sequential keypoint density estimator: an overlooked baseline of skeleton-based video anomaly detection(https://arxiv.org/abs/2506.18368)
Keywords: anomaly
Abstract: Detecting anomalous human behaviour is an important visual task in safety-critical applications such as healthcare monitoring, workplace safety, or public surveillance. In these contexts, abnormalities are often reflected with unusual human poses. Thus, we propose SeeKer, a method for detecting anomalies in sequences of human skeletons. Our method formulates the skeleton sequence density through autoregressive factorization at the keypoint level. The corresponding conditional distributions represent probable keypoint locations given prior skeletal motion. We formulate the joint distribution of the considered skeleton as causal prediction of conditional Gaussians across its constituent keypoints. A skeleton is flagged as anomalous if its keypoint locations surprise our model (i.e. receive a low density). In practice, our anomaly score is a weighted sum of per-keypoint log-conditionals, where the weights account for the confidence of the underlying keypoint detector. Despite its conceptual simplicity, SeeKer surpasses all previous methods on the UBnormal and MSAD-HR datasets while delivering competitive performance on the ShanghaiTech dataset.

Title: Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging

Authors: Filippo Ruffini, Elena Mulero Ayllon, Linlin Shen, Paolo Soda, Valerio Guarrasi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18434
Pdf URL: https://arxiv.org/pdf/2506.18434
Copy Paste: [[2506.18434]] Benchmarking Foundation Models and Parameter-Efficient Fine-Tuning for Prognosis Prediction in Medical Imaging(https://arxiv.org/abs/2506.18434)
Keywords: foundation model
Abstract: Artificial Intelligence (AI) holds significant promise for improving prognosis prediction in medical imaging, yet its effective application remains challenging. In this work, we introduce a structured benchmark explicitly designed to evaluate and compare the transferability of Convolutional Neural Networks and Foundation Models in predicting clinical outcomes in COVID-19 patients, leveraging diverse publicly available Chest X-ray datasets. Our experimental methodology extensively explores a wide set of fine-tuning strategies, encompassing traditional approaches such as Full Fine-Tuning and Linear Probing, as well as advanced Parameter-Efficient Fine-Tuning methods including Low-Rank Adaptation, BitFit, VeRA, and IA3. The evaluations were conducted across multiple learning paradigms, including both extensive full-data scenarios and more clinically realistic Few-Shot Learning settings, which are critical for modeling rare disease outcomes and rapidly emerging health threats. By implementing a large-scale comparative analysis involving a diverse selection of pretrained models, including general-purpose architectures pretrained on large-scale datasets such as CLIP and DINOv2, to biomedical-specific models like MedCLIP, BioMedCLIP, and PubMedCLIP, we rigorously assess each model's capacity to effectively adapt and generalize to prognosis tasks, particularly under conditions of severe data scarcity and pronounced class imbalance. The benchmark was designed to capture critical conditions common in prognosis tasks, including variations in dataset size and class distribution, providing detailed insights into the strengths and limitations of each fine-tuning strategy. This extensive and structured evaluation aims to inform the practical deployment and adoption of robust, efficient, and generalizable AI-driven solutions in real-world clinical prognosis prediction workflows.

Title: CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing

Authors: Dinh-Khoi Vo, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18438
Pdf URL: https://arxiv.org/pdf/2506.18438
Copy Paste: [[2506.18438]] CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing(https://arxiv.org/abs/2506.18438)
Keywords: diffusion
Abstract: Editing natural images using textual descriptions in text-to-image diffusion models remains a significant challenge, particularly in achieving consistent generation and handling complex, non-rigid objects. Existing methods often struggle to preserve textures and identity, require extensive fine-tuning, and exhibit limitations in editing specific spatial regions or objects while retaining background details. This paper proposes Context-Preserving Adaptive Manipulation (CPAM), a novel zero-shot framework for complicated, non-rigid real image editing. Specifically, we propose a preservation adaptation module that adjusts self-attention mechanisms to preserve and independently control the object and background effectively. This ensures that the objects' shapes, textures, and identities are maintained while keeping the background undistorted during the editing process using the mask guidance technique. Additionally, we develop a localized extraction module to mitigate the interference with the non-desired modified regions during conditioning in cross-attention mechanisms. We also introduce various mask-guidance strategies to facilitate diverse image manipulation tasks in a simple manner. Extensive experiments on our newly constructed Image Manipulation BenchmArk (IMBA), a robust benchmark dataset specifically designed for real image editing, demonstrate that our proposed method is the preferred choice among human raters, outperforming existing state-of-the-art editing techniques.

Title: DIP: Unsupervised Dense In-Context Post-training of Visual Representations

Authors: Sophia Sirko-Galouchenko, Spyros Gidaris, Antonin Vobecky, Andrei Bursuc, Nicolas Thome
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18463
Pdf URL: https://arxiv.org/pdf/2506.18463
Copy Paste: [[2506.18463]] DIP: Unsupervised Dense In-Context Post-training of Visual Representations(https://arxiv.org/abs/2506.18463)
Keywords: diffusion, in-context
Abstract: We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations. Code available here: this https URL

Title: GANs vs. Diffusion Models for virtual staining with the HER2match dataset

Authors: Pascal Klöckner, José Teixeira, Diana Montezuma, Jaime S. Cardoso, Hugo M. Horlings, Sara P. Oliveira
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18484
Pdf URL: https://arxiv.org/pdf/2506.18484
Copy Paste: [[2506.18484]] GANs vs. Diffusion Models for virtual staining with the HER2match dataset(https://arxiv.org/abs/2506.18484)
Keywords: diffusion, generative
Abstract: Virtual staining is a promising technique that uses deep generative models to recreate histological stains, providing a faster and more cost-effective alternative to traditional tissue chemical staining. Specifically for H&E-HER2 staining transfer, despite a rising trend in publications, the lack of sufficient public datasets has hindered progress in the topic. Additionally, it is currently unclear which model frameworks perform best for this particular task. In this paper, we introduce the HER2match dataset, the first publicly available dataset with the same breast cancer tissue sections stained with both H&E and HER2. Furthermore, we compare the performance of several Generative Adversarial Networks (GANs) and Diffusion Models (DMs), and implement a novel Brownian Bridge Diffusion Model for H&E-HER2 translation. Our findings indicate that, overall, GANs perform better than DMs, with only the BBDM achieving comparable results. Furthermore, we emphasize the importance of data alignment, as all models trained on HER2match produced vastly improved visuals compared to the widely used consecutive-slide BCI dataset. This research provides a new high-quality dataset ([available upon publication acceptance]), improving both model training and evaluation. In addition, our comparison of frameworks offers valuable guidance for researchers working on the topic.

Title: MeRF: Motivation-enhanced Reinforcement Finetuning for Large Reasoning Models

Authors: Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang, Ting-En Lin, Fei Huang, Yongbin Li, Dacheng Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18485
Pdf URL: https://arxiv.org/pdf/2506.18485
Copy Paste: [[2506.18485]] MeRF: Motivation-enhanced Reinforcement Finetuning for Large Reasoning Models(https://arxiv.org/abs/2506.18485)
Keywords: in-context
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful learn-to-reason paradigm for Large Language Models (LLMs) to tackle complex reasoning tasks. However, existing RLVR methods overlook one of the most distinctive capabilities of LLMs, their in-context learning ability, as prominently demonstrated by the success of Chain-of-Thought (CoT) prompting. This motivates us to explore how reinforcement learning can be effectively combined with in-context learning to better improve the reasoning capabilities of LLMs. In this paper, we introduce Motivation-enhanced Reinforcement Finetuning} (MeRF), an intuitive yet effective method enhancing reinforcement learning of LLMs by involving ``telling LLMs the rules of the game''. Specifically, MeRF directly injects the reward specification into the prompt, which serves as an in-context motivation for model to improve its responses with awareness of the optimization objective. This simple modification leverages the in-context learning ability of LLMs aligning generation with optimization, thereby incentivizing the model to generate desired outputs from both inner motivation and external reward. Empirical evaluations on the Knights and Knaves~(K&K) logic puzzle reasoning benchmark demonstrate that \texttt{MeRF} achieves substantial performance gains over baselines. Moreover, ablation studies show that performance improves with greater consistency between the in-context motivation and the external reward function, while the model also demonstrates an ability to adapt to misleading motivations through reinforcement learning.

Title: Multi-Scale Representation of Follicular Lymphoma Pathology Images in a Single Hyperbolic Space

Authors: Kei Taguchi, Kazumasa Ohara, Tatsuya Yokota, Hiroaki Miyoshi, Noriaki Hashimoto, Ichiro Takeuchi, Hidekata Hontani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18523
Pdf URL: https://arxiv.org/pdf/2506.18523
Copy Paste: [[2506.18523]] Multi-Scale Representation of Follicular Lymphoma Pathology Images in a Single Hyperbolic Space(https://arxiv.org/abs/2506.18523)
Keywords: self-supervised
Abstract: We propose a method for representing malignant lymphoma pathology images, from high-resolution cell nuclei to low-resolution tissue images, within a single hyperbolic space using self-supervised learning. To capture morphological changes that occur across scales during disease progression, our approach embeds tissue and corresponding nucleus images close to each other based on inclusion relationships. Using the Poincaré ball as the feature space enables effective encoding of this hierarchical structure. The learned representations capture both disease state and cell type variations.

Title: Auto-Regressively Generating Multi-View Consistent Images

Authors: JiaKui Hu, Yuxiao Yang, Jialun Liu, Jinbo Wu, Chen Zhao, Yanye Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18527
Pdf URL: https://arxiv.org/pdf/2506.18527
Copy Paste: [[2506.18527]] Auto-Regressively Generating Multi-View Consistent Images(https://arxiv.org/abs/2506.18527)
Keywords: diffusion
Abstract: Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (MV-AR) method, which leverages an auto-regressive model to progressively generate consistent multi-view images from arbitrary prompts. Firstly, the next-token-prediction capability of the AR model significantly enhances its effectiveness in facilitating progressive multi-view synthesis. When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information. Subsequently, we propose a unified model that accommodates various prompts via architecture designing and training strategies. To address multiple conditions, we introduce condition injection modules for text, camera pose, image, and shape. To manage multi-modal conditions simultaneously, a progressive training strategy is employed. This strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to enhance the development of a comprehensive X-to-multi-view (X2mv) model through the randomly dropping and combining conditions. Finally, to alleviate the overfitting problem caused by limited high-quality data, we propose the "Shuffle View" data augmentation technique, thus significantly expanding the training data by several magnitudes. Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images across a range of conditions and performs on par with leading diffusion-based multi-view image generation models. Code and models will be released at this https URL.

Title: End-to-End Spoken Grammatical Error Correction

Authors: Mengjie Qian, Rao Ma, Stefano Bannò, Mark J.F. Gales, Kate M. Knill
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.18532
Pdf URL: https://arxiv.org/pdf/2506.18532
Copy Paste: [[2506.18532]] End-to-End Spoken Grammatical Error Correction(https://arxiv.org/abs/2506.18532)
Keywords: foundation model
Abstract: Grammatical Error Correction (GEC) and feedback play a vital role in supporting second language (L2) learners, educators, and examiners. While written GEC is well-established, spoken GEC (SGEC), aiming to provide feedback based on learners' speech, poses additional challenges due to disfluencies, transcription errors, and the lack of structured input. SGEC systems typically follow a cascaded pipeline consisting of Automatic Speech Recognition (ASR), disfluency detection, and GEC, making them vulnerable to error propagation across modules. This work examines an End-to-End (E2E) framework for SGEC and feedback generation, highlighting challenges and possible solutions when developing these systems. Cascaded, partial-cascaded and E2E architectures are compared, all built on the Whisper foundation model. A challenge for E2E systems is the scarcity of GEC labeled spoken data. To address this, an automatic pseudo-labeling framework is examined, increasing the training data from 77 to over 2500 hours. To improve the accuracy of the SGEC system, additional contextual information, exploiting the ASR output, is investigated. Candidate feedback of their mistakes is an essential step to improving performance. In E2E systems the SGEC output must be compared with an estimate of the fluent transcription to obtain the feedback. To improve the precision of this feedback, a novel reference alignment process is proposed that aims to remove hypothesised edits that results from fluent transcription errors. Finally, these approaches are combined with an edit confidence estimation approach, to exclude low-confidence edits. Experiments on the in-house Linguaskill (LNG) corpora and the publicly available Speak & Improve (S&I) corpus show that the proposed approaches significantly boost E2E SGEC performance.

Title: Normality Prior Guided Multi-Semantic Fusion Network for Unsupervised Image Anomaly Detection

Authors: Muhao Xu, Xueying Zhou, Xizhan Gao, Weiye Song, Guang Feng, Sijie Niu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18544
Pdf URL: https://arxiv.org/pdf/2506.18544
Copy Paste: [[2506.18544]] Normality Prior Guided Multi-Semantic Fusion Network for Unsupervised Image Anomaly Detection(https://arxiv.org/abs/2506.18544)
Keywords: anomaly
Abstract: Recently, detecting logical anomalies is becoming a more challenging task compared to detecting structural ones. Existing encoder decoder based methods typically compress inputs into low-dimensional bottlenecks on the assumption that the compression process can effectively suppress the transmission of logical anomalies to the decoder. However, logical anomalies present a particular difficulty because, while their local features often resemble normal semantics, their global semantics deviate significantly from normal patterns. Thanks to the generalisation capabilities inherent in neural networks, these abnormal semantic features can propagate through low-dimensional bottlenecks. This ultimately allows the decoder to reconstruct anomalous images with misleading fidelity. To tackle the above challenge, we propose a novel normality prior guided multi-semantic fusion network for unsupervised anomaly detection. Instead of feeding the compressed bottlenecks to the decoder directly, we introduce the multi-semantic features of normal samples into the reconstruction process. To this end, we first extract abstract global semantics of normal cases by a pre-trained vision-language network, then the learnable semantic codebooks are constructed to store representative feature vectors of normal samples by vector quantisation. Finally, the above multi-semantic features are fused and employed as input to the decoder to guide the reconstruction of anomalies to approximate normality. Extensive experiments are conducted to validate the effectiveness of our proposed method, and it achieves the SOTA performance on the MVTec LOCO AD dataset with improvements of 5.7% in pixel-sPRO and 2.6% in image-AUROC. The source code is available at this https URL.

Title: Resampling Augmentation for Time Series Contrastive Learning: Application to Remote Sensing

Authors: Antoine Saget, Baptiste Lafabregue, Antoine Cornuéjols, Pierre Gançarski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18587
Pdf URL: https://arxiv.org/pdf/2506.18587
Copy Paste: [[2506.18587]] Resampling Augmentation for Time Series Contrastive Learning: Application to Remote Sensing(https://arxiv.org/abs/2506.18587)
Keywords: self-supervised
Abstract: Given the abundance of unlabeled Satellite Image Time Series (SITS) and the scarcity of labeled data, contrastive self-supervised pretraining emerges as a natural tool to leverage this vast quantity of unlabeled data. However, designing effective data augmentations for contrastive learning remains challenging for time series. We introduce a novel resampling-based augmentation strategy that generates positive pairs by upsampling time series and extracting disjoint subsequences while preserving temporal coverage. We validate our approach on multiple agricultural classification benchmarks using Sentinel-2 imagery, showing that it outperforms common alternatives such as jittering, resizing, and masking. Further, we achieve state-of-the-art performance on the S2-Agri100 dataset without employing spatial information or temporal encodings, surpassing more complex masked-based SSL frameworks. Our method offers a simple, yet effective, contrastive learning augmentation for remote sensing time series.

Title: No Training Wheels: Steering Vectors for Bias Correction at Inference Time

Authors: Aviral Gupta, Armaan Sethi, Ameesh Sethi
Subjects: cs.LG, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2506.18598
Pdf URL: https://arxiv.org/pdf/2506.18598
Copy Paste: [[2506.18598]] No Training Wheels: Steering Vectors for Bias Correction at Inference Time(https://arxiv.org/abs/2506.18598)
Keywords: generative
Abstract: Neural network classifiers trained on datasets with uneven group representation often inherit class biases and learn spurious correlations. These models may perform well on average but consistently fail on atypical groups. For example, in hair color classification, datasets may over-represent females with blond hair, reinforcing stereotypes. Although various algorithmic and data-centric methods have been proposed to address such biases, they often require retraining or significant compute. In this work, we propose a cheap, training-free method inspired by steering vectors used to edit behaviors in large language models. We compute the difference in mean activations between majority and minority groups to define a "bias vector," which we subtract from the model's residual stream. This leads to reduced classification bias and improved worst-group accuracy. We explore multiple strategies for extracting and applying these vectors in transformer-like classifiers, showing that steering vectors, traditionally used in generative models, can also be effective in classification. More broadly, we showcase an extremely cheap, inference time, training free method to mitigate bias in classification models.

Title: Simulation-Free Differential Dynamics through Neural Conservation Laws

Authors: Mengjian Hua, Eric Vanden-Eijnden, Ricky T.Q. Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18604
Pdf URL: https://arxiv.org/pdf/2506.18604
Copy Paste: [[2506.18604]] Simulation-Free Differential Dynamics through Neural Conservation Laws(https://arxiv.org/abs/2506.18604)
Keywords: diffusion, generative
Abstract: We present a novel simulation-free framework for training continuous-time diffusion processes over very general objective functions. Existing methods typically involve either prescribing the optimal diffusion process -- which only works for heavily restricted problem formulations -- or require expensive simulation to numerically obtain the time-dependent densities and sample from the diffusion process. In contrast, we propose a coupled parameterization which jointly models a time-dependent density function, or probability path, and the dynamics of a diffusion process that generates this probability path. To accomplish this, our approach directly bakes in the Fokker-Planck equation and density function requirements as hard constraints, by extending and greatly simplifying the construction of Neural Conservation Laws. This enables simulation-free training for a large variety of problem formulations, from data-driven objectives as in generative modeling and dynamical optimal transport, to optimality-based objectives as in stochastic optimal control, with straightforward extensions to mean-field objectives due to the ease of accessing exact density functions. We validate our method in a diverse range of application domains from modeling spatio-temporal events to learning optimal dynamics from population data.

Title: ReDit: Reward Dithering for Improved LLM Policy Optimization

Authors: Chenxing Wei, Jiarui Yu, Ying Tiffany He, Hande Dong, Yao Shu, Fei Yu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.18631
Pdf URL: https://arxiv.org/pdf/2506.18631
Copy Paste: [[2506.18631]] ReDit: Reward Dithering for Improved LLM Policy Optimization(https://arxiv.org/abs/2506.18631)
Keywords: anomaly
Abstract: DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system. While it's a ''perfect'' reward system that effectively mitigates reward hacking, such reward functions are often discrete. Our experimental observations suggest that discrete rewards can lead to gradient anomaly, unstable optimization, and slow convergence. To address this issue, we propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise. With this perturbed reward, exploratory gradients are continuously provided throughout the learning process, enabling smoother gradient updates and accelerating convergence. The injected noise also introduces stochasticity into flat reward regions, encouraging the model to explore novel policies and escape local optima. Experiments across diverse tasks demonstrate the effectiveness and efficiency of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO with only approximately 10% the training steps, and furthermore, still exhibits a 4% performance improvement over vanilla GRPO when trained for a similar duration. Visualizations confirm significant mitigation of gradient issues with ReDit. Moreover, theoretical analyses are provided to further validate these advantages.

Title: Benchmarking histopathology foundation models in a multi-center dataset for skin cancer subtyping

Authors: Pablo Meseguer, Rocío del Amor, Valery Naranjo
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18668
Pdf URL: https://arxiv.org/pdf/2506.18668
Copy Paste: [[2506.18668]] Benchmarking histopathology foundation models in a multi-center dataset for skin cancer subtyping(https://arxiv.org/abs/2506.18668)
Keywords: foundation model
Abstract: Pretraining on large-scale, in-domain datasets grants histopathology foundation models (FM) the ability to learn task-agnostic data representations, enhancing transfer learning on downstream tasks. In computational pathology, automated whole slide image analysis requires multiple instance learning (MIL) frameworks due to the gigapixel scale of the slides. The diversity among histopathology FMs has highlighted the need to design real-world challenges for evaluating their effectiveness. To bridge this gap, our work presents a novel benchmark for evaluating histopathology FMs as patch-level feature extractors within a MIL classification framework. For that purpose, we leverage the AI4SkIN dataset, a multi-center cohort encompassing slides with challenging cutaneous spindle cell neoplasm subtypes. We also define the Foundation Model - Silhouette Index (FM-SI), a novel metric to measure model consistency against distribution shifts. Our experimentation shows that extracting less biased features enhances classification performance, especially in similarity-based MIL classifiers.

Title: MCN-SLAM: Multi-Agent Collaborative Neural SLAM with Hybrid Implicit Neural Scene Representation

Authors: Tianchen Deng, Guole Shen, Xun Chen, Shenghai Yuan, Hongming Shen, Guohao Peng, Zhenyu Wu, Jingchuan Wang, Lihua Xie, Danwei Wang, Hesheng Wang, Weidong Chen
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2506.18678
Pdf URL: https://arxiv.org/pdf/2506.18678
Copy Paste: [[2506.18678]] MCN-SLAM: Multi-Agent Collaborative Neural SLAM with Hybrid Implicit Neural Scene Representation(https://arxiv.org/abs/2506.18678)
Keywords: foundation model
Abstract: Neural implicit scene representations have recently shown promising results in dense visual SLAM. However, existing implicit SLAM algorithms are constrained to single-agent scenarios, and fall difficulties in large-scale scenes and long sequences. Existing NeRF-based multi-agent SLAM frameworks cannot meet the constraints of communication bandwidth. To this end, we propose the first distributed multi-agent collaborative neural SLAM framework with hybrid scene representation, distributed camera tracking, intra-to-inter loop closure, and online distillation for multiple submap fusion. A novel triplane-grid joint scene representation method is proposed to improve scene reconstruction. A novel intra-to-inter loop closure method is designed to achieve local (single-agent) and global (multi-agent) consistency. We also design a novel online distillation method to fuse the information of different submaps to achieve global consistency. Furthermore, to the best of our knowledge, there is no real-world dataset for NeRF-based/GS-based SLAM that provides both continuous-time trajectories groundtruth and high-accuracy 3D meshes groundtruth. To this end, we propose the first real-world Dense slam (DES) dataset covering both single-agent and multi-agent scenarios, ranging from small rooms to large-scale outdoor scenes, with high-accuracy ground truth for both 3D mesh and continuous-time camera trajectory. This dataset can advance the development of the research in both SLAM, 3D reconstruction, and visual foundation model. Experiments on various datasets demonstrate the superiority of the proposed method in both mapping, tracking, and communication. The dataset and code will open-source on this https URL.

Title: Matrix-Game: Interactive World Foundation Model

Authors: Yifan Zhang, Chunli Peng, Boyang Wang, Puyi Wang, Qingcheng Zhu, Fei Kang, Biao Jiang, Zedong Gao, Eric Li, Yang Liu, Yahui Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18701
Pdf URL: https://arxiv.org/pdf/2506.18701
Copy Paste: [[2506.18701]] Matrix-Game: Interactive World Foundation Model(https://arxiv.org/abs/2506.18701)
Keywords: foundation model
Abstract: We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at this https URL.

Title: Benchmarking the Pedagogical Knowledge of Large Language Models

Authors: Maxime Lelièvre, Amy Waldock, Meng Liu, Natalia Valdés Aspillaga, Alasdair Mackintosh, María José Ogando Portelo, Jared Lee, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18710
Pdf URL: https://arxiv.org/pdf/2506.18710
Copy Paste: [[2506.18710]] Benchmarking the Pedagogical Knowledge of Large Language Models(https://arxiv.org/abs/2506.18710)
Keywords: generative
Abstract: Benchmarks like Massive Multitask Language Understanding (MMLU) have played a pivotal role in evaluating AI's knowledge and abilities across diverse domains. However, existing benchmarks predominantly focus on content knowledge, leaving a critical gap in assessing models' understanding of pedagogy - the method and practice of teaching. This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their Cross-Domain Pedagogical Knowledge (CDPK) and Special Education Needs and Disability (SEND) pedagogical knowledge. These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers, which cover a range of pedagogical subdomains such as teaching strategies and assessment methods. Here we outline the methodology and development of these benchmarks. We report results for 97 models, with accuracies spanning a range from 28% to 89% on the pedagogical knowledge questions. We consider the relationship between cost and accuracy and chart the progression of the Pareto value frontier over time. We provide online leaderboards at this https URL which are updated with new models and allow interactive exploration and filtering based on various model properties, such as cost per token and open-vs-closed weights, as well as looking at performance in different subjects. LLMs and generative AI have tremendous potential to influence education and help to address the global learning crisis. Education-focused benchmarks are crucial to measure models' capacities to understand pedagogical concepts, respond appropriately to learners' needs, and support effective teaching practices across diverse contexts. They are needed for informing the responsible and evidence-based deployment of LLMs and LLM-based tools in educational settings, and for guiding both development and policy decisions.

Title: Towards Group Fairness with Multiple Sensitive Attributes in Federated Foundation Models

Authors: Yuning Yang, Han Yu, Tianrun Gao, Xiaodong Xu, Guangyu Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.18732
Pdf URL: https://arxiv.org/pdf/2506.18732
Copy Paste: [[2506.18732]] Towards Group Fairness with Multiple Sensitive Attributes in Federated Foundation Models(https://arxiv.org/abs/2506.18732)
Keywords: foundation model
Abstract: The deep integration of foundation models (FM) with federated learning (FL) enhances personalization and scalability for diverse downstream tasks, making it crucial in sensitive domains like healthcare. Achieving group fairness has become an increasingly prominent issue in the era of federated foundation models (FFMs), since biases in sensitive attributes might lead to inequitable treatment for under-represented demographic groups. Existing studies mostly focus on achieving fairness with respect to a single sensitive attribute. This renders them unable to provide clear interpretability of dependencies among multiple sensitive attributes which is required to achieve group fairness. Our paper takes the first attempt towards a causal analysis of the relationship between group fairness across various sensitive attributes in the FFM. We extend the FFM structure to trade off multiple sensitive attributes simultaneously and quantify the causal effect behind the group fairness through causal discovery and inference. Extensive experiments validate its effectiveness, offering insights into interpretability towards building trustworthy and fair FFM systems.

Title: ContinualFlow: Learning and Unlearning with Neural Flow Matching

Authors: Lorenzo Simone, Davide Bacciu, Shuangge Ma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18747
Pdf URL: https://arxiv.org/pdf/2506.18747
Copy Paste: [[2506.18747]] ContinualFlow: Learning and Unlearning with Neural Flow Matching(https://arxiv.org/abs/2506.18747)
Keywords: generative
Abstract: We introduce ContinualFlow, a principled framework for targeted unlearning in generative models via Flow Matching. Our method leverages an energy-based reweighting loss to softly subtract undesired regions of the data distribution without retraining from scratch or requiring direct access to the samples to be unlearned. Instead, it relies on energy-based proxies to guide the unlearning process. We prove that this induces gradients equivalent to Flow Matching toward a soft mass-subtracted target, and validate the framework through experiments on 2D and image domains, supported by interpretable visualizations and quantitative evaluations.

Title: 3D Arena: An Open Platform for Generative 3D Evaluation

Authors: Dylan Ebert
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18787
Pdf URL: https://arxiv.org/pdf/2506.18787
Copy Paste: [[2506.18787]] 3D Arena: An Open Platform for Generative 3D Evaluation(https://arxiv.org/abs/2506.18787)
Keywords: generative
Abstract: Evaluating Generative 3D models remains challenging due to misalignment between automated metrics and human perception of quality. Current benchmarks rely on image-based metrics that ignore 3D structure or geometric measures that fail to capture perceptual appeal and real-world utility. To address this gap, we present 3D Arena, an open platform for evaluating image-to-3D generation models through large-scale human preference collection using pairwise comparisons. Since launching in June 2024, the platform has collected 123,243 votes from 8,096 users across 19 state-of-the-art models, establishing the largest human preference evaluation for Generative 3D. We contribute the iso3d dataset of 100 evaluation prompts and demonstrate quality control achieving 99.75% user authenticity through statistical fraud detection. Our ELO-based ranking system provides reliable model assessment, with the platform becoming an established evaluation resource. Through analysis of this preference data, we present insights into human preference patterns. Our findings reveal preferences for visual presentation features, with Gaussian splat outputs achieving a 16.6 ELO advantage over meshes and textured models receiving a 144.1 ELO advantage over untextured models. We provide recommendations for improving evaluation methods, including multi-criteria assessment, task-oriented evaluation, and format-aware comparison. The platform's community engagement establishes 3D Arena as a benchmark for the field while advancing understanding of human-centered evaluation in Generative 3D.

Title: ViDAR: Video Diffusion-Aware 4D Reconstruction From Monocular Inputs

Authors: Michal Nazarczuk, Sibi Catley-Chandar, Thomas Tanay, Zhensong Zhang, Gregory Slabaugh, Eduardo Pérez-Pellitero
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18792
Pdf URL: https://arxiv.org/pdf/2506.18792
Copy Paste: [[2506.18792]] ViDAR: Video Diffusion-Aware 4D Reconstruction From Monocular Inputs(https://arxiv.org/abs/2506.18792)
Keywords: diffusion
Abstract: Dynamic Novel View Synthesis aims to generate photorealistic views of moving subjects from arbitrary viewpoints. This task is particularly challenging when relying on monocular video, where disentangling structure from motion is ill-posed and supervision is scarce. We introduce Video Diffusion-Aware Reconstruction (ViDAR), a novel 4D reconstruction framework that leverages personalised diffusion models to synthesise a pseudo multi-view supervision signal for training a Gaussian splatting representation. By conditioning on scene-specific features, ViDAR recovers fine-grained appearance details while mitigating artefacts introduced by monocular ambiguity. To address the spatio-temporal inconsistency of diffusion-based supervision, we propose a diffusion-aware loss function and a camera pose optimisation strategy that aligns synthetic views with the underlying scene geometry. Experiments on DyCheck, a challenging benchmark with extreme viewpoint variation, show that ViDAR outperforms all state-of-the-art baselines in visual quality and geometric consistency. We further highlight ViDAR's strong improvement over baselines on dynamic regions and provide a new benchmark to compare performance in reconstructing motion-rich parts of the scene. Project page: this https URL

Title: RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies

Authors: Arjun Mukerji, Michael L. Jackson, Jason Jones, Neil Sanghavi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.18819
Pdf URL: https://arxiv.org/pdf/2506.18819
Copy Paste: [[2506.18819]] RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies(https://arxiv.org/abs/2506.18819)
Keywords: foundation model
Abstract: Large Language Models (LLMs) have been extensively evaluated for general summarization tasks as well as medical research assistance, but they have not been specifically evaluated for the task of summarizing real-world evidence (RWE) from structured output of RWE studies. We introduce RWESummary, a proposed addition to the MedHELM framework (Bedi, Cui, Fuentes, Unell et al., 2025) to enable benchmarking of LLMs for this task. RWESummary includes one scenario and three evaluations covering major types of errors observed in summarization of medical research studies and was developed using Atropos Health proprietary data. Additionally, we use RWESummary to compare the performance of different LLMs in our internal RWE summarization tool. At the time of publication, with 13 distinct RWE studies, we found the Gemini 2.5 models performed best overall (both Flash and Pro). We suggest RWESummary as a novel and useful foundation model benchmark for real-world evidence study summarization.

Title: 4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation

Authors: Chaoyang Wang, Ashkan Mirzaei, Vidit Goel, Willi Menapace, Aliaksandr Siarohin, Avalon Vinella, Michael Vasilkovsky, Ivan Skorokhodov, Vladislav Shakhrai, Sergey Korolev, Sergey Tulyakov, Peter Wonka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18839
Pdf URL: https://arxiv.org/pdf/2506.18839
Copy Paste: [[2506.18839]] 4Real-Video-V2: Fused View-Time Attention and Feedforward Reconstruction for 4D Scene Generation(https://arxiv.org/abs/2506.18839)
Keywords: diffusion
Abstract: We propose the first framework capable of computing a 4D spatio-temporal grid of video frames and 3D Gaussian particles for each time step using a feed-forward architecture. Our architecture has two main components, a 4D video model and a 4D reconstruction model. In the first part, we analyze current 4D video diffusion architectures that perform spatial and temporal attention either sequentially or in parallel within a two-stream design. We highlight the limitations of existing approaches and introduce a novel fused architecture that performs spatial and temporal attention within a single layer. The key to our method is a sparse attention pattern, where tokens attend to others in the same frame, at the same timestamp, or from the same viewpoint. In the second part, we extend existing 3D reconstruction algorithms by introducing a Gaussian head, a camera token replacement algorithm, and additional dynamic layers and training. Overall, we establish a new state of the art for 4D generation, improving both visual quality and reconstruction capability.

Title: OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation

Authors: Qijun Gan, Ruizi Yang, Jianke Zhu, Shaofei Xue, Steven Hoi
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2506.18866
Pdf URL: https://arxiv.org/pdf/2506.18866
Copy Paste: [[2506.18866]] OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation(https://arxiv.org/abs/2506.18866)
Keywords: foundation model
Abstract: Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video generation model that enhances human animation with improved lip-sync accuracy and natural movements. OmniAvatar introduces a pixel-wise multi-hierarchical audio embedding strategy to better capture audio features in the latent space, enhancing lip-syncing across diverse scenes. To preserve the capability for prompt-driven control of foundation models while effectively incorporating audio features, we employ a LoRA-based training approach. Extensive experiments show that OmniAvatar surpasses existing models in both facial and semi-body video generation, offering precise text-based control for creating videos in various domains, such as podcasts, human interactions, dynamic scenes, and singing. Our project page is this https URL.

Title: OmniGen2: Exploration to Advanced Multimodal Generation

Authors: Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.18871
Pdf URL: https://arxiv.org/pdf/2506.18871
Copy Paste: [[2506.18871]] OmniGen2: Exploration to Advanced Multimodal Generation(https://arxiv.org/abs/2506.18871)
Keywords: generative, in-context
Abstract: In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: this https URL GitHub Link: this https URL

Title: Let Your Video Listen to Your Music!

Authors: Xinyu Zhang, Dong Gong, Zicheng Duan, Anton van den Hengel, Lingqiao Liu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2506.18881
Pdf URL: https://arxiv.org/pdf/2506.18881
Copy Paste: [[2506.18881]] Let Your Video Listen to Your Music!(https://arxiv.org/abs/2506.18881)
Keywords: diffusion, generative
Abstract: Aligning the rhythm of visual motion in a video with a given music track is a practical need in multimedia production, yet remains an underexplored task in autonomous video editing. Effective alignment between motion and musical beats enhances viewer engagement and visual appeal, particularly in music videos, promotional content, and cinematic editing. Existing methods typically depend on labor-intensive manual cutting, speed adjustments, or heuristic-based editing techniques to achieve synchronization. While some generative models handle joint video and music generation, they often entangle the two modalities, limiting flexibility in aligning video to music beats while preserving the full visual content. In this paper, we propose a novel and efficient framework, termed MVAA (Music-Video Auto-Alignment), that automatically edits video to align with the rhythm of a given music track while preserving the original visual content. To enhance flexibility, we modularize the task into a two-step process in our MVAA: aligning motion keyframes with audio beats, followed by rhythm-aware video inpainting. Specifically, we first insert keyframes at timestamps aligned with musical beats, then use a frame-conditioned diffusion model to generate coherent intermediate frames, preserving the original video's semantic content. Since comprehensive test-time training can be time-consuming, we adopt a two-stage strategy: pretraining the inpainting module on a small video set to learn general motion priors, followed by rapid inference-time fine-tuning for video-specific adaptation. This hybrid approach enables adaptation within 10 minutes with one epoch on a single NVIDIA 4090 GPU using CogVideoX-5b-I2V as the backbone. Extensive experiments show that our approach can achieve high-quality beat alignment and visual smoothness.

Title: Universal Video Temporal Grounding with Generative Multi-modal Large Language Models

Authors: Zeqian Li, Shangzhe Di, Zhonghua Zhai, Weilin Huang, Yanfeng Wang, Weidi Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18883
Pdf URL: https://arxiv.org/pdf/2506.18883
Copy Paste: [[2506.18883]] Universal Video Temporal Grounding with Generative Multi-modal Large Language Models(https://arxiv.org/abs/2506.18883)
Keywords: generative
Abstract: This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries (e.g., questions or descriptions). Unlike existing methods that are often limited to specific video domains or durations, we propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs). Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries. The key contributions include: (i) We consider steering strong MLLMs for temporal grounding in videos. To enable precise timestamp outputs, we incorporate temporal information by interleaving timestamp tokens with video tokens. (ii) By training the model to handle videos with different input granularities through adaptive frame scaling, our approach achieves robust temporal grounding for both short and long videos. (iii) Comprehensive experiments show that UniTime outperforms state-of-the-art approaches in both zero-shot and dataset-specific finetuned settings across five public temporal grounding benchmarks. (iv) When employed as a preliminary moment retriever for long-form video question-answering (VideoQA), UniTime significantly improves VideoQA accuracy, highlighting its value for complex video understanding tasks.

Title: 4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time

Authors: Ziqiao Ma, Xuweiyi Chen, Shoubin Yu, Sai Bi, Kai Zhang, Chen Ziwen, Sihan Xu, Jianing Yang, Zexiang Xu, Kalyan Sunkavalli, Mohit Bansal, Joyce Chai, Hao Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18890
Pdf URL: https://arxiv.org/pdf/2506.18890
Copy Paste: [[2506.18890]] 4D-LRM: Large Space-Time Reconstruction Model From and To Any View at Any Time(https://arxiv.org/abs/2506.18890)
Keywords: generative
Abstract: Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at some times to any view at any time? We provide an affirmative answer with 4D-LRM, the first large-scale 4D reconstruction model that takes input from unconstrained views and timestamps and renders arbitrary novel view-time combinations. Unlike prior 4D approaches, e.g., optimization-based, geometry-based, or generative, that struggle with efficiency, generalization, or faithfulness, 4D-LRM learns a unified space-time representation and directly predicts per-pixel 4D Gaussian primitives from posed image tokens across time, enabling fast, high-quality rendering at, in principle, infinite frame rate. Our results demonstrate that scaling spatiotemporal pretraining enables accurate and efficient 4D reconstruction. We show that 4D-LRM generalizes to novel objects, interpolates across time, and handles diverse camera setups. It reconstructs 24-frame sequences in one forward pass with less than 1.5 seconds on a single A100 GPU.

Title: Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Authors: Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2506.18898
Pdf URL: https://arxiv.org/pdf/2506.18898
Copy Paste: [[2506.18898]] Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations(https://arxiv.org/abs/2506.18898)
Keywords: diffusion, generative
Abstract: This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at this https URL

Title: FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation

Authors: Kaiyi Huang, Yukun Huang, Xintao Wang, Zinan Lin, Xuefei Ning, Pengfei Wan, Di Zhang, Yu Wang, Xihui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18899
Pdf URL: https://arxiv.org/pdf/2506.18899
Copy Paste: [[2506.18899]] FilMaster: Bridging Cinematic Principles and Generative AI for Automated Film Generation(https://arxiv.org/abs/2506.18899)
Keywords: generative
Abstract: AI-driven content creation has shown potential in film production. However, existing film generation systems struggle to implement cinematic principles and thus fail to generate professional-quality films, particularly lacking diverse camera language and cinematic rhythm. This results in templated visuals and unengaging narratives. To address this, we introduce FilMaster, an end-to-end AI system that integrates real-world cinematic principles for professional-grade film generation, yielding editable, industry-standard outputs. FilMaster is built on two key principles: (1) learning cinematography from extensive real-world film data and (2) emulating professional, audience-centric post-production workflows. Inspired by these principles, FilMaster incorporates two stages: a Reference-Guided Generation Stage which transforms user input to video clips, and a Generative Post-Production Stage which transforms raw footage into audiovisual outputs by orchestrating visual and auditory elements for cinematic rhythm. Our generation stage highlights a Multi-shot Synergized RAG Camera Language Design module to guide the AI in generating professional camera language by retrieving reference clips from a vast corpus of 440,000 film clips. Our post-production stage emulates professional workflows by designing an Audience-Centric Cinematic Rhythm Control module, including Rough Cut and Fine Cut processes informed by simulated audience feedback, for effective integration of audiovisual elements to achieve engaging content. The system is empowered by generative AI models like (M)LLMs and video generation models. Furthermore, we introduce FilmEval, a comprehensive benchmark for evaluating AI-generated films. Extensive experiments show FilMaster's superior performance in camera language design and cinematic rhythm control, advancing generative AI in professional filmmaking.

Title: Audit & Repair: An Agentic Framework for Consistent Story Visualization in Text-to-Image Diffusion Models

Authors: Kiymet Akdemir, Tahira Kazimi, Pinar Yanardag
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.18900
Pdf URL: https://arxiv.org/pdf/2506.18900
Copy Paste: [[2506.18900]] Audit & Repair: An Agentic Framework for Consistent Story Visualization in Text-to-Image Diffusion Models(https://arxiv.org/abs/2506.18900)
Keywords: diffusion
Abstract: Story visualization has become a popular task where visual scenes are generated to depict a narrative across multiple panels. A central challenge in this setting is maintaining visual consistency, particularly in how characters and objects persist and evolve throughout the story. Despite recent advances in diffusion models, current approaches often fail to preserve key character attributes, leading to incoherent narratives. In this work, we propose a collaborative multi-agent framework that autonomously identifies, corrects, and refines inconsistencies across multi-panel story visualizations. The agents operate in an iterative loop, enabling fine-grained, panel-level updates without re-generating entire sequences. Our framework is model-agnostic and flexibly integrates with a variety of diffusion models, including rectified flow transformers such as Flux and latent diffusion models such as Stable Diffusion. Quantitative and qualitative experiments show that our method outperforms prior approaches in terms of multi-panel consistency.