2025-06-09

Title: AI-Driven Dynamic Firewall Optimization Using Reinforcement Learning for Anomaly Detection and Prevention

Authors: Taimoor Ahmad
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05356
Pdf URL: https://arxiv.org/pdf/2506.05356
Copy Paste: [[2506.05356]] AI-Driven Dynamic Firewall Optimization Using Reinforcement Learning for Anomaly Detection and Prevention(https://arxiv.org/abs/2506.05356)
Keywords: anomaly
Abstract: The growing complexity of cyber threats has rendered static firewalls increasingly ineffective for dynamic, real-time intrusion prevention. This paper proposes a novel AI-driven dynamic firewall optimization framework that leverages deep reinforcement learning (DRL) to autonomously adapt and update firewall rules in response to evolving network threats. Our system employs a Markov Decision Process (MDP) formulation, where the RL agent observes network states, detects anomalies using a hybrid LSTM-CNN model, and dynamically modifies firewall configurations to mitigate risks. We train and evaluate our framework on the NSL-KDD and CIC-IDS2017 datasets using a simulated software-defined network environment. Results demonstrate significant improvements in detection accuracy, false positive reduction, and rule update latency when compared to traditional signature- and behavior-based firewalls. The proposed method provides a scalable, autonomous solution for enhancing network resilience against complex attack vectors in both enterprise and critical infrastructure settings.

Title: Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching

Authors: Tinglin Huang, Tianyu Liu, Mehrtash Babadi, Wengong Jin, Rex Ying
Subjects: cs.CV, q-bio.GN
Abstract URL: https://arxiv.org/abs/2506.05361
Pdf URL: https://arxiv.org/pdf/2506.05361
Copy Paste: [[2506.05361]] Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching(https://arxiv.org/abs/2506.05361)
Keywords: foundation model, generative
Abstract: Spatial transcriptomics (ST) has emerged as a powerful technology for bridging histology imaging with gene expression profiling. However, its application has been limited by low throughput and the need for specialized experimental facilities. Prior works sought to predict ST from whole-slide histology images to accelerate this process, but they suffer from two major limitations. First, they do not explicitly model cell-cell interaction as they factorize the joint distribution of whole-slide ST data and predict the gene expression of each spot independently. Second, their encoders struggle with memory constraints due to the large number of spots (often exceeding 10,000) in typical ST datasets. Herein, we propose STFlow, a flow matching generative model that considers cell-cell interaction by modeling the joint distribution of gene expression of an entire slide. It also employs an efficient slide-level encoder with local spatial attention, enabling whole-slide processing without excessive memory overhead. On the recently curated HEST-1k and STImage-1K4M benchmarks, STFlow substantially outperforms state-of-the-art baselines and achieves over 18% relative improvements over the pathology foundation models.

Title: Seed Selection for Human-Oriented Image Reconstruction via Guided Diffusion

Authors: Yui Tatsumi, Ziyue Zeng, Hiroshi Watanabe
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05363
Pdf URL: https://arxiv.org/pdf/2506.05363
Copy Paste: [[2506.05363]] Seed Selection for Human-Oriented Image Reconstruction via Guided Diffusion(https://arxiv.org/abs/2506.05363)
Keywords: diffusion
Abstract: Conventional methods for scalable image coding for humans and machines require the transmission of additional information to achieve scalability. A recent diffusion-based method avoids this by generating human-oriented images from machine-oriented images without extra bitrate. This method, however, uses a single random seed, which may lead to suboptimal image quality. In this paper, we propose a seed selection method that identifies the optimal seed from multiple candidates to improve image quality without increasing the bitrate. To reduce computational cost, the selection is performed based on intermediate outputs obtained from early steps of the reverse diffusion process. Experimental results demonstrate that our method outperforms the baseline across multiple metrics.

Title: Text2Stereo: Repurposing Stable Diffusion for Stereo Generation with Consistency Rewards

Authors: Aakash Garg, Libing Zeng, Andrii Tsarov, Nima Khademi Kalantari
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05367
Pdf URL: https://arxiv.org/pdf/2506.05367
Copy Paste: [[2506.05367]] Text2Stereo: Repurposing Stable Diffusion for Stereo Generation with Consistency Rewards(https://arxiv.org/abs/2506.05367)
Keywords: diffusion
Abstract: In this paper, we propose a novel diffusion-based approach to generate stereo images given a text prompt. Since stereo image datasets with large baselines are scarce, training a diffusion model from scratch is not feasible. Therefore, we propose leveraging the strong priors learned by Stable Diffusion and fine-tuning it on stereo image datasets to adapt it to the task of stereo generation. To improve stereo consistency and text-to-image alignment, we further tune the model using prompt alignment and our proposed stereo consistency reward functions. Comprehensive experiments demonstrate the superiority of our approach in generating high-quality stereo images across diverse scenarios, outperforming existing methods.

Title: Speaking images. A novel framework for the automated self-description of artworks

Authors: Valentine Bernasconi, Gustavo Marfia
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05368
Pdf URL: https://arxiv.org/pdf/2506.05368
Copy Paste: [[2506.05368]] Speaking images. A novel framework for the automated self-description of artworks(https://arxiv.org/abs/2506.05368)
Keywords: generative
Abstract: Recent breakthroughs in generative AI have opened the door to new research perspectives in the domain of art and cultural heritage, where a large number of artifacts have been digitized. There is a need for innovation to ease the access and highlight the content of digital collections. Such innovations develop into creative explorations of the digital image in relation to its malleability and contemporary interpretation, in confrontation to the original historical object. Based on the concept of the autonomous image, we propose a new framework towards the production of self-explaining cultural artifacts using open-source large-language, face detection, text-to-speech and audio-to-animation models. The goal is to start from a digitized artwork and to automatically assemble a short video of the latter where the main character animates to explain its content. The whole process questions cultural biases encapsulated in large-language models, the potential of digital images and deepfakes of artworks for educational purposes, along with concerns of the field of art history regarding such creative diversions.

Title: An Independent Discriminant Network Towards Identification of Counterfeit Images and Videos

Authors: Shayantani Kar, B. Shresth Bhimrajka, Aditya Kumar, Sahil Gupta, Sourav Ghosh, Subhamita Mukherjee, Shauvik Paul
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05377
Pdf URL: https://arxiv.org/pdf/2506.05377
Copy Paste: [[2506.05377]] An Independent Discriminant Network Towards Identification of Counterfeit Images and Videos(https://arxiv.org/abs/2506.05377)
Keywords: generative
Abstract: Rapid spread of false images and videos on online platforms is an emerging problem. Anyone may add, delete, clone or modify people and entities from an image using various editing software which are readily available. This generates false and misleading proof to hide the crime. Now-a-days, these false and counterfeit images and videos are flooding on the internet. These spread false information. Many methods are available in literature for detecting those counterfeit contents but new methods of counterfeiting are also evolving. Generative Adversarial Networks (GAN) are observed to be one effective method as it modifies the context and definition of images producing plausible results via image-to-image translation. This work uses an independent discriminant network that can identify GAN generated image or video. A discriminant network has been created using a convolutional neural network based on InceptionResNetV2. The article also proposes a platform where users can detect forged images and videos. This proposed work has the potential to help the forensics domain to detect counterfeit videos and hidden criminal evidence towards the identification of criminal activities.

Title: Can Vision Transformers with ResNet's Global Features Fairly Authenticate Demographic Faces?

Authors: Abu Sufian, Marco Leo, Cosimo Distante, Anirudha Ghosh, Debaditya Barman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05383
Pdf URL: https://arxiv.org/pdf/2506.05383
Copy Paste: [[2506.05383]] Can Vision Transformers with ResNet's Global Features Fairly Authenticate Demographic Faces?(https://arxiv.org/abs/2506.05383)
Keywords: foundation model
Abstract: Biometric face authentication is crucial in computer vision, but ensuring fairness and generalization across demographic groups remains a big challenge. Therefore, we investigated whether Vision Transformer (ViT) and ResNet, leveraging pre-trained global features, can fairly authenticate different demographic faces while relying minimally on local features. In this investigation, we used three pre-trained state-of-the-art (SOTA) ViT foundation models from Facebook, Google, and Microsoft for global features as well as ResNet-18. We concatenated the features from ViT and ResNet, passed them through two fully connected layers, and trained on customized face image datasets to capture the local features. Then, we designed a novel few-shot prototype network with backbone features embedding. We also developed new demographic face image support and query datasets for this empirical study. The network's testing was conducted on this dataset in one-shot, three-shot, and five-shot scenarios to assess how performance improves as the size of the support set increases. We observed results across datasets with varying races/ethnicities, genders, and age groups. The Microsoft Swin Transformer backbone performed better among the three SOTA ViT for this task. The code and data are available at: this https URL.

Title: LLMs Can Also Do Well! Breaking Barriers in Semantic Role Labeling via Large Language Models

Authors: Xinxin Li, Huiyao Chen, Chengjun Liu, Jing Li, Meishan Zhang, Jun Yu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05385
Pdf URL: https://arxiv.org/pdf/2506.05385
Copy Paste: [[2506.05385]] LLMs Can Also Do Well! Breaking Barriers in Semantic Role Labeling via Large Language Models(https://arxiv.org/abs/2506.05385)
Keywords: generative
Abstract: Semantic role labeling (SRL) is a crucial task of natural language processing (NLP). Although generative decoder-based large language models (LLMs) have achieved remarkable success across various NLP tasks, they still lag behind state-of-the-art encoder-decoder (BERT-like) models in SRL. In this work, we seek to bridge this gap by equipping LLMs for SRL with two mechanisms: (a) retrieval-augmented generation and (b) self-correction. The first mechanism enables LLMs to leverage external linguistic knowledge such as predicate and argument structure descriptions, while the second allows LLMs to identify and correct inconsistent SRL outputs. We conduct extensive experiments on three widely-used benchmarks of SRL (CPB1.0, CoNLL-2009, and CoNLL-2012). Results demonstrate that our method achieves state-of-the-art performance in both Chinese and English, marking the first successful application of LLMs to surpass encoder-decoder approaches in SRL.

Title: Attacking Attention of Foundation Models Disrupts Downstream Tasks

Authors: Hondamunige Prasanna Silva, Federico Becattini, Lorenzo Seidenari
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05394
Pdf URL: https://arxiv.org/pdf/2506.05394
Copy Paste: [[2506.05394]] Attacking Attention of Foundation Models Disrupts Downstream Tasks(https://arxiv.org/abs/2506.05394)
Keywords: foundation model
Abstract: Foundation models represent the most prominent and recent paradigm shift in artificial this http URL models are large models, trained on broad data that deliver high accuracy in many downstream tasks, often without fine-tuning. For this reason, models such as CLIP , DINO or Vision Transfomers (ViT), are becoming the bedrock of many industrial AI-powered applications. However, the reliance on pre-trained foundation models also introduces significant security concerns, as these models are vulnerable to adversarial attacks. Such attacks involve deliberately crafted inputs designed to deceive AI systems, jeopardizing their this http URL paper studies the vulnerabilities of vision foundation models, focusing specifically on CLIP and ViTs, and explores the transferability of adversarial attacks to downstream tasks. We introduce a novel attack, targeting the structure of transformer-based architectures in a task-agnostic this http URL demonstrate the effectiveness of our attack on several downstream tasks: classification, captioning, image/text retrieval, segmentation and depth estimation.

Title: Poisoning Behavioral-based Worker Selection in Mobile Crowdsensing using Generative Adversarial Networks

Authors: Ruba Nasser, Ahmed Alagha, Shakti Singh, Rabeb Mizouni, Hadi Otrok, Jamal Bentahar
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05403
Pdf URL: https://arxiv.org/pdf/2506.05403
Copy Paste: [[2506.05403]] Poisoning Behavioral-based Worker Selection in Mobile Crowdsensing using Generative Adversarial Networks(https://arxiv.org/abs/2506.05403)
Keywords: generative
Abstract: With the widespread adoption of Artificial intelligence (AI), AI-based tools and components are becoming omnipresent in today's solutions. However, these components and tools are posing a significant threat when it comes to adversarial attacks. Mobile Crowdsensing (MCS) is a sensing paradigm that leverages the collective participation of workers and their smart devices to collect data. One of the key challenges faced at the selection stage is ensuring task completion due to workers' varying behavior. AI has been utilized to tackle this challenge by building unique models for each worker to predict their behavior. However, the integration of AI into the system introduces vulnerabilities that can be exploited by malicious insiders to reduce the revenue obtained by victim workers. This work proposes an adversarial attack targeting behavioral-based selection models in MCS. The proposed attack leverages Generative Adversarial Networks (GANs) to generate poisoning points that can mislead the models during the training stage without being detected. This way, the potential damage introduced by GANs on worker selection in MCS can be anticipated. Simulation results using a real-life dataset show the effectiveness of the proposed attack in compromising the victim workers' model and evading detection by an outlier detector, compared to a benchmark. In addition, the impact of the attack on reducing the payment obtained by victim workers is evaluated.

Title: A VLM-based Method for Visual Anomaly Detection in Robotic Scientific Laboratories

Authors: Shiwei Lin, Chenxu Wang, Xiaozhen Ding, Yi Wang, Boyuan Du, Lei Song, Chenggang Wang, Huaping Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05405
Pdf URL: https://arxiv.org/pdf/2506.05405
Copy Paste: [[2506.05405]] A VLM-based Method for Visual Anomaly Detection in Robotic Scientific Laboratories(https://arxiv.org/abs/2506.05405)
Keywords: anomaly
Abstract: In robot scientific laboratories, visual anomaly detection is important for the timely identification and resolution of potential faults or deviations. It has become a key factor in ensuring the stability and safety of experimental processes. To address this challenge, this paper proposes a VLM-based visual reasoning approach that supports different levels of supervision through four progressively informative prompt configurations. To systematically evaluate its effectiveness, we construct a visual benchmark tailored for process anomaly detection in scientific workflows. Experiments on two representative vision-language models show that detection accuracy improves as more contextual information is provided, confirming the effectiveness and adaptability of the proposed reasoning approach for process anomaly detection in scientific workflows. Furthermore, real-world validations at selected experimental steps confirm that first-person visual observation can effectively identify process-level anomalies. This work provides both a data-driven foundation and an evaluation framework for vision anomaly detection in scientific experiment workflows.

Title: PCEvolve: Private Contrastive Evolution for Synthetic Dataset Generation via Few-Shot Private Data and Generative APIs

Authors: Jianqing Zhang, Yang Liu, Jie Fu, Yang Hua, Tianyuan Zou, Jian Cao, Qiang Yang
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2506.05407
Pdf URL: https://arxiv.org/pdf/2506.05407
Copy Paste: [[2506.05407]] PCEvolve: Private Contrastive Evolution for Synthetic Dataset Generation via Few-Shot Private Data and Generative APIs(https://arxiv.org/abs/2506.05407)
Keywords: diffusion, generative
Abstract: The rise of generative APIs has fueled interest in privacy-preserving synthetic data generation. While the Private Evolution (PE) algorithm generates Differential Privacy (DP) synthetic images using diffusion model APIs, it struggles with few-shot private data due to the limitations of its DP-protected similarity voting approach. In practice, the few-shot private data challenge is particularly prevalent in specialized domains like healthcare and industry. To address this challenge, we propose a novel API-assisted algorithm, Private Contrastive Evolution (PCEvolve), which iteratively mines inherent inter-class contrastive relationships in few-shot private data beyond individual data points and seamlessly integrates them into an adapted Exponential Mechanism (EM) to optimize DP's utility in an evolution loop. We conduct extensive experiments on four specialized datasets, demonstrating that PCEvolve outperforms PE and other API-assisted baselines. These results highlight the potential of leveraging API access with private data for quality evaluation, enabling the generation of high-quality DP synthetic images and paving the way for more accessible and effective privacy-preserving generative API applications. Our code is available at this https URL.

Title: Dream to Generalize: Zero-Shot Model-Based Reinforcement Learning for Unseen Visual Distractions

Authors: Jeongsoo Ha, Kyungsoo Kim, Yusung Kim
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05419
Pdf URL: https://arxiv.org/pdf/2506.05419
Copy Paste: [[2506.05419]] Dream to Generalize: Zero-Shot Model-Based Reinforcement Learning for Unseen Visual Distractions(https://arxiv.org/abs/2506.05419)
Keywords: self-supervised
Abstract: Model-based reinforcement learning (MBRL) has been used to efficiently solve vision-based control tasks in highdimensional image observations. Although recent MBRL algorithms perform well in trained observations, they fail when faced with visual distractions in observations. These task-irrelevant distractions (e.g., clouds, shadows, and light) may be constantly present in real-world scenarios. In this study, we propose a novel self-supervised method, Dream to Generalize (Dr. G), for zero-shot MBRL. Dr. G trains its encoder and world model with dual contrastive learning which efficiently captures task-relevant features among multi-view data augmentations. We also introduce a recurrent state inverse dynamics model that helps the world model to better understand the temporal structure. The proposed methods can enhance the robustness of the world model against visual distractions. To evaluate the generalization performance, we first train Dr. G on simple backgrounds and then test it on complex natural video backgrounds in the DeepMind Control suite, and the randomizing environments in Robosuite. Dr. G yields a performance improvement of 117% and 14% over prior works, respectively. Our code is open-sourced and available at this https URL

Title: Self-supervised One-Stage Learning for RF-based Multi-Person Pose Estimation

Authors: Seunghwan Shin, Yusung Kim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05420
Pdf URL: https://arxiv.org/pdf/2506.05420
Copy Paste: [[2506.05420]] Self-supervised One-Stage Learning for RF-based Multi-Person Pose Estimation(https://arxiv.org/abs/2506.05420)
Keywords: self-supervised
Abstract: In the field of Multi-Person Pose Estimation (MPPE), Radio Frequency (RF)-based methods can operate effectively regardless of lighting conditions and obscured line-of-sight situations. Existing RF-based MPPE methods typically involve either 1) converting RF signals into heatmap images through complex preprocessing, or 2) applying a deep embedding network directly to raw RF signals. The first approach, while delivering decent performance, is computationally intensive and time-consuming. The second method, though simpler in preprocessing, results in lower MPPE accuracy and generalization performance. This paper proposes an efficient and lightweight one-stage MPPE model based on raw RF signals. By sub-grouping RF signals and embedding them using a shared single-layer CNN followed by multi-head attention, this model outperforms previous methods that embed all signals at once through a large and deep CNN. Additionally, we propose a new self-supervised learning (SSL) method that takes inputs from both one unmasked subgroup and the remaining masked subgroups to predict the latent representations of the masked data. Empirical results demonstrate that our model improves MPPE accuracy by up to 15 in PCKh@0.5 compared to previous methods using raw RF signals. Especially, the proposed SSL method has shown to significantly enhance performance improvements when placed in new locations or in front of obstacles at RF antennas, contributing to greater performance gains as the number of people increases. Our code and dataset is open at Github. this https URL .

Title: Mixture-of-Experts Meets In-Context Reinforcement Learning

Authors: Wenhao Wu, Fuhong Liu, Haoru Li, Zican Hu, Daoyi Dong, Chunlin Chen, Zhi Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05426
Pdf URL: https://arxiv.org/pdf/2506.05426
Copy Paste: [[2506.05426]] Mixture-of-Experts Meets In-Context Reinforcement Learning(https://arxiv.org/abs/2506.05426)
Keywords: in-context
Abstract: In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose \textbf{T2MIR} (\textbf{T}oken- and \textbf{T}ask-wise \textbf{M}oE for \textbf{I}n-context \textbf{R}L), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at this https URL.

Title: Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction

Authors: Zhihao Tang, Chaozhuo Li, Litian Zhang, Xi Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05428
Pdf URL: https://arxiv.org/pdf/2506.05428
Copy Paste: [[2506.05428]] Diffusion with a Linguistic Compass: Steering the Generation of Clinically Plausible Future sMRI Representations for Early MCI Conversion Prediction(https://arxiv.org/abs/2506.05428)
Keywords: diffusion
Abstract: Early prediction of Mild Cognitive Impairment (MCI) conversion is hampered by a trade-off between immediacy--making fast predictions from a single baseline sMRI--and accuracy--leveraging longitudinal scans to capture disease progression. We propose MCI-Diff, a diffusion-based framework that synthesizes clinically plausible future sMRI representations directly from baseline data, achieving both real-time risk assessment and high predictive performance. First, a multi-task sequence reconstruction strategy trains a shared denoising network on interpolation and extrapolation tasks to handle irregular follow-up sampling and learn robust latent trajectories. Second, an LLM-driven "linguistic compass" is introduced for clinical plausibility sampling: generated feature candidates are quantized, tokenized, and scored by a fine-tuned language model conditioned on expected structural biomarkers, guiding autoregressive generation toward realistic disease patterns. Experiments on ADNI and AIBL cohorts show that MCI-Diff outperforms state-of-the-art baselines, improving early conversion accuracy by 5-12%.

Title: Towards Reliable Identification of Diffusion-based Image Manipulations

Authors: Alex Costanzino, Woody Bayliss, Juil Sock, Marc Gorriz Blanch, Danijela Horak, Ivan Laptev, Philip Torr, Fabio Pizzati
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05466
Pdf URL: https://arxiv.org/pdf/2506.05466
Copy Paste: [[2506.05466]] Towards Reliable Identification of Diffusion-based Image Manipulations(https://arxiv.org/abs/2506.05466)
Keywords: diffusion, foundation model
Abstract: Changing facial expressions, gestures, or background details may dramatically alter the meaning conveyed by an image. Notably, recent advances in diffusion models greatly improve the quality of image manipulation while also opening the door to misuse. Identifying changes made to authentic images, thus, becomes an important task, constantly challenged by new diffusion-based editing tools. To this end, we propose a novel approach for ReliAble iDentification of inpainted AReas (RADAR). RADAR builds on existing foundation models and combines features from different image modalities. It also incorporates an auxiliary contrastive loss that helps to isolate manipulated image patches. We demonstrate these techniques to significantly improve both the accuracy of our method and its generalisation to a large number of diffusion models. To support realistic evaluation, we further introduce BBC-PAIR, a new comprehensive benchmark, with images tampered by 28 diffusion models. Our experiments show that RADAR achieves excellent results, outperforming the state-of-the-art in detecting and localising image edits made by both seen and unseen diffusion models. Our code, data and models will be publicly available at this http URL.

Title: Conformal Prediction Beyond the Seen: A Missing Mass Perspective for Uncertainty Quantification in Generative Models

Authors: Sima Noorani, Shayan Kiyani, George Pappas, Hamed Hassani
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05497
Pdf URL: https://arxiv.org/pdf/2506.05497
Copy Paste: [[2506.05497]] Conformal Prediction Beyond the Seen: A Missing Mass Perspective for Uncertainty Quantification in Generative Models(https://arxiv.org/abs/2506.05497)
Keywords: generative
Abstract: Uncertainty quantification (UQ) is essential for safe deployment of generative AI models such as large language models (LLMs), especially in high stakes applications. Conformal prediction (CP) offers a principled uncertainty quantification framework, but classical methods focus on regression and classification, relying on geometric distances or softmax scores: tools that presuppose structured outputs. We depart from this paradigm by studying CP in a query only setting, where prediction sets must be constructed solely from finite queries to a black box generative model, introducing a new trade off between coverage, test time query budget, and informativeness. We introduce Conformal Prediction with Query Oracle (CPQ), a framework characterizing the optimal interplay between these objectives. Our finite sample algorithm is built on two core principles: one governs the optimal query policy, and the other defines the optimal mapping from queried samples to prediction sets. Remarkably, both are rooted in the classical missing mass problem in statistics. Specifically, the optimal query policy depends on the rate of decay, or the derivative, of the missing mass, for which we develop a novel estimator. Meanwhile, the optimal mapping hinges on the missing mass itself, which we estimate using Good Turing estimators. We then turn our focus to implementing our method for language models, where outputs are vast, variable, and often under specified. Fine grained experiments on three real world open ended tasks and two LLMs, show CPQ applicability to any black box LLM and highlight: (1) individual contribution of each principle to CPQ performance, and (2) CPQ ability to yield significantly more informative prediction sets than existing conformal methods for language uncertainty quantification.

Title: The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models

Authors: Alex Damian, Jason D. Lee, Joan Bruna
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05500
Pdf URL: https://arxiv.org/pdf/2506.05500
Copy Paste: [[2506.05500]] The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models(https://arxiv.org/abs/2506.05500)
Keywords: generative
Abstract: In this work we consider generic Gaussian Multi-index models, in which the labels only depend on the (Gaussian) $d$-dimensional inputs through their projection onto a low-dimensional $r = O_d(1)$ subspace, and we study efficient agnostic estimation procedures for this hidden subspace. We introduce the \emph{generative leap} exponent $k^\star$, a natural extension of the generative exponent from [Damian et al.'24] to the multi-index setting. We first show that a sample complexity of $n=\Theta(d^{1 \vee \k/2})$ is necessary in the class of algorithms captured by the Low-Degree-Polynomial framework. We then establish that this sample complexity is also sufficient, by giving an agnostic sequential estimation procedure (that is, requiring no prior knowledge of the multi-index model) based on a spectral U-statistic over appropriate Hermite tensors. We further compute the generative leap exponent for several examples including piecewise linear functions (deep ReLU networks with bias), and general deep neural networks (with $r$-dimensional first hidden layer).

Title: FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL

Authors: Kaihang Pan, Wendong Bu, Yuruo Wu, Yang Wu, Kai Shen, Yunfei Li, Hang Zhao, Juncheng Li, Siliang Tang, Yueting Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05501
Pdf URL: https://arxiv.org/pdf/2506.05501
Copy Paste: [[2506.05501]] FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL(https://arxiv.org/abs/2506.05501)
Keywords: diffusion
Abstract: Recent studies extend the autoregression paradigm to text-to-image generation, achieving performance comparable to diffusion models. However, our new PairComp benchmark -- featuring test cases of paired prompts with similar syntax but different fine-grained semantics -- reveals that existing models struggle with fine-grained text-image alignment thus failing to realize precise control over visual tokens. To address this, we propose FocusDiff, which enhances fine-grained text-image semantic alignment by focusing on subtle differences between similar text-image pairs. We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics, further introducing a novel reinforcement learning algorithm to emphasize such fine-grained semantic differences for desired image generation. Our approach achieves state-of-the-art performance on existing text-to-image benchmarks and significantly outperforms prior methods on PairComp.

Title: MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

Authors: Zikui Cai, Andrew Wang, Anirudh Satheesh, Ankit Nakhawa, Hyunwoo Jae, Keenan Powell, Minghui Liu, Neel Jay, Sungbin Oh, Xiyao Wang, Yongyuan Liang, Tom Goldstein, Furong Huang
Subjects: cs.CV, cs.AI, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05523
Pdf URL: https://arxiv.org/pdf/2506.05523
Copy Paste: [[2506.05523]] MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning(https://arxiv.org/abs/2506.05523)
Keywords: generative
Abstract: Despite rapid advances in vision-language models (VLMs), current benchmarks for multimodal reasoning fall short in three key dimensions. First, they overwhelmingly rely on static images, failing to capture the temporal complexity of real-world environments. Second, they narrowly focus on mathematical problem-solving, neglecting the broader spectrum of reasoning skills -- including abstract, physical, planning, spatial, and temporal capabilities -- required for robust multimodal intelligence. Third, many benchmarks quickly saturate, offering limited headroom for diagnosing failure modes or measuring continued progress. We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary reasoning categories. Each instance is programmatically generated using deterministic Python scripts (via Manim, Matplotlib, MoviePy), generative video models, and curated real footage. This script-driven design allows fine-grained control over visual complexity, distractor density, and temporal dynamics -- enabling difficulty to be scaled systematically as models improve. Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve: its controllable generation pipeline supports the creation of arbitrarily challenging new instances, making it ideally suited for stress-testing next-generation models. Initial experiments with state-of-the-art systems -- including various Gemini 2.5 Pro and OpenAI o3 which represent the strongest available at the time, alongside strong open-source models -- reveal substantial performance gaps across all categories, with particularly large deficits in abstract and planning tasks. We release the full dataset, generation scripts, and evaluation harness to support transparent, reproducible, and forward-looking multimodal reasoning research.

Title: SocialDF: Benchmark Dataset and Detection Model for Mitigating Harmful Deepfake Content on Social Media Platforms

Authors: Arnesh Batra, Anushk Kumar, Jashn Khemani, Arush Gumber, Arhan Jain, Somil Gupta
Subjects: cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.05538
Pdf URL: https://arxiv.org/pdf/2506.05538
Copy Paste: [[2506.05538]] SocialDF: Benchmark Dataset and Detection Model for Mitigating Harmful Deepfake Content on Social Media Platforms(https://arxiv.org/abs/2506.05538)
Keywords: generative
Abstract: The rapid advancement of deep generative models has significantly improved the realism of synthetic media, presenting both opportunities and security challenges. While deepfake technology has valuable applications in entertainment and accessibility, it has emerged as a potent vector for misinformation campaigns, particularly on social media. Existing detection frameworks struggle to distinguish between benign and adversarially generated deepfakes engineered to manipulate public perception. To address this challenge, we introduce SocialDF, a curated dataset reflecting real-world deepfake challenges on social media platforms. This dataset encompasses high-fidelity deepfakes sourced from various online ecosystems, ensuring broad coverage of manipulative techniques. We propose a novel LLM-based multi-factor detection approach that combines facial recognition, automated speech transcription, and a multi-agent LLM pipeline to cross-verify audio-visual cues. Our methodology emphasizes robust, multi-modal verification techniques that incorporate linguistic, behavioral, and contextual analysis to effectively discern synthetic media from authentic content.

Title: FRAME: Pre-Training Video Feature Representations via Anticipation and Memory

Authors: Sethuraman TV, Savya Khosla, Vignesh Srinivasakumar, Jiahui Huang, Seoung Wug Oh, Simon Jenni, Derek Hoiem, Joon-Young Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05543
Pdf URL: https://arxiv.org/pdf/2506.05543
Copy Paste: [[2506.05543]] FRAME: Pre-Training Video Feature Representations via Anticipation and Memory(https://arxiv.org/abs/2506.05543)
Keywords: self-supervised
Abstract: Dense video prediction tasks, such as object tracking and semantic segmentation, require video encoders that generate temporally consistent, spatially dense features for every frame. However, existing approaches fall short: image encoders like DINO or CLIP lack temporal awareness, while video models such as VideoMAE underperform compared to image encoders on dense prediction tasks. We address this gap with FRAME, a self-supervised video frame encoder tailored for dense video understanding. FRAME learns to predict current and future DINO patch features from past and present RGB frames, leading to spatially precise and temporally coherent representations. To our knowledge, FRAME is the first video encoder to leverage image-based models for dense prediction while outperforming them on tasks requiring fine-grained visual correspondence. As an auxiliary capability, FRAME aligns its class token with CLIP's semantic space, supporting language-driven tasks such as video classification. We evaluate FRAME across six dense prediction tasks on seven datasets, where it consistently outperforms image encoders and existing self-supervised video models. Despite its versatility, FRAME maintains a compact architecture suitable for a range of downstream applications.

Title: EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh

Authors: Tao Hu, Haoyang Peng, Xiao Liu, Yuewen Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05554
Pdf URL: https://arxiv.org/pdf/2506.05554
Copy Paste: [[2506.05554]] EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh(https://arxiv.org/abs/2506.05554)
Keywords: diffusion
Abstract: Generating high-quality camera-controllable videos from monocular input is a challenging task, particularly under extreme viewpoint. Existing methods often struggle with geometric inconsistencies and occlusion artifacts in boundaries, leading to degraded visual quality. In this paper, we introduce EX-4D, a novel framework that addresses these challenges through a Depth Watertight Mesh representation. The representation serves as a robust geometric prior by explicitly modeling both visible and occluded regions, ensuring geometric consistency in extreme camera pose. To overcome the lack of paired multi-view datasets, we propose a simulated masking strategy that generates effective training data only from monocular videos. Additionally, a lightweight LoRA-based video diffusion adapter is employed to synthesize high-quality, physically consistent, and temporally coherent videos. Extensive experiments demonstrate that EX-4D outperforms state-of-the-art methods in terms of physical consistency and extreme-view quality, enabling practical 4D video generation.

Title: VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction

Authors: Ziyue Zhu, Shenlong Wang, Jin Xie, Jiang-jiang Liu, Jingdong Wang, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05563
Pdf URL: https://arxiv.org/pdf/2506.05563
Copy Paste: [[2506.05563]] VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction(https://arxiv.org/abs/2506.05563)
Keywords: self-supervised
Abstract: Recent advancements in camera-based occupancy prediction have focused on the simultaneous prediction of 3D semantics and scene flow, a task that presents significant challenges due to specific difficulties, e.g., occlusions and unbalanced dynamic environments. In this paper, we analyze these challenges and their underlying causes. To address them, we propose a novel regularization framework called VoxelSplat. This framework leverages recent developments in 3D Gaussian Splatting to enhance model performance in two key ways: (i) Enhanced Semantics Supervision through 2D Projection: During training, our method decodes sparse semantic 3D Gaussians from 3D representations and projects them onto the 2D camera view. This provides additional supervision signals in the camera-visible space, allowing 2D labels to improve the learning of 3D semantics. (ii) Scene Flow Learning: Our framework uses the predicted scene flow to model the motion of Gaussians, and is thus able to learn the scene flow of moving objects in a self-supervised manner using the labels of adjacent frames. Our method can be seamlessly integrated into various existing occupancy models, enhancing performance without increasing inference time. Extensive experiments on benchmark datasets demonstrate the effectiveness of VoxelSplat in improving the accuracy of both semantic occupancy and scene flow estimation. The project page and codes are available at this https URL.

Title: PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers

Authors: Yuchen Lin, Chenguo Lin, Panwang Pan, Honglei Yan, Yiqiang Feng, Yadong Mu, Katerina Fragkiadaki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05573
Pdf URL: https://arxiv.org/pdf/2506.05573
Copy Paste: [[2506.05573]] PartCrafter: Structured 3D Mesh Generation via Compositional Latent Diffusion Transformers(https://arxiv.org/abs/2506.05573)
Keywords: diffusion, generative
Abstract: We introduce PartCrafter, the first structured 3D generative model that jointly synthesizes multiple semantically meaningful and geometrically distinct 3D meshes from a single RGB image. Unlike existing methods that either produce monolithic 3D shapes or follow two-stage pipelines, i.e., first segmenting an image and then reconstructing each segment, PartCrafter adopts a unified, compositional generation architecture that does not rely on pre-segmented inputs. Conditioned on a single image, it simultaneously denoises multiple 3D parts, enabling end-to-end part-aware generation of both individual objects and complex multi-object scenes. PartCrafter builds upon a pretrained 3D mesh diffusion transformer (DiT) trained on whole objects, inheriting the pretrained weights, encoder, and decoder, and introduces two key innovations: (1) A compositional latent space, where each 3D part is represented by a set of disentangled latent tokens; (2) A hierarchical attention mechanism that enables structured information flow both within individual parts and across all parts, ensuring global coherence while preserving part-level detail during generation. To support part-level supervision, we curate a new dataset by mining part-level annotations from large-scale 3D object datasets. Experiments show that PartCrafter outperforms existing approaches in generating decomposable 3D meshes, including parts that are not directly visible in input images, demonstrating the strength of part-aware generative priors for 3D understanding and synthesis. Code and training data will be released.

Title: When can in-context learning generalize out of task distribution?

Authors: Chase Goddard, Lindsay M. Smith, Vudtiwat Ngampruetikorn, David J. Schwab
Subjects: cs.LG, cond-mat.dis-nn, cond-mat.stat-mech, q-bio.NC, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05574
Pdf URL: https://arxiv.org/pdf/2506.05574
Copy Paste: [[2506.05574]] When can in-context learning generalize out of task distribution?(https://arxiv.org/abs/2506.05574)
Keywords: in-context
Abstract: In-context learning (ICL) is a remarkable capability of pretrained transformers that allows models to generalize to unseen tasks after seeing only a few examples. We investigate empirically the conditions necessary on the pretraining distribution for ICL to emerge and generalize \emph{out-of-distribution}. Previous work has focused on the number of distinct tasks necessary in the pretraining dataset. Here, we use a different notion of task diversity to study the emergence of ICL in transformers trained on linear functions. We find that as task diversity increases, transformers undergo a transition from a specialized solution, which exhibits ICL only within the pretraining task distribution, to a solution which generalizes out of distribution to the entire task space. We also investigate the nature of the solutions learned by the transformer on both sides of the transition, and observe similar transitions in nonlinear regression problems. We construct a phase diagram to characterize how our concept of task diversity interacts with the number of pretraining tasks. In addition, we explore how factors such as the depth of the model and the dimensionality of the regression problem influence the transition.

Title: TabFlex: Scaling Tabular Learning to Millions with Linear Attention

Authors: Yuchen Zeng, Tuan Dinh, Wonjun Kang, Andreas C Mueller
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05584
Pdf URL: https://arxiv.org/pdf/2506.05584
Copy Paste: [[2506.05584]] TabFlex: Scaling Tabular Learning to Millions with Linear Attention(https://arxiv.org/abs/2506.05584)
Keywords: in-context
Abstract: Leveraging the in-context learning (ICL) capability of Large Language Models (LLMs) for tabular classification has gained significant attention for its training-free adaptability across diverse datasets. Recent advancements, like TabPFN, excel in small-scale tabular datasets but struggle to scale for large and complex datasets. Our work enhances the efficiency and scalability of TabPFN for larger datasets by incorporating linear attention mechanisms as a scalable alternative to complexity-quadratic self-attention. Our model, TabFlex, efficiently handles tabular datasets with thousands of features and hundreds of classes, scaling seamlessly to millions of samples. For instance, TabFlex processes the poker-hand dataset with over a million samples in just 5 seconds. Our extensive evaluations demonstrate that TabFlex can achieve over a 2x speedup compared to TabPFN and a 1.5x speedup over XGBoost, outperforming 25 tested baselines in terms of efficiency across a diverse range of datasets. Furthermore, TabFlex remains highly effective on large-scale datasets, delivering strong performance with significantly reduced computational costs, especially when combined with data-efficient techniques such as dimensionality reduction and data sampling.

Title: FaCTR: Factorized Channel-Temporal Representation Transformers for Efficient Time Series Forecasting

Authors: Yash Vijay, Harini Subramanyan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05597
Pdf URL: https://arxiv.org/pdf/2506.05597
Copy Paste: [[2506.05597]] FaCTR: Factorized Channel-Temporal Representation Transformers for Efficient Time Series Forecasting(https://arxiv.org/abs/2506.05597)
Keywords: self-supervised
Abstract: While Transformers excel in language and vision-where inputs are semantically rich and exhibit univariate dependency structures-their architectural complexity leads to diminishing returns in time series forecasting. Time series data is characterized by low per-timestep information density and complex dependencies across channels and covariates, requiring conditioning on structured variable interactions. To address this mismatch and overparameterization, we propose FaCTR, a lightweight spatiotemporal Transformer with an explicitly structural design. FaCTR injects dynamic, symmetric cross-channel interactions-modeled via a low-rank Factorization Machine into temporally contextualized patch embeddings through a learnable gating mechanism. It further encodes static and dynamic covariates for multivariate conditioning. Despite its compact design, FaCTR achieves state-of-the-art performance on eleven public forecasting benchmarks spanning both short-term and long-term horizons, with its largest variant using close to only 400K parameters-on average 50x smaller than competitive spatiotemporal transformer baselines. In addition, its structured design enables interpretability through cross-channel influence scores-an essential requirement for real-world decision-making. Finally, FaCTR supports self-supervised pretraining, positioning it as a compact yet versatile foundation for downstream time series tasks.

Title: UniRes: Universal Image Restoration for Complex Degradations

Authors: Mo Zhou, Keren Ye, Mauricio Delbracio, Peyman Milanfar, Vishal M. Patel, Hossein Talebi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05599
Pdf URL: https://arxiv.org/pdf/2506.05599
Copy Paste: [[2506.05599]] UniRes: Universal Image Restoration for Complex Degradations(https://arxiv.org/abs/2506.05599)
Keywords: diffusion, generative
Abstract: Real-world image restoration is hampered by diverse degradations stemming from varying capture conditions, capture devices and post-processing pipelines. Existing works make improvements through simulating those degradations and leveraging image generative priors, however generalization to in-the-wild data remains an unresolved problem. In this paper, we focus on complex degradations, i.e., arbitrary mixtures of multiple types of known degradations, which is frequently seen in the wild. A simple yet flexible diffusionbased framework, named UniRes, is proposed to address such degradations in an end-to-end manner. It combines several specialized models during the diffusion sampling steps, hence transferring the knowledge from several well-isolated restoration tasks to the restoration of complex in-the-wild degradations. This only requires well-isolated training data for several degradation types. The framework is flexible as extensions can be added through a unified formulation, and the fidelity-quality trade-off can be adjusted through a new paradigm. Our proposed method is evaluated on both complex-degradation and single-degradation image restoration datasets. Extensive qualitative and quantitative experimental results show consistent performance gain especially for images with complex degradations.

Title: GP-MoLFormer-Sim: Test Time Molecular Optimization through Contextual Similarity Guidance

Authors: Jiri Navratil, Jarret Ross, Payel Das, Youssef Mroueh, Samuel C Hoffman, Vijil Chenthamarakshan, Brian Belgodere
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05628
Pdf URL: https://arxiv.org/pdf/2506.05628
Copy Paste: [[2506.05628]] GP-MoLFormer-Sim: Test Time Molecular Optimization through Contextual Similarity Guidance(https://arxiv.org/abs/2506.05628)
Keywords: generative
Abstract: The ability to design molecules while preserving similarity to a target molecule and/or property is crucial for various applications in drug discovery, chemical design, and biology. We introduce in this paper an efficient training-free method for navigating and sampling from the molecular space with a generative Chemical Language Model (CLM), while using the molecular similarity to the target as a guide. Our method leverages the contextual representations learned from the CLM itself to estimate the molecular similarity, which is then used to adjust the autoregressive sampling strategy of the CLM. At each step of the decoding process, the method tracks the distance of the current generations from the target and updates the logits to encourage the preservation of similarity in generations. We implement the method using a recently proposed $\sim$47M parameter SMILES-based CLM, GP-MoLFormer, and therefore refer to the method as GP-MoLFormer-Sim, which enables a test-time update of the deep generative policy to reflect the contextual similarity to a set of guide molecules. The method is further integrated into a genetic algorithm (GA) and tested on a set of standard molecular optimization benchmarks involving property optimization, molecular rediscovery, and structure-based drug design. Results show that, GP-MoLFormer-Sim, combined with GA (GP-MoLFormer-Sim+GA) outperforms existing training-free baseline methods, when the oracle remains black-box. The findings in this work are a step forward in understanding and guiding the generative mechanisms of CLMs.

Title: Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones

Authors: Andrey Zhmoginov, Jihwan Lee, Mark Sandler
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05641
Pdf URL: https://arxiv.org/pdf/2506.05641
Copy Paste: [[2506.05641]] Projectable Models: One-Shot Generation of Small Specialized Transformers from Large Ones(https://arxiv.org/abs/2506.05641)
Keywords: foundation model
Abstract: Modern Foundation Models (FMs) are typically trained on corpora spanning a wide range of different data modalities, topics and downstream tasks. Utilizing these models can be very computationally expensive and is out of reach for most consumer devices. Furthermore, most of the broad FM knowledge may actually be irrelevant for a specific task at hand. Here we explore a technique for mapping parameters of a large Transformer to parameters of a smaller specialized model. By making this transformation task-specific, we aim to capture a narrower scope of the knowledge needed for performing a specific task by a smaller model. We study our method on image modeling tasks, showing that performance of generated models exceeds that of universal conditional models.

Title: Learning to Weight Parameters for Data Attribution

Authors: Shuangqi Li, Hieu Le, Jingyi Xu, Mathieu Salzmann
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05647
Pdf URL: https://arxiv.org/pdf/2506.05647
Copy Paste: [[2506.05647]] Learning to Weight Parameters for Data Attribution(https://arxiv.org/abs/2506.05647)
Keywords: diffusion, generative
Abstract: We study data attribution in generative models, aiming to identify which training examples most influence a given output. Existing methods achieve this by tracing gradients back to training data. However, they typically treat all network parameters uniformly, ignoring the fact that different layers encode different types of information and may thus draw information differently from the training set. We propose a method that models this by learning parameter importance weights tailored for attribution, without requiring labeled data. This allows the attribution process to adapt to the structure of the model, capturing which training examples contribute to specific semantic aspects of an output, such as subject, style, or background. Our method improves attribution accuracy across diffusion models and enables fine-grained insights into how outputs borrow from training data.

Title: RNE: a plug-and-play framework for diffusion density estimation and inference-time control

Authors: Jiajun He, José Miguel Hernández-Lobato, Yuanqi Du, Francisco Vargas
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05668
Pdf URL: https://arxiv.org/pdf/2506.05668
Copy Paste: [[2506.05668]] RNE: a plug-and-play framework for diffusion density estimation and inference-time control(https://arxiv.org/abs/2506.05668)
Keywords: diffusion
Abstract: In this paper, we introduce the Radon-Nikodym Estimator (RNE), a flexible, plug-and-play framework for diffusion inference-time density estimation and control, based on the concept of the density ratio between path distributions. RNE connects and unifies a variety of existing density estimation and inference-time control methods under a single and intuitive perspective, stemming from basic variational inference and probabilistic principles therefore offering both theoretical clarity and practical versatility. Experiments demonstrate that RNE achieves promising performances in diffusion density estimation and inference-time control tasks, including annealing, composition of diffusion models, and reward-tilting.

Title: Contextually Guided Transformers via Low-Rank Adaptation

Authors: Andrey Zhmoginov, Jihwan Lee, Max Vladymyrov, Mark Sandler
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05672
Pdf URL: https://arxiv.org/pdf/2506.05672
Copy Paste: [[2506.05672]] Contextually Guided Transformers via Low-Rank Adaptation(https://arxiv.org/abs/2506.05672)
Keywords: in-context
Abstract: Large Language Models (LLMs) based on Transformers excel at text processing, but their reliance on prompts for specialized behavior introduces computational overhead. We propose a modification to a Transformer architecture that eliminates the need for explicit prompts by learning to encode context into the model's weights. Our Contextually Guided Transformer (CGT) model maintains a contextual summary at each sequence position, allowing it to update the weights on the fly based on the preceding context. This approach enables the model to self-specialize, effectively creating a tailored model for processing information following a given prefix. We demonstrate the effectiveness of our method on synthetic in-context learning tasks and language modeling benchmarks. Furthermore, we introduce techniques for enhancing the interpretability of the learned contextual representations, drawing connections to Variational Autoencoders and promoting smoother, more consistent context encoding. This work offers a novel direction for efficient and adaptable language modeling by integrating context directly into the model's architecture.

Title: Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery

Authors: Sajjad Abdoli, Freeman Lewin, Gediminas Vasiliauskas, Fabian Schonholz
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05673
Pdf URL: https://arxiv.org/pdf/2506.05673
Copy Paste: [[2506.05673]] Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery(https://arxiv.org/abs/2506.05673)
Keywords: diffusion
Abstract: The development of modern Artificial Intelligence (AI) models, particularly diffusion-based models employed in computer vision and image generation tasks, is undergoing a paradigmatic shift in development methodologies. Traditionally dominated by a "Model Centric" approach, in which performance gains were primarily pursued through increasingly complex model architectures and hyperparameter optimization, the field is now recognizing a more nuanced "Data-Centric" approach. This emergent framework foregrounds the quality, structure, and relevance of training data as the principal driver of model performance. To operationalize this paradigm shift, we introduce the this http URL sample dataset (the "DSD"), initially comprised of approximately 10,610 high-quality human peer-ranked photography images accompanied by extensive multi-tier annotations. The DSD is a foundational computer vision dataset designed to usher in a new standard for commercial image datasets. Representing a small fraction of this http URL's 100 million-plus image catalog, the DSD provides a scalable foundation necessary for robust commercial and multimodal AI development. Through this in-depth exploratory analysis, we document the quantitative improvements generated by the DSD on specific models against known benchmarks and make the code and the trained models used in our evaluation publicly available.

Title: Learning Design-Score Manifold to Guide Diffusion Models for Offline Optimization

Authors: Tailin Zhou, Zhilin Chen, Wenlong Lyu, Zhitang Chen, Danny H.K. Tsang, Jun Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05680
Pdf URL: https://arxiv.org/pdf/2506.05680
Copy Paste: [[2506.05680]] Learning Design-Score Manifold to Guide Diffusion Models for Offline Optimization(https://arxiv.org/abs/2506.05680)
Keywords: diffusion
Abstract: Optimizing complex systems, from discovering therapeutic drugs to designing high-performance materials, remains a fundamental challenge across science and engineering, as the underlying rules are often unknown and costly to evaluate. Offline optimization aims to optimize designs for target scores using pre-collected datasets without system interaction. However, conventional approaches may fail beyond training data, predicting inaccurate scores and generating inferior designs. This paper introduces ManGO, a diffusion-based framework that learns the design-score manifold, capturing the design-score interdependencies holistically. Unlike existing methods that treat design and score spaces in isolation, ManGO unifies forward prediction and backward generation, attaining generalization beyond training data. Key to this is its derivative-free guidance for conditional generation, coupled with adaptive inference-time scaling that dynamically optimizes denoising paths. Extensive evaluations demonstrate that ManGO outperforms 24 single- and 10 multi-objective optimization methods across diverse domains, including synthetic tasks, robot control, material design, DNA sequence, and real-world engineering optimization.

Title: Multi-Modal Multi-Task Federated Foundation Models for Next-Generation Extended Reality Systems: Towards Privacy-Preserving Distributed Intelligence in AR/VR/MR

Authors: Fardis Nadimi, Payam Abdisarabshali, Kasra Borazjani, Jacob Chakareski, Seyyedali Hosseinalipour
Subjects: cs.LG, cs.AI, cs.CR, cs.MM
Abstract URL: https://arxiv.org/abs/2506.05683
Pdf URL: https://arxiv.org/pdf/2506.05683
Copy Paste: [[2506.05683]] Multi-Modal Multi-Task Federated Foundation Models for Next-Generation Extended Reality Systems: Towards Privacy-Preserving Distributed Intelligence in AR/VR/MR(https://arxiv.org/abs/2506.05683)
Keywords: foundation model
Abstract: Extended reality (XR) systems, which consist of virtual reality (VR), augmented reality (AR), and mixed reality (XR), offer a transformative interface for immersive, multi-modal, and embodied human-computer interaction. In this paper, we envision that multi-modal multi-task (M3T) federated foundation models (FedFMs) can offer transformative capabilities for XR systems through integrating the representational strength of M3T foundation models (FMs) with the privacy-preserving model training principles of federated learning (FL). We present a modular architecture for FedFMs, which entails different coordination paradigms for model training and aggregations. Central to our vision is the codification of XR challenges that affect the implementation of FedFMs under the SHIFT dimensions: (1) Sensor and modality diversity, (2) Hardware heterogeneity and system-level constraints, (3) Interactivity and embodied personalization, (4) Functional/task variability, and (5) Temporality and environmental variability. We illustrate the manifestation of these dimensions across a set of emerging and anticipated applications of XR systems. Finally, we propose evaluation metrics, dataset requirements, and design tradeoffs necessary for the development of resource-aware FedFMs in XR. This perspective aims to chart the technical and conceptual foundations for context-aware privacy-preserving intelligence in the next generation of XR systems.

Title: Latent Diffusion Model Based Denoising Receiver for 6G Semantic Communication: From Stochastic Differential Theory to Application

Authors: Xiucheng Wang, Honggang Jia, Nan Cheng, Dusit Niyato
Subjects: cs.LG, cs.IT, eess.SY
Abstract URL: https://arxiv.org/abs/2506.05710
Pdf URL: https://arxiv.org/pdf/2506.05710
Copy Paste: [[2506.05710]] Latent Diffusion Model Based Denoising Receiver for 6G Semantic Communication: From Stochastic Differential Theory to Application(https://arxiv.org/abs/2506.05710)
Keywords: diffusion, generative
Abstract: In this paper, a novel semantic communication framework empowered by generative artificial intelligence (GAI) is proposed, specifically leveraging the capabilities of diffusion models (DMs). A rigorous theoretical foundation is established based on stochastic differential equations (SDEs), which elucidates the denoising properties of DMs in mitigating additive white Gaussian noise (AWGN) in latent semantic representations. Crucially, a closed-form analytical relationship between the signal-to-noise ratio (SNR) and the denoising timestep is derived, enabling the optimal selection of diffusion parameters for any given channel condition. To address the distribution mismatch between the received signal and the DM's training data, a mathematically principled scaling mechanism is introduced, ensuring robust performance across a wide range of SNRs without requiring model fine-tuning. Built upon this theoretical insight, we develop a latent diffusion model (LDM)-based semantic transceiver, wherein a variational autoencoder (VAE) is employed for efficient semantic compression, and a pretrained DM serves as a universal denoiser. Notably, the proposed architecture is fully training-free at inference time, offering high modularity and compatibility with large-scale pretrained LDMs. This design inherently supports zero-shot generalization and mitigates the challenges posed by out-of-distribution inputs. Extensive experimental evaluations demonstrate that the proposed framework significantly outperforms conventional neural-network-based semantic communication baselines, particularly under low SNR conditions and distributional shifts, thereby establishing a promising direction for GAI-driven robust semantic transmission in future 6G systems.

Title: Come Together, But Not Right Now: A Progressive Strategy to Boost Low-Rank Adaptation

Authors: Zhan Zhuang, Xiequn Wang, Wei Li, Yulong Zhang, Qiushi Huang, Shuhao Chen, Xuehao Wang, Yanbin Wei, Yuhe Nie, Kede Ma, Yu Zhang, Ying Wei
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05713
Pdf URL: https://arxiv.org/pdf/2506.05713
Copy Paste: [[2506.05713]] Come Together, But Not Right Now: A Progressive Strategy to Boost Low-Rank Adaptation(https://arxiv.org/abs/2506.05713)
Keywords: foundation model
Abstract: Low-rank adaptation (LoRA) has emerged as a leading parameter-efficient fine-tuning technique for adapting large foundation models, yet it often locks adapters into suboptimal minima near their initialization. This hampers model generalization and limits downstream operators such as adapter merging and pruning. Here, we propose CoTo, a progressive training strategy that gradually increases adapters' activation probability over the course of fine-tuning. By stochastically deactivating adapters, CoTo encourages more balanced optimization and broader exploration of the loss landscape. We provide a theoretical analysis showing that CoTo promotes layer-wise dropout stability and linear mode connectivity, and we adopt a cooperative-game approach to quantify each adapter's marginal contribution. Extensive experiments demonstrate that CoTo consistently boosts single-task performance, enhances multi-task merging accuracy, improves pruning robustness, and reduces training overhead, all while remaining compatible with diverse LoRA variants. Code is available at this https URL.

Title: When Better Features Mean Greater Risks: The Performance-Privacy Trade-Off in Contrastive Learning

Authors: Ruining Sun, Hongsheng Hu, Wei Luo, Zhaoxi Zhang, Yanjun Zhang, Haizhuan Yuan, Leo Yu Zhang
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05743
Pdf URL: https://arxiv.org/pdf/2506.05743
Copy Paste: [[2506.05743]] When Better Features Mean Greater Risks: The Performance-Privacy Trade-Off in Contrastive Learning(https://arxiv.org/abs/2506.05743)
Keywords: self-supervised
Abstract: With the rapid advancement of deep learning technology, pre-trained encoder models have demonstrated exceptional feature extraction capabilities, playing a pivotal role in the research and application of deep learning. However, their widespread use has raised significant concerns about the risk of training data privacy leakage. This paper systematically investigates the privacy threats posed by membership inference attacks (MIAs) targeting encoder models, focusing on contrastive learning frameworks. Through experimental analysis, we reveal the significant impact of model architecture complexity on membership privacy leakage: As more advanced encoder frameworks improve feature-extraction performance, they simultaneously exacerbate privacy-leakage risks. Furthermore, this paper proposes a novel membership inference attack method based on the p-norm of feature vectors, termed the Embedding Lp-Norm Likelihood Attack (LpLA). This method infers membership status, by leveraging the statistical distribution characteristics of the p-norm of feature vectors. Experimental results across multiple datasets and model architectures demonstrate that LpLA outperforms existing methods in attack performance and robustness, particularly under limited attack knowledge and query volumes. This study not only uncovers the potential risks of privacy leakage in contrastive learning frameworks, but also provides a practical basis for privacy protection research in encoder models. We hope that this work will draw greater attention to the privacy risks associated with self-supervised learning models and shed light on the importance of a balance between model utility and training data privacy. Our code is publicly available at: this https URL.

Title: BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

Authors: Yunpeng Qing, Shuo Chen, Yixiao Chi, Shunyu Liu, Sixu Lin, Changqing Zou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05762
Pdf URL: https://arxiv.org/pdf/2506.05762
Copy Paste: [[2506.05762]] BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning(https://arxiv.org/abs/2506.05762)
Keywords: diffusion, generative
Abstract: Recent advances in offline Reinforcement Learning (RL) have proven that effective policy learning can benefit from imposing conservative constraints on pre-collected datasets. However, such static datasets often exhibit distribution bias, resulting in limited generalizability. To address this limitation, a straightforward solution is data augmentation (DA), which leverages generative models to enrich data distribution. Despite the promising results, current DA techniques focus solely on reconstructing future trajectories from given states, while ignoring the exploration of history transitions that reach them. This single-direction paradigm inevitably hinders the discovery of diverse behavior patterns, especially those leading to critical states that may have yielded high-reward outcomes. In this work, we introduce Bidirectional Trajectory Diffusion (BiTrajDiff), a novel DA framework for offline RL that models both future and history trajectories from any intermediate states. Specifically, we decompose the trajectory generation task into two independent yet complementary diffusion processes: one generating forward trajectories to predict future dynamics, and the other generating backward trajectories to trace essential history this http URL can efficiently leverage critical states as anchors to expand into potentially valuable yet underexplored regions of the state space, thereby facilitating dataset diversity. Extensive experiments on the D4RL benchmark suite demonstrate that BiTrajDiff achieves superior performance compared to other advanced DA methods across various offline RL backbones.

Title: LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models

Authors: Haojie Yu, Zhaonian Wang, Yihan Pan, Meng Cheng, Hao Yang, Chao Wang, Tao Xie, Xiaoming Xu, Xiaoming Wei, Xunliang Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05806
Pdf URL: https://arxiv.org/pdf/2506.05806
Copy Paste: [[2506.05806]] LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models(https://arxiv.org/abs/2506.05806)
Keywords: diffusion
Abstract: Diffusion-based models have gained wide adoption in the virtual human generation due to their outstanding expressiveness. However, their substantial computational requirements have constrained their deployment in real-time interactive avatar applications, where stringent speed, latency, and duration requirements are paramount. We present a novel audio-driven portrait video generation framework based on the diffusion model to address these challenges. Firstly, we propose robust variable-length video generation to reduce the minimum time required to generate the initial video clip or state transitions, which significantly enhances the user experience. Secondly, we propose a consistency model training strategy for Audio-Image-to-Video to ensure real-time performance, enabling a fast few-step generation. Model quantization and pipeline parallelism are further employed to accelerate the inference speed. To mitigate the stability loss incurred by the diffusion process and model quantization, we introduce a new inference strategy tailored for long-duration video generation. These methods ensure real-time performance and low latency while maintaining high-fidelity output. Thirdly, we incorporate class labels as a conditional input to seamlessly switch between speaking, listening, and idle states. Lastly, we design a novel mechanism for fine-grained facial expression control to exploit our model's inherent capacity. Extensive experiments demonstrate that our approach achieves low-latency, fluid, and authentic two-way communication. On an NVIDIA RTX 4090D, our model achieves a maximum of 78 FPS at a resolution of 384x384 and 45 FPS at a resolution of 512x512, with an initial video generation latency of 140 ms and 215 ms, respectively.

Title: Heartcare Suite: Multi-dimensional Understanding of ECG with Raw Multi-lead Signal Modeling

Authors: Yihan Xie, Sijing Li, Tianwei Lin, Zhuonan Wang, Chenglin Yang, Yu Zhong, Wenqiao Zhang, Haoyuan Li, Hao Jiang, Fengda Zhang, Qishan Chen, Jun Xiao, Yueting Zhuang, Beng Chin Ooi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05831
Pdf URL: https://arxiv.org/pdf/2506.05831
Copy Paste: [[2506.05831]] Heartcare Suite: Multi-dimensional Understanding of ECG with Raw Multi-lead Signal Modeling(https://arxiv.org/abs/2506.05831)
Keywords: diffusion
Abstract: We present Heartcare Suite, a multimodal comprehensive framework for finegrained electrocardiogram (ECG) understanding. It comprises three key components: (i) Heartcare-220K, a high-quality, structured, and comprehensive multimodal ECG dataset covering essential tasks such as disease diagnosis, waveform morphology analysis, and rhythm interpretation. (ii) Heartcare-Bench, a systematic and multi-dimensional benchmark designed to evaluate diagnostic intelligence and guide the optimization of Medical Multimodal Large Language Models (Med-MLLMs) in ECG scenarios. and (iii) HeartcareGPT with a tailored tokenizer Bidirectional ECG Abstract Tokenization (Beat), which compresses raw multi-lead signals into semantically rich discrete tokens via duallevel vector quantization and query-guided bidirectional diffusion mechanism. Built upon Heartcare-220K, HeartcareGPT achieves strong generalization and SoTA performance across multiple clinically meaningful tasks. Extensive experiments demonstrate that Heartcare Suite is highly effective in advancing ECGspecific multimodal understanding and evaluation. Our project is available at this https URL .

Title: FontAdapter: Instant Font Adaptation in Visual Text Generation

Authors: Myungkyu Koo, Subin Kim, Sangkyung Kwak, Jaehyun Nam, Seojin Kim, Jinwoo Shin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05843
Pdf URL: https://arxiv.org/pdf/2506.05843
Copy Paste: [[2506.05843]] FontAdapter: Instant Font Adaptation in Visual Text Generation(https://arxiv.org/abs/2506.05843)
Keywords: diffusion
Abstract: Text-to-image diffusion models have significantly improved the seamless integration of visual text into diverse image contexts. Recent approaches further improve control over font styles through fine-tuning with predefined font dictionaries. However, adapting unseen fonts outside the preset is computationally expensive, often requiring tens of minutes, making real-time customization impractical. In this paper, we present FontAdapter, a framework that enables visual text generation in unseen fonts within seconds, conditioned on a reference glyph image. To this end, we find that direct training on font datasets fails to capture nuanced font attributes, limiting generalization to new glyphs. To overcome this, we propose a two-stage curriculum learning approach: FontAdapter first learns to extract font attributes from isolated glyphs and then integrates these styles into diverse natural backgrounds. To support this two-stage training scheme, we construct synthetic datasets tailored to each stage, leveraging large-scale online fonts effectively. Experiments demonstrate that FontAdapter enables high-quality, robust font customization across unseen fonts without additional fine-tuning during inference. Furthermore, it supports visual text editing, font style blending, and cross-lingual font transfer, positioning FontAdapter as a versatile framework for font customization tasks.

Title: Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models

Authors: Cheonbok Park, Jeonghoon Kim, Joosung Lee, Sanghwan Bae, Jaegul Choo, Kangmin Yoo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05850
Pdf URL: https://arxiv.org/pdf/2506.05850
Copy Paste: [[2506.05850]] Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models(https://arxiv.org/abs/2506.05850)
Keywords: foundation model
Abstract: We identify \textbf{Cross-lingual Collapse}, a systematic drift in which the chain-of-thought (CoT) of a multilingual language model reverts to its dominant pre-training language even when the prompt is expressed in a different language. Recent large language models (LLMs) with reinforcement learning with verifiable reward (RLVR) have achieved strong logical reasoning performances by exposing their intermediate reasoning traces, giving rise to large reasoning models (LRMs). However, the mechanism behind multilingual reasoning in LRMs is not yet fully explored. To investigate the issue, we fine-tune multilingual LRMs with Group-Relative Policy Optimization (GRPO) on translated versions of the GSM$8$K and SimpleRL-Zoo datasets in three different languages: Chinese, Korean, and Ukrainian. During training, we monitor both task accuracy and language consistency of the reasoning chains. Our experiments reveal three key findings: (i) GRPO rapidly amplifies pre-training language imbalances, leading to the erosion of low-resource languages within just a few hundred updates; (ii) language consistency reward mitigates this drift but does so at the expense of an almost 5 - 10 pp drop in accuracy. and (iii) the resulting language collapse is severely damaging and largely irreversible, as subsequent fine-tuning struggles to steer the model back toward its original target-language reasoning capabilities. Together, these findings point to a remarkable conclusion: \textit{not all languages are trained equally for reasoning}. Furthermore, our paper sheds light on the roles of reward shaping, data difficulty, and pre-training priors in eliciting multilingual reasoning.

Title: ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On

Authors: Jinjuan Wang, Wenzhang Sun, Ming Li, Yun Zheng, Fanyao Li, Zhulin Tao, Donglin Di, Hao Li, Wei Chen, Xianglin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05858
Pdf URL: https://arxiv.org/pdf/2506.05858
Copy Paste: [[2506.05858]] ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On(https://arxiv.org/abs/2506.05858)
Keywords: diffusion
Abstract: Video virtual try-on aims to seamlessly replace the clothing of a person in a source video with a target garment. Despite significant progress in this field, existing approaches still struggle to maintain continuity and reproduce garment details. In this paper, we introduce ChronoTailor, a diffusion-based framework that generates temporally consistent videos while preserving fine-grained garment details. By employing a precise spatio-temporal attention mechanism to guide the integration of fine-grained garment features, ChronoTailor achieves robust try-on performance. First, ChronoTailor leverages region-aware spatial guidance to steer the evolution of spatial attention and employs an attention-driven temporal feature fusion mechanism to generate more continuous temporal features. This dual approach not only enables fine-grained local editing but also effectively mitigates artifacts arising from video dynamics. Second, ChronoTailor integrates multi-scale garment features to preserve low-level visual details and incorporates a garment-pose feature alignment to ensure temporal continuity during dynamic motion. Additionally, we collect StyleDress, a new dataset featuring intricate garments, varied environments, and diverse poses, offering advantages over existing public datasets, and will be publicly available for research. Extensive experiments show that ChronoTailor maintains spatio-temporal continuity and preserves garment details during motion, significantly outperforming previous methods.

Title: CryoFastAR: Fast Cryo-EM Ab Initio Reconstruction Made Easy

Authors: Jiakai Zhang, Shouchen Zhou, Haizhao Dai, Xinhang Liu, Peihao Wang, Zhiwen Fan, Yuan Pei, Jingyi Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05864
Pdf URL: https://arxiv.org/pdf/2506.05864
Copy Paste: [[2506.05864]] CryoFastAR: Fast Cryo-EM Ab Initio Reconstruction Made Easy(https://arxiv.org/abs/2506.05864)
Keywords: foundation model
Abstract: Pose estimation from unordered images is fundamental for 3D reconstruction, robotics, and scientific imaging. Recent geometric foundation models, such as DUSt3R, enable end-to-end dense 3D reconstruction but remain underexplored in scientific imaging fields like cryo-electron microscopy (cryo-EM) for near-atomic protein reconstruction. In cryo-EM, pose estimation and 3D reconstruction from unordered particle images still depend on time-consuming iterative optimization, primarily due to challenges such as low signal-to-noise ratios (SNR) and distortions from the contrast transfer function (CTF). We introduce CryoFastAR, the first geometric foundation model that can directly predict poses from Cryo-EM noisy images for Fast ab initio Reconstruction. By integrating multi-view features and training on large-scale simulated cryo-EM data with realistic noise and CTF modulations, CryoFastAR enhances pose estimation accuracy and generalization. To enhance training stability, we propose a progressive training strategy that first allows the model to extract essential features under simpler conditions before gradually increasing difficulty to improve robustness. Experiments show that CryoFastAR achieves comparable quality while significantly accelerating inference over traditional iterative approaches on both synthetic and real datasets.

Title: Stealix: Model Stealing via Prompt Evolution

Authors: Zhixiong Zhuang, Hui-Po Wang, Maria-Irina Nicolae, Mario Fritz
Subjects: cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.05867
Pdf URL: https://arxiv.org/pdf/2506.05867
Copy Paste: [[2506.05867]] Stealix: Model Stealing via Prompt Evolution(https://arxiv.org/abs/2506.05867)
Keywords: diffusion, generative
Abstract: Model stealing poses a significant security risk in machine learning by enabling attackers to replicate a black-box model without access to its training data, thus jeopardizing intellectual property and exposing sensitive information. Recent methods that use pre-trained diffusion models for data synthesis improve efficiency and performance but rely heavily on manually crafted prompts, limiting automation and scalability, especially for attackers with little expertise. To assess the risks posed by open-source pre-trained models, we propose a more realistic threat model that eliminates the need for prompt design skills or knowledge of class names. In this context, we introduce Stealix, the first approach to perform model stealing without predefined prompts. Stealix uses two open-source pre-trained models to infer the victim model's data distribution, and iteratively refines prompts through a genetic algorithm, progressively improving the precision and diversity of synthetic images. Our experimental results demonstrate that Stealix significantly outperforms other methods, even those with access to class names or fine-grained prompts, while operating under the same query budget. These findings highlight the scalability of our approach and suggest that the risks posed by pre-trained generative models in model stealing may be greater than previously recognized.

Title: Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection

Authors: Yu Li, Xingyu Qiu, Yuqian Fu, Jie Chen, Tianwen Qian, Xu Zheng, Danda Pani Paudel, Yanwei Fu, Xuanjing Huang, Luc Van Gool, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05872
Pdf URL: https://arxiv.org/pdf/2506.05872
Copy Paste: [[2506.05872]] Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection(https://arxiv.org/abs/2506.05872)
Keywords: generative
Abstract: Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results. Codes will be released upon acceptance.

Title: Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router

Authors: Chenyang Shao, Xinyang Liu, Yutang Lin, Fengli Xu, Yong Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05901
Pdf URL: https://arxiv.org/pdf/2506.05901
Copy Paste: [[2506.05901]] Route-and-Reason: Scaling Large Language Model Reasoning with Reinforced Model Router(https://arxiv.org/abs/2506.05901)
Keywords: self-supervised
Abstract: Multi-step reasoning has proven essential for enhancing the problem-solving capabilities of Large Language Models (LLMs) by decomposing complex tasks into intermediate steps, either explicitly or implicitly. Extending the reasoning chain at test time through deeper thought processes or broader exploration, can furthur improve performance, but often incurs substantial costs due to the explosion in token usage. Yet, many reasoning steps are relatively simple and can be handled by more efficient smaller-scale language models (SLMs). This motivates hybrid approaches that allocate subtasks across models of varying capacities. However, realizing such collaboration requires accurate task decomposition and difficulty-aware subtask allocation, which is challenging. To address this, we propose R2-Reasoner, a novel framework that enables collaborative reasoning across heterogeneous LLMs by dynamically routing sub-tasks based on estimated complexity. At the core of our framework is a Reinforced Model Router, composed of a task decomposer and a subtask allocator. The task decomposer segments complex input queries into logically ordered subtasks, while the subtask allocator assigns each subtask to the most appropriate model, ranging from lightweight SLMs to powerful LLMs, balancing accuracy and efficiency. To train this router, we introduce a staged pipeline that combines supervised fine-tuning on task-specific datasets with Group Relative Policy Optimization algorithm, enabling self-supervised refinement through iterative reinforcement learning. Extensive experiments across four challenging benchmarks demonstrate that R2-Reasoner reduces API costs by 86.85% while maintaining or surpassing baseline accuracy. Our framework paves the way for more cost-effective and adaptive LLM reasoning. The code is open-source at this https URL .

Title: FADE: Frequency-Aware Diffusion Model Factorization for Video Editing

Authors: Yixuan Zhu, Haolin Wang, Shilin Ma, Wenliang Zhao, Yansong Tang, Lei Chen, Jie Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05934
Pdf URL: https://arxiv.org/pdf/2506.05934
Copy Paste: [[2506.05934]] FADE: Frequency-Aware Diffusion Model Factorization for Video Editing(https://arxiv.org/abs/2506.05934)
Keywords: diffusion
Abstract: Recent advancements in diffusion frameworks have significantly enhanced video editing, achieving high fidelity and strong alignment with textual prompts. However, conventional approaches using image diffusion models fall short in handling video dynamics, particularly for challenging temporal edits like motion adjustments. While current video diffusion models produce high-quality results, adapting them for efficient editing remains difficult due to the heavy computational demands that prevent the direct application of previous image editing techniques. To overcome these limitations, we introduce FADE, a training-free yet highly effective video editing approach that fully leverages the inherent priors from pre-trained video diffusion models via frequency-aware factorization. Rather than simply using these models, we first analyze the attention patterns within the video model to reveal how video priors are distributed across different components. Building on these insights, we propose a factorization strategy to optimize each component's specialized role. Furthermore, we devise spectrum-guided modulation to refine the sampling trajectory with frequency domain cues, preventing information leakage and supporting efficient, versatile edits while preserving the basic spatial and temporal structure. Extensive experiments on real-world videos demonstrate that our method consistently delivers high-quality, realistic and temporally coherent editing results both qualitatively and quantitatively. Code is available at this https URL .

Title: Exponential Family Variational Flow Matching for Tabular Data Generation

Authors: Andrés Guzmán-Cordero, Floor Eijkelboom, Jan-Willem van de Meent
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05940
Pdf URL: https://arxiv.org/pdf/2506.05940
Copy Paste: [[2506.05940]] Exponential Family Variational Flow Matching for Tabular Data Generation(https://arxiv.org/abs/2506.05940)
Keywords: diffusion, generative
Abstract: While denoising diffusion and flow matching have driven major advances in generative modeling, their application to tabular data remains limited, despite its ubiquity in real-world applications. To this end, we develop TabbyFlow, a variational Flow Matching (VFM) method for tabular data generation. To apply VFM to data with mixed continuous and discrete features, we introduce Exponential Family Variational Flow Matching (EF-VFM), which represents heterogeneous data types using a general exponential family distribution. We hereby obtain an efficient, data-driven objective based on moment matching, enabling principled learning of probability paths over mixed continuous and discrete variables. We also establish a connection between variational flow matching and generalized flow matching objectives based on Bregman divergences. Evaluation on tabular data benchmarks demonstrates state-of-the-art performance compared to baselines.

Title: AQUATIC-Diff: Additive Quantization for Truly Tiny Compressed Diffusion Models

Authors: Adil Hasan, Thomas Peyrin
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05960
Pdf URL: https://arxiv.org/pdf/2506.05960
Copy Paste: [[2506.05960]] AQUATIC-Diff: Additive Quantization for Truly Tiny Compressed Diffusion Models(https://arxiv.org/abs/2506.05960)
Keywords: diffusion
Abstract: Significant investments have been made towards the commodification of diffusion models for generation of diverse media. Their mass-market adoption is however still hobbled by the intense hardware resource requirements of diffusion model inference. Model quantization strategies tailored specifically towards diffusion models have been useful in easing this burden, yet have generally explored the Uniform Scalar Quantization (USQ) family of quantization methods. In contrast, Vector Quantization (VQ) methods, which operate on groups of multiple related weights as the basic unit of compression, have seen substantial success in Large Language Model (LLM) quantization. In this work, we apply codebook-based additive vector quantization to the problem of diffusion model compression. Our resulting approach achieves a new Pareto frontier for the extremely low-bit weight quantization on the standard class-conditional benchmark of LDM-4 on ImageNet at 20 inference time steps. Notably, we report sFID 1.92 points lower than the full-precision model at W4A8 and the best-reported results for FID, sFID and ISC at W2A8. We are also able to demonstrate FLOPs savings on arbitrary hardware via an efficient inference kernel, as opposed to savings resulting from small integer operations which may lack broad hardware support.

Title: LTG at SemEval-2025 Task 10: Optimizing Context for Classification of Narrative Roles

Authors: Egil Rønningstad, Gaurav Negi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.05976
Pdf URL: https://arxiv.org/pdf/2506.05976
Copy Paste: [[2506.05976]] LTG at SemEval-2025 Task 10: Optimizing Context for Classification of Narrative Roles(https://arxiv.org/abs/2506.05976)
Keywords: generative
Abstract: Our contribution to the SemEval 2025 shared task 10, subtask 1 on entity framing, tackles the challenge of providing the necessary segments from longer documents as context for classification with a masked language model. We show that a simple entity-oriented heuristics for context selection can enable text classification using models with limited context window. Our context selection approach and the XLM-RoBERTa language model is on par with, or outperforms, Supervised Fine-Tuning with larger generative language models.

Title: LightGTS: A Lightweight General Time Series Forecasting Model

Authors: Yihang Wang, Yuying Qiu, Peng Chen, Yang Shu, Zhongwen Rao, Lujia Pan, Bin Yang, Chenjuan Guo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06005
Pdf URL: https://arxiv.org/pdf/2506.06005
Copy Paste: [[2506.06005]] LightGTS: A Lightweight General Time Series Forecasting Model(https://arxiv.org/abs/2506.06005)
Keywords: foundation model
Abstract: Existing works on general time series forecasting build foundation models with heavy model parameters through large-scale multi-source pre-training. These models achieve superior generalization ability across various datasets at the cost of significant computational burdens and limitations in resource-constrained scenarios. This paper introduces LightGTS, a lightweight general time series forecasting model designed from the perspective of consistent periodical modeling. To handle diverse scales and intrinsic periods in multi-source pre-training, we introduce Periodical Tokenization, which extracts consistent periodic patterns across different datasets with varying scales. To better utilize the periodicity in the decoding process, we further introduce Periodical Parallel Decoding, which leverages historical tokens to improve forecasting. Based on the two techniques above which fully leverage the inductive bias of periods inherent in time series, LightGTS uses a lightweight model to achieve outstanding performance on general time series forecasting. It achieves state-of-the-art forecasting performance on 9 real-world benchmarks in both zero-shot and full-shot settings with much better efficiency compared with existing time series foundation models.

Title: Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models

Authors: Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.06006
Pdf URL: https://arxiv.org/pdf/2506.06006
Copy Paste: [[2506.06006]] Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models(https://arxiv.org/abs/2506.06006)
Keywords: foundation model
Abstract: To what extent do vision-and-language foundation models possess a realistic world model (observation $\times$ action $\rightarrow$ observation) and a dynamics model (observation $\times$ observation $\rightarrow$ action), when actions are expressed through language? While open-source foundation models struggle with both, we find that fine-tuning them to acquire a dynamics model through supervision is significantly easier than acquiring a world model. In turn, dynamics models can be used to bootstrap world models through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, the dynamics model can annotate actions for unlabelled pairs of video frame observations to expand the training data. We further propose a new objective, where image tokens in observation pairs are weighted by their importance, as predicted by a recognition model. Secondly, the dynamics models can assign rewards to multiple samples of the world model to score them, effectively guiding search at inference time. We evaluate the world models resulting from both strategies through the task of action-centric image editing on Aurora-Bench. Our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin of $15\%$ on real-world subsets according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

Title: Restereo: Diffusion stereo video generation and restoration

Authors: Xingchang Huang, Ashish Kumar Singh, Florian Dubost, Cristina Nader Vasconcelos, Sakar Khattar, Liang Shi, Christian Theobalt, Cengiz Oztireli, Gurprit Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06023
Pdf URL: https://arxiv.org/pdf/2506.06023
Copy Paste: [[2506.06023]] Restereo: Diffusion stereo video generation and restoration(https://arxiv.org/abs/2506.06023)
Keywords: diffusion
Abstract: Stereo video generation has been gaining increasing attention with recent advancements in video diffusion models. However, most existing methods focus on generating 3D stereoscopic videos from monocular 2D videos. These approaches typically assume that the input monocular video is of high quality, making the task primarily about inpainting occluded regions in the warped video while preserving disoccluded areas. In this paper, we introduce a new pipeline that not only generates stereo videos but also enhances both left-view and right-view videos consistently with a single model. Our approach achieves this by fine-tuning the model on degraded data for restoration, as well as conditioning the model on warped masks for consistent stereo generation. As a result, our method can be fine-tuned on a relatively small synthetic stereo video datasets and applied to low-quality real-world videos, performing both stereo video generation and restoration. Experiments demonstrate that our method outperforms existing approaches both qualitatively and quantitatively in stereo video generation from low-resolution inputs.

Title: Sample-Specific Noise Injection For Diffusion-Based Adversarial Purification

Authors: Yuhao Sun, Jiacheng Zhang, Zesheng Ye, Chaowei Xiao, Feng Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06027
Pdf URL: https://arxiv.org/pdf/2506.06027
Copy Paste: [[2506.06027]] Sample-Specific Noise Injection For Diffusion-Based Adversarial Purification(https://arxiv.org/abs/2506.06027)
Keywords: diffusion, generative
Abstract: Diffusion-based purification (DBP) methods aim to remove adversarial noise from the input sample by first injecting Gaussian noise through a forward diffusion process, and then recovering the clean example through a reverse generative process. In the above process, how much Gaussian noise is injected to the input sample is key to the success of DBP methods, which is controlled by a constant noise level $t^*$ for all samples in existing methods. In this paper, we discover that an optimal $t^*$ for each sample indeed could be different. Intuitively, the cleaner a sample is, the less the noise it should be injected, and vice versa. Motivated by this finding, we propose a new framework, called Sample-specific Score-aware Noise Injection (SSNI). Specifically, SSNI uses a pre-trained score network to estimate how much a data point deviates from the clean data distribution (i.e., score norms). Then, based on the magnitude of score norms, SSNI applies a reweighting function to adaptively adjust $t^*$ for each sample, achieving sample-specific noise injections. Empirically, incorporating our framework with existing DBP methods results in a notable improvement in both accuracy and robustness on CIFAR-10 and ImageNet-1K, highlighting the necessity to allocate distinct noise levels to different samples in DBP methods. Our code is available at: this https URL.

Title: Large Language Models are Demonstration Pre-Selectors for Themselves

Authors: Jiarui Jin, Yuwei Wu, Haoxuan Li, Xiaoting He, Weinan Zhang, Yiming Yang, Yong Yu, Jun Wang, Mengyue Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06033
Pdf URL: https://arxiv.org/pdf/2506.06033
Copy Paste: [[2506.06033]] Large Language Models are Demonstration Pre-Selectors for Themselves(https://arxiv.org/abs/2506.06033)
Keywords: in-context
Abstract: In-context learning (ICL) with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training data. However, existing ICL methods, which rely on similarity or diversity scores to choose demonstrations, incur high computational costs due to repeatedly retrieval from large-scale datasets for each query. To this end, we propose FEEDER (FEw yet Essential Demonstration prE-selectoR), a novel pre-selection framework that identifies a representative subset of demonstrations containing the most representative examples in the training data, tailored to specific LLMs. To construct this subset, we introduce the "sufficiency" and "necessity" metrics in the pre-selection stage and design a tree-based algorithm to identify representative examples efficiently. Once pre-selected, this representative subset can effectively replace the full training data, improving efficiency while maintaining comparable performance in ICL. Additionally, our pre-selected subset also benefits fine-tuning LLMs, where we introduce a bi-level optimization method that enhances training efficiency without sacrificing performance. Experiments with LLMs ranging from 300M to 8B parameters show that FEEDER can reduce training data size by over 20% while maintaining performance and seamlessly integrating with various downstream demonstration selection strategies in ICL.

Title: HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Authors: Shiyi Zhang, Dong Liang, Hairong Zheng, Yihang Zhou
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06035
Pdf URL: https://arxiv.org/pdf/2506.06035
Copy Paste: [[2506.06035]] HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion(https://arxiv.org/abs/2506.06035)
Keywords: diffusion, generative
Abstract: Reconstructing visual information from brain activity bridges the gap between neuroscience and computer vision. Even though progress has been made in decoding images from fMRI using generative models, a challenge remains in accurately recovering highly complex visual stimuli. This difficulty stems from their elemental density and diversity, sophisticated spatial structures, and multifaceted semantic information. To address these challenges, we propose HAVIR that contains two adapters: (1) The AutoKL Adapter transforms fMRI voxels into a latent diffusion prior, capturing topological structures; (2) The CLIP Adapter converts the voxels to CLIP text and image embeddings, containing semantic information. These complementary representations are fused by Versatile Diffusion to generate the final reconstructed image. To extract the most essential semantic information from complex scenarios, the CLIP Adapter is trained with text captions describing the visual stimuli and their corresponding semantic images synthesized from these captions. The experimental results demonstrate that HAVIR effectively reconstructs both structural features and semantic information of visual stimuli even in complex scenarios, outperforming existing models.

Title: Do-PFN: In-Context Learning for Causal Effect Estimation

Authors: Jake Robertson, Arik Reuter, Siyuan Guo, Noah Hollmann, Frank Hutter, Bernhard Schölkopf
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06039
Pdf URL: https://arxiv.org/pdf/2506.06039
Copy Paste: [[2506.06039]] Do-PFN: In-Context Learning for Causal Effect Estimation(https://arxiv.org/abs/2506.06039)
Keywords: in-context
Abstract: Estimation of causal effects is critical to a range of scientific disciplines. Existing methods for this task either require interventional data, knowledge about the ground truth causal graph, or rely on assumptions such as unconfoundedness, restricting their applicability in real-world settings. In the domain of tabular machine learning, Prior-data fitted networks (PFNs) have achieved state-of-the-art predictive performance, having been pre-trained on synthetic data to solve tabular prediction problems via in-context learning. To assess whether this can be transferred to the harder problem of causal effect estimation, we pre-train PFNs on synthetic data drawn from a wide variety of causal structures, including interventions, to predict interventional outcomes given observational data. Through extensive experiments on synthetic case studies, we show that our approach allows for the accurate estimation of causal effects without knowledge of the underlying causal graph. We also perform ablation studies that elucidate Do-PFN's scalability and robustness across datasets with a variety of causal characteristics.

Title: Tensor-to-Tensor Models with Fast Iterated Sum Features

Authors: Joscha Diehl, Rasheed Ibraheem, Leonard Schmitz, Yue Wu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06041
Pdf URL: https://arxiv.org/pdf/2506.06041
Copy Paste: [[2506.06041]] Tensor-to-Tensor Models with Fast Iterated Sum Features(https://arxiv.org/abs/2506.06041)
Keywords: anomaly
Abstract: Data in the form of images or higher-order tensors is ubiquitous in modern deep learning applications. Owing to their inherent high dimensionality, the need for subquadratic layers processing such data is even more pressing than for sequence data. We propose a novel tensor-to-tensor layer with linear cost in the input size, utilizing the mathematical gadget of ``corner trees'' from the field of permutation counting. In particular, for order-two tensors, we provide an image-to-image layer that can be plugged into image processing pipelines. On the one hand, our method can be seen as a higher-order generalization of state-space models. On the other hand, it is based on a multiparameter generalization of the signature of iterated integrals (or sums). The proposed tensor-to-tensor concept is used to build a neural network layer called the Fast Iterated Sums (FIS) layer which integrates seamlessly with other layer types. We demonstrate the usability of the FIS layer with both classification and anomaly detection tasks. By replacing some layers of a smaller ResNet architecture with FIS, a similar accuracy (with a difference of only 0.1\%) was achieved in comparison to a larger ResNet while reducing the number of trainable parameters and multi-add operations. The FIS layer was also used to build an anomaly detection model that achieved an average AUROC of 97.3\% on the texture images of the popular MVTec AD dataset. The processing and modelling codes are publicly available at this https URL.

Title: Diffusion-Based Hierarchical Graph Neural Networks for Simulating Nonlinear Solid Mechanics

Authors: Tobias Würth, Niklas Freymuth, Gerhard Neumann, Luise Kärger
Subjects: cs.LG, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2506.06045
Pdf URL: https://arxiv.org/pdf/2506.06045
Copy Paste: [[2506.06045]] Diffusion-Based Hierarchical Graph Neural Networks for Simulating Nonlinear Solid Mechanics(https://arxiv.org/abs/2506.06045)
Keywords: diffusion
Abstract: Graph-based learned simulators have emerged as a promising approach for simulating physical systems on unstructured meshes, offering speed and generalization across diverse geometries. However, they often struggle with capturing global phenomena, such as bending or long-range correlations, and suffer from error accumulation over long rollouts due to their reliance on local message passing and direct next-step prediction. We address these limitations by introducing the Rolling Diffusion-Batched Inference Network (ROBIN), a novel learned simulator that integrates two key innovations: (i) Rolling Diffusion, a parallelized inference scheme that amortizes the cost of diffusion-based refinement across physical time steps by overlapping denoising steps across a temporal window. (ii) A Hierarchical Graph Neural Network built on algebraic multigrid coarsening, enabling multiscale message passing across different mesh resolutions. This architecture, implemented via Algebraic-hierarchical Message Passing Networks, captures both fine-scale local dynamics and global structural effects critical for phenomena like beam bending or multi-body contact. We validate ROBIN on challenging 2D and 3D solid mechanics benchmarks involving geometric, material, and contact nonlinearities. ROBIN achieves state-of-the-art accuracy on all tasks, substantially outperforming existing next-step learned simulators while reducing inference time by up to an order of magnitude compared to standard diffusion simulators.

Title: Full Conformal Adaptation of Medical Vision-Language Models

Authors: Julio Silva-Rodríguez, Leo Fillioux, Paul-Henry Cournède, Maria Vakalopoulou, Stergios Christodoulidis, Ismail Ben Ayed, Jose Dolz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06076
Pdf URL: https://arxiv.org/pdf/2506.06076
Copy Paste: [[2506.06076]] Full Conformal Adaptation of Medical Vision-Language Models(https://arxiv.org/abs/2506.06076)
Keywords: foundation model
Abstract: Vision-language models (VLMs) pre-trained at large scale have shown unprecedented transferability capabilities and are being progressively integrated into medical image analysis. Although its discriminative potential has been widely explored, its reliability aspect remains overlooked. This work investigates their behavior under the increasingly popular split conformal prediction (SCP) framework, which theoretically guarantees a given error level on output sets by leveraging a labeled calibration set. However, the zero-shot performance of VLMs is inherently limited, and common practice involves few-shot transfer learning pipelines, which cannot absorb the rigid exchangeability assumptions of SCP. To alleviate this issue, we propose full conformal adaptation, a novel setting for jointly adapting and conformalizing pre-trained foundation models, which operates transductively over each test data point using a few-shot adaptation set. Moreover, we complement this framework with SS-Text, a novel training-free linear probe solver for VLMs that alleviates the computational cost of such a transductive approach. We provide comprehensive experiments using 3 different modality-specialized medical VLMs and 9 adaptation tasks. Our framework requires exactly the same data as SCP, and provides consistent relative improvements of up to 27% on set efficiency while maintaining the same coverage guarantees.

Title: Feedback Guidance of Diffusion Models

Authors: Koulischer Felix, Handke Florian, Deleu Johannes, Demeester Thomas, Ambrogioni Luca
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06085
Pdf URL: https://arxiv.org/pdf/2506.06085
Copy Paste: [[2506.06085]] Feedback Guidance of Diffusion Models(https://arxiv.org/abs/2506.06085)
Keywords: diffusion
Abstract: While Classifier-Free Guidance (CFG) has become standard for improving sample fidelity in conditional diffusion models, it can harm diversity and induce memorization by applying constant guidance regardless of whether a particular sample needs correction. We propose FeedBack Guidance (FBG), which uses a state-dependent coefficient to self-regulate guidance amounts based on need. Our approach is derived from first principles by assuming the learned conditional distribution is linearly corrupted by the unconditional distribution, contrasting with CFG's implicit multiplicative assumption. Our scheme relies on feedback of its own predictions about the conditional signal informativeness to adapt guidance dynamically during inference, challenging the view of guidance as a fixed hyperparameter. The approach is benchmarked on ImageNet512x512, where it significantly outperforms Classifier-Free Guidance and is competitive to Limited Interval Guidance (LIG) while benefitting from a strong mathematical framework. On Text-To-Image generation, we demonstrate that, as anticipated, our approach automatically applies higher guidance scales for complex prompts than for simpler ones and that it can be easily combined with existing guidance schemes such as CFG or LIG.

Title: Text-to-LoRA: Instant Transformer Adaption

Authors: Rujikorn Charakorn, Edoardo Cetin, Yujin Tang, Robert Tjarko Lange
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06105
Pdf URL: https://arxiv.org/pdf/2506.06105
Copy Paste: [[2506.06105]] Text-to-LoRA: Instant Transformer Adaption(https://arxiv.org/abs/2506.06105)
Keywords: foundation model
Abstract: While Foundation Models provide a general tool for rapid content creation, they regularly require task-specific adaptation. Traditionally, this exercise involves careful curation of datasets and repeated fine-tuning of the underlying model. Fine-tuning techniques enable practitioners to adapt foundation models for many new applications but require expensive and lengthy training while being notably sensitive to hyper-parameter choices. To overcome these limitations, we introduce Text-to-LoRA (T2L), a model capable of adapting Large Language Models on the fly solely based on a natural language description of the target task. T2L is a hypernetwork trained to construct LoRAs in a single inexpensive forward pass. After training T2L on a suite of 9 pre-trained LoRA adapters (GSM8K, Arc, etc.), we show that the ad-hoc reconstructed LoRA instances match the performance of task-specific adapters across the corresponding test sets. Furthermore, T2L can compress hundreds of LoRA instances and zero-shot generalize to entirely unseen tasks. This approach provides a significant step towards democratizing the specialization of foundation models and enables language-based adaptation with minimal compute requirements. Our code is available at this https URL

Title: Bridging the Gap: In-Context Learning for Modeling Human Disagreement

Authors: Benedetta Muscato, Yue Li, Gizem Gezici, Zhixue Zhao, Fosca Giannotti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.06113
Pdf URL: https://arxiv.org/pdf/2506.06113
Copy Paste: [[2506.06113]] Bridging the Gap: In-Context Learning for Modeling Human Disagreement(https://arxiv.org/abs/2506.06113)
Keywords: in-context
Abstract: Large Language Models (LLMs) have shown strong performance on NLP classification tasks. However, they typically rely on aggregated labels-often via majority voting-which can obscure the human disagreement inherent in subjective annotations. This study examines whether LLMs can capture multiple perspectives and reflect annotator disagreement in subjective tasks such as hate speech and offensive language detection. We use in-context learning (ICL) in zero-shot and few-shot settings, evaluating four open-source LLMs across three label modeling strategies: aggregated hard labels, and disaggregated hard and soft labels. In few-shot prompting, we assess demonstration selection methods based on textual similarity (BM25, PLM-based), annotation disagreement (entropy), a combined ranking, and example ordering strategies (random vs. curriculum-based). Results show that multi-perspective generation is viable in zero-shot settings, while few-shot setups often fail to capture the full spectrum of human judgments. Prompt design and demonstration selection notably affect performance, though example ordering has limited impact. These findings highlight the challenges of modeling subjectivity with LLMs and the importance of building more perspective-aware, socially intelligent models.

Title: Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models

Authors: Rihui Jin, Zheyu Xin, Xing Xie, Zuoyi Li, Guilin Qi, Yongrui Chen, Xinbang Dai, Tongtong Wu, Gholamreza Haffari
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.06137
Pdf URL: https://arxiv.org/pdf/2506.06137
Copy Paste: [[2506.06137]] Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models(https://arxiv.org/abs/2506.06137)
Keywords: self-supervised
Abstract: Table reasoning (TR) requires structured reasoning over semi-structured tabular data and remains challenging, particularly for small language models (SLMs, e.g., LLaMA-8B) due to their limited capacity compared to large LMs (LLMs, e.g., GPT-4o). To narrow this gap, we explore program-based TR (P-TR), which circumvents key limitations of text-based TR (T-TR), notably in numerical reasoning, by generating executable programs. However, applying P-TR to SLMs introduces two challenges: (i) vulnerability to heterogeneity in table layouts, and (ii) inconsistency in reasoning due to limited code generation capability. We propose Table-r1, a two-stage P-TR method designed for SLMs. Stage 1 introduces an innovative self-supervised learning task, Layout Transformation Inference, to improve tabular layout generalization from a programmatic view. Stage 2 adopts a mix-paradigm variant of Group Relative Policy Optimization, enhancing P-TR consistency while allowing dynamic fallback to T-TR when needed. Experiments on four TR benchmarks demonstrate that Table-r1 outperforms all SLM-based methods, achieving at least a 15% accuracy improvement over the base model (LLaMA-8B) across all datasets and reaching performance competitive with LLMs.

Title: ENMA: Tokenwise Autoregression for Generative Neural PDE Operators

Authors: Armand Kassaï Koupaï, Lise Le Boudec, Louis Serrano, Patrick Gallinari
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06158
Pdf URL: https://arxiv.org/pdf/2506.06158
Copy Paste: [[2506.06158]] ENMA: Tokenwise Autoregression for Generative Neural PDE Operators(https://arxiv.org/abs/2506.06158)
Keywords: generative, in-context
Abstract: Solving time-dependent parametric partial differential equations (PDEs) remains a fundamental challenge for neural solvers, particularly when generalizing across a wide range of physical parameters and dynamics. When data is uncertain or incomplete-as is often the case-a natural approach is to turn to generative models. We introduce ENMA, a generative neural operator designed to model spatio-temporal dynamics arising from physical phenomena. ENMA predicts future dynamics in a compressed latent space using a generative masked autoregressive transformer trained with flow matching loss, enabling tokenwise generation. Irregularly sampled spatial observations are encoded into uniform latent representations via attention mechanisms and further compressed through a spatio-temporal convolutional encoder. This allows ENMA to perform in-context learning at inference time by conditioning on either past states of the target trajectory or auxiliary context trajectories with similar dynamics. The result is a robust and adaptable framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.

Title: Antithetic Noise in Diffusion Models

Authors: Jing Jia, Sifan Liu, Bowen Song, Wei Yuan, Liyue Shen, Guanyang Wang
Subjects: cs.LG, math.NA, stat.CO, stat.ML
Abstract URL: https://arxiv.org/abs/2506.06185
Pdf URL: https://arxiv.org/pdf/2506.06185
Copy Paste: [[2506.06185]] Antithetic Noise in Diffusion Models(https://arxiv.org/abs/2506.06185)
Keywords: diffusion
Abstract: We initiate a systematic study of antithetic initial noise in diffusion models. Across unconditional models trained on diverse datasets, text-conditioned latent-diffusion models, and diffusion-posterior samplers, we find that pairing each initial noise with its negation consistently yields strongly negatively correlated samples. To explain this phenomenon, we combine experiments and theoretical analysis, leading to a symmetry conjecture that the learned score function is approximately affine antisymmetric (odd symmetry up to a constant shift), and provide evidence supporting it. Leveraging this negative correlation, we enable two applications: (1) enhancing image diversity in models like Stable Diffusion without quality loss, and (2) sharpening uncertainty quantification (e.g., up to 90% narrower confidence intervals) when estimating downstream statistics. Building on these gains, we extend the two-point pairing to a randomized quasi-Monte Carlo estimator, which further improves estimation accuracy. Our framework is training-free, model-agnostic, and adds no runtime overhead.

Title: PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

Authors: Hengzhi Li, Brendon Jiang, Alexander Naehu, Regan Song, Justin Zhang, Megan Tjandrasuwita, Chanakya Ekbote, Steven-Shine Chen, Adithya Balachandran, Wei Dai, Rebecca Chang, Paul Pu Liang
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.06211
Pdf URL: https://arxiv.org/pdf/2506.06211
Copy Paste: [[2506.06211]] PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts(https://arxiv.org/abs/2506.06211)
Keywords: foundation model
Abstract: Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions, puzzlehunts require models to discover the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite recent progress in foundation models, their performance on such open-ended settings remains largely untested. In this paper, we introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces improves stepwise reasoning from 4% to 11%, while training on final answers alone degrades performance to near zero. Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at this https URL to support future work on building more general, open-ended, and creative reasoning systems.

Title: Model-Driven Graph Contrastive Learning

Authors: Ali Azizpour, Nicolas Zilberstein, Santiago Segarra
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.06212
Pdf URL: https://arxiv.org/pdf/2506.06212
Copy Paste: [[2506.06212]] Model-Driven Graph Contrastive Learning(https://arxiv.org/abs/2506.06212)
Keywords: self-supervised, generative
Abstract: We propose $\textbf{MGCL}$, a model-driven graph contrastive learning (GCL) framework that leverages graphons (probabilistic generative models for graphs) to guide contrastive learning by accounting for the data's underlying generative process. GCL has emerged as a powerful self-supervised framework for learning expressive node or graph representations without relying on annotated labels, which are often scarce in real-world data. By contrasting augmented views of graph data, GCL has demonstrated strong performance across various downstream tasks, such as node and graph classification. However, existing methods typically rely on manually designed or heuristic augmentation strategies that are not tailored to the underlying data distribution and operate at the individual graph level, ignoring similarities among graphs generated from the same model. Conversely, in our proposed approach, MGCL first estimates the graphon associated with the observed data and then defines a graphon-informed augmentation process, enabling data-adaptive and principled augmentations. Additionally, for graph-level tasks, MGCL clusters the dataset and estimates a graphon per group, enabling contrastive pairs to reflect shared semantics and structure. Extensive experiments on benchmark datasets demonstrate that MGCL achieves state-of-the-art performance, highlighting the advantages of incorporating generative models into GCL.

Title: GenIR: Generative Visual Feedback for Mental Image Retrieval

Authors: Diji Yang, Minghao Liu, Chung-Hsiang Lo, Yi Zhang, James Davis
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.06220
Pdf URL: https://arxiv.org/pdf/2506.06220
Copy Paste: [[2506.06220]] GenIR: Generative Visual Feedback for Mental Image Retrieval(https://arxiv.org/abs/2506.06220)
Keywords: diffusion, generative
Abstract: Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind, that is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction.

Title: Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study

Authors: Leon Mayer, Tim Rädsch, Dominik Michael, Lucas Luttner, Amine Yamlahi, Evangelia Christodoulou, Patrick Godau, Marcel Knopp, Annika Reinke, Fiona Kolbinger, Lena Maier-Hein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06232
Pdf URL: https://arxiv.org/pdf/2506.06232
Copy Paste: [[2506.06232]] Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study(https://arxiv.org/abs/2506.06232)
Keywords: foundation model
Abstract: While traditional computer vision models have historically struggled to generalize to endoscopic domains, the emergence of foundation models has shown promising cross-domain performance. In this work, we present the first large-scale study assessing the capabilities of Vision Language Models (VLMs) for endoscopic tasks with a specific focus on laparoscopic surgery. Using a diverse set of state-of-the-art models, multiple surgical datasets, and extensive human reference annotations, we address three key research questions: (1) Can current VLMs solve basic perception tasks on surgical images? (2) Can they handle advanced frame-based endoscopic scene understanding tasks? and (3) How do specialized medical VLMs compare to generalist models in this context? Our results reveal that VLMs can effectively perform basic surgical perception tasks, such as object counting and localization, with performance levels comparable to general domain tasks. However, their performance deteriorates significantly when the tasks require medical knowledge. Notably, we find that specialized medical VLMs currently underperform compared to generalist models across both basic and advanced surgical tasks, suggesting that they are not yet optimized for the complexity of surgical environments. These findings highlight the need for further advancements to enable VLMs to handle the unique challenges posed by surgery. Overall, our work provides important insights for the development of next-generation endoscopic AI systems and identifies key areas for improvement in medical visual language models.

Title: Cartridges: Lightweight and general-purpose long context representations via self-study

Authors: Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, Christopher Re
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.06266
Pdf URL: https://arxiv.org/pdf/2506.06266
Copy Paste: [[2506.06266]] Cartridges: Lightweight and general-purpose long context representations via self-study(https://arxiv.org/abs/2506.06266)
Keywords: in-context
Abstract: Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-1M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model's effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.

Title: STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

Authors: Jiatao Gu, Tianrong Chen, David Berthelot, Huangjie Zheng, Yuyang Wang, Ruixiang Zhang, Laurent Dinh, Miguel Angel Bautista, Josh Susskind, Shuangfei Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06276
Pdf URL: https://arxiv.org/pdf/2506.06276
Copy Paste: [[2506.06276]] STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis(https://arxiv.org/abs/2506.06276)
Keywords: diffusion, generative
Abstract: We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive Flow (TARFlow), which combines the expressive power of normalizing flows with the structured modeling capabilities of Autoregressive Transformers. We first establish the theoretical universality of TARFlow for modeling continuous distributions. Building on this foundation, we introduce several key architectural and algorithmic innovations to significantly enhance scalability: (1) a deep-shallow design, wherein a deep Transformer block captures most of the model representational capacity, complemented by a few shallow Transformer blocks that are computationally efficient yet substantially beneficial; (2) modeling in the latent space of pretrained autoencoders, which proves more effective than direct pixel-level modeling; and (3) a novel guidance algorithm that significantly boosts sample quality. Crucially, our model remains an end-to-end normalizing flow, enabling exact maximum likelihood training in continuous spaces without discretization. STARFlow achieves competitive performance in both class-conditional and text-conditional image generation tasks, approaching state-of-the-art diffusion models in sample quality. To our knowledge, this work is the first successful demonstration of normalizing flows operating effectively at this scale and resolution.

Title: TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation

Authors: Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Muhammad Haris Khan, Rao Muhammad Anwer, Jorma Laaksonen, Fahad Shahbaz Khan, Salman Khan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.06281
Pdf URL: https://arxiv.org/pdf/2506.06281
Copy Paste: [[2506.06281]] TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation(https://arxiv.org/abs/2506.06281)
Keywords: self-supervised, foundation model
Abstract: Modern Earth observation (EO) increasingly leverages deep learning to harness the scale and diversity of satellite imagery across sensors and regions. While recent foundation models have demonstrated promising generalization across EO tasks, many remain limited by the scale, geographical coverage, and spectral diversity of their training data, factors critical for learning globally transferable representations. In this work, we introduce TerraFM, a scalable self-supervised learning model that leverages globally distributed Sentinel-1 and Sentinel-2 imagery, combined with large spatial tiles and land-cover aware sampling to enrich spatial and semantic coverage. By treating sensing modalities as natural augmentations in our self-supervised approach, we unify radar and optical inputs via modality-specific patch embeddings and adaptive cross-attention fusion. Our training strategy integrates local-global contrastive learning and introduces a dual-centering mechanism that incorporates class-frequency-aware regularization to address long-tailed distributions in land this http URL achieves strong generalization on both classification and segmentation tasks, outperforming prior models on GEO-Bench and Copernicus-Bench. Our code and pretrained models are publicly available at: this https URL .