2025-06-30

Title: Debunk and Infer: Multimodal Fake News Detection via Diffusion-Generated Evidence and LLM Reasoning

Authors: Kaiying Yan, Moyang Liu, Yukun Liu, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xuefei Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21557
Pdf URL: https://arxiv.org/pdf/2506.21557
Copy Paste: [[2506.21557]] Debunk and Infer: Multimodal Fake News Detection via Diffusion-Generated Evidence and LLM Reasoning(https://arxiv.org/abs/2506.21557)
Keywords: diffusion, generative
Abstract: The rapid spread of fake news across multimedia platforms presents serious challenges to information credibility. In this paper, we propose a Debunk-and-Infer framework for Fake News Detection(DIFND) that leverages debunking knowledge to enhance both the performance and interpretability of fake news detection. DIFND integrates the generative strength of conditional diffusion models with the collaborative reasoning capabilities of multimodal large language models (MLLMs). Specifically, debunk diffusion is employed to generate refuting or authenticating evidence based on the multimodal content of news videos, enriching the evaluation process with diverse yet semantically aligned synthetic samples. To improve inference, we propose a chain-of-debunk strategy where a multi-agent MLLM system produces logic-grounded, multimodal-aware reasoning content and final veracity judgment. By jointly modeling multimodal features, generative debunking cues, and reasoning-rich verification within a unified architecture, DIFND achieves notable improvements in detection accuracy. Extensive experiments on the FakeSV and FVC datasets show that DIFND not only outperforms existing approaches but also delivers trustworthy decisions.

Title: GraphLAMA: Enabling Efficient Adaptation of Graph Language Models with Limited Annotations

Authors: Junze Chen, Cheng Yang, Shujie Li, Zhiqiang Zhang, Yawen Li, Junping Du, Chuan Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21559
Pdf URL: https://arxiv.org/pdf/2506.21559
Copy Paste: [[2506.21559]] GraphLAMA: Enabling Efficient Adaptation of Graph Language Models with Limited Annotations(https://arxiv.org/abs/2506.21559)
Keywords: in-context
Abstract: Large language models (LLMs) have demonstrated their strong capabilities in various domains, and have been recently integrated for graph analysis as graph language models (GLMs). With LLMs as the predictor, some GLMs can interpret unseen tasks described by natural language, and learn from a few examples in the prompts without parameter tuning, known as in-context learning (ICL). Another subset of GLMs utilizes abundant training labels to enhance model performance, known as instruction tuning. However, we argue that ICL on graphs has effectiveness issues due to fixed parameters and efficiency issues due to long context. Meanwhile, the large amount of labeled data required for instruction tuning can be difficult to obtain in real-world scenarios. To this end, we aim to introduce an extra parameter adaptation stage that can efficiently tailor GLMs to an unseen graph and task with only a few labeled examples, in exchange for better prediction accuracy and faster inference speed. For implementation, in this paper we propose GraphLAMA method, with its model backbone and learning schemes specialized for efficient tuning and inference. Specifically, for model backbone, we use a graph neural network (GNN) with several well-designed components to transform nodes into the representation space of LLM tokens. Task instructions can then be represented as a mixture of node and language tokens. In the pre-training stage, model parameters except the LLM will be trained with different tasks to capture general knowledge. In the adaptation stage, only a few pre-trained parameters will be updated based on few-shot examples. Extensive experiments on few/zero-shot node classification and summary generation show that our proposed GraphLAMA achieves state-of-the-art performance with 4.91% absolution improvement in accuracy. Compared with ICL, our inference speed can be 10 times faster under 5-shot setting.

Title: FloorPlan-DeepSeek (FPDS): A multimodal approach to floorplan generation using vector-based next room prediction

Authors: Jun Yin, Pengyu Zeng, Jing Zhong, Peilin Li, Miao Zhang, Ran Luo, Shuai Lu
Subjects: cs.CL, cs.AI, cs.AR
Abstract URL: https://arxiv.org/abs/2506.21562
Pdf URL: https://arxiv.org/pdf/2506.21562
Copy Paste: [[2506.21562]] FloorPlan-DeepSeek (FPDS): A multimodal approach to floorplan generation using vector-based next room prediction(https://arxiv.org/abs/2506.21562)
Keywords: diffusion, generative
Abstract: In the architectural design process, floor plan generation is inherently progressive and iterative. However, existing generative models for floor plans are predominantly end-to-end generation that produce an entire pixel-based layout in a single pass. This paradigm is often incompatible with the incremental workflows observed in real-world architectural practice. To address this issue, we draw inspiration from the autoregressive 'next token prediction' mechanism commonly used in large language models, and propose a novel 'next room prediction' paradigm tailored to architectural floor plan modeling. Experimental evaluation indicates that FPDS demonstrates competitive performance in comparison to diffusion models and Tell2Design in the text-to-floorplan task, indicating its potential applicability in supporting future intelligent architectural design.

Title: VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

Authors: Sam Yu-Te Lee, Chengyang Ji, Shicheng Wen, Lifu Huang, Dongyi Liu, Kwan-Liu Ma
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2506.21582
Pdf URL: https://arxiv.org/pdf/2506.21582
Copy Paste: [[2506.21582]] VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents(https://arxiv.org/abs/2506.21582)
Keywords: generative
Abstract: Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE's effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience -- from none to expert -- demonstrates the system's usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.

Title: Evaluation of LLM-based Strategies for the Extraction of Food Product Information from Online Shops

Authors: Christoph Brosch, Sian Brumm, Rolf Krieger, Jonas Scheffler
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21585
Pdf URL: https://arxiv.org/pdf/2506.21585
Copy Paste: [[2506.21585]] Evaluation of LLM-based Strategies for the Extraction of Food Product Information from Online Shops(https://arxiv.org/abs/2506.21585)
Keywords: generative
Abstract: Generative AI and large language models (LLMs) offer significant potential for automating the extraction of structured information from web pages. In this work, we focus on food product pages from online retailers and explore schema-constrained extraction approaches to retrieve key product attributes, such as ingredient lists and nutrition tables. We compare two LLM-based approaches, direct extraction and indirect extraction via generated functions, evaluating them in terms of accuracy, efficiency, and cost on a curated dataset of 3,000 food product pages from three different online shops. Our results show that although the indirect approach achieves slightly lower accuracy (96.48\%, $-1.61\%$ compared to direct extraction), it reduces the number of required LLM calls by 95.82\%, leading to substantial efficiency gains and lower operational costs. These findings suggest that indirect extraction approaches can provide scalable and cost-effective solutions for large-scale information extraction tasks from template-based web pages using LLMs.

Title: Does Multimodality Lead to Better Time Series Forecasting?

Authors: Xiyuan Zhang, Boran Han, Haoyang Fang, Abdul Fatir Ansari, Shuai Zhang, Danielle C. Maddix, Cuixiong Hu, Andrew Gordon Wilson, Michael W. Mahoney, Hao Wang, Yan Liu, Huzefa Rangwala, George Karypis, Bernie Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21611
Pdf URL: https://arxiv.org/pdf/2506.21611
Copy Paste: [[2506.21611]] Does Multimodality Lead to Better Time Series Forecasting?(https://arxiv.org/abs/2506.21611)
Keywords: foundation model
Abstract: Recently, there has been growing interest in incorporating textual information into foundation models for time series forecasting. However, it remains unclear whether and under what conditions such multimodal integration consistently yields gains. We systematically investigate these questions across a diverse benchmark of 14 forecasting tasks spanning 7 domains, including health, environment, and economics. We evaluate two popular multimodal forecasting paradigms: aligning-based methods, which align time series and text representations; and prompting-based methods, which directly prompt large language models for forecasting. Although prior works report gains from multimodal input, we find these effects are not universal across datasets and models, and multimodal methods sometimes do not outperform the strongest unimodal baselines. To understand when textual information helps, we disentangle the effects of model architectural properties and data characteristics. Our findings highlight that on the modeling side, incorporating text information is most helpful given (1) high-capacity text models, (2) comparatively weaker time series models, and (3) appropriate aligning strategies. On the data side, performance gains are more likely when (4) sufficient training data is available and (5) the text offers complementary predictive signal beyond what is already captured from the time series alone. Our empirical findings offer practical guidelines for when multimodality can be expected to aid forecasting tasks, and when it does not.

Title: Performance of diverse evaluation metrics in NLP-based assessment and text generation of consumer complaints

Authors: Peiheng Gao, Chen Yang, Ning Sun, Ričardas Zitikis
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21623
Pdf URL: https://arxiv.org/pdf/2506.21623
Copy Paste: [[2506.21623]] Performance of diverse evaluation metrics in NLP-based assessment and text generation of consumer complaints(https://arxiv.org/abs/2506.21623)
Keywords: generative
Abstract: Machine learning (ML) has significantly advanced text classification by enabling automated understanding and categorization of complex, unstructured textual data. However, accurately capturing nuanced linguistic patterns and contextual variations inherent in natural language, particularly within consumer complaints, remains a challenge. This study addresses these issues by incorporating human-experience-trained algorithms that effectively recognize subtle semantic differences crucial for assessing consumer relief eligibility. Furthermore, we propose integrating synthetic data generation methods that utilize expert evaluations of generative adversarial networks and are refined through expert annotations. By combining expert-trained classifiers with high-quality synthetic data, our research seeks to significantly enhance machine learning classifier performance, reduce dataset acquisition costs, and improve overall evaluation metrics and robustness in text classification tasks.

Title: TanDiT: Tangent-Plane Diffusion Transformer for High-Quality 360° Panorama Generation

Authors: Hakan Çapuk, Andrew Bond, Muhammed Burak Kızıl, Emir Göçen, Erkut Erdem, Aykut Erdem
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21681
Pdf URL: https://arxiv.org/pdf/2506.21681
Copy Paste: [[2506.21681]] TanDiT: Tangent-Plane Diffusion Transformer for High-Quality 360° Panorama Generation(https://arxiv.org/abs/2506.21681)
Keywords: diffusion, generative
Abstract: Recent advances in image generation have led to remarkable improvements in synthesizing perspective images. However, these models still struggle with panoramic image generation due to unique challenges, including varying levels of geometric distortion and the requirement for seamless loop-consistency. To address these issues while leveraging the strengths of the existing models, we introduce TanDiT, a method that synthesizes panoramic scenes by generating grids of tangent-plane images covering the entire 360$^\circ$ view. Unlike previous methods relying on multiple diffusion branches, TanDiT utilizes a unified diffusion model trained to produce these tangent-plane images simultaneously within a single denoising iteration. Furthermore, we propose a model-agnostic post-processing step specifically designed to enhance global coherence across the generated panoramas. To accurately assess panoramic image quality, we also present two specialized metrics, TangentIS and TangentFID, and provide a comprehensive benchmark comprising captioned panoramic datasets and standardized evaluation scripts. Extensive experiments demonstrate that our method generalizes effectively beyond its training data, robustly interprets detailed and complex text prompts, and seamlessly integrates with various generative models to yield high-quality, diverse panoramic images.

Title: Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers

Authors: Tzu-Quan Lin, Hsi-Chun Cheng, Hung-yi Lee, Hao Tang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.21712
Pdf URL: https://arxiv.org/pdf/2506.21712
Copy Paste: [[2506.21712]] Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers(https://arxiv.org/abs/2506.21712)
Keywords: self-supervised
Abstract: In recent years, the impact of self-supervised speech Transformers has extended to speaker-related applications. However, little research has explored how these models encode speaker information. In this work, we address this gap by identifying neurons in the feed-forward layers that are correlated with speaker information. Specifically, we analyze neurons associated with k-means clusters of self-supervised features and i-vectors. Our analysis reveals that these clusters correspond to broad phonetic and gender classes, making them suitable for identifying neurons that represent speakers. By protecting these neurons during pruning, we can significantly preserve performance on speaker-related task, demonstrating their crucial role in encoding speaker information.

Title: $\textrm{ODE}_t \left(\textrm{ODE}_l \right)$: Shortcutting the Time and Length in Diffusion and Flow Models for Faster Sampling

Authors: Denis Gudovskiy, Wenzhao Zheng, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.21714
Pdf URL: https://arxiv.org/pdf/2506.21714
Copy Paste: [[2506.21714]] $\textrm{ODE}_t \left(\textrm{ODE}_l \right)$: Shortcutting the Time and Length in Diffusion and Flow Models for Faster Sampling(https://arxiv.org/abs/2506.21714)
Keywords: diffusion
Abstract: Recently, continuous normalizing flows (CNFs) and diffusion models (DMs) have been studied using the unified theoretical framework. Although such models can generate high-quality data points from a noise distribution, the sampling demands multiple iterations to solve an ordinary differential equation (ODE) with high computational complexity. Most existing methods focus on reducing the number of time steps during the sampling process to improve efficiency. In this work, we explore a complementary direction in which the quality-complexity tradeoff can be dynamically controlled in terms of time steps and in the length of the neural network. We achieve this by rewiring the blocks in the transformer-based architecture to solve an inner discretized ODE w.r.t. its length. Then, we employ time- and length-wise consistency terms during flow matching training, and as a result, the sampling can be performed with an arbitrary number of time steps and transformer blocks. Unlike others, our $\textrm{ODE}_t \left(\textrm{ODE}_l \right)$ approach is solver-agnostic in time dimension and decreases both latency and memory usage. Compared to the previous state of the art, image generation experiments on CelebA-HQ and ImageNet show a latency reduction of up to $3\times$ in the most efficient sampling mode, and a FID score improvement of up to $3.5$ points for high-quality sampling. We release our code and model weights with fully reproducible experiments.

Title: Elucidating and Endowing the Diffusion Training Paradigm for General Image Restoration

Authors: Xin Lu, Xueyang Fu, Jie Xiao, Zihao Fan, Yurui Zhu, Zheng-Jun Zha
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21722
Pdf URL: https://arxiv.org/pdf/2506.21722
Copy Paste: [[2506.21722]] Elucidating and Endowing the Diffusion Training Paradigm for General Image Restoration(https://arxiv.org/abs/2506.21722)
Keywords: diffusion, generative
Abstract: While diffusion models demonstrate strong generative capabilities in image restoration (IR) tasks, their complex architectures and iterative processes limit their practical application compared to mainstream reconstruction-based general ordinary IR networks. Existing approaches primarily focus on optimizing network architecture and diffusion paths but overlook the integration of the diffusion training paradigm within general ordinary IR frameworks. To address these challenges, this paper elucidates key principles for adapting the diffusion training paradigm to general IR training through systematic analysis of time-step dependencies, network hierarchies, noise-level relationships, and multi-restoration task correlations, proposing a new IR framework supported by diffusion-based training. To enable IR networks to simultaneously restore images and model generative representations, we introduce a series of regularization strategies that align diffusion objectives with IR tasks, improving generalization in single-task scenarios. Furthermore, recognizing that diffusion-based generation exerts varying influences across different IR tasks, we develop an incremental training paradigm and task-specific adaptors, further enhancing performance in multi-task unified IR. Experiments demonstrate that our method significantly improves the generalization of IR networks in single-task IR and achieves superior performance in multi-task unified IR. Notably, the proposed framework can be seamlessly integrated into existing general IR architectures.

Title: Asymmetric Dual Self-Distillation for 3D Self-Supervised Representation Learning

Authors: Remco F. Leijenaar, Hamidreza Kasaei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21724
Pdf URL: https://arxiv.org/pdf/2506.21724
Copy Paste: [[2506.21724]] Asymmetric Dual Self-Distillation for 3D Self-Supervised Representation Learning(https://arxiv.org/abs/2506.21724)
Keywords: self-supervised
Abstract: Learning semantically meaningful representations from unstructured 3D point clouds remains a central challenge in computer vision, especially in the absence of large-scale labeled datasets. While masked point modeling (MPM) is widely used in self-supervised 3D learning, its reconstruction-based objective can limit its ability to capture high-level semantics. We propose AsymDSD, an Asymmetric Dual Self-Distillation framework that unifies masked modeling and invariance learning through prediction in the latent space rather than the input space. AsymDSD builds on a joint embedding architecture and introduces several key design choices: an efficient asymmetric setup, disabling attention between masked queries to prevent shape leakage, multi-mask sampling, and a point cloud adaptation of multi-crop. AsymDSD achieves state-of-the-art results on ScanObjectNN (90.53%) and further improves to 93.72% when pretrained on 930k shapes, surpassing prior methods.

Title: Exploring Image Generation via Mutually Exclusive Probability Spaces and Local Correlation Hypothesis

Authors: Chenqiu Zhao, Anup Basu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21731
Pdf URL: https://arxiv.org/pdf/2506.21731
Copy Paste: [[2506.21731]] Exploring Image Generation via Mutually Exclusive Probability Spaces and Local Correlation Hypothesis(https://arxiv.org/abs/2506.21731)
Keywords: generative
Abstract: We propose two theoretical frameworks, the Mutually Exclusive Probability Space (MESP) and the Local Correlation Hypothesis (LCH), to explore a potential limitation in probabilistic generative models; namely that learning global distributions leads to memorization rather than generative behavior. MESP emerges from our rethinking of the Variational Autoencoder (VAE). We observe that latent variable distributions in VAE exhibit overlap, which leads to an optimization conflict between the reconstruction loss and KL-divergence loss. A lower bound based on the overlap coefficient is proposed. We refer to this phenomenon as Mutually Exclusive Probability Spaces. Based on MESP, a Binary Latent Autoencoder (BL-AE) is proposed to encode images into binary latent representations. These binary latents are used as the input to our Autoregressive Random Variable Model (ARVM), a modified autoregressive model outputting histograms. Our ARVM achieves competitive FID scores, outperforming state-of-the-art methods on standard datasets. However, such scores reflect memorization rather than generation. To address this issue, we propose the Local Correlation Hypothesis (LCH), which posits that generative capability arising from local correlations among latent variables. Comprehensive experiments and discussions are conducted to validate our frameworks.

Title: M3PO: Massively Multi-Task Model-Based Policy Optimization

Authors: Aditya Narendra, Dmitry Makarov, Aleksandr Panov
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2506.21782
Pdf URL: https://arxiv.org/pdf/2506.21782
Copy Paste: [[2506.21782]] M3PO: Massively Multi-Task Model-Based Policy Optimization(https://arxiv.org/abs/2506.21782)
Keywords: generative
Abstract: We introduce Massively Multi-Task Model-Based Policy Optimization (M3PO), a scalable model-based reinforcement learning (MBRL) framework designed to address sample inefficiency in single-task settings and poor generalization in multi-task domains. Existing model-based approaches like DreamerV3 rely on pixel-level generative models that neglect control-centric representations, while model-free methods such as PPO suffer from high sample complexity and weak exploration. M3PO integrates an implicit world model, trained to predict task outcomes without observation reconstruction, with a hybrid exploration strategy that combines model-based planning and model-free uncertainty-driven bonuses. This eliminates the bias-variance trade-off in prior methods by using discrepancies between model-based and model-free value estimates to guide exploration, while maintaining stable policy updates through a trust-region optimizer. M3PO provides an efficient and robust alternative to existing model-based policy optimization approaches and achieves state-of-the-art performance across multiple benchmarks.

Title: Evaluating List Construction and Temporal Understanding capabilities of Large Language Models

Authors: Alexandru Dumitru, V Venktesh, Adam Jatowt, Avishek Anand
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21783
Pdf URL: https://arxiv.org/pdf/2506.21783
Copy Paste: [[2506.21783]] Evaluating List Construction and Temporal Understanding capabilities of Large Language Models(https://arxiv.org/abs/2506.21783)
Keywords: generative
Abstract: Large Language Models (LLMs) have demonstrated immense advances in a wide range of natural language tasks. However, these models are susceptible to hallucinations and errors on particularly temporal understanding tasks involving multiple entities in answers. In such tasks, they fail to associate entities with accurate time intervals, generate a complete list of entities in answers or reason about events associated with specific temporal bounds. Existing works do not extensively evaluate the abilities of the model to perform implicit and explicit temporal understanding in a list answer construction setup. To bridge this gap, we propose the Time referenced List based Question Answering or TLQA benchmark that requires structured answers in list format aligned with corresponding time periods. Our TLQA benchmark, requires both list construction and temporal understanding simultaneously, which to the best of our knowledge has not been explored in prior benchmarks. We investigate the temporal understanding and list construction capabilities of state-of-the-art generative models on TLQA in closed-book and open-domain settings. Our findings reveal significant shortcomings in current models, particularly their inability to provide complete answers and temporally align facts in a closed-book setup and the need to improve retrieval in open-domain setup, providing clear future directions for research on TLQA. The benchmark and code at this https URL.

Title: Multi-task parallelism for robust pre-training of graph foundation models on multi-source, multi-fidelity atomistic modeling data

Authors: Massimiliano Lupo Pasini, Jong Youl Choi, Pei Zhang, Kshitij Mehta, Rylie Weaver, Ashwin M. Aji, Karl W. Schulz, Jorda Polo, Prasanna Balaprakash
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI, physics.atm-clus
Abstract URL: https://arxiv.org/abs/2506.21788
Pdf URL: https://arxiv.org/pdf/2506.21788
Copy Paste: [[2506.21788]] Multi-task parallelism for robust pre-training of graph foundation models on multi-source, multi-fidelity atomistic modeling data(https://arxiv.org/abs/2506.21788)
Keywords: foundation model
Abstract: Graph foundation models using graph neural networks promise sustainable, efficient atomistic modeling. To tackle challenges of processing multi-source, multi-fidelity data during pre-training, recent studies employ multi-task learning, in which shared message passing layers initially process input atomistic structures regardless of source, then route them to multiple decoding heads that predict data-specific outputs. This approach stabilizes pre-training and enhances a model's transferability to unexplored chemical regions. Preliminary results on approximately four million structures are encouraging, yet questions remain about generalizability to larger, more diverse datasets and scalability on supercomputers. We propose a multi-task parallelism method that distributes each head across computing resources with GPU acceleration. Implemented in the open-source HydraGNN architecture, our method was trained on over 24 million structures from five datasets and tested on the Perlmutter, Aurora, and Frontier supercomputers, demonstrating efficient scaling on all three highly heterogeneous super-computing architectures.

Title: Few-Shot Segmentation of Historical Maps via Linear Probing of Vision Foundation Models

Authors: Rafael Sterzinger, Marco Peer, Robert Sablatnig
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21826
Pdf URL: https://arxiv.org/pdf/2506.21826
Copy Paste: [[2506.21826]] Few-Shot Segmentation of Historical Maps via Linear Probing of Vision Foundation Models(https://arxiv.org/abs/2506.21826)
Keywords: foundation model
Abstract: As rich sources of history, maps provide crucial insights into historical changes, yet their diverse visual representations and limited annotated data pose significant challenges for automated processing. We propose a simple yet effective approach for few-shot segmentation of historical maps, leveraging the rich semantic embeddings of large vision foundation models combined with parameter-efficient fine-tuning. Our method outperforms the state-of-the-art on the Siegfried benchmark dataset in vineyard and railway segmentation, achieving +5% and +13% relative improvements in mIoU in 10-shot scenarios and around +20% in the more challenging 5-shot setting. Additionally, it demonstrates strong performance on the ICDAR 2021 competition dataset, attaining a mean PQ of 67.3% for building block segmentation, despite not being optimized for this shape-sensitive metric, underscoring its generalizability. Notably, our approach maintains high performance even in extremely low-data regimes (10- & 5-shot), while requiring only 689k trainable parameters - just 0.21% of the total model size. Our approach enables precise segmentation of diverse historical maps while drastically reducing the need for manual annotations, advancing automated processing and analysis in the field. Our implementation is publicly available at: this https URL.

Title: TaleForge: Interactive Multimodal System for Personalized Story Creation

Authors: Minh-Loi Nguyen, Quang-Khai Le, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21832
Pdf URL: https://arxiv.org/pdf/2506.21832
Copy Paste: [[2506.21832]] TaleForge: Interactive Multimodal System for Personalized Story Creation(https://arxiv.org/abs/2506.21832)
Keywords: diffusion
Abstract: Storytelling is a deeply personal and creative process, yet existing methods often treat users as passive consumers, offering generic plots with limited personalization. This undermines engagement and immersion, especially where individual style or appearance is crucial. We introduce TaleForge, a personalized story-generation system that integrates large language models (LLMs) and text-to-image diffusion to embed users' facial images within both narratives and illustrations. TaleForge features three interconnected modules: Story Generation, where LLMs create narratives and character descriptions from user prompts; Personalized Image Generation, merging users' faces and outfit choices into character illustrations; and Background Generation, creating scene backdrops that incorporate personalized characters. A user study demonstrated heightened engagement and ownership when individuals appeared as protagonists. Participants praised the system's real-time previews and intuitive controls, though they requested finer narrative editing tools. TaleForge advances multimodal storytelling by aligning personalized text and imagery to create immersive, user-centric experiences.

Title: PrefPaint: Enhancing Image Inpainting through Expert Human Feedback

Authors: Duy-Bao Bui, Hoang-Khang Nguyen, Trung-Nghia Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21834
Pdf URL: https://arxiv.org/pdf/2506.21834
Copy Paste: [[2506.21834]] PrefPaint: Enhancing Image Inpainting through Expert Human Feedback(https://arxiv.org/abs/2506.21834)
Keywords: diffusion
Abstract: Inpainting, the process of filling missing or corrupted image parts, has broad applications, including medical imaging. However, in specialized fields like medical polyps imaging, where accuracy and reliability are critical, inpainting models can generate inaccurate images, leading to significant errors in medical diagnosis and treatment. To ensure reliability, medical images should be annotated by experts like oncologists for effective model training. We propose PrefPaint, an approach that incorporates human feedback into the training process of Stable Diffusion Inpainting, bypassing the need for computationally expensive reward models. In addition, we develop a web-based interface streamlines training, fine-tuning, and inference. This interactive interface provides a smooth and intuitive user experience, making it easier to offer feedback and manage the fine-tuning process. User study on various domains shows that PrefPaint outperforms existing methods, reducing visual inconsistencies and improving image rendering, particularly in medical contexts, where our model generates more realistic polyps images.

Title: ProSAM: Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts

Authors: Xiaoqi Wang, Clint Sebastian, Wenbin He, Liu Ren
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21835
Pdf URL: https://arxiv.org/pdf/2506.21835
Copy Paste: [[2506.21835]] ProSAM: Enhancing the Robustness of SAM-based Visual Reference Segmentation with Probabilistic Prompts(https://arxiv.org/abs/2506.21835)
Keywords: foundation model
Abstract: The recent advancements in large foundation models have driven the success of open-set image segmentation, a task focused on segmenting objects beyond predefined categories. Among various prompt types (such as points, boxes, texts, and visual references), visual reference segmentation stands out for its unique flexibility and strong zero-shot capabilities. Recently, several SAM-based methods have made notable progress in this task by automatically generating prompts to guide SAM. However, these methods often generate prompts at object boundaries due to suboptimal prompt encoder, which results in instability and reduced robustness. In this work, we introduce ProSAM, a simple but effective method to address the stability challenges we identified in existing SAM-based visual reference segmentation approaches. By learning a variational prompt encoder to predict multivariate prompt distributions, ProSAM avoids generating prompts that lie in unstable regions, overcoming the instability caused by less robust prompts. Our approach consistently surpasses state-of-the-art methods on the Pascal-5$^i$ and COCO-20$^i$ datasets, providing a more robust solution for visual reference segmentation.

Title: PARSI: Persian Authorship Recognition via Stylometric Integration

Authors: Kourosh Shahnazari, Mohammadali Keshtparvar, Seyed Moein Ayyoubzadeh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21840
Pdf URL: https://arxiv.org/pdf/2506.21840
Copy Paste: [[2506.21840]] PARSI: Persian Authorship Recognition via Stylometric Integration(https://arxiv.org/abs/2506.21840)
Keywords: generative
Abstract: The intricate linguistic, stylistic, and metrical aspects of Persian classical poetry pose a challenge for computational authorship attribution. In this work, we present a versatile framework to determine authorship among 67 prominent poets. We employ a multi-input neural framework consisting of a transformer-based language encoder complemented by features addressing the semantic, stylometric, and metrical dimensions of Persian poetry. Our feature set encompasses 100-dimensional Word2Vec embeddings, seven stylometric measures, and categorical encodings of poetic form and meter. We compiled a vast corpus of 647,653 verses of the Ganjoor digital collection, validating the data through strict preprocessing and author verification while preserving poem-level splitting to prevent overlap. This work employs verse-level classification and majority and weighted voting schemes in evaluation, revealing that weighted voting yields 71% accuracy. We further investigate threshold-based decision filtering, allowing the model to generate highly confident predictions, achieving 97% accuracy at a 0.9 threshold, though at lower coverage. Our work focuses on the integration of deep representational forms with domain-specific features for improved authorship attribution. The results illustrate the potential of our approach for automated classification and the contribution to stylistic analysis, authorship disputes, and general computational literature research. This research will facilitate further research on multilingual author attribution, style shift, and generative modeling of Persian poetry.

Title: 3D-Telepathy: Reconstructing 3D Objects from EEG Signals

Authors: Yuxiang Ge, Jionghao Cheng, Ruiquan Ge, Zhaojie Fang, Gangyong Jia, Xiang Wan, Nannan Li, Ahmed Elazab, Changmiao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21843
Pdf URL: https://arxiv.org/pdf/2506.21843
Copy Paste: [[2506.21843]] 3D-Telepathy: Reconstructing 3D Objects from EEG Signals(https://arxiv.org/abs/2506.21843)
Keywords: diffusion, self-supervised
Abstract: Reconstructing 3D visual stimuli from Electroencephalography (EEG) data holds significant potential for applications in Brain-Computer Interfaces (BCIs) and aiding individuals with communication disorders. Traditionally, efforts have focused on converting brain activity into 2D images, neglecting the translation of EEG data into 3D objects. This limitation is noteworthy, as the human brain inherently processes three-dimensional spatial information regardless of whether observing 2D images or the real world. The neural activities captured by EEG contain rich spatial information that is inevitably lost when reconstructing only 2D images, thus limiting its practical applications in BCI. The transition from EEG data to 3D object reconstruction faces considerable obstacles. These include the presence of extensive noise within EEG signals and a scarcity of datasets that include both EEG and 3D information, which complicates the extraction process of 3D visual data. Addressing this challenging task, we propose an innovative EEG encoder architecture that integrates a dual self-attention mechanism. We use a hybrid training strategy to train the EEG Encoder, which includes cross-attention, contrastive learning, and self-supervised learning techniques. Additionally, by employing stable diffusion as a prior distribution and utilizing Variational Score Distillation to train a neural radiation field, we successfully generate 3D objects with similar content and structure from EEG data.

Title: Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation

Authors: Jiho Choi, Sang Jun Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21855
Pdf URL: https://arxiv.org/pdf/2506.21855
Copy Paste: [[2506.21855]] Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation(https://arxiv.org/abs/2506.21855)
Keywords: self-supervised
Abstract: In this paper, we propose a method that learns a general representation of periodic signals from unlabeled facial videos by capturing subtle changes in skin tone over time. The proposed framework employs the video masked autoencoder to learn a high-dimensional spatio-temporal representation of the facial region through self-supervised learning. Capturing quasi-periodic signals in the video is crucial for remote photoplethysmography (rPPG) estimation. To account for signal periodicity, we apply frame masking in terms of video sampling, which allows the model to capture resampled quasi-periodic signals during the pre-training stage. Moreover, the framework incorporates physiological bandlimit constraints, leveraging the property that physiological signals are sparse within their frequency bandwidth to provide pulse cues to the model. The pre-trained encoder is then transferred to the rPPG task, where it is used to extract physiological signals from facial videos. We evaluate the proposed method through extensive experiments on the PURE, UBFC-rPPG, MMPD, and V4V datasets. Our results demonstrate significant performance improvements, particularly in challenging cross-dataset evaluations. Our code is available at this https URL.

Title: SPADE: Spatial Transcriptomics and Pathology Alignment Using a Mixture of Data Experts for an Expressive Latent Space

Authors: Ekaterina Redekop, Mara Pleasure, Zichen Wang, Kimberly Flores, Anthony Sisk, William Speier, Corey W. Arnold
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21857
Pdf URL: https://arxiv.org/pdf/2506.21857
Copy Paste: [[2506.21857]] SPADE: Spatial Transcriptomics and Pathology Alignment Using a Mixture of Data Experts for an Expressive Latent Space(https://arxiv.org/abs/2506.21857)
Keywords: self-supervised, foundation model
Abstract: The rapid growth of digital pathology and advances in self-supervised deep learning have enabled the development of foundational models for various pathology tasks across diverse diseases. While multimodal approaches integrating diverse data sources have emerged, a critical gap remains in the comprehensive integration of whole-slide images (WSIs) with spatial transcriptomics (ST), which is crucial for capturing critical molecular heterogeneity beyond standard hematoxylin & eosin (H&E) staining. We introduce SPADE, a foundation model that integrates histopathology with ST data to guide image representation learning within a unified framework, in effect creating an ST-informed latent space. SPADE leverages a mixture-of-data experts technique, where experts, created via two-stage feature-space clustering, use contrastive learning to learn representations of co-registered WSI patches and gene expression profiles. Pre-trained on the comprehensive HEST-1k dataset, SPADE is evaluated on 14 downstream tasks, demonstrating significantly superior few-shot performance compared to baseline models, highlighting the benefits of integrating morphological and molecular information into one latent space.

Title: On the Feasibility of Poisoning Text-to-Image AI Models via Adversarial Mislabeling

Authors: Stanley Wu, Ronik Bhaskar, Anna Yoo Jeong Ha, Shawn Shan, Haitao Zheng, Ben Y. Zhao
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21874
Pdf URL: https://arxiv.org/pdf/2506.21874
Copy Paste: [[2506.21874]] On the Feasibility of Poisoning Text-to-Image AI Models via Adversarial Mislabeling(https://arxiv.org/abs/2506.21874)
Keywords: generative
Abstract: Today's text-to-image generative models are trained on millions of images sourced from the Internet, each paired with a detailed caption produced by Vision-Language Models (VLMs). This part of the training pipeline is critical for supplying the models with large volumes of high-quality image-caption pairs during training. However, recent work suggests that VLMs are vulnerable to stealthy adversarial attacks, where adversarial perturbations are added to images to mislead the VLMs into producing incorrect captions. In this paper, we explore the feasibility of adversarial mislabeling attacks on VLMs as a mechanism to poisoning training pipelines for text-to-image models. Our experiments demonstrate that VLMs are highly vulnerable to adversarial perturbations, allowing attackers to produce benign-looking images that are consistently miscaptioned by the VLM models. This has the effect of injecting strong "dirty-label" poison samples into the training pipeline for text-to-image models, successfully altering their behavior with a small number of poisoned samples. We find that while potential defenses can be effective, they can be targeted and circumvented by adaptive attackers. This suggests a cat-and-mouse game that is likely to reduce the quality of training data and increase the cost of text-to-image model development. Finally, we demonstrate the real-world effectiveness of these attacks, achieving high attack success (over 73%) even in black-box scenarios against commercial VLMs (Google Vertex AI and Microsoft Azure).

Title: TOAST: Task-Oriented Adaptive Semantic Transmission over Dynamic Wireless Environments

Authors: Sheng Yun, Jianhua Pei, Ping Wang
Subjects: cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2506.21900
Pdf URL: https://arxiv.org/pdf/2506.21900
Copy Paste: [[2506.21900]] TOAST: Task-Oriented Adaptive Semantic Transmission over Dynamic Wireless Environments(https://arxiv.org/abs/2506.21900)
Keywords: diffusion
Abstract: The evolution toward 6G networks demands a fundamental shift from bit-centric transmission to semantic-aware communication that emphasizes task-relevant information. This work introduces TOAST (Task-Oriented Adaptive Semantic Transmission), a unified framework designed to address the core challenge of multi-task optimization in dynamic wireless environments through three complementary components. First, we formulate adaptive task balancing as a Markov decision process, employing deep reinforcement learning to dynamically adjust the trade-off between image reconstruction fidelity and semantic classification accuracy based on real-time channel conditions. Second, we integrate module-specific Low-Rank Adaptation (LoRA) mechanisms throughout our Swin Transformer-based joint source-channel coding architecture, enabling parameter-efficient fine-tuning that dramatically reduces adaptation overhead while maintaining full performance across diverse channel impairments including Additive White Gaussian Noise (AWGN), fading, phase noise, and impulse interference. Third, we incorporate an Elucidating diffusion model that operates in the latent space to restore features corrupted by channel noises, providing substantial quality improvements compared to baseline approaches. Extensive experiments across multiple datasets demonstrate that TOAST achieves superior performance compared to baseline approaches, with significant improvements in both classification accuracy and reconstruction quality at low Signal-to-Noise Ratio (SNR) conditions while maintaining robust performance across all tested scenarios.

Title: Physics-informed network paradigm with data generation and background noise removal for diverse distributed acoustic sensing applications

Authors: Yangyang Wan, Haotian Wang, Xuhui Yu, Jiageng Chen, Xinyu Fan, Zuyuan He
Subjects: cs.LG, physics.app-ph, physics.optics
Abstract URL: https://arxiv.org/abs/2506.21952
Pdf URL: https://arxiv.org/pdf/2506.21952
Copy Paste: [[2506.21952]] Physics-informed network paradigm with data generation and background noise removal for diverse distributed acoustic sensing applications(https://arxiv.org/abs/2506.21952)
Keywords: generative
Abstract: Distributed acoustic sensing (DAS) has attracted considerable attention across various fields and artificial intelligence (AI) technology plays an important role in DAS applications to realize event recognition and denoising. Existing AI models require real-world data (RWD), whether labeled or not, for training, which is contradictory to the fact of limited available event data in real-world scenarios. Here, a physics-informed DAS neural network paradigm is proposed, which does not need real-world events data for training. By physically modeling target events and the constraints of real world and DAS system, physical functions are derived to train a generative network for generation of DAS events data. DAS debackground net is trained by using the generated DAS events data to eliminate background noise in DAS data. The effectiveness of the proposed paradigm is verified in event identification application based on a public dataset of DAS spatiotemporal data and in belt conveyor fault monitoring application based on DAS time-frequency data, and achieved comparable or better performance than data-driven networks trained with RWD. Owing to the introduction of physical information and capability of background noise removal, the paradigm demonstrates generalization in same application on different sites. A fault diagnosis accuracy of 91.8% is achieved in belt conveyor field with networks which transferred from simulation test site without any fault events data of test site and field for training. The proposed paradigm is a prospective solution to address significant obstacles of data acquisition and intense noise in practical DAS applications and explore more potential fields for DAS.

Title: Optimal Return-to-Go Guided Decision Transformer for Auto-Bidding in Advertisement

Authors: Hao Jiang, Yongxiang Tang, Yanxiang Zeng, Pengjia Yuan, Yanhua Cheng, Teng Sha, Xialong Liu, Peng Jiang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21956
Pdf URL: https://arxiv.org/pdf/2506.21956
Copy Paste: [[2506.21956]] Optimal Return-to-Go Guided Decision Transformer for Auto-Bidding in Advertisement(https://arxiv.org/abs/2506.21956)
Keywords: generative
Abstract: In the realm of online advertising, advertisers partake in ad auctions to obtain advertising slots, frequently taking advantage of auto-bidding tools provided by demand-side platforms. To improve the automation of these bidding systems, we adopt generative models, namely the Decision Transformer (DT), to tackle the difficulties inherent in automated bidding. Applying the Decision Transformer to the auto-bidding task enables a unified approach to sequential modeling, which efficiently overcomes short-sightedness by capturing long-term dependencies between past bidding actions and user behavior. Nevertheless, conventional DT has certain drawbacks: (1) DT necessitates a preset return-to-go (RTG) value before generating actions, which is not inherently produced; (2) The policy learned by DT is restricted by its training data, which is consists of mixed-quality trajectories. To address these challenges, we introduce the R* Decision Transformer (R* DT), developed in a three-step process: (1) R DT: Similar to traditional DT, R DT stores actions based on state and RTG value, as well as memorizing the RTG for a given state using the training set; (2) R^ DT: We forecast the highest value (within the training set) of RTG for a given state, deriving a suboptimal policy based on the current state and the forecasted supreme RTG value; (3) R* DT: Based on R^ DT, we generate trajectories and select those with high rewards (using a simulator) to augment our training dataset. This data enhancement has been shown to improve the RTG of trajectories in the training data and gradually leads the suboptimal policy towards optimality. Comprehensive tests on a publicly available bidding dataset validate the R* DT's efficacy and highlight its superiority when dealing with mixed-quality trajectories.

Title: Exploring Semantic Masked Autoencoder for Self-supervised Point Cloud Understanding

Authors: Yixin Zha, Chuxin Wang, Wenfei Yang, Tianzhu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21957
Pdf URL: https://arxiv.org/pdf/2506.21957
Copy Paste: [[2506.21957]] Exploring Semantic Masked Autoencoder for Self-supervised Point Cloud Understanding(https://arxiv.org/abs/2506.21957)
Keywords: self-supervised
Abstract: Point cloud understanding aims to acquire robust and general feature representations from unlabeled data. Masked point modeling-based methods have recently shown significant performance across various downstream tasks. These pre-training methods rely on random masking strategies to establish the perception of point clouds by restoring corrupted point cloud inputs, which leads to the failure of capturing reasonable semantic relationships by the self-supervised models. To address this issue, we propose Semantic Masked Autoencoder, which comprises two main components: a prototype-based component semantic modeling module and a component semantic-enhanced masking strategy. Specifically, in the component semantic modeling module, we design a component semantic guidance mechanism to direct a set of learnable prototypes in capturing the semantics of different components from objects. Leveraging these prototypes, we develop a component semantic-enhanced masking strategy that addresses the limitations of random masking in effectively covering complete component structures. Furthermore, we introduce a component semantic-enhanced prompt-tuning strategy, which further leverages these prototypes to improve the performance of pre-trained models in downstream tasks. Extensive experiments conducted on datasets such as ScanObjectNN, ModelNet40, and ShapeNetPart demonstrate the effectiveness of our proposed modules.

Title: Don't Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism

Authors: Simon Münker, Nils Schwager, Achim Rettinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.21974
Pdf URL: https://arxiv.org/pdf/2506.21974
Copy Paste: [[2506.21974]] Don't Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism(https://arxiv.org/abs/2506.21974)
Keywords: generative
Abstract: The ability of Large Language Models (LLMs) to mimic human behavior triggered a plethora of computational social science research, assuming that empirical studies of humans can be conducted with AI agents instead. Since there have been conflicting research findings on whether and when this hypothesis holds, there is a need to better understand the differences in their experimental designs. We focus on replicating the behavior of social network users with the use of LLMs for the analysis of communication on social networks. First, we provide a formal framework for the simulation of social networks, before focusing on the sub-task of imitating user communication. We empirically test different approaches to imitate user behavior on X in English and German. Our findings suggest that social simulations should be validated by their empirical realism measured in the setting in which the simulation components were fitted. With this paper, we argue for more rigor when applying generative-agent-based modeling for social simulation.

Title: TASeg: Text-aware RGB-T Semantic Segmentation based on Fine-tuning Vision Foundation Models

Authors: Meng Yu, Te Cui, Qitong Chu, Wenjie Song, Yi Yang, Yufeng Yue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21975
Pdf URL: https://arxiv.org/pdf/2506.21975
Copy Paste: [[2506.21975]] TASeg: Text-aware RGB-T Semantic Segmentation based on Fine-tuning Vision Foundation Models(https://arxiv.org/abs/2506.21975)
Keywords: foundation model
Abstract: Reliable semantic segmentation of open environments is essential for intelligent systems, yet significant problems remain: 1) Existing RGB-T semantic segmentation models mainly rely on low-level visual features and lack high-level textual information, which struggle with accurate segmentation when categories share similar visual characteristics. 2) While SAM excels in instance-level segmentation, integrating it with thermal images and text is hindered by modality heterogeneity and computational inefficiency. To address these, we propose TASeg, a text-aware RGB-T segmentation framework by using Low-Rank Adaptation (LoRA) fine-tuning technology to adapt vision foundation models. Specifically, we propose a Dynamic Feature Fusion Module (DFFM) in the image encoder, which effectively merges features from multiple visual modalities while freezing SAM's original transformer blocks. Additionally, we incorporate CLIP-generated text embeddings in the mask decoder to enable semantic alignment, which further rectifies the classification error and improves the semantic understanding accuracy. Experimental results across diverse datasets demonstrate that our method achieves superior performance in challenging scenarios with fewer trainable parameters.

Title: SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model

Authors: Shuhan Tan, John Lambert, Hong Jeon, Sakshum Kulshrestha, Yijing Bai, Jing Luo, Dragomir Anguelov, Mingxing Tan, Chiyu Max Jiang
Subjects: cs.LG, cs.AI, cs.CV, cs.MA, cs.RO
Abstract URL: https://arxiv.org/abs/2506.21976
Pdf URL: https://arxiv.org/pdf/2506.21976
Copy Paste: [[2506.21976]] SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model(https://arxiv.org/abs/2506.21976)
Keywords: generative
Abstract: The goal of traffic simulation is to augment a potentially limited amount of manually-driven miles that is available for testing and validation, with a much larger amount of simulated synthetic miles. The culmination of this vision would be a generative simulated city, where given a map of the city and an autonomous vehicle (AV) software stack, the simulator can seamlessly simulate the trip from point A to point B by populating the city around the AV and controlling all aspects of the scene, from animating the dynamic agents (e.g., vehicles, pedestrians) to controlling the traffic light states. We refer to this vision as CitySim, which requires an agglomeration of simulation technologies: scene generation to populate the initial scene, agent behavior modeling to animate the scene, occlusion reasoning, dynamic scene generation to seamlessly spawn and remove agents, and environment simulation for factors such as traffic lights. While some key technologies have been separately studied in various works, others such as dynamic scene generation and environment simulation have received less attention in the research community. We propose SceneDiffuser++, the first end-to-end generative world model trained on a single loss function capable of point A-to-B simulation on a city scale integrating all the requirements above. We demonstrate the city-scale traffic simulation capability of SceneDiffuser++ and study its superior realism under long simulation conditions. We evaluate the simulation quality on an augmented version of the Waymo Open Motion Dataset (WOMD) with larger map regions to support trip-level simulation.

Title: RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation

Authors: Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Soumajit Majumder, Ziyuan Liu, Gitta Kutyniok, Abhinav Valada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22007
Pdf URL: https://arxiv.org/pdf/2506.22007
Copy Paste: [[2506.22007]] RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation(https://arxiv.org/abs/2506.22007)
Keywords: diffusion
Abstract: We address the problem of generating long-horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photorealism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works predict short sequences of the robot achieving one task and employ an autoregressive paradigm to extend to the long horizon, leading to error accumulations in the generated video and in the execution. To overcome these limitations, we propose a novel pipeline that bypasses the need for autoregressive generation. We achieve this through a threefold contribution: 1) we first decompose the high-level goals into smaller atomic tasks and generate keyframes aligned with these instructions. A second diffusion model then interpolates between each of the two generated frames, achieving the long-horizon video. 2) We propose a semantics preserving attention module to maintain consistency between the keyframes. 3) We design a lightweight policy model to regress the robot joint states from generated videos. Our approach achieves state-of-the-art results on two benchmarks in video quality and consistency while outperforming previous policy models on long-horizon tasks.

Title: Hyper-modal Imputation Diffusion Embedding with Dual-Distillation for Federated Multimodal Knowledge Graph Completion

Authors: Ying Zhang, Yu Zhao, Xuhui Sui, Baohang Zhou, Xiangrui Cai, Li Shen, Xiaojie Yuan, Dacheng Tao
Subjects: cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.22036
Pdf URL: https://arxiv.org/pdf/2506.22036
Copy Paste: [[2506.22036]] Hyper-modal Imputation Diffusion Embedding with Dual-Distillation for Federated Multimodal Knowledge Graph Completion(https://arxiv.org/abs/2506.22036)
Keywords: diffusion
Abstract: With the increasing multimodal knowledge privatization requirements, multimodal knowledge graphs in different institutes are usually decentralized, lacking of effective collaboration system with both stronger reasoning ability and transmission safety guarantees. In this paper, we propose the Federated Multimodal Knowledge Graph Completion (FedMKGC) task, aiming at training over federated MKGs for better predicting the missing links in clients without sharing sensitive knowledge. We propose a framework named MMFeD3-HidE for addressing multimodal uncertain unavailability and multimodal client heterogeneity challenges of FedMKGC. (1) Inside the clients, our proposed Hyper-modal Imputation Diffusion Embedding model (HidE) recovers the complete multimodal distributions from incomplete entity embeddings constrained by available modalities. (2) Among clients, our proposed Multimodal FeDerated Dual Distillation (MMFeD3) transfers knowledge mutually between clients and the server with logit and feature distillation to improve both global convergence and semantic consistency. We propose a FedMKGC benchmark for a comprehensive evaluation, consisting of a general FedMKGC backbone named MMFedE, datasets with heterogeneous multimodal information, and three groups of constructed baselines. Experiments conducted on our benchmark validate the effectiveness, semantic consistency, and convergence robustness of MMFeD3-HidE.

Title: UniCA: Adapting Time Series Foundation Model to General Covariate-Aware Forecasting

Authors: Lu Han, Yu Liu, Qiwen Deng, Jian Jiang, Yinbo Sun, Zhe Yu, Binfeng Wang, Xingyu Lu, Lintao Ma, Han-Jia Ye, De-Chuan Zhan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22039
Pdf URL: https://arxiv.org/pdf/2506.22039
Copy Paste: [[2506.22039]] UniCA: Adapting Time Series Foundation Model to General Covariate-Aware Forecasting(https://arxiv.org/abs/2506.22039)
Keywords: foundation model
Abstract: Time Series Foundation Models (TSFMs) have achieved remarkable success through large-scale pretraining. However, their design primarily targets real-valued series, limiting their ability to handle general forecasting tasks involving diverse and often heterogeneous covariates--such as categorical variables and multimodal data (e.g., images, text)--which are typically task-specific and difficult to leverage during pretraining. To address this gap, we propose Unified Covariate Adaptation (UniCA), a framework to bridge TSFMs with general covariate-aware forecasting. UniCA first performs covariate homogenization to transform heterogeneous covariates into high-level homogeneous series representations and then fuses them via a unified attention-based fusion mechanism. UniCA is compatible and universal for adaptation with both homogeneous and heterogeneous covariates, incorporating extra covariate information while preserving the generalization ability of this http URL experiments on multiple unimodal and multimodal covariate-aware forecasting benchmarks demonstrate the superiority of UniCA, highlighting the promise of covariate-aware TSFM adaptation in real-world forecasting scenarios. Codes are released on this https URL.

Title: Few-Shot Identity Adaptation for 3D Talking Heads via Global Gaussian Field

Authors: Hong Nie, Fuyuan Cao, Lu Chen, Fengxin Chen, Yuefeng Zou, Jun Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22044
Pdf URL: https://arxiv.org/pdf/2506.22044
Copy Paste: [[2506.22044]] Few-Shot Identity Adaptation for 3D Talking Heads via Global Gaussian Field(https://arxiv.org/abs/2506.22044)
Keywords: generative
Abstract: Reconstruction and rendering-based talking head synthesis methods achieve high-quality results with strong identity preservation but are limited by their dependence on identity-specific models. Each new identity requires training from scratch, incurring high computational costs and reduced scalability compared to generative model-based approaches. To overcome this limitation, we propose FIAG, a novel 3D speaking head synthesis framework that enables efficient identity-specific adaptation using only a few training footage. FIAG incorporates Global Gaussian Field, which supports the representation of multiple identities within a shared field, and Universal Motion Field, which captures the common motion dynamics across diverse identities. Benefiting from the shared facial structure information encoded in the Global Gaussian Field and the general motion priors learned in the motion field, our framework enables rapid adaptation from canonical identity representations to specific ones with minimal data. Extensive comparative and ablation experiments demonstrate that our method outperforms existing state-of-the-art approaches, validating both the effectiveness and generalizability of the proposed framework. Code is available at: \textit{this https URL}.

Title: MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation

Authors: Dechao Meng, Steven Xiao, Xindi Zhang, Guangyuan Wang, Peng Zhang, Qi Wang, Bang Zhang, Liefeng Bo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22065
Pdf URL: https://arxiv.org/pdf/2506.22065
Copy Paste: [[2506.22065]] MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation(https://arxiv.org/abs/2506.22065)
Keywords: diffusion
Abstract: Audio-driven portrait animation, which synthesizes realistic videos from reference images using audio signals, faces significant challenges in real-time generation of high-fidelity, temporally coherent animations. While recent diffusion-based methods improve generation quality by integrating audio into denoising processes, their reliance on frame-by-frame UNet architectures introduces prohibitive latency and struggles with temporal consistency. This paper introduces MirrorMe, a real-time, controllable framework built on the LTX video model, a diffusion transformer that compresses video spatially and temporally for efficient latent space denoising. To address LTX's trade-offs between compression and semantic fidelity, we propose three innovations: 1. A reference identity injection mechanism via VAE-encoded image concatenation and self-attention, ensuring identity consistency; 2. A causal audio encoder and adapter tailored to LTX's temporal structure, enabling precise audio-expression synchronization; and 3. A progressive training strategy combining close-up facial training, half-body synthesis with facial masking, and hand pose integration for enhanced gesture control. Extensive experiments on the EMTD Benchmark demonstrate MirrorMe's state-of-the-art performance in fidelity, lip-sync accuracy, and temporal stability.

Title: Reasoning in machine vision: learning to think fast and slow

Authors: Shaheer U. Saeed, Yipei Wang, Veeru Kasivisvanathan, Brian R. Davidson, Matthew J. Clarkson, Yipeng Hu, Daniel C. Alexander
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22075
Pdf URL: https://arxiv.org/pdf/2506.22075
Copy Paste: [[2506.22075]] Reasoning in machine vision: learning to think fast and slow(https://arxiv.org/abs/2506.22075)
Keywords: foundation model
Abstract: Reasoning is a hallmark of human intelligence, enabling adaptive decision-making in complex and unfamiliar scenarios. In contrast, machine intelligence remains bound to training data, lacking the ability to dynamically refine solutions at inference time. While some recent advances have explored reasoning in machines, these efforts are largely limited to verbal domains such as mathematical problem-solving, where explicit rules govern step-by-step reasoning. Other critical real-world tasks - including visual perception, spatial reasoning, and radiological diagnosis - require non-verbal reasoning, which remains an open challenge. Here we present a novel learning paradigm that enables machine reasoning in vision by allowing performance improvement with increasing thinking time (inference-time compute), even under conditions where labelled data is very limited. Inspired by dual-process theories of human cognition in psychology, our approach integrates a fast-thinking System I module for familiar tasks, with a slow-thinking System II module that iteratively refines solutions using self-play reinforcement learning. This paradigm mimics human reasoning by proposing, competing over, and refining solutions in data-scarce scenarios. We demonstrate superior performance through extended thinking time, compared not only to large-scale supervised learning but also foundation models and even human experts, in real-world vision tasks. These tasks include computer-vision benchmarks and cancer localisation on medical images across five organs, showcasing transformative potential for non-verbal machine reasoning.

Title: Tied Prototype Model for Few-Shot Medical Image Segmentation

Authors: Hyeongji Kim, Stine Hansen, Michael Kampffmeyer
Subjects: cs.CV, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.22101
Pdf URL: https://arxiv.org/pdf/2506.22101
Copy Paste: [[2506.22101]] Tied Prototype Model for Few-Shot Medical Image Segmentation(https://arxiv.org/abs/2506.22101)
Keywords: anomaly
Abstract: Common prototype-based medical image few-shot segmentation (FSS) methods model foreground and background classes using class-specific prototypes. However, given the high variability of the background, a more promising direction is to focus solely on foreground modeling, treating the background as an anomaly -- an approach introduced by ADNet. Yet, ADNet faces three key limitations: dependence on a single prototype per class, a focus on binary classification, and fixed thresholds that fail to adapt to patient and organ variability. To address these shortcomings, we propose the Tied Prototype Model (TPM), a principled reformulation of ADNet with tied prototype locations for foreground and background distributions. Building on its probabilistic foundation, TPM naturally extends to multiple prototypes and multi-class segmentation while effectively separating non-typical background features. Notably, both extensions lead to improved segmentation accuracy. Finally, we leverage naturally occurring class priors to define an ideal target for adaptive thresholds, boosting segmentation performance. Taken together, TPM provides a fresh perspective on prototype-based FSS for medical image segmentation. The code can be found at this https URL.

Title: RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models

Authors: Ronald Fecso, José Morano, Ursula Schmidt-Erfurth, Hrvoje Bogunović
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22149
Pdf URL: https://arxiv.org/pdf/2506.22149
Copy Paste: [[2506.22149]] RetFiner: A Vision-Language Refinement Scheme for Retinal Foundation Models(https://arxiv.org/abs/2506.22149)
Keywords: self-supervised, foundation model
Abstract: The rise of imaging techniques such as optical coherence tomography (OCT) and advances in deep learning (DL) have enabled clinicians and researchers to streamline retinal disease staging. A popular DL approach is self-supervised learning (SSL), where models learn from vast amounts of unlabeled data, avoiding costly annotation. SSL has allowed the development of foundation models (FMs), large models that can be used for a variety of downstream tasks. However, existing FMs for OCT, trained solely on image data, lack a comprehensive and robust semantic understanding of images, as evidenced by their downstream performance (especially for complex tasks), and thus require supervised fine-tuning (which may be unfeasible) to better adapt to specific applications and populations. To address this, we propose RetFiner, an SSL vision-language refinement scheme that improves the representations of existing FMs and enables their efficient and direct adaptation to specific populations for improved downstream performance. Our method uses a diverse set of training objectives which take advantage of the rich supervisory signal found in textual data. We tested RetFiner on the retinal FMs RETFound, UrFound, and VisionFM, showing significant improvements in linear probing performance on seven highly diverse OCT classification tasks, with an average increase of 5.8, 3.9, and 2.1 percentage points over their baselines, respectively. Our code and model weights are publicly available at this https URL.

Title: Leveraging In-Context Learning for Political Bias Testing of LLMs

Authors: Patrick Haller, Jannis Vamvas, Rico Sennrich, Lena A. Jäger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2506.22232
Pdf URL: https://arxiv.org/pdf/2506.22232
Copy Paste: [[2506.22232]] Leveraging In-Context Learning for Political Bias Testing of LLMs(https://arxiv.org/abs/2506.22232)
Keywords: in-context
Abstract: A growing body of work has been querying LLMs with political questions to evaluate their potential biases. However, this probing method has limited stability, making comparisons between models unreliable. In this paper, we argue that LLMs need more context. We propose a new probing task, Questionnaire Modeling (QM), that uses human survey data as in-context examples. We show that QM improves the stability of question-based bias evaluation, and demonstrate that it may be used to compare instruction-tuned models to their base versions. Experiments with LLMs of various sizes indicate that instruction tuning can indeed change the direction of bias. Furthermore, we observe a trend that larger models are able to leverage in-context examples more effectively, and generally exhibit smaller bias scores in QM. Data and code are publicly available.

Title: OutDreamer: Video Outpainting with a Diffusion Transformer

Authors: Linhao Zhong, Fan Li, Yi Huang, Jianzhuang Liu, Renjing Pei, Fenglong Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22298
Pdf URL: https://arxiv.org/pdf/2506.22298
Copy Paste: [[2506.22298]] OutDreamer: Video Outpainting with a Diffusion Transformer(https://arxiv.org/abs/2506.22298)
Keywords: diffusion
Abstract: Video outpainting is a challenging task that generates new video content by extending beyond the boundaries of an original input video, requiring both temporal and spatial consistency. Many state-of-the-art methods utilize latent diffusion models with U-Net backbones but still struggle to achieve high quality and adaptability in generated content. Diffusion transformers (DiTs) have emerged as a promising alternative because of their superior performance. We introduce OutDreamer, a DiT-based video outpainting framework comprising two main components: an efficient video control branch and a conditional outpainting branch. The efficient video control branch effectively extracts masked video information, while the conditional outpainting branch generates missing content based on these extracted conditions. Additionally, we propose a mask-driven self-attention layer that dynamically integrates the given mask information, further enhancing the model's adaptability to outpainting tasks. Furthermore, we introduce a latent alignment loss to maintain overall consistency both within and between frames. For long video outpainting, we employ a cross-video-clip refiner to iteratively generate missing content, ensuring temporal consistency across video clips. Extensive evaluations demonstrate that our zero-shot OutDreamer outperforms state-of-the-art zero-shot methods on widely recognized benchmarks.

Title: Unfolding Generative Flows with Koopman Operators: Fast and Interpretable Sampling

Authors: Erkan Turan, Aristotelis Siozopoulos, Maks Ovsjanikov
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.22304
Pdf URL: https://arxiv.org/pdf/2506.22304
Copy Paste: [[2506.22304]] Unfolding Generative Flows with Koopman Operators: Fast and Interpretable Sampling(https://arxiv.org/abs/2506.22304)
Keywords: diffusion, generative
Abstract: Conditional Flow Matching (CFM) offers a simulation-free framework for training continuous-time generative models, bridging diffusion and flow-based approaches. However, sampling from CFM still relies on numerically solving non-linear ODEs which can be computationally expensive and difficult to interpret. Recent alternatives address sampling speed via trajectory straightening, mini-batch coupling or distillation. However, these methods typically do not shed light on the underlying \textit{structure} of the generative process. In this work, we propose to accelerate CFM and introduce an interpretable representation of its dynamics by integrating Koopman operator theory, which models non-linear flows as linear evolution in a learned space of observables. We introduce a decoder-free Koopman-CFM architecture that learns an embedding where the generative dynamics become linear, enabling closed-form, one-step sampling via matrix exponentiation. This results in significant speedups over traditional CFM as demonstrated on controlled 2D datasets and real-world benchmarks, MNIST, Fashion-MNIST (F-MNIST), and the Toronto Face Dataset (TFD). Unlike previous methods, our approach leads to a well-structured Koopman generator, whose spectral properties, eigenvalues, and eigenfunctions offer principled tools for analyzing generative behavior such as temporal scaling, mode stability, and decomposition in Koopman latent space. By combining sampling efficiency with analytical structure, Koopman-enhanced flow matching offers a potential step toward fast and interpretable generative modeling.

Title: Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment

Authors: Yue Zhang, Jilei Sun, Yunhui Guo, Vibhav Gogate
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.22385
Pdf URL: https://arxiv.org/pdf/2506.22385
Copy Paste: [[2506.22385]] Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment(https://arxiv.org/abs/2506.22385)
Keywords: generative
Abstract: Video Large Multimodal Models (VLMMs) have made impressive strides in understanding video content, but they often struggle with abstract and adaptive reasoning-the ability to revise their interpretations when new information emerges. In reality, conclusions are rarely set in stone; additional context can strengthen or weaken an initial inference. To address this, we introduce Defeasible Video Entailment (DVidE), a new task that challenges models to think like doubters, constantly updating their reasoning based on evolving evidence. In DVidE, given a video premise and a textual hypothesis, models must determine whether a new update strengthens or weakens the hypothesis (classification version) or generate a coherent update that modifies the entailment relationship (generation version). For solving the classification task, we propose the Chain of Counterfactual Thought framework, utilizing counterfactual reasoning, ASR-enhanced video content, and rationale refinement to reduce inference bias. For the generation task, we develop a framework that combines ASR output with a Large Language Model (LLM) to produce coherent, contextually relevant updates aligned with the intended strengthener or weakener goals. Additionally, we introduce a novel benchmark dataset, with strengthener/weakener annotations and an LLM-based evaluation metric specifically designed for assessing generative performance. Experimental results demonstrate significant improvements, highlighting our proposed method in enhancing dynamic reasoning capabilities of VLMMs.

Title: HyperCLOVA X THINK Technical Report

Authors: NAVER Cloud HyperCLOVA X Team
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22403
Pdf URL: https://arxiv.org/pdf/2506.22403
Copy Paste: [[2506.22403]] HyperCLOVA X THINK Technical Report(https://arxiv.org/abs/2506.22403)
Keywords: foundation model
Abstract: We introduce HyperCLOVA X THINK, the first reasoning-focused large language model in the HyperCLOVA X family, pre-trained on roughly $6$ trillion high-quality Korean, and English tokens, augmented with targeted synthetic Korean data. It was implemented as a compute-memory-balanced Peri-LN Transformer scaled with $\mu$P, pre-trained through a three-stage curriculum that expands the context window to $128$K tokens, and post-trained via supervised fine-tuning with Reinforcement Learning from Verifiable Rewards supports both detailed rationale and concise-answer modes. It delivers competitive performance against similarly sized models on Korea-focused benchmarks such as KMMLU, CSAT, KoBALT-700, HAERAE-1.0, and KoBigBench, while preserving robust bilingual consistency and translation quality. In addition, a vision-augmented variant matches or exceeds GPT-4.1 on the KCSAT STEM benchmark, all of which are achieved with substantially lower training compute than existing models of similar sizes. We also present a pruning and distillation technique that will soon be applied to HyperCLOVA X THINK for an open-source and business-friendly foundation model. Altogether, these capabilities position HyperCLOVA X THINK as a robust foundation for Korean AI innovation and a valuable resource for the global research community.

Title: Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy

Authors: Yuhao Liu, Tengfei Wang, Fang Liu, Zhenwei Wang, Rynson W.H. Lau
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22432
Pdf URL: https://arxiv.org/pdf/2506.22432
Copy Paste: [[2506.22432]] Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy(https://arxiv.org/abs/2506.22432)
Keywords: diffusion, generative
Abstract: Recent advances in deep generative modeling have unlocked unprecedented opportunities for video synthesis. In real-world applications, however, users often seek tools to faithfully realize their creative editing intentions with precise and consistent control. Despite the progress achieved by existing methods, ensuring fine-grained alignment with user intentions remains an open and challenging problem. In this work, we present Shape-for-Motion, a novel framework that incorporates a 3D proxy for precise and consistent video editing. Shape-for-Motion achieves this by converting the target object in the input video to a time-consistent mesh, i.e., a 3D proxy, allowing edits to be performed directly on the proxy and then inferred back to the video frames. To simplify the editing process, we design a novel Dual-Propagation Strategy that allows users to perform edits on the 3D mesh of a single frame, and the edits are then automatically propagated to the 3D meshes of the other frames. The 3D meshes for different frames are further projected onto the 2D space to produce the edited geometry and texture renderings, which serve as inputs to a decoupled video diffusion model for generating edited results. Our framework supports various precise and physically-consistent manipulations across the video frames, including pose editing, rotation, scaling, translation, texture modification, and object composition. Our approach marks a key step toward high-quality, controllable video editing workflows. Extensive experiments demonstrate the superiority and effectiveness of our approach. Project page: this https URL

Title: MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Authors: Xi Chen, Mingkang Zhu, Shaoteng Liu, Xiaoyang Wu, Xiaogang Xu, Yu Liu, Xiang Bai, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22434
Pdf URL: https://arxiv.org/pdf/2506.22434
Copy Paste: [[2506.22434]] MiCo: Multi-image Contrast for Reinforcement Visual Reasoning(https://arxiv.org/abs/2506.22434)
Keywords: self-supervised
Abstract: This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods typically rely on manually curated question-answer pairs, which can be particularly challenging when dealing with fine grained visual details and complex logic across images. Inspired by self-supervised visual representation learning, we observe that images contain inherent constraints that can serve as supervision. Based on this insight, we construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. During training, the model is prompted to generate a reasoning process to compare these images (i.e., determine same or different). Then we optimize the model with rule-based reinforcement learning. Due to the high visual similarity and the presence of augmentations, the model must attend to subtle visual changes and perform logical reasoning to succeed. Experiments show that, although trained solely on visual comparison tasks, the learned reasoning ability generalizes effectively to a wide range of questions. Without relying on any human-annotated question-answer pairs, our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.