2025-07-23

Title: PAT++: a cautionary tale about generative visual augmentation for Object Re-identification

Authors: Leonardo Santiago Benitez Pereira, Arathy Jeevan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.15888
Pdf URL: https://arxiv.org/pdf/2507.15888
Copy Paste: [[2507.15888]] PAT++: a cautionary tale about generative visual augmentation for Object Re-identification(https://arxiv.org/abs/2507.15888)
Keywords: diffusion, generative
Abstract: Generative data augmentation has demonstrated gains in several vision tasks, but its impact on object re-identification - where preserving fine-grained visual details is essential - remains largely unexplored. In this work, we assess the effectiveness of identity-preserving image generation for object re-identification. Our novel pipeline, named PAT++, incorporates Diffusion Self-Distillation into the well-established Part-Aware Transformer. Using the Urban Elements ReID Challenge dataset, we conduct extensive experiments with generated images used for both model training and query expansion. Our results show consistent performance degradation, driven by domain shifts and failure to retain identity-defining features. These findings challenge assumptions about the transferability of generative models to fine-grained recognition tasks and expose key limitations in current approaches to visual augmentation for identity-preserving applications.

Title: ReDi: Rectified Discrete Flow

Authors: Jaehoon Yoo, Wonjung Kim, Seunghoon Hong
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2507.15897
Pdf URL: https://arxiv.org/pdf/2507.15897
Copy Paste: [[2507.15897]] ReDi: Rectified Discrete Flow(https://arxiv.org/abs/2507.15897)
Keywords: generative
Abstract: Discrete Flow-based Models (DFMs) are powerful generative models for high-quality discrete data but typically suffer from slow sampling speeds due to their reliance on iterative decoding processes. This reliance on a multi-step process originates from the factorization approximation of DFMs, which is necessary for handling high-dimensional data. In this paper, we rigorously characterize the approximation error from factorization using Conditional Total Correlation (TC), which depends on the coupling. To reduce the Conditional TC and enable efficient few-step generation, we propose Rectified Discrete Flow (ReDi), a novel iterative method that reduces factorization error by rectifying the coupling between source and target distributions. We theoretically prove that each ReDi step guarantees a monotonic decreasing Conditional TC, ensuring its convergence. Empirically, ReDi significantly reduces Conditional TC and enables few-step generation. Moreover, we demonstrate that the rectified couplings are well-suited for training efficient one-step models on image generation. ReDi offers a simple and theoretically grounded approach for tackling the few-step challenge, providing a new perspective on efficient discrete data synthesis. Code is available at this https URL

Title: Foundation Models and Transformers for Anomaly Detection: A Survey

Authors: Mouïn Ben Ammar, Arturo Mendoza, Nacim Belkhir, Antoine Manzanera, Gianni Franchi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.15905
Pdf URL: https://arxiv.org/pdf/2507.15905
Copy Paste: [[2507.15905]] Foundation Models and Transformers for Anomaly Detection: A Survey(https://arxiv.org/abs/2507.15905)
Keywords: foundation model, anomaly
Abstract: In line with the development of deep learning, this survey examines the transformative role of Transformers and foundation models in advancing visual anomaly detection (VAD). We explore how these architectures, with their global receptive fields and adaptability, address challenges such as long-range dependency modeling, contextual modeling and data scarcity. The survey categorizes VAD methods into reconstruction-based, feature-based and zero/few-shot approaches, highlighting the paradigm shift brought about by foundation models. By integrating attention mechanisms and leveraging large-scale pre-training, Transformers and foundation models enable more robust, interpretable, and scalable anomaly detection solutions. This work provides a comprehensive review of state-of-the-art techniques, their strengths, limitations, and emerging trends in leveraging these architectures for VAD.

Title: HyDRA: A Hybrid-Driven Reasoning Architecture for Verifiable Knowledge Graphs

Authors: Adrian Kaiser, Claudiu Leoveanu-Condrei, Ryan Gold, Marius-Constantin Dinu, Markus Hofmarcher
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.15917
Pdf URL: https://arxiv.org/pdf/2507.15917
Copy Paste: [[2507.15917]] HyDRA: A Hybrid-Driven Reasoning Architecture for Verifiable Knowledge Graphs(https://arxiv.org/abs/2507.15917)
Keywords: generative
Abstract: The synergy between symbolic knowledge, often represented by Knowledge Graphs (KGs), and the generative capabilities of neural networks is central to advancing neurosymbolic AI. A primary bottleneck in realizing this potential is the difficulty of automating KG construction, which faces challenges related to output reliability, consistency, and verifiability. These issues can manifest as structural inconsistencies within the generated graphs, such as the formation of disconnected $\textit{isolated islands}$ of data or the inaccurate conflation of abstract classes with specific instances. To address these challenges, we propose HyDRA, a $\textbf{Hy}$brid-$\textbf{D}$riven $\textbf{R}$easoning $\textbf{A}$rchitecture designed for verifiable KG automation. Given a domain or an initial set of documents, HyDRA first constructs an ontology via a panel of collaborative neurosymbolic agents. These agents collaboratively agree on a set of competency questions (CQs) that define the scope and requirements the ontology must be able to answer. Given these CQs, we build an ontology graph that subsequently guides the automated extraction of triplets for KG generation from arbitrary documents. Inspired by design-by-contracts (DbC) principles, our method leverages verifiable contracts as the primary control mechanism to steer the generative process of Large Language Models (LLMs). To verify the output of our approach, we extend beyond standard benchmarks and propose an evaluation framework that assesses the functional correctness of the resulting KG by leveraging symbolic verifications as described by the neurosymbolic AI framework, $\textit{SymbolicAI}$. This work contributes a hybrid-driven architecture for improving the reliability of automated KG construction and the exploration of evaluation methods for measuring the functional integrity of its output. The code is publicly available.

Title: Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation

Authors: Sumit Singh, Rohit Mishra, Uma Shanker Tiwary
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.16002
Pdf URL: https://arxiv.org/pdf/2507.16002
Copy Paste: [[2507.16002]] Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation(https://arxiv.org/abs/2507.16002)
Keywords: generative
Abstract: One major challenge in natural language processing is named entity recognition (NER), which identifies and categorises named entities in textual input. In order to improve NER, this study investigates a Hindi NER technique that makes use of Hindi-specific pretrained encoders (MuRIL and XLM-R) and Generative Models ( Llama-2-7B-chat-hf (Llama2-7B), Llama-2-70B-chat-hf (Llama2-70B), Llama-3-70B-Instruct (Llama3-70B) and GPT3.5-turbo), and augments the data with retrieved data from external relevant contexts, notably from Wikipedia. We have fine-tuned MuRIL, XLM-R and Llama2-7B with and without RA. However, Llama2-70B, lama3-70B and GPT3.5-turbo are utilised for few-shot NER generation. Our investigation shows that the mentioned language models (LMs) with Retrieval Augmentation (RA) outperform baseline methods that don't incorporate RA in most cases. The macro F1 scores for MuRIL and XLM-R are 0.69 and 0.495, respectively, without RA and increase to 0.70 and 0.71, respectively, in the presence of RA. Fine-tuned Llama2-7B outperforms Llama2-7B by a significant margin. On the other hand the generative models which are not fine-tuned also perform better with augmented data. GPT3.5-turbo adopted RA well; however, Llama2-70B and llama3-70B did not adopt RA with our retrieval context. The findings show that RA significantly improves performance, especially for low-context data. This study adds significant knowledge about how best to use data augmentation methods and pretrained models to enhance NER performance, particularly in languages with limited resources.

Title: Learning without training: The implicit dynamics of in-context learning

Authors: Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, Javier Gonzalvo
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.16003
Pdf URL: https://arxiv.org/pdf/2507.16003
Copy Paste: [[2507.16003]] Learning without training: The implicit dynamics of in-context learning(https://arxiv.org/abs/2507.16003)
Keywords: in-context
Abstract: One of the most striking features of Large Language Models (LLM) is their ability to learn in context. Namely at inference time an LLM is able to learn new patterns without any additional weight update when these patterns are presented in the form of examples in the prompt, even if these patterns were not seen during training. The mechanisms through which this can happen are still largely unknown. In this work, we show that the stacking of a self-attention layer with an MLP, allows the transformer block to implicitly modify the weights of the MLP layer according to the context. We argue through theory and experimentation that this simple mechanism may be the reason why LLMs can learn in context and not only during training. Specifically, we show under mild simplifying assumptions how a transformer block implicitly transforms a context into a low-rank weight-update of the MLP layer.

Title: AutoMeet: a proof-of-concept study of genAI to automate meetings in automotive engineering

Authors: Simon Baeuerle, Max Radyschevski, Ulrike Pado
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16054
Pdf URL: https://arxiv.org/pdf/2507.16054
Copy Paste: [[2507.16054]] AutoMeet: a proof-of-concept study of genAI to automate meetings in automotive engineering(https://arxiv.org/abs/2507.16054)
Keywords: generative
Abstract: In large organisations, knowledge is mainly shared in meetings, which takes up significant amounts of work time. Additionally, frequent in-person meetings produce inconsistent documentation -- official minutes, personal notes, presentations may or may not exist. Shared information therefore becomes hard to retrieve outside of the meeting, necessitating lengthy updates and high-frequency meeting schedules. Generative Artificial Intelligence (genAI) models like Large Language Models (LLMs) exhibit an impressive performance on spoken and written language processing. This motivates a practical usage of genAI for knowledge management in engineering departments: using genAI for transcribing meetings and integrating heterogeneous additional information sources into an easily usable format for ad-hoc searches. We implement an end-to-end pipeline to automate the entire meeting documentation workflow in a proof-of-concept state: meetings are recorded and minutes are created by genAI. These are further made easily searchable through a chatbot interface. The core of our work is to test this genAI-based software tooling in a real-world engineering department and collect extensive survey data on both ethical and technical aspects. Direct feedback from this real-world setup points out both opportunities and risks: a) users agree that the effort for meetings could be significantly reduced with the help of genAI models, b) technical aspects are largely solved already, c) organizational aspects are crucial for a successful ethical usage of such a system.

Title: Deep Researcher with Test-Time Diffusion

Authors: Rujun Han, Yanfei Chen, Zoey CuiZhu, Lesly Miculicich, Guan Sun, Yuanjun Bi, Weiming Wen, Hui Wan, Chunfeng Wen, Solène Maître, George Lee, Vishy Tirumalashetty, Emily Xue, Zizhao Zhang, Salem Haykal, Burak Gokturk, Tomas Pfister, Chen-Yu Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16075
Pdf URL: https://arxiv.org/pdf/2507.16075
Copy Paste: [[2507.16075]] Deep Researcher with Test-Time Diffusion(https://arxiv.org/abs/2507.16075)
Keywords: diffusion
Abstract: Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a "denoising" process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents.

Title: Efficient Compositional Multi-tasking for On-device Large Language Models

Authors: Ondrej Bohdal, Mete Ozay, Jijoong Moon, Kyeng-Hun Lee, Hyeonmok Ko, Umberto Michieli
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.16083
Pdf URL: https://arxiv.org/pdf/2507.16083
Copy Paste: [[2507.16083]] Efficient Compositional Multi-tasking for On-device Large Language Models(https://arxiv.org/abs/2507.16083)
Keywords: generative
Abstract: Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.

Title: Improving Personalized Image Generation through Social Context Feedback

Authors: Parul Gupta, Abhinav Dhall, Thanh-Toan Do
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16095
Pdf URL: https://arxiv.org/pdf/2507.16095
Copy Paste: [[2507.16095]] Improving Personalized Image Generation through Social Context Feedback(https://arxiv.org/abs/2507.16095)
Keywords: diffusion
Abstract: Personalized image generation, where reference images of one or more subjects are used to generate their image according to a scene description, has gathered significant interest in the community. However, such generated images suffer from three major limitations -- complex activities, such as $<$man, pushing, motorcycle$>$ are not generated properly with incorrect human poses, reference human identities are not preserved, and generated human gaze patterns are unnatural/inconsistent with the scene description. In this work, we propose to overcome these shortcomings through feedback-based fine-tuning of existing personalized generation methods, wherein, state-of-art detectors of pose, human-object-interaction, human facial recognition and human gaze-point estimation are used to refine the diffusion model. We also propose timestep-based inculcation of different feedback modules, depending upon whether the signal is low-level (such as human pose), or high-level (such as gaze point). The images generated in this manner show an improvement in the generated interactions, facial identities and image quality over three benchmark datasets.

Title: Stop-band Energy Constraint for Orthogonal Tunable Wavelet Units in Convolutional Neural Networks for Computer Vision problems

Authors: An D. Le, Hung Nguyen, Sungbal Seo, You-Suk Bae, Truong Q. Nguyen
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2507.16114
Pdf URL: https://arxiv.org/pdf/2507.16114
Copy Paste: [[2507.16114]] Stop-band Energy Constraint for Orthogonal Tunable Wavelet Units in Convolutional Neural Networks for Computer Vision problems(https://arxiv.org/abs/2507.16114)
Keywords: anomaly
Abstract: This work introduces a stop-band energy constraint for filters in orthogonal tunable wavelet units with a lattice structure, aimed at improving image classification and anomaly detection in CNNs, especially on texture-rich datasets. Integrated into ResNet-18, the method enhances convolution, pooling, and downsampling operations, yielding accuracy gains of 2.48% on CIFAR-10 and 13.56% on the Describable Textures dataset. Similar improvements are observed in ResNet-34. On the MVTec hazelnut anomaly detection task, the proposed method achieves competitive results in both segmentation and detection, outperforming existing approaches.

Title: PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation

Authors: Yaofang Liu, Yumeng Ren, Aitor Artola, Yuxuan Hu, Xiaodong Cun, Xiaotong Zhao, Alan Zhao, Raymond H. Chan, Suiyun Zhang, Rui Liu, Dandan Tu, Jean-Michel Morel
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16116
Pdf URL: https://arxiv.org/pdf/2507.16116
Copy Paste: [[2507.16116]] PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation(https://arxiv.org/abs/2507.16116)
Keywords: diffusion, foundation model, generative
Abstract: The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present Pusa, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. Besides, VTA is a non-destructive adaptation, which means it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency -- surpassing the performance of Wan-I2V-14B with $\leq$ 1/200 of the training cost (\$500 vs. $\geq$ \$100,000) and $\leq$ 1/2500 of the dataset size (4K vs. $\geq$ 10M samples). Pusa not only sets a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32\% (vs. 86.86\% of Wan-I2V-14B), but also unlocks many zero-shot multi-task capabilities such as start-end frames and video extension -- all without task-specific training. Meanwhile, Pusa can still perform text-to-video generation. Mechanistic analyses reveal that our approach preserves the foundation model's generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to vectorized timesteps. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike. Code is open-sourced at this https URL

Title: DP2Guard: A Lightweight and Byzantine-Robust Privacy-Preserving Federated Learning Scheme for Industrial IoT

Authors: Baofu Han, Bing Li, Yining Qi, Raja Jurdak, Kaibin Huang, Chau Yuen
Subjects: cs.CR, cs.DC
Abstract URL: https://arxiv.org/abs/2507.16134
Pdf URL: https://arxiv.org/pdf/2507.16134
Copy Paste: [[2507.16134]] DP2Guard: A Lightweight and Byzantine-Robust Privacy-Preserving Federated Learning Scheme for Industrial IoT(https://arxiv.org/abs/2507.16134)
Keywords: anomaly
Abstract: Privacy-Preserving Federated Learning (PPFL) has emerged as a secure distributed Machine Learning (ML) paradigm that aggregates locally trained gradients without exposing raw data. To defend against model poisoning threats, several robustness-enhanced PPFL schemes have been proposed by integrating anomaly detection. Nevertheless, they still face two major challenges: (1) the reliance on heavyweight encryption techniques results in substantial communication and computation overhead; and (2) single-strategy defense mechanisms often fail to provide sufficient robustness against adaptive adversaries. To overcome these challenges, we propose DP2Guard, a lightweight PPFL framework that enhances both privacy and robustness. DP2Guard leverages a lightweight gradient masking mechanism to replace costly cryptographic operations while ensuring the privacy of local gradients. A hybrid defense strategy is proposed, which extracts gradient features using singular value decomposition and cosine similarity, and applies a clustering algorithm to effectively identify malicious gradients. Additionally, DP2Guard adopts a trust score-based adaptive aggregation scheme that adjusts client weights according to historical behavior, while blockchain records aggregated results and trust scores to ensure tamper-proof and auditable training. Extensive experiments conducted on two public datasets demonstrate that DP2Guard effectively defends against four advanced poisoning attacks while ensuring privacy with reduced communication and computation costs.

Title: Learning Patient-Specific Spatial Biomarker Dynamics via Operator Learning for Alzheimer's Disease Progression

Authors: Jindong Wang, Yutong Mao, Xiao Liu, Wenrui Hao
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2507.16148
Pdf URL: https://arxiv.org/pdf/2507.16148
Copy Paste: [[2507.16148]] Learning Patient-Specific Spatial Biomarker Dynamics via Operator Learning for Alzheimer's Disease Progression(https://arxiv.org/abs/2507.16148)
Keywords: generative
Abstract: Alzheimer's disease (AD) is a complex, multifactorial neurodegenerative disorder with substantial heterogeneity in progression and treatment response. Despite recent therapeutic advances, predictive models capable of accurately forecasting individualized disease trajectories remain limited. Here, we present a machine learning-based operator learning framework for personalized modeling of AD progression, integrating longitudinal multimodal imaging, biomarker, and clinical data. Unlike conventional models with prespecified dynamics, our approach directly learns patient-specific disease operators governing the spatiotemporal evolution of amyloid, tau, and neurodegeneration biomarkers. Using Laplacian eigenfunction bases, we construct geometry-aware neural operators capable of capturing complex brain dynamics. Embedded within a digital twin paradigm, the framework enables individualized predictions, simulation of therapeutic interventions, and in silico clinical trials. Applied to AD clinical data, our method achieves high prediction accuracy exceeding 90% across multiple biomarkers, substantially outperforming existing approaches. This work offers a scalable, interpretable platform for precision modeling and personalized therapeutic optimization in neurodegenerative diseases.

Title: LSSGen: Leveraging Latent Space Scaling in Flow and Diffusion for Efficient Text to Image Generation

Authors: Jyun-Ze Tang, Chih-Fan Hsu, Jeng-Lin Li, Ming-Ching Chang, Wei-Chao Chen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.16154
Pdf URL: https://arxiv.org/pdf/2507.16154
Copy Paste: [[2507.16154]] LSSGen: Leveraging Latent Space Scaling in Flow and Diffusion for Efficient Text to Image Generation(https://arxiv.org/abs/2507.16154)
Keywords: diffusion
Abstract: Flow matching and diffusion models have shown impressive results in text-to-image generation, producing photorealistic images through an iterative denoising process. A common strategy to speed up synthesis is to perform early denoising at lower resolutions. However, traditional methods that downscale and upscale in pixel space often introduce artifacts and distortions. These issues arise when the upscaled images are re-encoded into the latent space, leading to degraded final image quality. To address this, we propose {\bf Latent Space Scaling Generation (LSSGen)}, a framework that performs resolution scaling directly in the latent space using a lightweight latent upsampler. Without altering the Transformer or U-Net architecture, LSSGen improves both efficiency and visual quality while supporting flexible multi-resolution generation. Our comprehensive evaluation covering text-image alignment and perceptual quality shows that LSSGen significantly outperforms conventional scaling approaches. When generating $1024^2$ images at similar speeds, it achieves up to 246\% TOPIQ score improvement.

Title: EBaReT: Expert-guided Bag Reward Transformer for Auto Bidding

Authors: Kaiyuan Li, Pengyu Wang, Yunshan Peng, Pengjia Yuan, Yanxiang Zeng, Rui Xiang, Yanhua Cheng, Xialong Liu, Peng Jiang
Subjects: cs.LG, cs.IR
Abstract URL: https://arxiv.org/abs/2507.16186
Pdf URL: https://arxiv.org/pdf/2507.16186
Copy Paste: [[2507.16186]] EBaReT: Expert-guided Bag Reward Transformer for Auto Bidding(https://arxiv.org/abs/2507.16186)
Keywords: generative
Abstract: Reinforcement learning has been widely applied in automated bidding. Traditional approaches model bidding as a Markov Decision Process (MDP). Recently, some studies have explored using generative reinforcement learning methods to address long-term dependency issues in bidding environments. Although effective, these methods typically rely on supervised learning approaches, which are vulnerable to low data quality due to the amount of sub-optimal bids and low probability rewards resulting from the low click and conversion rates. Unfortunately, few studies have addressed these challenges. In this paper, we formalize the automated bidding as a sequence decision-making problem and propose a novel Expert-guided Bag Reward Transformer (EBaReT) to address concerns related to data quality and uncertainty rewards. Specifically, to tackle data quality issues, we generate a set of expert trajectories to serve as supplementary data in the training process and employ a Positive-Unlabeled (PU) learning-based discriminator to identify expert transitions. To ensure the decision also meets the expert level, we further design a novel expert-guided inference strategy. Moreover, to mitigate the uncertainty of rewards, we consider the transitions within a certain period as a "bag" and carefully design a reward function that leads to a smoother acquisition of rewards. Extensive experiments demonstrate that our model achieves superior performance compared to state-of-the-art bidding methods.

Title: METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark

Authors: Xu Yang, Qi Zhang, Shuming Jiang, Yaowen Xu, Zhaofan Zou, Hao Sun, Xuelong Li
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.16206
Pdf URL: https://arxiv.org/pdf/2507.16206
Copy Paste: [[2507.16206]] METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark(https://arxiv.org/abs/2507.16206)
Keywords: generative
Abstract: With the rapid advancement of generative AI, synthetic content across images, videos, and audio has become increasingly realistic, amplifying the risk of misinformation. Existing detection approaches predominantly focus on binary classification while lacking detailed and interpretable explanations of forgeries, which limits their applicability in safety-critical scenarios. Moreover, current methods often treat each modality separately, without a unified benchmark for cross-modal forgery detection and interpretation. To address these challenges, we introduce METER, a unified, multi-modal benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content. Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations, including spatio-temporal localization, textual rationales, and forgery type tracing. Compared to prior benchmarks, METER offers broader modality coverage and richer interpretability metrics such as spatial/temporal IoU, multi-class tracing, and evidence consistency. We further propose a human-aligned, three-stage Chain-of-Thought (CoT) training strategy combining SFT, DPO, and a novel GRPO stage that integrates a human-aligned evaluator with CoT reasoning. We hope METER will serve as a standardized foundation for advancing generalizable and interpretable forgery detection in the era of generative media.

Title: Advancing Visual Large Language Model for Multi-granular Versatile Perception

Authors: Wentao Xiang, Haoxian Tan, Cong Wei, Yujie Zhong, Dengjie Li, Yujiu Yang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.16213
Pdf URL: https://arxiv.org/pdf/2507.16213
Copy Paste: [[2507.16213]] Advancing Visual Large Language Model for Multi-granular Versatile Perception(https://arxiv.org/abs/2507.16213)
Keywords: generative
Abstract: Perception is a fundamental task in the field of computer vision, encompassing a diverse set of subtasks that can be systematically categorized into four distinct groups based on two dimensions: prediction type and instruction type. Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVP-LM, a Multi-granular and Versatile Perception framework incorporating Visual Large Language Model. Our framework is designed to integrate both word-based and sentence-based perception tasks alongside box and mask predictions within a single architecture. MVP-LM features an innovative multi-granularity decoder in conjunction with a CoT-inspired dataset unification strategy, enabling seamless supervised fine-tuning across a wide spectrum of tasks, including but not limited to panoptic segmentation, detection, grounding, and referring expression segmentation. Furthermore, we introduce a query enhancement strategy aimed at harnessing the decoding and generative capabilities inherent in VLLMs. Extensive experiments conducted across a range of benchmarks in both word-based and sentence-based perception tasks substantiate the efficacy of our framework. The code will be available at this https URL.

Title: Towards Compute-Optimal Many-Shot In-Context Learning

Authors: Shahriar Golchin, Yanfei Chen, Rujun Han, Manan Gandhi, Tianli Yu, Swaroop Mishra, Mihai Surdeanu, Rishabh Agarwal, Chen-Yu Lee, Tomas Pfister
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.16217
Pdf URL: https://arxiv.org/pdf/2507.16217
Copy Paste: [[2507.16217]] Towards Compute-Optimal Many-Shot In-Context Learning(https://arxiv.org/abs/2507.16217)
Keywords: in-context
Abstract: Long-context large language models (LLMs) are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the benefits of caching and reusing computations, and (3) the similar performance offered by this strategy compared to others when scaled. In this work, we propose two straightforward strategies for demonstration selection in many-shot ICL that improve performance with minimal computational overhead. Our first method combines a small number of demonstrations, selected based on their similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached. The second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering. Our experiments with Gemini Pro and Flash across several datasets indicate that our strategies consistently outperform random selection and surpass or match the most performant selection approach while supporting caching and reducing inference cost by up to an order of magnitude. We also show that adjusting the proportion of demonstrations selected based on different criteria can balance performance and inference cost in many-shot ICL.

Title: Edge-case Synthesis for Fisheye Object Detection: A Data-centric Perspective

Authors: Seunghyeon Kim, Kyeongryeol Go
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.16254
Pdf URL: https://arxiv.org/pdf/2507.16254
Copy Paste: [[2507.16254]] Edge-case Synthesis for Fisheye Object Detection: A Data-centric Perspective(https://arxiv.org/abs/2507.16254)
Keywords: generative
Abstract: Fisheye cameras introduce significant distortion and pose unique challenges to object detection models trained on conventional datasets. In this work, we propose a data-centric pipeline that systematically improves detection performance by focusing on the key question of identifying the blind spots of the model. Through detailed error analysis, we identify critical edge-cases such as confusing class pairs, peripheral distortions, and underrepresented contexts. Then we directly address them through edge-case synthesis. We fine-tuned an image generative model and guided it with carefully crafted prompts to produce images that replicate real-world failure modes. These synthetic images are pseudo-labeled using a high-quality detector and integrated into training. Our approach results in consistent performance gains, highlighting how deeply understanding data and selectively fixing its weaknesses can be impactful in specialized domains like fisheye object detection.

Title: Dens3R: A Foundation Model for 3D Geometry Prediction

Authors: Xianze Fang, Jingnan Gao, Zhe Wang, Zhuo Chen, Xingyu Ren, Jiangjing Lyu, Qiaomu Ren, Zhonglei Yang, Xiaokang Yang, Yichao Yan, Chengfei Lyu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16290
Pdf URL: https://arxiv.org/pdf/2507.16290
Copy Paste: [[2507.16290]] Dens3R: A Foundation Model for 3D Geometry Prediction(https://arxiv.org/abs/2507.16290)
Keywords: foundation model
Abstract: Recent advances in dense 3D reconstruction have led to significant progress, yet achieving accurate unified geometric prediction remains a major challenge. Most existing methods are limited to predicting a single geometry quantity from input images. However, geometric quantities such as depth, surface normals, and point maps are inherently correlated, and estimating them in isolation often fails to ensure consistency, thereby limiting both accuracy and practical applicability. This motivates us to explore a unified framework that explicitly models the structural coupling among different geometric properties to enable joint regression. In this paper, we present Dens3R, a 3D foundation model designed for joint geometric dense prediction and adaptable to a wide range of downstream tasks. Dens3R adopts a two-stage training framework to progressively build a pointmap representation that is both generalizable and intrinsically invariant. Specifically, we design a lightweight shared encoder-decoder backbone and introduce position-interpolated rotary positional encoding to maintain expressive power while enhancing robustness to high-resolution inputs. By integrating image-pair matching features with intrinsic invariance modeling, Dens3R accurately regresses multiple geometric quantities such as surface normals and depth, achieving consistent geometry perception from single-view to multi-view inputs. Additionally, we propose a post-processing pipeline that supports geometrically consistent multi-view inference. Extensive experiments demonstrate the superior performance of Dens3R across various dense 3D prediction tasks and highlight its potential for broader applications.

Title: Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning

Authors: Boheng Li, Renjie Gu, Junjie Wang, Leyi Qi, Yiming Li, Run Wang, Zhan Qin, Tianwei Zhang
Subjects: cs.LG, cs.AI, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2507.16302
Pdf URL: https://arxiv.org/pdf/2507.16302
Copy Paste: [[2507.16302]] Towards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning(https://arxiv.org/abs/2507.16302)
Keywords: diffusion
Abstract: Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are identified to be fragile to downstream fine-tuning, where we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau Envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety after downstream fine-tuning while preserving benign generation capability well.

Title: M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision

Authors: Kailai Zhou, Fuqiang Yang, Shixian Wang, Bihan Wen, Chongde Zi, Linsen Chen, Qiu Shen, Xun Cao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16318
Pdf URL: https://arxiv.org/pdf/2507.16318
Copy Paste: [[2507.16318]] M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision(https://arxiv.org/abs/2507.16318)
Keywords: self-supervised, foundation model
Abstract: RGB-Thermal (RGBT) multispectral vision is essential for robust perception in complex environments. Most RGBT tasks follow a case-by-case research paradigm, relying on manually customized models to learn task-oriented representations. Nevertheless, this paradigm is inherently constrained by artificial inductive bias, modality bias, and data bottleneck. To address these limitations, we make the initial attempt to build a Generalized RGBT MultiSpectral foundation model (M-SpecGene), which aims to learn modality-invariant representations from large-scale broad data in a self-supervised manner. M-SpecGene provides new insights into multispectral fusion and integrates prior case-by-case studies into a unified paradigm. Considering the unique characteristic of information imbalance in RGBT data, we introduce the Cross-Modality Structural Sparsity (CMSS) metric to quantify the information density across two modalities. Then we develop the GMM-CMSS progressive masking strategy to facilitate a flexible, easy-to-hard, and object-centric pre-training process. Comprehensive experiments validate M-SpecGene's generalizability across eleven datasets for four RGBT downstream tasks. The code will be available at this https URL.

Title: DREAM: Scalable Red Teaming for Text-to-Image Generative Systems via Distribution Modeling

Authors: Boheng Li, Junjie Wang, Yiming Li, Zhiyang Hu, Leyi Qi, Jianshuo Dong, Run Wang, Han Qiu, Zhan Qin, Tianwei Zhang
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2507.16329
Pdf URL: https://arxiv.org/pdf/2507.16329
Copy Paste: [[2507.16329]] DREAM: Scalable Red Teaming for Text-to-Image Generative Systems via Distribution Modeling(https://arxiv.org/abs/2507.16329)
Keywords: generative
Abstract: Despite the integration of safety alignment and external filters, text-to-image (T2I) generative models are still susceptible to producing harmful content, such as sexual or violent imagery. This raises serious concerns about unintended exposure and potential misuse. Red teaming, which aims to proactively identify diverse prompts that can elicit unsafe outputs from the T2I system (including the core generative model as well as potential external safety filters and other processing components), is increasingly recognized as an essential method for assessing and improving safety before real-world deployment. Yet, existing automated red teaming approaches often treat prompt discovery as an isolated, prompt-level optimization task, which limits their scalability, diversity, and overall effectiveness. To bridge this gap, in this paper, we propose DREAM, a scalable red teaming framework to automatically uncover diverse problematic prompts from a given T2I system. Unlike most prior works that optimize prompts individually, DREAM directly models the probabilistic distribution of the target system's problematic prompts, which enables explicit optimization over both effectiveness and diversity, and allows efficient large-scale sampling after training. To achieve this without direct access to representative training samples, we draw inspiration from energy-based models and reformulate the objective into simple and tractable objectives. We further introduce GC-SPSA, an efficient optimization algorithm that provide stable gradient estimates through the long and potentially non-differentiable T2I pipeline. The effectiveness of DREAM is validated through extensive experiments, demonstrating that it surpasses 9 state-of-the-art baselines by a notable margin across a broad range of T2I models and safety filters in terms of prompt success rate and diversity.

Title: Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

Authors: Chuanhao Yan, Fengdi Che, Xuhan Huang, Xu Xu, Xin Li, Yizhi Li, Xingwei Qu, Jingzhe Shi, Zhuangzhuang He, Chenghua Lin, Yaodong Yang, Binhang Yuan, Hang Zhao, Yu Qiao, Bowen Zhou, Jie Fu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.16331
Pdf URL: https://arxiv.org/pdf/2507.16331
Copy Paste: [[2507.16331]] Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny(https://arxiv.org/abs/2507.16331)
Keywords: generative
Abstract: Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs. Unfortunately, it becomes unacceptably all-consuming to provide such priors for supervising complex programming tasks. In this work, we systematically explore ways to reduce human priors with the formal language, Dafny, as the main environment for our pilot study. Our pipeline mainly relies on introducing an automatic and scalable data curation pipeline, and careful RL designs integrated with feedback from the formal language verifier. We introduce DafnyComp, a benchmark of compositional formal programs with auto-formalized specifications for specification reasoning. Our supervised fine-tuning (SFT) stage enables even small models (e.g., 0.5B) to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization further improves performance, achieving stronger generalization to out-of-domain tasks and outperforming all strong baselines on the challenging DafnyComp benchmark.

Title: One Polyp Identifies All: One-Shot Polyp Segmentation with SAM via Cascaded Priors and Iterative Prompt Evolution

Authors: Xinyu Mao, Xiaohan Xing, Fei Meng, Jianbang Liu, Fan Bai, Qiang Nie, Max Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16337
Pdf URL: https://arxiv.org/pdf/2507.16337
Copy Paste: [[2507.16337]] One Polyp Identifies All: One-Shot Polyp Segmentation with SAM via Cascaded Priors and Iterative Prompt Evolution(https://arxiv.org/abs/2507.16337)
Keywords: foundation model
Abstract: Polyp segmentation is vital for early colorectal cancer detection, yet traditional fully supervised methods struggle with morphological variability and domain shifts, requiring frequent retraining. Additionally, reliance on large-scale annotations is a major bottleneck due to the time-consuming and error-prone nature of polyp boundary labeling. Recently, vision foundation models like Segment Anything Model (SAM) have demonstrated strong generalizability and fine-grained boundary detection with sparse prompts, effectively addressing key polyp segmentation challenges. However, SAM's prompt-dependent nature limits automation in medical applications, since manually inputting prompts for each image is labor-intensive and time-consuming. We propose OP-SAM, a One-shot Polyp segmentation framework based on SAM that automatically generates prompts from a single annotated image, ensuring accurate and generalizable segmentation without additional annotation burdens. Our method introduces Correlation-based Prior Generation (CPG) for semantic label transfer and Scale-cascaded Prior Fusion (SPF) to adapt to polyp size variations as well as filter out noisy transfers. Instead of dumping all prompts at once, we devise Euclidean Prompt Evolution (EPE) for iterative prompt refinement, progressively enhancing segmentation quality. Extensive evaluations across five datasets validate OP-SAM's effectiveness. Notably, on Kvasir, it achieves 76.93% IoU, surpassing the state-of-the-art by 11.44%.

Title: Navigating Large-Pose Challenge for High-Fidelity Face Reenactment with Video Diffusion Model

Authors: Mingtao Guo, Guanyu Xing, Yanci Zhang, Yanli Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16341
Pdf URL: https://arxiv.org/pdf/2507.16341
Copy Paste: [[2507.16341]] Navigating Large-Pose Challenge for High-Fidelity Face Reenactment with Video Diffusion Model(https://arxiv.org/abs/2507.16341)
Keywords: diffusion
Abstract: Face reenactment aims to generate realistic talking head videos by transferring motion from a driving video to a static source image while preserving the source identity. Although existing methods based on either implicit or explicit keypoints have shown promise, they struggle with large pose variations due to warping artifacts or the limitations of coarse facial landmarks. In this paper, we present the Face Reenactment Video Diffusion model (FRVD), a novel framework for high-fidelity face reenactment under large pose changes. Our method first employs a motion extractor to extract implicit facial keypoints from the source and driving images to represent fine-grained motion and to perform motion alignment through a warping module. To address the degradation introduced by warping, we introduce a Warping Feature Mapper (WFM) that maps the warped source image into the motion-aware latent space of a pretrained image-to-video (I2V) model. This latent space encodes rich priors of facial dynamics learned from large-scale video data, enabling effective warping correction and enhancing temporal coherence. Extensive experiments show that FRVD achieves superior performance over existing methods in terms of pose accuracy, identity preservation, and visual quality, especially in challenging scenarios with extreme pose variations.

Title: Are Foundation Models All You Need for Zero-shot Face Presentation Attack Detection?

Authors: Lazaro Janier Gonzalez-Sole, Juan E. Tapia, Christoph Busch
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16393
Pdf URL: https://arxiv.org/pdf/2507.16393
Copy Paste: [[2507.16393]] Are Foundation Models All You Need for Zero-shot Face Presentation Attack Detection?(https://arxiv.org/abs/2507.16393)
Keywords: foundation model
Abstract: Although face recognition systems have undergone an impressive evolution in the last decade, these technologies are vulnerable to attack presentations (AP). These attacks are mostly easy to create and, by executing them against the system's capture device, the malicious actor can impersonate an authorised subject and thus gain access to the latter's information (e.g., financial transactions). To protect facial recognition schemes against presentation attacks, state-of-the-art deep learning presentation attack detection (PAD) approaches require a large amount of data to produce reliable detection performances and even then, they decrease their performance for unknown presentation attack instruments (PAI) or database (information not seen during training), i.e. they lack generalisability. To mitigate the above problems, this paper focuses on zero-shot PAD. To do so, we first assess the effectiveness and generalisability of foundation models in established and challenging experimental scenarios and then propose a simple but effective framework for zero-shot PAD. Experimental results show that these models are able to achieve performance in difficult scenarios with minimal effort of the more advanced PAD mechanisms, whose weights were optimised mainly with training sets that included APs and bona fide presentations. The top-performing foundation model outperforms by a margin the best from the state of the art observed with the leaving-one-out protocol on the SiW-Mv2 database, which contains challenging unknown 2D and 3D attacks

Title: Sparse-View 3D Reconstruction: Recent Advances and Open Challenges

Authors: Tanveer Younis, Zhanglin Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16406
Pdf URL: https://arxiv.org/pdf/2507.16406
Copy Paste: [[2507.16406]] Sparse-View 3D Reconstruction: Recent Advances and Open Challenges(https://arxiv.org/abs/2507.16406)
Keywords: diffusion, foundation model, generative
Abstract: Sparse-view 3D reconstruction is essential for applications in which dense image acquisition is impractical, such as robotics, augmented/virtual reality (AR/VR), and autonomous systems. In these settings, minimal image overlap prevents reliable correspondence matching, causing traditional methods, such as structure-from-motion (SfM) and multiview stereo (MVS), to fail. This survey reviews the latest advances in neural implicit models (e.g., NeRF and its regularized versions), explicit point-cloud-based approaches (e.g., 3D Gaussian Splatting), and hybrid frameworks that leverage priors from diffusion and vision foundation models (VFMs).We analyze how geometric regularization, explicit shape modeling, and generative inference are used to mitigate artifacts such as floaters and pose ambiguities in sparse-view settings. Comparative results on standard benchmarks reveal key trade-offs between the reconstruction accuracy, efficiency, and generalization. Unlike previous reviews, our survey provides a unified perspective on geometry-based, neural implicit, and generative (diffusion-based) methods. We highlight the persistent challenges in domain generalization and pose-free reconstruction and outline future directions for developing 3D-native generative priors and achieving real-time, unconstrained sparse-view reconstruction.

Title: Robust Noisy Pseudo-label Learning for Semi-supervised Medical Image Segmentation Using Diffusion Model

Authors: Lin Xi, Yingliang Ma, Cheng Wang, Sandra Howell, Aldo Rinaldi, Kawal S. Rhode
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16429
Pdf URL: https://arxiv.org/pdf/2507.16429
Copy Paste: [[2507.16429]] Robust Noisy Pseudo-label Learning for Semi-supervised Medical Image Segmentation Using Diffusion Model(https://arxiv.org/abs/2507.16429)
Keywords: diffusion
Abstract: Obtaining pixel-level annotations in the medical domain is both expensive and time-consuming, often requiring close collaboration between clinical experts and developers. Semi-supervised medical image segmentation aims to leverage limited annotated data alongside abundant unlabeled data to achieve accurate segmentation. However, existing semi-supervised methods often struggle to structure semantic distributions in the latent space due to noise introduced by pseudo-labels. In this paper, we propose a novel diffusion-based framework for semi-supervised medical image segmentation. Our method introduces a constraint into the latent structure of semantic labels during the denoising diffusion process by enforcing prototype-based contrastive consistency. Rather than explicitly delineating semantic boundaries, the model leverages class prototypes centralized semantic representations in the latent space as anchors. This strategy improves the robustness of dense predictions, particularly in the presence of noisy pseudo-labels. We also introduce a new publicly available benchmark: Multi-Object Segmentation in X-ray Angiography Videos (MOSXAV), which provides detailed, manually annotated segmentation ground truth for multiple anatomical structures in X-ray angiography videos. Extensive experiments on the EndoScapes2023 and MOSXAV datasets demonstrate that our method outperforms state-of-the-art medical image segmentation approaches under the semi-supervised learning setting. This work presents a robust and data-efficient diffusion model that offers enhanced flexibility and strong potential for a wide range of clinical applications.

Title: VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Authors: Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, Jin Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16443
Pdf URL: https://arxiv.org/pdf/2507.16443
Copy Paste: [[2507.16443]] VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences(https://arxiv.org/abs/2507.16443)
Keywords: foundation model
Abstract: Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios. Code is available at this https URL.

Title: The Ever-Evolving Science Exam

Authors: Junying Wang, Zicheng Zhang, Yijin Guo, Farong Wen, Ye Shen, Yingji Liang, Yalun Wu, Wenzhe Li, Chunyi Li, Zijian Chen, Qi Jia, Guangtao Zhai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.16514
Pdf URL: https://arxiv.org/pdf/2507.16514
Copy Paste: [[2507.16514]] The Ever-Evolving Science Exam(https://arxiv.org/abs/2507.16514)
Keywords: foundation model
Abstract: As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad **Range**, wide **Reach**, and high **Rigor**, yet they often face two major challenges: **data leakage risks** that compromise benchmarking validity, and **evaluation inefficiency** due to large-scale testing. To address these issues, we introduce the **Ever-Evolving Science Exam (EESE)**, a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public **EESE-Pool** with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring **Range**, **Reach**, and **Rigor**, 2) a periodically updated 500-instance subset **EESE**, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: this https URL.

Title: EarthCrafter: Scalable 3D Earth Generation via Dual-Sparse Latent Diffusion

Authors: Shang Liu, Chenjie Cao, Chaohui Yu, Wen Qian, Jing Wang, Fan Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.16535
Pdf URL: https://arxiv.org/pdf/2507.16535
Copy Paste: [[2507.16535]] EarthCrafter: Scalable 3D Earth Generation via Dual-Sparse Latent Diffusion(https://arxiv.org/abs/2507.16535)
Keywords: diffusion
Abstract: Despite the remarkable developments achieved by recent 3D generation works, scaling these methods to geographic extents, such as modeling thousands of square kilometers of Earth's surface, remains an open challenge. We address this through a dual innovation in data infrastructure and model architecture. First, we introduce Aerial-Earth3D, the largest 3D aerial dataset to date, consisting of 50k curated scenes (each measuring 600m x 600m) captured across the U.S. mainland, comprising 45M multi-view Google Earth frames. Each scene provides pose-annotated multi-view images, depth maps, normals, semantic segmentation, and camera poses, with explicit quality control to ensure terrain diversity. Building on this foundation, we propose EarthCrafter, a tailored framework for large-scale 3D Earth generation via sparse-decoupled latent diffusion. Our architecture separates structural and textural generation: 1) Dual sparse 3D-VAEs compress high-resolution geometric voxels and textural 2D Gaussian Splats (2DGS) into compact latent spaces, largely alleviating the costly computation suffering from vast geographic scales while preserving critical information. 2) We propose condition-aware flow matching models trained on mixed inputs (semantics, images, or neither) to flexibly model latent geometry and texture features independently. Extensive experiments demonstrate that EarthCrafter performs substantially better in extremely large-scale generation. The framework further supports versatile applications, from semantic-guided urban layout generation to unconditional terrain synthesis, while maintaining geographic plausibility through our rich data priors from Aerial-Earth3D.

Title: Scaling Linear Attention with Sparse State Expansion

Authors: Yuqi Pan, Yongqi An, Zheng Li, Yuhong Chou, Ruijie Zhu, Xiaohui Wang, Mingxuan Wang, Jinqiao Wang, Guoqi Li
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2507.16577
Pdf URL: https://arxiv.org/pdf/2507.16577
Copy Paste: [[2507.16577]] Scaling Linear Attention with Sparse State Expansion(https://arxiv.org/abs/2507.16577)
Keywords: in-context
Abstract: The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top-$k$ hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse classification paradigm. Our design, supported by efficient parallelized implementations, yields effective classification and discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.7 on AIME24 and 51.3 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.

Title: Dyna3DGR: 4D Cardiac Motion Tracking with Dynamic 3D Gaussian Representation

Authors: Xueming Fu, Pei Wu, Yingtai Li, Xin Luo, Zihang Jiang, Junhao Mei, Jian Lu, Gao-Jun Teng, S. Kevin Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16608
Pdf URL: https://arxiv.org/pdf/2507.16608
Copy Paste: [[2507.16608]] Dyna3DGR: 4D Cardiac Motion Tracking with Dynamic 3D Gaussian Representation(https://arxiv.org/abs/2507.16608)
Keywords: self-supervised
Abstract: Accurate analysis of cardiac motion is crucial for evaluating cardiac function. While dynamic cardiac magnetic resonance imaging (CMR) can capture detailed tissue motion throughout the cardiac cycle, the fine-grained 4D cardiac motion tracking remains challenging due to the homogeneous nature of myocardial tissue and the lack of distinctive features. Existing approaches can be broadly categorized into image based and representation-based, each with its limitations. Image-based methods, including both raditional and deep learning-based registration approaches, either struggle with topological consistency or rely heavily on extensive training data. Representation-based methods, while promising, often suffer from loss of image-level details. To address these limitations, we propose Dynamic 3D Gaussian Representation (Dyna3DGR), a novel framework that combines explicit 3D Gaussian representation with implicit neural motion field modeling. Our method simultaneously optimizes cardiac structure and motion in a self-supervised manner, eliminating the need for extensive training data or point-to-point correspondences. Through differentiable volumetric rendering, Dyna3DGR efficiently bridges continuous motion representation with image-space alignment while preserving both topological and temporal consistency. Comprehensive evaluations on the ACDC dataset demonstrate that our approach surpasses state-of-the-art deep learning-based diffeomorphic registration methods in tracking accuracy. The code will be available in this https URL.

Title: CTSL: Codebook-based Temporal-Spatial Learning for Accurate Non-Contrast Cardiac Risk Prediction Using Cine MRIs

Authors: Haoyang Su, Shaohao Rui, Jinyi Xiang, Lianming Wu, Xiaosong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16612
Pdf URL: https://arxiv.org/pdf/2507.16612
Copy Paste: [[2507.16612]] CTSL: Codebook-based Temporal-Spatial Learning for Accurate Non-Contrast Cardiac Risk Prediction Using Cine MRIs(https://arxiv.org/abs/2507.16612)
Keywords: self-supervised
Abstract: Accurate and contrast-free Major Adverse Cardiac Events (MACE) prediction from Cine MRI sequences remains a critical challenge. Existing methods typically necessitate supervised learning based on human-refined masks in the ventricular myocardium, which become impractical without contrast agents. We introduce a self-supervised framework, namely Codebook-based Temporal-Spatial Learning (CTSL), that learns dynamic, spatiotemporal representations from raw Cine data without requiring segmentation masks. CTSL decouples temporal and spatial features through a multi-view distillation strategy, where the teacher model processes multiple Cine views, and the student model learns from reduced-dimensional Cine-SA sequences. By leveraging codebook-based feature representations and dynamic lesion self-detection through motion cues, CTSL captures intricate temporal dependencies and motion patterns. High-confidence MACE risk predictions are achieved through our model, providing a rapid, non-invasive solution for cardiac risk assessment that outperforms traditional contrast-dependent methods, thereby enabling timely and accessible heart disease diagnosis in clinical settings.

Title: Meta-Learning for Cold-Start Personalization in Prompt-Tuned LLMs

Authors: Yushang Zhao, Huijie Shen, Dannier Li, Lu Chang, Chengrui Zhou, Yinuo Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.16672
Pdf URL: https://arxiv.org/pdf/2507.16672
Copy Paste: [[2507.16672]] Meta-Learning for Cold-Start Personalization in Prompt-Tuned LLMs(https://arxiv.org/abs/2507.16672)
Keywords: generative
Abstract: Generative, explainable, and flexible recommender systems, derived using Large Language Models (LLM) are promising and poorly adapted to the cold-start user situation, where there is little to no history of interaction. The current solutions i.e. supervised fine-tuning and collaborative filtering are dense-user-item focused and would be expensive to maintain and update. This paper introduces a meta-learning framework, that can be used to perform parameter-efficient prompt-tuning, to effectively personalize LLM-based recommender systems quickly at cold-start. The model learns soft prompt embeddings with first-order (Reptile) and second-order (MAML) optimization by treating each of the users as the tasks. As augmentations to the input tokens, these learnable vectors are the differentiable control variables that represent user behavioral priors. The prompts are meta-optimized through episodic sampling, inner-loop adaptation, and outer-loop generalization. On MovieLens-1M, Amazon Reviews, and Recbole, we can see that our adaptive model outperforms strong baselines in NDCG@10, HR@10, and MRR, and it runs in real-time (i.e., below 300 ms) on consumer GPUs. Zero-history personalization is also supported by this scalable solution, and its 275 ms rate of adaptation allows successful real-time risk profiling of financial systems by shortening detection latency and improving payment network stability. Crucially, the 275 ms adaptation capability can enable real-time risk profiling for financial institutions, reducing systemic vulnerability detection latency significantly versus traditional compliance checks. By preventing contagion in payment networks (e.g., Fedwire), the framework strengthens national financial infrastructure resilience.

Title: PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

Authors: Han Jiang, Dongyao Zhu, Zhihua Wei, Xiaoyuan Yi, Ziang Xiao, Xing Xie
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2507.16679
Pdf URL: https://arxiv.org/pdf/2507.16679
Copy Paste: [[2507.16679]] PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization(https://arxiv.org/abs/2507.16679)
Keywords: in-context
Abstract: In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that navigates multiple values to better elicit LLMs' understanding of them and improve their alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, theoretically reinforcing value correlation while reducing distractive noise, resulting in effective value instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

Title: FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation

Authors: Pingyi Fan, Anbai Jiang, Shuwei Zhang, Zhiqiang Lv, Bing Han, Xinhu Zheng, Wenrui Liang, Junjie Li, Wei-Qiang Zhang, Yanmin Qian, Xie Chen, Cheng Lu, Jia Liu
Subjects: cs.LG, cs.AI, cs.MM, cs.SD
Abstract URL: https://arxiv.org/abs/2507.16696
Pdf URL: https://arxiv.org/pdf/2507.16696
Copy Paste: [[2507.16696]] FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation(https://arxiv.org/abs/2507.16696)
Keywords: foundation model
Abstract: With the rapid deployment of SCADA systems, how to effectively analyze industrial signals and detect abnormal states is an urgent need for the industry. Due to the significant heterogeneity of these signals, which we summarize as the M5 problem, previous works only focus on small sub-problems and employ specialized models, failing to utilize the synergies between modalities and the powerful scaling law. However, we argue that the M5 signals can be modeled in a unified manner due to the intrinsic similarity. As a result, we propose FISHER, a Foundation model for multi-modal Industrial Signal compreHEnsive Representation. To support arbitrary sampling rates, FISHER considers the increment of sampling rate as the concatenation of sub-band information. Specifically, FISHER takes the STFT sub-band as the modeling unit and adopts a teacher student SSL framework for pre-training. We also develop the RMIS benchmark, which evaluates the representations of M5 industrial signals on multiple health management tasks. Compared with top SSL models, FISHER showcases versatile and outstanding capabilities with a general performance gain up to 5.03%, along with much more efficient scaling curves. We also investigate the scaling law on downstream tasks and derive potential avenues for future works. FISHER is now open-sourced on this https URL

Title: Enhancing Remote Sensing Vision-Language Models Through MLLM and LLM-Based High-Quality Image-Text Dataset Generation

Authors: Yiguo He, Junjie Zhu, Yiying Li, Xiaoyu Zhang, Chunping Qiu, Jun Wang, Qiangjuan Huang, Ke Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16716
Pdf URL: https://arxiv.org/pdf/2507.16716
Copy Paste: [[2507.16716]] Enhancing Remote Sensing Vision-Language Models Through MLLM and LLM-Based High-Quality Image-Text Dataset Generation(https://arxiv.org/abs/2507.16716)
Keywords: foundation model, generative
Abstract: The application of Vision-language foundation models (VLFMs) to remote sensing (RS) imagery has garnered significant attention due to their superior capability in various downstream tasks. A key challenge lies in the scarcity of high-quality, large-scale, image-text paired training data. Recently, several works introduced extensive image-text datasets for RS and trained their VLFMs. However, due to the rudimentary methods used for generating captions, the quality of datasets is suboptimal, requiring larger volumes of training data, while only yielding modest performance improvements. In this paper, we propose a two-stage method named MpGI(Multi-Perspective Generation and Integration) for generating high-quality text captions for RS images. Firstly, we generate distinct and detailed descriptions from different perspectives using Rule-MLLM(Multimodal Large Language Model) Relay Generation and MLLMs generation methods. Next, we utilize Large Language Models (LLMs) to integrate these diverse descriptions into comprehensive captions, capturing details from multiple perspectives. Finally, we have created the HQRS-IT-210K dataset, including about 210,000 RS images and 1.3 million captions. We fine-tuned two VLFMs using our dataset: CLIP, a discriminative model, and CoCa, an image-to-text generative model. This process resulted in our proposed HQRS-CLIP and RS-CoCa models. Experimental results demonstrate that HQRS-CLIP surpassed the previous SOTA RS CLIP model in various downstream tasks while using only 4.2\% of the training data. RS-CoCa outperforms other advanced approaches across benchmark datasets and can generate captions for RS images that rival or even exceed manual annotations. Dataset, pre-trained models, and codes will be released at this https URL.

Title: HarmonPaint: Harmonized Training-Free Diffusion Inpainting

Authors: Ying Li, Xinzhe Li, Yong Du, Yangyang Xu, Junyu Dong, Shengfeng He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16732
Pdf URL: https://arxiv.org/pdf/2507.16732
Copy Paste: [[2507.16732]] HarmonPaint: Harmonized Training-Free Diffusion Inpainting(https://arxiv.org/abs/2507.16732)
Keywords: diffusion
Abstract: Existing inpainting methods often require extensive retraining or fine-tuning to integrate new content seamlessly, yet they struggle to maintain coherence in both structure and style between inpainted regions and the surrounding background. Motivated by these limitations, we introduce HarmonPaint, a training-free inpainting framework that seamlessly integrates with the attention mechanisms of diffusion models to achieve high-quality, harmonized image inpainting without any form of training. By leveraging masking strategies within self-attention, HarmonPaint ensures structural fidelity without model retraining or fine-tuning. Additionally, we exploit intrinsic diffusion model properties to transfer style information from unmasked to masked regions, achieving a harmonious integration of styles. Extensive experiments demonstrate the effectiveness of HarmonPaint across diverse scenes and styles, validating its versatility and performance.

Title: CMP: A Composable Meta Prompt for SAM-Based Cross-Domain Few-Shot Segmentation

Authors: Shuai Chen, Fanman Meng, Chunjin Yang, Haoran Wei, Chenhao Wu, Qingbo Wu, Hongliang Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.16753
Pdf URL: https://arxiv.org/pdf/2507.16753
Copy Paste: [[2507.16753]] CMP: A Composable Meta Prompt for SAM-Based Cross-Domain Few-Shot Segmentation(https://arxiv.org/abs/2507.16753)
Keywords: foundation model
Abstract: Cross-Domain Few-Shot Segmentation (CD-FSS) remains challenging due to limited data and domain shifts. Recent foundation models like the Segment Anything Model (SAM) have shown remarkable zero-shot generalization capability in general segmentation tasks, making it a promising solution for few-shot scenarios. However, adapting SAM to CD-FSS faces two critical challenges: reliance on manual prompt and limited cross-domain ability. Therefore, we propose the Composable Meta-Prompt (CMP) framework that introduces three key modules: (i) the Reference Complement and Transformation (RCT) module for semantic expansion, (ii) the Composable Meta-Prompt Generation (CMPG) module for automated meta-prompt synthesis, and (iii) the Frequency-Aware Interaction (FAI) module for domain discrepancy mitigation. Evaluations across four cross-domain datasets demonstrate CMP's state-of-the-art performance, achieving 71.8\% and 74.5\% mIoU in 1-shot and 5-shot scenarios respectively.

Title: Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning

Authors: Yanjun Zheng, Xiyang Du, Longfei Liao, Xiaoke Zhao, Zhaowen Zhou, Bo Zhang, Jiawei Liu, Xiang Qi, Zhe Li, Zhiqiang Zhang, Wang Wei, Peng Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.16802
Pdf URL: https://arxiv.org/pdf/2507.16802
Copy Paste: [[2507.16802]] Agentar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise, Training Efficiency, and Advanced Reasoning(https://arxiv.org/abs/2507.16802)
Keywords: foundation model
Abstract: Large Language Models (LLMs) demonstrate tremendous potential in the financial domain, yet existing models often fall short in scenarios demanding robust reasoning capabilities, stringent trustworthiness requirements, and efficient adaptation to task-specific needs. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task taxonomy with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage learning processes, and detailed attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including FinEva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications.