2025-07-08

Title: Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions

Authors: Vineet Kumar Rakesh, Soumya Mazumdar, Research Pratim Maity, Sarbajit Pal, Amitabha Das, Tapas Samanta
Subjects: cs.CV, cs.AI, cs.GR, cs.HC, cs.MM
Abstract URL: https://arxiv.org/abs/2507.02900
Pdf URL: https://arxiv.org/pdf/2507.02900
Copy Paste: [[2507.02900]] Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions(https://arxiv.org/abs/2507.02900)
Keywords: diffusion
Abstract: Talking Head Generation (THG) has emerged as a transformative technology in computer vision, enabling the synthesis of realistic human faces synchronized with image, audio, text, or video inputs. This paper provides a comprehensive review of methodologies and frameworks for talking head generation, categorizing approaches into 2D--based, 3D--based, Neural Radiance Fields (NeRF)--based, diffusion--based, parameter-driven techniques and many other techniques. It evaluates algorithms, datasets, and evaluation metrics while highlighting advancements in perceptual realism and technical efficiency critical for applications such as digital avatars, video dubbing, ultra-low bitrate video conferencing, and online education. The study identifies challenges such as reliance on pre--trained models, extreme pose handling, multilingual synthesis, and temporal consistency. Future directions include modular architectures, multilingual datasets, hybrid models blending pre--trained and task-specific layers, and innovative loss functions. By synthesizing existing research and exploring emerging trends, this paper aims to provide actionable insights for researchers and practitioners in the field of talking head generation. For the complete survey, code, and curated resource list, visit our GitHub repository: this https URL.

Title: Controllable diffusion-based generation for multi-channel biological data

Authors: Haoran Zhang, Mingyuan Zhou, Wesley Tansey
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2507.02902
Pdf URL: https://arxiv.org/pdf/2507.02902
Copy Paste: [[2507.02902]] Controllable diffusion-based generation for multi-channel biological data(https://arxiv.org/abs/2507.02902)
Keywords: diffusion, generative
Abstract: Spatial profiling technologies in biology, such as imaging mass cytometry (IMC) and spatial transcriptomics (ST), generate high-dimensional, multi-channel data with strong spatial alignment and complex inter-channel relationships. Generative modeling of such data requires jointly capturing intra- and inter-channel structure, while also generalizing across arbitrary combinations of observed and missing channels for practical application. Existing diffusion-based models generally assume low-dimensional inputs (e.g., RGB images) and rely on simple conditioning mechanisms that break spatial correspondence and ignore inter-channel dependencies. This work proposes a unified diffusion framework for controllable generation over structured and spatial biological data. Our model contains two key innovations: (1) a hierarchical feature injection mechanism that enables multi-resolution conditioning on spatially aligned channels, and (2) a combination of latent-space and output-space channel-wise attention to capture inter-channel relationships. To support flexible conditioning and generalization to arbitrary subsets of observed channels, we train the model using a random masking strategy, enabling it to reconstruct missing channels from any combination of inputs. We demonstrate state-of-the-art performance across both spatial and non-spatial prediction tasks, including protein imputation in IMC and gene-to-protein prediction in single-cell datasets, and show strong generalization to unseen conditional configurations.

Title: Hyperbolic Kernel Graph Neural Networks for Neurocognitive Decline Analysis from Multimodal Brain Imaging

Authors: Meimei Yang, Yongheng Sun, Qianqian Wang, Andrea Bozoki, Maureen Kohi, Mingxia Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02908
Pdf URL: https://arxiv.org/pdf/2507.02908
Copy Paste: [[2507.02908]] Hyperbolic Kernel Graph Neural Networks for Neurocognitive Decline Analysis from Multimodal Brain Imaging(https://arxiv.org/abs/2507.02908)
Keywords: diffusion
Abstract: Multimodal neuroimages, such as diffusion tensor imaging (DTI) and resting-state functional MRI (fMRI), offer complementary perspectives on brain activities by capturing structural or functional interactions among brain regions. While existing studies suggest that fusing these multimodal data helps detect abnormal brain activity caused by neurocognitive decline, they are generally implemented in Euclidean space and can't effectively capture intrinsic hierarchical organization of structural/functional brain networks. This paper presents a hyperbolic kernel graph fusion (HKGF) framework for neurocognitive decline analysis with multimodal neuroimages. It consists of a multimodal graph construction module, a graph representation learning module that encodes brain graphs in hyperbolic space through a family of hyperbolic kernel graph neural networks (HKGNNs), a cross-modality coupling module that enables effective multimodal data fusion, and a hyperbolic neural network for downstream predictions. Notably, HKGNNs represent graphs in hyperbolic space to capture both local and global dependencies among brain regions while preserving the hierarchical structure of brain networks. Extensive experiments involving over 4,000 subjects with DTI and/or fMRI data suggest the superiority of HKGF over state-of-the-art methods in two neurocognitive decline prediction tasks. HKGF is a general framework for multimodal data analysis, facilitating objective quantification of structural/functional brain connectivity changes associated with neurocognitive decline.

Title: DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective

Authors: Hyung Gun Chi, Zakaria Aldeneh, Tatiana Likhomanenko, Oggi Rudovic, Takuya Higuchi, Li-Wei Chen, Shinji Watanabe, Ahmed Hussen Abdelaziz
Subjects: cs.LG, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2507.02911
Pdf URL: https://arxiv.org/pdf/2507.02911
Copy Paste: [[2507.02911]] DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective(https://arxiv.org/abs/2507.02911)
Keywords: self-supervised, foundation model
Abstract: We introduce DiceHuBERT, a knowledge distillation framework for compressing HuBERT, a widely used self-supervised learning (SSL)-based speech foundation model. Unlike existing distillation methods that rely on layer-wise and feature-wise mapping between teacher and student models, DiceHuBERT leverages HuBERT's iterative self-distillation mechanism by directly replacing the original model with a student model. This replacement allows the student to be trained using the same SSL objective used when pre-training HuBERT, eliminating the need for additional modules or architectural constraints. Experimental results on SUPERB show that DiceHuBERT consistently outperforms existing distillation methods, improving phoneme recognition performance by over 21% and ASR performance by more than 14%. Furthermore, DiceHuBERT demonstrates competitive performance across multiple tasks, highlighting its clear advantage.

Title: PlaceFM: A Training-free Geospatial Foundation Model of Places

Authors: Mohammad Hashemi, Hossein Amiri, Andreas Zufle
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02921
Pdf URL: https://arxiv.org/pdf/2507.02921
Copy Paste: [[2507.02921]] PlaceFM: A Training-free Geospatial Foundation Model of Places(https://arxiv.org/abs/2507.02921)
Keywords: foundation model
Abstract: Spatial structure is central to effective geospatial intelligence systems. While foundation models have shown promise, they often lack the flexibility to reason about places, which are context-rich regions spanning different spatial granularities. We propose PlaceFM, a spatial foundation model that captures place representations using a training-free graph condensation method. PlaceFM condenses a nationwide POI graph built from integrated Foursquare and OpenStreetMap data in the U.S., generating general-purpose embeddings of places. These embeddings can be seamlessly integrated into geolocation data pipelines to support a wide range of downstream tasks. Without requiring pretraining, PlaceFM offers a scalable and adaptable solution for multi-scale geospatial analysis.

Title: OBSER: Object-Based Sub-Environment Recognition for Zero-Shot Environmental Inference

Authors: Won-Seok Choi, Dong-Sig Han, Suhyung Choi, Hyeonseo Yang, Byoung-Tak Zhang
Subjects: cs.CV, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.02929
Pdf URL: https://arxiv.org/pdf/2507.02929
Copy Paste: [[2507.02929]] OBSER: Object-Based Sub-Environment Recognition for Zero-Shot Environmental Inference(https://arxiv.org/abs/2507.02929)
Keywords: self-supervised
Abstract: We present the Object-Based Sub-Environment Recognition (OBSER) framework, a novel Bayesian framework that infers three fundamental relationships between sub-environments and their constituent objects. In the OBSER framework, metric and self-supervised learning models estimate the object distributions of sub-environments on the latent space to compute these measures. Both theoretically and empirically, we validate the proposed framework by introducing the ($\epsilon,\delta$) statistically separable (EDS) function which indicates the alignment of the representation. Our framework reliably performs inference in open-world and photorealistic environments and outperforms scene-based methods in chained retrieval tasks. The OBSER framework enables zero-shot recognition of environments to achieve autonomous environment understanding.

Title: GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation

Authors: Yi-Chun Chen, Arnav Jhala
Subjects: cs.CV, cs.AI, cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2507.02941
Pdf URL: https://arxiv.org/pdf/2507.02941
Copy Paste: [[2507.02941]] GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation(https://arxiv.org/abs/2507.02941)
Keywords: generative
Abstract: GameTileNet is a dataset designed to provide semantic labels for low-resolution digital game art, advancing procedural content generation (PCG) and related AI research as a vision-language alignment task. Large Language Models (LLMs) and image-generative AI models have enabled indie developers to create visual assets, such as sprites, for game interactions. However, generating visuals that align with game narratives remains challenging due to inconsistent AI outputs, requiring manual adjustments by human artists. The diversity of visual representations in automatically generated game content is also limited because of the imbalance in distributions across styles for training data. GameTileNet addresses this by collecting artist-created game tiles from this http URL under Creative Commons licenses and providing semantic annotations to support narrative-driven content generation. The dataset introduces a pipeline for object detection in low-resolution tile-based game art (e.g., 32x32 pixels) and annotates semantics, connectivity, and object classifications. GameTileNet is a valuable resource for improving PCG methods, supporting narrative-rich game content, and establishing a baseline for object detection in low-resolution, non-photorealistic images. TL;DR: GameTileNet is a semantic dataset of low-resolution game tiles designed to support narrative-driven procedural content generation through visual-language alignment.

Title: Concept-based Adversarial Attack: a Probabilistic Perspective

Authors: Andi Zhang, Xuan Ding, Steven McDonagh, Samuel Kaski
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02965
Pdf URL: https://arxiv.org/pdf/2507.02965
Copy Paste: [[2507.02965]] Concept-based Adversarial Attack: a Probabilistic Perspective(https://arxiv.org/abs/2507.02965)
Keywords: generative
Abstract: We propose a concept-based adversarial attack framework that extends beyond single-image perturbations by adopting a probabilistic perspective. Rather than modifying a single image, our method operates on an entire concept -- represented by a probabilistic generative model or a set of images -- to generate diverse adversarial examples. Preserving the concept is essential, as it ensures that the resulting adversarial images remain identifiable as instances of the original underlying category or identity. By sampling from this concept-based adversarial distribution, we generate images that maintain the original concept but vary in pose, viewpoint, or background, thereby misleading the classifier. Mathematically, this framework remains consistent with traditional adversarial attacks in a principled manner. Our theoretical and empirical results demonstrate that concept-based adversarial attacks yield more diverse adversarial examples and effectively preserve the underlying concept, while achieving higher attack efficiency.

Title: Leveraging the Structure of Medical Data for Improved Representation Learning

Authors: Andrea Agostini, Sonia Laguna, Alain Ryser, Samuel Ruiperez-Campillo, Moritz Vandenhirtz, Nicolas Deperrois, Farhad Nooralahzadeh, Michael Krauthammer, Thomas M. Sutter, Julia E. Vogt
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02987
Pdf URL: https://arxiv.org/pdf/2507.02987
Copy Paste: [[2507.02987]] Leveraging the Structure of Medical Data for Improved Representation Learning(https://arxiv.org/abs/2507.02987)
Keywords: self-supervised
Abstract: Building generalizable medical AI systems requires pretraining strategies that are data-efficient and domain-aware. Unlike internet-scale corpora, clinical datasets such as MIMIC-CXR offer limited image counts and scarce annotations, but exhibit rich internal structure through multi-view imaging. We propose a self-supervised framework that leverages the inherent structure of medical datasets. Specifically, we treat paired chest X-rays (i.e., frontal and lateral views) as natural positive pairs, learning to reconstruct each view from sparse patches while aligning their latent embeddings. Our method requires no textual supervision and produces informative representations. Evaluated on MIMIC-CXR, we show strong performance compared to supervised objectives and baselines being trained without leveraging structure. This work provides a lightweight, modality-agnostic blueprint for domain-specific pretraining where data is structured but scarce

Title: FreqCross: A Multi-Modal Frequency-Spatial Fusion Network for Robust Detection of Stable Diffusion 3.5 Generated Images

Authors: Guang Yang
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2507.02995
Pdf URL: https://arxiv.org/pdf/2507.02995
Copy Paste: [[2507.02995]] FreqCross: A Multi-Modal Frequency-Spatial Fusion Network for Robust Detection of Stable Diffusion 3.5 Generated Images(https://arxiv.org/abs/2507.02995)
Keywords: diffusion
Abstract: The rapid advancement of diffusion models, particularly Stable Diffusion 3.5, has enabled the generation of highly photorealistic synthetic images that pose significant challenges to existing detection methods. This paper presents FreqCross, a novel multi-modal fusion network that combines spatial RGB features, frequency domain artifacts, and radial energy distribution patterns to achieve robust detection of AI-generated images. Our approach leverages a three-branch architecture: (1) a ResNet-18 backbone for spatial feature extraction, (2) a lightweight CNN for processing 2D FFT magnitude spectra, and (3) a multi-layer perceptron for analyzing radial energy profiles. We introduce a novel radial energy distribution analysis that captures characteristic frequency artifacts inherent in diffusion-generated images, and fuse it with spatial and spectral cues via simple feature concatenation followed by a compact classification head. Extensive experiments on a dataset of 10,000 paired real (MS-COCO) and synthetic (Stable Diffusion 3.5) images demonstrate that FreqCross achieves 97.8\% accuracy, outperforming state-of-the-art baselines by 5.2\%. The frequency analysis further reveals that synthetic images exhibit distinct spectral signatures in the 0.1--0.4 normalised frequency range, providing theoretical foundation for our approach. Code and pre-trained models are publicly available to facilitate reproducible research.

Title: PDFMathTranslate: Scientific Document Translation Preserving Layouts

Authors: Rongxin Ouyang, Chang Chu, Zhikuang Xin, Xiangyao Ma
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03009
Pdf URL: https://arxiv.org/pdf/2507.03009
Copy Paste: [[2507.03009]] PDFMathTranslate: Scientific Document Translation Preserving Layouts(https://arxiv.org/abs/2507.03009)
Keywords: diffusion
Abstract: Language barriers in scientific documents hinder the diffusion and development of science and technologies. However, prior efforts in translating such documents largely overlooked the information in layouts. To bridge the gap, we introduce PDFMathTranslate, the world's first open-source software for translating scientific documents while preserving layouts. Leveraging the most recent advances in large language models and precise layout detection, we contribute to the community with key improvements in precision, flexibility, and efficiency. The work has been open-sourced at this https URL with more than 22k downloads.

Title: Beyond Overcorrection: Evaluating Diversity in T2I Models with DIVBENCH

Authors: Felix Friedrich, Thiemo Ganesha Welsch, Patrick Schramowski, Kristian Kersting
Subjects: cs.CL, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03015
Pdf URL: https://arxiv.org/pdf/2507.03015
Copy Paste: [[2507.03015]] Beyond Overcorrection: Evaluating Diversity in T2I Models with DIVBENCH(https://arxiv.org/abs/2507.03015)
Keywords: diffusion
Abstract: Current diversification strategies for text-to-image (T2I) models often ignore contextual appropriateness, leading to over-diversification where demographic attributes are modified even when explicitly specified in prompts. This paper introduces DIVBENCH, a benchmark and evaluation framework for measuring both under- and over-diversification in T2I generation. Through systematic evaluation of state-of-the-art T2I models, we find that while most models exhibit limited diversity, many diversification approaches overcorrect by inappropriately altering contextually-specified attributes. We demonstrate that context-aware methods, particularly LLM-guided FairDiffusion and prompt rewriting, can already effectively address under-diversity while avoiding over-diversification, achieving a better balance between representation and semantic fidelity.

Title: Rethinking Data Protection in the (Generative) Artificial Intelligence Era

Authors: Yiming Li, Shuo Shao, Yu He, Junfeng Guo, Tianwei Zhang, Zhan Qin, Pin-Yu Chen, Michael Backes, Philip Torr, Dacheng Tao, Kui Ren
Subjects: cs.LG, cs.AI, cs.CR, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2507.03034
Pdf URL: https://arxiv.org/pdf/2507.03034
Copy Paste: [[2507.03034]] Rethinking Data Protection in the (Generative) Artificial Intelligence Era(https://arxiv.org/abs/2507.03034)
Keywords: generative
Abstract: The (generative) artificial intelligence (AI) era has profoundly reshaped the meaning and value of data. No longer confined to static content, data now permeates every stage of the AI lifecycle from the training samples that shape model parameters to the prompts and outputs that drive real-world model deployment. This shift renders traditional notions of data protection insufficient, while the boundaries of what needs safeguarding remain poorly defined. Failing to safeguard data in AI systems can inflict societal and individual, underscoring the urgent need to clearly delineate the scope of and rigorously enforce data protection. In this perspective, we propose a four-level taxonomy, including non-usability, privacy preservation, traceability, and deletability, that captures the diverse protection needs arising in modern (generative) AI models and systems. Our framework offers a structured understanding of the trade-offs between data utility and control, spanning the entire AI pipeline, including training datasets, model weights, system prompts, and AI-generated content. We analyze representative technical approaches at each level and reveal regulatory blind spots that leave critical assets exposed. By offering a structured lens to align future AI technologies and governance with trustworthy data practices, we underscore the urgency of rethinking data protection for modern AI techniques and provide timely guidance for developers, researchers, and regulators alike.

Title: Intelligent Histology for Tumor Neurosurgery

Authors: Xinhai Hou, Akhil Kondepudi, Cheng Jiang, Yiwei Lyu, Samir Harake, Asadur Chowdury, Anna-Katharina Meißner, Volker Neuschmelting, David Reinecke, Gina Furtjes, Georg Widhalm, Lisa Irina Koerner, Jakob Straehle, Nicolas Neidert, Pierre Scheffler, Juergen Beck, Michael Ivan, Ashish Shah, Aditya Pandey, Sandra Camelo-Piragua, Dieter Henrik Heiland, Oliver Schnell, Chris Freudiger, Jacob Young, Melike Pekmezci, Katie Scotford, Shawn Hervey-Jumper, Daniel Orringer, Mitchel Berger, Todd Hollon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03037
Pdf URL: https://arxiv.org/pdf/2507.03037
Copy Paste: [[2507.03037]] Intelligent Histology for Tumor Neurosurgery(https://arxiv.org/abs/2507.03037)
Keywords: foundation model
Abstract: The importance of rapid and accurate histologic analysis of surgical tissue in the operating room has been recognized for over a century. Our standard-of-care intraoperative pathology workflow is based on light microscopy and H\&E histology, which is slow, resource-intensive, and lacks real-time digital imaging capabilities. Here, we present an emerging and innovative method for intraoperative histologic analysis, called Intelligent Histology, that integrates artificial intelligence (AI) with stimulated Raman histology (SRH). SRH is a rapid, label-free, digital imaging method for real-time microscopic tumor tissue analysis. SRH generates high-resolution digital images of surgical specimens within seconds, enabling AI-driven tumor histologic analysis, molecular classification, and tumor infiltration detection. We review the scientific background, clinical translation, and future applications of intelligent histology in tumor neurosurgery. We focus on the major scientific and clinical studies that have demonstrated the transformative potential of intelligent histology across multiple neurosurgical specialties, including neurosurgical oncology, skull base, spine oncology, pediatric tumors, and periperal nerve tumors. Future directions include the development of AI foundation models through multi-institutional datasets, incorporating clinical and radiologic data for multimodal learning, and predicting patient outcomes. Intelligent histology represents a transformative intraoperative workflow that can reinvent real-time tumor analysis for 21st century neurosurgery.

Title: LATTE: Latent Trajectory Embedding for Diffusion-Generated Image Detection

Authors: Ana Vasilcoiu, Ivona Najdenkoska, Zeno Geradts, Marcel Worring
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03054
Pdf URL: https://arxiv.org/pdf/2507.03054
Copy Paste: [[2507.03054]] LATTE: Latent Trajectory Embedding for Diffusion-Generated Image Detection(https://arxiv.org/abs/2507.03054)
Keywords: diffusion
Abstract: The rapid advancement of diffusion-based image generators has made it increasingly difficult to distinguish generated from real images. This can erode trust in digital media, making it critical to develop generalizable detectors for generated images. Recent methods leverage diffusion denoising cues, but mainly focus on single-step reconstruction errors, ignoring the inherent sequential nature of the denoising process. In this work, we propose LATTE - Latent Trajectory Embedding - a novel approach that models the evolution of latent embeddings across several denoising timesteps. By modeling the trajectory of such embeddings rather than single-step errors, LATTE captures subtle, discriminative patterns that distinguish real from generated images. Each latent is refined by employing our latent-visual feature refinement module and aggregated into a unified representation. Afterwards, it is fused with the visual features and finally passed into a lightweight classifier. Our experiments demonstrate that LATTE surpasses the baselines on several established benchmarks, such as GenImage and DiffusionFake. Moreover, it demonstrates strong performance in cross-generator and cross-datasets settings, highlighting the potential of using the trajectory of latent embeddings for generated image detection. The code is available on the following link: this https URL.

Title: Cycle-Consistent Helmholtz Machine: Goal-Seeded Simulation via Inverted Inference

Authors: Xin Li
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03065
Pdf URL: https://arxiv.org/pdf/2507.03065
Copy Paste: [[2507.03065]] Cycle-Consistent Helmholtz Machine: Goal-Seeded Simulation via Inverted Inference(https://arxiv.org/abs/2507.03065)
Keywords: generative
Abstract: The Helmholtz Machine (HM) is a foundational architecture for unsupervised learning, coupling a bottom-up recognition model with a top-down generative model through alternating inference. However, its reliance on symmetric, data-driven updates constrains its ability to perform goal-directed reasoning or simulate temporally extended processes. In this work, we introduce the \emph{Cycle-Consistent Helmholtz Machine} (C$^2$HM), a novel extension that reframes inference as a \emph{goal-seeded}, \emph{asymmetric} process grounded in structured internal priors. Rather than inferring latent causes solely from sensory data, C$^2$HM simulates plausible latent trajectories conditioned on abstract goals, aligning them with observed outcomes through a recursive cycle of forward generation and inverse refinement. This cycle-consistent formulation integrates top-down structure with bottom-up evidence via a variational loop, enforcing mutual alignment between goal-conditioned latent predictions and recognition-based reconstructions. We formalize this mechanism within the framework of the \emph{Context-Content Uncertainty Principle} (CCUP), which posits that inference proceeds by aligning structured, low-entropy content with high-entropy, ambiguous context. C$^2$HM improves representational efficiency, supports memory chaining via path-dependent inference, and enables spatial compositional imagination. By offering a biologically inspired alternative to classical amortized inference, $C^2$HM reconceives generative modeling as intentional simulation, bridging memory-based planning and unsupervised learning in a unified probabilistic framework.

Title: Expert-level validation of AI-generated medical text with scalable language models

Authors: Asad Aali, Vasiliki Bikia, Maya Varma, Nicole Chiou, Sophie Ostmeier, Arnav Singhvi, Magdalini Paschali, Ashwin Kumar, Andrew Johnston, Karimar Amador-Martinez, Eduardo Juan Perez Guerrero, Paola Naovi Cruz Rivera, Sergios Gatidis, Christian Bluethgen, Eduardo Pontes Reis, Eddy D. Zandee van Rilland, Poonam Laxmappa Hosamani, Kevin R Keet, Minjoung Go, Evelyn Ling, David B. Larson, Curtis Langlotz, Roxana Daneshjou, Jason Hom, Sanmi Koyejo, Emily Alsentzer, Akshay S. Chaudhari
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03152
Pdf URL: https://arxiv.org/pdf/2507.03152
Copy Paste: [[2507.03152]] Expert-level validation of AI-generated medical text with scalable language models(https://arxiv.org/abs/2507.03152)
Keywords: self-supervised
Abstract: With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase ( this https URL ), 2) MedVAL-Bench ( this https URL ), and 3) MedVAL-4B ( this https URL ), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.

Title: Adopting a human developmental visual diet yields robust, shape-based AI vision

Authors: Zejin Lu, Sushrut Thorat, Radoslaw M Cichy, Tim C Kietzmann
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.03168
Pdf URL: https://arxiv.org/pdf/2507.03168
Copy Paste: [[2507.03168]] Adopting a human developmental visual diet yields robust, shape-based AI vision(https://arxiv.org/abs/2507.03168)
Keywords: foundation model
Abstract: Despite years of research and the dramatic scaling of artificial intelligence (AI) systems, a striking misalignment between artificial and human vision persists. Contrary to humans, AI heavily relies on texture-features rather than shape information, lacks robustness to image distortions, remains highly vulnerable to adversarial attacks, and struggles to recognise simple abstract shapes within complex backgrounds. To close this gap, we here introduce a solution that arises from a previously underexplored direction: rather than scaling up, we take inspiration from how human vision develops from early infancy into adulthood. We quantified the visual maturation by synthesising decades of psychophysical and neurophysiological research into a novel developmental visual diet (DVD) for AI vision. We show that guiding AI systems through this human-inspired curriculum produces models that closely align with human behaviour on every hallmark of robust vision tested yielding the strongest reported reliance on shape information to date, abstract shape recognition beyond the state of the art, higher robustness to image corruptions, and stronger resilience to adversarial attacks. By outperforming high parameter AI foundation models trained on orders of magnitude more data, we provide evidence that robust AI vision can be achieved by guiding the way how a model learns, not merely how much it learns, offering a resource-efficient route toward safer and more human-like artificial visual systems.

Title: Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited Data

Authors: Yunrui Qiu, Richard John, Lukas Herron, Pratyush Tiwary
Subjects: cs.LG, cond-mat.stat-mech, physics.bio-ph, physics.chem-ph
Abstract URL: https://arxiv.org/abs/2507.03174
Pdf URL: https://arxiv.org/pdf/2507.03174
Copy Paste: [[2507.03174]] Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited Data(https://arxiv.org/abs/2507.03174)
Keywords: generative
Abstract: Accurate characterization of the equilibrium distributions of complex molecular systems and their dependence on environmental factors such as temperature is essential for understanding thermodynamic properties and transition mechanisms. Projecting these distributions onto meaningful low-dimensional representations enables interpretability and downstream analysis. Recent advances in generative AI, particularly flow models such as Normalizing Flows (NFs), have shown promise in modeling such distributions, but their scope is limited without tailored representation learning. In this work, we introduce Latent Thermodynamic Flows (LaTF), an end-to-end framework that tightly integrates representation learning and generative modeling. LaTF unifies the State Predictive Information Bottleneck (SPIB) with NFs to simultaneously learn low-dimensional latent representations, referred to as Collective Variables (CVs), classify metastable states, and generate equilibrium distributions across temperatures beyond the training data. The two components of representation learning and generative modeling are optimized jointly, ensuring that the learned latent features capture the system's slow, important degrees of freedom while the generative model accurately reproduces the system's equilibrium behavior. We demonstrate LaTF's effectiveness across diverse systems, including a model potential, the Chignolin protein, and cluster of Lennard Jones particles, with thorough evaluations and benchmarking using multiple metrics and extensive simulations. Finally, we apply LaTF to a RNA tetraloop system, where despite using simulation data from only two temperatures, LaTF reconstructs the temperature-dependent structural ensemble and melting behavior, consistent with experimental and prior extensive computational results.

Title: Subject Invariant Contrastive Learning for Human Activity Recognition

Authors: Yavuz Yarici, Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03250
Pdf URL: https://arxiv.org/pdf/2507.03250
Copy Paste: [[2507.03250]] Subject Invariant Contrastive Learning for Human Activity Recognition(https://arxiv.org/abs/2507.03250)
Keywords: self-supervised
Abstract: The high cost of annotating data makes self-supervised approaches, such as contrastive learning methods, appealing for Human Activity Recognition (HAR). Effective contrastive learning relies on selecting informative positive and negative samples. However, HAR sensor signals are subject to significant domain shifts caused by subject variability. These domain shifts hinder model generalization to unseen subjects by embedding subject-specific variations rather than activity-specific features. As a result, human activity recognition models trained with contrastive learning often struggle to generalize to new subjects. We introduce Subject-Invariant Contrastive Learning (SICL), a simple yet effective loss function to improve generalization in human activity recognition. SICL re-weights negative pairs drawn from the same subject to suppress subject-specific cues and emphasize activity-specific information. We evaluate our loss function on three public benchmarks: UTD-MHAD, MMAct, and DARai. We show that SICL improves performance by up to 11% over traditional contrastive learning methods. Additionally, we demonstrate the adaptability of our loss function across various settings, including multiple self-supervised methods, multimodal scenarios, and supervised learning frameworks.

Title: LACONIC: A 3D Layout Adapter for Controllable Image Creation

Authors: Léopold Maillard, Tom Durand, Adrien Ramanana Rahary, Maks Ovsjanikov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03257
Pdf URL: https://arxiv.org/pdf/2507.03257
Copy Paste: [[2507.03257]] LACONIC: A 3D Layout Adapter for Controllable Image Creation(https://arxiv.org/abs/2507.03257)
Keywords: diffusion, generative
Abstract: Existing generative approaches for guided image synthesis of multi-object scenes typically rely on 2D controls in the image or text space. As a result, these methods struggle to maintain and respect consistent three-dimensional geometric structure, underlying the scene. In this paper, we propose a novel conditioning approach, training method and adapter network that can be plugged into pretrained text-to-image diffusion models. Our approach provides a way to endow such models with 3D-awareness, while leveraging their rich prior knowledge. Our method supports camera control, conditioning on explicit 3D geometries and, for the first time, accounts for the entire context of a scene, i.e., both on and off-screen items, to synthesize plausible and semantically rich images. Despite its multi-modal nature, our model is lightweight, requires a reasonable number of data for supervised learning and shows remarkable generalization power. We also introduce methods for intuitive and consistent image editing and restyling, e.g., by positioning, rotating or resizing individual objects in a scene. Our method integrates well within various image creation workflows and enables a richer set of applications compared to previous approaches.

Title: ConceptMix++: Leveling the Playing Field in Text-to-Image Benchmarking via Iterative Prompt Optimization

Authors: Haosheng Gan, Berk Tinaz, Mohammad Shahab Sepehri, Zalan Fabian, Mahdi Soltanolkotabi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03275
Pdf URL: https://arxiv.org/pdf/2507.03275
Copy Paste: [[2507.03275]] ConceptMix++: Leveling the Playing Field in Text-to-Image Benchmarking via Iterative Prompt Optimization(https://arxiv.org/abs/2507.03275)
Keywords: diffusion, generative
Abstract: Current text-to-image (T2I) benchmarks evaluate models on rigid prompts, potentially underestimating true generative capabilities due to prompt sensitivity and creating biases that favor certain models while disadvantaging others. We introduce ConceptMix++, a framework that disentangles prompt phrasing from visual generation capabilities by applying iterative prompt optimization. Building on ConceptMix, our approach incorporates a multimodal optimization pipeline that leverages vision-language model feedback to refine prompts systematically. Through extensive experiments across multiple diffusion models, we show that optimized prompts significantly improve compositional generation performance, revealing previously hidden model capabilities and enabling fairer comparisons across T2I models. Our analysis reveals that certain visual concepts -- such as spatial relationships and shapes -- benefit more from optimization than others, suggesting that existing benchmarks systematically underestimate model performance in these categories. Additionally, we find strong cross-model transferability of optimized prompts, indicating shared preferences for effective prompt phrasing across models. These findings demonstrate that rigid benchmarking approaches may significantly underrepresent true model capabilities, while our framework provides more accurate assessment and insights for future development.

Title: Global Variational Inference Enhanced Robust Domain Adaptation

Authors: Lingkun Luo, Shiqiang Hu, Liming Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03291
Pdf URL: https://arxiv.org/pdf/2507.03291
Copy Paste: [[2507.03291]] Global Variational Inference Enhanced Robust Domain Adaptation(https://arxiv.org/abs/2507.03291)
Keywords: diffusion, generative
Abstract: Deep learning-based domain adaptation (DA) methods have shown strong performance by learning transferable representations. However, their reliance on mini-batch training limits global distribution modeling, leading to unstable alignment and suboptimal generalization. We propose Global Variational Inference Enhanced Domain Adaptation (GVI-DA), a framework that learns continuous, class-conditional global priors via variational inference to enable structure-aware cross-domain alignment. GVI-DA minimizes domain gaps through latent feature reconstruction, and mitigates posterior collapse using global codebook learning with randomized sampling. It further improves robustness by discarding low-confidence pseudo-labels and generating reliable target-domain samples. Extensive experiments on four benchmarks and thirty-eight DA tasks demonstrate consistent state-of-the-art performance. We also derive the model's evidence lower bound (ELBO) and analyze the effects of prior continuity, codebook size, and pseudo-label noise tolerance. In addition, we compare GVI-DA with diffusion-based generative frameworks in terms of optimization principles and efficiency, highlighting both its theoretical soundness and practical advantages.

Title: Zero-shot Inexact CAD Model Alignment from a Single Image

Authors: Pattaramanee Arsomngern, Sasikarn Khwanmuang, Matthias Nießner, Supasorn Suwajanakorn
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03292
Pdf URL: https://arxiv.org/pdf/2507.03292
Copy Paste: [[2507.03292]] Zero-shot Inexact CAD Model Alignment from a Single Image(https://arxiv.org/abs/2507.03292)
Keywords: self-supervised
Abstract: One practical approach to infer 3D scene structure from a single image is to retrieve a closely matching 3D model from a database and align it with the object in the image. Existing methods rely on supervised training with images and pose annotations, which limits them to a narrow set of object categories. To address this, we propose a weakly supervised 9-DoF alignment method for inexact 3D models that requires no pose annotations and generalizes to unseen categories. Our approach derives a novel feature space based on foundation features that ensure multi-view consistency and overcome symmetry ambiguities inherent in foundation features using a self-supervised triplet loss. Additionally, we introduce a texture-invariant pose refinement technique that performs dense alignment in normalized object coordinates, estimated through the enhanced feature space. We conduct extensive evaluations on the real-world ScanNet25k dataset, where our method outperforms SOTA weakly supervised baselines by +4.3% mean alignment accuracy and is the only weakly supervised approach to surpass the supervised ROCA by +2.7%. To assess generalization, we introduce SUN2CAD, a real-world test set with 20 novel object categories, where our method achieves SOTA results without prior training on them.

Title: CPKD: Clinical Prior Knowledge-Constrained Diffusion Models for Surgical Phase Recognition in Endoscopic Submucosal Dissection

Authors: Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Yaqi Wang, Chengfeng Zhou, Xiaobo Li, Dahong Qian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03295
Pdf URL: https://arxiv.org/pdf/2507.03295
Copy Paste: [[2507.03295]] CPKD: Clinical Prior Knowledge-Constrained Diffusion Models for Surgical Phase Recognition in Endoscopic Submucosal Dissection(https://arxiv.org/abs/2507.03295)
Keywords: diffusion, generative
Abstract: Gastrointestinal malignancies constitute a leading cause of cancer-related mortality worldwide, with advanced-stage prognosis remaining particularly dismal. Originating as a groundbreaking technique for early gastric cancer treatment, Endoscopic Submucosal Dissection has evolved into a versatile intervention for diverse gastrointestinal lesions. While computer-assisted systems significantly enhance procedural precision and safety in ESD, their clinical adoption faces a critical bottleneck: reliable surgical phase recognition within complex endoscopic workflows. Current state-of-the-art approaches predominantly rely on multi-stage refinement architectures that iteratively optimize temporal predictions. In this paper, we present Clinical Prior Knowledge-Constrained Diffusion (CPKD), a novel generative framework that reimagines phase recognition through denoising diffusion principles while preserving the core iterative refinement philosophy. This architecture progressively reconstructs phase sequences starting from random noise and conditioned on visual-temporal features. To better capture three domain-specific characteristics, including positional priors, boundary ambiguity, and relation dependency, we design a conditional masking strategy. Furthermore, we incorporate clinical prior knowledge into the model training to improve its ability to correct phase logical errors. Comprehensive evaluations on ESD820, Cholec80, and external multi-center demonstrate that our proposed CPKD achieves superior or comparable performance to state-of-the-art approaches, validating the effectiveness of diffusion-based generative paradigms for surgical phase recognition.

Title: Personalized Image Generation from an Author Writing Style

Authors: Sagar Gandhi, Vishal Gandhi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03313
Pdf URL: https://arxiv.org/pdf/2507.03313
Copy Paste: [[2507.03313]] Personalized Image Generation from an Author Writing Style(https://arxiv.org/abs/2507.03313)
Keywords: diffusion, generative
Abstract: Translating nuanced, textually-defined authorial writing styles into compelling visual representations presents a novel challenge in generative AI. This paper introduces a pipeline that leverages Author Writing Sheets (AWS) - structured summaries of an author's literary characteristics - as input to a Large Language Model (LLM, Claude 3.7 Sonnet). The LLM interprets the AWS to generate three distinct, descriptive text-to-image prompts, which are then rendered by a diffusion model (Stable Diffusion 3.5 Medium). We evaluated our approach using 49 author styles from Reddit data, with human evaluators assessing the stylistic match and visual distinctiveness of the generated images. Results indicate a good perceived alignment between the generated visuals and the textual authorial profiles (mean style match: $4.08/5$), with images rated as moderately distinctive. Qualitative analysis further highlighted the pipeline's ability to capture mood and atmosphere, while also identifying challenges in representing highly abstract narrative elements. This work contributes a novel end-to-end methodology for visual authorial style personalization and provides an initial empirical validation, opening avenues for applications in creative assistance and cross-modal understanding.

Title: Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents

Authors: Zhao Wang, Bowen Chen, Yotaro Shimose, Sota Moriyama, Heng Wang, Shingo Takamatsu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03326
Pdf URL: https://arxiv.org/pdf/2507.03326
Copy Paste: [[2507.03326]] Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents(https://arxiv.org/abs/2507.03326)
Keywords: diffusion, generative
Abstract: Recent generative models such as GPT-4o have shown strong capabilities in producing high-quality images with accurate text rendering. However, commercial design tasks like advertising banners demand more than visual fidelity -- they require structured layouts, precise typography, consistent branding, and more. In this paper, we introduce MIMO (Mirror In-the-Model), an agentic refinement framework for automatic ad banner generation. MIMO combines a hierarchical multi-modal agent system (MIMO-Core) with a coordination loop (MIMO-Loop) that explores multiple stylistic directions and iteratively improves design quality. Requiring only a simple natural language based prompt and logo image as input, MIMO automatically detects and corrects multiple types of errors during generation. Experiments show that MIMO significantly outperforms existing diffusion and LLM-based baselines in real-world banner design scenarios.

Title: Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling

Authors: Mingzhuo Li, Guang Li, Jiafeng Mao, Linfeng Ye, Takahiro Ogawa, Miki Haseyama
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03331
Pdf URL: https://arxiv.org/pdf/2507.03331
Copy Paste: [[2507.03331]] Task-Specific Generative Dataset Distillation with Difficulty-Guided Sampling(https://arxiv.org/abs/2507.03331)
Keywords: generative
Abstract: To alleviate the reliance of deep neural networks on large-scale datasets, dataset distillation aims to generate compact, high-quality synthetic datasets that can achieve comparable performance to the original dataset. The integration of generative models has significantly advanced this field. However, existing approaches primarily focus on aligning the distilled dataset with the original one, often overlooking task-specific information that can be critical for optimal downstream performance. In this paper, focusing on the downstream task of classification, we propose a task-specific sampling strategy for generative dataset distillation that incorporates the concept of difficulty to consider the requirements of the target task better. The final dataset is sampled from a larger image pool with a sampling distribution obtained by matching the difficulty distribution of the original dataset. A logarithmic transformation is applied as a pre-processing step to correct for distributional bias. The results of extensive experiments demonstrate the effectiveness of our method and suggest its potential for enhancing performance on other downstream tasks.

Title: De-Fake: Style based Anomaly Deepfake Detection

Authors: Sudev Kumar Padhi, Harshit Kumar, Umesh Kashyap, Sk. Subidh Ali
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03334
Pdf URL: https://arxiv.org/pdf/2507.03334
Copy Paste: [[2507.03334]] De-Fake: Style based Anomaly Deepfake Detection(https://arxiv.org/abs/2507.03334)
Keywords: anomaly
Abstract: Detecting deepfakes involving face-swaps presents a significant challenge, particularly in real-world scenarios where anyone can perform face-swapping with freely available tools and apps without any technical knowledge. Existing deepfake detection methods rely on facial landmarks or inconsistencies in pixel-level features and often struggle with face-swap deepfakes, where the source face is seamlessly blended into the target image or video. The prevalence of face-swap is evident in everyday life, where it is used to spread false information, damage reputations, manipulate political opinions, create non-consensual intimate deepfakes (NCID), and exploit children by enabling the creation of child sexual abuse material (CSAM). Even prominent public figures are not immune to its impact, with numerous deepfakes of them circulating widely across social media platforms. Another challenge faced by deepfake detection methods is the creation of datasets that encompass a wide range of variations, as training models require substantial amounts of data. This raises privacy concerns, particularly regarding the processing and storage of personal facial data, which could lead to unauthorized access or misuse. Our key idea is to identify these style discrepancies to detect face-swapped images effectively without accessing the real facial image. We perform comprehensive evaluations using multiple datasets and face-swapping methods, which showcases the effectiveness of SafeVision in detecting face-swap deepfakes across diverse scenarios. SafeVision offers a reliable and scalable solution for detecting face-swaps in a privacy preserving manner, making it particularly effective in challenging real-world applications. To the best of our knowledge, SafeVision is the first deepfake detection using style features while providing inherent privacy protection.

Title: Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

Authors: Yufan Zhou, Zhaobo Qi, Lingshuai Lin, Junqi Jing, Tingting Chai, Beichen Zhang, Shuhui Wang, Weigang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03393
Pdf URL: https://arxiv.org/pdf/2507.03393
Copy Paste: [[2507.03393]] Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos(https://arxiv.org/abs/2507.03393)
Keywords: diffusion
Abstract: In this paper, we address the challenge of procedure planning in instructional videos, aiming to generate coherent and task-aligned action sequences from start and end visual observations. Previous work has mainly relied on text-level supervision to bridge the gap between observed states and unobserved actions, but it struggles with capturing intricate temporal relationships among actions. Building on these efforts, we propose the Masked Temporal Interpolation Diffusion (MTID) model that introduces a latent space temporal interpolation module within the diffusion model. This module leverages a learnable interpolation matrix to generate intermediate latent features, thereby augmenting visual supervision with richer mid-state details. By integrating this enriched supervision into the model, we enable end-to-end training tailored to task-specific requirements, significantly enhancing the model's capacity to predict temporally coherent action sequences. Additionally, we introduce an action-aware mask projection mechanism to restrict the action generation space, combined with a task-adaptive masked proximity loss to prioritize more accurate reasoning results close to the given start and end states over those in intermediate steps. Simultaneously, it filters out task-irrelevant action predictions, leading to contextually aware action sequences. Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics. The code is available at this https URL.

Title: Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images

Authors: Yuran Dong, Mang Ye
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03402
Pdf URL: https://arxiv.org/pdf/2507.03402
Copy Paste: [[2507.03402]] Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images(https://arxiv.org/abs/2507.03402)
Keywords: diffusion
Abstract: To advance real-world fashion image editing, we analyze existing two-stage pipelines(mask generation followed by diffusion-based editing)which overly prioritize generator optimization while neglecting mask controllability. This results in two critical limitations: I) poor user-defined flexibility (coarse-grained human masks restrict edits to predefined regions like upper torso; fine-grained clothes masks preserve poses but forbid style/length customization). II) weak pose robustness (mask generators fail due to articulated poses and miss rare regions like waist, while human parsers remain limited by predefined categories). To address these gaps, we propose Pose-Star, a framework that dynamically recomposes body structures (e.g., neck, chest, etc.) into anatomy-aware masks (e.g., chest-length) for user-defined edits. In Pose-Star, we calibrate diffusion-derived attention (Star tokens) via skeletal keypoints to enhance rare structure localization in complex poses, suppress noise through phase-aware analysis of attention dynamics (Convergence,Stabilization,Divergence) with threshold masking and sliding-window fusion, and refine edges via cross-self attention merging and Canny alignment. This work bridges controlled benchmarks and open-world demands, pioneering anatomy-aware, pose-robust editing and laying the foundation for industrial fashion image editing.

Title: Generating Synthetic Relational Tabular Data via Structural Causal Models

Authors: Frederik Hoppe, Astrid Franz, Lars Kleinemeier, Udo Göbel
Subjects: cs.LG, cs.AI, stat.AP
Abstract URL: https://arxiv.org/abs/2507.03528
Pdf URL: https://arxiv.org/pdf/2507.03528
Copy Paste: [[2507.03528]] Generating Synthetic Relational Tabular Data via Structural Causal Models(https://arxiv.org/abs/2507.03528)
Keywords: foundation model
Abstract: Synthetic tabular data generation has received increasing attention in recent years, particularly with the emergence of foundation models for tabular data. The breakthrough success of TabPFN (Hollmann et al.,2025), which leverages vast quantities of synthetic tabular datasets derived from structural causal models (SCMs), demonstrates the critical role synthetic data plays in developing powerful tabular foundation models. However, most real-world tabular data exists in relational formats spanning multiple interconnected tables - a structure not adequately addressed by current generation methods. In this work, we extend the SCM-based approach by developing a novel framework that generates realistic synthetic relational tabular data including causal relationships across tables. Our experiments confirm that this framework is able to construct relational datasets with complex inter-table dependencies mimicking real-world scenarios.

Title: Foundation versus Domain-specific Models: Performance Comparison, Fusion, and Explainability in Face Recognition

Authors: Redwan Sony, Parisa Farmanifard, Arun Ross, Anil K. Jain
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03541
Pdf URL: https://arxiv.org/pdf/2507.03541
Copy Paste: [[2507.03541]] Foundation versus Domain-specific Models: Performance Comparison, Fusion, and Explainability in Face Recognition(https://arxiv.org/abs/2507.03541)
Keywords: foundation model
Abstract: In this paper, we address the following question: How do generic foundation models (e.g., CLIP, BLIP, LLaVa, DINO) compare against a domain-specific face recognition model (viz., AdaFace or ArcFace) on the face recognition task? Through a series of experiments involving several foundation models and benchmark datasets, we are able to report the following findings: (a) In all datasets considered, domain-specific models outperformed zero-shot foundation models. (b) The performance of zero-shot generic foundation models improves on over-segmented face images than tightly cropped faces thereby suggesting the importance of contextual clues. For example, at a False Match Rate (FMR) of 0.01%, the True Match Rate (TMR) of OpenCLIP improved from 64.97% to 81.73% on the LFW dataset as the face crop increased from 112x112 to 250x250 while the TMR of domain-specific AdaFace dropped from 99.09% to 77.31%. (c) A simple score-level fusion of a foundation model with a domain-specific FR model improved the accuracy at low FMRs. For example, the TMR of AdaFace when fused with BLIP improved from 72.64% to 83.31% at an FMR of 0.0001% on the IJB-B dataset and from 73.17% to 85.81% on the IJB-C dataset. (d) Foundation models, such as ChatGPT, can be used to impart explainability to the FR pipeline (e.g., ``Despite minor lighting and head tilt differences, the two left-profile images show high consistency in forehead slope, nose shape, chin contour...''). In some instances, foundation models are even able to resolve low-confidence decisions made by AdaFace (e.g., ``Although AdaFace assigns a low similarity score of 0.21, both images exhibit visual similarity...and the pair is likely of the same person''), thereby reiterating the importance of combining domain-specific FR models with generic foundation models in a judicious manner.

Title: Beyond Accuracy: Metrics that Uncover What Makes a `Good' Visual Descriptor

Authors: Ethan Lin, Linxi Zhao, Atharva Sehgal, Jennifer J. Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03542
Pdf URL: https://arxiv.org/pdf/2507.03542
Copy Paste: [[2507.03542]] Beyond Accuracy: Metrics that Uncover What Makes a `Good' Visual Descriptor(https://arxiv.org/abs/2507.03542)
Keywords: foundation model
Abstract: Text-based visual descriptors-ranging from simple class names to more descriptive phrases-are widely used in visual concept discovery and image classification with vision-language models (VLMs). Their effectiveness, however, depends on a complex interplay of factors, including semantic clarity, presence in the VLM's pre-training data, and how well the descriptors serve as a meaningful representation space. In this work, we systematically analyze descriptor quality along two key dimensions: (1) representational capacity, and (2) relationship with VLM pre-training data. We evaluate a spectrum of descriptor generation methods, from zero-shot LLM-generated prompts to iteratively refined descriptors. Motivated by ideas from representation alignment and language understanding, we introduce two alignment-based metrics-Global Alignment and CLIP Similarity-that move beyond accuracy. These metrics allow us to shed light on how different descriptor generation strategies interact with foundation model properties, offering insights into ways of studying descriptor effectiveness beyond accuracy evaluations.

Title: SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications

Authors: Yana Hasson, Pauline Luc, Liliane Momeni, Maks Ovsjanikov, Guillaume Le Moing, Alina Kuznetsova, Ira Ktena, Jennifer J. Sun, Skanda Koppula, Dilara Gokay, Joseph Heyward, Etienne Pot, Andrew Zisserman
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03578
Pdf URL: https://arxiv.org/pdf/2507.03578
Copy Paste: [[2507.03578]] SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications(https://arxiv.org/abs/2507.03578)
Keywords: foundation model
Abstract: In recent years, there has been a proliferation of spatiotemporal foundation models in different scientific disciplines. While promising, these models are often domain-specific and are only assessed within the particular applications for which they are designed. Given that many tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise as general-purpose domain-agnostic approaches. However, it is not known whether the knowledge acquired on large-scale but potentially out-of-domain data can be effectively transferred across diverse scientific disciplines, and if a single, pretrained ViFM can be competitive with domain-specific baselines. To address this, we introduce SciVid, a comprehensive benchmark comprising five *Sci*entific *Vid*eo tasks, across medical computer vision, animal behavior, and weather forecasting. We adapt six leading ViFMs to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating the potential for effective transfer learning. Specifically, we show that state-of-the-art results can be obtained in several applications by leveraging the general-purpose representations from ViFM backbones. Furthermore, our results reveal the limitations of existing ViFMs, and highlight opportunities for the development of generalizable models for high-impact scientific applications. We release our code at this https URL to facilitate further research in the development of ViFMs.

Title: Kinetic Langevin Diffusion for Crystalline Materials Generation

Authors: François Cornet, Federico Bergamin, Arghya Bhowmik, Juan Maria Garcia Lastra, Jes Frellsen, Mikkel N. Schmidt
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03602
Pdf URL: https://arxiv.org/pdf/2507.03602
Copy Paste: [[2507.03602]] Kinetic Langevin Diffusion for Crystalline Materials Generation(https://arxiv.org/abs/2507.03602)
Keywords: diffusion, generative
Abstract: Generative modeling of crystalline materials using diffusion models presents a series of challenges: the data distribution is characterized by inherent symmetries and involves multiple modalities, with some defined on specific manifolds. Notably, the treatment of fractional coordinates representing atomic positions in the unit cell requires careful consideration, as they lie on a hypertorus. In this work, we introduce Kinetic Langevin Diffusion for Materials (KLDM), a novel diffusion model for crystalline materials generation, where the key innovation resides in the modeling of the coordinates. Instead of resorting to Riemannian diffusion on the hypertorus directly, we generalize Trivialized Diffusion Model (TDM) to account for the symmetries inherent to crystals. By coupling coordinates with auxiliary Euclidean variables representing velocities, the diffusion process is now offset to a flat space. This allows us to effectively perform diffusion on the hypertorus while providing a training objective that accounts for the periodic translation symmetry of the true data distribution. We evaluate KLDM on both Crystal Structure Prediction (CSP) and De-novo Generation (DNG) tasks, demonstrating its competitive performance with current state-of-the-art models.

Title: From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Visual Concepts in Brain Signal Analysis

Authors: Amir Hojjati, Lu Li, Ibrahim Hameed, Anis Yazidi, Pedro G. Lind, Rabindra Khadka
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03633
Pdf URL: https://arxiv.org/pdf/2507.03633
Copy Paste: [[2507.03633]] From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Visual Concepts in Brain Signal Analysis(https://arxiv.org/abs/2507.03633)
Keywords: self-supervised
Abstract: EEG signals capture brain activity with high temporal and low spatial resolution, supporting applications such as neurological diagnosis, cognitive monitoring, and brain-computer interfaces. However, effective analysis is hindered by limited labeled data, high dimensionality, and the absence of scalable models that fully capture spatiotemporal dependencies. Existing self-supervised learning (SSL) methods often focus on either spatial or temporal features, leading to suboptimal representations. To this end, we propose EEG-VJEPA, a novel adaptation of the Video Joint Embedding Predictive Architecture (V-JEPA) for EEG classification. By treating EEG as video-like sequences, EEG-VJEPA learns semantically meaningful spatiotemporal representations using joint embeddings and adaptive masking. To our knowledge, this is the first work that exploits V-JEPA for EEG classification and explores the visual concepts learned by the model. Evaluations on the publicly available Temple University Hospital (TUH) Abnormal EEG dataset show that EEG-VJEPA outperforms existing state-of-the-art models in classification this http URL classification accuracy, EEG-VJEPA captures physiologically relevant spatial and temporal signal patterns, offering interpretable embeddings that may support human-AI collaboration in diagnostic workflows. These findings position EEG-VJEPA as a promising framework for scalable, trustworthy EEG analysis in real-world clinical settings.

Title: SecureT2I: No More Unauthorized Manipulation on AI Generated Images from Prompts

Authors: Xiaodong Wu, Xiangman Li, Qi Li, Jianbing Ni, Rongxing Lu
Subjects: cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2507.03636
Pdf URL: https://arxiv.org/pdf/2507.03636
Copy Paste: [[2507.03636]] SecureT2I: No More Unauthorized Manipulation on AI Generated Images from Prompts(https://arxiv.org/abs/2507.03636)
Keywords: diffusion, generative
Abstract: Text-guided image manipulation with diffusion models enables flexible and precise editing based on prompts, but raises ethical and copyright concerns due to potential unauthorized modifications. To address this, we propose SecureT2I, a secure framework designed to prevent unauthorized editing in diffusion-based generative models. SecureT2I is compatible with both general-purpose and domain-specific models and can be integrated via lightweight fine-tuning without architectural changes. We categorize images into a permit set and a forbid set based on editing permissions. For the permit set, the model learns to perform high-quality manipulations as usual. For the forbid set, we introduce training objectives that encourage vague or semantically ambiguous outputs (e.g., blurred images), thereby suppressing meaningful edits. The core challenge is to block unauthorized editing while preserving editing quality for permitted inputs. To this end, we design separate loss functions that guide selective editing behavior. Extensive experiments across multiple datasets and models show that SecureT2I effectively degrades manipulation quality on forbidden images while maintaining performance on permitted ones. We also evaluate generalization to unseen inputs and find that SecureT2I consistently outperforms baselines. Additionally, we analyze different vagueness strategies and find that resize-based degradation offers the best trade-off for secure manipulation control.

Title: When There Is No Decoder: Removing Watermarks from Stable Diffusion Models in a No-box Setting

Authors: Xiaodong Wu, Tianyi Tang, Xiangman Li, Jianbing Ni, Yong Yu
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.03646
Pdf URL: https://arxiv.org/pdf/2507.03646
Copy Paste: [[2507.03646]] When There Is No Decoder: Removing Watermarks from Stable Diffusion Models in a No-box Setting(https://arxiv.org/abs/2507.03646)
Keywords: diffusion
Abstract: Watermarking has emerged as a promising solution to counter harmful or deceptive AI-generated content by embedding hidden identifiers that trace content origins. However, the robustness of current watermarking techniques is still largely unexplored, raising critical questions about their effectiveness against adversarial attacks. To address this gap, we examine the robustness of model-specific watermarking, where watermark embedding is integrated with text-to-image generation in models like latent diffusion models. We introduce three attack strategies: edge prediction-based, box blurring, and fine-tuning-based attacks in a no-box setting, where an attacker does not require access to the ground-truth watermark decoder. Our findings reveal that while model-specific watermarking is resilient against basic evasion attempts, such as edge prediction, it is notably vulnerable to blurring and fine-tuning-based attacks. Our best-performing attack achieves a reduction in watermark detection accuracy to approximately 47.92\%. Additionally, we perform an ablation study on factors like message length, kernel size and decoder depth, identifying critical parameters influencing the fine-tuning attack's success. Finally, we assess several advanced watermarking defenses, finding that even the most robust methods, such as multi-label smoothing, result in watermark extraction accuracy that falls below an acceptable level when subjected to our no-box attacks.

Title: When Network Architecture Meets Physics: Deep Operator Learning for Coupled Multiphysics

Authors: Kazuma Kobayashi, Jaewan Park, Qibang Liu, Seid Koric, Diab Abueidda, Syed Bahauddin Alam
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03660
Pdf URL: https://arxiv.org/pdf/2507.03660
Copy Paste: [[2507.03660]] When Network Architecture Meets Physics: Deep Operator Learning for Coupled Multiphysics(https://arxiv.org/abs/2507.03660)
Keywords: diffusion
Abstract: Scientific applications increasingly demand real-time surrogate models that can capture the behavior of strongly coupled multiphysics systems driven by multiple input functions, such as in thermo-mechanical and electro-thermal processes. While neural operator frameworks, such as Deep Operator Networks (DeepONets), have shown considerable success in single-physics settings, their extension to multiphysics problems remains poorly understood. In particular, the challenge of learning nonlinear interactions between tightly coupled physical fields has received little systematic attention. This study addresses a foundational question: should the architectural design of a neural operator reflect the strength of physical coupling it aims to model? To answer this, we present the first comprehensive, architecture-aware evaluation of DeepONet variants across three regimes: single-physics, weakly coupled, and strongly coupled multiphysics systems. We consider a reaction-diffusion equation with dual spatial inputs, a nonlinear thermo-electrical problem with bidirectional coupling through temperature-dependent conductivity, and a viscoplastic thermo-mechanical model of steel solidification governed by transient phase-driven interactions. Two operator-learning frameworks, the classical DeepONet and its sequential GRU-based extension, S-DeepONet, are benchmarked using both single-branch and multi-branch (MIONet-style) architectures. Our results demonstrate that architectural alignment with physical coupling is crucial: single-branch networks significantly outperform multi-branch counterparts in strongly coupled settings, whereas multi-branch encodings offer advantages for decoupled or single-physics problems. Once trained, these surrogates achieve full-field predictions up to 1.8e4 times faster than high-fidelity finite-element solvers, without compromising solution accuracy.

Title: SAMed-2: Selective Memory Enhanced Medical Segment Anything Model

Authors: Zhiling Yan, Sifan Song, Dingjie Song, Yiwei Li, Rong Zhou, Weixiang Sun, Zhennong Chen, Sekeun Kim, Hui Ren, Tianming Liu, Quanzheng Li, Xiang Li, Lifang He, Lichao Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03698
Pdf URL: https://arxiv.org/pdf/2507.03698
Copy Paste: [[2507.03698]] SAMed-2: Selective Memory Enhanced Medical Segment Anything Model(https://arxiv.org/abs/2507.03698)
Keywords: foundation model
Abstract: Recent "segment anything" efforts show promise by learning from large-scale data, but adapting such models directly to medical images remains challenging due to the complexity of medical data, noisy annotations, and continual learning requirements across diverse modalities and anatomical structures. In this work, we propose SAMed-2, a new foundation model for medical image segmentation built upon the SAM-2 architecture. Specifically, we introduce a temporal adapter into the image encoder to capture image correlations and a confidence-driven memory mechanism to store high-certainty features for later retrieval. This memory-based strategy counters the pervasive noise in large-scale medical datasets and mitigates catastrophic forgetting when encountering new tasks or modalities. To train and evaluate SAMed-2, we curate MedBank-100k, a comprehensive dataset spanning seven imaging modalities and 21 medical segmentation tasks. Our experiments on both internal benchmarks and 10 external datasets demonstrate superior performance over state-of-the-art baselines in multi-task scenarios. The code is available at: this https URL.

Title: FAROS: Fair Graph Generation via Attribute Switching Mechanisms

Authors: Abdennacer Badaoui, Oussama Kharouiche, Hatim Mrabet, Daniele Malitesta, Fragkiskos D. Malliaros
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03728
Pdf URL: https://arxiv.org/pdf/2507.03728
Copy Paste: [[2507.03728]] FAROS: Fair Graph Generation via Attribute Switching Mechanisms(https://arxiv.org/abs/2507.03728)
Keywords: diffusion
Abstract: Recent advancements in graph diffusion models (GDMs) have enabled the synthesis of realistic network structures, yet ensuring fairness in the generated data remains a critical challenge. Existing solutions attempt to mitigate bias by re-training the GDMs with ad-hoc fairness constraints. Conversely, with this work, we propose FAROS, a novel FAir graph geneRatiOn framework leveraging attribute Switching mechanisms and directly running in the generation process of the pre-trained GDM. Technically, our approach works by altering nodes' sensitive attributes during the generation. To this end, FAROS calculates the optimal fraction of switching nodes, and selects the diffusion step to perform the switch by setting tailored multi-criteria constraints to preserve the node-topology profile from the original distribution (a proxy for accuracy) while ensuring the edge independence on the sensitive attributes for the generated graph (a proxy for fairness). Our experiments on benchmark datasets for link prediction demonstrate that the proposed approach effectively reduces fairness discrepancies while maintaining comparable (or even higher) accuracy performance to other similar baselines. Noteworthy, FAROS is also able to strike a better accuracy-fairness trade-off than other competitors in some of the tested settings under the Pareto optimality concept, demonstrating the effectiveness of the imposed multi-criteria constraints.

Title: Flow-Anchored Consistency Models

Authors: Yansong Peng, Kai Zhu, Yu Liu, Pingyu Wu, Hebei Li, Xiaoyan Sun, Feng Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03738
Pdf URL: https://arxiv.org/pdf/2507.03738
Copy Paste: [[2507.03738]] Flow-Anchored Consistency Models(https://arxiv.org/abs/2507.03738)
Keywords: generative
Abstract: Continuous-time Consistency Models (CMs) promise efficient few-step generation but face significant challenges with training instability. We argue this instability stems from a fundamental conflict: by training a network to learn only a shortcut across a probability flow, the model loses its grasp on the instantaneous velocity field that defines the flow. Our solution is to explicitly anchor the model in the underlying flow during training. We introduce the Flow-Anchored Consistency Model (FACM), a simple but effective training strategy that uses a Flow Matching (FM) task as an anchor for the primary CM shortcut objective. This Flow-Anchoring approach requires no architectural modifications and is broadly compatible with standard model architectures. By distilling a pre-trained LightningDiT model, our method achieves a state-of-the-art FID of 1.32 with two steps (NFE=2) and 1.76 with just one step (NFE=1) on ImageNet 256x256, significantly outperforming previous methods. This provides a general and effective recipe for building high-performance, few-step generative models. Our code and pretrained models: this https URL.

Title: ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays

Authors: Shehroz S. Khan, Petar Przulj, Ahmed Ashraf, Ali Abedi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03739
Pdf URL: https://arxiv.org/pdf/2507.03739
Copy Paste: [[2507.03739]] ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays(https://arxiv.org/abs/2507.03739)
Keywords: generative
Abstract: The global demand for radiologists is increasing rapidly due to a growing reliance on medical imaging services, while the supply of radiologists is not keeping pace. Advances in computer vision and image processing technologies present significant potential to address this gap by enhancing radiologists' capabilities and improving diagnostic accuracy. Large language models (LLMs), particularly generative pre-trained transformers (GPTs), have become the primary approach for understanding and generating textual data. In parallel, vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently. In this paper, we present ChestGPT, a deep-learning framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images. The ViT converts X-ray images into tokens, which are then fed, together with engineered prompts, into the LLM, enabling joint classification and localization of diseases. This approach incorporates transfer learning techniques to enhance both explainability and performance. The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around the regions of interest. We also outline several task-specific prompts, in addition to general-purpose prompts, for scenarios radiologists might encounter. Overall, this framework offers an assistive tool that can lighten radiologists' workload by providing preliminary findings and regions of interest to facilitate their diagnostic process.

Title: StreamDiT: Real-Time Streaming Text-to-Video Generation

Authors: Akio Kodaira, Tingbo Hou, Ji Hou, Masayoshi Tomizuka, Yue Zhao
Subjects: cs.CV, cs.AI, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2507.03745
Pdf URL: https://arxiv.org/pdf/2507.03745
Copy Paste: [[2507.03745]] StreamDiT: Real-Time Streaming Text-to-Video Generation(https://arxiv.org/abs/2507.03745)
Keywords: diffusion
Abstract: Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: https://arxiv.org/abs/2507.03779
Pdf URL: https://arxiv.org/pdf/2507.03779
Copy Paste: [[2507.03779]] FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed(https://arxiv.org/abs/2507.03779)
Keywords: self-supervised, foundation model
Abstract: Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning--which is currently extremely demanding computation-wise. We thus propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence--and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum--low-frequency being seen first--and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time and FLOPs are reduced by 1.6x and 2.25x, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as means to improve self-supervised learning models robustness. The code is available at this https URL

Title: Interpretable Diffusion Models with B-cos Networks

Authors: Nicola Bernold, Moritz Vandenhirtz, Alice Bizeul, Julia E. Vogt
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.03846
Pdf URL: https://arxiv.org/pdf/2507.03846
Copy Paste: [[2507.03846]] Interpretable Diffusion Models with B-cos Networks(https://arxiv.org/abs/2507.03846)
Keywords: diffusion
Abstract: Text-to-image diffusion models generate images by iteratively denoising random noise, conditioned on a prompt. While these models have enabled impressive progress in image generation, they often fail to accurately reflect all semantic information described in the prompt -- failures that are difficult to detect automatically. In this work, we introduce a diffusion model architecture built with B-cos modules that offers inherent interpretability. Our approach provides insight into how individual prompt tokens affect the generated image by producing explanations that highlight the pixel regions influenced by each token. We demonstrate that B-cos diffusion models can produce high-quality images while providing meaningful insights into prompt-image alignment.

Title: Enhanced accuracy through ensembling of randomly initialized auto-regressive models for time-dependent PDEs

Authors: Ishan Khurjekar, Indrashish Saha, Lori Graham-Brady, Somdatta Goswami
Subjects: cs.LG, cs.AI, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2507.03863
Pdf URL: https://arxiv.org/pdf/2507.03863
Copy Paste: [[2507.03863]] Enhanced accuracy through ensembling of randomly initialized auto-regressive models for time-dependent PDEs(https://arxiv.org/abs/2507.03863)
Keywords: diffusion
Abstract: Systems governed by partial differential equations (PDEs) require computationally intensive numerical solvers to predict spatiotemporal field evolution. While machine learning (ML) surrogates offer faster solutions, autoregressive inference with ML models suffer from error accumulation over successive predictions, limiting their long-term accuracy. We propose a deep ensemble framework to address this challenge, where multiple ML surrogate models with random weight initializations are trained in parallel and aggregated during inference. This approach leverages the diversity of model predictions to mitigate error propagation while retaining the autoregressive strategies ability to capture the system's time dependent relations. We validate the framework on three PDE-driven dynamical systems - stress evolution in heterogeneous microstructures, Gray-Scott reaction-diffusion, and planetary-scale shallow water system - demonstrating consistent reduction in error accumulation over time compared to individual models. Critically, the method requires only a few time steps as input, enabling full trajectory predictions with inference times significantly faster than numerical solvers. Our results highlight the robustness of ensemble methods in diverse physical systems and their potential as efficient and accurate alternatives to traditional solvers. The codes for this work are available on GitHub (this https URL).

Title: GenAI-Powered Inference

Authors: Kosuke Imai, Kentaro Nakamura
Subjects: cs.LG, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2507.03897
Pdf URL: https://arxiv.org/pdf/2507.03897
Copy Paste: [[2507.03897]] GenAI-Powered Inference(https://arxiv.org/abs/2507.03897)
Keywords: diffusion, generative
Abstract: We introduce GenAI-Powered Inference (GPI), a statistical framework for both causal and predictive inference using unstructured data, including text and images. GPI leverages open-source Generative Artificial Intelligence (GenAI) models - such as large language models and diffusion models - not only to generate unstructured data at scale but also to extract low-dimensional representations that capture their underlying structure. Applying machine learning to these representations, GPI enables estimation of causal and predictive effects while quantifying associated estimation uncertainty. Unlike existing approaches to representation learning, GPI does not require fine-tuning of generative models, making it computationally efficient and broadly accessible. We illustrate the versatility of the GPI framework through three applications: (1) analyzing Chinese social media censorship, (2) estimating predictive effects of candidates' facial appearance on electoral outcomes, and (3) assessing the persuasiveness of political rhetoric. An open-source software package is available for implementing GPI.

Title: Transformer Model for Alzheimer's Disease Progression Prediction Using Longitudinal Visit Sequences

Authors: Mahdi Moghaddami, Clayton Schubring, Mohammad-Reza Siadat
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2507.03899
Pdf URL: https://arxiv.org/pdf/2507.03899
Copy Paste: [[2507.03899]] Transformer Model for Alzheimer's Disease Progression Prediction Using Longitudinal Visit Sequences(https://arxiv.org/abs/2507.03899)
Keywords: generative
Abstract: Alzheimer's disease (AD) is a neurodegenerative disorder with no known cure that affects tens of millions of people worldwide. Early detection of AD is critical for timely intervention to halt or slow the progression of the disease. In this study, we propose a Transformer model for predicting the stage of AD progression at a subject's next clinical visit using features from a sequence of visits extracted from the subject's visit history. We also rigorously compare our model to recurrent neural networks (RNNs) such as long short-term memory (LSTM), gated recurrent unit (GRU), and minimalRNN and assess their performances based on factors such as the length of prior visits and data imbalance. We test the importance of different feature categories and visit history, as well as compare the model to a newer Transformer-based model optimized for time series. Our model demonstrates strong predictive performance despite missing visits and missing features in available visits, particularly in identifying converter subjects -- individuals transitioning to more severe disease stages -- an area that has posed significant challenges in longitudinal prediction. The results highlight the model's potential in enhancing early diagnosis and patient outcomes.

Title: Taming Anomalies with Down-Up Sampling Networks: Group Center Preserving Reconstruction for 3D Anomaly Detection

Authors: Hanzhe Liang, Jie Zhang, Tao Dai, Linlin Shen, Jinbao Wang, Can Gao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03903
Pdf URL: https://arxiv.org/pdf/2507.03903
Copy Paste: [[2507.03903]] Taming Anomalies with Down-Up Sampling Networks: Group Center Preserving Reconstruction for 3D Anomaly Detection(https://arxiv.org/abs/2507.03903)
Keywords: anomaly
Abstract: Reconstruction-based methods have demonstrated very promising results for 3D anomaly detection. However, these methods face great challenges in handling high-precision point clouds due to the large scale and complex structure. In this study, a Down-Up Sampling Network (DUS-Net) is proposed to reconstruct high-precision point clouds for 3D anomaly detection by preserving the group center geometric structure. The DUS-Net first introduces a Noise Generation module to generate noisy patches, which facilitates the diversity of training data and strengthens the feature representation for reconstruction. Then, a Down-sampling Network~(Down-Net) is developed to learn an anomaly-free center point cloud from patches with noise injection. Subsequently, an Up-sampling Network (Up-Net) is designed to reconstruct high-precision point clouds by fusing multi-scale up-sampling features. Our method leverages group centers for construction, enabling the preservation of geometric structure and providing a more precise point cloud. Extensive experiments demonstrate the effectiveness of our proposed method, achieving state-of-the-art (SOTA) performance with an Object-level AUROC of 79.9% and 79.5%, and a Point-level AUROC of 71.2% and 84.7% on the Real3D-AD and Anomaly-ShapeNet datasets, respectively.

Title: Return of the Latent Space COWBOYS: Re-thinking the use of VAEs for Bayesian Optimisation of Structured Spaces

Authors: Henry B. Moss, Sebastian W. Ober, Tom Diethe
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.03910
Pdf URL: https://arxiv.org/pdf/2507.03910
Copy Paste: [[2507.03910]] Return of the Latent Space COWBOYS: Re-thinking the use of VAEs for Bayesian Optimisation of Structured Spaces(https://arxiv.org/abs/2507.03910)
Keywords: generative
Abstract: Bayesian optimisation in the latent space of a Variational AutoEncoder (VAE) is a powerful framework for optimisation tasks over complex structured domains, such as the space of scientifically interesting molecules. However, existing approaches tightly couple the surrogate and generative models, which can lead to suboptimal performance when the latent space is not tailored to specific tasks, which in turn has led to the proposal of increasingly sophisticated algorithms. In this work, we explore a new direction, instead proposing a decoupled approach that trains a generative model and a Gaussian Process (GP) surrogate separately, then combines them via a simple yet principled Bayesian update rule. This separation allows each component to focus on its strengths -- structure generation from the VAE and predictive modelling by the GP. We show that our decoupled approach improves our ability to identify high-potential candidates in molecular optimisation problems under constrained evaluation budgets.

Title: DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering

Authors: Rongjia Zheng, Qing Zhang, Chengjiang Long, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03924
Pdf URL: https://arxiv.org/pdf/2507.03924
Copy Paste: [[2507.03924]] DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering(https://arxiv.org/abs/2507.03924)
Keywords: diffusion, generative
Abstract: Recent methods have shown that pre-trained diffusion models can be fine-tuned to enable generative inverse rendering by learning image-conditioned noise-to-intrinsic mapping. Despite their remarkable progress, they struggle to robustly produce high-quality results as the noise-to-intrinsic paradigm essentially utilizes noisy images with deteriorated structure and appearance for intrinsic prediction, while it is common knowledge that structure and appearance information in an image are crucial for inverse rendering. To address this issue, we present DNF-Intrinsic, a robust yet efficient inverse rendering approach fine-tuned from a pre-trained diffusion model, where we propose to take the source image rather than Gaussian noise as input to directly predict deterministic intrinsic properties via flow matching. Moreover, we design a generative renderer to constrain that the predicted intrinsic properties are physically faithful to the source image. Experiments on both synthetic and real-world datasets show that our method clearly outperforms existing state-of-the-art methods.

Title: Evaluating Adversarial Protections for Diffusion Personalization: A Comprehensive Study

Authors: Kai Ye, Tianyi Chen, Zhen Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.03953
Pdf URL: https://arxiv.org/pdf/2507.03953
Copy Paste: [[2507.03953]] Evaluating Adversarial Protections for Diffusion Personalization: A Comprehensive Study(https://arxiv.org/abs/2507.03953)
Keywords: diffusion
Abstract: With the increasing adoption of diffusion models for image generation and personalization, concerns regarding privacy breaches and content misuse have become more pressing. In this study, we conduct a comprehensive comparison of eight perturbation based protection methods: AdvDM, ASPL, FSGM, MetaCloak, Mist, PhotoGuard, SDS, and SimAC--across both portrait and artwork domains. These methods are evaluated under varying perturbation budgets, using a range of metrics to assess visual imperceptibility and protective efficacy. Our results offer practical guidance for method selection. Code is available at: this https URL.

Title: Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data

Authors: Anurag Garg, Muhammad Ali, Noah Hollmann, Lennart Purucker, Samuel Müller, Frank Hutter
Subjects: cs.LG, cs.AI, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2507.03971
Pdf URL: https://arxiv.org/pdf/2507.03971
Copy Paste: [[2507.03971]] Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data(https://arxiv.org/abs/2507.03971)
Keywords: foundation model
Abstract: Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior downstream predictive accuracy compared to using broader, potentially noisier corpora like CommonCrawl or GitTables. Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.

Title: CoT-Segmenter: Enhancing OOD Detection in Dense Road Scenes via Chain-of-Thought Reasoning

Authors: Jeonghyo Song, Kimin Yun, DaeUng Jo, Jinyoung Kim, Youngjoon Yoo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.03984
Pdf URL: https://arxiv.org/pdf/2507.03984
Copy Paste: [[2507.03984]] CoT-Segmenter: Enhancing OOD Detection in Dense Road Scenes via Chain-of-Thought Reasoning(https://arxiv.org/abs/2507.03984)
Keywords: foundation model, anomaly
Abstract: Effective Out-of-Distribution (OOD) detection is criti-cal for ensuring the reliability of semantic segmentation models, particularly in complex road environments where safety and accuracy are paramount. Despite recent advancements in large language models (LLMs), notably GPT-4, which significantly enhanced multimodal reasoning through Chain-of-Thought (CoT) prompting, the application of CoT-based visual reasoning for OOD semantic segmentation remains largely unexplored. In this paper, through extensive analyses of the road scene anomalies, we identify three challenging scenarios where current state-of-the-art OOD segmentation methods consistently struggle: (1) densely packed and overlapping objects, (2) distant scenes with small objects, and (3) large foreground-dominant objects. To address the presented challenges, we propose a novel CoT-based framework targeting OOD detection in road anomaly scenes. Our method leverages the extensive knowledge and reasoning capabilities of foundation models, such as GPT-4, to enhance OOD detection through improved image understanding and prompt-based reasoning aligned with observed problematic scene attributes. Extensive experiments show that our framework consistently outperforms state-of-the-art methods on both standard benchmarks and our newly defined challenging subset of the RoadAnomaly dataset, offering a robust and interpretable solution for OOD semantic segmentation in complex driving environments.

Title: Fast Re-Trainable Attention Autoencoder for Liquid Sensor Anomaly Detection at the Edge

Authors: Seongyun Choi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.03995
Pdf URL: https://arxiv.org/pdf/2507.03995
Copy Paste: [[2507.03995]] Fast Re-Trainable Attention Autoencoder for Liquid Sensor Anomaly Detection at the Edge(https://arxiv.org/abs/2507.03995)
Keywords: anomaly
Abstract: A lightweight, edge-deployable pipeline is proposed for detecting sensor anomalies in chemistry and biology laboratories. A custom PCB captures seven sensor channels and streams them over the local network. An Attention-based One-Class Autoencoder reaches a usable state after training on only thirty minutes of normal data. Despite the small data set, the model already attains an F1 score of 0.72, a precision of 0.89, and a recall of 0.61 when tested on synthetic micro-anomalies. The trained network is converted into a TensorFlow-Lite binary of about 31 kB and runs on an Advantech ARK-1221L, a fan-less x86 edge device without AVX instructions; end-to-end inference latency stays below two seconds. The entire collect-train-deploy workflow finishes within one hour, which demonstrates that the pipeline adapts quickly whenever a new liquid or sensor is introduced.

Title: Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation

Authors: Ziying Song, Lin Liu, Hongyu Pan, Bencheng Liao, Mingzhe Guo, Lei Yang, Yongchang Zhang, Shaoqing Xu, Caiyan Jia, Yadan Luo
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.04049
Pdf URL: https://arxiv.org/pdf/2507.04049
Copy Paste: [[2507.04049]] Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation(https://arxiv.org/abs/2507.04049)
Keywords: diffusion
Abstract: Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode this http URL experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.

Title: Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery

Authors: Xiao Liu, Nan Pu, Haiyang Zheng, Wenjing Li, Nicu Sebe, Zhun Zhong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04051
Pdf URL: https://arxiv.org/pdf/2507.04051
Copy Paste: [[2507.04051]] Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery(https://arxiv.org/abs/2507.04051)
Keywords: diffusion
Abstract: In this paper, we investigate a practical yet challenging task: On-the-fly Category Discovery (OCD). This task focuses on the online identification of newly arriving stream data that may belong to both known and unknown categories, utilizing the category knowledge from only labeled data. Existing OCD methods are devoted to fully mining transferable knowledge from only labeled data. However, the transferability learned by these methods is limited because the knowledge contained in known categories is often insufficient, especially when few annotated data/categories are available in fine-grained recognition. To mitigate this limitation, we propose a diffusion-based OCD framework, dubbed DiffGRE, which integrates Generation, Refinement, and Encoding in a multi-stage fashion. Specifically, we first design an attribute-composition generation method based on cross-image interpolation in the diffusion latent space to synthesize novel samples. Then, we propose a diversity-driven refinement approach to select the synthesized images that differ from known categories for subsequent OCD model training. Finally, we leverage a semi-supervised leader encoding to inject additional category knowledge contained in synthesized data into the OCD models, which can benefit the discovery of both known and unknown categories during the on-the-fly inference process. Extensive experiments demonstrate the superiority of our DiffGRE over previous methods on six fine-grained datasets.

Title: Consistent and Invariant Generalization Learning for Short-video Misinformation Detection

Authors: Hanghui Guo, Weijie Shi, Mengze Li, Juncheng Li, Hao Chen, Yue Cui, Jiajie Xu, Jia Zhu, Jiawei Shen, Zhangze Chen, Sirui Han
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2507.04061
Pdf URL: https://arxiv.org/pdf/2507.04061
Copy Paste: [[2507.04061]] Consistent and Invariant Generalization Learning for Short-video Misinformation Detection(https://arxiv.org/abs/2507.04061)
Keywords: diffusion
Abstract: Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at this https URL.

Title: Token Level Hallucination Detection via Variance in Language Models

Authors: Keshav Kumar
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04137
Pdf URL: https://arxiv.org/pdf/2507.04137
Copy Paste: [[2507.04137]] Token Level Hallucination Detection via Variance in Language Models(https://arxiv.org/abs/2507.04137)
Keywords: generative
Abstract: Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations, confidently generated yet factually incorrect outputs. We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple stochastic generations. Unlike prior methods that require ground-truth references or sentence-level verification, our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis. We evaluate our method on unanswerable question prompts from the SQuAD v2 dataset and benchmark across three autoregressive models of varying scales: GPT-Neo 125M, Falcon 1B, and Mistral 7B. Through both quantitative metrics and visual diagnostics, we show that token-level variance reliably highlights instability in model outputs and correlates with hallucination patterns. Our framework is lightweight, reproducible, and adaptable to multiple domains, offering a valuable diagnostic tool for analyzing generative reliability in LLMs.

Title: Pedestrian Intention Prediction via Vision-Language Foundation Models

Authors: Mohsen Azarmi, Mahdi Rezaei, He Wang
Subjects: cs.CV, cs.AI, cs.ET, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2507.04141
Pdf URL: https://arxiv.org/pdf/2507.04141
Copy Paste: [[2507.04141]] Pedestrian Intention Prediction via Vision-Language Foundation Models(https://arxiv.org/abs/2507.04141)
Keywords: foundation model
Abstract: Prediction of pedestrian crossing intention is a critical function in autonomous vehicles. Conventional vision-based methods of crossing intention prediction often struggle with generalizability, context understanding, and causal reasoning. This study explores the potential of vision-language foundation models (VLFMs) for predicting pedestrian crossing intentions by integrating multimodal data through hierarchical prompt templates. The methodology incorporates contextual information, including visual frames, physical cues observations, and ego-vehicle dynamics, into systematically refined prompts to guide VLFMs effectively in intention prediction. Experiments were conducted on three common datasets-JAAD, PIE, and FU-PIP. Results demonstrate that incorporating vehicle speed, its variations over time, and time-conscious prompts significantly enhances the prediction accuracy up to 19.8%. Additionally, optimised prompts generated via an automatic prompt engineering framework yielded 12.5% further accuracy gains. These findings highlight the superior performance of VLFMs compared to conventional vision-based models, offering enhanced generalisation and contextual understanding for autonomous driving applications.

Title: Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation

Authors: Fernando Gabriela Garcia, Spencer Burns, Ryan Shaw, Hunter Young
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04151
Pdf URL: https://arxiv.org/pdf/2507.04151
Copy Paste: [[2507.04151]] Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation(https://arxiv.org/abs/2507.04151)
Keywords: diffusion, self-supervised, generative
Abstract: This paper introduces Hierarchical Self-Supervised LVLM (Hi-SSLVLM), a novel generative model designed to significantly advance text-to-image synthesis, particularly for complex and compositionally challenging prompts. Traditional methods often grapple with the high cost of meticulously curated paired image-text datasets and struggle with precise control over fine-grained visual attributes and intricate spatial relationships. Our Hi-SSLVLM addresses these limitations through a unique two-stage self-supervised learning strategy. The first stage, Multi-Granularity Visual-Language Grounding, enables the Large Vision-Language Model (LVLM) backbone to autonomously generate and align hierarchical captions (global and local) to images, cultivating a deep internal semantic understanding without reliance on extensive human annotation. The second stage, Self-Refinement and Guided Image Generation, leverages this acquired knowledge by an Internal Compositional Planning (ICP) mechanism, where the LVLM first formulates detailed textual sub-prompts to guide the image generation process, complemented by a novel Semantic Consistency Loss for precise output alignment. Comprehensive experiments against leading baselines, including Janus-Pro-1B, Stable Diffusion XL 1.0, DeepFloyd IF v1.0, and ControlNet-XL, on multi-dimensional benchmarks such as Gemini-2.0-Flash and InternVL3-78B, demonstrate Hi-SSLVLM's superior performance across all fine-grained metrics. An in-depth ablation study confirms the critical role of each proposed component. Furthermore, human evaluations corroborate our quantitative findings, highlighting Hi-SSLVLM's enhanced fidelity to prompt, compositional accuracy, and overall aesthetic quality, marking a significant step towards more controllable and semantically consistent open-ended text-to-image generation.

Title: LVLM-Composer's Explicit Planning for Image Generation

Authors: Spencer Ramsey, Jeffrey Lee, Amina Grant
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04152
Pdf URL: https://arxiv.org/pdf/2507.04152
Copy Paste: [[2507.04152]] LVLM-Composer's Explicit Planning for Image Generation(https://arxiv.org/abs/2507.04152)
Keywords: generative
Abstract: The burgeoning field of generative artificial intelligence has fundamentally reshaped our approach to content creation, with Large Vision-Language Models (LVLMs) standing at its forefront. While current LVLMs have demonstrated impressive capabilities in text-to-image generation, they often falter when confronted with complex textual descriptions demanding precise compositional understanding and visual planning. This limitation particularly impacts the accurate rendering of multiple objects, their attributes, spatial relationships, and specific poses within intricate scenes, as evidenced by benchmarks like LongBench-T2I. To address these challenges, we introduce LVLM-Composer, a novel 10-billion parameter scale LVLM specifically engineered for enhanced compositional image synthesis. Our method incorporates a Hierarchical Semantic Planning Module for structured prompt decomposition and a Fine-Grained Feature Alignment Mechanism for precise visual guidance during generation. We propose a multi-stage training paradigm, featuring Hierarchical Semantic-Visual Grounding Pre-training and Compositional Planning Reinforcement Learning with Self-Correction, to instill robust compositional reasoning. Extensive experiments on the LongBench-T2I benchmark, utilizing automatic evaluation by Gemini-2.0-Flash and InternVL3-78B, demonstrate LVLM-Composer's superior performance across critical compositional dimensions including object accuracy, composition fidelity, and pose accuracy, significantly outperforming state-of-the-art baselines. An in-depth ablation study further validates the indispensable contribution of our proposed modules, while human evaluations confirm the perceptual superiority of our generated images. LVLM-Composer represents a significant step towards truly controllable and compositionally accurate open-ended text-to-image generation.

Title: ML-Enhanced AES Anomaly Detection for Real-Time Embedded Security

Authors: Nishant Chinnasami, Rye Stahle-Smith, Rasha Karakchi
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2507.04197
Pdf URL: https://arxiv.org/pdf/2507.04197
Copy Paste: [[2507.04197]] ML-Enhanced AES Anomaly Detection for Real-Time Embedded Security(https://arxiv.org/abs/2507.04197)
Keywords: anomaly
Abstract: Advanced Encryption Standard (AES) is a widely adopted cryptographic algorithm, yet its practical implementations remain susceptible to side-channel and fault injection attacks. In this work, we propose a comprehensive framework that enhances AES-128 encryption security through controlled anomaly injection and real-time anomaly detection using both statistical and machine learning (ML) methods. We simulate timing and fault-based anomalies by injecting execution delays and ciphertext perturbations during encryption, generating labeled datasets for detection model training. Two complementary detection mechanisms are developed: a threshold-based timing anomaly detector and a supervised Random Forest classifier trained on combined timing and ciphertext features. We implement and evaluate the framework on both CPU and FPGA-based SoC hardware (PYNQ-Z1), measuring performance across varying block sizes, injection rates, and core counts. Our results show that ML-based detection significantly outperforms threshold-based methods in precision and recall while maintaining real-time performance on embedded hardware. Compared to existing AES anomaly detection methods, our solution offers a low-cost, real-time, and accurate detection approach deployable on lightweight FPGA platforms.

Title: An explicit formulation of the learned noise predictor $ε_θ({\bf x}_t, t)$ via the forward-process noise $ε_{t}$ in denoising diffusion probabilistic models (DDPMs)

Authors: KiHyun Yun
Subjects: cs.LG, math.AP
Abstract URL: https://arxiv.org/abs/2507.04203
Pdf URL: https://arxiv.org/pdf/2507.04203
Copy Paste: [[2507.04203]] An explicit formulation of the learned noise predictor $ε_θ({\bf x}_t, t)$ via the forward-process noise $ε_{t}$ in denoising diffusion probabilistic models (DDPMs)(https://arxiv.org/abs/2507.04203)
Keywords: diffusion, generative
Abstract: In denoising diffusion probabilistic models (DDPMs), the learned noise predictor $ \epsilon_{\theta} ( {\bf x}_t , t)$ is trained to approximate the forward-process noise $\epsilon_t$. The equality $\nabla_{{\bf x}_t} \log q({\bf x}_t) = -\frac 1 {\sqrt {1- {\bar \alpha}_t} } \epsilon_{\theta} ( {\bf x}_t , t)$ plays a fundamental role in both theoretical analyses and algorithmic design, and thus is frequently employed across diffusion-based generative models. In this paper, an explicit formulation of $ \epsilon_{\theta} ( {\bf x}_t , t)$ in terms of the forward-process noise $\epsilon_t$ is derived. This result show how the forward-process noise $\epsilon_t$ contributes to the learned predictor $ \epsilon_{\theta} ( {\bf x}_t , t)$. Furthermore, based on this formulation, we present a novel and mathematically rigorous proof of the fundamental equality above, clarifying its origin and providing new theoretical insight into the structure of diffusion models.

Title: Quick Bypass Mechanism of Zero-Shot Diffusion-Based Image Restoration

Authors: Yu-Shan Tai, An-Yeu (Andy)Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04207
Pdf URL: https://arxiv.org/pdf/2507.04207
Copy Paste: [[2507.04207]] Quick Bypass Mechanism of Zero-Shot Diffusion-Based Image Restoration(https://arxiv.org/abs/2507.04207)
Keywords: diffusion
Abstract: Recent advancements in diffusion models have demonstrated remarkable success in various image generation tasks. Building upon these achievements, diffusion models have also been effectively adapted to image restoration tasks, e.g., super-resolution and deblurring, aiming to recover high-quality images from degraded inputs. Although existing zero-shot approaches enable pretrained diffusion models to perform restoration tasks without additional fine-tuning, these methods often suffer from prolonged iteration times in the denoising process. To address this limitation, we propose a Quick Bypass Mechanism (QBM), a strategy that significantly accelerates the denoising process by initializing from an intermediate approximation, effectively bypassing early denoising steps. Furthermore, recognizing that approximation may introduce inconsistencies, we introduce a Revised Reverse Process (RRP), which adjusts the weighting of random noise to enhance the stochasticity and mitigate potential disharmony. We validate proposed methods on ImageNet-1K and CelebA-HQ across multiple image restoration tasks, e.g., super-resolution, deblurring, and compressed sensing. Our experimental results show that the proposed methods can effectively accelerate existing methods while maintaining original performance.

Title: DreamPoster: A Unified Framework for Image-Conditioned Generative Poster Design

Authors: Xiwei Hu, Haokun Chen, Zhongqi Qi, Hui Zhang, Dexiang Hong, Jie Shao, Xinglong Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04218
Pdf URL: https://arxiv.org/pdf/2507.04218
Copy Paste: [[2507.04218]] DreamPoster: A Unified Framework for Image-Conditioned Generative Poster Design(https://arxiv.org/abs/2507.04218)
Keywords: generative
Abstract: We present DreamPoster, a Text-to-Image generation framework that intelligently synthesizes high-quality posters from user-provided images and text prompts while maintaining content fidelity and supporting flexible resolution and layout outputs. Specifically, DreamPoster is built upon our T2I model, Seedream3.0 to uniformly process different poster generating types. For dataset construction, we propose a systematic data annotation pipeline that precisely annotates textual content and typographic hierarchy information within poster images, while employing comprehensive methodologies to construct paired datasets comprising source materials (e.g., raw graphics/text) and their corresponding final poster outputs. Additionally, we implement a progressive training strategy that enables the model to hierarchically acquire multi-task generation capabilities while maintaining high-quality generation. Evaluations on our testing benchmarks demonstrate DreamPoster's superiority over existing methods, achieving a high usability rate of 88.55\%, compared to GPT-4o (47.56\%) and SeedEdit3.0 (25.96\%). DreamPoster will be online in Jimeng and other Bytedance Apps.

Title: Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

Authors: Yan Scholten, Sophie Xhonneux, Stephan Günnemann, Leo Schwinn
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04219
Pdf URL: https://arxiv.org/pdf/2507.04219
Copy Paste: [[2507.04219]] Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs(https://arxiv.org/abs/2507.04219)
Keywords: generative
Abstract: Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their training objectives. We argue this not only risks reinforcing exposure to sensitive data, it also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method - Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from the model. Our core idea is to leverage this collapse for unlearning by triggering collapse partially on the sensitive data. We theoretically analyze that our approach converges to the desired outcome, i.e. the LLM unlearns the information in the forget set. We empirically demonstrate that PMC overcomes two key limitations of existing unlearning approaches that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs. Overall, our contributions represent an important step toward more comprehensive unlearning that aligns with real-world privacy constraints. Code available at this https URL.

Title: Context Tuning for In-Context Optimization

Authors: Jack Lu, Ryan Teehan, Zhenbang Yang, Mengye Ren
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04221
Pdf URL: https://arxiv.org/pdf/2507.04221
Copy Paste: [[2507.04221]] Context Tuning for In-Context Optimization(https://arxiv.org/abs/2507.04221)
Keywords: in-context
Abstract: We introduce Context Tuning, a simple and effective method to significantly enhance few-shot adaptation of language models (LLMs) without fine-tuning model parameters. While prompt-based adaptation techniques have demonstrated the effectiveness of lightweight adaptation methods for large language models (LLMs), they typically initialize a trainable prompt or prefix with irrelevant tokens for the task at hand. In contrast, Context Tuning initializes the trainable prompt or prefix with task-specific demonstration examples, leveraging the model's inherent In-Context Learning (ICL) ability to extract relevant information for improved few-shot learning performance. Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.

Title: Zero-Shot Cyclic Peptide Design with Composable Geometric Conditions

Authors: Dapeng Jiang, Xiangzhe Kong, Jiaqi Han, Mingyu Li, Rui Jiao, Wenbing Huang, Stefano Ermon, Jianzhu Ma, Yang Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04225
Pdf URL: https://arxiv.org/pdf/2507.04225
Copy Paste: [[2507.04225]] Zero-Shot Cyclic Peptide Design with Composable Geometric Conditions(https://arxiv.org/abs/2507.04225)
Keywords: diffusion, generative
Abstract: Cyclic peptides, characterized by geometric constraints absent in linear peptides, offer enhanced biochemical properties, presenting new opportunities to address unmet medical needs. However, designing target-specific cyclic peptides remains underexplored due to limited training data. To bridge the gap, we propose CP-Composer, a novel generative framework that enables zero-shot cyclic peptide generation via composable geometric constraints. Our approach decomposes complex cyclization patterns into unit constraints, which are incorporated into a diffusion model through geometric conditioning on nodes and edges. During training, the model learns from unit constraints and their random combinations in linear peptides, while at inference, novel constraint combinations required for cyclization are imposed as input. Experiments show that our model, despite trained with linear peptides, is capable of generating diverse target-binding cyclic peptides, reaching success rates from 38% to 84% on different cyclization strategies.

Title: Scaling Context Requires Rethinking Attention

Authors: Carles Gelada, Jacob Buckman, Sean Zhang, Txus Bach
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04239
Pdf URL: https://arxiv.org/pdf/2507.04239
Copy Paste: [[2507.04239]] Scaling Context Requires Rethinking Attention(https://arxiv.org/abs/2507.04239)
Keywords: in-context
Abstract: We argue that neither transformers nor sub-quadratic architectures are well suited to training at long sequence lengths: the cost of processing the context is too expensive in the former, too inexpensive in the latter. Approaches such as sliding window attention which reduce the cost-per-token of a transformer impair in-context learning, and so are also unsuitable. To address these limitations, we introduce power attention, an architectural layer for linear-cost sequence modeling whose state size can be adjusted independently of parameters, unlocking the advantages of linear attention on practical domains. We develop and open-source a set of GPU kernels for efficient power attention, identifying a novel pattern of operation fusion to avoid memory and bandwidth bottlenecks. Our experiments on the in-context learning of power attention shows that these models dominate both exponential attention and linear attention at long-context training.

Title: Domain Generalizable Portrait Style Transfer

Authors: Xinbo Wang, Wenju Xu, Qing Zhang, Wei-Shi Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04243
Pdf URL: https://arxiv.org/pdf/2507.04243
Copy Paste: [[2507.04243]] Domain Generalizable Portrait Style Transfer(https://arxiv.org/abs/2507.04243)
Keywords: diffusion
Abstract: This paper presents a portrait style transfer method that generalizes well to various different domains while enabling high-quality semantic-aligned stylization on regions including hair, eyes, eyelashes, skins, lips, and background. To this end, we propose to establish dense semantic correspondence between the given input and reference portraits based on a pre-trained model and a semantic adapter, with which we obtain a warped reference semantically aligned with the input. To ensure effective yet controllable style transfer, we devise an AdaIN-Wavelet transform to balance content preservation and stylization by blending low-frequency information of the warped reference with high-frequency information of the input in the latent space. A style adapter is also designed to provide style guidance from the warped reference. With the stylized latent from AdaIN-Wavelet transform, we employ a dual-conditional diffusion model that integrates a ControlNet recording high-frequency information and the style guidance to generate the final result. Extensive experiments demonstrate the superiority of our method. Our code and trained model are available at this https URL.

Title: An Explainable Transformer Model for Alzheimer's Disease Detection Using Retinal Imaging

Authors: Saeed Jamshidiha, Alireza Rezaee, Farshid Hajati, Mojtaba Golzan, Raymond Chiong
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.04259
Pdf URL: https://arxiv.org/pdf/2507.04259
Copy Paste: [[2507.04259]] An Explainable Transformer Model for Alzheimer's Disease Detection Using Retinal Imaging(https://arxiv.org/abs/2507.04259)
Keywords: generative
Abstract: Alzheimer's disease (AD) is a neurodegenerative disorder that affects millions worldwide. In the absence of effective treatment options, early diagnosis is crucial for initiating management strategies to delay disease onset and slow down its progression. In this study, we propose Retformer, a novel transformer-based architecture for detecting AD using retinal imaging modalities, leveraging the power of transformers and explainable artificial intelligence. The Retformer model is trained on datasets of different modalities of retinal images from patients with AD and age-matched healthy controls, enabling it to learn complex patterns and relationships between image features and disease diagnosis. To provide insights into the decision-making process of our model, we employ the Gradient-weighted Class Activation Mapping algorithm to visualize the feature importance maps, highlighting the regions of the retinal images that contribute most significantly to the classification outcome. These findings are compared to existing clinical studies on detecting AD using retinal biomarkers, allowing us to identify the most important features for AD detection in each imaging modality. The Retformer model outperforms a variety of benchmark algorithms across different performance metrics by margins of up to 11\.

Title: ZERO: Multi-modal Prompt-based Visual Grounding

Authors: Sangbum Choi, Kyeongryeol Go
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04270
Pdf URL: https://arxiv.org/pdf/2507.04270
Copy Paste: [[2507.04270]] ZERO: Multi-modal Prompt-based Visual Grounding(https://arxiv.org/abs/2507.04270)
Keywords: foundation model
Abstract: Recent advances in artificial intelligence have led to the emergence of foundation models, large-scale pre-trained neural networks that serve as versatile starting points for a wide range of downstream tasks. In this work, we present ZERO, a zero-shot multi-prompt object detection model specifically designed for robust, production-ready deployment across diverse industrial domains. ZERO integrates direct image input with multiple user-defined prompts, which can include both textual and visual cues, and processes them through dedicated encoders to generate accurate detection outputs. The model architecture is optimized for scalability, with a total of 1.033 TFLOPS and 622.346 million parameters, and is trained using a domain-specific image database exceeding one billion images. For the CVPR 2025 Foundational Few-Shot Object Detection (FSOD) Challenge, we introduce a domain-specific fine-tuning strategy that emphasizes prompt diversity and conservative pseudo-labeling, enabling effective adaptation to new domains with minimal supervision. Our approach demonstrates practical advantages in flexibility, efficiency, and real-world applicability, achieving strong performance on the RF20VL-fsod benchmark despite limited annotation budgets. The results highlight the potential of prompt-driven, data-centric AI for scalable and adaptive object detection in dynamic industrial environments.

Title: SeqTex: Generate Mesh Textures in Video Sequence

Authors: Ze Yuan (1), Xin Yu (1), Yangtian Sun (1), Yuan-Chen Guo (2), Yan-Pei Cao (2), Ding Liang (2), Xiaojuan Qi (1) ((1) HKU, (2) VAST)
Subjects: cs.CV, cs.AI, cs.GR
Abstract URL: https://arxiv.org/abs/2507.04285
Pdf URL: https://arxiv.org/pdf/2507.04285
Copy Paste: [[2507.04285]] SeqTex: Generate Mesh Textures in Video Sequence(https://arxiv.org/abs/2507.04285)
Keywords: foundation model, generative
Abstract: Training native 3D texture generative models remains a fundamental yet challenging problem, largely due to the limited availability of large-scale, high-quality 3D texture datasets. This scarcity hinders generalization to real-world scenarios. To address this, most existing methods finetune foundation image generative models to exploit their learned visual priors. However, these approaches typically generate only multi-view images and rely on post-processing to produce UV texture maps -- an essential representation in modern graphics pipelines. Such two-stage pipelines often suffer from error accumulation and spatial inconsistencies across the 3D surface. In this paper, we introduce SeqTex, a novel end-to-end framework that leverages the visual knowledge encoded in pretrained video foundation models to directly generate complete UV texture maps. Unlike previous methods that model the distribution of UV textures in isolation, SeqTex reformulates the task as a sequence generation problem, enabling the model to learn the joint distribution of multi-view renderings and UV textures. This design effectively transfers the consistent image-space priors from video foundation models into the UV domain. To further enhance performance, we propose several architectural innovations: a decoupled multi-view and UV branch design, geometry-informed attention to guide cross-domain feature alignment, and adaptive token resolution to preserve fine texture details while maintaining computational efficiency. Together, these components allow SeqTex to fully utilize pretrained video priors and synthesize high-fidelity UV texture maps without the need for post-processing. Extensive experiments show that SeqTex achieves state-of-the-art performance on both image-conditioned and text-conditioned 3D texture generation tasks, with superior 3D consistency, texture-geometry alignment, and real-world generalization.

Title: MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation

Authors: Weilun Feng, Chuanguang Yang, Haotong Qin, Yuqi Li, Xiangqi Li, Zhulin An, Libo Huang, Boyu Diao, Fuzhen Zhuang, Michele Magno, Yongjun Xu, Yingli Tian, Tingwen Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04290
Pdf URL: https://arxiv.org/pdf/2507.04290
Copy Paste: [[2507.04290]] MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation(https://arxiv.org/abs/2507.04290)
Keywords: diffusion
Abstract: Diffusion models have demonstrated remarkable performance on vision generation tasks. However, the high computational complexity hinders its wide application on edge devices. Quantization has emerged as a promising technique for inference acceleration and memory reduction. However, existing quantization methods do not generalize well under extremely low-bit (2-4 bit) quantization. Directly applying these methods will cause severe performance degradation. We identify that the existing quantization framework suffers from the outlier-unfriendly quantizer design, suboptimal initialization, and optimization strategy. We present MPQ-DMv2, an improved \textbf{M}ixed \textbf{P}recision \textbf{Q}uantization framework for extremely low-bit \textbf{D}iffusion \textbf{M}odels. For the quantization perspective, the imbalanced distribution caused by salient outliers is quantization-unfriendly for uniform quantizer. We propose \textit{Flexible Z-Order Residual Mixed Quantization} that utilizes an efficient binary residual branch for flexible quant steps to handle salient error. For the optimization framework, we theoretically analyzed the convergence and optimality of the LoRA module and propose \textit{Object-Oriented Low-Rank Initialization} to use prior quantization error for informative initialization. We then propose \textit{Memory-based Temporal Relation Distillation} to construct an online time-aware pixel queue for long-term denoising temporal information distillation, which ensures the overall temporal consistency between quantized and full-precision model. Comprehensive experiments on various generation tasks show that our MPQ-DMv2 surpasses current SOTA methods by a great margin on different architectures, especially under extremely low-bit widths.

Title: Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs

Authors: Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho
Subjects: cs.CR, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.04365
Pdf URL: https://arxiv.org/pdf/2507.04365
Copy Paste: [[2507.04365]] Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs(https://arxiv.org/abs/2507.04365)
Keywords: in-context
Abstract: As large language models (LLMs) become more integral to society and technology, ensuring their safety becomes essential. Jailbreak attacks exploit vulnerabilities to bypass safety guardrails, posing a significant threat. However, the mechanisms enabling these attacks are not well understood. In this paper, we reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping. During this phenomenon, the model gradually reduces the attention it allocates to unsafe requests in a user query during the attack process, ultimately causing a jailbreak. We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning. Additionally, we evaluate two defenses based on query perturbation, Token Highlighter and SmoothLLM, and find they indirectly mitigate Attention Slipping, with their effectiveness positively correlated with the degree of mitigation achieved. Inspired by this finding, we propose Attention Sharpening, a new defense that directly counters Attention Slipping by sharpening the attention score distribution using temperature scaling. Experiments on four leading LLMs (Gemma2-9B-It, Llama3.1-8B-It, Qwen2.5-7B-It, Mistral-7B-It v0.2) show that our method effectively resists various jailbreak attacks while maintaining performance on benign tasks on AlpacaEval. Importantly, Attention Sharpening introduces no additional computational or memory overhead, making it an efficient and practical solution for real-world deployment.

Title: Time2Agri: Temporal Pretext Tasks for Agricultural Monitoring

Authors: Moti Rattan Gupta, Anupam Sobti
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.04366
Pdf URL: https://arxiv.org/pdf/2507.04366
Copy Paste: [[2507.04366]] Time2Agri: Temporal Pretext Tasks for Agricultural Monitoring(https://arxiv.org/abs/2507.04366)
Keywords: foundation model
Abstract: Self Supervised Learning(SSL) has emerged as a prominent paradigm for label-efficient learning, and has been widely utilized by remote sensing foundation models(RSFMs). Recent RSFMs including SatMAE, DoFA, primarily rely on masked autoencoding(MAE), contrastive learning or some combination of them. However, these pretext tasks often overlook the unique temporal characteristics of agricultural landscape, namely nature's cycle. Motivated by this gap, we propose three novel agriculture-specific pretext tasks, namely Time-Difference Prediction(TD), Temporal Frequency Prediction(FP), and Future-Frame Prediction(FF). Comprehensive evaluation on SICKLE dataset shows FF achieves 69.6% IoU on crop mapping and FP reduces yield prediction error to 30.7% MAPE, outperforming all baselines, and TD remains competitive on most tasks. Further, we also scale FF to the national scale of India, achieving 54.2% IoU outperforming all baselines on field boundary delineation on FTW India dataset.

Title: Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion

Authors: Tongyan Hua, Lutao Jiang, Ying-Cong Chen, Wufan Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04403
Pdf URL: https://arxiv.org/pdf/2507.04403
Copy Paste: [[2507.04403]] Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion(https://arxiv.org/abs/2507.04403)
Keywords: diffusion, generative
Abstract: Recent advancements in generative models have enabled 3D urban scene generation from satellite imagery, unlocking promising applications in gaming, digital twins, and beyond. However, most existing methods rely heavily on neural rendering techniques, which hinder their ability to produce detailed 3D structures on a broader scale, largely due to the inherent structural ambiguity derived from relatively limited 2D observations. To address this challenge, we propose Sat2City, a novel framework that synergizes the representational capacity of sparse voxel grids with latent diffusion models, tailored specifically for our novel 3D city dataset. Our approach is enabled by three key components: (1) A cascaded latent diffusion framework that progressively recovers 3D city structures from satellite imagery, (2) a Re-Hash operation at its Variational Autoencoder (VAE) bottleneck to compute multi-scale feature grids for stable appearance optimization and (3) an inverse sampling strategy enabling implicit supervision for smooth appearance this http URL overcome the challenge of collecting real-world city-scale 3D models with high-quality geometry and appearance, we introduce a dataset of synthesized large-scale 3D cities paired with satellite-view height maps. Validated on this dataset, our framework generates detailed 3D structures from a single satellite image, achieving superior fidelity compared to existing city generation models.

Title: A View-consistent Sampling Method for Regularized Training of Neural Radiance Fields

Authors: Aoxiang Fan, Corentin Dumery, Nicolas Talabot, Pascal Fua
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04408
Pdf URL: https://arxiv.org/pdf/2507.04408
Copy Paste: [[2507.04408]] A View-consistent Sampling Method for Regularized Training of Neural Radiance Fields(https://arxiv.org/abs/2507.04408)
Keywords: foundation model
Abstract: Neural Radiance Fields (NeRF) has emerged as a compelling framework for scene representation and 3D recovery. To improve its performance on real-world data, depth regularizations have proven to be the most effective ones. However, depth estimation models not only require expensive 3D supervision in training, but also suffer from generalization issues. As a result, the depth estimations can be erroneous in practice, especially for outdoor unbounded scenes. In this paper, we propose to employ view-consistent distributions instead of fixed depth value estimations to regularize NeRF training. Specifically, the distribution is computed by utilizing both low-level color features and high-level distilled features from foundation models at the projected 2D pixel-locations from per-ray sampled 3D points. By sampling from the view-consistency distributions, an implicit regularization is imposed on the training of NeRF. We also utilize a depth-pushing loss that works in conjunction with the sampling technique to jointly provide effective regularizations for eliminating the failure modes. Extensive experiments conducted on various scenes from public datasets demonstrate that our proposed method can generate significantly better novel view synthesis results than state-of-the-art NeRF variants as well as different depth regularization methods.

Title: DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Authors: Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.04447
Pdf URL: https://arxiv.org/pdf/2507.04447
Copy Paste: [[2507.04447]] DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge(https://arxiv.org/abs/2507.04447)
Keywords: diffusion
Abstract: Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

Title: CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step

Authors: Zheyuan Liu, Munan Ning, Qihui Zhang, Shuo Yang, Zhongrui Wang, Yiwei Yang, Xianzhe Xu, Yibing Song, Weihua Chen, Fan Wang, Li Yuan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04451
Pdf URL: https://arxiv.org/pdf/2507.04451
Copy Paste: [[2507.04451]] CoT-lized Diffusion: Let's Reinforce T2I Generation Step-by-step(https://arxiv.org/abs/2507.04451)
Keywords: diffusion
Abstract: Current text-to-image (T2I) generation models struggle to align spatial composition with the input text, especially in complex scenes. Even layout-based approaches yield suboptimal spatial control, as their generation process is decoupled from layout planning, making it difficult to refine the layout during synthesis. We present CoT-Diff, a framework that brings step-by-step CoT-style reasoning into T2I generation by tightly integrating Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process. CoT-Diff enables layout-aware reasoning inline within a single diffusion round: at each denoising step, the MLLM evaluates intermediate predictions, dynamically updates the 3D scene layout, and continuously guides the generation process. The updated layout is converted into semantic conditions and depth maps, which are fused into the diffusion model via a condition-aware attention mechanism, enabling precise spatial control and semantic injection. Experiments on 3D Scene benchmarks show that CoT-Diff significantly improves spatial alignment and compositional fidelity, and outperforms the state-of-the-art method by 34.7% in complex scene spatial accuracy, thereby validating the effectiveness of this entangled generation paradigm.

Title: Dealing with Uncertainty in Contextual Anomaly Detection

Authors: Luca Bindini, Lorenzo Perini, Stefano Nistri, Jesse Davis, Paolo Frasconi
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2507.04490
Pdf URL: https://arxiv.org/pdf/2507.04490
Copy Paste: [[2507.04490]] Dealing with Uncertainty in Contextual Anomaly Detection(https://arxiv.org/abs/2507.04490)
Keywords: anomaly
Abstract: Contextual anomaly detection (CAD) aims to identify anomalies in a target (behavioral) variable conditioned on a set of contextual variables that influence the normalcy of the target variable but are not themselves indicators of anomaly. In many anomaly detection tasks, there exist contextual variables that influence the normalcy of the target variable but are not themselves indicators of anomaly. In this work, we propose a novel framework for CAD, normalcy score (NS), that explicitly models both the aleatoric and epistemic uncertainties. Built on heteroscedastic Gaussian process regression, our method regards the Z-score as a random variable, providing confidence intervals that reflect the reliability of the anomaly assessment. Through experiments on benchmark datasets and a real-world application in cardiology, we demonstrate that NS outperforms state-of-the-art CAD methods in both detection accuracy and interpretability. Moreover, confidence intervals enable an adaptive, uncertainty-driven decision-making process, which may be very important in domains such as healthcare.

Title: Unveiling the Potential of Diffusion Large Language Model in Controllable Generation

Authors: Zhen Xiong, Yujun Cai, Zhecheng Li, Yiwei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2507.04504
Pdf URL: https://arxiv.org/pdf/2507.04504
Copy Paste: [[2507.04504]] Unveiling the Potential of Diffusion Large Language Model in Controllable Generation(https://arxiv.org/abs/2507.04504)
Keywords: diffusion
Abstract: Diffusion models, originally developed for image generation, have emerged as a promising alternative to autoregressive large language models (LLMs). We present a theoretical analysis comparing autoregressive and masked diffusion LLMs, revealing that the intrinsic bidirectional attention mechanism of diffusion LLMs (dLLMs) enables superior context modeling and generation controllability. However, existing dLLM applications face significant challenges in controllable generation: the native multi-step denoising process exhibits high sensitivity to sequence length, elevated hallucination rates, and prohibitive inference costs without specialized optimizations. To address these limitations, we propose \textbf{S}elf-adaptive \textbf{S}chema \textbf{S}caffolding ($S^3$), a novel framework that enables dLLMs to generate structured outputs (e.g., JSON) while maintaining semantic fidelity and accelerating inference. Our approach injects the target schema structure into the output context, reducing unnecessary computation while improving controllability. Extensive experiments demonstrate that $S^3$ achieves substantial improvements: 65\% increase in structural adherence, 48\% enhancement in content fidelity, and 17\% reduction in hallucination rates compared to baseline. These results establish both theoretical foundations and practical pathways for deploying diffusion models in controllable text generation tasks. Code and data will be publicly released.

Title: MambaVideo for Discrete Video Tokenization with Channel-Split Quantization

Authors: Dawit Mureja Argaw, Xian Liu, Joon Son Chung, Ming-Yu Liu, Fitsum Reda
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04559
Pdf URL: https://arxiv.org/pdf/2507.04559
Copy Paste: [[2507.04559]] MambaVideo for Discrete Video Tokenization with Channel-Split Quantization(https://arxiv.org/abs/2507.04559)
Keywords: generative
Abstract: Discrete video tokenization is essential for efficient autoregressive generative modeling due to the high dimensionality of video data. This work introduces a state-of-the-art discrete video tokenizer with two key contributions. First, we propose a novel Mamba-based encoder-decoder architecture that overcomes the limitations of previous sequencebased tokenizers. Second, we introduce a new quantization scheme, channel-split quantization, which significantly enhances the representational power of quantized latents while preserving the token count. Our model sets a new state-of-the-art, outperforming both causal 3D convolutionbased and Transformer-based approaches across multiple datasets. Experimental results further demonstrate its robustness as a tokenizer for autoregressive video generation.

Title: Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts

Authors: Guokan Shang, Hadi Abdine, Ahmad Chamma, Amr Mohamed, Mohamed Anwar, Abdelaziz Bounhar, Omar El Herraoui, Preslav Nakov, Michalis Vazirgiannis, Eric Xing
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04569
Pdf URL: https://arxiv.org/pdf/2507.04569
Copy Paste: [[2507.04569]] Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts(https://arxiv.org/abs/2507.04569)
Keywords: generative
Abstract: We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model yields a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to dual-script languages, addressing an often overlooked aspect in modern LLM development.

Title: S$^2$Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control

Authors: Xudong Liu, Zikun Chen, Ruowei Jiang, Ziyi Wu, Kejia Yin, Han Zhao, Parham Aarabi, Igor Gilitschenski
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04584
Pdf URL: https://arxiv.org/pdf/2507.04584
Copy Paste: [[2507.04584]] S$^2$Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control(https://arxiv.org/abs/2507.04584)
Keywords: diffusion
Abstract: Recent advances in diffusion models have enabled high-quality generation and manipulation of images guided by texts, as well as concept learning from images. However, naive applications of existing methods to editing tasks that require fine-grained control, e.g., face editing, often lead to suboptimal solutions with identity information and high-frequency details lost during the editing process, or irrelevant image regions altered due to entangled concepts. In this work, we propose S$^2$Edit, a novel method based on a pre-trained text-to-image diffusion model that enables personalized editing with precise semantic and spatial control. We first fine-tune our model to embed the identity information into a learnable text token. During fine-tuning, we disentangle the learned identity token from attributes to be edited by enforcing an orthogonality constraint in the textual feature space. To ensure that the identity token only affects regions of interest, we apply object masks to guide the cross-attention maps. At inference time, our method performs localized editing while faithfully preserving the original identity with semantically disentangled and spatially focused identity token learned. Extensive experiments demonstrate the superiority of S$^2$Edit over state-of-the-art methods both quantitatively and qualitatively. Additionally, we showcase several compositional image editing applications of S$^2$Edit such as makeup transfer.

Title: QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation

Authors: Jiahui Yang, Yongjia Ma, Donglin Di, Hao Li, Wei Chen, Yan Xie, Jianxun Cui, Xun Yang, Wangmeng Zuo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04599
Pdf URL: https://arxiv.org/pdf/2507.04599
Copy Paste: [[2507.04599]] QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation(https://arxiv.org/abs/2507.04599)
Keywords: generative
Abstract: Existing text-to-image models often rely on parameter fine-tuning techniques such as Low-Rank Adaptation (LoRA) to customize visual attributes. However, when combining multiple LoRA models for content-style fusion tasks, unstructured modifications of weight matrices often lead to undesired feature entanglement between content and style attributes. We propose QR-LoRA, a novel fine-tuning framework leveraging QR decomposition for structured parameter updates that effectively separate visual attributes. Our key insight is that the orthogonal Q matrix naturally minimizes interference between different visual features, while the upper triangular R matrix efficiently encodes attribute-specific transformations. Our approach fixes both Q and R matrices while only training an additional task-specific $\Delta R$ matrix. This structured design reduces trainable parameters to half of conventional LoRA methods and supports effective merging of multiple adaptations without cross-contamination due to the strong disentanglement properties between $\Delta R$ matrices. Experiments demonstrate that QR-LoRA achieves superior disentanglement in content-style fusion tasks, establishing a new paradigm for parameter-efficient, disentangled fine-tuning in generative models.

Title: Information-Guided Diffusion Sampling for Dataset Distillation

Authors: Linfeng Ye, Shayan Mohajer Hamidi, Guang Li, Takahiro Ogawa, Miki Haseyama, Konstantinos N. Plataniotis
Subjects: cs.LG, cs.AI, cs.CV, cs.IT
Abstract URL: https://arxiv.org/abs/2507.04619
Pdf URL: https://arxiv.org/pdf/2507.04619
Copy Paste: [[2507.04619]] Information-Guided Diffusion Sampling for Dataset Distillation(https://arxiv.org/abs/2507.04619)
Keywords: diffusion
Abstract: Dataset distillation aims to create a compact dataset that retains essential information while maintaining model performance. Diffusion models (DMs) have shown promise for this task but struggle in low images-per-class (IPC) settings, where generated samples lack diversity. In this paper, we address this issue from an information-theoretic perspective by identifying two key types of information that a distilled dataset must preserve: ($i$) prototype information $\mathrm{I}(X;Y)$, which captures label-relevant features; and ($ii$) contextual information $\mathrm{H}(X | Y)$, which preserves intra-class variability. Here, $(X,Y)$ represents the pair of random variables corresponding to the input data and its ground truth label, respectively. Observing that the required contextual information scales with IPC, we propose maximizing $\mathrm{I}(X;Y) + \beta \mathrm{H}(X | Y)$ during the DM sampling process, where $\beta$ is IPC-dependent. Since directly computing $\mathrm{I}(X;Y)$ and $\mathrm{H}(X | Y)$ is intractable, we develop variational estimations to tightly lower-bound these quantities via a data-driven approach. Our approach, information-guided diffusion sampling (IGDS), seamlessly integrates with diffusion models and improves dataset distillation across all IPC settings. Experiments on Tiny ImageNet and ImageNet subsets show that IGDS significantly outperforms existing methods, particularly in low-IPC regimes. The code will be released upon acceptance.

Title: Multimodal LLM Integrated Semantic Communications for 6G Immersive Experiences

Authors: Yusong Zhang, Yuxuan Sun, Lei Guo, Wei Chen, Bo Ai, Deniz Gunduz
Subjects: cs.LG, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2507.04621
Pdf URL: https://arxiv.org/pdf/2507.04621
Copy Paste: [[2507.04621]] Multimodal LLM Integrated Semantic Communications for 6G Immersive Experiences(https://arxiv.org/abs/2507.04621)
Keywords: diffusion, foundation model, generative
Abstract: 6G networks promise revolutionary immersive communication experiences including augmented reality (AR), virtual reality (VR), and holographic communications. These applications demand high-dimensional multimodal data transmission and intelligent data processing in real-time, which is extremely challenging over resource-limited wireless communication systems. Moreover, a joint understanding of the environment, context, and user intent is essential to deliver task-relevant content effectively. This article presents a novel multimodal large language model (MLLM) integrated semantic communications framework, termed MLLM-SC, which fully leverages reasoning and generative capabilities of pre-trained foundation models for context-aware and task-oriented wireless communication. The MLLM-SC framework adopts a device-edge collaborative architecture. At the edge, MLLM-empowered semantic guidance module analyzes multimodal inputs, user intents, and channel conditions to generate importance-aware attention maps prioritizing semantically critical information. An importance-aware semantic encoder and a resource-adaptive semantic decoder are jointly designed and optimized, which can utilize the semantic guidance for adaptive bandwidth allocation and high-quality content reconstruction or generation. Extensive case studies on visual question answering for AR/VR applications and diffusion-driven image generation validate the effectiveness of MLLM-SC.

Title: Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts

Authors: Yun Wang, Longguang Wang, Chenghao Zhang, Yongjian Zhang, Zhanjie Zhang, Ao Ma, Chenyou Fan, Tin Lun Lam, Junjie Hu
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2507.04631
Pdf URL: https://arxiv.org/pdf/2507.04631
Copy Paste: [[2507.04631]] Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts(https://arxiv.org/abs/2507.04631)
Keywords: foundation model
Abstract: Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge. To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction. Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-the-art cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation. The code is available at \textcolor{red}{this https URL}.

Title: Hybrid Adversarial Spectral Loss Conditional Generative Adversarial Networks for Signal Data Augmentation in Ultra-precision Machining Surface Roughness Prediction

Authors: Suiyan Shang, Chi Fai Cheung, Pai Zheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04665
Pdf URL: https://arxiv.org/pdf/2507.04665
Copy Paste: [[2507.04665]] Hybrid Adversarial Spectral Loss Conditional Generative Adversarial Networks for Signal Data Augmentation in Ultra-precision Machining Surface Roughness Prediction(https://arxiv.org/abs/2507.04665)
Keywords: generative
Abstract: Accurate surface roughness prediction in ultra-precision machining (UPM) is critical for real-time quality control, but small datasets hinder model performance. We propose HAS-CGAN, a Hybrid Adversarial Spectral Loss CGAN, for effective UPM data augmentation. Among five CGAN variants tested, HAS-CGAN excels in 1D force signal generation, particularly for high-frequency signals, achieving >0.85 wavelet coherence through Fourier-domain optimization. By combining generated signals with machining parameters, prediction accuracy significantly improves. Experiments with traditional ML (SVR, RF, LSTM) and deep learning models (BPNN, 1DCNN, CNN-Transformer) demonstrate that augmenting training data with 520+ synthetic samples reduces prediction error from 31.4% (original 52 samples) to ~9%, effectively addressing data scarcity in UPM roughness prediction."

Title: ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing

Authors: Zhenghui Zhao, Chen Wu, Di Wang, Hongruixuan Chen, Zhuo Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04678
Pdf URL: https://arxiv.org/pdf/2507.04678
Copy Paste: [[2507.04678]] ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing(https://arxiv.org/abs/2507.04678)
Keywords: diffusion, generative
Abstract: Recent advancements in generative methods, especially diffusion models, have made great progress in remote sensing image synthesis. Despite these advancements, existing methods have not explored the simulation of future scenarios based on given scenario images. This simulation capability has wide applications for urban planning, land managementChangeBridge: Spatiotemporal Image Generation with Multimodal Controls, and beyond. In this work, we propose ChangeBridge, a conditional spatiotemporal diffusion model. Given pre-event images and conditioned on multimodal spatial controls (e.g., text prompts, instance layouts, and semantic maps), ChangeBridge can synthesize post-event images. The core idea behind ChangeBridge is to modeling the noise-to-image diffusion model, as a pre-to-post diffusion bridge. Conditioned on multimodal controls, ChangeBridge leverages a stochastic Brownian-bridge diffusion, directly modeling the spatiotemporal evolution between pre-event and post-event states. To the best of our knowledge, ChangeBridge is the first spatiotemporal generative model with multimodal controls for remote sensing. Experimental results demonstrate that ChangeBridge can simulate high-fidelity future scenarios aligned with given conditions, including event and event-driven background variations. Code will be available.

Title: TeethGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation

Authors: Changsong Lei, Yaqian Liang, Shaofeng Wang, Jiajia Dai, Yong-Jin Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04685
Pdf URL: https://arxiv.org/pdf/2507.04685
Copy Paste: [[2507.04685]] TeethGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation(https://arxiv.org/abs/2507.04685)
Keywords: diffusion
Abstract: Digital orthodontics represents a prominent and critical application of computer vision technology in the medical field. So far, the labor-intensive process of collecting clinical data, particularly in acquiring paired 3D orthodontic teeth models, constitutes a crucial bottleneck for developing tooth arrangement neural networks. Although numerous general 3D shape generation methods have been proposed, most of them focus on single-object generation and are insufficient for generating anatomically structured teeth models, each comprising 24-32 segmented teeth. In this paper, we propose TeethGenerator, a novel two-stage framework designed to synthesize paired 3D teeth models pre- and post-orthodontic, aiming to facilitate the training of downstream tooth arrangement networks. Specifically, our approach consists of two key modules: (1) a teeth shape generation module that leverages a diffusion model to learn the distribution of morphological characteristics of teeth, enabling the generation of diverse post-orthodontic teeth models; and (2) a teeth style generation module that synthesizes corresponding pre-orthodontic teeth models by incorporating desired styles as conditional inputs. Extensive qualitative and quantitative experiments demonstrate that our synthetic dataset aligns closely with the distribution of real orthodontic data, and promotes tooth alignment performance significantly when combined with real data for training. The code and dataset are available at this https URL.

Title: Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal

Authors: Wanchang Yu, Qing Zhang, Rongjia Zheng, Wei-Shi Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04692
Pdf URL: https://arxiv.org/pdf/2507.04692
Copy Paste: [[2507.04692]] Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal(https://arxiv.org/abs/2507.04692)
Keywords: diffusion, generative
Abstract: We present a diffusion-based portrait shadow removal approach that can robustly produce high-fidelity results. Unlike previous methods, we cast shadow removal as diffusion-based inpainting. To this end, we first train a shadow-independent structure extraction network on a real-world portrait dataset with various synthetic lighting conditions, which allows to generate a shadow-independent structure map including facial details while excluding the unwanted shadow boundaries. The structure map is then used as condition to train a structure-guided inpainting diffusion model for removing shadows in a generative manner. Finally, to restore the fine-scale details (e.g., eyelashes, moles and spots) that may not be captured by the structure map, we take the gradients inside the shadow regions as guidance and train a detail restoration diffusion model to refine the shadow removal result. Extensive experiments on the benchmark datasets show that our method clearly outperforms existing methods, and is effective to avoid previously common issues such as facial identity tampering, shadow residual, color distortion, structure blurring, and loss of details. Our code is available at this https URL.

Title: Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation

Authors: Daichi Mukunoki, Shun-ichiro Hayashi, Tetsuya Hoshino, Takahiro Katagiri
Subjects: cs.LG, cs.DC, cs.MS
Abstract URL: https://arxiv.org/abs/2507.04697
Pdf URL: https://arxiv.org/pdf/2507.04697
Copy Paste: [[2507.04697]] Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation(https://arxiv.org/abs/2507.04697)
Keywords: generative
Abstract: Generative AI technology based on Large Language Models (LLM) has been developed and applied to assist or automatically generate program codes. In this paper, we evaluate the capability of existing general LLMs for Basic Linear Algebra Subprograms (BLAS) code generation for CPUs. We use two LLMs provided by OpenAI: GPT-4.1, a Generative Pre-trained Transformer (GPT) model, and o4-mini, one of the o-series of Reasoning models. Both have been released in April 2025. For the routines from level-1 to 3 BLAS, we tried to generate (1) C code without optimization from routine name only, (2) C code with basic performance optimizations (thread parallelization, SIMD vectorization, and cache blocking) from routine name only, and (3) C code with basic performance optimizations based on Fortran reference code. As a result, we found that correct code can be generated in many cases even when only routine name are given. We also confirmed that thread parallelization with OpenMP, SIMD vectorization, and cache blocking can be implemented to some extent, and that the code is faster than the reference code.

Title: A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets

Authors: Zexi Jia, Chuanwei Huang, Hongyan Fei, Yeshuang Zhu, Zhiqiang Yuan, Ying Deng, Jiapei Zhang, Jinchao Zhang, Jie Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04699
Pdf URL: https://arxiv.org/pdf/2507.04699
Copy Paste: [[2507.04699]] A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets(https://arxiv.org/abs/2507.04699)
Keywords: diffusion
Abstract: Vision-language models (VLMs) often struggle with compositional reasoning due to insufficient high-quality image-text data. To tackle this challenge, we propose a novel block-based diffusion approach that automatically generates counterfactual datasets without manual annotation. Our method utilizes large language models to identify entities and their spatial relationships. It then independently generates image blocks as "puzzle pieces" coherently arranged according to specified compositional rules. This process creates diverse, high-fidelity counterfactual image-text pairs with precisely controlled variations. In addition, we introduce a specialized loss function that differentiates inter-set from intra-set samples, enhancing training efficiency and reducing the need for negative samples. Experiments demonstrate that fine-tuning VLMs with our counterfactual datasets significantly improves visual reasoning performance. Our approach achieves state-of-the-art results across multiple benchmarks while using substantially less training data than existing methods.

Title: Spooky Action at a Distance: Normalization Layers Enable Side-Channel Spatial Communication

Authors: Samuel Pfrommer, George Ma, Yixiao Huang, Somayeh Sojoudi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04709
Pdf URL: https://arxiv.org/pdf/2507.04709
Copy Paste: [[2507.04709]] Spooky Action at a Distance: Normalization Layers Enable Side-Channel Spatial Communication(https://arxiv.org/abs/2507.04709)
Keywords: diffusion
Abstract: This work shows that normalization layers can facilitate a surprising degree of communication across the spatial dimensions of an input tensor. We study a toy localization task with a convolutional architecture and show that normalization layers enable an iterative message passing procedure, allowing information aggregation from well outside the local receptive field. Our results suggest that normalization layers should be employed with caution in applications such as diffusion-based trajectory generation, where maintaining a spatially limited receptive field is crucial.

Title: Geometric-Guided Few-Shot Dental Landmark Detection with Human-Centric Foundation Model

Authors: Anbang Wang, Marawan Elbatel, Keyuan Liu, Lizhuo Lin, Meng Lan, Yanqi Yang, Xiaomeng Li
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2507.04710
Pdf URL: https://arxiv.org/pdf/2507.04710
Copy Paste: [[2507.04710]] Geometric-Guided Few-Shot Dental Landmark Detection with Human-Centric Foundation Model(https://arxiv.org/abs/2507.04710)
Keywords: foundation model
Abstract: Accurate detection of anatomic landmarks is essential for assessing alveolar bone and root conditions, thereby optimizing clinical outcomes in orthodontics, periodontics, and implant dentistry. Manual annotation of landmarks on cone-beam computed tomography (CBCT) by dentists is time-consuming, labor-intensive, and subject to inter-observer variability. Deep learning-based automated methods present a promising approach to streamline this process efficiently. However, the scarcity of training data and the high cost of expert annotations hinder the adoption of conventional deep learning techniques. To overcome these challenges, we introduce GeoSapiens, a novel few-shot learning framework designed for robust dental landmark detection using limited annotated CBCT of anterior teeth. Our GeoSapiens framework comprises two key components: (1) a robust baseline adapted from Sapiens, a foundational model that has achieved state-of-the-art performance in human-centric vision tasks, and (2) a novel geometric loss function that improves the model's capacity to capture critical geometric relationships among anatomical structures. Experiments conducted on our collected dataset of anterior teeth landmarks revealed that GeoSapiens surpassed existing landmark detection methods, outperforming the leading approach by an 8.18% higher success detection rate at a strict 0.5 mm threshold-a standard widely recognized in dental diagnostics. Code is available at: this https URL.

Title: Losing Control: Data Poisoning Attack on Guided Diffusion via ControlNet

Authors: Raz Lapid, Almog Dubin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04726
Pdf URL: https://arxiv.org/pdf/2507.04726
Copy Paste: [[2507.04726]] Losing Control: Data Poisoning Attack on Guided Diffusion via ControlNet(https://arxiv.org/abs/2507.04726)
Keywords: diffusion
Abstract: Text-to-image diffusion models have achieved remarkable success in translating textual prompts into high-fidelity images. ControlNets further extend these models by allowing precise, image-based conditioning (e.g., edge maps, depth, pose), enabling fine-grained control over structure and style. However, their dependence on large, publicly scraped datasets -- and the increasing use of community-shared data for fine-tuning -- exposes them to stealthy data poisoning attacks. In this work, we introduce a novel data poisoning method that manipulates ControlNets to generate images containing specific content without any text triggers. By injecting poisoned samples -- each pairing a subtly triggered input with an NSFW target -- the model retains clean-prompt fidelity yet reliably produces NSFW outputs when the trigger is present. On large-scale, high-quality datasets, our backdoor achieves high attack success rate while remaining imperceptible in raw inputs. These results reveal a critical vulnerability in open-source ControlNets pipelines and underscore the need for robust data sanitization and defense mechanisms.

Title: Word stress in self-supervised speech models: A cross-linguistic comparison

Authors: Martijn Bentum, Louis ten Bosch, Tomas O. Lentz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04738
Pdf URL: https://arxiv.org/pdf/2507.04738
Copy Paste: [[2507.04738]] Word stress in self-supervised speech models: A cross-linguistic comparison(https://arxiv.org/abs/2507.04738)
Keywords: self-supervised
Abstract: In this paper we study word stress representations learned by self-supervised speech models (S3M), specifically the Wav2vec 2.0 model. We investigate the S3M representations of word stress for five different languages: Three languages with variable or lexical stress (Dutch, English and German) and two languages with fixed or demarcative stress (Hungarian and Polish). We train diagnostic stress classifiers on S3M embeddings and show that they can distinguish between stressed and unstressed syllables in read-aloud short sentences with high accuracy. We also tested language-specificity effects of S3M word stress. The results indicate that the word stress representations are language-specific, with a greater difference between the set of variable versus the set of fixed stressed languages.

Title: GraphBrep: Learning B-Rep in Graph Structure for Efficient CAD Generation

Authors: Weilin Lai, Tie Xu, Hu Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04765
Pdf URL: https://arxiv.org/pdf/2507.04765
Copy Paste: [[2507.04765]] GraphBrep: Learning B-Rep in Graph Structure for Efficient CAD Generation(https://arxiv.org/abs/2507.04765)
Keywords: diffusion
Abstract: Direct B-Rep generation is increasingly important in CAD workflows, eliminating costly modeling sequence data and supporting complex features. A key challenge is modeling joint distribution of the misaligned geometry and topology. Existing methods tend to implicitly embed topology into the geometric features of edges. Although this integration ensures feature alignment, it also causes edge geometry to carry more redundant structural information compared to the original B-Rep, leading to significantly higher computational cost. To reduce redundancy, we propose GraphBrep, a B-Rep generation model that explicitly represents and learns compact topology. Following the original structure of B-Rep, we construct an undirected weighted graph to represent surface topology. A graph diffusion model is employed to learn topology conditioned on surface features, serving as the basis for determining connectivity between primitive surfaces. The explicit representation ensures a compact data structure, effectively reducing computational cost during both training and inference. Experiments on two large-scale unconditional datasets and one category-conditional dataset demonstrate the proposed method significantly reduces training and inference times (up to 31.3% and 56.3% for given datasets, respectively) while maintaining high-quality CAD generation compared with SOTA.

Title: From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach

Authors: Mihai Masala, Marius Leordeanu
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.04815
Pdf URL: https://arxiv.org/pdf/2507.04815
Copy Paste: [[2507.04815]] From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach(https://arxiv.org/abs/2507.04815)
Keywords: self-supervised
Abstract: The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due to the expensive human manual annotation required and to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of generating shorter captions from videos by direct end-to-end learning between the videos and text, the problem of explaining the relationship between vision and language is still beyond our reach. In this work, we propose a shared representation between vision and language, based on graphs of events in space and time, which can be obtained in an explainable and analytical way, to integrate and connect multiple vision tasks to produce the final natural language description. Moreover, we also demonstrate how our automated and explainable video description generation process can function as a fully automatic teacher to effectively train direct, end-to-end neural student pathways, within a self-supervised neuro-analytical system. We validate that our explainable neuro-analytical approach generates coherent, rich and relevant textual descriptions on videos collected from multiple varied datasets, using both standard evaluation metrics, human annotations and consensus from ensembles of state-of-the-art VLMs.

Title: Discrete Diffusion Trajectory Alignment via Stepwise Decomposition

Authors: Jiaqi Han, Austin Wang, Minkai Xu, Wenda Chu, Meihua Dang, Yisong Yue, Stefano Ermon
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04832
Pdf URL: https://arxiv.org/pdf/2507.04832
Copy Paste: [[2507.04832]] Discrete Diffusion Trajectory Alignment via Stepwise Decomposition(https://arxiv.org/abs/2507.04832)
Keywords: diffusion
Abstract: Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose a novel preference optimization method for masked discrete diffusion models through a principled diffusion trajectory alignment. Instead of applying the reward on the final output and backpropagating the gradient to the entire discrete denoising process, we decompose the problem into a set of stepwise alignment objectives. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, guarantees an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12\% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 80.7 on LLaDA-8B-Instruct for language modeling.

Title: Semantically Consistent Discrete Diffusion for 3D Biological Graph Modeling

Authors: Chinmay Prabhakar, Suprosanna Shit, Tamaz Amiranashvili, Hongwei Bran Li, Bjoern Menze
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04856
Pdf URL: https://arxiv.org/pdf/2507.04856
Copy Paste: [[2507.04856]] Semantically Consistent Discrete Diffusion for 3D Biological Graph Modeling(https://arxiv.org/abs/2507.04856)
Keywords: diffusion, generative
Abstract: 3D spatial graphs play a crucial role in biological and clinical research by modeling anatomical networks such as blood vessels,neurons, and airways. However, generating 3D biological graphs while maintaining anatomical validity remains challenging, a key limitation of existing diffusion-based methods. In this work, we propose a novel 3D biological graph generation method that adheres to structural and semantic plausibility conditions. We achieve this by using a novel projection operator during sampling that stochastically fixes inconsistencies. Further, we adopt a superior edge-deletion-based noising procedure suitable for sparse biological graphs. Our method demonstrates superior performance on two real-world datasets, human circle of Willis and lung airways, compared to previous approaches. Importantly, we demonstrate that the generated samples significantly enhance downstream graph labeling performance. Furthermore, we show that our generative model is a reasonable out-of-the-box link predictior.

Title: Fine-tuning on simulated data outperforms prompting for agent tone of voice

Authors: Ingo Marquardt, Philippe Brule
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.04889
Pdf URL: https://arxiv.org/pdf/2507.04889
Copy Paste: [[2507.04889]] Fine-tuning on simulated data outperforms prompting for agent tone of voice(https://arxiv.org/abs/2507.04889)
Keywords: in-context
Abstract: Deploying language models (LMs) in customer-facing speech applications requires conversational fluency and adherence to specific stylistic guidelines. This can be challenging to achieve reliably using complex system prompts due to issues like instruction following limitations and in-context bias. This study investigates the effectiveness of fine-tuning versus system prompting for aligning LMs with a specific behavioral target: responding in a natural, conversational tone suitable for voice interactions. We fine-tuned a small, open-weights model (`Llama3.2-1B-Instruct`) using Low-Rank Adaptation (LoRA) on a synthetically generated dataset derived from Wikipedia. Additionally, we fine-tuned two closed-source models (`gpt-4o-mini`, `gpt-4.1-mini`). Our results demonstrate that fine-tuning outperformed system prompting, achieving a high percentage of conversational responses, even when trained on only 100 data samples. Semantic similarity analysis confirmed that fine-tuning did not degrade content quality. Interestingly, fine-tuning with 8-bit integer quantization converged faster towards the target style than using bfloat16 precision, potentially due to implicit regularization effects. We conclude that fine-tuning small, open-weights LMs on simulated data is a highly effective and data-efficient method for instilling specific stylistic behaviors, offering a preferable alternative to complex system prompting for practical applications requiring nuanced response styles.

Title: Leveraging Self-Supervised Features for Efficient Flooded Region Identification in UAV Aerial Images

Authors: Dibyabha Deb, Ujjwal Verma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04915
Pdf URL: https://arxiv.org/pdf/2507.04915
Copy Paste: [[2507.04915]] Leveraging Self-Supervised Features for Efficient Flooded Region Identification in UAV Aerial Images(https://arxiv.org/abs/2507.04915)
Keywords: self-supervised
Abstract: Identifying regions affected by disasters is a vital step in effectively managing and planning relief and rescue efforts. Unlike the traditional approaches of manually assessing post-disaster damage, analyzing images of Unmanned Aerial Vehicles (UAVs) offers an objective and reliable way to assess the damage. In the past, segmentation techniques have been adopted to identify post-flood damage in UAV aerial images. However, most of these supervised learning approaches rely on manually annotated datasets. Indeed, annotating images is a time-consuming and error-prone task that requires domain expertise. This work focuses on leveraging self-supervised features to accurately identify flooded regions in UAV aerial images. This work proposes two encoder-decoder-based segmentation approaches, which integrate the visual features learned from DINOv2 with the traditional encoder backbone. This study investigates the generalization of self-supervised features for UAV aerial images. Specifically, we evaluate the effectiveness of features from the DINOv2 model, trained on non-aerial images, for segmenting aerial images, noting the distinct perspectives between the two image types. Our results demonstrate that DINOv2's self-supervised pretraining on natural images generates transferable, general-purpose visual features that streamline the development of aerial segmentation workflows. By leveraging these features as a foundation, we significantly reduce reliance on labor-intensive manual annotation processes, enabling high-accuracy segmentation with limited labeled aerial data.

Title: Object-centric Denoising Diffusion Models for Physical Reasoning

Authors: Moritz Lange, Raphael C. Engelhardt, Wolfgang Konen, Andrew Melnik, Laurenz Wiskott
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04920
Pdf URL: https://arxiv.org/pdf/2507.04920
Copy Paste: [[2507.04920]] Object-centric Denoising Diffusion Models for Physical Reasoning(https://arxiv.org/abs/2507.04920)
Keywords: diffusion
Abstract: Reasoning about the trajectories of multiple, interacting objects is integral to physical reasoning tasks in machine learning. This involves conditions imposed on the objects at different time steps, for instance initial states or desired goal states. Existing approaches in physical reasoning generally rely on autoregressive modeling, which can only be conditioned on initial states, but not on later states. In fields such as planning for reinforcement learning, similar challenges are being addressed with denoising diffusion models. In this work, we propose an object-centric denoising diffusion model architecture for physical reasoning that is translation equivariant over time, permutation equivariant over objects, and can be conditioned on arbitrary time steps for arbitrary objects. We demonstrate how this model can solve tasks with multiple conditions and examine its performance when changing object numbers and trajectory lengths during inference.

Title: RainShift: A Benchmark for Precipitation Downscaling Across Geographies

Authors: Paula Harder, Luca Schmidt, Francis Pelletier, Nicole Ludwig, Matthew Chantry, Christian Lessig, Alex Hernandez-Garcia, David Rolnick
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04930
Pdf URL: https://arxiv.org/pdf/2507.04930
Copy Paste: [[2507.04930]] RainShift: A Benchmark for Precipitation Downscaling Across Geographies(https://arxiv.org/abs/2507.04930)
Keywords: diffusion
Abstract: Earth System Models (ESM) are our main tool for projecting the impacts of climate change. However, running these models at sufficient resolution for local-scale risk-assessments is not computationally feasible. Deep learning-based super-resolution models offer a promising solution to downscale ESM outputs to higher resolutions by learning from data. Yet, due to regional variations in climatic processes, these models typically require retraining for each geographical area-demanding high-resolution observational data, which is unevenly available across the globe. This highlights the need to assess how well these models generalize across geographic regions. To address this, we introduce RainShift, a dataset and benchmark for evaluating downscaling under geographic distribution shifts. We evaluate state-of-the-art downscaling approaches including GANs and diffusion models in generalizing across data gaps between the Global North and Global South. Our findings reveal substantial performance drops in out-of-distribution regions, depending on model and geographic area. While expanding the training domain generally improves generalization, it is insufficient to overcome shifts between geographically distinct regions. We show that addressing these shifts through, for example, data alignment can improve spatial generalization. Our work advances the global applicability of downscaling methods and represents a step toward reducing inequities in access to high-resolution climate information.

Title: Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation

Authors: Jianjiang Yang, Ziyan Huang
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2507.04946
Pdf URL: https://arxiv.org/pdf/2507.04946
Copy Paste: [[2507.04946]] Taming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation(https://arxiv.org/abs/2507.04946)
Keywords: diffusion, generative
Abstract: Despite remarkable progress in image quality and prompt fidelity, text-to-image (T2I) diffusion models continue to exhibit persistent "hallucinations", where generated content subtly or significantly diverges from the intended prompt semantics. While often regarded as unpredictable artifacts, we argue that these failures reflect deeper, structured misalignments within the generative process. In this work, we propose a cognitively inspired perspective that reinterprets hallucinations as trajectory drift within a latent alignment space. Empirical observations reveal that generation unfolds within a multiaxial cognitive tension field, where the model must continuously negotiate competing demands across three key critical axes: semantic coherence, structural alignment, and knowledge grounding. We then formalize this three-axis space as the \textbf{Hallucination Tri-Space} and introduce the Alignment Risk Code (ARC): a dynamic vector representation that quantifies real-time alignment tension during generation. The magnitude of ARC captures overall misalignment, its direction identifies the dominant failure axis, and its imbalance reflects tension asymmetry. Based on this formulation, we develop the TensionModulator (TM-ARC): a lightweight controller that operates entirely in latent space. TM-ARC monitors ARC signals and applies targeted, axis-specific interventions during the sampling process. Extensive experiments on standard T2I benchmarks demonstrate that our approach significantly reduces hallucination without compromising image quality or diversity. This framework offers a unified and interpretable approach for understanding and mitigating generative failures in diffusion-based T2I systems.

Title: DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer

Authors: Yecheng Wu, Junyu Chen, Zhuoyang Zhang, Enze Xie, Jincheng Yu, Junsong Chen, Jinyi Hu, Yao Lu, Song Han, Han Cai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.04947
Pdf URL: https://arxiv.org/pdf/2507.04947
Copy Paste: [[2507.04947]] DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer(https://arxiv.org/abs/2507.04947)
Keywords: diffusion
Abstract: We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers' limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT - a deep compression hybrid tokenizer for AR models that achieves a 32x spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens. DC-AR achieves state-of-the-art results with a gFID of 5.49 on MJHQ-30K and an overall score of 0.69 on GenEval, while offering 1.5-7.9x higher throughput and 2.0-3.5x lower latency compared to prior leading diffusion and autoregressive models.

Title: ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Authors: Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Shihui Hu, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Haotian Zhu, Yuanxing Zhang, Yuhao Jiang, Yue Zhang, Zenan Xu, Bohui Zhai, Guoxiang He, Hebin Li, Jie Zhao, Le Zhang, Lingyun Tan, Pengyu Guo, Xianshu Pang, Yang Ruan, Zhifeng Zhang, Zhonghu Wang, Ziyan Xu, Zuopu Yin, Wiggin Zhou, Chayse Zhou, Fengzong Lian
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2507.04952
Pdf URL: https://arxiv.org/pdf/2507.04952
Copy Paste: [[2507.04952]] ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation(https://arxiv.org/abs/2507.04952)
Keywords: generative
Abstract: The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at this https URL, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.

Title: InterGSEdit: Interactive 3D Gaussian Splatting Editing with 3D Geometry-Consistent Attention Prior

Authors: Minghao Wen, Shengjie Wu, Kangkan Wang, Dong Liang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04961
Pdf URL: https://arxiv.org/pdf/2507.04961
Copy Paste: [[2507.04961]] InterGSEdit: Interactive 3D Gaussian Splatting Editing with 3D Geometry-Consistent Attention Prior(https://arxiv.org/abs/2507.04961)
Keywords: diffusion
Abstract: 3D Gaussian Splatting based 3D editing has demonstrated impressive performance in recent years. However, the multi-view editing often exhibits significant local inconsistency, especially in areas of non-rigid deformation, which lead to local artifacts, texture blurring, or semantic variations in edited 3D scenes. We also found that the existing editing methods, which rely entirely on text prompts make the editing process a "one-shot deal", making it difficult for users to control the editing degree flexibly. In response to these challenges, we present InterGSEdit, a novel framework for high-quality 3DGS editing via interactively selecting key views with users' preferences. We propose a CLIP-based Semantic Consistency Selection (CSCS) strategy to adaptively screen a group of semantically consistent reference views for each user-selected key view. Then, the cross-attention maps derived from the reference views are used in a weighted Gaussian Splatting unprojection to construct the 3D Geometry-Consistent Attention Prior ($GAP^{3D}$). We project $GAP^{3D}$ to obtain 3D-constrained attention, which are fused with 2D cross-attention via Attention Fusion Network (AFN). AFN employs an adaptive attention strategy that prioritizes 3D-constrained attention for geometric consistency during early inference, and gradually prioritizes 2D cross-attention maps in diffusion for fine-grained features during the later inference. Extensive experiments demonstrate that InterGSEdit achieves state-of-the-art performance, delivering consistent, high-fidelity 3DGS editing with improved user experience.

Title: Parameterized Diffusion Optimization enabled Autoregressive Ordinal Regression for Diabetic Retinopathy Grading

Authors: Qinkai Yu, Wei Zhou, Hantao Liu, Yanyu Xu, Meng Wang, Yitian Zhao, Huazhu Fu, Xujiong Ye, Yalin Zheng, Yanda Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04978
Pdf URL: https://arxiv.org/pdf/2507.04978
Copy Paste: [[2507.04978]] Parameterized Diffusion Optimization enabled Autoregressive Ordinal Regression for Diabetic Retinopathy Grading(https://arxiv.org/abs/2507.04978)
Keywords: diffusion, foundation model
Abstract: As a long-term complication of diabetes, diabetic retinopathy (DR) progresses slowly, potentially taking years to threaten vision. An accurate and robust evaluation of its severity is vital to ensure prompt management and care. Ordinal regression leverages the underlying inherent order between categories to achieve superior performance beyond traditional classification. However, there exist challenges leading to lower DR classification performance: 1) The uneven distribution of DR severity levels, characterized by a long-tailed pattern, adds complexity to the grading process. 2)The ambiguity in defining category boundaries introduces additional challenges, making the classification process more complex and prone to inconsistencies. This work proposes a novel autoregressive ordinal regression method called AOR-DR to address the above challenges by leveraging the clinical knowledge of inherent ordinal information in DR grading dataset settings. Specifically, we decompose the DR grading task into a series of ordered steps by fusing the prediction of the previous steps with extracted image features as conditions for the current prediction step. Additionally, we exploit the diffusion process to facilitate conditional probability modeling, enabling the direct use of continuous global image features for autoregression without relearning contextual information from patch-level features. This ensures the effectiveness of the autoregressive process and leverages the capabilities of pre-trained large-scale foundation models. Extensive experiments were conducted on four large-scale publicly available color fundus datasets, demonstrating our model's effectiveness and superior performance over six recent state-of-the-art ordinal regression methods. The implementation code is available at this https URL.

Title: TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation

Authors: Zonglin Lyu, Chen Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.04984
Pdf URL: https://arxiv.org/pdf/2507.04984
Copy Paste: [[2507.04984]] TLB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation(https://arxiv.org/abs/2507.04984)
Keywords: diffusion
Abstract: Video Frame Interpolation (VFI) aims to predict the intermediate frame $I_n$ (we use n to denote time in videos to avoid notation overload with the timestep $t$ in diffusion models) based on two consecutive neighboring frames $I_0$ and $I_1$. Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having 3times fewer parameters. Such a parameter reduction results in 2.3x speed up. By incorporating optical flow guidance, our method requires 9000x less training data and achieves over 20x fewer parameters than video-based diffusion models. Codes and results are available at our project page: this https URL.

Title: Multi-modal Representations for Fine-grained Multi-label Critical View of Safety Recognition

Authors: Britty Baby, Vinkle Srivastav, Pooja P. Jain, Kun Yuan, Pietro Mascagni, Nicolas Padoy
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05007
Pdf URL: https://arxiv.org/pdf/2507.05007
Copy Paste: [[2507.05007]] Multi-modal Representations for Fine-grained Multi-label Critical View of Safety Recognition(https://arxiv.org/abs/2507.05007)
Keywords: foundation model
Abstract: The Critical View of Safety (CVS) is crucial for safe laparoscopic cholecystectomy, yet assessing CVS criteria remains a complex and challenging task, even for experts. Traditional models for CVS recognition depend on vision-only models learning with costly, labor-intensive spatial annotations. This study investigates how text can be harnessed as a powerful tool for both training and inference in multi-modal surgical foundation models to automate CVS recognition. Unlike many existing multi-modal models, which are primarily adapted for multi-class classification, CVS recognition requires a multi-label framework. Zero-shot evaluation of existing multi-modal surgical models shows a significant performance gap for this task. To address this, we propose CVS-AdaptNet, a multi-label adaptation strategy that enhances fine-grained, binary classification across multiple labels by aligning image embeddings with textual descriptions of each CVS criterion using positive and negative prompts. By adapting PeskaVLP, a state-of-the-art surgical foundation model, on the Endoscapes-CVS201 dataset, CVS-AdaptNet achieves 57.6 mAP, improving over the ResNet50 image-only baseline (51.5 mAP) by 6 points. Our results show that CVS-AdaptNet's multi-label, multi-modal framework, enhanced by textual prompts, boosts CVS recognition over image-only methods. We also propose text-specific inference methods, that helps in analysing the image-text alignment. While further work is needed to match state-of-the-art spatial annotation-based methods, this approach highlights the potential of adapting generalist models to specialized surgical tasks. Code: this https URL

Title: Verified Language Processing with Hybrid Explainability: A Technical Report

Authors: Oliver Robert Fox, Giacomo Bergami, Graham Morgan
Subjects: cs.CL, cs.SC
Abstract URL: https://arxiv.org/abs/2507.05017
Pdf URL: https://arxiv.org/pdf/2507.05017
Copy Paste: [[2507.05017]] Verified Language Processing with Hybrid Explainability: A Technical Report(https://arxiv.org/abs/2507.05017)
Keywords: generative
Abstract: The volume and diversity of digital information have led to a growing reliance on Machine Learning techniques, such as Natural Language Processing, for interpreting and accessing appropriate data. While vector and graph embeddings represent data for similarity tasks, current state-of-the-art pipelines lack guaranteed explainability, failing to determine similarity for given full texts accurately. These considerations can also be applied to classifiers exploiting generative language models with logical prompts, which fail to correctly distinguish between logical implication, indifference, and inconsistency, despite being explicitly trained to recognise the first two classes. We present a novel pipeline designed for hybrid explainability to address this. Our methodology combines graphs and logic to produce First-Order Logic representations, creating machine- and human-readable representations through Montague Grammar. Preliminary results indicate the effectiveness of this approach in accurately capturing full text similarity. To the best of our knowledge, this is the first approach to differentiate between implication, inconsistency, and indifference for text classification tasks. To address the limitations of existing approaches, we use three self-contained datasets annotated for the former classification task to determine the suitability of these approaches in capturing sentence structure equivalence, logical connectives, and spatiotemporal reasoning. We also use these data to compare the proposed method with language models pre-trained for detecting sentence entailment. The results show that the proposed method outperforms state-of-the-art models, indicating that natural language understanding cannot be easily generalised by training over extensive document corpora. This work offers a step toward more transparent and reliable Information Retrieval from extensive textual data.

Title: Meta-Learning Transformers to Improve In-Context Generalization

Authors: Lorenzo Braccaioli, Anna Vettoruzzo, Prabhant Singh, Joaquin Vanschoren, Mohamed-Rafik Bouguelia, Nicola Conci
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05019
Pdf URL: https://arxiv.org/pdf/2507.05019
Copy Paste: [[2507.05019]] Meta-Learning Transformers to Improve In-Context Generalization(https://arxiv.org/abs/2507.05019)
Keywords: in-context
Abstract: In-context learning enables transformer models to generalize to new tasks based solely on input prompts, without any need for weight updates. However, existing training paradigms typically rely on large, unstructured datasets that are costly to store, difficult to evaluate for quality and balance, and pose privacy and ethical concerns due to the inclusion of sensitive information. Motivated by these limitations and risks, we propose an alternative training strategy where we leverage a collection of multiple, small-scale, and domain-specific datasets. We empirically demonstrate that the increased quality and diversity of such data improve the generalization abilities of in-context learners beyond their training domain, while achieving comparable performance with models trained on a single large-scale dataset. We investigate this paradigm by leveraging meta-learning to train an in-context learner on the Meta-Album collection under several settings. Firstly, we show the performance in a controlled environment, where the test domain is completely excluded from the training knowledge. Secondly, we explore the robustness of these models to forgetting in a continual scenario where the information is accessible for a limited time. Finally, we explore the more challenging unsupervised scenario. Our findings demonstrate that transformers still generalize for in-context prediction when trained on a curated dataset collection while offering advantages in modularity and replaceability.

Title: AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics

Authors: Jan Carreras Boada, Rao Muhammad Umer, Carsten Marr
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05063
Pdf URL: https://arxiv.org/pdf/2507.05063
Copy Paste: [[2507.05063]] AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics(https://arxiv.org/abs/2507.05063)
Keywords: diffusion
Abstract: Biomedical datasets often contain a large sample imbalance and are subject to strict privacy constraints, which together hinder the development of accurate machine learning models. One potential solution is to generate synthetic images, as this can improve data availability while preserving patient privacy. However, it remains difficult to generate synthetic images of sufficient quality for training robust classifiers. In this work, we focus on the classification of single white blood cells, a key component in the diagnosis of hematological diseases such as acute myeloid leukemia (AML), a severe blood cancer. We demonstrate how synthetic images generated with a fine-tuned stable diffusion model using LoRA weights when guided by real few-shot samples of the target white blood cell classes, can enhance classifier performance for limited data. When training a ResNet classifier, accuracy increased from 27.3\% to 78.4\% (+51.1\%) by adding 5000 synthetic images per class to a small and highly imbalanced real dataset. For a CLIP-based classifier, the accuracy improved from 61.8\% to 76.8\% (+15.0\%). The synthetic images are highly similar to real images, and they can help overcome dataset limitations, enhancing model generalization. Our results establish synthetic images as a tool in biomedical research, improving machine learning models, and facilitating medical diagnosis and research.

Title: ICAS: Detecting Training Data from Autoregressive Image Generative Models

Authors: Hongyao Yu, Yixiang Qiu, Yiheng Yang, Hao Fang, Tianqu Zhuang, Jiaxin Hong, Bin Chen, Hao Wu, Shu-Tao Xia
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2507.05068
Pdf URL: https://arxiv.org/pdf/2507.05068
Copy Paste: [[2507.05068]] ICAS: Detecting Training Data from Autoregressive Image Generative Models(https://arxiv.org/abs/2507.05068)
Keywords: foundation model, generative
Abstract: Autoregressive image generation has witnessed rapid advancements, with prominent models such as scale-wise visual auto-regression pushing the boundaries of visual synthesis. However, these developments also raise significant concerns regarding data privacy and copyright. In response, training data detection has emerged as a critical task for identifying unauthorized data usage in model training. To better understand the vulnerability of autoregressive image generative models to such detection, we conduct the first study applying membership inference to this domain. Our approach comprises two key components: implicit classification and an adaptive score aggregation strategy. First, we compute the implicit token-wise classification score within the query image. Then we propose an adaptive score aggregation strategy to acquire a final score, which places greater emphasis on the tokens with lower scores. A higher final score indicates that the sample is more likely to be involved in the training set. To validate the effectiveness of our method, we adapt existing detection algorithms originally designed for LLMs to visual autoregressive models. Extensive experiments demonstrate the superiority of our method in both class-conditional and text-to-image scenarios. Moreover, our approach exhibits strong robustness and generalization under various data transformations. Furthermore, sufficient experiments suggest two novel key findings: (1) A linear scaling law on membership inference, exposing the vulnerability of large foundation models. (2) Training data from scale-wise visual autoregressive models is easier to detect than other autoregressive this http URL code is available at this https URL.

Title: Exploring Semantic Clustering and Similarity Search for Heterogeneous Traffic Scenario Graph

Authors: Ferdinand Mütsch, Maximilian Zipfl, Nikolai Polley, J. Marius Zöllner
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.05086
Pdf URL: https://arxiv.org/pdf/2507.05086
Copy Paste: [[2507.05086]] Exploring Semantic Clustering and Similarity Search for Heterogeneous Traffic Scenario Graph(https://arxiv.org/abs/2507.05086)
Keywords: self-supervised
Abstract: Scenario-based testing is an indispensable instrument for the comprehensive validation and verification of automated vehicles (AVs). However, finding a manageable and finite, yet representative subset of scenarios in a scalable, possibly unsupervised manner is notoriously challenging. Our work is meant to constitute a cornerstone to facilitate sample-efficient testing, while still capturing the diversity of relevant operational design domains (ODDs) and accounting for the "long tail" phenomenon in particular. To this end, we first propose an expressive and flexible heterogeneous, spatio-temporal graph model for representing traffic scenarios. Leveraging recent advances of graph neural networks (GNNs), we then propose a self-supervised method to learn a universal embedding space for scenario graphs that enables clustering and similarity search. In particular, we implement contrastive learning alongside a bootstrapping-based approach and evaluate their suitability for partitioning the scenario space. Experiments on the nuPlan dataset confirm the model's ability to capture semantics and thus group related scenarios in a meaningful way despite the absence of discrete class labels. Different scenario types materialize as distinct clusters. Our results demonstrate how variable-length traffic scenarios can be condensed into single vector representations that enable nearest-neighbor retrieval of representative candidates for distinct scenario categories. Notably, this is achieved without manual labeling or bias towards an explicit objective such as criticality. Ultimately, our approach can serve as a basis for scalable selection of scenarios to further enhance the efficiency and robustness of testing AVs in simulation.

Title: MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation

Authors: Yucheng Wang, Dan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05092
Pdf URL: https://arxiv.org/pdf/2507.05092
Copy Paste: [[2507.05092]] MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation(https://arxiv.org/abs/2507.05092)
Keywords: diffusion
Abstract: Audio-driven talking head generation is critical for applications such as virtual assistants, video games, and films, where natural lip movements are essential. Despite progress in this field, challenges remain in producing both consistent and realistic facial animations. Existing methods, often based on GANs or UNet-based diffusion models, face three major limitations: (i) temporal jittering caused by weak temporal constraints, resulting in frame inconsistencies; (ii) identity drift due to insufficient 3D information extraction, leading to poor preservation of facial identity; and (iii) unnatural blinking behavior due to inadequate modeling of realistic blink dynamics. To address these issues, we propose MoDiT, a novel framework that combines the 3D Morphable Model (3DMM) with a Diffusion-based Transformer. Our contributions include: (i) A hierarchical denoising strategy with revised temporal attention and biased self/cross-attention mechanisms, enabling the model to refine lip synchronization and progressively enhance full-face coherence, effectively mitigating temporal jittering. (ii) The integration of 3DMM coefficients to provide explicit spatial constraints, ensuring accurate 3D-informed optical flow prediction and improved lip synchronization using Wav2Lip results, thereby preserving identity consistency. (iii) A refined blinking strategy to model natural eye movements, with smoother and more realistic blinking behaviors.

Title: DICE: Discrete inverse continuity equation for learning population dynamics

Authors: Tobias Blickhan, Jules Berman, Andrew Stuart, Benjamin Peherstorfer
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.05107
Pdf URL: https://arxiv.org/pdf/2507.05107
Copy Paste: [[2507.05107]] DICE: Discrete inverse continuity equation for learning population dynamics(https://arxiv.org/abs/2507.05107)
Keywords: generative
Abstract: We introduce the Discrete Inverse Continuity Equation (DICE) method, a generative modeling approach that learns the evolution of a stochastic process from given sample populations at a finite number of time points. Models learned with DICE capture the typically smooth and well-behaved population dynamics, rather than the dynamics of individual sample trajectories that can exhibit complex or even chaotic behavior. The DICE loss function is developed specifically to be invariant, even in discrete time, to spatially constant but time-varying spurious constants that can emerge during training; this invariance increases training stability and robustness. Generating a trajectory of sample populations with DICE is fast because samples evolve directly in the time interval over which the stochastic process is formulated, in contrast to approaches that condition on time and then require multiple sampling steps per time step. DICE is stable to train, in situations where other methods for learning population dynamics fail, and DICE generates representative samples with orders of magnitude lower costs than methods that have to condition on time. Numerical experiments on a wide range of problems from random waves, Vlasov-Poisson instabilities and high-dimensional chaos are included to justify these assertions.

Title: VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting

Authors: Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, Yanzhi Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.05116
Pdf URL: https://arxiv.org/pdf/2507.05116
Copy Paste: [[2507.05116]] VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting(https://arxiv.org/abs/2507.05116)
Keywords: diffusion
Abstract: Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, their generalization remains limited when applied to novel objects or unfamiliar environments that lie outside the training distribution. To address this, many existing approaches integrate additional components such as depth estimation, segmentation, or even diffusion to improve generalization, at the cost of adding significant computation overhead, resulting in low efficiency. This motivates the exploration of efficient action prediction methods, which are independent of additional high-level visual representations or diffusion techniques. In this work, we propose VOTE, an efficient and general framework for the optimization and acceleration of VLA models. In details, we propose a novel tokenizer-free fine-tuning approach for parallel accurate action prediction, which reduces computational overhead and accelerates inference speed. Additionally, we adopt an ensemble voting strategy for the action sampling, which significantly improves model performance and enhances generalization. Experimental results show that our method achieves state-of-the-art performance with 35$\times$ faster inference and 145 Hz throughput. All the details and codes will be open-sourced.

Title: An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques

Authors: Walid Mohamed Aly, Taysir Hassan A. Soliman, Amr Mohamed AbdelAziz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05123
Pdf URL: https://arxiv.org/pdf/2507.05123
Copy Paste: [[2507.05123]] An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques(https://arxiv.org/abs/2507.05123)
Keywords: in-context
Abstract: Large Language Models (LLMs) continue to advance natural language processing with their ability to generate human-like text across a range of tasks. Despite the remarkable success of LLMs in Natural Language Processing (NLP), their performance in text summarization across various domains and datasets has not been comprehensively evaluated. At the same time, the ability to summarize text effectively without relying on extensive training data has become a crucial bottleneck. To address these issues, we present a systematic evaluation of six LLMs across four datasets: CNN/Daily Mail and NewsRoom (news), SAMSum (dialog), and ArXiv (scientific). By leveraging prompt engineering techniques including zero-shot and in-context learning, our study evaluates the performance using the ROUGE and BERTScore metrics. In addition, a detailed analysis of inference times is conducted to better understand the trade-off between summarization quality and computational efficiency. For Long documents, introduce a sentence-based chunking strategy that enables LLMs with shorter context windows to summarize extended inputs in multiple stages. The findings reveal that while LLMs perform competitively on news and dialog tasks, their performance on long scientific documents improves significantly when aided by chunking strategies. In addition, notable performance variations were observed based on model parameters, dataset properties, and prompt design. These results offer actionable insights into how different LLMs behave across task types, contributing to ongoing research in efficient, instruction-based NLP systems.

Title: Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization

Authors: Jaewook Lee, Alexander Scarlatos, Andrew Lan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05137
Pdf URL: https://arxiv.org/pdf/2507.05137
Copy Paste: [[2507.05137]] Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization(https://arxiv.org/abs/2507.05137)
Keywords: generative
Abstract: Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.

Title: VERITAS: Verification and Explanation of Realness in Images for Transparency in AI Systems

Authors: Aadi Srivastava, Vignesh Natarajkumar, Utkarsh Bheemanaboyna, Devisree Akashapu, Nagraj Gaonkar, Archit Joshi
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2507.05146
Pdf URL: https://arxiv.org/pdf/2507.05146
Copy Paste: [[2507.05146]] VERITAS: Verification and Explanation of Realness in Images for Transparency in AI Systems(https://arxiv.org/abs/2507.05146)
Keywords: diffusion, generative
Abstract: The widespread and rapid adoption of AI-generated content, created by models such as Generative Adversarial Networks (GANs) and Diffusion Models, has revolutionized the digital media landscape by allowing efficient and creative content generation. However, these models also blur the difference between real images and AI-generated synthetic images, raising concerns regarding content authenticity and integrity. While many existing solutions to detect fake images focus solely on classification and higher-resolution images, they often lack transparency in their decision-making, making it difficult for users to understand why an image is classified as fake. In this paper, we present VERITAS, a comprehensive framework that not only accurately detects whether a small (32x32) image is AI-generated but also explains why it was classified that way through artifact localization and semantic reasoning. VERITAS produces human-readable explanations that describe key artifacts in synthetic images. We show that this architecture offers clear explanations of the basis of zero-shot synthetic image detection tasks. Code and relevant prompts can be found at this https URL .

Title: AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models

Authors: Chinnappa Guggilla, Budhaditya Roy, Trupti Ramdas Chavan, Abdul Rahman, Edward Bowen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05157
Pdf URL: https://arxiv.org/pdf/2507.05157
Copy Paste: [[2507.05157]] AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models(https://arxiv.org/abs/2507.05157)
Keywords: generative
Abstract: Large Language Models (LLMs) possess an extraordinary capability to produce text that is not only coherent and contextually relevant but also strikingly similar to human writing. They adapt to various styles and genres, producing content that is both grammatically correct and semantically meaningful. Recently, LLMs have been misused to create highly realistic phishing emails, spread fake news, generate code to automate cyber crime, and write fraudulent scientific articles. Additionally, in many real-world applications, the generated content including style and topic and the generator model are not known beforehand. The increasing prevalence and sophistication of artificial intelligence (AI)-generated texts have made their detection progressively more challenging. Various attempts have been made to distinguish machine-generated text from human-authored content using linguistic, statistical, machine learning, and ensemble-based approaches. This work focuses on two primary objectives Task-A, which involves distinguishing human-written text from machine-generated text, and Task-B, which attempts to identify the specific LLM model responsible for the generation. Both of these tasks are based on fine tuning of Generative Pre-trained Transformer (GPT_4o-mini), Large Language Model Meta AI (LLaMA) 3 8B, and Bidirectional Encoder Representations from Transformers (BERT). The fine-tuned version of GPT_4o-mini and the BERT model has achieved accuracies of 0.9547 for Task-A and 0.4698 for Task-B.

Title: 4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture

Authors: Yutian Chen, Shi Guo, Tianshuo Yang, Lihe Ding, Xiuyuan Yu, Jinwei Gu, Tianfan Xue
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05163
Pdf URL: https://arxiv.org/pdf/2507.05163
Copy Paste: [[2507.05163]] 4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture(https://arxiv.org/abs/2507.05163)
Keywords: diffusion, generative
Abstract: Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.

Title: Critiques of World Models

Authors: Eric Xing, Mingkai Deng, Jinyu Hou, Zhiting Hu
Subjects: cs.LG, cs.AI, cs.CL, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2507.05169
Pdf URL: https://arxiv.org/pdf/2507.05169
Copy Paste: [[2507.05169]] Critiques of World Models(https://arxiv.org/abs/2507.05169)
Keywords: generative
Abstract: World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of "hypothetical thinking" in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

Title: CTA: Cross-Task Alignment for Better Test Time Training

Authors: Samuel Barbeau, Pedram Fekri, David Osowiechi, Ali Bahri, Moslem YazdanpanahMasih Aminbeidokhti, Christian Desrosiers
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.05221
Pdf URL: https://arxiv.org/pdf/2507.05221
Copy Paste: [[2507.05221]] CTA: Cross-Task Alignment for Better Test Time Training(https://arxiv.org/abs/2507.05221)
Keywords: self-supervised
Abstract: Deep learning models have demonstrated exceptional performance across a wide range of computer vision tasks. However, their performance often degrades significantly when faced with distribution shifts, such as domain or dataset changes. Test-Time Training (TTT) has emerged as an effective method to enhance model robustness by incorporating an auxiliary unsupervised task during training and leveraging it for model updates at test time. In this work, we introduce CTA (Cross-Task Alignment), a novel approach for improving TTT. Unlike existing TTT methods, CTA does not require a specialized model architecture and instead takes inspiration from the success of multi-modal contrastive learning to align a supervised encoder with a self-supervised one. This process enforces alignment between the learned representations of both models, thereby mitigating the risk of gradient interference, preserving the intrinsic robustness of self-supervised learning and enabling more semantically meaningful updates at test-time. Experimental results demonstrate substantial improvements in robustness and generalization over the state-of-the-art on several benchmark datasets.

Title: Self-Supervised Real-Time Tracking of Military Vehicles in Low-FPS UAV Footage

Authors: Markiyan Kostiv, Anatolii Adamovskyi, Yevhen Cherniavskyi, Mykyta Varenyk, Ostap Viniavskyi, Igor Krashenyi, Oles Dobosevych
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05229
Pdf URL: https://arxiv.org/pdf/2507.05229
Copy Paste: [[2507.05229]] Self-Supervised Real-Time Tracking of Military Vehicles in Low-FPS UAV Footage(https://arxiv.org/abs/2507.05229)
Keywords: self-supervised
Abstract: Multi-object tracking (MOT) aims to maintain consistent identities of objects across video frames. Associating objects in low-frame-rate videos captured by moving unmanned aerial vehicles (UAVs) in actual combat scenarios is complex due to rapid changes in object appearance and position within the frame. The task becomes even more challenging due to image degradation caused by cloud video streaming and compression algorithms. We present how instance association learning from single-frame annotations can overcome these challenges. We show that global features of the scene provide crucial context for low-FPS instance association, allowing our solution to be robust to distractors and gaps in detections. We also demonstrate that such a tracking approach maintains high association quality even when reducing the input image resolution and latent representation size for faster inference. Finally, we present a benchmark dataset of annotated military vehicles collected from publicly available data sources. This paper was initially presented at the NATO Science and Technology Organization Symposium (ICMCIS) organized by the Information Systems Technology (IST)Scientific and Technical Committee, IST-209-RSY - the ICMCIS, held in Oeiras, Portugal, 13-14 May 2025.

Title: Physics-Guided Dual Implicit Neural Representations for Source Separation

Authors: Yuan Ni, Zhantao Chen, Alexander N. Petsch, Edmund Xu, Cheng Peng, Alexander I. Kolesnikov, Sugata Chowdhury, Arun Bansil, Jana B. Thayer, Joshua J. Turner
Subjects: cs.CV, cond-mat.str-el, cs.LG, physics.data-an
Abstract URL: https://arxiv.org/abs/2507.05249
Pdf URL: https://arxiv.org/pdf/2507.05249
Copy Paste: [[2507.05249]] Physics-Guided Dual Implicit Neural Representations for Source Separation(https://arxiv.org/abs/2507.05249)
Keywords: self-supervised
Abstract: Significant challenges exist in efficient data analysis of most advanced experimental and observational techniques because the collected signals often include unwanted contributions--such as background and signal distortions--that can obscure the physically relevant information of interest. To address this, we have developed a self-supervised machine-learning approach for source separation using a dual implicit neural representation framework that jointly trains two neural networks: one for approximating distortions of the physical signal of interest and the other for learning the effective background contribution. Our method learns directly from the raw data by minimizing a reconstruction-based loss function without requiring labeled data or pre-defined dictionaries. We demonstrate the effectiveness of our framework by considering a challenging case study involving large-scale simulated as well as experimental momentum-energy-dependent inelastic neutron scattering data in a four-dimensional parameter space, characterized by heterogeneous background contributions and unknown distortions to the target signal. The method is found to successfully separate physically meaningful signals from a complex or structured background even when the signal characteristics vary across all four dimensions of the parameter space. An analytical approach that informs the choice of the regularization parameter is presented. Our method offers a versatile framework for addressing source separation problems across diverse domains, ranging from superimposed signals in astronomical measurements to structural features in biomedical image reconstructions.

Title: From Marginal to Joint Predictions: Evaluating Scene-Consistent Trajectory Prediction Approaches for Automated Driving

Authors: Fabian Konstantinidis, Ariel Dallari Guerreiro, Raphael Trumpp, Moritz Sackmann, Ulrich Hofmann, Marco Caccamo, Christoph Stiller
Subjects: cs.CV, cs.AI, cs.LG, cs.MA, cs.RO
Abstract URL: https://arxiv.org/abs/2507.05254
Pdf URL: https://arxiv.org/pdf/2507.05254
Copy Paste: [[2507.05254]] From Marginal to Joint Predictions: Evaluating Scene-Consistent Trajectory Prediction Approaches for Automated Driving(https://arxiv.org/abs/2507.05254)
Keywords: generative
Abstract: Accurate motion prediction of surrounding traffic participants is crucial for the safe and efficient operation of automated vehicles in dynamic environments. Marginal prediction models commonly forecast each agent's future trajectories independently, often leading to sub-optimal planning decisions for an automated vehicle. In contrast, joint prediction models explicitly account for the interactions between agents, yielding socially and physically consistent predictions on a scene level. However, existing approaches differ not only in their problem formulation but also in the model architectures and implementation details used, making it difficult to compare them. In this work, we systematically investigate different approaches to joint motion prediction, including post-processing of the marginal predictions, explicitly training the model for joint predictions, and framing the problem as a generative task. We evaluate each approach in terms of prediction accuracy, multi-modality, and inference efficiency, offering a comprehensive analysis of the strengths and limitations of each approach. Several prediction examples are available at this https URL.

Title: Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing

Authors: Chun-Hsiao Yeh, Yilin Wang, Nanxuan Zhao, Richard Zhang, Yuheng Li, Yi Ma, Krishna Kumar Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.05259
Pdf URL: https://arxiv.org/pdf/2507.05259
Copy Paste: [[2507.05259]] Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing(https://arxiv.org/abs/2507.05259)
Keywords: diffusion
Abstract: Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark.