2025-08-26

Title: CrystalDiT: A Diffusion Transformer for Crystal Generation

Authors: Xiaohan Yi, Guikun Xu, Xi Xiao, Zhong Zhang, Liu Liu, Yatao Bian, Peilin Zhao
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2508.16614
Pdf URL: https://arxiv.org/pdf/2508.16614
Copy Paste: [[2508.16614]] CrystalDiT: A Diffusion Transformer for Crystal Generation(https://arxiv.org/abs/2508.16614)
Keywords: diffusion
Abstract: We present CrystalDiT, a diffusion transformer for crystal structure generation that achieves state-of-the-art performance by challenging the trend of architectural complexity. Instead of intricate, multi-stream designs, CrystalDiT employs a unified transformer that imposes a powerful inductive bias: treating lattice and atomic properties as a single, interdependent system. Combined with a periodic table-based atomic representation and a balanced training strategy, our approach achieves 9.62% SUN (Stable, Unique, Novel) rate on MP-20, substantially outperforming recent methods including FlowMM (4.38%) and MatterGen (3.42%). Notably, CrystalDiT generates 63.28% unique and novel structures while maintaining comparable stability rates, demonstrating that architectural simplicity can be more effective than complexity for materials discovery. Our results suggest that in data-limited scientific domains, carefully designed simple architectures outperform sophisticated alternatives that are prone to overfitting.

Title: A Novel Unified Extended Matrix for Graph Signal Processing: Theory and Application

Authors: Yunyan Zheng, Zhichao Zhang, Wei Yao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.16633
Pdf URL: https://arxiv.org/pdf/2508.16633
Copy Paste: [[2508.16633]] A Novel Unified Extended Matrix for Graph Signal Processing: Theory and Application(https://arxiv.org/abs/2508.16633)
Keywords: anomaly
Abstract: Graph signal processing has become an essential tool for analyzing data structured on irregular domains. While conventional graph shift operators (GSOs) are effective for certain tasks, they inherently lack flexibility in modeling dependencies between non-adjacent nodes, limiting their ability to represent complex graph structures. To address this limitation, this paper proposes the unified extended matrix (UEM) framework, which integrates the extended-adjacency matrix and the unified graph representation matrix through parametric design, so as to be able to flexibly adapt to different graph structures and reveal more graph signal information. Theoretical analysis of the UEM is conducted, demonstrating positive semi-definiteness and eigenvalue monotonicity under specific conditions. Then, we propose graph Fourier transform based on UEM (UEM-GFT), which can adaptively tune spectral properties to enhance signal processing performance. Experimental results on synthetic and real-world datasets demonstrate that the UEM-GFT outperforms existing GSO-based methods in anomaly detection tasks, achieving superior performance across varying network topologies.

Title: Enhancing Transformer-Based Foundation Models for Time Series Forecasting via Bagging, Boosting and Statistical Ensembles

Authors: Dhruv D. Modi, Rong Pan
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2508.16641
Pdf URL: https://arxiv.org/pdf/2508.16641
Copy Paste: [[2508.16641]] Enhancing Transformer-Based Foundation Models for Time Series Forecasting via Bagging, Boosting and Statistical Ensembles(https://arxiv.org/abs/2508.16641)
Keywords: foundation model, anomaly
Abstract: Time series foundation models (TSFMs) such as Lag-Llama, TimeGPT, Chronos, MOMENT, UniTS, and TimesFM have shown strong generalization and zero-shot capabilities for time series forecasting, anomaly detection, classification, and imputation. Despite these advantages, their predictions still suffer from variance, domain-specific bias, and limited uncertainty quantification when deployed on real operational data. This paper investigates a suite of statistical and ensemble-based enhancement techniques, including bootstrap-based bagging, regression-based stacking, prediction interval construction, statistical residual modeling, and iterative error feedback, to improve robustness and accuracy. Using the Belgium Electricity Short-Term Load Forecasting dataset as a case study, we demonstrate that the proposed hybrids consistently outperform standalone foundation models across multiple horizons. Regression-based ensembles achieve the lowest mean squared error; bootstrap aggregation markedly reduces long-context errors; residual modeling corrects systematic bias; and the resulting prediction intervals achieve near nominal coverage with widths shrinking as context length increases. The results indicate that integrating statistical reasoning with modern foundation models yields measurable gains in accuracy, reliability, and interpretability for real-world time series applications.

Title: From Classical Probabilistic Latent Variable Models to Modern Generative AI: A Unified Perspective

Authors: Tianhua Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16643
Pdf URL: https://arxiv.org/pdf/2508.16643
Copy Paste: [[2508.16643]] From Classical Probabilistic Latent Variable Models to Modern Generative AI: A Unified Perspective(https://arxiv.org/abs/2508.16643)
Keywords: diffusion, generative
Abstract: From large language models to multi-modal agents, Generative Artificial Intelligence (AI) now underpins state-of-the-art systems. Despite their varied architectures, many share a common foundation in probabilistic latent variable models (PLVMs), where hidden variables explain observed data for density estimation, latent reasoning, and structured inference. This paper presents a unified perspective by framing both classical and modern generative methods within the PLVM paradigm. We trace the progression from classical flat models such as probabilistic PCA, Gaussian mixture models, latent class analysis, item response theory, and latent Dirichlet allocation, through their sequential extensions including Hidden Markov Models, Gaussian HMMs, and Linear Dynamical Systems, to contemporary deep architectures: Variational Autoencoders as Deep PLVMs, Normalizing Flows as Tractable PLVMs, Diffusion Models as Sequential PLVMs, Autoregressive Models as Explicit Generative Models, and Generative Adversarial Networks as Implicit PLVMs. Viewing these architectures under a common probabilistic taxonomy reveals shared principles, distinct inference strategies, and the representational trade-offs that shape their strengths. We offer a conceptual roadmap that consolidates generative AI's theoretical foundations, clarifies methodological lineages, and guides future innovation by grounding emerging architectures in their probabilistic heritage.

Title: CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance

Authors: Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Lladós, Xiatian Zhu, Anjan Dutta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16644
Pdf URL: https://arxiv.org/pdf/2508.16644
Copy Paste: [[2508.16644]] CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance(https://arxiv.org/abs/2508.16644)
Keywords: diffusion
Abstract: Diffusion models have shown remarkable progress in photorealistic image synthesis, yet they remain unreliable for generating scenes with a precise number of object instances, particularly in complex and high-density settings. We present CountLoop, a training-free framework that provides diffusion models with accurate instance control through iterative structured feedback. The approach alternates between image generation and multimodal agent evaluation, where a language-guided planner and critic assess object counts, spatial arrangements, and attribute consistency. This feedback is then used to refine layouts and guide subsequent generations. To further improve separation between objects, especially in occluded scenes, we introduce instance-driven attention masking and compositional generation techniques. Experiments on COCO Count, T2I CompBench, and two new high-instance benchmarks show that CountLoop achieves counting accuracy of up to 98% while maintaining spatial fidelity and visual quality, outperforming layout-based and gradient-guided baselines with a score of 0.97.

Title: A Laplace diffusion-based transformer model for heart rate forecasting within daily activity context

Authors: Andrei Mateescu, Ioana Hadarau, Ionut Anghel, Tudor Cioara, Ovidiu Anchidin, Ancuta Nemes
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2508.16655
Pdf URL: https://arxiv.org/pdf/2508.16655
Copy Paste: [[2508.16655]] A Laplace diffusion-based transformer model for heart rate forecasting within daily activity context(https://arxiv.org/abs/2508.16655)
Keywords: diffusion
Abstract: With the advent of wearable Internet of Things (IoT) devices, remote patient monitoring (RPM) emerged as a promising solution for managing heart failure. However, the heart rate can fluctuate significantly due to various factors, and without correlating it to the patient's actual physical activity, it becomes difficult to assess whether changes are significant. Although Artificial Intelligence (AI) models may enhance the accuracy and contextual understanding of remote heart rate monitoring, the integration of activity data is still rarely addressed. In this paper, we propose a Transformer model combined with a Laplace diffusion technique to model heart rate fluctuations driven by physical activity of the patient. Unlike prior models that treat activity as secondary, our approach conditions the entire modeling process on activity context using specialized embeddings and attention mechanisms to prioritize activity specific historical patents. The model captures both long-term patterns and activity-specific heart rate dynamics by incorporating contextualized embeddings and dedicated encoder. The Transformer model was validated on a real-world dataset collected from 29 patients over a 4-month period. Experimental results show that our model outperforms current state-of-the-art methods, achieving a 43% reduction in mean absolute error compared to the considered baseline models. Moreover, the coefficient of determination R2 is 0.97 indicating the model predicted heart rate is in strong agreement with actual heart rate values. These findings suggest that the proposed model is a practical and effective tool for supporting both healthcare providers and remote patient monitoring systems.

Title: OASIS: Open-world Adaptive Self-supervised and Imbalanced-aware System

Authors: Miru Kim, Mugon Joe, Minhae Kwon
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.16656
Pdf URL: https://arxiv.org/pdf/2508.16656
Copy Paste: [[2508.16656]] OASIS: Open-world Adaptive Self-supervised and Imbalanced-aware System(https://arxiv.org/abs/2508.16656)
Keywords: self-supervised
Abstract: The expansion of machine learning into dynamic environments presents challenges in handling open-world problems where label shift, covariate shift, and unknown classes emerge. Post-training methods have been explored to address these challenges, adapting models to newly emerging data. However, these methods struggle when the initial pre-training is performed on class-imbalanced datasets, limiting generalization to minority classes. To address this, we propose a method that effectively handles open-world problems even when pre-training is conducted on imbalanced data. Our contrastive-based pre-training approach enhances classification performance, particularly for underrepresented classes. Our post-training mechanism generates reliable pseudo-labels, improving model robustness against open-world problems. We also introduce selective activation criteria to optimize the post-training process, reducing unnecessary computation. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art adaptation techniques in both accuracy and efficiency across diverse open-world scenarios.

Title: Trust but Verify! A Survey on Verification Design for Test-time Scaling

Authors: V Venktesh, Mandeep rathee, Avishek Anand
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16665
Pdf URL: https://arxiv.org/pdf/2508.16665
Copy Paste: [[2508.16665]] Trust but Verify! A Survey on Verification Design for Test-time Scaling(https://arxiv.org/abs/2508.16665)
Keywords: generative
Abstract: Test-time scaling (TTS) has emerged as a new frontier for scaling the performance of Large Language Models. In test-time scaling, by using more computational resources during inference, LLMs can improve their reasoning process and task performance. Several approaches have emerged for TTS such as distilling reasoning traces from another model or exploring the vast decoding search space by employing a verifier. The verifiers serve as reward models that help score the candidate outputs from the decoding process to diligently explore the vast solution space and select the best outcome. This paradigm commonly termed has emerged as a superior approach owing to parameter free scaling at inference time and high performance gains. The verifiers could be prompt-based, fine-tuned as a discriminative or generative model to verify process paths, outcomes or both. Despite their widespread adoption, there is no detailed collection, clear categorization and discussion of diverse verification approaches and their training mechanisms. In this survey, we cover the diverse approaches in the literature and present a unified view of verifier training, types and their utility in test-time scaling. Our repository can be found at this https URL.

Title: FAIRWELL: Fair Multimodal Self-Supervised Learning for Wellbeing Prediction

Authors: Jiaee Cheong, Abtin Mogharabin, Paul Liang, Hatice Gunes, Sinan Kalkan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16748
Pdf URL: https://arxiv.org/pdf/2508.16748
Copy Paste: [[2508.16748]] FAIRWELL: Fair Multimodal Self-Supervised Learning for Wellbeing Prediction(https://arxiv.org/abs/2508.16748)
Keywords: self-supervised
Abstract: Early efforts on leveraging self-supervised learning (SSL) to improve machine learning (ML) fairness has proven promising. However, such an approach has yet to be explored within a multimodal context. Prior work has shown that, within a multimodal setting, different modalities contain modality-unique information that can complement information of other modalities. Leveraging on this, we propose a novel subject-level loss function to learn fairer representations via the following three mechanisms, adapting the variance-invariance-covariance regularization (VICReg) method: (i) the variance term, which reduces reliance on the protected attribute as a trivial solution; (ii) the invariance term, which ensures consistent predictions for similar individuals; and (iii) the covariance term, which minimizes correlational dependence on the protected attribute. Consequently, our loss function, coined as FAIRWELL, aims to obtain subject-independent representations, enforcing fairness in multimodal prediction tasks. We evaluate our method on three challenging real-world heterogeneous healthcare datasets (i.e. D-Vlog, MIMIC and MODMA) which contain different modalities of varying length and different prediction tasks. Our findings indicate that our framework improves overall fairness performance with minimal reduction in classification performance and significantly improves on the performance-fairness Pareto frontier.

Title: A Framework for Benchmarking Fairness-Utility Trade-offs in Text-to-Image Models via Pareto Frontiers

Authors: Marco N. Bochernitsan, Rodrigo C. Barros, Lucas S. Kupssinskü
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16752
Pdf URL: https://arxiv.org/pdf/2508.16752
Copy Paste: [[2508.16752]] A Framework for Benchmarking Fairness-Utility Trade-offs in Text-to-Image Models via Pareto Frontiers(https://arxiv.org/abs/2508.16752)
Keywords: diffusion
Abstract: Achieving fairness in text-to-image generation demands mitigating social biases without compromising visual fidelity, a challenge critical to responsible AI. Current fairness evaluation procedures for text-to-image models rely on qualitative judgment or narrow comparisons, which limit the capacity to assess both fairness and utility in these models and prevent reproducible assessment of debiasing methods. Existing approaches typically employ ad-hoc, human-centered visual inspections that are both error-prone and difficult to replicate. We propose a method for evaluating fairness and utility in text-to-image models using Pareto-optimal frontiers across hyperparametrization of debiasing methods. Our method allows for comparison between distinct text-to-image models, outlining all configurations that optimize fairness for a given utility and vice-versa. To illustrate our evaluation method, we use Normalized Shannon Entropy and ClipScore for fairness and utility evaluation, respectively. We assess fairness and utility in Stable Diffusion, Fair Diffusion, SDXL, DeCoDi, and FLUX text-to-image models. Our method shows that most default hyperparameterizations of the text-to-image model are dominated solutions in the fairness-utility space, and it is straightforward to find better hyperparameters.

Title: GAICo: A Deployed and Extensible Framework for Evaluating Diverse and Multimodal Generative AI Outputs

Authors: Nitin Gupta, Pallav Koppisetti, Kausik Lakkaraju, Biplav Srivastava
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16753
Pdf URL: https://arxiv.org/pdf/2508.16753
Copy Paste: [[2508.16753]] GAICo: A Deployed and Extensible Framework for Evaluating Diverse and Multimodal Generative AI Outputs(https://arxiv.org/abs/2508.16753)
Keywords: generative
Abstract: The rapid proliferation of Generative AI (GenAI) into diverse, high-stakes domains necessitates robust and reproducible evaluation methods. However, practitioners often resort to ad-hoc, non-standardized scripts, as common metrics are often unsuitable for specialized, structured outputs (e.g., automated plans, time-series) or holistic comparison across modalities (e.g., text, audio, and image). This fragmentation hinders comparability and slows AI system development. To address this challenge, we present GAICo (Generative AI Comparator): a deployed, open-source Python library that streamlines and standardizes GenAI output comparison. GAICo provides a unified, extensible framework supporting a comprehensive suite of reference-based metrics for unstructured text, specialized structured data formats, and multimedia (images, audio). Its architecture features a high-level API for rapid, end-to-end analysis, from multi-model comparison to visualization and reporting, alongside direct metric access for granular control. We demonstrate GAICo's utility through a detailed case study evaluating and debugging complex, multi-modal AI Travel Assistant pipelines. GAICo empowers AI researchers and developers to efficiently assess system performance, make evaluation reproducible, improve development velocity, and ultimately build more trustworthy AI systems, aligning with the goal of moving faster and safer in AI deployment. Since its release on PyPI in Jun 2025, the tool has been downloaded over 13K times, across versions, by Aug 2025, demonstrating growing community interest.

Title: Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation

Authors: Arka Mukherjee, Shreya Ghosh
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2508.16762
Pdf URL: https://arxiv.org/pdf/2508.16762
Copy Paste: [[2508.16762]] Toward Socially Aware Vision-Language Models: Evaluating Cultural Competence Through Multimodal Story Generation(https://arxiv.org/abs/2508.16762)
Keywords: generative
Abstract: As Vision-Language Models (VLMs) achieve widespread deployment across diverse cultural contexts, ensuring their cultural competence becomes critical for responsible AI systems. While prior work has evaluated cultural awareness in text-only models and VLM object recognition tasks, no research has systematically assessed how VLMs adapt outputs when cultural identity cues are embedded in both textual prompts and visual inputs during generative tasks. We present the first comprehensive evaluation of VLM cultural competence through multimodal story generation, developing a novel multimodal framework that perturbs cultural identity and evaluates 5 contemporary VLMs on a downstream task: story generation. Our analysis reveals significant cultural adaptation capabilities, with rich culturally-specific vocabulary spanning names, familial terms, and geographic markers. However, we uncover concerning limitations: cultural competence varies dramatically across architectures, some models exhibit inverse cultural alignment, and automated metrics show architectural bias contradicting human assessments. Cross-modal evaluation shows that culturally distinct outputs are indeed detectable through visual-semantic similarity (28.7% within-nationality vs. 0.2% cross-nationality recall), yet visual-cultural understanding remains limited. In essence, we establish the promise and challenges of cultural competence in multimodal AI. We publicly release our codebase and data: this https URL

Title: Latent Graph Learning in Generative Models of Neural Signals

Authors: Nathan X. Kodama, Kenneth A. Loparo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.16776
Pdf URL: https://arxiv.org/pdf/2508.16776
Copy Paste: [[2508.16776]] Latent Graph Learning in Generative Models of Neural Signals(https://arxiv.org/abs/2508.16776)
Keywords: foundation model, generative
Abstract: Inferring temporal interaction graphs and higher-order structure from neural signals is a key problem in building generative models for systems neuroscience. Foundation models for large-scale neural data represent shared latent structures of neural signals. However, extracting interpretable latent graph representations in foundation models remains challenging and unsolved. Here we explore latent graph learning in generative models of neural signals. By testing against numerical simulations of neural circuits with known ground-truth connectivity, we evaluate several hypotheses for explaining learned model weights. We discover modest alignment between extracted network representations and the underlying directed graphs and strong alignment in the co-input graph representations. These findings motivate paths towards incorporating graph-based geometric constraints in the construction of large-scale foundation models for neural data.

Title: Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data

Authors: Stefania L. Moroianu, Christian Bluethgen, Pierre Chambon, Mehdi Cherti, Jean-Benoit Delbrouck, Magdalini Paschali, Brandon Price, Judy Gichoya, Jenia Jitsev, Curtis P. Langlotz, Akshay S. Chaudhari
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16783
Pdf URL: https://arxiv.org/pdf/2508.16783
Copy Paste: [[2508.16783]] Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data(https://arxiv.org/abs/2508.16783)
Keywords: diffusion
Abstract: Achieving robust performance and fairness across diverse patient populations remains a challenge in developing clinically deployable deep learning models for diagnostic imaging. Synthetic data generation has emerged as a promising strategy to address limitations in dataset scale and diversity. We introduce RoentGen-v2, a text-to-image diffusion model for chest radiographs that enables fine-grained control over both radiographic findings and patient demographic attributes, including sex, age, and race/ethnicity. RoentGen-v2 is the first model to generate clinically plausible images with demographic conditioning, facilitating the creation of a large, demographically balanced synthetic dataset comprising over 565,000 images. We use this large synthetic dataset to evaluate optimal training pipelines for downstream disease classification models. In contrast to prior work that combines real and synthetic data naively, we propose an improved training strategy that leverages synthetic data for supervised pretraining, followed by fine-tuning on real data. Through extensive evaluation on over 137,000 chest radiographs from five institutions, we demonstrate that synthetic pretraining consistently improves model performance, generalization to out-of-distribution settings, and fairness across demographic subgroups. Across datasets, synthetic pretraining led to a 6.5% accuracy increase in the performance of downstream classification models, compared to a modest 2.7% increase when naively combining real and synthetic data. We observe this performance improvement simultaneously with the reduction of the underdiagnosis fairness gap by 19.3%. These results highlight the potential of synthetic imaging to advance equitable and generalizable medical deep learning under real-world data constraints. We open source our code, trained models, and synthetic dataset at this https URL .

Title: Assess and Prompt: A Generative RL Framework for Improving Engagement in Online Mental Health Communities

Authors: Bhagesh Gaur, Karan Gupta, Aseem Srivastava, Manish Gupta, Md Shad Akhtar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.16788
Pdf URL: https://arxiv.org/pdf/2508.16788
Copy Paste: [[2508.16788]] Assess and Prompt: A Generative RL Framework for Improving Engagement in Online Mental Health Communities(https://arxiv.org/abs/2508.16788)
Keywords: generative
Abstract: Online Mental Health Communities (OMHCs) provide crucial peer and expert support, yet many posts remain unanswered due to missing support attributes that signal the need for help. We present a novel framework that identifies these gaps and prompts users to enrich their posts, thereby improving engagement. To support this, we introduce REDDME, a new dataset of 4,760 posts from mental health subreddits annotated for the span and intensity of three key support attributes: event what happened?, effect what did the user experience?, and requirement what support they need?. Next, we devise a hierarchical taxonomy, CueTaxo, of support attributes for controlled question generation. Further, we propose MH-COPILOT, a reinforcement learning-based system that integrates (a) contextual attribute-span identification, (b) support attribute intensity classification, (c) controlled question generation via a hierarchical taxonomy, and (d) a verifier for reward modeling. Our model dynamically assesses posts for the presence/absence of support attributes, and generates targeted prompts to elicit missing information. Empirical results across four notable language models demonstrate significant improvements in attribute elicitation and user engagement. A human evaluation further validates the model's effectiveness in real-world OMHC settings.

Title: Towards Open-Vocabulary Multimodal 3D Object Detection with Attributes

Authors: Xinhao Xiang, Kuan-Chuan Peng, Suhas Lohit, Michael J. Jones, Jiawei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16812
Pdf URL: https://arxiv.org/pdf/2508.16812
Copy Paste: [[2508.16812]] Towards Open-Vocabulary Multimodal 3D Object Detection with Attributes(https://arxiv.org/abs/2508.16812)
Keywords: foundation model
Abstract: 3D object detection plays a crucial role in autonomous systems, yet existing methods are limited by closed-set assumptions and struggle to recognize novel objects and their attributes in real-world scenarios. We propose OVODA, a novel framework enabling both open-vocabulary 3D object and attribute detection with no need to know the novel class anchor size. OVODA uses foundation models to bridge the semantic gap between 3D features and texts while jointly detecting attributes, e.g., spatial relationships, motion states, etc. To facilitate such research direction, we propose OVAD, a new dataset that supplements existing 3D object detection benchmarks with comprehensive attribute annotations. OVODA incorporates several key innovations, including foundation model feature concatenation, prompt tuning strategies, and specialized techniques for attribute detection, including perspective-specified prompts and horizontal flip augmentation. Our results on both the nuScenes and Argoverse 2 datasets show that under the condition of no given anchor sizes of novel classes, OVODA outperforms the state-of-the-art methods in open-vocabulary 3D object detection while successfully recognizing object attributes. Our OVAD dataset is released here: this https URL .

Title: NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows

Authors: Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Nikita Lyubaykin, Andrei Polubarov, Alexander Derevyagin, Vladislav Kurenkov
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.16845
Pdf URL: https://arxiv.org/pdf/2508.16845
Copy Paste: [[2508.16845]] NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows(https://arxiv.org/abs/2508.16845)
Keywords: diffusion
Abstract: Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alter- native to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.

Title: Delta-SVD: Efficient Compression for Personalized Text-to-Image Models

Authors: Tangyuan Zhang, Shangyu Chen, Qixiang Chen, Jianfei Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16863
Pdf URL: https://arxiv.org/pdf/2508.16863
Copy Paste: [[2508.16863]] Delta-SVD: Efficient Compression for Personalized Text-to-Image Models(https://arxiv.org/abs/2508.16863)
Keywords: diffusion
Abstract: Personalized text-to-image models such as DreamBooth require fine-tuning large-scale diffusion backbones, resulting in significant storage overhead when maintaining many subject-specific models. We present Delta-SVD, a post-hoc, training-free compression method that targets the parameter weights update induced by DreamBooth fine-tuning. Our key observation is that these delta weights exhibit strong low-rank structure due to the sparse and localized nature of personalization. Delta-SVD first applies Singular Value Decomposition (SVD) to factorize the weight deltas, followed by an energy-based rank truncation strategy to balance compression efficiency and reconstruction fidelity. The resulting compressed models are fully plug-and-play and can be re-constructed on-the-fly during inference. Notably, the proposed approach is simple, efficient, and preserves the original model architecture. Experiments on a multiple subject dataset demonstrate that Delta-SVD achieves substantial compression with negligible loss in generation quality measured by CLIP score, SSIM and FID. Our method enables scalable and efficient deployment of personalized diffusion models, making it a practical solution for real-world applications that require storing and deploying large-scale subject customizations.

Title: Structural Energy-Guided Sampling for View-Consistent Text-to-3D

Authors: Qing Zhang, Jinguang Tong, Jie Hong, Jing Zhang, Xuesong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16917
Pdf URL: https://arxiv.org/pdf/2508.16917
Copy Paste: [[2508.16917]] Structural Energy-Guided Sampling for View-Consistent Text-to-3D(https://arxiv.org/abs/2508.16917)
Keywords: diffusion
Abstract: Text-to-3D generation often suffers from the Janus problem, where objects look correct from the front but collapse into duplicated or distorted geometry from other angles. We attribute this failure to viewpoint bias in 2D diffusion priors, which propagates into 3D optimization. To address this, we propose Structural Energy-Guided Sampling (SEGS), a training-free, plug-and-play framework that enforces multi-view consistency entirely at sampling time. SEGS defines a structural energy in a PCA subspace of intermediate U-Net features and injects its gradients into the denoising trajectory, steering geometry toward the intended viewpoint while preserving appearance fidelity. Integrated seamlessly into SDS/VSD pipelines, SEGS significantly reduces Janus artifacts, achieving improved geometric alignment and viewpoint consistency without retraining or weight modification.

Title: Sig-DEG for Distillation: Making Diffusion Models Faster and Lighter

Authors: Lei Jiang, Wen Ge, Niels Cariou-Kotlarek, Mingxuan Yi, Po-Yu Chen, Lingyi Yang, Francois Buet-Golfouse, Gaurav Mittal, Hao Ni
Subjects: cs.LG, math.PR, stat.ML
Abstract URL: https://arxiv.org/abs/2508.16939
Pdf URL: https://arxiv.org/pdf/2508.16939
Copy Paste: [[2508.16939]] Sig-DEG for Distillation: Making Diffusion Models Faster and Lighter(https://arxiv.org/abs/2508.16939)
Keywords: diffusion, generative
Abstract: Diffusion models have achieved state-of-the-art results in generative modelling but remain computationally intensive at inference time, often requiring thousands of discretization steps. To this end, we propose Sig-DEG (Signature-based Differential Equation Generator), a novel generator for distilling pre-trained diffusion models, which can universally approximate the backward diffusion process at a coarse temporal resolution. Inspired by high-order approximations of stochastic differential equations (SDEs), Sig-DEG leverages partial signatures to efficiently summarize Brownian motion over sub-intervals and adopts a recurrent structure to enable accurate global approximation of the SDE solution. Distillation is formulated as a supervised learning task, where Sig-DEG is trained to match the outputs of a fine-resolution diffusion model on a coarse time grid. During inference, Sig-DEG enables fast generation, as the partial signature terms can be simulated exactly without requiring fine-grained Brownian paths. Experiments demonstrate that Sig-DEG achieves competitive generation quality while reducing the number of inference steps by an order of magnitude. Our results highlight the effectiveness of signature-based approximations for efficient generative modeling.

Title: RPD-Diff: Region-Adaptive Physics-Guided Diffusion Model for Visibility Enhancement under Dense and Non-Uniform Haze

Authors: Ruicheng Zhang, Puxin Yan, Zeyu Zhang, Yicheng Chang, Hongyi Chen, Zhi Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16956
Pdf URL: https://arxiv.org/pdf/2508.16956
Copy Paste: [[2508.16956]] RPD-Diff: Region-Adaptive Physics-Guided Diffusion Model for Visibility Enhancement under Dense and Non-Uniform Haze(https://arxiv.org/abs/2508.16956)
Keywords: diffusion
Abstract: Single-image dehazing under dense and non-uniform haze conditions remains challenging due to severe information degradation and spatial heterogeneity. Traditional diffusion-based dehazing methods struggle with insufficient generation conditioning and lack of adaptability to spatially varying haze distributions, which leads to suboptimal restoration. To address these limitations, we propose RPD-Diff, a Region-adaptive Physics-guided Dehazing Diffusion Model for robust visibility enhancement in complex haze scenarios. RPD-Diff introduces a Physics-guided Intermediate State Targeting (PIST) strategy, which leverages physical priors to reformulate the diffusion Markov chain by generation target transitions, mitigating the issue of insufficient conditioning in dense haze scenarios. Additionally, the Haze-Aware Denoising Timestep Predictor (HADTP) dynamically adjusts patch-specific denoising timesteps employing a transmission map cross-attention mechanism, adeptly managing non-uniform haze distributions. Extensive experiments across four real-world datasets demonstrate that RPD-Diff achieves state-of-the-art performance in challenging dense and non-uniform haze scenarios, delivering high-quality, haze-free images with superior detail clarity and color fidelity.

Title: HiCache: Training-free Acceleration of Diffusion Models via Hermite Polynomial-based Feature Caching

Authors: Liang Feng, Shikang Zheng, Jiacheng Liu, Yuqi Lin, Qinming Zhou, Peiliang Cai, Xinyu Wang, Junjie Chen, Chang Zou, Yue Ma, Linfeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16984
Pdf URL: https://arxiv.org/pdf/2508.16984
Copy Paste: [[2508.16984]] HiCache: Training-free Acceleration of Diffusion Models via Hermite Polynomial-based Feature Caching(https://arxiv.org/abs/2508.16984)
Keywords: diffusion
Abstract: Diffusion models have achieved remarkable success in content generation but suffer from prohibitive computational costs due to iterative sampling. While recent feature caching methods tend to accelerate inference through temporal extrapolation, these methods still suffer from server quality loss due to the failure in modeling the complex dynamics of feature evolution. To solve this problem, this paper presents HiCache, a training-free acceleration framework that fundamentally improves feature prediction by aligning mathematical tools with empirical properties. Our key insight is that feature derivative approximations in Diffusion Transformers exhibit multivariate Gaussian characteristics, motivating the use of Hermite polynomials-the potentially theoretically optimal basis for Gaussian-correlated processes. Besides, We further introduce a dual-scaling mechanism that ensures numerical stability while preserving predictive accuracy. Extensive experiments demonstrate HiCache's superiority: achieving 6.24x speedup on FLUX.1-dev while exceeding baseline quality, maintaining strong performance across text-to-image, video generation, and super-resolution tasks. Core implementation is provided in the appendix, with complete code to be released upon acceptance.

Title: Dual Orthogonal Guidance for Robust Diffusion-based Handwritten Text Generation

Authors: Konstantina Nikolaidou, George Retsinas, Giorgos Sfikas, Silvia Cascianelli, Rita Cucchiara, Marcus Liwicki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17017
Pdf URL: https://arxiv.org/pdf/2508.17017
Copy Paste: [[2508.17017]] Dual Orthogonal Guidance for Robust Diffusion-based Handwritten Text Generation(https://arxiv.org/abs/2508.17017)
Keywords: diffusion
Abstract: Diffusion-based Handwritten Text Generation (HTG) approaches achieve impressive results on frequent, in-vocabulary words observed at training time and on regular styles. However, they are prone to memorizing training samples and often struggle with style variability and generation clarity. In particular, standard diffusion models tend to produce artifacts or distortions that negatively affect the readability of the generated text, especially when the style is hard to produce. To tackle these issues, we propose a novel sampling guidance strategy, Dual Orthogonal Guidance (DOG), that leverages an orthogonal projection of a negatively perturbed prompt onto the original positive prompt. This approach helps steer the generation away from artifacts while maintaining the intended content, and encourages more diverse, yet plausible, outputs. Unlike standard Classifier-Free Guidance (CFG), which relies on unconditional predictions and produces noise at high guidance scales, DOG introduces a more stable, disentangled direction in the latent space. To control the strength of the guidance across the denoising process, we apply a triangular schedule: weak at the start and end of denoising, when the process is most sensitive, and strongest in the middle steps. Experimental results on the state-of-the-art DiffusionPen and One-DM demonstrate that DOG improves both content clarity and style variability, even for out-of-vocabulary words and challenging writing styles.

Title: A Novel Local Focusing Mechanism for Deepfake Detection Generalization

Authors: Mingliang Li, Lin Yuanbo Wu, Changhong Liu, Hanxi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17029
Pdf URL: https://arxiv.org/pdf/2508.17029
Copy Paste: [[2508.17029]] A Novel Local Focusing Mechanism for Deepfake Detection Generalization(https://arxiv.org/abs/2508.17029)
Keywords: diffusion
Abstract: The rapid advancement of deepfake generation techniques has intensified the need for robust and generalizable detection methods. Existing approaches based on reconstruction learning typically leverage deep convolutional networks to extract differential features. However, these methods show poor generalization across object categories (e.g., from faces to cars) and generation domains (e.g., from GANs to Stable Diffusion), due to intrinsic limitations of deep CNNs. First, models trained on a specific category tend to overfit to semantic feature distributions, making them less transferable to other categories, especially as network depth increases. Second, Global Average Pooling (GAP) compresses critical local forgery cues into a single vector, thus discarding discriminative patterns vital for real-fake classification. To address these issues, we propose a novel Local Focus Mechanism (LFM) that explicitly attends to discriminative local features for differentiating fake from real images. LFM integrates a Salience Network (SNet) with a task-specific Top-K Pooling (TKP) module to select the K most informative local patterns. To mitigate potential overfitting introduced by Top-K pooling, we introduce two regularization techniques: Rank-Based Linear Dropout (RBLD) and Random-K Sampling (RKS), which enhance the model's robustness. LFM achieves a 3.7 improvement in accuracy and a 2.8 increase in average precision over the state-of-the-art Neighboring Pixel Relationships (NPR) method, while maintaining exceptional efficiency at 1789 FPS on a single NVIDIA A6000 GPU. Our approach sets a new benchmark for cross-domain deepfake detection. The source code are available in this https URL

Title: Styleclone: Face Stylization with Diffusion Based Data Augmentation

Authors: Neeraj Matiyali, Siddharth Srivastava, Gaurav Sharma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17045
Pdf URL: https://arxiv.org/pdf/2508.17045
Copy Paste: [[2508.17045]] Styleclone: Face Stylization with Diffusion Based Data Augmentation(https://arxiv.org/abs/2508.17045)
Keywords: diffusion
Abstract: We present StyleClone, a method for training image-to-image translation networks to stylize faces in a specific style, even with limited style images. Our approach leverages textual inversion and diffusion-based guided image generation to augment small style datasets. By systematically generating diverse style samples guided by both the original style images and real face images, we significantly enhance the diversity of the style dataset. Using this augmented dataset, we train fast image-to-image translation networks that outperform diffusion-based methods in speed and quality. Experiments on multiple styles demonstrate that our method improves stylization quality, better preserves source image content, and significantly accelerates inference. Additionally, we provide a systematic evaluation of the augmentation techniques and their impact on stylization performance.

Title: PVNet: Point-Voxel Interaction LiDAR Scene Upsampling Via Diffusion Models

Authors: Xianjing Cheng, Lintai Wu, Zuowen Wang, Junhui Hou, Jie Wen, Yong Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17050
Pdf URL: https://arxiv.org/pdf/2508.17050
Copy Paste: [[2508.17050]] PVNet: Point-Voxel Interaction LiDAR Scene Upsampling Via Diffusion Models(https://arxiv.org/abs/2508.17050)
Keywords: diffusion
Abstract: Accurate 3D scene understanding in outdoor environments heavily relies on high-quality point clouds. However, LiDAR-scanned data often suffer from extreme sparsity, severely hindering downstream 3D perception tasks. Existing point cloud upsampling methods primarily focus on individual objects, thus demonstrating limited generalization capability for complex outdoor scenes. To address this issue, we propose PVNet, a diffusion model-based point-voxel interaction framework to perform LiDAR point cloud upsampling without dense supervision. Specifically, we adopt the classifier-free guidance-based DDPMs to guide the generation, in which we employ a sparse point cloud as the guiding condition and the synthesized point clouds derived from its nearby frames as the input. Moreover, we design a voxel completion module to refine and complete the coarse voxel features for enriching the feature representation. In addition, we propose a point-voxel interaction module to integrate features from both points and voxels, which efficiently improves the environmental perception capability of each upsampled point. To the best of our knowledge, our approach is the first scene-level point cloud upsampling method supporting arbitrary upsampling rates. Extensive experiments on various benchmarks demonstrate that our method achieves state-of-the-art performance. The source code will be available at this https URL.

Title: REGEN: Real-Time Photorealism Enhancement in Games via a Dual-Stage Generative Network Framework

Authors: Stefanos Pasios, Nikos Nikolaidis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17061
Pdf URL: https://arxiv.org/pdf/2508.17061
Copy Paste: [[2508.17061]] REGEN: Real-Time Photorealism Enhancement in Games via a Dual-Stage Generative Network Framework(https://arxiv.org/abs/2508.17061)
Keywords: generative
Abstract: Photorealism is an important aspect of modern video games since it can shape the player experience and simultaneously impact the immersion, narrative engagement, and visual fidelity. Although recent hardware technological breakthroughs, along with state-of-the-art rendering technologies, have significantly improved the visual realism of video games, achieving true photorealism in dynamic environments at real-time frame rates still remains a major challenge due to the tradeoff between visual quality and performance. In this short paper, we present a novel approach for enhancing the photorealism of rendered game frames using generative adversarial networks. To this end, we propose Real-time photorealism Enhancement in Games via a dual-stage gEnerative Network framework (REGEN), which employs a robust unpaired image-to-image translation model to produce semantically consistent photorealistic frames that transform the problem into a simpler paired image-to-image translation task. This enables training with a lightweight method that can achieve real-time inference time without compromising visual quality. We demonstrate the effectiveness of our framework on Grand Theft Auto V, showing that the approach achieves visual results comparable to the ones produced by the robust unpaired Im2Im method while improving inference speed by 32.14 times. Our findings also indicate that the results outperform the photorealism-enhanced frames produced by directly training a lightweight unpaired Im2Im translation method to translate the video game frames towards the visual characteristics of real-world images. Code, pre-trained models, and demos for this work are available at: this https URL.

Title: SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation

Authors: Peng Hu, Yu Gu, Liang Luo, Fuji Ren
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17062
Pdf URL: https://arxiv.org/pdf/2508.17062
Copy Paste: [[2508.17062]] SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation(https://arxiv.org/abs/2508.17062)
Keywords: diffusion, generative
Abstract: Controllable video generation aims to synthesize video content that aligns precisely with user-provided conditions, such as text descriptions and initial images. However, a significant challenge persists in this domain: existing models often struggle to maintain strong semantic consistency, frequently generating videos that deviate from the nuanced details specified in the prompts. To address this issue, we propose SSG-DiT (Spatial Signal Guided Diffusion Transformer), a novel and efficient framework for high-fidelity controllable video generation. Our approach introduces a decoupled two-stage process. The first stage, Spatial Signal Prompting, generates a spatially aware visual prompt by leveraging the rich internal representations of a pre-trained multi-modal model. This prompt, combined with the original text, forms a joint condition that is then injected into a frozen video DiT backbone via our lightweight and parameter-efficient SSG-Adapter. This unique design, featuring a dual-branch attention mechanism, allows the model to simultaneously harness its powerful generative priors while being precisely steered by external spatial signals. Extensive experiments demonstrate that SSG-DiT achieves state-of-the-art performance, outperforming existing models on multiple key metrics in the VBench benchmark, particularly in spatial relationship control and overall consistency.

Title: Linguistic Neuron Overlap Patterns to Facilitate Cross-lingual Transfer on Low-resource Languages

Authors: Yuemei Xu, Kexin Xu, Jian Zhou, Ling Hu, Lin Gui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17078
Pdf URL: https://arxiv.org/pdf/2508.17078
Copy Paste: [[2508.17078]] Linguistic Neuron Overlap Patterns to Facilitate Cross-lingual Transfer on Low-resource Languages(https://arxiv.org/abs/2508.17078)
Keywords: in-context
Abstract: The current Large Language Models (LLMs) face significant challenges in improving performance on low-resource languages and urgently need data-efficient methods without costly fine-tuning. From the perspective of language-bridge, we propose BridgeX-ICL, a simple yet effective method to improve zero-shot Cross-lingual In-Context Learning (X-ICL) for low-resource languages. Unlike existing works focusing on language-specific neurons, BridgeX-ICL explores whether sharing neurons can improve cross-lingual performance in LLMs or not. We construct neuron probe data from the ground-truth MUSE bilingual dictionaries, and define a subset of language overlap neurons accordingly, to ensure full activation of these anchored neurons. Subsequently, we propose an HSIC-based metric to quantify LLMs' internal linguistic spectrum based on overlap neurons, which guides optimal bridge selection. The experiments conducted on 2 cross-lingual tasks and 15 language pairs from 7 diverse families (covering both high-low and moderate-low pairs) validate the effectiveness of BridgeX-ICL and offer empirical insights into the underlying multilingual mechanisms of LLMs.

Title: Two Birds with One Stone: Enhancing Uncertainty Quantification and Interpretability with Graph Functional Neural Process

Authors: Lingkai Kong, Haotian Sun, Yuchen Zhuang, Haorui Wang, Wenhao Mu, Chao Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17097
Pdf URL: https://arxiv.org/pdf/2508.17097
Copy Paste: [[2508.17097]] Two Birds with One Stone: Enhancing Uncertainty Quantification and Interpretability with Graph Functional Neural Process(https://arxiv.org/abs/2508.17097)
Keywords: generative
Abstract: Graph neural networks (GNNs) are powerful tools on graph data. However, their predictions are mis-calibrated and lack interpretability, limiting their adoption in critical applications. To address this issue, we propose a new uncertainty-aware and interpretable graph classification model that combines graph functional neural process and graph generative model. The core of our method is to assume a set of latent rationales which can be mapped to a probabilistic embedding space; the predictive distribution of the classifier is conditioned on such rationale embeddings by learning a stochastic correlation matrix. The graph generator serves to decode the graph structure of the rationales from the embedding space for model interpretability. For efficient model training, we adopt an alternating optimization procedure which mimics the well known Expectation-Maximization (EM) algorithm. The proposed method is general and can be applied to any existing GNN architecture. Extensive experiments on five graph classification datasets demonstrate that our framework outperforms state-of-the-art methods in both uncertainty quantification and GNN interpretability. We also conduct case studies to show that the decoded rationale structure can provide meaningful explanations.

Title: GRASP: Geospatial pixel Reasoning viA Structured Policy learning

Authors: Chengjie Jiang, Yunqi Zhou, Jiafeng Yan, Jing Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17102
Pdf URL: https://arxiv.org/pdf/2508.17102
Copy Paste: [[2508.17102]] GRASP: Geospatial pixel Reasoning viA Structured Policy learning(https://arxiv.org/abs/2508.17102)
Keywords: foundation model
Abstract: Geospatial pixel reasoning is a nascent remote-sensing task that aims to generate segmentation masks directly from natural-language instructions. Prevailing MLLM-based systems co-train a language model and a mask decoder with dense pixel supervision, which is expensive and often weak on out-of-domain (OOD) data. We introduce GRASP, a structured policy-learning framework. In our design, a multimodal large language model first emits task-relevant bounding boxes and positive points from a vision-language instruction. These outputs are then passed to a pre-trained segmentation model, which consumes them as prompts to generate the final mask. Instead of supervised fine-tuning, we optimize the system purely with reinforcement learning: the model is trained solely with GRPO, guided by format rewards and accuracy rewards computed on boxes and points (no mask supervision). This leverages strong priors in foundation models, minimizes trainable parameters, and enables learning from inexpensive annotations. We additionally curate GRASP-1k, which contains reasoning-intensive queries, detailed reasoning traces, and fine-grained segmentation annotations. Evaluations on both in-domain and out-of-domain test sets show state-of-the-art results: about 4% improvement in-domain and up to 54% on OOD benchmarks. The experiment results evidence our model's robust generalization and demonstrate that complex geospatial segmentation behaviors can be learned via RL from weak spatial cues. Code and the dataset will be released open-source.

Title: Geolocation-Aware Robust Spoken Language Identification

Authors: Qingzheng Wang, Hye-jin Shim, Jiancheng Sun, Shinji Watanabe
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2508.17148
Pdf URL: https://arxiv.org/pdf/2508.17148
Copy Paste: [[2508.17148]] Geolocation-Aware Robust Spoken Language Identification(https://arxiv.org/abs/2508.17148)
Keywords: self-supervised
Abstract: While Self-supervised Learning (SSL) has significantly improved Spoken Language Identification (LID), existing models often struggle to consistently classify dialects and accents of the same language as a unified class. To address this challenge, we propose geolocation-aware LID, a novel approach that incorporates language-level geolocation information into the SSL-based LID model. Specifically, we introduce geolocation prediction as an auxiliary task and inject the predicted vectors into intermediate representations as conditioning signals. This explicit conditioning encourages the model to learn more unified representations for dialectal and accented variations. Experiments across six multilingual datasets demonstrate that our approach improves robustness to intra-language variations and unseen domains, achieving new state-of-the-art accuracy on FLEURS (97.7%) and 9.7% relative improvement on ML-SUPERB 2.0 dialect set.

Title: SSFO: Self-Supervised Faithfulness Optimization for Retrieval-Augmented Generation

Authors: Xiaqiang Tang, Yi Wang, Keyu Hu, Rui Xu, Chuang Li, Weigao Sun, Jian Li, Sihong Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17225
Pdf URL: https://arxiv.org/pdf/2508.17225
Copy Paste: [[2508.17225]] SSFO: Self-Supervised Faithfulness Optimization for Retrieval-Augmented Generation(https://arxiv.org/abs/2508.17225)
Keywords: self-supervised
Abstract: Retrieval-Augmented Generation (RAG) systems require Large Language Models (LLMs) to generate responses that are faithful to the retrieved context. However, faithfulness hallucination remains a critical challenge, as existing methods often require costly supervision and post-training or significant inference burdens. To overcome these limitations, we introduce Self-Supervised Faithfulness Optimization (SSFO), the first self-supervised alignment approach for enhancing RAG faithfulness. SSFO constructs preference data pairs by contrasting the model's outputs generated with and without the context. Leveraging Direct Preference Optimization (DPO), SSFO aligns model faithfulness without incurring labeling costs or additional inference burden. We theoretically and empirically demonstrate that SSFO leverages a benign form of \emph{likelihood displacement}, transferring probability mass from parametric-based tokens to context-aligned tokens. Based on this insight, we propose a modified DPO loss function to encourage likelihood displacement. Comprehensive evaluations show that SSFO significantly outperforms existing methods, achieving state-of-the-art faithfulness on multiple context-based question-answering datasets. Notably, SSFO exhibits strong generalization, improving cross-lingual faithfulness and preserving general instruction-following capabilities. We release our code and model at the anonymous link: this https URL

Title: 4D Visual Pre-training for Robot Learning

Authors: Chengkai Hou, Yanjie Ze, Yankai Fu, Zeyu Gao, Songbo Hu, Yue Yu, Shanghang Zhang, Huazhe Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17230
Pdf URL: https://arxiv.org/pdf/2508.17230
Copy Paste: [[2508.17230]] 4D Visual Pre-training for Robot Learning(https://arxiv.org/abs/2508.17230)
Keywords: diffusion
Abstract: General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks; yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on the larger public datasets directly. Across twelve real-world manipulation tasks, FVP boosts the average success rate of 3D Diffusion Policy (DP3) for these tasks by 28%. The FVP pre-trained DP3 achieves state-of-the-art performance across imitation learning methods. Moreover, the efficacy of FVP adapts across various point cloud encoders and datasets. Finally, we apply FVP to the RDT-1B, a larger Vision-Language-Action robotic model, enhancing its performance on various robot tasks. Our project page is available at: https://4d- this http URL.

Title: Deep Learning-Assisted Detection of Sarcopenia in Cross-Sectional Computed Tomography Imaging

Authors: Manish Bhardwaj, Huizhi Liang, Ashwin Sivaharan, Sandip Nandhra, Vaclav Snasel, Tamer El-Sayed, Varun Ojha
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17275
Pdf URL: https://arxiv.org/pdf/2508.17275
Copy Paste: [[2508.17275]] Deep Learning-Assisted Detection of Sarcopenia in Cross-Sectional Computed Tomography Imaging(https://arxiv.org/abs/2508.17275)
Keywords: self-supervised
Abstract: Sarcopenia is a progressive loss of muscle mass and function linked to poor surgical outcomes such as prolonged hospital stays, impaired mobility, and increased mortality. Although it can be assessed through cross-sectional imaging by measuring skeletal muscle area (SMA), the process is time-consuming and adds to clinical workloads, limiting timely detection and management; however, this process could become more efficient and scalable with the assistance of artificial intelligence applications. This paper presents high-quality three-dimensional cross-sectional computed tomography (CT) images of patients with sarcopenia collected at the Freeman Hospital, Newcastle upon Tyne Hospitals NHS Foundation Trust. Expert clinicians manually annotated the SMA at the third lumbar vertebra, generating precise segmentation masks. We develop deep-learning models to measure SMA in CT images and automate this task. Our methodology employed transfer learning and self-supervised learning approaches using labelled and unlabeled CT scan datasets. While we developed qualitative assessment models for detecting sarcopenia, we observed that the quantitative assessment of SMA is more precise and informative. This approach also mitigates the issue of class imbalance and limited data availability. Our model predicted the SMA, on average, with an error of +-3 percentage points against the manually measured SMA. The average dice similarity coefficient of the predicted masks was 93%. Our results, therefore, show a pathway to full automation of sarcopenia assessment and detection.

Title: Quickly Tuning Foundation Models for Image Segmentation

Authors: Breenda Das, Lennart Purucker, Timur Carstensen, Frank Hutter
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17283
Pdf URL: https://arxiv.org/pdf/2508.17283
Copy Paste: [[2508.17283]] Quickly Tuning Foundation Models for Image Segmentation(https://arxiv.org/abs/2508.17283)
Keywords: foundation model
Abstract: Foundation models like SAM (Segment Anything Model) exhibit strong zero-shot image segmentation performance, but often fall short on domain-specific tasks. Fine-tuning these models typically requires significant manual effort and domain expertise. In this work, we introduce QTT-SEG, a meta-learning-driven approach for automating and accelerating the fine-tuning of SAM for image segmentation. Built on the Quick-Tune hyperparameter optimization framework, QTT-SEG predicts high-performing configurations using meta-learned cost and performance models, efficiently navigating a search space of over 200 million possibilities. We evaluate QTT-SEG on eight binary and five multiclass segmentation datasets under tight time constraints. Our results show that QTT-SEG consistently improves upon SAM's zero-shot performance and surpasses AutoGluon Multimodal, a strong AutoML baseline, on most binary tasks within three minutes. On multiclass datasets, QTT-SEG delivers consistent gains as well. These findings highlight the promise of meta-learning in automating model adaptation for specialized segmentation tasks. Code available at: this https URL

Title: FoundDiff: Foundational Diffusion Model for Generalizable Low-Dose CT Denoising

Authors: Zhihao Chen, Qi Gao, Zilong Li, Junping Zhang, Yi Zhang, Jun Zhao, Hongming Shan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17299
Pdf URL: https://arxiv.org/pdf/2508.17299
Copy Paste: [[2508.17299]] FoundDiff: Foundational Diffusion Model for Generalizable Low-Dose CT Denoising(https://arxiv.org/abs/2508.17299)
Keywords: diffusion
Abstract: Low-dose computed tomography (CT) denoising is crucial for reduced radiation exposure while ensuring diagnostically acceptable image quality. Despite significant advancements driven by deep learning (DL) in recent years, existing DL-based methods, typically trained on a specific dose level and anatomical region, struggle to handle diverse noise characteristics and anatomical heterogeneity during varied scanning conditions, limiting their generalizability and robustness in clinical scenarios. In this paper, we propose FoundDiff, a foundational diffusion model for unified and generalizable LDCT denoising across various dose levels and anatomical regions. FoundDiff employs a two-stage strategy: (i) dose-anatomy perception and (ii) adaptive denoising. First, we develop a dose- and anatomy-aware contrastive language image pre-training model (DA-CLIP) to achieve robust dose and anatomy perception by leveraging specialized contrastive learning strategies to learn continuous representations that quantify ordinal dose variations and identify salient anatomical regions. Second, we design a dose- and anatomy-aware diffusion model (DA-Diff) to perform adaptive and generalizable denoising by synergistically integrating the learned dose and anatomy embeddings from DACLIP into diffusion process via a novel dose and anatomy conditional block (DACB) based on Mamba. Extensive experiments on two public LDCT datasets encompassing eight dose levels and three anatomical regions demonstrate superior denoising performance of FoundDiff over existing state-of-the-art methods and the remarkable generalization to unseen dose levels. The codes and models are available at this https URL.

Title: PosBridge: Multi-View Positional Embedding Transplant for Identity-Aware Image Editing

Authors: Peilin Xiong, Junwen Chen, Honghui Yuan, Keiji Yanai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17302
Pdf URL: https://arxiv.org/pdf/2508.17302
Copy Paste: [[2508.17302]] PosBridge: Multi-View Positional Embedding Transplant for Identity-Aware Image Editing(https://arxiv.org/abs/2508.17302)
Keywords: diffusion, generative
Abstract: Localized subject-driven image editing aims to seamlessly integrate user-specified objects into target scenes. As generative models continue to scale, training becomes increasingly costly in terms of memory and computation, highlighting the need for training-free and scalable editing this http URL this end, we propose PosBridge an efficient and flexible framework for inserting custom objects. A key component of our method is positional embedding transplant, which guides the diffusion model to faithfully replicate the structural characteristics of reference this http URL, we introduce the Corner Centered Layout, which concatenates reference images and the background image as input to the FLUX.1-Fill model. During progressive denoising, positional embedding transplant is applied to guide the noise distribution in the target region toward that of the reference object. In this way, Corner Centered Layout effectively directs the FLUX.1-Fill model to synthesize identity-consistent content at the desired location. Extensive experiments demonstrate that PosBridge outperforms mainstream baselines in structural consistency, appearance fidelity, and computational efficiency, showcasing its practical value and potential for broad adoption.

Title: ShortListing Model: A Streamlined SimplexDiffusion for Discrete Variable Generation

Authors: Yuxuan Song, Zhe Zhang, Yu Pei, Jingjing Gong, Qiying Yu, Zheng Zhang, Mingxuan Wang, Hao Zhou, Jingjing Liu, Wei-Ying Ma
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2508.17345
Pdf URL: https://arxiv.org/pdf/2508.17345
Copy Paste: [[2508.17345]] ShortListing Model: A Streamlined SimplexDiffusion for Discrete Variable Generation(https://arxiv.org/abs/2508.17345)
Keywords: diffusion, generative
Abstract: Generative modeling of discrete variables is challenging yet crucial for applications in natural language processing and biological sequence design. We introduce the Shortlisting Model (SLM), a novel simplex-based diffusion model inspired by progressive candidate pruning. SLM operates on simplex centroids, reducing generation complexity and enhancing scalability. Additionally, SLM incorporates a flexible implementation of classifier-free guidance, enhancing unconditional generation performance. Extensive experiments on DNA promoter and enhancer design, protein design, character-level and large-vocabulary language modeling demonstrate the competitive performance and strong potential of SLM. Our code can be found at this https URL

Title: No Pixel Left Behind: A Detail-Preserving Architecture for Robust High-Resolution AI-Generated Image Detection

Authors: Lianrui Mu, Zou Xingze, Jianhong Bai, Jiaqi Hu, Wenjie Zheng, Jiangnan Ye, Jiedong Zhuang, Mudassar Ali, Jing Wang, Haoji Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17346
Pdf URL: https://arxiv.org/pdf/2508.17346
Copy Paste: [[2508.17346]] No Pixel Left Behind: A Detail-Preserving Architecture for Robust High-Resolution AI-Generated Image Detection(https://arxiv.org/abs/2508.17346)
Keywords: generative
Abstract: The rapid growth of high-resolution, meticulously crafted AI-generated images poses a significant challenge to existing detection methods, which are often trained and evaluated on low-resolution, automatically generated datasets that do not align with the complexities of high-resolution scenarios. A common practice is to resize or center-crop high-resolution images to fit standard network inputs. However, without full coverage of all pixels, such strategies risk either obscuring subtle, high-frequency artifacts or discarding information from uncovered regions, leading to input information loss. In this paper, we introduce the High-Resolution Detail-Aggregation Network (HiDA-Net), a novel framework that ensures no pixel is left behind. We use the Feature Aggregation Module (FAM), which fuses features from multiple full-resolution local tiles with a down-sampled global view of the image. These local features are aggregated and fused with global representations for final prediction, ensuring that native-resolution details are preserved and utilized for detection. To enhance robustness against challenges such as localized AI manipulations and compression, we introduce Token-wise Forgery Localization (TFL) module for fine-grained spatial sensitivity and JPEG Quality Factor Estimation (QFE) module to disentangle generative artifacts from compression noise explicitly. Furthermore, to facilitate future research, we introduce HiRes-50K, a new challenging benchmark consisting of 50,568 images with up to 64 megapixels. Extensive experiments show that HiDA-Net achieves state-of-the-art, increasing accuracy by over 13% on the challenging Chameleon dataset and 10% on our HiRes-50K.

Title: DiCache: Let Diffusion Model Determine Its Own Cache

Authors: Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Tong Wu, Dahua Lin, Jiaqi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17356
Pdf URL: https://arxiv.org/pdf/2508.17356
Copy Paste: [[2508.17356]] DiCache: Let Diffusion Model Determine Its Own Cache(https://arxiv.org/abs/2508.17356)
Keywords: diffusion
Abstract: Recent years have witnessed the rapid development of acceleration techniques for diffusion models, especially caching-based acceleration methods. These studies seek to answer two fundamental questions: "When to cache" and "How to use cache", typically relying on predefined empirical laws or dataset-level priors to determine the timing of caching and utilizing handcrafted rules for leveraging multi-step caches. However, given the highly dynamic nature of the diffusion process, they often exhibit limited generalizability and fail on outlier samples. In this paper, a strong correlation is revealed between the variation patterns of the shallow-layer feature differences in the diffusion model and those of final model outputs. Moreover, we have observed that the features from different model layers form similar trajectories. Based on these observations, we present DiCache, a novel training-free adaptive caching strategy for accelerating diffusion models at runtime, answering both when and how to cache within a unified framework. Specifically, DiCache is composed of two principal components: (1) Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain a stable prior for the caching error in real time, enabling the model to autonomously determine caching schedules. (2) Dynamic Cache Trajectory Alignment combines multi-step caches based on shallow-layer probe feature trajectory to better approximate the current feature, facilitating higher visual quality. Extensive experiments validate DiCache's capability in achieving higher efficiency and improved visual fidelity over state-of-the-art methods on various leading diffusion models including WAN 2.1, HunyuanVideo for video generation, and Flux for image generation.

Title: ShaLa: Multimodal Shared Latent Space Modelling

Authors: Jiali Cui, Yan-Ying Chen, Yanxia Zhang, Matthew Klenk
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2508.17376
Pdf URL: https://arxiv.org/pdf/2508.17376
Copy Paste: [[2508.17376]] ShaLa: Multimodal Shared Latent Space Modelling(https://arxiv.org/abs/2508.17376)
Keywords: diffusion, generative
Abstract: This paper presents a novel generative framework for learning shared latent representations across multimodal data. Many advanced multimodal methods focus on capturing all combinations of modality-specific details across inputs, which can inadvertently obscure the high-level semantic concepts that are shared across modalities. Notably, Multimodal VAEs with low-dimensional latent variables are designed to capture shared representations, enabling various tasks such as joint multimodal synthesis and cross-modal inference. However, multimodal VAEs often struggle to design expressive joint variational posteriors and suffer from low-quality synthesis. In this work, ShaLa addresses these challenges by integrating a novel architectural inference model and a second-stage expressive diffusion prior, which not only facilitates effective inference of shared latent representation but also significantly improves the quality of downstream multimodal synthesis. We validate ShaLa extensively across multiple benchmarks, demonstrating superior coherence and synthesis quality compared to state-of-the-art multimodal VAEs. Furthermore, ShaLa scales to many more modalities while prior multimodal VAEs have fallen short in capturing the increasing complexity of the shared latent space.

Title: Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs

Authors: Jacob Portes, Connor Jennings, Erica Ji Yuen, Sasha Doubov, Michael Carbin
Subjects: cs.LG, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.17400
Pdf URL: https://arxiv.org/pdf/2508.17400
Copy Paste: [[2508.17400]] Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs(https://arxiv.org/abs/2508.17400)
Keywords: in-context
Abstract: How does retrieval performance scale with pretraining FLOPs? We benchmark retrieval performance across LLM model sizes from 125 million parameters to 7 billion parameters pretrained on datasets ranging from 1 billion tokens to more than 2 trillion tokens. We find that retrieval performance on zero-shot BEIR tasks predictably scales with LLM size, training duration, and estimated FLOPs. We also show that In-Context Learning scores are strongly correlated with retrieval scores across retrieval tasks. Finally, we highlight the implications this has for the development of LLM-based retrievers.

Title: DS@GT at CheckThat! 2025: A Simple Retrieval-First, LLM-Backed Framework for Claim Normalization

Authors: Aleksandar Pramov, Jiangqin Ma, Bina Patel
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2508.17402
Pdf URL: https://arxiv.org/pdf/2508.17402
Copy Paste: [[2508.17402]] DS@GT at CheckThat! 2025: A Simple Retrieval-First, LLM-Backed Framework for Claim Normalization(https://arxiv.org/abs/2508.17402)
Keywords: in-context
Abstract: Claim normalization is an integral part of any automatic fact-check verification system. It parses the typically noisy claim data, such as social media posts into normalized claims, which are then fed into downstream veracity classification tasks. The CheckThat! 2025 Task 2 focuses specifically on claim normalization and spans 20 languages under monolingual and zero-shot conditions. Our proposed solution consists of a lightweight \emph{retrieval-first, LLM-backed} pipeline, in which we either dynamically prompt a GPT-4o-mini with in-context examples, or retrieve the closest normalization from the train dataset directly. On the official test set, the system ranks near the top for most monolingual tracks, achieving first place in 7 out of of the 13 languages. In contrast, the system underperforms in the zero-shot setting, highlighting the limitation of the proposed solution.

Title: Mutual Information Surprise: Rethinking Unexpectedness in Autonomous Systems

Authors: Yinsong Wang, Xiao Liu, Quan Zeng, Yu Ding
Subjects: cs.LG, stat.AP
Abstract URL: https://arxiv.org/abs/2508.17403
Pdf URL: https://arxiv.org/pdf/2508.17403
Copy Paste: [[2508.17403]] Mutual Information Surprise: Rethinking Unexpectedness in Autonomous Systems(https://arxiv.org/abs/2508.17403)
Keywords: anomaly
Abstract: Recent breakthroughs in autonomous experimentation have demonstrated remarkable physical capabilities, yet their cognitive control remains limited--often relying on static heuristics or classical optimization. A core limitation is the absence of a principled mechanism to detect and adapt to the unexpectedness. While traditional surprise measures--such as Shannon or Bayesian Surprise--offer momentary detection of deviation, they fail to capture whether a system is truly learning and adapting. In this work, we introduce Mutual Information Surprise (MIS), a new framework that redefines surprise not as anomaly detection, but as a signal of epistemic growth. MIS quantifies the impact of new observations on mutual information, enabling autonomous systems to reflect on their learning progression. We develop a statistical test sequence to detect meaningful shifts in estimated mutual information and propose a mutual information surprise reaction policy (MISRP) that dynamically governs system behavior through sampling adjustment and process forking. Empirical evaluations--on both synthetic domains and a dynamic pollution map estimation task--show that MISRP-governed strategies significantly outperform classical surprise-based approaches in stability, responsiveness, and predictive accuracy. By shifting surprise from reactive to reflective, MIS offers a path toward more self-aware and adaptive autonomous systems.

Title: E-BayesSAM: Efficient Bayesian Adaptation of SAM with Self-Optimizing KAN-Based Interpretation for Uncertainty-Aware Ultrasonic Segmentation

Authors: Bin Huang, Zhong Liu, Huiying Wen, Bingsheng Huang, Xin Chen, Shuo Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17408
Pdf URL: https://arxiv.org/pdf/2508.17408
Copy Paste: [[2508.17408]] E-BayesSAM: Efficient Bayesian Adaptation of SAM with Self-Optimizing KAN-Based Interpretation for Uncertainty-Aware Ultrasonic Segmentation(https://arxiv.org/abs/2508.17408)
Keywords: self-supervised
Abstract: Although the Segment Anything Model (SAM) has advanced medical image segmentation, its Bayesian adaptation for uncertainty-aware segmentation remains hindered by three key issues: (1) instability in Bayesian fine-tuning of large pre-trained SAMs; (2) high computation cost due to SAM's massive parameters; (3) SAM's black-box design limits interpretability. To overcome these, we propose E-BayesSAM, an efficient framework combining Token-wise Variational Bayesian Inference (T-VBI) for efficienty Bayesian adaptation and Self-Optimizing Kolmogorov-Arnold Network (SO-KAN) for improving interpretability. T-VBI innovatively reinterprets SAM's output tokens as dynamic probabilistic weights and reparameterizes them as latent variables without auxiliary training, enabling training-free VBI for uncertainty estimation. SO-KAN improves token prediction with learnable spline activations via self-supervised learning, providing insight to prune redundant tokens to boost efficiency and accuracy. Experiments on five ultrasound datasets demonstrated that E-BayesSAM achieves: (i) real-time inference (0.03s/image), (ii) superior segmentation accuracy (average DSC: Pruned E-BayesSAM's 89.0\% vs. E-BayesSAM's 88.0% vs. MedSAM's 88.3%), and (iii) identification of four critical tokens governing SAM's decisions. By unifying efficiency, reliability, and interpretability, E-BayesSAM bridges SAM's versatility with clinical needs, advancing deployment in safety-critical medical applications. The source code is available at this https URL.

Title: Modular MeanFlow: Towards Stable and Scalable One-Step Generative Modeling

Authors: Haochen You, Baojing Liu, Hongyang He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.17426
Pdf URL: https://arxiv.org/pdf/2508.17426
Copy Paste: [[2508.17426]] Modular MeanFlow: Towards Stable and Scalable One-Step Generative Modeling(https://arxiv.org/abs/2508.17426)
Keywords: diffusion, generative
Abstract: One-step generative modeling seeks to generate high-quality data samples in a single function evaluation, significantly improving efficiency over traditional diffusion or flow-based models. In this work, we introduce Modular MeanFlow (MMF), a flexible and theoretically grounded approach for learning time-averaged velocity fields. Our method derives a family of loss functions based on a differential identity linking instantaneous and average velocities, and incorporates a gradient modulation mechanism that enables stable training without sacrificing expressiveness. We further propose a curriculum-style warmup schedule to smoothly transition from coarse supervision to fully differentiable training. The MMF formulation unifies and generalizes existing consistency-based and flow-matching methods, while avoiding expensive higher-order derivatives. Empirical results across image synthesis and trajectory modeling tasks demonstrate that MMF achieves competitive sample quality, robust convergence, and strong generalization, particularly under low-data or out-of-distribution settings.

Title: TinySR: Pruning Diffusion for Real-World Image Super-Resolution

Authors: Linwei Dong, Qingnan Fan, Yuhang Yu, Qi Zhang, Jinwei Chen, Yawei Luo, Changqing Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17434
Pdf URL: https://arxiv.org/pdf/2508.17434
Copy Paste: [[2508.17434]] TinySR: Pruning Diffusion for Real-World Image Super-Resolution(https://arxiv.org/abs/2508.17434)
Keywords: diffusion, generative
Abstract: Real-world image super-resolution (Real-ISR) focuses on recovering high-quality images from low-resolution inputs that suffer from complex degradations like noise, blur, and compression. Recently, diffusion models (DMs) have shown great potential in this area by leveraging strong generative priors to restore fine details. However, their iterative denoising process incurs high computational overhead, posing challenges for real-time applications. Although one-step distillation methods, such as OSEDiff and TSD-SR, offer faster inference, they remain fundamentally constrained by their large, over-parameterized model architectures. In this work, we present TinySR, a compact yet effective diffusion model specifically designed for Real-ISR that achieves real-time performance while maintaining perceptual quality. We introduce a Dynamic Inter-block Activation and an Expansion-Corrosion Strategy to facilitate more effective decision-making in depth pruning. We achieve VAE compression through channel pruning, attention removal and lightweight SepConv. We eliminate time- and prompt-related modules and perform pre-caching techniques to further speed up the model. TinySR significantly reduces computational cost and model size, achieving up to 5.68x speedup and 83% parameter reduction compared to its teacher TSD-SR, while still providing high quality results.

Title: A Synthetic Dataset for Manometry Recognition in Robotic Applications

Authors: Pedro Antonio Rabelo Saraiva, Enzo Ferreira de Souza, Joao Manoel Herrera Pinheiro, Thiago H. Segreto, Ricardo V. Godoy, Marcelo Becker
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2508.17468
Pdf URL: https://arxiv.org/pdf/2508.17468
Copy Paste: [[2508.17468]] A Synthetic Dataset for Manometry Recognition in Robotic Applications(https://arxiv.org/abs/2508.17468)
Keywords: foundation model
Abstract: This work addresses the challenges of data scarcity and high acquisition costs for training robust object detection models in complex industrial environments, such as offshore oil platforms. The practical and economic barriers to collecting real-world data in these hazardous settings often hamper the development of autonomous inspection systems. To overcome this, in this work we propose and validate a hybrid data synthesis pipeline that combines procedural rendering with AI-driven video generation. Our methodology leverages BlenderProc to create photorealistic images with precise annotations and controlled domain randomization, and integrates NVIDIA's Cosmos-Predict2 world-foundation model to synthesize physically plausible video sequences with temporal diversity, capturing rare viewpoints and adverse conditions. We demonstrate that a YOLO-based detection network trained on a composite dataset, blending real images with our synthetic data, achieves superior performance compared to models trained exclusively on real-world data. Notably, a 1:1 mixture of real and synthetic data yielded the highest accuracy, surpassing the real-only baseline. These findings highlight the viability of a synthetic-first approach as an efficient, cost-effective, and safe alternative for developing reliable perception systems in safety-critical and resource-constrained industrial applications.

Title: Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

Authors: Hugo Bohy, Minh Tran, Kevin El Haddad, Thierry Dutoit, Mohammad Soleymani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17502
Pdf URL: https://arxiv.org/pdf/2508.17502
Copy Paste: [[2508.17502]] Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice(https://arxiv.org/abs/2508.17502)
Keywords: self-supervised
Abstract: Human social behaviors are inherently multimodal necessitating the development of powerful audiovisual models for their perception. In this paper, we present Social-MAE, our pre-trained audiovisual Masked Autoencoder based on an extended version of Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE), which is pre-trained on audiovisual social data. Specifically, we modify CAV-MAE to receive a larger number of frames as input and pre-train it on a large dataset of human social interaction (VoxCeleb2) in a self-supervised manner. We demonstrate the effectiveness of this model by finetuning and evaluating the model on different social and affective downstream tasks, namely, emotion recognition, laughter detection and apparent personality estimation. The model achieves state-of-the-art results on multimodal emotion recognition and laughter recognition and competitive results for apparent personality estimation, demonstrating the effectiveness of in-domain self-supervised pre-training. Code and model weight are available here this https URL.

Title: DinoTwins: Combining DINO and Barlow Twins for Robust, Label-Efficient Vision Transformers

Authors: Michael Podsiadly, Brendon K Lay
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17509
Pdf URL: https://arxiv.org/pdf/2508.17509
Copy Paste: [[2508.17509]] DinoTwins: Combining DINO and Barlow Twins for Robust, Label-Efficient Vision Transformers(https://arxiv.org/abs/2508.17509)
Keywords: self-supervised
Abstract: Training AI models to understand images without costly labeled data remains a challenge. We combine two techniques--DINO (teacher-student learning) and Barlow Twins (redundancy reduction)--to create a model that learns better with fewer labels and less compute. While both DINO and Barlow Twins have independently demonstrated strong performance in self-supervised learning, each comes with limitations--DINO may be sensitive to certain augmentations, and Barlow Twins often requires batch sizes too large to fit on consumer hardware. By combining the redundancy-reduction objective of Barlow Twins with the self-distillation strategy of DINO, we aim to leverage their complementary strengths. We train a hybrid model on the MS COCO dataset using only 10\% of labeled data for linear probing, and evaluate its performance against standalone DINO and Barlow Twins implementations. Preliminary results show that the combined approach achieves comparable loss and classification accuracy to DINO while maintaining strong feature representations. Attention visualizations further suggest improved semantic segmentation capability in the hybrid model. This combined method offers a scalable, label-efficient alternative for training ViTs in resource-constrained environments.

Title: Modeling Irregular Astronomical Time Series with Neural Stochastic Delay Differential Equations

Authors: YongKyung Oh, Seungsu Kam, Dong-Young Lim, Sungil Kim
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.17521
Pdf URL: https://arxiv.org/pdf/2508.17521
Copy Paste: [[2508.17521]] Modeling Irregular Astronomical Time Series with Neural Stochastic Delay Differential Equations(https://arxiv.org/abs/2508.17521)
Keywords: anomaly
Abstract: Astronomical time series from large-scale surveys like LSST are often irregularly sampled and incomplete, posing challenges for classification and anomaly detection. We introduce a new framework based on Neural Stochastic Delay Differential Equations (Neural SDDEs) that combines stochastic modeling with neural networks to capture delayed temporal dynamics and handle irregular observations. Our approach integrates a delay-aware neural architecture, a numerical solver for SDDEs, and mechanisms to robustly learn from noisy, sparse sequences. Experiments on irregularly sampled astronomical data demonstrate strong classification accuracy and effective detection of novel astrophysical events, even with partial labels. This work highlights Neural SDDEs as a principled and practical tool for time series analysis under observational constraints.

Title: OmniMRI: A Unified Vision--Language Foundation Model for Generalist MRI Interpretation

Authors: Xingxin He, Aurora Rofena, Ruimin Feng, Haozhe Liao, Zhaoye Zhou, Albert Jang, Fang Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17524
Pdf URL: https://arxiv.org/pdf/2508.17524
Copy Paste: [[2508.17524]] OmniMRI: A Unified Vision--Language Foundation Model for Generalist MRI Interpretation(https://arxiv.org/abs/2508.17524)
Keywords: self-supervised, foundation model
Abstract: Magnetic Resonance Imaging (MRI) is indispensable in clinical practice but remains constrained by fragmented, multi-stage workflows encompassing acquisition, reconstruction, segmentation, detection, diagnosis, and reporting. While deep learning has achieved progress in individual tasks, existing approaches are often anatomy- or application-specific and lack generalizability across diverse clinical settings. Moreover, current pipelines rarely integrate imaging data with complementary language information that radiologists rely on in routine practice. Here, we introduce OmniMRI, a unified vision-language foundation model designed to generalize across the entire MRI workflow. OmniMRI is trained on a large-scale, heterogeneous corpus curated from 60 public datasets, over 220,000 MRI volumes and 19 million MRI slices, incorporating image-only data, paired vision-text data, and instruction-response data. Its multi-stage training paradigm, comprising self-supervised vision pretraining, vision-language alignment, multimodal pretraining, and multi-task instruction tuning, progressively equips the model with transferable visual representations, cross-modal reasoning, and robust instruction-following capabilities. Qualitative results demonstrate OmniMRI's ability to perform diverse tasks within a single architecture, including MRI reconstruction, anatomical and pathological segmentation, abnormality detection, diagnostic suggestion, and radiology report generation. These findings highlight OmniMRI's potential to consolidate fragmented pipelines into a scalable, generalist framework, paving the way toward foundation models that unify imaging and clinical language for comprehensive, end-to-end MRI interpretation.

Title: In-Context Algorithm Emulation in Fixed-Weight Transformers

Authors: Jerry Yao-Chieh Hu, Hude Liu, Jennifer Yuntong Zhang, Han Liu
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2508.17550
Pdf URL: https://arxiv.org/pdf/2508.17550
Copy Paste: [[2508.17550]] In-Context Algorithm Emulation in Fixed-Weight Transformers(https://arxiv.org/abs/2508.17550)
Keywords: foundation model, in-context
Abstract: We prove that a minimal Transformer architecture with frozen weights is capable of emulating a broad class of algorithms by in-context prompting. In particular, for any algorithm implementable by a fixed-weight attention head (e.g. one-step gradient descent or linear/ridge regression), there exists a prompt that drives a two-layer softmax attention module to reproduce the algorithm's output with arbitrary precision. This guarantee extends even to a single-head attention layer (using longer prompts if necessary), achieving architectural minimality. Our key idea is to construct prompts that encode an algorithm's parameters into token representations, creating sharp dot-product gaps that force the softmax attention to follow the intended computation. This construction requires no feed-forward layers and no parameter updates. All adaptation happens through the prompt alone. These findings forge a direct link between in-context learning and algorithmic emulation, and offer a simple mechanism for large Transformers to serve as prompt-programmable libraries of algorithms. They illuminate how GPT-style foundation models may swap algorithms via prompts alone, establishing a form of algorithmic universality in modern Transformer models.

Title: IDU: Incremental Dynamic Update of Existing 3D Virtual Environments with New Imagery Data

Authors: Meida Chen, Luis Leal, Yue Hu, Rong Liu, Butian Xiong, Andrew Feng, Jiuyi Xu, Yangming Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17579
Pdf URL: https://arxiv.org/pdf/2508.17579
Copy Paste: [[2508.17579]] IDU: Incremental Dynamic Update of Existing 3D Virtual Environments with New Imagery Data(https://arxiv.org/abs/2508.17579)
Keywords: generative
Abstract: For simulation and training purposes, military organizations have made substantial investments in developing high-resolution 3D virtual environments through extensive imaging and 3D scanning. However, the dynamic nature of battlefield conditions-where objects may appear or vanish over time-makes frequent full-scale updates both time-consuming and costly. In response, we introduce the Incremental Dynamic Update (IDU) pipeline, which efficiently updates existing 3D reconstructions, such as 3D Gaussian Splatting (3DGS), with only a small set of newly acquired images. Our approach starts with camera pose estimation to align new images with the existing 3D model, followed by change detection to pinpoint modifications in the scene. A 3D generative AI model is then used to create high-quality 3D assets of the new elements, which are seamlessly integrated into the existing 3D model. The IDU pipeline incorporates human guidance to ensure high accuracy in object identification and placement, with each update focusing on a single new object at a time. Experimental results confirm that our proposed IDU pipeline significantly reduces update time and labor, offering a cost-effective and targeted solution for maintaining up-to-date 3D models in rapidly evolving military scenarios.

Title: HERO: Hierarchical Extrapolation and Refresh for Efficient World Models

Authors: Quanjian Song, Xinyu Wang, Donghao Zhou, Jingyu Lin, Cunjian Chen, Yue Ma, Xiu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17588
Pdf URL: https://arxiv.org/pdf/2508.17588
Copy Paste: [[2508.17588]] HERO: Hierarchical Extrapolation and Refresh for Efficient World Models(https://arxiv.org/abs/2508.17588)
Keywords: diffusion
Abstract: Generation-driven world models create immersive virtual environments but suffer slow inference due to the iterative nature of diffusion models. While recent advances have improved diffusion model efficiency, directly applying these techniques to world models introduces limitations such as quality degradation. In this paper, we present HERO, a training-free hierarchical acceleration framework tailored for efficient world models. Owing to the multi-modal nature of world models, we identify a feature coupling phenomenon, wherein shallow layers exhibit high temporal variability, while deeper layers yield more stable feature representations. Motivated by this, HERO adopts hierarchical strategies to accelerate inference: (i) In shallow layers, a patch-wise refresh mechanism efficiently selects tokens for recomputation. With patch-wise sampling and frequency-aware tracking, it avoids extra metric computation and remain compatible with FlashAttention. (ii) In deeper layers, a linear extrapolation scheme directly estimates intermediate features. This completely bypasses the computations in attention modules and feed-forward networks. Our experiments show that HERO achieves a 1.73$\times$ speedup with minimal quality degradation, significantly outperforming existing diffusion acceleration methods.

Title: JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on

Authors: Aowen Wang, Wei Li, Hao Luo, Mengxing Ao, Chenyu Zhu, Xinyang Li, Fan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17614
Pdf URL: https://arxiv.org/pdf/2508.17614
Copy Paste: [[2508.17614]] JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on(https://arxiv.org/abs/2508.17614)
Keywords: diffusion, self-supervised
Abstract: Virtual try-on systems have long been hindered by heavy reliance on human body masks, limited fine-grained control over garment attributes, and poor generalization to real-world, in-the-wild scenarios. In this paper, we propose JCo-MVTON (Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-On), a novel framework that overcomes these limitations by integrating diffusion-based image generation with multi-modal conditional fusion. Built upon a Multi-Modal Diffusion Transformer (MM-DiT) backbone, our approach directly incorporates diverse control signals -- such as the reference person image and the target garment image -- into the denoising process through dedicated conditional pathways that fuse features within the self-attention layers. This fusion is further enhanced with refined positional encodings and attention masks, enabling precise spatial alignment and improved garment-person integration. To address data scarcity and quality, we introduce a bidirectional generation strategy for dataset construction: one pipeline uses a mask-based model to generate realistic reference images, while a symmetric ``Try-Off'' model, trained in a self-supervised manner, recovers the corresponding garment images. The synthesized dataset undergoes rigorous manual curation, allowing iterative improvement in visual fidelity and diversity. Experiments demonstrate that JCo-MVTON achieves state-of-the-art performance on public benchmarks including DressCode, significantly outperforming existing methods in both quantitative metrics and human evaluations. Moreover, it shows strong generalization in real-world applications, surpassing commercial systems.

Title: ControlEchoSynth: Boosting Ejection Fraction Estimation Models via Controlled Video Diffusion

Authors: Nima Kondori, Hanwen Liang, Hooman Vaseli, Bingyu Xie, Christina Luong, Purang Abolmaesumi, Teresa Tsang, Renjie Liao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17631
Pdf URL: https://arxiv.org/pdf/2508.17631
Copy Paste: [[2508.17631]] ControlEchoSynth: Boosting Ejection Fraction Estimation Models via Controlled Video Diffusion(https://arxiv.org/abs/2508.17631)
Keywords: diffusion, generative
Abstract: Synthetic data generation represents a significant advancement in boosting the performance of machine learning (ML) models, particularly in fields where data acquisition is challenging, such as echocardiography. The acquisition and labeling of echocardiograms (echo) for heart assessment, crucial in point-of-care ultrasound (POCUS) settings, often encounter limitations due to the restricted number of echo views available, typically captured by operators with varying levels of experience. This study proposes a novel approach for enhancing clinical diagnosis accuracy by synthetically generating echo views. These views are conditioned on existing, real views of the heart, focusing specifically on the estimation of ejection fraction (EF), a critical parameter traditionally measured from biplane apical views. By integrating a conditional generative model, we demonstrate an improvement in EF estimation accuracy, providing a comparative analysis with traditional methods. Preliminary results indicate that our synthetic echoes, when used to augment existing datasets, not only enhance EF estimation but also show potential in advancing the development of more robust, accurate, and clinically relevant ML models. This approach is anticipated to catalyze further research in synthetic data applications, paving the way for innovative solutions in medical imaging diagnostics.

Title: Finding Outliers in a Haystack: Anomaly Detection for Large Pointcloud Scenes

Authors: Ryan Faulkner, Ian Reid, Simon Ratcliffe, Tat-Jun Chin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17634
Pdf URL: https://arxiv.org/pdf/2508.17634
Copy Paste: [[2508.17634]] Finding Outliers in a Haystack: Anomaly Detection for Large Pointcloud Scenes(https://arxiv.org/abs/2508.17634)
Keywords: anomaly
Abstract: LiDAR scanning in outdoor scenes acquires accurate distance measurements over wide areas, producing large-scale point clouds. Application examples for this data include robotics, automotive vehicles, and land surveillance. During such applications, outlier objects from outside the training data will inevitably appear. Our research contributes a novel approach to open-set segmentation, leveraging the learnings of object defect-detection research. We also draw on the Mamba architecture's strong performance in utilising long-range dependencies and scalability to large data. Combining both, we create a reconstruction based approach for the task of outdoor scene open-set segmentation. We show that our approach improves performance not only when applied to our our own open-set segmentation method, but also when applied to existing methods. Furthermore we contribute a Mamba based architecture which is competitive with existing voxel-convolution based methods on challenging, large-scale pointclouds.

Title: Longitudinal Progression Prediction of Alzheimer's Disease with Tabular Foundation Model

Authors: Yilang Ding, Jiawen Ren, Jiaying Lu, Gloria Hyunjung Kwak, Armin Iraji, Alex Fedorov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.17649
Pdf URL: https://arxiv.org/pdf/2508.17649
Copy Paste: [[2508.17649]] Longitudinal Progression Prediction of Alzheimer's Disease with Tabular Foundation Model(https://arxiv.org/abs/2508.17649)
Keywords: foundation model, generative
Abstract: Alzheimer's disease is a progressive neurodegenerative disorder that remains challenging to predict due to its multifactorial etiology and the complexity of multimodal clinical data. Accurate forecasting of clinically relevant biomarkers, including diagnostic and quantitative measures, is essential for effective monitoring of disease progression. This work introduces L2C-TabPFN, a method that integrates a longitudinal-to-cross-sectional (L2C) transformation with a pre-trained Tabular Foundation Model (TabPFN) to predict Alzheimer's disease outcomes using the TADPOLE dataset. L2C-TabPFN converts sequential patient records into fixed-length feature vectors, enabling robust prediction of diagnosis, cognitive scores, and ventricular volume. Experimental results demonstrate that, while L2C-TabPFN achieves competitive performance on diagnostic and cognitive outcomes, it provides state-of-the-art results in ventricular volume prediction. This key imaging biomarker reflects neurodegeneration and progression in Alzheimer's disease. These findings highlight the potential of tabular foundational models for advancing longitudinal prediction of clinically relevant imaging markers in Alzheimer's disease.

Title: Towards Synthesizing Normative Data for Cognitive Assessments Using Generative Multimodal Large Language Models

Authors: Victoria Yan, Honor Chotkowski, Fengran Wang, Alex Fedorov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.17675
Pdf URL: https://arxiv.org/pdf/2508.17675
Copy Paste: [[2508.17675]] Towards Synthesizing Normative Data for Cognitive Assessments Using Generative Multimodal Large Language Models(https://arxiv.org/abs/2508.17675)
Keywords: generative
Abstract: Cognitive assessments require normative data as essential benchmarks for evaluating individual performance. Hence, developing new cognitive tests based on novel image stimuli is challenging due to the lack of readily available normative data. Traditional data collection methods are costly, time-consuming, and infrequently updated, limiting their practical utility. Recent advancements in generative multimodal large language models (MLLMs) offer a new approach to generate synthetic normative data from existing cognitive test images. We investigated the feasibility of using MLLMs, specifically GPT-4o and GPT-4o-mini, to synthesize normative textual responses for established image-based cognitive assessments, such as the "Cookie Theft" picture description task. Two distinct prompting strategies-naive prompts with basic instructions and advanced prompts enriched with contextual guidance-were evaluated. Responses were analyzed using embeddings to assess their capacity to distinguish diagnostic groups and demographic variations. Performance metrics included BLEU, ROUGE, BERTScore, and an LLM-as-a-judge evaluation. Advanced prompting strategies produced synthetic responses that more effectively distinguished between diagnostic groups and captured demographic diversity compared to naive prompts. Superior models generated responses exhibiting higher realism and diversity. BERTScore emerged as the most reliable metric for contextual similarity assessment, while BLEU was less effective for evaluating creative outputs. The LLM-as-a-judge approach provided promising preliminary validation results. Our study demonstrates that generative multimodal LLMs, guided by refined prompting methods, can feasibly generate robust synthetic normative data for existing cognitive tests, thereby laying the groundwork for developing novel image-based cognitive assessments without the traditional limitations.

Title: Robustness Feature Adapter for Efficient Adversarial Training

Authors: Quanwei Wu, Jun Guo, Wei Wang, Yi Wang
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2508.17680
Pdf URL: https://arxiv.org/pdf/2508.17680
Copy Paste: [[2508.17680]] Robustness Feature Adapter for Efficient Adversarial Training(https://arxiv.org/abs/2508.17680)
Keywords: foundation model
Abstract: Adversarial training (AT) with projected gradient descent is the most popular method to improve model robustness under adversarial attacks. However, computational overheads become prohibitively large when AT is applied to large backbone models. AT is also known to have the issue of robust overfitting. This paper contributes to solving both problems simultaneously towards building more trustworthy foundation models. In particular, we propose a new adapter-based approach for efficient AT directly in the feature space. We show that the proposed adapter-based approach can improve the inner-loop convergence quality by eliminating robust overfitting. As a result, it significantly increases computational efficiency and improves model accuracy by generalizing adversarial robustness to unseen attacks. We demonstrate the effectiveness of the new adapter-based approach in different backbone architectures and in AT at scale.

Title: Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery

Authors: Robert Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17681
Pdf URL: https://arxiv.org/pdf/2508.17681
Copy Paste: [[2508.17681]] Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery(https://arxiv.org/abs/2508.17681)
Keywords: generative
Abstract: Bold claims about AI's role in science-from "AGI will cure all diseases" to promises of radically accelerated discovery-raise a central epistemic question: do large language models (LLMs) truly generate new knowledge, or do they merely remix memorized fragments? We propose unlearning-as-ablation as a falsifiable test of constructive scientific discovery. The method systematically removes a target result and its entire forget-closure (lemmas, paraphrases, and multi-hop entailments) and then evaluates whether the model can re-derive the result from only permitted axioms and tools. Success provides evidence for genuine generative capability; failure exposes current limits. Unlike prevailing motivations for unlearning-privacy, copyright, or safety-our framing repositions it as an epistemic probe for AI-for-Science. We argue that such tests could serve as the next generation of benchmarks, much as ImageNet catalyzed progress in vision: distinguishing models that can merely recall from those that can constructively generate new scientific knowledge. We outline a minimal pilot in mathematics and algorithms, and discuss extensions to physics, chemistry, and biology. Whether models succeed or fail, unlearning-as-ablation provides a principled framework to map the true reach and limits of AI scientific discovery. This is a position paper: we advance a conceptual and methodological argument rather than new empirical results.

Title: On the Edge of Memorization in Diffusion Models

Authors: Sam Buchanan, Druv Pai, Yi Ma, Valentin De Bortoli
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2508.17689
Pdf URL: https://arxiv.org/pdf/2508.17689
Copy Paste: [[2508.17689]] On the Edge of Memorization in Diffusion Models(https://arxiv.org/abs/2508.17689)
Keywords: diffusion
Abstract: When do diffusion models reproduce their training data, and when are they able to generate samples beyond it? A practically relevant theoretical understanding of this interplay between memorization and generalization may significantly impact real-world deployments of diffusion models with respect to issues such as copyright infringement and data privacy. In this work, to disentangle the different factors that influence memorization and generalization in practical diffusion models, we introduce a scientific and mathematical "laboratory" for investigating these phenomena in diffusion models trained on fully synthetic or natural image-like structured data. Within this setting, we hypothesize that the memorization or generalization behavior of an underparameterized trained model is determined by the difference in training loss between an associated memorizing model and a generalizing model. To probe this hypothesis, we theoretically characterize a crossover point wherein the weighted training loss of a fully generalizing model becomes greater than that of an underparameterized memorizing model at a critical value of model (under)parameterization. We then demonstrate via carefully-designed experiments that the location of this crossover predicts a phase transition in diffusion models trained via gradient descent, validating our hypothesis. Ultimately, our theory enables us to analytically predict the model size at which memorization becomes predominant. Our work provides an analytically tractable and practically meaningful setting for future theoretical and empirical investigations. Code for our experiments is available at this https URL.

Title: Copyright Protection for 3D Molecular Structures with Watermarking

Authors: Runwen Hu, Peilin Chen, Keyan Ding, Shiqi Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.17702
Pdf URL: https://arxiv.org/pdf/2508.17702
Copy Paste: [[2508.17702]] Copyright Protection for 3D Molecular Structures with Watermarking(https://arxiv.org/abs/2508.17702)
Keywords: generative
Abstract: Artificial intelligence (AI) revolutionizes molecule generation in bioengineering and biological research, significantly accelerating discovery processes. However, this advancement introduces critical concerns regarding intellectual property protection. To address these challenges, we propose the first robust watermarking method designed for molecules, which utilizes atom-level features to preserve molecular integrity and invariant features to ensure robustness against affine transformations. Comprehensive experiments validate the effectiveness of our method using the datasets QM9 and GEOM-DRUG, and generative models GeoBFN and GeoLDM. We demonstrate the feasibility of embedding watermarks, maintaining basic properties higher than 90.00\% while achieving watermark accuracy greater than 95.00\%. Furthermore, downstream docking simulations reveal comparable performance between original and watermarked molecules, with binding affinities reaching -6.00 kcal/mol and root mean square deviations below 1.602 Å. These results confirm that our watermarking technique effectively safeguards molecular intellectual property without compromising scientific utility, enabling secure and responsible AI integration in molecular discovery and research applications.

Title: CATformer: Contrastive Adversarial Transformer for Image Super-Resolution

Authors: Qinyi Tian, Spence Cox, Laura E. Dalton
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17708
Pdf URL: https://arxiv.org/pdf/2508.17708
Copy Paste: [[2508.17708]] CATformer: Contrastive Adversarial Transformer for Image Super-Resolution(https://arxiv.org/abs/2508.17708)
Keywords: diffusion
Abstract: Super-resolution remains a promising technique to enhance the quality of low-resolution images. This study introduces CATformer (Contrastive Adversarial Transformer), a novel neural network integrating diffusion-inspired feature refinement with adversarial and contrastive learning. CATformer employs a dual-branch architecture combining a primary diffusion-inspired transformer, which progressively refines latent representations, with an auxiliary transformer branch designed to enhance robustness to noise through learned latent contrasts. These complementary representations are fused and decoded using deep Residual-in-Residual Dense Blocks for enhanced reconstruction quality. Extensive experiments on benchmark datasets demonstrate that CATformer outperforms recent transformer-based and diffusion-inspired methods both in efficiency and visual image quality. This work bridges the performance gap among transformer-, diffusion-, and GAN-based methods, laying a foundation for practical applications of diffusion-inspired transformers in super-resolution.

Title: F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model

Authors: Hanbo Bi, Zhiqiang Yuan, Zexi Jia, Jiapei Zhang, Chongyang Li, Peixiang Luo, Ying Deng, Xiaoyue Duan, Jinchao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17714
Pdf URL: https://arxiv.org/pdf/2508.17714
Copy Paste: [[2508.17714]] F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model(https://arxiv.org/abs/2508.17714)
Keywords: generative
Abstract: Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history. However, they often fail to meet users' actual needs for revisiting semantically coherent content scattered across long-form conversations. To fill this gap, we define the Fine-grained Fragment Retrieval (FFR) task, requiring models to locate query-relevant fragments, comprising both utterances and images, from multimodal long-form dialogues. As a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue retrieval dataset to date, averaging 25.45 turns per dialogue, with each naturally spanning three distinct topics. To evaluate generalization in real-world scenarios, we curate and annotate a WeChat-based test set comprising real-world multimodal dialogues with an average of 75.38 turns. Building on these resources, we explore existing generation-based Vision-Language Models (VLMs) on FFR and observe that they often retrieve incoherent utterance-image fragments. While optimized for generating responses from visual-textual inputs, these models lack explicit supervision to ensure semantic coherence within retrieved fragments. To this end, we propose F2RVLM, a generative retrieval model trained in a two-stage paradigm: (1) supervised fine-tuning to inject fragment-level retrieval knowledge, and (2) GRPO-based reinforcement learning with multi-objective rewards promoting semantic precision, relevance, and contextual coherence. To handle varying intra-fragment complexity, from locally dense to sparsely distributed, we introduce difficulty-aware curriculum sampling that ranks training instances by model-predicted difficulty and gradually exposes the model to harder samples. This boosts reasoning ability in long, multi-turn contexts. F2RVLM outperforms popular VLMs in both in-domain and real-domain settings, demonstrating superior retrieval performance.

Title: Instant Preference Alignment for Text-to-Image Diffusion Models

Authors: Yang Li, Songlin Yang, Xiaoxuan Han, Wei Wang, Jing Dong, Yueming Lyu, Ziyu Xue
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17718
Pdf URL: https://arxiv.org/pdf/2508.17718
Copy Paste: [[2508.17718]] Instant Preference Alignment for Text-to-Image Diffusion Models(https://arxiv.org/abs/2508.17718)
Keywords: diffusion
Abstract: Text-to-image (T2I) generation has greatly enhanced creative expression, yet achieving preference-aligned generation in a real-time and training-free manner remains challenging. Previous methods often rely on static, pre-collected preferences or fine-tuning, limiting adaptability to evolving and nuanced user intents. In this paper, we highlight the need for instant preference-aligned T2I generation and propose a training-free framework grounded in multimodal large language model (MLLM) priors. Our framework decouples the task into two components: preference understanding and preference-guided generation. For preference understanding, we leverage MLLMs to automatically extract global preference signals from a reference image and enrich a given prompt using structured instruction design. Our approach supports broader and more fine-grained coverage of user preferences than existing methods. For preference-guided generation, we integrate global keyword-based control and local region-aware cross-attention modulation to steer the diffusion model without additional training, enabling precise alignment across both global attributes and local elements. The entire framework supports multi-round interactive refinement, facilitating real-time and context-aware image generation. Extensive experiments on the Viper dataset and our collected benchmark demonstrate that our method outperforms prior approaches in both quantitative metrics and human evaluations, and opens up new possibilities for dialog-based generation and MLLM-diffusion integration.

Title: Few-shot Human Action Anomaly Detection via a Unified Contrastive Learning Framework

Authors: Koichiro Kamide, Shunsuke Sakai, Shun Maeda, Chunzhi Gu, Chao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17726
Pdf URL: https://arxiv.org/pdf/2508.17726
Copy Paste: [[2508.17726]] Few-shot Human Action Anomaly Detection via a Unified Contrastive Learning Framework(https://arxiv.org/abs/2508.17726)
Keywords: diffusion, foundation model, generative, anomaly
Abstract: Human Action Anomaly Detection (HAAD) aims to identify anomalous actions given only normal action data during training. Existing methods typically follow a one-model-per-category paradigm, requiring separate training for each action category and a large number of normal samples. These constraints hinder scalability and limit applicability in real-world scenarios, where data is often scarce or novel categories frequently appear. To address these limitations, we propose a unified framework for HAAD that is compatible with few-shot scenarios. Our method constructs a category-agnostic representation space via contrastive learning, enabling AD by comparing test samples with a given small set of normal examples (referred to as the support set). To improve inter-category generalization and intra-category robustness, we introduce a generative motion augmentation strategy harnessing a diffusion-based foundation model for creating diverse and realistic training samples. Notably, to the best of our knowledge, our work is the first to introduce such a strategy specifically tailored to enhance contrastive learning for action AD. Extensive experiments on the HumanAct12 dataset demonstrate the state-of-the-art effectiveness of our approach under both seen and unseen category settings, regarding training efficiency and model scalability for few-shot HAAD.

Title: SMITE: Enhancing Fairness in LLMs through Optimal In-Context Example Selection via Dynamic Validation

Authors: Garima Chhikara, Kripabandhu Ghosh, Abhijnan Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17735
Pdf URL: https://arxiv.org/pdf/2508.17735
Copy Paste: [[2508.17735]] SMITE: Enhancing Fairness in LLMs through Optimal In-Context Example Selection via Dynamic Validation(https://arxiv.org/abs/2508.17735)
Keywords: in-context
Abstract: Large Language Models (LLMs) are widely used for downstream tasks such as tabular classification, where ensuring fairness in their outputs is critical for inclusivity, equal representation, and responsible AI deployment. This study introduces a novel approach to enhancing LLM performance and fairness through the concept of a dynamic validation set, which evolves alongside the test set, replacing the traditional static validation approach. We also propose an iterative algorithm, SMITE, to select optimal in-context examples, with each example set validated against its corresponding dynamic validation set. The in-context set with the lowest total error is used as the final demonstration set. Our experiments across four different LLMs show that our proposed techniques significantly improve both predictive accuracy and fairness compared to baseline methods. To our knowledge, this is the first study to apply dynamic validation in the context of in-context learning for LLMs.

Title: Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks

Authors: Sotaro Takeshita, Yurina Takeshita, Daniel Ruffinelli, Simone Paolo Ponzetto
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.17744
Pdf URL: https://arxiv.org/pdf/2508.17744
Copy Paste: [[2508.17744]] Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks(https://arxiv.org/abs/2508.17744)
Keywords: generative
Abstract: In this paper, we study the surprising impact that truncating text embeddings has on downstream performance. We consistently observe across 6 state-of-the-art text encoders and 26 downstream tasks, that randomly removing up to 50% of embedding dimensions results in only a minor drop in performance, less than 10%, in retrieval and classification tasks. Given the benefits of using smaller-sized embeddings, as well as the potential insights about text encoding, we study this phenomenon and find that, contrary to what is suggested in prior work, this is not the result of an ineffective use of representation space. Instead, we find that a large number of uniformly distributed dimensions actually cause an increase in performance when removed. This would explain why, on average, removing a large number of embedding dimensions results in a marginal drop in performance. We make similar observations when truncating the embeddings used by large language models to make next-token predictions on generative tasks, suggesting that this phenomenon is not isolated to classification or retrieval tasks.

Title: SuperGen: An Efficient Ultra-high-resolution Video Generation System with Sketching and Tiling

Authors: Fanjiang Ye, Zepeng Zhao, Yi Mu, Jucheng Shen, Renjie Li, Kaijian Wang, Desen Sun, Saurabh Agarwal, Myungjin Lee, Triston Cao, Aditya Akella, Arvind Krishnamurthy, T.S. Eugene Ng, Zhengzhong Tu, Yuke Wang
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2508.17756
Pdf URL: https://arxiv.org/pdf/2508.17756
Copy Paste: [[2508.17756]] SuperGen: An Efficient Ultra-high-resolution Video Generation System with Sketching and Tiling(https://arxiv.org/abs/2508.17756)
Keywords: diffusion, generative
Abstract: Diffusion models have recently achieved remarkable success in generative tasks (e.g., image and video generation), and the demand for high-quality content (e.g., 2K/4K videos) is rapidly increasing across various domains. However, generating ultra-high-resolution videos on existing standard-resolution (e.g., 720p) platforms remains challenging due to the excessive re-training requirements and prohibitively high computational and memory costs. To this end, we introduce SuperGen, an efficient tile-based framework for ultra-high-resolution video generation. SuperGen features a novel training-free algorithmic innovation with tiling to successfully support a wide range of resolutions without additional training efforts while significantly reducing both memory footprint and computational complexity. Moreover, SuperGen incorporates a tile-tailored, adaptive, region-aware caching strategy that accelerates video generation by exploiting redundancy across denoising steps and spatial regions. SuperGen also integrates cache-guided, communication-minimized tile parallelism for enhanced throughput and minimized latency. Evaluations demonstrate that SuperGen harvests the maximum performance gains while achieving high output quality across various benchmarks.

Title: CEIDM: A Controlled Entity and Interaction Diffusion Model for Enhanced Text-to-Image Generation

Authors: Mingyue Yang, Dianxi Shi, Jialu Zhou, Xinyu Wei, Leqian Li, Shaowu Yang, Chunping Qiu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2508.17760
Pdf URL: https://arxiv.org/pdf/2508.17760
Copy Paste: [[2508.17760]] CEIDM: A Controlled Entity and Interaction Diffusion Model for Enhanced Text-to-Image Generation(https://arxiv.org/abs/2508.17760)
Keywords: diffusion
Abstract: In Text-to-Image (T2I) generation, the complexity of entities and their intricate interactions pose a significant challenge for T2I method based on diffusion model: how to effectively control entity and their interactions to produce high-quality images. To address this, we propose CEIDM, a image generation method based on diffusion model with dual controls for entity and interaction. First, we propose an entity interactive relationships mining approach based on Large Language Models (LLMs), extracting reasonable and rich implicit interactive relationships through chain of thought to guide diffusion models to generate high-quality images that are closer to realistic logic and have more reasonable interactive relationships. Furthermore, We propose an interactive action clustering and offset method to cluster and offset the interactive action features contained in each text prompts. By constructing global and local bidirectional offsets, we enhance semantic understanding and detail supplementation of original actions, making the model's understanding of the concept of interactive "actions" more accurate and generating images with more accurate interactive actions. Finally, we design an entity control network which generates masks with entity semantic guidance, then leveraging multi-scale convolutional network to enhance entity feature and dynamic network to fuse feature. It effectively controls entities and significantly improves image quality. Experiments show that the proposed CEIDM method is better than the most representative existing methods in both entity control and their interaction control.

Title: Proximal Supervised Fine-Tuning

Authors: Wenhong Zhu, Ruobing Xie, Rui Wang, Xingwu Sun, Di Wang, Pengfei Liu
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.17784
Pdf URL: https://arxiv.org/pdf/2508.17784
Copy Paste: [[2508.17784]] Proximal Supervised Fine-Tuning(https://arxiv.org/abs/2508.17784)
Keywords: foundation model
Abstract: Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on new tasks or domains. Inspired by trust-region policy optimization (TRPO) and proximal policy optimization (PPO) in reinforcement learning (RL), we propose Proximal SFT (PSFT). This fine-tuning objective incorporates the benefits of trust-region, effectively constraining policy drift during SFT while maintaining competitive tuning. By viewing SFT as a special case of policy gradient methods with constant positive advantages, we derive PSFT that stabilizes optimization and leads to generalization, while leaving room for further optimization in subsequent post-training stages. Experiments across mathematical and human-value domains show that PSFT matches SFT in-domain, outperforms it in out-of-domain generalization, remains stable under prolonged training without causing entropy collapse, and provides a stronger foundation for the subsequent optimization.

Title: Robust Anomaly Detection in Industrial Environments via Meta-Learning

Authors: Muhammad Aqeel, Shakiba Sharifi, Marco Cristani, Francesco Setti
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17789
Pdf URL: https://arxiv.org/pdf/2508.17789
Copy Paste: [[2508.17789]] Robust Anomaly Detection in Industrial Environments via Meta-Learning(https://arxiv.org/abs/2508.17789)
Keywords: anomaly
Abstract: Anomaly detection is fundamental for ensuring quality control and operational efficiency in industrial environments, yet conventional approaches face significant challenges when training data contains mislabeled samples-a common occurrence in real-world scenarios. This paper presents RAD, a robust anomaly detection framework that integrates Normalizing Flows with Model-Agnostic Meta-Learning to address the critical challenge of label noise in industrial settings. Our approach employs a bi-level optimization strategy where meta-learning enables rapid adaptation to varying noise conditions, while uncertainty quantification guides adaptive L2 regularization to maintain model stability. The framework incorporates multiscale feature processing through pretrained feature extractors and leverages the precise likelihood estimation capabilities of Normalizing Flows for robust anomaly scoring. Comprehensive evaluation on MVTec-AD and KSDD2 datasets demonstrates superior performance, achieving I-AUROC scores of 95.4% and 94.6% respectively under clean conditions, while maintaining robust detection capabilities above 86.8% and 92.1% even when 50% of training samples are mislabeled. The results highlight RAD's exceptional resilience to noisy training conditions and its ability to detect subtle anomalies across diverse industrial scenarios, making it a practical solution for real-world anomaly detection applications where perfect data curation is challenging.

Title: Multi-domain Distribution Learning for De Novo Drug Design

Authors: Arne Schneuing, Ilia Igashov, Adrian W. Dobbelstein, Thomas Castiglione, Michael Bronstein, Bruno Correia
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2508.17815
Pdf URL: https://arxiv.org/pdf/2508.17815
Copy Paste: [[2508.17815]] Multi-domain Distribution Learning for De Novo Drug Design(https://arxiv.org/abs/2508.17815)
Keywords: generative
Abstract: We introduce DrugFlow, a generative model for structure-based drug design that integrates continuous flow matching with discrete Markov bridges, demonstrating state-of-the-art performance in learning chemical, geometric, and physical aspects of three-dimensional protein-ligand data. We endow DrugFlow with an uncertainty estimate that is able to detect out-of-distribution samples. To further enhance the sampling process towards distribution regions with desirable metric values, we propose a joint preference alignment scheme applicable to both flow matching and Markov bridge frameworks. Furthermore, we extend our model to also explore the conformational landscape of the protein by jointly sampling side chain angles and molecules.

Title: UniSino: Physics-Driven Foundational Model for Universal CT Sinogram Standardization

Authors: Xingyu Ai, Shaoyu Wang, Zhiyuan Jia, Ao Xu, Hongming Shan, Jianhua Ma, Qiegen Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17816
Pdf URL: https://arxiv.org/pdf/2508.17816
Copy Paste: [[2508.17816]] UniSino: Physics-Driven Foundational Model for Universal CT Sinogram Standardization(https://arxiv.org/abs/2508.17816)
Keywords: foundation model
Abstract: During raw-data acquisition in CT imaging, diverse factors can degrade the collected sinograms, with undersampling and noise leading to severe artifacts and noise in reconstructed images and compromising diagnostic accuracy. Conventional correction methods rely on manually designed algorithms or fixed empirical parameters, but these approaches often lack generalizability across heterogeneous artifact types. To address these limitations, we propose UniSino, a foundation model for universal CT sinogram standardization. Unlike existing foundational models that operate in image domain, UniSino directly standardizes data in the projection domain, which enables stronger generalization across diverse undersampling scenarios. Its training framework incorporates the physical characteristics of sinograms, enhancing generalization and enabling robust performance across multiple subtasks spanning four benchmark datasets. Experimental results demonstrate thatUniSino achieves superior reconstruction quality both single and mixed undersampling case, demonstrating exceptional robustness and generalization in sinogram enhancement for CT imaging. The code is available at: this https URL.

Title: A Contrastive Learning-Guided Confident Meta-learning for Zero Shot Anomaly Detection

Authors: Muhammad Aqeel, Danijel Skocaj, Marco Cristani, Francesco Setti
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17827
Pdf URL: https://arxiv.org/pdf/2508.17827
Copy Paste: [[2508.17827]] A Contrastive Learning-Guided Confident Meta-learning for Zero Shot Anomaly Detection(https://arxiv.org/abs/2508.17827)
Keywords: anomaly
Abstract: Industrial and medical anomaly detection faces critical challenges from data scarcity and prohibitive annotation costs, particularly in evolving manufacturing and healthcare settings. To address this, we propose CoZAD, a novel zero-shot anomaly detection framework that integrates soft confident learning with meta-learning and contrastive feature representation. Unlike traditional confident learning that discards uncertain samples, our method assigns confidence-based weights to all training data, preserving boundary information while emphasizing prototypical normal patterns. The framework quantifies data uncertainty through IQR-based thresholding and model uncertainty via covariance based regularization within a Model-Agnostic Meta-Learning. Contrastive learning creates discriminative feature spaces where normal patterns form compact clusters, enabling rapid domain adaptation. Comprehensive evaluation across 10 datasets spanning industrial and medical domains demonstrates state-of-the-art performance, outperforming existing methods on 6 out of 7 industrial benchmarks with notable improvements on texture-rich datasets (99.2% I-AUROC on DTD-Synthetic, 97.2% on BTAD) and pixellevel localization (96.3% P-AUROC on MVTec-AD). The framework eliminates dependence on vision-language alignments or model ensembles, making it valuable for resourceconstrained environments requiring rapid deployment.

Title: Diffusion-Based Data Augmentation for Medical Image Segmentation

Authors: Maham Nazir, Muhammad Aqeel, Francesco Setti
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17844
Pdf URL: https://arxiv.org/pdf/2508.17844
Copy Paste: [[2508.17844]] Diffusion-Based Data Augmentation for Medical Image Segmentation(https://arxiv.org/abs/2508.17844)
Keywords: diffusion
Abstract: Medical image segmentation models struggle with rare abnormalities due to scarce annotated pathological data. We propose DiffAug a novel framework that combines textguided diffusion-based generation with automatic segmentation validation to address this challenge. Our proposed approach uses latent diffusion models conditioned on medical text descriptions and spatial masks to synthesize abnormalities via inpainting on normal images. Generated samples undergo dynamic quality validation through a latentspace segmentation network that ensures accurate localization while enabling single-step inference. The text prompts, derived from medical literature, guide the generation of diverse abnormality types without requiring manual annotation. Our validation mechanism filters synthetic samples based on spatial accuracy, maintaining quality while operating efficiently through direct latent estimation. Evaluated on three medical imaging benchmarks (CVC-ClinicDB, Kvasir-SEG, REFUGE2), our framework achieves state-of-the-art performance with 8-10% Dice improvements over baselines and reduces false negative rates by up to 28% for challenging cases like small polyps and flat lesions critical for early detection in screening applications.

Title: Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs

Authors: Dingdong Wang, Junan Li, Mingyu Cui, Dongchao Yang, Xueyuan Chen, Helen Meng
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2508.17863
Pdf URL: https://arxiv.org/pdf/2508.17863
Copy Paste: [[2508.17863]] Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs(https://arxiv.org/abs/2508.17863)
Keywords: self-supervised
Abstract: With the rise of Speech Large Language Models (SpeechLLMs), two dominant approaches have emerged for speech processing: discrete tokens and continuous features. Each approach has demonstrated strong capabilities in audio-related processing tasks. However, the performance gap between these two paradigms has not been thoroughly explored. To address this gap, we present a fair comparison of self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings. We evaluate their performance across six spoken language understanding-related tasks using both small and large-scale LLMs (Qwen1.5-0.5B and Llama3.1-8B). We further conduct in-depth analyses, including efficient comparison, SSL layer analysis, LLM layer analysis, and robustness comparison. Our findings reveal that continuous features generally outperform discrete tokens in various tasks. Each speech processing method exhibits distinct characteristics and patterns in how it learns and processes speech information. We hope our results will provide valuable insights to advance spoken language understanding in SpeechLLMs.

Title: Edge-Enhanced Vision Transformer Framework for Accurate AI-Generated Image Detection

Authors: Dabbrata Das, Mahshar Yahan, Md Tareq Zaman, Md Rishadul Bayesh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17877
Pdf URL: https://arxiv.org/pdf/2508.17877
Copy Paste: [[2508.17877]] Edge-Enhanced Vision Transformer Framework for Accurate AI-Generated Image Detection(https://arxiv.org/abs/2508.17877)
Keywords: generative
Abstract: The rapid advancement of generative models has led to a growing prevalence of highly realistic AI-generated images, posing significant challenges for digital forensics and content authentication. Conventional detection methods mainly rely on deep learning models that extract global features, which often overlook subtle structural inconsistencies and demand substantial computational resources. To address these limitations, we propose a hybrid detection framework that combines a fine-tuned Vision Transformer (ViT) with a novel edge-based image processing module. The edge-based module computes variance from edge-difference maps generated before and after smoothing, exploiting the observation that AI-generated images typically exhibit smoother textures, weaker edges, and reduced noise compared to real images. When applied as a post-processing step on ViT predictions, this module enhances sensitivity to fine-grained structural cues while maintaining computational efficiency. Extensive experiments on the CIFAKE, Artistic, and Custom Curated datasets demonstrate that the proposed framework achieves superior detection performance across all benchmarks, attaining 97.75% accuracy and a 97.77% F1-score on CIFAKE, surpassing widely adopted state-of-the-art models. These results establish the proposed method as a lightweight, interpretable, and effective solution for both still images and video frames, making it highly suitable for real-world applications in automated content verification and digital forensics.

Title: Evaluating the Representation of Vowels in Wav2Vec Feature Extractor: A Layer-Wise Analysis Using MFCCs

Authors: Domenico De Cristofaro, Vincenzo Norman Vitale, Alessandro Vietti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.17914
Pdf URL: https://arxiv.org/pdf/2508.17914
Copy Paste: [[2508.17914]] Evaluating the Representation of Vowels in Wav2Vec Feature Extractor: A Layer-Wise Analysis Using MFCCs(https://arxiv.org/abs/2508.17914)
Keywords: self-supervised
Abstract: Automatic Speech Recognition has advanced with self-supervised learning, enabling feature extraction directly from raw audio. In Wav2Vec, a CNN first transforms audio into feature vectors before the transformer processes them. This study examines CNN-extracted information for monophthong vowels using the TIMIT corpus. We compare MFCCs, MFCCs with formants, and CNN activations by training SVM classifiers for front-back vowel identification, assessing their classification accuracy to evaluate phonetic representation.

Title: EndoUFM: Utilizing Foundation Models for Monocular depth estimation of endoscopic images

Authors: Xinning Yao, Bo Liu, Bojian Li, Jingjing Wang, Jinghua Yue, Fugen Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17916
Pdf URL: https://arxiv.org/pdf/2508.17916
Copy Paste: [[2508.17916]] EndoUFM: Utilizing Foundation Models for Monocular depth estimation of endoscopic images(https://arxiv.org/abs/2508.17916)
Keywords: foundation model
Abstract: Depth estimation is a foundational component for 3D reconstruction in minimally invasive endoscopic surgeries. However, existing monocular depth estimation techniques often exhibit limited performance to the varying illumination and complex textures of the surgical environment. While powerful visual foundation models offer a promising solution, their training on natural images leads to significant domain adaptability limitations and semantic perception deficiencies when applied to endoscopy. In this study, we introduce EndoUFM, an unsupervised monocular depth estimation framework that innovatively integrating dual foundation models for surgical scenes, which enhance the depth estimation performance by leveraging the powerful pre-learned priors. The framework features a novel adaptive fine-tuning strategy that incorporates Random Vector Low-Rank Adaptation (RVLoRA) to enhance model adaptability, and a Residual block based on Depthwise Separable Convolution (Res-DSC) to improve the capture of fine-grained local features. Furthermore, we design a mask-guided smoothness loss to enforce depth consistency within anatomical tissue structures. Extensive experiments on the SCARED, Hamlyn, SERV-CT, and EndoNeRF datasets confirm that our method achieves state-of-the-art performance while maintaining an efficient model size. This work contributes to augmenting surgeons' spatial perception during minimally invasive procedures, thereby enhancing surgical precision and safety, with crucial implications for augmented reality and navigation systems.

Title: Generative Feature Imputing - A Technique for Error-resilient Semantic Communication

Authors: Jianhao Huang, Qunsong Zeng, Hongyang Du, Kaibin Huang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2508.17957
Pdf URL: https://arxiv.org/pdf/2508.17957
Copy Paste: [[2508.17957]] Generative Feature Imputing - A Technique for Error-resilient Semantic Communication(https://arxiv.org/abs/2508.17957)
Keywords: diffusion, generative
Abstract: Semantic communication (SemCom) has emerged as a promising paradigm for achieving unprecedented communication efficiency in sixth-generation (6G) networks by leveraging artificial intelligence (AI) to extract and transmit the underlying meanings of source data. However, deploying SemCom over digital systems presents new challenges, particularly in ensuring robustness against transmission errors that may distort semantically critical content. To address this issue, this paper proposes a novel framework, termed generative feature imputing, which comprises three key techniques. First, we introduce a spatial error concentration packetization strategy that spatially concentrates feature distortions by encoding feature elements based on their channel mappings, a property crucial for both the effectiveness and reduced complexity of the subsequent techniques. Second, building on this strategy, we propose a generative feature imputing method that utilizes a diffusion model to efficiently reconstruct missing features caused by packet losses. Finally, we develop a semantic-aware power allocation scheme that enables unequal error protection by allocating transmission power according to the semantic importance of each packet. Experimental results demonstrate that the proposed framework outperforms conventional approaches, such as Deep Joint Source-Channel Coding (DJSCC) and JPEG2000, under block fading conditions, achieving higher semantic accuracy and lower Learned Perceptual Image Patch Similarity (LPIPS) scores.

Title: A Novel Framework for Uncertainty Quantification via Proper Scores for Classification and Beyond

Authors: Sebastian G. Gruber
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2508.18001
Pdf URL: https://arxiv.org/pdf/2508.18001
Copy Paste: [[2508.18001]] A Novel Framework for Uncertainty Quantification via Proper Scores for Classification and Beyond(https://arxiv.org/abs/2508.18001)
Keywords: generative
Abstract: In this PhD thesis, we propose a novel framework for uncertainty quantification in machine learning, which is based on proper scores. Uncertainty quantification is an important cornerstone for trustworthy and reliable machine learning applications in practice. Usually, approaches to uncertainty quantification are problem-specific, and solutions and insights cannot be readily transferred from one task to another. Proper scores are loss functions minimized by predicting the target distribution. Due to their very general definition, proper scores apply to regression, classification, or even generative modeling tasks. We contribute several theoretical results, that connect epistemic uncertainty, aleatoric uncertainty, and model calibration with proper scores, resulting in a general and widely applicable framework. We achieve this by introducing a general bias-variance decomposition for strictly proper scores via functional Bregman divergences. Specifically, we use the kernel score, a kernel-based proper score, for evaluating sample-based generative models in various domains, like image, audio, and natural language generation. This includes a novel approach for uncertainty estimation of large language models, which outperforms state-of-the-art baselines. Further, we generalize the calibration-sharpness decomposition beyond classification, which motivates the definition of proper calibration errors. We then introduce a novel estimator for proper calibration errors in classification, and a novel risk-based approach to compare different estimators for squared calibration errors. Last, we offer a decomposition of the kernel spherical score, another kernel-based proper score, allowing a more fine-grained and interpretable evaluation of generative image models.

Title: Fence off Anomaly Interference: Cross-Domain Distillation for Fully Unsupervised Anomaly Detection

Authors: Xinyue Liu, Jianyuan Wang, Biao Leng, Shuo Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.18007
Pdf URL: https://arxiv.org/pdf/2508.18007
Copy Paste: [[2508.18007]] Fence off Anomaly Interference: Cross-Domain Distillation for Fully Unsupervised Anomaly Detection(https://arxiv.org/abs/2508.18007)
Keywords: anomaly
Abstract: Fully Unsupervised Anomaly Detection (FUAD) is a practical extension of Unsupervised Anomaly Detection (UAD), aiming to detect anomalies without any labels even when the training set may contain anomalous samples. To achieve FUAD, we pioneer the introduction of Knowledge Distillation (KD) paradigm based on teacher-student framework into the FUAD setting. However, due to the presence of anomalies in the training data, traditional KD methods risk enabling the student to learn the teacher's representation of anomalies under FUAD setting, thereby resulting in poor anomaly detection performance. To address this issue, we propose a novel Cross-Domain Distillation (CDD) framework based on the widely studied reverse distillation (RD) paradigm. Specifically, we design a Domain-Specific Training, which divides the training set into multiple domains with lower anomaly ratios and train a domain-specific student for each. Cross-Domain Knowledge Aggregation is then performed, where pseudo-normal features generated by domain-specific students collaboratively guide a global student to learn generalized normal representations across all samples. Experimental results on noisy versions of the MVTec AD and VisA datasets demonstrate that our method achieves significant performance improvements over the baseline, validating its effectiveness under FUAD setting.

Title: Towards Continual Visual Anomaly Detection in the Medical Domain

Authors: Manuel Barusco, Francesco Borsatti, Nicola Beda, Davide Dalle Pezze, Gian Antonio Susto
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.18013
Pdf URL: https://arxiv.org/pdf/2508.18013
Copy Paste: [[2508.18013]] Towards Continual Visual Anomaly Detection in the Medical Domain(https://arxiv.org/abs/2508.18013)
Keywords: anomaly
Abstract: Visual Anomaly Detection (VAD) seeks to identify abnormal images and precisely localize the corresponding anomalous regions, relying solely on normal data during training. This approach has proven essential in domains such as manufacturing and, more recently, in the medical field, where accurate and explainable detection is critical. Despite its importance, the impact of evolving input data distributions over time has received limited attention, even though such changes can significantly degrade model performance. In particular, given the dynamic and evolving nature of medical imaging data, Continual Learning (CL) provides a natural and effective framework to incrementally adapt models while preserving previously acquired knowledge. This study explores for the first time the application of VAD models in a CL scenario for the medical field. In this work, we utilize a CL version of the well-established PatchCore model, called PatchCoreCL, and evaluate its performance using BMAD, a real-world medical imaging dataset with both image-level and pixel-level annotations. Our results demonstrate that PatchCoreCL is an effective solution, achieving performance comparable to the task-specific models, with a forgetting value less than a 1%, highlighting the feasibility and potential of CL for adaptive VAD in medical imaging.

Title: FCR: Investigating Generative AI models for Forensic Craniofacial Reconstruction

Authors: Ravi Shankar Prasad, Dinesh Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.18031
Pdf URL: https://arxiv.org/pdf/2508.18031
Copy Paste: [[2508.18031]] FCR: Investigating Generative AI models for Forensic Craniofacial Reconstruction(https://arxiv.org/abs/2508.18031)
Keywords: generative
Abstract: Craniofacial reconstruction in forensics is one of the processes to identify victims of crime and natural disasters. Identifying an individual from their remains plays a crucial role when all other identification methods fail. Traditional methods for this task, such as clay-based craniofacial reconstruction, require expert domain knowledge and are a time-consuming process. At the same time, other probabilistic generative models like the statistical shape model or the Basel face model fail to capture the skull and face cross-domain attributes. Looking at these limitations, we propose a generic framework for craniofacial reconstruction from 2D X-ray images. Here, we used various generative models (i.e., CycleGANs, cGANs, etc) and fine-tune the generator and discriminator parts to generate more realistic images in two distinct domains, which are the skull and face of an individual. This is the first time where 2D X-rays are being used as a representation of the skull by generative models for craniofacial reconstruction. We have evaluated the quality of generated faces using FID, IS, and SSIM scores. Finally, we have proposed a retrieval framework where the query is the generated face image and the gallery is the database of real faces. By experimental results, we have found that this can be an effective tool for forensic science.

Title: Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images

Authors: Kaiyu Li, Xiangyong Cao, Ruixun Liu, Shihong Wang, Zixuan Jiang, Zhi Wang, Deyu Meng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.18067
Pdf URL: https://arxiv.org/pdf/2508.18067
Copy Paste: [[2508.18067]] Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images(https://arxiv.org/abs/2508.18067)
Keywords: foundation model
Abstract: Semantic segmentation of remote sensing (RS) images is pivotal for comprehensive Earth observation, but the demand for interpreting new object categories, coupled with the high expense of manual annotation, poses significant challenges. Although open-vocabulary semantic segmentation (OVSS) offers a promising solution, existing frameworks designed for natural images are insufficient for the unique complexities of RS data. They struggle with vast scale variations and fine-grained details, and their adaptation often relies on extensive, costly annotations. To address this critical gap, this paper introduces SegEarth-OV, the first framework for annotation-free open-vocabulary segmentation of RS images. Specifically, we propose SimFeatUp, a universal upsampler that robustly restores high-resolution spatial details from coarse features, correcting distorted target shapes without any task-specific post-training. We also present a simple yet effective Global Bias Alleviation operation to subtract the inherent global context from patch features, significantly enhancing local semantic fidelity. These components empower SegEarth-OV to effectively harness the rich semantics of pre-trained VLMs, making OVSS possible in optical RS contexts. Furthermore, to extend the framework's universality to other challenging RS modalities like SAR images, where large-scale VLMs are unavailable and expensive to create, we introduce AlignEarth, which is a distillation-based strategy and can efficiently transfer semantic knowledge from an optical VLM encoder to an SAR encoder, bypassing the need to build SAR foundation models from scratch and enabling universal OVSS across diverse sensor types. Extensive experiments on both optical and SAR datasets validate that SegEarth-OV can achieve dramatic improvements over the SOTA methods, establishing a robust foundation for annotation-free and open-world Earth observation.

Title: How Quantization Shapes Bias in Large Language Models

Authors: Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2508.18088
Pdf URL: https://arxiv.org/pdf/2508.18088
Copy Paste: [[2508.18088]] How Quantization Shapes Bias in Large Language Models(https://arxiv.org/abs/2508.18088)
Keywords: generative
Abstract: This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups. We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, toxicity, sentiment, and fairness. We employ both probabilistic and generated text-based metrics across nine benchmarks and evaluate models varying in architecture family and reasoning ability. Our findings show that quantization has a nuanced impact on bias: while it can reduce model toxicity and does not significantly impact sentiment, it tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. These trends are generally consistent across demographic categories and model types, although their magnitude depends on the specific setting. Overall, our results highlight the importance of carefully balancing efficiency and ethical considerations when applying quantization in practice.

Title: Speech-Based Depressive Mood Detection in the Presence of Multiple Sclerosis: A Cross-Corpus and Cross-Lingual Study

Authors: Monica Gonzalez-Machorro, Uwe Reichel, Pascal Hecker, Helly Hammer, Hesam Sagha, Florian Eyben, Robert Hoepner, Björn W. Schuller
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.18092
Pdf URL: https://arxiv.org/pdf/2508.18092
Copy Paste: [[2508.18092]] Speech-Based Depressive Mood Detection in the Presence of Multiple Sclerosis: A Cross-Corpus and Cross-Lingual Study(https://arxiv.org/abs/2508.18092)
Keywords: generative
Abstract: Depression commonly co-occurs with neurodegenerative disorders like Multiple Sclerosis (MS), yet the potential of speech-based Artificial Intelligence for detecting depression in such contexts remains unexplored. This study examines the transferability of speech-based depression detection methods to people with MS (pwMS) through cross-corpus and cross-lingual analysis using English data from the general population and German data from pwMS. Our approach implements supervised machine learning models using: 1) conventional speech and language features commonly used in the field, 2) emotional dimensions derived from a Speech Emotion Recognition (SER) model, and 3) exploratory speech feature analysis. Despite limited data, our models detect depressive mood in pwMS with moderate generalisability, achieving a 66% Unweighted Average Recall (UAR) on a binary task. Feature selection further improved performance, boosting UAR to 74%. Our findings also highlight the relevant role emotional changes have as an indicator of depressive mood in both the general population and within PwMS. This study provides an initial exploration into generalising speech-based depression detection, even in the presence of co-occurring conditions, such as neurodegenerative diseases.

Title: Incorporating Pre-trained Diffusion Models in Solving the Schrödinger Bridge Problem

Authors: Zhicong Tang, Tiankai Hang, Shuyang Gu, Dong Chen, Baining Guo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.18095
Pdf URL: https://arxiv.org/pdf/2508.18095
Copy Paste: [[2508.18095]] Incorporating Pre-trained Diffusion Models in Solving the Schrödinger Bridge Problem(https://arxiv.org/abs/2508.18095)
Keywords: diffusion, generative
Abstract: This paper aims to unify Score-based Generative Models (SGMs), also known as Diffusion models, and the Schrödinger Bridge (SB) problem through three reparameterization techniques: Iterative Proportional Mean-Matching (IPMM), Iterative Proportional Terminus-Matching (IPTM), and Iterative Proportional Flow-Matching (IPFM). These techniques significantly accelerate and stabilize the training of SB-based models. Furthermore, the paper introduces novel initialization strategies that use pre-trained SGMs to effectively train SB-based models. By using SGMs as initialization, we leverage the advantages of both SB-based models and SGMs, ensuring efficient training of SB-based models and further improving the performance of SGMs. Extensive experiments demonstrate the significant effectiveness and improvements of the proposed methods. We believe this work contributes to and paves the way for future research on generative models.

Title: Provable Mixed-Noise Learning with Flow-Matching

Authors: Paul Hagemann, Robert Gruhlke, Bernhard Stankewitz, Claudia Schillings, Gabriele Steidl
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2508.18122
Pdf URL: https://arxiv.org/pdf/2508.18122
Copy Paste: [[2508.18122]] Provable Mixed-Noise Learning with Flow-Matching(https://arxiv.org/abs/2508.18122)
Keywords: generative
Abstract: We study Bayesian inverse problems with mixed noise, modeled as a combination of additive and multiplicative Gaussian components. While traditional inference methods often assume fixed or known noise characteristics, real-world applications, particularly in physics and chemistry, frequently involve noise with unknown and heterogeneous structure. Motivated by recent advances in flow-based generative modeling, we propose a novel inference framework based on conditional flow matching embedded within an Expectation-Maximization (EM) algorithm to jointly estimate posterior samplers and noise parameters. To enable high-dimensional inference and improve scalability, we use simulation-free ODE-based flow matching as the generative model in the E-step of the EM algorithm. We prove that, under suitable assumptions, the EM updates converge to the true noise parameters in the population limit of infinite observations. Our numerical results illustrate the effectiveness of combining EM inference with flow matching for mixed-noise Bayesian inverse problems.

Title: Learning from Few Samples: A Novel Approach for High-Quality Malcode Generation

Authors: Haijian Ma, Daizong Liu, Xiaowen Cai, Pan Zhou, Yulai Xie
Subjects: cs.CR, cs.AI
Abstract URL: https://arxiv.org/abs/2508.18148
Pdf URL: https://arxiv.org/pdf/2508.18148
Copy Paste: [[2508.18148]] Learning from Few Samples: A Novel Approach for High-Quality Malcode Generation(https://arxiv.org/abs/2508.18148)
Keywords: generative
Abstract: Intrusion Detection Systems (IDS) play a crucial role in network security defense. However, a significant challenge for IDS in training detection models is the shortage of adequately labeled malicious samples. To address these issues, this paper introduces a novel semi-supervised framework \textbf{GANGRL-LLM}, which integrates Generative Adversarial Networks (GANs) with Large Language Models (LLMs) to enhance malicious code generation and SQL Injection (SQLi) detection capabilities in few-sample learning scenarios. Specifically, our framework adopts a collaborative training paradigm where: (1) the GAN-based discriminator improves malicious pattern recognition through adversarial learning with generated samples and limited real samples; and (2) the LLM-based generator refines the quality of malicious code synthesis using reward signals from the discriminator. The experimental results demonstrate that even with a limited number of labeled samples, our training framework is highly effective in enhancing both malicious code generation and detection capabilities. This dual enhancement capability offers a promising solution for developing adaptive defense systems capable of countering evolving cyber threats.

Title: $AutoGuardX$: A Comprehensive Cybersecurity Framework for Connected Vehicles

Authors: Muhammad Ali Nadeem, Bishwo Prakash Pokharel, Naresh Kshetri, Achyut Shankar, Gokarna Sharma
Subjects: cs.CR, cs.ET
Abstract URL: https://arxiv.org/abs/2508.18155
Pdf URL: https://arxiv.org/pdf/2508.18155
Copy Paste: [[2508.18155]] $AutoGuardX$: A Comprehensive Cybersecurity Framework for Connected Vehicles(https://arxiv.org/abs/2508.18155)
Keywords: anomaly
Abstract: The rapid integration of Internet of Things (IoT) and interconnected systems in modern vehicles not only introduced a new era of convenience, automation, and connected vehicles but also elevated their exposure to sophisticated cyber threats. This is especially evident in US and Canada, where cyber-enabled auto theft has surged in recent years, revealing the limitations of existing security measures for connected vehicles. In response, this paper proposes $AutoGuardX$, a comprehensive cybersecurity framework designed specifically for connected vehicles. $AutoGuardX$ combines key elements from existing recognized standards for vehicle security, such as ISO/SAE 21434 and ISO 26262, with advanced technologies, including machine learning-based anomaly detection, IoT security protocols, and encrypted communication channels. The framework addresses major attack vectors like relay attacks, controller area network (CAN) bus intrusions, and vulnerabilities introduced by emerging technologies such as 5G and quantum computing. $AutoGuardX$ is extensively evaluated through security simulations across a mix of Sedans and SUVs from four major vehicle brands manufactured between 2019 and 2023. The results demonstrate the framework's adaptability, scalability, and practical effectiveness against existing and emerging threats.

Title: SpotEdit: Evaluating Visually-Guided Image Editing Methods

Authors: Sara Ghazanfari, Wei-An Lin, Haitong Tian, Ersin Yumer
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.18159
Pdf URL: https://arxiv.org/pdf/2508.18159
Copy Paste: [[2508.18159]] SpotEdit: Evaluating Visually-Guided Image Editing Methods(https://arxiv.org/abs/2508.18159)
Keywords: diffusion, generative
Abstract: Visually-guided image editing, where edits are conditioned on both visual cues and textual prompts, has emerged as a powerful paradigm for fine-grained, controllable content generation. Although recent generative models have shown remarkable capabilities, existing evaluations remain simple and insufficiently representative of real-world editing challenges. We present SpotEdit, a comprehensive benchmark designed to systematically assess visually-guided image editing methods across diverse diffusion, autoregressive, and hybrid generative models, uncovering substantial performance disparities. To address a critical yet underexplored challenge, our benchmark includes a dedicated component on hallucination, highlighting how leading models, such as GPT-4o, often hallucinate the existence of a visual cue and erroneously perform the editing task. Our code and benchmark are publicly released at this https URL.

Title: Amortized Sampling with Transferable Normalizing Flows

Authors: Charlie B. Tan, Majdi Hassan, Leon Klein, Saifuddin Syed, Dominique Beaini, Michael M. Bronstein, Alexander Tong, Kirill Neklyudov
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.18175
Pdf URL: https://arxiv.org/pdf/2508.18175
Copy Paste: [[2508.18175]] Amortized Sampling with Transferable Normalizing Flows(https://arxiv.org/abs/2508.18175)
Keywords: generative
Abstract: Efficient equilibrium sampling of molecular conformations remains a core challenge in computational chemistry and statistical inference. Classical approaches such as molecular dynamics or Markov chain Monte Carlo inherently lack amortization; the computational cost of sampling must be paid in-full for each system of interest. The widespread success of generative models has inspired interest into overcoming this limitation through learning sampling algorithms. Despite performing on par with conventional methods when trained on a single system, learned samplers have so far demonstrated limited ability to transfer across systems. We prove that deep learning enables the design of scalable and transferable samplers by introducing Prose, a 280 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Prose draws zero-shot uncorrelated proposal samples for arbitrary peptide systems, achieving the previously intractable transferability across sequence length, whilst retaining the efficient likelihood evaluation of normalizing flows. Through extensive empirical evaluation we demonstrate the efficacy of Prose as a proposal for a variety of sampling algorithms, finding a simple importance sampling-based finetuning procedure to achieve superior performance to established methods such as sequential Monte Carlo on unseen tetrapeptides. We open-source the Prose codebase, model weights, and training dataset, to further stimulate research into amortized sampling methods and finetuning objectives.

Title: Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios

Authors: Luana Bulla, Gabriele Tuccio, Misael Mongiovì, Aldo Gangemi
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2508.18183
Pdf URL: https://arxiv.org/pdf/2508.18183
Copy Paste: [[2508.18183]] Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios(https://arxiv.org/abs/2508.18183)
Keywords: in-context
Abstract: Translating natural languages into sign languages is a highly complex and underexplored task. Despite growing interest in accessibility and inclusivity, the development of robust translation systems remains hindered by the limited availability of parallel corpora which align natural language with sign language data. Existing methods often struggle to generalize in these data-scarce environments, as the few datasets available are typically domain-specific, lack standardization, or fail to capture the full linguistic richness of sign languages. To address this limitation, we propose Advanced Use of LLMs for Sign Language Translation (AulSign), a novel method that leverages Large Language Models via dynamic prompting and in-context learning with sample selection and subsequent sign association. Despite their impressive abilities in processing text, LLMs lack intrinsic knowledge of sign languages; therefore, they are unable to natively perform this kind of translation. To overcome this limitation, we associate the signs with compact descriptions in natural language and instruct the model to use them. We evaluate our method on both English and Italian languages using SignBank+, a recognized benchmark in the field, as well as the Italian LaCAM CNR-ISTC dataset. We demonstrate superior performance compared to state-of-the-art models in low-data scenario. Our findings demonstrate the effectiveness of AulSign, with the potential to enhance accessibility and inclusivity in communication technologies for underrepresented linguistic communities.

Title: Better Language Model-Based Judging Reward Modeling through Scaling Comprehension Boundaries

Authors: Meiling Ning, Zhongbao Zhang, Junda Ye, Jiabao Guo, Qingyuan Guan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.18212
Pdf URL: https://arxiv.org/pdf/2508.18212
Copy Paste: [[2508.18212]] Better Language Model-Based Judging Reward Modeling through Scaling Comprehension Boundaries(https://arxiv.org/abs/2508.18212)
Keywords: generative
Abstract: The emergence of LM-based judging reward modeling, represented by generative reward models, has successfully made reinforcement learning from AI feedback (RLAIF) efficient and scalable. To further advance this paradigm, we propose a core insight: this form of reward modeling shares fundamental formal consistency with natural language inference (NLI), a core task in natural language understanding. This reframed perspective points to a key path for building superior reward models: scaling the model's comprehension boundaries. Pursuing this path, exploratory experiments on NLI tasks demonstrate that the slot prediction masked language models (MLMs) incorporating contextual explanations achieve significantly better performance compared to mainstream autoregressive models. Based on this key finding, we propose ESFP-RM, a two-stage LM-based judging reward model that utilizes an explanation based slot framework for prediction to fully leverage the advantages of MLMs. Extensive experiments demonstrate that in both reinforcement learning from human feedback (RLHF) and out-of-distribution (OOD) scenarios, the ESFP-RM framework delivers more stable and generalizable reward signals compared to generative reward models.

Title: Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance

Authors: Ayce Idil Aytekin, Helge Rhodin, Rishabh Dabral, Christian Theobalt
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.18213
Pdf URL: https://arxiv.org/pdf/2508.18213
Copy Paste: [[2508.18213]] Follow My Hold: Hand-Object Interaction Reconstruction through Geometric Guidance(https://arxiv.org/abs/2508.18213)
Keywords: diffusion
Abstract: We propose a novel diffusion-based framework for reconstructing 3D geometry of hand-held objects from monocular RGB images by leveraging hand-object interaction as geometric guidance. Our method conditions a latent diffusion model on an inpainted object appearance and uses inference-time guidance to optimize the object reconstruction, while simultaneously ensuring plausible hand-object interactions. Unlike prior methods that rely on extensive post-processing or produce low-quality reconstructions, our approach directly generates high-quality object geometry during the diffusion process by introducing guidance with an optimization-in-the-loop design. Specifically, we guide the diffusion model by applying supervision to the velocity field while simultaneously optimizing the transformations of both the hand and the object being reconstructed. This optimization is driven by multi-modal geometric cues, including normal and depth alignment, silhouette consistency, and 2D keypoint reprojection. We further incorporate signed distance field supervision and enforce contact and non-intersection constraints to ensure physical plausibility of hand-object interaction. Our method yields accurate, robust and coherent reconstructions under occlusion while generalizing well to in-the-wild scenarios.

Title: Sealing The Backdoor: Unlearning Adversarial Text Triggers In Diffusion Models Using Knowledge Distillation

Authors: Ashwath Vaithinathan Aravindan, Abha Jha, Matthew Salaway, Atharva Sandeep Bhide, Duygu Nur Yaldiz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.18235
Pdf URL: https://arxiv.org/pdf/2508.18235
Copy Paste: [[2508.18235]] Sealing The Backdoor: Unlearning Adversarial Text Triggers In Diffusion Models Using Knowledge Distillation(https://arxiv.org/abs/2508.18235)
Keywords: diffusion, generative
Abstract: Text-to-image diffusion models have revolutionized generative AI, but their vulnerability to backdoor attacks poses significant security risks. Adversaries can inject imperceptible textual triggers into training data, causing models to generate manipulated outputs. Although text-based backdoor defenses in classification models are well-explored, generative models lack effective mitigation techniques against. We address this by selectively erasing the model's learned associations between adversarial text triggers and poisoned outputs, while preserving overall generation quality. Our approach, Self-Knowledge Distillation with Cross-Attention Guidance (SKD-CAG), uses knowledge distillation to guide the model in correcting responses to poisoned prompts while maintaining image quality by exploiting the fact that the backdoored model still produces clean outputs in the absence of triggers. Using the cross-attention mechanism, SKD-CAG neutralizes backdoor influences at the attention level, ensuring the targeted removal of adversarial effects. Extensive experiments show that our method outperforms existing approaches, achieving removal accuracy 100\% for pixel backdoors and 93\% for style-based attacks, without sacrificing robustness or image fidelity. Our findings highlight targeted unlearning as a promising defense to secure generative models. Code and model weights can be found at this https URL .

Title: Interpretable Evaluation of AI-Generated Content with Language-Grounded Sparse Encoders

Authors: Yiming Tang, Arash Lagzian, Srinivas Anumasa, Qiran Zou, Trang Nguyen, Ehsan Adeli, Ching-Yu Cheng, Yilun Du, Dianbo Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.18236
Pdf URL: https://arxiv.org/pdf/2508.18236
Copy Paste: [[2508.18236]] Interpretable Evaluation of AI-Generated Content with Language-Grounded Sparse Encoders(https://arxiv.org/abs/2508.18236)
Keywords: generative
Abstract: While the quality of AI-generated contents, such as synthetic images, has become remarkably high, current evaluation metrics provide only coarse-grained assessments, failing to identify specific strengths and weaknesses that researchers and practitioners need for model selection and development, further limiting the scientific understanding and commercial deployment of these generative models. To address this, we introduce Language-Grounded Sparse Encoders (LanSE), a novel architecture that creates interpretable evaluation metrics by identifying interpretable visual patterns and automatically describing them in natural language. Through large-scale human evaluation (more than 11,000 annotations) and large multimodal model (LMM) based analysis, LanSE demonstrates reliable capabilities to detect interpretable visual patterns in synthetic images with more than 93\% accuracy in natural images. LanSE further provides a fine-grained evaluation framework that quantifies four key dimensions of generation quality, prompt match, visual realism, physical plausibility, and content diversity. LanSE reveals nuanced model differences invisible to existing metrics, for instance, FLUX's superior physical plausibility and SDXL-medium's strong content diversity, while aligning with human judgments. By bridging interpretability with practical evaluation needs, LanSE offers all users of generative AI models a powerful tool for model selection, quality control of synthetic content, and model improvement. These capabilities directly address the need for public confidence and safety in AI-generated content, both critical for the future of generative AI applications.

Title: ObjFiller-3D: Consistent Multi-view 3D Inpainting via Video Diffusion Models

Authors: Haitang Feng, Jie Liu, Jie Tang, Gangshan Wu, Beiqi Chen, Jianhuang Lai, Guangcong Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.18271
Pdf URL: https://arxiv.org/pdf/2508.18271
Copy Paste: [[2508.18271]] ObjFiller-3D: Consistent Multi-view 3D Inpainting via Video Diffusion Models(https://arxiv.org/abs/2508.18271)
Keywords: diffusion
Abstract: 3D inpainting often relies on multi-view 2D image inpainting, where the inherent inconsistencies across different inpainted views can result in blurred textures, spatial discontinuities, and distracting visual artifacts. These inconsistencies pose significant challenges when striving for accurate and realistic 3D object completion, particularly in applications that demand high fidelity and structural coherence. To overcome these limitations, we propose ObjFiller-3D, a novel method designed for the completion and editing of high-quality and consistent 3D objects. Instead of employing a conventional 2D image inpainting model, our approach leverages a curated selection of state-of-the-art video editing model to fill in the masked regions of 3D objects. We analyze the representation gap between 3D and videos, and propose an adaptation of a video inpainting model for 3D scene inpainting. In addition, we introduce a reference-based 3D inpainting method to further enhance the quality of reconstruction. Experiments across diverse datasets show that compared to previous methods, ObjFiller-3D produces more faithful and fine-grained reconstructions (PSNR of 26.6 vs. NeRFiller (15.9) and LPIPS of 0.19 vs. Instant3dit (0.25)). Moreover, it demonstrates strong potential for practical deployment in real-world 3D editing applications. Project page: this https URL Code: this https URL .