2025-08-07

Title: PLA: Prompt Learning Attack against Text-to-Image Generative Models

Authors: Xinqi Lyu, Yihao Liu, Yanjie Li, Bin Xiao
Subjects: cs.CR, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2508.03696
Pdf URL: https://arxiv.org/pdf/2508.03696
Copy Paste: [[2508.03696]] PLA: Prompt Learning Attack against Text-to-Image Generative Models(https://arxiv.org/abs/2508.03696)
Keywords: generative
Abstract: Text-to-Image (T2I) models have gained widespread adoption across various applications. Despite the success, the potential misuse of T2I models poses significant risks of generating Not-Safe-For-Work (NSFW) content. To investigate the vulnerability of T2I models, this paper delves into adversarial attacks to bypass the safety mechanisms under black-box settings. Most previous methods rely on word substitution to search adversarial prompts. Due to limited search space, this leads to suboptimal performance compared to gradient-based training. However, black-box settings present unique challenges to training gradient-driven attack methods, since there is no access to the internal architecture and parameters of T2I models. To facilitate the learning of adversarial prompts in black-box settings, we propose a novel prompt learning attack framework (PLA), where insightful gradient-based training tailored to black-box T2I models is designed by utilizing multimodal similarities. Experiments show that our new method can effectively attack the safety mechanisms of black-box T2I models including prompt filters and post-hoc safety checkers with a high success rate compared to state-of-the-art methods. Warning: This paper may contain offensive model-generated content.

Title: From Waveforms to Pixels: A Survey on Audio-Visual Segmentation

Authors: Jia Li, Yapeng Tian
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03724
Pdf URL: https://arxiv.org/pdf/2508.03724
Copy Paste: [[2508.03724]] From Waveforms to Pixels: A Survey on Audio-Visual Segmentation(https://arxiv.org/abs/2508.03724)
Keywords: foundation model
Abstract: Audio-Visual Segmentation (AVS) aims to identify and segment sound-producing objects in videos by leveraging both visual and audio modalities. It has emerged as a significant research area in multimodal perception, enabling fine-grained object-level understanding. In this survey, we present a comprehensive overview of the AVS field, covering its problem formulation, benchmark datasets, evaluation metrics, and the progression of methodologies. We analyze a wide range of approaches, including architectures for unimodal and multimodal encoding, key strategies for audio-visual fusion, and various decoder designs. Furthermore, we examine major training paradigms, from fully supervised learning to weakly supervised and training-free methods. Notably, we provide an extensive comparison of AVS methods across standard benchmarks, highlighting the impact of different architectural choices, fusion strategies, and training paradigms on performance. Finally, we outline the current challenges, such as limited temporal modeling, modality bias toward vision, lack of robustness in complex environments, and high computational demands, and propose promising future directions, including improving temporal reasoning and multimodal fusion, leveraging foundation models for better generalization and few-shot learning, reducing reliance on labeled data through selfand weakly supervised learning, and incorporating higher-level reasoning for more intelligent AVS systems.

Title: TIR-Diffusion: Diffusion-based Thermal Infrared Image Denoising via Latent and Wavelet Domain Optimization

Authors: Tai Hyoung Rhee, Dong-guw Lee, Ayoung Kim
Subjects: cs.CV, cs.RO, eess.IV
Abstract URL: https://arxiv.org/abs/2508.03727
Pdf URL: https://arxiv.org/pdf/2508.03727
Copy Paste: [[2508.03727]] TIR-Diffusion: Diffusion-based Thermal Infrared Image Denoising via Latent and Wavelet Domain Optimization(https://arxiv.org/abs/2508.03727)
Keywords: diffusion
Abstract: Thermal infrared imaging exhibits considerable potentials for robotic perception tasks, especially in environments with poor visibility or challenging lighting conditions. However, TIR images typically suffer from heavy non-uniform fixed-pattern noise, complicating tasks such as object detection, localization, and mapping. To address this, we propose a diffusion-based TIR image denoising framework leveraging latent-space representations and wavelet-domain optimization. Utilizing a pretrained stable diffusion model, our method fine-tunes the model via a novel loss function combining latent-space and discrete wavelet transform (DWT) / dual-tree complex wavelet transform (DTCWT) losses. Additionally, we implement a cascaded refinement stage to enhance fine details, ensuring high-fidelity denoising results. Experiments on benchmark datasets demonstrate superior performance of our approach compared to state-of-the-art denoising methods. Furthermore, our method exhibits robust zero-shot generalization to diverse and challenging real-world TIR datasets, underscoring its effectiveness for practical robotic deployment.

Title: CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning

Authors: Wenjie Li, Yujie Zhang, Haoran Sun, Yueqi Li, Fanrui Zhang, Mengzhe Xu, Victoria Borja Clausich, Sade Mellin, Renhao Yang, Chenrun Wang, Jethro Zih-Shuo Wang, Shiyi Yao, Gen Li, Yidong Xu, Hanyu Wang, Yilin Huang, Angela Lin Wang, Chen Shi, Yin Zhang, Jianan Guo, Luqi Yang, Renxuan Li, Yang Xu, Jiawei Liu, Yao Zhang, Lei Liu, Carlos Gutiérrez SanRomán, Lei Wang
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2508.03733
Pdf URL: https://arxiv.org/pdf/2508.03733
Copy Paste: [[2508.03733]] CX-Mind: A Pioneering Multimodal Large Language Model for Interleaved Reasoning in Chest X-ray via Curriculum-Guided Reinforcement Learning(https://arxiv.org/abs/2508.03733)
Keywords: generative
Abstract: Chest X-ray (CXR) imaging is one of the most widely used diagnostic modalities in clinical practice, encompassing a broad spectrum of diagnostic tasks. Recent advancements have seen the extensive application of reasoning-based multimodal large language models (MLLMs) in medical imaging to enhance diagnostic efficiency and interpretability. However, existing multimodal models predominantly rely on "one-time" diagnostic approaches, lacking verifiable supervision of the reasoning process. This leads to challenges in multi-task CXR diagnosis, including lengthy reasoning, sparse rewards, and frequent hallucinations. To address these issues, we propose CX-Mind, the first generative model to achieve interleaved "think-answer" reasoning for CXR tasks, driven by curriculum-based reinforcement learning and verifiable process rewards (CuRL-VPR). Specifically, we constructed an instruction-tuning dataset, CX-Set, comprising 708,473 images and 2,619,148 samples, and generated 42,828 high-quality interleaved reasoning data points supervised by clinical reports. Optimization was conducted in two stages under the Group Relative Policy Optimization framework: initially stabilizing basic reasoning with closed-domain tasks, followed by transfer to open-domain diagnostics, incorporating rule-based conditional process rewards to bypass the need for pretrained reward models. Extensive experimental results demonstrate that CX-Mind significantly outperforms existing medical and general-domain MLLMs in visual understanding, text generation, and spatiotemporal alignment, achieving an average performance improvement of 25.1% over comparable CXR-specific models. On real-world clinical dataset (Rui-CXR), CX-Mind achieves a mean recall@1 across 14 diseases that substantially surpasses the second-best results, with multi-center expert evaluations further confirming its clinical utility across multiple dimensions.

Title: StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization

Authors: Gopalji Gaur, Mohammadreza Zolfaghari, Thomas Brox
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03735
Pdf URL: https://arxiv.org/pdf/2508.03735
Copy Paste: [[2508.03735]] StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization(https://arxiv.org/abs/2508.03735)
Keywords: diffusion
Abstract: Generating a coherent sequence of images that tells a visual story, using text-to-image diffusion models, often faces the critical challenge of maintaining subject consistency across all story scenes. Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model's pre-existing capabilities. In this paper, we follow a training-free approach and propose an efficient consistent-subject-generation method. This approach works seamlessly with pre-trained diffusion models by introducing masked cross-image attention sharing to dynamically align subject features across a batch of images, and Regional Feature Harmonization to refine visually similar details for improved subject consistency. Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios while maintaining the creative abilities of the diffusion model.

Title: LLM-Prior: A Framework for Knowledge-Driven Prior Elicitation and Aggregation

Authors: Yongchao Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03766
Pdf URL: https://arxiv.org/pdf/2508.03766
Copy Paste: [[2508.03766]] LLM-Prior: A Framework for Knowledge-Driven Prior Elicitation and Aggregation(https://arxiv.org/abs/2508.03766)
Keywords: generative
Abstract: The specification of prior distributions is fundamental in Bayesian inference, yet it remains a significant bottleneck. The prior elicitation process is often a manual, subjective, and unscalable task. We propose a novel framework which leverages Large Language Models (LLMs) to automate and scale this process. We introduce \texttt{LLMPrior}, a principled operator that translates rich, unstructured contexts such as natural language descriptions, data or figures into valid, tractable probability distributions. We formalize this operator by architecturally coupling an LLM with an explicit, tractable generative model, such as a Gaussian Mixture Model (forming a LLM based Mixture Density Network), ensuring the resulting prior satisfies essential mathematical properties. We further extend this framework to multi-agent systems where Logarithmic Opinion Pooling is employed to aggregate prior distributions induced by decentralized knowledge. We present the federated prior aggregation algorithm, \texttt{Fed-LLMPrior}, for aggregating distributed, context-dependent priors in a manner robust to agent heterogeneity. This work provides the foundation for a new class of tools that can potentially lower the barrier to entry for sophisticated Bayesian modeling.

Title: Provably Near-Optimal Distributionally Robust Reinforcement Learning in Online Settings

Authors: Debamita Ghosh, George K. Atia, Yue Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03768
Pdf URL: https://arxiv.org/pdf/2508.03768
Copy Paste: [[2508.03768]] Provably Near-Optimal Distributionally Robust Reinforcement Learning in Online Settings(https://arxiv.org/abs/2508.03768)
Keywords: generative
Abstract: Reinforcement learning (RL) faces significant challenges in real-world deployments due to the sim-to-real gap, where policies trained in simulators often underperform in practice due to mismatches between training and deployment conditions. Distributionally robust RL addresses this issue by optimizing worst-case performance over an uncertainty set of environments and providing an optimized lower bound on deployment performance. However, existing studies typically assume access to either a generative model or offline datasets with broad coverage of the deployment environment -- assumptions that limit their practicality in unknown environments without prior knowledge. In this work, we study the more realistic and challenging setting of online distributionally robust RL, where the agent interacts only with a single unknown training environment while aiming to optimize its worst-case performance. We focus on general $f$-divergence-based uncertainty sets, including Chi-Square and KL divergence balls, and propose a computationally efficient algorithm with sublinear regret guarantees under minimal assumptions. Furthermore, we establish a minimax lower bound on regret of online learning, demonstrating the near-optimality of our approach. Extensive experiments across diverse environments further confirm the robustness and efficiency of our algorithm, validating our theoretical findings.

Title: SoilNet: A Multimodal Multitask Model for Hierarchical Classification of Soil Horizons

Authors: Teodor Chiaburu, Vipin Singh, Frank Haußer, Felix Bießmann
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03785
Pdf URL: https://arxiv.org/pdf/2508.03785
Copy Paste: [[2508.03785]] SoilNet: A Multimodal Multitask Model for Hierarchical Classification of Soil Horizons(https://arxiv.org/abs/2508.03785)
Keywords: foundation model
Abstract: While recent advances in foundation models have improved the state of the art in many domains, some problems in empirical sciences could not benefit from this progress yet. Soil horizon classification, for instance, remains challenging because of its multimodal and multitask characteristics and a complex hierarchically structured label taxonomy. Accurate classification of soil horizons is crucial for monitoring soil health, which directly impacts agricultural productivity, food security, ecosystem stability and climate resilience. In this work, we propose $\textit{SoilNet}$ - a multimodal multitask model to tackle this problem through a structured modularized pipeline. Our approach integrates image data and geotemporal metadata to first predict depth markers, segmenting the soil profile into horizon candidates. Each segment is characterized by a set of horizon-specific morphological features. Finally, horizon labels are predicted based on the multimodal concatenated feature vector, leveraging a graph-based label representation to account for the complex hierarchical relationships among soil horizons. Our method is designed to address complex hierarchical classification, where the number of possible labels is very large, imbalanced and non-trivially structured. We demonstrate the effectiveness of our approach on a real-world soil profile dataset. All code and experiments can be found in our repository: this https URL

Title: HPSv3: Towards Wide-Spectrum Human Preference Score

Authors: Yuhang Ma, Xiaoshi Wu, Keqiang Sun, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.03789
Pdf URL: https://arxiv.org/pdf/2508.03789
Copy Paste: [[2508.03789]] HPSv3: Towards Wide-Spectrum Human Preference Score(https://arxiv.org/abs/2508.03789)
Keywords: generative
Abstract: Evaluating text-to-image generation models requires alignment with human perception, yet existing human-centric metrics are constrained by limited data coverage, suboptimal feature extraction, and inefficient loss functions. To address these challenges, we introduce Human Preference Score v3 (HPSv3). (1) We release HPDv3, the first wide-spectrum human preference dataset integrating 1.08M text-image pairs and 1.17M annotated pairwise comparisons from state-of-the-art generative models and low to high-quality real-world images. (2) We introduce a VLM-based preference model trained using an uncertainty-aware ranking loss for fine-grained ranking. Besides, we propose Chain-of-Human-Preference (CoHP), an iterative image refinement method that enhances quality without extra data, using HPSv3 to select the best image at each step. Extensive experiments demonstrate that HPSv3 serves as a robust metric for wide-spectrum image evaluation, and CoHP offers an efficient and human-aligned approach to improve image generation quality. The code and dataset are available at the HPSv3 Homepage.

Title: VAE-DNN: Energy-Efficient Trainable-by-Parts Surrogate Model For Parametric Partial Differential Equations

Authors: Yifei Zong, Alexandre M. Tartakovsky
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2508.03839
Pdf URL: https://arxiv.org/pdf/2508.03839
Copy Paste: [[2508.03839]] VAE-DNN: Energy-Efficient Trainable-by-Parts Surrogate Model For Parametric Partial Differential Equations(https://arxiv.org/abs/2508.03839)
Keywords: diffusion
Abstract: We propose a trainable-by-parts surrogate model for solving forward and inverse parameterized nonlinear partial differential equations. Like several other surrogate and operator learning models, the proposed approach employs an encoder to reduce the high-dimensional input $y(\bm{x})$ to a lower-dimensional latent space, $\bm\mu_{\bm\phi_y}$. Then, a fully connected neural network is used to map $\bm\mu_{\bm\phi_y}$ to the latent space, $\bm\mu_{\bm\phi_h}$, of the PDE solution $h(\bm{x},t)$. Finally, a decoder is utilized to reconstruct $h(\bm{x},t)$. The innovative aspect of our model is its ability to train its three components independently. This approach leads to a substantial decrease in both the time and energy required for training when compared to leading operator learning models such as FNO and DeepONet. The separable training is achieved by training the encoder as part of the variational autoencoder (VAE) for $y(\bm{x})$ and the decoder as part of the $h(\bm{x},t)$ VAE. We refer to this model as the VAE-DNN model. VAE-DNN is compared to the FNO and DeepONet models for obtaining forward and inverse solutions to the nonlinear diffusion equation governing groundwater flow in an unconfined aquifer. Our findings indicate that VAE-DNN not only demonstrates greater efficiency but also delivers superior accuracy in both forward and inverse solutions compared to the FNO and DeepONet models.

Title: Active Learning and Transfer Learning for Anomaly Detection in Time-Series Data

Authors: John D. Kelleher, Matthew Nicholson, Rahul Agrahari, Clare Conran
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03921
Pdf URL: https://arxiv.org/pdf/2508.03921
Copy Paste: [[2508.03921]] Active Learning and Transfer Learning for Anomaly Detection in Time-Series Data(https://arxiv.org/abs/2508.03921)
Keywords: anomaly
Abstract: This paper examines the effectiveness of combining active learning and transfer learning for anomaly detection in cross-domain time-series data. Our results indicate that there is an interaction between clustering and active learning and in general the best performance is achieved using a single cluster (in other words when clustering is not applied). Also, we find that adding new samples to the training set using active learning does improve model performance but that in general, the rate of improvement is slower than the results reported in the literature suggest. We attribute this difference to an improved experimental design where distinct data samples are used for the sampling and testing pools. Finally, we assess the ceiling performance of transfer learning in combination with active learning across several datasets and find that performance does initially improve but eventually begins to tail off as more target points are selected for inclusion in training. This tail-off in performance may indicate that the active learning process is doing a good job of sequencing data points for selection, pushing the less useful points towards the end of the selection process and that this tail-off occurs when these less useful points are eventually added. Taken together our results indicate that active learning is effective but that the improvement in model performance follows a linear flat function concerning the number of points selected and labelled.

Title: Point-Based Shape Representation Generation with a Correspondence-Preserving Diffusion Model

Authors: Shen Zhu, Yinzhu Jin, Ifrah Zawar, P. Thomas Fletcher
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.03925
Pdf URL: https://arxiv.org/pdf/2508.03925
Copy Paste: [[2508.03925]] Point-Based Shape Representation Generation with a Correspondence-Preserving Diffusion Model(https://arxiv.org/abs/2508.03925)
Keywords: diffusion, generative
Abstract: We propose a diffusion model designed to generate point-based shape representations with correspondences. Traditional statistical shape models have considered point correspondences extensively, but current deep learning methods do not take them into account, focusing on unordered point clouds instead. Current deep generative models for point clouds do not address generating shapes with point correspondences between generated shapes. This work aims to formulate a diffusion model that is capable of generating realistic point-based shape representations, which preserve point correspondences that are present in the training data. Using shape representation data with correspondences derived from Open Access Series of Imaging Studies 3 (OASIS-3), we demonstrate that our correspondence-preserving model effectively generates point-based hippocampal shape representations that are highly realistic compared to existing methods. We further demonstrate the applications of our generative model by downstream tasks, such as conditional generation of healthy and AD subjects and predicting morphological changes of disease progression by counterfactual generation.

Title: Markov Chain Estimation with In-Context Learning

Authors: Simon Lepage, Jeremie Mary, David Picard
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.03934
Pdf URL: https://arxiv.org/pdf/2508.03934
Copy Paste: [[2508.03934]] Markov Chain Estimation with In-Context Learning(https://arxiv.org/abs/2508.03934)
Keywords: in-context
Abstract: We investigate the capacity of transformers to learn algorithms involving their context while solely being trained using next token prediction. We set up Markov chains with random transition matrices and we train transformers to predict the next token. Matrices used during training and test are different and we show that there is a threshold in transformer size and in training set size above which the model is able to learn to estimate the transition probabilities from its context instead of memorizing the training patterns. Additionally, we show that more involved encoding of the states enables more robust prediction for Markov chains with structures different than those seen during training.

Title: RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification

Authors: Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Abdenour Hadid
Subjects: cs.CV, cs.CR, cs.IR
Abstract URL: https://arxiv.org/abs/2508.03967
Pdf URL: https://arxiv.org/pdf/2508.03967
Copy Paste: [[2508.03967]] RAVID: Retrieval-Augmented Visual Detection: A Knowledge-Driven Approach for AI-Generated Image Identification(https://arxiv.org/abs/2508.03967)
Keywords: foundation model, generative
Abstract: In this paper, we introduce RAVID, the first framework for AI-generated image detection that leverages visual retrieval-augmented generation (RAG). While RAG methods have shown promise in mitigating factual inaccuracies in foundation models, they have primarily focused on text, leaving visual knowledge underexplored. Meanwhile, existing detection methods, which struggle with generalization and robustness, often rely on low-level artifacts and model-specific features, limiting their adaptability. To address this, RAVID dynamically retrieves relevant images to enhance detection. Our approach utilizes a fine-tuned CLIP image encoder, RAVID CLIP, enhanced with category-related prompts to improve representation learning. We further integrate a vision-language model (VLM) to fuse retrieved images with the query, enriching the input and improving accuracy. Given a query image, RAVID generates an embedding using RAVID CLIP, retrieves the most relevant images from a database, and combines these with the query image to form an enriched input for a VLM (e.g., Qwen-VL or Openflamingo). Experiments on the UniversalFakeDetect benchmark, which covers 19 generative models, show that RAVID achieves state-of-the-art performance with an average accuracy of 93.85%. RAVID also outperforms traditional methods in terms of robustness, maintaining high accuracy even under image degradations such as Gaussian blur and JPEG compression. Specifically, RAVID achieves an average accuracy of 80.27% under degradation conditions, compared to 63.44% for the state-of-the-art model C2P-CLIP, demonstrating consistent improvements in both Gaussian blur and JPEG compression scenarios. The code will be publicly available upon acceptance.

Title: Data and AI governance: Promoting equity, ethics, and fairness in large language models

Authors: Alok Abhishek, Lisa Erickson, Tushar Bandopadhyay
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.03970
Pdf URL: https://arxiv.org/pdf/2508.03970
Copy Paste: [[2508.03970]] Data and AI governance: Promoting equity, ethics, and fairness in large language models(https://arxiv.org/abs/2508.03970)
Keywords: generative
Abstract: In this paper, we cover approaches to systematically govern, assess and quantify bias across the complete life cycle of machine learning models, from initial development and validation to ongoing production monitoring and guardrail implementation. Building upon our foundational work on the Bias Evaluation and Assessment Test Suite (BEATS) for Large Language Models, the authors share prevalent bias and fairness related gaps in Large Language Models (LLMs) and discuss data and AI governance framework to address Bias, Ethics, Fairness, and Factuality within LLMs. The data and AI governance approach discussed in this paper is suitable for practical, real-world applications, enabling rigorous benchmarking of LLMs prior to production deployment, facilitating continuous real-time evaluation, and proactively governing LLM generated responses. By implementing the data and AI governance across the life cycle of AI development, organizations can significantly enhance the safety and responsibility of their GenAI systems, effectively mitigating risks of discrimination and protecting against potential reputational or brand-related harm. Ultimately, through this article, we aim to contribute to advancement of the creation and deployment of socially responsible and ethically aligned generative artificial intelligence powered applications.

Title: Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models

Authors: Xinyu Zhao, Zhen Tan, Maya Enisman, Minjae Seo, Marta R. Durantini, Dolores Albarracin, Tianlong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.03998
Pdf URL: https://arxiv.org/pdf/2508.03998
Copy Paste: [[2508.03998]] Transferring Expert Cognitive Models to Social Robots via Agentic Concept Bottleneck Models(https://arxiv.org/abs/2508.03998)
Keywords: foundation model
Abstract: Successful group meetings, such as those implemented in group behavioral-change programs, work meetings, and other social contexts, must promote individual goal setting and execution while strengthening the social relationships within the group. Consequently, an ideal facilitator must be sensitive to the subtle dynamics of disengagement, difficulties with individual goal setting and execution, and interpersonal difficulties that signal a need for intervention. The challenges and cognitive load experienced by facilitators create a critical gap for an embodied technology that can interpret social exchanges while remaining aware of the needs of the individuals in the group and providing transparent recommendations that go beyond powerful but "black box" foundation models (FMs) that identify social cues. We address this important demand with a social robot co-facilitator that analyzes multimodal meeting data and provides discreet cues to the facilitator. The robot's reasoning is powered by an agentic concept bottleneck model (CBM), which makes decisions based on human-interpretable concepts like participant engagement and sentiments, ensuring transparency and trustworthiness. Our core contribution is a transfer learning framework that distills the broad social understanding of an FM into our specialized and transparent CBM. This concept-driven system significantly outperforms direct zero-shot FMs in predicting the need for intervention and enables real-time human correction of its reasoning. Critically, we demonstrate robust knowledge transfer: the model generalizes across different groups and successfully transfers the expertise of senior human facilitators to improve the performance of novices. By transferring an expert's cognitive model into an interpretable robotic partner, our work provides a powerful blueprint for augmenting human capabilities in complex social domains.

Title: CAD-Judge: Toward Efficient Morphological Grading and Verification for Text-to-CAD Generation

Authors: Zheyuan Zhou, Jiayi Han, Liang Du, Naiyu Fang, Lemiao Qiu, Shuyou Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04002
Pdf URL: https://arxiv.org/pdf/2508.04002
Copy Paste: [[2508.04002]] CAD-Judge: Toward Efficient Morphological Grading and Verification for Text-to-CAD Generation(https://arxiv.org/abs/2508.04002)
Keywords: generative
Abstract: Computer-Aided Design (CAD) models are widely used across industrial design, simulation, and manufacturing processes. Text-to-CAD systems aim to generate editable, general-purpose CAD models from textual descriptions, significantly reducing the complexity and entry barrier associated with traditional CAD workflows. However, rendering CAD models can be slow, and deploying VLMs to review CAD models can be expensive and may introduce reward hacking that degrades the systems. To address these challenges, we propose CAD-Judge, a novel, verifiable reward system for efficient and effective CAD preference grading and grammatical validation. We adopt the Compiler-as-a-Judge Module (CJM) as a fast, direct reward signal, optimizing model alignment by maximizing generative utility through prospect theory. To further improve the robustness of Text-to-CAD in the testing phase, we introduce a simple yet effective agentic CAD generation approach and adopt the Compiler-as-a-Review Module (CRM), which efficiently verifies the generated CAD models, enabling the system to refine them accordingly. Extensive experiments on challenging CAD datasets demonstrate that our method achieves state-of-the-art performance while maintaining superior efficiency.

Title: $\text{S}^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation

Authors: Weilun Feng, Haotong Qin, Chuanguang Yang, Xiangqi Li, Han Yang, Yuqi Li, Zhulin An, Libo Huang, Michele Magno, Yongjun Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04016
Pdf URL: https://arxiv.org/pdf/2508.04016
Copy Paste: [[2508.04016]] $\text{S}^2$Q-VDiT: Accurate Quantized Video Diffusion Transformer with Salient Data and Sparse Token Distillation(https://arxiv.org/abs/2508.04016)
Keywords: diffusion
Abstract: Diffusion transformers have emerged as the mainstream paradigm for video generation models. However, the use of up to billions of parameters incurs significant computational costs. Quantization offers a promising solution by reducing memory usage and accelerating inference. Nonetheless, we observe that the joint modeling of spatial and temporal information in video diffusion models (V-DMs) leads to extremely long token sequences, which introduces high calibration variance and learning challenges. To address these issues, we propose \textbf{$\text{S}^2$Q-VDiT}, a post-training quantization framework for V-DMs that leverages \textbf{S}alient data and \textbf{S}parse token distillation. During the calibration phase, we identify that quantization performance is highly sensitive to the choice of calibration data. To mitigate this, we introduce \textit{Hessian-aware Salient Data Selection}, which constructs high-quality calibration datasets by considering both diffusion and quantization characteristics unique to V-DMs. To tackle the learning challenges, we further analyze the sparse attention patterns inherent in V-DMs. Based on this observation, we propose \textit{Attention-guided Sparse Token Distillation}, which exploits token-wise attention distributions to emphasize tokens that are more influential to the model's output. Under W4A6 quantization, $\text{S}^2$Q-VDiT achieves lossless performance while delivering $3.9\times$ model compression and $1.3\times$ inference acceleration. Code will be available at \href{this https URL}{this https URL}.

Title: FeDaL: Federated Dataset Learning for Time Series Foundation Models

Authors: Shengchao Chen, Guodong Long, Jing Jiang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04045
Pdf URL: https://arxiv.org/pdf/2508.04045
Copy Paste: [[2508.04045]] FeDaL: Federated Dataset Learning for Time Series Foundation Models(https://arxiv.org/abs/2508.04045)
Keywords: foundation model
Abstract: Dataset-wise heterogeneity introduces significant domain biases that fundamentally degrade generalization on Time Series Foundation Models (TSFMs), yet this challenge remains underexplored. This paper rethink the development of TSFMs using the paradigm of federated learning. We propose a novel Federated Dataset Learning (FeDaL) approach to tackle heterogeneous time series by learning dataset-agnostic temporal representations. Specifically, the distributed architecture of federated learning is a nature solution to decompose heterogeneous TS datasets into shared generalized knowledge and preserved personalized knowledge. Moreover, based on the TSFM architecture, FeDaL explicitly mitigates both local and global biases by adding two complementary mechanisms: Domain Bias Elimination (DBE) and Global Bias Elimination (GBE). FeDaL`s cross-dataset generalization has been extensively evaluated in real-world datasets spanning eight tasks, including both representation learning and downstream time series analysis, against 54 baselines. We further analyze federated scaling behavior, showing how data volume, client count, and join rate affect model performance under decentralization.

Title: Uni-DocDiff: A Unified Document Restoration Model Based on Diffusion

Authors: Fangmin Zhao, Weichao Zeng, Zhenhang Li, Dongbao Yang, Binbin Li, Xiaojun Bi, Yu Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04055
Pdf URL: https://arxiv.org/pdf/2508.04055
Copy Paste: [[2508.04055]] Uni-DocDiff: A Unified Document Restoration Model Based on Diffusion(https://arxiv.org/abs/2508.04055)
Keywords: diffusion
Abstract: Removing various degradations from damaged documents greatly benefits digitization, downstream document analysis, and readability. Previous methods often treat each restoration task independently with dedicated models, leading to a cumbersome and highly complex document processing system. Although recent studies attempt to unify multiple tasks, they often suffer from limited scalability due to handcrafted prompts and heavy preprocessing, and fail to fully exploit inter-task synergy within a shared architecture. To address the aforementioned challenges, we propose Uni-DocDiff, a Unified and highly scalable Document restoration model based on Diffusion. Uni-DocDiff develops a learnable task prompt design, ensuring exceptional scalability across diverse tasks. To further enhance its multi-task capabilities and address potential task interference, we devise a novel \textbf{Prior \textbf{P}ool}, a simple yet comprehensive mechanism that combines both local high-frequency features and global low-frequency features. Additionally, we design the \textbf{Prior \textbf{F}usion \textbf{M}odule (PFM)}, which enables the model to adaptively select the most relevant prior information for each specific task. Extensive experiments show that the versatile Uni-DocDiff achieves performance comparable or even superior performance compared with task-specific expert models, and simultaneously holds the task scalability for seamless adaptation to new tasks.

Title: GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning

Authors: Jianghangfan Zhang, Yibo Yan, Kening Zheng, Xin Zou, Song Dai, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04088
Pdf URL: https://arxiv.org/pdf/2508.04088
Copy Paste: [[2508.04088]] GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning(https://arxiv.org/abs/2508.04088)
Keywords: generative
Abstract: Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure. While Process Reward Models (PRMs) offer step-by-step supervision, existing multimodal PRMs are limited to being binary verifiers that can identify but not correct errors, offering little explanatory power. To address these deficiencies, we introduce the Generative Multimodal Process Reward Model (GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an active reasoning collaborator. Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step, evaluating its step intent, visual alignment, and logical soundness. More critically, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This unique corrective capability enables our new test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework actively enhances solution quality by using the PRM's generated correction to guide the policy model toward a more promising reasoning trajectory, thereby improving the diversity and correctness of the solution pool. We demonstrate that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly boosting policy model performance with remarkable data efficiency, requiring only a 20K-sample training dataset. Our code will be released upon acceptance.

Title: Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework

Authors: Yi-Ting Chen, Ting-Hsuan Liao, Pengsheng Guo, Alexander Schwing, Jia-Bin Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04090
Pdf URL: https://arxiv.org/pdf/2508.04090
Copy Paste: [[2508.04090]] Bridging Diffusion Models and 3D Representations: A 3D Consistent Super-Resolution Framework(https://arxiv.org/abs/2508.04090)
Keywords: diffusion
Abstract: We propose 3D Super Resolution (3DSR), a novel 3D Gaussian-splatting-based super-resolution framework that leverages off-the-shelf diffusion-based 2D super-resolution models. 3DSR encourages 3D consistency across views via the use of an explicit 3D Gaussian-splatting-based scene representation. This makes the proposed 3DSR different from prior work, such as image upsampling or the use of video super-resolution, which either don't consider 3D consistency or aim to incorporate 3D consistency implicitly. Notably, our method enhances visual quality without additional fine-tuning, ensuring spatial coherence within the reconstructed scene. We evaluate 3DSR on MipNeRF360 and LLFF data, demonstrating that it produces high-resolution results that are visually compelling, while maintaining structural consistency in 3D reconstructions. Code will be released.

Title: Model Inversion Attacks on Vision-Language Models: Do They Leak What They Learn?

Authors: Ngoc-Bao Nguyen, Sy-Tuyen Ho, Koh Jun Hao, Ngai-Man Cheung
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04097
Pdf URL: https://arxiv.org/pdf/2508.04097
Copy Paste: [[2508.04097]] Model Inversion Attacks on Vision-Language Models: Do They Leak What They Learn?(https://arxiv.org/abs/2508.04097)
Keywords: generative
Abstract: Model inversion (MI) attacks pose significant privacy risks by reconstructing private training data from trained neural networks. While prior works have focused on conventional unimodal DNNs, the vulnerability of vision-language models (VLMs) remains underexplored. In this paper, we conduct the first study to understand VLMs' vulnerability in leaking private visual training data. To tailored for VLMs' token-based generative nature, we propose a suite of novel token-based and sequence-based model inversion strategies. Particularly, we propose Token-based Model Inversion (TMI), Convergent Token-based Model Inversion (TMI-C), Sequence-based Model Inversion (SMI), and Sequence-based Model Inversion with Adaptive Token Weighting (SMI-AW). Through extensive experiments and user study on three state-of-the-art VLMs and multiple datasets, we demonstrate, for the first time, that VLMs are susceptible to training data leakage. The experiments show that our proposed sequence-based methods, particularly SMI-AW combined with a logit-maximization loss based on vocabulary representation, can achieve competitive reconstruction and outperform token-based methods in attack accuracy and visual similarity. Importantly, human evaluation of the reconstructed images yields an attack accuracy of 75.31\%, underscoring the severity of model inversion threats in VLMs. Notably we also demonstrate inversion attacks on the publicly released VLMs. Our study reveals the privacy vulnerability of VLMs as they become increasingly popular across many applications such as healthcare and finance.

Title: Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation

Authors: Maximilian Ulmer, Wout Boerdijk, Rudolph Triebel, Maximilian Durner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04122
Pdf URL: https://arxiv.org/pdf/2508.04122
Copy Paste: [[2508.04122]] Conditional Latent Diffusion Models for Zero-Shot Instance Segmentation(https://arxiv.org/abs/2508.04122)
Keywords: diffusion, generative
Abstract: This paper presents OC-DiT, a novel class of diffusion models designed for object-centric prediction, and applies it to zero-shot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model's latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large-scale synthetic dataset comprising thousands of high-quality object meshes. Remarkably, our model achieves state-of-the-art performance on multiple challenging real-world benchmarks, without requiring any retraining on target data. Through comprehensive ablation studies, we demonstrate the potential of diffusion models for instance segmentation tasks.

Title: IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control

Authors: Lijuan Liu, Wenfa Li, Dongbo Zhang, Shuo Wang, Shaohui Jiao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04147
Pdf URL: https://arxiv.org/pdf/2508.04147
Copy Paste: [[2508.04147]] IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control(https://arxiv.org/abs/2508.04147)
Keywords: diffusion
Abstract: We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved inter-frame geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables fine-grained camera control, enhancing control over the generated sequences. Extensive experiments show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences. Notably, the generated RGB-D sequences can be directly feed for downstream 3D Scene reconstruction tasks without extra post-processing steps, showcasing the practical benefits of our joint learning framework. See more at this https URL.

Title: ICM-Fusion: In-Context Meta-Optimized LoRA Fusion for Multi-Task Adaptation

Authors: Yihua Shao, Xiaofeng Lin, Xinwei Long, Siyu Chen, Minxi Yan, Yang Liu, Ziyang Yan, Ao Ma, Hao Tang, Jingcai Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04153
Pdf URL: https://arxiv.org/pdf/2508.04153
Copy Paste: [[2508.04153]] ICM-Fusion: In-Context Meta-Optimized LoRA Fusion for Multi-Task Adaptation(https://arxiv.org/abs/2508.04153)
Keywords: in-context
Abstract: Enabling multi-task adaptation in pre-trained Low-Rank Adaptation (LoRA) models is crucial for enhancing their generalization capabilities. Most existing pre-trained LoRA fusion methods decompose weight matrices, sharing similar parameters while merging divergent ones. However, this paradigm inevitably induces inter-weight conflicts and leads to catastrophic domain forgetting. While incremental learning enables adaptation to multiple tasks, it struggles to achieve generalization in few-shot scenarios. Consequently, when the weight data follows a long-tailed distribution, it can lead to forgetting in the fused weights. To address this issue, we propose In-Context Meta LoRA Fusion (ICM-Fusion), a novel framework that synergizes meta-learning with in-context adaptation. The key innovation lies in our task vector arithmetic, which dynamically balances conflicting optimization directions across domains through learned manifold projections. ICM-Fusion obtains the optimal task vector orientation for the fused model in the latent space by adjusting the orientation of the task vectors. Subsequently, the fused LoRA is reconstructed by a self-designed Fusion VAE (F-VAE) to realize multi-task LoRA generation. We have conducted extensive experiments on visual and linguistic tasks, and the experimental results demonstrate that ICM-Fusion can be adapted to a wide range of architectural models and applied to various tasks. Compared to the current pre-trained LoRA fusion method, ICM-Fusion fused LoRA can significantly reduce the multi-tasking loss and can even achieve task enhancement in few-shot scenarios.

Title: AD-FM: Multimodal LLMs for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization

Authors: Jingyi Liao, Yongyi Su, Rong-Cheng Tu, Zhao Jin, Wenhao Sun, Yiting Li, Dacheng Tao, Xun Xu, Xulei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04175
Pdf URL: https://arxiv.org/pdf/2508.04175
Copy Paste: [[2508.04175]] AD-FM: Multimodal LLMs for Anomaly Detection via Multi-Stage Reasoning and Fine-Grained Reward Optimization(https://arxiv.org/abs/2508.04175)
Keywords: anomaly
Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities across diverse domains, their application to specialized anomaly detection (AD) remains constrained by domain adaptation challenges. Existing Group Relative Policy Optimization (GRPO) based approaches suffer from two critical limitations: inadequate training data utilization when models produce uniform responses, and insufficient supervision over reasoning processes that encourage immediate binary decisions without deliberative analysis. We propose a comprehensive framework addressing these limitations through two synergistic innovations. First, we introduce a multi-stage deliberative reasoning process that guides models from region identification to focused examination, generating diverse response patterns essential for GRPO optimization while enabling structured supervision over analytical workflows. Second, we develop a fine-grained reward mechanism incorporating classification accuracy and localization supervision, transforming binary feedback into continuous signals that distinguish genuine analytical insight from spurious correctness. Comprehensive evaluation across multiple industrial datasets demonstrates substantial performance improvements in adapting general vision-language models to specialized anomaly detection. Our method achieves superior accuracy with efficient adaptation of existing annotations, effectively bridging the gap between general-purpose MLLM capabilities and the fine-grained visual discrimination required for detecting subtle manufacturing defects and structural irregularities.

Title: DP-DocLDM: Differentially Private Document Image Generation using Latent Diffusion Models

Authors: Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed
Subjects: cs.CR
Abstract URL: https://arxiv.org/abs/2508.04208
Pdf URL: https://arxiv.org/pdf/2508.04208
Copy Paste: [[2508.04208]] DP-DocLDM: Differentially Private Document Image Generation using Latent Diffusion Models(https://arxiv.org/abs/2508.04208)
Keywords: diffusion
Abstract: As deep learning-based, data-driven information extraction systems become increasingly integrated into modern document processing workflows, one primary concern is the risk of malicious leakage of sensitive private data from these systems. While some recent works have explored Differential Privacy (DP) to mitigate these privacy risks, DP-based training is known to cause significant performance degradation and impose several limitations on standard training procedures, making its direct application to downstream tasks both difficult and costly. In this work, we aim to address the above challenges within the context of document image classification by substituting real private data with a synthetic counterpart. In particular, we propose to use conditional latent diffusion models (LDMs) in combination with differential privacy (DP) to generate class-specific synthetic document images under strict privacy constraints, which can then be utilized to train a downstream classifier following standard training procedures. We investigate our approach under various pretraining setups, including unconditional, class-conditional, and layout-conditional pretraining, in combination with multiple private training strategies such as class-conditional and per-label private fine-tuning with DPDM and DP-Promise algorithms. Additionally, we evaluate it on two well-known document benchmark datasets, RVL-CDIP and Tobacco3482, and show that it can generate useful and realistic document samples across various document types and privacy levels ($\varepsilon \in \{1, 5, 10\}$). Lastly, we show that our approach achieves substantial performance improvements in downstream evaluations on small-scale datasets, compared to the direct application of DP-Adam.

Title: LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation

Authors: Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo, Guangtao Zhai, Xiaohong Liu
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2508.04228
Pdf URL: https://arxiv.org/pdf/2508.04228
Copy Paste: [[2508.04228]] LayerT2V: Interactive Multi-Object Trajectory Layering for Video Generation(https://arxiv.org/abs/2508.04228)
Keywords: generative
Abstract: Controlling object motion trajectories in Text-to-Video (T2V) generation is a challenging and relatively under-explored area, particularly in scenarios involving multiple moving objects. Most community models and datasets in the T2V domain are designed for single-object motion, limiting the performance of current generative models in multi-object tasks. Additionally, existing motion control methods in T2V either lack support for multi-object motion scenes or experience severe performance degradation when object trajectories intersect, primarily due to the semantic conflicts in colliding regions. To address these limitations, we introduce LayerT2V, the first approach for generating video by compositing background and foreground objects layer by layer. This layered generation enables flexible integration of multiple independent elements within a video, positioning each element on a distinct "layer" and thus facilitating coherent multi-object synthesis while enhancing control over the generation process. Extensive experiments demonstrate the superiority of LayerT2V in generating complex multi-object scenarios, showcasing 1.4x and 4.5x improvements in mIoU and AP50 metrics over state-of-the-art (SOTA) methods. Project page and code are available at this https URL .

Title: Intention Enhanced Diffusion Model for Multimodal Pedestrian Trajectory Prediction

Authors: Yu Liu, Zhijie Liu, Xiao Ren, You-Fu Li, He Kong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04229
Pdf URL: https://arxiv.org/pdf/2508.04229
Copy Paste: [[2508.04229]] Intention Enhanced Diffusion Model for Multimodal Pedestrian Trajectory Prediction(https://arxiv.org/abs/2508.04229)
Keywords: diffusion
Abstract: Predicting pedestrian motion trajectories is critical for path planning and motion control of autonomous vehicles. However, accurately forecasting crowd trajectories remains a challenging task due to the inherently multimodal and uncertain nature of human motion. Recent diffusion-based models have shown promising results in capturing the stochasticity of pedestrian behavior for trajectory prediction. However, few diffusion-based approaches explicitly incorporate the underlying motion intentions of pedestrians, which can limit the interpretability and precision of prediction models. In this work, we propose a diffusion-based multimodal trajectory prediction model that incorporates pedestrians' motion intentions into the prediction framework. The motion intentions are decomposed into lateral and longitudinal components, and a pedestrian intention recognition module is introduced to enable the model to effectively capture these intentions. Furthermore, we adopt an efficient guidance mechanism that facilitates the generation of interpretable trajectories. The proposed framework is evaluated on two widely used human trajectory prediction benchmarks, ETH and UCY, on which it is compared against state-of-the-art methods. The experimental results demonstrate that our method achieves competitive performance.

Title: DocVCE: Diffusion-based Visual Counterfactual Explanations for Document Image Classification

Authors: Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04233
Pdf URL: https://arxiv.org/pdf/2508.04233
Copy Paste: [[2508.04233]] DocVCE: Diffusion-based Visual Counterfactual Explanations for Document Image Classification(https://arxiv.org/abs/2508.04233)
Keywords: diffusion, generative
Abstract: As black-box AI-driven decision-making systems become increasingly widespread in modern document processing workflows, improving their transparency and reliability has become critical, especially in high-stakes applications where biases or spurious correlations in decision-making could lead to serious consequences. One vital component often found in such document processing workflows is document image classification, which, despite its widespread use, remains difficult to explain. While some recent works have attempted to explain the decisions of document image classification models through feature-importance maps, these maps are often difficult to interpret and fail to provide insights into the global features learned by the model. In this paper, we aim to bridge this research gap by introducing generative document counterfactuals that provide meaningful insights into the model's decision-making through actionable explanations. In particular, we propose DocVCE, a novel approach that leverages latent diffusion models in combination with classifier guidance to first generate plausible in-distribution visual counterfactual explanations, and then performs hierarchical patch-wise refinement to search for a refined counterfactual that is closest to the target factual image. We demonstrate the effectiveness of our approach through a rigorous qualitative and quantitative assessment on 3 different document classification datasets -- RVL-CDIP, Tobacco3482, and DocLayNet -- and 3 different models -- ResNet, ConvNeXt, and DiT -- using well-established evaluation criteria such as validity, closeness, and realism. To the best of the authors' knowledge, this is the first work to explore generative counterfactual explanations in document image analysis.

Title: PIS3R: Very Large Parallax Image Stitching via Deep 3D Reconstruction

Authors: Muhua Zhu, Xinhao Jin, Chengbo Wang, Yongcong Zhang, Yifei Xue, Tie Ji, Yizhen Lao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04236
Pdf URL: https://arxiv.org/pdf/2508.04236
Copy Paste: [[2508.04236]] PIS3R: Very Large Parallax Image Stitching via Deep 3D Reconstruction(https://arxiv.org/abs/2508.04236)
Keywords: diffusion
Abstract: Image stitching aim to align two images taken from different viewpoints into one seamless, wider image. However, when the 3D scene contains depth variations and the camera baseline is significant, noticeable parallax occurs-meaning the relative positions of scene elements differ substantially between views. Most existing stitching methods struggle to handle such images with large parallax effectively. To address this challenge, in this paper, we propose an image stitching solution called PIS3R that is robust to very large parallax based on the novel concept of deep 3D reconstruction. First, we apply visual geometry grounded transformer to two input images with very large parallax to obtain both intrinsic and extrinsic parameters, as well as the dense 3D scene reconstruction. Subsequently, we reproject reconstructed dense point cloud onto a designated reference view using the recovered camera parameters, achieving pixel-wise alignment and generating an initial stitched image. Finally, to further address potential artifacts such as holes or noise in the initial stitching, we propose a point-conditioned image diffusion module to obtain the refined this http URL with existing methods, our solution is very large parallax tolerant and also provides results that fully preserve the geometric integrity of all pixels in the 3D photogrammetric context, enabling direct applicability to downstream 3D vision tasks such as SfM. Experimental results demonstrate that the proposed algorithm provides accurate stitching results for images with very large parallax, and outperforms the existing methods qualitatively and quantitatively.

Title: WSS-CL: Weight Saliency Soft-Guided Contrastive Learning for Efficient Machine Unlearning Image Classification

Authors: Thang Duc Tran, Thai Hoang Le
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04308
Pdf URL: https://arxiv.org/pdf/2508.04308
Copy Paste: [[2508.04308]] WSS-CL: Weight Saliency Soft-Guided Contrastive Learning for Efficient Machine Unlearning Image Classification(https://arxiv.org/abs/2508.04308)
Keywords: self-supervised
Abstract: Machine unlearning, the efficient deletion of the impact of specific data in a trained model, remains a challenging problem. Current machine unlearning approaches that focus primarily on data-centric or weight-based strategies frequently encounter challenges in achieving precise unlearning, maintaining stability, and ensuring applicability across diverse domains. In this work, we introduce a new two-phase efficient machine unlearning method for image classification, in terms of weight saliency, leveraging weight saliency to focus the unlearning process on critical model parameters. Our method is called weight saliency soft-guided contrastive learning for efficient machine unlearning image classification (WSS-CL), which significantly narrows the performance gap with "exact" unlearning. First, the forgetting stage maximizes kullback-leibler divergence between output logits and aggregated pseudo-labels for efficient forgetting in logit space. Next, the adversarial fine-tuning stage introduces contrastive learning in a self-supervised manner. By using scaled feature representations, it maximizes the distance between the forgotten and retained data samples in the feature space, with the forgotten and the paired augmented samples acting as positive pairs, while the retained samples act as negative pairs in the contrastive loss computation. Experimental evaluations reveal that our proposed method yields much-improved unlearning efficacy with negligible performance loss compared to state-of-the-art approaches, indicative of its usability in supervised and self-supervised settings.

Title: A Foundation Model for DAS Signal Recognition and Visual Prompt Tuning of the Pre-trained Model for Downstream Tasks

Authors: Kun Gui, Hongliang Ren, Shang Shi, Jin Lu, Changqiu Yu, Quanjun Cao, Guomin Gu, Qi Xuan
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2508.04316
Pdf URL: https://arxiv.org/pdf/2508.04316
Copy Paste: [[2508.04316]] A Foundation Model for DAS Signal Recognition and Visual Prompt Tuning of the Pre-trained Model for Downstream Tasks(https://arxiv.org/abs/2508.04316)
Keywords: self-supervised, foundation model
Abstract: Distributed Acoustic Sensing (DAS) technology finds growing applications across various domains. However, data distribution disparities due to heterogeneous sensing environments pose challenges for data-driven artificial intelligence (AI) models, limiting cross-domain generalization and facing a shortage of labeled training data. To address these issues, this study proposes a foundational model for DAS signal recognition based on a Masked Autoencoder, named MAEPD. The MAEPD model is pretrained on a dataset of 635,860 samples, encompassing DAS gait spatiotemporal signals, 2D GASF images for perimeter security, 2D time-frequency images for pipeline leakage, and open-dataset signals including whale vocalizations and seismic activities, using a self-supervised mask reconstruction task to capture deep semantic features of DAS signals. Visual Prompt Tuning (VPT) is employed for downstream recognition tasks. This method freezes the pretrained backbone parameters and fine-tunes only a small set of learnable visual prompt vectors inserted into the Transformer encoder layers. Experiments on the NVIDIA GeForce RTX 4080 Super platform validate MAEPD using indoor gait recognition as a downstream task. The VPT-Deep approach achieves a classification accuracy of 96.94% with just 0.322% of parameters fine-tuned, surpassing the traditional Full Fine Tuning (FFT) method by 0.61% and reducing training time by 45%. The model also exhibits robust performance in pipeline leakage detection, confirming the generality, efficiency, and scalability of MAEPD as a foundational model. This approach offers a novel paradigm for addressing the limited generalization of signal recognition models in the DAS domain.

Title: TempFlow-GRPO: When Timing Matters for GRPO in Flow Models

Authors: Xiaoxuan He, Siming Fu, Yuke Zhao, Wanli Li, Jian Yang, Dacheng Yin, Fengyun Rao, Bo Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04324
Pdf URL: https://arxiv.org/pdf/2508.04324
Copy Paste: [[2508.04324]] TempFlow-GRPO: When Timing Matters for GRPO in Flow Models(https://arxiv.org/abs/2508.04324)
Keywords: generative
Abstract: Recent flow matching models for text-to-image generation have achieved remarkable quality, yet their integration with reinforcement learning for human preference alignment remains suboptimal, hindering fine-grained reward-based optimization. We observe that the key impediment to effective GRPO training of flow models is the temporal uniformity assumption in existing approaches: sparse terminal rewards with uniform credit assignment fail to capture the varying criticality of decisions across generation timesteps, resulting in inefficient exploration and suboptimal convergence. To remedy this shortcoming, we introduce \textbf{TempFlow-GRPO} (Temporal Flow GRPO), a principled GRPO framework that captures and exploits the temporal structure inherent in flow-based generation. TempFlow-GRPO introduces two key innovations: (i) a trajectory branching mechanism that provides process rewards by concentrating stochasticity at designated branching points, enabling precise credit assignment without requiring specialized intermediate reward models; and (ii) a noise-aware weighting scheme that modulates policy optimization according to the intrinsic exploration potential of each timestep, prioritizing learning during high-impact early stages while ensuring stable refinement in later phases. These innovations endow the model with temporally-aware optimization that respects the underlying generative dynamics, leading to state-of-the-art performance in human preference alignment and standard text-to-image benchmarks.

Title: From Split to Share: Private Inference with Distributed Feature Sharing

Authors: Zihan Liu, Jiayi Wen, Shouhong Tan, Zhirun Zheng, Cheng Huang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04346
Pdf URL: https://arxiv.org/pdf/2508.04346
Copy Paste: [[2508.04346]] From Split to Share: Private Inference with Distributed Feature Sharing(https://arxiv.org/abs/2508.04346)
Keywords: diffusion
Abstract: Cloud-based Machine Learning as a Service (MLaaS) raises serious privacy concerns when handling sensitive client data. Existing Private Inference (PI) methods face a fundamental trade-off between privacy and efficiency: cryptographic approaches offer strong protection but incur high computational overhead, while efficient alternatives such as split inference expose intermediate features to inversion attacks. We propose PrivDFS, a new paradigm for private inference that replaces a single exposed representation with distributed feature sharing. PrivDFS partitions input features on the client into multiple balanced shares, which are distributed to non-colluding, non-communicating servers for independent partial inference. The client securely aggregates the servers' outputs to reconstruct the final prediction, ensuring that no single server observes sufficient information to compromise input privacy. To further strengthen privacy, we propose two key extensions: PrivDFS-AT, which uses adversarial training with a diffusion-based proxy attacker to enforce inversion-resistant feature partitioning, and PrivDFS-KD, which leverages user-specific keys to diversify partitioning policies and prevent query-based inversion generalization. Experiments on CIFAR-10 and CelebA demonstrate that PrivDFS achieves privacy comparable to deep split inference while cutting client computation by up to 100 times with no accuracy loss, and that the extensions remain robust against both diffusion-based in-distribution and adaptive attacks.

Title: Chain of Questions: Guiding Multimodal Curiosity in Language Models

Authors: Nima Iji, Kia Dashtipour
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2508.04350
Pdf URL: https://arxiv.org/pdf/2508.04350
Copy Paste: [[2508.04350]] Chain of Questions: Guiding Multimodal Curiosity in Language Models(https://arxiv.org/abs/2508.04350)
Keywords: foundation model
Abstract: Reasoning capabilities in large language models (LLMs) have substantially advanced through methods such as chain-of-thought and explicit step-by-step explanations. However, these improvements have not yet fully transitioned to multimodal contexts, where models must proactively decide which sensory modalities such as vision, audio, or spatial perception to engage when interacting with complex real-world environments. In this paper, we introduce the Chain of Questions (CoQ) framework, a curiosity-driven reasoning approach that encourages multimodal language models to dynamically generate targeted questions regarding their surroundings. These generated questions guide the model to selectively activate relevant modalities, thereby gathering critical information necessary for accurate reasoning and response generation. We evaluate our framework on a novel multimodal benchmark dataset, assembled by integrating WebGPT, ScienceQA, AVSD, and ScanQA datasets. Experimental results demonstrate that our CoQ method improves a foundation model's ability to effectively identify and integrate pertinent sensory information. This leads to improved accuracy, interpretability, and alignment of the reasoning process with diverse multimodal tasks.

Title: VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Visual Backbones

Authors: Lefei Shen, Mouxiang Chen, Xu Liu, Han Fu, Xiaoxue Ren, Jianling Sun, Zhuo Li, Chenghao Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04379
Pdf URL: https://arxiv.org/pdf/2508.04379
Copy Paste: [[2508.04379]] VisionTS++: Cross-Modal Time Series Foundation Model with Continual Pre-trained Visual Backbones(https://arxiv.org/abs/2508.04379)
Keywords: foundation model
Abstract: Recent studies have revealed that vision models pre-trained on images can perform well in time series forecasting by reformulating forecasting as an image reconstruction task, suggesting their potential as universal time series foundation models. However, effective cross-modal transfer from vision to time series remains challenging due to three key discrepancies: (1) data-modality gap between structured, bounded image data and unbounded, heterogeneous time series; (2) multivariate-forecasting gap between standard RGB three-channel-based vision models and the need to model time series with arbitrary numbers of variates; and (3) probabilistic-forecasting gap between the deterministic output formats of most vision models and the requirement for uncertainty-aware probabilistic predictions. To bridge these gaps, we propose VisionTS++, a vision-model-based TSFM that performs continual pre-training on large-scale time series datasets, including 3 innovations: (1) a vision-model-based filtering mechanism to identify high-quality time series data, thereby mitigating modality gap and improving pre-training stability, (2) a colorized multivariate conversion method that transforms multivariate time series into multi-subfigure RGB images, capturing complex inter-variate dependencies; and (3) a multi-quantile forecasting approach using parallel reconstruction heads to generate forecasts of different quantile levels, thus more flexibly approximating arbitrary output distributions without restrictive prior distributional assumptions. Evaluated on both in-distribution and out-of-distribution TSF benchmarks, \model achieves SOTA results, outperforming specialized TSFMs by 6%-44% in MSE reduction and ranking first in 9 out of 12 probabilistic forecasting settings. Our work establishes a new paradigm for cross-modal knowledge transfer, advancing the development of universal TSFMs.

Title: Why are LLMs' abilities emergent?

Authors: Vladimír Havlík
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04401
Pdf URL: https://arxiv.org/pdf/2508.04401
Copy Paste: [[2508.04401]] Why are LLMs' abilities emergent?(https://arxiv.org/abs/2508.04401)
Keywords: generative, in-context
Abstract: The remarkable success of Large Language Models (LLMs) in generative tasks has raised fundamental questions about the nature of their acquired capabilities, which often appear to emerge unexpectedly without explicit training. This paper examines the emergent properties of Deep Neural Networks (DNNs) through both theoretical analysis and empirical observation, addressing the epistemological challenge of "creation without understanding" that characterises contemporary AI development. We explore how the neural approach's reliance on nonlinear, stochastic processes fundamentally differs from symbolic computational paradigms, creating systems whose macro-level behaviours cannot be analytically derived from micro-level neuron activities. Through analysis of scaling laws, grokking phenomena, and phase transitions in model capabilities, I demonstrate that emergent abilities arise from the complex dynamics of highly sensitive nonlinear systems rather than simply from parameter scaling alone. My investigation reveals that current debates over metrics, pre-training loss thresholds, and in-context learning miss the fundamental ontological nature of emergence in DNNs. I argue that these systems exhibit genuine emergent properties analogous to those found in other complex natural phenomena, where systemic capabilities emerge from cooperative interactions among simple components without being reducible to their individual behaviours. The paper concludes that understanding LLM capabilities requires recognising DNNs as a new domain of complex dynamical systems governed by universal principles of emergence, similar to those operating in physics, chemistry, and biology. This perspective shifts the focus from purely phenomenological definitions of emergence to understanding the internal dynamic transformations that enable these systems to acquire capabilities that transcend their individual components.

Title: Benchmarking Foundation Models for Mitotic Figure Classification

Authors: Jonas Ammeling, Jonathan Ganz, Emely Rosbach, Ludwig Lausser, Christof A. Bertram, Katharina Breininger, Marc Aubreville
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04441
Pdf URL: https://arxiv.org/pdf/2508.04441
Copy Paste: [[2508.04441]] Benchmarking Foundation Models for Mitotic Figure Classification(https://arxiv.org/abs/2508.04441)
Keywords: self-supervised, foundation model
Abstract: The performance of deep learning models is known to scale with data quantity and diversity. In pathology, as in many other medical imaging domains, the availability of labeled images for a specific task is often limited. Self-supervised learning techniques have enabled the use of vast amounts of unlabeled data to train large-scale neural networks, i.e., foundation models, that can address the limited data problem by providing semantically rich feature vectors that can generalize well to new tasks with minimal training effort increasing model performance and robustness. In this work, we investigate the use of foundation models for mitotic figure classification. The mitotic count, which can be derived from this classification task, is an independent prognostic marker for specific tumors and part of certain tumor grading systems. In particular, we investigate the data scaling laws on multiple current foundation models and evaluate their robustness to unseen tumor domains. Next to the commonly used linear probing paradigm, we also adapt the models using low-rank adaptation (LoRA) of their attention mechanisms. We compare all models against end-to-end-trained baselines, both CNNs and Vision Transformers. Our results demonstrate that LoRA-adapted foundation models provide superior performance to those adapted with standard linear probing, reaching performance levels close to 100% data availability with only 10% of training data. Furthermore, LoRA-adaptation of the most recent foundation models almost closes the out-of-domain performance gap when evaluated on unseen tumor domains. However, full fine-tuning of traditional architectures still yields competitive performance.

Title: Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI

Authors: Rohaizah Abdul Wahid, Muhamad Said Nizamuddin Nadim, Suliana Sulaiman, Syahmi Akmal Shaharudin, Muhammad Danial Jupikil, Iqqwan Jasman Su Azlan Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04442
Pdf URL: https://arxiv.org/pdf/2508.04442
Copy Paste: [[2508.04442]] Automated Generation of Curriculum-Aligned Multiple-Choice Questions for Malaysian Secondary Mathematics Using Generative AI(https://arxiv.org/abs/2508.04442)
Keywords: generative
Abstract: This paper addresses the critical need for scalable and high-quality educational assessment tools within the Malaysian education system. It highlights the potential of Generative AI (GenAI) while acknowledging the significant challenges of ensuring factual accuracy and curriculum alignment, especially for low-resource languages like Bahasa Melayu. This research introduces and compares four incremental pipelines for generating Form 1 Mathematics multiple-choice questions (MCQs) in Bahasa Melayu using OpenAI's GPT-4o. The methods range from non-grounded prompting (structured and basic) to Retrieval-Augmented Generation (RAG) approaches (one using the LangChain framework, one implemented manually). The system is grounded in official curriculum documents, including teacher-prepared notes and the yearly teaching plan (RPT). A dual-pronged automated evaluation framework is employed to assess the generated questions. Curriculum alignment is measured using Semantic Textual Similarity (STS) against the RPT, while contextual validity is verified through a novel RAG-based Question-Answering (RAG-QA) method. The results demonstrate that RAG-based pipelines significantly outperform non-grounded prompting methods, producing questions with higher curriculum alignment and factual validity. The study further analyzes the trade-offs between the ease of implementation of framework-based RAG and the fine-grained control offered by a manual pipeline. This work presents a validated methodology for generating curriculum-specific educational content in a low-resource language, introduces a symbiotic RAG-QA evaluation technique, and provides actionable insights for the development and deployment of practical EdTech solutions in Malaysia and similar regions.

Title: Cloud Model Characteristic Function Auto-Encoder: Integrating Cloud Model Theory with MMD Regularization for Enhanced Generative Modeling

Authors: Biao Hu, Guoyin Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04447
Pdf URL: https://arxiv.org/pdf/2508.04447
Copy Paste: [[2508.04447]] Cloud Model Characteristic Function Auto-Encoder: Integrating Cloud Model Theory with MMD Regularization for Enhanced Generative Modeling(https://arxiv.org/abs/2508.04447)
Keywords: generative
Abstract: We introduce Cloud Model Characteristic Function Auto-Encoder (CMCFAE), a novel generative model that integrates the cloud model into the Wasserstein Auto-Encoder (WAE) framework. By leveraging the characteristic functions of the cloud model to regularize the latent space, our approach enables more accurate modeling of complex data distributions. Unlike conventional methods that rely on a standard Gaussian prior and traditional divergence measures, our method employs a cloud model prior, providing a more flexible and realistic representation of the latent space, thus mitigating the homogenization observed in reconstructed samples. We derive the characteristic function of the cloud model and propose a corresponding regularizer within the WAE framework. Extensive quantitative and qualitative evaluations on MNIST, FashionMNIST, CIFAR-10, and CelebA demonstrate that CMCFAE outperforms existing models in terms of reconstruction quality, latent space structuring, and sample diversity. This work not only establishes a novel integration of cloud model theory with MMD-based regularization but also offers a promising new perspective for enhancing autoencoder-based generative models.

Title: Automatic LLM Red Teaming

Authors: Roman Belaire, Arunesh Sinha, Pradeep Varakantham
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04451
Pdf URL: https://arxiv.org/pdf/2508.04451
Copy Paste: [[2508.04451]] Automatic LLM Red Teaming(https://arxiv.org/abs/2508.04451)
Keywords: generative
Abstract: Red teaming is critical for identifying vulnerabilities and building trust in current LLMs. However, current automated methods for Large Language Models (LLMs) rely on brittle prompt templates or single-turn attacks, failing to capture the complex, interactive nature of real-world adversarial dialogues. We propose a novel paradigm: training an AI to strategically `break' another AI. By formalizing red teaming as a Markov Decision Process (MDP) and employing a hierarchical Reinforcement Learning (RL) framework, we effectively address the inherent sparse reward and long-horizon challenges. Our generative agent learns coherent, multi-turn attack strategies through a fine-grained, token-level harm reward, enabling it to uncover subtle vulnerabilities missed by existing baselines. This approach sets a new state-of-the-art, fundamentally reframing LLM red teaming as a dynamic, trajectory-based process (rather than a one-step test) essential for robust AI deployment.

Title: Small transformer architectures for task switching

Authors: Claudius Gros
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04461
Pdf URL: https://arxiv.org/pdf/2508.04461
Copy Paste: [[2508.04461]] Small transformer architectures for task switching(https://arxiv.org/abs/2508.04461)
Keywords: generative
Abstract: The rapid progress seen in terms of large-scale generative AI is largely based on the attention mechanism. It is conversely non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches, such as multi-layer perceptrons or recurrent networks. We examine this problem in the context of 'task switching'. In this framework models work on ongoing token sequences with the current task being determined by stochastically interspersed control tokens. We show that standard transformers cannot solve a basic task switching reference model based on finite domain arithmetics which contains subtasks dedicated to increment / addition / reverse copy / context (IARC). We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies. We enlarge our comparative study by including an extension of the standard transformer architecture to its non-translational invariant counterpart, the cisformer, and an alternative attention mechanism, extensive attention. A combination of the latter is found to be the only model able to achieve considerable performance levels, of around 95%. Our results indicate that the workings of attention can be understood better, and even improved, when comparing qualitatively different formulations in task-switching settings.

Title: 4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation

Authors: Shuzhou Yang, Xiaodong Cun, Xiaoyu Li, Yaowei Li, Jian Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04467
Pdf URL: https://arxiv.org/pdf/2508.04467
Copy Paste: [[2508.04467]] 4DVD: Cascaded Dense-view Video Diffusion Model for High-quality 4D Content Generation(https://arxiv.org/abs/2508.04467)
Keywords: diffusion
Abstract: Given the high complexity of directly generating high-dimensional data such as 4D, we present 4DVD, a cascaded video diffusion model that generates 4D content in a decoupled manner. Unlike previous multi-view video methods that directly model 3D space and temporal features simultaneously with stacked cross view/temporal attention modules, 4DVD decouples this into two subtasks: coarse multi-view layout generation and structure-aware conditional generation, and effectively unifies them. Specifically, given a monocular video, 4DVD first predicts the dense view content of its layout with superior cross-view and temporal consistency. Based on the produced layout priors, a structure-aware spatio-temporal generation branch is developed, combining these coarse structural priors with the exquisite appearance content of input monocular video to generate final high-quality dense-view videos. Benefit from this, explicit 4D representation~(such as 4D Gaussian) can be optimized accurately, enabling wider practical application. To train 4DVD, we collect a dynamic 3D object dataset, called D-Objaverse, from the Objaverse benchmark and render 16 videos with 21 frames for each object. Extensive experiments demonstrate our state-of-the-art performance on both novel view synthesis and 4D generation. Our project page is this https URL

Title: FedHiP: Heterogeneity-Invariant Personalized Federated Learning Through Closed-Form Solutions

Authors: Jianheng Tang, Zhirui Yang, Jingchao Wang, Kejia Fan, Jinfeng Xu, Huiping Zhuang, Anfeng Liu, Houbing Herbert Song, Leye Wang, Yunhuai Liu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04470
Pdf URL: https://arxiv.org/pdf/2508.04470
Copy Paste: [[2508.04470]] FedHiP: Heterogeneity-Invariant Personalized Federated Learning Through Closed-Form Solutions(https://arxiv.org/abs/2508.04470)
Keywords: self-supervised, foundation model
Abstract: Lately, Personalized Federated Learning (PFL) has emerged as a prevalent paradigm to deliver personalized models by collaboratively training while simultaneously adapting to each client's local applications. Existing PFL methods typically face a significant challenge due to the ubiquitous data heterogeneity (i.e., non-IID data) across clients, which severely hinders convergence and degrades performance. We identify that the root issue lies in the long-standing reliance on gradient-based updates, which are inherently sensitive to non-IID data. To fundamentally address this issue and bridge the research gap, in this paper, we propose a Heterogeneity-invariant Personalized Federated learning scheme, named FedHiP, through analytical (i.e., closed-form) solutions to avoid gradient-based updates. Specifically, we exploit the trend of self-supervised pre-training, leveraging a foundation model as a frozen backbone for gradient-free feature extraction. Following the feature extractor, we further develop an analytic classifier for gradient-free training. To support both collective generalization and individual personalization, our FedHiP scheme incorporates three phases: analytic local training, analytic global aggregation, and analytic local personalization. The closed-form solutions of our FedHiP scheme enable its ideal property of heterogeneity invariance, meaning that each personalized model remains identical regardless of how non-IID the data are distributed across all other clients. Extensive experiments on benchmark datasets validate the superiority of our FedHiP scheme, outperforming the state-of-the-art baselines by at least 5.79%-20.97% in accuracy.

Title: Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model

Authors: Hongxu Chen, Zhen Wang, Taoran Mei, Lin Li, Bowei Zhu, Runshi Li, Long Chen
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.04472
Pdf URL: https://arxiv.org/pdf/2508.04472
Copy Paste: [[2508.04472]] Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model(https://arxiv.org/abs/2508.04472)
Keywords: generative
Abstract: Concept Erasure, which aims to prevent pretrained text-to-image models from generating content associated with semantic-harmful concepts (i.e., target concepts), is getting increased attention. State-of-the-art methods formulate this task as an optimization problem: they align all target concepts with semantic-harmless anchor concepts, and apply closed-form solutions to update the model accordingly. While these closed-form methods are efficient, we argue that existing methods have two overlooked limitations: 1) They often result in incomplete erasure due to "non-zero alignment residual", especially when text prompts are relatively complex. 2) They may suffer from generation quality degradation as they always concentrate parameter updates in a few deep layers. To address these issues, we propose a novel closed-form method ErasePro: it is designed for more complete concept erasure and better preserving overall generative quality. Specifically, ErasePro first introduces a strict zero-residual constraint into the optimization objective, ensuring perfect alignment between target and anchor concept features and enabling more complete erasure. Secondly, it employs a progressive, layer-wise update strategy that gradually transfers target concept features to those of the anchor concept from shallow to deep layers. As the depth increases, the required parameter changes diminish, thereby reducing deviations in sensitive deep layers and preserving generative quality. Empirical results across different concept erasure tasks (including instance, art style, and nudity erasure) have demonstrated the effectiveness of our ErasePro.

Title: Emotion Detection Using Conditional Generative Adversarial Networks (cGAN): A Deep Learning Approach

Authors: Anushka Srivastava
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2508.04481
Pdf URL: https://arxiv.org/pdf/2508.04481
Copy Paste: [[2508.04481]] Emotion Detection Using Conditional Generative Adversarial Networks (cGAN): A Deep Learning Approach(https://arxiv.org/abs/2508.04481)
Keywords: generative
Abstract: This paper presents a deep learning-based approach to emotion detection using Conditional Generative Adversarial Networks (cGANs). Unlike traditional unimodal techniques that rely on a single data type, we explore a multimodal framework integrating text, audio, and facial expressions. The proposed cGAN architecture is trained to generate synthetic emotion-rich data and improve classification accuracy across multiple modalities. Our experimental results demonstrate significant improvements in emotion recognition performance compared to baseline models. This work highlights the potential of cGANs in enhancing human-computer interaction systems by enabling more nuanced emotional understanding.

Title: QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution

Authors: Bowen Chai, Zheng Chen, Libo Zhu, Wenbo Li, Yong Guo, Yulun Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04485
Pdf URL: https://arxiv.org/pdf/2508.04485
Copy Paste: [[2508.04485]] QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution(https://arxiv.org/abs/2508.04485)
Keywords: diffusion
Abstract: Diffusion models have shown superior performance in real-world video super-resolution (VSR). However, the slow processing speeds and heavy resource consumption of diffusion models hinder their practical application and deployment. Quantization offers a potential solution for compressing the VSR model. Nevertheless, quantizing VSR models is challenging due to their temporal characteristics and high fidelity requirements. To address these issues, we propose QuantVSR, a low-bit quantization model for real-world VSR. We propose a spatio-temporal complexity aware (STCA) mechanism, where we first utilize the calibration dataset to measure both spatial and temporal complexities for each layer. Based on these statistics, we allocate layer-specific ranks to the low-rank full-precision (FP) auxiliary branch. Subsequently, we jointly refine the FP and low-bit branches to achieve simultaneous optimization. In addition, we propose a learnable bias alignment (LBA) module to reduce the biased quantization errors. Extensive experiments on synthetic and real-world datasets demonstrate that our method obtains comparable performance with the FP model and significantly outperforms recent leading low-bit quantization methods. Code is available at: this https URL.

Title: CALE : Concept-Aligned Embeddings for Both Within-Lemma and Inter-Lemma Sense Differentiation

Authors: Bastien Liétard, Gabriel Loiseau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2508.04494
Pdf URL: https://arxiv.org/pdf/2508.04494
Copy Paste: [[2508.04494]] CALE : Concept-Aligned Embeddings for Both Within-Lemma and Inter-Lemma Sense Differentiation(https://arxiv.org/abs/2508.04494)
Keywords: in-context
Abstract: Lexical semantics is concerned with both the multiple senses a word can adopt in different contexts, and the semantic relations that exist between meanings of different words. To investigate them, Contextualized Language Models are a valuable tool that provides context-sensitive representations that can be used to investigate lexical meaning. Recent works like XL-LEXEME have leveraged the task of Word-in-Context to fine-tune them to get more semantically accurate representations, but Word-in-Context only compares occurrences of the same lemma, limiting the range of captured information. In this paper, we propose an extension, Concept Differentiation, to include inter-words scenarios. We provide a dataset for this task, derived from SemCor data. Then we fine-tune several representation models on this dataset. We call these models Concept-Aligned Embeddings (CALE). By challenging our models and other models on various lexical semantic tasks, we demonstrate that the proposed models provide efficient multi-purpose representations of lexical meaning that reach best performances in our experiments. We also show that CALE's fine-tuning brings valuable changes to the spatial organization of embeddings.

Title: Two-Way Garment Transfer: Unified Diffusion Framework for Dressing and Undressing Synthesis

Authors: Angang Zhang, Fang Deng, Hao Chen, Zhongjian Chen, Junyan Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04551
Pdf URL: https://arxiv.org/pdf/2508.04551
Copy Paste: [[2508.04551]] Two-Way Garment Transfer: Unified Diffusion Framework for Dressing and Undressing Synthesis(https://arxiv.org/abs/2508.04551)
Keywords: diffusion
Abstract: While recent advances in virtual try-on (VTON) have achieved realistic garment transfer to human subjects, its inverse task, virtual try-off (VTOFF), which aims to reconstruct canonical garment templates from dressed humans, remains critically underexplored and lacks systematic investigation. Existing works predominantly treat them as isolated tasks: VTON focuses on garment dressing while VTOFF addresses garment extraction, thereby neglecting their complementary symmetry. To bridge this fundamental gap, we propose the Two-Way Garment Transfer Model (TWGTM), to the best of our knowledge, the first unified framework for joint clothing-centric image synthesis that simultaneously resolves both mask-guided VTON and mask-free VTOFF through bidirectional feature disentanglement. Specifically, our framework employs dual-conditioned guidance from both latent and pixel spaces of reference images to seamlessly bridge the dual tasks. On the other hand, to resolve the inherent mask dependency asymmetry between mask-guided VTON and mask-free VTOFF, we devise a phased training paradigm that progressively bridges this modality gap. Extensive qualitative and quantitative experiments conducted across the DressCode and VITON-HD datasets validate the efficacy and competitive edge of our proposed approach.

Title: One Model For All: Partial Diffusion for Unified Try-On and Try-Off in Any Pose

Authors: Jinxi Liu, Zijian He, Guangrun Wang, Guanbin Li, Liang Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04559
Pdf URL: https://arxiv.org/pdf/2508.04559
Copy Paste: [[2508.04559]] One Model For All: Partial Diffusion for Unified Try-On and Try-Off in Any Pose(https://arxiv.org/abs/2508.04559)
Keywords: diffusion
Abstract: Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios-for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce \textbf{OMFA} (\emph{One Model For All}), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. For example, OMFA enables removing garments from a source person (try-off) and transferring them onto a target person (try-on), while also allowing the generated target to appear in novel poses-even without access to multi-pose images of that person. OMFA is built upon a novel \emph{partial diffusion} strategy that selectively applies noise and denoising to individual components of the joint input-such as the garment, the person image, or the face-enabling dynamic subtask control and efficient bidirectional garment-person transformation. The framework is entirely mask-free and requires only a single portrait and a target pose as input, making it well-suited for real-world applications. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical and generalizable solution for virtual garment synthesis. The project page is here: this https URL.

Title: Drone Detection with Event Cameras

Authors: Gabriele Magrini, Lorenzo Berlincioni, Luca Cultrera, Federico Becattini, Pietro Pala
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04564
Pdf URL: https://arxiv.org/pdf/2508.04564
Copy Paste: [[2508.04564]] Drone Detection with Event Cameras(https://arxiv.org/abs/2508.04564)
Keywords: diffusion
Abstract: The diffusion of drones presents significant security and safety challenges. Traditional surveillance systems, particularly conventional frame-based cameras, struggle to reliably detect these targets due to their small size, high agility, and the resulting motion blur and poor performance in challenging lighting conditions. This paper surveys the emerging field of event-based vision as a robust solution to these problems. Event cameras virtually eliminate motion blur and enable consistent detection in extreme lighting. Their sparse, asynchronous output suppresses static backgrounds, enabling low-latency focus on motion cues. We review the state-of-the-art in event-based drone detection, from data representation methods to advanced processing pipelines using spiking neural networks. The discussion extends beyond simple detection to cover more sophisticated tasks such as real-time tracking, trajectory forecasting, and unique identification through propeller signature analysis. By examining current methodologies, available datasets, and the distinct advantages of the technology, this work demonstrates that event-based vision provides a powerful foundation for the next generation of reliable, low-latency, and efficient counter-UAV systems.

Title: TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning

Authors: Yunbi Liu, Enqi Tang, Shiyu Li, Lei Ma, Juncheng Li, Shu Lou, Yongchu Pan, Qingshan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04565
Pdf URL: https://arxiv.org/pdf/2508.04565
Copy Paste: [[2508.04565]] TAlignDiff: Automatic Tooth Alignment assisted by Diffusion-based Transformation Learning(https://arxiv.org/abs/2508.04565)
Keywords: diffusion
Abstract: Orthodontic treatment hinges on tooth alignment, which significantly affects occlusal function, facial aesthetics, and patients' quality of life. Current deep learning approaches predominantly concentrate on predicting transformation matrices through imposing point-to-point geometric constraints for tooth alignment. Nevertheless, these matrices are likely associated with the anatomical structure of the human oral cavity and possess particular distribution characteristics that the deterministic point-to-point geometric constraints in prior work fail to capture. To address this, we introduce a new automatic tooth alignment method named TAlignDiff, which is supported by diffusion-based transformation learning. TAlignDiff comprises two main components: a primary point cloud-based regression network (PRN) and a diffusion-based transformation matrix denoising module (DTMD). Geometry-constrained losses supervise PRN learning for point cloud-level alignment. DTMD, as an auxiliary module, learns the latent distribution of transformation matrices from clinical data. We integrate point cloud-based transformation regression and diffusion-based transformation modeling into a unified framework, allowing bidirectional feedback between geometric constraints and diffusion refinement. Extensive ablation and comparative experiments demonstrate the effectiveness and superiority of our method, highlighting its potential in orthodontic treatment.

Title: Analyzing and Mitigating Object Hallucination: A Training Bias Perspective

Authors: Yifan Li, Kun Zhou, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2508.04567
Pdf URL: https://arxiv.org/pdf/2508.04567
Copy Paste: [[2508.04567]] Analyzing and Mitigating Object Hallucination: A Training Bias Perspective(https://arxiv.org/abs/2508.04567)
Keywords: generative
Abstract: As scaling up training data has significantly improved the general multimodal capabilities of Large Vision-Language Models (LVLMs), they still suffer from the hallucination issue, generating text that is inconsistent with the visual input. This phenomenon motivates us to systematically investigate the role of training data in hallucination. We introduce a new benchmark, POPEv2, which consists of counterfactual images collected from the training data of LVLMs with certain objects masked. Through comprehensive evaluation on POPEv2, we find that current LVLMs suffer from training bias: they fail to fully leverage their training data and hallucinate more frequently on images seen during training. Specifically, they perform poorly on counterfactual images, often incorrectly answering ``Yes'' to questions about masked objects. To understand this issue, we conduct probing experiments on the models' internal components, revealing that this training bias is primarily located in the language modeling (LM) head. Based on these findings, we propose Obliviate, an efficient and lightweight unlearning method designed to mitigate object hallucination via training bias unlearning. Obliviate identifies the discrepancy between ground-truth labels and model outputs on the training data as a proxy for bias and adopts a parameter- and data-efficient fine-tuning strategy that only updates the LM head. Extensive experiments demonstrate the effectiveness of our approach. While only reusing the training data and updating approximately 2\% of the parameters, Obliviate significantly reduces hallucination across both discriminative and generative tasks. Furthermore, it demonstrates strong scalability with respect to both model size (2B to 72B) and training data volume, and exhibits promising generalization to hallucination types beyond object-level hallucination. Our code and data will be publicly released.

Title: DDTracking: A Deep Generative Framework for Diffusion MRI Tractography with Streamline Local-Global Spatiotemporal Modeling

Authors: Yijie Li, Wei Zhang, Xi Zhu, Ye Wu, Yogesh Rathi, Lauren J. O'Donnell, Fan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04568
Pdf URL: https://arxiv.org/pdf/2508.04568
Copy Paste: [[2508.04568]] DDTracking: A Deep Generative Framework for Diffusion MRI Tractography with Streamline Local-Global Spatiotemporal Modeling(https://arxiv.org/abs/2508.04568)
Keywords: diffusion, generative
Abstract: This paper presents DDTracking, a novel deep generative framework for diffusion MRI tractography that formulates streamline propagation as a conditional denoising diffusion process. In DDTracking, we introduce a dual-pathway encoding network that jointly models local spatial encoding (capturing fine-scale structural details at each streamline point) and global temporal dependencies (ensuring long-range consistency across the entire streamline). Furthermore, we design a conditional diffusion model module, which leverages the learned local and global embeddings to predict streamline propagation orientations for tractography in an end-to-end trainable manner. We conduct a comprehensive evaluation across diverse, independently acquired dMRI datasets, including both synthetic and clinical data. Experiments on two well-established benchmarks with ground truth (ISMRM Challenge and TractoInferno) demonstrate that DDTracking largely outperforms current state-of-the-art tractography methods. Furthermore, our results highlight DDTracking's strong generalizability across heterogeneous datasets, spanning varying health conditions, age groups, imaging protocols, and scanner types. Collectively, DDTracking offers anatomically plausible and robust tractography, presenting a scalable, adaptable, and end-to-end learnable solution for broad dMRI applications. Code is available at: this https URL

Title: GraphProp: Training the Graph Foundation Models using Graph Properties

Authors: Ziheng Sun, Qi Feng, Lehao Lin, Chris Ding, Jicong Fan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04594
Pdf URL: https://arxiv.org/pdf/2508.04594
Copy Paste: [[2508.04594]] GraphProp: Training the Graph Foundation Models using Graph Properties(https://arxiv.org/abs/2508.04594)
Keywords: foundation model
Abstract: This work focuses on training graph foundation models (GFMs) that have strong generalization ability in graph-level tasks such as graph classification. Effective GFM training requires capturing information consistent across different domains. We discover that graph structures provide more consistent cross-domain information compared to node features and graph labels. However, traditional GFMs primarily focus on transferring node features from various domains into a unified representation space but often lack structural cross-domain generalization. To address this, we introduce GraphProp, which emphasizes structural generalization. The training process of GraphProp consists of two main phases. First, we train a structural GFM by predicting graph invariants. Since graph invariants are properties of graphs that depend only on the abstract structure, not on particular labellings or drawings of the graph, this structural GFM has a strong ability to capture the abstract structural information and provide discriminative graph representations comparable across diverse domains. In the second phase, we use the representations given by the structural GFM as positional encodings to train a comprehensive GFM. This phase utilizes domain-specific node attributes and graph labels to further improve cross-domain node feature generalization. Our experiments demonstrate that GraphProp significantly outperforms the competitors in supervised learning and few-shot learning, especially in handling graphs without node attributes.

Title: Multitask Learning with Stochastic Interpolants

Authors: Hugo Negrel, Florentin Coeurdoux, Michael S. Albergo, Eric Vanden-Eijnden
Subjects: cs.LG, math.DS
Abstract URL: https://arxiv.org/abs/2508.04605
Pdf URL: https://arxiv.org/pdf/2508.04605
Copy Paste: [[2508.04605]] Multitask Learning with Stochastic Interpolants(https://arxiv.org/abs/2508.04605)
Keywords: diffusion, generative
Abstract: We propose a framework for learning maps between probability distributions that broadly generalizes the time dynamics of flow and diffusion models. To enable this, we generalize stochastic interpolants by replacing the scalar time variable with vectors, matrices, or linear operators, allowing us to bridge probability distributions across multiple dimensional spaces. This approach enables the construction of versatile generative models capable of fulfilling multiple tasks without task-specific training. Our operator-based interpolants not only provide a unifying theoretical perspective for existing generative models but also extend their capabilities. Through numerical experiments, we demonstrate the zero-shot efficacy of our method on conditional generation and inpainting, fine-tuning and posterior sampling, and multiscale modeling, suggesting its potential as a generic task-agnostic alternative to specialized models.

Title: CaPulse: Detecting Anomalies by Tuning in to the Causal Rhythms of Time Series

Authors: Yutong Xia, Yingying Zhang, Yuxuan Liang, Lunting Fan, Qingsong Wen, Roger Zimmermann
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.04630
Pdf URL: https://arxiv.org/pdf/2508.04630
Copy Paste: [[2508.04630]] CaPulse: Detecting Anomalies by Tuning in to the Causal Rhythms of Time Series(https://arxiv.org/abs/2508.04630)
Keywords: anomaly
Abstract: Time series anomaly detection has garnered considerable attention across diverse domains. While existing methods often fail to capture the underlying mechanisms behind anomaly generation in time series data. In addition, time series anomaly detection often faces several data-related inherent challenges, i.e., label scarcity, data imbalance, and complex multi-periodicity. In this paper, we leverage causal tools and introduce a new causality-based framework, CaPulse, which tunes in to the underlying causal pulse of time series data to effectively detect anomalies. Concretely, we begin by building a structural causal model to decipher the generation processes behind anomalies. To tackle the challenges posed by the data, we propose Periodical Normalizing Flows with a novel mask mechanism and carefully designed periodical learners, creating a periodicity-aware, density-based anomaly detection approach. Extensive experiments on seven real-world datasets demonstrate that CaPulse consistently outperforms existing methods, achieving AUROC improvements of 3% to 17%, with enhanced interpretability.

Title: EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts

Authors: Kushin Mukherjee, Donghao Ren, Dominik Moritz, Yannick Assogba
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.04650
Pdf URL: https://arxiv.org/pdf/2508.04650
Copy Paste: [[2508.04650]] EncQA: Benchmarking Vision-Language Models on Visual Encodings for Charts(https://arxiv.org/abs/2508.04650)
Keywords: anomaly
Abstract: Multimodal vision-language models (VLMs) continue to achieve ever-improving scores on chart understanding benchmarks. Yet, we find that this progress does not fully capture the breadth of visual reasoning capabilities essential for interpreting charts. We introduce EncQA, a novel benchmark informed by the visualization literature, designed to provide systematic coverage of visual encodings and analytic tasks that are crucial for chart understanding. EncQA provides 2,076 synthetic question-answer pairs, enabling balanced coverage of six visual encoding channels (position, length, area, color quantitative, color nominal, and shape) and eight tasks (find extrema, retrieve value, find anomaly, filter values, compute derived value exact, compute derived value relative, correlate values, and correlate values relative). Our evaluation of 9 state-of-the-art VLMs reveals that performance varies significantly across encodings within the same task, as well as across tasks. Contrary to expectations, we observe that performance does not improve with model size for many task-encoding pairs. Our results suggest that advancing chart understanding requires targeted strategies addressing specific visual reasoning gaps, rather than solely scaling up model or dataset size.

Title: HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

Authors: Young D. Kwon, Rui Li, Sijia Li, Da Li, Sourav Bhattacharya, Stylianos I. Venieris
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.04663
Pdf URL: https://arxiv.org/pdf/2508.04663
Copy Paste: [[2508.04663]] HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models(https://arxiv.org/abs/2508.04663)
Keywords: diffusion
Abstract: State-of-the-art text-to-image diffusion models (DMs) achieve remarkable quality, yet their massive parameter scale (8-11B) poses significant challenges for inferences on resource-constrained devices. In this paper, we present HierarchicalPrune, a novel compression framework grounded in a key observation: DM blocks exhibit distinct functional hierarchies, where early blocks establish semantic structures while later blocks handle texture refinements. HierarchicalPrune synergistically combines three techniques: (1) Hierarchical Position Pruning, which identifies and removes less essential later blocks based on position hierarchy; (2) Positional Weight Preservation, which systematically protects early model portions that are essential for semantic structural integrity; and (3) Sensitivity-Guided Distillation, which adjusts knowledge-transfer intensity based on our discovery of block-wise sensitivity variations. As a result, our framework brings billion-scale diffusion models into a range more suitable for on-device inference, while preserving the quality of the output images. Specifically, when combined with INT4 weight quantisation, HierarchicalPrune achieves 77.5-80.4% memory footprint reduction (e.g., from 15.8 GB to 3.2 GB) and 27.9-38.0% latency reduction, measured on server and consumer grade GPUs, with the minimum drop of 2.6% in GenEval score and 7% in HPSv2 score compared to the original model. Last but not least, our comprehensive user study with 85 participants demonstrates that HierarchicalPrune maintains perceptual quality comparable to the original model while significantly outperforming prior works.