2025-08-26

Title: Towards High-Precision Depth Sensing via Monocular-Aided iToF and RGB Integration

Authors: Yansong Du, Yutong Deng, Yuting Zhou, Feiyu Jiao, Jian Song, Xun Guan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16579
Pdf URL: https://arxiv.org/pdf/2508.16579
Copy Paste: [[2508.16579]] Towards High-Precision Depth Sensing via Monocular-Aided iToF and RGB Integration(https://arxiv.org/abs/2508.16579)
Keywords: super-resolution
Abstract: This paper presents a novel iToF-RGB fusion framework designed to address the inherent limitations of indirect Time-of-Flight (iToF) depth sensing, such as low spatial resolution, limited field-of-view (FoV), and structural distortion in complex scenes. The proposed method first reprojects the narrow-FoV iToF depth map onto the wide-FoV RGB coordinate system through a precise geometric calibration and alignment module, ensuring pixel-level correspondence between modalities. A dual-encoder fusion network is then employed to jointly extract complementary features from the reprojected iToF depth and RGB image, guided by monocular depth priors to recover fine-grained structural details and perform depth super-resolution. By integrating cross-modal structural cues and depth consistency constraints, our approach achieves enhanced depth accuracy, improved edge sharpness, and seamless FoV expansion. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed framework significantly outperforms state-of-the-art methods in terms of accuracy, structural consistency, and visual quality.
摘要：本文提出了一种新型的ITOF-RGB融合框架，旨在解决间接飞行时间（ITOF）深度感应的固有局限性，例如低空间分辨率，有限的视野（FOV）和复杂场景中的结构失真。所提出的方法首先通过精确的几何校准和对齐模块将窄翼iTof深度映射重新投入到宽fov RGB坐标系上，从而确保了模态之间的像素级对应关系。然后使用双重编码器融合网络从重新注射的ITOF深度和RGB图像中共同提取互补特征，并在单眼深度先验的指导下，以恢复细粒的结构细节并执行深度超级分辨率。通过整合跨模式结构提示和深度一致性约束，我们的方法可以提高深度精度，提高边缘清晰度和无缝的FOV扩展。关于合成和现实世界数据集的广泛实验表明，所提出的框架在准确性，结构一致性和视觉质量方面显着优于最先进的方法。

Title: CrystalDiT: A Diffusion Transformer for Crystal Generation

Authors: Xiaohan Yi, Guikun Xu, Xi Xiao, Zhong Zhang, Liu Liu, Yatao Bian, Peilin Zhao
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2508.16614
Pdf URL: https://arxiv.org/pdf/2508.16614
Copy Paste: [[2508.16614]] CrystalDiT: A Diffusion Transformer for Crystal Generation(https://arxiv.org/abs/2508.16614)
Keywords: generation
Abstract: We present CrystalDiT, a diffusion transformer for crystal structure generation that achieves state-of-the-art performance by challenging the trend of architectural complexity. Instead of intricate, multi-stream designs, CrystalDiT employs a unified transformer that imposes a powerful inductive bias: treating lattice and atomic properties as a single, interdependent system. Combined with a periodic table-based atomic representation and a balanced training strategy, our approach achieves 9.62% SUN (Stable, Unique, Novel) rate on MP-20, substantially outperforming recent methods including FlowMM (4.38%) and MatterGen (3.42%). Notably, CrystalDiT generates 63.28% unique and novel structures while maintaining comparable stability rates, demonstrating that architectural simplicity can be more effective than complexity for materials discovery. Our results suggest that in data-limited scientific domains, carefully designed simple architectures outperform sophisticated alternatives that are prone to overfitting.
摘要：我们提出了Crystaldit，这是一种用于晶体结构生成的扩散变压器，它通过挑战建筑复杂性的趋势来实现最先进的性能。 Crystaldit不是复杂的多流设计，而是采用了统一的变压器，它施加了强大的感应偏见：将晶格和原子特性视为一个相互依存的系统。与周期性的基于表的原子表示和平衡的训练策略相结合，我们的方法在MP-20上实现了9.62％的太阳（稳定，独特，新颖）的速率，在包括FlowMM（4.38％）和MatterGen（3.42％）（3.42％）（包括FlowMM（4.38％））的最新方法上大大优于。值得注意的是，水晶层产生63.28％的独特和新颖的结构，同时保持可比的稳定性率，这表明建筑简单性对于材料发现而言比复杂性更有效。我们的结果表明，在数据限制的科学领域中，精心设计的简单体系结构的表现优于易于过度拟合的复杂替代方案。

Title: A Retrieval Augmented Spatio-Temporal Framework for Traffic Prediction

Authors: Weilin Ruan, Xilin Dang, Ziyu Zhou, Sisuo Lyu, Yuxuan Liang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16623
Pdf URL: https://arxiv.org/pdf/2508.16623
Copy Paste: [[2508.16623]] A Retrieval Augmented Spatio-Temporal Framework for Traffic Prediction(https://arxiv.org/abs/2508.16623)
Keywords: generation
Abstract: Traffic prediction is a cornerstone of modern intelligent transportation systems and a critical task in spatio-temporal forecasting. Although advanced Spatio-temporal Graph Neural Networks (STGNNs) and pre-trained models have achieved significant progress in traffic prediction, two key challenges remain: (i) limited contextual capacity when modeling complex spatio-temporal dependencies, and (ii) low predictability at fine-grained spatio-temporal points due to heterogeneous patterns. Inspired by Retrieval-Augmented Generation (RAG), we propose RAST, a universal framework that integrates retrieval-augmented mechanisms with spatio-temporal modeling to address these challenges. Our framework consists of three key designs: 1) Decoupled Encoder and Query Generator to capture decoupled spatial and temporal features and construct a fusion query via residual fusion; 2) Spatio-temporal Retrieval Store and Retrievers to maintain and retrieve vectorized fine-grained patterns; and 3) Universal Backbone Predictor that flexibly accommodates pre-trained STGNNs or simple MLP predictors. Extensive experiments on six real-world traffic networks, including large-scale datasets, demonstrate that RAST achieves superior performance while maintaining computational efficiency.
摘要：交通预测是现代智能运输系统的基石，也是时空预测的关键任务。尽管先进的时空图神经网络（STGNN）和预训练的模型在交通预测方面取得了显着进展，但仍有两个关键挑战：（i）在对复杂的时空依赖性建模时，有限的上下文能力有限，（ii）由于异型模式而导致的过时时空时空点的可预测性较低。受检索功能的启发（RAG），我们提出了Rast，Rast是一个通用框架，将检索功能的机制与时空建模集成在一起，以应对这些挑战。我们的框架由三个关键设计组成：1）解耦编码器和查询发电机以捕获脱钩的空间和时间特征，并通过残留融合构建融合查询； 2）时空检索商店和检索器，以维持和检索矢量化的细粒模式； 3）通用主链预测器灵活地适应预训练的STGNN或简单的MLP预测因子。在包括大规模数据集在内的六个现实世界流量网络上进行的广泛实验表明，RAST在保持计算效率的同时取得了卓越的性能。

Title: From Classical Probabilistic Latent Variable Models to Modern Generative AI: A Unified Perspective

Authors: Tianhua Chen
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16643
Pdf URL: https://arxiv.org/pdf/2508.16643
Copy Paste: [[2508.16643]] From Classical Probabilistic Latent Variable Models to Modern Generative AI: A Unified Perspective(https://arxiv.org/abs/2508.16643)
Keywords: generative
Abstract: From large language models to multi-modal agents, Generative Artificial Intelligence (AI) now underpins state-of-the-art systems. Despite their varied architectures, many share a common foundation in probabilistic latent variable models (PLVMs), where hidden variables explain observed data for density estimation, latent reasoning, and structured inference. This paper presents a unified perspective by framing both classical and modern generative methods within the PLVM paradigm. We trace the progression from classical flat models such as probabilistic PCA, Gaussian mixture models, latent class analysis, item response theory, and latent Dirichlet allocation, through their sequential extensions including Hidden Markov Models, Gaussian HMMs, and Linear Dynamical Systems, to contemporary deep architectures: Variational Autoencoders as Deep PLVMs, Normalizing Flows as Tractable PLVMs, Diffusion Models as Sequential PLVMs, Autoregressive Models as Explicit Generative Models, and Generative Adversarial Networks as Implicit PLVMs. Viewing these architectures under a common probabilistic taxonomy reveals shared principles, distinct inference strategies, and the representational trade-offs that shape their strengths. We offer a conceptual roadmap that consolidates generative AI's theoretical foundations, clarifies methodological lineages, and guides future innovation by grounding emerging architectures in their probabilistic heritage.
摘要：从大型语言模型到多模式代理，生成人工智能（AI）现在是最先进的系统。尽管它们的体系结构各不相同，但许多人在概率潜在变量模型（PLVM）中共有共同的基础，其中隐藏的变量解释了观察到的数据，以进行密度估计，潜在推理和结构化推断。本文通过在PLVM范式中构建经典和现代生成方法来介绍统一的观点。我们通过其顺序扩展（包括隐藏的马尔可夫模型，高斯HMMS和线性动力学系统，对当代的深度体系结构：序列的自动化效果，像变化的自动化效果一样，像较大的自动化效果一样，像较深的自动化模型一样，我们从概率的PCA，高斯混合模型，潜在的类别分析，项目响应理论和潜在的dirichlet分配中追踪了从经典平坦模型，潜在的类别分析，项目响应理论和潜在的dirichlet分配的进展。自回归模型作为显式生成模型，而生成的对抗网络则为隐式PLVM。在共同的概率分类学下查看这些结构揭示了共同的原则，不同的推论策略以及影响其优势的代表权。我们提供了一个概念图图，该路线图巩固了生成AI的理论基础，阐明方法论谱系，并通过在其概率遗产中扎根新兴的体系结构来指导未来的创新。

Title: CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance

Authors: Anindya Mondal, Ayan Banerjee, Sauradip Nag, Josep Lladós, Xiatian Zhu, Anjan Dutta
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16644
Pdf URL: https://arxiv.org/pdf/2508.16644
Copy Paste: [[2508.16644]] CountLoop: Training-Free High-Instance Image Generation via Iterative Agent Guidance(https://arxiv.org/abs/2508.16644)
Keywords: generation
Abstract: Diffusion models have shown remarkable progress in photorealistic image synthesis, yet they remain unreliable for generating scenes with a precise number of object instances, particularly in complex and high-density settings. We present CountLoop, a training-free framework that provides diffusion models with accurate instance control through iterative structured feedback. The approach alternates between image generation and multimodal agent evaluation, where a language-guided planner and critic assess object counts, spatial arrangements, and attribute consistency. This feedback is then used to refine layouts and guide subsequent generations. To further improve separation between objects, especially in occluded scenes, we introduce instance-driven attention masking and compositional generation techniques. Experiments on COCO Count, T2I CompBench, and two new high-instance benchmarks show that CountLoop achieves counting accuracy of up to 98% while maintaining spatial fidelity and visual quality, outperforming layout-based and gradient-guided baselines with a score of 0.97.
摘要：扩散模型在影像图像合成中显示出显着的进展，但是它们对于具有精确数量的对象实例的场景，尤其是在复杂和高密度的设置中，它们仍然不可靠。我们提出了Countloop，这是一个无训练的框架，可通过迭代结构化反馈提供具有准确实例控制的扩散模型。该方法在图像产生和多模式代理评估之间交替，在语言引导的计划者和评论家评估对象计数，空间安排和属性一致性。然后，此反馈用于完善布局并指导后代。为了进一步改善对象之间的分离，尤其是在遮挡的场景中，我们引入了实例驱动的注意力掩盖和组成生成技术。关于可可计数，T2I compbench和两个新的高设备基准的实验表明，Countloop的计数可达到高达98％的准确性，同时保持空间忠诚度和视觉质量，优于基于布局的梯度和梯度指导的层，得分为0.97。

Title: QA-VLM: Providing human-interpretable quality assessment for wire-feed laser additive manufacturing parts with Vision Language Models

Authors: Qiaojie Zheng, Jiucai Zhang, Joy Gockel, Michael B. Wakin, Craig Brice, Xiaoli Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16661
Pdf URL: https://arxiv.org/pdf/2508.16661
Copy Paste: [[2508.16661]] QA-VLM: Providing human-interpretable quality assessment for wire-feed laser additive manufacturing parts with Vision Language Models(https://arxiv.org/abs/2508.16661)
Keywords: quality assessment
Abstract: Image-based quality assessment (QA) in additive manufacturing (AM) often relies heavily on the expertise and constant attention of skilled human operators. While machine learning and deep learning methods have been introduced to assist in this task, they typically provide black-box outputs without interpretable justifications, limiting their trust and adoption in real-world settings. In this work, we introduce a novel QA-VLM framework that leverages the attention mechanisms and reasoning capabilities of vision-language models (VLMs), enriched with application-specific knowledge distilled from peer-reviewed journal articles, to generate human-interpretable quality assessments. Evaluated on 24 single-bead samples produced by laser wire direct energy deposition (DED-LW), our framework demonstrates higher validity and consistency in explanation quality than off-the-shelf VLMs. These results highlight the potential of our approach to enable trustworthy, interpretable quality assessment in AM applications.
摘要：基于图像的质量评估（QA）在添加剂制造（AM）中通常在很大程度上依赖于熟练的人类运营商的专业知识和不断关注。尽管已经引入了机器学习和深度学习方法来协助完成此任务，但它们通常提供黑框输出而无需解释的理由，从而限制了他们在现实世界中的信任和采用。在这项工作中，我们介绍了一个新颖的QA-VLM框架，该框架利用了视觉模型（VLM）的注意机制和推理能力，并具有从同行评审的期刊文章中蒸馏出的应用特定的知识，以产生人类破解的质量评估。在激光导线直接能量沉积（DED-LW）产生的24个单珠样品上评估，我们的框架表现出比现成的VLM更高的有效性和一致性。这些结果突出了我们在AM应用中实现可信赖，可解释的质量评估的潜力。

Title: Multidimensional Distributional Neural Network Output Demonstrated in Super-Resolution of Surface Wind Speed

Authors: Harrison J. Goldwyn, Mitchell Krock, Johann Rudi, Daniel Getter, Julie Bessac
Subjects: cs.LG, stat.ME, stat.ML
Abstract URL: https://arxiv.org/abs/2508.16686
Pdf URL: https://arxiv.org/pdf/2508.16686
Copy Paste: [[2508.16686]] Multidimensional Distributional Neural Network Output Demonstrated in Super-Resolution of Surface Wind Speed(https://arxiv.org/abs/2508.16686)
Keywords: super-resolution
Abstract: Accurate quantification of uncertainty in neural network predictions remains a central challenge for scientific applications involving high-dimensional, correlated data. While existing methods capture either aleatoric or epistemic uncertainty, few offer closed-form, multidimensional distributions that preserve spatial correlation while remaining computationally tractable. In this work, we present a framework for training neural networks with a multidimensional Gaussian loss, generating closed-form predictive distributions over outputs with non-identically distributed and heteroscedastic structure. Our approach captures aleatoric uncertainty by iteratively estimating the means and covariance matrices, and is demonstrated on a super-resolution example. We leverage a Fourier representation of the covariance matrix to stabilize network training and preserve spatial correlation. We introduce a novel regularization strategy -- referred to as information sharing -- that interpolates between image-specific and global covariance estimates, enabling convergence of the super-resolution downscaling network trained on image-specific distributional loss functions. This framework allows for efficient sampling, explicit correlation modeling, and extensions to more complex distribution families all without disrupting prediction performance. We demonstrate the method on a surface wind speed downscaling task and discuss its broader applicability to uncertainty-aware prediction in scientific models.
摘要：对于涉及高维，相关数据的科学应用，神经网络预测中不确定性的准确量化仍然是一个核心挑战。尽管现有方法捕获了核心或认知不确定性，但很少有封闭形式的多维分布能够保留空间相关性，同时又可以依靠计算上的处理。在这项工作中，我们提出了一个框架，用于训练具有多维高斯损失的神经网络，从而对具有非相同分布和异性词结构的输出产生了封闭形式的预测分布。我们的方法通过迭代估算均值和协方差矩阵来捕获不确定性，并在一个超分辨率示例中证明。我们利用协方差矩阵的傅立叶表示来稳定网络训练并保持空间相关性。我们介绍了一种新颖的正则化策略（称为信息共享），该策略在图像特异性和全球协方差估计之间进行了插值，从而实现了对特定图像分布损失函数训练的超分辨率降尺度网络的收敛。该框架允许进行有效的采样，明确的相关建模以及向更复杂的分布家庭扩展，而不会破坏预测性能。我们在表面风速降低范围任务上演示了该方法，并讨论了其对科学模型中不确定性感知预测的更广泛的适用性。

Title: A Framework for Benchmarking Fairness-Utility Trade-offs in Text-to-Image Models via Pareto Frontiers

Authors: Marco N. Bochernitsan, Rodrigo C. Barros, Lucas S. Kupssinskü
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16752
Pdf URL: https://arxiv.org/pdf/2508.16752
Copy Paste: [[2508.16752]] A Framework for Benchmarking Fairness-Utility Trade-offs in Text-to-Image Models via Pareto Frontiers(https://arxiv.org/abs/2508.16752)
Keywords: generation
Abstract: Achieving fairness in text-to-image generation demands mitigating social biases without compromising visual fidelity, a challenge critical to responsible AI. Current fairness evaluation procedures for text-to-image models rely on qualitative judgment or narrow comparisons, which limit the capacity to assess both fairness and utility in these models and prevent reproducible assessment of debiasing methods. Existing approaches typically employ ad-hoc, human-centered visual inspections that are both error-prone and difficult to replicate. We propose a method for evaluating fairness and utility in text-to-image models using Pareto-optimal frontiers across hyperparametrization of debiasing methods. Our method allows for comparison between distinct text-to-image models, outlining all configurations that optimize fairness for a given utility and vice-versa. To illustrate our evaluation method, we use Normalized Shannon Entropy and ClipScore for fairness and utility evaluation, respectively. We assess fairness and utility in Stable Diffusion, Fair Diffusion, SDXL, DeCoDi, and FLUX text-to-image models. Our method shows that most default hyperparameterizations of the text-to-image model are dominated solutions in the fairness-utility space, and it is straightforward to find better hyperparameters.
摘要：实现文本到图像生成的公平性需要减轻社会偏见而不会损害视觉保真度，这对负责任的AI至关重要。文本对图像模型的当前公平评估程序依赖于定性判断或狭窄的比较，这限制了在这些模型中评估公平性和效用的能力，并防止对借记方法的可重复评估。现有方法通常采用临时，以人为本的视觉检查，既容易出错又难以复制。我们提出了一种在偏置方法的超副标中，使用帕累托最佳边界来评估文本对图像模型中的公平性和实用性的方法。我们的方法允许在不同的文本对图像模型之间进行比较，概述了优化给定实用程序公平性的所有配置，反之亦然。为了说明我们的评估方法，我们分别使用标准化的香农熵和夹克来进行公平和公用事业评估。我们评估稳定扩散，公平扩散，SDXL，DECODI和通量文本对图像模型的公平性和效用。我们的方法表明，文本对图像模型的大多数默认高参数是在公平效用空间中主导的解决方案，并且很容易找到更好的超参数。

Title: WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Authors: Rabiul Awal, Mahsa Massoud, Aarash Feizi, Zichao Li, Suyuchen Wang, Christopher Pal, Aishwarya Agrawal, David Vazquez, Siva Reddy, Juan A. Rodriguez, Perouz Taslakian, Spandana Gella, Sai Rajeswar
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16763
Pdf URL: https://arxiv.org/pdf/2508.16763
Copy Paste: [[2508.16763]] WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation(https://arxiv.org/abs/2508.16763)
Keywords: generation
Abstract: We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing involving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models' abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.
摘要：我们提出WebMMU，这是一个多语言基准测试，该基准评估了三个核心Web任务：（1）网站视觉问题回答，（2）涉及HTML/CSS/JavaScript的代码编辑，以及（3）Mockup-To-to-to-to-to-to-oder odenation。与先前的基准分别处理这些任务不同，WebMMU使用专家清记的，现实世界中的Web数据将它们统一，以评估模型在复杂的多步推理，精确元素接地以及功能性UI理解和编码中的能力。我们的评估表明，虽然多模式大语言模型（MLLM）在基本信息提取方面表现良好，但它们在推理和接地方面努力，编辑代码以保持功能，并生成维护层次结构并支持多语言内容的设计对代码。这些发现揭示了当前MLLM的关键局限性，并强调了改进多模式和跨语性推理的需求，以构建能够自动化多样化的Web开发任务的未来Web代理。

Title: Latent Graph Learning in Generative Models of Neural Signals

Authors: Nathan X. Kodama, Kenneth A. Loparo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.16776
Pdf URL: https://arxiv.org/pdf/2508.16776
Copy Paste: [[2508.16776]] Latent Graph Learning in Generative Models of Neural Signals(https://arxiv.org/abs/2508.16776)
Keywords: generative
Abstract: Inferring temporal interaction graphs and higher-order structure from neural signals is a key problem in building generative models for systems neuroscience. Foundation models for large-scale neural data represent shared latent structures of neural signals. However, extracting interpretable latent graph representations in foundation models remains challenging and unsolved. Here we explore latent graph learning in generative models of neural signals. By testing against numerical simulations of neural circuits with known ground-truth connectivity, we evaluate several hypotheses for explaining learned model weights. We discover modest alignment between extracted network representations and the underlying directed graphs and strong alignment in the co-input graph representations. These findings motivate paths towards incorporating graph-based geometric constraints in the construction of large-scale foundation models for neural data.
摘要：从神经信号中推断时间相互作用图和高阶结构是建立系统神经科学生成模型的关键问题。大规模神经数据的基础模型代表神经信号的共享潜在结构。但是，在基础模型中提取可解释的潜在图表仍然具有挑战性且未解决。在这里，我们在神经信号的生成模型中探索潜在的图形学习。通过测试具有已知地面真相连接性的神经回路的数值模拟，我们评估了一些假设，以解释学习的模型权重。我们发现提取的网络表示与基础的有向图和共同输入图表示中的强对准之间的适度比对。这些发现激发了将基于图的几何约束纳入用于神经数据的大规模基础模型的路径。

Title: Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data

Authors: Stefania L. Moroianu, Christian Bluethgen, Pierre Chambon, Mehdi Cherti, Jean-Benoit Delbrouck, Magdalini Paschali, Brandon Price, Judy Gichoya, Jenia Jitsev, Curtis P. Langlotz, Akshay S. Chaudhari
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16783
Pdf URL: https://arxiv.org/pdf/2508.16783
Copy Paste: [[2508.16783]] Improving Performance, Robustness, and Fairness of Radiographic AI Models with Finely-Controllable Synthetic Data(https://arxiv.org/abs/2508.16783)
Keywords: generation
Abstract: Achieving robust performance and fairness across diverse patient populations remains a challenge in developing clinically deployable deep learning models for diagnostic imaging. Synthetic data generation has emerged as a promising strategy to address limitations in dataset scale and diversity. We introduce RoentGen-v2, a text-to-image diffusion model for chest radiographs that enables fine-grained control over both radiographic findings and patient demographic attributes, including sex, age, and race/ethnicity. RoentGen-v2 is the first model to generate clinically plausible images with demographic conditioning, facilitating the creation of a large, demographically balanced synthetic dataset comprising over 565,000 images. We use this large synthetic dataset to evaluate optimal training pipelines for downstream disease classification models. In contrast to prior work that combines real and synthetic data naively, we propose an improved training strategy that leverages synthetic data for supervised pretraining, followed by fine-tuning on real data. Through extensive evaluation on over 137,000 chest radiographs from five institutions, we demonstrate that synthetic pretraining consistently improves model performance, generalization to out-of-distribution settings, and fairness across demographic subgroups. Across datasets, synthetic pretraining led to a 6.5% accuracy increase in the performance of downstream classification models, compared to a modest 2.7% increase when naively combining real and synthetic data. We observe this performance improvement simultaneously with the reduction of the underdiagnosis fairness gap by 19.3%. These results highlight the potential of synthetic imaging to advance equitable and generalizable medical deep learning under real-world data constraints. We open source our code, trained models, and synthetic dataset at this https URL .
摘要：在开发临床上可部署的深度学习模型以进行诊断成像方面，在各种患者人群中实现稳健的表现和公平性仍然是一个挑战。合成数据生成已成为解决数据集量表和多样性局限性的有前途策略。我们介绍了Roentgen-V2，这是一种用于胸部X光片的文本到图像扩散模型，可以对射线照相发现和患者人口统计学属性（包括性别，年龄和种族/种族/种族）进行细粒度的控制。 Roentgen-V2是第一个生成具有人口统计学条件的临床合理图像的模型，从而促进了一个大型，人口统计平衡的合成数据集，其中包含超过565,000张图像。我们使用此大型合成数据集评估下游疾病分类模型的最佳训练管道。与先前的工作结合了真实和合成数据的先前工作相反，我们提出了一种改进的培训策略，利用合成数据进行监督预处理，然后对真实数据进行微调。通过对来自五个机构的137,000多个胸部X光片的广泛评估，我们证明，合成预处理始终提高模型性能，对分布外设置的概括以及人群亚组的公平性。在整个数据集中，合成预审计导致下游分类模型的性能准确性增加了6.5％，而天然结合了真实和合成数据时，下游分类模型的性能增加了2.7％。我们同时观察到这种绩效的改善，而诊断不足的公平差距减少了19.3％。这些结果突出了合成成像在现实世界数据约束下提高公平和可推广的医学深度学习的潜力。我们在此HTTPS URL上为我们的代码，训练有素的模型和合成数据集开源。

Title: Delta-SVD: Efficient Compression for Personalized Text-to-Image Models

Authors: Tangyuan Zhang, Shangyu Chen, Qixiang Chen, Jianfei Cai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16863
Pdf URL: https://arxiv.org/pdf/2508.16863
Copy Paste: [[2508.16863]] Delta-SVD: Efficient Compression for Personalized Text-to-Image Models(https://arxiv.org/abs/2508.16863)
Keywords: generation
Abstract: Personalized text-to-image models such as DreamBooth require fine-tuning large-scale diffusion backbones, resulting in significant storage overhead when maintaining many subject-specific models. We present Delta-SVD, a post-hoc, training-free compression method that targets the parameter weights update induced by DreamBooth fine-tuning. Our key observation is that these delta weights exhibit strong low-rank structure due to the sparse and localized nature of personalization. Delta-SVD first applies Singular Value Decomposition (SVD) to factorize the weight deltas, followed by an energy-based rank truncation strategy to balance compression efficiency and reconstruction fidelity. The resulting compressed models are fully plug-and-play and can be re-constructed on-the-fly during inference. Notably, the proposed approach is simple, efficient, and preserves the original model architecture. Experiments on a multiple subject dataset demonstrate that Delta-SVD achieves substantial compression with negligible loss in generation quality measured by CLIP score, SSIM and FID. Our method enables scalable and efficient deployment of personalized diffusion models, making it a practical solution for real-world applications that require storing and deploying large-scale subject customizations.
摘要：个性化的文本对图像模型（例如Dreambooth）需要微调大规模扩散式骨架，在维护许多特定于主题的模型时会导致大量存储开销。我们提出了Delta-SVD，这是一种事后无训练的压缩方法，它针对Dreambooth微调引起的参数权重更新。我们的主要观察结果是，由于个性化的稀疏和局部化的性质，这些三角洲权重表现出强大的低级结构。 Delta-SVD首先采用单数值分解（SVD）来分解权重增量，然后采用基于能量的等级截断策略来平衡压缩效率和重建保真度。所得的压缩型号是完全插件的，并且可以在推理过程中直接重建。值得注意的是，所提出的方法是简单，高效的，并保留了原始模型体系结构。多个受试者数据集的实验表明，Delta-SVD实现了实质性的压缩，并且通过剪辑得分，SSIM和FID测量的发电质量损失可忽略不计。我们的方法可以使个性化扩散模型的可扩展有效部署，使其成为需要存储和部署大规模主题自定义的现实世界应用程序的实用解决方案。

Title: AWM-Fuse: Multi-Modality Image Fusion for Adverse Weather via Global and Local Text Perception

Authors: Xilai Li, Huichun Liu, Xiaosong Li, Tao Ye, Zhenyu Kuang, Huafeng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16881
Pdf URL: https://arxiv.org/pdf/2508.16881
Copy Paste: [[2508.16881]] AWM-Fuse: Multi-Modality Image Fusion for Adverse Weather via Global and Local Text Perception(https://arxiv.org/abs/2508.16881)
Keywords: generation
Abstract: Multi-modality image fusion (MMIF) in adverse weather aims to address the loss of visual information caused by weather-related degradations, providing clearer scene representations. Although less studies have attempted to incorporate textual information to improve semantic perception, they often lack effective categorization and thorough analysis of textual content. In response, we propose AWM-Fuse, a novel fusion method for adverse weather conditions, designed to handle multiple degradations through global and local text perception within a unified, shared weight architecture. In particular, a global feature perception module leverages BLIP-produced captions to extract overall scene features and identify primary degradation types, thus promoting generalization across various adverse weather conditions. Complementing this, the local module employs detailed scene descriptions produced by ChatGPT to concentrate on specific degradation effects through concrete textual cues, thereby capturing finer details. Furthermore, textual descriptions are used to constrain the generation of fusion images, effectively steering the network learning process toward better alignment with real semantic labels, thereby promoting the learning of more meaningful visual features. Extensive experiments demonstrate that AWM-Fuse outperforms current state-of-the-art methods in complex weather conditions and downstream tasks. Our code is available at this https URL.
摘要：不利天气中的多模式图像融合（MMIF）旨在解决与天气有关的降解引起的视觉信息丧失，从而提供更清晰的场景表示。尽管较少的研究试图纳入文本信息以改善语义感知，但它们通常缺乏有效的分类和对文本内容的透彻分析。作为回应，我们提出了AWM-Fuse，这是一种用于不利天气条件的新型融合方法，旨在通过统一的，共同的体重体系结构中的全球和本地文本感知来处理多种降解。特别是，全局特征感知模块利用Blip生产的字幕提取整体场景特征并识别主要的降解类型，从而促进了各种不利天气条件的概括。与此相辅相成，本地模块采用Chatgpt制作的详细场景描述，以通过具体的文本提示专注于特定的降解效果，从而捕获更细节。此外，使用文本描述来限制融合图像的产生，有效地将网络学习过程转向了与真实语义标签更好的对齐，从而促进了学习更有意义的视觉特征。广泛的实验表明，在复杂的天气条件和下游任务中，AWM-FUSE优于当前的最新方法。我们的代码可在此HTTPS URL上找到。

Title: MDIQA: Unified Image Quality Assessment for Multi-dimensional Evaluation and Restoration

Authors: Shunyu Yao, Ming Liu, Zhilu Zhang, Zhaolin Wan, Zhilong Ji, Jinfeng Bai, Wangmeng Zuo
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2508.16887
Pdf URL: https://arxiv.org/pdf/2508.16887
Copy Paste: [[2508.16887]] MDIQA: Unified Image Quality Assessment for Multi-dimensional Evaluation and Restoration(https://arxiv.org/abs/2508.16887)
Keywords: restoration, quality assessment
Abstract: Recent advancements in image quality assessment (IQA), driven by sophisticated deep neural network designs, have significantly improved the ability to approach human perceptions. However, most existing methods are obsessed with fitting the overall score, neglecting the fact that humans typically evaluate image quality from different dimensions before arriving at an overall quality assessment. To overcome this problem, we propose a multi-dimensional image quality assessment (MDIQA) framework. Specifically, we model image quality across various perceptual dimensions, including five technical and four aesthetic dimensions, to capture the multifaceted nature of human visual perception within distinct branches. Each branch of our MDIQA is initially trained under the guidance of a separate dimension, and the respective features are then amalgamated to generate the final IQA score. Additionally, when the MDIQA model is ready, we can deploy it for a flexible training of image restoration (IR) models, enabling the restoration results to better align with varying user preferences through the adjustment of perceptual dimension weights. Extensive experiments demonstrate that our MDIQA achieves superior performance and can be effectively and flexibly applied to image restoration tasks. The code is available: this https URL.
摘要：由复杂的深度神经网络设计驱动的图像质量评估（IQA）的最新进步已显着提高了接近人类看法的能力。但是，大多数现有方法都痴迷于拟合整体得分，从而忽略了这样一个事实，即人类通常在达到整体质量评估之前从不同维度评估图像质量。为了克服这个问题，我们提出了多维图像质量评估（MDIQA）框架。具体而言，我们在各种感知维度（包括五个技术和四个美学维度）上建模图像质量，以捕获不同分支中人类视觉感知的多方面性质。最初，我们的MDIQA的每个分支在单独的维度的指导下进行培训，然后将相应的特征合并以生成最终的IQA分数。此外，当准备就绪MDIQA模型时，我们可以将其部署以进行图像恢复（IR）模型的灵活培训，从而使恢复结果通过调整感知尺寸的重量来更好地与不同的用户偏好保持一致。广泛的实验表明，我们的MDIQA实现了出色的性能，并且可以有效，灵活地应用于图像恢复任务。代码可用：此HTTPS URL。

Title: Structural Energy-Guided Sampling for View-Consistent Text-to-3D

Authors: Qing Zhang, Jinguang Tong, Jie Hong, Jing Zhang, Xuesong Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16917
Pdf URL: https://arxiv.org/pdf/2508.16917
Copy Paste: [[2508.16917]] Structural Energy-Guided Sampling for View-Consistent Text-to-3D(https://arxiv.org/abs/2508.16917)
Keywords: generation
Abstract: Text-to-3D generation often suffers from the Janus problem, where objects look correct from the front but collapse into duplicated or distorted geometry from other angles. We attribute this failure to viewpoint bias in 2D diffusion priors, which propagates into 3D optimization. To address this, we propose Structural Energy-Guided Sampling (SEGS), a training-free, plug-and-play framework that enforces multi-view consistency entirely at sampling time. SEGS defines a structural energy in a PCA subspace of intermediate U-Net features and injects its gradients into the denoising trajectory, steering geometry toward the intended viewpoint while preserving appearance fidelity. Integrated seamlessly into SDS/VSD pipelines, SEGS significantly reduces Janus artifacts, achieving improved geometric alignment and viewpoint consistency without retraining or weight modification.
摘要：文本到3D的一代通常遭受Janus问题的苦难，在该问题中，物体从正面看起来正确，但从其他角度塌陷成重复或扭曲的几何形状。我们将这种失败归因于2D扩散先验中的观点偏差，后者传播到3D优化。为了解决这个问题，我们提出了结构性能源引导的采样（SEG），这是一个无训练的插件框架，可在采样时间完全执行多视图一致性。 SEG在中间U-NET特征的PCA子空间中定义了一个结构能，并将其梯度注入脱氧轨迹中，将几何形状转向预期的视点，同时保留外观保真度。 SEG无缝集成到SDS/VSD管道中，大大减少了Janus伪像，实现了改进的几何比对和观点一致性，而无需重新培训或重量修改。

Title: NAT: Learning to Attack Neurons for Enhanced Adversarial Transferability

Authors: Krishna Kanth Nakka, Alexandre Alahi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16937
Pdf URL: https://arxiv.org/pdf/2508.16937
Copy Paste: [[2508.16937]] NAT: Learning to Attack Neurons for Enhanced Adversarial Transferability(https://arxiv.org/abs/2508.16937)
Keywords: generation
Abstract: The generation of transferable adversarial perturbations typically involves training a generator to maximize embedding separation between clean and adversarial images at a single mid-layer of a source model. In this work, we build on this approach and introduce Neuron Attack for Transferability (NAT), a method designed to target specific neuron within the embedding. Our approach is motivated by the observation that previous layer-level optimizations often disproportionately focus on a few neurons representing similar concepts, leaving other neurons within the attacked layer minimally affected. NAT shifts the focus from embedding-level separation to a more fundamental, neuron-specific approach. We find that targeting individual neurons effectively disrupts the core units of the neural network, providing a common basis for transferability across different models. Through extensive experiments on 41 diverse ImageNet models and 9 fine-grained models, NAT achieves fooling rates that surpass existing baselines by over 14\% in cross-model and 4\% in cross-domain settings. Furthermore, by leveraging the complementary attacking capabilities of the trained generators, we achieve impressive fooling rates within just 10 queries. Our code is available at: this https URL
摘要：可转移的对抗扰动的产生通常涉及训练生成器，以最大程度地在源模型的单层中层中最大化清洁和对抗图像之间的嵌入分离。在这项工作中，我们以这种方法为基础，并引入神经元攻击以进行转移性（NAT），该方法旨在针对嵌入中的特定神经元。我们的方法是由以下观察结果激发的，即以前的层级优化通常不成比例地关注代表相似概念的一些神经元，而将其他神经元留在受攻击层中受到的影响很小。 NAT将重点从嵌入级别的分离转变为更基本的，神经元特异性的方法。我们发现，针对单个神经元有效地破坏神经网络的核心单位，为跨不同模型的可传递性提供了共同的基础。通过对41种不同的成像网模型和9个细粒模型的广泛实验，NAT实现了愚蠢的率，在跨模型中超过14 \％，在跨域设置中超过14 \％。此外，通过利用训练有素的发电机的互补攻击能力，我们仅在10个查询中实现了令人印象深刻的愚弄率。我们的代码可用：此HTTPS URL

Title: Sig-DEG for Distillation: Making Diffusion Models Faster and Lighter

Authors: Lei Jiang, Wen Ge, Niels Cariou-Kotlarek, Mingxuan Yi, Po-Yu Chen, Lingyi Yang, Francois Buet-Golfouse, Gaurav Mittal, Hao Ni
Subjects: cs.LG, math.PR, stat.ML
Abstract URL: https://arxiv.org/abs/2508.16939
Pdf URL: https://arxiv.org/pdf/2508.16939
Copy Paste: [[2508.16939]] Sig-DEG for Distillation: Making Diffusion Models Faster and Lighter(https://arxiv.org/abs/2508.16939)
Keywords: generation, generative
Abstract: Diffusion models have achieved state-of-the-art results in generative modelling but remain computationally intensive at inference time, often requiring thousands of discretization steps. To this end, we propose Sig-DEG (Signature-based Differential Equation Generator), a novel generator for distilling pre-trained diffusion models, which can universally approximate the backward diffusion process at a coarse temporal resolution. Inspired by high-order approximations of stochastic differential equations (SDEs), Sig-DEG leverages partial signatures to efficiently summarize Brownian motion over sub-intervals and adopts a recurrent structure to enable accurate global approximation of the SDE solution. Distillation is formulated as a supervised learning task, where Sig-DEG is trained to match the outputs of a fine-resolution diffusion model on a coarse time grid. During inference, Sig-DEG enables fast generation, as the partial signature terms can be simulated exactly without requiring fine-grained Brownian paths. Experiments demonstrate that Sig-DEG achieves competitive generation quality while reducing the number of inference steps by an order of magnitude. Our results highlight the effectiveness of signature-based approximations for efficient generative modeling.
摘要：扩散模型已经实现了最新的生成建模结果，但在推理时仍保持计算密集型，通常需要成千上万的离散步骤。为此，我们提出了SIG-DEG（基于签名的微分方程生成器），这是一种用于提炼预训练的扩散模型的新颖发电机，它可以普遍近似于粗糙的时间分辨率下的向后扩散过程。 SIG-DEG受到随机微分方程（SDE）的高阶近似启发，SIG-DEG杠杆效力是部分签名，以有效地汇总Brownian Motion在亚互相上，并采用经常性结构以实现SDE溶液的准确全局近似。蒸馏是作为监督的学习任务配制的，在粗分辨率扩散模型的粗分离网格上，SIG-DEG经过训练，以匹配精细分辨率扩散模型的输出。在推断期间，SIG-DEG可以快速生成，因为可以精确地模拟部分签名项而无需细粒度的布朗路径。实验表明，Sig-Deg可以达到竞争性的产生质量，同时将推理步骤的数量减少到数量级。我们的结果突出了基于签名的近似值对有效生成建模的有效性。

Title: Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

Authors: Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Jiale Zhao, Jingwen Yang, Jianwei Lv, Kongcheng Zhang, Yihe Zhou, Hengtong Lu, Wei Chen, Yan Xie, Mingli Song
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.16949
Pdf URL: https://arxiv.org/pdf/2508.16949
Copy Paste: [[2508.16949]] Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning(https://arxiv.org/abs/2508.16949)
Keywords: generation
Abstract: Recent advances in Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities. Despite the encouraging results, a fundamental dilemma persists as RL improvement relies on learning from high-quality samples, yet the exploration for such samples remains bounded by the inherent limitations of LLMs. This, in effect, creates an undesirable cycle in which what cannot be explored cannot be learned. In this work, we propose Rubric-Scaffolded Reinforcement Learning (RuscaRL), a novel instructional scaffolding framework designed to break the exploration bottleneck for general LLM reasoning. Specifically, RuscaRL introduces checklist-style rubrics as (1) explicit scaffolding for exploration during rollout generation, where different rubrics are provided as external guidance within task instructions to steer diverse high-quality responses. This guidance is gradually decayed over time, encouraging the model to internalize the underlying reasoning patterns; (2) verifiable rewards for exploitation during model training, where we can obtain robust LLM-as-a-Judge scores using rubrics as references, enabling effective RL on general reasoning tasks. Extensive experiments demonstrate the superiority of the proposed RuscaRL across various benchmarks, effectively expanding reasoning boundaries under the best-of-N evaluation. Notably, RuscaRL significantly boosts Qwen-2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500, surpassing GPT-4.1. Furthermore, our fine-tuned variant on Qwen3-30B-A3B-Instruct achieves 61.1 on HealthBench-500, outperforming leading LLMs including OpenAI-o3.
摘要：大型语言模型（LLM）的最新进展强调了强化学习的潜力（RL），以促进推理能力的出现。尽管结果令人鼓舞，但由于RL的改进依赖于从高质量的样本中学习，但基本的困境仍然存在，但是对此类样本的探索仍然受到LLMS固有局限性的限制。实际上，这创造了一个不良的循环，其中无法探索的内容无法学习。在这项工作中，我们提出了划界的加固学习（Ruscarl），这是一个新颖的教学脚手架框架，旨在打破探索瓶颈的一般LLM推理。具体而言，Ruscarl将清单风格的标题引入了（1）在推出生成期间进行探索的明确脚手架，在该探索中，在该任务指令中提供了不同的专用指导，以指导各种高质量的高质量响应。随着时间的流逝，该指南逐渐衰减，鼓励模型内部化基本的推理模式。（2）在模型培训期间剥削的可验证奖励，我们可以使用标题作为参考文献获得强大的LLM-AS-A-A-A-a-a-Gudge分数，从而有效地对一般推理任务有效RL。广泛的实验证明了拟议的ruscarl在各种基准中的优越性，从而有效地扩大了在最佳N评估下的推理边界。值得注意的是，Ruscarl在HealthBench-500上显着将QWEN-2.5-7B教学从23.6提高到50.3，超过GPT-4.1。此外，我们在QWEN3-30B-A3B-INSTUCT上的微调变体在HealthBench-500上达到61.1，胜过包括OpenAI-O3在内的领先LLM。

Title: RPD-Diff: Region-Adaptive Physics-Guided Diffusion Model for Visibility Enhancement under Dense and Non-Uniform Haze

Authors: Ruicheng Zhang, Puxin Yan, Zeyu Zhang, Yicheng Chang, Hongyi Chen, Zhi Jin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16956
Pdf URL: https://arxiv.org/pdf/2508.16956
Copy Paste: [[2508.16956]] RPD-Diff: Region-Adaptive Physics-Guided Diffusion Model for Visibility Enhancement under Dense and Non-Uniform Haze(https://arxiv.org/abs/2508.16956)
Keywords: restoration, generation
Abstract: Single-image dehazing under dense and non-uniform haze conditions remains challenging due to severe information degradation and spatial heterogeneity. Traditional diffusion-based dehazing methods struggle with insufficient generation conditioning and lack of adaptability to spatially varying haze distributions, which leads to suboptimal restoration. To address these limitations, we propose RPD-Diff, a Region-adaptive Physics-guided Dehazing Diffusion Model for robust visibility enhancement in complex haze scenarios. RPD-Diff introduces a Physics-guided Intermediate State Targeting (PIST) strategy, which leverages physical priors to reformulate the diffusion Markov chain by generation target transitions, mitigating the issue of insufficient conditioning in dense haze scenarios. Additionally, the Haze-Aware Denoising Timestep Predictor (HADTP) dynamically adjusts patch-specific denoising timesteps employing a transmission map cross-attention mechanism, adeptly managing non-uniform haze distributions. Extensive experiments across four real-world datasets demonstrate that RPD-Diff achieves state-of-the-art performance in challenging dense and non-uniform haze scenarios, delivering high-quality, haze-free images with superior detail clarity and color fidelity.
摘要：由于严重的信息降解和空间异质性，在致密和不均匀的雾兹条件下的单形图像去悬去仍然具有挑战性。传统的基于扩散的去悬式方法在发电不足和缺乏对空间变化的雾霾分布的能力不足，这会导致次优恢复。为了解决这些局限性，我们提出了RPD-DIFF，这是一种区域自适应物理学引导的脱壳扩散模型，用于在复杂的雾霾场景中增强可见的可见性。 RPD-DIFF引入了物理引导的中间状态靶向（PIST）策略，该策略利用物理先验来通过产生目标过渡来重新制定Markov链的扩散，从而减轻了密集的haze场景中条件不足的问题。此外，使用传输图跨注意机制的雾化时间段预测器（HADTP）动态调整了特定于斑块的去核时间段，从而管理了不均匀的雾霾分布。四个现实世界数据集进行的广泛实验表明，RPD-DIFF在具有挑战性的密集和非均匀的雾度场景中实现了最先进的性能，提供了具有优质细节清晰度和颜色保真度的高质量，无雾的图像。

Title: HiCache: Training-free Acceleration of Diffusion Models via Hermite Polynomial-based Feature Caching

Authors: Liang Feng, Shikang Zheng, Jiacheng Liu, Yuqi Lin, Qinming Zhou, Peiliang Cai, Xinyu Wang, Junjie Chen, Chang Zou, Yue Ma, Linfeng Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.16984
Pdf URL: https://arxiv.org/pdf/2508.16984
Copy Paste: [[2508.16984]] HiCache: Training-free Acceleration of Diffusion Models via Hermite Polynomial-based Feature Caching(https://arxiv.org/abs/2508.16984)
Keywords: super-resolution, generation
Abstract: Diffusion models have achieved remarkable success in content generation but suffer from prohibitive computational costs due to iterative sampling. While recent feature caching methods tend to accelerate inference through temporal extrapolation, these methods still suffer from server quality loss due to the failure in modeling the complex dynamics of feature evolution. To solve this problem, this paper presents HiCache, a training-free acceleration framework that fundamentally improves feature prediction by aligning mathematical tools with empirical properties. Our key insight is that feature derivative approximations in Diffusion Transformers exhibit multivariate Gaussian characteristics, motivating the use of Hermite polynomials-the potentially theoretically optimal basis for Gaussian-correlated processes. Besides, We further introduce a dual-scaling mechanism that ensures numerical stability while preserving predictive accuracy. Extensive experiments demonstrate HiCache's superiority: achieving 6.24x speedup on FLUX.1-dev while exceeding baseline quality, maintaining strong performance across text-to-image, video generation, and super-resolution tasks. Core implementation is provided in the appendix, with complete code to be released upon acceptance.
摘要：扩散模型在内容产生中取得了显着的成功，但由于迭代采样而遭受了过度的计算成本。尽管最近的功能缓存方法倾向于通过时间外推加速推理，但由于对功能演化的复杂动力学进行建模，这些方法仍然会遭受服务器质量损失。为了解决这个问题，本文提出了Hicache，这是一个无训练的加速框架，从根本上通过将数学工具与经验属性保持一致，从而改善了特征的预测。我们的关键见解是，扩散变压器中的特征衍生近似具有多元高斯特征，激发了使用Hermite多项式的使用 - 潜在的理论上是高斯相关过程的最佳基础。此外，我们进一步引入了双尺度机制，该机制可确保数值稳定性，同时保持预测精度。广泛的实验证明了Hicache的优势：在磁通量上达到6.24倍加速度，同时超过基线质量，在文本对图像，视频生成和超分辨率任务之间保持强劲的性能。附录中提供了核心实现，并在接受后发布完整的代码。

Title: Dual Orthogonal Guidance for Robust Diffusion-based Handwritten Text Generation

Authors: Konstantina Nikolaidou, George Retsinas, Giorgos Sfikas, Silvia Cascianelli, Rita Cucchiara, Marcus Liwicki
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17017
Pdf URL: https://arxiv.org/pdf/2508.17017
Copy Paste: [[2508.17017]] Dual Orthogonal Guidance for Robust Diffusion-based Handwritten Text Generation(https://arxiv.org/abs/2508.17017)
Keywords: generation
Abstract: Diffusion-based Handwritten Text Generation (HTG) approaches achieve impressive results on frequent, in-vocabulary words observed at training time and on regular styles. However, they are prone to memorizing training samples and often struggle with style variability and generation clarity. In particular, standard diffusion models tend to produce artifacts or distortions that negatively affect the readability of the generated text, especially when the style is hard to produce. To tackle these issues, we propose a novel sampling guidance strategy, Dual Orthogonal Guidance (DOG), that leverages an orthogonal projection of a negatively perturbed prompt onto the original positive prompt. This approach helps steer the generation away from artifacts while maintaining the intended content, and encourages more diverse, yet plausible, outputs. Unlike standard Classifier-Free Guidance (CFG), which relies on unconditional predictions and produces noise at high guidance scales, DOG introduces a more stable, disentangled direction in the latent space. To control the strength of the guidance across the denoising process, we apply a triangular schedule: weak at the start and end of denoising, when the process is most sensitive, and strongest in the middle steps. Experimental results on the state-of-the-art DiffusionPen and One-DM demonstrate that DOG improves both content clarity and style variability, even for out-of-vocabulary words and challenging writing styles.
摘要：基于扩散的手写文本生成（HTG）方法在训练时和常规样式上观察到的频繁的，播出的单词可取得令人印象深刻的结果。但是，它们很容易记住训练样本，并且经常在风格变异性和发电性上挣扎。特别是，标准扩散模型倾向于产生伪影或扭曲，从而对生成的文本的可读性产生负面影响，尤其是在样式难以生成时。为了解决这些问题，我们提出了一种新颖的采样指导策略，即双交指导（DOG），该策略利用了对原始积极提示的正交投影。这种方法有助于使一代人远离工件，同时保持预期的内容，并鼓励更加多样化但合理的产出。与无条件预测并在高导度尺度下产生噪音的无标准分类器指导（CFG）不同，狗在潜在空间中引入了更稳定的，散布的方向。为了控制整个denoising过程中的指导的强度，我们采用三角形时间表：在过程中最敏感和最强的步骤中，在DeNoising的开始和结束时较弱。最先进的扩散彭和One-DM的实验结果表明，即使对于播音方式和挑战性的写作风格，狗也可以提高内容的清晰度和样式变异性。

Title: A Novel Local Focusing Mechanism for Deepfake Detection Generalization

Authors: Mingliang Li, Lin Yuanbo Wu, Changhong Liu, Hanxi Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17029
Pdf URL: https://arxiv.org/pdf/2508.17029
Copy Paste: [[2508.17029]] A Novel Local Focusing Mechanism for Deepfake Detection Generalization(https://arxiv.org/abs/2508.17029)
Keywords: generation
Abstract: The rapid advancement of deepfake generation techniques has intensified the need for robust and generalizable detection methods. Existing approaches based on reconstruction learning typically leverage deep convolutional networks to extract differential features. However, these methods show poor generalization across object categories (e.g., from faces to cars) and generation domains (e.g., from GANs to Stable Diffusion), due to intrinsic limitations of deep CNNs. First, models trained on a specific category tend to overfit to semantic feature distributions, making them less transferable to other categories, especially as network depth increases. Second, Global Average Pooling (GAP) compresses critical local forgery cues into a single vector, thus discarding discriminative patterns vital for real-fake classification. To address these issues, we propose a novel Local Focus Mechanism (LFM) that explicitly attends to discriminative local features for differentiating fake from real images. LFM integrates a Salience Network (SNet) with a task-specific Top-K Pooling (TKP) module to select the K most informative local patterns. To mitigate potential overfitting introduced by Top-K pooling, we introduce two regularization techniques: Rank-Based Linear Dropout (RBLD) and Random-K Sampling (RKS), which enhance the model's robustness. LFM achieves a 3.7 improvement in accuracy and a 2.8 increase in average precision over the state-of-the-art Neighboring Pixel Relationships (NPR) method, while maintaining exceptional efficiency at 1789 FPS on a single NVIDIA A6000 GPU. Our approach sets a new benchmark for cross-domain deepfake detection. The source code are available in this https URL
摘要：深泡产生技术的快速发展增强了对可靠和可推广的检测方法的需求。基于重建学习的现有方法通常利用深层卷积网络来提取差异特征。但是，由于深CNN的固有局限性，这些方法的概括（例如，从面孔到汽车）和发电域（例如，从gan到稳定的扩散）的泛化差。首先，在特定类别中训练的模型倾向于过度适合语义特征分布，从而使其不可转移到其他类别，尤其是随着网络深度的增加。其次，全球平均池（GAP）将关键的局部伪造线索压缩为单个矢量，从而丢弃了对真实分类至关重要的歧视性模式。为了解决这些问题，我们提出了一种新型的本地重点机制（LFM），该机制明确地参与了歧视性的本地特征，以将伪造与真实图像区分开来。 LFM将显着网络（SNET）与特定于任务的TOP-K合并（TKP）模块集成在一起，以选择最有用的本地模式。为了减轻Top-K池池引入的潜在过度拟合，我们引入了两种正则化技术：基于等级的线性辍学（RBLD）和Random-K采样（RKS），从而增强了模型的稳健性。 LFM的准确性提高了3.7，比最新的相邻像素关系（NPR）方法的平均精度提高了2.8，同时在单个NVIDIA A6000 GPU上保持了1789 FPS的出色效率。我们的方法为跨域DeepFake检测设定了新的基准。源代码可在此HTTPS URL中找到

Title: Styleclone: Face Stylization with Diffusion Based Data Augmentation

Authors: Neeraj Matiyali, Siddharth Srivastava, Gaurav Sharma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17045
Pdf URL: https://arxiv.org/pdf/2508.17045
Copy Paste: [[2508.17045]] Styleclone: Face Stylization with Diffusion Based Data Augmentation(https://arxiv.org/abs/2508.17045)
Keywords: generation
Abstract: We present StyleClone, a method for training image-to-image translation networks to stylize faces in a specific style, even with limited style images. Our approach leverages textual inversion and diffusion-based guided image generation to augment small style datasets. By systematically generating diverse style samples guided by both the original style images and real face images, we significantly enhance the diversity of the style dataset. Using this augmented dataset, we train fast image-to-image translation networks that outperform diffusion-based methods in speed and quality. Experiments on multiple styles demonstrate that our method improves stylization quality, better preserves source image content, and significantly accelerates inference. Additionally, we provide a systematic evaluation of the augmentation techniques and their impact on stylization performance.
摘要：我们提出StyleClone，这是一种训练图像到图像翻译网络的方法，即使样式图像有限，也可以以特定的样式进行风格。我们的方法利用文本反演和基于扩散的指导图像生成来增强小型数据集。通过系统地生成以原始样式图像和真实面部图像为指导的多种样式样品，我们可以显着增强样式数据集的多样性。使用此增强数据集，我们训练快速图像到图像翻译网络，以优于基于速度和质量的基于扩散的方法。多种样式的实验表明，我们的方法可以提高风格化质量，更好地保留源图像内容并显着加速推断。此外，我们还对增强技术及其对风格化性能的影响进行系统评估。

Title: PVNet: Point-Voxel Interaction LiDAR Scene Upsampling Via Diffusion Models

Authors: Xianjing Cheng, Lintai Wu, Zuowen Wang, Junhui Hou, Jie Wen, Yong Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17050
Pdf URL: https://arxiv.org/pdf/2508.17050
Copy Paste: [[2508.17050]] PVNet: Point-Voxel Interaction LiDAR Scene Upsampling Via Diffusion Models(https://arxiv.org/abs/2508.17050)
Keywords: generation
Abstract: Accurate 3D scene understanding in outdoor environments heavily relies on high-quality point clouds. However, LiDAR-scanned data often suffer from extreme sparsity, severely hindering downstream 3D perception tasks. Existing point cloud upsampling methods primarily focus on individual objects, thus demonstrating limited generalization capability for complex outdoor scenes. To address this issue, we propose PVNet, a diffusion model-based point-voxel interaction framework to perform LiDAR point cloud upsampling without dense supervision. Specifically, we adopt the classifier-free guidance-based DDPMs to guide the generation, in which we employ a sparse point cloud as the guiding condition and the synthesized point clouds derived from its nearby frames as the input. Moreover, we design a voxel completion module to refine and complete the coarse voxel features for enriching the feature representation. In addition, we propose a point-voxel interaction module to integrate features from both points and voxels, which efficiently improves the environmental perception capability of each upsampled point. To the best of our knowledge, our approach is the first scene-level point cloud upsampling method supporting arbitrary upsampling rates. Extensive experiments on various benchmarks demonstrate that our method achieves state-of-the-art performance. The source code will be available at this https URL.
摘要：精确的3D场景在室外环境中的理解在很大程度上依赖于高质量的点云。但是，激光扫描的数据通常遭受极端稀疏性，严重阻碍了下游3D感知任务。现有的点云上采样方法主要集中在单个对象上，从而证明了复杂室外场景的概括能力有限。为了解决此问题，我们提出了PVNET，这是一种基于扩散模型的点素交互框架，以执行LIDAR POINT CLOIN CLUBER UPSPLING，而无需密集的监督。具体来说，我们采用基于无指导的DDPM来指导生成，其中我们采用稀疏点云作为指导条件和综合点云作为其附近框架作为输入。此外，我们设计一个体素完成模块，以完善并完成粗素特征，以丰富特征表示。此外，我们提出了一个点 - 素相互作用模块，以整合来自点和体素的特征，从而有效地提高了每个UPPLEPS采样点的环境感知能力。据我们所知，我们的方法是支持任意上采样率的第一个场景级别点云上采样方法。对各种基准测试的广泛实验表明，我们的方法可以达到最新的性能。源代码将在此HTTPS URL上可用。

Title: REGEN: Real-Time Photorealism Enhancement in Games via a Dual-Stage Generative Network Framework

Authors: Stefanos Pasios, Nikos Nikolaidis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17061
Pdf URL: https://arxiv.org/pdf/2508.17061
Copy Paste: [[2508.17061]] REGEN: Real-Time Photorealism Enhancement in Games via a Dual-Stage Generative Network Framework(https://arxiv.org/abs/2508.17061)
Keywords: generative
Abstract: Photorealism is an important aspect of modern video games since it can shape the player experience and simultaneously impact the immersion, narrative engagement, and visual fidelity. Although recent hardware technological breakthroughs, along with state-of-the-art rendering technologies, have significantly improved the visual realism of video games, achieving true photorealism in dynamic environments at real-time frame rates still remains a major challenge due to the tradeoff between visual quality and performance. In this short paper, we present a novel approach for enhancing the photorealism of rendered game frames using generative adversarial networks. To this end, we propose Real-time photorealism Enhancement in Games via a dual-stage gEnerative Network framework (REGEN), which employs a robust unpaired image-to-image translation model to produce semantically consistent photorealistic frames that transform the problem into a simpler paired image-to-image translation task. This enables training with a lightweight method that can achieve real-time inference time without compromising visual quality. We demonstrate the effectiveness of our framework on Grand Theft Auto V, showing that the approach achieves visual results comparable to the ones produced by the robust unpaired Im2Im method while improving inference speed by 32.14 times. Our findings also indicate that the results outperform the photorealism-enhanced frames produced by directly training a lightweight unpaired Im2Im translation method to translate the video game frames towards the visual characteristics of real-world images. Code, pre-trained models, and demos for this work are available at: this https URL.
摘要：光真主是现代视频游戏的重要方面，因为它可以塑造玩家体验并同时影响沉浸式，叙事参与和视觉保真度。尽管最近的硬件技术突破以及最先进的渲染技术已经显着改善了视频游戏的视觉现实主义，但由于视觉质量和性能之间的权衡，在实时帧速率下在动态环境中实现真正的光逼现实主义仍然是一个重大挑战。在这篇简短的论文中，我们提出了一种新颖的方法，可以使用生成的对抗网络来增强渲染游戏框架的光真相。为此，我们通过双阶段生成网络框架（Regen）提出了实时的光逼现实主义增强，该框架采用了强大的未配对的图像到图像翻译模型来生成语义上一致的光真逼真的框架，从而将问题转化为简单的配对图像到图像到图像到图像对象转换任务。这可以通过轻巧的方法进行培训，该方法可以实现实时推理时间而不会损害视觉质量。我们证明了我们框架对大盗窃自动V的有效性，表明该方法可实现与可靠的未配对IM2IM方法所产生的结果相当的视觉结果，同时将推理速度提高了32.14次。我们的发现还表明，通过直接训练轻巧的未配对的IM2IM翻译方法，结果超过了光真相增强的框架，将视频游戏框架转换为现实世界图像的视觉特征。该工作的代码，预训练模型和演示可在以下网址提供：此HTTPS URL。

Title: SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation

Authors: Peng Hu, Yu Gu, Liang Luo, Fuji Ren
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17062
Pdf URL: https://arxiv.org/pdf/2508.17062
Copy Paste: [[2508.17062]] SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation(https://arxiv.org/abs/2508.17062)
Keywords: generation, generative
Abstract: Controllable video generation aims to synthesize video content that aligns precisely with user-provided conditions, such as text descriptions and initial images. However, a significant challenge persists in this domain: existing models often struggle to maintain strong semantic consistency, frequently generating videos that deviate from the nuanced details specified in the prompts. To address this issue, we propose SSG-DiT (Spatial Signal Guided Diffusion Transformer), a novel and efficient framework for high-fidelity controllable video generation. Our approach introduces a decoupled two-stage process. The first stage, Spatial Signal Prompting, generates a spatially aware visual prompt by leveraging the rich internal representations of a pre-trained multi-modal model. This prompt, combined with the original text, forms a joint condition that is then injected into a frozen video DiT backbone via our lightweight and parameter-efficient SSG-Adapter. This unique design, featuring a dual-branch attention mechanism, allows the model to simultaneously harness its powerful generative priors while being precisely steered by external spatial signals. Extensive experiments demonstrate that SSG-DiT achieves state-of-the-art performance, outperforming existing models on multiple key metrics in the VBench benchmark, particularly in spatial relationship control and overall consistency.
摘要：可控的视频生成旨在综合与用户提供的条件（例如文本说明和初始图像）完全一致的视频内容。但是，在这个领域中，一个重大的挑战仍然存在：现有模型通常难以维持强大的语义一致性，经常生成偏离提示中指定的细微细节的视频。为了解决这个问题，我们提出了SSG-DIT（空间信号引导的扩散变压器），这是一种可用于高保真可控视频生成的新颖而有效的框架。我们的方法引入了一个脱钩的两个阶段过程。第一阶段是空间信号提示，通过利用预先训练的多模式模型的丰富内部表示形式来生成空间意识到的视觉提示。该提示与原始文本结合使用，形成了一个关节条件，然后通过我们的轻质和参数效率的SSG适配器将其注入冷冻视频dit主链中。这种具有双分支注意力机制的独特设计使该模型可以同时利用其强大的生成先验，同时由外部空间信号精确地指导。广泛的实验表明，SSG-DIT达到了最先进的性能，在VBENCH基准中的多个关键指标上的现有模型优于现有模型，尤其是在空间关系控制和整体一致性方面。

Title: Two Birds with One Stone: Enhancing Uncertainty Quantification and Interpretability with Graph Functional Neural Process

Authors: Lingkai Kong, Haotian Sun, Yuchen Zhuang, Haorui Wang, Wenhao Mu, Chao Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17097
Pdf URL: https://arxiv.org/pdf/2508.17097
Copy Paste: [[2508.17097]] Two Birds with One Stone: Enhancing Uncertainty Quantification and Interpretability with Graph Functional Neural Process(https://arxiv.org/abs/2508.17097)
Keywords: generative
Abstract: Graph neural networks (GNNs) are powerful tools on graph data. However, their predictions are mis-calibrated and lack interpretability, limiting their adoption in critical applications. To address this issue, we propose a new uncertainty-aware and interpretable graph classification model that combines graph functional neural process and graph generative model. The core of our method is to assume a set of latent rationales which can be mapped to a probabilistic embedding space; the predictive distribution of the classifier is conditioned on such rationale embeddings by learning a stochastic correlation matrix. The graph generator serves to decode the graph structure of the rationales from the embedding space for model interpretability. For efficient model training, we adopt an alternating optimization procedure which mimics the well known Expectation-Maximization (EM) algorithm. The proposed method is general and can be applied to any existing GNN architecture. Extensive experiments on five graph classification datasets demonstrate that our framework outperforms state-of-the-art methods in both uncertainty quantification and GNN interpretability. We also conduct case studies to show that the decoded rationale structure can provide meaningful explanations.
摘要：图形神经网络（GNN）是图数据上的强大工具。但是，他们的预测是错误校准的，缺乏解释性，从而限制了它们在关键应用中的采用。为了解决这个问题，我们提出了一种新的不确定性感知和可解释的图形分类模型，该模型结合了图形功能神经过程和图形生成模型。我们方法的核心是假设一组潜在理由，可以映射到概率嵌入空间。分类器的预测分布通过学习随机相关矩阵来基于此类理由嵌入。该图生成器可从嵌入空间中解释理由的图形结构，以解释。对于有效的模型培训，我们采用了一种交替的优化程序，该程序模仿了众所周知的预期最大化（EM）算法。提出的方法是一般的，可以应用于任何现有的GNN体系结构。在五个图形分类数据集上进行的广泛实验表明，我们的框架在不确定性量化和GNN的可解释性方面都优于最先进的方法。我们还进行了案例研究，以表明解码的理由结构可以提供有意义的解释。

Title: PlantVillageVQA: A Visual Question Answering Dataset for Benchmarking Vision-Language Models in Plant Science

Authors: Syed Nazmus Sakib, Nafiul Haque, Mohammad Zabed Hossain, Shifat E. Arman
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17117
Pdf URL: https://arxiv.org/pdf/2508.17117
Copy Paste: [[2508.17117]] PlantVillageVQA: A Visual Question Answering Dataset for Benchmarking Vision-Language Models in Plant Science(https://arxiv.org/abs/2508.17117)
Keywords: quality assessment
Abstract: PlantVillageVQA is a large-scale visual question answering (VQA) dataset derived from the widely used PlantVillage image corpus. It was designed to advance the development and evaluation of vision-language models for agricultural decision-making and analysis. The PlantVillageVQA dataset comprises 193,609 high-quality question-answer (QA) pairs grounded over 55,448 images spanning 14 crop species and 38 disease conditions. Questions are organised into 3 levels of cognitive complexity and 9 distinct categories. Each question category was phrased manually following expert guidance and generated via an automated two-stage pipeline: (1) template-based QA synthesis from image metadata and (2) multi-stage linguistic re-engineering. The dataset was iteratively reviewed by domain experts for scientific accuracy and relevancy. The final dataset was evaluated using three state-of-the-art models for quality assessment. Our objective remains to provide a publicly available, standardised and expert-verified database to enhance diagnostic accuracy for plant disease identifications and advance scientific research in the agricultural domain. Our dataset will be open-sourced at this https URL.
摘要：PlantVillageVQA是一个大规模的视觉问题回答（VQA）数据集，该数据集源自广泛使用的PlantVillage Image语料库。它旨在推进视力模型的发展和评估，以进行农业决策和分析。 PlantVillageVQA数据集包含193,609个高质量的问题解答（QA），覆盖了55,448张图像，涵盖了14种农作物物种和38种疾病状况。问题分为3个级别的认知复杂性和9个不同的类别。每个问题类别都是按照专家指导的手动措辞，并通过自动化的两阶段管道生成：（1）基于图像元数据的基于模板的QA综合，（2）多阶段的语言重新设计。该数据集经过域专家的科学准确性和相关性进行了迭代审查。最终数据集使用三种最新模型进行质量评估评估。我们的目标仍然是提供公开可用的，标准化和专家验证的数据库，以提高植物疾病识别的诊断准确性，并提高农业领域的科学研究。我们的数据集将在此HTTPS URL上开源。

Title: Structural Damage Detection Using AI Super Resolution and Visual Language Model

Authors: Catherine Hoier, Khandaker Mamun Ahmed
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17130
Pdf URL: https://arxiv.org/pdf/2508.17130
Copy Paste: [[2508.17130]] Structural Damage Detection Using AI Super Resolution and Visual Language Model(https://arxiv.org/abs/2508.17130)
Keywords: restoration, super-resolution
Abstract: Natural disasters pose significant challenges to timely and accurate damage assessment due to their sudden onset and the extensive areas they affect. Traditional assessment methods are often labor-intensive, costly, and hazardous to personnel, making them impractical for rapid response, especially in resource-limited settings. This study proposes a novel, cost-effective framework that leverages aerial drone footage, an advanced AI-based video super-resolution model, Video Restoration Transformer (VRT), and Gemma3:27b, a 27 billion parameter Visual Language Model (VLM). This integrated system is designed to improve low-resolution disaster footage, identify structural damage, and classify buildings into four damage categories, ranging from no/slight damage to total destruction, along with associated risk levels. The methodology was validated using pre- and post-event drone imagery from the 2023 Turkey earthquakes (courtesy of The Guardian) and satellite data from the 2013 Moore Tornado (xBD dataset). The framework achieved a classification accuracy of 84.5%, demonstrating its ability to provide highly accurate results. Furthermore, the system's accessibility allows non-technical users to perform preliminary analyses, thereby improving the responsiveness and efficiency of disaster management efforts.
摘要：自然灾害由于突然的发作及其影响的广泛领域而对及时，准确的损害评估提出了重大挑战。传统的评估方法通常是劳动力密集的，昂贵的，并且对人员危害，使其对快速响应不切实际，尤其是在资源有限的环境中。这项研究提出了一个新颖的，具有成本效益的框架，该框架利用空中无人机镜头，一种基于AI的高级视频超分辨率模型，视频恢复变压器（VRT）和Gemma3：27B，270亿个参数视觉语言模型（VLM）。该集成系统旨在改善低分辨率灾难录像，确定结构性损害，并将建筑物分为四个损害类别，范围从无/造成的全部破坏以及相关的风险水平。该方法使用2023年土耳其地震（由监护人提供）和2013年摩尔龙卷风（XBD数据集）的卫星数据的事前和事后无人机图像进行了验证。该框架的分类精度为84.5％，表明其提供了高度准确的结果的能力。此外，该系统的可访问性允许非技术用户进行初步分析，从而提高灾难管理工作的响应能力和效率。

Title: MMCIG: Multimodal Cover Image Generation for Text-only Documents and Its Dataset Construction via Pseudo-labeling

Authors: Hyeyeon Kim, Sungwoo Han, Jingun Kwon, Hidetaka Kamigaito, Manabu Okumura
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17199
Pdf URL: https://arxiv.org/pdf/2508.17199
Copy Paste: [[2508.17199]] MMCIG: Multimodal Cover Image Generation for Text-only Documents and Its Dataset Construction via Pseudo-labeling(https://arxiv.org/abs/2508.17199)
Keywords: generation
Abstract: In this study, we introduce a novel cover image generation task that produces both a concise summary and a visually corresponding image from a given text-only document. Because no existing datasets are available for this task, we propose a multimodal pseudo-labeling method to construct high-quality datasets at low cost. We first collect documents that contain multiple images with their captions, and their summaries by excluding factually inconsistent instances. Our approach selects one image from the multiple images accompanying the documents. Using the gold summary, we independently rank both the images and their captions. Then, we annotate a pseudo-label for an image when both the image and its corresponding caption are ranked first in their respective rankings. Finally, we remove documents that contain direct image references within texts. Experimental results demonstrate that the proposed multimodal pseudo-labeling method constructs more precise datasets and generates higher quality images than text- and image-only pseudo-labeling methods, which consider captions and images separately. We release our code at: this https URL
摘要：在这项研究中，我们介绍了一项新颖的封面图像生成任务，该任务既可以产生简明的摘要，又产生了仅给定文本文档的视觉相关图像。由于没有现有数据集用于此任务，因此我们提出了一种多模式伪标记的方法，以低成本构建高质量数据集。我们首先收集包含带有标题的多个图像的文档，并通过排除事实不一致的实例来摘要。我们的方法从随附文档的多个图像中选择一个图像。使用黄金摘要，我们独立对图像及其标题进行排名。然后，当图像及其相应的标题在其各自的排名中排名第一时，我们对图像的伪标签进行注释。最后，我们删除包含文本中包含直接图像参考的文档。实验结果表明，所提出的多模式伪标记方法构建了比文本和仅图像和仅图像的伪标记方法更精确的数据集，并产生更高的图像，这些方法分别考虑字幕和图像。我们在：此HTTPS URL上发布代码

Title: How to make Medical AI Systems safer? Simulating Vulnerabilities, and Threats in Multimodal Medical RAG System

Authors: Kaiwen Zuo, Zelin Liu, Raman Dutt, Ziyang Wang, Zhongtian Sun, Yeming Wang, Fan Mo, Pietro Liò
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2508.17215
Pdf URL: https://arxiv.org/pdf/2508.17215
Copy Paste: [[2508.17215]] How to make Medical AI Systems safer? Simulating Vulnerabilities, and Threats in Multimodal Medical RAG System(https://arxiv.org/abs/2508.17215)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) augmented with Retrieval-Augmented Generation (RAG) are increasingly employed in medical AI to enhance factual grounding through external clinical image-text retrieval. However, this reliance creates a significant attack surface. We propose MedThreatRAG, a novel multimodal poisoning framework that systematically probes vulnerabilities in medical RAG systems by injecting adversarial image-text pairs. A key innovation of our approach is the construction of a simulated semi-open attack environment, mimicking real-world medical systems that permit periodic knowledge base updates via user or pipeline contributions. Within this setting, we introduce and emphasize Cross-Modal Conflict Injection (CMCI), which embeds subtle semantic contradictions between medical images and their paired reports. These mismatches degrade retrieval and generation by disrupting cross-modal alignment while remaining sufficiently plausible to evade conventional filters. While basic textual and visual attacks are included for completeness, CMCI demonstrates the most severe degradation. Evaluations on IU-Xray and MIMIC-CXR QA tasks show that MedThreatRAG reduces answer F1 scores by up to 27.66% and lowers LLaVA-Med-1.5 F1 rates to as low as 51.36%. Our findings expose fundamental security gaps in clinical RAG systems and highlight the urgent need for threat-aware design and robust multimodal consistency checks. Finally, we conclude with a concise set of guidelines to inform the safe development of future multimodal medical RAG systems.
摘要：随着检索功能增强（RAG）的增强，大型视觉模型（LVLM）越来越多地在医学AI中使用，以通过外部临床图像文本检索来增强事实基础。但是，这种依赖会产生重要的攻击表面。我们提出了MedthreaTrag，这是一种新型的多模式中毒框架，该框架通过注入对抗性图像文本对来系统地探测医疗破布系统中的脆弱性。我们方法的关键创新是建造模拟的半开放攻击环境，模仿现实世界中的医疗系统，该系统允许通过用户或管道贡献进行定期知识库更新。在这种情况下，我们介绍并强调了跨模式冲突注射（CMCI），该冲突嵌入了医学图像及其配对报告之间的微妙的语义矛盾。这些不匹配通过破坏跨模式的对准而降低检索和产生，同时保持足够合理以逃避常规过滤器。尽管包括基本的文本和视觉攻击以进行完整性，但CMCI证明了最严重的降解。对IU-XRAR和MIMIC-CXR QA任务的评估表明，Medthrag将F1的得分降低了27.66％，Llava-Med-1.5 F1率降低到低至51.36％。我们的发现揭示了临床抹布系统中的基本安全差距，并强调了对威胁感知设计和强大的多模式一致性检查的迫切需求。最后，我们采用了一套简洁的指南，以告知未来多模式医学抹布系统的安全开发。

Title: Explain Before You Answer: A Survey on Compositional Visual Reasoning

Authors: Fucai Ke, Joy Hsu, Zhixi Cai, Zixian Ma, Xin Zheng, Xindi Wu, Sukai Huang, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Ranjay Krishna, Jiajun Wu, Hamid Rezatofighi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17298
Pdf URL: https://arxiv.org/pdf/2508.17298
Copy Paste: [[2508.17298]] Explain Before You Answer: A Survey on Compositional Visual Reasoning(https://arxiv.org/abs/2508.17298)
Keywords: generation
Abstract: Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human-like ability to decompose visual scenes, ground intermediate concepts, and perform multi-step logical inference. While early surveys focus on monolithic vision-language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five-stage paradigm shift: from prompt-enhanced language-centric pipelines, through tool-enhanced LLMs and tool-enhanced VLMs, to recently minted chain-of-thought reasoning and unified agentic VLMs, highlighting their architectural designs, strengths, and limitations. We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception. Drawing on these analyses, we distill key insights, identify open challenges (e.g., limitations of LLM-based reasoning, hallucination, a bias toward deductive reasoning, scalable supervision, tool integration, and benchmark limitations), and outline future directions, including world-model integration, human-AI collaborative reasoning, and richer evaluation protocols. By offering a unified taxonomy, historical roadmap, and critical outlook, this survey aims to serve as a foundational reference and inspire the next generation of compositional visual reasoning research.
摘要：构图视觉推理已成为多模式AI的关键研究前沿，旨在赋予人类样的能力分解视觉场景，地面中间概念并执行多步逻辑推断。虽然早期的调查专注于整体视觉模型或一般的多模式推理，但仍缺少迅速扩展的构图视觉推理文献的专门合成。我们通过2023年至2025年的全面调查填补了这一空白，该调查系统地审查了顶级场所（CVPR，ICCV，Neurips，ICML，ACL等）的260多篇论文。我们首先将核心定义形式化，并描述组成方法为何在认知一致性，语义忠诚，鲁棒性，可解释性和数据效率方面具有优势。接下来，我们追踪五阶段的范式转变：从以语言为中心的迅速增强的管道，到工具增强的LLM和工具增强的VLM到最近铸造的Theark Chaungend推理链和统一的代理VLM，并强调其建筑设计，优势和限制。然后，我们对60多个基准和相应的指标进行编目，这些指标沿沿尺寸探测诸如接地准确性，经过思考的忠实链和高分辨率感知等方面的组成视觉推理。利用这些分析，我们提炼了关键的见解，确定开放的挑战（例如，基于LLM的推理的局限性，幻觉，对演绎推理的偏见，可扩展的监督，工具集成和基准测试限制），并概述未来的方向，包括世界模式集成，人类 - ai-ai-ai协作推理，人为合作推理和Richerer评估协议和Richerer评估协议。通过提供统一的分类法，历史路线图和批判性前景，该调查旨在作为基础参考，并激发下一代构图视觉推理研究。

Title: PosBridge: Multi-View Positional Embedding Transplant for Identity-Aware Image Editing

Authors: Peilin Xiong, Junwen Chen, Honghui Yuan, Keiji Yanai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17302
Pdf URL: https://arxiv.org/pdf/2508.17302
Copy Paste: [[2508.17302]] PosBridge: Multi-View Positional Embedding Transplant for Identity-Aware Image Editing(https://arxiv.org/abs/2508.17302)
Keywords: generative
Abstract: Localized subject-driven image editing aims to seamlessly integrate user-specified objects into target scenes. As generative models continue to scale, training becomes increasingly costly in terms of memory and computation, highlighting the need for training-free and scalable editing this http URL this end, we propose PosBridge an efficient and flexible framework for inserting custom objects. A key component of our method is positional embedding transplant, which guides the diffusion model to faithfully replicate the structural characteristics of reference this http URL, we introduce the Corner Centered Layout, which concatenates reference images and the background image as input to the FLUX.1-Fill model. During progressive denoising, positional embedding transplant is applied to guide the noise distribution in the target region toward that of the reference object. In this way, Corner Centered Layout effectively directs the FLUX.1-Fill model to synthesize identity-consistent content at the desired location. Extensive experiments demonstrate that PosBridge outperforms mainstream baselines in structural consistency, appearance fidelity, and computational efficiency, showcasing its practical value and potential for broad adoption.
摘要：本地化主题驱动的图像编辑旨在将用户指定的对象无缝整合到目标场景中。随着生成模型继续扩展，在记忆和计算方面，培训变得越来越成本，强调了对本HTTP URL进行无训练和可扩展编辑的需求，我们建议Posbridge一个有效且灵活的框架，用于插入自定义对象。我们方法的一个关键组成部分是位置嵌入移植，它指导扩散模型忠实地复制参考本http URL的结构特征，我们介绍了角落中心的布局，该布局将参考图像和背景图像作为输入，以将其作为输入到FOLUX.1-FILL.1填充模型。在进行性降级期间，应用位置嵌入移植以指导目标区域的噪声分布向参考对象的噪声分布。这样，居中的布局有效地指导了Flux.1填充模型在所需位置综合身份符合的内容。广泛的实验表明，Posbridge在结构一致性，外观保真度和计算效率方面的表现优于主流基线，展示了其实际价值和广泛采用的潜力。

Title: Defending Deepfake via Texture Feature Perturbation

Authors: Xiao Zhang, Changfang Chen, Tianyi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17315
Pdf URL: https://arxiv.org/pdf/2508.17315
Copy Paste: [[2508.17315]] Defending Deepfake via Texture Feature Perturbation(https://arxiv.org/abs/2508.17315)
Keywords: generation
Abstract: The rapid development of Deepfake technology poses severe challenges to social trust and information security. While most existing detection methods primarily rely on passive analyses, due to unresolvable high-quality Deepfake contents, proactive defense has recently emerged by inserting invisible signals in advance of image editing. In this paper, we introduce a proactive Deepfake detection approach based on facial texture features. Since human eyes are more sensitive to perturbations in smooth regions, we invisibly insert perturbations within texture regions that have low perceptual saliency, applying localized perturbations to key texture regions while minimizing unwanted noise in non-textured areas. Our texture-guided perturbation framework first extracts preliminary texture features via Local Binary Patterns (LBP), and then introduces a dual-model attention strategy to generate and optimize texture perturbations. Experiments on CelebA-HQ and LFW datasets demonstrate the promising performance of our method in distorting Deepfake generation and producing obvious visual defects under multiple attack models, providing an efficient and scalable solution for proactive Deepfake detection.
摘要：DeepFake技术的快速发展对社会信任和信息安全构成了严重的挑战。尽管大多数现有的检测方法主要依赖于被动分析，但由于无法辨认的高质量深层含量，因此在图像编辑之前插入无形的信号，积极主动的防御已出现。在本文中，我们根据面部纹理特征介绍了一种主动的深层检测方法。由于人的眼睛对平滑区域的扰动更为敏感，因此我们在具有低感知显着性的纹理区域内插入扰动，将局部扰动应用于关键纹理区域，同时最大程度地减少无纹理区域中不需要的噪声。我们的纹理引导的扰动框架首先通过本地二进制模式（LBP）提取初步纹理特征，然后引入双模型注意策略以生成和优化纹理扰动。 Celeba-HQ和LFW数据集的实验证明了我们方法在扭曲深层生成并在多个攻击模型下产生明显的视觉缺陷的有希望的表现，从而为主动的深层检测提供了有效且可扩展的解决方案。

Title: SpecGen: Neural Spectral BRDF Generation via Spectral-Spatial Tri-plane Aggregation

Authors: Zhenyu Jin, Wenjie Li, Zhanyu Ma, Heng Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17316
Pdf URL: https://arxiv.org/pdf/2508.17316
Copy Paste: [[2508.17316]] SpecGen: Neural Spectral BRDF Generation via Spectral-Spatial Tri-plane Aggregation(https://arxiv.org/abs/2508.17316)
Keywords: generation
Abstract: Synthesizing spectral images across different wavelengths is essential for photorealistic rendering. Unlike conventional spectral uplifting methods that convert RGB images into spectral ones, we introduce SpecGen, a novel method that generates spectral bidirectional reflectance distribution functions (BRDFs) from a single RGB image of a sphere. This enables spectral image rendering under arbitrary illuminations and shapes covered by the corresponding material. A key challenge in spectral BRDF generation is the scarcity of measured spectral BRDF data. To address this, we propose the Spectral-Spatial Tri-plane Aggregation (SSTA) network, which models reflectance responses across wavelengths and incident-outgoing directions, allowing the training strategy to leverage abundant RGB BRDF data to enhance spectral BRDF generation. Experiments show that our method accurately reconstructs spectral BRDFs from limited spectral data and surpasses state-of-the-art methods in hyperspectral image reconstruction, achieving an improvement of 8 dB in PSNR. Codes and data will be released upon acceptance.
摘要：跨不同波长的综合光谱图像对于感性渲染至关重要。与传统的光谱升高方法将RGB图像转换为光谱图像，我们引入了Specgen，这是一种新的方法，该方法从球体的单个RGB图像中生成光谱双向反射分布函数（BRDF）。这可以使光谱图像在相应材料覆盖的任意照明和形状下呈现。光谱BRDF生成中的一个关键挑战是测量光谱BRDF数据的稀缺性。为了解决这个问题，我们提出了频谱空间三平面聚集（SSTA）网络，该网络对波长和出现爆发方向的反射响应进行建模，从而使训练策略利用丰富的RGB BRDF数据来增强光谱BRDF的生成。实验表明，我们的方法准确地从有限的光谱数据中重建光谱BRDF，并超过高光谱图像重建中最先进的方法，从而在PSNR中改善了8 dB。接受代码和数据将在接受后发布。

Title: ShortListing Model: A Streamlined SimplexDiffusion for Discrete Variable Generation

Authors: Yuxuan Song, Zhe Zhang, Yu Pei, Jingjing Gong, Qiying Yu, Zheng Zhang, Mingxuan Wang, Hao Zhou, Jingjing Liu, Wei-Ying Ma
Subjects: cs.LG, q-bio.GN
Abstract URL: https://arxiv.org/abs/2508.17345
Pdf URL: https://arxiv.org/pdf/2508.17345
Copy Paste: [[2508.17345]] ShortListing Model: A Streamlined SimplexDiffusion for Discrete Variable Generation(https://arxiv.org/abs/2508.17345)
Keywords: generation, generative
Abstract: Generative modeling of discrete variables is challenging yet crucial for applications in natural language processing and biological sequence design. We introduce the Shortlisting Model (SLM), a novel simplex-based diffusion model inspired by progressive candidate pruning. SLM operates on simplex centroids, reducing generation complexity and enhancing scalability. Additionally, SLM incorporates a flexible implementation of classifier-free guidance, enhancing unconditional generation performance. Extensive experiments on DNA promoter and enhancer design, protein design, character-level and large-vocabulary language modeling demonstrate the competitive performance and strong potential of SLM. Our code can be found at this https URL
摘要：离散变量的生成建模对于在自然语言处理和生物序列设计中的应用中具有挑战性但至关重要。我们介绍了入围模型（SLM），这是一种新型基于单纯的扩散模型，灵感来自渐进候选候选模型。 SLM在单纯质心上运行，降低了产生的复杂性和增强可伸缩性。此外，SLM还结合了无分类器指导的灵活实现，从而提高了无条件的生成性能。关于DNA启动子和增强剂设计，蛋白质设计，角色级别和大型摄影性语言建模的广泛实验证明了SLM的竞争性能和强大潜力。我们的代码可以在此HTTPS URL上找到

Title: No Pixel Left Behind: A Detail-Preserving Architecture for Robust High-Resolution AI-Generated Image Detection

Authors: Lianrui Mu, Zou Xingze, Jianhong Bai, Jiaqi Hu, Wenjie Zheng, Jiangnan Ye, Jiedong Zhuang, Mudassar Ali, Jing Wang, Haoji Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17346
Pdf URL: https://arxiv.org/pdf/2508.17346
Copy Paste: [[2508.17346]] No Pixel Left Behind: A Detail-Preserving Architecture for Robust High-Resolution AI-Generated Image Detection(https://arxiv.org/abs/2508.17346)
Keywords: generative
Abstract: The rapid growth of high-resolution, meticulously crafted AI-generated images poses a significant challenge to existing detection methods, which are often trained and evaluated on low-resolution, automatically generated datasets that do not align with the complexities of high-resolution scenarios. A common practice is to resize or center-crop high-resolution images to fit standard network inputs. However, without full coverage of all pixels, such strategies risk either obscuring subtle, high-frequency artifacts or discarding information from uncovered regions, leading to input information loss. In this paper, we introduce the High-Resolution Detail-Aggregation Network (HiDA-Net), a novel framework that ensures no pixel is left behind. We use the Feature Aggregation Module (FAM), which fuses features from multiple full-resolution local tiles with a down-sampled global view of the image. These local features are aggregated and fused with global representations for final prediction, ensuring that native-resolution details are preserved and utilized for detection. To enhance robustness against challenges such as localized AI manipulations and compression, we introduce Token-wise Forgery Localization (TFL) module for fine-grained spatial sensitivity and JPEG Quality Factor Estimation (QFE) module to disentangle generative artifacts from compression noise explicitly. Furthermore, to facilitate future research, we introduce HiRes-50K, a new challenging benchmark consisting of 50,568 images with up to 64 megapixels. Extensive experiments show that HiDA-Net achieves state-of-the-art, increasing accuracy by over 13% on the challenging Chameleon dataset and 10% on our HiRes-50K.
摘要：高分辨率，精心制作的AI生成的图像的快速增长对现有检测方法构成了重大挑战，这些方法通常在低分辨率上受过训练和评估，自动生成的数据集与高分辨率方案的复杂性不符。一种常见的做法是调整或中心折叠高分辨率图像以适合标准网络输入。但是，如果没有全部覆盖所有像素，这种策略可能会掩盖微妙，高频伪像或从未发现地区丢弃信息，从而导致输入信息损失。在本文中，我们介绍了高分辨率的细节 - 聚集网络（HIDA-NET），这是一个新颖的框架，可确保不会留下像素。我们使用功能聚合模块（FAM），该模块（FAM）融合了来自多个全分辨率本地瓷砖的功能，并具有向下采样的图像全局视图。这些局部特征汇总并与最终预测的全局表示形式融合在一起，以确保保留并使用本地分辨率的细节进行检测。为了增强针对诸如局部AI操作和压缩等挑战的鲁棒性，我们引入了令牌伪造的定位（TFL）模块，以实现细粒度的空间灵敏度和JPEG质量因子估计（QFE）模块，以明确地从压缩噪声中解散产生的生成物品。此外，为了促进未来的研究，我们介绍了Hires-50k，这是一种新的具有挑战性的基准测试，由50,568张图像组成，最高64兆像素。广泛的实验表明，HIDA-NET可以实现最先进的实验，在具有挑战性的变色龙数据集上，精度提高了13％，而我们的雇员50K的精度则增加了10％。

Title: DiCache: Let Diffusion Model Determine Its Own Cache

Authors: Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Tong Wu, Dahua Lin, Jiaqi Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17356
Pdf URL: https://arxiv.org/pdf/2508.17356
Copy Paste: [[2508.17356]] DiCache: Let Diffusion Model Determine Its Own Cache(https://arxiv.org/abs/2508.17356)
Keywords: generation
Abstract: Recent years have witnessed the rapid development of acceleration techniques for diffusion models, especially caching-based acceleration methods. These studies seek to answer two fundamental questions: "When to cache" and "How to use cache", typically relying on predefined empirical laws or dataset-level priors to determine the timing of caching and utilizing handcrafted rules for leveraging multi-step caches. However, given the highly dynamic nature of the diffusion process, they often exhibit limited generalizability and fail on outlier samples. In this paper, a strong correlation is revealed between the variation patterns of the shallow-layer feature differences in the diffusion model and those of final model outputs. Moreover, we have observed that the features from different model layers form similar trajectories. Based on these observations, we present DiCache, a novel training-free adaptive caching strategy for accelerating diffusion models at runtime, answering both when and how to cache within a unified framework. Specifically, DiCache is composed of two principal components: (1) Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain a stable prior for the caching error in real time, enabling the model to autonomously determine caching schedules. (2) Dynamic Cache Trajectory Alignment combines multi-step caches based on shallow-layer probe feature trajectory to better approximate the current feature, facilitating higher visual quality. Extensive experiments validate DiCache's capability in achieving higher efficiency and improved visual fidelity over state-of-the-art methods on various leading diffusion models including WAN 2.1, HunyuanVideo for video generation, and Flux for image generation.
摘要：近年来见证了扩散模型的加速技术的快速发展，尤其是基于缓存的加速方法。这些研究试图回答两个基本问题：“何时要缓存”和“如何使用缓存”，通常依靠预定义的经验定律或数据集级别的先验来确定缓存和利用手工制作的规则来利用多步火的时间。但是，鉴于扩散过程的高度动态性质，它们通常表现出有限的概括性，并且在异常样品中失败。在本文中，在扩散模型中的浅层特征差异与最终模型输出的变化模式之间揭示了强相关性。此外，我们已经观察到来自不同模型层的特征形成了相似的轨迹。基于这些观察结果，我们提出了Dicache，这是一种在运行时加速扩散模型的新型自适应缓存策略，回答了何时以及如何在统一框架内缓存。具体而言，DICACHE由两个主要组件组成：（1）在线探针分析方案利用浅层在线探测器实时获得缓存错误的稳定先验，使该模型能够自主确定缓存时间表。（2）动态缓存轨迹对齐结合了基于浅层探针特征轨迹的多步中心，以更好地近似当前功能，从而促进更高的视觉质量。广泛的实验验证了Dicache在提高效率上的能力，并改善了在各种领先扩散模型（包括WAN 2.1，用于视频生成的Hunyuanvideo）和图像生成的磁通量的各种领先扩散模型上的视觉保真度。

Title: Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation

Authors: Guoqing Zhang, Xingtong Ge, Lu Shi, Xin Zhang, Muqing Xue, Wanru Xu, Yigang Cen
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17364
Pdf URL: https://arxiv.org/pdf/2508.17364
Copy Paste: [[2508.17364]] Condition Weaving Meets Expert Modulation: Towards Universal and Controllable Image Generation(https://arxiv.org/abs/2508.17364)
Keywords: generation
Abstract: The image-to-image generation task aims to produce controllable images by leveraging conditional inputs and prompt instructions. However, existing methods often train separate control branches for each type of condition, leading to redundant model structures and inefficient use of computational resources. To address this, we propose a Unified image-to-image Generation (UniGen) framework that supports diverse conditional inputs while enhancing generation efficiency and expressiveness. Specifically, to tackle the widely existing parameter redundancy and computational inefficiency in controllable conditional generation architectures, we propose the Condition Modulated Expert (CoMoE) module. This module aggregates semantically similar patch features and assigns them to dedicated expert modules for visual representation and conditional modeling. By enabling independent modeling of foreground features under different conditions, CoMoE effectively mitigates feature entanglement and redundant computation in multi-condition scenarios. Furthermore, to bridge the information gap between the backbone and control branches, we propose WeaveNet, a dynamic, snake-like connection mechanism that enables effective interaction between global text-level control from the backbone and fine-grained control from conditional branches. Extensive experiments on the Subjects-200K and MultiGen-20M datasets across various conditional image generation tasks demonstrate that our method consistently achieves state-of-the-art performance, validating its advantages in both versatility and effectiveness. The code has been uploaded to this https URL.
摘要：图像到图像生成任务旨在通过利用条件输入和提示说明来产生可控的图像。但是，现有方法经常为每种条件训练单独的控制分支，从而导致冗余模型结构和计算资源的效率低下。为了解决这个问题，我们提出了一个统一的图像到图像生成（Unigen）框架，该框架支持各种条件输入，同时提高产生效率和表现力。具体而言，为了解决可控条件生成体系结构中广泛现有的参数冗余和计算效率低下，我们提出了条件调制专家（COMOE）模块。该模块汇总了语义上相似的补丁功能，并将其分配给专门的专家模块，以进行视觉表示和条件建模。通过在不同条件下对前景特征进行独立建模，ComoE有效地减轻了多条件方案中的特征纠缠和冗余计算。此外，为了弥合主链和控制分支之间的信息差距，我们提出了Weavenet，这是一种动态的，类似蛇的连接机制，可以从骨干线和从条件分支的细粒度控制中实现全局文本级控制之间的有效相互作用。对各种有条件图像生成任务的主题-200K和多计划-20M数据集进行了广泛的实验表明，我们的方法始终达到最新的性能，从而验证了其在多功能性和有效性方面的优势。该代码已上传到此HTTPS URL。

Title: ShaLa: Multimodal Shared Latent Space Modelling

Authors: Jiali Cui, Yan-Ying Chen, Yanxia Zhang, Matthew Klenk
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2508.17376
Pdf URL: https://arxiv.org/pdf/2508.17376
Copy Paste: [[2508.17376]] ShaLa: Multimodal Shared Latent Space Modelling(https://arxiv.org/abs/2508.17376)
Keywords: generative
Abstract: This paper presents a novel generative framework for learning shared latent representations across multimodal data. Many advanced multimodal methods focus on capturing all combinations of modality-specific details across inputs, which can inadvertently obscure the high-level semantic concepts that are shared across modalities. Notably, Multimodal VAEs with low-dimensional latent variables are designed to capture shared representations, enabling various tasks such as joint multimodal synthesis and cross-modal inference. However, multimodal VAEs often struggle to design expressive joint variational posteriors and suffer from low-quality synthesis. In this work, ShaLa addresses these challenges by integrating a novel architectural inference model and a second-stage expressive diffusion prior, which not only facilitates effective inference of shared latent representation but also significantly improves the quality of downstream multimodal synthesis. We validate ShaLa extensively across multiple benchmarks, demonstrating superior coherence and synthesis quality compared to state-of-the-art multimodal VAEs. Furthermore, ShaLa scales to many more modalities while prior multimodal VAEs have fallen short in capturing the increasing complexity of the shared latent space.
摘要：本文介绍了一个新颖的生成框架，用于学习多模式数据的共享潜在表示。许多先进的多模式方法着重于捕获跨输入的所有特定于模态细节的组合，这可能会无意中掩盖跨模态共享的高级语义概念。值得注意的是，具有低维度变量的多模式VAE旨在捕获共享表示形式，从而实现了各种任务，例如联合多模式合成和跨模式推断。然而，多模式的VAE通常很难设计表现性的关节变异后代并遭受低质量合成。在这项工作中，Shala通过整合一种新型的建筑推论模型和第二阶段表达性扩散的先验来解决这些挑战，这不仅促进了共享潜在表示的有效推理，而且还显着提高了下游多型合成的质量。我们在多个基准测试中广泛验证了沙拉，与最先进的多模态VAE相比，表明了优势相干性和合成质量。此外，Shala量表达到了更多的模式，而先前的多模式VAE在捕获共享潜在空间的复杂性日益增加方面却缺乏。

Title: Enhancing Underwater Images via Deep Learning: A Comparative Study of VGG19 and ResNet50-Based Approaches

Authors: Aoqi Li, Yanghui Song, Jichao Dao, Chengfu Yang
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2508.17397
Pdf URL: https://arxiv.org/pdf/2508.17397
Copy Paste: [[2508.17397]] Enhancing Underwater Images via Deep Learning: A Comparative Study of VGG19 and ResNet50-Based Approaches(https://arxiv.org/abs/2508.17397)
Keywords: quality assessment
Abstract: This paper addresses the challenging problem of image enhancement in complex underwater scenes by proposing a solution based on deep learning. The proposed method skillfully integrates two deep convolutional neural network models, VGG19 and ResNet50, leveraging their powerful feature extraction capabilities to perform multi-scale and multi-level deep feature analysis of underwater images. By constructing a unified model, the complementary advantages of the two models are effectively integrated, achieving a more comprehensive and accurate image enhancement this http URL objectively evaluate the enhancement effect, this paper introduces image quality assessment metrics such as PSNR, UCIQE, and UIQM to quantitatively compare images before and after enhancement and deeply analyzes the performance of different models in different this http URL, to improve the practicality and stability of the underwater visual enhancement system, this paper also provides practical suggestions from aspects such as model optimization, multi-model fusion, and hardware selection, aiming to provide strong technical support for visual enhancement tasks in complex underwater environments.
摘要：本文通过提出基于深度学习的解决方案来解决复杂水下场景中图像增强的挑战性问题。提出的方法巧妙地整合了两个深卷积神经网络模型VGG19和Resnet50，利用其强大的功能提取能力来执行水下图像的多尺度和多级深度特征分析。通过构建一个统一的模型，这两个模型的互补优势有效地整合了，实现了更全面和准确的图像增强，此HTTP URL客观地评估增强效果，本文介绍了图像质量评估指标，例如PSNR，UCIQE和UIQM，并在数量上进行了更深入的图像，并在Quantipery上进行了更深入的分析。 URL，为了提高水下视觉增强系统的实用性和稳定性，本文还提供了诸如模型优化，多模型融合和硬件选择等方面的实践建议，旨在为复杂的水下环境中的视觉增强任务提供强大的技术支持。

Title: MoCo: Motion-Consistent Human Video Generation via Structure-Appearance Decoupling

Authors: Haoyu Wang, Hao Tang, Donglin Di, Zhilu Zhang, Wangmeng Zuo, Feng Gao, Siwei Ma, Shiliang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17404
Pdf URL: https://arxiv.org/pdf/2508.17404
Copy Paste: [[2508.17404]] MoCo: Motion-Consistent Human Video Generation via Structure-Appearance Decoupling(https://arxiv.org/abs/2508.17404)
Keywords: generation
Abstract: Generating human videos with consistent motion from text prompts remains a significant challenge, particularly for whole-body or long-range motion. Existing video generation models prioritize appearance fidelity, resulting in unrealistic or physically implausible human movements with poor structural coherence. Additionally, most existing human video datasets primarily focus on facial or upper-body motions, or consist of vertically oriented dance videos, limiting the scope of corresponding generation methods to simple movements. To overcome these challenges, we propose MoCo, which decouples the process of human video generation into two components: structure generation and appearance generation. Specifically, our method first employs an efficient 3D structure generator to produce a human motion sequence from a text prompt. The remaining video appearance is then synthesized under the guidance of the generated structural sequence. To improve fine-grained control over sparse human structures, we introduce Human-Aware Dynamic Control modules and integrate dense tracking constraints during training. Furthermore, recognizing the limitations of existing datasets, we construct a large-scale whole-body human video dataset featuring complex and diverse motions. Extensive experiments demonstrate that MoCo outperforms existing approaches in generating realistic and structurally coherent human videos.
摘要：从文本提示中生成一致动作的人类视频仍然是一个重大挑战，尤其是对于全身或远程运动。现有的视频生成模型优先考虑外观保真度，导致结构连贯性差的不现实或物理上难以置信的人类运动。此外，大多数现有的人类视频数据集主要集中于面部或上身运动，或者由垂直面向舞蹈视频组成，将相应生成方法的范围限制在简单运动中。为了克服这些挑战，我们提出了Moco，该挑战将人类视频生成的过程分解为两个组成部分：结构产生和外观产生。具体而言，我们的方法首先采用高效的3D结构发生器来从文本提示中产生人体运动序列。然后在生成的结构序列的指导下合成其余的视频外观。为了改善对稀疏人类结构的细粒度控制，我们引入了人类感知的动态控制模块，并在训练过程中整合了密集的跟踪约束。此外，我们识别现有数据集的局限性，我们构建了一个大规模的全身视频数据集，具有复杂而多样的动作。广泛的实验表明，Moco在生成现实和结构连贯的人类视频方面的表现优于现有方法。

Title: Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models

Authors: Xiaojie Yin, Qilong Wang, Qinghua Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17417
Pdf URL: https://arxiv.org/pdf/2508.17417
Copy Paste: [[2508.17417]] Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models(https://arxiv.org/abs/2508.17417)
Keywords: generation
Abstract: Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment due to domain gaps between pre-training and downstream tasks. Existing approaches primarily focus on text prompting with class-specific descriptions and visual-text adaptation via aligning cropped image regions with textual descriptions. However, they still face the issues of incomplete textual prompts and noisy visual prompts. In this paper, we propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment by constructing comprehensive textual prompts and compact visual prompts from the semantic perspective. Specifically, our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS). Textually, to address the issue of incomplete semantic expression in textual prompts, our TGSSG first generates synonymous semantic set for each category via large language models, and constructs comprehensive textual prompts based on semantic ambiguity entropy and persistent homology analysis. Visually, to mitigate the irrelevant visual noise introduced by random cropping, our CADRS identifies discriminative regions with activation maps outputted by a pre-trained vision model, effectively filtering out noisy regions and generating compact visual prompts. Given the comprehensive set of textual prompts and compact set of visual prompts, we introduce two set-to-set matching strategies based on test-time adaptation (TTA) and optimal transport (OT) to achieve effective visual-textual alignment, and so improve zero-shot generalization of VLMs.
摘要：在Web级数据上进行训练的视觉语言模型（VLM）表现出有希望的零弹性概括，但由于训练和下游任务之间的域间隙而经常遭受语义不一致。现有方法主要集中于文本提示，并通过将裁剪的图像区域与文本描述对齐，并通过特定于类的描述和视觉文本进行适应。但是，他们仍然面临不完整的文本提示和嘈杂的视觉提示的问题。在本文中，我们提出了一种新颖的限制提示增强（CPE）方法，以从语义角度构造全面的文本提示和紧凑的视觉提示来改善视觉文本对齐。具体而言，我们的方法由两个关键组成部分组成：拓扑引导的同义语义生成（TGSSG）和类别不可吻合的区分区域选择（CADRS）。在文本上，为了解决文本提示中不完整的语义表达问题的问题，我们的TGSSG首先通过大型语言模型生成了为每个类别的同义语义集，并基于语义歧义性熵和持续的同源性分析构建了全面的文本提示。从视觉上讲，为了减轻随机裁剪引入的无关视觉噪声，我们的CADRS通过预先训练的视觉模型输出的激活图确定了判别区域，有效地过滤了嘈杂的区域并产生紧凑的视觉提示。考虑到一组全面的文本提示和紧凑的视觉提示集，我们基于测试时间适应（TTA）和最佳传输（OT）引入了两种设定的匹配策略，以实现有效的视觉文本对齐，从而改善了VLMS的零击球概括。

Title: Modular MeanFlow: Towards Stable and Scalable One-Step Generative Modeling

Authors: Haochen You, Baojing Liu, Hongyang He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.17426
Pdf URL: https://arxiv.org/pdf/2508.17426
Copy Paste: [[2508.17426]] Modular MeanFlow: Towards Stable and Scalable One-Step Generative Modeling(https://arxiv.org/abs/2508.17426)
Keywords: generative
Abstract: One-step generative modeling seeks to generate high-quality data samples in a single function evaluation, significantly improving efficiency over traditional diffusion or flow-based models. In this work, we introduce Modular MeanFlow (MMF), a flexible and theoretically grounded approach for learning time-averaged velocity fields. Our method derives a family of loss functions based on a differential identity linking instantaneous and average velocities, and incorporates a gradient modulation mechanism that enables stable training without sacrificing expressiveness. We further propose a curriculum-style warmup schedule to smoothly transition from coarse supervision to fully differentiable training. The MMF formulation unifies and generalizes existing consistency-based and flow-matching methods, while avoiding expensive higher-order derivatives. Empirical results across image synthesis and trajectory modeling tasks demonstrate that MMF achieves competitive sample quality, robust convergence, and strong generalization, particularly under low-data or out-of-distribution settings.
摘要：一步生成建模旨在在单个功能评估中生成高质量的数据样本，从而显着提高了传统扩散或基于流的模型的效率。在这项工作中，我们介绍了模块化平均流（MMF），这是一种灵活的，理论上接地的方法，用于学习时间平均速度场。我们的方法基于连接瞬时和平均速度的差分身份的损失功能系列，并结合了梯度调制机制，该机制可以实现稳定的训练而无需牺牲表现力。我们进一步提出了课程风格的热身时间表，以平稳地从粗糙的监督到完全可区分的培训。 MMF公式统一并概括了现有的基于一致性的和流匹配的方法，同时避免了昂贵的高级衍生品。图像综合和轨迹建模任务之间的经验结果表明，MMF可实现竞争性样本质量，稳健的收敛和强有力的概括，尤其是在低数据或分发设置下。

Title: TinySR: Pruning Diffusion for Real-World Image Super-Resolution

Authors: Linwei Dong, Qingnan Fan, Yuhang Yu, Qi Zhang, Jinwei Chen, Yawei Luo, Changqing Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17434
Pdf URL: https://arxiv.org/pdf/2508.17434
Copy Paste: [[2508.17434]] TinySR: Pruning Diffusion for Real-World Image Super-Resolution(https://arxiv.org/abs/2508.17434)
Keywords: super-resolution, generative
Abstract: Real-world image super-resolution (Real-ISR) focuses on recovering high-quality images from low-resolution inputs that suffer from complex degradations like noise, blur, and compression. Recently, diffusion models (DMs) have shown great potential in this area by leveraging strong generative priors to restore fine details. However, their iterative denoising process incurs high computational overhead, posing challenges for real-time applications. Although one-step distillation methods, such as OSEDiff and TSD-SR, offer faster inference, they remain fundamentally constrained by their large, over-parameterized model architectures. In this work, we present TinySR, a compact yet effective diffusion model specifically designed for Real-ISR that achieves real-time performance while maintaining perceptual quality. We introduce a Dynamic Inter-block Activation and an Expansion-Corrosion Strategy to facilitate more effective decision-making in depth pruning. We achieve VAE compression through channel pruning, attention removal and lightweight SepConv. We eliminate time- and prompt-related modules and perform pre-caching techniques to further speed up the model. TinySR significantly reduces computational cost and model size, achieving up to 5.68x speedup and 83% parameter reduction compared to its teacher TSD-SR, while still providing high quality results.
摘要：现实世界图像超分辨率（REAL-ISR）着重于从低分辨率输入中恢复高质量的图像，这些输入遭受了噪声，模糊和压缩等复杂降解的影响。最近，扩散模型（DMS）通过利用强大的生成先验来恢复细节，在该领域显示出巨大的潜力。但是，它们的迭代降解过程会引起高计算开销，这对实时应用构成了挑战。尽管Oseiff和TSD-SR等一步蒸馏方法提供了更快的推断，但它们从根本上仍然受到其大型，过度参数化的模型体系结构的限制。在这项工作中，我们提出了Tinysr，这是一种专门为实体ISR设计的紧凑而有效的扩散模型，可在保持感知质量的同时实现实时性能。我们引入了动态块间激活和扩展 - 腐败策略，以促进更有效的决策。我们通过通道修剪，删除注意力和轻质SEPCONV实现VAE压缩。我们消除了与时间和及时相关的模块，并执行预启动技术以进一步加快模型。 Tinysr大大降低了计算成本和模型大小，与教师TSD-SR相比，达到5.68倍的加速和83％的参数降低，同时仍提供高质量的结果。

Title: An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

Authors: Zihan Liang, Jiahao Sun, Haoran Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17435
Pdf URL: https://arxiv.org/pdf/2508.17435
Copy Paste: [[2508.17435]] An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing(https://arxiv.org/abs/2508.17435)
Keywords: generation
Abstract: Despite the remarkable capabilities of text-to-image (T2I) generation models, real-world applications often demand fine-grained, iterative image editing that existing methods struggle to provide. Key challenges include granular instruction understanding, robust context preservation during modifications, and the lack of intelligent feedback mechanisms for iterative refinement. This paper introduces RefineEdit-Agent, a novel, training-free intelligent agent framework designed to address these limitations by enabling complex, iterative, and context-aware image editing. RefineEdit-Agent leverages the powerful planning capabilities of Large Language Models (LLMs) and the advanced visual understanding and evaluation prowess of Vision-Language Large Models (LVLMs) within a closed-loop system. Our framework comprises an LVLM-driven instruction parser and scene understanding module, a multi-level LLM-driven editing planner for goal decomposition, tool selection, and sequence generation, an iterative image editing module, and a crucial LVLM-driven feedback and evaluation loop. To rigorously evaluate RefineEdit-Agent, we propose LongBench-T2I-Edit, a new benchmark featuring 500 initial images with complex, multi-turn editing instructions across nine visual dimensions. Extensive experiments demonstrate that RefineEdit-Agent significantly outperforms state-of-the-art baselines, achieving an average score of 3.67 on LongBench-T2I-Edit, compared to 2.29 for Direct Re-Prompting, 2.91 for InstructPix2Pix, 3.16 for GLIGEN-based Edit, and 3.39 for ControlNet-XL. Ablation studies, human evaluations, and analyses of iterative refinement, backbone choices, tool usage, and robustness to instruction complexity further validate the efficacy of our agentic design in delivering superior edit fidelity and context preservation.
摘要：尽管文本到图像（T2I）生成模型具有显着的功能，但现实世界中的应用通常需要细粒度，迭代图像编辑现有方法很难提供。关键挑战包括颗粒状的教学理解，修改过程中的稳健背景保护以及缺乏迭代精致的智能反馈机制。本文介绍了一种新颖的，无训练的智能代理框架，旨在通过启用复杂，迭代和上下文感知的图像编辑来解决这些限制。 RefineEdit-Agent利用大型语言模型（LLM）的强大计划能力以及在闭环系统中对视觉语言大型模型（LVLM）的高级视觉理解和评估实力。我们的框架包括LVLM驱动的指令解析器和场景理解模块，一个多级LLM驱动的编辑计划器，用于目标分解，工具选择和序列生成，迭代图像编辑模块以及至关重要的LVLM驱动反馈和评估循环。为了严格评估RefineEdit-Agent，我们提出了LongBench-T2-Edit，这是一种新的基准测试，其中包含500个初始图像，其中包含九个视觉尺寸的复杂，多转移的编辑指令。广泛的实验表明，精炼代理的表现明显优于最先进的基线，在Longbench-T2I-Edit上的平均得分为3.67，而直接重新推荐的2.29则获得了指Gligen的2.91，基于Gligen的Edit的平均得分为2.91，用于ControlNetNet-XL。消融研究，人体评估和分析迭代精致，骨干的选择，工具使用和鲁棒性，以进一步验证我们的代理设计在提供出色的编辑保真度和上下文保存方面的功效。

Title: TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

Authors: Yizhi Li, Qingshui Gu, Zhoufutu Wen, Ziniu Li, Tianshun Xing, Shuyue Guo, Tianyu Zheng, Xin Zhou, Xingwei Qu, Wangchunshu Zhou, Zheng Zhang, Wei Shen, Qian Liu, Chenghua Lin, Jian Yang, Ge Zhang, Wenhao Huang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2508.17445
Pdf URL: https://arxiv.org/pdf/2508.17445
Copy Paste: [[2508.17445]] TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling(https://arxiv.org/abs/2508.17445)
Keywords: generation
Abstract: Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22\% up to 43\% of the sampling design for the trained models, meanwhile showing up to 40\% reduction at trajectory-level and 35\% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at this https URL.
摘要：通过强化学习使大型语言模型保持一致的最新进展在解决复杂的推理问题方面取得了显着的收益，但以昂贵的车间推销和对各种推理路径的探索有限。在这项工作中，我们介绍了Treepo，其中涉及一种自引导的推出算法，该算法将序列生成视为树的结构化搜索过程。 Treepo由动态树采样策略和固定长度段解码组成，利用本地不确定性来保证其他分支。通过对跨常见前缀进行计算并尽早修剪低价值路径，Treepo基本上减轻了Per-perdate Compute负担，同时保留或增强了勘探多样性。关键贡献包括：（1）通过段的采样算法通过连续的段来减轻KV缓存负担，并产生新的分支以及早期的机制；（2）一个基于树的细分级优势估计，该估计考虑了全球和本地近端策略优化。（3）分析概率和质量驱动的动态差异和后备策略的有效性。我们从经验上验证了Treepo在设定的推理基准上的性能增长，以及训练有素型号的GPU小时从22 \％到43％的抽样设计的效率，同时显示了轨迹级别的40 \％降低，以降低40 \％的型号，以对现有模型进行35 \％。特雷普（Treepo）在提供免费推理效率的午餐时，揭示了一条实用的途径，以减少样本和较少的计算来缩放基于RL的培训。主页位于此HTTPS URL。

Title: A Synthetic Dataset for Manometry Recognition in Robotic Applications

Authors: Pedro Antonio Rabelo Saraiva, Enzo Ferreira de Souza, Joao Manoel Herrera Pinheiro, Thiago H. Segreto, Ricardo V. Godoy, Marcelo Becker
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2508.17468
Pdf URL: https://arxiv.org/pdf/2508.17468
Copy Paste: [[2508.17468]] A Synthetic Dataset for Manometry Recognition in Robotic Applications(https://arxiv.org/abs/2508.17468)
Keywords: generation
Abstract: This work addresses the challenges of data scarcity and high acquisition costs for training robust object detection models in complex industrial environments, such as offshore oil platforms. The practical and economic barriers to collecting real-world data in these hazardous settings often hamper the development of autonomous inspection systems. To overcome this, in this work we propose and validate a hybrid data synthesis pipeline that combines procedural rendering with AI-driven video generation. Our methodology leverages BlenderProc to create photorealistic images with precise annotations and controlled domain randomization, and integrates NVIDIA's Cosmos-Predict2 world-foundation model to synthesize physically plausible video sequences with temporal diversity, capturing rare viewpoints and adverse conditions. We demonstrate that a YOLO-based detection network trained on a composite dataset, blending real images with our synthetic data, achieves superior performance compared to models trained exclusively on real-world data. Notably, a 1:1 mixture of real and synthetic data yielded the highest accuracy, surpassing the real-only baseline. These findings highlight the viability of a synthetic-first approach as an efficient, cost-effective, and safe alternative for developing reliable perception systems in safety-critical and resource-constrained industrial applications.
摘要：这项工作解决了在复杂的工业环境（例如海上石油平台）中培训强大对象检测模型的数据稀缺和高获取成本的挑战。在这些危险环境中收集现实世界数据的实用和经济障碍通常会阻碍自主检查系统的发展。为了克服这一点，在这项工作中，我们提出并验证将程序渲染与AI驱动的视频生成相结合的混合数据合成管道。我们的方法论利用BlenderProc创建具有精确注释和受控域随机化的影像图像，并将NVIDIA的COSMOS-PREDICT2世界控制模型整合在一起，以将具有时间多样性的物理上合理的视频序列合成，从而捕获稀有观点和不利条件。我们证明，基于YOLO的检测网络在复合数据集上训练，将真实图像与我们的合成数据融为一体，与仅在现实世界中训练的模型相比，实现了优越的性能。值得注意的是，真实和合成数据的1：1混合物产生的精度最高，超过了实际的基线。这些发现凸显了合成优点方法的可行性，作为在安全关键和资源约束的工业应用中开发可靠的感知系统的有效，具有成本效益且安全的替代性。

Title: T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation

Authors: Kaiyue Sun, Rongyao Fang, Chengqi Duan, Xian Liu, Xihui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17472
Pdf URL: https://arxiv.org/pdf/2508.17472
Copy Paste: [[2508.17472]] T2I-ReasonBench: Benchmarking Reasoning-Informed Text-to-Image Generation(https://arxiv.org/abs/2508.17472)
Keywords: generation
Abstract: We propose T2I-ReasonBench, a benchmark evaluating reasoning capabilities of text-to-image (T2I) models. It consists of four dimensions: Idiom Interpretation, Textual Image Design, Entity-Reasoning and Scientific-Reasoning. We propose a two-stage evaluation protocol to assess the reasoning accuracy and image quality. We benchmark various T2I generation models, and provide comprehensive analysis on their performances.
摘要：我们提出了T2I-REASON BENCH，这是一种评估文本对图像（T2I）模型的推理能力的基准测试。它由四个维度组成：习语解释，文本图像设计，实体 - 策划和科学评估。我们提出了一个两阶段评估协议，以评估推理的准确性和图像质量。我们基准了各种T2I生成模型，并对它们的性能进行了全面的分析。

Title: OmniMRI: A Unified Vision--Language Foundation Model for Generalist MRI Interpretation

Authors: Xingxin He, Aurora Rofena, Ruimin Feng, Haozhe Liao, Zhaoye Zhou, Albert Jang, Fang Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17524
Pdf URL: https://arxiv.org/pdf/2508.17524
Copy Paste: [[2508.17524]] OmniMRI: A Unified Vision--Language Foundation Model for Generalist MRI Interpretation(https://arxiv.org/abs/2508.17524)
Keywords: generation
Abstract: Magnetic Resonance Imaging (MRI) is indispensable in clinical practice but remains constrained by fragmented, multi-stage workflows encompassing acquisition, reconstruction, segmentation, detection, diagnosis, and reporting. While deep learning has achieved progress in individual tasks, existing approaches are often anatomy- or application-specific and lack generalizability across diverse clinical settings. Moreover, current pipelines rarely integrate imaging data with complementary language information that radiologists rely on in routine practice. Here, we introduce OmniMRI, a unified vision-language foundation model designed to generalize across the entire MRI workflow. OmniMRI is trained on a large-scale, heterogeneous corpus curated from 60 public datasets, over 220,000 MRI volumes and 19 million MRI slices, incorporating image-only data, paired vision-text data, and instruction-response data. Its multi-stage training paradigm, comprising self-supervised vision pretraining, vision-language alignment, multimodal pretraining, and multi-task instruction tuning, progressively equips the model with transferable visual representations, cross-modal reasoning, and robust instruction-following capabilities. Qualitative results demonstrate OmniMRI's ability to perform diverse tasks within a single architecture, including MRI reconstruction, anatomical and pathological segmentation, abnormality detection, diagnostic suggestion, and radiology report generation. These findings highlight OmniMRI's potential to consolidate fragmented pipelines into a scalable, generalist framework, paving the way toward foundation models that unify imaging and clinical language for comprehensive, end-to-end MRI interpretation.
摘要：在临床实践中，磁共振成像（MRI）是必不可少的，但仍受到零散的多阶段工作流程的限制，包括采集，重建，分割，检测，诊断和报告。尽管深度学习在单个任务中取得了进步，但现有方法通常是解剖学或特定于应用的方法，并且在各种临床环境中缺乏普遍性。此外，当前的管道很少将成像数据与放射科医生在常规实践中所依赖的互补语言信息整合在一起。在这里，我们介绍了Omnimri，这是一个统一的视觉基础模型，旨在在整个MRI工作流程中进行概括。 Omnimri接受了从60个公共数据集策划的大规模的异质语料库，超过220,000次MRI量和1900万MRI切片，其中包含了仅图像的数据，配对的视觉数据和指令 - 反应数据。它的多阶段训练范式包括自我监督的视力预处理，视觉语言对准，多模式预处理和多任务指令调整，逐步将模型逐渐为可转移的视觉表示，交叉模态推理和可靠的遵循遵循功能的可转移视觉表示。定性结果表明，Omnimri能够在单个结构中执行各种任务，包括MRI重建，解剖和病理分割，异常检测，诊断建议和放射学报告生成。这些发现突出了Omnimri将碎片管道合并为可扩展的通用框架的潜力，铺平了通往基础模型的道路，这些模型将成像和临床语言统一，以进行全面的端到端MRI解释。

Title: MetaGen: A DSL, Database, and Benchmark for VLM-Assisted Metamaterial Generation

Authors: Liane Makatura, Benjamin Jones, Siyuan Bian, Wojciech Matusik
Subjects: cs.CV, cs.AI, cs.CE, cs.LG, cs.PL
Abstract URL: https://arxiv.org/abs/2508.17568
Pdf URL: https://arxiv.org/pdf/2508.17568
Copy Paste: [[2508.17568]] MetaGen: A DSL, Database, and Benchmark for VLM-Assisted Metamaterial Generation(https://arxiv.org/abs/2508.17568)
Keywords: generation
Abstract: Metamaterials are micro-architected structures whose geometry imparts highly tunable-often counter-intuitive-bulk properties. Yet their design is difficult because of geometric complexity and a non-trivial mapping from architecture to behaviour. We address these challenges with three complementary contributions. (i) MetaDSL: a compact, semantically rich domain-specific language that captures diverse metamaterial designs in a form that is both human-readable and machine-parsable. (ii) MetaDB: a curated repository of more than 150,000 parameterized MetaDSL programs together with their derivatives-three-dimensional geometry, multi-view renderings, and simulated elastic properties. (iii) MetaBench: benchmark suites that test three core capabilities of vision-language metamaterial assistants-structure reconstruction, property-driven inverse design, and performance prediction. We establish baselines by fine-tuning state-of-the-art vision-language models and deploy an omni-model within an interactive, CAD-like interface. Case studies show that our framework provides a strong first step toward integrated design and understanding of structure-representation-property relationships.
摘要：超材料是微构造的结构，其几何形状赋予高度可调的违反直觉bulk特性。然而，由于几何复杂性和从架构到行为的非平凡映射，它们的设计很困难。我们通过三个补充贡献来应对这些挑战。（i）元素：一种紧凑的，具有语义丰富的特定领域的语言，以人类可读和机器可避免的形式捕获多样的超材料设计。（ii）元B：超过150,000个参数化元素程序的策划存储库以及其衍生物三维几何，多视图渲染和模拟弹性属性。（iii）metabench：测试视觉构造助理助理的三个核心能力的基准套件，结构重建，财产驱动的逆设计和性能预测。我们通过微调最先进的视觉语言模型来建立基准，并在类似CAD的界面中部署Omni模型。案例研究表明，我们的框架为整合设计和理解结构代表性关系的关系提供了强大的第一步。

Title: IDU: Incremental Dynamic Update of Existing 3D Virtual Environments with New Imagery Data

Authors: Meida Chen, Luis Leal, Yue Hu, Rong Liu, Butian Xiong, Andrew Feng, Jiuyi Xu, Yangming Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17579
Pdf URL: https://arxiv.org/pdf/2508.17579
Copy Paste: [[2508.17579]] IDU: Incremental Dynamic Update of Existing 3D Virtual Environments with New Imagery Data(https://arxiv.org/abs/2508.17579)
Keywords: generative
Abstract: For simulation and training purposes, military organizations have made substantial investments in developing high-resolution 3D virtual environments through extensive imaging and 3D scanning. However, the dynamic nature of battlefield conditions-where objects may appear or vanish over time-makes frequent full-scale updates both time-consuming and costly. In response, we introduce the Incremental Dynamic Update (IDU) pipeline, which efficiently updates existing 3D reconstructions, such as 3D Gaussian Splatting (3DGS), with only a small set of newly acquired images. Our approach starts with camera pose estimation to align new images with the existing 3D model, followed by change detection to pinpoint modifications in the scene. A 3D generative AI model is then used to create high-quality 3D assets of the new elements, which are seamlessly integrated into the existing 3D model. The IDU pipeline incorporates human guidance to ensure high accuracy in object identification and placement, with each update focusing on a single new object at a time. Experimental results confirm that our proposed IDU pipeline significantly reduces update time and labor, offering a cost-effective and targeted solution for maintaining up-to-date 3D models in rapidly evolving military scenarios.
摘要：为了模拟和培训目的，军事组织通过广泛的成像和3D扫描在开发高分辨率3D虚拟环境方面进行了大量投资。但是，战场条件的动态性质 - 物体可能会出现或随着时间播放而消失，经常进行全面更新，既耗时又昂贵。作为响应，我们介绍了增量动态更新（IDU）管道，该管道有效地更新了现有的3D重建，例如3D高斯脱落（3DGS），只有一小部分新获取的图像。我们的方法始于相机姿势估计，以使新图像与现有3D模型对齐，然后更改检测以查明场景中的修改。然后，使用3D生成AI模型来创建新元素的高质量3D资产，这些资产无缝集成到现有的3D模型中。 IDU管道结合了人类的指导，以确保对象识别和放置方面的高精度，每个更新一次都集中在一个新对象上。实验结果证实，我们提出的IDU管道大大减少了更新的时间和劳动力，为在迅速发展的军事场景中维持最新的3D模型提供了一种具有成本效益的解决方案。

Title: HERO: Hierarchical Extrapolation and Refresh for Efficient World Models

Authors: Quanjian Song, Xinyu Wang, Donghao Zhou, Jingyu Lin, Cunjian Chen, Yue Ma, Xiu Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17588
Pdf URL: https://arxiv.org/pdf/2508.17588
Copy Paste: [[2508.17588]] HERO: Hierarchical Extrapolation and Refresh for Efficient World Models(https://arxiv.org/abs/2508.17588)
Keywords: generation
Abstract: Generation-driven world models create immersive virtual environments but suffer slow inference due to the iterative nature of diffusion models. While recent advances have improved diffusion model efficiency, directly applying these techniques to world models introduces limitations such as quality degradation. In this paper, we present HERO, a training-free hierarchical acceleration framework tailored for efficient world models. Owing to the multi-modal nature of world models, we identify a feature coupling phenomenon, wherein shallow layers exhibit high temporal variability, while deeper layers yield more stable feature representations. Motivated by this, HERO adopts hierarchical strategies to accelerate inference: (i) In shallow layers, a patch-wise refresh mechanism efficiently selects tokens for recomputation. With patch-wise sampling and frequency-aware tracking, it avoids extra metric computation and remain compatible with FlashAttention. (ii) In deeper layers, a linear extrapolation scheme directly estimates intermediate features. This completely bypasses the computations in attention modules and feed-forward networks. Our experiments show that HERO achieves a 1.73$\times$ speedup with minimal quality degradation, significantly outperforming existing diffusion acceleration methods.
摘要：以世代为导向的世界模型创造了沉浸式虚拟环境，但由于扩散模型的迭代性质而遭受缓慢的推断。尽管最近的进步提高了扩散模型的效率，但直接将这些技术应用于世界模型，却引入了诸如质量退化之类的局限性。在本文中，我们介绍了英雄，这是一个针对有效的世界模型量身定制的无培训层次加速框架。由于世界模型的多模式性质，我们确定了一种特征耦合现象，其中浅层具有较高的时间变异性，而更深的层产生了更稳定的特征表示。在此激励的基础上，英雄采用层次结构策略来加速推理：（i）在浅层层中，通过贴片的刷新机制有效地选择了代币以进行重新组成。通过通过贴片的采样和频率吸引跟踪，它可以避免额外的度量计算，并与闪存兼容。（ii）在较深的层中，线性外推方案直接估计中间特征。这完全绕过了注意模块和进发纸网络中的计算。我们的实验表明，英雄实现了1.73 $ \ times $的加速，质量最小的退化，大大优于现有的扩散加速方法。

Title: ChartMaster: Advancing Chart-to-Code Generation with Real-World Charts and Chart Similarity Reinforcement Learning

Authors: Wentao Tan, Qiong Cao, Chao Xue, Yibing Zhan, Changxing Ding, Xiaodong He
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.17608
Pdf URL: https://arxiv.org/pdf/2508.17608
Copy Paste: [[2508.17608]] ChartMaster: Advancing Chart-to-Code Generation with Real-World Charts and Chart Similarity Reinforcement Learning(https://arxiv.org/abs/2508.17608)
Keywords: generation
Abstract: The chart-to-code generation task requires MLLMs to convert chart images into executable code. This task faces two major challenges: limited data diversity and insufficient maintenance of visual consistency between generated and original charts during training. Existing datasets mainly rely on seed data to prompt GPT models for code generation, resulting in homogeneous samples. To address this, we propose ReChartPrompt, which leverages real-world, human-designed charts from arXiv papers as prompts instead of synthetic seeds. Using the diverse styles and rich content of arXiv charts, we construct ReChartPrompt-240K, a large-scale and highly diverse dataset. Another challenge is that although SFT effectively improve code understanding, it often fails to ensure that generated charts are visually consistent with the originals. To address this, we propose ChartSimRL, a GRPO-based reinforcement learning algorithm guided by a novel chart similarity reward. This reward consists of attribute similarity, which measures the overlap of chart attributes such as layout and color between the generated and original charts, and visual similarity, which assesses similarity in texture and other overall visual features using convolutional neural networks. Unlike traditional text-based rewards such as accuracy or format rewards, our reward considers the multimodal nature of the chart-to-code task and effectively enhances the model's ability to accurately reproduce charts. By integrating ReChartPrompt and ChartSimRL, we develop the ChartMaster model, which achieves state-of-the-art results among 7B-parameter models and even rivals GPT-4o on various chart-to-code generation benchmarks. All resources are available at this https URL.
摘要：图表到代码生成任务要求MLLM将图表图像转换为可执行代码。这项任务面临两个主要挑战：数据多样性有限，并且在培训过程中生成图表和原始图表之间的视觉一致性不足。现有数据集主要依靠种子数据来提示GPT模型以生成代码，从而产生均匀的样本。为了解决这个问题，我们提出了RechartPrompt，该局限是从Arxiv论文中利用现实世界中设计的图表作为提示而不是合成种子。使用Arxiv图表的各种样式和丰富的内容，我们构建了RechartPrompt-240k，这是一个大规模且高度多样的数据集。另一个挑战是，尽管SFT有效地改善了代码的理解，但通常无法确保在视觉上与原始图表一致。为了解决这个问题，我们提出了ChartSimrl，这是一种基于GRPO的强化学习算法，以新颖的图表相似性奖励为指导。该奖励由属性相似性组成，属性相似性衡量了图表属性的重叠，例如生成图表和原始图表之间的布局和颜色以及视觉相似性，该相似性评估了使用卷积神经网络评估纹理和其他整体视觉特征的相似性。与传统的基于文本的奖励（例如准确性或格式奖励）不同，我们的奖励考虑了图表对代码任务的多模式性质，并有效地增强了模型准确复制图表的能力。通过集成了RechartPrompt和ChartSimrl，我们开发了Chartmaster模型，该模型在7B参数模型之间在各种图表到代码生成的基准上都取得了最新的结果。所有资源都可以在此HTTPS URL上找到。

Title: JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on

Authors: Aowen Wang, Wei Li, Hao Luo, Mengxing Ao, Chenyu Zhu, Xinyang Li, Fan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17614
Pdf URL: https://arxiv.org/pdf/2508.17614
Copy Paste: [[2508.17614]] JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on(https://arxiv.org/abs/2508.17614)
Keywords: generation
Abstract: Virtual try-on systems have long been hindered by heavy reliance on human body masks, limited fine-grained control over garment attributes, and poor generalization to real-world, in-the-wild scenarios. In this paper, we propose JCo-MVTON (Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-On), a novel framework that overcomes these limitations by integrating diffusion-based image generation with multi-modal conditional fusion. Built upon a Multi-Modal Diffusion Transformer (MM-DiT) backbone, our approach directly incorporates diverse control signals -- such as the reference person image and the target garment image -- into the denoising process through dedicated conditional pathways that fuse features within the self-attention layers. This fusion is further enhanced with refined positional encodings and attention masks, enabling precise spatial alignment and improved garment-person integration. To address data scarcity and quality, we introduce a bidirectional generation strategy for dataset construction: one pipeline uses a mask-based model to generate realistic reference images, while a symmetric ``Try-Off'' model, trained in a self-supervised manner, recovers the corresponding garment images. The synthesized dataset undergoes rigorous manual curation, allowing iterative improvement in visual fidelity and diversity. Experiments demonstrate that JCo-MVTON achieves state-of-the-art performance on public benchmarks including DressCode, significantly outperforming existing methods in both quantitative metrics and human evaluations. Moreover, it shows strong generalization in real-world applications, surpassing commercial systems.
摘要：长期以来，虚拟的试验系统一直受到对人体面具的严重依赖，对服装属性的细粒度控制有限以及对现实世界中野外场景的不良概括。在本文中，我们提出了JCO-MVTON（用于无面膜虚拟试验的共同控制的多模式扩散变压器），这是一个新型框架，通过将基于扩散的图像产生与多模式条件融合整合来克服这些局限性。我们的方法建立在多模式扩散变压器（MM-DIT）主链的基础上，直接将各种控制信号（例如参考人员形象和目标服装图像）通过融合在自我注意力层中的专用条件途径中。通过精致的位置编码和注意力面膜进一步增强了这种融合，从而实现了精确的空间比对并改善了服装人的整合。为了解决数据稀缺性和质量，我们引入了用于数据集构建的双向生成策略：一个管道使用基于掩码的模型来生成逼真的参考图像，而对称的``试验''模型，以一种自我监视的方式训练，以自我监督的方式培训，恢复相应的服装图像。合成的数据集经历了严格的手动策划，从而可以迭代地改善视觉保真度和多样性。实验表明，JCO-MVTON在包括DressCode在内的公共基准上实现了最先进的性能，在定量指标和人类评估中的现有方法都大大超过了现有方法。此外，它在现实世界应用中显示出强烈的概括，超过了商业系统。

Title: ControlEchoSynth: Boosting Ejection Fraction Estimation Models via Controlled Video Diffusion

Authors: Nima Kondori, Hanwen Liang, Hooman Vaseli, Bingyu Xie, Christina Luong, Purang Abolmaesumi, Teresa Tsang, Renjie Liao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17631
Pdf URL: https://arxiv.org/pdf/2508.17631
Copy Paste: [[2508.17631]] ControlEchoSynth: Boosting Ejection Fraction Estimation Models via Controlled Video Diffusion(https://arxiv.org/abs/2508.17631)
Keywords: generation, generative
Abstract: Synthetic data generation represents a significant advancement in boosting the performance of machine learning (ML) models, particularly in fields where data acquisition is challenging, such as echocardiography. The acquisition and labeling of echocardiograms (echo) for heart assessment, crucial in point-of-care ultrasound (POCUS) settings, often encounter limitations due to the restricted number of echo views available, typically captured by operators with varying levels of experience. This study proposes a novel approach for enhancing clinical diagnosis accuracy by synthetically generating echo views. These views are conditioned on existing, real views of the heart, focusing specifically on the estimation of ejection fraction (EF), a critical parameter traditionally measured from biplane apical views. By integrating a conditional generative model, we demonstrate an improvement in EF estimation accuracy, providing a comparative analysis with traditional methods. Preliminary results indicate that our synthetic echoes, when used to augment existing datasets, not only enhance EF estimation but also show potential in advancing the development of more robust, accurate, and clinically relevant ML models. This approach is anticipated to catalyze further research in synthetic data applications, paving the way for innovative solutions in medical imaging diagnostics.
摘要：合成数据生成代表了提高机器学习（ML）模型的性能的重大进步，尤其是在数据采集具有挑战性的领域，例如超声心动图。超声心动图（ECHO）进行心脏评估的获取和标记，在护理上至关重要的超声波（POCUS）设置中，由于可用的经营者通常会捕获的ECHO视图数量限制，因此经常会遇到限制，通常由经验丰富的经营者捕获。这项研究提出了一种新的方法，可以通过合成产生回声观点来提高临床诊断准确性。这些观点是基于现有的，真实的心脏观点的条件，专门针对射血分数（EF）的估计，这是一种从双层顶点观点来衡量的关键参数。通过整合条件生成模型，我们证明了EF估计准确性的提高，并通过传统方法提供了比较分析。初步结果表明，我们的合成回声在用于增强现有数据集时，不仅可以增强EF估计，而且还显示出在推进更强大，准确和临床相关的ML模型的发展方面的潜力。预计这种方法可以促进合成数据应用中的进一步研究，从而为医学成像诊断中的创新解决方案铺平了道路。

Title: Longitudinal Progression Prediction of Alzheimer's Disease with Tabular Foundation Model

Authors: Yilang Ding, Jiawen Ren, Jiaying Lu, Gloria Hyunjung Kwak, Armin Iraji, Alex Fedorov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.17649
Pdf URL: https://arxiv.org/pdf/2508.17649
Copy Paste: [[2508.17649]] Longitudinal Progression Prediction of Alzheimer's Disease with Tabular Foundation Model(https://arxiv.org/abs/2508.17649)
Keywords: generation, generative
Abstract: Alzheimer's disease is a progressive neurodegenerative disorder that remains challenging to predict due to its multifactorial etiology and the complexity of multimodal clinical data. Accurate forecasting of clinically relevant biomarkers, including diagnostic and quantitative measures, is essential for effective monitoring of disease progression. This work introduces L2C-TabPFN, a method that integrates a longitudinal-to-cross-sectional (L2C) transformation with a pre-trained Tabular Foundation Model (TabPFN) to predict Alzheimer's disease outcomes using the TADPOLE dataset. L2C-TabPFN converts sequential patient records into fixed-length feature vectors, enabling robust prediction of diagnosis, cognitive scores, and ventricular volume. Experimental results demonstrate that, while L2C-TabPFN achieves competitive performance on diagnostic and cognitive outcomes, it provides state-of-the-art results in ventricular volume prediction. This key imaging biomarker reflects neurodegeneration and progression in Alzheimer's disease. These findings highlight the potential of tabular foundational models for advancing longitudinal prediction of clinically relevant imaging markers in Alzheimer's disease.
摘要：阿尔茨海默氏病是一种进行性神经退行性疾病，由于其多因素病因和多模式临床数据的复杂性，预测仍然具有挑战性。对临床相关的生物标志物（包括诊断和定量措施）的准确预测对于有效监测疾病进展至关重要。这项工作介绍了L2C-TABPFN，该方法将纵向到截面（L2C）转换与预训练的表粉底粉底模型（TABPFN）相结合，以使用Tadpole数据集预测阿尔茨海默氏病的结果。 L2C-TABPFN将顺序的患者记录转换为固定长度的特征向量，从而实现了诊断，认知评分和心室体积的可靠预测。实验结果表明，尽管L2C-TABPFN在诊断和认知结果上取得了竞争性表现，但它为心室体积预测提供了最新结果。该关键成像生物标志物反映了阿尔茨海默氏病的神经变性和进展。这些发现突出了表格基础模型的潜力，用于推进阿尔茨海默氏病临床相关成像标记的纵向预测。

Title: Hierarchical Vision-Language Learning for Medical Out-of-Distribution Detection

Authors: Runhe Lai, Xinhua Lu, Kanghao Chen, Qichao Chen, Wei-Shi Zheng, Ruixuan Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17667
Pdf URL: https://arxiv.org/pdf/2508.17667
Copy Paste: [[2508.17667]] Hierarchical Vision-Language Learning for Medical Out-of-Distribution Detection(https://arxiv.org/abs/2508.17667)
Keywords: generation
Abstract: In trustworthy medical diagnosis systems, integrating out-of-distribution (OOD) detection aims to identify unknown diseases in samples, thereby mitigating the risk of misdiagnosis. In this study, we propose a novel OOD detection framework based on vision-language models (VLMs), which integrates hierarchical visual information to cope with challenging unknown diseases that resemble known diseases. Specifically, a cross-scale visual fusion strategy is proposed to couple visual embeddings from multiple scales. This enriches the detailed representation of medical images and thus improves the discrimination of unknown diseases. Moreover, a cross-scale hard pseudo-OOD sample generation strategy is proposed to benefit OOD detection maximally. Experimental evaluations on three public medical datasets support that the proposed framework achieves superior OOD detection performance compared to existing methods. The source code is available at this https URL.
摘要：在值得信赖的医学诊断系统中，整合分布外（OOD）检测旨在鉴定样本中未知疾病，从而减轻误诊的风险。在这项研究中，我们提出了一个基于视觉模型（VLM）的新型OOD检测框架，该框架集成了层次的视觉信息，以应对类似于已知疾病的挑战性未知疾病。具体而言，提出了一种跨尺度的视觉融合策略，以对多个尺度进行逐步嵌入。这丰富了医学图像的详细表示，从而改善了未知疾病的歧视。此外，提出了跨尺度的硬伪-OON样品生成策略，以最大程度地使OOD检测受益。与现有方法相比，对三个公共医疗数据集的实验评估支持了提出的框架可实现优越的OOD检测性能。源代码可在此HTTPS URL上找到。

Title: Towards Synthesizing Normative Data for Cognitive Assessments Using Generative Multimodal Large Language Models

Authors: Victoria Yan, Honor Chotkowski, Fengran Wang, Alex Fedorov
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.17675
Pdf URL: https://arxiv.org/pdf/2508.17675
Copy Paste: [[2508.17675]] Towards Synthesizing Normative Data for Cognitive Assessments Using Generative Multimodal Large Language Models(https://arxiv.org/abs/2508.17675)
Keywords: generative
Abstract: Cognitive assessments require normative data as essential benchmarks for evaluating individual performance. Hence, developing new cognitive tests based on novel image stimuli is challenging due to the lack of readily available normative data. Traditional data collection methods are costly, time-consuming, and infrequently updated, limiting their practical utility. Recent advancements in generative multimodal large language models (MLLMs) offer a new approach to generate synthetic normative data from existing cognitive test images. We investigated the feasibility of using MLLMs, specifically GPT-4o and GPT-4o-mini, to synthesize normative textual responses for established image-based cognitive assessments, such as the "Cookie Theft" picture description task. Two distinct prompting strategies-naive prompts with basic instructions and advanced prompts enriched with contextual guidance-were evaluated. Responses were analyzed using embeddings to assess their capacity to distinguish diagnostic groups and demographic variations. Performance metrics included BLEU, ROUGE, BERTScore, and an LLM-as-a-judge evaluation. Advanced prompting strategies produced synthetic responses that more effectively distinguished between diagnostic groups and captured demographic diversity compared to naive prompts. Superior models generated responses exhibiting higher realism and diversity. BERTScore emerged as the most reliable metric for contextual similarity assessment, while BLEU was less effective for evaluating creative outputs. The LLM-as-a-judge approach provided promising preliminary validation results. Our study demonstrates that generative multimodal LLMs, guided by refined prompting methods, can feasibly generate robust synthetic normative data for existing cognitive tests, thereby laying the groundwork for developing novel image-based cognitive assessments without the traditional limitations.
摘要：认知评估需要规范性数据作为评估个人绩效的必要基准。因此，由于缺乏随时可用的规范性数据，基于新的图像刺激开发新的认知测试是具有挑战性的。传统的数据收集方法是昂贵的，耗时的，并且很少更新，从而限制了它们的实用性。生成多模式大语言模型（MLLM）的最新进展提供了一种新方法，可以从现有认知测试图像中生成合成规范性数据。我们研究了使用MLLM，特别是GPT-4O和GPT-4O-Mini的可行性，以合成基于图像的认知评估，例如“ Cookie Theft”图片描述任务，以合成规范性文本响应。在评估中，有两个不同的提示策略提示，并带有基本指示和高级提示。使用嵌入式分析反应，以评估其区分诊断组和人口变化的能力。性能指标包括BLEU，Rouge，Bertscore和LLM-AS-A-A-Gudge评估。与天真的提示相比，高级提示策略产生了合成反应，这些策略更有效地区分诊断组和捕获的人口多样性。高级模型产生了表现出更高现实主义和多样性的反应。 Bertscore成为上下文相似性评估的最可靠的指标，而BLEU在评估创意产出方面的有效性较低。 LLM-AS-A-Gudge方法提供了有希望的初步验证结果。我们的研究表明，以精致的提示方法为指导的生成性多模式LLM可以为现有认知测试生成健壮的合成规范性数据，从而为开发新的基于图像的认知评估而没有传统限制为基础。

Title: Characterizing the Behavior of Training Mamba-based State Space Models on GPUs

Authors: Trinayan Baruah, Kaustubh Shivdikar, Sara Prescott, David Kaeli
Subjects: cs.LG, cs.AR, cs.CL
Abstract URL: https://arxiv.org/abs/2508.17679
Pdf URL: https://arxiv.org/pdf/2508.17679
Copy Paste: [[2508.17679]] Characterizing the Behavior of Training Mamba-based State Space Models on GPUs(https://arxiv.org/abs/2508.17679)
Keywords: generation
Abstract: Mamba-based State Space Models (SSM) have emerged as a promising alternative to the ubiquitous transformers. Despite the expressive power of transformers, the quadratic complexity of computing attention is a major impediment to scaling performance as we increase the sequence length. SSMs provide an alternative path that addresses this problem, reducing the computational complexity requirements of self-attention with novel model architectures for different domains and fields such as video, text generation and graphs. Thus, it is important to characterize the behavior of these emerging workloads on GPUs and understand their requirements during GPU microarchitectural design. In this work we evaluate Mamba-based SSMs and characterize their behavior during training on GPUs. We construct a workload suite that offers representative models that span different model architectures. We then use this suite to analyze the architectural implications of running Mamba-based SSMs on GPUs. Our work sheds new light on potential optimizations to continue scaling the performance for such models.
摘要：基于MAMBA的状态空间模型（SSM）已成为无处不在的变压器的有希望的替代品。尽管有变压器具有表现力，但随着我们增加序列长度，计算注意力的二次复杂性是扩展性能的主要障碍。 SSM提供了解决此问题的替代路径，从而通过新颖的模型体系结构来减少自我发作的计算复杂性要求，以供不同的域和字段（例如视频，文本生成和图形）。因此，重要的是要表征这些新兴工作量在GPU上的行为，并在GPU微体系设计期间了解它们的要求。在这项工作中，我们评估了基于MAMBA的SSM并在GPU培训期间表征其行为。我们构建了一个工作负载套件，该套件提供跨越不同模型体系结构的代表性模型。然后，我们使用此套件来分析在GPU上运行基于MAMBA的SSM的架构含义。我们的工作为潜在的优化提供了新的启示，以继续扩展此类模型的性能。

Title: Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery

Authors: Robert Yang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17681
Pdf URL: https://arxiv.org/pdf/2508.17681
Copy Paste: [[2508.17681]] Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery(https://arxiv.org/abs/2508.17681)
Keywords: generation, generative
Abstract: Bold claims about AI's role in science-from "AGI will cure all diseases" to promises of radically accelerated discovery-raise a central epistemic question: do large language models (LLMs) truly generate new knowledge, or do they merely remix memorized fragments? We propose unlearning-as-ablation as a falsifiable test of constructive scientific discovery. The method systematically removes a target result and its entire forget-closure (lemmas, paraphrases, and multi-hop entailments) and then evaluates whether the model can re-derive the result from only permitted axioms and tools. Success provides evidence for genuine generative capability; failure exposes current limits. Unlike prevailing motivations for unlearning-privacy, copyright, or safety-our framing repositions it as an epistemic probe for AI-for-Science. We argue that such tests could serve as the next generation of benchmarks, much as ImageNet catalyzed progress in vision: distinguishing models that can merely recall from those that can constructively generate new scientific knowledge. We outline a minimal pilot in mathematics and algorithms, and discuss extensions to physics, chemistry, and biology. Whether models succeed or fail, unlearning-as-ablation provides a principled framework to map the true reach and limits of AI scientific discovery. This is a position paper: we advance a conceptual and methodological argument rather than new empirical results.
摘要：关于AI在科学领域的作用“ AGI将治愈所有疾病”的大胆主张，向根本加速的发现的承诺诺言是一个中心认识论的问题：大语言模型（LLMS）是否真正产生新知识，还是仅仅是将记忆记忆的片段重新混音？我们提出了不学习的实现，作为对建设性科学发现的可伪造测试。该方法系统地消除了目标结果及其整个忘记关闭（引理，释义和多跳索赔），然后评估该模型是否只能从允许的公理和工具中重新启用结果。成功提供了真正生成能力的证据；故障暴露了当前限制。与未经学习的动机，版权或安全性不同，我们的框架重新定位是对AI科学的认知探测器。我们认为，这样的测试可以作为下一代基准，就像Imagenet催化了视觉进步：区分模型，这些模型只能从那些可以建设性地产生新的科学知识的模型中回忆起。我们概述了数学和算法的最小试验，并讨论了物理，化学和生物学的扩展。无论模型成功还是失败，不学习的启动都提供了一个原则上的框架，以绘制AI科学发现的真实影响力和限制。这是一篇立场论文：我们提出了一个概念和方法论论点，而不是新的经验结果。

Title: Copyright Protection for 3D Molecular Structures with Watermarking

Authors: Runwen Hu, Peilin Chen, Keyan Ding, Shiqi Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.17702
Pdf URL: https://arxiv.org/pdf/2508.17702
Copy Paste: [[2508.17702]] Copyright Protection for 3D Molecular Structures with Watermarking(https://arxiv.org/abs/2508.17702)
Keywords: generation, generative
Abstract: Artificial intelligence (AI) revolutionizes molecule generation in bioengineering and biological research, significantly accelerating discovery processes. However, this advancement introduces critical concerns regarding intellectual property protection. To address these challenges, we propose the first robust watermarking method designed for molecules, which utilizes atom-level features to preserve molecular integrity and invariant features to ensure robustness against affine transformations. Comprehensive experiments validate the effectiveness of our method using the datasets QM9 and GEOM-DRUG, and generative models GeoBFN and GeoLDM. We demonstrate the feasibility of embedding watermarks, maintaining basic properties higher than 90.00\% while achieving watermark accuracy greater than 95.00\%. Furthermore, downstream docking simulations reveal comparable performance between original and watermarked molecules, with binding affinities reaching -6.00 kcal/mol and root mean square deviations below 1.602 Å. These results confirm that our watermarking technique effectively safeguards molecular intellectual property without compromising scientific utility, enabling secure and responsible AI integration in molecular discovery and research applications.
摘要：人工智能（AI）在生物工程和生物学研究中彻底改变了分子的产生，从而大大加速了发现过程。但是，这种进步引入了有关知识产权保护的关键问题。为了应对这些挑战，我们提出了针对分子设计的第一种强大的水印方法，该方法利用原子级特征来保留分子完整性和不变特征，以确保稳健性抵抗仿射转化。全面的实验使用数据集QM9和Geom-rug以及生成模型GeoBFN和Geoldm验证了我们方法的有效性。我们证明了嵌入水印，维持高于90.00 \％的基本特性的可行性，同时达到了大于95.00 \％的水印准确度。此外，下游对接模拟揭示了原始分子和水印分子之间的可比性能，结合亲和力达到-6.00 kcal/mol，而根平方偏差低于1.602Å。这些结果证实，我们的水印技术有效地保护了分子知识产权，而不会损害科学实用性，从而在分子发现和研究应用中实现了安全和负责的AI整合。

Title: CATformer: Contrastive Adversarial Transformer for Image Super-Resolution

Authors: Qinyi Tian, Spence Cox, Laura E. Dalton
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17708
Pdf URL: https://arxiv.org/pdf/2508.17708
Copy Paste: [[2508.17708]] CATformer: Contrastive Adversarial Transformer for Image Super-Resolution(https://arxiv.org/abs/2508.17708)
Keywords: super-resolution
Abstract: Super-resolution remains a promising technique to enhance the quality of low-resolution images. This study introduces CATformer (Contrastive Adversarial Transformer), a novel neural network integrating diffusion-inspired feature refinement with adversarial and contrastive learning. CATformer employs a dual-branch architecture combining a primary diffusion-inspired transformer, which progressively refines latent representations, with an auxiliary transformer branch designed to enhance robustness to noise through learned latent contrasts. These complementary representations are fused and decoded using deep Residual-in-Residual Dense Blocks for enhanced reconstruction quality. Extensive experiments on benchmark datasets demonstrate that CATformer outperforms recent transformer-based and diffusion-inspired methods both in efficiency and visual image quality. This work bridges the performance gap among transformer-, diffusion-, and GAN-based methods, laying a foundation for practical applications of diffusion-inspired transformers in super-resolution.
摘要：超分辨率仍然是一种有前途的技术，可以提高低分辨率图像的质量。这项研究介绍了catformer（对比反向变压器），这是一种新型神经网络，将扩散启发的特征精致与对抗性和对比度学习相结合。 catformer采用了双分支结构，结合了主要扩散启发的变压器，该结构逐渐完善了潜在表示，辅助变压器分支旨在通过学习的潜在对比度来增强噪声的鲁棒性。这些互补的表示形式使用深层残留的密集块进行融合和解码，以提高重建质量。基准数据集上的广泛实验表明，catformer在效率和视觉图像质量方面都优于基于变压器和扩散启发的方法。这项工作弥合了变压器，扩散和基于GAN的方法之间的性能差距，为扩散启发的变压器在超分辨率中的实际应用奠定了基础。

Title: F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model

Authors: Hanbo Bi, Zhiqiang Yuan, Zexi Jia, Jiapei Zhang, Chongyang Li, Peixiang Luo, Ying Deng, Xiaoyue Duan, Jinchao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17714
Pdf URL: https://arxiv.org/pdf/2508.17714
Copy Paste: [[2508.17714]] F2RVLM: Boosting Fine-grained Fragment Retrieval for Multi-Modal Long-form Dialogue with Vision Language Model(https://arxiv.org/abs/2508.17714)
Keywords: generation, generative
Abstract: Traditional dialogue retrieval aims to select the most appropriate utterance or image from recent dialogue history. However, they often fail to meet users' actual needs for revisiting semantically coherent content scattered across long-form conversations. To fill this gap, we define the Fine-grained Fragment Retrieval (FFR) task, requiring models to locate query-relevant fragments, comprising both utterances and images, from multimodal long-form dialogues. As a foundation for FFR, we construct MLDR, the longest-turn multimodal dialogue retrieval dataset to date, averaging 25.45 turns per dialogue, with each naturally spanning three distinct topics. To evaluate generalization in real-world scenarios, we curate and annotate a WeChat-based test set comprising real-world multimodal dialogues with an average of 75.38 turns. Building on these resources, we explore existing generation-based Vision-Language Models (VLMs) on FFR and observe that they often retrieve incoherent utterance-image fragments. While optimized for generating responses from visual-textual inputs, these models lack explicit supervision to ensure semantic coherence within retrieved fragments. To this end, we propose F2RVLM, a generative retrieval model trained in a two-stage paradigm: (1) supervised fine-tuning to inject fragment-level retrieval knowledge, and (2) GRPO-based reinforcement learning with multi-objective rewards promoting semantic precision, relevance, and contextual coherence. To handle varying intra-fragment complexity, from locally dense to sparsely distributed, we introduce difficulty-aware curriculum sampling that ranks training instances by model-predicted difficulty and gradually exposes the model to harder samples. This boosts reasoning ability in long, multi-turn contexts. F2RVLM outperforms popular VLMs in both in-domain and real-domain settings, demonstrating superior retrieval performance.
摘要：传统的对话检索旨在从最近的对话历史中选择最合适的话语或形象。但是，他们通常无法满足用户对散布在长期对话中的语义连贯内容的实际需求。为了填补这一空白，我们定义了细粒度的片段检索（FFR）任务，需要模型来定位与查询相关的片段，包括来自多模式的长形式对话的话语和图像。作为FFR的基础，我们构建了MLDR，这是迄今为止最长的多模式对话检索数据集，平均每个对话25.45转，每个对话自然涵盖了三个不同的主题。为了评估现实情况下的概括，我们策划并注释基于微信的测试集，其中包括现实世界多模式对话，平均为75.38圈。在这些资源的基础上，我们探索了FFR上现有的基于世代的视觉模型（VLM），并观察到它们经常取回不连贯的话语图像片段。这些模型优化用于从视觉文本输入中产生响应，但这些模型缺乏明确的监督，无法确保检索到的片段中的语义连贯性。为此，我们提出了一种以两阶段范式训练的生成检索模型F2RVLM：（1）监督微调以注入碎片级检索知识，（2）基于GRPO的增强性增强学习，具有多目标的重新启动，以促进语义精确，相关性和上下文相干性和上下文相干性。为了处理从局部密集到稀疏分布的不同碎片内复杂性，我们引入了难以感知的课程抽样，通过模型预测的难度对培训实例进行对训练实例进行对，并逐渐将模型暴露于更难的样本中。这提高了在长长的多转情况下的推理能力。 F2RVLM在内域和实域设置中都胜过流行的VLM，表现出卓越的检索性能。

Title: Instant Preference Alignment for Text-to-Image Diffusion Models

Authors: Yang Li, Songlin Yang, Xiaoxuan Han, Wei Wang, Jing Dong, Yueming Lyu, Ziyu Xue
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17718
Pdf URL: https://arxiv.org/pdf/2508.17718
Copy Paste: [[2508.17718]] Instant Preference Alignment for Text-to-Image Diffusion Models(https://arxiv.org/abs/2508.17718)
Keywords: generation
Abstract: Text-to-image (T2I) generation has greatly enhanced creative expression, yet achieving preference-aligned generation in a real-time and training-free manner remains challenging. Previous methods often rely on static, pre-collected preferences or fine-tuning, limiting adaptability to evolving and nuanced user intents. In this paper, we highlight the need for instant preference-aligned T2I generation and propose a training-free framework grounded in multimodal large language model (MLLM) priors. Our framework decouples the task into two components: preference understanding and preference-guided generation. For preference understanding, we leverage MLLMs to automatically extract global preference signals from a reference image and enrich a given prompt using structured instruction design. Our approach supports broader and more fine-grained coverage of user preferences than existing methods. For preference-guided generation, we integrate global keyword-based control and local region-aware cross-attention modulation to steer the diffusion model without additional training, enabling precise alignment across both global attributes and local elements. The entire framework supports multi-round interactive refinement, facilitating real-time and context-aware image generation. Extensive experiments on the Viper dataset and our collected benchmark demonstrate that our method outperforms prior approaches in both quantitative metrics and human evaluations, and opens up new possibilities for dialog-based generation and MLLM-diffusion integration.
摘要：文本对图像（T2I）的一代具有大大增强的创造性表达，但实时和无训练的方式实现了与偏好的一致生成仍然具有挑战性。以前的方法通常依赖于静态的，预先收集的偏好或微调，从而将适应性限制为不断发展和细微的用户意图。在本文中，我们强调了对即时偏好一致的T2I生成的必要性，并提出了一个以多模式大语模型（MLLM）先验为基础的无训练框架。我们的框架将任务分为两个组成部分：偏好理解和偏好引导的一代。为了理解偏好，我们利用MLLM自动从参考图像中提取全局偏好信号，并使用结构化指令设计丰富给定的提示。与现有方法相比，我们的方法支持用户偏好更广泛，更细粒度的覆盖范围。对于偏好引导的生成，我们将基于全球关键字的控制和局部感知的跨意识调制调制集成了扩散模型，而无需进行其他培训，从而实现了全局属性和本地元素的精确对齐。整个框架都支持多轮交互式改进，促进实时和上下文感知的图像生成。在VIPER数据集和我们收集的基准上进行的广泛实验表明，我们的方法在定量指标和人类评估中均优于先前的方法，并为基于对话的一代和MLLM扩散整合开辟了新的可能性。

Title: Few-shot Human Action Anomaly Detection via a Unified Contrastive Learning Framework

Authors: Koichiro Kamide, Shunsuke Sakai, Shun Maeda, Chunzhi Gu, Chao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17726
Pdf URL: https://arxiv.org/pdf/2508.17726
Copy Paste: [[2508.17726]] Few-shot Human Action Anomaly Detection via a Unified Contrastive Learning Framework(https://arxiv.org/abs/2508.17726)
Keywords: generative
Abstract: Human Action Anomaly Detection (HAAD) aims to identify anomalous actions given only normal action data during training. Existing methods typically follow a one-model-per-category paradigm, requiring separate training for each action category and a large number of normal samples. These constraints hinder scalability and limit applicability in real-world scenarios, where data is often scarce or novel categories frequently appear. To address these limitations, we propose a unified framework for HAAD that is compatible with few-shot scenarios. Our method constructs a category-agnostic representation space via contrastive learning, enabling AD by comparing test samples with a given small set of normal examples (referred to as the support set). To improve inter-category generalization and intra-category robustness, we introduce a generative motion augmentation strategy harnessing a diffusion-based foundation model for creating diverse and realistic training samples. Notably, to the best of our knowledge, our work is the first to introduce such a strategy specifically tailored to enhance contrastive learning for action AD. Extensive experiments on the HumanAct12 dataset demonstrate the state-of-the-art effectiveness of our approach under both seen and unseen category settings, regarding training efficiency and model scalability for few-shot HAAD.
摘要：人类作用异常检测（HAAD）旨在在训练过程中仅在正常作用数据下识别异常动作。现有方法通常遵循每类单模型的范式，需要为每个动作类别和大量普通样本进行单独培训。这些限制阻碍了可伸缩性并限制在现实情况下的适用性，在现实世界中，数据通常很少或新颖的类别经常出现。为了解决这些限制，我们为HAAD提出了一个统一的框架，该框架与很少的场景兼容。我们的方法通过对比度学习构建了类别 - 不合时宜的表示空间，通过将测试样本与给定的少量正常示例（称为支持集）进行比较，从而启用了AD。为了改善类别间的概括和类别内的鲁棒性，我们引入了一种生成运动增强策略，该策略利用基于扩散的基础模型来创建各种逼真的训练样本。值得注意的是，据我们所知，我们的工作是第一个引入这种专门针对增强行动广告学习的策略的策略。关于人类Act12数据集的广泛实验证明了我们在可见和看不见的类别设置下我们的方法的最先进有效性，这些训练效率和模型可伸缩性几乎没有发电。

Title: Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks

Authors: Sotaro Takeshita, Yurina Takeshita, Daniel Ruffinelli, Simone Paolo Ponzetto
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.17744
Pdf URL: https://arxiv.org/pdf/2508.17744
Copy Paste: [[2508.17744]] Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks(https://arxiv.org/abs/2508.17744)
Keywords: generative
Abstract: In this paper, we study the surprising impact that truncating text embeddings has on downstream performance. We consistently observe across 6 state-of-the-art text encoders and 26 downstream tasks, that randomly removing up to 50% of embedding dimensions results in only a minor drop in performance, less than 10%, in retrieval and classification tasks. Given the benefits of using smaller-sized embeddings, as well as the potential insights about text encoding, we study this phenomenon and find that, contrary to what is suggested in prior work, this is not the result of an ineffective use of representation space. Instead, we find that a large number of uniformly distributed dimensions actually cause an increase in performance when removed. This would explain why, on average, removing a large number of embedding dimensions results in a marginal drop in performance. We make similar observations when truncating the embeddings used by large language models to make next-token predictions on generative tasks, suggesting that this phenomenon is not isolated to classification or retrieval tasks.
摘要：在本文中，我们研究了截断文本嵌入对下游性能的惊人影响。我们始终在6个最先进的文本编码器和26个下游任务中观察到，在检索和分类任务中，随机去除多达50％的嵌入维度只会导致较小的性能下降，小于10％。鉴于使用较小嵌入的好处以及有关文本编码的潜在见解，我们研究了这种现象，发现与先前的工作中建议的相反，这并不是对表示空间使用无效的结果。取而代之的是，我们发现大量均匀分布的维度实际上会导致删除时性能的增加。这将解释为什么平均而言，消除大量嵌入维度会导致性能的边缘下降。当截断大型语言模型用于生成任务的下一步预测的嵌入时，我们会做出类似的观察，这表明这种现象并非隔离为分类或检索任务。

Title: Multi-layer Abstraction for Nested Generation of Options (MANGO) in Hierarchical Reinforcement Learning

Authors: Alessio Arcudi, Davide Sartor, Alberto Sinigaglia, Vincent François-Lavet, Gian Antonio Susto
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.17751
Pdf URL: https://arxiv.org/pdf/2508.17751
Copy Paste: [[2508.17751]] Multi-layer Abstraction for Nested Generation of Options (MANGO) in Hierarchical Reinforcement Learning(https://arxiv.org/abs/2508.17751)
Keywords: generation
Abstract: This paper introduces MANGO (Multilayer Abstraction for Nested Generation of Options), a novel hierarchical reinforcement learning framework designed to address the challenges of long-term sparse reward environments. MANGO decomposes complex tasks into multiple layers of abstraction, where each layer defines an abstract state space and employs options to modularize trajectories into macro-actions. These options are nested across layers, allowing for efficient reuse of learned movements and improved sample efficiency. The framework introduces intra-layer policies that guide the agent's transitions within the abstract state space, and task actions that integrate task-specific components such as reward functions. Experiments conducted in procedurally-generated grid environments demonstrate substantial improvements in both sample efficiency and generalization capabilities compared to standard RL methods. MANGO also enhances interpretability by making the agent's decision-making process transparent across layers, which is particularly valuable in safety-critical and industrial applications. Future work will explore automated discovery of abstractions and abstract actions, adaptation to continuous or fuzzy environments, and more robust multi-layer training strategies.
摘要：本文介绍了芒果（用于嵌套的选项的多层抽象），这是一个新型的分层增强学习框架，旨在应对长期稀疏奖励环境的挑战。芒果将复杂的任务分解为多层抽象，其中每个层都定义了抽象状态空间，并采用了将轨迹模块化为宏观动作的选项。这些选项嵌套在各个层中，从而有效地重复使用了学习的运动并提高了样品效率。该框架引入了层内政策，这些策略指导代理在抽象状态空间内的过渡以及整合特定于任务组件（例如奖励功能）的任务操作。与标准RL方法相比，在程序生成的网格环境中进行的实验表明，样品效率和概括能力的实质提高。芒果还通过使代理商的决策过程在各个层中透明，从而增强了可解释性，这在安全至关重要和工业应用中尤其有价值。未来的工作将探索自动发现抽象和抽象动作，对连续或模糊环境的适应以及更强大的多层培训策略。

Title: SuperGen: An Efficient Ultra-high-resolution Video Generation System with Sketching and Tiling

Authors: Fanjiang Ye, Zepeng Zhao, Yi Mu, Jucheng Shen, Renjie Li, Kaijian Wang, Desen Sun, Saurabh Agarwal, Myungjin Lee, Triston Cao, Aditya Akella, Arvind Krishnamurthy, T.S. Eugene Ng, Zhengzhong Tu, Yuke Wang
Subjects: cs.LG, eess.SY
Abstract URL: https://arxiv.org/abs/2508.17756
Pdf URL: https://arxiv.org/pdf/2508.17756
Copy Paste: [[2508.17756]] SuperGen: An Efficient Ultra-high-resolution Video Generation System with Sketching and Tiling(https://arxiv.org/abs/2508.17756)
Keywords: generation, generative
Abstract: Diffusion models have recently achieved remarkable success in generative tasks (e.g., image and video generation), and the demand for high-quality content (e.g., 2K/4K videos) is rapidly increasing across various domains. However, generating ultra-high-resolution videos on existing standard-resolution (e.g., 720p) platforms remains challenging due to the excessive re-training requirements and prohibitively high computational and memory costs. To this end, we introduce SuperGen, an efficient tile-based framework for ultra-high-resolution video generation. SuperGen features a novel training-free algorithmic innovation with tiling to successfully support a wide range of resolutions without additional training efforts while significantly reducing both memory footprint and computational complexity. Moreover, SuperGen incorporates a tile-tailored, adaptive, region-aware caching strategy that accelerates video generation by exploiting redundancy across denoising steps and spatial regions. SuperGen also integrates cache-guided, communication-minimized tile parallelism for enhanced throughput and minimized latency. Evaluations demonstrate that SuperGen harvests the maximum performance gains while achieving high output quality across various benchmarks.
摘要：扩散模型最近在生成任务（例如，图像和视频生成）方面取得了巨大的成功，并且对高质量内容的需求（例如2K/4K视频）正在迅速增加各个领域。但是，由于过度的重新训练要求，计算和记忆成本过高，生成有关现有标准分辨率（例如720p）平台的超高分辨率视频仍然具有挑战性。为此，我们介绍了SuperGen，这是一个高效的超高分辨率视频生成的框架。 Supergen具有一种新颖的无培训算法创新，并具有瓷砖，以成功地支持各种分辨率，而无需额外的培训工作，同时大大降低了记忆足迹和计算复杂性。此外，SuperGen结合了瓷砖量，自适应的，区域感知的缓存策略，该策略通过在DeNoising步骤和空间区域中利用冗余来加速视频的生成。 SuperGen还集成了缓存引导的，通信最小的瓷砖并行性，以增强吞吐量和最小化的延迟。评估表明，超基因收获最大的性能增长，同时在各种基准中获得高输出质量。

Title: CEIDM: A Controlled Entity and Interaction Diffusion Model for Enhanced Text-to-Image Generation

Authors: Mingyue Yang, Dianxi Shi, Jialu Zhou, Xinyu Wei, Leqian Li, Shaowu Yang, Chunping Qiu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2508.17760
Pdf URL: https://arxiv.org/pdf/2508.17760
Copy Paste: [[2508.17760]] CEIDM: A Controlled Entity and Interaction Diffusion Model for Enhanced Text-to-Image Generation(https://arxiv.org/abs/2508.17760)
Keywords: generation
Abstract: In Text-to-Image (T2I) generation, the complexity of entities and their intricate interactions pose a significant challenge for T2I method based on diffusion model: how to effectively control entity and their interactions to produce high-quality images. To address this, we propose CEIDM, a image generation method based on diffusion model with dual controls for entity and interaction. First, we propose an entity interactive relationships mining approach based on Large Language Models (LLMs), extracting reasonable and rich implicit interactive relationships through chain of thought to guide diffusion models to generate high-quality images that are closer to realistic logic and have more reasonable interactive relationships. Furthermore, We propose an interactive action clustering and offset method to cluster and offset the interactive action features contained in each text prompts. By constructing global and local bidirectional offsets, we enhance semantic understanding and detail supplementation of original actions, making the model's understanding of the concept of interactive "actions" more accurate and generating images with more accurate interactive actions. Finally, we design an entity control network which generates masks with entity semantic guidance, then leveraging multi-scale convolutional network to enhance entity feature and dynamic network to fuse feature. It effectively controls entities and significantly improves image quality. Experiments show that the proposed CEIDM method is better than the most representative existing methods in both entity control and their interaction control.
摘要：在文本到图像（T2I）的生成中，实体及其复杂的相互作用的复杂性对基于扩散模型的T2i方法构成了重大挑战：如何有效控制实体及其相互作用以产生高质量的图像。为了解决这个问题，我们提出了CEIDM，这是一种基于扩散模型的图像生成方法，具有双重控制实体和相互作用。首先，我们提出了一种基于大语言模型（LLM）的实体互动关系挖掘方法，通过思想链提取合理且丰富的内隐互动关系，以指导扩散模型生成更接近现实逻辑并具有更合理的互动关系的高质量图像。此外，我们提出了一种交互式动作聚类和偏移方法，以聚类和偏移每个文本提示中包含的交互式动作功能。通过构建全球和本地双向偏移，我们可以增强对原始动作的语义理解和详细信息补充，从而使模型对交互式“动作”的概念的理解更加准确，并以更准确的交互式动作生成图像。最后，我们设计了一个实体控制网络，该网络可生成具有实体语义指导的掩码，然后利用多规模卷积网络来增强实体功能和动态网络以融合功能。它有效地控制实体并显着提高图像质量。实验表明，所提出的CEIDM方法比实体控制及其相互作用控制中最具代表性的现有方法更好。

Title: Multi-domain Distribution Learning for De Novo Drug Design

Authors: Arne Schneuing, Ilia Igashov, Adrian W. Dobbelstein, Thomas Castiglione, Michael Bronstein, Bruno Correia
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2508.17815
Pdf URL: https://arxiv.org/pdf/2508.17815
Copy Paste: [[2508.17815]] Multi-domain Distribution Learning for De Novo Drug Design(https://arxiv.org/abs/2508.17815)
Keywords: generative
Abstract: We introduce DrugFlow, a generative model for structure-based drug design that integrates continuous flow matching with discrete Markov bridges, demonstrating state-of-the-art performance in learning chemical, geometric, and physical aspects of three-dimensional protein-ligand data. We endow DrugFlow with an uncertainty estimate that is able to detect out-of-distribution samples. To further enhance the sampling process towards distribution regions with desirable metric values, we propose a joint preference alignment scheme applicable to both flow matching and Markov bridge frameworks. Furthermore, we extend our model to also explore the conformational landscape of the protein by jointly sampling side chain angles and molecules.
摘要：我们介绍了药物流，这是一种基于结构的药物设计的生成模型，将连续流与离散的马尔可夫桥集成在一起，证明了三维蛋白质 - 配合数据的学习化学，几何，几何和物理方面的最新性能。我们将药物流带来了能够检测到分布样本的不确定性估计。为了进一步增强具有理想的度量值的分配区域的采样过程，我们提出了适用于流量匹配和马尔可夫桥框架的联合偏好比对方案。此外，我们扩展了模型，还通过共同采样侧链角度和分子来探索蛋白质的构象格局。

Title: HLG: Comprehensive 3D Room Construction via Hierarchical Layout Generation

Authors: Xiping Wang, Yuxi Wang, Mengqi Zhou, Junsong Fan, Zhaoxiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17832
Pdf URL: https://arxiv.org/pdf/2508.17832
Copy Paste: [[2508.17832]] HLG: Comprehensive 3D Room Construction via Hierarchical Layout Generation(https://arxiv.org/abs/2508.17832)
Keywords: generation
Abstract: Realistic 3D indoor scene generation is crucial for virtual reality, interior design, embodied intelligence, and scene understanding. While existing methods have made progress in coarse-scale furniture arrangement, they struggle to capture fine-grained object placements, limiting the realism and utility of generated environments. This gap hinders immersive virtual experiences and detailed scene comprehension for embodied AI applications. To address these issues, we propose Hierarchical Layout Generation (HLG), a novel method for fine-grained 3D scene generation. HLG is the first to adopt a coarse-to-fine hierarchical approach, refining scene layouts from large-scale furniture placement to intricate object arrangements. Specifically, our fine-grained layout alignment module constructs a hierarchical layout through vertical and horizontal decoupling, effectively decomposing complex 3D indoor scenes into multiple levels of granularity. Additionally, our trainable layout optimization network addresses placement issues, such as incorrect positioning, orientation errors, and object intersections, ensuring structurally coherent and physically plausible scene generation. We demonstrate the effectiveness of our approach through extensive experiments, showing superior performance in generating realistic indoor scenes compared to existing methods. This work advances the field of scene generation and opens new possibilities for applications requiring detailed 3D environments. We will release our code upon publication to encourage future research.
摘要：现实的3D室内场景生成对于虚拟现实，室内设计，具体的智能和场景理解至关重要。尽管现有的方法在粗尺度家具布置方面取得了进展，但它们努力捕获细粒的对象放置，从而限制了生成的环境的现实主义和实用性。这一差距阻碍了体现的AI应用程序的沉浸式虚拟体验和详细的场景理解。为了解决这些问题，我们提出了分层布局生成（HLG），这是一种新颖的3D场景生成的新方法。 HLG是第一个采用粗到精细的分层方法，从大规模家具放置到复杂的物体排列的场景布局。具体而言，我们的细颗粒布局对齐模块通过垂直和水平解耦构建了分层布局，有效地将复杂的3D室内场景分解为多个粒度。此外，我们可训练的布局优化网络解决了位置问题，例如不正确的定位，方向错误和对象相交，从而确保结构相干且物理上合理的场景生成。我们通过广泛的实验证明了方法的有效性，与现有方法相比，在产生现实的室内场景方面表现出了出色的性能。这项工作推动了场景生成领域的发展，并为需要详细3D环境的应用程序打开了新的可能性。我们将在出版时发布我们的代码，以鼓励未来的研究。

Title: Diffusion-Based Data Augmentation for Medical Image Segmentation

Authors: Maham Nazir, Muhammad Aqeel, Francesco Setti
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.17844
Pdf URL: https://arxiv.org/pdf/2508.17844
Copy Paste: [[2508.17844]] Diffusion-Based Data Augmentation for Medical Image Segmentation(https://arxiv.org/abs/2508.17844)
Keywords: generation
Abstract: Medical image segmentation models struggle with rare abnormalities due to scarce annotated pathological data. We propose DiffAug a novel framework that combines textguided diffusion-based generation with automatic segmentation validation to address this challenge. Our proposed approach uses latent diffusion models conditioned on medical text descriptions and spatial masks to synthesize abnormalities via inpainting on normal images. Generated samples undergo dynamic quality validation through a latentspace segmentation network that ensures accurate localization while enabling single-step inference. The text prompts, derived from medical literature, guide the generation of diverse abnormality types without requiring manual annotation. Our validation mechanism filters synthetic samples based on spatial accuracy, maintaining quality while operating efficiently through direct latent estimation. Evaluated on three medical imaging benchmarks (CVC-ClinicDB, Kvasir-SEG, REFUGE2), our framework achieves state-of-the-art performance with 8-10% Dice improvements over baselines and reduces false negative rates by up to 28% for challenging cases like small polyps and flat lesions critical for early detection in screening applications.
摘要：由于缺乏注释的病理数据，医疗图像分割模型因罕见异常而挣扎。我们提出了一个新型框架，将基于文本的扩散生成与自动分割验证相结合以应对这一挑战。我们提出的方法使用以医学文本描述和空间面膜为条件的潜在扩散模型，通过对正常图像的介入来综合异常。生成的样品通过潜在空间分割网络进行动态质量验证，该网络可确保在启用单步推理的同时准确定位。文本提示，源自医学文献，指导产生不同的异常类型，而无需手动注释。我们的验证机制根据空间精度过滤综合样品，在直接潜在估计中有效地保持质量，同时保持质量。在三个医学成像基准（CVC-ClinicDB，Kvasir-Seg，Rebuge2）上进行了评估，我们的框架可实现最先进的性能，对基线的骰子提高了8-10％，并将较小的小静态案例和诸如较早发现的较早发现的较高的疾病案例降低了28％，对筛查筛查的较早发现，高达28％。

Title: Edge-Enhanced Vision Transformer Framework for Accurate AI-Generated Image Detection

Authors: Dabbrata Das, Mahshar Yahan, Md Tareq Zaman, Md Rishadul Bayesh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.17877
Pdf URL: https://arxiv.org/pdf/2508.17877
Copy Paste: [[2508.17877]] Edge-Enhanced Vision Transformer Framework for Accurate AI-Generated Image Detection(https://arxiv.org/abs/2508.17877)
Keywords: generative
Abstract: The rapid advancement of generative models has led to a growing prevalence of highly realistic AI-generated images, posing significant challenges for digital forensics and content authentication. Conventional detection methods mainly rely on deep learning models that extract global features, which often overlook subtle structural inconsistencies and demand substantial computational resources. To address these limitations, we propose a hybrid detection framework that combines a fine-tuned Vision Transformer (ViT) with a novel edge-based image processing module. The edge-based module computes variance from edge-difference maps generated before and after smoothing, exploiting the observation that AI-generated images typically exhibit smoother textures, weaker edges, and reduced noise compared to real images. When applied as a post-processing step on ViT predictions, this module enhances sensitivity to fine-grained structural cues while maintaining computational efficiency. Extensive experiments on the CIFAKE, Artistic, and Custom Curated datasets demonstrate that the proposed framework achieves superior detection performance across all benchmarks, attaining 97.75% accuracy and a 97.77% F1-score on CIFAKE, surpassing widely adopted state-of-the-art models. These results establish the proposed method as a lightweight, interpretable, and effective solution for both still images and video frames, making it highly suitable for real-world applications in automated content verification and digital forensics.
摘要：生成模型的快速发展导致高度现实的AI生成图像的流行率越来越高，对数字取证和内容认证提出了重大挑战。传统的检测方法主要依赖于提取全球特征的深度学习模型，这些模型通常忽略细微的结构不一致并需要大量的计算资源。为了解决这些局限性，我们提出了一个混合检测框架，该框架将微调视觉变压器（VIT）与新型的基于边缘的图像处理模块结合在一起。基于边缘的模块计算平滑之前和之后产生的边缘差异图的方差，并利用了与真实图像相比，AI生成的图像通常表现出更平滑的纹理，较弱的边缘和噪声减少。当在VIT预测上应用后处理步骤时，该模块会增强对细粒结构线索的敏感性，同时保持计算效率。关于CIFAKE，艺术和定制策划数据集的广泛实验表明，所提出的框架在所有基准测试中都达到了卓越的检测性能，在CIFAKE上达到97.75％的准确性和97.77％的F1-SCORE，超过了广泛采用的Sate Sate-tar-Art模型。这些结果将提出的方法确定为对静止图像和视频帧的轻巧，可解释和有效的解决方案，使其非常适合于自动内容验证和数字取证中的现实世界应用。

Title: UniAPO: Unified Multimodal Automated Prompt Optimization

Authors: Qipeng Zhu, Yanzhe Chen, Huasong Zhong, Yan Li, Jie Chen, Zhixin Zhang, Junping Zhang, Zhenheng Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.17890
Pdf URL: https://arxiv.org/pdf/2508.17890
Copy Paste: [[2508.17890]] UniAPO: Unified Multimodal Automated Prompt Optimization(https://arxiv.org/abs/2508.17890)
Keywords: generation
Abstract: Prompting is fundamental to unlocking the full potential of large language models. To automate and enhance this process, automatic prompt optimization (APO) has been developed, demonstrating effectiveness primarily in text-only input scenarios. However, extending existing APO methods to multimodal tasks, such as video-language generation introduces two core challenges: (i) visual token inflation, where long visual token sequences restrict context capacity and result in insufficient feedback signals; (ii) a lack of process-level supervision, as existing methods focus on outcome-level supervision and overlook intermediate supervision, limiting prompt optimization. We present UniAPO: Unified Multimodal Automated Prompt Optimization, the first framework tailored for multimodal APO. UniAPO adopts an EM-inspired optimization process that decouples feedback modeling and prompt refinement, making the optimization more stable and goal-driven. To further address the aforementioned challenges, we introduce a short-long term memory mechanism: historical feedback mitigates context limitations, while historical prompts provide directional guidance for effective prompt optimization. UniAPO achieves consistent gains across text, image, and video benchmarks, establishing a unified framework for efficient and transferable prompt optimization.
摘要：提示对于解锁大型语言模型的全部潜力至关重要。为了自动化和增强此过程，已经开发了自动及时优化（APO），主要在仅在文本输入方案中证明有效性。但是，将现有的APO方法扩展到多模式任务，例如视频语言生成引入了两个核心挑战：（i）视觉令牌通货膨胀，其中长时间的视觉令牌序列限制了上下文容量，并导致反馈信号不足；（ii）缺乏过程级别的监督，因为现有方法着重于结果级的监督和忽略中间监督，从而限制了及时的优化。我们提出Uniapo：统一的多模式自动化及时优化，这是针对多模式APO量身定制的第一个框架。 Uniapo采用了EM启发的优化过程，该过程将解除反馈建模和及时改进，从而使优化更加稳定和目标驱动。为了进一步解决上述挑战，我们引入了长期的术语记忆机制：历史反馈减轻上下文限制，而历史提示为有效的迅速优化提供了方向指导。 Uniapo在文本，图像和视频基准中实现了一致的收益，建立了一个统一的框架，以进行有效且可转让的及时优化。

Title: Generative Feature Imputing - A Technique for Error-resilient Semantic Communication

Authors: Jianhao Huang, Qunsong Zeng, Hongyang Du, Kaibin Huang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2508.17957
Pdf URL: https://arxiv.org/pdf/2508.17957
Copy Paste: [[2508.17957]] Generative Feature Imputing - A Technique for Error-resilient Semantic Communication(https://arxiv.org/abs/2508.17957)
Keywords: generation, generative
Abstract: Semantic communication (SemCom) has emerged as a promising paradigm for achieving unprecedented communication efficiency in sixth-generation (6G) networks by leveraging artificial intelligence (AI) to extract and transmit the underlying meanings of source data. However, deploying SemCom over digital systems presents new challenges, particularly in ensuring robustness against transmission errors that may distort semantically critical content. To address this issue, this paper proposes a novel framework, termed generative feature imputing, which comprises three key techniques. First, we introduce a spatial error concentration packetization strategy that spatially concentrates feature distortions by encoding feature elements based on their channel mappings, a property crucial for both the effectiveness and reduced complexity of the subsequent techniques. Second, building on this strategy, we propose a generative feature imputing method that utilizes a diffusion model to efficiently reconstruct missing features caused by packet losses. Finally, we develop a semantic-aware power allocation scheme that enables unequal error protection by allocating transmission power according to the semantic importance of each packet. Experimental results demonstrate that the proposed framework outperforms conventional approaches, such as Deep Joint Source-Channel Coding (DJSCC) and JPEG2000, under block fading conditions, achieving higher semantic accuracy and lower Learned Perceptual Image Patch Similarity (LPIPS) scores.
摘要：语义通信（SEMCOM）已成为通过利用人工智能（AI）提取和传输源数据的潜在含义来实现第六代（6G）网络中前所未有的通信效率的有希望的范式。但是，在数字系统上部署SEMCOM提出了新的挑战，尤其是在确保可能扭曲语义上关键内容的传输错误的鲁棒性方面。为了解决这个问题，本文提出了一个新颖的框架，称为“生成特征”，其中包括三个关键技术。首先，我们引入了一种空间误差浓度包装策略，该策略通过基于其通道映射编码特征元素来在空间上浓缩畸变，这是对后续技术的有效性和降低复杂性的至关重要的属性。其次，基于此策略，我们提出了一种生成功能归档方法，该方法利用扩散模型有效地重建由数据包丢失引起的缺失特征。最后，我们开发了一种语义感知的功率分配方案，该方案通过根据每个数据包的语义重要性来分配传输功率来实现不等的错误保护。实验结果表明，在块褪色条件下，提出的框架优于传统方法，例如深关节源通道编码（DJSCC）和JPEG2000，达到了较高的语义准确性，并且具有较低的学习感知觉图像贴片相似性（LPIPS）。

Title: A Novel Framework for Uncertainty Quantification via Proper Scores for Classification and Beyond

Authors: Sebastian G. Gruber
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2508.18001
Pdf URL: https://arxiv.org/pdf/2508.18001
Copy Paste: [[2508.18001]] A Novel Framework for Uncertainty Quantification via Proper Scores for Classification and Beyond(https://arxiv.org/abs/2508.18001)
Keywords: generation, generative
Abstract: In this PhD thesis, we propose a novel framework for uncertainty quantification in machine learning, which is based on proper scores. Uncertainty quantification is an important cornerstone for trustworthy and reliable machine learning applications in practice. Usually, approaches to uncertainty quantification are problem-specific, and solutions and insights cannot be readily transferred from one task to another. Proper scores are loss functions minimized by predicting the target distribution. Due to their very general definition, proper scores apply to regression, classification, or even generative modeling tasks. We contribute several theoretical results, that connect epistemic uncertainty, aleatoric uncertainty, and model calibration with proper scores, resulting in a general and widely applicable framework. We achieve this by introducing a general bias-variance decomposition for strictly proper scores via functional Bregman divergences. Specifically, we use the kernel score, a kernel-based proper score, for evaluating sample-based generative models in various domains, like image, audio, and natural language generation. This includes a novel approach for uncertainty estimation of large language models, which outperforms state-of-the-art baselines. Further, we generalize the calibration-sharpness decomposition beyond classification, which motivates the definition of proper calibration errors. We then introduce a novel estimator for proper calibration errors in classification, and a novel risk-based approach to compare different estimators for squared calibration errors. Last, we offer a decomposition of the kernel spherical score, another kernel-based proper score, allowing a more fine-grained and interpretable evaluation of generative image models.
摘要：在本博士学位论文中，我们为机器学习中的不确定性定量提出了一个新的框架，该框架基于适当的分数。不确定性量化是实践中值得信赖且可靠的机器学习应用程序的重要基石。通常，不确定性量化的方法是特定于问题的，解决方案和见解不能轻易从一个任务转移到另一个任务。适当的分数是通过预测目标分布来最小化损失函数。由于其一般定义，适当的分数适用于回归，分类甚至生成建模任务。我们贡献了几种理论结果，这些结果将认识论不确定性，不确定性和模型校准与适当的分数联系起来，从而导致一般且广泛适用的框架。我们通过引入一般的偏见差异分解来实现这一目标，以通过功能性的Bregman Diverence进行严格适当的分数。具体来说，我们使用基于内核的适当分数内核分数来评估各个领域的基于样本的生成模型，例如图像，音频和自然语言生成。这包括一种新颖的方法，用于对大语言模型的不确定性估计，这表现优于最先进的基准。此外，我们概括了分类超出分类的校准分解分解，这激发了适当的校准错误的定义。然后，我们引入了一个新颖的估计器，以实现分类中正确的校准误差，以及一种基于风险的新方法，以比较平方校准误差的不同估计器。最后，我们提供了内核球形分数的分解，这是另一个基于内核的适当分数，可以对生成图像模型进行更细粒度和可解释的评估。

Title: AQ-PCDSys: An Adaptive Quantized Planetary Crater Detection System for Autonomous Space Exploration

Authors: Aditri Paul, Archan Paul
Subjects: cs.LG, cs.AI, cs.CV, cs.ET, eess.SY
Abstract URL: https://arxiv.org/abs/2508.18025
Pdf URL: https://arxiv.org/pdf/2508.18025
Copy Paste: [[2508.18025]] AQ-PCDSys: An Adaptive Quantized Planetary Crater Detection System for Autonomous Space Exploration(https://arxiv.org/abs/2508.18025)
Keywords: generation
Abstract: Autonomous planetary exploration missions are critically dependent on real-time, accurate environmental perception for navigation and hazard avoidance. However, deploying deep learning models on the resource-constrained computational hardware of planetary exploration platforms remains a significant challenge. This paper introduces the Adaptive Quantized Planetary Crater Detection System (AQ-PCDSys), a novel framework specifically engineered for real-time, onboard deployment in the computationally constrained environments of space exploration missions. AQ-PCDSys synergistically integrates a Quantized Neural Network (QNN) architecture, trained using Quantization-Aware Training (QAT), with an Adaptive Multi-Sensor Fusion (AMF) module. The QNN architecture significantly optimizes model size and inference latency suitable for real-time onboard deployment in space exploration missions, while preserving high accuracy. The AMF module intelligently fuses data from Optical Imagery (OI) and Digital Elevation Models (DEMs) at the feature level, utilizing an Adaptive Weighting Mechanism (AWM) to dynamically prioritize the most relevant and reliable sensor modality based on planetary ambient conditions. This approach enhances detection robustness across diverse planetary landscapes. Paired with Multi-Scale Detection Heads specifically designed for robust and efficient detection of craters across a wide range of sizes, AQ-PCDSys provides a computationally efficient, reliable and accurate solution for planetary crater detection, a critical capability for enabling the next generation of autonomous planetary landing, navigation, and scientific exploration.
摘要：自主行星探索任务至关取决于实时，准确的环境感知，以避免导航和危险。但是，在行星勘探平台的资源受限计算硬件上部署深度学习模型仍然是一个重大挑战。本文介绍了自适应量化的行星火山口检测系统（AQ-PCDSYS），这是一个专门设计用于实时的新型框架，在空间探索任务的计算约束环境中进行了船上部署。 AQ-PCDSYS协同整合了使用量化感知训练（QAT）训练的量化神经网络（QNN）体系结构，并具有自适应的多传感器融合（AMF）模块。 QNN体系结构可显着优化模型大小和推理潜伏期，适用于空间探索任务中的实时部署，同时保持高精度。 AMF模块在功能级别上智能融合了来自光学图像（OI）和数字高程模型（DEM）的数据，并利用自适应加权机制（AWM），根据行星环境条件动态优先考虑最相关和最可靠的传感器方式。这种方法增强了各种行星景观的鲁棒性。 AQ-PCDSYS与专门设计的多尺度检测头配对，专门设计用于对各种尺寸的陨石坑的稳健和有效检测，AQ-PCDSYS为计算高效，可靠和准确的解决方案提供了用于行星火山口检测的关键能力，这是实现下一代自主行星降落，导航，导航和科学探索的关键能力。

Title: FCR: Investigating Generative AI models for Forensic Craniofacial Reconstruction

Authors: Ravi Shankar Prasad, Dinesh Singh
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.18031
Pdf URL: https://arxiv.org/pdf/2508.18031
Copy Paste: [[2508.18031]] FCR: Investigating Generative AI models for Forensic Craniofacial Reconstruction(https://arxiv.org/abs/2508.18031)
Keywords: generative
Abstract: Craniofacial reconstruction in forensics is one of the processes to identify victims of crime and natural disasters. Identifying an individual from their remains plays a crucial role when all other identification methods fail. Traditional methods for this task, such as clay-based craniofacial reconstruction, require expert domain knowledge and are a time-consuming process. At the same time, other probabilistic generative models like the statistical shape model or the Basel face model fail to capture the skull and face cross-domain attributes. Looking at these limitations, we propose a generic framework for craniofacial reconstruction from 2D X-ray images. Here, we used various generative models (i.e., CycleGANs, cGANs, etc) and fine-tune the generator and discriminator parts to generate more realistic images in two distinct domains, which are the skull and face of an individual. This is the first time where 2D X-rays are being used as a representation of the skull by generative models for craniofacial reconstruction. We have evaluated the quality of generated faces using FID, IS, and SSIM scores. Finally, we have proposed a retrieval framework where the query is the generated face image and the gallery is the database of real faces. By experimental results, we have found that this can be an effective tool for forensic science.
摘要：取证中的颅面重建是确定犯罪和自然灾害受害者的过程之一。当所有其他识别方法失败时，从遗体中识别个人的遗体起着至关重要的作用。该任务的传统方法，例如基于粘土的颅面重建，需要专家领域知识，并且是一个耗时的过程。同时，其他概率生成模型（例如统计形状模型或巴塞尔面模型）无法捕获头骨和脸部交叉域属性。从这些限制来看，我们提出了一个从2D X射线图像中的颅面重建的通用框架。在这里，我们使用了各种生成模型（即自行车，CGAN等），并微调发电机和鉴别零件来在两个不同的域中生成更逼真的图像，这些域是个体的头骨和表面。这是第一次将2D X射线用作颅面重建的生成模型来表示头骨的表示。我们已经评估了使用FID，IS和SSIM分数生成的面孔的质量。最后，我们提出了一个检索框架，其中查询是生成的面部图像，而画廊是真实面孔的数据库。通过实验结果，我们发现这可以是法医科学的有效工具。

Title: Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

Authors: Yaqi Li, Peng Chen, Mingyang Han, Bu Pi, Haoxiang Shi, Runzhou Zhao, Yang Yao, Xuan Zhang, Jun Song
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.18032
Pdf URL: https://arxiv.org/pdf/2508.18032
Copy Paste: [[2508.18032]] Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation(https://arxiv.org/abs/2508.18032)
Keywords: generation
Abstract: Despite the promising progress of recent autoregressive models in text-to-image (T2I) generation, their ability to handle multi-attribute and ambiguous prompts remains limited. To address these limitations, existing works have applied chain-of-thought (CoT) to enable stage-aware visual synthesis and employed reinforcement learning (RL) to improve reasoning capabilities. However, most models provide reward signals only at the end of the generation stage. This monolithic final-only guidance makes it difficult to identify which stages contribute positively to the final outcome and may lead to suboptimal policies. To tackle this issue, we propose a Visual-Chain of Guidance (Visual-CoG) paradigm consisting of three stages: semantic reasoning, process refining, and outcome evaluation, with stage-aware rewards providing immediate guidance throughout the image generation pipeline. We further construct a visual cognition benchmark, VisCog-Bench, which comprises four subtasks to evaluate the effectiveness of semantic reasoning. Comprehensive evaluations on GenEval, T2I-CompBench, and the proposed VisCog-Bench show improvements of 15%, 5%, and 19%, respectively, demonstrating the superior performance of the proposed Visual-CoG. We will release all the resources soon.
摘要：尽管文本到图像（T2I）的最新自回归模型取得了希望，但它们处理多属性和模棱两可的提示的能力仍然有限。为了解决这些局限性，现有的作品应用了思想链（COT），以实现舞台感知的视觉合成和使用的强化学习（RL）以提高推理能力。但是，大多数模型仅在生成阶段结束时提供奖励信号。这个单层的最终指导使很难确定哪些阶段对最终结果产生了积极的贡献，并可能导致次优政策。为了解决这个问题，我们提出了一个由三个阶段组成的指导链（Visual-cog）范式：语义推理，过程完善和结果评估，并在整个图像生成管道中提供了舞台感知的奖励。我们进一步构建了一个视觉认知基准Viscog Bench，该基准包括四个子任务以评估语义推理的有效性。对Geneval，T2i-Compbench和拟议的Viscog板凳的全面评估分别提高了15％，5％和19％的改善，证明了所提出的视觉库的出色表现。我们将尽快发布所有资源。

Title: Incorporating Pre-trained Diffusion Models in Solving the Schrödinger Bridge Problem

Authors: Zhicong Tang, Tiankai Hang, Shuyang Gu, Dong Chen, Baining Guo
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.18095
Pdf URL: https://arxiv.org/pdf/2508.18095
Copy Paste: [[2508.18095]] Incorporating Pre-trained Diffusion Models in Solving the Schrödinger Bridge Problem(https://arxiv.org/abs/2508.18095)
Keywords: generative
Abstract: This paper aims to unify Score-based Generative Models (SGMs), also known as Diffusion models, and the Schrödinger Bridge (SB) problem through three reparameterization techniques: Iterative Proportional Mean-Matching (IPMM), Iterative Proportional Terminus-Matching (IPTM), and Iterative Proportional Flow-Matching (IPFM). These techniques significantly accelerate and stabilize the training of SB-based models. Furthermore, the paper introduces novel initialization strategies that use pre-trained SGMs to effectively train SB-based models. By using SGMs as initialization, we leverage the advantages of both SB-based models and SGMs, ensuring efficient training of SB-based models and further improving the performance of SGMs. Extensive experiments demonstrate the significant effectiveness and improvements of the proposed methods. We believe this work contributes to and paves the way for future research on generative models.
摘要：本文旨在通过三种重新聚体化技术统一基于得分的生成模型（SGMS），也称为扩散模型，以及Schrödinger桥（SB）问题：迭代成比例均值匹配（IPMM），迭代性比例终末匹配（IPTM），IPTM）和迭代效果及以比例的流量流程（IPFM）。这些技术显着加速并稳定基于SB的模型的训练。此外，本文介绍了使用预训练的SGM有效训练基于SB的模型的新型初始化策略。通过使用SGM作为初始化，我们利用基于SB的模型和SGM的优势，确保对基于SB的模型有效培训并进一步改善SGM的性能。广泛的实验证明了所提出方法的显着有效性和改进。我们认为，这项工作为生成模型的未来研究做出了贡献，并为未来的研究铺平了道路。

Title: Provable Mixed-Noise Learning with Flow-Matching

Authors: Paul Hagemann, Robert Gruhlke, Bernhard Stankewitz, Claudia Schillings, Gabriele Steidl
Subjects: cs.LG, math.OC
Abstract URL: https://arxiv.org/abs/2508.18122
Pdf URL: https://arxiv.org/pdf/2508.18122
Copy Paste: [[2508.18122]] Provable Mixed-Noise Learning with Flow-Matching(https://arxiv.org/abs/2508.18122)
Keywords: generative
Abstract: We study Bayesian inverse problems with mixed noise, modeled as a combination of additive and multiplicative Gaussian components. While traditional inference methods often assume fixed or known noise characteristics, real-world applications, particularly in physics and chemistry, frequently involve noise with unknown and heterogeneous structure. Motivated by recent advances in flow-based generative modeling, we propose a novel inference framework based on conditional flow matching embedded within an Expectation-Maximization (EM) algorithm to jointly estimate posterior samplers and noise parameters. To enable high-dimensional inference and improve scalability, we use simulation-free ODE-based flow matching as the generative model in the E-step of the EM algorithm. We prove that, under suitable assumptions, the EM updates converge to the true noise parameters in the population limit of infinite observations. Our numerical results illustrate the effectiveness of combining EM inference with flow matching for mixed-noise Bayesian inverse problems.
摘要：我们研究了混合噪声的贝叶斯反问题，该噪声构建为添加剂和乘法高斯组件的组合。尽管传统的推理方法通常假定固定或已知的噪声特征，但现实世界的应用，尤其是在物理和化学方面，经常涉及未知和异质结构的噪声。在基于流量的生成建模方面的最新进展中，我们提出了一个基于嵌入在期望最大化（EM）算法中的条件流量匹配的新型推理框架，以共同估计后验样本和噪声参数。为了启用高维推断并提高可扩展性，我们使用基于无模拟的流动流匹配作为EM算法的E步骤中的生成模型。我们证明，在适当的假设下，EM更新会在无限观察的总体限制中收敛到真实的噪声参数。我们的数值结果说明了将EM推断与混合噪声贝叶斯逆问题相结合的有效性。

Title: SpotEdit: Evaluating Visually-Guided Image Editing Methods

Authors: Sara Ghazanfari, Wei-An Lin, Haitong Tian, Ersin Yumer
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.18159
Pdf URL: https://arxiv.org/pdf/2508.18159
Copy Paste: [[2508.18159]] SpotEdit: Evaluating Visually-Guided Image Editing Methods(https://arxiv.org/abs/2508.18159)
Keywords: generation, generative
Abstract: Visually-guided image editing, where edits are conditioned on both visual cues and textual prompts, has emerged as a powerful paradigm for fine-grained, controllable content generation. Although recent generative models have shown remarkable capabilities, existing evaluations remain simple and insufficiently representative of real-world editing challenges. We present SpotEdit, a comprehensive benchmark designed to systematically assess visually-guided image editing methods across diverse diffusion, autoregressive, and hybrid generative models, uncovering substantial performance disparities. To address a critical yet underexplored challenge, our benchmark includes a dedicated component on hallucination, highlighting how leading models, such as GPT-4o, often hallucinate the existence of a visual cue and erroneously perform the editing task. Our code and benchmark are publicly released at this https URL.
摘要：视觉引导的图像编辑在视觉提示和文本提示下进行了调节，已成为一种强大的范式，可用于细粒度，可控制的内容生成。尽管最近的生成模型显示出了显着的功能，但现有的评估仍然简单而不足以代表现实世界编辑挑战。我们提出了SpotEdit，这是一种综合基准，旨在系统地评估各种扩散，自回归和混合生成模型的视觉引导的图像编辑方法，从而发现了实质性的性能差异。为了应对一个关键但毫无争议的挑战，我们的基准包括一个专门的幻觉组成部分，强调了诸如GPT-4O之类的领先模型如何经常幻觉，通常会幻觉地存在视觉提示并错误地执行编辑任务。我们的代码和基准在此HTTPS URL上公开发布。

Title: Amortized Sampling with Transferable Normalizing Flows

Authors: Charlie B. Tan, Majdi Hassan, Leon Klein, Saifuddin Syed, Dominique Beaini, Michael M. Bronstein, Alexander Tong, Kirill Neklyudov
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.18175
Pdf URL: https://arxiv.org/pdf/2508.18175
Copy Paste: [[2508.18175]] Amortized Sampling with Transferable Normalizing Flows(https://arxiv.org/abs/2508.18175)
Keywords: generative
Abstract: Efficient equilibrium sampling of molecular conformations remains a core challenge in computational chemistry and statistical inference. Classical approaches such as molecular dynamics or Markov chain Monte Carlo inherently lack amortization; the computational cost of sampling must be paid in-full for each system of interest. The widespread success of generative models has inspired interest into overcoming this limitation through learning sampling algorithms. Despite performing on par with conventional methods when trained on a single system, learned samplers have so far demonstrated limited ability to transfer across systems. We prove that deep learning enables the design of scalable and transferable samplers by introducing Prose, a 280 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Prose draws zero-shot uncorrelated proposal samples for arbitrary peptide systems, achieving the previously intractable transferability across sequence length, whilst retaining the efficient likelihood evaluation of normalizing flows. Through extensive empirical evaluation we demonstrate the efficacy of Prose as a proposal for a variety of sampling algorithms, finding a simple importance sampling-based finetuning procedure to achieve superior performance to established methods such as sequential Monte Carlo on unseen tetrapeptides. We open-source the Prose codebase, model weights, and training dataset, to further stimulate research into amortized sampling methods and finetuning objectives.
摘要：分子构象的有效平衡采样仍然是计算化学和统计推断的核心挑战。经典方法，例如分子动力学或马尔可夫链蒙特卡洛固有地缺乏摊销。必须为每个感兴趣系统支付抽样的计算成本。生成模型的广泛成功激发了人们通过学习采样算法克服这一限制的兴趣。尽管在单个系统上接受培训时，采用常规方法进行了表现，但到目前为止，学到的采样器表明，跨系统传输的能力有限。我们证明，深度学习可以通过引入散文来设计可扩展和可传递的采样器，这是一个2.8亿个参数全原子的可转移归一化流，该流程在肽分子动力学轨迹上训练了长度高达8个残基。散文为任意肽系统的零射击建议样品零摄，从而实现了跨序列长度的先前棘手的可传递性，同时保留了对归一化流量的有效可能性评估。通过广泛的经验评估，我们证明了散文作为各种抽样算法的提议的疗效，找到了一种简单的重要性基于采样的芬太尼程序，以实现优于既定方法的效果，例如顺序蒙特卡洛（Sequention Monte Carlo），对未见的四肽。我们开源的散文代码库，模型权重和培训数据集，以进一步刺激对摊销采样方法和填充目标的研究。

Title: Sealing The Backdoor: Unlearning Adversarial Text Triggers In Diffusion Models Using Knowledge Distillation

Authors: Ashwath Vaithinathan Aravindan, Abha Jha, Matthew Salaway, Atharva Sandeep Bhide, Duygu Nur Yaldiz
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.18235
Pdf URL: https://arxiv.org/pdf/2508.18235
Copy Paste: [[2508.18235]] Sealing The Backdoor: Unlearning Adversarial Text Triggers In Diffusion Models Using Knowledge Distillation(https://arxiv.org/abs/2508.18235)
Keywords: generation, generative
Abstract: Text-to-image diffusion models have revolutionized generative AI, but their vulnerability to backdoor attacks poses significant security risks. Adversaries can inject imperceptible textual triggers into training data, causing models to generate manipulated outputs. Although text-based backdoor defenses in classification models are well-explored, generative models lack effective mitigation techniques against. We address this by selectively erasing the model's learned associations between adversarial text triggers and poisoned outputs, while preserving overall generation quality. Our approach, Self-Knowledge Distillation with Cross-Attention Guidance (SKD-CAG), uses knowledge distillation to guide the model in correcting responses to poisoned prompts while maintaining image quality by exploiting the fact that the backdoored model still produces clean outputs in the absence of triggers. Using the cross-attention mechanism, SKD-CAG neutralizes backdoor influences at the attention level, ensuring the targeted removal of adversarial effects. Extensive experiments show that our method outperforms existing approaches, achieving removal accuracy 100\% for pixel backdoors and 93\% for style-based attacks, without sacrificing robustness or image fidelity. Our findings highlight targeted unlearning as a promising defense to secure generative models. Code and model weights can be found at this https URL .
摘要：文本到图像扩散模型已彻底改变了生成的AI，但是它们对后门攻击的脆弱性会带来很大的安全风险。对手可以将不可察觉的文本触发器注入训练数据中，从而导致模型生成受操纵的输出。尽管分类模型中基于文本的后门防御措施是经过充分探索的，但生成模型缺乏有效的缓解技术。我们通过选择性地擦除模型在对抗文本触发器和中毒输出之间的学习相关性来解决这一问题，同时保持整体发电质量。我们的方法，即交叉注意指导（SKD-CAG）的自我知识蒸馏，使用知识蒸馏来指导模型纠正对中毒提示的响应，同时通过利用在没有触发因素的情况下仍会产生干净的输出来维持图像质量。使用跨注意机制，SKD-CAG中和在注意水平上的后门影响，以确保靶向去除对抗性效应。广泛的实验表明，我们的方法的表现优于现有方法，在不牺牲鲁棒性或图像保真度的情况下，对像素后门的删除精度为100 \％，基于样式的攻击的方法为93 \％。我们的发现重点阐明了针对性的统治，作为确保生成模型的有前途的防御。代码和模型权重可以在此HTTPS URL上找到。

Title: Interpretable Evaluation of AI-Generated Content with Language-Grounded Sparse Encoders

Authors: Yiming Tang, Arash Lagzian, Srinivas Anumasa, Qiran Zou, Trang Nguyen, Ehsan Adeli, Ching-Yu Cheng, Yilun Du, Dianbo Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.18236
Pdf URL: https://arxiv.org/pdf/2508.18236
Copy Paste: [[2508.18236]] Interpretable Evaluation of AI-Generated Content with Language-Grounded Sparse Encoders(https://arxiv.org/abs/2508.18236)
Keywords: generation, generative
Abstract: While the quality of AI-generated contents, such as synthetic images, has become remarkably high, current evaluation metrics provide only coarse-grained assessments, failing to identify specific strengths and weaknesses that researchers and practitioners need for model selection and development, further limiting the scientific understanding and commercial deployment of these generative models. To address this, we introduce Language-Grounded Sparse Encoders (LanSE), a novel architecture that creates interpretable evaluation metrics by identifying interpretable visual patterns and automatically describing them in natural language. Through large-scale human evaluation (more than 11,000 annotations) and large multimodal model (LMM) based analysis, LanSE demonstrates reliable capabilities to detect interpretable visual patterns in synthetic images with more than 93\% accuracy in natural images. LanSE further provides a fine-grained evaluation framework that quantifies four key dimensions of generation quality, prompt match, visual realism, physical plausibility, and content diversity. LanSE reveals nuanced model differences invisible to existing metrics, for instance, FLUX's superior physical plausibility and SDXL-medium's strong content diversity, while aligning with human judgments. By bridging interpretability with practical evaluation needs, LanSE offers all users of generative AI models a powerful tool for model selection, quality control of synthetic content, and model improvement. These capabilities directly address the need for public confidence and safety in AI-generated content, both critical for the future of generative AI applications.
摘要：尽管AI生成的内容（例如合成图像）的质量已变得非常高，但目前的评估指标仅提供粗粒度的评估，无法确定研究人员和从业者需要进行模型选择和开发的特定优势和劣势，进一步限制了这些生成生成模型的科学理解和商业部署。为了解决这个问题，我们介绍了具有语言基础的稀疏编码器（LANSE），这是一种新颖的体系结构，通过识别可解释的视觉模式并以自然语言自动描述它们，从而创建可解释的评估指标。通过大规模的人类评估（超过11,000个注释）和大型多模型模型（LMM）分析，Lanse展示了可靠的功能，可在自然图像中具有超过93 \％精度的合成图像中检测可解释的视觉模式。 Lanse进一步提供了一个精细的评估框架，该框架量化了发电质量，及时匹配，视觉现实主义，身体合理性和内容多样性的四个关键维度。 Lanse揭示了现有指标不可见的细微模型差异，例如，Flux的出色物理合理性和SDXL-MEDIUM的强大内容多样性，同时与人类判断保持一致。通过将可解释性与实际评估需求桥接，LANSE为所有生成AI模型的用户提供了一种强大的模型选择工具，合成内容的质量控制以及改进模型。这些功能直接解决了对AI生成内容的公众信心和安全的需求，这对于生成AI应用程序的未来至关重要。