2025-08-05

Title: PCS Workflow for Veridical Data Science in the Age of AI

Authors: Zachary T. Rewolinski, Bin Yu
Subjects: cs.LG, cs.AI, stat.ME
Abstract URL: https://arxiv.org/abs/2508.00835
Pdf URL: https://arxiv.org/pdf/2508.00835
Copy Paste: [[2508.00835]] PCS Workflow for Veridical Data Science in the Age of AI(https://arxiv.org/abs/2508.00835)
Keywords: generative
Abstract: Data science is a pillar of artificial intelligence (AI), which is transforming nearly every domain of human activity, from the social and physical sciences to engineering and medicine. While data-driven findings in AI offer unprecedented power to extract insights and guide decision-making, many are difficult or impossible to replicate. A key reason for this challenge is the uncertainty introduced by the many choices made throughout the data science life cycle (DSLC). Traditional statistical frameworks often fail to account for this uncertainty. The Predictability-Computability-Stability (PCS) framework for veridical (truthful) data science offers a principled approach to addressing this challenge throughout the DSLC. This paper presents an updated and streamlined PCS workflow, tailored for practitioners and enhanced with guided use of generative AI. We include a running example to display the PCS framework in action, and conduct a related case study which showcases the uncertainty in downstream predictions caused by judgment calls in the data cleaning stage.
摘要：数据科学是人工智能（AI）的支柱，它几乎将人类活动的每个领域从社会和物理科学转变为工程和医学。尽管AI中数据驱动的发现提供了前所未有的能力来提取见解和指导决策，但许多人很难或不可能复制。造成这一挑战的关键原因是整个数据科学生命周期（DSLC）中做出的许多选择所带来的不确定性。传统的统计框架通常无法解释这种不确定性。垂直（真实）数据科学的可预测性可及性稳定性（PCS）框架为解决整个DSLC的这一挑战提供了一种原则性的方法。本文介绍了一个更新且流线的PCS工作流程，该工作流程为从业人员量身定制，并通过使用Generative AI进行了增强。我们包括一个运行的示例，以在行动中显示PCS框架，并进行相关的案例研究，该案例研究显示了在数据清洁阶段由判断呼叫引起的下游预测的不确定性。

Title: A Residual Guided strategy with Generative Adversarial Networks in training Physics-Informed Transformer Networks

Authors: Ziyang Zhang, Feifan Zhang, Weidong Tang, Lei Shi, Tailai Chen
Subjects: cs.LG, cs.CE, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2508.00855
Pdf URL: https://arxiv.org/pdf/2508.00855
Copy Paste: [[2508.00855]] A Residual Guided strategy with Generative Adversarial Networks in training Physics-Informed Transformer Networks(https://arxiv.org/abs/2508.00855)
Keywords: generative
Abstract: Nonlinear partial differential equations (PDEs) are pivotal in modeling complex physical systems, yet traditional Physics-Informed Neural Networks (PINNs) often struggle with unresolved residuals in critical spatiotemporal regions and violations of temporal causality. To address these limitations, we propose a novel Residual Guided Training strategy for Physics-Informed Transformer via Generative Adversarial Networks (GAN). Our framework integrates a decoder-only Transformer to inherently capture temporal correlations through autoregressive processing, coupled with a residual-aware GAN that dynamically identifies and prioritizes high-residual regions. By introducing a causal penalty term and an adaptive sampling mechanism, the method enforces temporal causality while refining accuracy in problematic domains. Extensive numerical experiments on the Allen-Cahn, Klein-Gordon, and Navier-Stokes equations demonstrate significant improvements, achieving relative MSE reductions of up to three orders of magnitude compared to baseline methods. This work bridges the gap between deep learning and physics-driven modeling, offering a robust solution for multiscale and time-dependent PDE systems.
摘要：非线性部分微分方程（PDE）在建模复杂的物理系统中是关键的，但是传统的物理知识的神经网络（PINN）通常会在关键时空区域中与未解决的残留物和违反时间因果关系的残留物进行斗争。为了解决这些局限性，我们通过生成对抗网络（GAN）提出了一种新型的物理变压器的残留指导训练策略。我们的框架将仅解码器的变压器集成到通过自回旋处理的固有捕获时间相关性，并与残留感知的GAN结合使用，该gan动态识别并优先考虑高水下区域。通过引入因果惩罚项和自适应抽样机制，该方法可以实现时间因果关系，同时提高了有问题的域中的准确性。与基线方法相比，对Allen-Cahn，Klein-Gordon和Navier-Stokes方程的广泛数值实验表现出显着改善，可实现多达三个数量级的相对MSE降低。这项工作弥合了深度学习与物理驱动的建模之间的差距，为多尺度和时间依赖的PDE系统提供了强大的解决方案。

Title: Phase-fraction guided denoising diffusion model for augmenting multiphase steel microstructure segmentation via micrograph image-mask pair synthesis

Authors: Hoang Hai Nam Nguyen, Minh Tien Tran, Hoheok Kim, Ho Won Lee
Subjects: cs.CV, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2508.00896
Pdf URL: https://arxiv.org/pdf/2508.00896
Copy Paste: [[2508.00896]] Phase-fraction guided denoising diffusion model for augmenting multiphase steel microstructure segmentation via micrograph image-mask pair synthesis(https://arxiv.org/abs/2508.00896)
Keywords: generation, generative
Abstract: The effectiveness of machine learning in metallographic microstructure segmentation is often constrained by the lack of human-annotated phase masks, particularly for rare or compositionally complex morphologies within the metal alloy. We introduce PF-DiffSeg, a phase-fraction controlled, one-stage denoising diffusion framework that jointly synthesizes microstructure images and their corresponding segmentation masks in a single generative trajectory to further improve segmentation accuracy. By conditioning on global phase-fraction vectors, augmented to represent real data distribution and emphasize minority classes, our model generates compositionally valid and structurally coherent microstructure image and mask samples that improve both data diversity and training efficiency. Evaluated on the MetalDAM benchmark for additively manufactured multiphase steel, our synthetic augmentation method yields notable improvements in segmentation accuracy compared to standard augmentation strategies especially in minority classes and further outperforms a two-stage mask-guided diffusion and generative adversarial network (GAN) baselines, while also reducing inference time compared to conventional approach. The method integrates generation and conditioning into a unified framework, offering a scalable solution for data augmentation in metallographic applications.
摘要：机器学习在金属学微结构分割中的有效性通常受到缺乏人类通知相膜的限制，尤其是对于金属合金中的稀有或构图复杂的形态。我们介绍了PF-DIFFSEG，这是一种相互分数控制的单阶段扩散框架，在单个生成轨迹中共同合成微结构图像及其相应的分割掩码，以进一步提高分割精度。通过对全球相位分数向量进行调节，增强以表示真实的数据分布并强调少数类别，我们的模型生成了有效的和结构相干的微观结构图像，并掩盖了提高数据多样性和培训效率的样本。与标准的增强策略相比，我们的合成增强方法在MetalDam基准测试中对添加性生产的多相钢进行了评估，与标准增强策略相比，分割准确性的改善显着提高，尤其是在少数群体中，进一步超过了两阶段的面具引导的扩散和生成性的对抗性网络（GAN）基础（GAN）的基础，同时还要降低了与会议的相比。该方法将生成和调理整合到统一的框架中，为金属图应用中的数据增强提供了可扩展的解决方案。

Title: Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models

Authors: Jiazhen Pan, Bailiang Jian, Paul Hager, Yundi Zhang, Che Liu, Friedrike Jungmann, Hongwei Bran Li, Chenyu You, Junde Wu, Jiayuan Zhu, Fenglin Liu, Yuyuan Liu, Niklas Bubeck, Christian Wachinger, Chen (Cherise)Chen, Zhenyu Gong, Cheng Ouyang, Georgios Kaissis, Benedikt Wiestler, Daniel Rueckert
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.00923
Pdf URL: https://arxiv.org/pdf/2508.00923
Copy Paste: [[2508.00923]] Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models(https://arxiv.org/abs/2508.00923)
Keywords: generation
Abstract: Ensuring the safety and reliability of large language models (LLMs) in clinical practice is critical to prevent patient harm and promote trustworthy healthcare applications of AI. However, LLMs are advancing so rapidly that static safety benchmarks often become obsolete upon publication, yielding only an incomplete and sometimes misleading picture of model trustworthiness. We demonstrate that a Dynamic, Automatic, and Systematic (DAS) red-teaming framework that continuously stress-tests LLMs can reveal significant weaknesses of current LLMs across four safety-critical domains: robustness, privacy, bias/fairness, and hallucination. A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses, uncovering vulnerabilities in real time without human intervention. Applying DAS to 15 proprietary and open-source LLMs revealed a stark contrast between static benchmark performance and vulnerability under adversarial pressure. Despite a median MedQA accuracy exceeding 80\%, 94\% of previously correct answers failed our dynamic robustness tests. We observed similarly high failure rates across other domains: privacy leaks were elicited in 86\% of scenarios, cognitive-bias priming altered clinical recommendations in 81\% of fairness tests, and we identified hallucination rates exceeding 66\% in widely used models. Such profound residual risks are incompatible with routine clinical practice. By converting red-teaming from a static checklist into a dynamic stress-test audit, DAS red-teaming offers the surveillance that hospitals/regulators/technology vendors require as LLMs become embedded in patient chatbots, decision-support dashboards, and broader healthcare workflows. Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
摘要：确保大语言模型（LLM）在临床实践中的安全性和可靠性对于防止患者伤害并促进AI的可信赖医疗保健应用至关重要。但是，LLM迅速发展，以至于静态安全基准在出版时常常变得过时，仅产生模型可信度的不完整，有时是误导性的图片。我们证明了一个动态，自动和系统的（DAS）红色团队框架，该框架不断强调LLM可以揭示四个安全 - 关键领域的当前LLM的巨大弱点：稳健性，隐私性，偏见，公平性和幻觉。一套对抗药物被应用于自主突变的测试案例，识别/进化不安全的触发策略，评估反应，无需人工干预即可实时发现脆弱性。将DAS应用于15个专有和开源LLMS，发现静态基准性能与对抗压力下的脆弱性之间存在鲜明的对比。尽管MEDQA准确性超过80 \％，但以前正确的答案中有94％的动态鲁棒性测试失败了。我们观察到其他领域的失败率类似：在86％的场景中引起了隐私泄漏，在81％的公平测试中，认知偏置启动改变了临床建议，并且我们确定幻觉速率超过66 \％\％。如此深刻的残留风险与常规临床实践不相容。通过将红色团队从静态清单转换为动态应力测试审核，DAS Red-Teaming提供了医院/调节器/技术供应商所需的监视，因为LLMS嵌入了患者聊天机器人，决策支持仪表板和更广泛的医疗工作流中。我们的框架为下一代医疗AI提供了可发展，可扩展和可靠的保障措施。

Title: From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

Authors: Yeong-Joon Ju, Seong-Whan Lee
Subjects: cs.LG, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2508.00955
Pdf URL: https://arxiv.org/pdf/2508.00955
Copy Paste: [[2508.00955]] From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model(https://arxiv.org/abs/2508.00955)
Keywords: generative
Abstract: Multimodal Large Language Models (MLLMs) have emerged as a promising solution for universal embedding tasks, yet adapting their generative nature for discriminative representation learning remains a significant challenge. The dominant paradigm of large-scale contrastive pre-training suffers from critical inefficiencies, including prohibitive computational costs and a failure to leverage the intrinsic, instruction-following capabilities of MLLMs. To overcome these limitations, we propose an efficient framework for universal multimodal embeddings, which bridges this gap by centering on two synergistic components. First, our hierarchical embedding prompt template employs a two-level instruction architecture that forces the model to produce discriminative representations. Building on this strong foundation, our second component, self-aware hard negative sampling, redefines the fine-tuning process by leveraging the model's own understanding to efficiently mine challenging negatives while actively filtering out potential false negatives. Our comprehensive experiments show that our hierarchical prompt achieves zero-shot performance competitive with contrastively trained baselines and enhances the fine-tuning process by lifting a simple in-batch negative baseline by 4.8 points on the MMEB benchmark. We further boost the performance via our self-aware hard negative sampling, achieving the state-of-the-art performance without the contrative pre-training. Our work presents an effective and efficient pathway to adapt MLLMs for universal embedding tasks, significantly reducing training time.
摘要：多模式的大语言模型（MLLM）已成为通用嵌入任务的有前途的解决方案，但是将其生成性质改编成歧视性表示学习仍然是一个重大挑战。大规模对比前训练的主要范式受到关键的效率低下，包括过度的计算成本和未能利用MLLM的内在，指导遵循的能力。为了克服这些局限性，我们提出了一个通用多模式嵌入的有效框架，该框架通过以两个协同组件为中心来弥合这一差距。首先，我们的分层嵌入提示模板采用了两级指令体系结构，迫使模型产生歧视性表示。我们的第二个组成部分是基于这一强大的基础，即自我意识到的硬性否定抽样，通过利用模型自己的理解来有效地提出挑战的负面质疑，重新调整了微调过程，同时积极过滤了潜在的虚假负面因素。我们的综合实验表明，我们的分层提示可以在相比训练的基准中实现零拍性能竞争，并通过在MMEB基准上提高简单的内部内部基线来增强微调过程。我们通过自我意识的硬性抽样进一步提高了性能，从而实现了最先进的表现，而无需进行对培训。我们的工作提出了一种有效而有效的途径，以适应MLLM的通用嵌入任务，从而大大减少了训练时间。

Title: VAULT: Vigilant Adversarial Updates via LLM-Driven Retrieval-Augmented Generation for NLI

Authors: Roie Kazoom, Ofir Cohen, Rami Puzis, Asaf Shabtai, Ofer Hadar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00965
Pdf URL: https://arxiv.org/pdf/2508.00965
Copy Paste: [[2508.00965]] VAULT: Vigilant Adversarial Updates via LLM-Driven Retrieval-Augmented Generation for NLI(https://arxiv.org/abs/2508.00965)
Keywords: generation
Abstract: We introduce VAULT, a fully automated adversarial RAG pipeline that systematically uncovers and remedies weaknesses in NLI models through three stages: retrieval, adversarial generation, and iterative retraining. First, we perform balanced few-shot retrieval by embedding premises with both semantic (BGE) and lexical (BM25) similarity. Next, we assemble these contexts into LLM prompts to generate adversarial hypotheses, which are then validated by an LLM ensemble for label fidelity. Finally, the validated adversarial examples are injected back into the training set at increasing mixing ratios, progressively fortifying a zero-shot RoBERTa-base this http URL standard benchmarks, VAULT elevates RoBERTa-base accuracy from 88.48% to 92.60% on SNLI +4.12%, from 75.04% to 80.95% on ANLI +5.91%, and from 54.67% to 71.99% on MultiNLI +17.32%. It also consistently outperforms prior in-context adversarial methods by up to 2.0% across datasets. By automating high-quality adversarial data curation at scale, VAULT enables rapid, human-independent robustness improvements in NLI inference tasks.
摘要：我们介绍了Vault，这是一种完全自动化的对抗RAG管道，通过三个阶段在NLI模型中系统地发现和补救措施弱点：检索，对抗性生成和迭代性重训练。首先，我们通过嵌入语义（BGE）和词汇（BM25）相似性的前提来执行平衡的几次检索。接下来，我们将这些上下文组装到LLM提示中，以生成对抗性假设，然后通过LLM集合来验证标签Fidelity的LLM合奏。最后，经过验证的对抗性例子被注入以增加混合率的训练集中，逐渐加强了零射击的罗伯塔（Roberta）基准标准基准，Roberta-Base的准确性从88.48％提高到SNLI +4.12％的88.48％，从75.0.0％ +8.5.0.95％上提高到92.60％。 Multinli +17.32％的54.67％至71.99％。在数据集中，它还一致地优于先前的文化对抗方法高达2.0％。通过在大规模上自动化高质量的对抗数据策展，Vault可以在NLI推理任务中快速，独立于人类独立的鲁棒性改进。

Title: Masked Omics Modeling for Multimodal Representation Learning across Histopathology and Molecular Profiles

Authors: Lucas Robinet, Ahmad Berjaoui, Elizabeth Cohen-Jonathan Moyal
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.00969
Pdf URL: https://arxiv.org/pdf/2508.00969
Copy Paste: [[2508.00969]] Masked Omics Modeling for Multimodal Representation Learning across Histopathology and Molecular Profiles(https://arxiv.org/abs/2508.00969)
Keywords: generation
Abstract: Self-supervised learning has driven major advances in computational pathology by enabling models to learn rich representations from hematoxylin and eosin (H&E)-stained cancer tissue. However, histopathology alone often falls short for molecular characterization and understanding clinical outcomes, as important information is contained in high-dimensional omics profiles like transcriptomics, methylomics, or genomics. In this work, we introduce MORPHEUS, a unified transformer-based pre-training framework that encodes both histopathology and multi-omics data into a shared latent space. At its core, MORPHEUS relies on a masked modeling objective applied to randomly selected omics portions, encouraging the model to learn biologically meaningful cross-modal relationships. The same pre-trained network can be applied to histopathology alone or in combination with any subset of omics modalities, seamlessly adapting to the available inputs. Additionally, MORPHEUS enables any-to-any omics generation, enabling one or more omics profiles to be inferred from any subset of modalities, including H&E alone. Pre-trained on a large pan-cancer cohort, MORPHEUS consistently outperforms state-of-the-art methods across diverse modality combinations and tasks, positioning itself as a promising framework for developing multimodal foundation models in oncology. The code is available at: this https URL
摘要：自我监督的学习通过使模型能够从苏木精和曙红（H＆E）染色的癌症组织中学习丰富的代表性，从而促进了计算病理学的重大进展。然而，仅组织病理学通常缺乏分子表征和理解临床结果的缺点，因为重要信息包含在转录组学，甲基甲基组学或基因组学等高维度图中。在这项工作中，我们介绍了Morpheus，Morpheus是一个基于统一的变压器的预训练框架，该框架将组织病理学和多摩学数据编码为共享的潜在空间。 Morpheus的核心依赖于应用于随机选择的OMICS部分的掩盖建模目标，鼓励模型学习生物学上有意义的跨模式关系。相同的预训练网络可以单独应用于组织病理学，也可以与任何组合模式相结合，无缝地适应可用的输入。此外，Morpheus可以使任何一对一的OMICS生成，从而使一个或多个OMICS概况可以从任何一部分的模式（包括H＆e）中推断出来。 Morpheus在大型泛伴侣队列中进行了预训练，始终超过各种模态组合和任务的最先进方法，将自己定位为开发肿瘤学多模式基础模型的有前途的框架。代码可用：此HTTPS URL

Title: ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation

Authors: Cihang Peng, Qiming Hou, Zhong Ren, Kun Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01008
Pdf URL: https://arxiv.org/pdf/2508.01008
Copy Paste: [[2508.01008]] ROVI: A VLM-LLM Re-Captioned Dataset for Open-Vocabulary Instance-Grounded Text-to-Image Generation(https://arxiv.org/abs/2508.01008)
Keywords: generation
Abstract: We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation, created by labeling 1M curated web images. Our key innovation is a strategy called re-captioning, focusing on the pre-detection stage, where a VLM (Vision-Language Model) generates comprehensive visual descriptions that are then processed by an LLM (Large Language Model) to extract a flat list of potential categories for OVDs (Open-Vocabulary Detectors) to detect. This approach yields a global prompt inherently linked to instance annotations while capturing secondary visual elements humans typically overlook. Evaluations show that ROVI exceeds existing detection datasets in image quality and resolution while containing two orders of magnitude more categories with an open-vocabulary nature. For demonstrative purposes, a text-to-image model GLIGEN trained on ROVI significantly outperforms state-of-the-art alternatives in instance grounding accuracy, prompt fidelity, and aesthetic quality. Our dataset and reproducible pipeline are available at this https URL.
摘要：我们提出了Rovi，这是一种高质量的合成数据集，用于实例接地的文本对图像生成，它通过标记1M策划的Web图像而创建。我们的关键创新是一种称为重新捕获的策略，重点是检测前阶段，在该阶段，VLM（视觉语言模型）生成了全面的视觉描述，然后由LLM（大语言模型）处理，以提取OVDS（开放式vocabulary检测器）的潜在类别的平坦列表，以检测。该方法产生一个全面的提示，固有地链接到实例注释，同时捕获次要视觉元素，通常会忽略人类。评估表明，ROVI超过了图像质量和分辨率的现有检测数据集，同时包含两个数量级的更多类别，具有开放式摄影性的性质。出于示范目的，对ROVI训练的文本对象模型明显优于最先进的替代方案，例如接地准确性，及时的保真度和美学质量。我们的数据集和可再现管道可在此HTTPS URL上找到。

Title: v-PuNNs: van der Put Neural Networks for Transparent Ultrametric Representation Learning

Authors: Gnankan Landry Regis N'guessan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01010
Pdf URL: https://arxiv.org/pdf/2508.01010
Copy Paste: [[2508.01010]] v-PuNNs: van der Put Neural Networks for Transparent Ultrametric Representation Learning(https://arxiv.org/abs/2508.01010)
Keywords: generative
Abstract: Conventional deep learning models embed data in Euclidean space $\mathbb{R}^d$, a poor fit for strictly hierarchical objects such as taxa, word senses, or file systems. We introduce van der Put Neural Networks (v-PuNNs), the first architecture whose neurons are characteristic functions of p-adic balls in $\mathbb{Z}_p$. Under our Transparent Ultrametric Representation Learning (TURL) principle every weight is itself a p-adic number, giving exact subtree semantics. A new Finite Hierarchical Approximation Theorem shows that a depth-K v-PuNN with $\sum_{j=0}^{K-1}p^{\,j}$ neurons universally represents any K-level tree. Because gradients vanish in this discrete space, we propose Valuation-Adaptive Perturbation Optimization (VAPO), with a fast deterministic variant (HiPaN-DS) and a moment-based one (HiPaN / Adam-VAPO). On three canonical benchmarks our CPU-only implementation sets new state-of-the-art: WordNet nouns (52,427 leaves) 99.96% leaf accuracy in 16 min; GO molecular-function 96.9% leaf / 100% root in 50 s; NCBI Mammalia Spearman $\rho = -0.96$ with true taxonomic distance. The learned metric is perfectly ultrametric (zero triangle violations), and its fractal and information-theoretic properties are analyzed. Beyond classification we derive structural invariants for quantum systems (HiPaQ) and controllable generative codes for tabular data (Tab-HiPaN). v-PuNNs therefore bridge number theory and deep learning, offering exact, interpretable, and efficient models for hierarchical data.
摘要：传统的深度学习模型嵌入了欧几里得空间中的数据$ \ mathbb {r}^d $，非常适合严格的层次结构对象，例如分类单元，单词感官或文件系统。我们介绍了van der put神经网络（V-Punns），这是第一个在$ \ mathbb {z} _p $中的神经元是p-adic球的特征功能的体系结构。在我们透明的超级表示学习（TURL）原理下，每个重量本身都是一个p-adic数字，具有确切的子树语义。一个新的有限层次结构近似定理表明，具有$ \ sum_ {j = 0}^{k-1} p^{\，j} $ neurons普遍代表任何k级树的深度k v-punn。由于梯度在这个离散的空间中消失，我们提出了估值自适应扰动优化（VAPO），具有快速的确定性变体（HIPAN-DS）和基于力矩的变体（Hipan / Adam-Vapo）。在三个规范基准上，我们的仅CPU实施设置了新的最先进的：WordNet名词（52,427片叶子）在16分钟内99.96％的叶子精度为99.96％； GO分子功能96.9％叶 / 100％在50 s中； NCBI哺乳动物Spearman $ \ rho = -0.96 $具有真正的分类距离。学到的度量是完全超级的（零三角形），并且分析了其分形和信息理论特性。除了分类之外，我们还得出了量子系统（HIPAQ）的结构不变性，并为表格数据（TAB-iphan）提供可控的生成代码。因此，V-Punns桥梁数理论和深度学习，为层次数据提供精确，可解释和有效的模型。

Title: Structured Spectral Graph Learning for Anomaly Classification in 3D Chest CT Scans

Authors: Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2508.01045
Pdf URL: https://arxiv.org/pdf/2508.01045
Copy Paste: [[2508.01045]] Structured Spectral Graph Learning for Anomaly Classification in 3D Chest CT Scans(https://arxiv.org/abs/2508.01045)
Keywords: generation
Abstract: With the increasing number of CT scan examinations, there is a need for automated methods such as organ segmentation, anomaly detection and report generation to assist radiologists in managing their increasing workload. Multi-label classification of 3D CT scans remains a critical yet challenging task due to the complex spatial relationships within volumetric data and the variety of observed anomalies. Existing approaches based on 3D convolutional networks have limited abilities to model long-range dependencies while Vision Transformers suffer from high computational costs and often require extensive pre-training on large-scale datasets from the same domain to achieve competitive performance. In this work, we propose an alternative by introducing a new graph-based approach that models CT scans as structured graphs, leveraging axial slice triplets nodes processed through spectral domain convolution to enhance multi-label anomaly classification performance. Our method exhibits strong cross-dataset generalization, and competitive performance while achieving robustness to z-axis translation. An ablation study evaluates the contribution of each proposed component.
摘要：随着CT扫描检查数量的增加，需要自动化方法，例如器官分割，异常检测和报告生成，以帮助放射科医生管理其增加的工作量。由于体积数据中的复杂空间关系和观察到的异常，因此3D CT扫描的多标签分类仍然是一项至关重要但具有挑战性的任务。基于3D卷积网络的现有方法具有建模远程依赖性的能力，而视觉变形金刚则遭受了高计算成本的损失，并且通常需要对来自同一领域的大规模数据集进行广泛的预培训，以实现竞争性能。在这项工作中，我们通过引入一种新的基于图的方法来提出一种替代方案，该方法将CT扫描作为结构化图进行建模，并利用通过频谱域卷积处理的轴向切片三联节点来增强多标签异常分类性能。我们的方法表现出强大的跨元素概括和竞争性能，同时实现了Z轴翻译的鲁棒性。消融研究评估了每个提出的组件的贡献。

Title: Flow Matching for Probabilistic Learning of Dynamical Systems from Missing or Noisy Data

Authors: Siddharth Rout, Eldad Haber, Stephane Gaudreault
Subjects: cs.LG, math.DS, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2508.01101
Pdf URL: https://arxiv.org/pdf/2508.01101
Copy Paste: [[2508.01101]] Flow Matching for Probabilistic Learning of Dynamical Systems from Missing or Noisy Data(https://arxiv.org/abs/2508.01101)
Keywords: generative
Abstract: Learning dynamical systems is crucial across many fields, yet applying machine learning techniques remains challenging due to missing variables and noisy data. Classical mathematical models often struggle in these scenarios due to the arose ill-posedness of the physical systems. Stochastic machine learning techniques address this challenge by enabling the modeling of such ill-posed problems. Thus, a single known input to the trained machine learning model may yield multiple plausible outputs, and all of the outputs are correct. In such scenarios, probabilistic forecasting is inherently meaningful. In this study, we introduce a variant of flow matching for probabilistic forecasting which estimates possible future states as a distribution over possible outcomes rather than a single-point prediction. Perturbation of complex dynamical states is not trivial. Community uses typical Gaussian or uniform perturbations to crucial variables to model uncertainty. However, not all variables behave in a Gaussian fashion. So, we also propose a generative machine learning approach to physically and logically perturb the states of complex high-dimensional dynamical systems. Finally, we establish the mathematical foundations of our method and demonstrate its effectiveness on several challenging dynamical systems, including a variant of the high-dimensional WeatherBench dataset, which models the global weather at a 5.625° meridional resolution.
摘要：学习动力学系统在许多领域至关重要，但是由于丢失的变量和嘈杂的数据，应用机器学习技术仍然具有挑战性。经典的数学模型在这些情况下通常由于物理系统的不良性而挣扎。随机机器学习技术通过实现此类不足问题的建模来应对这一挑战。因此，训练有素的机器学习模型的单个已知输入可能会产生多个合理的输出，并且所有输出都是正确的。在这种情况下，概率预测本质上是有意义的。在这项研究中，我们介绍了概率预测的流动匹配变体，该变体估计可能的未来状态是可能的结果，而不是单点预测。复杂的动态状态的扰动并不是微不足道的。社区使用典型的高斯或统一扰动来对不确定性进行关键变量。但是，并非所有变量都以高斯的方式行事。因此，我们还提出了一种生成机器学习方法，以在物理和逻辑上扰动复杂的高维动力系统的状态。最后，我们建立了我们方法的数学基础，并证明了它在几个具有挑战性的动力系统上的有效性，包括高维气板数据集的变体，该数据集将全球天气建模为5.625°子午线分辨率。

Title: A hierarchy tree data structure for behavior-based user segment representation

Authors: Yang Liu, Xuejiao Kang, Sathya Iyer, Idris Malik, Ruixuan Li, Juan Wang, Xinchen Lu, Xiangxue Zhao, Dayong Wang, Menghan Liu, Isaac Liu, Feng Liang, Yinzhe Yu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.01115
Pdf URL: https://arxiv.org/pdf/2508.01115
Copy Paste: [[2508.01115]] A hierarchy tree data structure for behavior-based user segment representation(https://arxiv.org/abs/2508.01115)
Keywords: generation
Abstract: User attributes are essential in multiple stages of modern recommendation systems and are particularly important for mitigating the cold-start problem and improving the experience of new or infrequent users. We propose Behavior-based User Segmentation (BUS), a novel tree-based data structure that hierarchically segments the user universe with various users' categorical attributes based on the users' product-specific engagement behaviors. During the BUS tree construction, we use Normalized Discounted Cumulative Gain (NDCG) as the objective function to maximize the behavioral representativeness of marginal users relative to active users in the same segment. The constructed BUS tree undergoes further processing and aggregation across the leaf nodes and internal nodes, allowing the generation of popular social content and behavioral patterns for each node in the tree. To further mitigate bias and improve fairness, we use the social graph to derive the user's connection-based BUS segments, enabling the combination of behavioral patterns extracted from both the user's own segment and connection-based segments as the connection aware BUS-based recommendation. Our offline analysis shows that the BUS-based retrieval significantly outperforms traditional user cohort-based aggregation on ranking quality. We have successfully deployed our data structure and machine learning algorithm and tested it with various production traffic serving billions of users daily, achieving statistically significant improvements in the online product metrics, including music ranking and email notifications. To the best of our knowledge, our study represents the first list-wise learning-to-rank framework for tree-based recommendation that effectively integrates diverse user categorical attributes while preserving real-world semantic interpretability at a large industrial scale.
摘要：用户属性在现代推荐系统的多个阶段至关重要，对于缓解寒冷启动问题并改善新用户或不经常使用的体验尤为重要。我们建议基于行为的用户细分（BUS），这是一种基于树木的新型数据结构，根据用户的特定于产品特定的参与行为，将用户宇宙分层划分用户宇宙。在公交树建设期间，我们使用标准化的折扣累积增益（NDCG）作为目标函数，以最大程度地提高边缘用户相对于同一细分市场中的活动用户的行为代表性。构造的公交树在整个叶子节点和内部节点上经历了进一步的处理和聚集，从而可以为树中每个节点生成流行的社会内容和行为模式。为了进一步减轻偏见和提高公平性，我们使用社交图来得出用户基于连接的总线细分市场，从而使从用户自己的细分细分市场和基于连接的段作为连接的COANNECTION BUS基于BUS的建议组合。我们的离线分析表明，基于公共汽车的检索显着优于基于传统的用户队列的排名质量聚合。我们已经成功部署了数据结构和机器学习算法，并通过每天为数十亿用户提供服务的各种生产流量对其进行了测试，从而在在线产品指标（包括音乐排名和电子邮件通知）方面取得了统计学上的显着改进。据我们所知，我们的研究代表了基于树的建议的第一个列表学习框架，该框架有效地整合了各种用户的分类属性，同时在大型工业规模上保留了现实世界中的语义解释性。

Title: UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation

Authors: Chaitanya Patel, Hiroki Nakamura, Yuta Kyuragi, Kazuki Kozuka, Juan Carlos Niebles, Ehsan Adeli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01126
Pdf URL: https://arxiv.org/pdf/2508.01126
Copy Paste: [[2508.01126]] UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation(https://arxiv.org/abs/2508.01126)
Keywords: generation
Abstract: Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in real-world egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion's simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.
摘要：以场景上的为中心的人类运动产生和预测对增强AR/VR体验，改善人类机器人的相互作用，推进辅助技术以及能够通过从头到尾的角度准确预测和模拟运动来实现自适应医疗保健解决方案至关重要。但是，现有方法主要集中于结构化3D场景上下文的第三人称运动合成，从而限制了其在现实世界中以有限的视野，频繁的遮挡和动态相机阻碍场景感知的有效性。为了弥合这一差距，我们介绍了以自我为中心的运动产生和以自我为中心的运动预测，这是两个新颖的任务，这些任务将第一人称图像用于场景感知运动综合而不依赖于明确的3D场景。我们提出了Uniegomotion，这是一种统一的条件运动扩散模型，具有针对以自我为中心设备的新型中心运动表示。 Uniegomotion的简单而有效的设计支持以统一框架中的第一人称视觉输入为中心运动重建，预测和产生。与以前忽略场景语义的作品不同，我们的模型有效地提取了基于图像的场景上下文，以推断出合理的3D运动。为了促进培训，我们介绍了EE4D-Motion，这是一个源自EgoExo4d的大规模数据集，并用伪际3D运动注释增强。 Uniegomotion在以自我为中心运动重建中实现了最先进的表现，并且是第一个从单个自我的图像中产生运动的人。广泛的评估证明了我们的统一框架的有效性，为以自我为中心运动建模和解锁以自我为中心应用的新可能性树立了新的基准。

Title: Transformers in Pseudo-Random Number Generation: A Dual Perspective on Theory and Practice

Authors: Ran Li, Lingshu Zeng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.01134
Pdf URL: https://arxiv.org/pdf/2508.01134
Copy Paste: [[2508.01134]] Transformers in Pseudo-Random Number Generation: A Dual Perspective on Theory and Practice(https://arxiv.org/abs/2508.01134)
Keywords: generation
Abstract: Pseudo-random number generators (PRNGs) are high-nonlinear processes, and they are key blocks in optimization of Large language models. Transformers excel at processing complex nonlinear relationships. Thus it is reasonable to generate high-quality pseudo-random numbers based on transformers. In this paper, we explore this question from both theoretical and practical perspectives, highlighting the potential benefits and implications of Transformer in PRNGs. We theoretically demonstrate that decoder-only Transformer models with Chain-of-Thought can simulate both the Linear Congruential Generator (LCG) and Mersenne Twister (MT) PRNGs. Based on this, we conclude that the log-precision decoder-only Transformer can represent non-uniform $\text{AC}^0$. Our simulative theoretical findings are validated through experiments. The random numbers generated by Transformer-based PRNGs successfully pass the majority of NIST tests, whose heat maps exhibit clear statistical randomness. Finally, we assess their capability in prediction attacks.
摘要：伪随机数生成器（PRNGS）是高非线性过程，它们是优化大型语言模型的关键块。变形金刚在处理复杂的非线性关系方面表现出色。因此，基于变压器生成高质量的伪随机数是合理的。在本文中，我们从理论和实际角度探讨了这个问题，强调了变压器在PRNG中的潜在好处和含义。从理论上讲，我们证明了只有三链的仅解码器变压器模型可以模拟线性一致发电机（LCG）和Mersenne Twister（MT）PRNG。基于此，我们得出的结论是，唯一的二十二次解码器可以代表非均匀的$ \ text {ac}^0 $。我们的模拟理论发现将通过实验验证。基于变压器的PRNG生成的随机数成功通过了大多数NIST测试，其热图表现出明显的统计随机性。最后，我们评估他们在预测攻击中的能力。

Title: Personalized Safety Alignment for Text-to-Image Diffusion Models

Authors: Yu Lei, Jinbin Bai, Qingyu Shi, Aosong Feng, Kaidong Yu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01151
Pdf URL: https://arxiv.org/pdf/2508.01151
Copy Paste: [[2508.01151]] Personalized Safety Alignment for Text-to-Image Diffusion Models(https://arxiv.org/abs/2508.01151)
Keywords: generation, generative
Abstract: Text-to-image diffusion models have revolutionized visual content generation, but current safety mechanisms apply uniform standards that often fail to account for individual user preferences. These models overlook the diverse safety boundaries shaped by factors like age, mental health, and personal beliefs. To address this, we propose Personalized Safety Alignment (PSA), a framework that allows user-specific control over safety behaviors in generative models. PSA integrates personalized user profiles into the diffusion process, adjusting the model's behavior to match individual safety preferences while preserving image quality. We introduce a new dataset, Sage, which captures user-specific safety preferences and incorporates these profiles through a cross-attention mechanism. Experiments show that PSA outperforms existing methods in harmful content suppression and aligns generated content better with user constraints, achieving higher Win Rate and Pass Rate scores. Our code, data, and models are publicly available at this https URL.
摘要：文本到图像扩散模型已彻底改变了视觉内容的产生，但是当前的安全机制采用了统一的标准，这些标准通常无法说明个人用户的偏好。这些模型忽略了年龄，心理健康和个人信念等因素所塑造的各种安全界限。为了解决这个问题，我们提出了个性化的安全对准（PSA），该框架允许在生成模型中对用户特定控制安全行为的控制。 PSA将个性化的用户配置文件集成到扩散过程中，调整模型的行为以匹配个人安全偏好，同时保持图像质量。我们介绍了一个新的数据集Sage，该数据集捕获了特定于用户的安全性偏好，并通过跨注意机制结合了这些配置文件。实验表明，PSA在有害内容抑制中的现有方法优于现有方法，并且可以更好地使生成的内容与用户约束，从而达到更高的获胜率和通过率分数。我们的代码，数据和模型在此HTTPS URL上公开可用。

Title: LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation

Authors: Xinyu Yan, Meijun Sun, Ge-Peng Ji, Fahad Shahbaz Khan, Salman Khan, Deng-Ping Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01152
Pdf URL: https://arxiv.org/pdf/2508.01152
Copy Paste: [[2508.01152]] LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation(https://arxiv.org/abs/2508.01152)
Keywords: generation
Abstract: We present LawDIS, a language-window-based controllable dichotomous image segmentation (DIS) framework that produces high-quality object masks. Our framework recasts DIS as an image-conditioned mask generation task within a latent diffusion model, enabling seamless integration of user controls. LawDIS is enhanced with macro-to-micro control modes. Specifically, in macro mode, we introduce a language-controlled segmentation strategy (LS) to generate an initial mask based on user-provided language prompts. In micro mode, a window-controlled refinement strategy (WR) allows flexible refinement of user-defined regions (i.e., size-adjustable windows) within the initial mask. Coordinated by a mode switcher, these modes can operate independently or jointly, making the framework well-suited for high-accuracy, personalised applications. Extensive experiments on the DIS5K benchmark reveal that our LawDIS significantly outperforms 11 cutting-edge methods across all metrics. Notably, compared to the second-best model MVANet, we achieve $F_\beta^\omega$ gains of 4.6\% with both the LS and WR strategies and 3.6\% gains with only the LS strategy on DIS-TE. Codes will be made available at this https URL.
摘要：我们提出Lawdis，这是一种基于语言窗口的可控二分法图像分割（DIS）框架，可产生高质量的对象掩模。我们的框架将DIS作为图像条件的掩码生成任务重新铸造，从而使用户控件的无缝集成能够进行无缝集成。 Lawdis通过宏观到薄膜控制模式得到了增强。具体来说，在宏观模式下，我们引入了一种语言控制的分割策略（LS），以根据用户提供的语言提示生成初始掩码。在微型模式下，窗口控制的改进策略（WR）允许在初始掩码中灵活地改进用户定义区域（即尺寸可调式窗口）。这些模式通过模式切换器协调，可以独立或共同运行，这使得该框架非常适合高临界性，个性化的应用程序。关于DIS5K基准测试的广泛实验表明，我们的Lawdis在所有指标上的表现明显优于11种尖端方法。值得注意的是，与第二好的模型Mvanet相比，我们在LS和WR策略中获得了$ f_ \ beta^\ omega $的增益，为4.6 \％，并且仅在DIS-TE上的LS策略中获得了3.6 \％的增长。代码将在此HTTPS URL上提供。

Title: RSPO: Risk-Seeking Policy Optimization for Pass@k and Max@k Metrics in Large Language Models

Authors: Kaichen Zhang, Shenghao Gao, Yuzhong Hong, Haipeng Sun, Junwei Bao, Hongfei Jiang, Yang Song, Hong Dingqian, Hui Xiong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01174
Pdf URL: https://arxiv.org/pdf/2508.01174
Copy Paste: [[2508.01174]] RSPO: Risk-Seeking Policy Optimization for Pass@k and Max@k Metrics in Large Language Models(https://arxiv.org/abs/2508.01174)
Keywords: generation
Abstract: Current large language model post-training optimizes a risk-neutral objective that maximizes expected reward, yet evaluation relies heavily on risk-seeking metrics like Pass@k (at least one success in k trials) and Max@k (maximum reward across k responses). This mismatch in risk preferences can inevitably lead to suboptimal performance. To bridge this gap, we propose Risk-Seeking Policy Optimization (RSPO), a novel method that directly targets Pass@k and Max@k during training. A key challenge in optimizing these metrics is the "hitchhiking" problem: low-reward responses are inadvertently reinforced if they co-occur with a high-reward response within a sample of k generations, resulting in inefficient optimization. RSPO addresses this problem by leveraging the closed-form probability that a given response is the maximum among k samplings. Despite the complexity of nested gradients over multiple responses, RSPO produces efficient, unbiased gradient estimators for both metrics. We validate our approach with both rigorous theoretical analysis and comprehensive experimental results.
摘要：当前的大型语言模型在培训后进行了优化的风险中性目标，该目标最大化了预期的奖励，但评估在很大程度上取决于寻求风险的指标，例如Pass@k（至少在K试验中取得了至少一个成功）和Max@K（跨K响应的最大奖励）。这种不匹配的风险偏好不可避免地会导致次优性能。为了弥合这一差距，我们提出了寻求风险的政策优化（RSPO），这是一种新颖的方法，该方法在训练过程中直接针对Pass@k和Max@K。优化这些指标的一个主要挑战是“搭便车”问题：如果在K代样本中与高回报响应共同相处，则低回报的响应将无意中加强，从而导致优化效率低下。 RSPO通过利用封闭形式的概率来解决此问题，即给定响应是K采样中最大的响应。尽管嵌套梯度在多个响应中的复杂性，但RSPO对于这两个指标都会产生有效的，无偏的梯度估计器。我们通过严格的理论分析和全面的实验结果来验证我们的方法。

Title: BSL: A Unified and Generalizable Multitask Learning Platform for Virtual Drug Discovery from Design to Synthesis

Authors: Kun Li, Zhennan Wu, Yida Xiong, Hongzhi Zhang, Longtao Hu, Zhonglie Liu, Junqi Zeng, Wenjie Wu, Mukun Chen, Jiameng Chen, Wenbin Hu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01195
Pdf URL: https://arxiv.org/pdf/2508.01195
Copy Paste: [[2508.01195]] BSL: A Unified and Generalizable Multitask Learning Platform for Virtual Drug Discovery from Design to Synthesis(https://arxiv.org/abs/2508.01195)
Keywords: generative
Abstract: Drug discovery is of great social significance in safeguarding human health, prolonging life, and addressing the challenges of major diseases. In recent years, artificial intelligence has demonstrated remarkable advantages in key tasks across bioinformatics and pharmacology, owing to its efficient data processing and data representation capabilities. However, most existing computational platforms cover only a subset of core tasks, leading to fragmented workflows and low efficiency. In addition, they often lack algorithmic innovation and show poor generalization to out-of-distribution (OOD) data, which greatly hinders the progress of drug discovery. To address these limitations, we propose Baishenglai (BSL), a deep learning-enhanced, open-access platform designed for virtual drug discovery. BSL integrates seven core tasks within a unified and modular framework, incorporating advanced technologies such as generative models and graph neural networks. In addition to achieving state-of-the-art (SOTA) performance on multiple benchmark datasets, the platform emphasizes evaluation mechanisms that focus on generalization to OOD molecular structures. Comparative experiments with existing platforms and baseline methods demonstrate that BSL provides a comprehensive, scalable, and effective solution for virtual drug discovery, offering both algorithmic innovation and high-precision prediction for real-world pharmaceutical research. In addition, BSL demonstrated its practical utility by discovering novel modulators of the GluN1/GluN3A NMDA receptor, successfully identifying three compounds with clear bioactivity in in-vitro electrophysiological assays. These results highlight BSL as a promising and comprehensive platform for accelerating biomedical research and drug discovery. The platform is accessible at this https URL.
摘要：药物发现在维护人类健康，延长生命并应对主要疾病的挑战方面具有重要的社会意义。近年来，由于其有效的数据处理和数据表示能力，人工智能在生物信息学和药理学的关键任务中表现出了显着的优势。但是，大多数现有的计算平台仅涵盖核心任务的一部分，导致工作流零散和低效率。此外，它们通常缺乏算法创新，并且对分布（OOD）数据的概括不佳，这极大地阻碍了药物发现的进展。为了解决这些局限性，我们建议Baishenglai（BSL），这是一个专为虚拟药物发现设计的深度学习增强的开放式平台。 BSL将七个核心任务集成到统一和模块化的框架中，并结合了高级技术，例如生成模型和图形神经网络。除了在多个基准数据集上实现最先进的（SOTA）性能外，该平台还强调了侧重于对OOD分子结构的概括的评估机制。现有平台和基线方法的比较实验表明，BSL为虚拟药物发现提供了全面，可扩展性和有效的解决方案，为现实世界中的药物研究提供了算法创新和高精度预测。此外，BSL通过发现GLUN1/GLUN3A NMDA受体的新调节剂来证明其实用性，成功地鉴定了三种在体外电生理测定中具有明显生物活性的化合物。这些结果凸显了BSL是加速生物医学研究和药物发现的有前途且全面的平台。该平台可在此HTTPS URL上访问。

Title: StyDeco: Unsupervised Style Transfer with Distilling Priors and Semantic Decoupling

Authors: Yuanlin Yang, Quanjian Song, Zhexian Gao, Ge Wang, Shanshan Li, Xiaoyan Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01215
Pdf URL: https://arxiv.org/pdf/2508.01215
Copy Paste: [[2508.01215]] StyDeco: Unsupervised Style Transfer with Distilling Priors and Semantic Decoupling(https://arxiv.org/abs/2508.01215)
Keywords: generative
Abstract: Diffusion models have emerged as the dominant paradigm for style transfer, but their text-driven mechanism is hindered by a core limitation: it treats textual descriptions as uniform, monolithic guidance. This limitation overlooks the semantic gap between the non-spatial nature of textual descriptions and the spatially-aware attributes of visual style, often leading to the loss of semantic structure and fine-grained details during stylization. In this paper, we propose StyDeco, an unsupervised framework that resolves this limitation by learning text representations specifically tailored for the style transfer task. Our framework first employs Prior-Guided Data Distillation (PGD), a strategy designed to distill stylistic knowledge without human supervision. It leverages a powerful frozen generative model to automatically synthesize pseudo-paired data. Subsequently, we introduce Contrastive Semantic Decoupling (CSD), a task-specific objective that adapts a text encoder using domain-specific weights. CSD performs a two-class clustering in the semantic space, encouraging source and target representations to form distinct clusters. Extensive experiments on three classic benchmarks demonstrate that our framework outperforms several existing approaches in both stylistic fidelity and structural preservation, highlighting its effectiveness in style transfer with semantic preservation. In addition, our framework supports a unique de-stylization process, further demonstrating its extensibility. Our code is vailable at this https URL.
摘要：扩散模型已成为样式转移的主要范式，但其文本驱动的机制受到核心限制的阻碍：它将文本描述视为统一的单片指导。这个限制忽略了文本描述的非空间性质与视觉样式的空间意识属性之间的语义差距，这通常导致语义结构的丧失和在风格化过程中的细节细节。在本文中，我们提出了Stydeco，这是一个无监督的框架，通过学习专门针对样式转移任务量身定制的文本表示来解决此限制。我们的框架首先采用了先前的数据蒸馏（PGD），该策略旨在在不监督的情况下提炼风格知识。它利用强大的冷冻生成模型自动合成伪配对数据。随后，我们引入了对比性语义脱钩（CSD），这是一个特定于任务的目标，可使用特定于域的权重调整文本编码器。 CSD在语义空间中执行两级聚类，鼓励源和目标表示形成不同的群集。对三个经典基准测试的广泛实验表明，我们的框架在风格上的忠诚度和结构保存方面都超过了几种现有方法，从而强调了其在语义保护方面的有效性。此外，我们的框架还支持独特的去式化过程，进一步证明了其可扩展性。我们的代码可在此HTTPS URL上启用。

Title: NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection

Authors: Jiazhen Yan, Fan Wang, Weiwei Jiang, Ziqiang Li, Zhangjie Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01248
Pdf URL: https://arxiv.org/pdf/2508.01248
Copy Paste: [[2508.01248]] NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection(https://arxiv.org/abs/2508.01248)
Keywords: generation, generative
Abstract: The rapid progress of generative models, such as GANs and diffusion models, has facilitated the creation of highly realistic images, raising growing concerns over their misuse in security-sensitive domains. While existing detectors perform well under known generative settings, they often fail to generalize to unknown generative models, especially when semantic content between real and fake images is closely aligned. In this paper, we revisit the use of CLIP features for AI-generated image detection and uncover a critical limitation: the high-level semantic information embedded in CLIP's visual features hinders effective discrimination. To address this, we propose NS-Net, a novel detection framework that leverages NULL-Space projection to decouple semantic information from CLIP's visual features, followed by contrastive learning to capture intrinsic distributional differences between real and generated images. Furthermore, we design a Patch Selection strategy to preserve fine-grained artifacts by mitigating semantic bias caused by global image structures. Extensive experiments on an open-world benchmark comprising images generated by 40 diverse generative models show that NS-Net outperforms existing state-of-the-art methods, achieving a 7.4\% improvement in detection accuracy, thereby demonstrating strong generalization across both GAN- and diffusion-based image generation techniques.
摘要：生成模型的快速进步，例如gan和扩散模型，促进了高度逼真的图像的创建，从而引起了人们对安全敏感域中滥用的不断关注。尽管现有检测器在已知的生成设置下表现良好，但它们通常无法推广到未知的生成模型，尤其是当真实图像和假图像之间的语义内容紧密对齐时。在本文中，我们重新审视了剪辑特征用于AI生成的图像检测并发现关键限制：嵌入在夹的视觉特征中的高级语义信息阻碍了有效的歧视。为了解决这个问题，我们提出了NS-NET，这是一个新型的检测框架，利用空空间投影将语义信息从剪辑的视觉特征中解脱出来，然后进行对比学习，以捕获真实图像和生成的图像之间的固有分布差异。此外，我们设计了一种斑块选择策略，以减轻由全球图像结构引起的语义偏见来保留细粒的伪影。在包含40种不同生成模型产生的图像的开放世界基准上进行的广泛实验表明，NS-NET的表现优于现有的最新方法，在检测准确性上提高了7.4 \％的提高，从而证明了在GAN和扩散基于基于GAN和扩散的基于基于GAN和扩散的图像生成技术方面都有强大的概括。

Title: SpatioTemporal Difference Network for Video Depth Super-Resolution

Authors: Zhengxue Wang, Yuan Wu, Xiang Li, Zhiqiang Yan, Jian Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01259
Pdf URL: https://arxiv.org/pdf/2508.01259
Copy Paste: [[2508.01259]] SpatioTemporal Difference Network for Video Depth Super-Resolution(https://arxiv.org/abs/2508.01259)
Keywords: super-resolution
Abstract: Depth super-resolution has achieved impressive performance, and the incorporation of multi-frame information further enhances reconstruction quality. Nevertheless, statistical analyses reveal that video depth super-resolution remains affected by pronounced long-tailed distributions, with the long-tailed effects primarily manifesting in spatial non-smooth regions and temporal variation zones. To address these challenges, we propose a novel SpatioTemporal Difference Network (STDNet) comprising two core branches: a spatial difference branch and a temporal difference branch. In the spatial difference branch, we introduce a spatial difference mechanism to mitigate the long-tailed issues in spatial non-smooth regions. This mechanism dynamically aligns RGB features with learned spatial difference representations, enabling intra-frame RGB-D aggregation for depth calibration. In the temporal difference branch, we further design a temporal difference strategy that preferentially propagates temporal variation information from adjacent RGB and depth frames to the current depth frame, leveraging temporal difference representations to achieve precise motion compensation in temporal long-tailed areas. Extensive experimental results across multiple datasets demonstrate the effectiveness of our STDNet, outperforming existing approaches.
摘要：深度超分辨率已经取得了令人印象深刻的性能，并且融合多框信息进一步提高了重建质量。然而，统计分析表明，视频深度超分辨率仍然受到明显的长尾分布的影响，长尾效应主要在空间非平滑区域和时间变化区域中表现出来。为了应对这些挑战，我们提出了一个新的时空差异网络（STDNET），其中包括两个核心分支：空间差异分支和一个时间差分支。在空间差异分支中，我们引入了一种空间差异机制，以减轻空间非平滑区域的长尾问题。该机制将RGB特征与学习的空间差异表示，使框架内RGB-D聚合进行深度校准。在时间差异分支中，我们进一步设计了一种时间差异策略，优先将时间变化信息从相邻的RGB和深度框架传播到当前深度框架，从而利用时间差异表示，以在时间长尾部区域中实现精确的运动补偿。多个数据集的广泛实验结果证明了我们的Stdnet的有效性，表现优于现有方法。

Title: Enhancing Diffusion-based Dataset Distillation via Adversary-Guided Curriculum Sampling

Authors: Lexiao Zou, Gongwei Chen, Yanda Chen, Miao Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01264
Pdf URL: https://arxiv.org/pdf/2508.01264
Copy Paste: [[2508.01264]] Enhancing Diffusion-based Dataset Distillation via Adversary-Guided Curriculum Sampling(https://arxiv.org/abs/2508.01264)
Keywords: generative
Abstract: Dataset distillation aims to encapsulate the rich information contained in dataset into a compact distilled dataset but it faces performance degradation as the image-per-class (IPC) setting or image resolution grows larger. Recent advancements demonstrate that integrating diffusion generative models can effectively facilitate the compression of large-scale datasets while maintaining efficiency due to their superiority in matching data distribution and summarizing representative patterns. However, images sampled from diffusion models are always blamed for lack of diversity which may lead to information redundancy when multiple independent sampled images are aggregated as a distilled dataset. To address this issue, we propose Adversary-guided Curriculum Sampling (ACS), which partitions the distilled dataset into multiple curricula. For generating each curriculum, ACS guides diffusion sampling process by an adversarial loss to challenge a discriminator trained on sampled images, thus mitigating information overlap between curricula and fostering a more diverse distilled dataset. Additionally, as the discriminator evolves with the progression of curricula, ACS generates images from simpler to more complex, ensuring efficient and systematic coverage of target data informational spectrum. Extensive experiments demonstrate the effectiveness of ACS, which achieves substantial improvements of 4.1\% on Imagewoof and 2.1\% on ImageNet-1k over the state-of-the-art.
摘要：数据集蒸馏旨在将数据集中包含的丰富信息封装到一个紧凑的蒸馏数据集中，但随着图像级别（IPC）设置或图像分辨率的增长，它会面临性能退化。最近的进步表明，整合扩散生成模型可以有效地促进大规模数据集的压缩，同时由于它们在匹配数据分布和汇总代表性模式方面的优势而保持效率。但是，从扩散模型中采样的图像总是因为缺乏多样性而被指责，这可能会导致信息冗余，而当将多个独立的采样图像汇总为蒸馏数据集时。为了解决这个问题，我们提出了对手指导的课程采样（ACS），该课程将蒸馏数据集分配为多个课程。为了生成每种课程，ACS通过对抗性损失来指导扩散抽样过程，以挑战对采样图像进行训练的歧视者，从而减轻课程之间的信息重叠并促进更多样化的蒸馏数据集。此外，随着歧视者随着课程的发展的发展，ACS从更简单到更复杂的图像生成图像，从而确保目标数据信息频谱的有效且系统地覆盖。广泛的实验证明了ACS的有效性，在最先进的图像上，ImageWoof的4.1 \％\％在ImageNet-1k上的有效性为2.1 \％。

Title: PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation

Authors: Zonglei Jing, Xiao Yang, Xiaoqian Li, Siyuan Liang, Aishan Liu, Mingchuan Zhang, Xianglong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01272
Pdf URL: https://arxiv.org/pdf/2508.01272
Copy Paste: [[2508.01272]] PromptSafe: Gated Prompt Tuning for Safe Text-to-Image Generation(https://arxiv.org/abs/2508.01272)
Keywords: generation, generative
Abstract: Text-to-image (T2I) models have demonstrated remarkable generative capabilities but remain vulnerable to producing not-safe-for-work (NSFW) content, such as violent or explicit imagery. While recent moderation efforts have introduced soft prompt-guided tuning by appending defensive tokens to the input, these approaches often rely on large-scale curated image-text datasets and apply static, one-size-fits-all defenses at inference time. However, this results not only in high computational cost and degraded benign image quality, but also in limited adaptability to the diverse and nuanced safety requirements of real-world prompts. To address these challenges, we propose PromptSafe, a gated prompt tuning framework that combines a lightweight, text-only supervised soft embedding with an inference-time gated control network. Instead of training on expensive image-text datasets, we first rewrite unsafe prompts into semantically aligned but safe alternatives using an LLM, constructing an efficient text-only training corpus. Based on this, we optimize a universal soft prompt that repels unsafe and attracts safe embeddings during the diffusion denoising process. To avoid over-suppressing benign prompts, we introduce a gated mechanism that adaptively adjusts the defensive strength based on estimated prompt toxicity, thereby aligning defense intensity with prompt risk and ensuring strong protection for harmful inputs while preserving benign generation quality. Extensive experiments across multiple benchmarks and T2I models show that PromptSafe achieves a SOTA unsafe generation rate (2.36%), while preserving high benign fidelity. Furthermore, PromptSafe demonstrates strong generalization to unseen harmful categories, robust transferability across diffusion model architectures, and resilience under adaptive adversarial attacks, highlighting its practical value for safe and scalable deployment.
摘要：文本对图像（T2I）模型已显示出显着的生成能力，但仍然容易产生非安全工作（NSFW）内容，例如暴力或显式图像。虽然最近的审核努力通过将防御代币添加到输入中引入了软及时引导调音，但这些方法通常依赖于大规模的策划图像文本数据集并在推理时应用静态的，单尺的防御能力。但是，这不仅以高计算成本和降级的良性图像质量降低，而且对现实提示的各种安全要求的适应性有限。为了应对这些挑战，我们提出了提示，这是一个封闭式的提示调谐框架，将轻巧的，纯文本监督的软嵌入与推理时代封闭式控制网络相结合。我们首先使用LLM重写不安全的替代方案，而不是对昂贵的图像文本数据集进行培训，而是将不安全的替代品改写为语义对齐但安全的替代方案，从而构建了有效的仅文本培训语料库。基于此，我们优化了一个通用软提示，该提示可以排除不安全，并在扩散降级过程中吸引安全的嵌入。为了避免过度施加良性的提示，我们引入了一种封闭式机制，该机制基于估计的迅速毒性来适应防御强度，从而使防御强度与迅速的风险保持一致，并确保对有害投入的强有力保护，同时保留良性发电质量。跨多个基准和T2I模型进行的广泛实验表明，促使SAFE达到了SOTA不安全的发电率（2.36％），同时保留了高良性的忠诚度。此外，提示安全表明了对看不见的有害类别，跨扩散模型体系结构的可转移性以及在自适应对抗性攻击下的韧性的强烈概括，从而突出了其对安全可扩展部署的实用价值。

Title: GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification

Authors: Ngoc Bui Lam Quang, Nam Le Nguyen Binh, Thanh-Huy Nguyen, Le Thien Phuc Nguyen, Quan Nguyen, Ulas Bagci
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01293
Pdf URL: https://arxiv.org/pdf/2508.01293
Copy Paste: [[2508.01293]] GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification(https://arxiv.org/abs/2508.01293)
Keywords: generation
Abstract: Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification, enabling efficient analysis of gigapixel pathology slides. Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge through text-based class descriptions rather than simple class names. However, when these methods rely on large language models (LLMs) to generate clinical descriptions or use fixed-length prompts to represent complex pathology concepts, the limited token capacity of VLMs often constrains the expressiveness and richness of the encoded class information. Additionally, descriptions generated solely by LLMs may lack domain grounding and fine-grained medical specificity, leading to suboptimal alignment with visual features. To address these challenges, we propose a vision-language MIL framework with two key contributions: (1) A grounded multi-agent description generation system that leverages curated pathology textbooks and agent specialization (e.g., morphology, spatial context) to produce accurate and diverse clinical descriptions; (2) A text encoding strategy using a list of descriptions rather than a single prompt, capturing fine-grained and complementary clinical signals for better alignment with visual features. Integrated into a VLM-MIL pipeline, our approach shows improved performance over single-prompt class baselines and achieves results comparable to state-of-the-art models, as demonstrated on renal and lung cancer datasets.
摘要：多个实例学习（MIL）是整个幻灯片图像（WSI）分类的领先方法，从而有效地分析了Gigapixel病理幻灯片。最近的工作将视觉模型（VLM）引入了MIL管道中，以通过基于文本的类描述而不是简单的班级名称来结合医学知识。但是，当这些方法依靠大型语言模型（LLMS）生成临床描述或使用固定长度提示来表示复杂的病理概念时，VLMS的有限令牌的能力通常会限制所编码类信息的表现力和丰富性。此外，仅由LLMS产生的描述可能缺乏域接地和细粒度的医学特异性，从而导致与视觉特征的次优对齐。为了应对这些挑战，我们提出了一个具有两个关键贡献的视觉语言MIL框架：（1）扎根的多代理描述生成系统，该系统利用了精心策划的病理学教科书和代理专业化（例如形态，空间环境）来产生准确而多样的临床描述；（2）使用描述列表而不是单个提示的文本编码策略，可捕获细粒度和互补的临床信号，以更好地对齐视觉特征。我们的方法集成到VLM-MIL管道中，与单个级别类基线相比，表现出改善的性能，并取得了与最先进模型相当的结果，如肾脏和肺癌数据集所示。

Title: Zero-shot Segmentation of Skin Conditions: Erythema with Edit-Friendly Inversion

Authors: Konstantinos Moutselos, Ilias Maglogiannis
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01334
Pdf URL: https://arxiv.org/pdf/2508.01334
Copy Paste: [[2508.01334]] Zero-shot Segmentation of Skin Conditions: Erythema with Edit-Friendly Inversion(https://arxiv.org/abs/2508.01334)
Keywords: generative
Abstract: This study proposes a zero-shot image segmentation framework for detecting erythema (redness of the skin) using edit-friendly inversion in diffusion models. The method synthesizes reference images of the same patient that are free from erythema via generative editing and then accurately aligns these references with the original images. Color-space analysis is performed with minimal user intervention to identify erythematous regions. This approach significantly reduces the reliance on labeled dermatological datasets while providing a scalable and flexible diagnostic support tool by avoiding the need for any annotated training masks. In our initial qualitative experiments, the pipeline successfully isolated facial erythema in diverse cases, demonstrating performance improvements over baseline threshold-based techniques. These results highlight the potential of combining generative diffusion models and statistical color segmentation for computer-aided dermatology, enabling efficient erythema detection without prior training data.
摘要：这项研究提出了一个用于使用扩散模型中编辑友好型反转来检测红斑（皮肤发红）的零拍图分割框架。该方法合成了同一患者的参考图像，该参考图像通过生成编辑不含红斑，然后将这些引用与原始图像准确地对齐。颜色空间分析是用最少的用户干预进行的，以识别红斑区域。这种方法大大降低了对标记的皮肤病学数据集的依赖，同时通过避免需要任何带注释的训练口罩来提供可扩展且灵活的诊断支持工具。在我们最初的定性实验中，管道在各种情况下成功隔离了面部红斑，表明基于基线阈值的技术的性能提高。这些结果突出了将生成扩散模型和用于计算机辅助皮肤病学的统计颜色分割结合的潜力，从而无需事先培训数据即可有效地检测有效的红斑检测。

Title: Effective Damage Data Generation by Fusing Imagery with Human Knowledge Using Vision-Language Models

Authors: Jie Wei, Erika Ardiles-Cruz, Aleksey Panasyuk, Erik Blasch
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01380
Pdf URL: https://arxiv.org/pdf/2508.01380
Copy Paste: [[2508.01380]] Effective Damage Data Generation by Fusing Imagery with Human Knowledge Using Vision-Language Models(https://arxiv.org/abs/2508.01380)
Keywords: generation
Abstract: It is of crucial importance to assess damages promptly and accurately in humanitarian assistance and disaster response (HADR). Current deep learning approaches struggle to generalize effectively due to the imbalance of data classes, scarcity of moderate damage examples, and human inaccuracy in pixel labeling during HADR situations. To accommodate for these limitations and exploit state-of-the-art techniques in vision-language models (VLMs) to fuse imagery with human knowledge understanding, there is an opportunity to generate a diversified set of image-based damage data effectively. Our initial experimental results suggest encouraging data generation quality, which demonstrates an improvement in classifying scenes with different levels of structural damage to buildings, roads, and infrastructures.
摘要：在人道主义援助和灾难响应（HADR）中及时，准确评估损害赔偿至关重要。当前的深度学习方法由于数据类别的不平衡，中等损害示例的稀缺性以及在HADR情况下像素标签中的人类不准确而难以有效地概括。为了适应这些局限性并利用视觉模型（VLMS）中的最新技术，以将图像与人类知识的理解融合在一起，有机会有效地生成一组多样化的基于图像的损害数据。我们最初的实验结果表明了鼓励数据生成质量，这表明对建筑物，道路和基础设施的结构性损害不同的场景进行了分类。

Title: A Full-Stage Refined Proposal Algorithm for Suppressing False Positives in Two-Stage CNN-Based Detection Methods

Authors: Qiang Guo, Rubo Zhang, Bingbing Zhang, Junjie Liu, Jianqing Liu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01382
Pdf URL: https://arxiv.org/pdf/2508.01382
Copy Paste: [[2508.01382]] A Full-Stage Refined Proposal Algorithm for Suppressing False Positives in Two-Stage CNN-Based Detection Methods(https://arxiv.org/abs/2508.01382)
Keywords: generation
Abstract: False positives in pedestrian detection remain a challenge that has yet to be effectively resolved. To address this issue, this paper proposes a Full-stage Refined Proposal (FRP) algorithm aimed at eliminating these false positives within a two-stage CNN-based pedestrian detection framework. The main innovation of this work lies in employing various pedestrian feature re-evaluation strategies to filter out low-quality pedestrian proposals during both the training and testing stages. Specifically, in the training phase, the Training mode FRP algorithm (TFRP) introduces a novel approach for validating pedestrian proposals to effectively guide the model training process, thereby constructing a model with strong capabilities for false positive suppression. During the inference phase, two innovative strategies are implemented: the Classifier-guided FRP (CFRP) algorithm integrates a pedestrian classifier into the proposal generation pipeline to yield high-quality proposals through pedestrian feature evaluation, and the Split-proposal FRP (SFRP) algorithm vertically divides all proposals, sending both the original and the sub-region proposals to the subsequent subnetwork to evaluate their confidence scores, filtering out those with lower sub-region pedestrian confidence scores. As a result, the proposed algorithm enhances the model's ability to suppress pedestrian false positives across all stages. Various experiments conducted on multiple benchmarks and the SY-Metro datasets demonstrate that the model, supported by different combinations of the FRP algorithm, can effectively eliminate false positives to varying extents. Furthermore, experiments conducted on embedded platforms underscore the algorithm's effectiveness in enhancing the comprehensive pedestrian detection capabilities of the small pedestrian detector in resource-constrained edge devices.
摘要：行人检测中的误报仍然是尚未有效解决的挑战。为了解决这个问题，本文提出了一个全阶段精制提案（FRP）算法，旨在消除基于两阶段CNN的行人检测框架中的这些假阳性。这项工作的主要创新在于采用各种行人功能重新评估策略，以在培训和测试阶段筛选出低质量的行人建议。具体而言，在训练阶段，训练模式FRP算法（TFRP）引入了一种新颖的方法，用于验证行人建议，以有效地指导模型训练过程，从而构建具有强大功能的模型，以实现假阳性抑制。在推论阶段，实施了两种创新策略：分类器指导的FRP（CFRP）算法将行人分类器整合到提案生成管道中，通过行人的特征评估来产生高质量的建议，以及分裂的FRP（sfrp）Algorithm vertry cropers and Sendress sendsement均可分配原始的建议。子网评估其置信度得分，从而滤除次区域置信度得分较低的人。结果，提出的算法增强了模型在所有阶段抑制行人误报的能力。在多个基准和SY-METRO数据集上进行的各种实验表明，由FRP算法的不同组合支持的模型可以有效地消除误报以变化的扩展。此外，在嵌入式平台上进行的实验强调了该算法在提高资源受限边缘设备中小型行人探测器的全面行人检测能力方面的有效性。

Title: ForenX: Towards Explainable AI-Generated Image Detection with Multimodal Large Language Models

Authors: Chuangchuang Tan, Jinglu Wang, Xiang Ming, Renshuai Tao, Yunchao Wei, Yao Zhao, Yan Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01402
Pdf URL: https://arxiv.org/pdf/2508.01402
Copy Paste: [[2508.01402]] ForenX: Towards Explainable AI-Generated Image Detection with Multimodal Large Language Models(https://arxiv.org/abs/2508.01402)
Keywords: generative
Abstract: Advances in generative models have led to AI-generated images visually indistinguishable from authentic ones. Despite numerous studies on detecting AI-generated images with classifiers, a gap persists between such methods and human cognitive forensic analysis. We present ForenX, a novel method that not only identifies the authenticity of images but also provides explanations that resonate with human thoughts. ForenX employs the powerful multimodal large language models (MLLMs) to analyze and interpret forensic cues. Furthermore, we overcome the limitations of standard MLLMs in detecting forgeries by incorporating a specialized forensic prompt that directs the MLLMs attention to forgery-indicative attributes. This approach not only enhance the generalization of forgery detection but also empowers the MLLMs to provide explanations that are accurate, relevant, and comprehensive. Additionally, we introduce ForgReason, a dataset dedicated to descriptions of forgery evidences in AI-generated images. Curated through collaboration between an LLM-based agent and a team of human annotators, this process provides refined data that further enhances our model's performance. We demonstrate that even limited manual annotations significantly improve explanation quality. We evaluate the effectiveness of ForenX on two major benchmarks. The model's explainability is verified by comprehensive subjective evaluations.
摘要：生成模型的进步导致AI生成的图像在视觉上与真实图像没有区别。尽管对使用分类器检测AI生成的图像进行了许多研究，但这种方法与人类认知法医分析之间存在差距。我们提出了Forenx，这是一种新颖的方法，不仅可以识别图像的真实性，而且提供了与人类思想产生共鸣的解释。 Forenx采用强大的多模式大语模型（MLLM）来分析和解释法医提示。此外，我们通过融合了专门的法医提示来克服标准MLLM在检测伪造时的局限性，该提示将MLLMS的注意力引向伪造的指标属性。这种方法不仅增强了伪造检测的概括，而且还促进了MLLM的能力提供准确，相关和全面的解释。此外，我们介绍了Forgrason，这是一个专门介绍AI生成图像中伪造证据的数据集。通过基于LLM的代理商和人类注释团队之间的协作策划，此过程提供了精致的数据，从而进一步增强了我们的模型性能。我们证明，即使是有限的手动注释，也会显着提高解释质量。我们评估了ForENX对两个主要基准的有效性。该模型的解释性通过全面的主观评估验证。

Title: Uncertainty-Aware Segmentation Quality Prediction via Deep Learning Bayesian Modeling: Comprehensive Evaluation and Interpretation on Skin Cancer and Liver Segmentation

Authors: Sikha O K, Meritxell Riera-Marín, Adrian Galdran, Javier García Lopez, Julia Rodríguez-Comas, Gemma Piella, Miguel A. González Ballester
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01460
Pdf URL: https://arxiv.org/pdf/2508.01460
Copy Paste: [[2508.01460]] Uncertainty-Aware Segmentation Quality Prediction via Deep Learning Bayesian Modeling: Comprehensive Evaluation and Interpretation on Skin Cancer and Liver Segmentation(https://arxiv.org/abs/2508.01460)
Keywords: quality assessment
Abstract: Image segmentation is a critical step in computational biomedical image analysis, typically evaluated using metrics like the Dice coefficient during training and validation. However, in clinical settings without manual annotations, assessing segmentation quality becomes challenging, and models lacking reliability indicators face adoption barriers. To address this gap, we propose a novel framework for predicting segmentation quality without requiring ground truth annotations during test time. Our approach introduces two complementary frameworks: one leveraging predicted segmentation and uncertainty maps, and another integrating the original input image, uncertainty maps, and predicted segmentation maps. We present Bayesian adaptations of two benchmark segmentation models-SwinUNet and Feature Pyramid Network with ResNet50-using Monte Carlo Dropout, Ensemble, and Test Time Augmentation to quantify uncertainty. We evaluate four uncertainty estimates: confidence map, entropy, mutual information, and expected pairwise Kullback-Leibler divergence on 2D skin lesion and 3D liver segmentation datasets, analyzing their correlation with segmentation quality metrics. Our framework achieves an R2 score of 93.25 and Pearson correlation of 96.58 on the HAM10000 dataset, outperforming previous segmentation quality assessment methods. For 3D liver segmentation, Test Time Augmentation with entropy achieves an R2 score of 85.03 and a Pearson correlation of 65.02, demonstrating cross-modality robustness. Additionally, we propose an aggregation strategy that combines multiple uncertainty estimates into a single score per image, offering a more robust and comprehensive assessment of segmentation quality. Finally, we use Grad-CAM and UMAP-based embedding analysis to interpret the model's behavior and reliability, highlighting the impact of uncertainty integration.
摘要：图像分割是计算生物医学图像分析的关键步骤，通常使用训练和验证期间的骰子系数等指标进行评估。但是，在没有手动注释的临床环境中，评估细分质量变得具有挑战性，缺乏可靠性指标的模型面临采用障碍。为了解决这一差距，我们提出了一个新颖的框架，用于预测细分质量，而无需在测试时间内进行地面真相注释。我们的方法介绍了两个互补框架：一个利用预测的分割和不确定性图，另一个利用原始输入图像，不确定性图和预测的分割图。我们介绍了两个基准分割模型-SWINUNET的贝叶斯改编，并具有带有RESNET50使用的蒙特卡洛辍学，集合和测试时间增加的金字塔网络，以量化不确定性。我们评估了四个不确定性估计值：置信图，熵，互信息和预期的成对kullback-leibler差异在2D皮肤病变和3D肝分段数据集上，分析了它们与分割质量指标的相关性。我们的框架的R2得分为93.25，在HAM10000数据集上的Pearson相关性为96.58，表现优于先前的细分质量评估方法。对于3D肝脏分割，带有熵的测试时间增加的R2得分为85.03，Pearson相关性为65.02，表明跨模式鲁棒性。此外，我们提出了一种聚合策略，该策略将多个不确定性估计值结合到每个图像中的单个分数中，从而对细分质量进行更强大和更全面的评估。最后，我们使用基于Grad-CAM和基于UMAP的嵌入分析来解释模型的行为和可靠性，从而强调了不确定性整合的影响。

Title: Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians

Authors: Quankai Gao, Iliyan Georgiev, Tuanfeng Y. Wang, Krishna Kumar Singh, Ulrich Neumann, Jae Shin Yoon
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01464
Pdf URL: https://arxiv.org/pdf/2508.01464
Copy Paste: [[2508.01464]] Can3Tok: Canonical 3D Tokenization and Latent Modeling of Scene-Level 3D Gaussians(https://arxiv.org/abs/2508.01464)
Keywords: generation, generative
Abstract: 3D generation has made significant progress, however, it still largely remains at the object-level. Feedforward 3D scene-level generation has been rarely explored due to the lack of models capable of scaling-up latent representation learning on 3D scene-level data. Unlike object-level generative models, which are trained on well-labeled 3D data in a bounded canonical space, scene-level generations with 3D scenes represented by 3D Gaussian Splatting (3DGS) are unbounded and exhibit scale inconsistency across different scenes, making unified latent representation learning for generative purposes extremely challenging. In this paper, we introduce Can3Tok, the first 3D scene-level variational autoencoder (VAE) capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding, which effectively captures both semantic and spatial information of the inputs. Beyond model design, we propose a general pipeline for 3D scene data processing to address scale inconsistency issue. We validate our method on the recent scene-level 3D dataset DL3DV-10K, where we found that only Can3Tok successfully generalizes to novel 3D scenes, while compared methods fail to converge on even a few hundred scene inputs during training and exhibit zero generalization ability during inference. Finally, we demonstrate image-to-3DGS and text-to-3DGS generation as our applications to demonstrate its ability to facilitate downstream generation tasks.
摘要：3D代取得了重大进展，但是，它仍然在很大程度上保持在对象级别。由于缺乏能够在3D场景级数据上扩展潜在表示学习的模型，因此很少探索FeedForward 3D场景级生成。与对象级生成模型不同，在有限的规范空间中接受了标签良好的3D数据培训，场景级别的几代具有3D高斯分裂（3DGS）表示的3D场景（3DGS）表示，并且在不同的场景中表现出规模不一致，使统一的潜在表示对生产力的潜在表达式学习，以实现生产力的目的。在本文中，我们介绍了CAN3TOK，这是能够编码大量高斯原始原语的第一个3D场景级自动编码器（VAE）中的低维潜在嵌入，从而有效地捕获了输入的语义和空间信息。除了模型设计之外，我们还为3D场景数据处理提出了一条通用管道，以解决规模不一致问题。我们在最近的场景级别3D数据集DL3DV-10K上验证了我们的方法，在那里我们发现只有CAN3TOK才能成功概括为新颖的3D场景，而比较的方法在训练期间甚至在训练过程中也无法收敛几百个场景输入，并且在推理过程中表现出了零概括能力。最后，我们演示了图像到3DGS和文本到3DGS的生成，作为我们的应用程序，以证明其促进下游生成任务的能力。

Title: ESM: A Framework for Building Effective Surrogate Models for Hardware-Aware Neural Architecture Search

Authors: Azaz-Ur-Rehman Nasir, Samroz Ahmad Shoaib, Muhammad Abdullah Hanif, Muhammad Shafique
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.01505
Pdf URL: https://arxiv.org/pdf/2508.01505
Copy Paste: [[2508.01505]] ESM: A Framework for Building Effective Surrogate Models for Hardware-Aware Neural Architecture Search(https://arxiv.org/abs/2508.01505)
Keywords: generation
Abstract: Hardware-aware Neural Architecture Search (NAS) is one of the most promising techniques for designing efficient Deep Neural Networks (DNNs) for resource-constrained devices. Surrogate models play a crucial role in hardware-aware NAS as they enable efficient prediction of performance characteristics (e.g., inference latency and energy consumption) of different candidate models on the target hardware device. In this paper, we focus on building hardware-aware latency prediction models. We study different types of surrogate models and highlight their strengths and weaknesses. We perform a systematic analysis to understand the impact of different factors that can influence the prediction accuracy of these models, aiming to assess the importance of each stage involved in the model designing process and identify methods and policies necessary for designing/training an effective estimation model, specifically for GPU-powered devices. Based on the insights gained from the analysis, we present a holistic framework that enables reliable dataset generation and efficient model generation, considering the overall costs of different stages of the model generation pipeline.
摘要：硬件感知的神经体系结构搜索（NAS）是为资源约束设备设计有效的深神经网络（DNN）的最有希望的技术之一。替代模型在目标硬件设备上不同候选模型的性能特征（例如推理潜伏期和能源消耗）的有效预测时，在硬件感知的NAS中起着至关重要的作用。在本文中，我们专注于构建硬件感知延迟预测模型。我们研究了不同类型的替代模型，并突出了它们的优势和劣势。我们执行系统分析以了解不同因素的影响，这些因素可能影响这些模型的预测准确性，旨在评估模型设计过程中涉及的每个阶段的重要性，并确定设计/培训设计/培训的方法和政策，特别是针对GPU驱动设备的有效估计模型。根据分析中获得的见解，我们提出了一个整体框架，该框架可以实现可靠的数据集生成和有效的模型生成，考虑到模型生成管道的不同阶段的整体成本。

Title: A Reward-Directed Diffusion Framework for Generative Design Optimization

Authors: Hadi Keramati, Patrick Kirchen, Mohammed Hannan, Rajeev K. Jaiman
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2508.01509
Pdf URL: https://arxiv.org/pdf/2508.01509
Copy Paste: [[2508.01509]] A Reward-Directed Diffusion Framework for Generative Design Optimization(https://arxiv.org/abs/2508.01509)
Keywords: generative
Abstract: This study presents a generative optimization framework that builds on a fine-tuned diffusion model and reward-directed sampling to generate high-performance engineering designs. The framework adopts a parametric representation of the design geometry and produces new parameter sets corresponding to designs with enhanced performance metrics. A key advantage of the reward-directed approach is its suitability for scenarios in which performance metrics rely on costly engineering simulations or surrogate models (e.g. graph-based, ensemble models, or tree-based) are non-differentiable or prohibitively expensive to differentiate. This work introduces the iterative use of a soft value function within a Markov decision process framework to achieve reward-guided decoding in the diffusion model. By incorporating soft-value guidance during both the training and inference phases, the proposed approach reduces computational and memory costs to achieve high-reward designs, even beyond the training data. Empirical results indicate that this iterative reward-directed method substantially improves the ability of the diffusion models to generate samples with reduced resistance in 3D ship hull design and enhanced hydrodynamic performance in 2D airfoil design tasks. The proposed framework generates samples that extend beyond the training data distribution, resulting in a greater 25 percent reduction in resistance for ship design and over 10 percent improvement in the lift-to-drag ratio for the 2D airfoil design. Successful integration of this model into the engineering design life cycle can enhance both designer productivity and overall design performance.
摘要：这项研究提出了一个生成优化框架，该框架以微调扩散模型和奖励定向采样为基础，以生成高性能的工程设计。该框架采用了设计几何形状的参数表示，并产生与具有增强性能指标的设计相对应的新参数集。奖励指导方法的一个关键优点是其适合方案的性能，在这种情况下，性能指标依赖于昂贵的工程模拟或替代模型（例如，基于图形的，集合模型或基于树的模型）是不可差异或昂贵的，以差异化。这项工作介绍了在马尔可夫决策过程框架内迭代使用软值函数，以实现扩散模型中的奖励指导解码。通过在训练阶段和推理阶段纳入软价指导，提出的方法可以降低计算和记忆成本以实现高回报设计，甚至超出培训数据。经验结果表明，这种迭代奖励指导的方法显着提高了扩散模型在3D船体船体设计中具有降低电阻的样品的能力，并在2D机翼设计任务中增强了流体动力性能。所提出的框架生成的样品超出了训练数据分布，从而使船舶设计的阻力降低了25％，而2D机翼设计的升力拖流比提高了10％。该模型成功地集成到工程设计生命周期中可以提高设计人员的生产率和整体设计性能。

Title: Canoe Paddling Quality Assessment Using Smart Devices: Preliminary Machine Learning Study

Authors: S. Parab, A. Lamelas, A. Hassan, P. Bhote
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.01511
Pdf URL: https://arxiv.org/pdf/2508.01511
Copy Paste: [[2508.01511]] Canoe Paddling Quality Assessment Using Smart Devices: Preliminary Machine Learning Study(https://arxiv.org/abs/2508.01511)
Keywords: quality assessment
Abstract: Over 22 million Americans participate in paddling-related activities annually, contributing to a global paddlesports market valued at 2.4 billion US dollars in 2020. Despite its popularity, the sport has seen limited integration of machine learning (ML) and remains hindered by the cost of coaching and specialized equipment. This study presents a novel AI-based coaching system that uses ML models trained on motion data and delivers stroke feedback via a large language model (LLM). Participants were recruited through a collaboration with the NYU Concrete Canoe Team. Motion data were collected across two sessions, one with suboptimal form and one with corrected technique, using Apple Watches and smartphones secured in sport straps. The data underwent stroke segmentation and feature extraction. ML models, including Support Vector Classifier, Random Forest, Gradient Boosting, and Extremely Randomized Trees, were trained on both raw and engineered features. A web based interface was developed to visualize stroke quality and deliver LLM-based feedback. Across four participants, eight trials yielded 66 stroke samples. The Extremely Randomized Tree model achieved the highest performance with an F score of 0.9496 under five fold cross validation. The web interface successfully provided both quantitative metrics and qualitative feedback. Sensor placement near the wrists improved data quality. Preliminary results indicate that smartwatches and smartphones can enable low cost, accessible alternatives to traditional paddling instruction. While limited by sample size, the study demonstrates the feasibility of using consumer devices and ML to support stroke refinement and technique improvement.
摘要：每年有超过2200万美国人参加与划桨相关的活动，这导致了2020年价值24亿美元的全球桨运动市场。尽管它很受欢迎，但这项运动的机器学习（ML）的整合程度有限，并受到教练和专业设备的成本的阻碍。这项研究提出了一种基于AI的新型教练系统，该系统使用了对运动数据进行训练的ML模型，并通过大型语言模型（LLM）提供了中风反馈。通过与纽约大学混凝土独木舟团队合作招募参与者。在两次会议上收集了运动数据，一个会议次数为次优形，另一项具有校正技术，使用Apple手表和运动带中的智能手机。数据接受了中风分割和特征提取。 ML模型，包括支持矢量分类器，随机森林，梯度提升和极为随机的树木，都接受了原始和工程特征的培训。开发了基于Web的界面，以可视化中风质量并提供基于LLM的反馈。在四名参与者中，八次试验产生了66个中风样本。极端随机的树模型在五倍交叉验证下以0.9496的F得分达到了最高性能。 Web界面成功地提供了定量指标和定性反馈。手腕附近的传感器放置提高了数据质量。初步结果表明，智能手表和智能手机可以实现低成本，可访问传统划桨指导的替代方案。虽然受样本量的限制，但该研究证明了使用消费者设备和ML支持中风的改进和技术改进的可行性。

Title: MiraGe: Multimodal Discriminative Representation Learning for Generalizable AI-Generated Image Detection

Authors: Kuo Shi, Jie Lu, Shanshan Ye, Guangquan Zhang, Zhen Fang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01525
Pdf URL: https://arxiv.org/pdf/2508.01525
Copy Paste: [[2508.01525]] MiraGe: Multimodal Discriminative Representation Learning for Generalizable AI-Generated Image Detection(https://arxiv.org/abs/2508.01525)
Keywords: generative
Abstract: Recent advances in generative models have highlighted the need for robust detectors capable of distinguishing real images from AI-generated images. While existing methods perform well on known generators, their performance often declines when tested with newly emerging or unseen generative models due to overlapping feature embeddings that hinder accurate cross-generator classification. In this paper, we propose Multimodal Discriminative Representation Learning for Generalizable AI-generated Image Detection (MiraGe), a method designed to learn generator-invariant features. Motivated by theoretical insights on intra-class variation minimization and inter-class separation, MiraGe tightly aligns features within the same class while maximizing separation between classes, enhancing feature discriminability. Moreover, we apply multimodal prompt learning to further refine these principles into CLIP, leveraging text embeddings as semantic anchors for effective discriminative representation learning, thereby improving generalizability. Comprehensive experiments across multiple benchmarks show that MiraGe achieves state-of-the-art performance, maintaining robustness even against unseen generators like Sora.
摘要：生成模型的最新进展强调了能够将真实图像与AI生成的图像区分开的可靠检测器的必要性。尽管现有方法在已知的发电机上的性能很好，但由于重叠的功能嵌入，它们的性能经常在测试新出现或看不见的生成模型时会下降，从而阻碍了准确的交叉发电机分类。在本文中，我们提出了可推广的AI生成图像检测（Mirage）的多模式判别性表示学习，这是一种旨在学习生成器不变特征的方法。由理论上的洞察力对阶层内变化最小化和阶层间的分离的动机，幻影在同一类中紧密地对准特征，同时最大程度地在类之间进行分离，从而增强特征可区分性。此外，我们将多模式提示学习将这些原理进一步完善剪辑，利用文本嵌入作为语义锚来进行有效的判别性表示学习，从而提高了通用性。跨多个基准测试的全面实验表明，海市rage楼实现了最新的性能，即使对Sora等看不见的发电机也保持了稳健性。

Title: E-VRAG: Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation

Authors: Zeyu Xu, Junkang Zhang, Qiang Wang, Yi Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01546
Pdf URL: https://arxiv.org/pdf/2508.01546
Copy Paste: [[2508.01546]] E-VRAG: Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation(https://arxiv.org/abs/2508.01546)
Keywords: generation
Abstract: Vision-Language Models (VLMs) have enabled substantial progress in video understanding by leveraging cross-modal reasoning capabilities. However, their effectiveness is limited by the restricted context window and the high computational cost required to process long videos with thousands of frames. Retrieval-augmented generation (RAG) addresses this challenge by selecting only the most relevant frames as input, thereby reducing the computational burden. Nevertheless, existing video RAG methods struggle to balance retrieval efficiency and accuracy, particularly when handling diverse and complex video content. To address these limitations, we propose E-VRAG, a novel and efficient video RAG framework for video understanding. We first apply a frame pre-filtering method based on hierarchical query decomposition to eliminate irrelevant frames, reducing computational costs at the data level. We then employ a lightweight VLM for frame scoring, further reducing computational costs at the model level. Additionally, we propose a frame retrieval strategy that leverages the global statistical distribution of inter-frame scores to mitigate the potential performance degradation from using a lightweight VLM. Finally, we introduce a multi-view question answering scheme for the retrieved frames, enhancing the VLM's capability to extract and comprehend information from long video contexts. Experiments on four public benchmarks show that E-VRAG achieves about 70% reduction in computational cost and higher accuracy compared to baseline methods, all without additional training. These results demonstrate the effectiveness of E-VRAG in improving both efficiency and accuracy for video RAG tasks.
摘要：视觉语言模型（VLM）通过利用跨模式推理能力来实现视频理解的实质性进展。但是，它们的有效性受到限制上下文窗口的限制，以及用数千帧处理长视频所需的高计算成本。检索增强的生成（RAG）通过仅选择最相关的框架作为输入来解决这一挑战，从而减少了计算负担。然而，现有的视频抹布方法努力平衡检索效率和准确性，尤其是在处理多样化且复杂的视频内容时。为了解决这些局限性，我们提出了电子vrag，这是一个新颖而有效的视频抹布框架，用于视频理解。我们首先采用基于层次查询分解的框架预过滤方法，以消除无关紧要的帧，从而降低数据级别的计算成本。然后，我们使用轻巧的VLM进行帧评分，从而进一步降低了模型级别的计算成本。此外，我们提出了一种框架检索策略，该策略利用框架间得分的全球统计分布来减轻使用轻量级VLM的潜在性能退化。最后，我们为检索到的框架介绍了一个多视图问答方案，从而增强了VLM从长视频上下文中提取和理解信息的功能。四个公共基准的实验表明，与基线方法相比，E-VRAG的计算成本降低了约70％，精度更高，所有这些都没有额外的培训。这些结果证明了电子vrag在提高视频抹布任务的效率和准确性方面的有效性。

Title: A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models

Authors: Quan-Sheng Zeng, Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming-Ming Cheng, Qibin Hou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01548
Pdf URL: https://arxiv.org/pdf/2508.01548
Copy Paste: [[2508.01548]] A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models(https://arxiv.org/abs/2508.01548)
Keywords: generation
Abstract: Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. To address this issue, we introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition. It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our work paves a new way for building more powerful and efficient LVLMs.
摘要：视觉令牌压缩对于大型视觉模型（LVLM）至关重要，以有效处理高分辨率输入。通常采用固定压缩比的现有方法不能适应各种复杂性的场景，通常会导致不精确的修剪丢弃信息的视觉令牌，并导致模型性能退化。为了解决这个问题，我们引入了一个动态的修剪框架，Glimpseprune，灵感来自人类认知。它采用数据驱动的“瞥见”，并在回答生成之前单个向前传球中的视觉令牌无关。这种方法可预测92.6％的视觉令牌，同时平均完全保留自由VQA任务的基线性能。降低的计算成本还可以更有效地进行微调：增强的GLIMPSEPRUNE+可达到基线性能的110％，同时保持类似的修剪率。我们的工作铺平了一种新的方式来构建更强大，更有效的LVLM。

Title: EvoVLMA: Evolutionary Vision-Language Model Adaptation

Authors: Kun Ding, Ying Wang, Shiming Xiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01558
Pdf URL: https://arxiv.org/pdf/2508.01558
Copy Paste: [[2508.01558]] EvoVLMA: Evolutionary Vision-Language Model Adaptation(https://arxiv.org/abs/2508.01558)
Keywords: generation
Abstract: Pre-trained Vision-Language Models (VLMs) have been exploited in various Computer Vision tasks (e.g., few-shot recognition) via model adaptation, such as prompt tuning and adapters. However, existing adaptation methods are designed by human experts, requiring significant time cost and experience. Inspired by recent advances in Large Language Models (LLMs) based code generation, we propose an Evolutionary Vision-Language Model Adaptation (EvoVLMA) method to automatically search training-free efficient adaptation algorithms for VLMs. We recognize feature selection and logits computation as the key functions in training-free VLM adaptation, and propose a two-stage LLM-assisted evolutionary algorithm for optimizing these parts in a sequential manner, effectively addressing the challenge posed by the expansive search space through a divide-and-conquer strategy. Besides, to enhance the stability and efficiency of searching process, we propose low-precision code conversion, web based code execution and process monitoring, leading to a highly effective automatic algorithm design system. Extensive experiments demonstrate that the algorithms found by EvoVLMA can obtain promising results compared to previous manually-designed ones. More specifically, in the 8-shot image classification setting, the classical APE algorithm can be improved by 1.91 points in recognition accuracy. This research opens new possibilities for automating the optimization of adaptation algorithms of pre-trained multimodal models. Code is available at: this https URL
摘要：预训练的视觉模型（VLM）已在各种计算机视觉任务（例如，很少识别）中通过模型适应（例如及时的调整和适配器）中进行了利用。但是，现有的适应方法是由人类专家设计的，需要大量的时间成本和经验。受到大型语言模型（LLMS）代码生成的最新进展的启发，我们提出了一种进化视觉语言模型适应（EVOVLMA）方法，以自动搜索VLMS的无培训有效适应算法。我们将特征选择和逻辑计算视为无训练VLM适应的关键功能，并提出了一种两阶段的LLM辅助进化算法，以依次以依次的方式优化这些部分，从而有效地解决了通过分隔和征服策略通过分裂搜索空间带来的广泛搜索空间所带来的挑战。此外，为了提高搜索过程的稳定性和效率，我们提出了低精度代码转换，基于Web的代码执行和过程监视，从而导致高效的自动算法设计系统。广泛的实验表明，与先前的手动设计的实验相比，Evovlma发现的算法可以获得有希望的结果。更具体地说，在8-Shot图像分类设置中，经典的猿类算法可以提高1.91点的识别精度。这项研究开辟了新的可能性，以自动化预先训练的多模型模型的适应算法的优化。代码可用：此HTTPS URL

Title: A Spatio-temporal Continuous Network for Stochastic 3D Human Motion Prediction

Authors: Hua Yu, Yaqing Hou, Xu Gui, Shanshan Feng, Dongsheng Zhou, Qiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01585
Pdf URL: https://arxiv.org/pdf/2508.01585
Copy Paste: [[2508.01585]] A Spatio-temporal Continuous Network for Stochastic 3D Human Motion Prediction(https://arxiv.org/abs/2508.01585)
Keywords: generative
Abstract: Stochastic Human Motion Prediction (HMP) has received increasing attention due to its wide applications. Despite the rapid progress in generative fields, existing methods often face challenges in learning continuous temporal dynamics and predicting stochastic motion sequences. They tend to overlook the flexibility inherent in complex human motions and are prone to mode collapse. To alleviate these issues, we propose a novel method called STCN, for stochastic and continuous human motion prediction, which consists of two stages. Specifically, in the first stage, we propose a spatio-temporal continuous network to generate smoother human motion sequences. In addition, the anchor set is innovatively introduced into the stochastic HMP task to prevent mode collapse, which refers to the potential human motion patterns. In the second stage, STCN endeavors to acquire the Gaussian mixture distribution (GMM) of observed motion sequences with the aid of the anchor set. It also focuses on the probability associated with each anchor, and employs the strategy of sampling multiple sequences from each anchor to alleviate intra-class differences in human motions. Experimental results on two widely-used datasets (Human3.6M and HumanEva-I) demonstrate that our model obtains competitive performance on both diversity and accuracy.
摘要：随机人类运动预测（HMP）由于其广泛的应用而受到了越来越多的关注。尽管生成领域取得了迅速的进展，但现有方法在学习连续的时间动力学和预测随机运动序列方面经常面临挑战。他们倾向于忽略复杂的人类运动中固有的灵活性，并且容易崩溃。为了减轻这些问题，我们提出了一种名为STCN的新方法，用于随机且连续的人类运动预测，该预测由两个阶段组成。具体而言，在第一阶段，我们提出了一个时空连续网络，以生成更光滑的人类运动序列。此外，将锚点创新到随机HMP任务中以防止模式崩溃，这是指潜在的人类运动模式。在第二阶段，STCN努力借助锚固组获取观察到的运动序列的高斯混合物分布（GMM）。它还侧重于与每个锚相关的概率，并采用从每个锚点进行多个序列的策略来减轻人类运动的类内差异。在两个广泛使用的数据集（Human36M和Humaneva-I）上的实验结果表明，我们的模型在多样性和准确性上都获得了竞争性能。

Title: Diffusion Models for Future Networks and Communications: A Comprehensive Survey

Authors: Nguyen Cong Luong, Nguyen Duc Hai, Duc Van Le, Huy T. Nguyen, Thai-Hoc Vu, Thien Huynh-The, Ruichen Zhang, Nguyen Duc Duy Anh, Dusit Niyato, Marco Di Renzo, Dong In Kim, Quoc-Viet Pham
Subjects: cs.LG, cs.AI, cs.ET, cs.IT, cs.NI
Abstract URL: https://arxiv.org/abs/2508.01586
Pdf URL: https://arxiv.org/pdf/2508.01586
Copy Paste: [[2508.01586]] Diffusion Models for Future Networks and Communications: A Comprehensive Survey(https://arxiv.org/abs/2508.01586)
Keywords: generative
Abstract: The rise of Generative AI (GenAI) in recent years has catalyzed transformative advances in wireless communications and networks. Among the members of the GenAI family, Diffusion Models (DMs) have risen to prominence as a powerful option, capable of handling complex, high-dimensional data distribution, as well as consistent, noise-robust performance. In this survey, we aim to provide a comprehensive overview of the theoretical foundations and practical applications of DMs across future communication systems. We first provide an extensive tutorial of DMs and demonstrate how they can be applied to enhance optimizers, reinforcement learning and incentive mechanisms, which are popular approaches for problems in wireless networks. Then, we review and discuss the DM-based methods proposed for emerging issues in future networks and communications, including channel modeling and estimation, signal detection and data reconstruction, integrated sensing and communication, resource management in edge computing networks, semantic communications and other notable issues. We conclude the survey with highlighting technical limitations of DMs and their applications, as well as discussing future research directions.
摘要：近年来，生成AI（Genai）的兴起促进了无线通信和网络方面的变革性进步。在Genai家族的成员中，扩散模型（DMS）已成为一种强大的选择，能够处理复杂，高维数据分布以及一致的，噪声的性能。在这项调查中，我们旨在对DM在未来的通信系统中的理论基础和实际应用进行全面概述。我们首先提供了广泛的DMS教程，并演示如何应用它们来增强优化者，增强学习和激励机制，这是无线网络中问题的流行方法。然后，我们审查并讨论针对未来网络和通信中新兴问题的基于DM的方法，包括渠道建模和估计，信号检测和数据重建，集成感应和通信，边缘计算网络中的资源管理，语义通信，其他值得注意的问题。我们以强调DMS及其应用的技术局限性以及讨论未来的研究方向来结束调查。

Title: Censored Sampling for Topology Design: Guiding Diffusion with Human Preferences

Authors: Euihyun Kim, Keun Park, Yeoneung Kim
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01589
Pdf URL: https://arxiv.org/pdf/2508.01589
Copy Paste: [[2508.01589]] Censored Sampling for Topology Design: Guiding Diffusion with Human Preferences(https://arxiv.org/abs/2508.01589)
Keywords: generation, generative
Abstract: Recent advances in denoising diffusion models have enabled rapid generation of optimized structures for topology optimization. However, these models often rely on surrogate predictors to enforce physical constraints, which may fail to capture subtle yet critical design flaws such as floating components or boundary discontinuities that are obvious to human experts. In this work, we propose a novel human-in-the-loop diffusion framework that steers the generative process using a lightweight reward model trained on minimal human feedback. Inspired by preference alignment techniques in generative modeling, our method learns to suppress unrealistic outputs by modulating the reverse diffusion trajectory using gradients of human-aligned rewards. Specifically, we collect binary human evaluations of generated topologies and train classifiers to detect floating material and boundary violations. These reward models are then integrated into the sampling loop of a pre-trained diffusion generator, guiding it to produce designs that are not only structurally performant but also physically plausible and manufacturable. Our approach is modular and requires no retraining of the diffusion model. Preliminary results show substantial reductions in failure modes and improved design realism across diverse test conditions. This work bridges the gap between automated design generation and expert judgment, offering a scalable solution to trustworthy generative design.
摘要：DeNoisis扩散模型的最新进展已使快速生成优化的结构进行拓扑优化。但是，这些模型通常依靠替代预测因子来执行物理限制，这些预测因素可能无法捕获微妙而关键的设计缺陷，例如浮动组件或边界不连续性对人类专家显而易见。在这项工作中，我们提出了一个新颖的人类在环境扩散框架中，该框架使用对最小人类反馈训练的轻量级奖励模型来引导生成过程。受生成建模中的偏好比对技术的启发，我们的方法学会了通过使用人类对准奖励的梯度调节反向扩散轨迹来抑制不现实的输出。具体而言，我们收集对生成的拓扑结构和火车分类器的二元人类评估，以检测浮动物质和边界违规。然后将这些奖励模型集成到预先训练的扩散发生器的采样环中，从而引导其产生不仅在结构上表现的设计，而且在物理上是可行的且可制造的。我们的方法是模块化的，不需要扩散模型的重新训练。初步结果表明，在各种测试条件下，故障模式的大幅减少，并改善了设计现实主义。这项工作弥合了自动设计生成和专家判断之间的差距，为可信赖的生成设计提供了可扩展的解决方案。

Title: Enhancing Zero-Shot Brain Tumor Subtype Classification via Fine-Grained Patch-Text Alignment

Authors: Lubin Gan, Jing Zhang, Linhao Qu, Yijun Wang, Siying Wu, Xiaoyan Sun
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01602
Pdf URL: https://arxiv.org/pdf/2508.01602
Copy Paste: [[2508.01602]] Enhancing Zero-Shot Brain Tumor Subtype Classification via Fine-Grained Patch-Text Alignment(https://arxiv.org/abs/2508.01602)
Keywords: generation
Abstract: The fine-grained classification of brain tumor subtypes from histopathological whole slide images is highly challenging due to subtle morphological variations and the scarcity of annotated data. Although vision-language models have enabled promising zero-shot classification, their ability to capture fine-grained pathological features remains limited, resulting in suboptimal subtype discrimination. To address these challenges, we propose the Fine-Grained Patch Alignment Network (FG-PAN), a novel zero-shot framework tailored for digital pathology. FG-PAN consists of two key modules: (1) a local feature refinement module that enhances patch-level visual features by modeling spatial relationships among representative patches, and (2) a fine-grained text description generation module that leverages large language models to produce pathology-aware, class-specific semantic prototypes. By aligning refined visual features with LLM-generated fine-grained descriptions, FG-PAN effectively increases class separability in both visual and semantic spaces. Extensive experiments on multiple public pathology datasets, including EBRAINS and TCGA, demonstrate that FG-PAN achieves state-of-the-art performance and robust generalization in zero-shot brain tumor subtype classification.
摘要：由于细微的形态变化和带注释的数据的稀缺性，组织病理学整个幻灯片图像对脑肿瘤亚型的细粒度分类非常具有挑战性。尽管视觉语言模型已实现了有希望的零摄影分类，但它们捕获细粒病理性特征的能力仍然有限，从而导致次优亚型歧视。为了应对这些挑战，我们提出了细粒度贴片对准网络（FG-PAN），这是一个针对数字病理学量身定制的新型零拍框架。 FG-PAN由两个关键模块组成：（1）一个局部特征细化模块，该模块通过对代表性贴片之间的空间关系进行建模来增强贴片级的视觉特征，以及（2）一个精细的文本说明生成模块，该模块利用大型语言模型产生病理学意识到，类别是类，类别特定的语义原型。通过将精致的视觉特征与LLM生成的细粒描述对齐，FG-PAN有效地提高了视觉和语义空间中的类可分离性。在包括EBRAINS和TCGA在内的多个公共病理数据集上进行的广泛实验表明，FG-PAN在零拍脑肿瘤亚型分类中实现了最先进的性能和鲁棒的概括。

Title: TCDiff: Triplex Cascaded Diffusion for High-fidelity Multimodal EHRs Generation with Incomplete Clinical Data

Authors: Yandong Yan, Chenxi Li, Yu Huang, Dexuan Xu, Jiaqi Zhu, Zhongyan Chai, Huamin Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01615
Pdf URL: https://arxiv.org/pdf/2508.01615
Copy Paste: [[2508.01615]] TCDiff: Triplex Cascaded Diffusion for High-fidelity Multimodal EHRs Generation with Incomplete Clinical Data(https://arxiv.org/abs/2508.01615)
Keywords: generation, generative
Abstract: The scarcity of large-scale and high-quality electronic health records (EHRs) remains a major bottleneck in biomedical research, especially as large foundation models become increasingly data-hungry. Synthesizing substantial volumes of de-identified and high-fidelity data from existing datasets has emerged as a promising solution. However, existing methods suffer from a series of limitations: they struggle to model the intrinsic properties of heterogeneous multimodal EHR data (e.g., continuous, discrete, and textual modalities), capture the complex dependencies among them, and robustly handle pervasive data incompleteness. These challenges are particularly acute in Traditional Chinese Medicine (TCM). To this end, we propose TCDiff (Triplex Cascaded Diffusion Network), a novel EHR generation framework that cascades three diffusion networks to learn the features of real-world EHR data, formatting a multi-stage generative process: Reference Modalities Diffusion, Cross-Modal Bridging, and Target Modality Diffusion. Furthermore, to validate our proposed framework, besides two public datasets, we also construct and introduce TCM-SZ1, a novel multimodal EHR dataset for benchmarking. Experimental results show that TCDiff consistently outperforms state-of-the-art baselines by an average of 10% in data fidelity under various missing rate, while maintaining competitive privacy guarantees. This highlights the effectiveness, robustness, and generalizability of our approach in real-world healthcare scenarios.
摘要：大规模和高质量电子健康记录（EHR）的稀缺性仍然是生物医学研究中的主要瓶颈，尤其是随着大型基础模型越来越多地渴望数据。从现有数据集中综合了大量的去识别和高保真数据已成为有前途的解决方案。但是，现有方法遭受了一系列局限性：它们难以模拟异构多模式EHR数据（例如，连续，离散和文本模式）的内在特性，捕获它们之间的复杂依赖性，并坚强地处理普遍数据不完整。这些挑战在中医（TCM）中尤为严重。为此，我们提出了TCDIFF（三元级联扩散网络），这是一个新型的EHR生成框架，它层叠了三个扩散网络，以了解真实世界EHR数据的特征，从而格式化多阶段生成过程：参考模态扩散，交叉模态铜质桥接和目标模态扩散。此外，为了验证我们提出的框架，除了两个公共数据集外，我们还构建和介绍了TCM-SZ1，这是一种用于基准测试的新型多模式EHR数据集。实验结果表明，TCDIFF始终在各种缺失率下的数据保真度平均超过最先进的基线，同时保持竞争性隐私保证。这凸显了我们在现实世界中的医疗保健方案中我们方法的有效性，鲁棒性和普遍性。

Title: Privacy-Preserving Inference for Quantized BERT Models

Authors: Tianpei Lu, Bingsheng Zhang, Lekun Peng, Bowen Zheng, Lichun Li, Kui Ren
Subjects: cs.LG, cs.CR
Abstract URL: https://arxiv.org/abs/2508.01636
Pdf URL: https://arxiv.org/pdf/2508.01636
Copy Paste: [[2508.01636]] Privacy-Preserving Inference for Quantized BERT Models(https://arxiv.org/abs/2508.01636)
Keywords: generative
Abstract: With the increasing deployment of generative machine learning models in privacy-sensitive domains such as healthcare and personalized services, ensuring secure inference has become a critical challenge. Secure multi-party computation (MPC) enables privacy-preserving model inference but suffers from high communication and computation overhead. The main bottleneck lies in the expensive secure evaluation of floating-point operations. Quantization offers a promising solution by converting floating-point operations into lower-precision integer computations, significantly reducing overhead. However, existing MPC-based quantized inference methods either rely on public quantization parameters-posing privacy risks-or suffer from inefficiencies, particularly in handling nonlinear functions such as activations and softmax. In this work, we propose a fine-grained, layer-wise quantization scheme and support 1-bit weight fully connected layers in a secure setting. We design a multi-input lookup table protocol to evaluate softmax efficiently and securely. Furthermore, we use dual secret sharing schemes and perform precision conversions via lookup tables, eliminating truncation overhead entirely. Experimental evaluation on BERT-base models demonstrates that our approach achieves up to $8\times$ speedup compared to Lu \emph{et al}. (NDSS 25), $9\times$ speedup compared to Gupta \emph{et al}. (PETS 24) and $22 \times$ speedup compared to Knott \emph{et al}. (NeurIPS 21).
摘要：随着对隐私敏感领域（例如医疗保健和个性化服务）中生成机器学习模型的部署越来越多，确保安全推断已成为一个关键挑战。安全多方计算（MPC）可实现隐私保护模型推断，但遭受了高通信和计算开销的损失。主要的瓶颈在于对浮点操作的昂贵安全评估。量化通过将浮点操作转换为较低精确整数计算，从而大大降低了开销，从而提供了有希望的解决方案。但是，现有的基于MPC的量化推理方法要么依赖于公共量化参数放置隐私风险，要么遭受效率低下的损失，尤其是在处理非线性功能（例如激活和SoftMax）时。在这项工作中，我们提出了一个细粒度的，层的量化方案，并在安全设置中支持1位重量完全连接的层。我们设计了一个多输入查找表协议，以有效，安全地评估软效果。此外，我们使用双秘密共享方案并通过查找表执行精确转换，从而完全消除了截断的开销。 BERT基本模型的实验评估表明，与Lu \ emph {et al}相比，我们的方法达到了高达$ 8 \ times $速度。（NDSS 25），与Gupta \ Emph {et al}相比，$ 9 \ times $加速。（宠物24）和$ 22 \ times $速度与knott \ emph {et al}相比。（神经21）。

Title: StrandDesigner: Towards Practical Strand Generation with Sketch Guidance

Authors: Na Zhang, Moran Li, Chengming Xu, Han Feng, Xiaobin Hu, Jiangning Zhang, Weijian Cao, Chengjie Wang, Yanwei Fu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01650
Pdf URL: https://arxiv.org/pdf/2508.01650
Copy Paste: [[2508.01650]] StrandDesigner: Towards Practical Strand Generation with Sketch Guidance(https://arxiv.org/abs/2508.01650)
Keywords: generation
Abstract: Realistic hair strand generation is crucial for applications like computer graphics and virtual reality. While diffusion models can generate hairstyles from text or images, these inputs lack precision and user-friendliness. Instead, we propose the first sketch-based strand generation model, which offers finer control while remaining user-friendly. Our framework tackles key challenges, such as modeling complex strand interactions and diverse sketch patterns, through two main innovations: a learnable strand upsampling strategy that encodes 3D strands into multi-scale latent spaces, and a multi-scale adaptive conditioning mechanism using a transformer with diffusion heads to ensure consistency across granularity levels. Experiments on several benchmark datasets show our method outperforms existing approaches in realism and precision. Qualitative results further confirm its effectiveness. Code will be released at [GitHub](this https URL).
摘要：逼真的头发链生成对于计算机图形和虚拟现实等应用至关重要。虽然扩散模型可以从文本或图像中产生发型，但这些输入缺乏精度和用户友好性。取而代之的是，我们提出了第一个基于草图的链生成模型，该模型可提供更优质的控制，同时保持用户友好。我们的框架通过两个主要创新来应对关键挑战，例如建模复杂的链相互作用和不同的草图模式：可学习的链链提升策略，将3D链编码为多规模的潜在潜在空间，以及使用具有扩散头的变压器来确保跨粒度水平的一致性。几个基准数据集的实验表明，我们的方法在现实主义和精确度上都优于现有方法。定性结果进一步证实了其有效性。代码将在[GitHub]（此HTTPS URL）上发布。

Title: DisCo3D: Distilling Multi-View Consistency for 3D Scene Editing

Authors: Yufeng Chi, Huimin Ma, Kafeng Wang, Jianmin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01684
Pdf URL: https://arxiv.org/pdf/2508.01684
Copy Paste: [[2508.01684]] DisCo3D: Distilling Multi-View Consistency for 3D Scene Editing(https://arxiv.org/abs/2508.01684)
Keywords: generation
Abstract: While diffusion models have demonstrated remarkable progress in 2D image generation and editing, extending these capabilities to 3D editing remains challenging, particularly in maintaining multi-view consistency. Classical approaches typically update 3D representations through iterative refinement based on a single editing view. However, these methods often suffer from slow convergence and blurry artifacts caused by cross-view inconsistencies. Recent methods improve efficiency by propagating 2D editing attention features, yet still exhibit fine-grained inconsistencies and failure modes in complex scenes due to insufficient constraints. To address this, we propose \textbf{DisCo3D}, a novel framework that distills 3D consistency priors into a 2D editor. Our method first fine-tunes a 3D generator using multi-view inputs for scene adaptation, then trains a 2D editor through consistency distillation. The edited multi-view outputs are finally optimized into 3D representations via Gaussian Splatting. Experimental results show DisCo3D achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality.
摘要：尽管扩散模型在2D图像生成和编辑中表现出了显着的进展，但将这些功能扩展到3D编辑仍然具有挑战性，尤其是在保持多视图一致性方面。经典方法通常通过基于单个编辑视图的迭代改进来更新3D表示。但是，这些方法通常会遭受跨越矛盾引起的缓慢收敛性和模糊的伪像。最近的方法通过传播2D编辑注意力特征来提高效率，但由于约束不足，在复杂场景中仍然表现出细粒度的不一致和故障模式。为了解决这个问题，我们建议\ textbf {disco3d}，这是一个新颖的框架，将3D一致性先验提炼成2D编辑器。我们的方法首先使用用于场景适应的多视图输入的3D生成器进行微调，然后通过一致性蒸馏训练2D编辑器。最终通过高斯剥落将编辑的多视图输出优化为3D表示。实验结果表明，Disco3D达到了稳定的多视图一致性，并且在编辑质量方面的最先进方法优于最先进的方法。

Title: Versatile Transition Generation with Image-to-Video Diffusion

Authors: Zuhao Yang, Jiahui Zhang, Yingchen Yu, Shijian Lu, Song Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01698
Pdf URL: https://arxiv.org/pdf/2508.01698
Copy Paste: [[2508.01698]] Versatile Transition Generation with Image-to-Video Diffusion(https://arxiv.org/abs/2508.01698)
Keywords: generation
Abstract: Leveraging text, images, structure maps, or motion trajectories as conditional guidance, diffusion models have achieved great success in automated and high-quality video generation. However, generating smooth and rational transition videos given the first and last video frames as well as descriptive text prompts is far underexplored. We present VTG, a Versatile Transition video Generation framework that can generate smooth, high-fidelity, and semantically coherent video transitions. VTG introduces interpolation-based initialization that helps preserve object identity and handle abrupt content changes effectively. In addition, it incorporates dual-directional motion fine-tuning and representation alignment regularization to mitigate the limitations of pre-trained image-to-video diffusion models in motion smoothness and generation fidelity, respectively. To evaluate VTG and facilitate future studies on unified transition generation, we collected TransitBench, a comprehensive benchmark for transition generation covering two representative transition tasks: concept blending and scene transition. Extensive experiments show that VTG achieves superior transition performance consistently across all four tasks.
摘要：利用文本，图像，结构图或运动轨迹作为有条件的指导，扩散模型在自动化和高质量的视频生成中取得了巨大的成功。但是，鉴于第一个和最后一个视频帧以及描述性文本提示，生成平稳，合理的过渡视频尚未得到充分展望。我们提出VTG，这是一个多功能的过渡视频生成框架，可以产生光滑，高保真和语义连贯的视频过渡。 VTG引入了基于插值的初始化，该初始化有助于保留对象身份并有效地处理突然的内容变化。此外，它还结合了双向运动微调和表示正则化，以减轻预训练的图像到视频扩散模型的局限性，分别在运动平滑度和产生忠诚度中。为了评估VTG并促进对统一过渡生成的未来研究，我们收集了TransitBench，这是过渡生成的全面基准，涵盖了两个代表性的过渡任务：概念融合和场景过渡。广泛的实验表明，VTG在所有四个任务中始终如一地实现了出色的过渡性能。

Title: TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding

Authors: Zuhao Yang, Yingchen Yu, Yunqing Zhao, Shijian Lu, Song Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01699
Pdf URL: https://arxiv.org/pdf/2508.01699
Copy Paste: [[2508.01699]] TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding(https://arxiv.org/abs/2508.01699)
Keywords: generation
Abstract: Video Temporal Grounding (VTG) aims to precisely identify video event segments in response to textual queries. The outputs of VTG tasks manifest as sequences of events, each defined by precise timestamps, saliency scores, and textual descriptions. Despite recent advances, a fundamental limitation persists in existing Video Large Language Models (Video-LLMs): they process all task tokens through identical and static pathways, failing to recognize that temporal localization, saliency assessment, and textual generation represent fundamentally distinct tasks requiring specialized processing. To address this, we introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes VTG tasks by dynamically routing task-specific tokens (e.g., timestamps, saliency scores) to specialized experts, with increased computational efficiency. Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications. Extensive experiments demonstrate that TimeExpert consistently achieves state-of-the-art performance on various VTG tasks such as Dense Video Captioning, Moment Retrieval, and Video Highlight Detection.
摘要：视频时间基础（VTG）旨在精确识别响应文本查询的视频事件段。 VTG任务的输出表现为事件序列，每个序列由精确的时间戳，显着性得分和文本描述定义。尽管有最近的进步，但现有的视频大语模型（视频llms）仍存在基本限制：他们通过相同和静态的途径处理所有任务令牌，未能认识到时间定位，显着性评估和文本生成代表了基本上不同的任务，这些任务从根本上代表了需要专业处理的。为了解决这个问题，我们介绍了TimeExpert（基于Experts的混合物（MOE）的视频-LLM），通过动态路由特定的任务代币（例如时间戳，显着性得分）有效地分解VTG任务，并提高了计算效率。我们的设计选择可以精确处理每个子任务，从而改善了各种VTG应用程序的事件建模。广泛的实验表明，时间表始终在各种VTG任务上取得最新的性能，例如密集的视频字幕，时刻检索和视频突出显示。

Title: Imbalance-Robust and Sampling-Efficient Continuous Conditional GANs via Adaptive Vicinity and Auxiliary Regularization

Authors: Xin Ding, Yun Chen, Yongwei Wang, Kao Zhang, Sen Zhang, Peibei Cao, Xiangxue Wang
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2508.01725
Pdf URL: https://arxiv.org/pdf/2508.01725
Copy Paste: [[2508.01725]] Imbalance-Robust and Sampling-Efficient Continuous Conditional GANs via Adaptive Vicinity and Auxiliary Regularization(https://arxiv.org/abs/2508.01725)
Keywords: generation, generative
Abstract: Recent advances in conditional generative modeling have introduced Continuous conditional Generative Adversarial Network (CcGAN) and Continuous Conditional Diffusion Model (CCDM) for estimating high-dimensional data distributions conditioned on scalar, continuous regression labels (e.g., angles, ages, or temperatures). However, these approaches face fundamental limitations: CcGAN suffers from data imbalance due to fixed-size vicinity constraints, while CCDM requires computationally expensive iterative sampling. We present CcGAN-AVAR, an enhanced CcGAN framework that addresses both challenges: (1) leveraging the GAN framework's native one-step generation to overcome CCDMs' sampling bottleneck (achieving 300x-2000x faster inference), while (2) two novel components specifically target data imbalance - an adaptive vicinity mechanism that dynamically adjusts vicinity's size, and a multi-task discriminator that constructs two regularization terms (through auxiliary regression and density ratio estimation) to significantly improve generator training. Extensive experiments on four benchmark datasets (64x64 to 192x192 resolution) across eight challenging imbalanced settings demonstrate that CcGAN-AVAR achieves state-of-the-art generation quality while maintaining sampling efficiency.
摘要：有条件生成建模的最新进展引入了连续的条件生成对抗网络（CCGAN）和连续的条件扩散模型（CCDM），用于估计以标量，连续回归标签（例如，角度，年龄或温度）来估计高维数据分布。但是，这些方法面临着根本的局限性：CCGAN由于固定规模的附近限制而遭受数据不平衡，而CCDM需要计算昂贵的迭代采样。 We present CcGAN-AVAR, an enhanced CcGAN framework that addresses both challenges: (1) leveraging the GAN framework's native one-step generation to overcome CCDMs' sampling bottleneck (achieving 300x-2000x faster inference), while (2) two novel components specifically target data imbalance - an adaptive vicinity mechanism that dynamically adjusts vicinity's size, and a multi-task构建两个正则化项（通过辅助回归和密度比估计）以显着改善发电机训练的歧视器。在八个具有挑战性的不平衡设置上进行了四个基准数据集（64x64至192x192分辨率）进行的广泛实验表明，CCGAN-AVAR在保持采样效率的同时，达到了最先进的一代质量。

Title: Improving Noise Efficiency in Privacy-preserving Dataset Distillation

Authors: Runkai Zheng, Vishnu Asutosh Dasu, Yinong Oliver Wang, Haohan Wang, Fernando De la Torre
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01749
Pdf URL: https://arxiv.org/pdf/2508.01749
Copy Paste: [[2508.01749]] Improving Noise Efficiency in Privacy-preserving Dataset Distillation(https://arxiv.org/abs/2508.01749)
Keywords: generation
Abstract: Modern machine learning models heavily rely on large datasets that often include sensitive and private information, raising serious privacy concerns. Differentially private (DP) data generation offers a solution by creating synthetic datasets that limit the leakage of private information within a predefined privacy budget; however, it requires a substantial amount of data to achieve performance comparable to models trained on the original data. To mitigate the significant expense incurred with synthetic data generation, Dataset Distillation (DD) stands out for its remarkable training and storage efficiency. This efficiency is particularly advantageous when integrated with DP mechanisms, curating compact yet informative synthetic datasets without compromising privacy. However, current state-of-the-art private DD methods suffer from a synchronized sampling-optimization process and the dependency on noisy training signals from randomly initialized networks. This results in the inefficient utilization of private information due to the addition of excessive noise. To address these issues, we introduce a novel framework that decouples sampling from optimization for better convergence and improves signal quality by mitigating the impact of DP noise through matching in an informative subspace. On CIFAR-10, our method achieves a \textbf{10.0\%} improvement with 50 images per class and \textbf{8.3\%} increase with just \textbf{one-fifth} the distilled set size of previous state-of-the-art methods, demonstrating significant potential to advance privacy-preserving DD.
摘要：现代机器学习模型在很大程度上依赖于通常包括敏感和私人信息的大型数据集，从而引发了严重的隐私问题。通过创建限制预定义隐私预算内私人信息泄漏的合成数据集，差异化私有（DP）数据生成提供了解决方案；但是，它需要大量数据才能实现与原始数据训练的模型相当的性能。为了减轻合成数据生成产生的巨大费用，数据集蒸馏（DD）在其出色的培训和存储效率方面脱颖而出。当与DP机制集成时，这种效率尤其有利，可以策划紧凑而有益的合成数据集而不会损害隐私。但是，当前最新的私人DD方法遭受了同步采样优化过程以及对随机初始化网络的嘈杂训练信号的依赖性。由于添加了过多的噪声，这导致私人信息效率低下。为了解决这些问题，我们介绍了一个新颖的框架，该框架将采样从优化中取消，以更好地收敛，并通过在内容丰富的子空间中匹配来减轻DP噪声的影响，从而提高信号质量。在CIFAR-10上，我们的方法实现了A \ TextBf {10.0 \％}的改进，每堂课50张图像，并且\ TextBf {8.3 \％}随着\ TextBf {一五分五个}的提高，\ textbf {一五五个}的蒸馏设置大小的大小是先前的先进方法的大小，显示出了明显的潜在潜在的潜在潜在的潜在潜在的潜在潜力。

Title: DiffSemanticFusion: Semantic Raster BEV Fusion for Autonomous Driving via Online HD Map Diffusion

Authors: Zhigang Sun, Yiru Wang, Anqing Jiang, Shuo Wang, Yu Gao, Yuwen Heng, Shouyi Zhang, An He, Hao Jiang, Jinhao Chai, Zichong Gu, Wang Jijun, Shichen Tang, Lavdim Halilaj, Juergen Luettin, Hao Sun
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2508.01778
Pdf URL: https://arxiv.org/pdf/2508.01778
Copy Paste: [[2508.01778]] DiffSemanticFusion: Semantic Raster BEV Fusion for Autonomous Driving via Online HD Map Diffusion(https://arxiv.org/abs/2508.01778)
Keywords: generation
Abstract: Autonomous driving requires accurate scene understanding, including road geometry, traffic agents, and their semantic relationships. In online HD map generation scenarios, raster-based representations are well-suited to vision models but lack geometric precision, while graph-based representations retain structural detail but become unstable without precise maps. To harness the complementary strengths of both, we propose DiffSemanticFusion -- a fusion framework for multimodal trajectory prediction and planning. Our approach reasons over a semantic raster-fused BEV space, enhanced by a map diffusion module that improves both the stability and expressiveness of online HD map representations. We validate our framework on two downstream tasks: trajectory prediction and planning-oriented end-to-end autonomous driving. Experiments on real-world autonomous driving benchmarks, nuScenes and NAVSIM, demonstrate improved performance over several state-of-the-art methods. For the prediction task on nuScenes, we integrate DiffSemanticFusion with the online HD map informed QCNet, achieving a 5.1\% performance improvement. For end-to-end autonomous driving in NAVSIM, DiffSemanticFusion achieves state-of-the-art results, with a 15\% performance gain in NavHard scenarios. In addition, extensive ablation and sensitivity studies show that our map diffusion module can be seamlessly integrated into other vector-based approaches to enhance performance. All artifacts are available at this https URL.
摘要：自动驾驶需要准确的场景理解，包括道路几何形状，交通代理及其语义关系。在在线高清地图生成方案中，基于栅格的表示非常适合视觉模型，但缺乏几何精度，而基于图的表示形式保留了结构细节，但在没有精确地图的情况下变得不稳定。为了利用这两者的互补优势，我们提出了DiffsemanticFusion-多模式轨迹预测和计划的融合框架。我们在语义栅格融合的BEV空间上的方法原因，通过地图扩散模块增强了在线HD MAP表示的稳定性和表现力。我们在两个下游任务上验证框架：轨迹预测和面向计划的端到端自主驾驶。对现实世界自动驾驶基准，Nuscenes和NavSim的实验表明，在几种最先进的方法上的性能提高了。对于Nuscenes的预测任务，我们将DIFFSemanticFusion与在线HD Map Norlded QCNet进行了整合，实现了5.1 \％的性能提高。对于NAVSIM的端到端自动驾驶，DiffSemanticFusion可实现最先进的结果，在NAVHARD方案中具有15 \％的性能增长。此外，广泛的消融和敏感性研究表明，我们的地图扩散模块可以无缝地集成到其他基于矢量的方法中以增强性能。所有工件都可以在此HTTPS URL上找到。

Title: Beyond Vulnerabilities: A Survey of Adversarial Attacks as Both Threats and Defenses in Computer Vision Systems

Authors: Zhongliang Guo, Yifei Qian, Yanli Li, Weiye Li, Chun Tong Lei, Shuai Zhao, Lei Fang, Ognjen Arandjelović, Chun Pong Lau
Subjects: cs.CV, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2508.01845
Pdf URL: https://arxiv.org/pdf/2508.01845
Copy Paste: [[2508.01845]] Beyond Vulnerabilities: A Survey of Adversarial Attacks as Both Threats and Defenses in Computer Vision Systems(https://arxiv.org/abs/2508.01845)
Keywords: generative
Abstract: Adversarial attacks against computer vision systems have emerged as a critical research area that challenges the fundamental assumptions about neural network robustness and security. This comprehensive survey examines the evolving landscape of adversarial techniques, revealing their dual nature as both sophisticated security threats and valuable defensive tools. We provide a systematic analysis of adversarial attack methodologies across three primary domains: pixel-space attacks, physically realizable attacks, and latent-space attacks. Our investigation traces the technical evolution from early gradient-based methods such as FGSM and PGD to sophisticated optimization techniques incorporating momentum, adaptive step sizes, and advanced transferability mechanisms. We examine how physically realizable attacks have successfully bridged the gap between digital vulnerabilities and real-world threats through adversarial patches, 3D textures, and dynamic optical perturbations. Additionally, we explore the emergence of latent-space attacks that leverage semantic structure in internal representations to create more transferable and meaningful adversarial examples. Beyond traditional offensive applications, we investigate the constructive use of adversarial techniques for vulnerability assessment in biometric authentication systems and protection against malicious generative models. Our analysis reveals critical research gaps, particularly in neural style transfer protection and computational efficiency requirements. This survey contributes a comprehensive taxonomy, evolution analysis, and identification of future research directions, aiming to advance understanding of adversarial vulnerabilities and inform the development of more robust and trustworthy computer vision systems.
摘要：对计算机视觉系统的对抗性攻击已成为一个关键的研究领域，挑战了有关神经网络鲁棒性和安全性的基本假设。这项全面的调查研究了对抗技术的不断发展的景观，揭示了它们的双重性质，既是复杂的安全威胁和宝贵的防御工具。我们对跨三个主要领域的对抗攻击方法进行系统分析：像素空间攻击，可实现的攻击和潜在空间攻击。我们的研究探讨了从早期基于梯度的方法（例如FGSM和PGD）到结合动量，适应性步骤尺寸和高级可传递性机制的复杂优化技术的技术演变。我们研究了通过对抗贴片，3D纹理和动态光学扰动，如何成功地弥合了可实现的攻击数字漏洞和现实世界威胁之间的差距。此外，我们探讨了潜在空间攻击的出现，这些攻击利用内部表示中的语义结构来创建更有意义和有意义的对抗性示例。除了传统的进攻应用外，我们还研究了生物识别验证系统中脆弱性评估的对抗技术的建设性使用，并保护对恶意生成模型的保护。我们的分析揭示了关键的研究差距，尤其是在神经风格转移保护和计算效率要求中。这项调查为未来的研究方向提供了全面的分类学，进化分析和识别，旨在提高对对抗性脆弱性的理解，并为开发更健壮和值得信赖的计算机视觉系统的发展提供了信息。

Title: DiffusionFF: Face Forgery Detection via Diffusion-based Artifact Localization

Authors: Siran Peng, Haoyuan Zhang, Li Gao, Tianshuo Zhang, Bao Li, Zhen Lei
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.01873
Pdf URL: https://arxiv.org/pdf/2508.01873
Copy Paste: [[2508.01873]] DiffusionFF: Face Forgery Detection via Diffusion-based Artifact Localization(https://arxiv.org/abs/2508.01873)
Keywords: generation
Abstract: The rapid evolution of deepfake generation techniques demands robust and accurate face forgery detection algorithms. While determining whether an image has been manipulated remains essential, the ability to precisely localize forgery artifacts has become increasingly important for improving model explainability and fostering user trust. To address this challenge, we propose DiffusionFF, a novel framework that enhances face forgery detection through diffusion-based artifact localization. Our method utilizes a denoising diffusion model to generate high-quality Structural Dissimilarity (DSSIM) maps, which effectively capture subtle traces of manipulation. These DSSIM maps are then fused with high-level semantic features extracted by a pretrained forgery detector, leading to significant improvements in detection accuracy. Extensive experiments on both cross-dataset and intra-dataset benchmarks demonstrate that DiffusionFF not only achieves superior detection performance but also offers precise and fine-grained artifact localization, highlighting its overall effectiveness.
摘要：深泡产生技术的快速演变需要强大而准确的面部伪造算法。在确定是否已操纵图像仍然是必不可少的同时，精确本地化伪造物的能力对于改善模型的解释性和促进用户信任而变得越来越重要。为了应对这一挑战，我们提出了diffusionff，这是一个新颖的框架，可通过基于扩散的伪影定位来增强面部伪造的检测。我们的方法利用一个脱氧扩散模型来生成高质量的结构差异（DSSIM）地图，从而有效地捕获了操纵的微妙痕迹。然后将这些DSSIM图与预验证的伪造探测器提取的高级语义特征融合在一起，从而显着提高了检测准确性。对跨数据库和数据内基准测试的广泛实验表明，DiffusionFF不仅达到了出色的检测性能，而且还提供了精确且细粒度的伪像定位，突出了其整体效率。

Title: Optimizing Day-Ahead Energy Trading with Proximal Policy Optimization and Blockchain

Authors: Navneet Verma, Ying Xie
Subjects: cs.LG, cs.CR, cs.MA
Abstract URL: https://arxiv.org/abs/2508.01888
Pdf URL: https://arxiv.org/pdf/2508.01888
Copy Paste: [[2508.01888]] Optimizing Day-Ahead Energy Trading with Proximal Policy Optimization and Blockchain(https://arxiv.org/abs/2508.01888)
Keywords: generation
Abstract: The increasing penetration of renewable energy sources in day-ahead energy markets introduces challenges in balancing supply and demand, ensuring grid resilience, and maintaining trust in decentralized trading systems. This paper proposes a novel framework that integrates the Proximal Policy Optimization (PPO) algorithm, a state-of-the-art reinforcement learning method, with blockchain technology to optimize automated trading strategies for prosumers in day-ahead energy markets. We introduce a comprehensive framework that employs RL agent for multi-objective energy optimization and blockchain for tamper-proof data and transaction management. Simulations using real-world data from the Electricity Reliability Council of Texas (ERCOT) demonstrate the effectiveness of our approach. The RL agent achieves demand-supply balancing within 2\% and maintains near-optimal supply costs for the majority of the operating hours. Moreover, it generates robust battery storage policies capable of handling variability in solar and wind generation. All decisions are recorded on an Algorand-based blockchain, ensuring transparency, auditability, and security - key enablers for trustworthy multi-agent energy trading. Our contributions include a novel system architecture, curriculum learning for robust agent development, and actionable policy insights for practical deployment.
摘要：可再生能源在日常能源市场中的渗透不断提高，这在平衡供求方面引入了挑战，确保网格弹性以及维持对分散交易系统的信任。本文提出了一个新颖的框架，该框架将近端政策优化（PPO）算法（一种最先进的加固学习方法）与区块链技术相结合，以优化日常能源市场中的自动交易策略。我们引入了一个综合框架，该框架采用RL代理进行多目标能量优化和区块链，以防篡改数据和交易管理。使用得克萨斯州电力可靠性委员会（ERCOT）的现实世界数据的模拟证明了我们方法的有效性。 RL代理商在2 \％以内实现了需求供应平衡，并在大部分运营时间内保持了近乎最佳的供应成本。此外，它生成了能够处理太阳能和风能的可变性的强大电池存储策略。所有决策都记录在基于Algorand的区块链上，以确保透明度，可审核性和安全性 - 值得信赖的多代理能源交易的关键推动力。我们的贡献包括一种新颖的系统体系结构，用于稳健代理商开发的课程学习以及实用部署的可行政策见解。

Title: How Does Controllability Emerge In Language Models During Pretraining?

Authors: Jianshu She, Xinyue Li, Eric Xing, Zhengzhong Liu, Qirong Ho
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01892
Pdf URL: https://arxiv.org/pdf/2508.01892
Copy Paste: [[2508.01892]] How Does Controllability Emerge In Language Models During Pretraining?(https://arxiv.org/abs/2508.01892)
Keywords: generation
Abstract: Language models can be steered by modifying their internal representations to control concepts such as emotion, style, or truthfulness in generation. However, the conditions for an effective intervention remain unclear and are often validated through heuristics and trial-and-error. To fill this gap, we demonstrate that intervention efficacy, measured by linear steerability (i.e., the ability to adjust output via linear transformations of hidden states), emerges during intermediate stages of training. Moreover, even closely related concepts (e.g., anger and sadness) exhibit steerability emergence at distinct stages of training. To better interpret the dynamics of steerability during training, we adapt existing intervention techniques into a unified framework, referred to as the "Intervention Detector" (ID), which is designed to reveal how linear steerability evolves over the course of training through hidden state and representation analysis. ID reveals that concepts become increasingly linearly separable in the hidden space as training progresses, which strongly correlates with the emergence of linear steerability. We further introduce ID-based metrics, such as heatmaps, entropy trends, and cosine similarity, to help interpret how linear steerability evolves throughout training. In addition, we apply ID across different model families to ensure the generality of our findings on steerability dynamics.
摘要：语言模型可以通过修改其内部表示形式来控制诸如情感，风格或世代真实性之类的概念。但是，有效干预的条件尚不清楚，并且经常通过启发式和反复试验得到验证。为了填补这一空白，我们证明了通过线性的可识别性（即，通过隐藏状态的线性变换调节输出的能力）衡量的干预功效在训练的中间阶段中出现。此外，即使是密切相关的概念（例如，愤怒和悲伤）在训练的不同阶段也会表现出置权。为了更好地解释训练期间的可固定性动力学，我们将现有的干预技术调整为统一的框架，称为“干预探测器”（ID），该框架旨在揭示线性可识别性如何通过隐藏状态和表示状态分析在培训过程中演变。 ID揭示了随着训练的进行，概念在隐藏空间中越来越线性分离，这与线性可识别性的出现密切相关。我们进一步介绍了基于ID的指标，例如热图，熵趋势和余弦相似性，以帮助解释线性的可识别性如何在整个训练过程中演变。此外，我们将ID应用于不同模型系列的ID，以确保我们发现的一般性对可管化动力学的一般性。

Title: Proactive Disentangled Modeling of Trigger-Object Pairings for Backdoor Defense

Authors: Kyle Stein, Andrew A. Mahyari, Guillermo Francia III, Eman El-Sheikh
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01932
Pdf URL: https://arxiv.org/pdf/2508.01932
Copy Paste: [[2508.01932]] Proactive Disentangled Modeling of Trigger-Object Pairings for Backdoor Defense(https://arxiv.org/abs/2508.01932)
Keywords: generative
Abstract: Deep neural networks (DNNs) and generative AI (GenAI) are increasingly vulnerable to backdoor attacks, where adversaries embed triggers into inputs to cause models to misclassify or misinterpret target labels. Beyond traditional single-trigger scenarios, attackers may inject multiple triggers across various object classes, forming unseen backdoor-object configurations that evade standard detection pipelines. In this paper, we introduce DBOM (Disentangled Backdoor-Object Modeling), a proactive framework that leverages structured disentanglement to identify and neutralize both seen and unseen backdoor threats at the dataset level. Specifically, DBOM factorizes input image representations by modeling triggers and objects as independent primitives in the embedding space through the use of Vision-Language Models (VLMs). By leveraging the frozen, pre-trained encoders of VLMs, our approach decomposes the latent representations into distinct components through a learnable visual prompt repository and prompt prefix tuning, ensuring that the relationships between triggers and objects are explicitly captured. To separate trigger and object representations in the visual prompt repository, we introduce the trigger-object separation and diversity losses that aids in disentangling trigger and object visual features. Next, by aligning image features with feature decomposition and fusion, as well as learned contextual prompt tokens in a shared multimodal space, DBOM enables zero-shot generalization to novel trigger-object pairings that were unseen during training, thereby offering deeper insights into adversarial attack patterns. Experimental results on CIFAR-10 and GTSRB demonstrate that DBOM robustly detects poisoned images prior to downstream training, significantly enhancing the security of DNN training pipelines.
摘要：深神经网络（DNN）和生成AI（Genai）越来越容易受到后门攻击的影响，在这些攻击中，对手将触发器嵌入输入中，以导致模型错误分类或误解目标标签。除了传统的单触发场景之外，攻击者还可以在各种对象类中注入多个触发器，从而形成逃避标准检测管道的看不见的后门对象配置。在本文中，我们介绍了DBOM（解开后门对象建模），这是一个主动的框架，利用结构化的分离，以识别和中和在数据集级别识别和看不见的后门威胁。具体而言，DBOM通过使用视觉模型（VLM）将触发器和对象作为嵌入空间中的独立原始物进行建模，从而将输入图像表示。通过利用VLM的冷冻，预训练的编码器，我们的方法通过可学习的视觉提示库将潜在表示分解为不同的组件，并提示前缀调整，从而确保明确捕获触发器和对象之间的关系。为了在视觉提示存储库中分离触发器和对象表示，我们介绍了触发对象的分离和多样性损失，有助于解开触发器和对象视觉特征。接下来，通过将图像特征与特征分解和融合保持一致，以及在共享多模式空间中学习的上下文提示令牌，DBOM可以将零拍的概括能够归档到训练期间看不见的新型触发对象配对，从而为对抗性攻击模式提供了更深入的洞察力。 CIFAR-10和GTSRB的实验结果表明，DBOM在下游训练之前可鲁棒检测中毒的图像，从而显着提高了DNN训练管道的安全性。

Title: Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling

Authors: Seyyed Saeid Cheshmi, Azal Ahmad Khan, Xinran Wang, Zirui Liu, Ali Anwar
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.01969
Pdf URL: https://arxiv.org/pdf/2508.01969
Copy Paste: [[2508.01969]] Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling(https://arxiv.org/abs/2508.01969)
Keywords: generation
Abstract: Large Language Models (LLMs) are increasingly relied upon for solving complex reasoning tasks in domains such as mathematics, logic, and multi-step question answering. A growing line of work seeks to improve reasoning quality by scaling inference time compute particularly through Process Reward Models (PRMs), used to reward the reasoning at intermediate steps. While effective, these methods introduce substantial computational overhead, especially when generating large numbers of solutions in parallel. In this paper, we investigate whether PRMs can be used mid-generation to provide early signals that enable the rejection of suboptimal candidates before full generation of step is complete. We introduce the hypothesis that PRMs are also Partial Reward Models, meaning that the scores they assign to partially completed reasoning step are predictive of final output quality. This allows for principled early rejection based on intermediate token-level signals. We support this hypothesis both theoretically, by proving that the risk of discarding optimal beams decreases exponentially with generation length and empirically, by demonstrating a strong correlation between partial and final rewards across multiple reward models. On math reasoning benchmarks, our method achieves up to 1.4$\times$-9$\times$ reduction in inference FLOPs without degrading final performance. These results suggest that early rejection is a powerful mechanism for improving the compute-efficiency of reasoning in LLMs.
摘要：大型语言模型（LLM）越来越依赖于在数学，逻辑和多步骤问题等领域中解决复杂的推理任务。越来越多的工作旨在通过扩展推理时间来提高推理质量，尤其是通过过程奖励模型（PRMS）来奖励中级步骤的推理。尽管有效，这些方法会引入大量的计算开销，尤其是在并联生成大量解决方案时。在本文中，我们调查了是否可以将PRMS用于中期来提供早期信号，以便在完整的步骤完成之前拒绝次优候选者。我们介绍了以下假设：PRM也是部分奖励模型，这意味着他们分配给部分完成推理步骤的分数可以预测最终产出质量。这允许基于中间令牌级信号进行原则拒绝。从理论上讲，我们通过证明丢弃最佳光束的风险随着生成长度和经验证明，在多个奖励模型之间证明了部分奖励和最终奖励之间的牢固相关性，从而支持这一假设。在数学推理基准下，我们的方法可实现高达1.4 $ \ times $ -9 $ \ times $减少推理拖鞋而不会降低最终性能。这些结果表明，早期拒绝是提高LLM中推理的计算效率的有力机制。

Title: Generative Large-Scale Pre-trained Models for Automated Ad Bidding Optimization

Authors: Yu Lei, Jiayang Zhao, Yilei Zhao, Zhaoqi Zhang, Linyou Cai, Qianlong Xie, Xingxing Wang
Subjects: cs.LG, cs.CE
Abstract URL: https://arxiv.org/abs/2508.02002
Pdf URL: https://arxiv.org/pdf/2508.02002
Copy Paste: [[2508.02002]] Generative Large-Scale Pre-trained Models for Automated Ad Bidding Optimization(https://arxiv.org/abs/2508.02002)
Keywords: generation, generative
Abstract: Modern auto-bidding systems are required to balance overall performance with diverse advertiser goals and real-world constraints, reflecting the dynamic and evolving needs of the industry. Recent advances in conditional generative models, such as transformers and diffusers, have enabled direct trajectory generation tailored to advertiser preferences, offering a promising alternative to traditional Markov Decision Process-based methods. However, these generative methods face significant challenges, such as the distribution shift between offline and online environments, limited exploration of the action space, and the necessity to meet constraints like marginal Cost-per-Mille (CPM) and Return on Investment (ROI). To tackle these challenges, we propose GRAD (Generative Reward-driven Ad-bidding with Mixture-of-Experts), a scalable foundation model for auto-bidding that combines an Action-Mixture-of-Experts module for diverse bidding action exploration with the Value Estimator of Causal Transformer for constraint-aware optimization. Extensive offline and online experiments demonstrate that GRAD significantly enhances platform revenue, highlighting its effectiveness in addressing the evolving and diverse requirements of modern advertisers. Furthermore, GRAD has been implemented in multiple marketing scenarios at Meituan, one of the world's largest online food delivery platforms, leading to a 2.18% increase in Gross Merchandise Value (GMV) and 10.68% increase in ROI.
摘要：需要现代的自动投标系统，以平衡整体绩效与各种广告商的目标和现实世界的约束，这反映了行业的动态和不断发展的需求。有条件生成模型（例如变形金刚和扩散器）的最新进展使得为广告商偏好量身定制的直接轨迹生成，为传统的马尔可夫决策过程方法提供了有希望的替代方法。但是，这些生成方法面临着重大挑战，例如离线环境和在线环境之间的分配变化，对动作空间的探索有限，以及需要满足诸如每次米尔（CPM）和投资回报率（ROI）之类的约束的必要性。为了应对这些挑战，我们提出了毕业生（生成奖励驱动的广告量与Experts的混合物），这是一种可扩展的自动竞标基础模型，它结合了一个动作混合物的模块模块，用于多样化的竞标动作探索与Causal Transformer的价值估算器，以实现约束 - 应值量优化的价值估计器。广泛的离线和在线实验表明，Grad显着增强了平台收入，突出了其在满足现代广告商不断发展和多样化的要求方面的有效性。此外，在世界上最大的在线食品交付平台之一Meituan的多种营销场景中，GRAD已实施，导致总商品价值（GMV）增长了2.18％，ROI增长了10.68％。

Title: Devil is in the Detail: Towards Injecting Fine Details of Image Prompt in Image Generation via Conflict-free Guidance and Stratified Attention

Authors: Kyungmin Jo, Jooyeol Yun, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02004
Pdf URL: https://arxiv.org/pdf/2508.02004
Copy Paste: [[2508.02004]] Devil is in the Detail: Towards Injecting Fine Details of Image Prompt in Image Generation via Conflict-free Guidance and Stratified Attention(https://arxiv.org/abs/2508.02004)
Keywords: generation
Abstract: While large-scale text-to-image diffusion models enable the generation of high-quality, diverse images from text prompts, these prompts struggle to capture intricate details, such as textures, preventing the user intent from being reflected. This limitation has led to efforts to generate images conditioned on user-provided images, referred to as image prompts. Recent work modifies the self-attention mechanism to impose image conditions in generated images by replacing or concatenating the keys and values from the image prompt. This enables the self-attention layer to work like a cross-attention layer, generally used to incorporate text prompts. In this paper, we identify two common issues in existing methods of modifying self-attention to generate images that reflect the details of image prompts. First, existing approaches neglect the importance of image prompts in classifier-free guidance. Specifically, current methods use image prompts as both desired and undesired conditions in classifier-free guidance, causing conflicting signals. To resolve this, we propose conflict-free guidance by using image prompts only as desired conditions, ensuring that the generated image faithfully reflects the image prompt. In addition, we observe that the two most common self-attention modifications involve a trade-off between the realism of the generated image and alignment with the image prompt. Specifically, selecting more keys and values from the image prompt improves alignment, while selecting more from the generated image enhances realism. To balance both, we propose an new self-attention modification method, Stratified Attention to jointly use keys and values from both images rather than selecting between them. Through extensive experiments across three image generation tasks, we show that the proposed method outperforms existing image-prompting models in faithfully reflecting the image prompt.
摘要：尽管大规模的文本到图像扩散模型能够产生高质量的文本提示中的不同图像，但这些提示很难捕获复杂的细节，例如纹理，从而阻止了用户的意图反映。这种限制导致努力生成以用户提供的图像为条件的图像，称为图像提示。最近的工作修改了自我注意力的机制，以通过更换或加入图像提示中的键和值来施加产生的图像中的图像条件。这使自我发场层像跨注意层一样工作，通常用于合并文本提示。在本文中，我们在修改自我注意力的现有方法中确定了两个常见问题，以生成反映图像提示细节的图像。首先，现有方法忽略了图像提示在无分类器指导中的重要性。具体而言，当前方法在无分类器指导中使用图像提示作为所需和不希望的条件，从而导致信号冲突。为了解决这一问题，我们通过仅根据所需条件使用图像提示来提出无冲突的指导，从而确保生成的图像忠实地反映了图像提示。此外，我们观察到，两个最常见的自我发场修改涉及生成图像的现实主义与图像提示的对齐之间的权衡。具体而言，从图像提示符中选择更多的键和值可以改善对齐方式，同时从生成的图像中选择更多，增强了现实主义。为了平衡这两者，我们提出了一种新的自我发项修改方法，将注意力分层为共同使用两个图像中的键和值，而不是在它们之间选择。通过在三个图像生成任务上进行的广泛实验，我们表明所提出的方法在忠实地反映图像提示的情况下优于现有的图像促进模型。

Title: An Evolving Scenario Generation Method based on Dual-modal Driver Model Trained by Multi-Agent Reinforcement Learning

Authors: Xinzheng Wu, Junyi Chen, Shaolingfeng Ye, Wei Jiang, Yong Shen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.02027
Pdf URL: https://arxiv.org/pdf/2508.02027
Copy Paste: [[2508.02027]] An Evolving Scenario Generation Method based on Dual-modal Driver Model Trained by Multi-Agent Reinforcement Learning(https://arxiv.org/abs/2508.02027)
Keywords: generation
Abstract: In the autonomous driving testing methods based on evolving scenarios, the construction method of the driver model, which determines the driving maneuvers of background vehicles (BVs) in the scenario, plays a critical role in generating safety-critical scenarios. In particular, the cooperative adversarial driving characteristics between BVs can contribute to the efficient generation of safety-critical scenarios with high testing value. In this paper, a multi-agent reinforcement learning (MARL) method is used to train and generate a dual-modal driver model (Dual-DM) with non-adversarial and adversarial driving modalities. The model is then connected to a continuous simulated traffic environment to generate complex, diverse and strong interactive safety-critical scenarios through evolving scenario generation method. After that, the generated evolving scenarios are evaluated in terms of fidelity, test efficiency, complexity and diversity. Results show that without performance degradation in scenario fidelity (>85% similarity to real-world scenarios) and complexity (complexity metric: 0.45, +32.35% and +12.5% over two baselines), Dual-DM achieves a substantial enhancement in the efficiency of generating safety-critical scenarios (efficiency metric: 0.86, +195% over two baselines). Furthermore, statistical analysis and case studies demonstrate the diversity of safety-critical evolving scenarios generated by Dual-DM in terms of the adversarial interaction patterns. Therefore, Dual-DM can greatly improve the performance of the generation of safety-critical scenarios through evolving scenario generation method.
摘要：在基于不断发展的场景的自主驾驶测试方法中，驾驶员模型的构造方法确定了场景中背景车辆（BVS）的驾驶操作，在产生安全关键方案中起着至关重要的作用。特别是，BV之间的合作对抗驾驶特性可以有助于有效地生成具有高测试价值的安全至关重要方案。在本文中，使用多机构增强学习（MARL）方法用于训练并生成具有非对抗和对抗性驾驶方式的双模式驱动器模型（Dual-DM）。然后将模型连接到连续的模拟流量环境，以通过不断发展的场景生成方法来产生复杂，多样化和强大的交互式安全至关重要的情况。之后，根据忠诚度，测试效率，复杂性和多样性评估了生成的不断发展的方案。结果表明，在场景保真度（与现实世界情景相似> 85％的相似性）和复杂性（复杂性度量：0.45， +32.35％和 +12.5％的相似性）中没有绩效降解，在两个基准线上 +32.35％和 +12.5％），Dual-DM实现了产生安全性临界场景效率的实质性增强的效率（0.8％）， +195％， +195％， +195％， +195％。此外，统计分析和案例研究证明了双DM在对抗性相互作用模式方面产生的安全关键性不断发展的情景的多样性。因此，Dual-DM可以通过不断发展的场景生成方法可以大大提高安全 - 关键方案的产生。

Title: Bench2ADVLM: A Closed-Loop Benchmark for Vision-language Models in Autonomous Driving

Authors: Tianyuan Zhang, Ting Jin, Lu Wang, Jiangfan Liu, Siyuan Liang, Mingchuan Zhang, Aishan Liu, Xianglong Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02028
Pdf URL: https://arxiv.org/pdf/2508.02028
Copy Paste: [[2508.02028]] Bench2ADVLM: A Closed-Loop Benchmark for Vision-language Models in Autonomous Driving(https://arxiv.org/abs/2508.02028)
Keywords: generation
Abstract: Vision-Language Models (VLMs) have recently emerged as a promising paradigm in autonomous driving (AD). However, current performance evaluation protocols for VLM-based AD systems (ADVLMs) are predominantly confined to open-loop settings with static inputs, neglecting the more realistic and informative closed-loop setting that captures interactive behavior, feedback resilience, and real-world safety. To address this, we introduce Bench2ADVLM, a unified hierarchical closed-loop evaluation framework for real-time, interactive assessment of ADVLMs across both simulation and physical platforms. Inspired by dual-process theories of cognition, we first adapt diverse ADVLMs to simulation environments via a dual-system adaptation architecture. In this design, heterogeneous high-level driving commands generated by target ADVLMs (fast system) are interpreted by a general-purpose VLM (slow system) into standardized mid-level control actions suitable for execution in simulation. To bridge the gap between simulation and reality, we design a physical control abstraction layer that translates these mid-level actions into low-level actuation signals, enabling, for the first time, closed-loop testing of ADVLMs on physical vehicles. To enable more comprehensive evaluation, Bench2ADVLM introduces a self-reflective scenario generation module that automatically explores model behavior and uncovers potential failure modes for safety-critical scenario generation. Overall, Bench2ADVLM establishes a hierarchical evaluation pipeline that seamlessly integrates high-level abstract reasoning, mid-level simulation actions, and low-level real-world execution. Experiments on diverse scenarios across multiple state-of-the-art ADVLMs and physical platforms validate the diagnostic strength of our framework, revealing that existing ADVLMs still exhibit limited performance under closed-loop conditions.
摘要：视觉语言模型（VLM）最近成为自主驾驶（AD）的有希望的范式。但是，基于VLM的AD系统（ADVLM）的当前绩效评估协议主要局限于具有静态输入的开环设置，从而忽略了捕获交互式行为，反馈弹性和现实世界安全的更真实和信息丰富的闭环设置。为了解决这个问题，我们介绍了Bench2ADVLM，这是一个统一的层次结构闭环评估框架，用于实时，交互式评估Advlms跨模拟和物理平台。受认知双过程理论的启发，我们首先通过双重系统适应体系结构将各种advlms调整为模拟环境。在此设计中，目标advlms（快速系统）生成的异质高级驾驶命令被通用VLM（慢系统）解释为适合模拟执行的标准化中级控制动作。为了弥合模拟与现实之间的差距，我们设计了一个物理控制抽象层，该层将这些中级动作转化为低级驱动信号，这是首次在物理车辆上对Advlms进行闭环测试。为了实现更全面的评估，Bench2ADVLM引入了一个自我反射的场景生成模块，该模块会自动探索模型行为并发现对安全至关重要方案的潜在故障模式。总体而言，Bench2ADVLM建立了一个分层评估管道，该管道无缝地集成了高级抽象推理，中级模拟动作和低级现实世界的执行。在多个最先进的ADVLM和物理平台上进行各种方案的实验验证了我们框架的诊断强度，表明现有的ADVLM在闭环条件下仍然表现出有限的性能。

Title: Conditional Diffusion Model with Anatomical-Dose Dual Constraints for End-to-End Multi-Tumor Dose Prediction

Authors: Hui Xie, Haiqin Hu, Lijuan Ding, Qing Li, Yue Sun, Tao Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02043
Pdf URL: https://arxiv.org/pdf/2508.02043
Copy Paste: [[2508.02043]] Conditional Diffusion Model with Anatomical-Dose Dual Constraints for End-to-End Multi-Tumor Dose Prediction(https://arxiv.org/abs/2508.02043)
Keywords: generation
Abstract: Radiotherapy treatment planning often relies on time-consuming, trial-and-error adjustments that heavily depend on the expertise of specialists, while existing deep learning methods face limitations in generalization, prediction accuracy, and clinical applicability. To tackle these challenges, we propose ADDiff-Dose, an Anatomical-Dose Dual Constraints Conditional Diffusion Model for end-to-end multi-tumor dose prediction. The model employs LightweightVAE3D to compress high-dimensional CT data and integrates multimodal inputs, including target and organ-at-risk (OAR) masks and beam parameters, within a progressive noise addition and denoising framework. It incorporates conditional features via a multi-head attention mechanism and utilizes a composite loss function combining MSE, conditional terms, and KL divergence to ensure both dosimetric accuracy and compliance with clinical constraints. Evaluation on a large-scale public dataset (2,877 cases) and three external institutional cohorts (450 cases in total) demonstrates that ADDiff-Dose significantly outperforms traditional baselines, achieving an MAE of 0.101-0.154 (compared to 0.316 for UNet and 0.169 for GAN models), a DICE coefficient of 0.927 (a 6.8% improvement), and limiting spinal cord maximum dose error to within 0.1 Gy. The average plan generation time per case is reduced to 22 seconds. Ablation studies confirm that the structural encoder enhances compliance with clinical dose constraints by 28.5%. To our knowledge, this is the first study to introduce a conditional diffusion model framework for radiotherapy dose prediction, offering a generalizable and efficient solution for automated treatment planning across diverse tumor sites, with the potential to substantially reduce planning time and improve clinical workflow efficiency.
摘要：放疗治疗计划通常依赖于耗时，试验和错误的调整，这些调整在很大程度上取决于专家的专业知识，而现有的深度学习方法面临概括，预测准确性和临床适用性的限制。为了应对这些挑战，我们提出了辅助剂量，这是端到端多肿瘤剂量预测的有条件扩散模型的解剖剂量双重约束。该模型采用LightWeightVae3D来压缩高维CT数据，并集成了多模式输入，包括目标和有机风险（OAR）掩码和梁参数，并在渐进的噪声添加和DeNoing框架内。它通过多头注意机制结合了条件特征，并利用了组合MSE，条件项和KL差异的复合损失函数，以确保剂量学的准确性和遵守临床约束。对大规模公共数据集（2,877例）和三个外部机构队列（总共450例）的评估（总共450例）表明，辅助剂量明显胜过传统基线，实现0.101-0.154的MAE（与UNET的0.316相比，gan型号为0.169），少量启动0.169，含量为0.95％，Ange Codefel at 6.8％，A。剂量误差为0.1 Gy。每例情况的平均计划生成时间减少到22秒。消融研究证实，结构编码器可增强临床剂量约束的符合28.5％。据我们所知，这是第一个为放射疗法剂量预测引入条件扩散模型框架的研究，为跨不同肿瘤部位的自动化治疗计划提供了可普遍，有效的解决方案，有可能大大减少计划时间并提高临床工作流程效率。

Title: StarPose: 3D Human Pose Estimation via Spatial-Temporal Autoregressive Diffusion

Authors: Haoxin Yang, Weihong Chen, Xuemiao Xu, Cheng Xu, Peng Xiao, Cuifeng Sun, Shaoyu Huang, Shengfeng He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02056
Pdf URL: https://arxiv.org/pdf/2508.02056
Copy Paste: [[2508.02056]] StarPose: 3D Human Pose Estimation via Spatial-Temporal Autoregressive Diffusion(https://arxiv.org/abs/2508.02056)
Keywords: generation
Abstract: Monocular 3D human pose estimation remains a challenging task due to inherent depth ambiguities and occlusions. Compared to traditional methods based on Transformers or Convolutional Neural Networks (CNNs), recent diffusion-based approaches have shown superior performance, leveraging their probabilistic nature and high-fidelity generation capabilities. However, these methods often fail to account for the spatial and temporal correlations across predicted frames, resulting in limited temporal consistency and inferior accuracy in predicted 3D pose sequences. To address these shortcomings, this paper proposes StarPose, an autoregressive diffusion framework that effectively incorporates historical 3D pose predictions and spatial-temporal physical guidance to significantly enhance both the accuracy and temporal coherence of pose predictions. Unlike existing approaches, StarPose models the 2D-to-3D pose mapping as an autoregressive diffusion process. By synergically integrating previously predicted 3D poses with 2D pose inputs via a Historical Pose Integration Module (HPIM), the framework generates rich and informative historical pose embeddings that guide subsequent denoising steps, ensuring temporally consistent predictions. In addition, a fully plug-and-play Spatial-Temporal Physical Guidance (STPG) mechanism is tailored to refine the denoising process in an iterative manner, which further enforces spatial anatomical plausibility and temporal motion dynamics, rendering robust and realistic pose estimates. Extensive experiments on benchmark datasets demonstrate that StarPose outperforms state-of-the-art methods, achieving superior accuracy and temporal consistency in 3D human pose estimation. Code is available at this https URL.
摘要：由于固有的深度歧义和遮挡，单程3D人类姿势估计仍然是一项具有挑战性的任务。与基于变压器或卷积神经网络（CNN）的传统方法相比，最近基于扩散的方法表现出了较高的性能，从而利用了它们的概率性质和高保真的产生能力。但是，这些方法通常无法解释预测帧之间的空间和时间相关性，从而导致时间一致性有限，而预测的3D姿势序列的准确性有限。为了解决这些缺点，本文提出了Starpose，这是一种自回归扩散框架，有效地结合了历史3D姿势预测和时空的物理指导，以显着提高姿势预测的准确性和时间相干性。与现有的方法不同，饥饿模型2D到3D姿势映射作为自回归扩散过程。通过通过历史姿势整合模块（HPIM）协同将先前预测的3D姿势与2D姿势输入进行协同整合，该框架会产生丰富且信息丰富的历史姿势嵌入，以指导后续的DeNoSing步骤，从而确保时间一致地预测。此外，量身定制了完全插入的时空物理指导（STPG）机制，以迭代方式完善脱索过程，这进一步实施了空间解剖学的合理性和时间运动动力学和时间运动动力学，呈现鲁棒和现实的姿势估计。在基准数据集上进行的广泛实验表明，在3D人类姿势估计中，恒定的表现优于最先进的方法，实现了卓越的准确性和时间一致性。代码可在此HTTPS URL上找到。

Title: S-RRG-Bench: Structured Radiology Report Generation with Fine-Grained Evaluation Framework

Authors: Yingshu Li, Yunyi Liu, Zhanyu Wang, Xinyu Liang, Lingqiao Liu, Lei Wang, Luping Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02082
Pdf URL: https://arxiv.org/pdf/2508.02082
Copy Paste: [[2508.02082]] S-RRG-Bench: Structured Radiology Report Generation with Fine-Grained Evaluation Framework(https://arxiv.org/abs/2508.02082)
Keywords: generation
Abstract: Radiology report generation (RRG) for diagnostic images, such as chest X-rays, plays a pivotal role in both clinical practice and AI. Traditional free-text reports suffer from redundancy and inconsistent language, complicating the extraction of critical clinical details. Structured radiology report generation (S-RRG) offers a promising solution by organizing information into standardized, concise formats. However, existing approaches often rely on classification or visual question answering (VQA) pipelines that require predefined label sets and produce only fragmented outputs. Template-based approaches, which generate reports by replacing keywords within fixed sentence patterns, further compromise expressiveness and often omit clinically important details. In this work, we present a novel approach to S-RRG that includes dataset construction, model training, and the introduction of a new evaluation framework. We first create a robust chest X-ray dataset (MIMIC-STRUC) that includes disease names, severity levels, probabilities, and anatomical locations, ensuring that the dataset is both clinically relevant and well-structured. We train an LLM-based model to generate standardized, high-quality reports. To assess the generated reports, we propose a specialized evaluation metric (S-Score) that not only measures disease prediction accuracy but also evaluates the precision of disease-specific details, thus offering a clinically meaningful metric for report quality that focuses on elements critical to clinical decision-making and demonstrates a stronger alignment with human assessments. Our approach highlights the effectiveness of structured reports and the importance of a tailored evaluation metric for S-RRG, providing a more clinically relevant measure of report quality.
摘要：用于诊断图像（例如胸部X射线）的放射学报告生成（RRG）在临床实践和AI中都起着关键作用。传统的自由文本报告遭受了冗余和不一致的语言，这使提取关键临床细节的提取变得复杂。结构化放射学报告生成（S-RRG）通过将信息组织为标准化的简洁格式，提供了有希望的解决方案。但是，现有方法通常依赖于需要预定义标签集并仅产生零碎的输出的分类或视觉问题答案（VQA）管道。基于模板的方法，通过在固定句子模式中替换关键字来产生报告，进一步妥协表现力，并经常忽略临床上重要的细节。在这项工作中，我们提出了一种新颖的S-RRG方法，其中包括数据集构建，模型培训以及引入新的评估框架。我们首先创建了一个健壮的胸部X射线数据集（MIMIC-Struc），其中包括疾病名称，严重程度，概率和解剖位置，以确保数据集在临床上相关且结构良好。我们培训基于LLM的模型来生成标准化的高质量报告。为了评估生成的报告，我们提出了一个专门的评估指标（S分数），该指标不仅可以衡量疾病的预测准确性，而且还评估了疾病特异性细节的精确度，从而为报告质量提供了临床意义的指标，该报告质量集中在对临床决策至关重要的元素上，并表现出与人类评估的较强一致性。我们的方法强调了结构化报告的有效性以及S-RRG量身定制的评估指标的重要性，从而提供了更临床相关的报告质量量度。

Title: CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search

Authors: Xiaoya Li, Xiaofei Sun, Albert Wang, Chris Shum, Jiwei Li
Subjects: cs.LG, cs.AI, cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2508.02091
Pdf URL: https://arxiv.org/pdf/2508.02091
Copy Paste: [[2508.02091]] CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search(https://arxiv.org/abs/2508.02091)
Keywords: generation
Abstract: Approximate nearest-neighbor search (ANNS) algorithms have become increasingly critical for recent AI applications, particularly in retrieval-augmented generation (RAG) and agent-based LLM applications. In this paper, we present CRINN, a new paradigm for ANNS algorithms. CRINN treats ANNS optimization as a reinforcement learning problem where execution speed serves as the reward signal. This approach enables the automatic generation of progressively faster ANNS implementations while maintaining accuracy constraints. Our experimental evaluation demonstrates CRINN's effectiveness across six widely-used NNS benchmark datasets. When compared against state-of-the-art open-source ANNS algorithms, CRINN achieves best performance on three of them (GIST-960-Euclidean, MNIST-784-Euclidean, and GloVe-25-angular), and tied for first place on two of them (SIFT-128-Euclidean and GloVe-25-angular). The implications of CRINN's success reach well beyond ANNS optimization: It validates that LLMs augmented with reinforcement learning can function as an effective tool for automating sophisticated algorithmic optimizations that demand specialized knowledge and labor-intensive manual this http URL can be found at this https URL
摘要：对于最近的AI应用程序，尤其是在检索增强生成（RAG）和基于代理的LLM应用程序中，大约最近的邻居搜索（ANN）算法变得越来越重要。在本文中，我们提出了Crinn，这是ANNS算法的新范式。 Crinn将ANN优化视为强化学习问题，而执行速度是奖励信号。这种方法使自动生成逐步生成更快的ANN实现，同时保持准确性约束。我们的实验评估表明，Crinn在六个广泛使用的NNS基准数据集中的有效性。与最先进的开源ANNS算法进行比较时，Crinn在其中三个（GIST-960-EUCLIDEAN，MNIST-784-EUCLIDEAN和GLOVE-25-angular）上取得了最佳性能，并在其中的两个方面并列第一名（Sift-128-Ecift-128-ecift-128-ecift-128-Euclidean and Glove-25-25-25-25-25-- glove）。 Crinn成功的含义超出了ANN的优化：它证实了通过增强学习的LLM可以作为自动化复杂算法优化的有效工具，可以在此HTTPS URL上找到专门的知识和劳动密集型手册的HTTPS URL。

Title: Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis

Authors: Kaiyang Ji, Ye Shi, Zichen Jin, Kangyi Chen, Lan Xu, Yuexin Ma, Jingyi Yu, Jingya Wang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2508.02106
Pdf URL: https://arxiv.org/pdf/2508.02106
Copy Paste: [[2508.02106]] Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis(https://arxiv.org/abs/2508.02106)
Keywords: generation
Abstract: Real-time synthesis of physically plausible human interactions remains a critical challenge for immersive VR/AR systems and humanoid robotics. While existing methods demonstrate progress in kinematic motion generation, they often fail to address the fundamental tension between real-time responsiveness, physical feasibility, and safety requirements in dynamic human-machine interactions. We introduce Human-X, a novel framework designed to enable immersive and physically plausible human interactions across diverse entities, including human-avatar, human-humanoid, and human-robot systems. Unlike existing approaches that focus on post-hoc alignment or simplified physics, our method jointly predicts actions and reactions in real-time using an auto-regressive reaction diffusion planner, ensuring seamless synchronization and context-aware responses. To enhance physical realism and safety, we integrate an actor-aware motion tracking policy trained with reinforcement learning, which dynamically adapts to interaction partners' movements while avoiding artifacts like foot sliding and penetration. Extensive experiments on the Inter-X and InterHuman datasets demonstrate significant improvements in motion quality, interaction continuity, and physical plausibility over state-of-the-art methods. Our framework is validated in real-world applications, including virtual reality interface for human-robot interaction, showcasing its potential for advancing human-robot collaboration.
摘要：物理上合理的人类相互作用的实时综合仍然是沉浸式VR/AR系统和人形机器人技术的关键挑战。尽管现有方法显示了运动运动的进展，但它们通常无法解决实时响应能力，身体可行性和动态人机相互作用的安全要求之间的基本张力。我们介绍了Human-X，这是一个新型框架，旨在使包括人类阿瓦塔尔，人类类动物和人类手机系统在内的各种实体之间具有沉浸式和身体上合理的人类相互作用。与专注于事后一致性或简化物理的现有方法不同，我们的方法共同使用自动回应反应扩散计划者实时预测行动和反应，以确保无缝同步和上下文感知的响应。为了增强物理现实主义和安全性，我们整合了一项通过强化学习训练的参与者感知的运动跟踪政策，该政策动态地适应了互动伙伴的动作，同时避免了诸如脚滑和穿透等文物。在Inter-X和人口间数据集上进行的广泛实验表明，运动质量，相互作用连续性和物理合理性在最新方法上有显着改善。我们的框架在现实世界应用中得到了验证，包括用于人类机器人互动的虚拟现实接口，展示了其推进人类机器人协作的潜力。

Title: AutoLoRA: Automatic LoRA Retrieval and Fine-Grained Gated Fusion for Text-to-Image Generation

Authors: Zhiwen Li, Zhongjie Duan, Die Chen, Cen Chen, Daoyuan Chen, Yaliang Li, Yingda Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02107
Pdf URL: https://arxiv.org/pdf/2508.02107
Copy Paste: [[2508.02107]] AutoLoRA: Automatic LoRA Retrieval and Fine-Grained Gated Fusion for Text-to-Image Generation(https://arxiv.org/abs/2508.02107)
Keywords: generation
Abstract: Despite recent advances in photorealistic image generation through large-scale models like FLUX and Stable Diffusion v3, the practical deployment of these architectures remains constrained by their inherent intractability to parameter fine-tuning. While low-rank adaptation (LoRA) have demonstrated efficacy in enabling model customization with minimal parameter overhead, the effective utilization of distributed open-source LoRA modules faces three critical challenges: sparse metadata annotation, the requirement for zero-shot adaptation capabilities, and suboptimal fusion strategies for multi-LoRA fusion strategies. To address these limitations, we introduce a novel framework that enables semantic-driven LoRA retrieval and dynamic aggregation through two key components: (1) weight encoding-base LoRA retriever that establishes a shared semantic space between LoRA parameter matrices and text prompts, eliminating dependence on original training data, and (2) fine-grained gated fusion mechanism that computes context-specific fusion weights across network layers and diffusion timesteps to optimally integrate multiple LoRA modules during generation. Our approach achieves significant improvement in image generation perfermance, thereby facilitating scalable and data-efficient enhancement of foundational models. This work establishes a critical bridge between the fragmented landscape of community-developed LoRAs and practical deployment requirements, enabling collaborative model evolution through standardized adapter integration.
摘要：尽管影像学图像通过通量和稳定扩散V3等大规模模型的生成最新进展，但这些体系结构的实际部署仍受到其对参数微调的固有棘手性的限制。虽然低排名适应（LORA）在启用最小的参数启用模型定制方面表现出了功效，但有效利用分布式的开源洛拉模块面临三个关键挑战：稀疏的元数据注释，零拍适应能力的要求，对多型融合策略的零摄像适应能力以及统一策略策略。为了解决这些局限性，我们介绍了一个新颖的框架，该框架可以通过两个关键组成部分实现语义驱动的洛拉检索和动态聚集，（（1）重量编码基准lora检索器，该框架在lora参数矩阵和文本提示中建立了共同的语义空间，并消除了对原始培训的范围和（2）精细的融合效果的依赖，并消除了范围的依赖性，并计算出（2）计算融合的融合，以计算出融合的融合，以计算出融合的融合效果。扩散时间步长以最佳地集成生成过程中的多个LORA模块。我们的方法在图像产生的启示率方面取得了重大改善，从而促进了基础模型的可扩展性和数据有效增强。这项工作建立了一个关键的桥梁，在社区开发的洛拉斯的零散景观与实际部署要求之间，通过标准化的适配器集成使协作模型演变。

Title: Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models

Authors: Tai An, Ruwu Cai, Yanzhe Zhang, Yang Liu, Hao Chen, Pengcheng Xie, Sheng Chang, Yiwu Yao, Gongyi Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02128
Pdf URL: https://arxiv.org/pdf/2508.02128
Copy Paste: [[2508.02128]] Amber Pruner: Leveraging N:M Activation Sparsity for Efficient Prefill in Large Language Models(https://arxiv.org/abs/2508.02128)
Keywords: generation, generative
Abstract: In the era of large language models (LLMs), N:M sparsity has emerged as a structured compression technique critical for accelerating inference. While prior work has primarily focused on weight sparsity, it often suffers from significant accuracy degradation. Activation sparsity, though promising, is typically training-dependent and faces challenges in generalization. To address these limitations, we introduce Amber Pruner, a training-free N:M activation sparsity method designed specifically for the prefill stage, targeting the acceleration of linear projection layers in LLMs. Extensive experiments across multiple models and sparsity ratios (2:4, 4:8, and 8:16) demonstrate that Amber Pruner can effectively sparsify and accelerate more than 55% of linear computations without requiring model retraining. To further enhance generality and efficiency, we propose Outstanding-sparse, a unified framework that integrates Amber Pruner with post-training W8A8 quantization. Our approach preserves strong performance across a range of downstream tasks, with notable advantages in generative tasks. This work pioneers a new frontier in activation sparsity, providing foundational insights that are poised to guide the co-evolution of algorithms and architectures in the design of next-generation AI systems.
摘要：在大型语言模型（LLM）的时代，N：m稀疏性已成为一种结构化压缩技术，这对于加速推理至关重要。虽然先前的工作主要集中在体重稀疏性上，但它通常会遭受明显的准确性降解。激活稀疏性虽然很有希望，但通常依赖于训练，并且面临概括的挑战。为了解决这些局限性，我们引入了琥珀色Pruner，这是一种专门针对预填充阶段设计的无训练N：M激活稀疏方法，以LLMS中线性投影层的加速为目标。跨多个模型和稀疏比（2：4、4：8和8：16）进行的广泛实验表明，琥珀色的修剪可以有效地稀疏和加速超过55％的线性计算，而无需模型重新培训。为了进一步提高一般性和效率，我们提出了出色的Sparse，这是一个将琥珀色修剪与训练后W8A8量化相结合的统一框架。我们的方法在一系列下游任务中保持了强劲的性能，在生成任务中具有显着优势。这项工作开创了激活稀疏性的新领域，提供了基本的见解，这些见解有望指导下一代AI系统设计中算法和体系结构的共同发展。

Title: A Neural Quality Metric for BRDF Models

Authors: Behnaz Kavoosighafi, Rafal K. Mantiuk, Saghi Hajisharif, Ehsan Miandji, Jonas Unger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02131
Pdf URL: https://arxiv.org/pdf/2508.02131
Copy Paste: [[2508.02131]] A Neural Quality Metric for BRDF Models(https://arxiv.org/abs/2508.02131)
Keywords: quality assessment
Abstract: Accurately evaluating the quality of bidirectional reflectance distribution function (BRDF) models is essential for photo-realistic rendering. Traditional BRDF-space metrics often employ numerical error measures that fail to capture perceptual differences evident in rendered images. In this paper, we introduce the first perceptually informed neural quality metric for BRDF evaluation that operates directly in BRDF space, eliminating the need for rendering during quality assessment. Our metric is implemented as a compact multi-layer perceptron (MLP), trained on a dataset of measured BRDFs supplemented with synthetically generated data and labelled using a perceptually validated image-space metric. The network takes as input paired samples of reference and approximated BRDFs and predicts their perceptual quality in terms of just-objectionable-difference (JOD) scores. We show that our neural metric achieves significantly higher correlation with human judgments than existing BRDF-space metrics. While its performance as a loss function for BRDF fitting remains limited, the proposed metric offers a perceptually grounded alternative for evaluating BRDF models.
摘要：准确评估双向反射分布函数（BRDF）模型的质量对于照片现实渲染至关重要。传统的BRDF空间指标通常采用数值错误度量，这些误差度量未能捕获渲染图像中明显的感知差异。在本文中，我们介绍了直接在BRDF空间中运行的BRDF评估的首个知情神经质量指标，从而消除了质量评估期间渲染的需求。我们的度量标准被实现为紧凑的多层感知器（MLP），该指标在测量的BRDF的数据集上训练，该数据集补充了合成生成的数据，并使用感知验证的图像空间度量标记。该网络作为输入配对的参考样本和近似BRDF，并根据可观察的差异（JOD）分数来预测其感知质量。我们表明，与现有的BRDF空间指标相比，我们的神经指标与人类判断的相关性明显更高。尽管其作为BRDF拟合的损失函数的性能仍然有限，但该指标提供了一种评估BRDF模型的知觉扎根替代方案。

Title: AttriCtrl: Fine-Grained Control of Aesthetic Attribute Intensity in Diffusion Models

Authors: Die Chen, Zhongjie Duan, Zhiwen Li, Cen Chen, Daoyuan Chen, Yaliang Li, Yinda Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02151
Pdf URL: https://arxiv.org/pdf/2508.02151
Copy Paste: [[2508.02151]] AttriCtrl: Fine-Grained Control of Aesthetic Attribute Intensity in Diffusion Models(https://arxiv.org/abs/2508.02151)
Keywords: generation
Abstract: Recent breakthroughs in text-to-image diffusion models have significantly enhanced both the visual fidelity and semantic controllability of generated images. However, fine-grained control over aesthetic attributes remains challenging, especially when users require continuous and intensity-specific adjustments. Existing approaches often rely on vague textual prompts, which are inherently ambiguous in expressing both the aesthetic semantics and the desired intensity, or depend on costly human preference data for alignment, limiting their scalability and practicality. To address these limitations, we propose AttriCtrl, a plug-and-play framework for precise and continuous control of aesthetic attributes. Specifically, we quantify abstract aesthetics by leveraging semantic similarity from pre-trained vision-language models, and employ a lightweight value encoder that maps scalar intensities in $[0,1]$ to learnable embeddings within diffusion-based generation. This design enables intuitive and customizable aesthetic manipulation, with minimal training overhead and seamless integration into existing generation pipelines. Extensive experiments demonstrate that AttriCtrl achieves accurate control over individual attributes as well as flexible multi-attribute composition. Moreover, it is fully compatible with popular open-source controllable generation frameworks, showcasing strong integration capability and practical utility across diverse generation scenarios.
摘要：文本到图像扩散模型的最新突破显着增强了生成图像的视觉保真度和语义可控性。但是，对美学属性的细粒度控制仍然具有挑战性，尤其是当用户需要进行连续和强度的调整时。现有的方法通常依赖于模糊的文本提示，这些提示本质上是表达美学语义和所需强度的含糊不清的，或者依赖于昂贵的人类偏好数据来对齐，限制了它们的可扩展性和实用性。为了解决这些限制，我们提出了Attrictrl，这是一个插件框架，以精确，连续地控制美学属性。具体而言，我们通过利用预训练的视觉语言模型的语义相似性来量化抽象美学，并采用轻巧的值编码器，该编码器将$ [0,1] $的标量强度映射到基于扩散的生成中的可学习嵌入。该设计可实现直观和可定制的美学操作，其开销最少，并无缝集成到现有的一代管道中。广泛的实验表明，Attrictrl可以准确控制单个属性以及灵活的多属性组成。此外，它与流行的开源可控生成框架完全兼容，展示了各种一代情景的强大集成能力和实用性。

Title: DreamPainter: Image Background Inpainting for E-commerce Scenarios

Authors: Sijie Zhao, Jing Cheng, Yaoyao Wu, Hao Xu, Shaohui Jiao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02155
Pdf URL: https://arxiv.org/pdf/2508.02155
Copy Paste: [[2508.02155]] DreamPainter: Image Background Inpainting for E-commerce Scenarios(https://arxiv.org/abs/2508.02155)
Keywords: generation
Abstract: Although diffusion-based image genenation has been widely explored and applied, background generation tasks in e-commerce scenarios still face significant challenges. The first challenge is to ensure that the generated products are consistent with the given product inputs while maintaining a reasonable spatial arrangement, harmonious shadows, and reflections between foreground products and backgrounds. Existing inpainting methods fail to address this due to the lack of domain-specific data. The second challenge involves the limitation of relying solely on text prompts for image control, as effective integrating visual information to achieve precise control in inpainting tasks remains underexplored. To address these challenges, we introduce DreamEcom-400K, a high-quality e-commerce dataset containing accurate product instance masks, background reference images, text prompts, and aesthetically pleasing product images. Based on this dataset, we propose DreamPainter, a novel framework that not only utilizes text prompts for control but also flexibly incorporates reference image information as an additional control signal. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, maintaining high product consistency while effectively integrating both text prompt and reference image information.
摘要：尽管基于扩散的图像晶体已被广泛探索和应用，但电子商务场景中的背景生成任务仍然面临重大挑战。第一个挑战是确保生成的产品与给定的产品输入一致，同时保持合理的空间布置，和谐的阴影以及前景产品和背景之间的反射。由于缺乏特定于域的数据，现有的涂料方法无法解决此问题。第二个挑战涉及仅依靠文本提示进行图像控制的局限性，因为有效整合视觉信息以实现在填充任务中的精确控制仍然没有被忽视。为了应对这些挑战，我们介绍了DreameCom-400K，这是一个高质量的电子商务数据集，其中包含准确的产品实例掩码，背景参考图像，文本提示和美观的产品图像。基于此数据集，我们提出了DreamPainter，这是一个新颖的框架，不仅利用文本提示进行控制，而且还可以灵活地将参考图像信息作为附加控制信号。广泛的实验表明，我们的方法显着胜过最先进的方法，保持较高的产品一致性，同时有效地整合了文本提示和参考图像信息。

Title: Subject or Style: Adaptive and Training-Free Mixture of LoRAs

Authors: Jia-Chen Zhang, Yu-Jie Xiong
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2508.02165
Pdf URL: https://arxiv.org/pdf/2508.02165
Copy Paste: [[2508.02165]] Subject or Style: Adaptive and Training-Free Mixture of LoRAs(https://arxiv.org/abs/2508.02165)
Keywords: generation
Abstract: Fine-tuning models via Low-Rank Adaptation (LoRA) demonstrates remarkable performance in subject-driven or style-driven generation tasks. Studies have explored combinations of different LoRAs to jointly generate learned styles and content. However, current methods struggle to balance the original subject and style, and often require additional training. Recently, K-LoRA proposed a training-free LoRA fusion method. But it involves multiple hyperparameters, making it difficult to adapt to all styles and subjects. In this paper, we propose EST-LoRA, a training-free adaptive LoRA fusion method. It comprehensively considers three critical factors: \underline{E}nergy of matrix, \underline{S}tyle discrepancy scores and \underline{T}ime steps. Analogous to the Mixture of Experts (MoE) architecture, the model adaptively selects between subject LoRA and style LoRA within each attention layer. This integrated selection mechanism ensures balanced contributions from both components during the generation process. Experimental results show that EST-LoRA outperforms state-of-the-art methods in both qualitative and quantitative evaluations and achieves faster generation speed compared to other efficient fusion approaches. Our code is publicly available at: this https URL.
摘要：通过低级适应（LORA）进行的微调模型在主题驱动或样式驱动的生成任务中表现出了显着的性能。研究探索了不同洛拉斯的组合，共同生成了学识渊博的样式和内容。但是，当前的方法努力平衡原始主题和样式，并且通常需要额外的培训。最近，K-Lora提出了一种无训练的Lora融合方法。但这涉及多个超参数，因此很难适应所有样式和主题。在本文中，我们提出了一种无训练的自适应洛拉融合法EST-Lora。它全面考虑了三个关键因素：\下划线{类似于专家（MOE）体系结构的混合物，该模型在每个注意力层中的主题Lora和样式Lora之间适应。这种集成的选择机制可确保在生成过程中两个组成部分的平衡贡献。实验结果表明，与其他有效的融合方法相比，EST-LORA在定性和定量评估中均优于定性和定量评估的最先进方法，并实现更快的生成速度。我们的代码可公开可用：此HTTPS URL。

Title: After the Party: Navigating the Mapping From Color to Ambient Lighting

Authors: Florin-Alexandru Vasluianu, Tim Seizinger, Zongwei Wu, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02168
Pdf URL: https://arxiv.org/pdf/2508.02168
Copy Paste: [[2508.02168]] After the Party: Navigating the Mapping From Color to Ambient Lighting(https://arxiv.org/abs/2508.02168)
Keywords: restoration
Abstract: Illumination in practical scenarios is inherently complex, involving colored light sources, occlusions, and diverse material interactions that produce intricate reflectance and shading effects. However, existing methods often oversimplify this challenge by assuming a single light source or uniform, white-balanced lighting, leaving many of these complexities this http URL this paper, we introduce CL3AN, the first large-scale, high-resolution dataset of its kind designed to facilitate the restoration of images captured under multiple Colored Light sources to their Ambient-Normalized counterparts. Through benchmarking, we find that leading approaches often produce artifacts, such as illumination inconsistencies, texture leakage, and color distortion, primarily due to their limited ability to precisely disentangle illumination from reflectance. Motivated by this insight, we achieve such a desired decomposition through a novel learning framework that leverages explicit chromaticity and luminance components guidance, drawing inspiration from the principles of the Retinex model. Extensive evaluations on existing benchmarks and our dataset demonstrate the effectiveness of our approach, showcasing enhanced robustness under non-homogeneous color lighting and material-specific reflectance variations, all while maintaining a highly competitive computational cost. The benchmark, codes, and models are available at this http URL.
摘要：在实际场景中的照明本质上是复杂的，涉及彩色光源，遮挡和各种材料相互作用，从而产生复杂的反射率和阴影效果。但是，现有方法通常通过假设单一的光源或统一的白色平衡照明来过分简化这一挑战，这使许多复杂性本文介绍了本文，我们介绍了Cl3an，这是第一个大型的，高分辨率的高分辨率数据集，旨在促进捕获的图像恢复，以促进在多色的光源中恢复其稳定的相反的逆时针。通过基准测试，我们发现领先的方法通常会产生伪影，例如照明不一致，纹理泄漏和颜色失真，这主要是由于它们精确地将照明与反射率完全解散的能力有限。在这种见识的驱动下，我们通过一个新颖的学习框架实现了所需的分解，从而利用了明确的色彩和亮度组成部分指导，从视网膜模型的原理中汲取灵感。对现有基准和我们的数据集进行了广泛的评估，证明了我们方法的有效性，在非均匀色彩照明和特定于材料的反射率变化下展示了增强的鲁棒性，同时保持了竞争激烈的计算成本。该HTTP URL可用基准，代码和模型。

Title: CAAD: Context-Aware Adaptive Decoding for Truthful Text Generation

Authors: Manh Nguyen, Sunil Gupta, Hung Le
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.02184
Pdf URL: https://arxiv.org/pdf/2508.02184
Copy Paste: [[2508.02184]] CAAD: Context-Aware Adaptive Decoding for Truthful Text Generation(https://arxiv.org/abs/2508.02184)
Keywords: generation
Abstract: Ensuring truthfulness in large language models remains a critical challenge for reliable text generation. While supervised fine-tuning and reinforcement learning with human feedback have shown promise, they require substantial amount of annotated data and computational resources, limiting scalability. In contrast, decoding-time interventions offer lightweight alternatives without model retraining. However, existing decoding strategies often face issues like prompt sensitivity, limited generalization, or dependence on internal model states. We propose a context-aware adaptive decoding method that leverages a compact reference grounding space, built from as few as 10 annotated examples and comprising pairs of context embeddings and next token logits from truthful responses, to enable retrieval-based logit shaping during inference. At each decoding step, our method retrieves top-N semantically similar contexts and aggregates their associated next token logits to modify the LLM's logits. Across three open-ended question-answering benchmarks, our approach achieves a 2.8 percent average improvement on TruthfulQA and further outperforms existing baselines on both Biographies and WikiQA. Experimental results also demonstrate cross-task generalization, with TruthfulQA-derived grounding enhancing biography generation. Our model-agnostic, scalable, and efficient method requires only a single generation pass, highlighting the potential of context-aware decoding for factual reliability in LLMs.
摘要：确保大语模型中的真实性仍然是可靠的文本生成的关键挑战。虽然对人类反馈的监督微调和加强学习有望，但它们需要大量注释的数据和计算资源，从而限制了可扩展性。相比之下，解码时间干预措施提供了轻巧的替代方案，而无需模型再培训。但是，现有的解码策略通常会面临诸如迅速灵敏度，有限的概括或对内部模型状态的依赖等问题。我们提出了一种情境感知的自适应解码方法，该方法利用了一个紧凑的参考接地空间，该方法由少于10个带注释的示例构建，并包括成对的上下文嵌入和接下来的标记logits，从真实的响应中启用基于检验的基于检验的logit在推理过程中的塑造。在每个解码步骤中，我们的方法都会检索顶级N的语义相似上下文，并汇总其关联的隔壁logits以修改LLM的logits。在三个开放式的提问基准测试中，我们的方法在真实性方面取得了2.8％的平均改善，并且进一步优于传记和Wikiqa的现有基线。实验结果还证明了交叉任务的概括，真实的接地增强了传记产生。我们的模型不稳定，可扩展和高效的方法仅需要一代通行证，突出了上下文感知的解码对LLM中事实可靠性的潜力。

Title: Balancing Information Accuracy and Response Timeliness in Networked LLMs

Authors: Yigit Turkmen, Baturalp Buyukates, Melih Bastopcu
Subjects: cs.LG, cs.AI, cs.IT, cs.NI
Abstract URL: https://arxiv.org/abs/2508.02209
Pdf URL: https://arxiv.org/pdf/2508.02209
Copy Paste: [[2508.02209]] Balancing Information Accuracy and Response Timeliness in Networked LLMs(https://arxiv.org/abs/2508.02209)
Keywords: generation
Abstract: Recent advancements in Large Language Models (LLMs) have transformed many fields including scientific discovery, content generation, biomedical text mining, and educational technology. However, the substantial requirements for training data, computational resources, and energy consumption pose significant challenges for their practical deployment. A promising alternative is to leverage smaller, specialized language models and aggregate their outputs to improve overall response quality. In this work, we investigate a networked LLM system composed of multiple users, a central task processor, and clusters of topic-specialized LLMs. Each user submits categorical binary (true/false) queries, which are routed by the task processor to a selected cluster of $m$ LLMs. After gathering individual responses, the processor returns a final aggregated answer to the user. We characterize both the information accuracy and response timeliness in this setting, and formulate a joint optimization problem to balance these two competing objectives. Our extensive simulations demonstrate that the aggregated responses consistently achieve higher accuracy than those of individual LLMs. Notably, this improvement is more significant when the participating LLMs exhibit similar standalone performance.
摘要：大型语言模型（LLM）的最新进展改变了许多领域，包括科学发现，内容产生，生物医学文本挖掘和教育技术。但是，对培训数据，计算资源和能源消耗的实质要求对其实际部署构成了重大挑战。一个有希望的替代方法是利用较小的专业语言模型并汇总其输出以提高整体响应质量。在这项工作中，我们研究了一个由多个用户，中央任务处理器和主题特殊LLMS组成的网络LLM系统。每个用户都会提交分类二进制（true/false）查询，这些查询由任务处理器路由到所选$ m $ llms的群集。收集个人响应后，处理器向用户返回最终的汇总答案。我们在这种情况下表征了信息的准确性和响应及时性，并制定了一个联合优化问题，以平衡这两个竞争目标。我们广泛的模拟表明，汇总响应始终比单个LLM的响应始终获得更高的精度。值得注意的是，当参与的LLM表现出类似的独立性能时，这种改进更为重要。

Title: Forecasting When to Forecast: Accelerating Diffusion Models with Confidence-Gated Taylor

Authors: Xiaoliu Guan, Lielin Jiang, Hanqi Chen, Xu Zhang, Jiaxing Yan, Guanzhong Wang, Yi Liu, Zetao Zhang, Yu Wu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02240
Pdf URL: https://arxiv.org/pdf/2508.02240
Copy Paste: [[2508.02240]] Forecasting When to Forecast: Accelerating Diffusion Models with Confidence-Gated Taylor(https://arxiv.org/abs/2508.02240)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable performance in visual generation tasks. However, their low inference speed limits their deployment in low-resource applications. Recent training-free approaches exploit the redundancy of features across timesteps by caching and reusing past representations to accelerate inference. Building on this idea, TaylorSeer instead uses cached features to predict future ones via Taylor expansion. However, its module-level prediction across all transformer blocks (e.g., attention or feedforward modules) requires storing fine-grained intermediate features, leading to notable memory and computation overhead. Moreover, it adopts a fixed caching schedule without considering the varying accuracy of predictions across timesteps, which can lead to degraded outputs when prediction fails. To address these limitations, we propose a novel approach to better leverage Taylor-based acceleration. First, we shift the Taylor prediction target from the module level to the last block level, significantly reducing the number of cached features. Furthermore, observing strong sequential dependencies among Transformer blocks, we propose to use the error between the Taylor-estimated and actual outputs of the first block as an indicator of prediction reliability. If the error is small, we trust the Taylor prediction for the last block; otherwise, we fall back to full computation, thereby enabling a dynamic caching mechanism. Empirical results show that our method achieves a better balance between speed and quality, achieving a 3.17x acceleration on FLUX, 2.36x on DiT, and 4.14x on Wan Video with negligible quality drop. The Project Page is \href{this https URL}{here.}
摘要：扩散变压器（DIT）在视觉生成任务中表现出了显着的性能。但是，他们的低推理速度限制了他们在低资源应用程序中的部署。最近的无培训方法通过缓存和重复使用过去的表述加速推理，利用了时间步中的功能的冗余。在这个想法的基础上，Taylorseer使用缓存的功能通过Taylor扩展来预测未来的功能。但是，其在所有变压器块（例如注意力或前馈模块）中的模块级预测需要存储细粒的中间特征，从而导致值得注意的内存和计算开销。此外，它采用了固定的缓存时间表，而无需考虑跨时间段的预测的不同准确性，这在预测失败时可能会导致输出退化。为了解决这些局限性，我们提出了一种新型方法，以更好地利用泰勒的加速度。首先，我们将泰勒预测目标从模块级别转移到最后一个块级别，从而大大减少了缓存特征的数量。此外，在变压器块之间观察强烈的顺序依赖性，我们建议在第一个块的泰勒估计和实际输出之间使用误差作为预测可靠性的指标。如果错误很小，我们相信最后一个块的泰勒预测。否则，我们回到了完整的计算中，从而实现了动态的缓存机制。经验结果表明，我们的方法在速度和质量之间取得了更好的平衡，在通量上达到了3.17倍加速度，DIT上的2.36倍和4.14倍的WAN视频，质量降低可忽略不计。项目页面为\ href {this https url} {there。}

Title: Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning

Authors: Wenchuan Zhang, Jingru Guo, Hengzhe Zhang, Penghao Zhang, Jie Chen, Shuwan Zhang, Zhang Zhang, Yuhao Yi, Hong Bu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02258
Pdf URL: https://arxiv.org/pdf/2508.02258
Copy Paste: [[2508.02258]] Patho-AgenticRAG: Towards Multimodal Agentic Retrieval-Augmented Generation for Pathology VLMs via Reinforcement Learning(https://arxiv.org/abs/2508.02258)
Keywords: generation
Abstract: Although Vision Language Models (VLMs) have shown strong generalization in medical imaging, pathology presents unique challenges due to ultra-high resolution, complex tissue structures, and nuanced clinical semantics. These factors make pathology VLMs prone to hallucinations, i.e., generating outputs inconsistent with visual evidence, which undermines clinical trust. Existing RAG approaches in this domain largely depend on text-based knowledge bases, limiting their ability to leverage diagnostic visual cues. To address this, we propose Patho-AgenticRAG, a multimodal RAG framework with a database built on page-level embeddings from authoritative pathology textbooks. Unlike traditional text-only retrieval systems, it supports joint text-image search, enabling direct retrieval of textbook pages that contain both the queried text and relevant visual cues, thus avoiding the loss of critical image-based information. Patho-AgenticRAG also supports reasoning, task decomposition, and multi-turn search interactions, improving accuracy in complex diagnostic scenarios. Experiments show that Patho-AgenticRAG significantly outperforms existing multimodal models in complex pathology tasks like multiple-choice diagnosis and visual question answering. Our project is available at the Patho-AgenticRAG repository: this https URL.
摘要：尽管视觉语言模型（VLM）在医学成像中表现出强烈的概括，但由于超高分辨率，复杂的组织结构和细微的临床语义，病理学提出了独特的挑战。这些因素使病理VLM容易产生幻觉，即产生与视觉证据不一致的产出，这破坏了临床信任。该领域中现有的破布方法很大程度上取决于基于文本的知识库，从而限制了它们利用诊断视觉提示的能力。为了解决这个问题，我们提出了Patho-Agenticrag，这是一个多模式的RAG框架，其数据库构建在权威病理学教科书中的页面级嵌入。与传统的纯文本检索系统不同，它支持联合文本图像搜索，从而直接检索包含查询文本和相关视觉提示的教科书页面，从而避免了基于批判图像的信息的丢失。 Patho-Agenticrag还支持推理，任务分解和多转弯搜索交互，从而提高了复杂诊断方案的准确性。实验表明，在复杂的病理任务中，诸如多项选择诊断和视觉问题的回答等复杂的病理任务中的病原体质量显着优于现有多模型的现有模型。我们的项目可在Patho-Agenticrag存储库中获得：此HTTPS URL。

Title: CellForge: Agentic Design of Virtual Cell Models

Authors: Xiangru Tang, Zhuoyun Yu, Jiapeng Chen, Yan Cui, Daniel Shao, Weixu Wang, Fang Wu, Yuchen Zhuang, Wenqi Shi, Zhi Huang, Arman Cohan, Xihong Lin, Fabian Theis, Smita Krishnaswamy, Mark Gerstein
Subjects: cs.LG, cs.AI, cs.CL, q-bio.QM
Abstract URL: https://arxiv.org/abs/2508.02276
Pdf URL: https://arxiv.org/pdf/2508.02276
Copy Paste: [[2508.02276]] CellForge: Agentic Design of Virtual Cell Models(https://arxiv.org/abs/2508.02276)
Keywords: generation
Abstract: Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantities such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at this https URL.
摘要：虚拟细胞建模代表了人工智能和生物学交集的新兴领域，旨在预测数量，例如定量对各种扰动的响应。但是，由于生物系统的复杂性，数据模式的异质性以及对多个学科的特定领域专业知识的需求，因此为虚拟细胞的自主构建计算模型具有挑战性。在这里，我们介绍了Cellforge，这是一种利用多代理框架的代理系统，该框架将呈现的生物学数据集和研究目标直接转换为虚拟单元的优化计算模型。更具体地说，只有原始的单细胞多媒体数据和任务描述为输入，Cellforge既输出了优化的模型架构，也可以输出可执行的代码，以训练虚拟单元格模型和推理。该框架集成了三个核心模块：用于提出的数据集特征和相关文献检索的任务分析，方法设计，专门代理协作开发优化的建模策略，以及对自动生成代码的执行。设计模块中的代理分为具有不同观点和中央主持人的专家，必须协作交换解决方案，直到达成合理的共识为止。我们使用六个不同的数据集展示了Cellforge在单细胞扰动预测中的功能，这些数据集涵盖了多种模态的基因敲除，药物治疗和细胞因子刺激。 Cellforge始终优于特定于任务的最先进方法。总体而言，Cellforge展示了与直接应对建模挑战相比，具有不同观点的LLM代理之间的迭代相互作用如何提供更好的解决方案。我们的代码在此HTTPS URL上公开可用。

Title: CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment

Authors: Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, Xiao Zhang
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2508.02298
Pdf URL: https://arxiv.org/pdf/2508.02298
Copy Paste: [[2508.02298]] CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment(https://arxiv.org/abs/2508.02298)
Keywords: generative
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback, helping to mitigate reward hacking. However, current RLVR methods typically treat whole responses as single actions, assigning the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies and inefficient learning. Methods like PPO provide credit assignment through value estimation, but often yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-by-step judgments for each reasoning step, but they require high-quality process supervision labels and are time-consuming when applied in online reinforcement learning (RL). To overcome these limitations, we introduce a simple but efficient method Credit Assignment Policy Optimization (CAPO). Given a reasoning response rollout from the policy model, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass, thereby providing verifiable token-level rewards to refine the tokens that were originally assigned identical rule-based rewards. This enables more fine-grained credit assignment in an effective way. Furthermore, to enhance the accuracy and robustness of CAPO, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments using different backbones like Llama and Qwen models and in different sizes show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across six challenging mathematical benchmarks and three out-of-domain benchmarks.
摘要：通过可验证的奖励（RLVR）的增强学习通过使用基于规则的二进制反馈来提高大语言模型（LLMS）的推理能力，从而有助于减轻奖励黑客攻击。但是，当前的RLVR方法通常将整个响应视为单个动作，为每个令牌分配相同的奖励。这种粗粒的反馈会阻碍精确的信用分配，使模型难以确定哪些推理步骤导致成功或失败，并且通常会导致次优政策和效率低下的学习。 PPO之类的方法通过价值估计提供了信用分配，但由于采样有限，通常会产生不准确和无法验证的信号。另一方面，使用流程奖励模型的方法可以为每个推理步骤提供逐步判断，但是它们需要高质量的过程监督标签，并且在在线强化学习（RL）中应用时会耗时。为了克服这些限制，我们引入了一个简单但有效的方法信用分配政策优化（CAPO）。鉴于策略模型的推理响应推出，Capo直接利用了现成的通用LLM作为生成过程奖励模型（LLM-AS-GENPRM），以一个通行证产生所有渐进式批评，从而提供可验证的可验证的令牌级别的奖励，以优化基于统治的统治统治统治的奖励。这可以有效地使更多细粒度的信贷分配。此外，为了提高CAPO的准确性和鲁棒性，我们采用了随着产生的批评数量扩展的投票机制。使用不同型甲板和QWEN模型以及不同尺寸的不同骨架的广泛实验表明，CAPO始终在六个具有挑战性的数学基准和三个室外基准的基于学习的基于学习和基于RL的基于学习的微调方法上。

Title: Qwen-Image Technical Report

Authors: Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun Wen, Wensen Feng, Xiaoxiao Xu, Yi Wang, Yichang Zhang, Yongqiang Zhu, Yujia Wu, Yuxuan Cai, Zenan Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02324
Pdf URL: https://arxiv.org/pdf/2508.02324
Copy Paste: [[2508.02324]] Qwen-Image Technical Report(https://arxiv.org/abs/2508.02324)
Keywords: generation
Abstract: We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (TI2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing across multiple benchmarks.
摘要：我们提出了QWEN-IMAGE，这是QWEN系列中图像生成基础模型，在复杂的文本渲染和精确的图像编辑中取得了重大进展。为了应对复杂文本渲染的挑战，我们设计了一条全面的数据管道，其中包括大规模数据收集，过滤，注释，综合和平衡。此外，我们采用了一种渐进培训策略，该策略从非文本到文本渲染开始，从简单到复杂的文本输入演变，并逐渐扩展到段落级的描述。这种课程学习方法显着增强了该模型的本地文本渲染功能。结果，QWEN图像不仅在英语等字母语言中表现出色，而且在更具挑战性的逻辑语言（如中文）上取得了惊人的进步。为了增强图像编辑的一致性，我们引入了改进的多任务培训范式，该范围不仅包含传统的文本图像（T2I）和文本图像图像图像（TI2I）任务，还包括图像到图像图像（I2I）重建，有效地使QWEN2.5-d-fl和mmdit之间的延伸表示有效。此外，我们将原始图像分别馈送到QWEN2.5-VL和VAE编码器中，分别获得语义和重建表示。这种双重编码机制使编辑模块能够在保持语义一致性和保持视觉保真度之间取得平衡。 Qwen-Image达到了最先进的性能，表明了其在图像生成和跨多个基准测试中的强大功能。

Title: MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models

Authors: Wenyuan Liu, Haoqian Meng, Yilun Luo, Peng Zhang, Xindian Ma
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02343
Pdf URL: https://arxiv.org/pdf/2508.02343
Copy Paste: [[2508.02343]] MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models(https://arxiv.org/abs/2508.02343)
Keywords: generation
Abstract: Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIA's Blackwell architecture offer up to 4x speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-designed mixed-precision quantization algorithm and matrix multiplication kernel based on Microscaling (MX) data formats. Tailored for the Blackwell architecture, the MicroMix kernel supports arbitrary combinations of MXFP4, MXFP6, and MXFP8 channels, and produces BFloat16 outputs. To achieve a favorable trade-off between accuracy and efficiency for each linear layer, we introduce quantization thresholds that identify activation elements where lower-precision formats (MXFP4 or MXFP6) incur excessive quantization error. Our algorithm selectively allocates higher-precision channels to preserve accuracy while maintaining compute efficiency. MicroMix achieves competitive or superior performance across diverse downstream tasks, including zero-shot and few-shot learning, language modeling, code generation, and mathematical reasoning. On both consumer-grade (RTX 5070Ti laptop) and server-grade (RTX 5090) GPUs, our kernel delivers at least 20% faster execution than TensorRT-FP8. Furthermore, when applied to various Llama and Qwen models, MicroMix consistently improves prefill latency and memory efficiency across a range of batch sizes compared to TensorRT baselines. Our code is available at this https URL.
摘要：量化通过低精度对应物替换原始的高精度矩阵，可以显着加速大语言模型（LLMS）的推断。重量激活量化的最新进展主要集中在将权重和激活映射到INT4格式上。尽管NVIDIA的Blackwell体系结构中的新型FP4张量核心在FP16上提供了高达4倍的速度，但由于不匹配的数据格式，现有的基于INT4的内核无法完全利用此功能。为了弥合此差距，我们提出了基于显微镜（MX）数据格式的共同设计的混合精液量化算法和矩阵乘法内核。 Micromix内核为Blackwell体系结构量身定制，支持MXFP4，MXFP6和MXFP8通道的任意组合，并产生BFLOAT16输出。为了在每个线性层的准确性和效率之间实现良好的权衡，我们引入了量化阈值，以识别较低精确格式（MXFP4或MXFP6）的激活元素会导致过多的量化误差。我们的算法有选择地分配了更高精确的通道，以保持准确性，同时保持计算效率。 Micromix在各种下游任务中实现竞争性或卓越的表现，包括零击和几乎没有射击的学习，语言建模，代码生成和数学推理。在消费者级（RTX 5070TI笔记本电脑）和服务器级（RTX 5090）GPU上，我们的内核可以比Tensorrt-FP8快20％。此外，与Tensorrt基线相比，MicroMix应用于各种骆驼和QWEN模型时，始终可以提高一系列批次尺寸的预填充潜伏期和记忆效率。我们的代码可在此HTTPS URL上找到。

Title: Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering

Authors: Xu Wang, Shengeng Tang, Fei Wang, Lechao Cheng, Dan Guo, Feng Xue, Richang Hong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02362
Pdf URL: https://arxiv.org/pdf/2508.02362
Copy Paste: [[2508.02362]] Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering(https://arxiv.org/abs/2508.02362)
Keywords: generation
Abstract: Generating semantically coherent and visually accurate talking faces requires bridging the gap between linguistic meaning and facial articulation. Although audio-driven methods remain prevalent, their reliance on high-quality paired audio visual data and the inherent ambiguity in mapping acoustics to lip motion pose significant challenges in terms of scalability and robustness. To address these issues, we propose Text2Lip, a viseme-centric framework that constructs an interpretable phonetic-visual bridge by embedding textual input into structured viseme sequences. These mid-level units serve as a linguistically grounded prior for lip motion prediction. Furthermore, we design a progressive viseme-audio replacement strategy based on curriculum learning, enabling the model to gradually transition from real audio to pseudo-audio reconstructed from enhanced viseme features via cross-modal attention. This allows for robust generation in both audio-present and audio-free scenarios. Finally, a landmark-guided renderer synthesizes photorealistic facial videos with accurate lip synchronization. Extensive evaluations show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness, establishing a new paradigm for controllable and flexible talking face generation. Our project homepage is this https URL.
摘要：产生语义相干和视觉上准确的说话面孔需要弥合语言意义与面部表情之间的差距。尽管音频驱动的方法仍然普遍存在，但它们对高质量的配对视觉数据的依赖以及将声学绘制声学绘制到唇部运动的固有歧义在可扩展性和鲁棒性方面构成了重大挑战。为了解决这些问题，我们提出了Text2Lip，这是一个以Viseme为中心的框架，该框架通过将文本输入嵌入结构化的观察序列中来构建可解释的语音 - 视觉桥梁。这些中层单元作为唇部运动预测的语言基础。此外，我们根据课程学习设计了一种渐进的Viseme-Audio替换策略，使该模型能够通过交叉模式的注意从增强的观察特征重建，从真实音频转变为伪audio。这允许在音频出现和无音频场景中生成强大的生成。最后，具有里程碑意义的引导者通过准确的唇部同步合成了逼真的面部视频。广泛的评估表明，Text2LIP在语义忠诚，视觉现实主义和模态稳健性中的现有方法优于现有方法，为可控且灵活的说话面部生成建立了新的范式。我们的项目主页是此HTTPS URL。

Title: Uni-Layout: Integrating Human Feedback in Unified Layout Generation and Evaluation

Authors: Shuo Lu, Yanyin Chen, Wei Feng, Jiahao Fan, Fengheng Li, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Jian Liang
Subjects: cs.CV, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2508.02374
Pdf URL: https://arxiv.org/pdf/2508.02374
Copy Paste: [[2508.02374]] Uni-Layout: Integrating Human Feedback in Unified Layout Generation and Evaluation(https://arxiv.org/abs/2508.02374)
Keywords: generation
Abstract: Layout generation plays a crucial role in enhancing both user experience and design efficiency. However, current approaches suffer from task-specific generation capabilities and perceptually misaligned evaluation metrics, leading to limited applicability and ineffective measurement. In this paper, we propose \textit{Uni-Layout}, a novel framework that achieves unified generation, human-mimicking evaluation and alignment between the two. For universal generation, we incorporate various layout tasks into a single taxonomy and develop a unified generator that handles background or element contents constrained tasks via natural language prompts. To introduce human feedback for the effective evaluation of layouts, we build \textit{Layout-HF100k}, the first large-scale human feedback dataset with 100,000 expertly annotated layouts. Based on \textit{Layout-HF100k}, we introduce a human-mimicking evaluator that integrates visual and geometric information, employing a Chain-of-Thought mechanism to conduct qualitative assessments alongside a confidence estimation module to yield quantitative measurements. For better alignment between the generator and the evaluator, we integrate them into a cohesive system by adopting Dynamic-Margin Preference Optimization (DMPO), which dynamically adjusts margins based on preference strength to better align with human judgments. Extensive experiments show that \textit{Uni-Layout} significantly outperforms both task-specific and general-purpose methods. Our code is publicly available at this https URL.
摘要：布局生成在提高用户体验和设计效率方面起着至关重要的作用。然而，当前的方法具有特定于任务的生成能力和感知不一致的评估指标，从而导致适用性有限和无效测量。在本文中，我们提出了\ textit {uni-layout}，这是一个新的框架，可以实现统一的产生，模仿人类的评估和两者之间的对齐。对于通用生成，我们将各种布局任务纳入单个分类法中，并开发统一的发电机，该统一发电机通过自然语言提示来处理背景或元素内容的约束任务。为了介绍人类反馈以进行有效评估布局，我们构建\ textit {layout-hf100k}，这是第一个具有100,000个专业注释布局的大规模人类反馈数据集。基于\ textIt {layout-hf100k}，我们引入了一个模仿人类的评估者，该评估者集成了视觉和几何信息，采用了经过思考的机制来进行定性评估以及置信估计模块，以产生定量测量。为了更好地对齐生成器和评估器，我们通过采用动态 - 利润优先优化（DMPO）将它们整合到凝聚系统中，该优化（DMPO）会根据偏好强度动态调整边缘，以更好地与人类判断更好地保持一致。广泛的实验表明，\ textit {uni-layout}显着优于特定于任务和通用方法。我们的代码在此HTTPS URL上公开可用。

Title: MindShot: Multi-Shot Video Reconstruction from fMRI with LLM Decoding

Authors: Wenwen Zeng, Yonghuang Wu, Yifan Chen, Xuan Xie, Chengqian Zhao, Feiyu Yin, Guoqing Wu, Jinhua Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02480
Pdf URL: https://arxiv.org/pdf/2508.02480
Copy Paste: [[2508.02480]] MindShot: Multi-Shot Video Reconstruction from fMRI with LLM Decoding(https://arxiv.org/abs/2508.02480)
Keywords: generative
Abstract: Reconstructing dynamic videos from fMRI is important for understanding visual cognition and enabling vivid brain-computer interfaces. However, current methods are critically limited to single-shot clips, failing to address the multi-shot nature of real-world experiences. Multi-shot reconstruction faces fundamental challenges: fMRI signal mixing across shots, the temporal resolution mismatch between fMRI and video obscuring rapid scene changes, and the lack of dedicated multi-shot fMRI-video datasets. To overcome these limitations, we propose a novel divide-and-decode framework for multi-shot fMRI video reconstruction. Our core innovations are: (1) A shot boundary predictor module explicitly decomposing mixed fMRI signals into shot-specific segments. (2) Generative keyframe captioning using LLMs, which decodes robust textual descriptions from each segment, overcoming temporal blur by leveraging high-level semantics. (3) Novel large-scale data synthesis (20k samples) from existing datasets. Experimental results demonstrate our framework outperforms state-of-the-art methods in multi-shot reconstruction fidelity. Ablation studies confirm the critical role of fMRI decomposition and semantic captioning, with decomposition significantly improving decoded caption CLIP similarity by 71.8%. This work establishes a new paradigm for multi-shot fMRI reconstruction, enabling accurate recovery of complex visual narratives through explicit decomposition and semantic prompting.
摘要：从功能磁共振成像中重建动态视频对于理解视觉认知和实现生动的脑部计算机界面非常重要。但是，当前的方法至关限于单次剪辑，无法解决现实世界体验的多弹性性质。多拍的重建面临着基本挑战：跨镜头的fMRI信号混合，fMRI和视频之间的时间分辨率不匹配，掩盖了快速场景的变化，以及缺乏专用的多拍fmri-video数据集。为了克服这些局限性，我们为多弹药fMRI视频重建提供了一个新颖的划分和数字框架。我们的核心创新是：（1）Shot Boundare预测器模块将混合fMRI信号明确分解为特定于SHOT特定的段。（2）使用LLMS的生成键帧字幕，该字幕解码每个段的强大文本描述，通过利用高级语义来克服时间模糊。（3）现有数据集中的新型大规模数据合成（20K样本）。实验结果表明，我们的框架在多弹性重建保真度中优于最先进的方法。消融研究证实了fMRI分解和语义字幕的关键作用，分解将解码的字幕夹相似性显着提高了71.8％。这项工作为多弹药fMRI重建建立了一个新的范式，从而通过明确的分解和语义提示来准确恢复复杂的视觉叙事。

Title: Toward Using Machine Learning as a Shape Quality Metric for Liver Point Cloud Generation

Authors: Khoa Tuan Nguyen, Gaeun Oh, Ho-min Park, Francesca Tozzi, Wouter Willaert, Joris Vankerschaver, Niki Rashidian, Wesley De Neve
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.02482
Pdf URL: https://arxiv.org/pdf/2508.02482
Copy Paste: [[2508.02482]] Toward Using Machine Learning as a Shape Quality Metric for Liver Point Cloud Generation(https://arxiv.org/abs/2508.02482)
Keywords: generation, generative
Abstract: While 3D medical shape generative models such as diffusion models have shown promise in synthesizing diverse and anatomically plausible structures, the absence of ground truth makes quality evaluation challenging. Existing evaluation metrics commonly measure distributional distances between training and generated sets, while the medical field requires assessing quality at the individual level for each generated shape, which demands labor-intensive expert review. In this paper, we investigate the use of classical machine learning (ML) methods and PointNet as an alternative, interpretable approach for assessing the quality of generated liver shapes. We sample point clouds from the surfaces of the generated liver shapes, extract handcrafted geometric features, and train a group of supervised ML and PointNet models to classify liver shapes as good or bad. These trained models are then used as proxy discriminators to assess the quality of synthetic liver shapes produced by generative models. Our results show that ML-based shape classifiers provide not only interpretable feedback but also complementary insights compared to expert evaluation. This suggests that ML classifiers can serve as lightweight, task-relevant quality metrics in 3D organ shape generation, supporting more transparent and clinically aligned evaluation protocols in medical shape modeling.
摘要：尽管3D医学形状生成模型（例如扩散模型）在综合多样化和解剖学上合理的结构方面表现出了希望，但缺乏地面真相使质量评估具有挑战性。现有的评估指标通常衡量培训和生成集之间的分布距离，而医疗领域则需要评估每个生成形状的单个水平的质量，这需要劳动密集型的专家审查。在本文中，我们研究了使用古典机器学习（ML）方法和PointNet作为评估产生肝形状质量的替代方法的方法。我们从生成的肝脏形状的表面采样点云，提取手工制作的几何特征，并训练一组监督的ML和PointNet模型，以将肝形状分类为好或坏。然后，这些训练有素的模型被用作替代歧视者，以评估生成模型产生的合成肝形状的质量。我们的结果表明，与专家评估相比，基于ML的形状分类器不仅提供了可解释的反馈，还提供了互补的见解。这表明ML分类器可以作为3D器官形状生成的轻量级，与任务相关的质量指标，从而支持医学形状建模中更透明和临床上的评估方案。

Title: Federated Graph Unlearning

Authors: Yuming Ai, Xunkai Li, Jiaqi Chao, Bowen Fan, Zhengyu Wu, Yinlin Zhu, Rong-Hua Li, Guoren Wang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.02485
Pdf URL: https://arxiv.org/pdf/2508.02485
Copy Paste: [[2508.02485]] Federated Graph Unlearning(https://arxiv.org/abs/2508.02485)
Keywords: generation
Abstract: The demand for data privacy has led to the development of frameworks like Federated Graph Learning (FGL), which facilitate decentralized model training. However, a significant operational challenge in such systems is adhering to the right to be forgotten. This principle necessitates robust mechanisms for two distinct types of data removal: the selective erasure of specific entities and their associated knowledge from local subgraphs and the wholesale removal of a user's entire dataset and influence. Existing methods often struggle to fully address both unlearning requirements, frequently resulting in incomplete data removal or the persistence of residual knowledge within the system. This work introduces a unified framework, conceived to provide a comprehensive solution to these challenges. The proposed framework employs a bifurcated strategy tailored to the specific unlearning request. For fine-grained Meta Unlearning, it uses prototype gradients to direct the initial local forgetting process, which is then refined by generating adversarial graphs to eliminate any remaining data traces among affected clients. In the case of complete client unlearning, the framework utilizes adversarial graph generation exclusively to purge the departed client's contributions from the remaining network. Extensive experiments on multiple benchmark datasets validate the proposed approach. The framework achieves substantial improvements in model prediction accuracy across both client and meta-unlearning scenarios when compared to existing methods. Furthermore, additional studies confirm its utility as a plug-in module, where it materially enhances the predictive capabilities and unlearning effectiveness of other established methods.
摘要：对数据隐私的需求导致了联合图学习（FGL）等框架的发展，该框架促进了分散的模型培训。但是，这种系统中的重大运营挑战是遵守被遗忘的权利。该原理需要对两种不同类型的数据删除类型的鲁棒机制：选择性擦除特定实体及其从本地子图中的相关知识以及用户的整个数据集和影响的批发去除。现有的方法通常难以完全满足未学习的要求，通常导致删除数据不完整或系统内剩余知识的持久性。这项工作引入了一个统一的框架，旨在为这些挑战提供全面的解决方案。拟议的框架采用了针对特定未学习请求的分叉策略。对于细粒度的元学习，它使用原型梯度来指导初始的本地遗忘过程，然后通过生成对抗图以消除受影响客户端的任何剩余数据跟踪。在完整的客户端学习的情况下，该框架使用对抗图生成，专门从其余网络中清除已故客户端的贡献。多个基准数据集的广泛实验验证了所提出的方法。与现有方法相比，该框架在客户端和元研究方案的模型预测准确性方面取得了重大改进。此外，其他研究证实了其作为插件模块的实用性，在该模块中，它实质上增强了其他已建立方法的预测能力和未学习效率。

Title: AnalogCoder-Pro: Unifying Analog Circuit Generation and Optimization via Multi-modal LLMs

Authors: Yao Lai, Souradip Poddar, Sungyoung Lee, Guojin Chen, Mengkang Hu, Bei Yu, Ping Luo, David Z. Pan
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.02518
Pdf URL: https://arxiv.org/pdf/2508.02518
Copy Paste: [[2508.02518]] AnalogCoder-Pro: Unifying Analog Circuit Generation and Optimization via Multi-modal LLMs(https://arxiv.org/abs/2508.02518)
Keywords: generation, generative
Abstract: Despite advances in analog design automation, analog front-end design still heavily depends on expert intuition and iterative simulations, underscoring critical gaps in fully automated optimization for performance-critical applications. Recently, the rapid development of Large Language Models (LLMs) has brought new promise to analog design automation. However, existing work remains in its early stages, and holistic joint optimization for practical end-to-end solutions remains largely unexplored. We propose AnalogCoder-Pro, a unified multimodal LLM-based framework that integrates generative capabilities and optimization techniques to jointly explore circuit topologies and optimize device sizing, automatically generating performance-specific, fully sized schematic netlists. AnalogCoder-Pro employs rejection sampling for fine-tuning LLMs on high-quality synthesized circuit data and introduces a multimodal diagnosis and repair workflow based on functional specifications and waveform images. By leveraging LLMs to interpret generated circuit netlists, AnalogCoder-Pro automates the extraction of critical design parameters and the formulation of parameter spaces, establishing an end-to-end workflow for simultaneous topology generation and device sizing optimization. Extensive experiments demonstrate that these orthogonal approaches significantly improve the success rate of analog circuit design and enhance circuit performance.
摘要：尽管模拟设计自动化取得了进步，但模拟前端设计仍然在很大程度上取决于专家的直觉和迭代模拟，从而强调了针对绩效至关重要应用的全自动优化的关键差距。最近，大型语言模型（LLM）的快速发展为模拟设计自动化带来了新的希望。但是，现有的工作仍处于早期阶段，并且对实用的端到端解决方案的整体关节优化基本上仍未得到探索。我们提出了AnalogCoder-Pro，这是一种基于统一的多模式LLM的框架，该框架集成了生成能力和优化技术，以共同探索电路拓扑并优化设备尺寸，自动生成性能特定于特定于特定于性能的示意图。 AnalogCoder-Pro在高质量的合成电路数据上采用了拒绝采样来调整LLM，并根据功能规格和波形图像引入了多模式诊断和修复工作流程。通过利用LLM来解释生成的电路网列，AnalogCoder-Pro可以自动提取关键设计参数和参数空间的配方，从而为同时拓扑生成和设备大小优化建立了端到端工作流程。广泛的实验表明，这些正交方法可显着提高模拟电路设计的成功率并增强电路性能。

Title: StructSynth: Leveraging LLMs for Structure-Aware Tabular Data Synthesis in Low-Data Regimes

Authors: Siyi Liu, Yujia Zheng, Yongqi Zhang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2508.02601
Pdf URL: https://arxiv.org/pdf/2508.02601
Copy Paste: [[2508.02601]] StructSynth: Leveraging LLMs for Structure-Aware Tabular Data Synthesis in Low-Data Regimes(https://arxiv.org/abs/2508.02601)
Keywords: generation, generative
Abstract: The application of machine learning on tabular data in specialized domains is severely limited by data scarcity. While generative models offer a solution, traditional methods falter in low-data regimes, and recent Large Language Models (LLMs) often ignore the explicit dependency structure of tabular data, leading to low-fidelity synthetics. To address these limitations, we introduce StructSynth, a novel framework that integrates the generative power of LLMs with robust structural control. StructSynth employs a two-stage architecture. First, it performs explicit structure discovery to learn a Directed Acyclic Graph (DAG) from the available data. Second, this learned structure serves as a high-fidelity blueprint to steer the LLM's generation process, forcing it to adhere to the learned feature dependencies and thereby ensuring the generated data respects the underlying structure by design. Our extensive experiments demonstrate that StructSynth produces synthetic data with significantly higher structural integrity and downstream utility than state-of-the-art methods. It proves especially effective in challenging low-data scenarios, successfully navigating the trade-off between privacy preservation and statistical fidelity.
摘要：在专用域中，机器学习在表格数据上的应用受到数据稀缺的严重限制。尽管生成模型提供了一种解决方案，但传统方法在低数据制度中步履蹒跚，而最近的大型语言模型（LLMS）通常忽略了表格数据的明确依赖性结构，从而导致低保真合成。为了解决这些局限性，我们介绍了结构系统，这是一个新颖的框架，将LLM的生成力与稳健的结构控制整合在一起。 structsynth采用了两个阶段的体系结构。首先，它执行明确的结构发现，以从可用数据中学习有向的无环图（DAG）。其次，这种学识渊博的结构是一种高保真的蓝图，可以引导LLM的生成过程，迫使其遵守学习的特征依赖性，从而确保生成的数据通过设计尊重基础结构。我们的广泛实验表明，与最先进的方法相比，结构符合结构性数据具有明显更高的结构完整性和下游效用。事实证明，它在挑战低数据表面的情况下特别有效，成功地在隐私保存和统计忠诚度之间进行了权衡。

Title: ReMoMask: Retrieval-Augmented Masked Motion Generation

Authors: Zhengdao Li, Siheng Wang, Zeyu Zhang, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2508.02605
Pdf URL: https://arxiv.org/pdf/2508.02605
Copy Paste: [[2508.02605]] ReMoMask: Retrieval-Augmented Masked Motion Generation(https://arxiv.org/abs/2508.02605)
Keywords: generation, generative
Abstract: Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: this https URL. Website: this https URL.
摘要：文本到动作（T2M）的生成旨在从自然语言描述中综合现实和语义上的人类运动序列。然而，当前的方法面临双重挑战：生成模型（例如扩散模型）的多样性，错误积累和身体上的不可能有限，而检索功能增强的产生（RAG）方法表现出扩散的惯性，部分模式崩溃和异步人工。为了解决这些局限性，我们提出了一个集成了三个关键创新的统一框架：1）双向动量文本移动模型通过动量排队将负样本量表从批次大小中解脱出来，从而实质上提高了交叉模式的检索精度； 2）语义时空注意机制在零件级融合过程中强制执行生物力学约束，以消除异步伪像； 3）无用的无阶层指导结合了次要的无条件产生，以增强概括。 Remomask建立在Momask的RVQ-VAE上，有效地以最小的步骤产生了时间连贯的动作。与先前的SOTA方法RAG-T2M相比，对标准基准测试的广泛实验表明了Remomask的最新性能，分别在HumanML3D和Kit-ML上提高了3.88％和10.97％。代码：此HTTPS URL。网站：此HTTPS URL。

Title: DeepKoopFormer: A Koopman Enhanced Transformer Based Architecture for Time Series Forecasting

Authors: Ali Forootani, Mohammad Khosravi, Masoud Barati
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2508.02616
Pdf URL: https://arxiv.org/pdf/2508.02616
Copy Paste: [[2508.02616]] DeepKoopFormer: A Koopman Enhanced Transformer Based Architecture for Time Series Forecasting(https://arxiv.org/abs/2508.02616)
Keywords: generation
Abstract: Time series forecasting plays a vital role across scientific, industrial, and environmental domains, especially when dealing with high-dimensional and nonlinear systems. While Transformer-based models have recently achieved state-of-the-art performance in long-range forecasting, they often suffer from interpretability issues and instability in the presence of noise or dynamical uncertainty. In this work, we propose DeepKoopFormer, a principled forecasting framework that combines the representational power of Transformers with the theoretical rigor of Koopman operator theory. Our model features a modular encoder-propagator-decoder structure, where temporal dynamics are learned via a spectrally constrained, linear Koopman operator in a latent space. We impose structural guarantees-such as bounded spectral radius, Lyapunov based energy regularization, and orthogonal parameterization to ensure stability and interpretability. Comprehensive evaluations are conducted on both synthetic dynamical systems, real-world climate dataset (wind speed and surface pressure), financial time series (cryptocurrency), and electricity generation dataset using the Python package that is prepared for this purpose. Across all experiments, DeepKoopFormer consistently outperforms standard LSTM and baseline Transformer models in terms of accuracy, robustness to noise, and long-term forecasting stability. These results establish DeepKoopFormer as a flexible, interpretable, and robust framework for forecasting in high dimensional and dynamical settings.
摘要：时间序列预测在科学，工业和环境领域中起着至关重要的作用，尤其是在处理高维和非线性系统时。尽管基于变压器的模型最近在远程预测中实现了最先进的性能，但在存在噪声或动态不确定性的情况下，它们通常会遭受可解释性问题和不稳定的困扰。在这项工作中，我们提出了DeepKoopFormer，这是一个原则上的预测框架，将变压器的代表力与Koopman操作员理论的理论严格相结合。我们的模型具有模块化的编码器 - 驱动器 - 模块结构，其中通过在潜在空间中的频谱限制的线性koopman操作员来学习时间动力学。我们强加了结构性保证，例如有界光谱半径，基于Lyapunov的能量正则化和正交参数化，以确保稳定性和解释性。使用为此目的准备的Python软件包，对合成动力学系统，现实世界气候数据集（风速和表面压力），财务时间序列（加密货币）和发电数据集进行了全面评估。在所有实验中，DeepKoopFormer在准确性，稳健性和长期预测稳定性方面始终优于标准LSTM和基线变压器模型。这些结果将DeepKoopFormer建立为在高维和动态设置中进行预测的灵活，可解释且可靠的框架。