2025-07-04

Title: GeoAda: Efficiently Finetune Geometric Diffusion Models with Equivariant Adapters

Authors: Wanjia Zhao, Jiaqi Han, Siyi Gu, Mingjian Jiang, James Zou, Stefano Ermon
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02085
Pdf URL: https://arxiv.org/pdf/2507.02085
Copy Paste: [[2507.02085]] GeoAda: Efficiently Finetune Geometric Diffusion Models with Equivariant Adapters(https://arxiv.org/abs/2507.02085)
Keywords: generation, generative
Abstract: Geometric diffusion models have shown remarkable success in molecular dynamics and structure generation. However, efficiently fine-tuning them for downstream tasks with varying geometric controls remains underexplored. In this work, we propose an SE(3)-equivariant adapter framework ( GeoAda) that enables flexible and parameter-efficient fine-tuning for controlled generative tasks without modifying the original model architecture. GeoAda introduces a structured adapter design: control signals are first encoded through coupling operators, then processed by a trainable copy of selected pretrained model layers, and finally projected back via decoupling operators followed by an equivariant zero-initialized convolution. By fine-tuning only these lightweight adapter modules, GeoAda preserves the model's geometric consistency while mitigating overfitting and catastrophic forgetting. We theoretically prove that the proposed adapters maintain SE(3)-equivariance, ensuring that the geometric inductive biases of the pretrained diffusion model remain intact during adaptation. We demonstrate the wide applicability of GeoAda across diverse geometric control types, including frame control, global control, subgraph control, and a broad range of application domains such as particle dynamics, molecular dynamics, human motion prediction, and molecule generation. Empirical results show that GeoAda achieves state-of-the-art fine-tuning performance while preserving original task accuracy, whereas other baselines experience significant performance degradation due to overfitting and catastrophic forgetting.
摘要：几何扩散模型在分子动力学和结构产生方面取得了显着成功。但是，有效地将它们用于通过不同的几何控件进行下游任务进行微调。在这项工作中，我们提出了一个SE（3） - 等级式适配器框架（GEOADA），该框架可以在不修改原始模型体系结构的情况下为受控生成任务提供灵活和参数有效的微调。 Geoada引入了结构化的适配器设计：控制信号首先是通过耦合操作员编码的，然后通过选定预告片的模型层的可训练副本进行处理，最后通过解耦运算符向后投影，然后是零等值的零静脉化卷积。通过仅微调这些轻巧的适配器模块，Geoada保留了该模型的几何一致性，同时减轻过度拟合和灾难性遗忘。从理论上讲，我们证明了所提出的适配器维持SE（3） - 等级，以确保预审前扩散模型的几何感应偏置在适应过程中保持完整。我们证明了GEOADA在各种几何控制类型中的广泛适用性，包括框架控制，全局控制，子图控制以及广泛的应用领域，例如粒子动力学，分子动力学，人类运动预测和分子产生。经验结果表明，Geoada在保持原始任务准确性的同时，达到了最新的微调性能，而其他基线由于过度拟合和灾难性的遗忘而经历了重大的性能退化。

Title: Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model

Authors: Xingtu Liu, Lin F. Yang, Sharan Vaswani
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.02089
Pdf URL: https://arxiv.org/pdf/2507.02089
Copy Paste: [[2507.02089]] Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model(https://arxiv.org/abs/2507.02089)
Keywords: generative
Abstract: We consider infinite-horizon $\gamma$-discounted (linear) constrained Markov decision processes (CMDPs) where the objective is to find a policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Given access to a generative model, we propose to solve CMDPs with a primal-dual framework that can leverage any black-box unconstrained MDP solver. For linear CMDPs with feature dimension $d$, we instantiate the framework by using mirror descent value iteration (\texttt{MDVI})~\citep{kitamura2023regularization} an example MDP solver. We provide sample complexity bounds for the resulting CMDP algorithm in two cases: (i) relaxed feasibility, where small constraint violations are allowed, and (ii) strict feasibility, where the output policy is required to exactly satisfy the constraint. For (i), we prove that the algorithm can return an $\epsilon$-optimal policy with high probability by using $\tilde{O}\left(\frac{d^2}{(1-\gamma)^4\epsilon^2}\right)$ samples. We note that these results exhibit a near-optimal dependence on both $d$ and $\epsilon$. For (ii), we show that the algorithm requires $\tilde{O}\left(\frac{d^2}{(1-\gamma)^6\epsilon^2\zeta^2}\right)$ samples, where $\zeta$ is the problem-dependent Slater constant that characterizes the size of the feasible region. Finally, we instantiate our framework for tabular CMDPs and show that it can be used to recover near-optimal sample complexities in this setting.
摘要：我们考虑Infinite-Horizon $ \ gamma $ discousped（线性）约束马尔可夫决策过程（CMDP），其中目标是找到一项政策，以最大程度地提高预期累积奖励，但受预期的累积约束。给定对生成模型的访问，我们建议使用一个原始的偶型框架求解CMDP，该框架可以利用任何黑框无约束的MDP求解器。对于具有特征尺寸$ D $的线性CMDP，我们通过使用镜像下降值迭代（\ texttt {mdvi}）〜\ citep {kitamura2023 regularization}示例MDP求解器来实例化框架。在两种情况下，我们为所得的CMDP算法提供了样本复杂性界限：（i）允许违反小小的约束的轻松可行性，以及（ii）严格的可行性，在这种情况下，需要进行输出策略以准确满足约束。对于（i），我们证明该算法可以通过使用$ \ tilde {o} \ left（\ frac {d^2} {（1- \ gamma）^4 \ epsilon^2} \ right）返回$ \ epsilon $ -optimal策略，概率很高。我们注意到，这些结果对$ d $和$ \ epsilon $的依赖性近乎最佳。对于（ii），我们表明算法需要$ \ tilde {o} \ left（\ frac {d^2} {（（1- \ gamma）^6 \ epsilon^2 \ zeta^2} \ zeta^2} \ right）$样本，其中$ \ zeta $是问题所依赖的，该尺寸是acte的尺寸。最后，我们实例化了表格CMDP的框架，并证明它可用于在这种情况下恢复近乎最佳的样本复杂性。

Title: CROP: Circuit Retrieval and Optimization with Parameter Guidance using LLMs

Authors: Jingyu Pan, Isaac Jacobson, Zheng Zhao, Tung-Chieh Chen, Guanglei Zhou, Chen-Chia Chang, Vineet Rashingkar, Yiran Chen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.02128
Pdf URL: https://arxiv.org/pdf/2507.02128
Copy Paste: [[2507.02128]] CROP: Circuit Retrieval and Optimization with Parameter Guidance using LLMs(https://arxiv.org/abs/2507.02128)
Keywords: generation
Abstract: Modern very large-scale integration (VLSI) design requires the implementation of integrated circuits using electronic design automation (EDA) tools. Due to the complexity of EDA algorithms, the vast parameter space poses a huge challenge to chip design optimization, as the combination of even moderate numbers of parameters creates an enormous solution space to explore. Manual parameter selection remains industrial practice despite being excessively laborious and limited by expert experience. To address this issue, we present CROP, the first large language model (LLM)-powered automatic VLSI design flow tuning framework. Our approach includes: (1) a scalable methodology for transforming RTL source code into dense vector representations, (2) an embedding-based retrieval system for matching designs with semantically similar circuits, and (3) a retrieval-augmented generation (RAG)-enhanced LLM-guided parameter search system that constrains the search process with prior knowledge from similar designs. Experiment results demonstrate CROP's ability to achieve superior quality-of-results (QoR) with fewer iterations than existing approaches on industrial designs, including a 9.9% reduction in power consumption.
摘要：现代非常大规模集成（VLSI）设计需要使用电子设计自动化（EDA）工具实施集成电路。由于EDA算法的复杂性，庞大的参数空间对芯片设计优化构成了巨大的挑战，因为甚至中等数量的参数的组合创造了一个巨大的解决方案空间。手动参数选择仍然是工业实践，尽管过于费力，并且受到专家经验的限制。为了解决这个问题，我们提出了第一个大型语言模型（LLM）能力的自动VLSI设计流调框架。我们的方法包括：（1）一种可扩展的方法，用于将RTL源代码转换为密集的矢量表示形式，（2）基于嵌入式的基于嵌入式的检索系统，用于匹配具有语义上相似电路的设计，以及（3）检索型（RAG）增强的LLM LLM辅助参数搜索系统，可限制与类似知识的搜索过程相似的搜索过程。实验结果表明，与现有工业设计的现有方法相比，作物具有实现卓越质量质量（QOR）的能力，包括降低功耗9.9％。

Title: Generative Latent Diffusion for Efficient Spatiotemporal Data Reduction

Authors: Xiao Li, Liangji Zhu, Anand Rangarajan, Sanjay Ranka
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2507.02129
Pdf URL: https://arxiv.org/pdf/2507.02129
Copy Paste: [[2507.02129]] Generative Latent Diffusion for Efficient Spatiotemporal Data Reduction(https://arxiv.org/abs/2507.02129)
Keywords: generative
Abstract: Generative models have demonstrated strong performance in conditional settings and can be viewed as a form of data compression, where the condition serves as a compact representation. However, their limited controllability and reconstruction accuracy restrict their practical application to data compression. In this work, we propose an efficient latent diffusion framework that bridges this gap by combining a variational autoencoder with a conditional diffusion model. Our method compresses only a small number of keyframes into latent space and uses them as conditioning inputs to reconstruct the remaining frames via generative interpolation, eliminating the need to store latent representations for every frame. This approach enables accurate spatiotemporal reconstruction while significantly reducing storage costs. Experimental results across multiple datasets show that our method achieves up to 10 times higher compression ratios than rule-based state-of-the-art compressors such as SZ3, and up to 63 percent better performance than leading learning-based methods under the same reconstruction error.
摘要：生成模型在条件设置中表现出强烈的性能，可以看作是一种数据压缩的一种形式，在该形式中，该条件是紧凑的表示。但是，它们的有限的可控性和重建精度限制了它们对数据压缩的实际应用。在这项工作中，我们提出了一个有效的潜在扩散框架，该框架通过将变分自动编码器与条件扩散模型相结合来弥合此差距。我们的方法仅将少量的密钥帧压缩到潜在空间中，并将其用作条件输入，通过生成插值重建其余帧，从而消除了为每个帧存储潜在表示的需求。这种方法可实现准确的时空重建，同时大大降低存储成本。多个数据集的实验结果表明，我们的方法比基于规则的最先进的压缩机（例如SZ3）的压缩比高出10倍，并且比在相同的重建误差下基于学习的基于学习的方法高63％。

Title: ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning

Authors: Xiao Wang, Jingtao Jiang, Qiang Chen, Lan Chen, Lin Zhu, Yaowei Wang, Yonghong Tian, Jin Tang
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2507.02200
Pdf URL: https://arxiv.org/pdf/2507.02200
Copy Paste: [[2507.02200]] ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning(https://arxiv.org/abs/2507.02200)
Keywords: generation
Abstract: Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of insufficient interpretability and weak contextual logical reasoning. In this work, we propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT. Specifically, we first adopt the vision encoder EVA-CLIP (ViT-G/14) to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt. A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously. Our framework can be optimized using supervised fine-tuning in an end-to-end manner. In addition, we also propose a large-scale CoT dataset to train our framework via a three stage processing (i.e., generation, polish, and expert verification). This dataset provides a solid data foundation for the development of subsequent reasoning-based large models. Extensive experiments on three event stream STR benchmark datasets (i.e., EventSTR, WordArt*, IC15*) fully validated the effectiveness and interpretability of our proposed framework. The source code and pre-trained models will be released on this https URL.
摘要：基于事件流的场景文本识别是近年来新出现的研究主题，在极具挑战性的场景中，尤其是低照明，快速运动，其性能比广泛使用的RGB摄像机更好。现有作品要么采用端到端的编码器框架或大型语言模型来增强识别，但是，它们仍然受到不足的解释性和弱上下文逻辑推理的挑战的限制。在这项工作中，我们提出了一种基于经过思想的事件流场景文本识别框架，称为ESTR-COT。具体来说，我们首先采用视觉编码器eva-clip（vit-g/14）将输入事件流转换为令牌，并利用千层面令牌来编码给定的生成提示。 Q形式用于将视觉令牌与预先训练的大型语言模型Vicuna-7b保持一致，并同时输出答案和思想链（COT）推理过程。可以使用监督的微调以端到端的方式优化我们的框架。此外，我们还提出了一个大规模的COT数据集，以通过三阶段处理（即生成，波兰和专家验证）来训练我们的框架。该数据集为开发后续基于推理的大型模型提供了坚实的数据基础。在三个事件流基准数据集（即EventsTR，WordArt*，IC15*）上进行了大量实验，充分验证了我们所提出的框架的有效性和解释性。源代码和预训练模型将在此HTTPS URL上发布。

Title: SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers

Authors: Takuro Kawada, Shunsuke Kitada, Sota Nemoto, Hitoshi Iyatomi
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2507.02212
Pdf URL: https://arxiv.org/pdf/2507.02212
Copy Paste: [[2507.02212]] SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers(https://arxiv.org/abs/2507.02212)
Keywords: generation
Abstract: Graphical Abstracts (GAs) play a crucial role in visually conveying the key findings of scientific papers. While recent research has increasingly incorporated visual materials such as Figure 1 as de facto GAs, their potential to enhance scientific communication remains largely unexplored. Moreover, designing effective GAs requires advanced visualization skills, creating a barrier to their widespread adoption. To tackle these challenges, we introduce SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific papers and 1.14 million figures, explicitly designed for supporting GA selection and recommendation as well as facilitating research in automated GA generation. As a preliminary step toward GA design support, we define two tasks: 1) Intra-GA recommendation, which identifies figures within a given paper that are well-suited to serve as GAs, and 2) Inter-GA recommendation, which retrieves GAs from other papers to inspire the creation of new GAs. We provide reasonable baseline models for these tasks. Furthermore, we propose Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation metric that offers a fine-grained analysis of model behavior. CAR addresses limitations in traditional ranking-based metrics by considering cases where multiple figures within a paper, beyond the explicitly labeled GA, may also serve as GAs. By unifying these tasks and metrics, our SciGA-145k establishes a foundation for advancing visual scientific communication while contributing to the development of AI for Science.
摘要：图形摘要（气）在视觉传达科学论文的关键发现中起着至关重要的作用。尽管最近的研究越来越多地纳入了图1，例如事实上的气体，但它们增强科学沟通的潜力仍然在很大程度上尚未探索。此外，设计有效的气体需要先进的可视化技能，从而为其广泛采用带来了障碍。为了应对这些挑战，我们介绍了SCIGA-145K，这是一个大规模数据集，其中包括约145,000篇科学论文和114万个数字，明确设计用于支持GA选择和建议以及促进自动GA生成的研究。作为朝着GA设计支持的初步步骤，我们定义了两个任务：1）GA内部建议，该建议在给定论文中识别出非常适合用作气体的数字，以及2）GA Inter-GA建议，从其他纸张中检索天然气以激发新气体的创造。我们为这些任务提供合理的基线模型。此外，我们提出了置信度调整后的TOP-1地面真实比（CAR），这是一种新颖的建议指标，可对模型行为进行精细元素分析。 CAR通过考虑纸张中的多个数字（除明确标记的GA）中的多个数字也可以用作气体，从而解决了基于排名的指标的局限性。通过统一这些任务和指标，我们的SCIGA-145K为推进视觉科学沟通的基础建立了基础，同时为科学的AI发展做出了贡献。

Title: Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation

Authors: Feizhen Huang, Yu Wu, Yutian Lin, Bo Du
Subjects: cs.CV, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2507.02271
Pdf URL: https://arxiv.org/pdf/2507.02271
Copy Paste: [[2507.02271]] Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation(https://arxiv.org/abs/2507.02271)
Keywords: generation
Abstract: Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.
摘要：视频对原告（V2A）的一代取得了重大进展，并在电影和视频后期制作中起着至关重要的作用。但是，当前的方法忽略了电影制作中艺术表达的关键组成部分。结果，在Foley目标仅部分可见的情况下，它们的性能恶化。为了应对这一挑战，我们提出了一种简单的自我鉴定方法，将V2A模型扩展到电影语言方案。通过模拟电影语言的变化，学生模型学会了将训练对的视频特征与相同的视听对应相结合，从而使其能够有效地捕获声音和部分视觉信息之间的关联。我们的方法不仅在所有评估指标的部分可见性下都取得了令人印象深刻的改进，而且还提高了大规模V2A数据集Vggsound的性能。

Title: DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation

Authors: Yunhan Yang, Shuo Chen, Yukun Huang, Xiaoyang Wu, Yuan-Chen Guo, Edmund Y. Lam, Hengshuang Zhao, Tong He, Xihui Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02299
Pdf URL: https://arxiv.org/pdf/2507.02299
Copy Paste: [[2507.02299]] DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation(https://arxiv.org/abs/2507.02299)
Keywords: generation
Abstract: Recent advancements in leveraging pre-trained 2D diffusion models achieve the generation of high-quality novel views from a single in-the-wild image. However, existing works face challenges in producing controllable novel views due to the lack of information from multiple views. In this paper, we present DreamComposer++, a flexible and scalable framework designed to improve current view-aware diffusion models by incorporating multi-view conditions. Specifically, DreamComposer++ utilizes a view-aware 3D lifting module to extract 3D representations of an object from various views. These representations are then aggregated and rendered into the latent features of target view through the multi-view feature fusion module. Finally, the obtained features of target view are integrated into pre-trained image or video diffusion models for novel view synthesis. Experimental results demonstrate that DreamComposer++ seamlessly integrates with cutting-edge view-aware diffusion models and enhances their abilities to generate controllable novel views from multi-view conditions. This advancement facilitates controllable 3D object reconstruction and enables a wide range of applications.
摘要：利用预先训练的2D扩散模型的最新进展实现了从单个野外图像产生高质量的新颖观点。但是，由于缺乏多种观点的信息，现有作品在产生可控的新颖观点时面临挑战。在本文中，我们提出了DreamComposer ++，这是一个灵活且可扩展的框架，旨在通过合并多视图条件来改善当前的视图扩散模型。具体而言，DreamComposer ++使用视图3D提升模块从各种视图中提取对象的3D表示。然后将这些表示形式汇总并通过多视图特征融合模块渲染到目标视图的潜在特征中。最后，将目标视图的特征集成到用于新型视图合成的预训练图像或视频扩散模型中。实验结果表明，DreamComposer ++无缝地与尖端的视图扩散模型无缝集成，并增强了它们从多视图条件中产生可控的新型视图的能力。这种进步有助于可控的3D对象重建，并实现了广泛的应用程序。

Title: MAGIC: Mask-Guided Diffusion Inpainting with Multi-Level Perturbations and Context-Aware Alignment for Few-Shot Anomaly Generation

Authors: JaeHyuck Choi, MinJun Kim, JeHyeong Hong
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02314
Pdf URL: https://arxiv.org/pdf/2507.02314
Copy Paste: [[2507.02314]] MAGIC: Mask-Guided Diffusion Inpainting with Multi-Level Perturbations and Context-Aware Alignment for Few-Shot Anomaly Generation(https://arxiv.org/abs/2507.02314)
Keywords: generation
Abstract: Few-shot anomaly generation is emerging as a practical solution for augmenting the scarce anomaly data in industrial quality control settings. An ideal generator would meet three demands at once, namely (i) keep the normal background intact, (ii) inpaint anomalous regions to tightly overlap with the corresponding anomaly masks, and (iii) generate anomalous regions in a semantically valid location, while still producing realistic, diverse appearances from only a handful of real examples. Existing diffusion-based methods usually satisfy at most two of these requirements: global anomaly generators corrupt the background, whereas mask-guided ones often falter when the mask is imprecise or misplaced. We propose MAGIC--Mask-guided inpainting with multi-level perturbations and Context-aware alignment--to resolve all three issues. At its core, MAGIC fine-tunes a Stable Diffusion inpainting backbone that preserves normal regions and ensures strict adherence of the synthesized anomaly to the supplied mask, directly addressing background corruption and misalignment. To offset the diversity loss that fine-tuning can cause, MAGIC adds two complementary perturbation strategies: (i) Gaussian prompt-level perturbation applied during fine-tuning and inference that broadens the global appearance of anomalies while avoiding low-fidelity textual appearances, and (ii) mask-guided spatial noise injection that enriches local texture variations. Additionally, the context-aware mask alignment module forms semantic correspondences and relocates masks so that every anomaly remains plausibly contained within the host object, eliminating out-of-boundary artifacts. Under a consistent identical evaluation protocol on the MVTec-AD dataset, MAGIC outperforms previous state-of-the-arts in downstream anomaly tasks.
摘要：几乎没有射击异常的生成是一种实用解决方案，可以增加工业质量控制环境中稀缺的异常数据。理想的发电机将立即满足三个要求，即（i）保持正常背景完整，（ii）涂料异常区域与相应的异常掩模紧密地重叠，并且（iii）在语义上有效的位置中产生异常区域，同时仍然可以从实际的真实示例中产生不同的现实效果。现有的基于扩散的方法通常满足以下两个要求：全球异常发生器破坏背景，而掩模引导的方法通常会在掩膜不精确或放错位置时步履蹒跚。我们提出了魔术 - 掩盖了多层次扰动和上下文感知的对齐方式的掩盖介绍 - 解决这三个问题。 Magic通过核心微调进行了稳定的扩散式底链，可保留正常区域并确保综合异常对所提供的面膜的严格粘附，直接解决背景腐败和错误对准。为了抵消微调可能造成的多样性损失，魔术添加了两种互补的扰动策略：（i）在微调和推理期间应用高斯及时的迅速扰动，可扩大异常的全球外观，同时避免低效率的文本出现，以及（ii）掩盖引导空间噪声的local percection local promiate promiate promiate promiate promiate percection promiate percection promicture versice pressiation。此外，上下文感知的掩模对齐模块形成语义对应关系并重新定位掩模，以使每个异常都在主机对象中合理地包含，从而消除了室外伪影。在MVTEC-AD数据集上的一致相同的评估协议下，魔术在下游异常任务中的表现优于先前的最先进。

Title: Improving Constrained Generation in Language Models via Self-Distilled Twisted Sequential Monte Carlo

Authors: Sooyeon Kim, Giung Nam, Juho Lee
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2507.02315
Pdf URL: https://arxiv.org/pdf/2507.02315
Copy Paste: [[2507.02315]] Improving Constrained Generation in Language Models via Self-Distilled Twisted Sequential Monte Carlo(https://arxiv.org/abs/2507.02315)
Keywords: generation
Abstract: Recent work has framed constrained text generation with autoregressive language models as a probabilistic inference problem. Among these, Zhao et al. (2024) introduced a promising approach based on twisted Sequential Monte Carlo, which incorporates learned twist functions and twist-induced proposals to guide the generation process. However, in constrained generation settings where the target distribution concentrates on outputs that are unlikely under the base model, learning becomes challenging due to sparse and uninformative reward signals. We show that iteratively refining the base model through self-distillation alleviates this issue by making the model progressively more aligned with the target, leading to substantial gains in generation quality.
摘要：最近的工作已将自学语言模型作为概率的推论问题构成了约束文本生成。其中，Zhao等人。（2024）引入了一种基于扭曲的顺序蒙特卡洛的有希望的方法，该方法结合了学习的扭曲功能和扭曲引起的建议，以指导生成过程。但是，在限制的生成设置中，目标分布集中在基本模型下不太可能的输出上，由于稀疏和非信息奖励信号，学习变得具有挑战性。我们表明，通过自我验证对基本模型进行迭代精炼，从而通过使该模型逐渐与目标保持一致，从而减轻了这个问题，从而导致发电质量的巨大提高。

Title: Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos

Authors: Zecheng Zhao, Selena Song, Tong Chen, Zhi Chen, Shazia Sadiq, Yadan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02316
Pdf URL: https://arxiv.org/pdf/2507.02316
Copy Paste: [[2507.02316]] Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos(https://arxiv.org/abs/2507.02316)
Keywords: quality assessment
Abstract: Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object \& Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarking, our results show that SynTVA is a valuable asset for dataset augmentation, enabling the selection of high-utility synthetic samples that measurably improve TVR outcomes. Project page and dataset can be found at this https URL.
摘要：文本到视频（T2V）综合已迅速提高，但是当前的评估指标主要捕获视觉质量和时间一致性，从而有限地了解合成视频在下游任务中的执行方式，例如文本到视频检索（TVR）。在这项工作中，我们介绍了Santtva，这是一种新的数据集和基准测试，旨在评估合成视频用于构建检索模型的实用性。基于从MSRVTT训练分配中得出的800种不同的用户查询，我们使用最先进的T2V型号生成合成视频，并沿着四个关键的语义对齐维度注释每个视频文本对：对象\＆cast，castion，action，action，属性，属性和及时的保真度。我们的评估框架将一般视频质量评估（VQA）指标与这些对齐分数相关联，并检查了其下游TVR性能的预测能力。为了探索扩大规模的途径，我们进一步开发了一个自动评估器，以估算现有指标的一致性质量。除了基准测试之外，我们的结果表明，Syntva是数据集增强的宝贵资产，从而可以选择高实用性合成样本，可衡量地改善TVR结果。可以在此HTTPS URL上找到项目页面和数据集。

Title: Transformer-based EEG Decoding: A Survey

Authors: Haodong Zhang, Hongqi Li
Subjects: cs.LG, cs.HC
Abstract URL: https://arxiv.org/abs/2507.02320
Pdf URL: https://arxiv.org/pdf/2507.02320
Copy Paste: [[2507.02320]] Transformer-based EEG Decoding: A Survey(https://arxiv.org/abs/2507.02320)
Keywords: generative
Abstract: Electroencephalography (EEG) is one of the most common signals used to capture the electrical activity of the brain, and the decoding of EEG, to acquire the user intents, has been at the forefront of brain-computer/machine interfaces (BCIs/BMIs) research. Compared to traditional EEG analysis methods with machine learning, the advent of deep learning approaches have gradually revolutionized the field by providing an end-to-end long-cascaded architecture, which can learn more discriminative features automatically. Among these, Transformer is renowned for its strong handling capability of sequential data by the attention mechanism, and the application of Transformers in various EEG processing tasks is increasingly prevalent. This article delves into a relevant survey, summarizing the latest application of Transformer models in EEG decoding since it appeared. The evolution of the model architecture is followed to sort and organize the related advances, in which we first elucidate the fundamentals of the Transformer that benefits EEG decoding and its direct application. Then, the common hybrid architectures by integrating basic Transformer with other deep learning techniques (convolutional/recurrent/graph/spiking neural netwo-rks, generative adversarial networks, diffusion models, etc.) is overviewed in detail. The research advances of applying the modified intrinsic structures of customized Transformer have also been introduced. Finally, the current challenges and future development prospects in this rapidly evolving field are discussed. This paper aims to help readers gain a clear understanding of the current state of Transformer applications in EEG decoding and to provide valuable insights for future research endeavors.
摘要：脑电图（EEG）是用于捕获大脑电活动的最常见信号之一，而EEG的解码以获取用户意图，一直处于脑部计算机/机器界面（BCIS/BMIS）研究的最前沿。与传统的脑电图分析方法相比，通过机器学习，深度学习方法的出现通过提供端到端的长期架构来逐渐彻底改变了该领域，该架构可以自动学习更多的判别特征。其中，变压器以其序列数据的强大处理能力而闻名，而注意力机制的应用在各种脑电图处理任务中的应用越来越普遍。本文深入研究了一项相关的调查，总结了变压器模型在EEG解码中的最新应用。遵循模型体系结构的演变，以对相关的进步进行分类和组织，其中我们首先阐明了有益于脑电图解码及其直接应用的变压器的基本面。然后，通过将基本变压器与其他深度学习技术（卷积/复发/图形/尖峰神经netwo-rks，生成的对抗网络，扩散模型等）集成在一起的常见混合体系结构。还引入了应用定制变压器修改的内在结构的研究进展。最后，讨论了这个迅速发展的领域的当前挑战和未来的发展前景。本文旨在帮助读者清楚地了解脑电图解码中变压器应用的当前状态，并为未来的研究努力提供宝贵的见解。

Title: Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

Authors: Nina Konovalova, Maxim Nikolaev, Andrey Kuznetsov, Aibek Alanov
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02321
Pdf URL: https://arxiv.org/pdf/2507.02321
Copy Paste: [[2507.02321]] Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback(https://arxiv.org/abs/2507.02321)
Keywords: generation
Abstract: Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).
摘要：尽管文本到图像扩散模型取得了重大进展，但实现对生成的产出的精确空间控制仍然具有挑战性。 ControlNet通过引入辅助调节模块来解决此问题，而ControlNet ++通过仅应用于最终的剥离步骤的周期一致性损失进一步优化对齐。但是，这种方法忽略了中间世的阶段，从而限制了其有效性。我们提出了InnerControl，这是一种训练策略，可在所有扩散步骤中实现空间一致性。我们的方法将轻量级的卷积探针训练，以重建来自中级UNET特征的输入控制信号（例如边缘，深度）。这些探针即使从高度嘈杂的潜伏期中也有效提取信号，从而实现了伪造的真相控制。通过在整个扩散过程中最大程度地减少预测和目标条件之间的差异，我们的对齐损失可以提高控制保真度和发电质量。结合诸如ControlNet ++之类的既定技术，InnerControl在各种条件方法（例如边缘，深度）上实现了最先进的性能。

Title: Holistic Tokenizer for Autoregressive Image Generation

Authors: Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, Xiaojuan Qi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02358
Pdf URL: https://arxiv.org/pdf/2507.02358
Copy Paste: [[2507.02358]] Holistic Tokenizer for Autoregressive Image Generation(https://arxiv.org/abs/2507.02358)
Keywords: generation
Abstract: The vanilla autoregressive image generation model generates visual tokens in a step-by-step fashion, which limits the ability to capture holistic relationships among token sequences. Moreover, most visual tokenizers map local image patches into latent tokens, leading to limited global information. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Besides, Hita incorporates two key strategies for improved alignment with the AR generation process: 1) it arranges a sequential structure with holistic tokens at the beginning followed by patch-level tokens while using causal attention to maintain awareness of previous tokens; and 2) before feeding the de-quantized tokens into the decoder, Hita adopts a lightweight fusion module to control information flow to prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf{2.59 FID} and \textbf{281.9 IS} on the ImageNet benchmark. A detailed analysis of the holistic representation highlights its ability to capture global image properties such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \href{this https URL}{this https URL}
摘要：香草自回旋图像生成模型以逐步的方式生成视觉令牌，这限制了捕获令牌序列之间整体关系的能力。此外，大多数视觉图形器将本地图像贴片映射到潜在的代币中，从而导致全局信息有限。为了解决这个问题，我们介绍了\ textit {hita}，这是一种新颖的图像令牌，用于自动回归（AR）图像生成。它介绍了一个具有可学习的整体查询和本地贴片令牌的整体到本地令牌化方案。此外，HITA与AR生成过程结合了两种关键策略：1）它在开始时安排了一个带有整体令牌的顺序结构，然后是补丁级令牌，同时使用因果关系来维持对以前的代币的认识； 2）在将去量化令牌喂入解码器之前，HITA采用轻量级融合模块来控制信息流以优先考虑整体令牌。广泛的实验表明，HITA可以加速AR发电机的训练速度，并胜过接受香草令状训练的人的训练速度，在ImagEnet基准下实现\ textbf {2.59 FID}和\ textbf {281.9 is}。对整体表示形式的详细分析突出了其捕获纹理，材料和形状等全局图像属性的能力。此外，HITA还证明了零拍传输和图像中的有效性。该代码可在\ href {this https url} {此https url} {

Title: PosDiffAE: Position-aware Diffusion Auto-encoder For High-Resolution Brain Tissue Classification Incorporating Artifact Restoration

Authors: Ayantika Das, Moitreya Chaudhuri, Koushik Bhat, Keerthi Ram, Mihail Bota, Mohanasankar Sivaprakasam
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02405
Pdf URL: https://arxiv.org/pdf/2507.02405
Copy Paste: [[2507.02405]] PosDiffAE: Position-aware Diffusion Auto-encoder For High-Resolution Brain Tissue Classification Incorporating Artifact Restoration(https://arxiv.org/abs/2507.02405)
Keywords: restoration, generation
Abstract: Denoising diffusion models produce high-fidelity image samples by capturing the image distribution in a progressive manner while initializing with a simple distribution and compounding the distribution complexity. Although these models have unlocked new applicabilities, the sampling mechanism of diffusion does not offer means to extract image-specific semantic representation, which is inherently provided by auto-encoders. The encoding component of auto-encoders enables mapping between a specific image and its latent space, thereby offering explicit means of enforcing structures in the latent space. By integrating an encoder with the diffusion model, we establish an auto-encoding formulation, which learns image-specific representations and offers means to organize the latent space. In this work, First, we devise a mechanism to structure the latent space of a diffusion auto-encoding model, towards recognizing region-specific cellular patterns in brain images. We enforce the representations to regress positional information of the patches from high-resolution images. This creates a conducive latent space for differentiating tissue types of the brain. Second, we devise an unsupervised tear artifact restoration technique based on neighborhood awareness, utilizing latent representations and the constrained generation capability of diffusion models during inference. Third, through representational guidance and leveraging the inference time steerable noising and denoising capability of diffusion, we devise an unsupervised JPEG artifact restoration technique.
摘要：剥离扩散模型通过以渐进的方式捕获图像分布，同时以简单的分布初始化并复杂分布复杂性来产生高保真图像样本。尽管这些模型已解锁了新的应用，但扩散的采样机制并不提供提取特定图像特定语义表示的手段，而自动编码器本质上提供了特定图像的语义表示。自动编码器的编码组件可以在特定图像及其潜在空间之间进行映射，从而提供了在潜在空间中执行结构的明确方法。通过将编码器与扩散模型集成在一起，我们建立了一个自动编码公式，该公式学习特定于图像的表示形式并提供了组织潜在空间的手段。在这项工作中，首先，我们设计了一种机制来构建扩散自动编码模型的潜在空间，以识别大脑图像中的区域特异性细胞模式。我们强制执行从高分辨率图像中回归斑块的位置信息的表示。这为区分大脑的组织类型创造了有利的潜在空间。其次，我们根据社区意识设计了一种无监督的撕裂伪影修复技术，利用潜在表示和推理期间扩散模型的生成能力约束。第三，通过代表性的指导并利用推理时间可传达的诺言和降低扩散能力，我们设计了一种无监督的JPEG伪像恢复技术。

Title: Mesh Silksong: Auto-Regressive Mesh Generation as Weaving Silk

Authors: Gaochao Song, Zibo Zhao, Haohan Weng, Jingbo Zeng, Rongfei Jia, Shenghua Gao
Subjects: cs.CV, cs.GR
Abstract URL: https://arxiv.org/abs/2507.02477
Pdf URL: https://arxiv.org/pdf/2507.02477
Copy Paste: [[2507.02477]] Mesh Silksong: Auto-Regressive Mesh Generation as Weaving Silk(https://arxiv.org/abs/2507.02477)
Keywords: generation
Abstract: We introduce Mesh Silksong, a compact and efficient mesh representation tailored to generate the polygon mesh in an auto-regressive manner akin to silk weaving. Existing mesh tokenization methods always produce token sequences with repeated vertex tokens, wasting the network capability. Therefore, our approach tokenizes mesh vertices by accessing each mesh vertice only once, reduces the token sequence's redundancy by 50\%, and achieves a state-of-the-art compression rate of approximately 22\%. Furthermore, Mesh Silksong produces polygon meshes with superior geometric properties, including manifold topology, watertight detection, and consistent face normals, which are critical for practical applications. Experimental results demonstrate the effectiveness of our approach, showcasing not only intricate mesh generation but also significantly improved geometric integrity.
摘要：我们介绍了网状丝绸，这是一种量身定制的紧凑而有效的网格代表，以自动回火方式类似于丝绸编织。现有的网格令牌化方法总是用重复的顶点令牌产生令牌序列，从而浪费网络能力。因此，我们的方法通过仅访问每个网格角度来使网格顶点降低了每个网格顶点，将令牌序列的冗余降低了50 \％，并达到了最新的压缩率约为22 \％。此外，网状丝绸质量产生具有优质几何特性的多边形网格，包括歧管拓扑，水密检测和一致的面部正常，这对于实际应用至关重要。实验结果证明了我们方法的有效性，不仅展示了复杂的网格产生，而且还显着提高了几何完整性。

Title: RetrySQL: text-to-SQL training with retry data for self-correcting query generation

Authors: Alicja Rączkowska, Riccardo Belluzzo, Piotr Zieliński, Joanna Baran, Paweł Olszewski
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.02529
Pdf URL: https://arxiv.org/pdf/2507.02529
Copy Paste: [[2507.02529]] RetrySQL: text-to-SQL training with retry data for self-correcting query generation(https://arxiv.org/abs/2507.02529)
Keywords: generation, generative
Abstract: The text-to-SQL task is an active challenge in Natural Language Processing. Many existing solutions focus on using black-box language models extended with specialized components within customized end-to-end text-to-SQL pipelines. While these solutions use both closed-source proprietary language models and coding-oriented open-source models, there is a lack of research regarding SQL-specific generative models. At the same time, recent advancements in self-correcting generation strategies show promise for improving the capabilities of existing architectures. The application of these concepts to the text-to-SQL task remains unexplored. In this paper, we introduce RetrySQL, a new approach to training text-to-SQL generation models. We prepare reasoning steps for reference SQL queries and then corrupt them to create retry data that contains both incorrect and corrected steps, divided with a special token. We continuously pre-train an open-source coding model with this data and demonstrate that retry steps yield an improvement of up to 4 percentage points in both overall and challenging execution accuracy metrics, compared to pre-training without retry data. Additionally, we confirm that supervised fine-tuning with LoRA is ineffective for learning from retry data and that full-parameter pre-training is a necessary requirement for that task. We showcase that the self-correcting behavior is learned by the model and the increase in downstream accuracy metrics is a result of this additional skill. Finally, we incorporate RetrySQL-trained models into the full text-to-SQL pipeline and showcase that they are competitive in terms of execution accuracy with proprietary models that contain orders of magnitude more parameters. RetrySQL demonstrates that self-correction can be learned in the text-to-SQL task and provides a novel way of improving generation accuracy for SQL-oriented language models.
摘要：文本到SQL任务是自然语言处理中的积极挑战。许多现有解决方案着重于使用在自定义的端到端文本到SQL管道中使用专用组件扩展的黑框语言模型。尽管这些解决方案使用封闭式专有语言模型和面向编码的开源模型，但缺乏有关SQL特异性生成模型的研究。同时，自我校正生成策略的最新进展显示出有望提高现有建筑能力的希望。这些概念在文本到SQL任务中的应用仍未探索。在本文中，我们介绍了Retrysql，这是一种培训文本到SQL生成模型的新方法。我们准备了参考SQL查询的推理步骤，然后损坏它们以创建包含不正确和校正步骤的重试数据，并与特殊令牌分开。与没有重试数据的预培训相比，我们不断使用此数据预先培训使用此数据的开源编码模型，并证明重试步骤在整体和具有挑战性的执行精度指标中最多提高了4个百分点。此外，我们确认对Lora进行的微调无效，无法从重试数据中学习，并且全参数预培训是该任务的必要条件。我们展示了模型学到的自我校正行为，下游准确度指标的增加是这种额外技能的结果。最后，我们将重试训练的模型纳入完整的文本到SQL管道中，并展示了它们在执行精度方面具有竞争力，并通过包含更多参数的专有模型。重试证明，可以在文本到SQL任务中学习自我纠正，并为提高面向SQL的语言模型的生成准确性提供了一种新颖的方法。

Title: Lost in Latent Space: An Empirical Study of Latent Diffusion Models for Physics Emulation

Authors: François Rozet, Ruben Ohana, Michael McCabe, Gilles Louppe, François Lanusse, Shirley Ho
Subjects: cs.LG, physics.flu-dyn
Abstract URL: https://arxiv.org/abs/2507.02608
Pdf URL: https://arxiv.org/pdf/2507.02608
Copy Paste: [[2507.02608]] Lost in Latent Space: An Empirical Study of Latent Diffusion Models for Physics Emulation(https://arxiv.org/abs/2507.02608)
Keywords: generation, generative
Abstract: The steep computational cost of diffusion models at inference hinders their use as fast physics emulators. In the context of image and video generation, this computational drawback has been addressed by generating in the latent space of an autoencoder instead of the pixel space. In this work, we investigate whether a similar strategy can be effectively applied to the emulation of dynamical systems and at what cost. We find that the accuracy of latent-space emulation is surprisingly robust to a wide range of compression rates (up to 1000x). We also show that diffusion-based emulators are consistently more accurate than non-generative counterparts and compensate for uncertainty in their predictions with greater diversity. Finally, we cover practical design choices, spanning from architectures to optimizers, that we found critical to train latent-space emulators.
摘要：推理的扩散模型的陡峭计算成本阻碍了它们作为快速物理模拟器的使用。在图像和视频生成的上下文中，通过在自动编码器的潜在空间而不是像素空间的潜在空间中生成这种计算缺陷。在这项工作中，我们研究是否可以有效地将类似的策略应用于动态系统的仿真以及以什么成本。我们发现，潜在空间仿真的准确性对于广泛的压缩速率（最高1000倍）非常强大。我们还表明，基于扩散的仿真器始终比非生成对应物更准确，并以更大的多样性来弥补其预测的不确定性。最后，我们涵盖了从体系结构到优化器的实用设计选择，我们发现训练潜在空间仿真器至关重要。

Title: Medical Data Pecking: A Context-Aware Approach for Automated Quality Evaluation of Structured Medical Data

Authors: Irena Girshovitz, Atai Ambus, Moni Shahar, Ran Gilad-Bachrach
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.02628
Pdf URL: https://arxiv.org/pdf/2507.02628
Copy Paste: [[2507.02628]] Medical Data Pecking: A Context-Aware Approach for Automated Quality Evaluation of Structured Medical Data(https://arxiv.org/abs/2507.02628)
Keywords: quality assessment
Abstract: Background: The use of Electronic Health Records (EHRs) for epidemiological studies and artificial intelligence (AI) training is increasing rapidly. The reliability of the results depends on the accuracy and completeness of EHR data. However, EHR data often contain significant quality issues, including misrepresentations of subpopulations, biases, and systematic errors, as they are primarily collected for clinical and billing purposes. Existing quality assessment methods remain insufficient, lacking systematic procedures to assess data fitness for research. Methods: We present the Medical Data Pecking approach, which adapts unit testing and coverage concepts from software engineering to identify data quality concerns. We demonstrate our approach using the Medical Data Pecking Tool (MDPT), which consists of two main components: (1) an automated test generator that uses large language models and grounding techniques to create a test suite from data and study descriptions, and (2) a data testing framework that executes these tests, reporting potential errors and coverage. Results: We evaluated MDPT on three datasets: All of Us (AoU), MIMIC-III, and SyntheticMass, generating 55-73 tests per cohort across four conditions. These tests correctly identified 20-43 non-aligned or non-conforming data issues. We present a detailed analysis of the LLM-generated test suites in terms of reference grounding and value accuracy. Conclusion: Our approach incorporates external medical knowledge to enable context-sensitive data quality testing as part of the data analysis workflow to improve the validity of its outcomes. Our approach tackles these challenges from a quality assurance perspective, laying the foundation for further development such as additional data modalities and improved grounding methods.
摘要：背景：使用电子健康记录（EHR）进行流行病学研究和人工智能（AI）培训正在迅速增加。结果的可靠性取决于EHR数据的准确性和完整性。但是，EHR数据通常包含重大质量问题，包括对亚群，偏见和系统错误的陈述，因为它们主要用于临床和计费目的。现有的质量评估方法仍然不足，缺乏评估研究数据适应性的系统程序。方法：我们介绍了医疗数据啄食方法，该方法适应了来自软件工程的单元测试和覆盖概念以识别数据质量问题。我们使用医疗数据啄食工具（MDPT）证明了我们的方法，该工具由两个主要组成部分组成：（1）使用大语言模型和接地技术来创建数据和研究描述的测试套件，以及（2）数据测试框架，该测试框架执行这些测试，报告潜在的错误和覆盖范围。结果：我们在三个数据集上评估了MDPT：我们所有人（AOU），MIMIC-III和SYNTHETICETMASS，在四个条件下每个队列的每个队列产生55-73个测试。这些测试正确识别了20-43个非对准或不合格数据问题。我们在参考接地和价值准确性方面对LLM生成的测试套件进行了详细分析。结论：我们的方法结合了外部医学知识，以使上下文敏感的数据质量测试作为数据分析工作流程的一部分，以提高其结果的有效性。我们的方法从质量保证的角度解决了这些挑战，为进一步发展奠定了基础，例如其他数据方式和改进的接地方法。

Title: High-Order Deep Meta-Learning with Category-Theoretic Interpretation

Authors: David H. Mguni
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2507.02634
Pdf URL: https://arxiv.org/pdf/2507.02634
Copy Paste: [[2507.02634]] High-Order Deep Meta-Learning with Category-Theoretic Interpretation(https://arxiv.org/abs/2507.02634)
Keywords: generation, generative
Abstract: We introduce a new hierarchical deep learning framework for recursive higher-order meta-learning that enables neural networks (NNs) to construct, solve, and generalise across hierarchies of tasks. Central to this approach is a generative mechanism that creates \emph{virtual tasks} -- synthetic problem instances designed to enable the meta-learner to learn \emph{soft constraints} and unknown generalisable rules across related tasks. Crucially, this enables the framework to generate its own informative, task-grounded datasets thereby freeing machine learning (ML) training from the limitations of relying entirely on human-generated data. By actively exploring the virtual point landscape and seeking out tasks lower-level learners find difficult, the meta-learner iteratively refines constraint regions. This enhances inductive biases, regularises the adaptation process, and produces novel, unanticipated tasks and constraints required for generalisation. Each meta-level of the hierarchy corresponds to a progressively abstracted generalisation of problems solved at lower levels, enabling a structured and interpretable learning progression. By interpreting meta-learners as category-theoretic \emph{functors} that generate and condition a hierarchy of subordinate learners, we establish a compositional structure that supports abstraction and knowledge transfer across progressively generalised tasks. The category-theoretic perspective unifies existing meta-learning models and reveals how learning processes can be transformed and compared through functorial relationships, while offering practical design principles for structuring meta-learning. We speculate this architecture may underpin the next generation of NNs capable of autonomously generating novel, instructive tasks and their solutions, thereby advancing ML towards general artificial intelligence.
摘要：我们引入了一个新的层次深度学习框架，用于递归高阶元学习，该元素使神经网络（NNS）能够跨任务层次结构构建，求解和推广。这种方法的核心是一种生成机制，它创建\ emph {虚拟任务} - 旨在使元学习者能够学习\ emph {soft Conflaints}的综合问题实例，并且在相关任务之间进行了未知的通用规则。至关重要的是，这使该框架能够生成其自己的信息，任务接地的数据集，从而从完全依赖人类生成的数据的局限性中释放机器学习（ML）培训。通过积极探索虚拟点的格局并寻找较低级别的学习者的任务很困难，元学习者迭代地改进了约束区域。这会增强归纳偏见，规范适应过程，并产生新颖的，意外的任务和泛化所需的约束。层次结构的每个元水平都对应于在较低级别上解决的问题的逐渐抽象的概括，从而实现了结构化和可解释的学习进步。通过将元学习者解释为生成和条件下属学习者层次结构的类别理论\ emph {functors}，我们建立了一个组成结构，以支持跨逐渐概括的任务的抽象和知识转移。类别理论的观点统一了现有的元学习模型，并揭示了如何通过功能关系进行学习和比较，同时提供了用于构建元学习的实用设计原理。我们推测这种体系结构可能是下一代NNS能够自主生成新颖，有启发性的任务及其解决方案的基础，从而将ML推向了通用人工智能。

Title: OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

Authors: Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2507.02659
Pdf URL: https://arxiv.org/pdf/2507.02659
Copy Paste: [[2507.02659]] OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding(https://arxiv.org/abs/2507.02659)
Keywords: generation
Abstract: Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit{``one drafter for all''} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.
摘要：投机性解码通常决定具有鉴定或离线蒸馏到特定目标模型系列（例如Llama或QWEN模型）的小型，有效的草稿模型。但是，在在线部署设置中，有两个主要挑战：1）目标模型的使用与模型草案不兼容； 2）对使用和时间的延迟改善的期望。在这项工作中，我们提出了Omnidraft，这是一个统一的框架，使单个草稿模型可以使用任何目标模型运行并动态适应用户数据。我们介绍了一个在线N-Gram缓存，并带有混合蒸馏微调，以解决跨草稿和目标模型的交叉录音率不匹配；并通过利用自适应起草技术进一步提高解码速度。 Omnidraft特别适用于模型成本，效率和用户自定义是主要争议点的设备LLM应用程序。这进一步凸显了应对上述挑战并激发\ textit {``所有'''''}范式的必要性。我们通过在数学推理，编码和文本生成任务上执行在线学习，展示了Omnidraft框架的熟练程度。值得注意的是，Omnidraft使单个Llama-68M模型可以与各种目标模型配对，包括Vicuna-7b，Qwen2-7b和Llama3-8b模型，用于投机解码；此外，最多可提供1.5-2倍的速度。

Title: AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

Authors: Ziyin Zhou, Yunpeng Luo, Yuanchen Wu, Ke Sun, Jiayi Ji, Ke Yan, Shouhong Ding, Xiaoshuai Sun, Yunsheng Wu, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02664
Pdf URL: https://arxiv.org/pdf/2507.02664
Copy Paste: [[2507.02664]] AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models(https://arxiv.org/abs/2507.02664)
Keywords: generation
Abstract: The rapid development of AI-generated content (AIGC) technology has led to the misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, posing a threat to public information security. Although existing AIGI detection techniques are generally effective, they face two issues: 1) a lack of human-verifiable explanations, and 2) a lack of generalization in the latest generation technology. To address these issues, we introduce a large-scale and comprehensive dataset, Holmes-Set, which includes the Holmes-SFTSet, an instruction-tuning dataset with explanations on whether images are AI-generated, and the Holmes-DPOSet, a human-aligned preference dataset. Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control via cross-model evaluation, expert defect filtering, and human preference modification. In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization. Holmes Pipeline adapts multimodal large language models (MLLMs) for AIGI detection while generating human-verifiable and human-aligned explanations, ultimately yielding our model AIGI-Holmes. During the inference stage, we introduce a collaborative decoding strategy that integrates the model perception of the visual expert with the semantic reasoning of MLLMs, further enhancing the generalization capabilities. Extensive experiments on three benchmarks validate the effectiveness of our AIGI-Holmes.
摘要：AI生成的内容（AIGC）技术的快速发展导致滥用高度现实的AI生成的图像（AIGI）在传播错误信息方面，对公共信息安全构成了威胁。尽管现有的AIGI检测技术通常是有效的，但它们面临两个问题：1）缺乏人为验证的解释，以及2）最新一代技术缺乏概括。为了解决这些问题，我们介绍了一个大规模且全面的数据集Holmes-Set，其中包括Holmes-Sftset，一个指令调整数据集，其中包含有关图像是否是AI生成的解释，以及Holmes-Dposet，一个人与与人对齐的偏好数据集。我们的工作介绍了一种称为多专家陪审团的有效数据注释方法，通过结构化的MLLM解释来增强数据生成，并通过跨模型评估，专家缺陷过滤和人类偏好修改来增强数据的质量控制。此外，我们提出了福尔摩斯管道（Holmes Pipeline），这是一种精心设计的三阶段训练框架，包括视觉专家预训练，监督微调和直接偏好优化。福尔摩斯管道（Holmes Pipeline）适应了多模式的大语言模型（MLLM）进行AIGI检测，同时产生了人类验证和人类对准的解释，最终产生了我们的模型Aigi-Holmes。在推理阶段，我们介绍了一种协作解码策略，该策略将视觉专家的模型感知与MLLM的语义推理相结合，从而进一步增强了概括能力。对三个基准测试的广泛实验验证了我们的Aigi-Holmes的有效性。

Title: Guided Generation for Developable Antibodies

Authors: Siqi Zhao, Joshua Moller, Porfi Quintero-Cadena, Lood van Niekerk
Subjects: cs.LG, q-bio.BM
Abstract URL: https://arxiv.org/abs/2507.02670
Pdf URL: https://arxiv.org/pdf/2507.02670
Copy Paste: [[2507.02670]] Guided Generation for Developable Antibodies(https://arxiv.org/abs/2507.02670)
Keywords: generation
Abstract: Therapeutic antibodies require not only high-affinity target engagement, but also favorable manufacturability, stability, and safety profiles for clinical effectiveness. These properties are collectively called `developability'. To enable a computational framework for optimizing antibody sequences for favorable developability, we introduce a guided discrete diffusion model trained on natural paired heavy- and light-chain sequences from the Observed Antibody Space (OAS) and quantitative developability measurements for 246 clinical-stage antibodies. To steer generation toward biophysically viable candidates, we integrate a Soft Value-based Decoding in Diffusion (SVDD) Module that biases sampling without compromising naturalness. In unconstrained sampling, our model reproduces global features of both the natural repertoire and approved therapeutics, and under SVDD guidance we achieve significant enrichment in predicted developability scores over unguided baselines. When combined with high-throughput developability assays, this framework enables an iterative, ML-driven pipeline for designing antibodies that satisfy binding and biophysical criteria in tandem.
摘要：治疗性抗体不仅需要高亲和力的目标参与，而且还需要具有良好的生产性，稳定性和安全性，以实现临床有效性。这些属性共同称为“开发性”。为了启用一个计算框架，以优化抗体序列以获得有利的开发性，我们引入了一个带有的离散扩散模型，该模型在观察到的抗体空间（OAS）的天然配对重型链和轻链序列上训练，并针对246个临床阶段抗体进行了定量的可发展性测量。为了转向生物物理可行的候选物，我们将基于软值的解码集成在扩散（SVDD）模块中，该模块偏向采样而不会损害自然性。在不受约束的采样中，我们的模型重现了自然曲目和批准的治疗剂的全球特征，在SVDD指导下，我们在未指导的基线的预测可发展性评分方面实现了显着丰富。当与高通量的可发展性测定结合使用时，该框架可实现迭代，ML驱动的管道，以设计满足串联结合和生物物理标准的抗体。

Title: Embedding-Based Federated Data Sharing via Differentially Private Conditional VAEs

Authors: Francesco Di Salvo, Hanh Huyen My Nguyen, Christian Ledig
Subjects: cs.LG, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2507.02671
Pdf URL: https://arxiv.org/pdf/2507.02671
Copy Paste: [[2507.02671]] Embedding-Based Federated Data Sharing via Differentially Private Conditional VAEs(https://arxiv.org/abs/2507.02671)
Keywords: generative
Abstract: Deep Learning (DL) has revolutionized medical imaging, yet its adoption is constrained by data scarcity and privacy regulations, limiting access to diverse datasets. Federated Learning (FL) enables decentralized training but suffers from high communication costs and is often restricted to a single downstream task, reducing flexibility. We propose a data-sharing method via Differentially Private (DP) generative models. By adopting foundation models, we extract compact, informative embeddings, reducing redundancy and lowering computational overhead. Clients collaboratively train a Differentially Private Conditional Variational Autoencoder (DP-CVAE) to model a global, privacy-aware data distribution, supporting diverse downstream tasks. Our approach, validated across multiple feature extractors, enhances privacy, scalability, and efficiency, outperforming traditional FL classifiers while ensuring differential privacy. Additionally, DP-CVAE produces higher-fidelity embeddings than DP-CGAN while requiring $5{\times}$ fewer parameters.
摘要：深度学习（DL）彻底改变了医学成像，但其采用受数据稀缺和隐私法规的限制，从而限制了对不同数据集的访问。联合学习（FL）可以进行分散培训，但沟通成本高，通常仅限于单个下游任务，从而降低了灵活性。我们通过差异私有（DP）生成模型提出了一种数据共享方法。通过采用基础模型，我们提取紧凑，信息丰富的嵌入，减少冗余并降低计算开销。客户协作培训差异性的有条件变分自动编码器（DP-CVAE），以建模全球，隐私感知的数据分布，从而支持多样化的下游任务。我们的方法在多个功能提取器中进行了验证，可提高隐私，可扩展性和效率，在确保差异隐私的同时，超过了传统的FL分类器。此外，DP-CVAE产生的嵌入程度高于DP-CGAN，同时需要$ 5 {\ times} $更少的参数。

Title: UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation

Authors: Qin Guo, Ailing Zeng, Dongxu Yue, Ceyuan Yang, Yang Cao, Hanzhong Guo, Fei Shen, Wei Liu, Xihui Liu, Dan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02713
Pdf URL: https://arxiv.org/pdf/2507.02713
Copy Paste: [[2507.02713]] UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation(https://arxiv.org/abs/2507.02713)
Keywords: generation
Abstract: Although significant advancements have been achieved in the progress of keypoint-guided Text-to-Image diffusion models, existing mainstream keypoint-guided models encounter challenges in controlling the generation of more general non-rigid objects beyond humans (e.g., animals). Moreover, it is difficult to generate multiple overlapping humans and animals based on keypoint controls solely. These challenges arise from two main aspects: the inherent limitations of existing controllable methods and the lack of suitable datasets. First, we design a DiT-based framework, named UniMC, to explore unifying controllable multi-class image generation. UniMC integrates instance- and keypoint-level conditions into compact tokens, incorporating attributes such as class, bounding box, and keypoint coordinates. This approach overcomes the limitations of previous methods that struggled to distinguish instances and classes due to their reliance on skeleton images as conditions. Second, we propose HAIG-2.9M, a large-scale, high-quality, and diverse dataset designed for keypoint-guided human and animal image generation. HAIG-2.9M includes 786K images with 2.9M instances. This dataset features extensive annotations such as keypoints, bounding boxes, and fine-grained captions for both humans and animals, along with rigorous manual inspection to ensure annotation accuracy. Extensive experiments demonstrate the high quality of HAIG-2.9M and the effectiveness of UniMC, particularly in heavy occlusions and multi-class scenarios.
摘要：尽管在关键引导的文本对图像扩散模型的进度方面取得了重大进步，但现有的主流关键点引导的模型在控制人类以外的更一般的非韧性对象（例如动物）时会遇到挑战。此外，很难仅基于关键点控制产生多个重叠的人和动物。这些挑战来自两个主要方面：现有可控方法的固有局限性以及缺乏合适的数据集。首先，我们设计了一个名为UNIMC的基于DIT的框架，以探索可控的多级图像生成。 UNIMC将实例和关键点级条件集成到紧凑的令牌中，并结合了类，边界框和关键点坐标等属性。这种方法克服了以前的方法的局限性，这些方法由于依赖骨架图像作为条件而努力区分实例和类的局限性。其次，我们提出了HAIG-2.-29M，这是一个专为关键引导的人类和动物形象生成而设计的大型，高质量和多样的数据集。 HAIG-2.9M包括786K图像，具有290万个实例。该数据集具有广泛的注释，例如针对人类和动物的关键点，边界框和细粒度的字幕，以及严格的手动检查，以确保注释准确性。广泛的实验证明了HAIG-2.9M的高质量以及UNIMC的有效性，尤其是在重型阻塞和多级场景中。

Title: FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models

Authors: Yuxuan Wang, Tianwei Cao, Huayu Zhang, Zhongjiang He, Kongming Liang, Zhanyu Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2507.02714
Pdf URL: https://arxiv.org/pdf/2507.02714
Copy Paste: [[2507.02714]] FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models(https://arxiv.org/abs/2507.02714)
Keywords: generation
Abstract: Image generation has achieved remarkable progress with the development of large-scale text-to-image models, especially diffusion-based models. However, generating human images with plausible details, such as faces or hands, remains challenging due to insufficient supervision of local regions during training. To address this issue, we propose FairHuman, a multi-objective fine-tuning approach designed to enhance both global and local generation quality fairly. Specifically, we first construct three learning objectives: a global objective derived from the default diffusion objective function and two local objectives for hands and faces based on pre-annotated positional priors. Subsequently, we derive the optimal parameter updating strategy under the guidance of the Minimum Potential Delay (MPD) criterion, thereby attaining fairness-ware optimization for this multi-objective problem. Based on this, our proposed method can achieve significant improvements in generating challenging local details while maintaining overall quality. Extensive experiments showcase the effectiveness of our method in improving the performance of human image generation under different scenarios.
摘要：图像生成通过大规模文本对图像模型（尤其是基于扩散的模型）的开发取得了显着的进步。但是，由于培训期间对地方地区的监督不足，生成具有合理细节的人类图像，例如面孔或手，仍然具有挑战性。为了解决这个问题，我们提出了Fairhuman，这是一种多目标微调方法，旨在公平地提高全球和本地一代质量。具体来说，我们首先构建了三个学习目标：一个从默认扩散目标函数中得出的全球目标，以及基于预先通知的位置先验的手和面的两个局部目标。随后，我们根据最小潜在延迟（MPD）标准的指导得出了最佳参数更新策略，从而为这个多目标问题获得了公平性软件优化。基于此，我们提出的方法可以在生成挑战的本地细节的同时，在保持整体质量的同时，取得重大改进。广泛的实验展示了我们方法在不同情况下改善人类形象产生的性能的有效性。

Title: Prompt learning with bounding box constraints for medical image segmentation

Authors: Mélanie Gaillochet, Mehrdad Noori, Sahar Dastani, Christian Desrosiers, Hervé Lombaert
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02743
Pdf URL: https://arxiv.org/pdf/2507.02743
Copy Paste: [[2507.02743]] Prompt learning with bounding box constraints for medical image segmentation(https://arxiv.org/abs/2507.02743)
Keywords: generation
Abstract: Pixel-wise annotations are notoriously labourious and costly to obtain in the medical domain. To mitigate this burden, weakly supervised approaches based on bounding box annotations-much easier to acquire-offer a practical alternative. Vision foundation models have recently shown noteworthy segmentation performance when provided with prompts such as points or bounding boxes. Prompt learning exploits these models by adapting them to downstream tasks and automating segmentation, thereby reducing user intervention. However, existing prompt learning approaches depend on fully annotated segmentation masks. This paper proposes a novel framework that combines the representational power of foundation models with the annotation efficiency of weakly supervised segmentation. More specifically, our approach automates prompt generation for foundation models using only bounding box annotations. Our proposed optimization scheme integrates multiple constraints derived from box annotations with pseudo-labels generated by the prompted foundation model. Extensive experiments across multimodal datasets reveal that our weakly supervised method achieves an average Dice score of 84.90% in a limited data setting, outperforming existing fully-supervised and weakly-supervised approaches. The code is available at this https URL
摘要：众所周知，在医疗领域获得的努力和昂贵。为了减轻这种负担，基于边界框注释的弱监督方法，更容易获得实用的选择。视觉基础模型最近在提供诸如点或边界框之类的提示时显示了值得注意的细分性能。及时学习通过将这些模型调整为下游任务和自动分割来利用这些模型，从而减少用户干预。但是，现有的及时学习方法取决于完全注释的细分面具。本文提出了一个新颖的框架，将基础模型的代表力与弱监督分割的注释效率相结合。更具体地说，我们的方法仅使用边界框注释来自动为基础模型的及时生成。我们提出的优化方案集成了从框注释中得出的多个约束，并与提示的基础模型生成的伪标记。多模式数据集的广泛实验表明，我们弱监督的方法在有限的数据设置中达到了84.90％的平均骰子得分，表现优于现有的完全监督和弱点的方法。该代码可在此HTTPS URL上找到

Title: RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

Authors: Liheng Zhang, Lexi Pang, Hang Ye, Xiaoxuan Ma, Yizhou Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02792
Pdf URL: https://arxiv.org/pdf/2507.02792
Copy Paste: [[2507.02792]] RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation(https://arxiv.org/abs/2507.02792)
Keywords: generation
Abstract: Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., depth or pose maps) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. By revisiting existing methods, we identify a core limitation: the synchronous injection of condition features fails to account for the trade-off between domain alignment and structural preservation during denoising. Inspired by this observation, we propose a flexible feature injection framework that decouples the injection timestep from the denoising process. At its core is a structure-rich injection module, which enables the model to better adapt to the evolving interplay between alignment and structure preservation throughout the diffusion steps, resulting in more faithful structural generation. In addition, we introduce appearance-rich prompting and a restart refinement strategy to further enhance appearance control and visual quality. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art performance across diverse zero-shot conditioning scenarios.
摘要：文本对图像（T2I）扩散模型在从文本提示中生成高质量图像方面取得了显着的成功。最近的努力扩展了这些模型，以结合条件图像（例如，深度或姿势地图），以进行细粒度的空间对照。其中，特征注射方法已成为传统微调方法的无训练替代方法。但是，它们通常会遭受结构错位，状况泄漏和视觉伪像，尤其是当病情图像与天然RGB分布显着分歧时。通过重新访问现有方法，我们确定了一个核心限制：条件特征的同步注入无法解决域名和结构保存之间的权衡。受到这一观察的启发，我们提出了一个灵活的特征注入框架，该框架将注射时间步分解为脱索过程。其核心是一个富含结构的注入模块，它使模型能够更好地适应整个扩散步骤中对齐和结构保存之间不断发展的相互作用，从而导致更具忠实的结构产生。此外，我们引入了丰富的外观提示和重新启动的精炼策略，以进一步增强外观控制和视觉质量。这些设计共同使无训练的生成既丰富又富含结构。广泛的实验表明，我们的方法在各种零拍调理方案中实现了最先进的表现。

Title: No time to train! Training-Free Reference-Based Instance Segmentation

Authors: Miguel Espinosa, Chenhongyi Yang, Linus Ericsson, Steven McDonagh, Elliot J. Crowley
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02798
Pdf URL: https://arxiv.org/pdf/2507.02798
Copy Paste: [[2507.02798]] No time to train! Training-Free Reference-Based Instance Segmentation(https://arxiv.org/abs/2507.02798)
Keywords: generation
Abstract: The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).
摘要：历史上，图像分割模型的性能受到收集大规模注释数据的高成本的限制。该段的任何模型（SAM）通过敏捷的，语义 - 敏捷的，分割范式来减轻此原始问题，但仍然需要手动视觉范围或复杂的域依赖性及时生成规则来处理新图像。为了减轻这一新负担，我们的工作只需提供一小部分参考图像，就会调查对象细分的任务。正如基础模型所学，我们的主要见解是利用强大的语义先验，以确定参考图像和目标图像之间的相应区域。我们发现，通信能够自动生成实例级分段面具，以实现下游任务，并通过包含（1）内存库构建的多阶段，无训练的方法来实例化我们的想法；（2）表示聚合和（3）语义感知功能匹配。我们的实验显示了分割指标的显着改善，从而导致可可FSOD（36.8％NAP），Pascal VOC很少射击（71.2％NAP50）的最新性能（NAP50 71.2％），并且在交叉域FSOD FSOD基准（22.4％NAP）上超过了现有的无训练方法。

Title: LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

Authors: Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, Yueqi Duan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02813
Pdf URL: https://arxiv.org/pdf/2507.02813
Copy Paste: [[2507.02813]] LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion(https://arxiv.org/abs/2507.02813)
Keywords: generative
Abstract: Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: this https URL.
摘要：从2D图像中恢复具有开放式视频场景的理解的3D结构是一项基本但艰巨的任务。最近的发展通过使用嵌入式语言信息进行人均优化实现了这一目标。但是，他们在很大程度上依赖于校准的密集视图重建范式，从而在有限的视图中遇到严重的渲染文物和令人难以置信的语义综合。在本文中，我们介绍了一个新颖的生成框架，即langscene-X，以统一和生成3D一致的多模式信息，以进行重建和理解。通过创造更一致的新颖观察结果的生成能力，我们只能从稀疏视图中构建可概括的3D语言包裹的场景。具体而言，我们首先训练可以生成外观（RGB），几何形状（正常）和语义（分段图）从稀疏输入通过渐进知识积分从稀疏输入中产生的构架视频扩散模型。此外，我们提出了一种量化的语言压缩机（LQC），该语言在大规模图像数据集上训练，以有效地编码语言嵌入，从而实现跨性别的概括而无需每次习惯重新训练。最后，我们通过将语言信息对齐到3D场景的表面，从而启用开放式语言查询来重建语言表面字段。关于现实世界数据的广泛实验表明，就质量和概括性而言，我们的langscene-X优于最先进的方法。项目页面：此HTTPS URL。

Title: AnyI2V: Animating Any Conditional Image with Motion Control

Authors: Ziye Li, Hao Luo, Xincheng Shuai, Henghui Ding
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02857
Pdf URL: https://arxiv.org/pdf/2507.02857
Copy Paste: [[2507.02857]] AnyI2V: Animating Any Conditional Image with Motion Control(https://arxiv.org/abs/2507.02857)
Keywords: generation
Abstract: Recent advancements in video generation, particularly in diffusion models, have driven notable progress in text-to-video (T2V) and image-to-video (I2V) synthesis. However, challenges remain in effectively integrating dynamic motion signals and flexible spatial constraints. Existing T2V methods typically rely on text prompts, which inherently lack precise control over the spatial layout of generated content. In contrast, I2V methods are limited by their dependence on real images, which restricts the editability of the synthesized content. Although some methods incorporate ControlNet to introduce image-based conditioning, they often lack explicit motion control and require computationally expensive training. To address these limitations, we propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories. AnyI2V supports a broader range of modalities as the conditional image, including data types such as meshes and point clouds that are not supported by ControlNet, enabling more flexible and versatile video generation. Additionally, it supports mixed conditional inputs and enables style transfer and editing via LoRA and text prompts. Extensive experiments demonstrate that the proposed AnyI2V achieves superior performance and provides a new perspective in spatial- and motion-controlled video generation. Code is available at this https URL.
摘要：视频生成的最新进展，尤其是在扩散模型中，在文本到视频（T2V）和图像到视频（I2V）合成方面取得了显着进步。但是，挑战仍然在有效地整合动态运动信号和灵活的空间约束中。现有的T2V方法通常依赖于文本提示，该提示本质上缺乏对生成内容的空间布局的精确控制。相比之下，I2V方法受到对真实图像的依赖的限制，这限制了合成内容的编辑性。尽管某些方法结合了ControlNet来引入基于图像的调节，但它们通常缺乏明确的运动控制，需要计算昂贵的训练。为了解决这些限制，我们提出了Anyi2v，这是一个无训练的框架，可以使用用户定义的运动轨迹为任何有条件的图像动画。 Anyi2V支持更广泛的模态作为条件图像，包括ControlNet不支持的网格和点云等数据类型，从而使更灵活，更通用的视频生成。此外，它支持混合条件输入，并通过LORA和文本提示启用样式传输和编辑。广泛的实验表明，提出的Anyi2V实现了出色的性能，并在空间和运动控制的视频生成方面提供了新的视角。代码可在此HTTPS URL上找到。

Title: Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching

Authors: Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Hengshuang Zhao, Xiang Bai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02860
Pdf URL: https://arxiv.org/pdf/2507.02860
Copy Paste: [[2507.02860]] Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching(https://arxiv.org/abs/2507.02860)
Keywords: generation
Abstract: Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This work proposes EasyCache, a training-free acceleration framework for video diffusion models. EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors, avoiding redundant computations during inference. Unlike prior approaches, EasyCache requires no offline profiling, pre-computation, or extensive parameter tuning. We conduct comprehensive studies on various large-scale video generation models, including OpenSora, Wan2.1, and HunyuanVideo. Our method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3$\times$ compared to the original baselines while maintaining high visual fidelity with a significant up to 36% PSNR improvement compared to the previous SOTA method. This improvement makes our EasyCache a efficient and highly accessible solution for high-quality video generation in both research and practical applications. The code is available at this https URL.
摘要：视频生成模型表现出了显着的性能，但其更广泛的采用仍受到缓慢的推理速度和实质性计算成本的限制，这主要是由于斜体化过程的迭代性质。解决此瓶颈对于使高级视频合成技术民主化并使它们融入现实世界应用至关重要。这项工作提出了EasyCache，这是一个用于视频扩散模型的无训练加速框架。 EasyCache引入了轻巧的运行时自适应缓存机制，该机制动态重复了先前计算的转换向量，避免了推理期间的冗余计算。与先前的方法不同，EasyCache不需要离线分析，预计算或广泛的参数调整。我们对包括Opensora，Wan2.1和Hunyuanvideo在内的各种大型视频生成模型进行了全面研究。与原始基线相比，我们的方法达到了领先的加速性能，将推理时间降低了2.1-3.3 $ \ times $，同时保持高视觉保真度，与以前的SOTA方法相比，可显着提高了36％的PSNR。这种改进使我们的EasyCache成为研究和实际应用中高质量视频生成的高效且高度可访问的解决方案。该代码可在此HTTPS URL上找到。

Title: RefTok: Reference-Based Tokenization for Video Generation

Authors: Xiang Fan, Xiaohang Sun, Kushan Thakkar, Zhu Liu, Vimal Bhat, Ranjay Krishna, Xiang Hao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2507.02862
Pdf URL: https://arxiv.org/pdf/2507.02862
Copy Paste: [[2507.02862]] RefTok: Reference-Based Tokenization for Video Generation(https://arxiv.org/abs/2507.02862)
Keywords: generation
Abstract: Effectively handling temporal redundancy remains a key challenge in learning video models. Prevailing approaches often treat each set of frames independently, failing to effectively capture the temporal dependencies and redundancies inherent in videos. To address this limitation, we introduce RefTok, a novel reference-based tokenization method capable of capturing complex temporal dynamics and contextual information. Our method encodes and decodes sets of frames conditioned on an unquantized reference frame. When decoded, RefTok preserves the continuity of motion and the appearance of objects across frames. For example, RefTok retains facial details despite head motion, reconstructs text correctly, preserves small patterns, and maintains the legibility of handwriting from the context. Across 4 video datasets (K600, UCF-101, BAIR Robot Pushing, and DAVIS), RefTok significantly outperforms current state-of-the-art tokenizers (Cosmos and MAGVIT) and improves all evaluated metrics (PSNR, SSIM, LPIPS) by an average of 36.7% at the same or higher compression ratios. When a video generation model is trained using RefTok's latents on the BAIR Robot Pushing task, the generations not only outperform MAGVIT-B but the larger MAGVIT-L, which has 4x more parameters, across all generation metrics by an average of 27.9%.
摘要：有效处理时间冗余仍然是学习视频模型的关键挑战。普遍的方法通常独立对待每组帧，无法有效捕获视频中固有的时间依赖性和冗余。为了解决此限制，我们介绍了Reftok，这是一种基于参考的新型令牌化方法，能够捕获复杂的时间动态和上下文信息。我们的方法编码在未量化的参考框架上进行条件的框架集和解码集。解码时，重新频道保留了运动的连续性和跨帧的物体的外观。例如，尽管头部运动头部运动，重新组建文本，保留小图案，并保留从上下文中保持笔迹的可读性，但仍保留了面部细节。在4个视频数据集（K600，UCF-101，Bair机器人推动和戴维斯）中，重新效果显着胜过当前最新的Tokenizers（Cosmos and Magvit），并在同一或更高的压缩率上提高了所有评估的仪表（PSNR，SSIM，LPIPS）的36.7％。当使用Reftok的潜伏在Bair机器人推动任务上训练视频生成模型时，几代人不仅超过了MAGVIT-B，而且较大的MAGVIT-L在所有一代指标中具有4倍参数的较大参数，平均均为27.9％。