2025-06-06

Title: DrSR: LLM based Scientific Equation Discovery with Dual Reasoning from Data and Experience

Authors: Runxiang Wang, Boxiao Wang, Kai Li, Yifan Zhang, Jian Cheng
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04282
Pdf URL: https://arxiv.org/pdf/2506.04282
Copy Paste: [[2506.04282]] DrSR: LLM based Scientific Equation Discovery with Dual Reasoning from Data and Experience(https://arxiv.org/abs/2506.04282)
Keywords: generation
Abstract: Symbolic regression is a fundamental tool for discovering interpretable mathematical expressions from data, with broad applications across scientific and engineering domains. Recently, large language models (LLMs) have demonstrated strong performance in this task, leveraging embedded scientific priors and reasoning capabilities to surpass traditional methods. However, existing LLM-based approaches, such as LLM-SR, often over-rely on internal priors, lacking explicit data understanding and systematic reflection during equation generation. To address these limitations, we propose DrSR (Dual Reasoning Symbolic Regression), a framework that combines data-driven insight with reflective learning to enhance both robustness and discovery capability. Specifically, DrSR guides LLMs to analyze structural relationships (e.g., monotonicity, nonlinearity, and correlation) within the data to generate structured descriptions. Simultaneously, it monitors equation performance and establishes a feedback loop to refine subsequent generations. By integrating data understanding and generation reflection in a closed loop, DrSR enables more efficient exploration of the symbolic expression space. Experiments across interdisciplinary datasets in physics, chemistry, biology, and materials science demonstrate that DrSR substantially improves the valid equation rate and consistently outperforms both classical and recent LLM-based methods in terms of accuracy, generalization, and search efficiency. These results underscore its potential for scientific equation discovery.
摘要：符号回归是一种基本工具，用于从数据中发现可解释的数学表达式，并具有跨科学和工程领域的广泛应用。最近，大型语言模型（LLMS）在这项任务中表现出了很强的表现，利用嵌入式科学先验和推理能力超越传统方法。但是，现有的基于LLM的方法，例如LLM-SR，通常过度依赖内部先验，在方程生成过程中缺乏明确的数据理解和系统反射。为了解决这些局限性，我们建议DRSR（双重推理符号回归），该框架将数据驱动的见解与反思性学习相结合，以增强鲁棒性和发现能力。具体而言，DRSR指导LLMS分析数据中的结构关系（例如单调性，非线性和相关性）以生成结构化描述。同时，它可以监视方程式性能，并建立一个反馈循环以完善后代。通过在封闭环中整合数据理解和产生反射，DRSR可以更有效地探索符号表达空间。在物理，化学，生物学和材料科学方面的跨学科数据集的实验表明，DRSR基本上提高了有效的方程率，并且在准确性，概括和搜索效率方面始终超过经典和最新的基于LLM的方法。这些结果强调了其进行科学方程发现的潜力。

Title: Backbone Augmented Training for Adaptations

Authors: Jae Wan Park, Junhyeok Kim, Youngjun Jun, Hyunah Ko, Seong Jae Hwang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04288
Pdf URL: https://arxiv.org/pdf/2506.04288
Copy Paste: [[2506.04288]] Backbone Augmented Training for Adaptations(https://arxiv.org/abs/2506.04288)
Keywords: generation
Abstract: Adaptations facilitate efficient training of large backbone models, including diffusion models for image generation and transformer-based language models. While various adaptation techniques enhance performance with minimal computational resources, limited adaptation data often leads to challenges in training. To address this, we focus on the enormous amount of backbone data used to pre-train the backbone models. We propose Backbone Augmented Training (BAT), a method that leverages backbone data to augment the adaptation dataset. First, we formulate and prove two mathematical key propositions: one establishes the validity of BAT, while the other identifies a condition under which BAT benefits adaptation. Furthermore, we introduce an advanced data selection scheme that satisfies these propositions and present ALBAT algorithm to implement this approach. ALBAT efficiently enhances adaptation training in both personalization and language generation tasks with scarce data.
摘要：适应有助于对大型骨干模型的有效训练，包括用于产生图像和变压器语言模型的扩散模型。尽管各种适应技术通过最少的计算资源增强了性能，但有限的适应数据通常会导致培训挑战。为了解决这个问题，我们关注用于预先培训骨干模型的大量骨干数据。我们提出了骨干增强训练（BAT），该方法利用骨干数据来增加适应数据集。首先，我们制定并证明了两个数学关键命题：一个建立了蝙蝠的有效性，而另一个确定了蝙蝠受益于适应的条件。此外，我们引入了一种高级数据选择方案，该方案满足这些命题并介绍Albat算法以实施此方法。 Albat通过稀缺数据有效地增强了个性化和语言生成任务的适应培训。

Title: Softlog-Softmax Layers and Divergences Contribute to a Computationally Dependable Ensemble Learning

Authors: Abdourrahmane Mahamane Atto (LISTIC)
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04297
Pdf URL: https://arxiv.org/pdf/2506.04297
Copy Paste: [[2506.04297]] Softlog-Softmax Layers and Divergences Contribute to a Computationally Dependable Ensemble Learning(https://arxiv.org/abs/2506.04297)
Keywords: generation
Abstract: The paper proposes a 4-step process for highlighting that softlog-softmax cascades can improve both consistency and dependability of the next generation ensemble learning systems. The first process is anatomical in nature: the target ensemble model under consideration is composed by canonical elements relating to the definition of a convolutional frustum. No a priori is considered in the choice of canonical forms. Diversity is the main criterion for selecting these forms. It is shown that the more complex the problem, the more useful this ensemble diversity is. The second process is physiological and relates to neural engineering: a softlog is derived to both make weak logarithmic operations consistent and lead, through multiple softlog-softmax layers, to intermediate decisions in the sense of respecting the same class logic as that faced by the output layer. The third process concerns neural information theory: softlog-based entropy and divergence are proposed for the sake of constructing information measures yielding consistent values on closed intervals. These information measures are used to determine the relationships between individual and sub-community decisions in frustum diversitybased ensemble learning. The concluding process addresses the derivation of an informative performance tensor for the purpose of a reliable ensemble evaluation.
摘要：本文提出了一个四步过程，以突出显示软链球级联的级联可以提高下一代合奏学习系统的一致性和可靠性。第一个过程本质上是解剖学：所考虑的目标集合模型是由与卷积式汇合的定义有关的规范元素组成的。在选择规范形式时，没有先验地考虑先验。多样性是选择这些形式的主要标准。结果表明，问题越复杂，整体多样性就越有用。第二个过程是生理的，并且与神经工程有关：派生的软核是使较弱的对数操作保持一致的，并且通过多个软链效量层的层次，以与输出层所面对的同一类逻辑的意义，以中间决策为中间决策。第三个过程涉及神经信息理论：提出了基于软糖的熵和差异，为了构建信息测量，在封闭间隔内产生一致的值。这些信息度量用于确定基于Froustum多样性的集合学习中个人和子社区决策之间的关系。为了可靠的合奏评估目的，结论过程解决了信息性能张量的推导。

Title: HuGeDiff: 3D Human Generation via Diffusion with Gaussian Splatting

Authors: Maksym Ivashechkin, Oscar Mendez, Richard Bowden
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04351
Pdf URL: https://arxiv.org/pdf/2506.04351
Copy Paste: [[2506.04351]] HuGeDiff: 3D Human Generation via Diffusion with Gaussian Splatting(https://arxiv.org/abs/2506.04351)
Keywords: generation, generative
Abstract: 3D human generation is an important problem with a wide range of applications in computer vision and graphics. Despite recent progress in generative AI such as diffusion models or rendering methods like Neural Radiance Fields or Gaussian Splatting, controlling the generation of accurate 3D humans from text prompts remains an open challenge. Current methods struggle with fine detail, accurate rendering of hands and faces, human realism, and controlability over appearance. The lack of diversity, realism, and annotation in human image data also remains a challenge, hindering the development of a foundational 3D human model. We present a weakly supervised pipeline that tries to address these challenges. In the first step, we generate a photorealistic human image dataset with controllable attributes such as appearance, race, gender, etc using a state-of-the-art image diffusion model. Next, we propose an efficient mapping approach from image features to 3D point clouds using a transformer-based architecture. Finally, we close the loop by training a point-cloud diffusion model that is conditioned on the same text prompts used to generate the original samples. We demonstrate orders-of-magnitude speed-ups in 3D human generation compared to the state-of-the-art approaches, along with significantly improved text-prompt alignment, realism, and rendering quality. We will make the code and dataset available.
摘要：在计算机视觉和图形中应用广泛的应用是3D人类一代的重要问题。尽管最近在生成AI中取得了进展，例如扩散模型或渲染方法，例如神经辐射场或高斯裂缝，但从文本提示中控制了准确的3D人类的生成仍然是一个悬而未决的挑战。当前的方法在细节上挣扎，手和脸的准确渲染，人类的现实主义以及对外观的控制性。人类形象数据中缺乏多样性，现实主义和注释也是一个挑战，阻碍了基础3D人类模型的发展。我们提出了一条弱监督的管道，试图解决这些挑战。在第一步中，我们使用最先进的图像扩散模型生成具有可控属性（例如外观，种族，性别等）的可控属性数据集。接下来，我们使用基于变压器的体系结构提出了一种从图像功能到3D点云的有效映射方法。最后，我们通过训练一个点云扩散模型来关闭循环，该模型由用于生成原始样品的相同文本提示进行条件。与最先进的方法相比，我们展示了3D人类发电的速度速度加快，以及明显改善的文本效果，现实主义和渲染质量。我们将使代码和数据集可用。

Title: ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding

Authors: Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, Pranav Rajpurkar
Subjects: cs.CV, cs.AI, cs.CE, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04353
Pdf URL: https://arxiv.org/pdf/2506.04353
Copy Paste: [[2506.04353]] ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding(https://arxiv.org/abs/2506.04353)
Keywords: generation
Abstract: We present ReXVQA, the largest and most comprehensive benchmark for visual question answering (VQA) in chest radiology, comprising approximately 696,000 questions paired with 160,000 chest X-rays studies across training, validation, and test sets. Unlike prior efforts that rely heavily on template based queries, ReXVQA introduces a diverse and clinically authentic task suite reflecting five core radiological reasoning skills: presence assessment, location analysis, negation detection, differential diagnosis, and geometric reasoning. We evaluate eight state-of-the-art multimodal large language models, including MedGemma-4B-it, Qwen2.5-VL, Janus-Pro-7B, and Eagle2-9B. The best-performing model (MedGemma) achieves 83.24% overall accuracy. To bridge the gap between AI performance and clinical expertise, we conducted a comprehensive human reader study involving 3 radiology residents on 200 randomly sampled cases. Our evaluation demonstrates that MedGemma achieved superior performance (83.84% accuracy) compared to human readers (best radiology resident: 77.27%), representing a significant milestone where AI performance exceeds expert human evaluation on chest X-ray interpretation. The reader study reveals distinct performance patterns between AI models and human experts, with strong inter-reader agreement among radiologists while showing more variable agreement patterns between human readers and AI models. ReXVQA establishes a new standard for evaluating generalist radiological AI systems, offering public leaderboards, fine-grained evaluation splits, structured explanations, and category-level breakdowns. This benchmark lays the foundation for next-generation AI systems capable of mimicking expert-level clinical reasoning beyond narrow pathology classification. Our dataset will be open-sourced at this https URL
摘要：我们介绍了RexVQA，这是胸部放射学中视觉问题答案（VQA）的最大，最全面的基准，包括大约696,000个问题，并在培训，验证和测试集中与160,000个胸部X射线研究配对。与先前依赖基于模板的查询的努力不同，RexVQA引入了反映五种核心放射学推理技能的多样化和临床真实的任务套件：存在评估，位置分析，否定检测，差异诊断和几何学推理。我们评估了八种最先进的多模式大型语言模型，包括Medgemma-4B-IT，QWEN2.5-VL，Janus-Pro-7b和Eagle2-9B。表现最佳的模型（MEDGEMMA）达到了总体准确性的83.24％。为了弥合AI性能与临床专业知识之间的差距，我们进行了一项全面的人类读者研究，涉及3例随机采样病例的放射学居民。我们的评估表明，与人类读者（最佳放射学居民：77.27％）相比，MEDGEMMA取得了卓越的性能（精度为83.84％），这代表了一个重要的里程碑，其中AI绩效超过了胸部X射线解释的专家人类评估。读者的研究揭示了AI模型与人类专家之间的不同性能模式，放射科医生之间具有强大的阅读者一致性，同时在人类读者和AI模型之间显示出更多可变的一致性模式。 REXVQA建立了一个新的标准，用于评估通才放射AI系统，提供公共排行榜，细粒度评估拆分，结构化解释和类别级别的分解。该基准为下一代AI系统奠定了基础，该系统能够模仿专家级的临床推理，而不是狭窄的病理分类。我们的数据集将在此HTTPS URL上开源

Title: WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning

Authors: Delong Chen, Willy Chung, Yejin Bang, Ziwei Ji, Pascale Fung
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04363
Pdf URL: https://arxiv.org/pdf/2506.04363
Copy Paste: [[2506.04363]] WorldPrediction: A Benchmark for High-level World Modeling and Long-horizon Procedural Planning(https://arxiv.org/abs/2506.04363)
Keywords: generative
Abstract: Humans are known to have an internal "world model" that enables us to carry out action planning based on world states. AI agents need to have such a world model for action planning as well. It is not clear how current AI models, especially generative models, are able to learn such world models and carry out procedural planning in diverse environments. We introduce WorldPrediction, a video-based benchmark for evaluating world modeling and procedural planning capabilities of different AI models. In contrast to prior benchmarks that focus primarily on low-level world modeling and robotic motion planning, WorldPrediction is the first benchmark that emphasizes actions with temporal and semantic abstraction. Given initial and final world states, the task is to distinguish the proper action (WorldPrediction-WM) or the properly ordered sequence of actions (WorldPrediction-PP) from a set of counterfactual distractors. This discriminative task setup enable us to evaluate different types of world models and planners and realize a thorough comparison across different hypothesis. The benchmark represents states and actions using visual observations. In order to prevent models from exploiting low-level continuity cues in background scenes, we provide "action equivalents" - identical actions observed in different contexts - as candidates for selection. This benchmark is grounded in a formal framework of partially observable semi-MDP, ensuring better reliability and robustness of the evaluation. We conduct extensive human filtering and validation on our benchmark and show that current frontier models barely achieve 57% accuracy on WorldPrediction-WM and 38% on WorldPrediction-PP whereas humans are able to solve both tasks perfectly.
摘要：众所周知，人类具有内部的“世界模型”，使我们能够根据世界各州进行行动计划。人工智能代理商还需要具有这样的世界模式进行行动计划。目前尚不清楚当前的AI模型，尤其是生成模型如何能够学习此类世界模型并在各种环境中进行程序计划。我们介绍了WorldPrediction，这是一种基于视频的基准，用于评估不同AI模型的世界建模和程序计划功能。与主要关注低级世界建模和机器人运动计划的先前基准相反，WorldPrediction是强调具有时间和语义抽象的行动的第一个基准。鉴于初始和最后的世界各州，任务是区分适当的动作（WorldPrediction-WM）或正确有序的动作序列（WorldPrediction-PP）与一组反事实分散者。这种歧视性的任务设置使我们能够评估不同类型的世界模型和计划者，并在不同的假设上进行了详尽的比较。基准测试代表使用视觉观测值的状态和动作。为了防止模型在背景场景中利用低级连续性提示，我们提供了“行动当量” - 在不同上下文中观察到的相同的动作 - 作为选择的候选者。该基准是基于部分可观察到的半MDP的正式框架，从而确保了评估的可靠性和鲁棒性。我们对我们的基准进行了广泛的人类过滤和验证，并表明当前的边境模型几乎无法在世界prepentiction-WM上实现57％的准确性，而在WorldPrediction-PP上，当前的模型几乎无法实现57％的准确性，而人类可以完美地解决这两个任务。

Title: Visualizing and Controlling Cortical Responses Using Voxel-Weighted Activation Maximization

Authors: Matthew W. Shinkle, Mark D. Lescroart
Subjects: cs.CV, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2506.04379
Pdf URL: https://arxiv.org/pdf/2506.04379
Copy Paste: [[2506.04379]] Visualizing and Controlling Cortical Responses Using Voxel-Weighted Activation Maximization(https://arxiv.org/abs/2506.04379)
Keywords: generative
Abstract: Deep neural networks (DNNs) trained on visual tasks develop feature representations that resemble those in the human visual system. Although DNN-based encoding models can accurately predict brain responses to visual stimuli, they offer limited insight into the specific features driving these responses. Here, we demonstrate that activation maximization -- a technique designed to interpret vision DNNs -- can be applied to DNN-based encoding models of the human brain. We extract and adaptively downsample activations from multiple layers of a pretrained Inception V3 network, then use linear regression to predict fMRI responses. This yields a full image-computable model of brain responses. Next, we apply activation maximization to generate images optimized for predicted responses in individual cortical voxels. We find that these images contain visual characteristics that qualitatively correspond with known selectivity and enable exploration of selectivity across the visual cortex. We further extend our method to whole regions of interest (ROIs) of the brain and validate its efficacy by presenting these images to human participants in an fMRI study. We find that the generated images reliably drive activity in targeted regions across both low- and high-level visual areas and across subjects. These results demonstrate that activation maximization can be successfully applied to DNN-based encoding models. By addressing key limitations of alternative approaches that require natively generative models, our approach enables flexible characterization and modulation of responses across the human visual system.
摘要：对视觉任务进行训练的深度神经网络（DNNS）会开发出类似于人类视觉系统的特征表示。尽管基于DNN的编码模型可以准确预测大脑对视觉刺激的反应，但它们对推动这些响应的特定功能提供了有限的见解。在这里，我们证明了激活最大化（一种旨在解释视觉DNN的技术）可以应用于基于DNN的人脑的编码模型。我们从验证的成立V3网络的多层中提取和自适应下样本激活，然后使用线性回归来预测fMRI响应。这产生了完整的大脑反应模型。接下来，我们应用激活最大化来生成针对单个皮质体素中预测响应进行优化的图像。我们发现这些图像包含视觉特征，这些视觉特征与已知的选择性相对应，并能够探索整个视觉皮层的选择性。我们进一步将我们的方法扩展到大脑的整个感兴趣区域（ROI），并通过在fMRI研究中向人类参与者呈现这些图像来验证其功效。我们发现，生成的图像可靠地驱动低水平和高级视觉区域以及受试者的目标区域的活动。这些结果表明，激活最大化可以成功应用于基于DNN的编码模型。通过解决需要本地生成模型的替代方法的关键局限性，我们的方法可以灵活地表征和调节人类视觉系统的响应。

Title: Is Perturbation-Based Image Protection Disruptive to Image Editing?

Authors: Qiuyu Tang, Bonor Ayambem, Mooi Choo Chuah, Aparna Bharati
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04394
Pdf URL: https://arxiv.org/pdf/2506.04394
Copy Paste: [[2506.04394]] Is Perturbation-Based Image Protection Disruptive to Image Editing?(https://arxiv.org/abs/2506.04394)
Keywords: generation
Abstract: The remarkable image generation capabilities of state-of-the-art diffusion models, such as Stable Diffusion, can also be misused to spread misinformation and plagiarize copyrighted materials. To mitigate the potential risks associated with image editing, current image protection methods rely on adding imperceptible perturbations to images to obstruct diffusion-based editing. A fully successful protection for an image implies that the output of editing attempts is an undesirable, noisy image which is completely unrelated to the reference image. In our experiments with various perturbation-based image protection methods across multiple domains (natural scene images and artworks) and editing tasks (image-to-image generation and style editing), we discover that such protection does not achieve this goal completely. In most scenarios, diffusion-based editing of protected images generates a desirable output image which adheres precisely to the guidance prompt. Our findings suggest that adding noise to images may paradoxically increase their association with given text prompts during the generation process, leading to unintended consequences such as better resultant edits. Hence, we argue that perturbation-based methods may not provide a sufficient solution for robust image protection against diffusion-based editing.
摘要：最新的扩散模型（例如稳定扩散）的显着图像产生能力也可能被滥用以扩散错误信息并窃受版权保护的材料。为了减轻与图像编辑相关的潜在风险，当前的图像保护方法依赖于向图像添加不可察觉的扰动以阻止基于扩散的编辑。对图像的完全成功的保护意味着编辑尝试的输出是不良的，嘈杂的图像，与参考图像完全无关。在我们使用跨多个域（自然场景图像和艺术品）以及编辑任务（图像到图像生成和样式编辑）的各种基于扰动的图像保护方法的实验中，我们发现这种保护并不能完全实现此目标。在大多数情况下，受保护图像的基于扩散的编辑会产生理想的输出图像，该图像准确地遵循指导提示。我们的发现表明，在生成过程中，向图像增加噪声可能会矛盾地增加与给定文本提示的关联，从而导致意想不到的后果，例如更好的结果编辑。因此，我们认为基于扰动的方法可能无法提供足够的解决方案来防止基于扩散的编辑进行强大的图像保护。

Title: HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

Authors: Hermann Kumbong, Xian Liu, Tsung-Yi Lin, Ming-Yu Liu, Xihui Liu, Ziwei Liu, Daniel Y. Fu, Christopher Ré, David W. Romero
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.04421
Pdf URL: https://arxiv.org/pdf/2506.04421
Copy Paste: [[2506.04421]] HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation(https://arxiv.org/abs/2506.04421)
Keywords: generation
Abstract: Visual Auto-Regressive modeling (VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolution scales. During inference, an image is generated by predicting all the tokens in the next (higher-resolution) scale, conditioned on all tokens in all previous (lower-resolution) scales. However, this formulation suffers from reduced image quality due to the parallel generation of all tokens in a resolution scale; has sequence lengths scaling superlinearly in image resolution; and requires retraining to change the sampling schedule. We introduce Hierarchical Masked Auto-Regressive modeling (HMAR), a new image generation algorithm that alleviates these issues using next-scale prediction and masked prediction to generate high-quality images with fast sampling. HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor instead of the tokens in all predecessor resolutions. When predicting a resolution scale, HMAR uses a controllable multi-step masked generation procedure to generate a subset of the tokens in each step. On ImageNet 256x256 and 512x512 benchmarks, HMAR models match or outperform parameter-matched VAR, diffusion, and autoregressive baselines. We develop efficient IO-aware block-sparse attention kernels that allow HMAR to achieve faster training and inference times over VAR by over 2.5x and 1.75x respectively, as well as over 3x lower inference memory footprint. Finally, HMAR yields additional flexibility over VAR; its sampling schedule can be changed without further training, and it can be applied to image editing tasks in a zero-shot manner.
摘要：视觉自动回归建模（VAR）在弥合自回归图像模型和扩散模型之间的速度和质量差距方面表现出了希望。 VAR通过将图像分解为连续的分辨率量表来重新进行自回旋建模。在推断期间，通过预测下一个（高分辨率）刻度中的所有令牌来生成图像，该标记在所有以前的（下分辨率）尺度中的所有令牌中。但是，由于所有令牌平行生成，该公式的图像质量降低了。在图像分辨率中具有序列长度的序列长度。并需要再培训以更改采样时间表。我们介绍了分层掩盖自动回归建模（HMAR），这是一种新的图像生成算法，使用次尺度预测和掩盖预测来减轻这些问题，以生成具有快速采样的高质量图像。 HMAR将次级预测重新定义为马尔可夫的过程，其中每个决议量表的预测仅在其直接前任的代币中，而不是所有前身决议中的代币。在预测分辨率量表时，HMAR使用可控的多步蒙版生成过程来生成每个步骤中令牌的子集。在Imagenet 256x256和512x512上，HMAR模型匹配或匹配参数匹配的VAR，扩散和自动回归基线。我们开发了有效的IO-Aware Block-Sparse注意力内核，使HMAR能够分别超过2.5倍和1.75倍的VAR训练和推理时间，以及低于3倍的推理记忆足迹。最后，HMAR在VAR上产生额外的灵活性。可以在没有进一步培训的情况下更改其采样时间表，并且可以以零拍的方式应用于图像编辑任务。

Title: RETRO SYNFLOW: Discrete Flow Matching for Accurate and Diverse Single-Step Retrosynthesis

Authors: Robin Yadav, Qi Yan, Guy Wolf, Avishek Joey Bose, Renjie Liao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04439
Pdf URL: https://arxiv.org/pdf/2506.04439
Copy Paste: [[2506.04439]] RETRO SYNFLOW: Discrete Flow Matching for Accurate and Diverse Single-Step Retrosynthesis(https://arxiv.org/abs/2506.04439)
Keywords: generation, generative
Abstract: A fundamental problem in organic chemistry is identifying and predicting the series of reactions that synthesize a desired target product molecule. Due to the combinatorial nature of the chemical search space, single-step reactant prediction -- i.e. single-step retrosynthesis -- remains challenging even for existing state-of-the-art template-free generative approaches to produce an accurate yet diverse set of feasible reactions. In this paper, we model single-step retrosynthesis planning and introduce RETRO SYNFLOW (RSF) a discrete flow-matching framework that builds a Markov bridge between the prescribed target product molecule and the reactant molecule. In contrast to past approaches, RSF employs a reaction center identification step to produce intermediate structures known as synthons as a more informative source distribution for the discrete flow. To further enhance diversity and feasibility of generated samples, we employ Feynman-Kac steering with Sequential Monte Carlo based resampling to steer promising generations at inference using a new reward oracle that relies on a forward-synthesis model. Empirically, we demonstrate \nameshort achieves $60.0 \%$ top-1 accuracy, which outperforms the previous SOTA by $20 \%$. We also substantiate the benefits of steering at inference and demonstrate that FK-steering improves top-$5$ round-trip accuracy by $19 \%$ over prior template-free SOTA methods, all while preserving competitive top-$k$ accuracy results.
摘要：有机化学中的一个基本问题是识别和预测合成所需靶产品分子的一系列反应。由于化学搜索空间的组合性质，即使对于现有的无模板生成生成方法，也可以产生准确但多样化的可行反应集，即使是现有的无模板生成方法，单步反应物预测（即单步反折叠）仍然具有挑战性。在本文中，我们对单步反折面合成计划进行建模，并引入复古合成（RSF）一个离散的流量匹配框架，该框架在规定的目标产物分子和反应剂分子之间建立了马尔可夫桥。与过去的方法相反，RSF采用了反应中心识别步骤来产生称为合成子的中间结构，作为离散流的更有信息的源分布。为了进一步提高生成的样品的多样性和可行性，我们使用依赖于正合成模型的新奖励Oracle采用了基于蒙特卡洛的feynman-kac转向进行推导的推导后代。从经验上讲，我们证明\ nameshort可以达到$ 60.0 \％$ $ top-1的准确性，这使以前的SOTA优于$ 20 \％$。我们还证实了在推理时转向的好处，并证明FK-Steering将$ 5 $ $ 5 $的往返准确性提高了$ 19 \％$，而不是以前的无模板SOTA方法，同时保留了竞争性的顶级$ K $精度结果。

Title: AuthGuard: Generalizable Deepfake Detection via Language Guidance

Authors: Guangyu Shen, Zhihua Li, Xiang Xu, Tianchen Zhao, Zheng Zhang, Dongsheng An, Zhuowen Tu, Yifan Xing, Qin Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04501
Pdf URL: https://arxiv.org/pdf/2506.04501
Copy Paste: [[2506.04501]] AuthGuard: Generalizable Deepfake Detection via Language Guidance(https://arxiv.org/abs/2506.04501)
Keywords: generation
Abstract: Existing deepfake detection techniques struggle to keep-up with the ever-evolving novel, unseen forgeries methods. This limitation stems from their reliance on statistical artifacts learned during training, which are often tied to specific generation processes that may not be representative of samples from new, unseen deepfake generation methods encountered at test time. We propose that incorporating language guidance can improve deepfake detection generalization by integrating human-like commonsense reasoning -- such as recognizing logical inconsistencies and perceptual anomalies -- alongside statistical cues. To achieve this, we train an expert deepfake vision encoder by combining discriminative classification with image-text contrastive learning, where the text is generated by generalist MLLMs using few-shot prompting. This allows the encoder to extract both language-describable, commonsense deepfake artifacts and statistical forgery artifacts from pixel-level distributions. To further enhance robustness, we integrate data uncertainty learning into vision-language contrastive learning, mitigating noise in image-text supervision. Our expert vision encoder seamlessly interfaces with an LLM, further enabling more generalized and interpretable deepfake detection while also boosting accuracy. The resulting framework, AuthGuard, achieves state-of-the-art deepfake detection accuracy in both in-distribution and out-of-distribution settings, achieving AUC gains of 6.15% on the DFDC dataset and 16.68% on the DF40 dataset. Additionally, AuthGuard significantly enhances deepfake reasoning, improving performance by 24.69% on the DDVQA dataset.
摘要：现有的DeepFake检测技术难以跟上不断发展的小说，看不见的伪造方法。这种局限性源于它们对训练期间学到的统计工件的依赖，这些统计工件通常与特定的生成过程相关，这些过程可能无法代表在测试时遇到的新的，看不见的深层生成方法中的样品。我们建议，通过整合类似人类的常识性推理（例如识别逻辑上的不一致和感知异常），可以通过整合人类的常识性推理来改善深层检测概括。为了实现这一目标，我们通过将歧视性分类与图像文本对比学习相结合，培训专家的Deepfake视觉编码器，在该学习中，文本是由通才MLLM生成的，使用很少的射击提示。这使编码器可以从像素级分布中提取语言可描述的，常识的深击文物和统计伪造。为了进一步增强鲁棒性，我们将数据不确定性学习整合到视觉对比度学习中，从而减轻图像文本监督中的噪声。我们的专家视觉编码器与LLM无缝接口，从而进一步实现了更普遍和可解释的深层检测，同时也提高了准确性。在分布和分布式设置中，实现的框架（Authguard）在DFDC数据集中获得了6.15％的AUC收益，而DF40数据集则达到了6.15％的AUC收益。此外，Authguard显着增强了DeepFake推理，在DDVQA数据集上提高了24.69％的性能。

Title: EECD-Net: Energy-Efficient Crack Detection with Spiking Neural Networks and Gated Attention

Authors: Shuo Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04526
Pdf URL: https://arxiv.org/pdf/2506.04526
Copy Paste: [[2506.04526]] EECD-Net: Energy-Efficient Crack Detection with Spiking Neural Networks and Gated Attention(https://arxiv.org/abs/2506.04526)
Keywords: super-resolution
Abstract: Crack detection on road surfaces is a critical measurement technology in the instrumentation domain, essential for ensuring infrastructure safety and transportation reliability. However, due to limited energy and low-resolution imaging, smart terminal devices struggle to maintain real-time monitoring performance. To overcome these challenges, this paper proposes a multi-stage detection approach for road crack detection, EECD-Net, to enhance accuracy and energy efficiency of instrumentation. Specifically, the sophisticated Super-Resolution Convolutional Neural Network (SRCNN) is employed to address the inherent challenges of low-quality images, which effectively enhance image resolution while preserving critical structural details. Meanwhile, a Spike Convolution Unit (SCU) with Continuous Integrate-and-Fire (CIF) neurons is proposed to convert these images into sparse pulse sequences, significantly reducing power consumption. Additionally, a Gated Attention Transformer (GAT) module is designed to strategically fuse multi-scale feature representations through adaptive attention mechanisms, effectively capturing both long-range dependencies and intricate local crack patterns, and significantly enhancing detection robustness across varying crack morphologies. The experiments on the CrackVision12K benchmark demonstrate that EECD-Net achieves a remarkable 98.6\% detection accuracy, surpassing state-of-the-art counterparts such as Hybrid-Segmentor by a significant 1.5\%. Notably, the EECD-Net maintains exceptional energy efficiency, consuming merely 5.6 mJ, which is a substantial 33\% reduction compared to baseline implementations. This work pioneers a transformative approach in instrumentation-based crack detection, offering a scalable, low-power solution for real-time, large-scale infrastructure monitoring in resource-constrained environments.
摘要：道路表面上的裂纹检测是仪器域中的关键测量技术，对于确保基础设施的安全性和运输可靠性至关重要。但是，由于能源有限和低分辨率成像，智能终端设备难以维持实时监控性能。为了克服这些挑战，本文提出了一种多阶段检测方法，用于eecd-net，以提高仪器的准确性和能源效率。具体而言，采用了复杂的超分辨率卷积神经网络（SRCNN）来应对低质量图像的固有挑战，从而有效地增强了图像分辨率，同时保留了关键的结构细节。同时，提出了具有连续集成和开火（CIF）神经元的尖峰卷积单元（SCU），以将这些图像转换为稀疏的脉冲序列，从而大大降低功耗。此外，通过自适应注意机制，封闭的注意变压器（GAT）模块设计为战略性地融合多尺度特征表示，有效地捕获了远距离依赖性和复杂的局部裂纹模式，并显着增强了在不同裂纹形态中的检测鲁棒性。 crackVision12k基准的实验表明，EECD-NET达到了显着的98.6 \％检测准确性，超过了最新的杂种类分离剂，以1.5 \％的速度超过了混合分段。值得注意的是，EECD-NET保持出色的能效，仅消耗5.6 MJ，与基线实施相比，这是大幅度降低的33 \％。这项工作是基于仪器的裂纹检测中的一种变革性方法，为在资源受限的环境中提供了可扩展的低功率解决方案，用于实时，大规模的基础架构监视。

Title: NOBLE -- Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models

Authors: Luca Ghafourpour, Valentin Duruisseaux, Bahareh Tolooshams, Philip H. Wong, Costas A. Anastassiou, Anima Anandkumar
Subjects: cs.LG, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2506.04536
Pdf URL: https://arxiv.org/pdf/2506.04536
Copy Paste: [[2506.04536]] NOBLE -- Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models(https://arxiv.org/abs/2506.04536)
Keywords: generation
Abstract: Characterizing the diverse computational properties of human neurons via multimodal electrophysiological, transcriptomic, and morphological data provides the foundation for constructing and validating bio-realistic neuron models that can advance our understanding of fundamental mechanisms underlying brain function. However, current modeling approaches remain constrained by the limited availability and intrinsic variability of experimental neuronal data. To capture variability, ensembles of deterministic models are often used, but are difficult to scale as model generation requires repeating computationally expensive optimization for each neuron. While deep learning is becoming increasingly relevant in this space, it fails to capture the full biophysical complexity of neurons, their nonlinear voltage dynamics, and variability. To address these shortcomings, we introduce NOBLE, a neural operator framework that learns a mapping from a continuous frequency-modulated embedding of interpretable neuron features to the somatic voltage response induced by current injection. Trained on data generated from biophysically realistic neuron models, NOBLE predicts distributions of neural dynamics accounting for the intrinsic experimental variability. Unlike conventional bio-realistic neuron models, interpolating within the embedding space offers models whose dynamics are consistent with experimentally observed responses. NOBLE is the first scaled-up deep learning framework validated on real experimental data, enabling efficient generation of synthetic neurons that exhibit trial-to-trial variability and achieve a $4200\times$ speedup over numerical solvers. To this end, NOBLE captures fundamental neural properties, opening the door to a better understanding of cellular composition and computations, neuromorphic architectures, large-scale brain circuits, and general neuroAI applications.
摘要：通过多模式电生理，转录组和形态学数据来表征人神经元的各种计算特性为构建和验证生物现实的神经元模型提供了基础，这些神经元模型可以促进我们对脑功能基本机制的理解。但是，当前的建模方法仍然受到实验神经元数据的可用性有限和内在可变性的限制。为了捕获可变性，经常使用确定性模型的组合，但是由于模型生成需要对每个神经元重复计算昂贵的优化，因此很难扩展。尽管深度学习在这个空间中变得越来越重要，但它无法捕获神经元的完整生物物理复杂性，其非线性电压动力学和可变性。为了解决这些缺点，我们介绍了Noble，这是一个神经操作员框架，该框架从连续频率调制的可解释神经元特征的嵌入映射到由当前注射引起的体压反应响应。诺布尔（Noble）受生物物理逼真的神经元模型产生的数据培训，可预测神经动力学的分布，这些分布构成了内在的实验变异性。与常规的生物现实神经元模型不同，在嵌入空间内插值提供了模型，其动力学与实验观察到的响应一致。 Noble是在实际实验数据上验证的第一个扩展的深度学习框架，从而有效地生成了表现出试验性变异性的合成神经元，并实现了$ 4200 \ times $ QUATE，而不是数值求解器。为此，贵族捕获了基本的神经特性，为更好地理解细胞组成和计算，神经形态架构，大规模脑电路和一般神经ai应用程序打开了大门。

Title: Enhancing Frequency for Single Image Super-Resolution with Learnable Separable Kernels

Authors: Heng Tian
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2506.04555
Pdf URL: https://arxiv.org/pdf/2506.04555
Copy Paste: [[2506.04555]] Enhancing Frequency for Single Image Super-Resolution with Learnable Separable Kernels(https://arxiv.org/abs/2506.04555)
Keywords: super-resolution
Abstract: Existing approaches often enhance the performance of single-image super-resolution (SISR) methods by incorporating auxiliary structures, such as specialized loss functions, to indirectly boost the quality of low-resolution images. In this paper, we propose a plug-and-play module called Learnable Separable Kernels (LSKs), which are formally rank-one matrices designed to directly enhance image frequency components. We begin by explaining why LSKs are particularly suitable for SISR tasks from a frequency perspective. Baseline methods incorporating LSKs demonstrate a significant reduction of over 60\% in both the number of parameters and computational requirements. This reduction is achieved through the decomposition of LSKs into orthogonal and mergeable one-dimensional kernels. Additionally, we perform an interpretable analysis of the feature maps generated by LSKs. Visualization results reveal the capability of LSKs to enhance image frequency components effectively. Extensive experiments show that incorporating LSKs not only reduces the number of parameters and computational load but also improves overall model performance. Moreover, these experiments demonstrate that models utilizing LSKs exhibit superior performance, particularly as the upscaling factor increases.
摘要：现有方法通常通过合并辅助结构（例如专业损耗函数）来间接提高低分辨率图像的质量，从而增强单图像超分辨率（SISR）方法的性能。在本文中，我们提出了一个称为可分离核（LSK）的插件播放模块，该模块是正式的排名一级矩阵，旨在直接增强图像频率组件。我们首先要解释为什么LSK从频率角度来看特别适合SISR任务。结合LSK的基线方法表明，在参数和计算要求的数量中，显着降低了60 \％。通过将LSK分解为正交和可合并的一维内核来实现这种减少。此外，我们对LSK生成的特征图进行了可解释的分析。可视化结果揭示了LSK有效增强图像频率成分的能力。广泛的实验表明，合并LSK不仅减少了参数和计算负载的数量，还可以改善整体模型性能。此外，这些实验表明，利用LSK的模型表现出较高的性能，尤其是随着展望因子的增加。

Title: Follow-Your-Creation: Empowering 4D Creation through Video Inpainting

Authors: Yue Ma, Kunyu Feng, Xinhua Zhang, Hongyu Liu, David Junhao Zhang, Jinbo Xing, Yinhan Zhang, Ayden Yang, Zeyu Wang, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04590
Pdf URL: https://arxiv.org/pdf/2506.04590
Copy Paste: [[2506.04590]] Follow-Your-Creation: Empowering 4D Creation through Video Inpainting(https://arxiv.org/abs/2506.04590)
Keywords: generation, generative
Abstract: We introduce Follow-Your-Creation, a novel 4D video creation framework capable of both generating and editing 4D content from a single monocular video input. By leveraging a powerful video inpainting foundation model as a generative prior, we reformulate 4D video creation as a video inpainting task, enabling the model to fill in missing content caused by camera trajectory changes or user edits. To facilitate this, we generate composite masked inpainting video data to effectively fine-tune the model for 4D video generation. Given an input video and its associated camera trajectory, we first perform depth-based point cloud rendering to obtain invisibility masks that indicate the regions that should be completed. Simultaneously, editing masks are introduced to specify user-defined modifications, and these are combined with the invisibility masks to create a composite masks dataset. During training, we randomly sample different types of masks to construct diverse and challenging inpainting scenarios, enhancing the model's generalization and robustness in various 4D editing and generation tasks. To handle temporal consistency under large camera motion, we design a self-iterative tuning strategy that gradually increases the viewing angles during training, where the model is used to generate the next-stage training data after each fine-tuning iteration. Moreover, we introduce a temporal packaging module during inference to enhance generation quality. Our method effectively leverages the prior knowledge of the base model without degrading its original performance, enabling the generation of 4D videos with consistent multi-view coherence. In addition, our approach supports prompt-based content editing, demonstrating strong flexibility and significantly outperforming state-of-the-art methods in both quality and versatility.
摘要：我们介绍了跟随您的创作，这是一个新颖的4D视频创建框架，能够从单眼视频输入中生成和编辑4D内容。通过利用强大的视频介绍基础模型作为生成性先验，我们将4D视频创建重新制定为视频介绍任务，使该模型能够填充由摄像头轨迹更改或用户编辑引起的缺失内容。为了促进这一点，我们生成了复合掩盖的视频数据，以有效地调整4D视频生成的模型。给定输入视频及其相关的摄像头轨迹，我们首先执行基于深度的点云渲染，以获取指示应完成的区域的隐形遮罩。同时，引入了编辑掩码以指定用户定义的修改，并且将其与隐形掩码结合使用以创建复合掩码数据集。在培训期间，我们随机对不同类型的面具进行采样，以构建各种且具有挑战性的介绍场景，从而增强了模型在各种4D编辑和发电任务中的概括和鲁棒性。为了在大型摄像机运动下处理时间一致性，我们设计了一种自我调整策略，该策略逐渐增加了训练期间的视角，在训练过程中，该模型用于在每次微调迭代后生成下一阶段的训练数据。此外，我们在推断过程中引入了一个时间包装模块，以提高发电质量。我们的方法有效地利用了基本模型的先验知识，而不会降低其原始性能，从而使能够以一致的多视图相干性产生4D视频。此外，我们的方法还支持基于迅速的内容编辑，表现出强大的灵活性，并且在质量和多功能性方面都显着优于最先进的方法。

Title: Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

Authors: Marianna Nezhurina, Tomer Porian, Giovanni Pucceti, Tommie Kerssies, Romain Beaumont, Mehdi Cherti, Jenia Jitsev
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.04598
Pdf URL: https://arxiv.org/pdf/2506.04598
Copy Paste: [[2506.04598]] Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets(https://arxiv.org/abs/2506.04598)
Keywords: generative
Abstract: In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. For the first time, full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. Ensuring sufficient prediction accuracy for held out points, we use derived scaling laws to compare both models, obtaining evidence for MaMMUT's stronger improvement with scale and better sample efficiency than standard CLIP. To strengthen validity of the comparison, we show scaling laws for various downstream tasks, classification, retrieval, and segmentation, and for different open datasets, DataComp, DFN and Re-LAION, observing consistently the same trends. We show that comparison can also be performed when deriving scaling laws with a constant learning rate schedule, reducing compute cost. Accurate derivation of scaling laws provides thus means to perform model and dataset comparison across scale spans, avoiding misleading conclusions based on measurements from single reference scales only, paving the road for systematic comparison and improvement of open foundation models and datasets for their creation. We release all the pre-trained models with their intermediate checkpoints, including openMaMMUT-L/14, which achieves $80.3\%$ zero-shot ImageNet-1k accuracy, trained on 12.8B samples from DataComp-1.4B. Code for reproducing experiments in the paper and raw experiments data can be found at this https URL.
摘要：在可转移学习的研究中，为各种重要的基础模型获得了缩放定律，以预测其在较大尺度上的性能和性能。我们在这里展示了如何将缩放法律推导用于模型和数据集比较，从而允许确定哪个过程是预训练的首选。首次，基于广泛的模型的密集测量法和所见尺度的样本是针对两个重要的语言视觉学习程序Clip和Mammut得出的完整缩放定律，这些量表仅使用对比度或对比度或字幕文本生成性损失。确保持有点的足够预测准确性，我们使用得出的缩放定律来比较这两个模型，从而获得了与标准夹相比Mammut更强的改进和更高的样品效率的证据。为了增强比较的有效性，我们显示了各种下游任务，分类，检索和细分的缩放定律，以及不同的开放数据集，DataComp，DFN和Re-laion，观察到始终如一的趋势。我们表明，在以恒定的学习率计划得出缩放定律时，也可以进行比较，从而降低计算成本。缩放定律的准确推导提供了对跨比例跨度进行模型和数据集比较的手段，避免了仅基于单个参考量表的测量值的误导性结论，从而为他们创建的开放基础模型和数据集的系统比较和改进的道路铺平了道路。我们发布了所有具有中间检查点的预训练模型，包括OpenMammut-l/14，该模型可实现$ 80.3 \％$零摄像机ImaTEnet-1K精度，并在DataComp-1.4B的12.8B样本上进行了培训。可以在此HTTPS URL上找到纸和原始实验数据中复制实验的代码。

Title: SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents

Authors: Alexander Huang-Menders, Xinhang Liu, Andy Xu, Yuyao Zhang, Chi-Keung Tang, Yu-Wing Tai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04606
Pdf URL: https://arxiv.org/pdf/2506.04606
Copy Paste: [[2506.04606]] SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents(https://arxiv.org/abs/2506.04606)
Keywords: generation
Abstract: SmartAvatar is a vision-language-agent-driven framework for generating fully rigged, animation-ready 3D human avatars from a single photo or textual prompt. While diffusion-based methods have made progress in general 3D object generation, they continue to struggle with precise control over human identity, body shape, and animation readiness. In contrast, SmartAvatar leverages the commonsense reasoning capabilities of large vision-language models (VLMs) in combination with off-the-shelf parametric human generators to deliver high-quality, customizable avatars. A key innovation is an autonomous verification loop, where the agent renders draft avatars, evaluates facial similarity, anatomical plausibility, and prompt alignment, and iteratively adjusts generation parameters for convergence. This interactive, AI-guided refinement process promotes fine-grained control over both facial and body features, enabling users to iteratively refine their avatars via natural-language conversations. Unlike diffusion models that rely on static pre-trained datasets and offer limited flexibility, SmartAvatar brings users into the modeling loop and ensures continuous improvement through an LLM-driven procedural generation and verification system. The generated avatars are fully rigged and support pose manipulation with consistent identity and appearance, making them suitable for downstream animation and interactive applications. Quantitative benchmarks and user studies demonstrate that SmartAvatar outperforms recent text- and image-driven avatar generation systems in terms of reconstructed mesh quality, identity fidelity, attribute accuracy, and animation readiness, making it a versatile tool for realistic, customizable avatar creation on consumer-grade hardware.
摘要：SmartAvatar是一个视觉代理驱动的框架，可从单个照片或文本提示中生成完全操纵的动画准备的3D人体化身。尽管基于扩散的方法在一般的3D对象产生中取得了进展，但它们继续在对人类身份，身体形状和动画准备就绪的精确控制方面继续挣扎。相比之下，SmartAvatar利用大型视觉模型（VLM）的常识性推理能力以及现成的参数人发电机来提供高质量的可定制化身。一个关键的创新是一个自主验证循环，该代理会在该循环中渲染化身草案，评估面部相似性，解剖学合理性和及时对齐，并迭代地调整发电参数以收敛。这种互动的，AI引导的改进过程可促进对面部和身体特征的细粒度控制，从而使用户能够通过自然语言对话迭代地完善其头像。与依靠静态预训练数据集并提供有限的灵活性的扩散模型不同，SmartAvatar将用户带入建模循环，并通过LLM驱动的程序生成和验证系统确保不断改进。生成的化身是完全操纵的，并具有一致的身份和外观支撑姿势操纵，使其适合下游动画和交互式应用。定量基准和用户研究表明，SmartAvatar在重建网状质量，身份保真度，属性准确性和动画就绪方面优于最近的文本和图像驱动的化身生成系统，使其成为对消费者级硬件的现实，可自定义的avatar创建的多功能工具。

Title: Exploring bidirectional bounds for minimax-training of Energy-based models

Authors: Cong Geng, Jia Wang, Li Chen, Zhiyong Gao, Jes Frellsen, Søren Hauberg
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.04609
Pdf URL: https://arxiv.org/pdf/2506.04609
Copy Paste: [[2506.04609]] Exploring bidirectional bounds for minimax-training of Energy-based models(https://arxiv.org/abs/2506.04609)
Keywords: generation, generative
Abstract: Energy-based models (EBMs) estimate unnormalized densities in an elegant framework, but they are generally difficult to train. Recent work has linked EBMs to generative adversarial networks, by noting that they can be trained through a minimax game using a variational lower bound. To avoid the instabilities caused by minimizing a lower bound, we propose to instead work with bidirectional bounds, meaning that we maximize a lower bound and minimize an upper bound when training the EBM. We investigate four different bounds on the log-likelihood derived from different perspectives. We derive lower bounds based on the singular values of the generator Jacobian and on mutual information. To upper bound the negative log-likelihood, we consider a gradient penalty-like bound, as well as one based on diffusion processes. In all cases, we provide algorithms for evaluating the bounds. We compare the different bounds to investigate, the pros and cons of the different approaches. Finally, we demonstrate that the use of bidirectional bounds stabilizes EBM training and yields high-quality density estimation and sample generation.
摘要：基于能量的模型（EBM）在优雅的框架中估计了非均衡密度，但通常很难训练。最近的工作将EBM与生成的对抗网络联系起来，指出可以使用各种下限通过minimax游戏对其进行培训。为了避免通过最小化下限引起的不稳定性，我们建议使用双向界限，这意味着我们在训练EBM时最大化下限并最小化上限。我们研究了从不同角度得出的对数可能性的四个不同范围。我们基于发电机雅各布的单数值和相互信息得出下限。为了上限为负模样，我们考虑了一种类似梯度的惩罚，以及基于扩散过程的结合。在所有情况下，我们都提供用于评估界限的算法。我们比较了不同的范围，即不同方法的利弊。最后，我们证明了双向界限的使用可以稳定EBM训练，并产生高质量的密度估计和样品产生。

Title: Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders

Authors: Qiming Hu, Linlong Fan, Yiyan Luo, Yuhang Yu, Xiaojie Guo, Qingnan Fan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04641
Pdf URL: https://arxiv.org/pdf/2506.04641
Copy Paste: [[2506.04641]] Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders(https://arxiv.org/abs/2506.04641)
Keywords: super-resolution, generative
Abstract: The introduction of generative models has significantly advanced image super-resolution (SR) in handling real-world degradations. However, they often incur fidelity-related issues, particularly distorting textual structures. In this paper, we introduce a novel diffusion-based SR framework, namely TADiSR, which integrates text-aware attention and joint segmentation decoders to recover not only natural details but also the structural fidelity of text regions in degraded real-world images. Moreover, we propose a complete pipeline for synthesizing high-quality images with fine-grained full-image text masks, combining realistic foreground text regions with detailed background content. Extensive experiments demonstrate that our approach substantially enhances text legibility in super-resolved images, achieving state-of-the-art performance across multiple evaluation metrics and exhibiting strong generalization to real-world scenarios. Our code is available at \href{this https URL}{here}.
摘要：生成模型的引入在处理现实世界降解时具有显着高级的图像超分辨率（SR）。但是，他们经常会引起与富达相关的问题，尤其是扭曲文本结构。在本文中，我们介绍了一种基于扩散的SR框架，即Tadisr，该框架将文本感知注意力和关节分割解码器整合在一起，不仅恢复了自然细节，而且还恢复了降级现实世界中文本区域的结构性忠诚度。此外，我们提出了一条完整的管道，用于将高质量图像与细粒度的全图文本掩码合成，将现实的前景文本区域与详细的背景内容相结合。广泛的实验表明，我们的方法可以大大提高超级分辨图像中的文本知名度，从而在多个评估指标中实现最先进的性能，并表现出对现实世界情景的强烈概括。我们的代码可在\ href {this https url} {there}上获得。

Title: Inference economics of language models

Authors: Ege Erdil
Subjects: cs.LG, cs.DC
Abstract URL: https://arxiv.org/abs/2506.04645
Pdf URL: https://arxiv.org/pdf/2506.04645
Copy Paste: [[2506.04645]] Inference economics of language models(https://arxiv.org/abs/2506.04645)
Keywords: generation
Abstract: We develop a theoretical model that addresses the economic trade-off between cost per token versus serial token generation speed when deploying LLMs for inference at scale. Our model takes into account arithmetic, memory bandwidth, network bandwidth and latency constraints; and optimizes over different parallelism setups and batch sizes to find the ones that optimize serial inference speed at a given cost per token. We use the model to compute Pareto frontiers of serial speed versus cost per token for popular language models.
摘要：我们开发了一个理论模型，该模型解决了在部署LLMS进行大规模推断时，每个令牌成本与串行令牌生成速度之间的经济权衡。我们的模型考虑了算术，内存带宽，网络带宽和延迟约束；并在不同的并行设置和批处理大小上进行优化，以找到以每个令牌给定成本优化串行推理速度的。我们使用该模型来计算流行语言模型的串行速度和每个令牌成本的帕累托前沿。

Title: FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion

Authors: Akide Liu, Zeyu Zhang, Zhexin Li, Xuehai Bai, Yizeng Han, Jiasheng Tang, Yuanjie Xing, Jichao Wu, Mingyang Yang, Weihua Chen, Jiahao He, Yuanyu He, Fan Wang, Gholamreza Haffari, Bohan Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04648
Pdf URL: https://arxiv.org/pdf/2506.04648
Copy Paste: [[2506.04648]] FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion(https://arxiv.org/abs/2506.04648)
Keywords: generation, generative
Abstract: Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to significant performance degradation due to the lack of joint this http URL introduce FPSAttention, a novel training-aware co-design of FP8 quantization and sparsity for video generation, with a focus on the 3D bi-directional attention mechanism. Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity; 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps; 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with optimized Hopper architecture features for highly efficient execution. Trained on Wan2.1's 1.3B and 14B models and evaluated on the VBench benchmark, FPSAttention achieves a 7.09x kernel speedup for attention operations and a 4.96x end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution-without sacrificing generation quality.
摘要：扩散生成模型已成为生产高质量，连贯的视频内容的标准，但它们的推理速度缓慢和高度计算需求阻碍了实际部署。尽管量化和稀疏性都可以独立地加速推理，同时保持发电质量，但在现有的无训练方法中将这些技术固定结合，导致由于缺乏关节而引入FPSTATIENTION，这会导致大量的性能下降，这是一种新颖的培训，这是FP8量化的新型培训 - fp8量化和宽容性，以供视频生成，以对3D DD DD DIDIRINICENTICINICTIC for Video Intife Incortion for Video Intive Incortion。我们的方法具有三个关键创新：1）同时支持量化和稀疏性的统一3D瓷砖粒度； 2）一种适应噪声时间表的变态级别感知策略，以解决量化/稀疏性误差与降解步骤之间的牢固相关性； 3）一种原始的，硬件友好的内核，利用闪存的速度，并具有优化的Hopper体系结构功能，以高效执行。与WAN2.1的1.3B和14B模型进行了培训，并在VBENCH基准上进行了评估，与在720p分辨率的BF16基线相比，在720p分辨率的基线与寄养生成质量相比，视频生成的4.09倍内核速度和4.96倍端到端的速度加速。

Title: Gen-n-Val: Agentic Image Data Generation and Validation

Authors: Jing-En Huang, I-Sheng Fang, Tzuhsuan Huang, Chih-Yu Wang, Jun-Cheng Chen
Subjects: cs.CV, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2506.04676
Pdf URL: https://arxiv.org/pdf/2506.04676
Copy Paste: [[2506.04676]] Gen-n-Val: Agentic Image Data Generation and Validation(https://arxiv.org/abs/2506.04676)
Keywords: generation
Abstract: Recently, Large Language Models (LLMs) and Vision Large Language Models (VLLMs) have demonstrated impressive performance as agents across various tasks while data scarcity and label noise remain significant challenges in computer vision tasks, such as object detection and instance segmentation. A common solution for resolving these issues is to generate synthetic data. However, current synthetic data generation methods struggle with issues, such as multiple objects per mask, inaccurate segmentation, and incorrect category labels, limiting their effectiveness. To address these issues, we introduce Gen-n-Val, a novel agentic data generation framework that leverages Layer Diffusion (LD), LLMs, and VLLMs to produce high-quality, single-object masks and diverse backgrounds. Gen-n-Val consists of two agents: (1) The LD prompt agent, an LLM, optimizes prompts for LD to generate high-quality foreground instance images and segmentation masks. These optimized prompts ensure the generation of single-object synthetic data with precise instance masks and clean backgrounds. (2) The data validation agent, a VLLM, which filters out low-quality synthetic instance images. The system prompts for both agents are refined through TextGrad. Additionally, we use image harmonization to combine multiple instances within scenes. Compared to state-of-the-art synthetic data approaches like MosaicFusion, our approach reduces invalid synthetic data from 50% to 7% and improves performance by 1% mAP on rare classes in COCO instance segmentation with YOLOv9c and YOLO11m. Furthermore, Gen-n-Val shows significant improvements (7. 1% mAP) over YOLO-Worldv2-M in open-vocabulary object detection benchmarks with YOLO11m. Moreover, Gen-n-Val improves the performance of YOLOv9 and YOLO11 families in instance segmentation and object detection.
摘要：最近，大型语言模型（LLM）和视觉大型语言模型（VLLM）表现出令人印象深刻的表现，作为各种任务的代理，而数据稀缺和标签噪声仍然是计算机视觉任务的重大挑战，例如对象检测和实例段。解决这些问题的常见解决方案是生成综合数据。但是，当前的综合数据生成方法在问题上（例如每个掩码，不准确的分段和错误类别标签）而困难，从而限制了它们的有效性。为了解决这些问题，我们引入了Gen-N-Val，这是一个新型的代理数据生成框架，利用层扩散（LD），LLMS和VLLMS生成高质量的单对象掩码和不同的背景。 Gen-N-VAL由两种代理组成：（1）LD提示剂，LLM，优化了LD的提示，以生成高质量的前景实例图像和分割掩码。这些优化的提示确保了具有精确实例掩码和干净背景的单对象合成数据的生成。（2）数据验证代理，一个VLLM，它过滤了低质量的合成实例图像。两个代理的系统提示通过TextGrad进行了完善。此外，我们使用图像协调在场景中结合多个实例。与最先进的合成数据方法（如MosaicFusion）相比，我们的方法将无效的合成数据从50％降低到7％，并通过Yolov9c和Yolo11m的可可实例分割中的稀有类别的稀有类别提高了性能。此外，Gen-N-VAL在带有Yolo11m的开放式对象检测基准中显示出比Yolo-Worldv2-m的显着改善（7。1％地图）。此外，Gen-N-VAL在实例分割和对象检测中提高了Yolov9和Yolo11家族的性能。

Title: MARS: Radio Map Super-resolution and Reconstruction Method under Sparse Channel Measurements

Authors: Chuyun Deng, Na Liu, Wei Xie, Lianming Xu, Li Wang
Subjects: cs.CV, eess.SP
Abstract URL: https://arxiv.org/abs/2506.04682
Pdf URL: https://arxiv.org/pdf/2506.04682
Copy Paste: [[2506.04682]] MARS: Radio Map Super-resolution and Reconstruction Method under Sparse Channel Measurements(https://arxiv.org/abs/2506.04682)
Keywords: super-resolution
Abstract: Radio maps reflect the spatial distribution of signal strength and are essential for applications like smart cities, IoT, and wireless network planning. However, reconstructing accurate radio maps from sparse measurements remains challenging. Traditional interpolation and inpainting methods lack environmental awareness, while many deep learning approaches depend on detailed scene data, limiting generalization. To address this, we propose MARS, a Multi-scale Aware Radiomap Super-resolution method that combines CNNs and Transformers with multi-scale feature fusion and residual connections. MARS focuses on both global and local feature extraction, enhancing feature representation across different receptive fields and improving reconstruction accuracy. Experiments across different scenes and antenna locations show that MARS outperforms baseline models in both MSE and SSIM, while maintaining low computational cost, demonstrating strong practical potential.
摘要：无线电图反映了信号强度的空间分布，对于智能城市，物联网和无线网络计划等应用至关重要。但是，从稀疏测量值重建准确的无线电图仍然具有挑战性。传统的插值和灌输方法缺乏环境意识，而许多深度学习方法取决于详细的场景数据，从而限制了概括。为了解决这个问题，我们提出了MARS，MARS是一种多尺度意识的放射线超分辨率方法，将CNN和变压器与多尺度特征融合和残留连接相结合。火星侧重于全球和局部特征提取，增强不同接受场的特征表示，并提高重建精度。跨不同场景和天线位置的实验表明，MES和SSIM中的基线模型都优于基线模型，同时保持低计算成本，表现出强大的实用潜力。

Title: Explicit Density Approximation for Neural Implicit Samplers Using a Bernstein-Based Convex Divergence

Authors: José Manuel de Frutos, Manuel A. Vázquez, Pablo M. Olmos, Joaquín Míguez
Subjects: cs.LG, cs.AI, math.PR, stat.ML
Abstract URL: https://arxiv.org/abs/2506.04700
Pdf URL: https://arxiv.org/pdf/2506.04700
Copy Paste: [[2506.04700]] Explicit Density Approximation for Neural Implicit Samplers Using a Bernstein-Based Convex Divergence(https://arxiv.org/abs/2506.04700)
Keywords: generative
Abstract: Rank-based statistical metrics, such as the invariant statistical loss (ISL), have recently emerged as robust and practically effective tools for training implicit generative models. In this work, we introduce dual-ISL, a novel likelihood-free objective for training implicit generative models that interchanges the roles of the target and model distributions in the ISL framework, yielding a convex optimization problem in the space of model densities. We prove that the resulting rank-based discrepancy $d_K$ is i) continuous under weak convergence and with respect to the $L^1$ norm, and ii) convex in its first argument-properties not shared by classical divergences such as KL or Wasserstein distances. Building on this, we develop a theoretical framework that interprets $d_K$ as an $L^2$-projection of the density ratio $q = p/\tilde p$ onto a Bernstein polynomial basis, from which we derive exact bounds on the truncation error, precise convergence rates, and a closed-form expression for the truncated density approximation. We further extend our analysis to the multivariate setting via random one-dimensional projections, defining a sliced dual-ISL divergence that retains both convexity and continuity. We empirically show that these theoretical advantages translate into practical ones. Specifically, across several benchmarks dual-ISL converges more rapidly, delivers markedly smoother and more stable training, and more effectively prevents mode collapse than classical ISL and other leading implicit generative methods-while also providing an explicit density approximation.
摘要：基于等级的统计指标，例如不变统计损失（ISL），最近已成为训练隐式生成模型的强大且实际上有效的工具。在这项工作中，我们介绍了Dual-Isl，这是一个新颖的无似然目标，用于训练隐式生成模型，该模型可以互换ISL框架中目标和模型分布的作用，从而在模型密度的空间中产生了凸优化问题。我们证明，由此产生的基于等级的差异$ d_k $是i）在较弱的融合和$ l^1 $ norm的情况下连续不断，而ii）在其第一个参数 - 普罗托克中凸出，例如KL或WASSERSERSTEIN DISTANCES等古典差异。在此基础上，我们开发了一个理论框架，该框架将$ d_k $解释为$ l^2 $ - 密度比$ q = q = p/\ tilde p $上的伯恩斯坦多项式基础，我们从中得出了截断误差，精确的收敛速率，闭合表达的截断性差异的确切界限。我们通过随机的一维投影将我们的分析进一步扩展到多元设置，定义了切片的双ISS差异，该差异既保留凸性和连续性。我们从经验上表明，这些理论优势转化为实用的优势。具体而言，在几个基准测试中，双ISL收敛更快，提供明显更平滑和更稳定的训练，并且比经典的ISL和其他领先的隐式生成方法更有效地防止模式崩溃 - 同时还提供了显式密度近似。

Title: UNO: Unlearning via Orthogonalization in Generative models

Authors: Pinak Mandal, Georg A. Gottwald
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04712
Pdf URL: https://arxiv.org/pdf/2506.04712
Copy Paste: [[2506.04712]] UNO: Unlearning via Orthogonalization in Generative models(https://arxiv.org/abs/2506.04712)
Keywords: generation, generative
Abstract: As generative models become increasingly powerful and pervasive, the ability to unlearn specific data, whether due to privacy concerns, legal requirements, or the correction of harmful content, has become increasingly important. Unlike in conventional training, where data are accumulated and knowledge is reinforced, unlearning aims to selectively remove the influence of particular data points without costly retraining from scratch. To be effective and reliable, such algorithms need to achieve (i) forgetting of the undesired data, (ii) preservation of the quality of the generation, (iii) preservation of the influence of the desired training data on the model parameters, and (iv) small number of training steps. We propose fast unlearning algorithms based on loss gradient orthogonalization. We show that our algorithms are able to forget data while maintaining the fidelity of the original model. Using MNIST and CelebA data, we demonstrate that our algorithms achieve orders of magnitude faster unlearning times than their predecessors, such as gradient surgery.
摘要：随着生成模型变得越来越强大和普遍，由于隐私问题，法律要求还是对有害内容的纠正，取消特定数据的能力变得越来越重要。与传统培训不同，在积累数据并加强知识的情况下，旨在有选择地删除特定数据点的影响而无需从头开始的昂贵重新培训的影响。为了有效和可靠，这种算法需要实现（i）忘记不希望的数据，（ii）保留一代质量，（iii）保留所需的培训数据对模型参数的影响，以及（iv）少量培训步骤。我们建议基于损失梯度正交化的快速学习算法。我们表明，我们的算法能够忘记数据，同时保持原始模型的保真度。使用MNIST和CELEBA数据，我们证明我们的算法比其前身（例如梯度手术）达到的数量级要快。

Title: Towards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM-Based Multi-Dimensional Evaluation Model

Authors: Zelu Qi, Ping Shi, Chaoyang Zhang, Shuqi Wang, Fei Zhao, Da Pan, Zefeng Ying
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04715
Pdf URL: https://arxiv.org/pdf/2506.04715
Copy Paste: [[2506.04715]] Towards Holistic Visual Quality Assessment of AI-Generated Videos: A LLM-Based Multi-Dimensional Evaluation Model(https://arxiv.org/abs/2506.04715)
Keywords: generative, quality assessment
Abstract: The development of AI-Generated Video (AIGV) technology has been remarkable in recent years, significantly transforming the paradigm of video content production. However, AIGVs still suffer from noticeable visual quality defects, such as noise, blurriness, frame jitter and low dynamic degree, which severely impact the user's viewing experience. Therefore, an effective automatic visual quality assessment is of great importance for AIGV content regulation and generative model improvement. In this work, we decompose the visual quality of AIGVs into three dimensions: technical quality, motion quality, and video semantics. For each dimension, we design corresponding encoder to achieve effective feature representation. Moreover, considering the outstanding performance of large language models (LLMs) in various vision and language tasks, we introduce a LLM as the quality regression module. To better enable the LLM to establish reasoning associations between multi-dimensional features and visual quality, we propose a specially designed multi-modal prompt engineering framework. Additionally, we incorporate LoRA fine-tuning technology during the training phase, allowing the LLM to better adapt to specific tasks. Our proposed method achieved \textbf{second place} in the NTIRE 2025 Quality Assessment of AI-Generated Content Challenge: Track 2 AI Generated video, demonstrating its effectiveness. Codes can be obtained at this https URL.
摘要：近年来，AI生成的视频（AIGV）技术的开发非常出色，这大大改变了视频内容的范式。但是，AIGV仍然患有明显的视觉质量缺陷，例如噪声，模糊，框架抖动和低动态程度，这严重影响了用户的观看体验。因此，有效的自动视觉质量评估对于AIGV内容调节和生成模型改进至关重要。在这项工作中，我们将AIGV的视觉质量分解为三个维度：技术质量，运动质量和视频语义。对于每个维度，我们设计相应的编码器以实现有效的特征表示。此外，考虑到大型语言模型（LLM）在各种视觉和语言任务中的出色表现，我们将LLM作为质量回归模块。为了更好地使LLM在多维功能和视觉质量之间建立推理关联，我们提出了一个专门设计的多模式及时工程框架。此外，我们在培训阶段结合了Lora微调技术，使LLM可以更好地适应特定的任务。我们提出的方法在NTIRE 2025 AI生成的内容挑战的Ntire 2025质量评估中实现了\ TextBf {第二名}：Track 2 AI生成的视频，证明了其有效性。可以在此HTTPS URL上获得代码。

Title: SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs

Authors: Shuhan Xu, Siyuan Liang, Hongling Zheng, Yong Luo, Aishan Liu, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04743
Pdf URL: https://arxiv.org/pdf/2506.04743
Copy Paste: [[2506.04743]] SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs(https://arxiv.org/abs/2506.04743)
Keywords: generative
Abstract: Vision-Language Models (VLMs) have achieved remarkable performance in image captioning, but recent studies show they are vulnerable to backdoor attacks. Attackers can inject imperceptible perturbations-such as local pixel triggers or global semantic phrases-into the training data, causing the model to generate malicious, attacker-controlled captions for specific inputs. These attacks are hard to detect and defend due to their stealthiness and cross-modal nature. By analyzing attack samples, we identify two key vulnerabilities: (1) abnormal attention concentration on specific image regions, and (2) semantic drift and incoherence in generated captions. To counter this, we propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without prior knowledge of triggers. SRD uses a Deep Q-Network to learn policies for applying discrete perturbations (e.g., occlusion, color masking) to sensitive image regions, aiming to disrupt the activation of malicious pathways. We design a semantic fidelity score as the reward signal, which jointly evaluates semantic consistency and linguistic fluency of the output, guiding the agent toward generating robust yet faithful captions. Experiments across mainstream VLMs and datasets show SRD reduces attack success rates to 5.6%, while preserving caption quality on clean inputs with less than 10% performance drop. SRD offers a trigger-agnostic, interpretable defense paradigm against stealthy backdoor threats in multimodal generative models.
摘要：视觉语言模型（VLMS）在图像字幕上取得了出色的性能，但最近的研究表明，它们容易受到后门攻击的影响。攻击者可以注入不可察觉的扰动，例如本地像素触发器或全局语义短语，即训练数据，从而导致该模型为特定输入生成恶意，攻击者控制的字幕。这些攻击由于其隐形和跨模式而难以检测和捍卫。通过分析攻击样本，我们确定了两个关键漏洞：（1）对特定图像区域的注意力集中异常，以及（2）生成的字幕中的语义漂移和不一致。为了解决这个问题，我们提出了语义奖励防御（SRD），这是一个强化学习框架，可减轻后门行为而没有触发器的先验知识。 SRD使用深层Q网络来学习将离散扰动（例如，遮挡，颜色掩蔽）应用于敏感图像区域的策略，旨在破坏恶意途径的激活。我们将语义忠诚度得分设计为奖励信号，该信号共同评估了输出的语义一致性和语言流利性，从而指导代理人产生强大而忠实的字幕。主流VLM和数据集的实验显示，SRD将攻击成功率降低到5.6％，同时，在绩效下降少于10％的清洁输入上保留标题质量。 SRD在多模式生成模型中提供了针对隐秘的后门威胁的触发性，可解释的防御范式。

Title: DualX-VSR: Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution without Motion Compensation

Authors: Shuo Cao, Yihao Liu, Xiaohui Li.Yuanting Gao.Yu Zhou, Chao Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04830
Pdf URL: https://arxiv.org/pdf/2506.04830
Copy Paste: [[2506.04830]] DualX-VSR: Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution without Motion Compensation(https://arxiv.org/abs/2506.04830)
Keywords: super-resolution, generation
Abstract: Transformer-based models like ViViT and TimeSformer have advanced video understanding by effectively modeling spatiotemporal dependencies. Recent video generation models, such as Sora and Vidu, further highlight the power of transformers in long-range feature extraction and holistic spatiotemporal modeling. However, directly applying these models to real-world video super-resolution (VSR) is challenging, as VSR demands pixel-level precision, which can be compromised by tokenization and sequential attention mechanisms. While recent transformer-based VSR models attempt to address these issues using smaller patches and local attention, they still face limitations such as restricted receptive fields and dependence on optical flow-based alignment, which can introduce inaccuracies in real-world settings. To overcome these issues, we propose Dual Axial Spatial$\times$Temporal Transformer for Real-World Video Super-Resolution (DualX-VSR), which introduces a novel dual axial spatial$\times$temporal attention mechanism that integrates spatial and temporal information along orthogonal directions. DualX-VSR eliminates the need for motion compensation, offering a simplified structure that provides a cohesive representation of spatiotemporal information. As a result, DualX-VSR achieves high fidelity and superior performance in real-world VSR task.
摘要：基于变压器的模型（如Vivit和TimesFormer）通过有效建模时空依赖性来具有高级视频理解。最近的视频生成模型，例如Sora和Vidu，进一步突出了变压器在远程提取和整体时空建模中的力量。但是，将这些模型直接应用于现实世界视频超分辨率（VSR）是具有挑战性的，因为VSR要求像素级的精度，这可能会因令牌化和顺序注意机制而受到损害。尽管最近基于变压器的VSR模型试图使用较小的补丁和本地注意力来解决这些问题，但它们仍然面临限制，例如受限的接收场和对基于光流的对齐的依赖，这可能会在现实世界中引入不准确性。为了克服这些问题，我们提出了针对现实世界视频超分辨率（Dualx-VSR）的双轴向空间$ \ times $ tirmal变压器，它引入了一种新型的双轴向空间$ \ times $暂时关注机制，该机制将沿正向方向整合空间和时间信息。 Dualx-VSR消除了运动补偿的需求，提供了简化的结构，提供了时空信息的内聚表示。结果，DualX-VSR在现实世界VSR任务中实现了高保真度和出色的性能。

Title: OpenMaskDINO3D : Reasoning 3D Segmentation via Large Language Model

Authors: Kunshen Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04837
Pdf URL: https://arxiv.org/pdf/2506.04837
Copy Paste: [[2506.04837]] OpenMaskDINO3D : Reasoning 3D Segmentation via Large Language Model(https://arxiv.org/abs/2506.04837)
Keywords: generation
Abstract: Although perception systems have made remarkable advancements in recent years, particularly in 2D reasoning segmentation, these systems still rely on explicit human instruction or pre-defined categories to identify target objects before executing visual recognition tasks. Such systems have matured significantly, demonstrating the ability to reason and comprehend implicit user intentions in two-dimensional contexts, producing accurate segmentation masks based on complex and implicit query text. However, a comparable framework and structure for 3D reasoning segmentation remain absent. This paper introduces OpenMaskDINO3D, a LLM designed for comprehensive 3D understanding and segmentation. OpenMaskDINO3D processes point cloud data and text prompts to produce instance segmentation masks, excelling in many 3D tasks. By introducing a SEG token and object identifier, we achieve high-precision 3D segmentation mask generation, enabling the model to directly produce accurate point cloud segmentation results from natural language instructions. Experimental results on large-scale ScanNet datasets validate the effectiveness of our OpenMaskDINO3D across various tasks.
摘要：尽管近年来感知系统已取得了显着的进步，尤其是在2D推理细分中，但这些系统仍然依靠明确的人类教学或预定义的类别来识别目标对象，然后再执行视觉识别任务。这样的系统已经大大成熟，证明了推理和理解二维环境中隐式用户意图的能力，从而基于复杂和隐式查询文本产生准确的分割掩码。但是，3D推理分割的可比框架和结构仍然不存在。本文介绍了OpenMaskDino3D，这是一种旨在全面的3D理解和细分的LLM。 OpenMaskDino3D处理点云数据和文本提示以产生实例分割掩码，在许多3D任务中都出色。通过引入SEG令牌和对象标识符，我们实现了高精度3D分割掩码的生成，从而使模型能够直接从自然语言指令产生准确的点云分割结果。大规模扫描数据集的实验结果验证了我们OpenMaskDino3D在各种任务中的有效性。

Title: Geological Field Restoration through the Lens of Image Inpainting

Authors: Vladislav Trifonov, Ivan Oseledets, Ekaterina Muravleva
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04869
Pdf URL: https://arxiv.org/pdf/2506.04869
Copy Paste: [[2506.04869]] Geological Field Restoration through the Lens of Image Inpainting(https://arxiv.org/abs/2506.04869)
Keywords: restoration
Abstract: We present a new viewpoint on a reconstructing multidimensional geological fields from sparse observations. Drawing inspiration from deterministic image inpainting techniques, we model a partially observed spatial field as a multidimensional tensor and recover missing values by enforcing a global low-rank structure. Our approach combines ideas from tensor completion and geostatistics, providing a robust optimization framework. Experiments on synthetic geological fields demonstrate that used tensor completion method significant improvements in reconstruction accuracy over ordinary kriging for various percent of observed data.
摘要：我们提出了从稀疏观测中重建多维地质领域的重建多维地质领域的新观点。从确定性图像介入技术中汲取灵感，我们将部分观察到的空间场建模为多维张量，并通过执行全局低级别结构来恢复缺失值。我们的方法结合了张量完成和地统计学中的思想，提供了强大的优化框架。关于合成地质领域的实验表明，使用张量的完成方法，对于各个百分比观察到的数据，重建精度的重建精度显着提高。

Title: Invisible Backdoor Triggers in Image Editing Model via Deep Watermarking

Authors: Yu-Feng Chen, Tzuhsuan Huang, Pin-Yen Chiu, Jun-Cheng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04879
Pdf URL: https://arxiv.org/pdf/2506.04879
Copy Paste: [[2506.04879]] Invisible Backdoor Triggers in Image Editing Model via Deep Watermarking(https://arxiv.org/abs/2506.04879)
Keywords: generation
Abstract: Diffusion models have achieved remarkable progress in both image generation and editing. However, recent studies have revealed their vulnerability to backdoor attacks, in which specific patterns embedded in the input can manipulate the model's behavior. Most existing research in this area has proposed attack frameworks focused on the image generation pipeline, leaving backdoor attacks in image editing relatively unexplored. Among the few studies targeting image editing, most utilize visible triggers, which are impractical because they introduce noticeable alterations to the input image before editing. In this paper, we propose a novel attack framework that embeds invisible triggers into the image editing process via poisoned training data. We leverage off-the-shelf deep watermarking models to encode imperceptible watermarks as backdoor triggers. Our goal is to make the model produce the predefined backdoor target when it receives watermarked inputs, while editing clean images normally according to the given prompt. With extensive experiments across different watermarking models, the proposed method achieves promising attack success rates. In addition, the analysis results of the watermark characteristics in term of backdoor attack further support the effectiveness of our approach. The code is available at:this https URL
摘要：扩散模型在图像生成和编辑中都取得了显着的进步。但是，最近的研究揭示了它们对后门攻击的脆弱性，其中嵌入输入中的特定模式可以操纵模型的行为。该领域的大多数现有研究都提出了集中在图像生成管道上的攻击框架，在相对尚未探索的图像编辑中留下了后门攻击。在针对图像编辑的少数研究中，大多数利用可见的触发器，这是不切实际的，因为它们在编辑之前对输入图像引入了明显的改变。在本文中，我们提出了一个新颖的攻击框架，该攻击框架将无形的触发器嵌入到图像编辑过程中，并通过中毒训练数据。我们利用现成的深水标记型号来编码可察觉的水印作为后门触发器。我们的目标是使模型在接收水印的输入时产生预定义的后门目标，同时根据给定的提示正常编辑清洁图像。通过跨不同水印模型进行的广泛实验，提出的方法实现了有希望的攻击成功率。此外，在后门攻击方面的水印特征的分析结果进一步支持了我们方法的有效性。代码可用：此HTTPS URL

Title: Generating Synthetic Stereo Datasets using 3D Gaussian Splatting and Expert Knowledge Transfer

Authors: Filip Slezak, Magnus K. Gjerde, Joakim B. Haurum, Ivan Nikolov, Morten S. Laursen, Thomas B. Moeslund
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04908
Pdf URL: https://arxiv.org/pdf/2506.04908
Copy Paste: [[2506.04908]] Generating Synthetic Stereo Datasets using 3D Gaussian Splatting and Expert Knowledge Transfer(https://arxiv.org/abs/2506.04908)
Keywords: generation
Abstract: In this paper, we introduce a 3D Gaussian Splatting (3DGS)-based pipeline for stereo dataset generation, offering an efficient alternative to Neural Radiance Fields (NeRF)-based methods. To obtain useful geometry estimates, we explore utilizing the reconstructed geometry from the explicit 3D representations as well as depth estimates from the FoundationStereo model in an expert knowledge transfer setup. We find that when fine-tuning stereo models on 3DGS-generated datasets, we demonstrate competitive performance in zero-shot generalization benchmarks. When using the reconstructed geometry directly, we observe that it is often noisy and contains artifacts, which propagate noise to the trained model. In contrast, we find that the disparity estimates from FoundationStereo are cleaner and consequently result in a better performance on the zero-shot generalization benchmarks. Our method highlights the potential for low-cost, high-fidelity dataset creation and fast fine-tuning for deep stereo models. Moreover, we also reveal that while the latest Gaussian Splatting based methods have achieved superior performance on established benchmarks, their robustness falls short in challenging in-the-wild settings warranting further exploration.
摘要：在本文中，我们介绍了一个基于立体声数据集生成的3D高斯裂（3DGS）管道，为基于神经辐射场（NERF）基于神经辐射场（NERF）的方法提供了有效的替代方案。为了获得有用的几何估计，我们利用从显式3D表示的重建几何形状以及在专家知识转移设置中的基础模型中的深度估计来探索。我们发现，当在3DGS生成的数据集上微调立体声模型时，我们在零击中基准中表现出竞争性能。当直接使用重建的几何形状时，我们观察到它通常是嘈杂的，并且包含伪影，这些伪影将噪声传播到受过训练的模型。相比之下，我们发现来自基础的差异估计更加干净，因此在零击概括基准上的性能更好。我们的方法突出了低成本，高保真数据集创建的潜力，并为Deep Stereo模型进行快速微调。此外，我们还透露，虽然最新的基于高斯碎片的方法在既定的基准上取得了卓越的性能，但它们的稳健性在挑战野外环境中却缺乏，需要进一步探索。

Title: Time-Lapse Video-Based Embryo Grading via Complementary Spatial-Temporal Pattern Mining

Authors: Yong Sun, Yipeng Wang, Junyu Shi, Zhiyuan Zhang, Yanmei Xiao, Lei Zhu, Manxi Jiang, Qiang Nie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04950
Pdf URL: https://arxiv.org/pdf/2506.04950
Copy Paste: [[2506.04950]] Time-Lapse Video-Based Embryo Grading via Complementary Spatial-Temporal Pattern Mining(https://arxiv.org/abs/2506.04950)
Keywords: quality assessment
Abstract: Artificial intelligence has recently shown promise in automated embryo selection for In-Vitro Fertilization (IVF). However, current approaches either address partial embryo evaluation lacking holistic quality assessment or target clinical outcomes inevitably confounded by extra-embryonic factors, both limiting clinical utility. To bridge this gap, we propose a new task called Video-Based Embryo Grading - the first paradigm that directly utilizes full-length time-lapse monitoring (TLM) videos to predict embryologists' overall quality assessments. To support this task, we curate a real-world clinical dataset comprising over 2,500 TLM videos, each annotated with a grading label indicating the overall quality of embryos. Grounded in clinical decision-making principles, we propose a Complementary Spatial-Temporal Pattern Mining (CoSTeM) framework that conceptually replicates embryologists' evaluation process. The CoSTeM comprises two branches: (1) a morphological branch using a Mixture of Cross-Attentive Experts layer and a Temporal Selection Block to select discriminative local structural features, and (2) a morphokinetic branch employing a Temporal Transformer to model global developmental trajectories, synergistically integrating static and dynamic determinants for grading embryos. Extensive experimental results demonstrate the superiority of our design. This work provides a valuable methodological framework for AI-assisted embryo selection. The dataset and source code will be publicly available upon acceptance.
摘要：最近，人工智能显示了自动化胚胎施肥（IVF）的胚胎选择的希望。但是，目前的方法要么解决缺乏整体质量评估的部分胚胎评估，要么针对目标临床结果不可避免地与外界因素混淆，这两者都限制了临床实用性。为了弥合这一差距，我们提出了一项名为基于视频的胚胎分级的新任务，这是第一个直接利用全长延时监控（TLM）视频来预测胚胎学家的整体质量评估的范式。为了支持这项任务，我们策划了一个现实世界中的临床数据集，其中包含2500多个TLM视频，每个数据集都带有一个分级标签，指示胚胎的整体质量。基于临床决策原则，我们提出了一个互补的时空模式挖掘（Costem）框架，从概念上复制了胚胎学家的评估过程。 COSTEM包括两个分支：（1）使用跨界专家层和时间选择块的混合物来选择区分局部结构特征，以及（2）使用时间变压器来模拟全球发育轨迹，协同整合静态和动态的确定性嵌入型嵌入型。广泛的实验结果证明了我们设计的优势。这项工作为AI辅助胚胎选择提供了宝贵的方法论框架。该数据集和源代码将在接受后公开可用。

Title: Robustness as Architecture: Designing IQA Models to Withstand Adversarial Perturbations

Authors: Igor Meleshin, Anna Chistyakova, Anastasia Antsiferova, Dmitriy Vatolin
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.04951
Pdf URL: https://arxiv.org/pdf/2506.04951
Copy Paste: [[2506.04951]] Robustness as Architecture: Designing IQA Models to Withstand Adversarial Perturbations(https://arxiv.org/abs/2506.04951)
Keywords: generation, quality assessment
Abstract: Image Quality Assessment (IQA) models are increasingly relied upon to evaluate image quality in real-world systems -- from compression and enhancement to generation and streaming. Yet their adoption brings a fundamental risk: these models are inherently unstable. Adversarial manipulations can easily fool them, inflating scores and undermining trust. Traditionally, such vulnerabilities are addressed through data-driven defenses -- adversarial retraining, regularization, or input purification. But what if this is the wrong lens? What if robustness in perceptual models is not something to learn but something to design? In this work, we propose a provocative idea: robustness as an architectural prior. Rather than training models to resist perturbations, we reshape their internal structure to suppress sensitivity from the ground up. We achieve this by enforcing orthogonal information flow, constraining the network to norm-preserving operations -- and further stabilizing the system through pruning and fine-tuning. The result is a robust IQA architecture that withstands adversarial attacks without requiring adversarial training or significant changes to the original model. This approach suggests a shift in perspective: from optimizing robustness through data to engineering it through design.
摘要：图像质量评估（IQA）模型越来越依赖于评估现实世界中的图像质量 - 从压缩和增强到生成和流媒体。然而，他们的采用带来了基本风险：这些模型本质上是不稳定的。对抗性的操纵很容易欺骗他们，夸大得分并破坏信任。传统上，这种漏洞是通过数据驱动的防御措施来解决的 - 对抗性训练，正则化或输入净化。但是，如果这是错误的镜头怎么办？如果感知模型中的鲁棒性不是要学习的东西，而是要设计的东西怎么办？在这项工作中，我们提出了一个挑衅的思想：作为建筑事务的鲁棒性。我们没有训练模型以抵抗扰动，而是重塑其内部结构以抑制从头开始的灵敏度。我们通过执行正交信息流，将网络限制为规范性操作，并通过修剪和微调进一步稳定系统来实现这一目标。结果是强大的IQA体系结构，该体系结构可以承受对抗性攻击，而无需对抗训练或对原始模型进行重大更改。这种方法暗示了观点的转变：从通过数据优化鲁棒性到通过设计进行工程。

Title: FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

Authors: Huihan Wang, Zhiwen Yang, Hui Zhang, Dan Zhao, Bingzheng Wei, Yan Xu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04956
Pdf URL: https://arxiv.org/pdf/2506.04956
Copy Paste: [[2506.04956]] FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation(https://arxiv.org/abs/2506.04956)
Keywords: generation
Abstract: Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23\% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at this https URL.
摘要：由于需要对空间一致性和时间动态进行建模，因此合成高质量的动态医疗视频仍然是一个重大挑战。现有的基于变压器的方法面临着临界局限性，包括通道相互作用不足，自我注意事项的高计算复杂性以及在处理变化的噪声水平时，时间到嵌入式的粗deno嵌入指导。 In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual价值指导模块，可提供精细的像素级指导，以适应不同的噪声水平。我们在标准的基准和下游任务上评估了壮举，证明了壮举S，只有23％的最先进模型Endora参数的壮举，实现了可比甚至优越的性能。此外，壮举L超过了多个数据集的所有比较方法，展示了卓越的有效性和可扩展性。代码可在此HTTPS URL上找到。

Title: Physical Annotation for Automated Optical Inspection: A Concept for In-Situ, Pointer-Based Trainingdata Generation

Authors: Oliver Krumpek, Oliver Heimann, Jörg Krüger
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05026
Pdf URL: https://arxiv.org/pdf/2506.05026
Copy Paste: [[2506.05026]] Physical Annotation for Automated Optical Inspection: A Concept for In-Situ, Pointer-Based Trainingdata Generation(https://arxiv.org/abs/2506.05026)
Keywords: generation
Abstract: This paper introduces a novel physical annotation system designed to generate training data for automated optical inspection. The system uses pointer-based in-situ interaction to transfer the valuable expertise of trained inspection personnel directly into a machine learning (ML) training pipeline. Unlike conventional screen-based annotation methods, our system captures physical trajectories and contours directly on the object, providing a more intuitive and efficient way to label data. The core technology uses calibrated, tracked pointers to accurately record user input and transform these spatial interactions into standardised annotation formats that are compatible with open-source annotation software. Additionally, a simple projector-based interface projects visual guidance onto the object to assist users during the annotation process, ensuring greater accuracy and consistency. The proposed concept bridges the gap between human expertise and automated data generation, enabling non-IT experts to contribute to the ML training pipeline and preventing the loss of valuable training samples. Preliminary evaluation results confirm the feasibility of capturing detailed annotation trajectories and demonstrate that integration with CVAT streamlines the workflow for subsequent ML tasks. This paper details the system architecture, calibration procedures and interface design, and discusses its potential contribution to future ML data generation for automated optical inspection.
摘要：本文介绍了一种新型的物理注释系统，旨在生成训练数据以进行自动化光学检查。该系统使用基于指针的原位互动将训练有素的检查人员的宝贵专业知识直接转移到机器学习（ML）培训管道中。与传统的基于屏幕的注释方法不同，我们的系统直接捕获物理轨迹和轮廓，从而提供了一种更直观，更有效的方法来标记数据。核心技术使用校准的，跟踪的指针来准确记录用户输入并将这些空间相互作用转换为标准化的注释格式，这些格式与开源注释软件兼容。此外，一个简单的基于投影仪的接口将视觉引导投放到对象上，以在注释过程中为用户提供帮助，从而确保更高的准确性和一致性。拟议的概念弥合了人类专业知识与自动数据生成之间的差距，使非IT专家能够为ML培训管道做出贡献，并防止损失有价值的培训样本。初步评估结果证实了捕获详细注释轨迹的可行性，并证明与CVAT的集成简化了后续ML任务的工作流程。本文详细介绍了系统体系结构，校准程序和接口设计，并讨论了其对自动化光学检查的未来ML数据生成的潜在贡献。

Title: SeedEdit 3.0: Fast and High-Quality Generative Image Editing

Authors: Peng Wang, Yichun Shi, Xiaochen Lian, Zhonghua Zhai, Xin Xia, Xuefeng Xiao, Weilin Huang, Jianchao Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05083
Pdf URL: https://arxiv.org/pdf/2506.05083
Copy Paste: [[2506.05083]] SeedEdit 3.0: Fast and High-Quality Generative Image Editing(https://arxiv.org/abs/2506.05083)
Keywords: generative
Abstract: We introduce SeedEdit 3.0, in companion with our T2I model Seedream 3.0 [22], which significantly improves over our previous version [27] in both aspects of edit instruction following and image content (e.g., ID/IP) preservation on real image inputs. Additional to model upgrading with T2I, in this report, we present several key improvements. First, we develop an enhanced data curation pipeline with a meta-info paradigm and meta-info embedding strategy that help mix images from multiple data sources. This allows us to scale editing data effectively, and meta information is helpfult to connect VLM with diffusion model more closely. Second, we introduce a joint learning pipeline for computing a diffusion loss and a reward loss. Finally, we evaluate SeedEdit 3.0 on our testing benchmarks, for real image editing, where it achieves a best trade-off between multiple aspects, yielding a high usability rate of 56.1%, compared to SeedEdit 1.6 (38.4%), GPT4o (37.1%) and Gemini 2.0 (30.3%).
摘要：我们介绍了SEEDEDIT 3.0，与我们的T2i Model SeedReam 3.0 [22]同伴，该[22]在编辑说明的两个方面都显着改善了我们以前的版本[27]，并且图像内容（例如，ID/IP）在真实图像输入上保存。除了使用T2I升级的模型升级之外，在本报告中，我们提供了一些关键的改进。首先，我们使用元INFO范式和元INFO嵌入策略开发了增强的数据策展管道，该策略有助于混合来自多个数据源的图像。这使我们能够有效地扩展编辑数据，并且元信息有助于更紧密地将VLM连接到VLM。其次，我们引入了一条联合学习管道，用于计算扩散损失和奖励损失。最后，我们在测试基准上评估了Seededit 3.0，以进行真实图像编辑，与SEEDEDIT 1.6（38.4％），GPT4O（37.1％）和Gemini 2.0（30.3％）相比，它在多个方面之间取得了最佳的权衡，可产生56.1％的高可用性率。

Title: Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers

Authors: Haosong Liu, Yuge Cheng, Zihan Liu, Aiyue Chen, Yiwu Yao, Chen Chen, Jingwen Leng, Yu Feng, Minyi Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05096
Pdf URL: https://arxiv.org/pdf/2506.05096
Copy Paste: [[2506.05096]] Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers(https://arxiv.org/abs/2506.05096)
Keywords: generation
Abstract: Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment. While existing acceleration methods reduce workload at various granularities, they often rely on heuristics, limiting their applicability. We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation. At its core, ASTRAEA proposes a lightweight token selection mechanism and a memory-efficient, GPU-parallel sparse attention strategy, enabling linear reductions in execution time with minimal impact on generation quality. To determine optimal token reduction for different timesteps, we further design a search framework that leverages a classic evolutionary algorithm to automatically determine the distribution of the token budget effectively. Together, ASTRAEA achieves up to 2.4x inference speedup on a single GPU with great scalability (up to 13.2x speedup on 8 GPUs) while retaining better video quality compared to the state-of-the-art methods (<0.5% loss on the VBench score compared to the baseline vDiT models).
摘要：视频扩散变压器（VDIT）在文本到视频的生成方面取得了令人印象深刻的进步，但是它们的高计算需求对实际部署带来了主要的挑战。尽管现有的加速方法减少了各种粒度的工作量，但它们通常依靠启发式方法，从而限制其适用性。我们介绍了Astraea，这是一个自动框架，可搜索基于VDIT的视频生成的近乎最佳配置。 Astraea以此为核心提出了一种轻巧的令牌选择机制和记忆效率，GPU - 平行的稀疏注意策略，从而可以减少执行时间的线性减少，并且对发电质量的影响最小。为了确定不同时间段的最佳令牌减少，我们进一步设计了一个搜索框架，该搜索框架利用经典的进化算法来自动确定代币预算的分布。与最先进的方法相比，Astraea在单个GPU上达到了高达2.4倍的推理速度（在8 GPU上的速度高达13.2倍），同时保留更好的视频质量（与基线VDIT模型相比，VBench分数损失<0.5％）。

Title: Privacy Amplification Through Synthetic Data: Insights from Linear Regression

Authors: Clément Pierquin, Aurélien Bellet, Marc Tommasi, Matthieu Boussard
Subjects: cs.LG, cs.CR, stat.ML
Abstract URL: https://arxiv.org/abs/2506.05101
Pdf URL: https://arxiv.org/pdf/2506.05101
Copy Paste: [[2506.05101]] Privacy Amplification Through Synthetic Data: Insights from Linear Regression(https://arxiv.org/abs/2506.05101)
Keywords: generative
Abstract: Synthetic data inherits the differential privacy guarantees of the model used to generate it. Additionally, synthetic data may benefit from privacy amplification when the generative model is kept hidden. While empirical studies suggest this phenomenon, a rigorous theoretical understanding is still lacking. In this paper, we investigate this question through the well-understood framework of linear regression. First, we establish negative results showing that if an adversary controls the seed of the generative model, a single synthetic data point can leak as much information as releasing the model itself. Conversely, we show that when synthetic data is generated from random inputs, releasing a limited number of synthetic data points amplifies privacy beyond the model's inherent guarantees. We believe our findings in linear regression can serve as a foundation for deriving more general bounds in the future.
摘要：合成数据继承了用于生成它的模型的差异隐私保证。此外，当生成模型隐藏时，合成数据可能受益于隐私扩增。尽管经验研究表明这种现象，但仍缺乏严格的理论理解。在本文中，我们通过众所周知的线性回归框架调查了这个问题。首先，我们建立负面结果表明，如果对手控制生成模型的种子，则单个合成数据点可以泄漏与释放模型本身一样多的信息。相反，我们表明，当从随机输入中生成综合数据时，释放有限数量的合成数据点会扩大超出模型固有保证的隐私。我们认为，我们在线性回归中的发现可以成为未来得出更一般界限的基础。

Title: DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models

Authors: Revant Teotia, Candace Ross, Karen Ullrich, Sumit Chopra, Adriana Romero-Soriano, Melissa Hall, Matthew J. Muckley
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05108
Pdf URL: https://arxiv.org/pdf/2506.05108
Copy Paste: [[2506.05108]] DIMCIM: A Quantitative Evaluation Framework for Default-mode Diversity and Generalization in Text-to-Image Generative Models(https://arxiv.org/abs/2506.05108)
Keywords: generative
Abstract: Recent advances in text-to-image (T2I) models have achieved impressive quality and consistency. However, this has come at the cost of representation diversity. While automatic evaluation methods exist for benchmarking model diversity, they either require reference image datasets or lack specificity about the kind of diversity measured, limiting their adaptability and interpretability. To address this gap, we introduce the Does-it/Can-it framework, DIM-CIM, a reference-free measurement of default-mode diversity ("Does" the model generate images with expected attributes?) and generalization capacity ("Can" the model generate diverse attributes for a particular concept?). We construct the COCO-DIMCIM benchmark, which is seeded with COCO concepts and captions and augmented by a large language model. With COCO-DIMCIM, we find that widely-used models improve in generalization at the cost of default-mode diversity when scaling from 1.5B to 8.1B parameters. DIMCIM also identifies fine-grained failure cases, such as attributes that are generated with generic prompts but are rarely generated when explicitly requested. Finally, we use DIMCIM to evaluate the training data of a T2I model and observe a correlation of 0.85 between diversity in training images and default-mode diversity. Our work provides a flexible and interpretable framework for assessing T2I model diversity and generalization, enabling a more comprehensive understanding of model performance.
摘要：文本到图像（T2I）模型的最新进展已达到令人印象深刻的质量和一致性。但是，这是以代表多样性为代价的。尽管存在用于基准模型多样性的自动评估方法，但它们要么需要参考图像数据集，要么对所测量的多样性缺乏特殊性，从而限制了它们的适应性和可解释性。为了解决这一差距，我们介绍了do-it/can-it框架，昏暗的cim，默认模式多样性的无参考测量（“'我们构建了可可二毫西的基准，该基准配有可可概念和标题，并通过大型语言模型增强。使用可可二秒，我们发现广泛使用的模型在从1.5b到8.1b参数时以默认模式多样性的成本来改善概括。 DimCim还标识了细粒度的失败情况，例如使用通用提示生成但很少生成时，明确要求时会生成。最后，我们使用DIMCIM评估T2I模型的训练数据，并观察到训练图像中多样性与默认模式多样性之间的相关性0.85。我们的工作为评估T2I模型多样性和概括提供了一个灵活且可解释的框架，从而对模型性能有了更全面的了解。

Title: Practical Manipulation Model for Robust Deepfake Detection

Authors: Benedikt Hopf, Radu Timofte
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05119
Pdf URL: https://arxiv.org/pdf/2506.05119
Copy Paste: [[2506.05119]] Practical Manipulation Model for Robust Deepfake Detection(https://arxiv.org/abs/2506.05119)
Keywords: super-resolution
Abstract: Modern deepfake detection models have achieved strong performance even on the challenging cross-dataset task. However, detection performance under non-ideal conditions remains very unstable, limiting success on some benchmark datasets and making it easy to circumvent detection. Inspired by the move to a more real-world degradation model in the area of image super-resolution, we have developed a Practical Manipulation Model (PMM) that covers a larger set of possible forgeries. We extend the space of pseudo-fakes by using Poisson blending, more diverse masks, generator artifacts, and distractors. Additionally, we improve the detectors' generality and robustness by adding strong degradations to the training images. We demonstrate that these changes not only significantly enhance the model's robustness to common image degradations but also improve performance on standard benchmark datasets. Specifically, we show clear increases of $3.51\%$ and $6.21\%$ AUC on the DFDC and DFDCP datasets, respectively, over the s-o-t-a LAA backbone. Furthermore, we highlight the lack of robustness in previous detectors and our improvements in this regard. Code can be found at this https URL
摘要：即使在具有挑战性的跨统计任务上，现代的深层检测模型也达到了强劲的性能。但是，在非理想条件下的检测性能仍然非常不稳定，限制了某些基准数据集的成功，并易于绕过检测。受到图像超分辨率领域中更真实的降级模型的启发，我们开发了一种实用的操纵模型（PMM），该模型涵盖了较大的可能的伪造。我们通过使用泊松混合物，更多样化的面具，发电机工件和干扰器来扩展伪捕烟的空间。此外，我们通过为训练图像增加强烈的降解来提高检测器的一般性和鲁棒性。我们证明，这些变化不仅显着增强了模型对常见图像降解的鲁棒性，而且还提高了标准基准数据集的性能。具体而言，我们在DFDC和DFDCP数据集中分别显示出$ 3.51 \％$ $和$ 6.21 \％$ AUC的明显增加，这是S-O-T-A LAA骨架上的明显增加。此外，我们强调了以前的探测器缺乏鲁棒性以及我们在这方面的改进。代码可以在此HTTPS URL上找到

Title: Associative Memory and Generative Diffusion in the Zero-noise Limit

Authors: Joshua Hess, Quaid Morris
Subjects: cs.LG, cond-mat.dis-nn, math.DS, nlin.AO, q-bio.NC
Abstract URL: https://arxiv.org/abs/2506.05178
Pdf URL: https://arxiv.org/pdf/2506.05178
Copy Paste: [[2506.05178]] Associative Memory and Generative Diffusion in the Zero-noise Limit(https://arxiv.org/abs/2506.05178)
Keywords: generation, generative
Abstract: Connections between generative diffusion and continuous-state associative memory models are studied. Morse-Smale dynamical systems are emphasized as universal approximators of gradient-based associative memory models and diffusion models as white-noise perturbed systems thereof. Universal properties of associative memory that follow from this description are described and used to characterize a generic transition from generation to memory as noise levels diminish. Structural stability inherited by Morse-Smale flows is shown to imply a notion of stability for diffusions at vanishing noise levels. Applied to one- and two-parameter families of gradients, this indicates stability at all but isolated points of associative memory learning landscapes and the learning and generation landscapes of diffusion models with gradient drift in the zero-noise limit, at which small sets of generic bifurcations characterize qualitative transitions between stable systems. Examples illustrating the characterization of these landscapes by sequences of these bifurcations are given, along with structural stability criterion for classic and modern Hopfield networks (equivalently, the attention mechanism).
摘要：研究了生成扩散与连续状态的关联内存模型之间的连接。 Morse-Smale动力系统被强调为基于梯度的关联存储器模型的通用近似值和扩散模型，作为其白噪声扰动系统。描述并用来表征从生成到存储器的通用过渡，因为噪声水平减小，从而将关联内存的通用性质进行描述。显示出，摩尔斯 - 男性流遗传的结构稳定性表明在消失的噪声水平下扩散的稳定性概念。应用于梯度的单参数和两参数家族，这表明除了隔离的记忆学习景观的隔离点，以及扩散模型的学习和产生景观，其梯度漂移量限于零噪声限制，在零噪声上，一小部分通用分叉表征了稳定系统之间的定性过渡。给出了这些景观通过这些分叉的序列表征的示例，以及经典和现代Hopfield网络的结构稳定性标准（等效地，注意机制）。

Title: OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View

Authors: Yanbo Wang, Ziyi Wang, Wenzhao Zheng, Jie Zhou, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05204
Pdf URL: https://arxiv.org/pdf/2506.05204
Copy Paste: [[2506.05204]] OGGSplat: Open Gaussian Growing for Generalizable Reconstruction with Expanded Field-of-View(https://arxiv.org/abs/2506.05204)
Keywords: generative
Abstract: Reconstructing semantic-aware 3D scenes from sparse views is a challenging yet essential research direction, driven by the demands of emerging applications such as virtual reality and embodied AI. Existing per-scene optimization methods require dense input views and incur high computational costs, while generalizable approaches often struggle to reconstruct regions outside the input view cone. In this paper, we propose OGGSplat, an open Gaussian growing method that expands the field-of-view in generalizable 3D reconstruction. Our key insight is that the semantic attributes of open Gaussians provide strong priors for image extrapolation, enabling both semantic consistency and visual plausibility. Specifically, once open Gaussians are initialized from sparse views, we introduce an RGB-semantic consistent inpainting module applied to selected rendered views. This module enforces bidirectional control between an image diffusion model and a semantic diffusion model. The inpainted regions are then lifted back into 3D space for efficient and progressive Gaussian parameter optimization. To evaluate our method, we establish a Gaussian Outpainting (GO) benchmark that assesses both semantic and generative quality of reconstructed open-vocabulary scenes. OGGSplat also demonstrates promising semantic-aware scene reconstruction capabilities when provided with two view images captured directly from a smartphone camera.
摘要：从稀疏视图中重建语义感知的3D场景是一个具有挑战性但必不可少的研究方向，这是由虚拟现实和体现AI等新兴应用程序的需求驱动的。现有的每个场景优化方法需要密集的输入视图并产生高计算成本，而可推广的方法通常很难重建输入视图锥之外的区域。在本文中，我们提出了OGGSPLAT，这是一种开放的高斯生长方法，可在可概括的3D重建中扩展视野。我们的关键见解是，开放高斯人的语义属性为图像外推提供了强大的先验，从而使语义一致性和视觉上的合理性既可以进行。具体而言，一旦从稀疏视图中初始化开放的高斯人，我们就会引入一个应用于选定的渲染视图的RGB语义一致的介入模块。该模块在图像扩散模型和语义扩散模型之间执行双向控制。然后，将贴有区域提升回3D空间，以进行有效和进行性高斯参数优化。为了评估我们的方法，我们建立了一个高斯支出（GO）基准，该基准评估了重建的开放式摄影场景的语义和生成质量。 OGGSPLAT还展示了有前途的语义感知场景重建功能，并提供了直接从智能手机摄像机捕获的两个视图图像时。

Title: Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning

Authors: Yue Ma, Yulong Liu, Qiyuan Zhu, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zhifeng Li, Sirui Han, Chenyang Qi, Qifeng Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05207
Pdf URL: https://arxiv.org/pdf/2506.05207
Copy Paste: [[2506.05207]] Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning(https://arxiv.org/abs/2506.05207)
Keywords: generation
Abstract: Recently, breakthroughs in the video diffusion transformer have shown remarkable capabilities in diverse motion generations. As for the motion-transfer task, current methods mainly use two-stage Low-Rank Adaptations (LoRAs) finetuning to obtain better performance. However, existing adaptation-based motion transfer still suffers from motion inconsistency and tuning inefficiency when applied to large video diffusion transformers. Naive two-stage LoRA tuning struggles to maintain motion consistency between generated and input videos due to the inherent spatial-temporal coupling in the 3D attention operator. Additionally, they require time-consuming fine-tuning processes in both stages. To tackle these issues, we propose Follow-Your-Motion, an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex this http URL, we propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing. During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed. To address the lack of a benchmark for this field, we introduce MotionBench, a comprehensive benchmark comprising diverse motion, including creative camera motion, single object motion, multiple object motion, and complex human motion. We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.
摘要：最近，视频扩散变压器中的突破已经显示出不同的运动世代的显着功能。至于运动转移任务，当前方法主要使用两阶段的低级改编（LORAS）登录来获得更好的性能。但是，当应用于大型视频扩散变压器时，现有的基于适应性的运动转移仍会受到运动不一致和调谐效率的损失。由于3D注意操作员固有的时空耦合，幼稚的两阶段洛拉调整努力努力保持生成的视频和输入视频之间的运动一致性。此外，他们需要在两个阶段都耗时的微调过程。为了解决这些问题，我们提出了跟随您的动作，这是一个有效的两阶段视频运动转移框架，可以对强大的视频扩散变压器进行验证，以使该HTTP URL合成复杂性，我们提出了一个空间偏离的Lora，以将注意力结构与空间外观和暂时运动处理相矛盾。在第二个训练阶段，我们设计了稀疏的运动采样和自适应绳，以加速调音速度。为了解决该领域缺乏基准测试的基准，我们介绍了MotionBench，这是一个包括各种运动的全面基准，包括创意相机运动，单个对象运动，多对象运动和复杂的人类运动。我们展示了对运动台的广泛评估，以验证跟随您的运动的优势。

Title: Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation

Authors: Jan Ackermann, Kiyohiro Nakayama, Guandao Yang, Tong Wu, Gordon Wetzstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05210
Pdf URL: https://arxiv.org/pdf/2506.05210
Copy Paste: [[2506.05210]] Towards Vision-Language-Garment Models For Web Knowledge Garment Understanding and Generation(https://arxiv.org/abs/2506.05210)
Keywords: generation
Abstract: Multimodal foundation models have demonstrated strong generalization, yet their ability to transfer knowledge to specialized domains such as garment generation remains underexplored. We introduce VLG, a vision-language-garment model that synthesizes garments from textual descriptions and visual imagery. Our experiments assess VLG's zero-shot generalization, investigating its ability to transfer web-scale reasoning to unseen garment styles and prompts. Preliminary results indicate promising transfer capabilities, highlighting the potential for multimodal foundation models to adapt effectively to specialized domains like fashion design.
摘要：多模式的基础模型已经表现出强烈的概括，但是它们将知识转移到专业领域（例如生产）的能力尚未得到充分激励。我们介绍了VLG，这是一种视觉语言制定模型，可从文本描述和视觉图像中综合服装。我们的实验评估了VLG的零弹性概括，调查了其将网络规模推理转移到看不见的服装样式和提示的能力。初步结果表明有希望的转移功能，突出了多模式基础模型有效适应时装设计等专业领域的潜力。

Title: DSG-World: Learning a 3D Gaussian World Model from Dual State Videos

Authors: Wenhao Hu, Xuexiang Wen, Xi Li, Gaoang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05217
Pdf URL: https://arxiv.org/pdf/2506.05217
Copy Paste: [[2506.05217]] DSG-World: Learning a 3D Gaussian World Model from Dual State Videos(https://arxiv.org/abs/2506.05217)
Keywords: generative
Abstract: Building an efficient and physically consistent world model from limited observations is a long standing challenge in vision and robotics. Many existing world modeling pipelines are based on implicit generative models, which are hard to train and often lack 3D or physical consistency. On the other hand, explicit 3D methods built from a single state often require multi-stage processing-such as segmentation, background completion, and inpainting-due to occlusions. To address this, we leverage two perturbed observations of the same scene under different object configurations. These dual states offer complementary visibility, alleviating occlusion issues during state transitions and enabling more stable and complete reconstruction. In this paper, we present DSG-World, a novel end-to-end framework that explicitly constructs a 3D Gaussian World model from Dual State observations. Our approach builds dual segmentation-aware Gaussian fields and enforces bidirectional photometric and semantic consistency. We further introduce a pseudo intermediate state for symmetric alignment and design collaborative co-pruning trategies to refine geometric completeness. DSG-World enables efficient real-to-simulation transfer purely in the explicit Gaussian representation space, supporting high-fidelity rendering and object-level scene manipulation without relying on dense observations or multi-stage pipelines. Extensive experiments demonstrate strong generalization to novel views and scene states, highlighting the effectiveness of our approach for real-world 3D reconstruction and simulation.
摘要：从有限观察中建立高效且身体一致的世界模型是视觉和机器人技术的长期挑战。许多现有的世界建模管道基于隐性生成模型，这些模型很难训练，并且通常缺乏3D或身体一致性。另一方面，通过单个状态构建的显式3D方法通常需要多阶段的处理，例如分割，背景完成和对遮挡的内化。为了解决这个问题，我们利用在不同对象配置下对同一场景的两个扰动观察。这些双重状态提供互补的可见性，减轻州过渡期间的遮挡问题，并实现更稳定和完整的重建。在本文中，我们介绍了DSG-World，这是一个新颖的端到端框架，该框架明确地从双状态观察中构建了3D高斯世界模型。我们的方法构建了双分割 - 意识到高斯田地，并实施双向光度和语义一致性。我们进一步介绍了一个伪中间状态，以进行对齐和设计协作的共同整理曲目，以完善几何完整性。 DSG世界可以纯粹在明显的高斯表示空间中启用有效的实际模拟转移，从而支持高保真渲染和对象级场景操作，而无需依赖密集的观测或多阶段管道。广泛的实验表明对新型观点和场景状态的强烈概括，强调了我们对现实世界3D重建和模拟方法的有效性。

Title: Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit

Authors: Valérie Costa, Thomas Fel, Ekdeep Singh Lubana, Bahareh Tolooshams, Demba Ba
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05239
Pdf URL: https://arxiv.org/pdf/2506.05239
Copy Paste: [[2506.05239]] Evaluating Sparse Autoencoders: From Shallow Design to Matching Pursuit(https://arxiv.org/abs/2506.05239)
Keywords: generation
Abstract: Sparse autoencoders (SAEs) have recently become central tools for interpretability, leveraging dictionary learning principles to extract sparse, interpretable features from neural representations whose underlying structure is typically unknown. This paper evaluates SAEs in a controlled setting using MNIST, which reveals that current shallow architectures implicitly rely on a quasi-orthogonality assumption that limits the ability to extract correlated features. To move beyond this, we introduce a multi-iteration SAE by unrolling Matching Pursuit (MP-SAE), enabling the residual-guided extraction of correlated features that arise in hierarchical settings such as handwritten digit generation while guaranteeing monotonic improvement of the reconstruction as more atoms are selected.
摘要：稀疏的自动编码器（SAE）最近已成为可解释性的核心工具，利用字典学习原理，从基础结构通常未知的神经表示中提取稀疏，可解释的特征。本文使用MNIST评估了在受控设置中的SAE，该设置揭示了当前的浅层体系结构隐含地依赖于准正交性假设，该假设限制了提取相关特征的能力。为了超越这一点，我们通过展开匹配的追踪（MP-SAE）引入了多版本的SAE，从而可以在层次设置中出现的相关特征的残留引导提取，例如手写数字的生成等层次结构，同时保证选择了本原子的单调改进。

Title: Aligning Latent Spaces with Flow Priors

Authors: Yizhuo Li, Yuying Ge, Yixiao Ge, Ying Shan, Ping Luo
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.05240
Pdf URL: https://arxiv.org/pdf/2506.05240
Copy Paste: [[2506.05240]] Aligning Latent Spaces with Flow Priors(https://arxiv.org/abs/2506.05240)
Keywords: generation, generative
Abstract: This paper presents a novel framework for aligning learnable latent spaces to arbitrary target distributions by leveraging flow-based generative models as priors. Our method first pretrains a flow model on the target features to capture the underlying distribution. This fixed flow model subsequently regularizes the latent space via an alignment loss, which reformulates the flow matching objective to treat the latents as optimization targets. We formally prove that minimizing this alignment loss establishes a computationally tractable surrogate objective for maximizing a variational lower bound on the log-likelihood of latents under the target distribution. Notably, the proposed method eliminates computationally expensive likelihood evaluations and avoids ODE solving during optimization. As a proof of concept, we demonstrate in a controlled setting that the alignment loss landscape closely approximates the negative log-likelihood of the target distribution. We further validate the effectiveness of our approach through large-scale image generation experiments on ImageNet with diverse target distributions, accompanied by detailed discussions and ablation studies. With both theoretical and empirical validation, our framework paves a new way for latent space alignment.
摘要：本文提出了一个新颖的框架，可以通过利用基于流量的生成模型作为先验来使可学习的潜在空间与任意目标分布。我们的方法首先在目标特征上预定了一个流模型以捕获基础分布。这种固定流程模型随后通过对齐损失使潜在空间正常，这将重新定义了流程匹配的目标，以将潜伏期视为优化目标。我们正式证明，将这种比对损失最小化建立了一个可以计算障碍的替代物镜，以最大程度地利用目标分布下潜伏的对数可能性的变异下限。值得注意的是，提出的方法消除了计算上昂贵的似然评估，并避免了在优化过程中解决ode的解决方案。作为概念的证明，我们在受控的环境中证明，对齐损失景观近似于目标分布的负模样。我们通过在具有不同目标分布的ImageNet上进行大规模图像生成实验，进一步验证方法的有效性，并伴随着详细的讨论和消融研究。有了理论和经验验证，我们的框架为潜在空间对齐提供了一种新的方式。

Title: Conservative classifiers do consistently well with improving agents: characterizing statistical and online learning

Authors: Dravyansh Sharma, Alec Sun
Subjects: cs.LG, cs.GT, cs.MA
Abstract URL: https://arxiv.org/abs/2506.05252
Pdf URL: https://arxiv.org/pdf/2506.05252
Copy Paste: [[2506.05252]] Conservative classifiers do consistently well with improving agents: characterizing statistical and online learning(https://arxiv.org/abs/2506.05252)
Keywords: generative
Abstract: Machine learning is now ubiquitous in societal decision-making, for example in evaluating job candidates or loan applications, and it is increasingly important to take into account how classified agents will react to the learning algorithms. The majority of recent literature on strategic classification has focused on reducing and countering deceptive behaviors by the classified agents, but recent work of Attias et al. identifies surprising properties of learnability when the agents genuinely improve in order to attain the desirable classification, such as smaller generalization error than standard PAC-learning. In this paper we characterize so-called learnability with improvements across multiple new axes. We introduce an asymmetric variant of minimally consistent concept classes and use it to provide an exact characterization of proper learning with improvements in the realizable setting. While prior work studies learnability only under general, arbitrary agent improvement regions, we give positive results for more natural Euclidean ball improvement sets. In particular, we characterize improper learning under a mild generative assumption on the data distribution. We further show how to learn in more challenging settings, achieving lower generalization error under well-studied bounded noise models and obtaining mistake bounds in realizable and agnostic online learning. We resolve open questions posed by Attias et al. for both proper and improper learning.
摘要：现在，机器学习在社会决策中无处不在，例如评估求职者或贷款申请，考虑到机密代理将对学习算法的反应越来越重要。关于战略分类的大多数文献都集中在降低和反对分类药物的欺骗性行为上，但是最近的Attias等人的工作。当代理人真正改善以达到理想的分类（例如比标准PAC学习的较小的概括误差）时，可以确定可学习性的令人惊讶的特性。在本文中，我们表征了所谓的可学习性，并进行了多个新轴的改进。我们介绍了最小一致的概念类别的不对称变体，并使用它来提供适当学习的精确表征，并在可实现的环境中进行改进。尽管仅在一般的任意代理改善区域的一般工作中可学习性，但我们为更自然的欧几里得球改善集提供了积极的结果。特别是，我们表征了在数据分布的温和生成假设下学习不当的表征。我们进一步展示了如何在更具挑战性的环境中学习，在界定的有限噪声模型下实现较低的概括错误，并在可实现且不可知的在线学习中获得错误界限。我们解决了Attias等人提出的开放问题。对于适当的学习和不当学习。

Title: From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

Authors: Animesh Gupta, Jay Parmar, Ishan Rajendrakumar Dave, Mubarak Shah
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05274
Pdf URL: https://arxiv.org/pdf/2506.05274
Copy Paste: [[2506.05274]] From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos(https://arxiv.org/abs/2506.05274)
Keywords: generation
Abstract: Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 25.82.
摘要：组成的视频检索（COVR）在给定查询视频和描述预期更改的修改文本中检索目标视频。现有的COVR基准强调外观变化或粗糙事件变化，因此不会测试捕获微妙，快节奏的时间差异的能力。我们介绍了TF-COVR，这是第一个专门用于临时粒度COVR的大规模基准。 TF-COVR专注于体操和潜水，并提供了从Finegym和FeneDiving绘制的180K三胞胎。以前的COVR基准侧重于时间方面，将每个查询链接到从同一视频中获取的单个目标段，从而限制了实际实用性。在TF-COVR中，我们通过提示从不同视频绘制的剪辑之间的标签差异来构建每个<查询，修改>对；因此，每对都与多个有效的目标视频（平均3.9）相关联，反映了现实世界中的任务，例如Sports-Highlight Generation。为了模拟这些时间动力学，我们提出了TF-COVR基础，这是一个简洁的两阶段训练框架：（i）预先培训的视频编码器对细颗粒的动作分类，以获得时间歧视性嵌入；（ii）使用对比度学习将组合查询与候选视频对齐。我们对零射击和微调制度的时间细粒度组成的检索进行了对图像，视频和一般多模式嵌入（GME）模型的首次全面研究。在TF-COVR上，TF-COVR基准将@50@50从5.92（Lagansebind）提高到7.51，在微调后，将最先进的时间从19.83提高到25.82。

Title: How to Unlock Time Series Editing? Diffusion-Driven Approach with Multi-Grained Control

Authors: Hao Yu, Chu Xin Cheng, Runlong Yu, Yuyang Ye, Shiwei Tong, Zhaofeng Liu, Defu Lian
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05276
Pdf URL: https://arxiv.org/pdf/2506.05276
Copy Paste: [[2506.05276]] How to Unlock Time Series Editing? Diffusion-Driven Approach with Multi-Grained Control(https://arxiv.org/abs/2506.05276)
Keywords: generation, generative
Abstract: Recent advances in time series generation have shown promise, yet controlling properties in generated sequences remains challenging. Time Series Editing (TSE) - making precise modifications while preserving temporal coherence - consider both point-level constraints and segment-level controls that current methods struggle to provide. We introduce the CocktailEdit framework to enable simultaneous, flexible control across different types of constraints. This framework combines two key mechanisms: a confidence-weighted anchor control for point-wise constraints and a classifier-based control for managing statistical properties such as sums and averages over segments. Our methods achieve precise local control during the denoising inference stage while maintaining temporal coherence and integrating seamlessly, with any conditionally trained diffusion-based time series models. Extensive experiments across diverse datasets and models demonstrate its effectiveness. Our work bridges the gap between pure generative modeling and real-world time series editing needs, offering a flexible solution for human-in-the-loop time series generation and editing. The code and demo are provided for validation.
摘要：时间序列的最新进展已显示出希望，但是在生成序列中控制属性仍然具有挑战性。时间序列编辑（TSE） - 在保持时间连贯性的同时进行精确的修改 - 考虑当前方法难以提供的点级约束和细分级别的控制。我们介绍了鸡尾酒会框架，以跨不同类型的约束来同时进行灵活的控制。该框架结合了两种关键机制：针对点的限制的置信加权锚控制和基于分类器的控制，用于管理统计属性，例如总和和平均段的平均值。我们的方法在降级推理阶段获得了精确的局部控制，同时保持时间连贯性并与任何有条件训练的基于基于扩散的时间序列模型无缝集成。跨不同数据集和模型的广泛实验证明了其有效性。我们的工作弥合了纯生成型建模和现实世界中时间序列编辑需求之间的差距，为人类在环时间序列的生成和编辑提供了灵活的解决方案。提供代码和演示供验证。

Title: Rectified Point Flow: Generic Point Cloud Pose Estimation

Authors: Tao Sun, Liyuan Zhu, Shengyu Huang, Shuran Song, Iro Armeni
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2506.05282
Pdf URL: https://arxiv.org/pdf/2506.05282
Copy Paste: [[2506.05282]] Rectified Point Flow: Generic Point Cloud Pose Estimation(https://arxiv.org/abs/2506.05282)
Keywords: generative
Abstract: We introduce Rectified Point Flow, a unified parameterization that formulates pairwise point cloud registration and multi-part shape assembly as a single conditional generative problem. Given unposed point clouds, our method learns a continuous point-wise velocity field that transports noisy points toward their target positions, from which part poses are recovered. In contrast to prior work that regresses part-wise poses with ad-hoc symmetry handling, our method intrinsically learns assembly symmetries without symmetry labels. Together with a self-supervised encoder focused on overlapping points, our method achieves a new state-of-the-art performance on six benchmarks spanning pairwise registration and shape assembly. Notably, our unified formulation enables effective joint training on diverse datasets, facilitating the learning of shared geometric priors and consequently boosting accuracy. Project page: this https URL.
摘要：我们引入了整流点流，这是一种统一的参数化，将成对点云的配置和多部分形状组装为单个条件生成问题。鉴于未倒闭的点云，我们的方法学习了一个连续的点速度场，该速度场将嘈杂的指向转向其目标位置，从中恢复了部分姿势。与先前的工作相比，以临时对称性处理来回归部分，我们的方法本质地学习了没有对称标签的组装对称性。我们的方法与专注于重叠点的自制编码器一起，在跨越成对的注册和形状组装的六个基准上实现了新的最新性能。值得注意的是，我们的统一配方可以在不同的数据集上进行有效的联合培训，从而促进了共享几何学先验的学习，从而提高了准确性。项目页面：此HTTPS URL。

Title: Video World Models with Long-term Spatial Memory

Authors: Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, Gordon Wetzstein
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05284
Pdf URL: https://arxiv.org/pdf/2506.05284
Copy Paste: [[2506.05284]] Video World Models with Long-term Spatial Memory(https://arxiv.org/abs/2506.05284)
Keywords: generation
Abstract: Emerging world models autoregressively generate video frames in response to actions, such as camera movements and text prompts, among other control signals. Due to limited temporal context window sizes, these models often struggle to maintain scene consistency during revisits, leading to severe forgetting of previously generated environments. Inspired by the mechanisms of human memory, we introduce a novel framework to enhancing long-term consistency of video world models through a geometry-grounded long-term spatial memory. Our framework includes mechanisms to store and retrieve information from the long-term spatial memory and we curate custom datasets to train and evaluate world models with explicitly stored 3D memory mechanisms. Our evaluations show improved quality, consistency, and context length compared to relevant baselines, paving the way towards long-term consistent world generation.
摘要：新兴的世界模型自动重新调查对响应动作（例如相机运动和文本提示）以及其他控制信号等动作产生视频帧。由于时间上下文的窗口大小有限，这些模型通常很难在重新访问期间保持场景一致性，从而导致对以前生成的环境的严重忘记。受到人类记忆机制的启发，我们引入了一个新颖的框架，通过几何结构的长期空间记忆来增强视频世界模型的长期一致性。我们的框架包括从长期空间内存中存储和检索信息的机制，我们策划了自定义数据集，以训练和评估具有明确存储的3D内存机制的世界模型。与相关基线相比，我们的评估表明质量，一致性和上下文长度的提高，为长期一致的世界一代铺平了道路。

Title: AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

Authors: Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, Zheng-Jun Zha
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05289
Pdf URL: https://arxiv.org/pdf/2506.05289
Copy Paste: [[2506.05289]] AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model(https://arxiv.org/abs/2506.05289)
Keywords: generation
Abstract: Autoregressive image generation aims to predict the next token based on previous ones. However, existing image tokenizers encode tokens with bidirectional dependencies during the compression process, which hinders the effective modeling by autoregressive models. In this paper, we propose a novel Aligned Tokenizer (AliTok), which utilizes a causal decoder to establish unidirectional dependencies among encoded tokens, thereby aligning the token modeling approach between the tokenizer and autoregressive model. Furthermore, by incorporating prefix tokens and employing two-stage tokenizer training to enhance reconstruction consistency, AliTok achieves great reconstruction performance while being generation-friendly. On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator with only 177M parameters, AliTok achieves a gFID score of 1.50 and an IS of 305.9. When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed. The code and weights are available at this https URL.
摘要：自回归图像生成旨在根据以前的图像预测下一步的令牌。但是，现有的图像令牌在压缩过程中用双向依赖性编码令牌，这阻碍了自回旋模型的有效建模。在本文中，我们提出了一种新颖的对准令牌（Alitok），该数据利用因果解码器来建立编码令牌之间的单向依赖性，从而使令牌模型和自动性模型之间的令牌建模方法对齐。此外，通过合并前缀令牌并采用两阶段的令牌培训来增强重建一致性，Alitok在生成友好的同时实现了出色的重建性能。在Imagenet-256基准测试中，使用仅使用标准解码器自回旋模型作为仅1.77亿参数的发电机，Alitok的GFID得分为1.50，AN为305.9。当参数计数增加到662m时，Alitok的GFID得分为1.35，超过了最新的扩散方法，采样速度更快。代码和权重可在此HTTPS URL上找到。

Title: Power Law Guided Dynamic Sifting for Efficient Attention

Authors: Nirav Koley, Prajwal Singhania, Abhinav Bhatele
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05300
Pdf URL: https://arxiv.org/pdf/2506.05300
Copy Paste: [[2506.05300]] Power Law Guided Dynamic Sifting for Efficient Attention(https://arxiv.org/abs/2506.05300)
Keywords: generation
Abstract: Efficient inference on GPUs using large language models remains challenging due to memory bandwidth limitations, particularly during data transfers between High Bandwidth Memory (HBM) and SRAM in attention computations. Approximate attention methods address this issue by reducing computational and memory overhead but often rely on expensive top-$k$ operations, which perform poorly on GPUs. We propose SiftAttention, a novel approximate attention method that replaces the top-$k$ step with a computationally efficient element-wise filtering operation based on a threshold value. Our intuition for doing this is based on our empirical observation that the $\tau$-th quantile of attention scores follows a predictable power-law over sequential generation steps. Exploiting this insight, our approach dynamically estimates a threshold value per prompt at each generation step. Only attention scores above this threshold and their corresponding value vectors are loaded/used to compute the attention output, reducing data movement between HBM and SRAM. Our evaluation demonstrates that SiftAttention preserves model quality better than existing approximate attention methods while reducing memory bandwidth usage when loading value vectors.
摘要：由于记忆带宽的限制，使用大语言模型对GPU的有效推断仍然具有挑战性，尤其是在注意计算中高带宽内存（HBM）和SRAM之间的数据传输期间。近似关注方法通过减少计算和内存开销来解决此问题，但通常依赖于昂贵的$ K $操作，这些操作在GPU上的性能较差。我们提出了SiftaTeention，这是一种新颖的近似关注方法，该方法将基于阈值的计算高效元素滤波操作代替顶部$ K $ spte。我们执行此操作的直觉是基于我们的经验观察，即$ \ tau $ th的注意力分数遵循可预测的幂律对顺序生成步骤。利用这种见解，我们的方法在每个一代步骤中都会动态估计每个提示符的阈值值。只有高于此阈值的注意力评分，其相应的值向量被加载/用于计算注意力输出，从而减少了HBM和SRAM之间的数据运动。我们的评估表明，SiftaTeention比现有的近似注意方法更好地保留了模型质量，同时减少了加载值向量时的存储器带宽使用情况。

Title: SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Authors: Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, Lu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05301
Pdf URL: https://arxiv.org/pdf/2506.05301
Copy Paste: [[2506.05301]] SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training(https://arxiv.org/abs/2506.05301)
Keywords: restoration
Abstract: Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.
摘要：基于扩散的视频恢复（VR）的最新进展表现出视觉质量的显着改善，但在推断过程中产生了高度的计算成本。尽管几种基于蒸馏的方法表现出一步图像恢复的潜力，但扩展现有的VR方法仍然具有挑战性且无人驾驶，尤其是在现实世界中处理高分辨率视频时。在这项工作中，我们提出了一种基于SEEDVR2的一步基于扩散的VR模型，该模型对实际数据进行了对抗VR训练。为了在一个步骤中处理具有挑战性的高分辨率VR，我们为模型架构和培训程序介绍了几种增强功能。具体而言，提出了一种自适应窗口注意机制，在其中动态调整窗口大小以适合输出分辨率，避免使用带有预定义窗口大小的窗口注意力在高分辨率VR下观察到的窗口不一致。为了稳定和改善对VR的对抗性训练后，我们进一步验证了一系列损失的有效性，包括提出的特征匹配损失而不显着牺牲训练效率。广泛的实验表明，与单个步骤中现有的VR方法相比，SeedVr2可以实现可比甚至更好的性能。

Title: Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

Authors: Weifeng Lin, Xinyu Wei, Ruichuan An, Tianhe Ren, Tingwei Chen, Renrui Zhang, Ziyu Guo, Wentao Zhang, Lei Zhang, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05302
Pdf URL: https://arxiv.org/pdf/2506.05302
Copy Paste: [[2506.05302]] Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos(https://arxiv.org/abs/2506.05302)
Keywords: generation
Abstract: We present Perceive Anything Model (PAM), a conceptually straightforward and efficient framework for comprehensive region-level visual understanding in images and videos. Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation with the generation of diverse, region-specific semantic outputs, including categories, label definition, functional explanations, and detailed captions. A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features, which inherently carry general vision, localization, and semantic priors into multi-modal tokens for LLM comprehension. To support robust multi-granularity understanding, we also develop a dedicated data refinement and augmentation pipeline, yielding a high-quality dataset of 1.5M image and 0.6M video region-semantic annotations, including novel region-level streaming video caption data. PAM is designed for lightweightness and efficiency, while also demonstrates strong performance across a diverse range of region understanding tasks. It runs 1.2-2.4x faster and consumes less GPU memory than prior approaches, offering a practical solution for real-world applications. We believe that our effective approach will serve as a strong baseline for future research in region-level visual understanding.
摘要：我们提出了任何模型（PAM），这是一个在概念上直接有效的框架，可在图像和视频中进行全面的区域级别的视觉理解。我们的方法通过集成大型语言模型（LLM）来扩展强大的分割模型2，从而使对象分割能够与各种特定区域的语义输出产生生成，包括类别，标签定义，功能说明和详细的字幕。引入了一个关键组件，语义感知器，以有效地改变SAM 2的丰富视觉特征，该特征固有地将一般视觉，本地化和语义先验带入了多模式代币，以进行LLM理解。为了支持鲁棒的多粒性理解，我们还开发了专用的数据完善和增强管道，从而产生了150万图像和0.60万视频区域语义注释的高质量数据集，包括新型区域级别的流媒体视频字幕数据。 PAM专为轻巧和效率而设计，同时在各种各样的区域理解任务中也表现出强大的性能。它的运行速度比以前的方法快1.2-2.4倍，并且消耗的GPU内存少，为现实世界应用提供了实用的解决方案。我们认为，我们有效的方法将成为未来在区域视觉理解方面进行研究的强大基准。

Title: Learning normalized image densities via dual score matching

Authors: Florentin Guth, Zahra Kadkhodaie, Eero P Simoncelli
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05310
Pdf URL: https://arxiv.org/pdf/2506.05310
Copy Paste: [[2506.05310]] Learning normalized image densities via dual score matching(https://arxiv.org/abs/2506.05310)
Keywords: generative
Abstract: Learning probability models from data is at the heart of many machine learning endeavors, but is notoriously difficult due to the curse of dimensionality. We introduce a new framework for learning \emph{normalized} energy (log probability) models that is inspired from diffusion generative models, which rely on networks optimized to estimate the score. We modify a score network architecture to compute an energy while preserving its inductive biases. The gradient of this energy network with respect to its input image is the score of the learned density, which can be optimized using a denoising objective. Importantly, the gradient with respect to the noise level provides an additional score that can be optimized with a novel secondary objective, ensuring consistent and normalized energies across noise levels. We train an energy network with this \emph{dual} score matching objective on the ImageNet64 dataset, and obtain a cross-entropy (negative log likelihood) value comparable to the state of the art. We further validate our approach by showing that our energy model \emph{strongly generalizes}: estimated log probabilities are nearly independent of the specific images in the training set. Finally, we demonstrate that both image probability and dimensionality of local neighborhoods vary significantly with image content, in contrast with traditional assumptions such as concentration of measure or support on a low-dimensional manifold.
摘要：来自数据的学习概率模型是许多机器学习努力的核心，但由于维度的诅咒，众所周知。我们引入了一个新的框架，用于学习\ emph {归一化}能量（log概率）模型，该模型是从扩散生成模型中启发的，该模型依赖于优化的网络来估计分数。我们修改得分网络体系结构以计算能量，同时保留其电感偏见。该能量网络相对于其输入图像的梯度是学习密度的分数，可以使用降解目标进行优化。重要的是，相对于噪声水平的梯度提供了一个额外的分数，可以通过新颖的次要目标进行优化，从而确保跨噪声水平的一致和归一化的能量。我们在ImagEnet64数据集上使用此\ emph {dual}得分匹配目标训练能量网络，并获得与艺术状态相当的跨凝胶（负log可能性）值。我们通过证明我们的能量模型\ emph {强烈概括}进一步验证我们的方法：估计的对数概率几乎与训练集中的特定图像无关。最后，我们证明，与传统假设（例如，在低维歧管上的测量浓度或支持）相比，与图像含量相比，局部邻居的图像概率和维度均与图像含量有显着差异。

Title: LSM-2: Learning from Incomplete Wearable Sensor Data

Authors: Maxwell A. Xu, Girish Narayanswamy, Kumar Ayush, Dimitris Spathis, Shun Liao, Shyam A. Tailor, Ahmed Metwally, A. Ali Heydari, Yuwei Zhang, Jake Garrison, Samy Abdel-Ghaffar, Xuhai Xu, Ken Gu, Jacob Sunshine, Ming-Zher Poh, Yun Liu, Tim Althoff, Shrikanth Narayanan, Pushmeet Kohli, Mark Malhotra, Shwetak Patel, Yuzhe Yang, James M. Rehg, Xin Liu, Daniel McDuff
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.05321
Pdf URL: https://arxiv.org/pdf/2506.05321
Copy Paste: [[2506.05321]] LSM-2: Learning from Incomplete Wearable Sensor Data(https://arxiv.org/abs/2506.05321)
Keywords: generation, generative
Abstract: Foundation models, a cornerstone of recent advancements in machine learning, have predominantly thrived on complete and well-structured data. Wearable sensor data frequently suffers from significant missingness, posing a substantial challenge for self-supervised learning (SSL) models that typically assume complete data inputs. This paper introduces the second generation of Large Sensor Model (LSM-2) with Adaptive and Inherited Masking (AIM), a novel SSL approach that learns robust representations directly from incomplete data without requiring explicit imputation. AIM's core novelty lies in its use of learnable mask tokens to model both existing ("inherited") and artificially introduced missingness, enabling it to robustly handle fragmented real-world data during inference. Pre-trained on an extensive dataset of 40M hours of day-long multimodal sensor data, our LSM-2 with AIM achieves the best performance across a diverse range of tasks, including classification, regression and generative modeling. Furthermore, LSM-2 with AIM exhibits superior scaling performance, and critically, maintains high performance even under targeted missingness scenarios, reflecting clinically coherent patterns, such as the diagnostic value of nighttime biosignals for hypertension prediction. This makes AIM a more reliable choice for real-world wearable data applications.
摘要：基础模型是机器学习最新进步的基石，主要在完整且结构良好的数据上蓬勃发展。可穿戴的传感器数据经常遭受重大缺失，对通常采用完整数据输入的自学学习（SSL）模型构成了重大挑战。本文介绍了第二代大型传感器模型（LSM-2），具有自适应和遗传性掩蔽（AIM），这是一种新型的SSL方法，可以直接从不完整的数据中学习强大的表示，而无需明确的插补。 Aim的核心新颖性在于它使用可学习的面具令牌来建模现有（“继承”）和人为地引入失踪性，从而使其能够在推理过程中坚固地处理零散的现实世界数据。我们的LSM-2在长达4000万小时的多模式传感器数据的广泛数据集中进行了预训练，它的LSM-2在各种任务中都达到了最佳性能，包括分类，回归和生成建模。此外，具有AIM的LSM-2表现出卓越的缩放性能，并且在批判性的缺失场景下也保持高性能，反映了临床上一致的模式，例如高血压预测的夜间生物信号的诊断值。这使AIM成为现实世界可穿戴数据应用程序的更可靠的选择。

Title: MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning

Authors: Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05331
Pdf URL: https://arxiv.org/pdf/2506.05331
Copy Paste: [[2506.05331]] MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning(https://arxiv.org/abs/2506.05331)
Keywords: generation
Abstract: Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at this https URL
摘要：在大型语言模型（LLMS）中，经过思考链（COT）已广泛增强了数学推理，但是将其扩展到多模式域仍然具有挑战性。现有作品要么采用类似的文本推理来进行图像输入，要么试图将视觉信号交织成数学COT。但是，他们面临数学问题解决的三个关键局限性：依赖粗粒盒形图像区域，对数学内容的视觉编码的看法有限，以及对视觉修改的外部功能的依赖。在本文中，我们提出了Mint-Cot，引入了数学交织的令牌，以进行思想链的视觉推理。 Mint-Cot通过交织令牌将相关的视觉令牌自适应地交织到文本推理步骤中，该代币动态选择了数学数字中任何形状的视觉区域。为了增强这种能力，我们构建了Mint-COT数据集，其中包含54K数学问题，将每个推理步骤与令牌级别的视觉区域对齐，并伴随着严格的数据生成管道。我们进一步提出了一个三阶段的薄荷培训策略，逐渐结合了仅文本的COT SFT，交织的COT SFT和交错的COT RL，该COT RL衍生了我们的Mint-COT-7B模型。广泛的实验证明了我们方法在数学领域有效的视觉交错推理的有效性，其中Mint-COT-7B在Mathvista上的表现优于基线模型 +34.08％，在GEOQA上， +28.78％，分别在MMSTAR上， +28.78％和 +23.2％。我们的代码和数据可在此HTTPS URL上找到

Title: Kinetics: Rethinking Test-Time Scaling Laws

Authors: Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05333
Pdf URL: https://arxiv.org/pdf/2506.05333
Copy Paste: [[2506.05333]] Kinetics: Rethinking Test-Time Scaling Laws(https://arxiv.org/abs/2506.05333)
Keywords: generation
Abstract: We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training, where parameter scaling saturates, test-time accuracy continues to improve through increased generation. The code is available at this https URL.
摘要：我们从实践效率的角度重新考虑测试时间缩放定律，表明较小模型的有效性被显着高估。以计算最佳性为基础的先前工作忽略了推理时间策略（例如，最佳$ n $，长COTS）引入的关键内存访问瓶颈。我们的整体分析跨越0.6B到32B参数，揭示了一种新的动力学扩展定律，该定律定律通过合并计算和内存访问成本来更好地指导资源分配。动力学缩放定律表明，与较小的模型相比，测试时间计算在高于阈值的模型上时更有效。一个关键原因是，在TTS中，注意而不是参数计数是主要的成本因素。在此激励的情况下，我们提出了一个以稀疏关注为中心的新扩展范式，该范式降低了人均成本，并在相同的资源预算内实现了更长的世代和更平行的样本。从经验上讲，我们表明，稀疏注意模型始终超过密集的同行，在低成本制度中获得超过60分的收益，在高成本制度中获得超过5分，以解决AIME的问题解决准确性，包括对最先进的Moes的评估。这些结果表明，稀疏的注意力对于实现测试时间缩放的全部潜力至关重要，因为与训练相比，参数缩放饱和，测试时间准确性通过增加的生成而继续提高。该代码可在此HTTPS URL上找到。

Title: Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning

Authors: Xingjian Ran, Yixuan Li, Linning Xu, Mulin Yu, Bo Dai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.05341
Pdf URL: https://arxiv.org/pdf/2506.05341
Copy Paste: [[2506.05341]] Direct Numerical Layout Generation for 3D Indoor Scene Synthesis via Spatial Reasoning(https://arxiv.org/abs/2506.05341)
Keywords: generation, generative
Abstract: Realistic 3D indoor scene synthesis is vital for embodied AI and digital content creation. It can be naturally divided into two subtasks: object generation and layout generation. While recent generative models have significantly advanced object-level quality and controllability, layout generation remains challenging due to limited datasets. Existing methods either overfit to these datasets or rely on predefined constraints to optimize numerical layout that sacrifice flexibility. As a result, they fail to generate scenes that are both open-vocabulary and aligned with fine-grained user instructions. We introduce DirectLayout, a framework that directly generates numerical 3D layouts from text descriptions using generalizable spatial reasoning of large language models (LLMs). DirectLayout decomposes the generation into three stages: producing a Bird's-Eye View (BEV) layout, lifting it into 3D space, and refining object placements. To enable explicit spatial reasoning and help the model grasp basic principles of object placement, we employ Chain-of-Thought (CoT) Activation based on the 3D-Front dataset. Additionally, we design CoT-Grounded Generative Layout Reward to enhance generalization and spatial planning. During inference, DirectLayout addresses asset-layout mismatches via Iterative Asset-Layout Alignment through in-context learning. Extensive experiments demonstrate that DirectLayout achieves impressive semantic consistency, generalization and physical plausibility.
摘要：现实的3D室内场景合成对于体现的AI和数字内容创建至关重要。它可以自然分为两个子任务：对象生成和布局生成。尽管最近的生成模型具有显着高级的对象级质量和可控性，但由于数据集有限，布局生成仍然具有挑战性。现有方法要么过度拟合到这些数据集，要么依赖于预定义的约束来优化牺牲灵活性的数值布局。结果，他们无法生成既开放式唱片的场景又与细粒度的用户说明一致。我们介绍了DirectLayout，该框架直接使用大型语言模型（LLMS）的可推广的空间推理从文本描述中生成数值3D布局。 DirectLayout将一代分解为三个阶段：生成鸟类的视图（BEV）布局，将其提升为3D空间，并完善对象放置。为了启用明确的空间推理并帮助模型掌握对象放置的基本原理，我们基于3D前数据集采用了对象链（COT）激活。此外，我们设计了COT接地的生成布局奖励，以增强概括和空间计划。在推断期间，DirectLayout通过迭代资产划分来解决资产 - 落地不匹配，并通过信封学习。广泛的实验表明，DirectLayout实现了令人印象深刻的语义一致性，概括和身体上的合理性。

Title: ContentV: Efficient Training of Video Generation Models with Limited Compute

Authors: Wenfeng Lin, Renjie Chen, Boyuan Liu, Shiyue Yan, Ruoyu Feng, Jiangchuan Wei, Yichen Zhang, Yimeng Zhou, Chao Feng, Jiao Ran, Qi Wu, Zuotao Liu, Mingyu Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05343
Pdf URL: https://arxiv.org/pdf/2506.05343
Copy Paste: [[2506.05343]] ContentV: Efficient Training of Video Generation Models with Limited Compute(https://arxiv.org/abs/2506.05343)
Keywords: generation
Abstract: Recent advances in video generation demand increasingly efficient training recipes to mitigate escalating computational costs. In this report, we present ContentV, an 8B-parameter text-to-video model that achieves state-of-the-art performance (85.14 on VBench) after training on 256 x 64GB Neural Processing Units (NPUs) for merely four weeks. ContentV generates diverse, high-quality videos across multiple resolutions and durations from text prompts, enabled by three key innovations: (1) A minimalist architecture that maximizes reuse of pre-trained image generation models for video generation; (2) A systematic multi-stage training strategy leveraging flow matching for enhanced efficiency; and (3) A cost-effective reinforcement learning with human feedback framework that improves generation quality without requiring additional human annotations. All the code and models are available at: this https URL.
摘要：视频发电的最新进展需要越来越有效的培训食谱，以减轻计算成本的升级。在本报告中，我们介绍了contentv，这是一种8B参数文本对视频模型，在对256 x 64GB神经加工单元（NPU）进行训练后，可以实现最先进的性能（在VBench上进行85.14）。 ContentV通过三个关键创新启用了跨文本提示的多个分辨率和持续时间的多种分辨率和持续时间的高质量视频：（1）一种简约的体系结构，可最大程度地利用预先训练的图像生成模型，以生成视频；（2）一种系统的多阶段训练策略利用流量匹配以提高效率；（3）具有人力反馈框架具有成本效益的增强学习，可改善发电质量而无需其他人类注释。所有代码和模型均可在以下网址提供：此HTTPS URL。

Title: SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

Authors: Jiahui Wang, Zuyan Liu, Yongming Rao, Jiwen Lu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05344
Pdf URL: https://arxiv.org/pdf/2506.05344
Copy Paste: [[2506.05344]] SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs(https://arxiv.org/abs/2506.05344)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38x real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test. Our project is open sourced at this https URL.
摘要：多模式的大型语言模型（MLLM）通常是通过将预训练的大型语言模型（LLMS）带入视觉功能来得出的。在这项工作中，我们研究了MLLM如何通过分析其注意力机制来处理视觉输入。我们揭示了令人惊讶的稀疏现象：LLMS中只有一个小子集（大约少于5％）的注意力，积极地有助于视觉理解，称为视觉头。为了有效地识别这些头部，我们设计了一个无训练的框架，该框架通过目标响应分析来量化头部级别的视觉相关性。在这一发现的基础上，我们介绍了Sparsemm，这是一种KV-CACHE优化策略，该策略根据其视觉分数将不对称计算预算分配给LLMS中的头部，利用视觉头的少量来加速MLLM的推理。与忽略视觉特殊性的先前的KV-CACHE加速度方法相比，份子优先考虑压力并在解码过程中保留视觉语义。主流多模式基准之间进行了广泛的评估表明，Sparsem可以实现卓越的准确性效率折衷。值得注意的是，Sparsem在发电期间可提供1.38倍的实时加速度和52％的记忆力减少，同时保持效率测试的性能均衡。我们的项目是在此HTTPS URL上开源的。

Title: Inference-Time Hyper-Scaling with KV Cache Compression

Authors: Adrian Łańcucki, Konrad Staniszewski, Piotr Nawrot, Edoardo M. Ponti
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.05345
Pdf URL: https://arxiv.org/pdf/2506.05345
Copy Paste: [[2506.05345]] Inference-Time Hyper-Scaling with KV Cache Compression(https://arxiv.org/abs/2506.05345)
Keywords: generation
Abstract: Inference-time scaling trades efficiency for increased reasoning accuracy by generating longer or more parallel sequences. However, in Transformer LLMs, generation cost is bottlenecked by the size of the key-value (KV) cache, rather than the number of generated tokens. Hence, we explore inference-time hyper-scaling: by compressing the KV cache, we can generate more tokens within the same compute budget and further improve the accuracy of scaled inference. The success of this approach, however, hinges on the ability of compression methods to preserve accuracy even at high compression ratios. To make hyper-scaling practical, we introduce Dynamic Memory Sparsification (DMS), a novel method for sparsifying KV caches that only requires 1K training steps to achieve 8$\times$ compression, while maintaining better accuracy than training-free sparse attention. Instead of prematurely discarding cached tokens, DMS delays token eviction, implicitly merging representations and preserving critical information. We demonstrate the effectiveness of inference-time hyper-scaling with DMS on multiple families of LLMs, showing that it boosts accuracy for comparable inference runtime and memory load. For instance, we enhance Qwen-R1 32B by an average of 9.1 points on AIME 24, 7.6 on GPQA, and 9.6 on LiveCodeBench across compute budgets.
摘要：推理时间缩放量的效率通过产生更长或更高的并行序列来提高推理精度。但是，在变压器LLMS中，生成成本是由键值（KV）缓存的大小而不是生成的令牌数量的瓶颈。因此，我们探讨了推理时间超级缩放：通过压缩KV缓存，我们可以在相同的计算预算内生成更多的令牌，并进一步提高缩放推理的准确性。但是，这种方法的成功取决于压缩方法即使在高压缩比下保持准确性的能力。为了使超级缩放实用，我们引入了动态记忆稀疏（DMS），这是一种稀疏kV缓存的新方法，KV缓存仅需要1K训练步骤即可实现8 $ \ times $压缩，同时比无训练的稀疏稀疏注意力保持更好的精度。 DMS不用过早地丢弃缓存的代币，而是延迟了令牌驱逐，隐含地合并了表示并保留关键信息。我们证明了在LLM的多个家族中使用DMS进行推理时超级缩放的有效性，这表明它提高了与可比的推理运行时和内存负载的准确性。例如，我们在AIME 24、7.6上平均将QWEN-R1 32B提高了9.1分，而在计算预算中，在LiveCodeBench上，QWEN-R1 32B在LiveCodeBench上增强了9.6分。

Title: Contrastive Flow Matching

Authors: George Stoica, Vivek Ramanujan, Xiang Fan, Ali Farhadi, Ranjay Krishna, Judy Hoffman
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.05350
Pdf URL: https://arxiv.org/pdf/2506.05350
Copy Paste: [[2506.05350]] Contrastive Flow Matching(https://arxiv.org/abs/2506.05350)
Keywords: generation
Abstract: Unconditional flow-matching trains diffusion models to transport samples from a source distribution to a target distribution by enforcing that the flows between sample pairs are unique. However, in conditional settings (e.g., class-conditioned models), this uniqueness is no longer guaranteed--flows from different conditions may overlap, leading to more ambiguous generations. We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. Our approach adds a contrastive objective that maximizes dissimilarities between predicted flows from arbitrary sample pairs. We validate Contrastive Flow Matching by conducting extensive experiments across varying model architectures on both class-conditioned (ImageNet-1k) and text-to-image (CC3M) benchmarks. Notably, we find that training models with Contrastive Flow Matching (1) improves training speed by a factor of up to 9x, (2) requires up to 5x fewer de-noising steps and (3) lowers FID by up to 8.9 compared to training the same models with flow matching. We release our code at: this https URL.
摘要：无条件的流量匹配列车扩散模型将样品从源分布传输到目标分布，通过强制执行示例对之间的流量是唯一的。但是，在有条件的设置（例如，班级条件模型）中，这种唯一性不再保证 - 来自不同条件的流量可能会重叠，从而导致世代更加模棱两可。我们引入了对比度流匹配，这是对流匹配目标的扩展，该目标明确地在所有条件流中强制执行唯一性，从而增强了条件分离。我们的方法添加了一个对比目标，该目标可以最大程度地提高来自任意样本对的预测流之间的差异。我们通过在类调节（Imagenet-1k）和文本形象（CC3M）基准上进行不同模型体系结构进行广泛的模型体系结构进行大量实验来验证对比度匹配。值得注意的是，我们发现具有对比度流量匹配的训练模型（1）将训练速度提高了9倍，（2）需要减少降价步骤多达5倍，而（3）与使用流量匹配的相同模型相比，（3）将FID降低了8.9。我们在以下位置发布代码：此HTTPS URL。