2025-06-27

Title: Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models

Authors: Vineet Jain, Kusha Sareen, Mohammad Pedramfar, Siamak Ravanbakhsh
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.20701
Pdf URL: https://arxiv.org/pdf/2506.20701
Copy Paste: [[2506.20701]] Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models(https://arxiv.org/abs/2506.20701)
Keywords: generation, generative
Abstract: Adapting a pretrained diffusion model to new objectives at inference time remains an open problem in generative modeling. Existing steering methods suffer from inaccurate value estimation, especially at high noise levels, which biases guidance. Moreover, information from past runs is not reused to improve sample quality, resulting in inefficient use of compute. Inspired by the success of Monte Carlo Tree Search, we address these limitations by casting inference-time alignment as a search problem that reuses past computations. We introduce a tree-based approach that samples from the reward-aligned target density by propagating terminal rewards back through the diffusion chain and iteratively refining value estimates with each additional generation. Our proposed method, Diffusion Tree Sampling (DTS), produces asymptotically exact samples from the target distribution in the limit of infinite rollouts, and its greedy variant, Diffusion Tree Search (DTS$^\star$), performs a global search for high reward samples. On MNIST and CIFAR-10 class-conditional generation, DTS matches the FID of the best-performing baseline with up to $10\times$ less compute. In text-to-image generation and language completion tasks, DTS$^\star$ effectively searches for high reward samples that match best-of-N with up to $5\times$ less compute. By reusing information from previous generations, we get an anytime algorithm that turns additional compute into steadily better samples, providing a scalable approach for inference-time alignment of diffusion models.
摘要：在推理时间，将预估计的扩散模型适应新目标仍然是生成建模的一个开放问题。现有的转向方法的价值估计不准确，尤其是在高噪声水平下，这会偏向指导。此外，过去运行中的信息并未重新使用以提高样本质量，从而导致计算效率低下。受蒙特卡洛树搜索成功的启发，我们通过将推理时间对齐方式作为重复过去计算的搜索问题来解决这些限制。我们介绍了一种基于树的方法，该方法通过通过扩散链传播终端奖励并迭代地完善的价值估计来从奖励对准的目标密度进行取样。我们提出的方法，扩散树采样（DTS），从无限推广极限的目标分布及其贪婪变体的扩散树搜索（DTS $^\ star $）中产生渐近确切的样品，可执行全球搜索高奖励样品的搜索。在MNIST和CIFAR-10班级条件生成上，DTS与表现最佳的基线的FID相匹配，最高$ 10 \ tims $ $ $ $。在文本到图像的生成和语言完成任务中，DTS $^\ star $有效地搜索了高奖励样本，这些样本与最佳N的最佳n级匹配，最多$ 5 \ times $ $少于计算。通过重复前几代的信息，我们获得了一种随时随地的算法，该算法将额外的计算变成稳定的更好的样本，为推理时间对齐扩散模型提供了可扩展的方法。

Title: On Convolutions, Intrinsic Dimension, and Diffusion Models

Authors: Kin Kwan Leung, Rasa Hosseinzadeh, Gabriel Loaiza-Ganem
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.20705
Pdf URL: https://arxiv.org/pdf/2506.20705
Copy Paste: [[2506.20705]] On Convolutions, Intrinsic Dimension, and Diffusion Models(https://arxiv.org/abs/2506.20705)
Keywords: generative
Abstract: The manifold hypothesis asserts that data of interest in high-dimensional ambient spaces, such as image data, lies on unknown low-dimensional submanifolds. Diffusion models (DMs) -- which operate by convolving data with progressively larger amounts of Gaussian noise and then learning to revert this process -- have risen to prominence as the most performant generative models, and are known to be able to learn distributions with low-dimensional support. For a given datum in one of these submanifolds, we should thus intuitively expect DMs to have implicitly learned its corresponding local intrinsic dimension (LID), i.e. the dimension of the submanifold it belongs to. Kamkari et al. (2024b) recently showed that this is indeed the case by linking this LID to the rate of change of the log marginal densities of the DM with respect to the amount of added noise, resulting in an LID estimator known as FLIPD. LID estimators such as FLIPD have a plethora of uses, among others they quantify the complexity of a given datum, and can be used to detect outliers, adversarial examples and AI-generated text. FLIPD achieves state-of-the-art performance at LID estimation, yet its theoretical underpinnings are incomplete since Kamkari et al. (2024b) only proved its correctness under the highly unrealistic assumption of affine submanifolds. In this work we bridge this gap by formally proving the correctness of FLIPD under realistic assumptions. Additionally, we show that an analogous result holds when Gaussian convolutions are replaced with uniform ones, and discuss the relevance of this result.
摘要：该歧管假设断言，高维环境空间（例如图像数据）中感兴趣的数据在于未知的低维度亚策略。扩散模型（DMS） - 通过将数据逐渐增加的高斯噪声进行卷积，然后学会恢复这一过程 - 作为表现最佳的生成模型，并能够以低维支持来学习分布。因此，对于这些子手法之一中的一个基准，我们应该直观地期望DMS隐式地学习了其相应的局部内在维度（LID），即其属于子序列的尺寸。 Kamkari等。（2024b）最近表明，实际上，通过将此盖子与DM的对数边缘密度的变化率相对于添加噪声的量，这确实是这种情况，从而导致盖子估计量称为FLIPD。诸如FLIPD之类的盖估计器具有大量用途，除其他外，它们量化了给定数据的复杂性，可用于检测离群值，对抗性示例和AI生成的文本。 Flipd在盖子估计下实现了最先进的性能，但是自Kamkari等人以来，其理论基础是不完整的。（2024b）仅在高度不切实际的仿射子延伸物的假设下证明了其正确性。在这项工作中，我们通过正式证明在现实假设下的Flipd的正确性来弥合这一差距。此外，我们表明，当高斯卷积被统一的卷积代替时，一个类似的结果会产生，并讨论该结果的相关性。

Title: Characterization and Mitigation of Training Instabilities in Microscaling Formats

Authors: Huangyuan Su, Mujin Kwun, Stephanie Gil, Sham Kakade, Nikhil Anand
Subjects: cs.LG, cs.AR
Abstract URL: https://arxiv.org/abs/2506.20752
Pdf URL: https://arxiv.org/pdf/2506.20752
Copy Paste: [[2506.20752]] Characterization and Mitigation of Training Instabilities in Microscaling Formats(https://arxiv.org/abs/2506.20752)
Keywords: generation
Abstract: Training large language models is an expensive, compute-bound process that must be repeated as models scale, algorithms improve, and new data is collected. To address this, next-generation hardware accelerators increasingly support lower-precision arithmetic formats, such as the Microscaling (MX) formats introduced in NVIDIA's Blackwell architecture. These formats use a shared scale within blocks of parameters to extend representable range and perform forward/backward GEMM operations in reduced precision for efficiency gains. In this work, we investigate the challenges and viability of block-scaled precision formats during model training. Across nearly one thousand language models trained from scratch -- spanning compute budgets from $2 \times 10^{17}$ to $4.8 \times 10^{19}$ FLOPs and sweeping over a broad range of weight-activation precision combinations -- we consistently observe that training in MX formats exhibits sharp, stochastic instabilities in the loss, particularly at larger compute scales. To explain this phenomenon, we conduct controlled experiments and ablations on a smaller proxy model that exhibits similar behavior as the language model, sweeping across architectural settings, hyperparameters, and precision formats. These experiments motivate a simple model in which multiplicative gradient bias introduced by the quantization of layer-norm affine parameters and a small fraction of activations can trigger runaway divergence. Through \emph{in situ} intervention experiments on our proxy model, we demonstrate that instabilities can be averted or delayed by modifying precision schemes mid-training. Guided by these findings, we evaluate stabilization strategies in the LLM setting and show that certain hybrid configurations recover performance competitive with full-precision training. We release our code at this https URL.
摘要：培训大型语言模型是一个昂贵的，计算的过程，必须随着模型规模，算法的改善并收集新数据，必须重复进行计算。为了解决这个问题，下一代硬件加速器越来越支持较低精确的算术格式，例如Nvidia的Blackwell Architecture中引入的显微镜（MX）格式。这些格式在参数块内使用共享比例，以降低精确度以提高效率，以扩展可表示的范围并进行前进/向后的GEMM操作。在这项工作中，我们研究了模型训练期间块尺度精度格式的挑战和生存能力。在近一千种语言模型中，经过从头开始培训的语言模型 - 跨越了预算，从$ 2 \ times 10^{17} $到$ 4.8 \ times 10^{19} $ flops and theephe and theephe teephe teephe the在各种重量激活精度组合中 - 我们一致地观察到MX格式的培训表现出敏锐的，稳定的Instoctic Instoctics Instapers Instobility in Molds scapers，尤其是在较大的范围中表现出较大的标准。为了解释这种现象，我们对较小的代理模型进行了受控的实验和消融，该模型表现出与语言模型相似的行为，跨越体系结构设置，超参数和精度格式。这些实验激发了一个简单的模型，在这种模型中，通过层 - - 纳米仿射参数的量化和一小部分激活引入的乘法梯度偏置可以触发失控的差异。通过\ emph {int intu}在我们的代理模型上进行的干预实验，我们证明可以通过修改中训练中的精度方案来避免或延迟不稳定性。在这些发现的指导下，我们评估了LLM设置中的稳定策略，并表明某些混合配置通过完整的培训恢复了性能竞争力。我们在此HTTPS URL上发布代码。

Title: Stochastic and Non-local Closure Modeling for Nonlinear Dynamical Systems via Latent Score-based Generative Models

Authors: Xinghao Dong, Huchen Yang, Jin-Long Wu
Subjects: cs.LG, math.DS, physics.comp-ph
Abstract URL: https://arxiv.org/abs/2506.20771
Pdf URL: https://arxiv.org/pdf/2506.20771
Copy Paste: [[2506.20771]] Stochastic and Non-local Closure Modeling for Nonlinear Dynamical Systems via Latent Score-based Generative Models(https://arxiv.org/abs/2506.20771)
Keywords: generative
Abstract: We propose a latent score-based generative AI framework for learning stochastic, non-local closure models and constitutive laws in nonlinear dynamical systems of computational mechanics. This work addresses a key challenge of modeling complex multiscale dynamical systems without a clear scale separation, for which numerically resolving all scales is prohibitively expensive, e.g., for engineering turbulent flows. While classical closure modeling methods leverage domain knowledge to approximate subgrid-scale phenomena, their deterministic and local assumptions can be too restrictive in regimes lacking a clear scale separation. Recent developments of diffusion-based stochastic models have shown promise in the context of closure modeling, but their prohibitive computational inference cost limits practical applications for many real-world applications. This work addresses this limitation by jointly training convolutional autoencoders with conditional diffusion models in the latent spaces, significantly reducing the dimensionality of the sampling process while preserving essential physical characteristics. Numerical results demonstrate that the joint training approach helps discover a proper latent space that not only guarantees small reconstruction errors but also ensures good performance of the diffusion model in the latent space. When integrated into numerical simulations, the proposed stochastic modeling framework via latent conditional diffusion models achieves significant computational acceleration while maintaining comparable predictive accuracy to standard diffusion models in physical spaces.
摘要：我们提出了一个基于分数的潜在生成AI框架，用于学习随机，非本地闭合模型和计算机学非线性动力学系统中的本构定律。这项工作解决了建模复杂的多尺度动力学系统而不明确的尺度分离的关键挑战，为此，几个数字解决所有尺度都非常昂贵，例如，对于工程湍流而言。尽管经典的封闭建模方法利用域知识来近似亚网格规模的现象，但在缺乏明确的规模分离的制度中，它们的确定性和局部假设可能过于限制。基于扩散的随机模型的最新发展已在封闭建模的背景下显示出希望，但是它们在许多现实世界应用中的实用应用程序限制了实用性。这项工作通过与潜在空间中有条件扩散模型的共同训练卷积自动编码器来解决这一局限性，从而大大降低了采样过程的维度，同时保留了必要的物理特征。数值结果表明，联合训练方法有助于发现适当的潜在空间，不仅可以保证小的重建错误，而且还确保了潜在空间中扩散模型的良好性能。当整合到数值模拟中时，通过潜在条件扩散模型提出的随机建模框架实现了显着的计算加速度，同时保持了与物理空间中标准扩散模型的可比预测精度。

Title: Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models

Authors: Cansu Korkmaz, Ahmet Murat Tekalp, Zafer Dogan
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20832
Pdf URL: https://arxiv.org/pdf/2506.20832
Copy Paste: [[2506.20832]] Leveraging Vision-Language Models to Select Trustworthy Super-Resolution Samples Generated by Diffusion Models(https://arxiv.org/abs/2506.20832)
Keywords: super-resolution, generative
Abstract: Super-resolution (SR) is an ill-posed inverse problem with many feasible solutions consistent with a given low-resolution image. On one hand, regressive SR models aim to balance fidelity and perceptual quality to yield a single solution, but this trade-off often introduces artifacts that create ambiguity in information-critical applications such as recognizing digits or letters. On the other hand, diffusion models generate a diverse set of SR images, but selecting the most trustworthy solution from this set remains a challenge. This paper introduces a robust, automated framework for identifying the most trustworthy SR sample from a diffusion-generated set by leveraging the semantic reasoning capabilities of vision-language models (VLMs). Specifically, VLMs such as BLIP-2, GPT-4o, and their variants are prompted with structured queries to assess semantic correctness, visual quality, and artifact presence. The top-ranked SR candidates are then ensembled to yield a single trustworthy output in a cost-effective manner. To rigorously assess the validity of VLM-selected samples, we propose a novel Trustworthiness Score (TWS) a hybrid metric that quantifies SR reliability based on three complementary components: semantic similarity via CLIP embeddings, structural integrity using SSIM on edge maps, and artifact sensitivity through multi-level wavelet decomposition. We empirically show that TWS correlates strongly with human preference in both ambiguous and natural images, and that VLM-guided selections consistently yield high TWS values. Compared to conventional metrics like PSNR, LPIPS, which fail to reflect information fidelity, our approach offers a principled, scalable, and generalizable solution for navigating the uncertainty of the diffusion SR space. By aligning outputs with human expectations and semantic correctness, this work sets a new benchmark for trustworthiness in generative SR.
摘要：超分辨率（SR）是一个不当的逆问题，许多可行的解决方案与给定的低分辨率图像一致。一方面，回归SR模型旨在平衡忠诚度和感知质量以产生单个解决方案，但是这种权衡通常会引入文物，从而在诸如识别数字或字母之类的诸如识别数字或字母之类的信息中产生歧义。另一方面，扩散模型会产生一组不同的SR图像，但是从该集合中选择最值得信赖的解决方案仍然是一个挑战。本文介绍了一个强大的自动化框架，用于通过利用视觉模型（VLMS）的语义推理能力来识别扩散生成的最值得的SR样本。具体而言，诸如BLIP-2，GPT-4O及其变体之类的VLM会提示结构化查询，以评估语义正确性，视觉质量和伪影的存在。然后将排名最高的SR候选者结合起来，以具有成本效益的方式产生单个值得信赖的输出。为了严格评估VLM选择样品的有效性，我们提出了一种新颖的可信度得分（TWS），一种混合度量标准，该度量可以根据三个互补组件来量化SR可靠性：通过剪辑嵌入的语义相似性，通过剪辑的结构完整性使用边缘图上的SSIM进行结构完整性，并通过边缘图上的ssim通过多层decomptostions进行敏感性。我们从经验上表明，TWS与模棱两可和自然图像中的人类偏好密切相关，并且VLM引导的选择始终产生高的TWS值。与无法反映信息保真度的常规指标（如PSNR）相比，我们的方法提供了一种原则上，可扩展且可推广的解决方案，用于导航扩散SR空间的不确定性。通过将产出与人类的期望和语义正确性保持一致，这项工作为生成SR中的可信度提供了新的基准。

Title: MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans

Authors: Shubhankar Borse, Seokeon Choi, Sunghyun Park, Jeongho Kim, Shreya Kadambi, Risheek Garrepalli, Sungrack Yun, Munawar Hayat, Fatih Porikli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.20879
Pdf URL: https://arxiv.org/pdf/2506.20879
Copy Paste: [[2506.20879]] MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans(https://arxiv.org/abs/2506.20879)
Keywords: generation, generative
Abstract: Generation of images containing multiple humans, performing complex actions, while preserving their facial identities, is a significant challenge. A major factor contributing to this is the lack of a a dedicated benchmark. To address this, we introduce MultiHuman-Testbench, a novel benchmark for rigorously evaluating generative models for multi-human generation. The benchmark comprises 1800 samples, including carefully curated text prompts, describing a range of simple to complex human actions. These prompts are matched with a total of 5,550 unique human face images, sampled uniformly to ensure diversity across age, ethnic background, and gender. Alongside captions, we provide human-selected pose conditioning images which accurately match the prompt. We propose a multi-faceted evaluation suite employing four key metrics to quantify face count, ID similarity, prompt alignment, and action detection. We conduct a thorough evaluation of a diverse set of models, including zero-shot approaches and training-based methods, with and without regional priors. We also propose novel techniques to incorporate image and region isolation using human segmentation and Hungarian matching, significantly improving ID similarity. Our proposed benchmark and key findings provide valuable insights and a standardized tool for advancing research in multi-human image generation.
摘要：产生包含多个人类的图像，执行复杂的动作，同时保留其面部身份，这是一个重大挑战。造成这种情况的一个主要因素是缺乏专用基准。为了解决这个问题，我们介绍了Multihuman-Testbench，这是一种用于严格评估多人类生成生成模型的新基准。基准包括1800个样本，包括精心策划的文本提示，描述了一系列简单至复杂的人类行为。这些提示与总共有5,550个独特的人脸图像相匹配，并均匀地采样，以确保跨年龄，种族背景和性别的多样性。在字幕上，我们提供了人类选择的姿势调节图像，以准确匹配提示。我们提出了一个多面评估套件，该评估套件采用四个关键指标来量化面部计数，ID相似性，及时对准和操作检测。我们对各种模型进行了彻底的评估，包括零射击方法和基于培训的方法，具有和没有区域先验。我们还提出了新技术，以使用人分割和匈牙利匹配来结合图像和区域隔离，从而显着提高ID相似性。我们提出的基准和关键发现提供了有价值的见解和标准化工具，用于推进多人类图像生成的研究。

Title: LLM-guided Chemical Process Optimization with a Multi-Agent Approach

Authors: Tong Zeng, Srivathsan Badrinarayanan, Janghoon Ock, Cheng-Kai Lai, Amir Barati Farimani
Subjects: cs.LG, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2506.20921
Pdf URL: https://arxiv.org/pdf/2506.20921
Copy Paste: [[2506.20921]] LLM-guided Chemical Process Optimization with a Multi-Agent Approach(https://arxiv.org/abs/2506.20921)
Keywords: generation
Abstract: Chemical process optimization is crucial to maximize production efficiency and economic performance. Traditional methods, including gradient-based solvers, evolutionary algorithms, and parameter grid searches, become impractical when operating constraints are ill-defined or unavailable, requiring engineers to rely on subjective heuristics to estimate feasible parameter ranges. To address this constraint definition bottleneck, we present a multi-agent framework of large language model (LLM) agents that autonomously infer operating constraints from minimal process descriptions, then collaboratively guide optimization using the inferred constraints. Our AutoGen-based agentic framework employs OpenAI's o3 model, with specialized agents for constraint generation, parameter validation, simulation execution, and optimization guidance. Through two phases - autonomous constraint generation using embedded domain knowledge, followed by iterative multi-agent optimization - the framework eliminates the need for predefined operational bounds. Validated on the hydrodealkylation process across cost, yield, and yield-to-cost ratio metrics, the framework demonstrated competitive performance with conventional optimization methods while achieving better computational efficiency, requiring fewer iterations to converge. Our approach converged in under 20 minutes, achieving a 31-fold speedup over grid search. Beyond computational efficiency, the framework's reasoning-guided search demonstrates sophisticated process understanding, correctly identifying utility trade-offs, and applying domain-informed heuristics. This approach shows significant potential for optimization scenarios where operational constraints are poorly characterized or unavailable, particularly for emerging processes and retrofit applications.
摘要：化学过程优化对于最大化生产效率和经济绩效至关重要。当操作约束不明显或不可用时，包括基于梯度的求解器，进化算法和参数网格搜索在内的传统方法变得不切实际，要求工程师依靠主观的启发式方法来估计可行的参数范围。为了解决此约束定义瓶颈，我们提出了一个大型语言模型（LLM）代理的多代理框架，该框架自主从最小过程描述中自主推断操作约束，然后使用推断约束来协作指导优化。我们的基于Autogen的代理框架采用OpenAI的O3模型，并具有专门的代理，用于约束生成，参数验证，模拟执行和优化指南。通过两个阶段 - 使用嵌入式域知识的自主约束生成，然后进行迭代多代理优化 - 框架消除了对预定义的操作界限的需求。该框架在成本，产量和成本比率的跨成本，产量和产量与成本比率的跨越氢化烷基化过程中验证，该框架通过常规优化方法表现出竞争性能，同时实现了更好的计算效率，需要更少的迭代效果。我们的方法在不到20分钟的时间内收集，在网格搜索中实现了31倍的加速。除了计算效率之外，该框架的推理指导搜索还显示了复杂的过程理解，正确识别公用事业权衡并应用了域知识的启发式方法。这种方法显示了优化方案的显着潜力，而操作限制的特征或不可用，尤其是对于新兴过程和改造应用程序。

Title: PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling

Authors: Hao Zhang, Haolan Xu, Chun Feng, Varun Jampani, Narendra Ahuja
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.20936
Pdf URL: https://arxiv.org/pdf/2506.20936
Copy Paste: [[2506.20936]] PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling(https://arxiv.org/abs/2506.20936)
Keywords: generation
Abstract: Skinning and rigging are fundamental components in animation, articulated object reconstruction, motion transfer, and 4D generation. Existing approaches predominantly rely on Linear Blend Skinning (LBS), due to its simplicity and differentiability. However, LBS introduces artifacts such as volume loss and unnatural deformations, and it fails to model elastic materials like soft tissues, fur, and flexible appendages (e.g., elephant trunks, ears, and fatty tissues). In this work, we propose PhysRig: a differentiable physics-based skinning and rigging framework that overcomes these limitations by embedding the rigid skeleton into a volumetric representation (e.g., a tetrahedral mesh), which is simulated as a deformable soft-body structure driven by the animated skeleton. Our method leverages continuum mechanics and discretizes the object as particles embedded in an Eulerian background grid to ensure differentiability with respect to both material properties and skeletal motion. Additionally, we introduce material prototypes, significantly reducing the learning space while maintaining high expressiveness. To evaluate our framework, we construct a comprehensive synthetic dataset using meshes from Objaverse, The Amazing Animals Zoo, and MixaMo, covering diverse object categories and motion patterns. Our method consistently outperforms traditional LBS-based approaches, generating more realistic and physically plausible results. Furthermore, we demonstrate the applicability of our framework in the pose transfer task highlighting its versatility for articulated object modeling.
摘要：皮肤和索具是动画，铰接对象重建，运动转移和4D代的基本组成部分。现有方法主要依赖于线性混合皮肤（LBS），这是由于其简单性和不同性。但是，LBS引入了诸如体积损失和不自然变形之类的伪像，并且无法对弹性材料进行建模，例如软组织，毛皮和柔性附属物（例如，大象躯干，耳朵和脂肪组织）。在这项工作中，我们提出了Physrig：基于物理学的皮肤和索具框架，通过将刚性骨架嵌入体积表示中（例如，四面体网状）来克服这些局限性，该框架被模拟为由动画骨架驱动的可变形软体结构。我们的方法利用连续力学并将物体离散为嵌入欧拉背景网格中的粒子，以确保相对于材料特性和骨骼运动的可不同性。此外，我们引入了材料原型，在保持高表现力的同时大大降低了学习空间。为了评估我们的框架，我们使用Objaverse，Amazing Animals动物园和Mixamo的网格构建了一个全面的合成数据集，涵盖了各种物体类别和运动模式。我们的方法始终优于传统的基于LBS的方法，从而产生更现实和物理上合理的结果。此外，我们演示了框架在姿势传输任务中的适用性，突出了其在铰接对象建模中的多功能性。

Title: DFVEdit: Conditional Delta Flow Vector for Zero-shot Video Editing

Authors: Lingling Cai, Kang Zhao, Hangjie Yuan, Xiang Wang, Yingya Zhang, Kejie Huang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.20967
Pdf URL: https://arxiv.org/pdf/2506.20967
Copy Paste: [[2506.20967]] DFVEdit: Conditional Delta Flow Vector for Zero-shot Video Editing(https://arxiv.org/abs/2506.20967)
Keywords: generation
Abstract: The advent of Video Diffusion Transformers (Video DiTs) marks a milestone in video generation. However, directly applying existing video editing methods to Video DiTs often incurs substantial computational overhead, due to resource-intensive attention modification or finetuning. To alleviate this problem, we present DFVEdit, an efficient zero-shot video editing method tailored for Video DiTs. DFVEdit eliminates the need for both attention modification and fine-tuning by directly operating on clean latents via flow transformation. To be more specific, we observe that editing and sampling can be unified under the continuous flow perspective. Building upon this foundation, we propose the Conditional Delta Flow Vector (CDFV) -- a theoretically unbiased estimation of DFV -- and integrate Implicit Cross Attention (ICA) guidance as well as Embedding Reinforcement (ER) to further enhance editing quality. DFVEdit excels in practical efficiency, offering at least 20x inference speed-up and 85\% memory reduction on Video DiTs compared to attention-engineering-based editing methods. Extensive quantitative and qualitative experiments demonstrate that DFVEdit can be seamlessly applied to popular Video DiTs (e.g., CogVideoX and Wan2.1), attaining state-of-the-art performance on structural fidelity, spatial-temporal consistency, and editing quality.
摘要：视频扩散变压器（视频dit）的出现标志着视频生成的里程碑。但是，由于资源密集的注意力修改或填充，直接将现有的视频编辑方法应用于视频点通常会产生大量的计算开销。为了减轻此问题，我们提出了DFVEDIT，这是一种为视频Dits量身定制的有效的零拍视频编辑方法。 DFVEDIT通过流动转换直接在干净的潜伏期上运行，消除了注意力修改和微调的需求。更具体地说，我们观察到可以在连续的流程视角下统一编辑和采样。在这个基础的基础上，我们提出了条件的三角洲流量矢量（CDFV） - 理论上无偏见的DFV估计 - 并整合了隐式交叉注意（ICA）指导以及嵌入增强加固（ER）以进一步提高编辑质量。与基于注意力工程的编辑方法相比，DFVEDIT在实用效率方面表现出色，至少提供20倍推理的速度和视频dit上的85 \％内存。广泛的定量和定性实验表明，DFVEDIT可以无缝地应用于流行的视频点（例如Cogvideox和Wan2.1），在结构保真度，时空一致性和编辑质量方面具有最先进的性能。

Title: Rethink Sparse Signals for Pose-guided Text-to-image Generation

Authors: Wenjie Xuan, Jing Zhang, Juhua Liu, Bo Du, Dacheng Tao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.20983
Pdf URL: https://arxiv.org/pdf/2506.20983
Copy Paste: [[2506.20983]] Rethink Sparse Signals for Pose-guided Text-to-image Generation(https://arxiv.org/abs/2506.20983)
Keywords: generation
Abstract: Recent works favored dense signals (e.g., depth, DensePose), as an alternative to sparse signals (e.g., OpenPose), to provide detailed spatial guidance for pose-guided text-to-image generation. However, dense representations raised new challenges, including editing difficulties and potential inconsistencies with textual prompts. This fact motivates us to revisit sparse signals for pose guidance, owing to their simplicity and shape-agnostic nature, which remains underexplored. This paper proposes a novel Spatial-Pose ControlNet(SP-Ctrl), equipping sparse signals with robust controllability for pose-guided image generation. Specifically, we extend OpenPose to a learnable spatial representation, making keypoint embeddings discriminative and expressive. Additionally, we introduce keypoint concept learning, which encourages keypoint tokens to attend to the spatial positions of each keypoint, thus improving pose alignment. Experiments on animal- and human-centric image generation tasks demonstrate that our method outperforms recent spatially controllable T2I generation approaches under sparse-pose guidance and even matches the performance of dense signal-based methods. Moreover, SP-Ctrl shows promising capabilities in diverse and cross-species generation through sparse signals. Codes will be available at this https URL.
摘要：最近的作品有利于致密的信号（例如，深度，密集），作为稀疏信号（例如，开放点）的替代方案，为姿势指导的文本与图像生成提供了详细的空间指导。但是，密集的表示提出了新的挑战，包括编辑困难以及与文本提示的潜在不一致。这一事实促使我们由于其简单性和形状不足的性质而重新审视稀疏信号以获取姿势指导，这仍然没有被逐渐解散。本文提出了一种新型的空间置式控制网（SP-CTRL），为稀疏信号配备了良好的可控性，用于姿势引导的图像产生。具体而言，我们将开放式扩展到可学习的空间表示形式，从而使关键点嵌入歧视性和表现力。此外，我们介绍了Kepoint概念学习，该学习鼓励Kepoint令牌参与每个关键点的空间位置，从而改善姿势对齐。关于动物和以人为中心的图像生成任务的实验表明，我们的方法在稀疏置置指导下优于最近可空间可控的T2I生成方法，甚至与基于密度信号的方法的性能相匹配。此外，SP-CTRL通过稀疏信号显示了各种和跨物种生成的有希望的功能。代码将在此HTTPS URL上可用。

Title: Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

Authors: Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji
Subjects: cs.CV, cs.LG, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.20995
Pdf URL: https://arxiv.org/pdf/2506.20995
Copy Paste: [[2506.20995]] Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance(https://arxiv.org/abs/2506.20995)
Keywords: generation
Abstract: We propose a novel step-by-step video-to-audio generation method that sequentially produces individual audio tracks, each corresponding to a specific sound event in the video. Our approach mirrors traditional Foley workflows, aiming to capture all sound events induced by a given video comprehensively. Each generation step is formulated as a guided video-to-audio synthesis task, conditioned on a target text prompt and previously generated audio tracks. This design is inspired by the idea of concept negation from prior compositional generation frameworks. To enable this guided generation, we introduce a training framework that leverages pre-trained video-to-audio models and eliminates the need for specialized paired datasets, allowing training on more accessible data. Experimental results demonstrate that our method generates multiple semantically distinct audio tracks for a single input video, leading to higher-quality composite audio synthesis than existing baselines.
摘要：我们提出了一种新颖的分步视频到审计生成方法，该方法顺序产生单个音轨，每个音轨都与视频中的特定声音事件相对应。我们的方法反映了传统的Foley工作流程，旨在全面捕获给定视频引起的所有声音事件。每个一代步骤均以指导性视频综合任务为导向，并以目标文本提示和先前生成的音频轨道为条件。这种设计的灵感来自先前的成分生成框架的概念否定概念。为了启用这一指导一代，我们介绍了一个培训框架，该培训框架利用预先训练的视频到原告模型，并消除了对专业配对数据集的需求，从而可以培训更易于访问的数据。实验结果表明，我们的方法为单个输入视频生成了多个语义上不同的音轨，从而导致比现有基线更高的复合音频合成。

Title: Distilling Normalizing Flows

Authors: Steven Walton, Valeriy Klyukin, Maksim Artemev, Denis Derkach, Nikita Orlov, Humphrey Shi
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21003
Pdf URL: https://arxiv.org/pdf/2506.21003
Copy Paste: [[2506.21003]] Distilling Normalizing Flows(https://arxiv.org/abs/2506.21003)
Keywords: generative
Abstract: Explicit density learners are becoming an increasingly popular technique for generative models because of their ability to better model probability distributions. They have advantages over Generative Adversarial Networks due to their ability to perform density estimation and having exact latent-variable inference. This has many advantages, including: being able to simply interpolate, calculate sample likelihood, and analyze the probability distribution. The downside of these models is that they are often more difficult to train and have lower sampling quality. Normalizing flows are explicit density models, that use composable bijective functions to turn an intractable probability function into a tractable one. In this work, we present novel knowledge distillation techniques to increase sampling quality and density estimation of smaller student normalizing flows. We seek to study the capacity of knowledge distillation in Compositional Normalizing Flows to understand the benefits and weaknesses provided by these architectures. Normalizing flows have unique properties that allow for a non-traditional forms of knowledge transfer, where we can transfer that knowledge within intermediate layers. We find that through this distillation, we can make students significantly smaller while making substantial performance gains over a non-distilled student. With smaller models there is a proportionally increased throughput as this is dependent upon the number of bijectors, and thus parameters, in the network.
摘要：显式密度学习者正在成为生成模型越来越流行的技术，因为它们具有更好的模型概率分布的能力。由于其执行密度估计和具有确切的潜在变量推断的能力，它们比生成对抗网络具有优势。这具有许多优点，包括：能够简单地插入，计算样本可能性并分析概率分布。这些模型的缺点是，它们通常更难训练并且具有较低的采样质量。归一化的流是明确的密度模型，它使用可综合的徒函数将棘手的概率函数转化为可拖动的函数。在这项工作中，我们提出了新颖的知识蒸馏技术，以提高对较小学生正常流的采样质量和密度估计。我们试图研究知识蒸馏的能力在构图正常化流中，以了解这些架构提供的益处和弱点。归一化流具有独特的属性，可以实现非传统的知识转移形式，我们可以在中间层中转移该知识。我们发现，通过这种蒸馏，我们可以使学生大得多，同时对非衰落的学生提高表现。对于较小的模型，由于这取决于网络中的射击者数量，因此会增加吞吐量。

Title: Bridging Video Quality Scoring and Justification via Large Multimodal Models

Authors: Qizhi Xie, Kun Yuan, Yunpeng Qu, Jiachao Gong, Mingda Wu, Ming Sun, Chao Zhou, Jihong Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21011
Pdf URL: https://arxiv.org/pdf/2506.21011
Copy Paste: [[2506.21011]] Bridging Video Quality Scoring and Justification via Large Multimodal Models(https://arxiv.org/abs/2506.21011)
Keywords: generation, quality assessment
Abstract: Classical video quality assessment (VQA) methods generate a numerical score to judge a video's perceived visual fidelity and clarity. Yet, a score fails to describe the video's complex quality dimensions, restricting its applicability. Benefiting from the linguistic output, adapting video large multimodal models (LMMs) to VQA via instruction tuning has the potential to address this issue. The core of the approach lies in the video quality-centric instruction data. Previous explorations mainly focus on the image domain, and their data generation processes heavily rely on human quality annotations and proprietary systems, limiting data scalability and effectiveness. To address these challenges, we propose the Score-based Instruction Generation (SIG) pipeline. Specifically, SIG first scores multiple quality dimensions of an unlabeled video and maps scores to text-defined levels. It then explicitly incorporates a hierarchical Chain-of-Thought (CoT) to model the correlation between specific dimensions and overall quality, mimicking the human visual system's reasoning process. The automated pipeline eliminates the reliance on expert-written quality descriptions and proprietary systems, ensuring data scalability and generation efficiency. To this end, the resulting Score2Instruct (S2I) dataset contains over 320K diverse instruction-response pairs, laying the basis for instruction tuning. Moreover, to advance video LMMs' quality scoring and justification abilities simultaneously, we devise a progressive tuning strategy to fully unleash the power of S2I. Built upon SIG, we further curate a benchmark termed S2I-Bench with 400 open-ended questions to better evaluate the quality justification capacity of video LMMs. Experimental results on the S2I-Bench and existing benchmarks indicate that our method consistently improves quality scoring and justification capabilities across multiple video LMMs.
摘要：经典视频质量评估（VQA）方法产生了数值分数，以判断视频的视觉效果和清晰度。但是，分数无法描述视频的复杂质量维度，从而限制了其适用性。从语言输出中受益，通过指令调整将视频大型多模型（LMM）调整为VQA有可能解决此问题。该方法的核心在于以视频质量为中心的指令数据。先前的探索主要集中在图像域，其数据生成过程在很大程度上依赖于人类质量注释和专有系统，从而限制了数据可扩展性和有效性。为了应对这些挑战，我们提出了基于得分的指导生成（SIG）管道。具体而言，SIG首先为未标记的视频和地图得分分为文本定义的级别的多个质量维度。然后，它明确结合了一个层次结构链（COT），以模拟特定维度与整体质量之间的相关性，从而模仿了人类视觉系统的推理过程。自动化管道消除了对专家写的质量描述和专有系统的依赖，从而确保了数据可扩展性和发电效率。为此，由此产生的Score2 Instruction（S2I）数据集包含超过320k的不同指令 - 响应对，为教学调整奠定了基础。此外，为了同时提高视频LMM的质量评分和理由能力，我们制定了一种渐进的调整策略来完全释放S2I的力量。在SIG的基础上，我们进一步策划了一个基准，该基准称为S2i Bench和400个开放式问题，以更好地评估视频LMM的质量合理能力。 S2I基础和现有基准测试的实验结果表明，我们的方法始终提高多个视频LMM的质量评分和理由能力。

Title: HybridQ: Hybrid Classical-Quantum Generative Adversarial Network for Skin Disease Image Generation

Authors: Qingyue Jiao, Kangyu Zheng, Yiyu Shi, Zhiding Liang
Subjects: cs.CV, cs.LG, quant-ph
Abstract URL: https://arxiv.org/abs/2506.21015
Pdf URL: https://arxiv.org/pdf/2506.21015
Copy Paste: [[2506.21015]] HybridQ: Hybrid Classical-Quantum Generative Adversarial Network for Skin Disease Image Generation(https://arxiv.org/abs/2506.21015)
Keywords: generation, generative
Abstract: Machine learning-assisted diagnosis is gaining traction in skin disease detection, but training effective models requires large amounts of high-quality data. Skin disease datasets often suffer from class imbalance, privacy concerns, and object bias, making data augmentation essential. While classical generative models are widely used, they demand extensive computational resources and lengthy training time. Quantum computing offers a promising alternative, but existing quantum-based image generation methods can only yield grayscale low-quality images. Through a novel classical-quantum latent space fusion technique, our work overcomes this limitation and introduces the first classical-quantum generative adversarial network (GAN) capable of generating color medical images. Our model outperforms classical deep convolutional GANs and existing hybrid classical-quantum GANs in both image generation quality and classification performance boost when used as data augmentation. Moreover, the performance boost is comparable with that achieved using state-of-the-art classical generative models, yet with over 25 times fewer parameters and 10 times fewer training epochs. Such results suggest a promising future for quantum image generation as quantum hardware advances. Finally, we demonstrate the robust performance of our model on real IBM quantum machine with hardware noise.
摘要：机器学习辅助诊断正在在皮肤病检测中获得吸引力，但是训练有效模型需要大量的高质量数据。皮肤病数据集通常会遭受阶级失衡，隐私问题和物体偏见的困扰，从而使数据增强至关重要。尽管经典的生成模型被广泛使用，但它们需要大量的计算资源和冗长的培训时间。量子计算提供了一种有希望的替代方案，但现有的基于量子的图像生成方法只能产生灰度低质量的图像。通过一种新型的经典量子量子空间融合技术，我们的工作克服了这一限制，并引入了能够生成彩色医学图像的第一个经典量子生成对抗网络（GAN）。我们的模型在图像生成质量和分类性能方面都优于古典深卷积剂和现有的混合经典量子剂，当用作数据增强时。此外，性能提升与使用最先进的经典生成模型实现的效果相当，但是参数少了25倍，训练时期少了10倍。这样的结果表明，量子图像生成作为量子硬件进步的前途有希望的未来。最后，我们在带有硬件噪声的真实IBM量子机上演示了模型的稳健性能。

Title: Multimodal Prompt Alignment for Facial Expression Recognition

Authors: Fuyan Ma, Yiran He, Bin Sun, Shutao Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21017
Pdf URL: https://arxiv.org/pdf/2506.21017
Copy Paste: [[2506.21017]] Multimodal Prompt Alignment for Facial Expression Recognition(https://arxiv.org/abs/2506.21017)
Keywords: generation
Abstract: Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.
摘要：迅速学习已被广泛采用，以有效地适应视觉模型（VLM），例如剪辑，以完成各种下游任务。尽管他们成功了，但目前基于VLM的面部表达识别（FER）方法难以捕获细粒的文本 - 视觉关系，这对于区分面部表情之间的细微差异至关重要。为了应对这一挑战，我们为FER提出了一个称为MPA-FER的多模式提示对齐框架，该框架为提示的视觉特征的学习过程提供了精细的语义指导，从而提供了更精确，更可解释的表示。具体来说，我们引入了多个迅速生成策略，该策略利用大型语言模型（LLM）（例如Chatgpt）为每个面部表达式生成详细描述。通过将软提示和硬提示之间的特征差异最小化，将基于LLM的外部知识注入软提示中。为了保留预审计的剪辑模型的概括能力，我们的方法结合了原型引导的视觉特征对齐，以确保从冷冻图像编码器中启发的视觉特征与类别特定的原型紧密对齐。此外，我们提出了一个跨模式的全局位置模块，该模块着重于表达相关的面部特征，从而进一步改善了文本和视觉特征之间的比对。广泛的实验表明，在三个FER基准数据集上，我们的框架优于最先进的方法，同时保留了预算模型的好处并最大程度地减少计算成本。

Title: LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection

Authors: Lei Hao, Lina Xu, Chang Liu, Yanni Dong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21018
Pdf URL: https://arxiv.org/pdf/2506.21018
Copy Paste: [[2506.21018]] LASFNet: A Lightweight Attention-Guided Self-Modulation Feature Fusion Network for Multimodal Object Detection(https://arxiv.org/abs/2506.21018)
Keywords: generation
Abstract: Effective deep feature extraction via feature-level fusion is crucial for multimodal object detection. However, previous studies often involve complex training processes that integrate modality-specific features by stacking multiple feature-level fusion units, leading to significant computational overhead. To address this issue, we propose a new fusion detection baseline that uses a single feature-level fusion unit to enable high-performance detection, thereby simplifying the training process. Based on this approach, we propose a lightweight attention-guided self-modulation feature fusion network (LASFNet), which introduces a novel attention-guided self-modulation feature fusion (ASFF) module that adaptively adjusts the responses of fusion features at both global and local levels based on attention information from different modalities, thereby promoting comprehensive and enriched feature generation. Additionally, a lightweight feature attention transformation module (FATM) is designed at the neck of LASFNet to enhance the focus on fused features and minimize information loss. Extensive experiments on three representative datasets demonstrate that, compared to state-of-the-art methods, our approach achieves a favorable efficiency-accuracy trade-off, reducing the number of parameters and computational cost by as much as 90% and 85%, respectively, while improving detection accuracy (mAP) by 1%-3%. The code will be open-sourced at this https URL.
摘要：通过特征级融合的有效深度提取对于多模式对象检测至关重要。但是，以前的研究通常涉及复杂的训练过程，这些过程通过堆叠多个特征级融合单元来整合特定于模式的特征，从而导致大量的计算开销。为了解决这个问题，我们提出了一个新的融合检测基线，该检测基线使用单个功能级融合单元启用高性能检测，从而简化了训练过程。基于这种方法，我们提出了一个轻巧的注意力引导的自调整特征融合网络（LASFNET），它引入了一种新型的注意力引导引导的自调制功能融合（ASFF）模块，该模块可适应地调整基于来自不同模态的全球和本地级别的融合功能的响应，从而促进了来自不同模态的全面和丰富的特征生成生成。此外，在LASFNET的脖子上设计了轻巧的特征注意转换模块（FATM），以增强对融合功能的关注并最大程度地减少信息丢失。在三个代表性数据集上进行的广泛实验表明，与最新方法相比，我们的方法实现了良好的效率 - 准确性权衡取舍，将参数和计算成本的数量和计算成本分别降低了90％和85％，而将检测准确性（MAP）提高了1％-3％。该代码将在此HTTPS URL上开源。

Title: Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation

Authors: Ze Wang, Hao Chen, Benran Hu, Jiang Liu, Ximeng Sun, Jialian Wu, Yusheng Su, Xiaodong Yu, Emad Barsoum, Zicheng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21022
Pdf URL: https://arxiv.org/pdf/2506.21022
Copy Paste: [[2506.21022]] Instella-T2I: Pushing the Limits of 1D Discrete Latent Space Image Generation(https://arxiv.org/abs/2506.21022)
Keywords: generation
Abstract: Image tokenization plays a critical role in reducing the computational demands of modeling high-resolution images, significantly improving the efficiency of image and multimodal understanding and generation. Recent advances in 1D latent spaces have reduced the number of tokens required by eliminating the need for a 2D grid structure. In this paper, we further advance compact discrete image representation by introducing 1D binary image latents. By representing each image as a sequence of binary vectors, rather than using traditional one-hot codebook tokens, our approach preserves high-resolution details while maintaining the compactness of 1D latents. To the best of our knowledge, our text-to-image models are the first to achieve competitive performance in both diffusion and auto-regressive generation using just 128 discrete tokens for images up to 1024x1024, demonstrating up to a 32-fold reduction in token numbers compared to standard VQ-VAEs. The proposed 1D binary latent space, coupled with simple model architectures, achieves marked improvements in speed training and inference speed. Our text-to-image models allow for a global batch size of 4096 on a single GPU node with 8 AMD MI300X GPUs, and the training can be completed within 200 GPU days. Our models achieve competitive performance compared to modern image generation models without any in-house private training data or post-training refinements, offering a scalable and efficient alternative to conventional tokenization methods.
摘要：图像令牌化在减少建模高分辨率图像的计算需求中起着至关重要的作用，从而显着提高了图像和多模式理解和产生的效率。 1D潜在空间的最新进展减少了消除对2D网格结构的需求所需的令牌数量。在本文中，我们通过引入1D二进制图像潜伏期来进一步提高紧凑型离散图像表示。通过将每个图像表示为二进制向量的序列，而不是使用传统的单热代码书令牌，我们的方法可以保留高分辨率的细节，同时保持一维潜在的紧凑性。据我们所知，我们的文本到图像模型是第一个在扩散和自动回归生成中仅使用128个离散令牌来实现竞争性能的模型，该图像最多可用于1024x1024，与标准VQ-VAE相比，标记数量的降低了32倍。提出的一维二进制潜在空间，再加上简单的模型体系结构，可以在速度训练和推理速度方面取得明显改善。我们的文本对图像模型允许单个GPU节点的全球批量大小为8 AMD MI300X GPU，并且可以在200 GPU天内完成培训。与现代图像生成模型相比，我们的模型实现了竞争性能，而没有任何内部私人培训数据或培训后的改进，提供了可扩展有效的替代常规令牌化方法的替代方案。

Title: Efficient Skill Discovery via Regret-Aware Optimization

Authors: He Zhang, Ming Zhou, Shaopeng Zhai, Ying Sun, Hui Xiong
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21044
Pdf URL: https://arxiv.org/pdf/2506.21044
Copy Paste: [[2506.21044]] Efficient Skill Discovery via Regret-Aware Optimization(https://arxiv.org/abs/2506.21044)
Keywords: generation
Abstract: Unsupervised skill discovery aims to learn diverse and distinguishable behaviors in open-ended reinforcement learning. For existing methods, they focus on improving diversity through pure exploration, mutual information optimization, and learning temporal representation. Despite that they perform well on exploration, they remain limited in terms of efficiency, especially for the high-dimensional situations. In this work, we frame skill discovery as a min-max game of skill generation and policy learning, proposing a regret-aware method on top of temporal representation learning that expands the discovered skill space along the direction of upgradable policy strength. The key insight behind the proposed method is that the skill discovery is adversarial to the policy learning, i.e., skills with weak strength should be further explored while less exploration for the skills with converged strength. As an implementation, we score the degree of strength convergence with regret, and guide the skill discovery with a learnable skill generator. To avoid degeneration, skill generation comes from an up-gradable population of skill generators. We conduct experiments on environments with varying complexities and dimension sizes. Empirical results show that our method outperforms baselines in both efficiency and diversity. Moreover, our method achieves a 15% zero shot improvement in high-dimensional environments, compared to existing methods.
摘要：无监督的技能发现旨在学习开放式强化学习中的多样化和可区分的行为。对于现有方法，他们专注于通过纯粹的探索，相互信息优化和学习时间表示来改善多样性。尽管它们在勘探方面表现良好，但在效率方面，它们仍然有限，尤其是对于高维情况。在这项工作中，我们将技能发现构图为技能生成和政策学习的最小游戏游戏，并在时间代表学习的基础上提出了一种遗憾的方法，该方法将发现的技能空间扩展到可升级的政策实力的方向上。拟议方法背后的关键见解是，技能发现对政策学习的对抗，即，应进一步探索具有较弱力量的技能，同时减少对力量融合的技能的探索。作为实施，我们遗憾地为力量收敛程度评分，并用可学习的技能生成器指导技能发现。为了避免退化，技能产生来自可升级的技能生成者。我们对具有不同复杂性和尺寸的环境进行实验。经验结果表明，我们的方法在效率和多样性方面都优于基准。此外，与现有方法相比，我们的方法在高维环境中实现了15％的零投篮改进。

Title: Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features

Authors: Shangbo Wu, Yu-an Tan, Ruinan Ma, Wencong Ma, Dehua Zhu, Yuanzhang Li
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2506.21046
Pdf URL: https://arxiv.org/pdf/2506.21046
Copy Paste: [[2506.21046]] Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features(https://arxiv.org/abs/2506.21046)
Keywords: generative
Abstract: The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA -- a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts. Code available at this https URL.
摘要：深神经网络（DNN）的能力来自从提供的数据中提取和解释特征。通过利用DNN中的中间功能而不是依靠硬标签，我们制作了更有效地概括的对抗性扰动，从而提高了黑盒可传递性。这些功能普遍存在来自以前的工作中的监督学习。受到自我监督学习与变压器体系结构之间非凡的协同作用的启发，本文探讨了利用自我监督的视觉变压器（VIT）表示是否可以提高对抗性转移性。我们提出DSVA-一种生成的双重自我监督的VIT特征攻击，从对比度学习（CL）和本地纹理特征中利用了来自蒙版图像建模（MIM）的本地纹理特征（MIM），即自助式的学习范式二重奏组为VIT。我们设计了一个新颖的生成训练框架，该框架结合了一个生成器来创建黑盒对抗性示例，以及通过利用关节特征和自我监督VIT的注意机制来训练发电机的策略。我们的发现表明，CL和MIM使VIT能够参与独特的特征趋势，在串联时进行利用时，具有极大的对抗性概括性。通过破坏由自我监督的VIT蒸馏出的双重深度特征，我们将获得非凡的黑盒转移性，以超越最先进的各种体系结构的模型。可在此HTTPS URL上找到代码。

Title: PoseMaster: Generating 3D Characters in Arbitrary Poses from a Single Image

Authors: Hongyu Yan, Kunming Luo, Weiyu Li, Yixun Liang, Shengming Li, Jingwei Huang, Chunchao Guo, Ping Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21076
Pdf URL: https://arxiv.org/pdf/2506.21076
Copy Paste: [[2506.21076]] PoseMaster: Generating 3D Characters in Arbitrary Poses from a Single Image(https://arxiv.org/abs/2506.21076)
Keywords: generation
Abstract: 3D characters play a crucial role in our daily entertainment. To improve the efficiency of 3D character modeling, recent image-based methods use two separate models to achieve pose standardization and 3D reconstruction of the A-pose character. However, these methods are prone to generating distorted and degraded images in the pose standardization stage due to self-occlusion and viewpoints, which further affects the geometric quality of the subsequent reconstruction process. To tackle these problems, we propose PoseMaster, an end-to-end controllable 3D character generation framework. Specifically, we unify pose transformation and 3D character generation into a flow-based 3D native generation framework. To achieve accurate arbitrary-pose control, we propose to leverage the 3D body bones existing in the skeleton of an animatable character as the pose condition. Furthermore, considering the specificity of multi-condition control, we randomly empty the pose condition and the image condition during training to improve the effectiveness and generalizability of pose control. Finally, we create a high-quality pose-control dataset derived from realistic character animation data to make the model learning the implicit relationships between skeleton and skinning weights. Extensive experiments show that PoseMaster outperforms current state-of-the-art techniques in both qualitative and quantitative evaluations for A-pose character generation while demonstrating its powerful ability to achieve precise control for arbitrary poses.
摘要：3D角色在我们的日常娱乐中起着至关重要的作用。为了提高3D字符建模的效率，最新的基于图像的方法使用两个单独的模型来实现姿势标准化和A置态特征的3D重建。但是，这些方法容易在姿势标准化阶段产生扭曲和降解的图像，这是由于自咬合和观点，这进一步影响了随后的重建过程的几何质量。为了解决这些问题，我们提出了Posemaster，这是一个可控制的3D角色生成框架。具体而言，我们将姿势转换和3D角色生成统一为基于流的3D本地生成框架。为了获得准确的任意置式控制，我们建议利用动画特征作为姿势条件的骨骼中存在的3D身体骨骼。此外，考虑到多条件控制的特异性，我们在训练过程中随机清空姿势条件和图像条件，以提高姿势控制的有效性和概括性。最后，我们创建了一个从逼真的角色动画数据中得出的高质量姿势控制数据集，以使模型学习骨架与皮肤重量之间的隐式关系。广泛的实验表明，Posemaster在定性和定量评估中的当前最新技术对A置态性角色的产生，同时证明了其实现任意姿势的精确控制的强大能力。

Title: OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography

Authors: Caoshuo Li, Zengmao Ding, Xiaobin Hu, Bang Li, Donghao Luo, AndyPian Wu, Chaoyang Wang, Chengjie Wang, Taisong Jin, SevenShu, Yunsheng Wu, Yongge Liu, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21101
Pdf URL: https://arxiv.org/pdf/2506.21101
Copy Paste: [[2506.21101]] OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography(https://arxiv.org/abs/2506.21101)
Keywords: generation
Abstract: As one of the earliest ancient languages, Oracle Bone Script (OBS) encapsulates the cultural records and intellectual expressions of ancient civilizations. Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered. The remaining undeciphered ones, with their complex structure and abstract imagery, pose significant challenges for interpretation. To address these challenges, this paper proposes a novel two-stage semantic typography framework, named OracleFusion. In the first stage, this approach leverages the Multimodal Large Language Model (MLLM) with enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of the OBS character and perform visual localization of key components. In the second stage, we introduce Oracle Structural Vector Fusion (OSVF), incorporating glyph structure constraints and glyph maintenance constraints to ensure the accurate generation of semantically enriched vector fonts. This approach preserves the objective integrity of the glyph structure, offering visually enhanced representations that assist experts in deciphering OBS. Extensive qualitative and quantitative experiments demonstrate that OracleFusion outperforms state-of-the-art baseline models in terms of semantics, visual appeal, and glyph maintenance, significantly enhancing both readability and aesthetic quality. Furthermore, OracleFusion provides expert-like insights on unseen oracle characters, making it a valuable tool for advancing the decipherment of OBS.
摘要：作为最早的古代语言之一，Oracle Bone Script（obs）封装了古代文明的文化记录和智力表达。尽管发现了大约4,500个OBS角色，但仅1,600个被解密了。其余的未确定的结构和抽象图像对解释构成了重大挑战。为了应对这些挑战，本文提出了一个新颖的两阶段语义版式框架，名为OracleFusion。在第一阶段，这种方法利用具有增强的空间意识推理（SAR）的多模式大语言模型（MLLM）来分析obs特性的字形结构并执行关键组件的视觉定位。在第二阶段，我们引入了Oracle结构矢量融合（OSVF），并结合了字形结构约束和字形维护约束，以确保精确生成语义丰富的矢量字体。这种方法保留了字形结构的客观完整性，提供了视觉增强的表示形式，可帮助专家进行解密。广泛的定性和定量实验表明，在语义，视觉吸引力和字形维护方面，OracleFusion优于最先进的基线模型，从而显着提高了可读性和美学质量。此外，OracleFusion还提供了关于看不见的甲骨文角色的专家式见解，使其成为推进OBS解密的宝贵工具。

Title: Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges

Authors: Changxi Chi, Jun Xia, Yufei Huang, Jingbo Zhou, Siyuan Li, Yunfan Liu, Chang Yu, Stan Z. Li
Subjects: cs.LG, q-bio.MN
Abstract URL: https://arxiv.org/abs/2506.21107
Pdf URL: https://arxiv.org/pdf/2506.21107
Copy Paste: [[2506.21107]] Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges(https://arxiv.org/abs/2506.21107)
Keywords: generation
Abstract: Estimating single-cell responses across various perturbations facilitates the identification of key genes and enhances drug screening, significantly boosting experimental efficiency. However, single-cell sequencing is a destructive process, making it impossible to capture the same cell's phenotype before and after perturbation. Consequently, data collected under perturbed and unperturbed conditions are inherently unpaired. Existing methods either attempt to forcibly pair unpaired data using random sampling, or neglect the inherent relationship between unperturbed and perturbed cells during the modeling. In this work, we propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions, effectively addressing the challenge of unpaired data. We further interpret this framework as a form of data augmentation. We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way, and further incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles. Moreover, gene expression under the same perturbation often varies significantly across cells, frequently exhibiting a bimodal distribution that reflects intrinsic heterogeneity. To capture this, we introduce a more suitable evaluation metric. We propose Unlasting, dual conditional diffusion models that overcome the problem of unpaired single-cell perturbation data and strengthen the model's insight into perturbations under the guidance of the GRN, with a dedicated mask model designed to improve generation quality by predicting silent genes. In addition, we introduce a biologically grounded evaluation metric that better reflects the inherent heterogeneity in single-cell responses.
摘要：估计各种扰动的单细胞反应有助于鉴定关键基因并增强药物筛查，从而显着提高实验效率。但是，单细胞测序是一个破坏性的过程，因此无法在扰动前后捕获同一细胞的表型。因此，在扰动和不受干扰的条件下收集的数据本质上是不成对的。现有方法要么尝试使用随机抽样强制使用不成对数据，要么忽略了建模过程中未受干扰和扰动细胞之间的固有关系。在这项工作中，我们提出了一个基于双扩散隐式桥梁（DDIB）的框架，以了解不同数据分布之间的映射，从而有效地解决了未配对数据的挑战。我们进一步将此框架解释为数据增强形式。我们整合基因调节网络（GRN）信息以以生物学上有意义的方式传播扰动信号，并进一步结合了一种掩盖机制来预测无声基因，从而提高了生成的曲线质量。此外，在同一扰动下的基因表达通常在细胞之间差异很大，经常表现出反映内在异质性的双峰分布。为了捕获这一点，我们引入了更合适的评估度量。我们提出了不抗议的双重条件扩散模型，该模型克服了未配对的单细胞扰动数据的问题，并在GRN的指导下加强了该模型对扰动的洞察力，并使用专门的掩码模型来通过预测无声基因来提高发电质量。此外，我们引入了一个生物学上的评估度量，可以更好地反映单细胞响应中固有的异质性。

Title: Learning to See in the Extremely Dark

Authors: Hai Jiang, Binhao Guan, Zhen Liu, Xiaohong Liu, Jian Yu, Zheng Liu, Songchen Han, Shuaicheng Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21132
Pdf URL: https://arxiv.org/pdf/2506.21132
Copy Paste: [[2506.21132]] Learning to See in the Extremely Dark(https://arxiv.org/abs/2506.21132)
Keywords: restoration, generative
Abstract: Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. Furthermore, we propose a diffusion-based framework that leverages the generative ability and intrinsic denoising property of diffusion models to restore visually pleasing results from extremely low-SNR RAW inputs, in which an Adaptive Illumination Correction Module (AICM) and a color consistency loss are introduced to ensure accurate exposure correction and color restoration. Extensive experiments on the proposed SIED and publicly available benchmarks demonstrate the effectiveness of our method. The code and dataset are available at this https URL.
摘要：基于学习的方法在弱光原始图像增强功能方面取得了令人鼓舞的进步，而它们对于极黑色的场景的能力，由于缺乏相应的数据集，因此环境照明效果下降到低至0.0001 lux的能力仍有待探索。为此，我们提出了一个配对的数据合成管道，能够在三个精确的照明范围内生成精心校准的极低的原始图像，范围为0.01-0.1 lux，0.001-0.01 lux和0.0001-0.001 lux，0.0001-0.001 lux，以及高质量的SRGB指称的大型数据，该数据列出了大型数据。为基准低光原始图像增强方法。此外，我们提出了一个基于扩散的框架，该框架利用了扩散模型的生成能力和内在的deno框架，以恢复视觉上令人愉悦的因素，这是由于极低的SNR原始输入而导致的，在该输入中，自适应照明校正模块（AICM）和颜色一致性损失引入了准确的曝光校正校正校正校正校正和恢复。对拟议的SID和公开基准测试的广泛实验证明了我们方法的有效性。该代码和数据集可在此HTTPS URL上找到。

Title: Generative Adversarial Evasion and Out-of-Distribution Detection for UAV Cyber-Attacks

Authors: Deepak Kumar Panda, Weisi Guo
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21142
Pdf URL: https://arxiv.org/pdf/2506.21142
Copy Paste: [[2506.21142]] Generative Adversarial Evasion and Out-of-Distribution Detection for UAV Cyber-Attacks(https://arxiv.org/abs/2506.21142)
Keywords: generative
Abstract: The growing integration of UAVs into civilian airspace underscores the need for resilient and intelligent intrusion detection systems (IDS), as traditional anomaly detection methods often fail to identify novel threats. A common approach treats unfamiliar attacks as out-of-distribution (OOD) samples; however, this leaves systems vulnerable when mitigation is inadequate. Moreover, conventional OOD detectors struggle to distinguish stealthy adversarial attacks from genuine OOD events. This paper introduces a conditional generative adversarial network (cGAN)-based framework for crafting stealthy adversarial attacks that evade IDS mechanisms. We first design a robust multi-class IDS classifier trained on benign UAV telemetry and known cyber-attacks, including Denial of Service (DoS), false data injection (FDI), man-in-the-middle (MiTM), and replay attacks. Using this classifier, our cGAN perturbs known attacks to generate adversarial samples that misclassify as benign while retaining statistical resemblance to OOD distributions. These adversarial samples are iteratively refined to achieve high stealth and success rates. To detect such perturbations, we implement a conditional variational autoencoder (CVAE), leveraging negative log-likelihood to separate adversarial inputs from authentic OOD samples. Comparative evaluation shows that CVAE-based regret scores significantly outperform traditional Mahalanobis distance-based detectors in identifying stealthy adversarial threats. Our findings emphasize the importance of advanced probabilistic modeling to strengthen IDS capabilities against adaptive, generative-model-based cyber intrusions.
摘要：由于传统的异常检测方法通常无法识别出新颖的威胁，因此将无人机纳入平民领空的融合强调了对弹性和智能入侵检测系统（IDS）的需求。一种常见的方法将不熟悉的攻击视为分布（OOD）样本；但是，当缓解不足时，这会使系统脆弱。此外，传统的OOD检测器努力将隐秘的对抗性攻击与真正的OOD事件区分开。本文介绍了有条件的生成对抗网络（CGAN）基于基于避免ID机制的隐形对抗攻击的框架。我们首先设计了一个强大的多级IDS分类器，该分类器在良性无人机遥测和已知的网络攻击中训练，包括拒绝服务（DOS），虚假数据注入（FDI），中间人（MITM）和重播攻击。使用此分类器，我们的CGAN PERTURBS已知攻击生成对抗性样本，这些样本错误分类为良性，同时保持与OOD分布相似的统计相似之处。这些对抗性样本是迭代精致的，以实现高隐身和成功率。为了检测这种扰动，我们实施了条件变异自动编码器（CVAE），利用负模拟样本来将对抗性输入与真实的OOD样本分开。比较评估表明，基于CVAE的遗憾得分显着超过传统的Mahalanobis距离探测器，以识别隐秘的对抗威胁。我们的发现强调了先进的概率建模对增强IDS功能的重要性，以防止基于自适应，生成模型的网络侵入。

Title: Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image

Authors: Pufan Li, Bi'an Du, Wei Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21152
Pdf URL: https://arxiv.org/pdf/2506.21152
Copy Paste: [[2506.21152]] Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image(https://arxiv.org/abs/2506.21152)
Keywords: generation
Abstract: Generating realistic 3D objects from single-view images requires natural appearance, 3D consistency, and the ability to capture multiple plausible interpretations of unseen regions. Existing approaches often rely on fine-tuning pretrained 2D diffusion models or directly generating 3D information through fast network inference or 3D Gaussian Splatting, but their results generally suffer from poor multiview consistency and lack geometric detail. To takle these issues, we present a novel method that seamlessly integrates geometry and perception priors without requiring additional model training to reconstruct detailed 3D objects from a single image. Specifically, we train three different Gaussian branches initialized from the geometry prior, perception prior and Gaussian noise, respectively. The geometry prior captures the rough 3D shapes, while the perception prior utilizes the 2D pretrained diffusion model to enhance multiview information. Subsequently, we refine 3D Gaussian branches through mutual interaction between geometry and perception priors, further enhanced by a reprojection-based strategy that enforces depth consistency. Experiments demonstrate the higher-fidelity reconstruction results of our method, outperforming existing methods on novel view synthesis and 3D reconstruction, demonstrating robust and consistent 3D object generation.
摘要：从单视图像中生成逼真的3D对象需要自然的外观，3D一致性以及捕获对看不见区域的多种合理解释的能力。现有的方法通常依赖于经过验证的2D扩散模型或通过快速网络推理或3D高斯分裂直接生成3D信息，但是它们的结果通常会遭受多视图一致性差，并且缺乏几何细节。为了提交这些问题，我们提出了一种新颖的方法，该方法无缝地集成了几何学和感知先验，而无需进行其他模型训练以从单个图像中重建详细的3D对象。具体而言，我们分别从几何学先验，先验和高斯噪声开始训练三个不同的高斯分支。几何形状先验捕获了粗糙的3D形状，而感知先验利用2D预处理的扩散模型来增强多视图信息。随后，我们通过几何学和感知先验之间的相互作用来完善3D高斯分支，并通过基于再投影的策略进一步增强了深度一致性。实验证明了我们方法的较高保真度重建结果，表现优于新型视图合成和3D重建的现有方法，证明了强大且一致的3D对象生成。

Title: Diverse Mini-Batch Selection in Reinforcement Learning for Efficient Chemical Exploration in de novo Drug Design

Authors: Hampus Gummesson Svensson, Ola Engkvist, Jon Paul Janet, Christian Tyrchan, Morteza Haghir Chehreghani
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21158
Pdf URL: https://arxiv.org/pdf/2506.21158
Copy Paste: [[2506.21158]] Diverse Mini-Batch Selection in Reinforcement Learning for Efficient Chemical Exploration in de novo Drug Design(https://arxiv.org/abs/2506.21158)
Keywords: generation, generative
Abstract: In many real-world applications, evaluating the goodness of instances is often costly and time-consuming, e.g., human feedback and physics simulations, in contrast to proposing new instances. In particular, this is even more critical in reinforcement learning, as new interactions with the environment (i.e., new instances) need to be evaluated to provide a reward signal to learn from. As sufficient exploration is crucial, learning from a diverse mini-batch can have a large impact and help mitigate mode collapse. In this paper, we introduce diverse mini-batch selection for reinforcement learning and propose to use determinantal point processes for this task. We study this framework in the context of a real-world problem, namely drug discovery. We experimentally study how our proposed framework can improve the effectiveness of chemical exploration in de novo drug design, where finding diverse and high-quality solutions is essential. We conduct a comprehensive evaluation with three well-established molecular generation oracles over numerous generative steps. Our experiments conclude that our diverse mini-batch selection framework can substantially improve the diversity of the solutions, while still obtaining solutions of high quality. In drug discovery, such outcome can potentially lead to fulfilling unmet medication needs faster.
摘要：在许多实际应用中，与提出新实例相反，评估实例的好处通常是昂贵且耗时的，例如人类的反馈和物理模拟。特别是，这对于加强学习至关重要，因为需要评估与环境的新互动（即新实例），以提供奖励信号以学习。由于足够的探索至关重要，因此从各种迷你批次中学习可以产生巨大的影响并有助于减轻模式崩溃。在本文中，我们介绍了用于增强学习的各种迷你批次选择，并建议将确定点过程用于此任务。我们在现实世界中的问题（即药物发现）的背景下研究此框架。我们通过实验研究我们所提出的框架如何提高从头毒品设计中化学勘探的有效性，在这种设计中，找到多样化和高质量的解决方案至关重要。我们在众多生成步骤中使用三个完整的分子产生牙齿进行了全面的评估。我们的实验得出的结论是，我们多样化的迷你批次选择框架可以大大改善解决方案的多样性，同时仍能获得高质量的解决方案。在药物发现中，这种结果可能会导致更快地满足未满足的药物需求。

Title: BitMark for Infinity: Watermarking Bitwise Autoregressive Image Generative Models

Authors: Louis Kerner, Michel Meintz, Bihe Zhao, Franziska Boenisch, Adam Dziedzic
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21209
Pdf URL: https://arxiv.org/pdf/2506.21209
Copy Paste: [[2506.21209]] BitMark for Infinity: Watermarking Bitwise Autoregressive Image Generative Models(https://arxiv.org/abs/2506.21209)
Keywords: generation, generative
Abstract: State-of-the-art text-to-image models like Infinity generate photorealistic images at an unprecedented speed. These models operate in a bitwise autoregressive manner over a discrete set of tokens that is practically infinite in size. However, their impressive generative power comes with a growing risk: as their outputs increasingly populate the Internet, they are likely to be scraped and reused as training data-potentially by the very same models. This phenomenon has been shown to lead to model collapse, where repeated training on generated content, especially from the models' own previous versions, causes a gradual degradation in performance. A promising mitigation strategy is watermarking, which embeds human-imperceptible yet detectable signals into generated images-enabling the identification of generated content. In this work, we introduce BitMark, a robust bitwise watermarking framework for Infinity. Our method embeds a watermark directly at the bit level of the token stream across multiple scales (also referred to as resolutions) during Infinity's image generation process. Our bitwise watermark subtly influences the bits to preserve visual fidelity and generation speed while remaining robust against a spectrum of removal techniques. Furthermore, it exhibits high radioactivity, i.e., when watermarked generated images are used to train another image generative model, this second model's outputs will also carry the watermark. The radioactive traces remain detectable even when only fine-tuning diffusion or image autoregressive models on images watermarked with our BitMark. Overall, our approach provides a principled step toward preventing model collapse in image generative models by enabling reliable detection of generated outputs.
摘要：最先进的文本到图像模型（例如Infinity）以前所未有的速度产生了逼真的图像。这些模型在一组无限大小的离散令牌上以比较自回归的方式运行。但是，它们令人印象深刻的生成能力会带来越来越多的风险：随着它们的产出越来越多地填充了互联网，因此它们可能会被同样的模型划分并重复使用，因为培训数据功能。该现象已显示导致模型崩溃，在该模型中，对生成的内容进行了重复训练，尤其是从模型自己以前的版本中，导致性能逐渐降解。有希望的缓解策略是水印，它将人类侵蚀但可检测到的信号嵌入到生成的图像中，从而确定生成的内容的识别。在这项工作中，我们介绍了Bitmark，这是无穷大的强大的位水印框架。在Infinity的图像生成过程中，我们的方法将跨多个尺度（也称为分辨率）直接嵌入了令牌流的位。我们的位水印巧妙地影响了钻头，以保持视觉保真度和发电速度，同时保持着稳健的去除技术。此外，它表现出较高的放射性，即，当使用水印的生成图像训练另一个图像生成模型时，第二个模型的输出也将带有水印。即使仅在带有我们的位标图的图像上进行微调扩散或图像自回归模型，放射性痕迹仍可检测到。总体而言，我们的方法通过可靠地检测生成的输出来防止模型生成模型中的模型崩溃，从而提供了一个原则上的步骤。

Title: Video Virtual Try-on with Conditional Diffusion Transformer Inpainter

Authors: Cheng Zou, Senlin Cheng, Bolei Xu, Dandan Zheng, Xiaobo Li, Jingdong Chen, Ming Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21270
Pdf URL: https://arxiv.org/pdf/2506.21270
Copy Paste: [[2506.21270]] Video Virtual Try-on with Conditional Diffusion Transformer Inpainter(https://arxiv.org/abs/2506.21270)
Keywords: generation
Abstract: Video virtual try-on aims to naturally fit a garment to a target person in consecutive video frames. It is a challenging task, on the one hand, the output video should be in good spatial-temporal consistency, on the other hand, the details of the given garment need to be preserved well in all the frames. Naively using image-based try-on methods frame by frame can get poor results due to severe inconsistency. Recent diffusion-based video try-on methods, though very few, happen to coincide with a similar solution: inserting temporal attention into image-based try-on model to adapt it for video try-on task, which have shown improvements but there still exist inconsistency problems. In this paper, we propose ViTI (Video Try-on Inpainter), formulate and implement video virtual try-on as a conditional video inpainting task, which is different from previous methods. In this way, we start with a video generation problem instead of an image-based try-on problem, which from the beginning has a better spatial-temporal consistency. Specifically, at first we build a video inpainting framework based on Diffusion Transformer with full 3D spatial-temporal attention, and then we progressively adapt it for video garment inpainting, with a collection of masking strategies and multi-stage training. After these steps, the model can inpaint the masked garment area with appropriate garment pixels according to the prompt with good spatial-temporal consistency. Finally, as other try-on methods, garment condition is added to the model to make sure the inpainted garment appearance and details are as expected. Both quantitative and qualitative experimental results show that ViTI is superior to previous works.
摘要：视频虚拟试验旨在在连续的视频帧中自然地适合目标人员。一方面，这是一项具有挑战性的任务，另一方面，输出视频应具有良好的时空一致性，另一方面，给定服装的细节需要在所有框架中保存得很好。由于严重的不一致，逐帧天然使用基于图像的尝试方法会获得较差的结果。最近的基于扩散的视频尝试方法虽然很少，但与类似的解决方案相吻合：将时间关注插入基于图像的试验模型中，以使其适应视频尝试任务，这些任务已显示出改进，但仍然存在不一致问题。在本文中，我们提出了VITI（视频试验器），制定和实现视频虚拟试验作为有条件的视频介绍任务，这与以前的方法不同。通过这种方式，我们从视频生成问题开始，而不是基于图像的尝试问题，从一开始就具有更好的时空一致性。具体而言，首先，我们建立了一个基于扩散变压器的视频介绍框架，并具有完整的3D时空注意力，然后我们逐渐将其适应视频服装，并收集了一系列掩盖策略和多阶段培训。在这些步骤之后，该模型可以根据提示以良好的时空一致性来涂上适当的服装像素。最后，与其他尝试方法一样，添加了服装条件到模型中，以确保贴有服装的外观和细节如预期。定量和定性实验结果都表明，VITI优于以前的工作。

Title: HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation

Authors: Diego Biagini, Nassir Navab, Azade Farshad
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21287
Pdf URL: https://arxiv.org/pdf/2506.21287
Copy Paste: [[2506.21287]] HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation(https://arxiv.org/abs/2506.21287)
Keywords: generation
Abstract: Surgical Video Synthesis has emerged as a promising research direction following the success of diffusion models in general-domain video generation. Although existing approaches achieve high-quality video generation, most are unconditional and fail to maintain consistency with surgical actions and phases, lacking the surgical understanding and fine-grained guidance necessary for factual simulation. We address these challenges by proposing HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models. Given a surgical phase and an initial frame, HieraSurg first predicts future coarse-grained semantic changes through a segmentation prediction model. The final video is then generated by a second-stage model that augments these temporal segmentation maps with fine-grained visual features, leading to effective texture rendering and integration of semantic information in the video space. Our approach leverages surgical information at multiple levels of abstraction, including surgical phase, action triplets, and panoptic segmentation maps. The experimental results on Cholecystectomy Surgical Video Generation demonstrate that the model significantly outperforms prior work both quantitatively and qualitatively, showing strong generalization capabilities and the ability to generate higher frame-rate videos. The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.
摘要：在通用域视频生成中扩散模型的成功之后，手术视频综合已成为一个有希望的研究方向。尽管现有的方法可以实现高质量的视频生成，但大多数方法是无条件的，并且无法与手术动作和阶段保持一致，缺乏事实模拟所必需的手术理解和精细的指导。我们通过提出Hierasurg来应对这些挑战，这是一个由两个专门扩散模型组成的层次结构的手术视频生成框架。考虑到手术阶段和初始框架，Hierasurg首先通过分割预测模型预测未来的粗粒语义变化。然后，最终视频由第二阶段模型生成，该模型通过细粒度的视觉特征增强了这些时间分割图，从而导致了视频空间中语义信息的有效纹理渲染和集成。我们的方法利用多个抽象级别的手术信息，包括手术阶段，动作三重态和全景分割图。胆囊切除术外科视频的实验结果表明，该模型在定量和定性上都显着胜过先前的工作，显示出强大的概括能力以及产生更高框架率视频的能力。当提供现有的分割图时，该模型表现出特别细粒度的依从性，这表明其实用的手术应用潜力。

Title: DynamicBench: Evaluating Real-Time Report Generation in Large Language Models

Authors: Jingyao Li, Hao Sun, Zile Qiao, Yong Jiang, Pengjun Xie, Fei Huang, Hong Xu, Jiaya Jia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21343
Pdf URL: https://arxiv.org/pdf/2506.21343
Copy Paste: [[2506.21343]] DynamicBench: Evaluating Real-Time Report Generation in Large Language Models(https://arxiv.org/abs/2506.21343)
Keywords: generation
Abstract: Traditional benchmarks for large language models (LLMs) typically rely on static evaluations through storytelling or opinion expression, which fail to capture the dynamic requirements of real-time information processing in contemporary applications. To address this limitation, we present DynamicBench, a benchmark designed to evaluate the proficiency of LLMs in storing and processing up-to-the-minute data. DynamicBench utilizes a dual-path retrieval pipeline, integrating web searches with local report databases. It necessitates domain-specific knowledge, ensuring accurate responses report generation within specialized fields. By evaluating models in scenarios that either provide or withhold external documents, DynamicBench effectively measures their capability to independently process recent information or leverage contextual enhancements. Additionally, we introduce an advanced report generation system adept at managing dynamic information synthesis. Our experimental results confirm the efficacy of our approach, with our method achieving state-of-the-art performance, surpassing GPT4o in document-free and document-assisted scenarios by 7.0% and 5.8%, respectively. The code and data will be made publicly available.
摘要：大型语言模型（LLMS）的传统基准通常通过讲故事或意见表达依赖静态评估，这些评估无法捕获当代应用中实时信息处理的动态要求。为了解决此限制，我们提出了DynamicBench，这是一种基准，旨在评估LLM在存储和处理最新数据方面的熟练程度。 DynamicBench使用双路径检索管道，将Web搜索与本地报告数据库集成在一起。它需要特定于领域的知识，从而确保在专业领域内的准确响应报告生成。通过在提供或拒绝外部文档的方案中评估模型，DynamicBench有效地衡量了其独立处理最新信息或利用上下文增强功能的能力。此外，我们引入了一个高级报告生成系统，这些系统擅长管理动态信息综合。我们的实验结果证实了我们方法的功效，我们的方法可以实现最先进的性能，在无文档和文档辅助方案中超过了GPT4O，分别达到了7.0％和5.8％。代码和数据将公开可用。

Title: ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

Authors: Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, Ziwei Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21356
Pdf URL: https://arxiv.org/pdf/2506.21356
Copy Paste: [[2506.21356]] ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models(https://arxiv.org/abs/2506.21356)
Keywords: generation
Abstract: Cinematography, the fundamental visual language of film, is essential for conveying narrative, emotion, and aesthetic quality. While recent Vision-Language Models (VLMs) demonstrate strong general visual understanding, their proficiency in comprehending the nuanced cinematic grammar embedded within individual shots remains largely unexplored and lacks robust evaluation. This critical gap limits both fine-grained visual comprehension and the precision of AI-assisted video generation. To address this, we introduce \textbf{ShotBench}, a comprehensive benchmark specifically designed for cinematic language understanding. It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films and spanning eight key cinematography dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their substantial limitations: even the top-performing model achieves less than 60\% average accuracy, particularly struggling with fine-grained visual cues and complex spatial reasoning. To catalyze advancement in this domain, we construct \textbf{ShotQA}, a large-scale multimodal dataset comprising approximately 70k cinematic QA pairs. Leveraging ShotQA, we develop \textbf{ShotVL} through supervised fine-tuning and Group Relative Policy Optimization. ShotVL significantly outperforms all existing open-source and proprietary models on ShotBench, establishing new \textbf{state-of-the-art} performance. We open-source our models, data, and code to foster rapid progress in this crucial area of AI-driven cinematic understanding and generation.
摘要：摄影是电影的基本视觉语言，对于传达叙事，情感和审美品质至关重要。尽管最近的视觉模型（VLMS）表现出强烈的一般视觉理解，但它们在理解各个镜头中嵌入的细微的电影语法方面的熟练程度在很大程度上尚未探索，并且缺乏强大的评估。这个关键的差距限制了细粒度的视觉理解和AI辅助视频生成的精度。为了解决这个问题，我们介绍了\ textbf {shotbench}，这是一种专门为电影语言理解而设计的综合基准。它具有超过3.5k专家注销的质量检查对，来自图像和视频剪辑，这些片段从200多个广受赞誉（主要是奥斯卡提名的）电影中精心策划，并涵盖了八个关键的摄影维度。我们对Shotbench上24个领先VLM的评估揭示了它们的实质局限性：即使表现最佳模型的平均精度也小于60 \％，尤其是在使用细粒的视觉提示和复杂的空间推理方面挣扎。为了催化该域中的进步，我们构建了\ textbf {shotqa}，这是一个大规模的多模式数据集，包括大约70k的Cinematic cinematic Qa对。利用shotqa，我们通过监督的微调和小组相对策略优化开发\ textbf {shotvl}。 ShotVl在Shotbench上大大胜过所有现有的开源和专有型号，建立了新的\ textbf {textbf {toct-the-the-art}性能。我们开源的模型，数据和代码，以在AI驱动的电影理解和发电的关键领域促进快速进步。

Title: CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations

Authors: Julian Lorenz, Mrunmai Phatak, Robin Schön, Katja Ludwig, Nico Hörmann, Annemarie Friedrich, Rainer Lienhart
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21357
Pdf URL: https://arxiv.org/pdf/2506.21357
Copy Paste: [[2506.21357]] CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations(https://arxiv.org/abs/2506.21357)
Keywords: generation
Abstract: 2D scene graphs provide a structural and explainable framework for scene understanding. However, current work still struggles with the lack of accurate scene graph data. To overcome this data bottleneck, we present CoPa-SG, a synthetic scene graph dataset with highly precise ground truth and exhaustive relation annotations between all objects. Moreover, we introduce parametric and proto-relations, two new fundamental concepts for scene graphs. The former provides a much more fine-grained representation than its traditional counterpart by enriching relations with additional parameters such as angles or distances. The latter encodes hypothetical relations in a scene graph and describes how relations would form if new objects are placed in the scene. Using CoPa-SG, we compare the performance of various scene graph generation models. We demonstrate how our new relation types can be integrated in downstream applications to enhance planning and reasoning capabilities.
摘要：2D场景图为场景理解提供了一个结构性和可解释的框架。但是，目前的工作仍然在缺乏准确的场景图数据方面挣扎。为了克服此数据瓶颈，我们提出了COPA-SG，这是一个合成场景图数据集，所有对象之间具有高度精确的地面真相和详尽的关系注释。此外，我们引入了参数和原始关系，这是场景图的两个新的基本概念。前者通过与其他参数（例如角度或距离）丰富了关系，提供了比传统同行更加细粒度的表示。后者在场景图中编码假设关系，并描述如果将新对象放置在场景中，将如何形成关系。使用COPA-SG，我们比较各种场景图生成模型的性能。我们演示了如何将新的关系类型集成到下游应用程序中，以增强计划和推理功能。

Title: GenFlow: Interactive Modular System for Image Generation

Authors: Duc-Hung Nguyen, Huu-Phuc Huynh, Minh-Triet Tran, Trung-Nghia Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21369
Pdf URL: https://arxiv.org/pdf/2506.21369
Copy Paste: [[2506.21369]] GenFlow: Interactive Modular System for Image Generation(https://arxiv.org/abs/2506.21369)
Keywords: generation, generative
Abstract: Generative art unlocks boundless creative possibilities, yet its full potential remains untapped due to the technical expertise required for advanced architectural concepts and computational workflows. To bridge this gap, we present GenFlow, a novel modular framework that empowers users of all skill levels to generate images with precision and ease. Featuring a node-based editor for seamless customization and an intelligent assistant powered by natural language processing, GenFlow transforms the complexity of workflow creation into an intuitive and accessible experience. By automating deployment processes and minimizing technical barriers, our framework makes cutting-edge generative art tools available to everyone. A user study demonstrated GenFlow's ability to optimize workflows, reduce task completion times, and enhance user understanding through its intuitive interface and adaptive features. These results position GenFlow as a groundbreaking solution that redefines accessibility and efficiency in the realm of generative art.
摘要：生成艺术释放了无限的创意可能性，但是由于先进的建筑概念和计算工作流所需的技术专长，其全部潜力仍未开发。为了弥合这一差距，我们提出了Genflow，这是一个新颖的模块化框架，它使所有技能水平的用户都能精确而轻松地生成图像。 Genflow采用基于节点的无缝定制编辑器和由自然语言处理的智能助手，将工作流创建的复杂性转化为直观且可访问的体验。通过自动化部署流程并最大程度地限制技术障碍，我们的框架使每个人都可以使用尖端的生成艺术工具。一项用户研究表明，Genflow具有优化工作流程，减少任务完成时间并通过其直观界面和自适应功能增强用户理解的能力。这些结果将Genflow定位为开创性的解决方案，可重新定义生成艺术领域的可及性和效率。

Title: XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation

Authors: Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, Xinglong Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21416
Pdf URL: https://arxiv.org/pdf/2506.21416
Copy Paste: [[2506.21416]] XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation(https://arxiv.org/abs/2506.21416)
Keywords: generation
Abstract: Achieving fine-grained control over subject identity and semantic attributes (pose, style, lighting) in text-to-image generation, particularly for multiple subjects, often undermines the editability and coherence of Diffusion Transformers (DiTs). Many approaches introduce artifacts or suffer from attribute entanglement. To overcome these challenges, we propose a novel multi-subject controlled generation model XVerse. By transforming reference images into offsets for token-specific text-stream modulation, XVerse allows for precise and independent control for specific subject without disrupting image latents or features. Consequently, XVerse offers high-fidelity, editable multi-subject image synthesis with robust control over individual subject characteristics and semantic attributes. This advancement significantly improves personalized and complex scene generation capabilities.
摘要：在文本到图像生成中，对主题身份和语义属性（姿势，样式，照明）实现细粒度的控制，尤其是对于多个受试者，经常会破坏扩散变压器（DITS）的编辑性和连贯性。许多方法引入文物或属于属性纠缠。为了克服这些挑战，我们提出了一种新型的多主体受控生成模型X versevers。通过将参考图像转换为特定于代币的文本流调制的偏移，Xverse可以对特定主题进行精确和独立的控制，而不会破坏图像潜在或特征。因此，Xverse提供了高保真，可编辑的多主体图像综合，具有对单个主题特征和语义属性的强大控制。这种进步大大提高了个性化和复杂的场景生成功能。

Title: Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

Authors: Prajwal Koirala, Cody Fleming
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2506.21427
Pdf URL: https://arxiv.org/pdf/2506.21427
Copy Paste: [[2506.21427]] Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning(https://arxiv.org/abs/2506.21427)
Keywords: generation, generative
Abstract: Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the \textit{Single-Step Completion Policy} (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over diffusion-based baselines. We further extend SSCP to goal-conditioned RL, enabling flat policies to exploit subgoal structures without explicit hierarchical inference. SSCP achieves strong results across standard offline RL and behavior cloning benchmarks, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making.
摘要：诸如扩散和流程匹配之类的生成模型通过捕获丰富的多模式动作分布提供了脱机增强学习（RL）的表达政策，但是由于跨采样步骤，由于其迭代抽样介绍了较高的推理成本和训练不稳定。我们提出了\ textIt {单步完成策略}（SSCP），这是一种生成策略，该策略采用增强的流量匹配目标训练，以预测中间流样本的直接完成向量，从而启用准确的单发操作生成。在政治批评的范围内，SSCP将生成模型的表现力与单峰策略的训练和推理效率相结合，而无需长时间的反向传播链。我们的方法有效地缩放了离线，离线到线和在线RL设置，可在基于扩散的基线方面的速度和适应性大幅提高。我们进一步将SSCP扩展到目标条件条件的RL，从而使平面策略能够利用次目标结构而无需明确的分层推断。 SSCP在标准离线RL和行为克隆基准测试中取得了强劲的结果，将其定位为深入RL和顺序决策的多功能，表达和有效的框架。

Title: Controllable 3D Placement of Objects with Scene-Aware Diffusion Models

Authors: Mohamed Omran, Dimitris Kalatzis, Jens Petersen, Amirhossein Habibian, Auke Wiggers
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21446
Pdf URL: https://arxiv.org/pdf/2506.21446
Copy Paste: [[2506.21446]] Controllable 3D Placement of Objects with Scene-Aware Diffusion Models(https://arxiv.org/abs/2506.21446)
Keywords: generative
Abstract: Image editing approaches have become more powerful and flexible with the advent of powerful text-conditioned generative models. However, placing objects in an environment with a precise location and orientation still remains a challenge, as this typically requires carefully crafted inpainting masks or prompts. In this work, we show that a carefully designed visual map, combined with coarse object masks, is sufficient for high quality object placement. We design a conditioning signal that resolves ambiguities, while being flexible enough to allow for changing of shapes or object orientations. By building on an inpainting model, we leave the background intact by design, in contrast to methods that model objects and background jointly. We demonstrate the effectiveness of our method in the automotive setting, where we compare different conditioning signals in novel object placement tasks. These tasks are designed to measure edit quality not only in terms of appearance, but also in terms of pose and location accuracy, including cases that require non-trivial shape changes. Lastly, we show that fine location control can be combined with appearance control to place existing objects in precise locations in a scene.
摘要：随着强大的文本条件生成模型的出现，图像编辑方法变得更加强大和灵活。但是，将物体放置在具有精确位置和方向的环境中仍然是一个挑战，因为这通常需要精心制作的介入面罩或提示。在这项工作中，我们表明，经过精心设计的视觉图与粗制物体掩模相结合，足以容纳高质量的对象放置。我们设计一个可以解决歧义的调理信号，同时又有足够的灵活性以更改形状或对象方向。通过建立在介绍模型上，我们通过设计完整地将背景完好无损，与对象和背景共同建模的方法相反。我们在汽车环境中演示了我们方法的有效性，在该设置中，我们比较了新颖的对象放置任务中的不同条件信号。这些任务旨在在外观和位置准确性方面衡量编辑质量，包括需要非平凡形状更改的情况。最后，我们表明可以将良好的位置控制与外观控制结合使用，以将现有对象放置在场景中的精确位置。

Title: Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration

Authors: Jiahe Chen, Jiaying He, Qian Shao, Qiyuan Chen, Jiahe Ying, Hongxia Xu, Jintai Chen, Jianwei Zheng, Jian Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21509
Pdf URL: https://arxiv.org/pdf/2506.21509
Copy Paste: [[2506.21509]] Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration(https://arxiv.org/abs/2506.21509)
Keywords: generation
Abstract: Large Vision-Language Models (LVLMs) have demonstrated significant advancements in multimodal understanding, yet they are frequently hampered by hallucination-the generation of text that contradicts visual input. Existing training-free decoding strategies exhibit critical limitations, including the use of static constraints that do not adapt to semantic drift during generation, inefficiency stemming from the need for multiple forward passes, and degradation of detail due to overly rigid intervention rules. To overcome these challenges, this paper introduces Dynamic Logits Calibration (DLC), a novel training-free decoding framework designed to dynamically align text generation with visual evidence at inference time. At the decoding phase, DLC step-wise employs CLIP to assess the semantic alignment between the input image and the generated text sequence. Then, the Relative Visual Advantage (RVA) of candidate tokens is evaluated against a dynamically updated contextual baseline, adaptively adjusting output logits to favor tokens that are visually grounded. Furthermore, an adaptive weighting mechanism, informed by a real-time context alignment score, carefully balances the visual guidance while ensuring the overall quality of the textual output. Extensive experiments conducted across diverse benchmarks and various LVLM architectures (such as LLaVA, InstructBLIP, and MiniGPT-4) demonstrate that DLC significantly reduces hallucinations, outperforming current methods while maintaining high inference efficiency by avoiding multiple forward passes. Overall, we present an effective and efficient decoding-time solution to mitigate hallucinations, thereby enhancing the reliability of LVLMs for more practices. Code will be released on Github.
摘要：大型视觉模型（LVLM）在多模式理解方面表现出了显着的进步，但是它们经常因幻觉而受到阻碍 - 与视觉输入相矛盾的文本产生。现有的无培训解码策略表现出关键的局限性，包括使用静态约束，这些静态约束在发电期间不适合语义漂移，效率低下，这是由于需要多次向前传球的需求以及由于过度严格的干预规则而导致细节的退化。为了克服这些挑战，本文介绍了动态逻辑校准（DLC），这是一种新型的无训练解码框架，旨在在推理时在推理时与视觉证据动态地对齐文本。在解码阶段，DLC逐步采用夹子来评估输入图像和生成的文本序列之间的语义比对。然后，评估了候选令牌的相对视觉优势（RVA），以动态更新的上下文基线，自适应调整输出逻辑，从而有利于视觉接地的令牌。此外，由实时上下文对齐得分所告知的一种自适应加权机制，仔细平衡视觉指导，同时确保文本输出的整体质量。跨不同基准和各种LVLM架构（例如LLAVA，指令Blip和Minigpt-4）进行的广泛实验表明，DLC大大降低了幻觉，胜过当前方法，同时通过避免多个正向通行来维持高推理效率。总体而言，我们提出了一种有效，有效的解码时间解决方案，以减轻幻觉，从而提高了LVLMS的可靠性以进行更多实践。代码将在GitHub上发布。

Title: DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion

Authors: Yansong Qu, Shaohui Dai, Xinyang Li, Yuze Wang, You Shen, Liujuan Cao, Rongrong Ji
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21544
Pdf URL: https://arxiv.org/pdf/2506.21544
Copy Paste: [[2506.21544]] DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion(https://arxiv.org/abs/2506.21544)
Keywords: generation
Abstract: Reconstructing 3D objects from a single image is a long-standing challenge, especially under real-world occlusions. While recent diffusion-based view synthesis models can generate consistent novel views from a single RGB image, they generally assume fully visible inputs and fail when parts of the object are occluded. This leads to inconsistent views and degraded 3D reconstruction quality. To overcome this limitation, we propose an end-to-end framework for occlusion-aware multi-view generation. Our method directly synthesizes six structurally consistent novel views from a single partially occluded image, enabling downstream 3D reconstruction without requiring prior inpainting or manual annotations. We construct a self-supervised training pipeline using the Pix2Gestalt dataset, leveraging occluded-unoccluded image pairs and pseudo-ground-truth views to teach the model structure-aware completion and view consistency. Without modifying the original architecture, we fully fine-tune the view synthesis model to jointly learn completion and multi-view generation. Additionally, we introduce the first benchmark for occlusion-aware reconstruction, encompassing diverse occlusion levels, object categories, and mask patterns. This benchmark provides a standardized protocol for evaluating future methods under partial occlusions. Our code is available at this https URL.
摘要：从单个图像中重建3D对象是一个长期的挑战，尤其是在现实世界的遮挡下。虽然最近基于扩散的视图合成模型可以从单个RGB图像中产生一致的新型视图，但它们通常假设完全可见的输入，并且当对象的一部分被遮挡时失败。这会导致观点不一致并降低3D重建质量。为了克服这一限制，我们为闭塞性多视图生成提出了一个端到端框架。我们的方法直接从单个部分遮挡的图像中直接合成了六个结构一致的新视图，从而实现了下游3D重建，而无需事先介绍或手动注释。我们使用PIX2GENTALT数据集构建了一个自我监管的训练管道，利用了遮挡的无封闭图像对，并伪造的图像和伪地真相视图来教授模型结构所感知的完成和视图一致性。在不修改原始体系结构的情况下，我们完全微调视图合成模型以共同学习完成和多视图生成。此外，我们介绍了第一个用于咬合感知重建的基准，其中包括各种闭塞水平，对象类别和掩膜模式。该基准提供了一种标准化协议，用于评估部分遮挡下的未来方法。我们的代码可在此HTTPS URL上找到。

Title: Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

Authors: Ziyue Li, Chenrui Fan, Tianyi Zhou
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21551
Pdf URL: https://arxiv.org/pdf/2506.21551
Copy Paste: [[2506.21551]] Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test(https://arxiv.org/abs/2506.21551)
Keywords: generation
Abstract: Grokking, i.e., test performance keeps improving long after training loss converged, has been recently witnessed in neural network training, making the mechanism of generalization and other emerging capabilities such as reasoning mysterious. While prior studies usually train small models on a few toy or highly-specific tasks for thousands of epochs, we conduct the first study of grokking on checkpoints during one-pass pretraining of a 7B large language model (LLM), i.e., OLMoE. We compute the training loss and evaluate generalization on diverse benchmark tasks, including math reasoning, code generation, and commonsense/domain-specific knowledge retrieval tasks. Our study, for the first time, verifies that grokking still happens in the pretraining of large-scale foundation models, though different data may enter grokking stages asynchronously. We further demystify grokking's "emergence of generalization" by investigating LLM internal dynamics. Specifically, we find that training samples' pathways (i.e., expert choices across layers) evolve from random, instance-specific to more structured and shareable between samples during grokking. Also, the complexity of a sample's pathway reduces despite the converged loss. These indicate a memorization-to-generalization conversion, providing a mechanistic explanation of delayed generalization. In the study, we develop two novel metrics to quantify pathway distance and the complexity of a single pathway. We show their ability to predict the generalization improvement on diverse downstream tasks. They are efficient, simple to compute and solely dependent on training data. Hence, they have practical value for pretraining, enabling us to monitor the generalization performance without finetuning and test. Theoretically, we show that more structured pathways reduce model complexity and improve the generalization bound.
摘要：Grokking，即测试绩效在训练损失融合后很长一段时间一直在改善，最近在神经网络培训中见证了概括和其他新兴能力的机制，例如推理神秘。虽然先前的研究通常会针对数千个时代进行一些玩具或高度特定的任务训练小型模型，但我们在一次循环预处理期间对7B大语言模型（LLM）（即Olmoe）进行了首次研究。我们计算培训损失，并评估对不同基准任务的概括，包括数学推理，代码生成和常识/领域特定的知识检索任务。我们的研究首次验证了Grokking在大规模基础模型的审计中仍然存在，尽管不同的数据可能异步进入Grokking阶段。我们通过研究LLM内部动力学进一步揭开了Grokking的“概括性的出现”。具体而言，我们发现训练样本的途径（即跨层的专家选择）从特定于实例的特定于实例，在Grokking期间的样本之间发展到更具结构化和共享。同样，尽管损失汇聚，样本途径的复杂性仍会降低。这些表明记忆向普通化转化，提供了延迟泛化的机械解释。在研究中，我们开发了两个新的指标，以量化途径距离和单个途径的复杂性。我们展示了他们预测各种下游任务的概括改进的能力。它们有效，易于计算，仅依赖于训练数据。因此，它们具有预训练的实际价值，使我们能够在不进行填充和测试的情况下监视概括性能。从理论上讲，我们表明更结构化的途径降低了模型的复杂性并改善了概括结合。