2025-06-03

Title: On Designing Diffusion Autoencoders for Efficient Generation and Representation Learning

Authors: Magdalena Proszewska, Nikolay Malkin, N. Siddharth
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.00136
Pdf URL: https://arxiv.org/pdf/2506.00136
Copy Paste: [[2506.00136]] On Designing Diffusion Autoencoders for Efficient Generation and Representation Learning(https://arxiv.org/abs/2506.00136)
Keywords: generation, generative
Abstract: Diffusion autoencoders (DAs) are variants of diffusion generative models that use an input-dependent latent variable to capture representations alongside the diffusion process. These representations, to varying extents, can be used for tasks such as downstream classification, controllable generation, and interpolation. However, the generative performance of DAs relies heavily on how well the latent variables can be modelled and subsequently sampled from. Better generative modelling is also the primary goal of another class of diffusion models -- those that learn their forward (noising) process. While effective at adjusting the noise process in an input-dependent manner, they must satisfy additional constraints derived from the terminal conditions of the diffusion process. Here, we draw a connection between these two classes of models and show that certain design decisions (latent variable choice, conditioning method, etc.) in the DA framework -- leading to a model we term DMZ -- allow us to obtain the best of both worlds: effective representations as evaluated on downstream tasks, including domain transfer, as well as more efficient modelling and generation with fewer denoising steps compared to standard DMs.
摘要：扩散自动编码器（DAS）是扩散生成模型的变体，使用输入依赖性潜在变量与扩散过程一起捕获表示形式。这些表示形式可将范围变化，可用于下游分类，可控生成和插值等任务。但是，DA的生成性能在很大程度上取决于对潜在变量的建模和随后进行采样的良好。更好的生成建模也是另一类扩散模型的主要目标，即那些学习其前进（nosising）过程的模型。尽管有效地以输入依赖性调整噪声过程，但它们必须满足从扩散过程的终端条件得出的其他约束。在这里，我们在这两类模型之间建立了联系，并表明DA框架中的某些设计决策（潜在的可变选择，调节方法等） - 导致了我们称为DMZ的模型 - 允许我们获得两个世界的最佳：对下游任务进行评估的有效表示，包括域转移，包括域转移，以及与更有效的模型和更少的代表步骤相比，以及与标准的DMS相比。

Title: MOFGPT: Generative Design of Metal-Organic Frameworks using Language Models

Authors: Srivathsan Badrinarayanan, Rishikesh Magar, Akshay Antony, Radheesh Sharma Meda, Amir Barati Farimani
Subjects: cs.LG, cond-mat.mtrl-sci, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00198
Pdf URL: https://arxiv.org/pdf/2506.00198
Copy Paste: [[2506.00198]] MOFGPT: Generative Design of Metal-Organic Frameworks using Language Models(https://arxiv.org/abs/2506.00198)
Keywords: generation, generative
Abstract: The discovery of Metal-Organic Frameworks (MOFs) with application-specific properties remains a central challenge in materials chemistry, owing to the immense size and complexity of their structural design space. Conventional computational screening techniques such as molecular simulations and density functional theory (DFT), while accurate, are computationally prohibitive at scale. Machine learning offers an exciting alternative by leveraging data-driven approaches to accelerate materials discovery. The complexity of MOFs, with their extended periodic structures and diverse topologies, creates both opportunities and challenges for generative modeling approaches. To address these challenges, we present a reinforcement learning-enhanced, transformer-based framework for the de novo design of MOFs. Central to our approach is MOFid, a chemically-informed string representation encoding both connectivity and topology, enabling scalable generative modeling. Our pipeline comprises three components: (1) a generative GPT model trained on MOFid sequences, (2) MOFormer, a transformer-based property predictor, and (3) a reinforcement learning (RL) module that optimizes generated candidates via property-guided reward functions. By integrating property feedback into sequence generation, our method drives the model toward synthesizable, topologically valid MOFs with desired functional attributes. This work demonstrates the potential of large language models, when coupled with reinforcement learning, to accelerate inverse design in reticular chemistry and unlock new frontiers in computational MOF discovery.
摘要：由于其结构设计空间的巨大尺寸和复杂性，在材料化学方面发现具有特定应用特性的金属有机框架（MOF）仍然是一个核心挑战。传统的计算筛选技术，例如分子模拟和密度函数理论（DFT），虽然准确，但在规模上是计算上的过度效果。机器学习通过利用数据驱动的方法来加速材料发现，提供了令人兴奋的替代方法。 MOF的复杂性及其扩展的周期性结构和多样化的拓扑为生成建模方法带来了机会和挑战。为了应对这些挑战，我们为MOF从头设计提供了增强学习增强的，基于变压器的框架。我们方法的核心是MOFID，这是一种化学信息的字符串表示，编码连通性和拓扑，启用可扩展的生成建模。我们的管道包括三个组件：（1）在MOFID序列，（2）Moformer（基于变压器的属性预测指标）和（3）增强型学习（RL）模块的生成GPT模型，该模块通过属性指导的奖励功能优化生成的候选者。通过将属性反馈整合到序列生成中，我们的方法将模型驱动到具有所需功能属性的可合成的，拓扑有效的MOF。这项工作证明了大型语言模型的潜力，再加上加强学习，以加速网状化学中的逆设计并在计算MOF发现中解锁新的边界。

Title: Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes

Authors: Anthony Gosselin, Ge Ya Luo, Luis Lara, Florian Golemo, Derek Nowrouzezahrai, Liam Paull, Alexia Jolicoeur-Martineau, Christopher Pal
Subjects: cs.CV, cs.AI, cs.RO
Abstract URL: https://arxiv.org/abs/2506.00227
Pdf URL: https://arxiv.org/pdf/2506.00227
Copy Paste: [[2506.00227]] Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes(https://arxiv.org/abs/2506.00227)
Keywords: generation
Abstract: Video diffusion techniques have advanced significantly in recent years; however, they struggle to generate realistic imagery of car crashes due to the scarcity of accident events in most driving datasets. Improving traffic safety requires realistic and controllable accident simulations. To tackle the problem, we propose Ctrl-Crash, a controllable car crash video generation model that conditions on signals such as bounding boxes, crash types, and an initial image frame. Our approach enables counterfactual scenario generation where minor variations in input can lead to dramatically different crash outcomes. To support fine-grained control at inference time, we leverage classifier-free guidance with independently tunable scales for each conditioning signal. Ctrl-Crash achieves state-of-the-art performance across quantitative video quality metrics (e.g., FVD and JEDi) and qualitative measurements based on a human-evaluation of physical realism and video quality compared to prior diffusion-based methods.
摘要：近年来，视频扩散技术已经显着发展。但是，由于大多数驾驶数据集中的事故事件稀缺，他们很难产生逼真的车祸图像。提高交通安全需要现实且可控的事故模拟。为了解决该问题，我们提出了CTRL-Crash，这是一种可控的汽车碰撞视频生成模型，该模型在诸如边界框，崩溃类型和初始图像框架之类的信号上进行条件。我们的方法实现了反事实场景的生成，其中较小的输入变化可能导致崩溃结果截然不同。为了在推理时支持细粒度的控制，我们利用每个条件信号具有独立调谐量表的无分类器指导。与先前基于扩散的方法相比，CTRL-Crash在定量视频质量指标（例如FVD和JEDI）中实现了最新的性能以及基于物理现实主义和视频质量的人类评估的定性测量。

Title: Beyond Semantic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity

Authors: Dang Nguyen, Ali Payani, Baharan Mirzasoleiman
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.00245
Pdf URL: https://arxiv.org/pdf/2506.00245
Copy Paste: [[2506.00245]] Beyond Semantic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity(https://arxiv.org/abs/2506.00245)
Keywords: generation
Abstract: Hallucination in large language models (LLMs) can be detected by assessing the uncertainty of model outputs, typically measured using entropy. Semantic entropy (SE) enhances traditional entropy estimation by quantifying uncertainty at the semantic cluster level. However, as modern LLMs generate longer one-sentence responses, SE becomes less effective because it overlooks two crucial factors: intra-cluster similarity (the spread within a cluster) and inter-cluster similarity (the distance between clusters). To address these limitations, we propose a simple black-box uncertainty quantification method inspired by nearest neighbor estimates of entropy. Our approach can also be easily extended to white-box settings by incorporating token probabilities. Additionally, we provide theoretical results showing that our method generalizes semantic entropy. Extensive empirical results demonstrate its effectiveness compared to semantic entropy across two recent LLMs (Phi3 and Llama3) and three common text generation tasks: question answering, text summarization, and machine translation. Our code is available at this https URL.
摘要：可以通过评估通常使用熵测量的模型输出的不确定性来检测大语言模型（LLM）中的幻觉。语义熵（SE）通过量化语义群集水平的不确定性来增强传统熵估计。但是，随着现代LLM的产生更长的单句响应，SE变得效率较低，因为它忽略了两个关键因素：集群内相似性（集群中的扩散）和集群间相似性（群集之间的距离）。为了解决这些局限性，我们提出了一种由最近的熵估计值启发的简单黑盒不确定性定量方法。通过合并令牌概率，我们的方法也可以轻松地扩展到白色框设置。此外，我们提供了理论上的结果，表明我们的方法概括了语义熵。广泛的经验结果表明，与最近两个LLM（PHI3和Llama3）的语义熵相比，其有效性和三个常见的文本生成任务：问题回答，文本摘要和机器翻译。我们的代码可在此HTTPS URL上找到。

Title: Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Authors: Oliver Mortensen, Mohammad Sadegh Talebi
Subjects: cs.LG, cs.AI, math.OC, stat.ML
Abstract URL: https://arxiv.org/abs/2506.00286
Pdf URL: https://arxiv.org/pdf/2506.00286
Copy Paste: [[2506.00286]] Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model(https://arxiv.org/abs/2506.00286)
Keywords: generative
Abstract: In this paper we analyze the sample complexities of learning the optimal state-action value function $Q^*$ and an optimal policy $\pi^*$ in a discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter $\beta\neq 0$ and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive $Q$-value-iteration (MB-RS-QVI) which leads to $(\epsilon,\delta)$-PAC-bounds on $\|Q^*-Q^k\|$, and $\|V^*-V^{\pi_k}\|$ where $Q_k$ is the output of MB-RS-QVI after k iterations and $\pi_k$ is the greedy policy with respect to $Q_k$. Both PAC-bounds have exponential dependence on the effective horizon $\frac{1}{1-\gamma}$ and the strength of this dependence grows with the learners risk-sensitivity $|\beta|$. We also provide two lower bounds which shows that exponential dependence on $|\beta|\frac{1}{1-\gamma}$ is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are both tight in $\varepsilon$ and $\delta$ and that the PAC-bound on $Q$-learning is tight in the number of actions $A$, and that the PAC-bound on policy-learning is nearly tight in $A$.
摘要：在本文中，我们分析了学习最佳状态行动值功能的样本复杂性$ q^*$和最佳策略$ \ pi^*$在折扣的马尔可夫决策过程（MDP）中，代理具有带有风险参数$ \ beta $ \ beta \ beta \ neq 0 $的递归熵风险优先，并提供了MDP的生成模型。我们提供和分析一种基于模型的方法，我们称之为基于模型的风险敏感$ q $ value-ititeration（MB-RS-QVI），该方法导致$（\ epsilon，\ delta）$ - pac-bounds $ \ | q^* - q^* - q^* - q^*-q^k \ | $ c^k \ | K迭代后的MB-RS-QVI和$ \ pi_k $是相对于$ q_k $的贪婪政策。两种PAC结合都对有效的视野$ \ frac {1} {1- \ gamma} $具有指数依赖性，并且这种依赖的强度随着学习者的风险敏感性$ | \ beta | $而增强。我们还提供了两个下限，这表明指数依赖性对$ | \ beta | \ frac {1} {1- \ gamma} $在两种情况下都是不可避免的。下限表明，pac绑定的$ \ varepsilon $和$ \ delta $都很紧，并且$ q $ - $ q $ - 实现的pac-bound在$ a $ $ a $的数量上，政策学习的pac-bound在$ a $ a中几乎很紧。

Title: Improving Protein Sequence Design through Designability Preference Optimization

Authors: Fanglei Xue, Andrew Kubaney, Zhichun Guo, Joseph K. Min, Ge Liu, Yi Yang, David Baker
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2506.00297
Pdf URL: https://arxiv.org/pdf/2506.00297
Copy Paste: [[2506.00297]] Improving Protein Sequence Design through Designability Preference Optimization(https://arxiv.org/abs/2506.00297)
Keywords: generation
Abstract: Protein sequence design methods have demonstrated strong performance in sequence generation for de novo protein design. However, as the training objective was sequence recovery, it does not guarantee designability--the likelihood that a designed sequence folds into the desired structure. To bridge this gap, we redefine the training objective by steering sequence generation toward high designability. To do this, we integrate Direct Preference Optimization (DPO), using AlphaFold pLDDT scores as the preference signal, which significantly improves the in silico design success rate. To further refine sequence generation at a finer, residue-level granularity, we introduce Residue-level Designability Preference Optimization (ResiDPO), which applies residue-level structural rewards and decouples optimization across residues. This enables direct improvement in designability while preserving regions that already perform well. Using a curated dataset with residue-level annotations, we fine-tune LigandMPNN with ResiDPO to obtain EnhancedMPNN, which achieves a nearly 3-fold increase in in silico design success rate (from 6.56% to 17.57%) on a challenging enzyme design benchmark.
摘要：蛋白质序列设计方法表明，从头蛋白质设计的序列产生表现出色。但是，由于训练目标是序列恢复，因此不能保证可设计性 - 设计的序列折叠成所需的结构的可能性。为了弥合这一差距，我们通过将序列产生转向高可设计性来重新定义训练目标。为此，我们使用Alphafold PLDDT分数作为偏好信号整合了直接偏好优化（DPO），从而显着提高了计算机设计成功率。为了进一步优化较细的序列生成，残基级粒度，我们引入了残留级可设计性优先优化（残留物），该优化（残留物）应用了残基级的结构奖励，并在残基之间进行了优化。这可以直接改进可设计性，同时保留已经表现良好的区域。使用带有残基级注释的策划数据集，我们用残留物微调配体以获得增强的MPNN，在挑战性酶设计基准的基准基准的基准设计成功率（从6.56％到17.57％）上，硅胶设计成功率（从6.56％到17.57％）的增强率提高了3倍。

Title: Inference-Time Alignment of Diffusion Models with Evolutionary Algorithms

Authors: Purvish Jajal, Nick John Eliopoulos, Benjamin Shiue-Hal Chou, George K. Thiruvathukal, James C. Davis, Yung-Hsiang Lu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.00299
Pdf URL: https://arxiv.org/pdf/2506.00299
Copy Paste: [[2506.00299]] Inference-Time Alignment of Diffusion Models with Evolutionary Algorithms(https://arxiv.org/abs/2506.00299)
Keywords: generative
Abstract: Diffusion models are state-of-the-art generative models in various domains, yet their samples often fail to satisfy downstream objectives such as safety constraints or domain-specific validity. Existing techniques for alignment require gradients, internal model access, or large computational budgets. We introduce an inference-time alignment framework based on evolutionary algorithms. We treat diffusion models as black-boxes and search their latent space to maximize alignment objectives. Our method enables efficient inference-time alignment for both differentiable and non-differentiable alignment objectives across a range of diffusion models. On the DrawBench and Open Image Preferences benchmark, our EA methods outperform state-of-the-art gradient-based and gradient-free inference-time methods. In terms of memory consumption, we require 55% to 76% lower GPU memory than gradient-based methods. In terms of running-time, we are 72% to 80% faster than gradient-based methods. We achieve higher alignment scores over 50 optimization steps on Open Image Preferences than gradient-based and gradient-free methods.
摘要：扩散模型是各个领域中最新的生成模型，但是它们的样本通常无法满足下游目标，例如安全限制或特定于域的有效性。现有的对齐技术需要梯度，内部模型访问或大型计算预算。我们引入了基于进化算法的推理时间对齐框架。我们将扩散模型视为黑盒，并搜索其潜在空间以最大化对齐目标。我们的方法可以在一系列扩散模型中为可区分和非差异对齐目标提供有效的推理时间对齐。在Drawbench和Open Image首选项基准上，我们的EA方法优于最先进的基于梯度和无梯度推理时间方法。在内存消耗方面，我们比基于梯度的方法需要55％至76％的GPU存储器。在运行时间方面，我们比基于梯度的方法快72％至80％。与基于梯度和无梯度的方法相比，我们在开放图像偏好的50个优化步骤上获得更高的比对得分。

Title: Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework

Authors: Can Polat, Hasan Kurban, Erchin Serpedin, Mustafa Kurban
Subjects: cs.LG, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2506.00302
Pdf URL: https://arxiv.org/pdf/2506.00302
Copy Paste: [[2506.00302]] Beyond Atomic Geometry Representations in Materials Science: A Human-in-the-Loop Multimodal Framework(https://arxiv.org/abs/2506.00302)
Keywords: generation
Abstract: Most materials science datasets are limited to atomic geometries (e.g., XYZ files), restricting their utility for multimodal learning and comprehensive data-centric analysis. These constraints have historically impeded the adoption of advanced machine learning techniques in the field. This work introduces MultiCrystalSpectrumSet (MCS-Set), a curated framework that expands materials datasets by integrating atomic structures with 2D projections and structured textual annotations, including lattice parameters and coordination metrics. MCS-Set enables two key tasks: (1) multimodal property and summary prediction, and (2) constrained crystal generation with partial cluster supervision. Leveraging a human-in-the-loop pipeline, MCS-Set combines domain expertise with standardized descriptors for high-quality annotation. Evaluations using state-of-the-art language and vision-language models reveal substantial modality-specific performance gaps and highlight the importance of annotation quality for generalization. MCS-Set offers a foundation for benchmarking multimodal models, advancing annotation practices, and promoting accessible, versatile materials science datasets. The dataset and implementations are available at this https URL.
摘要：大多数材料科学数据集仅限于原子几何（例如XYZ文件），从而限制了其用于多模式学习的实用性和以数据为中心的分析。这些限制在历史上阻碍了该领域的先进机器学习技术的采用。这项工作介绍了Multicrystalspectrumset（MCS-SET），这是一个策划的框架，通过将原子结构与2D投影和结构化的文本注释（包括晶格参数和协调指标）集成来扩展材料数据集。 MCS-SET启用两个关键任务：（1）多模式属性和摘要预测，以及（2）通过部分集群监督受约束的晶体生成。 MCS-Set利用人类的管道，将域专业知识与标准化描述符相结合，用于高质量注释。使用最先进的语言和视觉语言模型进行评估揭示了特定于特定于模式的性能差距，并突出了注释质量对概括的重要性。 MCS-Set为基础测试了多模式模型，推进注释实践以及促进可访问的多功能材料科学数据集的基础。该数据集和实现可在此HTTPS URL上获得。

Title: Latent Guidance in Diffusion Models for Perceptual Evaluations

Authors: Shreshth Saini, Ru-Ling Liao, Yan Ye, Alan C. Bovik
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00327
Pdf URL: https://arxiv.org/pdf/2506.00327
Copy Paste: [[2506.00327]] Latent Guidance in Diffusion Models for Perceptual Evaluations(https://arxiv.org/abs/2506.00327)
Keywords: quality assessment
Abstract: Despite recent advancements in latent diffusion models that generate high-dimensional image data and perform various downstream tasks, there has been little exploration into perceptual consistency within these models on the task of No-Reference Image Quality Assessment (NR-IQA). In this paper, we hypothesize that latent diffusion models implicitly exhibit perceptually consistent local regions within the data manifold. We leverage this insight to guide on-manifold sampling using perceptual features and input measurements. Specifically, we propose Perceptual Manifold Guidance (PMG), an algorithm that utilizes pretrained latent diffusion models and perceptual quality features to obtain perceptually consistent multi-scale and multi-timestep feature maps from the denoising U-Net. We empirically demonstrate that these hyperfeatures exhibit high correlation with human perception in IQA tasks. Our method can be applied to any existing pretrained latent diffusion model and is straightforward to integrate. To the best of our knowledge, this paper is the first work on guiding diffusion model with perceptual features for NR-IQA. Extensive experiments on IQA datasets show that our method, LGDM, achieves state-of-the-art performance, underscoring the superior generalization capabilities of diffusion models for NR-IQA tasks.
摘要：尽管在潜在扩散模型中取得了最新的进步，该模型生成了高维图像数据并执行了各种下游任务，但在这些模型中，在无参考图像质量评估（NR-IQA）的任务中，几乎没有探索感知一致性。在本文中，我们假设潜在扩散模型隐含在数据歧管内表现出感知上一致的局部区域。我们利用这种见解来使用感知特征和输入测量来指导盛大抽样。具体而言，我们提出了一种感知歧管指导（PMG），该算法利用预处理的潜在扩散模型和感知质量特征来从Denoising U-net中获得感知一致的多尺度和多临界特征图。我们从经验上证明，这些高效与IQA任务中的人类感知表现出很高的相关性。我们的方法可以应用于任何现有的预处理的潜扩散模型，并且可以直接集成。据我们所知，本文是具有NR-IQA感知功能的指导扩散模型的第一部作品。 IQA数据集的广泛实验表明，我们的方法LGDM实现了最先进的性能，强调了NR-IQA任务的扩散模型的出色概括能力。

Title: Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation

Authors: Muhammad Adnan, Nithesh Kurella, Akhil Arunkumar, Prashant J. Nair
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.00329
Pdf URL: https://arxiv.org/pdf/2506.00329
Copy Paste: [[2506.00329]] Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation(https://arxiv.org/abs/2506.00329)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) achieve state-of-the-art results in text-to-image, text-to-video generation, and editing. However, their large model size and the quadratic cost of spatial-temporal attention over multiple denoising steps make video generation computationally expensive. Static caching mitigates this by reusing features across fixed steps but fails to adapt to generation dynamics, leading to suboptimal trade-offs between speed and quality. We propose Foresight, an adaptive layer-reuse technique that reduces computational redundancy across denoising steps while preserving baseline performance. Foresight dynamically identifies and reuses DiT block outputs for all layers across steps, adapting to generation parameters such as resolution and denoising schedules to optimize efficiency. Applied to OpenSora, Latte, and CogVideoX, Foresight achieves up to 1.63x end-to-end speedup, while maintaining video quality. The source code of Foresight is available at \texttt{this https URL}.
摘要：扩散变压器（DITS）实现最新的文本形象，文本对视频生成和编辑。但是，它们的较大模型大小以及在多个脱氧步骤中的时空关注的二次成本使视频生成在计算上变得昂贵。静态缓存可以通过跨固定步骤重复使用功能，但无法适应一代动态，从而减轻了这一点，从而导致速度和质量之间的次优折衷。我们提出了远见，这是一种自适应层修复技术，可在保持基线性能的同时降低跨剥离步骤的计算冗余。前瞻性动态识别和重用跨步骤的所有图层的DIT块输出，适应生成参数，例如分辨率和剥离计划以优化效率。适用于OpenSora，拿铁和Cogvideox，预见可达到1.63倍的端到端速度，同时保持视频质量。您可以在\ texttt {this https url}上获得远见的源代码。

Title: iDPA: Instance Decoupled Prompt Attention for Incremental Medical Object Detection

Authors: Huahui Yi, Wei Xu, Ziyuan Qin, Xi Chen, Xiaohu Wu, Kang Li, Qicheng Lao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00406
Pdf URL: https://arxiv.org/pdf/2506.00406
Copy Paste: [[2506.00406]] iDPA: Instance Decoupled Prompt Attention for Incremental Medical Object Detection(https://arxiv.org/abs/2506.00406)
Keywords: generation
Abstract: Existing prompt-based approaches have demonstrated impressive performance in continual learning, leveraging pre-trained large-scale models for classification tasks; however, the tight coupling between foreground-background information and the coupled attention between prompts and image-text tokens present significant challenges in incremental medical object detection tasks, due to the conceptual gap between medical and natural domains. To overcome these challenges, we introduce the \method~framework, which comprises two main components: 1) Instance-level Prompt Generation (\ipg), which decouples fine-grained instance-level knowledge from images and generates prompts that focus on dense predictions, and 2) Decoupled Prompt Attention (\dpa), which decouples the original prompt attention, enabling a more direct and efficient transfer of prompt information while reducing memory usage and mitigating catastrophic forgetting. We collect 13 clinical, cross-modal, multi-organ, and multi-category datasets, referred to as \dataset, and experiments demonstrate that \method~outperforms existing SOTA methods, with FAP improvements of 5.44\%, 4.83\%, 12.88\%, and 4.59\% in full data, 1-shot, 10-shot, and 50-shot settings, respectively.
摘要：现有的基于及时的方法在不断学习中表现出令人印象深刻的表现，并利用预先培训的大规模模型进行分类任务。但是，由于医疗和自然域之间的概念差距，提示和图像文本令牌之间的前景 - 背景信息与提示和图像文本令牌之间的关注之间的紧密耦合在增量医学对象检测任务中面临着重大挑战。 To overcome these challenges, we introduce the \method~framework, which comprises two main components: 1) Instance-level Prompt Generation (\ipg), which decouples fine-grained instance-level knowledge from images and generates prompts that focus on dense predictions, and 2) Decoupled Prompt Attention (\dpa), which decouples the original prompt attention, enabling a more direct and efficient transfer of prompt information while reducing memory usage and减轻灾难性遗忘。 We collect 13 clinical, cross-modal, multi-organ, and multi-category datasets, referred to as \dataset, and experiments demonstrate that \method~outperforms existing SOTA methods, with FAP improvements of 5.44\%, 4.83\%, 12.88\%, and 4.59\% in full data, 1-shot, 10-shot, and 50-shot settings, respectively.

Title: Latent Wavelet Diffusion: Enabling 4K Image Synthesis for Free

Authors: Luigi Sigillo, Shengfeng He, Danilo Comminiello
Subjects: cs.CV, cs.LG, eess.IV
Abstract URL: https://arxiv.org/abs/2506.00433
Pdf URL: https://arxiv.org/pdf/2506.00433
Copy Paste: [[2506.00433]] Latent Wavelet Diffusion: Enabling 4K Image Synthesis for Free(https://arxiv.org/abs/2506.00433)
Keywords: generation, generative
Abstract: High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present Latent Wavelet Diffusion (LWD), a lightweight framework that enables any latent diffusion model to scale to ultra-high-resolution image generation (2K to 4K) for free. LWD introduces three key components: (1) a scale-consistent variational autoencoder objective that enhances the spectral fidelity of latent representations; (2) wavelet energy maps that identify and localize detail-rich spatial regions within the latent space; and (3) a time-dependent masking strategy that focuses denoising supervision on high-frequency components during training. LWD requires no architectural modifications and incurs no additional computational overhead. Despite its simplicity, it consistently improves perceptual quality and reduces FID in ultra-high-resolution image synthesis, outperforming strong baseline models. These results highlight the effectiveness of frequency-aware, signal-driven supervision as a principled and efficient approach for high-resolution generative modeling.
摘要：高分辨率图像合成仍然是生成建模中的核心挑战，尤其是在平衡计算效率与保存细颗粒视觉细节方面。我们提出了潜在的小波扩散（LWD），这是一个轻巧的框架，可以免费提供任何潜在扩散模型，以扩展到超高分辨率图像生成（2k至4k）。 LWD介绍了三个关键组成部分：（1）一个比例一致的变分自动编码器物镜，可增强潜在表示的光谱保真度；（2）识别和定位潜在空间内富含细节的空间区域的小波能量图；（3）一项依赖时间的掩盖策略，将de索的监督重点放在训练过程中的高频组件上。 LWD不需要架构修改，也不需要额外的计算开销。尽管它很简单，但它始终如一地提高感知质量并减少超高分辨率图像合成中的FID，超过强大的基线模型。这些结果突出了频率吸引，信号驱动的监督作为高分辨率生成建模的原则性和高效方法的有效性。

Title: RLAE: Reinforcement Learning-Assisted Ensemble for LLMs

Authors: Yuqian Fu, Yuanheng Zhu, Jiajun Chai, Guojun Yin, Wei Lin, Qichao Zhang, Dongbin Zhao
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00439
Pdf URL: https://arxiv.org/pdf/2506.00439
Copy Paste: [[2506.00439]] RLAE: Reinforcement Learning-Assisted Ensemble for LLMs(https://arxiv.org/abs/2506.00439)
Keywords: generation
Abstract: Ensembling large language models (LLMs) can effectively combine diverse strengths of different models, offering a promising approach to enhance performance across various tasks. However, existing methods typically rely on fixed weighting strategies that fail to adapt to the dynamic, context-dependent characteristics of LLM capabilities. In this work, we propose Reinforcement Learning-Assisted Ensemble for LLMs (RLAE), a novel framework that reformulates LLM ensemble through the lens of a Markov Decision Process (MDP). Our approach introduces a RL agent that dynamically adjusts ensemble weights by considering both input context and intermediate generation states, with the agent being trained using rewards that directly correspond to the quality of final outputs. We implement RLAE using both single-agent and multi-agent reinforcement learning algorithms ($\text{RLAE}_\text{PPO}$ and $\text{RLAE}_\text{MAPPO}$ ), demonstrating substantial improvements over conventional ensemble methods. Extensive evaluations on a diverse set of tasks show that RLAE outperforms existing approaches by up to $3.3\%$ accuracy points, offering a more effective framework for LLM ensembling. Furthermore, our method exhibits superior generalization capabilities across different tasks without the need for retraining, while simultaneously achieving lower time latency.
摘要：结合大型语言模型（LLM）可以有效地结合不同模型的各种优势，提供有前途的方法来提高各种任务的性能。但是，现有方法通常依赖于无法适应LLM功能的动态，上下文依赖性特征的固定加权策略。在这项工作中，我们建议对LLMS（RLAE）的强化学习合奏，这是一个新颖的框架，可以通过马尔可夫决策过程（MDP）的镜头重新制定LLM合奏。我们的方法引入了RL代理，该RL代理通过考虑输入上下文和中间生成状态来动态调整集合权重，并且使用直接与最终输出质量相对应的奖励进行了训练。我们使用单代机构和多代理增强学习算法（$ \ text {rlae} _ \ text {ppo} $和$ \ text {rlae} _ \ text {mappo} $）实现RLAE，证明了对传统的集合方法的实质性改进。对各种任务的广泛评估表明，RLAE的表现优于现有的方法，最高$ 3.3 \％$ $精度，为LLM提供了更有效的框架。此外，我们的方法在不同任务上表现出卓越的概括能力，而无需重新训练，同时达到较低的时间延迟。

Title: Comparing Traditional and Reinforcement-Learning Methods for Energy Storage Control

Authors: Elinor Ginzburg, Itay Segev, Yoash Levron, Sarah Keren
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00459
Pdf URL: https://arxiv.org/pdf/2506.00459
Copy Paste: [[2506.00459]] Comparing Traditional and Reinforcement-Learning Methods for Energy Storage Control(https://arxiv.org/abs/2506.00459)
Keywords: generative
Abstract: We aim to better understand the tradeoffs between traditional and reinforcement learning (RL) approaches for energy storage management. More specifically, we wish to better understand the performance loss incurred when using a generative RL policy instead of using a traditional approach to find optimal control policies for specific instances. Our comparison is based on a simplified micro-grid model, that includes a load component, a photovoltaic source, and a storage device. Based on this model, we examine three use cases of increasing complexity: ideal storage with convex cost functions, lossy storage devices, and lossy storage devices with convex transmission losses. With the aim of promoting the principled use RL based methods in this challenging and important domain, we provide a detailed formulation of each use case and a detailed description of the optimization challenges. We then compare the performance of traditional and RL methods, discuss settings in which it is beneficial to use each method, and suggest avenues for future investigation.
摘要：我们的目标是更好地了解传统和强化学习（RL）方法的权衡，以进行储能管理。更具体地说，我们希望更好地理解使用生成RL策略而不是使用传统方法为特定实例找到最佳控制策略所产生的绩效损失。我们的比较基于简化的微网格模型，其中包括负载组件，光伏源和存储设备。基于此模型，我们研究了复杂性增加的三种用例：具有凸成本功能的理想存储，有损耗的存储设备以及具有凸传输损失的有损耗的存储设备。为了在这个具有挑战性和重要的领域中促进基于RL的原则用途方法，我们提供了每种用例的详细表述，并详细描述了优化挑战。然后，我们比较传统方法和RL方法的性能，讨论使用每种方法有益的设置，并建议将来研究的途径。

Title: Imputation of Missing Data in Smooth Pursuit Eye Movements Using a Self-Attention-based Deep Learning Approach

Authors: Mehdi Bejani, Guillermo Perez-de-Arenaza-Pozo, Julián D. Arias-Londoño, Juan I. Godino-LLorente
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00545
Pdf URL: https://arxiv.org/pdf/2506.00545
Copy Paste: [[2506.00545]] Imputation of Missing Data in Smooth Pursuit Eye Movements Using a Self-Attention-based Deep Learning Approach(https://arxiv.org/abs/2506.00545)
Keywords: generative
Abstract: Missing data is a relevant issue in time series, especially in biomedical sequences such as those corresponding to smooth pursuit eye movements, which often contain gaps due to eye blinks and track losses, complicating the analysis and extraction of meaningful biomarkers. In this paper, a novel imputation framework is proposed using Self-Attention-based Imputation networks for time series, which leverages the power of deep learning and self-attention mechanisms to impute missing data. We further refine the imputed data using a custom made autoencoder, tailored to represent smooth pursuit eye movement sequences. The proposed approach was implemented using 5,504 sequences from 172 Parkinsonian patients and healthy controls. Results show a significant improvement in the accuracy of reconstructed eye movement sequences with respect to other state of the art techniques, substantially reducing the values for common time domain error metrics such as the mean absolute error, mean relative error, and root mean square error, while also preserving the signal's frequency domain characteristics. Moreover, it demonstrates robustness when large intervals of data are missing. This method offers an alternative solution for robustly handling missing data in time series, enhancing the reliability of smooth pursuit analysis for the screening and monitoring of neurodegenerative disorders.
摘要：缺失的数据是时间序列中的一个相关问题，尤其是在生物医学序列中，例如与平滑追捕眼运动的序列相关的问题，这些序列通常会由于眼睛眨眼和轨道损失而含有差距，从而使有意义的生物标志物的分析和提取变得复杂。在本文中，提出了一个新颖的插补框架，该框架是使用基于自我注意的插补网络用于时间序列的，该网络利用深度学习和自我注意力的机制的力量将丢失的数据归为数据。我们使用定制的自动编码器进一步完善估算的数据，该数据量身定制，以代表平滑的追随眼动序列。该方法使用172名帕金森氏症患者和健康对照组的5,504个序列实施。结果表明，相对于其他最先进的技术，重建眼运动序列的准确性有显着提高，从而大大降低了通用时域误差指标的值，例如平均绝对误差，平均相对误差和根平方误差，同时还保留了信号频率域特征。此外，当丢失大的数据间隔时，它表明了鲁棒性。该方法为稳健处理时间序列中缺少的数据提供了替代解决方案，从而增强了对神经退行性疾病筛查和监测的平滑追求分析的可靠性。

Title: ORAN-GUIDE: RAG-Driven Prompt Learning for LLM-Augmented Reinforcement Learning in O-RAN Network Slicing

Authors: Fatemeh Lotfi, Hossein Rajoli, Fatemeh Afghah
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00576
Pdf URL: https://arxiv.org/pdf/2506.00576
Copy Paste: [[2506.00576]] ORAN-GUIDE: RAG-Driven Prompt Learning for LLM-Augmented Reinforcement Learning in O-RAN Network Slicing(https://arxiv.org/abs/2506.00576)
Keywords: generation
Abstract: Advanced wireless networks must support highly dynamic and heterogeneous service demands. Open Radio Access Network (O-RAN) architecture enables this flexibility by adopting modular, disaggregated components, such as the RAN Intelligent Controller (RIC), Centralized Unit (CU), and Distributed Unit (DU), that can support intelligent control via machine learning (ML). While deep reinforcement learning (DRL) is a powerful tool for managing dynamic resource allocation and slicing, it often struggles to process raw, unstructured input like RF features, QoS metrics, and traffic trends. These limitations hinder policy generalization and decision efficiency in partially observable and evolving environments. To address this, we propose \textit{ORAN-GUIDE}, a dual-LLM framework that enhances multi-agent RL (MARL) with task-relevant, semantically enriched state representations. The architecture employs a domain-specific language model, ORANSight, pretrained on O-RAN control and configuration data, to generate structured, context-aware prompts. These prompts are fused with learnable tokens and passed to a frozen GPT-based encoder that outputs high-level semantic representations for DRL agents. This design adopts a retrieval-augmented generation (RAG) style pipeline tailored for technical decision-making in wireless systems. Experimental results show that ORAN-GUIDE improves sample efficiency, policy convergence, and performance generalization over standard MARL and single-LLM baselines.
摘要：高级无线网络必须支持高度动态和异质的服务需求。开放无线电访问网络（O-RAN）体系结构可以通过采用模块化的分解组件，例如RAN智能控制器（RIC），集中式单元（CU）和分布式单元（DU），可以通过机器学习（ML）支持智能控制。虽然深度强化学习（DRL）是管理动态资源分配和切片的强大工具，但它通常会努力处理原始的，非结构化的输入，例如RF功能，QoS指标和交通趋势。这些限制阻碍了部分可观察到的环境中的政策概括和决策效率。为了解决这个问题，我们提出了\ textit {oran-guide}，这是一个双重框架框架，可增强具有与任务相关的，语义丰富的状态表示形式的多代理RL（MARL）。该体系结构采用特定领域的语言模型，即Oransight，在O-RAN控制和配置数据上预估计，以生成结构化的，上下文感知的提示。这些提示与可学习的令牌融合在一起，并传递给基于GPT的冰冻编码器，该编码器为DRL代理输出高级语义表示。该设计采用了针对无线系统技术决策量身定制的检索型生成一代（RAG）样式管道。实验结果表明，Oran-guide提高了对标准MAL和单LLM基线的样本效率，策略收敛性和性能概括。

Title: Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

Authors: Danfeng li, Hui Zhang, Sheng Wang, Jiacheng Li, Zuxuan Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00596
Pdf URL: https://arxiv.org/pdf/2506.00596
Copy Paste: [[2506.00596]] Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control(https://arxiv.org/abs/2506.00596)
Keywords: generation
Abstract: Despite recent advances in diffusion models, top-tier text-to-image (T2I) models still struggle to achieve precise spatial layout control, i.e. accurately generating entities with specified attributes and locations. Segmentation-mask-to-image (S2I) generation has emerged as a promising solution by incorporating pixel-level spatial guidance and regional text prompts. However, existing S2I methods fail to simultaneously ensure semantic consistency and shape consistency. To address these challenges, we propose Seg2Any, a novel S2I framework built upon advanced multimodal diffusion transformers (e.g. FLUX). First, to achieve both semantic and shape consistency, we decouple segmentation mask conditions into regional semantic and high-frequency shape components. The regional semantic condition is introduced by a Semantic Alignment Attention Mask, ensuring that generated entities adhere to their assigned text prompts. The high-frequency shape condition, representing entity boundaries, is encoded as an Entity Contour Map and then introduced as an additional modality via multi-modal attention to guide image spatial structure. Second, to prevent attribute leakage across entities in multi-entity scenarios, we introduce an Attribute Isolation Attention Mask mechanism, which constrains each entity's image tokens to attend exclusively to themselves during image self-attention. To support open-set S2I generation, we construct SACap-1M, a large-scale dataset containing 1 million images with 5.9 million segmented entities and detailed regional captions, along with a SACap-Eval benchmark for comprehensive S2I evaluation. Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities.
摘要：尽管扩散模型最近取得了进步，但顶级文本对图像（T2I）模型仍然难以实现精确的空间布局控制，即准确地生成具有指定属性和位置的实体。通过合并像素级的空间指导和区域文本提示，分割面罩对图像（S2I）的生成已成为有前途的解决方案。但是，现有的S2i方法无法同时确保语义一致性和形状一致性。为了应对这些挑战，我们提出了Seg2any，这是一个基于先进的多模式扩散变压器（例如磁通）的新型S2I框架。首先，为了达到语义和形状的一致性，我们将分割掩盖条件分解为区域语义和高频形状组件。区域语义条件是由语义对齐的注意面罩引入的，以确保生成的实体遵守其分配的文本提示。代表实体边界的高频形状条件被编码为实体轮廓图，然后通过多模式的注意将其作为附加方式引入，以指导图像空间结构。其次，为了防止在多实体场景中跨实体之间的属性泄漏，我们引入了属性隔离注意掩码机制，该机制限制了每个实体的图像令牌，以在图像自我注意力期间专门参加。为了支持开放式S2I生成，我们构建了SACAP-1M，这是一个大型数据集，其中包含100万张图像，其中有590万个分段实体和详细的区域字幕，以及用于全面的S2I评估的SACAP-EVAL基准。广泛的实验表明，Seg2y均在开放式和封闭设置的S2I基准测试中都达到了最先进的性能，尤其是在实体的细粒空间和属性控制中。

Title: SatDreamer360: Geometry Consistent Street-View Video Generation from Satellite Imagery

Authors: Xianghui Ze, Beiyi Zhu, Zhenbo Song, Jianfeng Lu, Yujiao Shi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00600
Pdf URL: https://arxiv.org/pdf/2506.00600
Copy Paste: [[2506.00600]] SatDreamer360: Geometry Consistent Street-View Video Generation from Satellite Imagery(https://arxiv.org/abs/2506.00600)
Keywords: generation
Abstract: Generating continuous ground-level video from satellite imagery is a challenging task with significant potential for applications in simulation, autonomous navigation, and digital twin cities. Existing approaches primarily focus on synthesizing individual ground-view images, often relying on auxiliary inputs like height maps or handcrafted projections, and fall short in producing temporally consistent sequences. In this paper, we propose {SatDreamer360}, a novel framework that generates geometrically and temporally consistent ground-view video from a single satellite image and a predefined trajectory. To bridge the large viewpoint gap, we introduce a compact tri-plane representation that encodes scene geometry directly from the satellite image. A ray-based pixel attention mechanism retrieves view-dependent features from the tri-plane, enabling accurate cross-view correspondence without requiring additional geometric priors. To ensure multi-frame consistency, we propose an epipolar-constrained temporal attention module that aligns features across frames using the known relative poses along the trajectory. To support evaluation, we introduce {VIGOR++}, a large-scale dataset for cross-view video generation, with dense trajectory annotations and high-quality ground-view sequences. Extensive experiments demonstrate that SatDreamer360 achieves superior performance in fidelity, coherence, and geometric alignment across diverse urban scenes.
摘要：从卫星图像中生成连续的地面视频是一项具有挑战性的任务，具有在模拟，自主导航和数字双城市中应用的重要潜力。现有方法主要集中于综合单个地面视图图像，通常依靠辅助输入（例如高度图或手工投影），而在产生时间一致的序列方面缺乏。在本文中，我们提出了{satdreamer360}，这是一个新颖的框架，该框架从单个卫星图像和预定义的轨迹上生成几何和时间一致的地面视频。为了弥合较大的观点差距，我们引入了一个紧凑的三平面表示，该表示直接从卫星图像中编码场景几何形状。基于射线的像素注意机制从三平面中检索了依赖视图的特征，从而实现了准确的跨视图对应关系，而无需其他额外的几何先验。为了确保多框架的一致性，我们提出了一个具有跨轨迹的已知相对姿势在整个框架上对齐特征的外聚构成的时间注意模块。为了支持评估，我们介绍了{Vigor ++}，这是一个用于跨视频生成的大型数据集，具有密集的轨迹注释和高质量的地面视图序列。广泛的实验表明，satdreamer360在不同城市场景的忠诚度，连贯性和几何形状一致性方面取得了出色的表现。

Title: ABCDEFGH: An Adaptation-Based Convolutional Neural Network-CycleGAN Disease-Courses Evolution Framework Using Generative Models in Health Education

Authors: Ruiming Min, Minghao Liu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00605
Pdf URL: https://arxiv.org/pdf/2506.00605
Copy Paste: [[2506.00605]] ABCDEFGH: An Adaptation-Based Convolutional Neural Network-CycleGAN Disease-Courses Evolution Framework Using Generative Models in Health Education(https://arxiv.org/abs/2506.00605)
Keywords: generative
Abstract: With the advancement of modern medicine and the development of technologies such as MRI, CT, and cellular analysis, it has become increasingly critical for clinicians to accurately interpret various diagnostic images. However, modern medical education often faces challenges due to limited access to high-quality teaching materials, stemming from privacy concerns and a shortage of educational resources (Balogh et al., 2015). In this context, image data generated by machine learning models, particularly generative models, presents a promising solution. These models can create diverse and comparable imaging datasets without compromising patient privacy, thereby supporting modern medical education. In this study, we explore the use of convolutional neural networks (CNNs) and CycleGAN (Zhu et al., 2017) for generating synthetic medical images. The source code is available at this https URL.
摘要：随着现代医学的发展以及MRI，CT和细胞分析等技术的发展，准确解释各种诊断图像的临床医生越来越重要。但是，由于隐私问题和教育资源的短缺，现代医学教育通常会面临挑战（Balogh等，2015）。在这种情况下，由机器学习模型（尤其是生成模型）生成的图像数据提出了一个有希望的解决方案。这些模型可以在不损害患者隐私的情况下创建多样化和可比较的成像数据集，从而支持现代医学教育。在这项研究中，我们探讨了卷积神经网络（CNN）和Cyclean（Zhu等，2017）的使用来生成合成医学图像。源代码可在此HTTPS URL上找到。

Title: Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Authors: Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00633
Pdf URL: https://arxiv.org/pdf/2506.00633
Copy Paste: [[2506.00633]] Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining(https://arxiv.org/abs/2506.00633)
Keywords: super-resolution, generation, generative
Abstract: Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric Computed Tomography (CT) remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE, enabling efficient 3D denoising diffusion without requiring external super-resolution stages. Results: We evaluate our method on the CT-RATE dataset and conduct a comprehensive assessment of image fidelity, clinical relevance, and semantic alignment. Our model achieves competitive performance across all tasks, significantly outperforming prior baselines for text-to-CT generation. Moreover, we demonstrate that CT scans synthesized by our framework can effectively augment real data, improving downstream diagnostic performance. Conclusion: Our results show that modality-specific vision-language alignment is a key component for high-quality 3D medical image generation. By integrating contrastive pretraining and volumetric diffusion, our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text, paving the way for new applications in data augmentation, medical education, and automated clinical simulation.
摘要：目的：尽管文本条件生成模型的最新进展使现实的医学图像综合了，但进步主要局限于诸如胸部X射线之类的2D模式。扩展文本到图像的生成到体积计算机断层扫描（CT）仍然是一个重大挑战，因为其高维，解剖学复杂性以及在3D医学成像中不符合视觉语言数据的稳健框架。方法：我们介绍了一种新型的文本到CT生成体系结构，该结构将潜在扩散模型与3D对比的视觉方式结合在一起。我们的方法利用了双重编码夹式模型，该模型在配对的CT卷和放射学报告中训练，以建立共享的嵌入空间，该空间可作为生成的条件输入。 CT体积通过预验证的体积VAE压缩到低维的潜在空间中，从而无需外部超分辨率阶段就能有效3D降解扩散。结果：我们在CT率数据集上评估了我们的方法，并对图像保真度，临床相关性和语义一致性进行了全面评估。我们的模型在所有任务中都达到了竞争性能，大大优于文本到CT生成的先前基线。此外，我们证明，通过我们的框架合成的CT扫描可以有效地增加实际数据，从而改善下游诊断性能。结论：我们的结果表明，特定于模式的视力语言比对是高质量3D医学图像生成的关键组成部分。通过整合对比度预处理和体积扩散，我们的方法提供了可扩展可控制的解决方案，可从文本中合成临床上有意义的CT体积，为数据增强，医学教育和自动化临床模拟的新应用铺平了道路。

Title: Video Signature: In-generation Watermarking for Latent Video Diffusion Models

Authors: Yu Huang, Junhao Chen, Qi Zheng, Hanqian Li, Shuliang Liu, Xuming Hu
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2506.00652
Pdf URL: https://arxiv.org/pdf/2506.00652
Copy Paste: [[2506.00652]] Video Signature: In-generation Watermarking for Latent Video Diffusion Models(https://arxiv.org/abs/2506.00652)
Keywords: generation
Abstract: The rapid development of Artificial Intelligence Generated Content (AIGC) has led to significant progress in video generation but also raises serious concerns about intellectual property protection and reliable content tracing. Watermarking is a widely adopted solution to this issue, but existing methods for video generation mainly follow a post-generation paradigm, which introduces additional computational overhead and often fails to effectively balance the trade-off between video quality and watermark extraction. To address these issues, we propose Video Signature (VIDSIG), an in-generation watermarking method for latent video diffusion models, which enables implicit and adaptive watermark integration during generation. Specifically, we achieve this by partially fine-tuning the latent decoder, where Perturbation-Aware Suppression (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality. Beyond spatial fidelity, we further enhance temporal consistency by introducing a lightweight Temporal Alignment module that guides the decoder to generate coherent frame sequences during fine-tuning. Experimental results show that VIDSIG achieves the best overall performance in watermark extraction, visual quality, and generation efficiency. It also demonstrates strong robustness against both spatial and temporal tampering, highlighting its practicality in real-world scenarios.
摘要：人工智能产生的内容的快速发展（AIGC）导致了视频生成的重大进展，但也引起了人们对知识产权保护和可靠内容追踪的严重关注。水印是针对此问题的一种广泛采用的解决方案，但是现有的视频生成方法主要遵循后期范式，该范式引入了其他计算开销，并且通常无法有效地平衡视频质量和水印提取之间的权衡。为了解决这些问题，我们提出了视频签名（VIDSIG），这是一种用于潜在视频扩散模型的生成水印方法，该方法可以在一代中实现隐式和适应性水印的整合。具体而言，我们通过部分微调潜在解码器来实现这一目标，在该解码器中，摄动感知抑制（PAS）预先识别并冻结感知敏感的层以保持视觉质量。除了空间保真度之外，我们还通过引入轻巧的时间对齐模块来进一步提高时间一致性，该模块指导解码器在微调过程中生成相干框架序列。实验结果表明，Vidsig在水印提取，视觉质量和发电效率方面取得了最佳的总体性能。它还表现出对空间和暂时性篡改的强大鲁棒性，在现实世界中强调了其实用性。

Title: Differential Privacy for Deep Learning in Medicine

Authors: Marziyeh Mohammadi, Mohsen Vejdanihemmat, Mahshad Lotfinia, Mirabela Rusu, Daniel Truhn, Andreas Maier, Soroosh Tayebi Arasteh
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00660
Pdf URL: https://arxiv.org/pdf/2506.00660
Copy Paste: [[2506.00660]] Differential Privacy for Deep Learning in Medicine(https://arxiv.org/abs/2506.00660)
Keywords: generative
Abstract: Differential privacy (DP) is a key technique for protecting sensitive patient data in medical deep learning (DL). As clinical models grow more data-dependent, balancing privacy with utility and fairness has become a critical challenge. This scoping review synthesizes recent developments in applying DP to medical DL, with a particular focus on DP-SGD and alternative mechanisms across centralized and federated settings. Using a structured search strategy, we identified 74 studies published up to March 2025. Our analysis spans diverse data modalities, training setups, and downstream tasks, and highlights the tradeoffs between privacy guarantees, model accuracy, and subgroup fairness. We find that while DP-especially at strong privacy budgets-can preserve performance in well-structured imaging tasks, severe degradation often occurs under strict privacy, particularly in underrepresented or complex modalities. Furthermore, privacy-induced performance gaps disproportionately affect demographic subgroups, with fairness impacts varying by data type and task. A small subset of studies explicitly addresses these tradeoffs through subgroup analysis or fairness metrics, but most omit them entirely. Beyond DP-SGD, emerging approaches leverage alternative mechanisms, generative models, and hybrid federated designs, though reporting remains inconsistent. We conclude by outlining key gaps in fairness auditing, standardization, and evaluation protocols, offering guidance for future work toward equitable and clinically robust privacy-preserving DL systems in medicine.
摘要：差异隐私（DP）是保护医学深度学习（DL）中敏感患者数据的关键技术。随着临床模型越来越依赖于数据，将隐私与效用和公平性的平衡已成为一个关键挑战。这项范围的评论综合了将DP应用于医疗DL的最新发展，特别关注集中式和联邦设置的DP-SGD和替代机制。使用结构化搜索策略，我们确定了74项截至2025年3月的研究。我们的分析涵盖了各种数据模式，培训设置和下游任务，并突出了隐私保证，模型准确性和亚组公平性之间的权衡。我们发现，虽然DP尤其是在强大的隐私预算中保留结构良好的成像任务中的性能，但经常在严格的隐私之下发生严重的降级，尤其是在代表性不足或复杂的方式下。此外，隐私引起的性能差距不成比例地影响人口统计亚组，公平性影响随数据类型和任务而变化。一小部分研究通过亚组分析或公平指标明确地解决了这些权衡，但大多数人完全忽略了它们。除了DP-SGD之外，新兴的方法还利用替代机制，生成模型和联合设计的混合设计，尽管报告仍然不一致。最后，我们概述了公平审计，标准化和评估协议方面的关键差距，为未来的工作提供了指导，以公平和临床上强大的隐私保护医学中的DL系统。

Title: SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning

Authors: Saad Hossain, Samanvay Vajpayee, Sirisha Rambhatla
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00676
Pdf URL: https://arxiv.org/pdf/2506.00676
Copy Paste: [[2506.00676]] SafeTuneBed: A Toolkit for Benchmarking LLM Safety Alignment in Fine-Tuning(https://arxiv.org/abs/2506.00676)
Keywords: generation
Abstract: As large language models (LLMs) become ubiquitous, parameter-efficient fine-tuning methods and safety-first defenses have proliferated rapidly. However, the number of approaches and their recent increase have resulted in diverse evaluations-varied datasets, metrics, and inconsistent threat settings-making it difficult to fairly compare safety, utility, and robustness across methods. To address this, we introduce SafeTuneBed, a benchmark and toolkit unifying fine-tuning and defense evaluation. SafeTuneBed (i) curates a diverse repository of multiple fine-tuning datasets spanning sentiment analysis, question-answering, multi-step reasoning, and open-ended instruction tasks, and allows for the generation of harmful-variant splits; (ii) enables integration of state-of-the-art defenses, including alignment-stage immunization, in-training safeguards, and post-tuning repair; and (iii) provides evaluators for safety (attack success rate, refusal consistency) and utility. Built on Python-first, dataclass-driven configs and plugins, SafeTuneBed requires minimal additional code to specify any fine-tuning regime, defense method, and metric suite, while ensuring end-to-end reproducibility. We showcase its value by benchmarking representative defenses across varied poisoning scenarios and tasks. By standardizing data, code, and metrics, SafeTuneBed is the first focused toolkit of its kind to accelerate rigorous and comparable research in safe LLM fine-tuning. Code is available at: this https URL
摘要：随着大型语言模型（LLM）变得无处不在，参数有效的微调方法和安全优先的防御迅速增殖。但是，方法的数量及其最近的增加导致了各种评估的数据集，指标和不一致的威胁环境，难以将跨方法的安全性，效用和稳健性进行比较。为了解决这个问题，我们介绍了Safetunebed，一个基准和工具包，统一微调和国防评估。 SafeTuneBed（i）策划了多个微调数据集的多元化存储库，这些数据集涵盖情感分析，提问，多步推理和开放式指令任务，并允许生成有害变化的分配；（ii）可以整合最先进的防御能力，包括对齐阶段的免疫，训练保障和调节后修复；（iii）为安全性（攻击成功率，拒绝一致性）和效用提供了评估者。 SafeTuneBed建立在Python-First，Dataclass驱动的配置和插件上，需要最少的附加代码来指定任何微调制度，防御方法和公制套件，同时确保端到端的可重复性。我们通过对各种中毒场景和任务进行基准代表性防御来展示其价值。通过标准化数据，代码和指标，Safetunebed是同类工具包的第一个集中的工具包，可在安全LLM微调中加速严格且可比较的研究。代码可用：此HTTPS URL

Title: Concept-Centric Token Interpretation for Vector-Quantized Generative Models

Authors: Tianze Yang, Yucheng Shi, Mengnan Du, Xuansheng Wu, Qiaoyu Tan, Jin Sun, Ninghao Liu
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.00698
Pdf URL: https://arxiv.org/pdf/2506.00698
Copy Paste: [[2506.00698]] Concept-Centric Token Interpretation for Vector-Quantized Generative Models(https://arxiv.org/abs/2506.00698)
Keywords: generation, generative
Abstract: Vector-Quantized Generative Models (VQGMs) have emerged as powerful tools for image generation. However, the key component of VQGMs -- the codebook of discrete tokens -- is still not well understood, e.g., which tokens are critical to generate an image of a certain concept? This paper introduces Concept-Oriented Token Explanation (CORTEX), a novel approach for interpreting VQGMs by identifying concept-specific token combinations. Our framework employs two methods: (1) a sample-level explanation method that analyzes token importance scores in individual images, and (2) a codebook-level explanation method that explores the entire codebook to find globally relevant tokens. Experimental results demonstrate CORTEX's efficacy in providing clear explanations of token usage in the generative process, outperforming baselines across multiple pretrained VQGMs. Besides enhancing VQGMs transparency, CORTEX is useful in applications such as targeted image editing and shortcut feature detection. Our code is available at this https URL.
摘要：矢量定量的生成模型（VQGM）已成为图像生成的强大工具。但是，VQGM的关键组成部分（离散令牌代码本）仍然不太了解，例如，哪些令牌对于生成某个概念的图像至关重要？本文介绍了面向概念的令牌解释（Cortex），这是一种通过识别特定概念的令牌组合来解释VQGM的新方法。我们的框架采用了两种方法：（1）一种示例级别的解释方法，该方法分析了单个图像中令牌的重要性分数，以及（2）代码书级的解释方法，探讨了整个代码手册以查找与全球相关的标记。实验结果表明，皮质在生成过程中对令牌使用的明确解释方面的功效，在多个预验证的VQGM上表现优于基层。除了增强VQGM透明度外，皮层还可以在诸如目标图像编辑和快捷特征检测之类的应用中有用。我们的代码可在此HTTPS URL上找到。

Title: RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models

Authors: Valter Hudovernik, Minkai Xu, Juntong Shi, Lovro Šubelj, Stefano Ermon, Erik Štrumbelj, Jure Leskovec
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.00710
Pdf URL: https://arxiv.org/pdf/2506.00710
Copy Paste: [[2506.00710]] RelDiff: Relational Data Generative Modeling with Graph-Based Diffusion Models(https://arxiv.org/abs/2506.00710)
Keywords: generation, generative
Abstract: Real-world databases are predominantly relational, comprising multiple interlinked tables that contain complex structural and statistical dependencies. Learning generative models on relational data has shown great promise in generating synthetic data and imputing missing values. However, existing methods often struggle to capture this complexity, typically reducing relational data to conditionally generated flat tables and imposing limiting structural assumptions. To address these limitations, we introduce RelDiff, a novel diffusion generative model that synthesizes complete relational databases by explicitly modeling their foreign key graph structure. RelDiff combines a joint graph-conditioned diffusion process across all tables for attribute synthesis, and a $2K+$SBM graph generator based on the Stochastic Block Model for structure generation. The decomposition of graph structure and relational attributes ensures both high fidelity and referential integrity, both of which are crucial aspects of synthetic relational database generation. Experiments on 11 benchmark datasets demonstrate that RelDiff consistently outperforms prior methods in producing realistic and coherent synthetic relational databases. Code is available at this https URL.
摘要：现实世界数据库主要是关系，包括包含复杂结构和统计依赖性的多个相互联系的表。关于关系数据的学习生成模型在生成综合数据和推出缺失值方面表现出了巨大的希望。但是，现有的方法通常难以捕获这种复杂性，通常将关系数据降低到有条件生成的扁平表并施加限制结构假设。为了解决这些局限性，我们介绍了Reldiff，这是一种新型扩散生成模型，该模型通过明确建模其外键图结构来综合完整的关系数据库。 Reldiff合并了所有表中的属性合成表中的联合图形扩散过程，以及一个基于随机块模型用于结构生成的$ 2K+$ SBM图生成器。图形结构和关系属性的分解确保了高保真度和参考完整性，这两者都是合成关系数据库生成的关键方面。在11个基准数据集上的实验表明，Reldiff始终在生成现实且相干的合成关系数据库方面始终优于先前的方法。代码可在此HTTPS URL上找到。

Title: ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary

Authors: Zeqi Gu, Yin Cui, Zhaoshuo Li, Fangyin Wei, Yunhao Ge, Jinwei Gu, Ming-Yu Liu, Abe Davis, Yifan Ding
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00742
Pdf URL: https://arxiv.org/pdf/2506.00742
Copy Paste: [[2506.00742]] ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary(https://arxiv.org/abs/2506.00742)
Keywords: generation
Abstract: Designing 3D scenes is traditionally a challenging task that demands both artistic expertise and proficiency with complex software. Recent advances in text-to-3D generation have greatly simplified this process by letting users create scenes based on simple text descriptions. However, as these methods generally require extra training or in-context learning, their performance is often hindered by the limited availability of high-quality 3D data. In contrast, modern text-to-image models learned from web-scale images can generate scenes with diverse, reliable spatial layouts and consistent, visually appealing styles. Our key insight is that instead of learning directly from 3D scenes, we can leverage generated 2D images as an intermediary to guide 3D synthesis. In light of this, we introduce ArtiScene, a training-free automated pipeline for scene design that integrates the flexibility of free-form text-to-image generation with the diversity and reliability of 2D intermediary layouts. First, we generate 2D images from a scene description, then extract the shape and appearance of objects to create 3D models. These models are assembled into the final scene using geometry, position, and pose information derived from the same intermediary image. Being generalizable to a wide range of scenes and styles, ArtiScene outperforms state-of-the-art benchmarks by a large margin in layout and aesthetic quality by quantitative metrics. It also averages a 74.89% winning rate in extensive user studies and 95.07% in GPT-4o evaluation. Project page: this https URL
摘要：传统上，设计3D场景是一项具有挑战性的任务，需要使用复杂的软件进行艺术专业知识和熟练程度。文本到3D一代的最新进展通过让用户根据简单的文本描述创建场景来大大简化了此过程。但是，由于这些方法通常需要额外的培训或中文学习，因此高质量3D数据的可用性有限，通常会阻碍它们的性能。相比之下，从网络尺度图像中学到的现代文本到图像模型可以生成各种，可靠的空间布局和一致，视觉上吸引人的风格的场景。我们的主要见解是，我们可以利用生成的2D图像作为中介机构来指导3D合成，而不是直接从3D场景中学习。鉴于此，我们介绍了Artisisene，这是一条无训练的自动化管道，用于场景设计，将自由形式的文本到图像生成的灵活性与2D中介布局的多样性和可靠性相结合。首先，我们从场景描述中生成2D图像，然后提取对象的形状和外观以创建3D模型。这些模型使用从同一中介图像得出的几何，位置和姿势信息组装到最终场景中。 Artiscene可以通过广泛的场景和样式推广，其布局和审美质量通过定量指标优于最先进的基准。在广泛的用户研究中，它的平均获胜率为74.89％，而GPT-4O评估中的获胜率为95.07％。项目页面：此HTTPS URL

Title: Manipulating 3D Molecules in a Fixed-Dimensional SE(3)-Equivariant Latent Space

Authors: Zitao Chen, Yinjun Jia, Zitong Tian, Wei-Ying Ma, Yanyan Lan
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00771
Pdf URL: https://arxiv.org/pdf/2506.00771
Copy Paste: [[2506.00771]] Manipulating 3D Molecules in a Fixed-Dimensional SE(3)-Equivariant Latent Space(https://arxiv.org/abs/2506.00771)
Keywords: generation
Abstract: Medicinal chemists often optimize drugs considering their 3D structures and designing structurally distinct molecules that retain key features, such as shapes, pharmacophores, or chemical properties. Previous deep learning approaches address this through supervised tasks like molecule inpainting or property-guided optimization. In this work, we propose a flexible zero-shot molecule manipulation method by navigating in a shared latent space of 3D molecules. We introduce a Variational AutoEncoder (VAE) for 3D molecules, named MolFLAE, which learns a fixed-dimensional, SE(3)-equivariant latent space independent of atom counts. MolFLAE encodes 3D molecules using an SE(3)-equivariant neural network into fixed number of latent nodes, distinguished by learned embeddings. The latent space is regularized, and molecular structures are reconstructed via a Bayesian Flow Network (BFN) conditioned on the encoder's latent output. MolFLAE achieves competitive performance on standard unconditional 3D molecule generation benchmarks. Moreover, the latent space of MolFLAE enables zero-shot molecule manipulation, including atom number editing, structure reconstruction, and coordinated latent interpolation for both structure and properties. We further demonstrate our approach on a drug optimization task for the human glucocorticoid receptor, generating molecules with improved hydrophilicity while preserving key interactions, under computational evaluations. These results highlight the flexibility, robustness, and real-world utility of our method, opening new avenues for molecule editing and optimization.
摘要：考虑到其3D结构并设计保留关键特征的结构不同的分子，例如形状，药算术或化学特性，通常会优化药物。以前的深度学习方法通过监督任务（例如分子介入或财产引导优化）来解决这一问题。在这项工作中，我们通过在3D分子的共享潜在空间中导航，提出了一种灵活的零摄影分子操纵方法。我们引入了一个名为Molflae的3D分子的变异自动编码器（VAE），该分子学会了独立于原子计数的固定维度的SE（3） - 等级的潜在潜伏空间。 Molflae使用SE（3） - 等级神经网络编码3D分子，以固定数量的潜在淋巴结，通过学习的嵌入来区分。潜在空间是正规化的，分子结构是通过在编码器的潜在输出上进行的贝叶斯流网络（BFN）重建的。 Molflae在标准的无条件3D分子生成基准上实现竞争性能。此外，Molflae的潜在空间可实现零拍的分子操作，包括原子数编辑，结构重建和结构和特性的协调潜在插值。我们进一步证明了对人糖皮质激素受体的药物优化任务的方法，在计算评估下，在保留关键相互作用的同时，产生了改善的亲水性分子。这些结果突出了我们方法的灵活性，鲁棒性和现实效用，为分子编辑和优化开辟了新的途径。

Title: Aiding Medical Diagnosis through Image Synthesis and Classification

Authors: Kanishk Choudhary
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00786
Pdf URL: https://arxiv.org/pdf/2506.00786
Copy Paste: [[2506.00786]] Aiding Medical Diagnosis through Image Synthesis and Classification(https://arxiv.org/abs/2506.00786)
Keywords: generation, generative
Abstract: Medical professionals, especially those in training, often depend on visual reference materials to support an accurate diagnosis and develop pattern recognition skills. However, existing resources may lack the diversity and accessibility needed for broad and effective clinical learning. This paper presents a system designed to generate realistic medical images from textual descriptions and validate their accuracy through a classification model. A pretrained stable diffusion model was fine-tuned using Low-Rank Adaptation (LoRA) on the PathMNIST dataset, consisting of nine colorectal histopathology tissue types. The generative model was trained multiple times using different training parameter configurations, guided by domain-specific prompts to capture meaningful features. To ensure quality control, a ResNet-18 classification model was trained on the same dataset, achieving 99.76% accuracy in detecting the correct label of a colorectal histopathological medical image. Generated images were then filtered using the trained classifier and an iterative process, where inaccurate outputs were discarded and regenerated until they were correctly classified. The highest performing version of the generative model from experimentation achieved an F1 score of 0.6727, with precision and recall scores of 0.6817 and 0.7111, respectively. Some types of tissue, such as adipose tissue and lymphocytes, reached perfect classification scores, while others proved more challenging due to structural complexity. The self-validating approach created demonstrates a reliable method for synthesizing domain-specific medical images because of high accuracy in both the generation and classification portions of the system, with potential applications in both diagnostic support and clinical education. Future work includes improving prompt-specific accuracy and extending the system to other areas of medical imaging.
摘要：医疗专业人员，尤其是接受培训的专业人员，通常依靠视觉参考材料来支持准确的诊断并发展模式识别能力。但是，现有资源可能缺乏广泛有效的临床学习所需的多样性和可访问性。本文介绍了一种系统，旨在从文本描述中生成逼真的医学图像，并通过分类模型验证其准确性。预处理的稳定扩散模型使用Pathmnist数据集上的低级适应（Lora）微调，由9种结直肠组织病理学组织类型组成。使用不同的训练参数配置对生成模型进行了多次训练，并以域特异性提示为指导以捕获有意义的功能。为了确保质量控制，在同一数据集上对RESNET-18分类模型进行了训练，在检测结肠直肠组织病理学医学图像的正确标签方面达到了99.76％的精度。然后使用训练有素的分类器和迭代过程对生成的图像进行过滤，在该过程中，不准确的输出被丢弃并再生，直到正确分类为止。实验的生成模型的性能最高的版本的F1得分为0.6727，精度和召回得分分别为0.6817和0.7111。某些类型的组织，例如脂肪组织和淋巴细胞，达到了完美的分类分数，而由于结构上的复杂性，证明更具挑战性。创建的自validation方法证明了一种可靠的方法，用于综合域特异性医学图像，因为在系统的发电和分类部分都具有很高的精度，并在诊断支持和临床教育中都有潜在的应用。未来的工作包括提高特定于特定的精度并将系统扩展到其他医学成像领域。

Title: HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models

Authors: Songtao Jiang, Yan Zhang, Yeying Jin, Zhihang Tang, Yangyang Wu, Yang Feng, Jian Wu, Zuozhu Liu
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.00805
Pdf URL: https://arxiv.org/pdf/2506.00805
Copy Paste: [[2506.00805]] HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models(https://arxiv.org/abs/2506.00805)
Keywords: generation
Abstract: Medical Vision-Language Models (Med-VLMs) have achieved success across various tasks, yet most existing methods overlook the modality misalignment issue that can lead to untrustworthy responses in clinical settings. In this paper, we propose Hierarchical Self-Contrastive Rewarding (HSCR), a novel approach that addresses two critical challenges in Med-VLM alignment: 1) Cost-effective generation of high-quality preference data; 2) Capturing nuanced and context-aware preferences for improved alignment. HSCR first leverages the inherent capability of Med-VLMs to generate dispreferred responses with higher sampling probability. By analyzing output logit shifts after visual token dropout, we identify modality-coupled tokens that induce misalignment and derive an implicit alignment reward function. This function guides token replacement with hallucinated ones during decoding, producing high-quality dispreferred data. Furthermore, HSCR introduces a multi-level preference optimization strategy, which extends beyond traditional adjacent-level optimization by incorporating nuanced implicit preferences, leveraging relative quality in dispreferred data to capture subtle alignment cues for more precise and context-aware optimization. Extensive experiments across multiple medical tasks, including Med-VQA, medical image captioning and instruction following, demonstrate that HSCR not only enhances zero-shot performance but also significantly improves modality alignment and trustworthiness with just 2,000 training entries.
摘要：医学视觉语言模型（MED-VLM）在各种任务中都取得了成功，但是大多数现有方法忽略了模式错位问题，这些问题可能导致临床环境中的不信任响应。在本文中，我们提出了层次结构的自我对比性奖励（HSCR），一种新颖的方法解决了Med-vlm对齐中两个关键挑战：1）具有成本效益的高质量偏好数据； 2）捕获细微差别和上下文感知的偏好，以改善对齐方式。 HSCR首先利用Med-vlms的固有能力以更高的采样概率生成分配的响应。通过分析视觉令牌辍学后的输出logit变化，我们确定了诱导未对准并获得隐式对齐奖励函数的模态耦合令牌。该功能在解码过程中将令牌替代用幻觉替代，从而产生高质量的分配数据。此外，HSCR引入了多层次的偏好优化策略，该策略通过纳入细微的隐式偏好，超越了传统的相邻级别优化，从而利用了分配数据中的相对质量来捕获微妙的一致性提示，以获得更精确的和上下文意识到的优化。跨多个医疗任务的广泛实验，包括MED-VQA，医疗图像字幕和以下指导，表明HSCR不仅可以提高零射击性能，而且还可以显着提高2,000个培训条目的模态对齐和可信度。

Title: QuantFace: Low-Bit Post-Training Quantization for One-Step Diffusion Face Restoration

Authors: Jiatong Li, Libo Zhu, Haotong Qin, Jingkai Wang, Linghe Kong, Guihai Chen, Yulun Zhang, Xiaokang Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00820
Pdf URL: https://arxiv.org/pdf/2506.00820
Copy Paste: [[2506.00820]] QuantFace: Low-Bit Post-Training Quantization for One-Step Diffusion Face Restoration(https://arxiv.org/abs/2506.00820)
Keywords: restoration
Abstract: Diffusion models have been achieving remarkable performance in face restoration. However, the heavy computations of diffusion models make it difficult to deploy them on devices like smartphones. In this work, we propose QuantFace, a novel low-bit quantization for one-step diffusion face restoration models, where the full-precision (\ie, 32-bit) weights and activations are quantized to 4$\sim$6-bit. We first analyze the data distribution within activations and find that they are highly variant. To preserve the original data information, we employ rotation-scaling channel balancing. Furthermore, we propose Quantization-Distillation Low-Rank Adaptation (QD-LoRA) that jointly optimizes for quantization and distillation performance. Finally, we propose an adaptive bit-width allocation strategy. We formulate such a strategy as an integer programming problem, which combines quantization error and perceptual metrics to find a satisfactory resource allocation. Extensive experiments on the synthetic and real-world datasets demonstrate the effectiveness of QuantFace under 6-bit and 4-bit. QuantFace achieves significant advantages over recent leading low-bit quantization methods for face restoration. The code is available at this https URL.
摘要：扩散模型在面部恢复方面已经取得了显着的性能。但是，扩散模型的大量计算使得它们很难在智能手机等设备上部署。在这项工作中，我们提出了Quantface，这是一种用于一步扩散面部恢复模型的新型低位量化，其中完整精确（\ ie，32位）的权重和激活量化为4 $ \ sim $ 6位。我们首先分析激活中的数据分布，并发现它们是高度变异的。为了保留原始数据信息，我们采用旋转尺度频道平衡。此外，我们提出了量化缩减低级适应（QD-lora），该适应性（QD-LORA）共同优化了量化和蒸馏性能。最后，我们提出了一种自适应的位宽度分配策略。我们将这种策略制定为整数编程问题，该策略结合了量化错误和感知指标，以找到令人满意的资源分配。关于合成和现实世界数据集的广泛实验证明了量化在6位和4位下的有效性。与最近领先的低位量化方法相比，Quantface具有显着优势。该代码可在此HTTPS URL上找到。

Title: SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers

Authors: Zhengcong Fei, Hao Jiang, Di Qiu, Baoxuan Gu, Youqiang Zhang, Jiahua Wang, Jialin Bai, Debang Li, Mingyuan Fan, Guibin Chen, Yahui Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00830
Pdf URL: https://arxiv.org/pdf/2506.00830
Copy Paste: [[2506.00830]] SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers(https://arxiv.org/abs/2506.00830)
Keywords: generation
Abstract: The generation and editing of audio-conditioned talking portraits guided by multimodal inputs, including text, images, and videos, remains under explored. In this paper, we present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos. Built upon pretrained video diffusion transformers, our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs. We employ a hybrid curriculum learning strategy to progressively align audio with facial motion, enabling fine-grained multimodal control over long video sequences. To enhance local facial coherence, we introduce a facial mask loss and an audio-guided classifier-free guidance mechanism. A sliding-window denoising approach further fuses latent representations across temporal segments, ensuring visual fidelity and temporal consistency across extended durations and diverse identities. More importantly, we construct a dedicated data pipeline for curating high-quality triplets consisting of synchronized audio, video, and textual descriptions. Comprehensive benchmark evaluations show that SkyReels-Audio achieves superior performance in lip-sync accuracy, identity consistency, and realistic facial dynamics, particularly under complex and challenging conditions.
摘要：由多模式输入（包括文本，图像和视频）指导的音频条件会说话的肖像的生成和编辑仍在探索中。在本文中，我们介绍了Skyreels-Audio，这是一个统一的框架，用于综合高保真和时间连贯的肖像视频。我们的框架建立在预处理的视频扩散变压器的基础上，支持无限长度的生成和编辑，同时通过多模式输入实现了多种可控的调节。我们采用混合课程学习策略来逐步使音频与面部运动保持一致，从而可以对长视频序列进行细粒度的多模式控制。为了增强本地面部连贯性，我们引入了面膜蒙版损失和音频引导的无分类指导机制。滑动窗口的denoising方法进一步融合了跨时间段的潜在表示，从而确保了长时间持续时间和不同身份的视觉保真度和时间一致性。更重要的是，我们构建了一个专用的数据管道，用于策划由同步音频，视频和文本描述组成的高质量的三胞胎。全面的基准评估表明，Skyreels-Audio在唇部同步的准确性，身份一致性和现实的面部动力学方面取得了出色的性能，尤其是在复杂且具有挑战性的条件下。

Title: FourierFlow: Frequency-aware Flow Matching for Generative Turbulence Modeling

Authors: Haixin Wang, Jiashu Pan, Hao Wu, Fan Zhang, Tailin Wu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.00862
Pdf URL: https://arxiv.org/pdf/2506.00862
Copy Paste: [[2506.00862]] FourierFlow: Frequency-aware Flow Matching for Generative Turbulence Modeling(https://arxiv.org/abs/2506.00862)
Keywords: generative
Abstract: Modeling complex fluid systems, especially turbulence governed by partial differential equations (PDEs), remains a fundamental challenge in science and engineering. Recently, diffusion-based generative models have gained attention as a powerful approach for these tasks, owing to their capacity to capture long-range dependencies and recover hierarchical structures. However, we present both empirical and theoretical evidence showing that generative models struggle with significant spectral bias and common-mode noise when generating high-fidelity turbulent flows. Here we propose FourierFlow, a novel generative turbulence modeling framework that enhances the frequency-aware learning by both implicitly and explicitly mitigating spectral bias and common-mode noise. FourierFlow comprises three key innovations. Firstly, we adopt a dual-branch backbone architecture, consisting of a salient flow attention branch with local-global awareness to focus on sensitive turbulence areas. Secondly, we introduce a frequency-guided Fourier mixing branch, which is integrated via an adaptive fusion strategy to explicitly mitigate spectral bias in the generative model. Thirdly, we leverage the high-frequency modeling capabilities of the masked auto-encoder pre-training and implicitly align the features of the generative model toward high-frequency components. We validate the effectiveness of FourierFlow on three canonical turbulent flow scenarios, demonstrating superior performance compared to state-of-the-art methods. Furthermore, we show that our model exhibits strong generalization capabilities in challenging settings such as out-of-distribution domains, long-term temporal extrapolation, and robustness to noisy inputs. The code can be found at this https URL.
摘要：建模复杂的流体系统，尤其是由部分微分方程（PDE）控制的湍流，仍然是科学和工程学的基本挑战。最近，基于扩散的生成模型已成为这些任务的有力方法，因为它们捕获了远程依赖性和恢复层次结构的能力。但是，我们介绍了经验和理论证据，表明生成模型在产生高保真湍流时具有明显的光谱偏差和共同模式噪声。在这里，我们提出了一种新颖的生成湍流建模框架，通过隐式和明确缓解光谱偏差和共同模式噪声来增强频率感知的学习。傅立叶流包括三个关键创新。首先，我们采用双分支主链体系结构，由一个显着的流动注意力分支和局部 - 全球意识的关注来关注敏感的湍流区域。其次，我们引入了一个频率引导的傅立叶混合分支，该分支通过自适应融合策略进行了集成，以明确减轻生成模型中的光谱偏差。第三，我们利用屏蔽自动编码器预训练的高频建模能力，并隐式将生成模型的特征与高频组件相提并论。我们验证了三种规范湍流场景上傅里叶的有效性，与最先进的方法相比表明了表现出色的性能。此外，我们表明，我们的模型在挑战性的环境中表现出强大的概括能力，例如分布域，长期时间外推和对嘈杂输入的鲁棒性。该代码可以在此HTTPS URL上找到。

Title: Local Manifold Approximation and Projection for Manifold-Aware Diffusion Planning

Authors: Kyowoon Lee, Jaesik Choi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00867
Pdf URL: https://arxiv.org/pdf/2506.00867
Copy Paste: [[2506.00867]] Local Manifold Approximation and Projection for Manifold-Aware Diffusion Planning(https://arxiv.org/abs/2506.00867)
Keywords: generation, generative
Abstract: Recent advances in diffusion-based generative modeling have demonstrated significant promise in tackling long-horizon, sparse-reward tasks by leveraging offline datasets. While these approaches have achieved promising results, their reliability remains inconsistent due to the inherent stochastic risk of producing infeasible trajectories, limiting their applicability in safety-critical applications. We identify that the primary cause of these failures is inaccurate guidance during the sampling procedure, and demonstrate the existence of manifold deviation by deriving a lower bound on the guidance gap. To address this challenge, we propose Local Manifold Approximation and Projection (LoMAP), a training-free method that projects the guided sample onto a low-rank subspace approximated from offline datasets, preventing infeasible trajectory generation. We validate our approach on standard offline reinforcement learning benchmarks that involve challenging long-horizon planning. Furthermore, we show that, as a standalone module, LoMAP can be incorporated into the hierarchical diffusion planner, providing further performance enhancements.
摘要：基于扩散的生成建模的最新进展表明，通过利用离线数据集来应对长途，稀疏的奖励任务，这表明了巨大的希望。尽管这些方法取得了令人鼓舞的结果，但由于产生不可行的轨迹的固有随机风险，它们的可靠性仍然不一致，从而限制了它们在安全至关重要的应用中的适用性。我们确定这些失败的主要原因是在抽样过程中不准确的指导，并通过在指导差距上得出下限来证明歧管偏差的存在。为了应对这一挑战，我们提出了局部流形近似和投影（LOMAP），这是一种无训练的方法，该方法将引导样本投射到从离线数据集近似的低级子空间上，从而防止了不可行的轨迹产生。我们验证了涉及挑战长马计划的标准离线增强学习基准的方法。此外，我们表明，作为独立的模块，LOMAP可以纳入层次扩散计划者，从而提供进一步的性能提高。

Title: Breaking Latent Prior Bias in Detectors for Generalizable AIGC Image Detection

Authors: Yue Zhou, Xinan He, KaiQing Lin, Bin Fan, Feng Ding, Bin Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00874
Pdf URL: https://arxiv.org/pdf/2506.00874
Copy Paste: [[2506.00874]] Breaking Latent Prior Bias in Detectors for Generalizable AIGC Image Detection(https://arxiv.org/abs/2506.00874)
Keywords: generative
Abstract: Current AIGC detectors often achieve near-perfect accuracy on images produced by the same generator used for training but struggle to generalize to outputs from unseen generators. We trace this failure in part to latent prior bias: detectors learn shortcuts tied to patterns stemming from the initial noise vector rather than learning robust generative artifacts. To address this, we propose On-Manifold Adversarial Training (OMAT): by optimizing the initial latent noise of diffusion models under fixed conditioning, we generate on-manifold adversarial examples that remain on the generator's output manifold-unlike pixel-space attacks, which introduce off-manifold perturbations that the generator itself cannot reproduce and that can obscure the true discriminative artifacts. To test against state-of-the-art generative models, we introduce GenImage++, a test-only benchmark of outputs from advanced generators (Flux.1, SD3) with extended prompts and diverse styles. We apply our adversarial-training paradigm to ResNet50 and CLIP baselines and evaluate across existing AIGC forensic benchmarks and recent challenge datasets. Extensive experiments show that adversarially trained detectors significantly improve cross-generator performance without any network redesign. Our findings on latent-prior bias offer valuable insights for future dataset construction and detector evaluation, guiding the development of more robust and generalizable AIGC forensic methodologies.
摘要：当前的AIGC探测器通常可以在用于训练的同一发电机生成的图像上实现几乎完美的精度，但要努力推广到看不见的发电机的输出。我们追踪了这种失败，部分原因是潜在的先验偏见：检测器学习与最初噪声矢量的模式相关的捷径，而不是学习强大的生成伪像。为了解决这个问题，我们提出了在固定条件下的扩散模型的初始潜在噪声（OMAT）：我们生成在生成歧管歧管不像的像素空间攻击上，这些示例在固定条件下产生了对生成量的偏见，从而引入了生成器本身的散发，并且可以使其毫不掩饰地降级。为了针对最新的生成模型进行测试，我们引入了Genimage ++，Genimage ++是来自高级发电机（Flux.1，sd3）输出的仅测试基准，并具有扩展的提示和各种样式。我们将对抗性训练范式应用于Resnet50和剪辑基线，并在现有的AIGC法医基准和最近的挑战数据集中评估。广泛的实验表明，对抗训练的探测器可显着提高跨发射器性能，而无需任何网络重新设计。我们对潜在偏见的发现为未来的数据集构建和探测器评估提供了宝贵的见解，从而指导开发更健壮和可推广的AIGC法医方法。

Title: State-Covering Trajectory Stitching for Diffusion Planners

Authors: Kyowoon Lee, Jaesik Choi
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00895
Pdf URL: https://arxiv.org/pdf/2506.00895
Copy Paste: [[2506.00895]] State-Covering Trajectory Stitching for Diffusion Planners(https://arxiv.org/abs/2506.00895)
Keywords: generative
Abstract: Diffusion-based generative models are emerging as powerful tools for long-horizon planning in reinforcement learning (RL), particularly with offline datasets. However, their performance is fundamentally limited by the quality and diversity of training data. This often restricts their generalization to tasks outside their training distribution or longer planning horizons. To overcome this challenge, we propose State-Covering Trajectory Stitching (SCoTS), a novel reward-free trajectory augmentation method that incrementally stitches together short trajectory segments, systematically generating diverse and extended trajectories. SCoTS first learns a temporal distance-preserving latent representation that captures the underlying temporal structure of the environment, then iteratively stitches trajectory segments guided by directional exploration and novelty to effectively cover and expand this latent space. We demonstrate that SCoTS significantly improves the performance and generalization capabilities of diffusion planners on offline goal-conditioned benchmarks requiring stitching and long-horizon reasoning. Furthermore, augmented trajectories generated by SCoTS significantly improve the performance of widely used offline goal-conditioned RL algorithms across diverse environments.
摘要：基于扩散的生成模型正在成为增强学习（RL）的长途计划的强大工具，尤其是在离线数据集中。但是，他们的性能从根本上受到培训数据的质量和多样性的限制。这通常会限制他们对培训分配或更长计划范围之外的任务的概括。为了克服这一挑战，我们提出了一种覆盖状态的轨迹缝线（SCOTS），这是一种新型的无奖励轨迹增强方法，可逐步缝合短轨迹段，系统地产生多样化和扩展的轨迹。苏格兰人首先学习了一个时间远距离的潜在表示，该表示捕获了环境的潜在时间结构，然后迭代缝合轨迹段，以方向探索和新颖性为指导，以有效覆盖和扩展该潜在空间。我们证明，苏格兰人可以显着提高扩散计划者在离线目标条件基准的基准上的性能和概括能力，需要缝线和长途推理。此外，苏格兰人产生的增强轨迹显着提高了不同环境中广泛使用的离线目标条件RL算法的性能。

Title: DS-VTON: High-Quality Virtual Try-on via Disentangled Dual-Scale Generation

Authors: Xianbing Sun, Yan Hong, Jiahui Zhan, Jun Lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00908
Pdf URL: https://arxiv.org/pdf/2506.00908
Copy Paste: [[2506.00908]] DS-VTON: High-Quality Virtual Try-on via Disentangled Dual-Scale Generation(https://arxiv.org/abs/2506.00908)
Keywords: generation
Abstract: Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. In this paper, we propose DS-VTON, a dual-scale virtual try-on framework that explicitly disentangles these objectives for more effective modeling. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. The second stage introduces a residual-guided diffusion process that reconstructs high-resolution outputs by refining the residual between the two scales, focusing on texture fidelity. In addition, our method adopts a fully mask-free generation paradigm, eliminating reliance on human parsing maps or segmentation masks. By leveraging the semantic priors embedded in pretrained diffusion models, this design more effectively preserves the person's appearance and geometric consistency. Extensive experiments demonstrate that DS-VTON achieves state-of-the-art performance in both structural alignment and texture preservation across multiple standard virtual try-on benchmarks.
摘要：尽管最近取得了进展，但大多数现有的虚拟尝试方法仍然很难同时解决两个核心挑战：将服装图像与目标人体准确地对齐，并保留细粒的服装纹理和模式。在本文中，我们提出了DS-Vton，这是一个双尺度的虚拟尝试框架，明确地解开了这些目标以进行更有效的建模。 DS-VTON由两个阶段组成：第一阶段产生低分辨率的尝试结果，以捕获服装和身体之间的语义对应关系，其中减少的细节促进了稳健的结构比对。第二阶段引入了一个残留的引导扩散过程，该过程通过完善两个量表之间的残留物来重建高分辨率输出，重点是质地保真度。此外，我们的方法采用了完全无面膜的生成范式，从而消除了对人类解析图或分割面罩的依赖。通过利用嵌入预处理扩散模型中的语义先验，这种设计更有效地保留了该人的外观和几何一致性。广泛的实验表明，DS-VTON在多个标准虚拟的虚拟试验基准中实现了结构对齐和纹理保存的最新性能。

Title: 3D Skeleton-Based Action Recognition: A Review

Authors: Mengyuan Liu, Hong Liu, Qianshuo Hu, Bin Ren, Junsong Yuan, Jiaying Lin, Jiajun Wen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00915
Pdf URL: https://arxiv.org/pdf/2506.00915
Copy Paste: [[2506.00915]] 3D Skeleton-Based Action Recognition: A Review(https://arxiv.org/abs/2506.00915)
Keywords: generative
Abstract: With the inherent advantages of skeleton representation, 3D skeleton-based action recognition has become a prominent topic in the field of computer vision. However, previous reviews have predominantly adopted a model-oriented perspective, often neglecting the fundamental steps involved in skeleton-based action recognition. This oversight tends to ignore key components of skeleton-based action recognition beyond model design and has hindered deeper, more intrinsic understanding of the task. To bridge this gap, our review aims to address these limitations by presenting a comprehensive, task-oriented framework for understanding skeleton-based action recognition. We begin by decomposing the task into a series of sub-tasks, placing particular emphasis on preprocessing steps such as modality derivation and data augmentation. The subsequent discussion delves into critical sub-tasks, including feature extraction and spatio-temporal modeling techniques. Beyond foundational action recognition networks, recently advanced frameworks such as hybrid architectures, Mamba models, large language models (LLMs), and generative models have also been highlighted. Finally, a comprehensive overview of public 3D skeleton datasets is presented, accompanied by an analysis of state-of-the-art algorithms evaluated on these benchmarks. By integrating task-oriented discussions, comprehensive examinations of sub-tasks, and an emphasis on the latest advancements, our review provides a fundamental and accessible structured roadmap for understanding and advancing the field of 3D skeleton-based action recognition.
摘要：凭借骨架表示的固有优势，基于3D骨架的动作识别已成为计算机视觉领域的重要主题。但是，先前的评论主要采用了面向模型的观点，通常忽略了基于骨架的动作识别所涉及的基本步骤。这种疏忽倾向于忽略基于骨架的动作识别的关键组成部分，超出了模型设计，并阻碍了对任务的更深入，更内在的理解。为了弥合这一差距，我们的评论旨在通过提出一个全面的，面向任务的框架来理解基于骨架的动作识别，以解决这些局限性。我们首先将任务分解为一系列子任务，特别强调诸如模态派生和数据增强之类的预处理步骤。随后的讨论涉足关键子任务，包括特征提取和时空建模技术。除了基本的行动识别网络外，还突出了最近高级的框架，例如混合体系结构，MAMBA模型，大语言模型（LLMS）和生成模型。最后，提出了对公共3D骨架数据集的全面概述，并伴随着对这些基准测试的最新算法的分析。通过整合以任务为导向的讨论，对子任务的全面考试以及对最新进步的重视，我们的评论提供了一个基本且易于访问的结构性路线图，以理解和推进基于3D骨架的行动识别领域。

Title: Deformable registration and generative modelling of aortic anatomies by auto-decoders and neural ODEs

Authors: Riccardo Tenderini, Luca Pegolotti, Fanwei Kong, Stefano Pagani, Francesco Regazzoni, Alison L. Marsden, Simone Deparis
Subjects: cs.CV, math.NA
Abstract URL: https://arxiv.org/abs/2506.00947
Pdf URL: https://arxiv.org/pdf/2506.00947
Copy Paste: [[2506.00947]] Deformable registration and generative modelling of aortic anatomies by auto-decoders and neural ODEs(https://arxiv.org/abs/2506.00947)
Keywords: generation, generative
Abstract: This work introduces AD-SVFD, a deep learning model for the deformable registration of vascular shapes to a pre-defined reference and for the generation of synthetic anatomies. AD-SVFD operates by representing each geometry as a weighted point cloud and models ambient space deformations as solutions at unit time of ODEs, whose time-independent right-hand sides are expressed through artificial neural networks. The model parameters are optimized by minimizing the Chamfer Distance between the deformed and reference point clouds, while backward integration of the ODE defines the inverse transformation. A distinctive feature of AD-SVFD is its auto-decoder structure, that enables generalization across shape cohorts and favors efficient weight sharing. In particular, each anatomy is associated with a low-dimensional code that acts as a self-conditioning field and that is jointly optimized with the network parameters during training. At inference, only the latent codes are fine-tuned, substantially reducing computational overheads. Furthermore, the use of implicit shape representations enables generative applications: new anatomies can be synthesized by suitably sampling from the latent space and applying the corresponding inverse transformations to the reference geometry. Numerical experiments, conducted on healthy aortic anatomies, showcase the high-quality results of AD-SVFD, which yields extremely accurate approximations at competitive computational costs.
摘要：这项工作介绍了AD-SVFD，这是一个深度学习模型，用于将血管形状的可变形注册到预定义的参考和生成合成解剖结构。 AD-SVFD通过将每个几何形状表示为加权点云，并将环境空间变形作为ODES单位时间的解决方案建模，而环境空间变形作为解决方案，其时间独立于时间的右侧是通过人工神经网络表达的。通过最小化变形点云和参考点云之间的倒角距离来优化模型参数，而ODE的向后积分定义了反变形。 AD-SVFD的一个独特功能是其自动二十码结构，它可以跨形状同类群体的概括，并有利于有效的重量共享。特别是，每种解剖结构都与低维密码相关联，该代码充当自我调节领域，并且在训练过程中与网络参数共同优化。在推断时，只有潜在代码是微调的，大大降低了计算开销。此外，使用隐式形状表示可以实现生成应用：可以通过从潜在空间进行适当采样并将相应的逆变换应用于参考几何形状来合成新的解剖。在健康主动脉解剖学上进行的数值实验展示了AD-SVFD的高质量结果，该结果以竞争性计算成本产生了极为准确的近似值。

Title: TIGeR: Text-Instructed Generation and Refinement for Template-Free Hand-Object Interaction

Authors: Yiyao Huang, Zhedong Zheng, Yu Ziwei, Yaxiong Wang, Tze Ho Elden Tse, Angela Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00953
Pdf URL: https://arxiv.org/pdf/2506.00953
Copy Paste: [[2506.00953]] TIGeR: Text-Instructed Generation and Refinement for Template-Free Hand-Object Interaction(https://arxiv.org/abs/2506.00953)
Keywords: generation
Abstract: Pre-defined 3D object templates are widely used in 3D reconstruction of hand-object interactions. However, they often require substantial manual efforts to capture or source, and inherently restrict the adaptability of models to unconstrained interaction scenarios, e.g., heavily-occluded objects. To overcome this bottleneck, we propose a new Text-Instructed Generation and Refinement (TIGeR) framework, harnessing the power of intuitive text-driven priors to steer the object shape refinement and pose estimation. We use a two-stage framework: a text-instructed prior generation and vision-guided refinement. As the name implies, we first leverage off-the-shelf models to generate shape priors according to the text description without tedious 3D crafting. Considering the geometric gap between the synthesized prototype and the real object interacted with the hand, we further calibrate the synthesized prototype via 2D-3D collaborative attention. TIGeR achieves competitive performance, i.e., 1.979 and 5.468 object Chamfer distance on the widely-used Dex-YCB and Obman datasets, respectively, surpassing existing template-free methods. Notably, the proposed framework shows robustness to occlusion, while maintaining compatibility with heterogeneous prior sources, e.g., retrieved hand-crafted prototypes, in practical deployment scenarios.
摘要：预定义的3D对象模板被广泛用于手动相互作用的3D重建。但是，他们通常需要大量的手动努力来捕获或源，并固有地限制了模型对不受约束的相互作用方案的适应性，例如重度的对象。为了克服这种瓶颈，我们提出了一个新的文本教学生成和改进（Tiger）框架，利用直观的文本驱动先验的力量来引导对象形状的改进和姿势估计。我们使用两个阶段的框架：文本指导的先前一代和视觉指导的改进。顾名思义，我们首先利用现成的模型根据文本描述而无需乏味的3D制作来生成形状先验。考虑到综合原型与手相互作用的真实对象之间的几何差距，我们通过2D-3D协作关注进一步校准合成的原型。 Tiger分别在广泛使用的DEX-YCB和OBMAN数据集上实现了竞争性能，即1.979和5.468对象倒角距离，超过了现有的无模板方法。值得注意的是，所提出的框架显示出牢固的遮挡性，同时在实际部署方案中保持了与先前来源（例如，检索的手工制作的原型）的兼容性。

Title: Camera Trajectory Generation: A Comprehensive Survey of Methods, Metrics, and Future Directions

Authors: Zahra Dehghanian, Pouya Ardekhani, Amir Vahedi, Hamid Beigy, Hamid R. Rabiee
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2506.00974
Pdf URL: https://arxiv.org/pdf/2506.00974
Copy Paste: [[2506.00974]] Camera Trajectory Generation: A Comprehensive Survey of Methods, Metrics, and Future Directions(https://arxiv.org/abs/2506.00974)
Keywords: generation
Abstract: Camera trajectory generation is a cornerstone in computer graphics, robotics, virtual reality, and cinematography, enabling seamless and adaptive camera movements that enhance visual storytelling and immersive experiences. Despite its growing prominence, the field lacks a systematic and unified survey that consolidates essential knowledge and advancements in this domain. This paper addresses this gap by providing the first comprehensive review of the field, covering from foundational definitions to advanced methodologies. We introduce the different approaches to camera representation and present an in-depth review of available camera trajectory generation models, starting with rule-based approaches and progressing through optimization-based techniques, machine learning advancements, and hybrid methods that integrate multiple strategies. Additionally, we gather and analyze the metrics and datasets commonly used for evaluating camera trajectory systems, offering insights into how these tools measure performance, aesthetic quality, and practical applicability. Finally, we highlight existing limitations, critical gaps in current research, and promising opportunities for investment and innovation in the field. This paper not only serves as a foundational resource for researchers entering the field but also paves the way for advancing adaptive, efficient, and creative camera trajectory systems across diverse applications.
摘要：相机轨迹的生成是计算机图形，机器人技术，虚拟现实和摄影的基石，可实现无缝和自适应的摄像头动作，从而增强视觉讲故事和沉浸式体验。尽管该领域越来越重要，但仍缺乏系统的统一调查，可以巩固该领域的基本知识和进步。本文通过提供对该领域的首次全面审查来解决这一差距，从基础定义到高级方法论。我们介绍了相机表示的不同方法，并对可用的相机轨迹生成模型进行了深入的评论，从基于规则的方法开始，并通过基于优化的技术，机器学习进步以及整合多种策略的混合方法进行进步。此外，我们收集和分析通常用于评估相机轨迹系统的指标和数据集，提供有关这些工具如何衡量性能，美学质量和实际适用性的见解。最后，我们重点介绍了现有的局限性，当前研究中的关键差距以及该领域的投资和创新机会。本文不仅是进入该领域的研究人员的基本资源，而且为在不同应用程序中推进适应性，高效和创造性的摄像头轨迹系统铺平了道路。

Title: Quantization-based Bounds on the Wasserstein Metric

Authors: Jonathan Bobrutsky, Amit Moscovich
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.00976
Pdf URL: https://arxiv.org/pdf/2506.00976
Copy Paste: [[2506.00976]] Quantization-based Bounds on the Wasserstein Metric(https://arxiv.org/abs/2506.00976)
Keywords: generative
Abstract: The Wasserstein metric has become increasingly important in many machine learning applications such as generative modeling, image retrieval and domain adaptation. Despite its appeal, it is often too costly to compute. This has motivated approximation methods like entropy-regularized optimal transport, downsampling, and subsampling, which trade accuracy for computational efficiency. In this paper, we consider the challenge of computing efficient approximations to the Wasserstein metric that also serve as strict upper or lower bounds. Focusing on discrete measures on regular grids, our approach involves formulating and exactly solving a Kantorovich problem on a coarse grid using a quantized measure and specially designed cost matrix, followed by an upscaling and correction stage. This is done either in the primal or dual space to obtain valid upper and lower bounds on the Wasserstein metric of the full-resolution inputs. We evaluate our methods on the DOTmark optimal transport images benchmark, demonstrating a 10x-100x speedup compared to entropy-regularized OT while keeping the approximation error below 2%.
摘要：Wasserstein指标在许多机器学习应用中变得越来越重要，例如生成建模，图像检索和域的适应性。尽管它具有吸引力，但计算通常太昂贵了。这激发了近似方法，例如熵登记的最佳运输，下采样和亚采样，这些方法的计算效率准确性。在本文中，我们考虑了对Wasserstein度量的有效近似值的挑战，该度量也是严格的上限或下限。我们的方法着眼于常规网格的离散措施，涉及使用量化的度量和特殊设计的成本矩阵在粗网格上进行配方，并精确地解决了坎托维奇问题，然后进行了升级和校正阶段。这是在原始空间或双重空间中完成的，以在全分辨率输入的Wasserstein度量上获得有效的上限和下限。我们在Dotmark最佳传输图像基准上评估了我们的方法，与熵登记的OT相比，在将近似误差保持在2％以下的同时，证明了10x-100x的速度。

Title: IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

Authors: Wayne Zhang, Changjiang Jiang, Zhonghao Zhang, Chenyang Si, Fengchang Yu, Wei Peng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.00979
Pdf URL: https://arxiv.org/pdf/2506.00979
Copy Paste: [[2506.00979]] IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection(https://arxiv.org/abs/2506.00979)
Keywords: generative
Abstract: The rapid advancement of Artificial Intelligence Generated Content (AIGC) in visual domains has resulted in highly realistic synthetic images and videos, driven by sophisticated generative frameworks such as diffusion-based architectures. While these breakthroughs open substantial opportunities, they simultaneously raise critical concerns about content authenticity and integrity. Many current AIGC detection methods operate as black-box binary classifiers, which offer limited interpretability, and no approach supports detecting both images and videos in a unified framework. This dual limitation compromises model transparency, reduces trustworthiness, and hinders practical deployment. To address these challenges, we introduce IVY-FAKE , a novel, unified, and large-scale dataset specifically designed for explainable multimodal AIGC detection. Unlike prior benchmarks, which suffer from fragmented modality coverage and sparse annotations, IVY-FAKE contains over 150,000 richly annotated training samples (images and videos) and 18,700 evaluation examples, each accompanied by detailed natural-language reasoning beyond simple binary labels. Building on this, we propose Ivy Explainable Detector (IVY-XDETECTOR), a unified AIGC detection and explainable architecture that jointly performs explainable detection for both image and video content. Our unified vision-language model achieves state-of-the-art performance across multiple image and video detection benchmarks, highlighting the significant advancements enabled by our dataset and modeling framework. Our data is publicly available at this https URL.
摘要：视觉域中人工智能生成的内容（AIGC）的快速发展导致了由复杂的生成框架（例如基于扩散的架构）驱动的高度逼真的合成图像和视频。尽管这些突破开辟了大量机会，但它们同时引起了人们对内容真实性和完整性的关键关注。许多当前的AIGC检测方法作为黑框二进制分类器的运行，这些分类器具有有限的解释性，没有方法支持在统一框架中检测图像和视频。这种双重限制会损害模型透明度，降低可信赖性，并阻碍实际部署。为了应对这些挑战，我们引入了Ivy-Fake，这是一种专门设计用于可解释的多模式AIGC检测的新颖，统一和大型数据集。与以前的基准分散的基准覆盖率覆盖和稀疏注释不同，Ivy-Fake包含超过15万个注释的训练样本（图像和视频）和18,700个评估示例，每个示例都有详细的自然语言推理。在此基础上，我们提出了Ivy可解释的检测器（IVY-XDETECTOR），这是一种统一的AIGC检测和可解释的架构，共同为图像和视频内容执行可解释的检测。我们的统一视觉模型在多个图像和视频检测基准中实现了最新的性能，突出了我们的数据集和建模框架实现的重大进步。我们的数据在此HTTPS URL上公开可用。

Title: GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs

Authors: Xiaorong Zhu, Ziheng Jia, Jiarui Wang, Xiangyu Zhao, Haodong Duan, Xiongkuo Min, Jia Wang, Zicheng Zhang, Guangtao Zhai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00991
Pdf URL: https://arxiv.org/pdf/2506.00991
Copy Paste: [[2506.00991]] GOBench: Benchmarking Geometric Optics Generation and Understanding of MLLMs(https://arxiv.org/abs/2506.00991)
Keywords: generation, generative
Abstract: The rapid evolution of Multi-modality Large Language Models (MLLMs) is driving significant advancements in visual understanding and generation. Nevertheless, a comprehensive assessment of their capabilities, concerning the fine-grained physical principles especially in geometric optics, remains underexplored. To address this gap, we introduce GOBench, the first benchmark to systematically evaluate MLLMs' ability across two tasks: 1) Generating Optically Authentic Imagery and 2) Understanding Underlying Optical Phenomena. We curates high-quality prompts of geometric optical scenarios and use MLLMs to construct GOBench-Gen-1k this http URL then organize subjective experiments to assess the generated imagery based on Optical Authenticity, Aesthetic Quality, and Instruction Fidelity, revealing MLLMs' generation flaws that violate optical principles. For the understanding task, we apply crafted evaluation instructions to test optical understanding ability of eleven prominent MLLMs. The experimental results demonstrate that current models face significant challenges in both optical generation and understanding. The top-performing generative model, GPT-4o-Image, cannot perfectly complete all generation tasks, and the best-performing MLLM model, Gemini-2.5Pro, attains a mere 37.35\% accuracy in optical understanding.
摘要：多模式大型语言模型（MLLM）的快速发展正在推动视觉理解和发电方面的重大进步。然而，对其能力的全面评估，涉及细粒度的物理原理，尤其是在几何光学方面，仍然没有得到充实的态度。为了解决这一差距，我们引入了Gobench，这是第一个系统地评估MLLM在两个任务中的能力的基准：1）生成光学真实的图像和2）理解基本的光学现象。我们策划了几何光学方案的高质量提示，并使用MLLM构建GoBench-1K此HTTP URL，然后组织主观实验，以基于光学真实性，美学质量和教学忠诚度来评估产生的图像，并揭示了MLLMS的一代违法缺陷，违反了光学原理。为了理解任务，我们应用精心设计的评估说明来测试11个突出MLLM的光学理解能力。实验结果表明，当前模型在光学产生和理解中都面临着重大挑战。表现最佳的生成模型GPT-4O图像无法完美地完成所有生成任务，并且表现最佳的MLLM模型Gemini-2.5pro仅达到37.35 \％的光学理解精度。

Title: Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models

Authors: Kinam Kim, Junha Hyung, Jaegul Choo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00996
Pdf URL: https://arxiv.org/pdf/2506.00996
Copy Paste: [[2506.00996]] Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models(https://arxiv.org/abs/2506.00996)
Keywords: generation
Abstract: Recent advances in text-to-video diffusion models have enabled high-quality video synthesis, but controllable generation remains challenging, particularly under limited data and compute. Existing fine-tuning methods for conditional generation often rely on external encoders or architectural modifications, which demand large datasets and are typically restricted to spatially aligned conditioning, limiting flexibility and scalability. In this work, we introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient and versatile approach for adapting pretrained video diffusion models to diverse conditional generation tasks. Our key idea is to concatenate condition and target frames along the temporal axis and insert intermediate buffer frames with progressively increasing noise levels. These buffer frames enable smooth transitions, aligning the fine-tuning process with the pretrained model's temporal dynamics. TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples. We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B. Extensive experiments show that TIC-FT outperforms existing baselines in both condition fidelity and visual quality, while remaining highly efficient in both training and inference. For additional results, visit this https URL
摘要：文本到视频扩散模型的最新进展已实现了高质量的视频综合，但可控的生成仍然具有挑战性，尤其是在有限的数据和计算下。有条件生成的现有微调方法通常依赖于外部编码器或架构修改，这些编码器需要大型数据集，通常仅限于空间对齐条件，从而限制了灵活性和可扩展性。在这项工作中，我们介绍了时间内部的微调微调（TIC-FT），这是一种有效且多才多艺的方法，用于调整预处理的视频扩散模型，以适应各种条件生成任务。我们的关键思想是沿时间轴的连接状态和靶向框架，并插入中间缓冲框，并逐渐增加噪声水平。这些缓冲框架实现了平滑的过渡，将微调过程与验证的模型的时间动力学对齐。 TIC-FT不需要架构变化，并且可以实现强劲的绩效，只需10-30个训练样本。我们使用大规模的基本模型（例如Cogvideox-5b和Wan-14B）验证了一系列任务，包括图像到视频和视频之间的方法。广泛的实验表明，在条件下的忠诚度和视觉质量方面，TIC-FT的表现都优于现有基线，同时在训练和推理方面保持效率高。有关其他结果，请访问此HTTPS URL

Title: Pseudo-Labeling Driven Refinement of Benchmark Object Detection Datasets via Analysis of Learning Patterns

Authors: Min Je Kim, Muhammad Munsif, Altaf Hussain, Hikmat Yar, Sung Wook Baik
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.00997
Pdf URL: https://arxiv.org/pdf/2506.00997
Copy Paste: [[2506.00997]] Pseudo-Labeling Driven Refinement of Benchmark Object Detection Datasets via Analysis of Learning Patterns(https://arxiv.org/abs/2506.00997)
Keywords: generation
Abstract: Benchmark object detection (OD) datasets play a pivotal role in advancing computer vision applications such as autonomous driving, and surveillance, as well as in training and evaluating deep learning-based state-of-the-art detection models. Among them, MS-COCO has become a standard benchmark due to its diverse object categories and complex scenes. However, despite its wide adoption, MS-COCO suffers from various annotation issues, including missing labels, incorrect class assignments, inaccurate bounding boxes, duplicate labels, and group labeling inconsistencies. These errors not only hinder model training but also degrade the reliability and generalization of OD models. To address these challenges, we propose a comprehensive refinement framework and present MJ-COCO, a newly re-annotated version of MS-COCO. Our approach begins with loss and gradient-based error detection to identify potentially mislabeled or hard-to-learn samples. Next, we apply a four-stage pseudo-labeling refinement process: (1) bounding box generation using invertible transformations, (2) IoU-based duplicate removal and confidence merging, (3) class consistency verification via expert objects recognizer, and (4) spatial adjustment based on object region activation map analysis. This integrated pipeline enables scalable and accurate correction of annotation errors without manual re-labeling. Extensive experiments were conducted across four validation datasets: MS-COCO, Sama COCO, Objects365, and PASCAL VOC. Models trained on MJ-COCO consistently outperformed those trained on MS-COCO, achieving improvements in Average Precision (AP) and APS metrics. MJ-COCO also demonstrated significant gains in annotation coverage: for example, the number of small object annotations increased by more than 200,000 compared to MS-COCO.
摘要：基准对象检测（OD）数据集在推进计算机视觉应用程序（例如自动驾驶和监视）以及培训和评估基于深度学习的最先进的检测模型等计算机视觉应用方面起着关键作用。其中，由于其多样化的对象类别和复杂的场景，MS-Coco已成为标准的基准。然而，尽管采用广泛的采用，但MS-Coco仍存在各种注释问题，包括缺少标签，不正确的类作业，不准确的边界框，重复的标签和组标签不一致。这些错误不仅会阻碍模型训练，还会降低OD模型的可靠性和概括。为了应对这些挑战，我们提出了一个全面的改进框架，并呈现MJ-Coco，这是MS-Coco的新版本。我们的方法始于损失和基于梯度的误差检测，以识别潜在的标签错误或难以学习的样本。接下来，我们采用四阶段伪标记的改进过程：（1）使用可逆转换，（2）基于IOU的重复删除和置信度合并，（3）通过专家对象识别器进行类一致性验证，以及（4）基于对象区域激活图分析的空间调整。该集成的管道可以在不手动重新标记的情况下对注释错误进行可扩展，准确的校正。在四个验证数据集中进行了广泛的实验：MS-Coco，Sama Coco，Objects365和Pascal VOC。接受MJ-Coco训练的模型始终优于接受MS-Coco训练的模型，在平均精度（AP）和APS指标方面取得了进步。 MJ-Coco在注释覆盖范围中还显示出显着的增长：例如，与MS-Coco相比，小物体注释的数量增加了200,000多个。

Title: Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution

Authors: Shijun Shi, Jing Xu, Lijing Lu, Zhihang Li, Kai Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01037
Pdf URL: https://arxiv.org/pdf/2506.01037
Copy Paste: [[2506.01037]] Self-supervised ControlNet with Spatio-Temporal Mamba for Real-world Video Super-resolution(https://arxiv.org/abs/2506.01037)
Keywords: super-resolution
Abstract: Existing diffusion-based video super-resolution (VSR) methods are susceptible to introducing complex degradations and noticeable artifacts into high-resolution videos due to their inherent randomness. In this paper, we propose a noise-robust real-world VSR framework by incorporating self-supervised learning and Mamba into pre-trained latent diffusion models. To ensure content consistency across adjacent frames, we enhance the diffusion model with a global spatio-temporal attention mechanism using the Video State-Space block with a 3D Selective Scan module, which reinforces coherence at an affordable computational cost. To further reduce artifacts in generated details, we introduce a self-supervised ControlNet that leverages HR features as guidance and employs contrastive learning to extract degradation-insensitive features from LR videos. Finally, a three-stage training strategy based on a mixture of HR-LR videos is proposed to stabilize VSR training. The proposed Self-supervised ControlNet with Spatio-Temporal Continuous Mamba based VSR algorithm achieves superior perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.
摘要：现有的基于扩散的视频超分辨率（VSR）方法容易将复杂的降解和明显的文物引入高分辨率视频，因为它们的固有随机性。在本文中，我们通过将自我监督的学习和MAMBA纳入预先训练的潜在扩散模型中提出了一个噪声真实世界VSR框架。为了确保相邻帧之间的内容一致性，我们使用带有3D选择性扫描模块的视频状态空间块通过全局时空注意机制增强了扩散模型，该模块以负担得起的计算成本增强了连贯性。为了进一步减少生成细节中的工件，我们引入了一个自我监督的控制网，该控制网络利用人力资源功能作为指导，并采用对比度学习从LR视频中提取不敏感的功能。最后，提出了基于HR-LR视频混合的三阶段训练策略，以稳定VSR培训。提出的具有时空连续MAMBA的VSR算法的自我监督控制网与对现实世界VSR基准数据集的最先进的质量相比，具有优越的感知质量，从而验证了建议的模型设计和培训策略的有效性。

Title: DeepVerse: 4D Autoregressive Video Generation as a World Model

Authors: Junyi Chen, Haoyi Zhu, Xianglong He, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Zhoujie Fu, Jiangmiao Pang, Tong He
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01103
Pdf URL: https://arxiv.org/pdf/2506.01103
Copy Paste: [[2506.01103]] DeepVerse: 4D Autoregressive Video Generation as a World Model(https://arxiv.org/abs/2506.01103)
Keywords: generation
Abstract: World models serve as essential building blocks toward Artificial General Intelligence (AGI), enabling intelligent agents to predict future states and plan actions by simulating complex physical interactions. However, existing interactive models primarily predict visual observations, thereby neglecting crucial hidden states like geometric structures and spatial coherence. This leads to rapid error accumulation and temporal inconsistency. To address these limitations, we introduce DeepVerse, a novel 4D interactive world model explicitly incorporating geometric predictions from previous timesteps into current predictions conditioned on actions. Experiments demonstrate that by incorporating explicit geometric constraints, DeepVerse captures richer spatio-temporal relationships and underlying physical dynamics. This capability significantly reduces drift and enhances temporal consistency, enabling the model to reliably generate extended future sequences and achieve substantial improvements in prediction accuracy, visual realism, and scene rationality. Furthermore, our method provides an effective solution for geometry-aware memory retrieval, effectively preserving long-term spatial consistency. We validate the effectiveness of DeepVerse across diverse scenarios, establishing its capacity for high-fidelity, long-horizon predictions grounded in geometry-aware dynamics.
摘要：世界模型是对人工通用情报（AGI）的重要组成部分，使智能代理能够通过模拟复杂的物理互动来预测未来的状态并计划行动。但是，现有的交互模型主要预测视觉观察，从而忽略了关键的隐藏状态，例如几何结构和空间连贯性。这会导致快速误差积累和时间不一致。为了解决这些局限性，我们引入了Deepverse，这是一种新型的4D交互式世界模型，将以前时间段的几何预测明确地纳入了以动作为条件的当前预测。实验表明，通过合并显式的几何约束，深层捕获更丰富的时空关系和潜在的物理动力学。这种能力大大降低了漂移并增强了时间的一致性，从而使模型能够可靠地产生扩展的未来序列并实现预测准确性，视觉现实主义和场景合理性的实质性提高。此外，我们的方法为几何学感知记忆检索提供了有效的解决方案，可以有效地保留长期的空间一致性。我们验证了跨不同场景的深verse效率，建立了其具有基于几何学感知动力学的高保真性，长途预测的能力。

Title: Reconsidering LLM Uncertainty Estimation Methods in the Wild

Authors: Yavuz Bakman, Duygu Nur Yaldiz, Sungmin Kang, Tuo Zhang, Baturalp Buyukates, Salman Avestimehr, Sai Praneeth Karimireddy
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01114
Pdf URL: https://arxiv.org/pdf/2506.01114
Copy Paste: [[2506.01114]] Reconsidering LLM Uncertainty Estimation Methods in the Wild(https://arxiv.org/abs/2506.01114)
Keywords: generation
Abstract: Large Language Model (LLM) Uncertainty Estimation (UE) methods have become a crucial tool for detecting hallucinations in recent years. While numerous UE methods have been proposed, most existing studies evaluate them in isolated short-form QA settings using threshold-independent metrics such as AUROC or PRR. However, real-world deployment of UE methods introduces several challenges. In this work, we systematically examine four key aspects of deploying UE methods in practical settings. Specifically, we assess (1) the sensitivity of UE methods to decision threshold selection, (2) their robustness to query transformations such as typos, adversarial prompts, and prior chat history, (3) their applicability to long-form generation, and (4) strategies for handling multiple UE scores for a single query. Our evaluations on 19 UE methods reveal that most of them are highly sensitive to threshold selection when there is a distribution shift in the calibration dataset. While these methods generally exhibit robustness against previous chat history and typos, they are significantly vulnerable to adversarial prompts. Additionally, while existing UE methods can be adapted for long-form generation through various strategies, there remains considerable room for improvement. Lastly, ensembling multiple UE scores at test time provides a notable performance boost, which highlights its potential as a practical improvement strategy. Code is available at: this https URL.
摘要：大型语言模型（LLM）不确定性估计（UE）方法已成为近年来检测幻觉的关键工具。尽管已经提出了许多UE方法，但大多数现有研究都使用诸如AUROC或PRR等阈值的指标在孤立的短形式QA设置中评估它们。但是，UE方法的现实部署引入了一些挑战。在这项工作中，我们系统地检查了在实际设置中部署UE方法的四个关键方面。具体而言，我们评估（1）UE方法对决策阈值选择的敏感性，（2）它们对查询转换的鲁棒性，例如错别字，对抗性提示和先前的聊天历史记录，（3）它们适用于长形式产生，以及（4）将多个UE分数用于单个问题的策略。我们对19个UE方法的评估表明，当校准数据集发生分布变化时，其中大多数对阈值选择高度敏感。尽管这些方法通常对以前的聊天历史记录和错别字表现出鲁棒性，但它们非常容易受到对抗提示的影响。此外，尽管可以通过各种策略将现有的UE方法改编成长期生成，但仍有相当大的改进空间。最后，在测试时间结合多个UE分数提供了显着的性能提升，这突出了其作为实际改进策略的潜力。代码可用：此HTTPS URL。

Title: Revolutionizing Radiology Workflow with Factual and Efficient CXR Report Generation

Authors: Pimchanok Sukjai, Apiradee Boonmee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01118
Pdf URL: https://arxiv.org/pdf/2506.01118
Copy Paste: [[2506.01118]] Revolutionizing Radiology Workflow with Factual and Efficient CXR Report Generation(https://arxiv.org/abs/2506.01118)
Keywords: generation
Abstract: The escalating demand for medical image interpretation underscores the critical need for advanced artificial intelligence solutions to enhance the efficiency and accuracy of radiological diagnoses. This paper introduces CXR-PathFinder, a novel Large Language Model (LLM)-centric foundation model specifically engineered for automated chest X-ray (CXR) report generation. We propose a unique training paradigm, Clinician-Guided Adversarial Fine-Tuning (CGAFT), which meticulously integrates expert clinical feedback into an adversarial learning framework to mitigate factual inconsistencies and improve diagnostic precision. Complementing this, our Knowledge Graph Augmentation Module (KGAM) acts as an inference-time safeguard, dynamically verifying generated medical statements against authoritative knowledge bases to minimize hallucinations and ensure standardized terminology. Leveraging a comprehensive dataset of millions of paired CXR images and expert reports, our experiments demonstrate that CXR-PathFinder significantly outperforms existing state-of-the-art medical vision-language models across various quantitative metrics, including clinical accuracy (Macro F1 (14): 46.5, Micro F1 (14): 59.5). Furthermore, blinded human evaluation by board-certified radiologists confirms CXR-PathFinder's superior clinical utility, completeness, and accuracy, establishing its potential as a reliable and efficient aid for radiological practice. The developed method effectively balances high diagnostic fidelity with computational efficiency, providing a robust solution for automated medical report generation.
摘要：对医学图像解释的需求不断提高，强调了对高级人工智能解决方案的关键需求，以提高放射学诊断的效率和准确性。本文介绍了CXR-Pathfinder，这是一种新型的大型语言模型（LLM）的中心基础模型，专门为自动化的胸部X射线（CXR）报告生成而设计。我们提出了一个独特的培训范式，临床医生指导的对抗微调（CGAFT），该训练将专家临床反馈精心整合到对抗性学习框架中，以减轻事实不一致并提高诊断精度。与此相辅相成，我们的知识图扩展模块（KGAM）充当推理时间保障措施，动态验证了针对权威知识库的生成的医疗报表，以最大程度地减少幻觉并确保标准化的术语。利用数百万个配对的CXR图像和专家报告的全面数据集，我们的实验表明，CXR PATHFINDER在各种定量指标上的现有最新医学视觉模型的表现显着超过了临床准确性（包括宏F1（14）：46.5，Micro F1（14）（14）：59.5）。此外，经过董事会认证的放射科医生的盲人评估证实了CXR-Pathfinder的出色临床实用性，完整性和准确性，确立了其潜力作为可靠，有效的放射学实践的潜力。开发的方法有效地平衡了高诊断忠诚度与计算效率，为自动化医疗报告生成提供了强大的解决方案。

Title: Neuro-Symbolic Generative Diffusion Models for Physically Grounded, Robust, and Safe Generation

Authors: Jacob K. Christopher, Michael Cardei, Jinhao Liang, Ferdinando Fioretto
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01121
Pdf URL: https://arxiv.org/pdf/2506.01121
Copy Paste: [[2506.01121]] Neuro-Symbolic Generative Diffusion Models for Physically Grounded, Robust, and Safe Generation(https://arxiv.org/abs/2506.01121)
Keywords: generation, generative
Abstract: Despite the remarkable generative capabilities of diffusion models, their integration into safety-critical or scientifically rigorous applications remains hindered by the need to ensure compliance with stringent physical, structural, and operational constraints. To address this challenge, this paper introduces Neuro-Symbolic Diffusion (NSD), a novel framework that interleaves diffusion steps with symbolic optimization, enabling the generation of certifiably consistent samples under user-defined functional and logic constraints. This key feature is provided for both standard and discrete diffusion models, enabling, for the first time, the generation of both continuous (e.g., images and trajectories) and discrete (e.g., molecular structures and natural language) outputs that comply with constraints. This ability is demonstrated on tasks spanning three key challenges: (1) Safety, in the context of non-toxic molecular generation and collision-free trajectory optimization; (2) Data scarcity, in domains such as drug discovery and materials engineering; and (3) Out-of-domain generalization, where enforcing symbolic constraints allows adaptation beyond the training distribution.
摘要：尽管扩散模型具有显着的生成能力，但它们集成到安全关键或科学严格的应用中仍然受到确保遵守严格的物理，结构和操作约束的必要性。为了应对这一挑战，本文介绍了神经符号扩散（NSD），这是一个新颖的框架，将扩散步骤与符号优化相结合，从而使在用户定义的功能和逻辑约束下的认证一致样品生成。该关键特征是为标准和离散扩散模型提供的，这是首次使连续（例如图像和轨迹）和离散（例如，分子结构和自然语言）输出的产生。在跨越三个关键挑战的任务上证明了这种能力：（1）在无毒分子产生和无碰撞轨迹优化的背景下安全；（2）数据稀缺性，在药物发现和材料工程等领域中；（3）跨域的概括，其中强制执行符号约束允许在训练分布之外进行适应。

Title: FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation

Authors: Ariel Shaulov, Itay Hazan, Lior Wolf, Hila Chefer
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01144
Pdf URL: https://arxiv.org/pdf/2506.01144
Copy Paste: [[2506.01144]] FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation(https://arxiv.org/abs/2506.01144)
Keywords: generation
Abstract: Text-to-video diffusion models are notoriously limited in their ability to model temporal aspects such as motion, physics, and dynamic interactions. Existing approaches address this limitation by retraining the model or introducing external conditioning signals to enforce temporal consistency. In this work, we explore whether a meaningful temporal representation can be extracted directly from the predictions of a pre-trained model without any additional training or auxiliary inputs. We introduce \textbf{FlowMo}, a novel training-free guidance method that enhances motion coherence using only the model's own predictions in each diffusion step. FlowMo first derives an appearance-debiased temporal representation by measuring the distance between latents corresponding to consecutive frames. This highlights the implicit temporal structure predicted by the model. It then estimates motion coherence by measuring the patch-wise variance across the temporal dimension and guides the model to reduce this variance dynamically during sampling. Extensive experiments across multiple text-to-video models demonstrate that FlowMo significantly improves motion coherence without sacrificing visual quality or prompt alignment, offering an effective plug-and-play solution for enhancing the temporal fidelity of pre-trained video diffusion models.
摘要：众所周知，文本对视频扩散模型在建模时间方面（例如运动，物理和动态相互作用）的能力上受到限制。现有方法通过重新训练模型或引入外部条件信号来解决此限制以实现时间一致性。在这项工作中，我们探讨了是否可以直接从预训练模型的预测中提取有意义的时间表示，而无需任何其他培训或辅助输入。我们介绍了一种新颖的无训练指导方法\ textbf {flowmo}，它仅在每个扩散步骤中使用模型自己的预测来增强运动相干性。 FlowMO首先通过测量与连续帧相对应的潜在的距离来得出外观呈现的时间表示。这突出了模型预测的隐式时间结构。然后，它通过测量跨时间尺寸的斑块方差来估计运动相干性，并指导模型在采样过程中动态减少该方差。跨多个文本视频模型进行的广泛实验表明，FlowMO显着提高了运动相干性而无需牺牲视觉质量或及时的对齐，提供了有效的插入式播放解决方案，以增强预训练视频扩散模型的时间忠诚度。

Title: Earley-Driven Dynamic Pruning for Efficient Structured Decoding

Authors: Xintong Sun, Chi Wei, Minghao Tian, Shiwen Ni
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.01151
Pdf URL: https://arxiv.org/pdf/2506.01151
Copy Paste: [[2506.01151]] Earley-Driven Dynamic Pruning for Efficient Structured Decoding(https://arxiv.org/abs/2506.01151)
Keywords: generation
Abstract: Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domain-specific language (DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs' adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we propose $\textbf{ZapFormat}$, a novel $\textbf{dynamic pruning}$ strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm's states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only $\textbf{consistently maintains}$ high-precision compliant outputs but also achieves $\textbf{significant improvements}$ in inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures. We release Formatron as open source at this https URL.
摘要：大型语言模型（LLM）表现出了显着的功能，但确保其输出符合严格的结构或语法约束仍然具有挑战性，这对于功能呼叫和特定于领域的语言（DSL）的生成至关重要。通过无上下文语法的约束解码是一种灵活的方法，可以通过动态构建令牌logits掩码来确保LLMS遵守特定格式的依从性。但是，创建此面膜需要在每个解码步骤中检查LLM词汇中所有令牌的有效性，这通常会在现有受约束的解码引擎中产生大量的开销。为了应对这一挑战，我们提出了$ \ textbf {zapformat} $，这是一种基于earley算法的新颖$ \ textbf {动态修剪} $策略，该算法可以实时识别并消除无效或多余的earley状态，可显着降低Earley Algorithm earley Algorithm estation的记忆占用。这进一步使我们能够使用状态缓存来加快大量查询的结构化世代。我们在称为Formatron的新的约束解码引擎中实现了Zapformat，该引擎还结合了现有的优化。通过有关结构化生成任务的全面实验，包括JSON生成，JSON Schema验证和语义解析，我们证明了Formatron不仅$ \ textbf {始终保持} $高度兼容的输出} $，而且还可以实现$ \ \ textbf {显着的改进} $，以加速至2x the-the-Art-Art-Art-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART-ART EMPANTIMIMATION进行。更重要的是，格式通常适用于各种LLM架构。我们在此HTTPS URL处将格式作为开源。

Title: FORT: Forward-Only Regression Training of Normalizing Flows

Authors: Danyal Rehman, Oscar Davis, Jiarui Lu, Jian Tang, Michael Bronstein, Yoshua Bengio, Alexander Tong, Avishek Joey Bose
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2506.01158
Pdf URL: https://arxiv.org/pdf/2506.01158
Copy Paste: [[2506.01158]] FORT: Forward-Only Regression Training of Normalizing Flows(https://arxiv.org/abs/2506.01158)
Keywords: generation, generative
Abstract: Simulation-free training frameworks have been at the forefront of the generative modelling revolution in continuous spaces, leading to neural dynamical systems that encompass modern large-scale diffusion and flow matching models. Despite the scalability of training, the generation of high-quality samples and their corresponding likelihood under the model requires expensive numerical simulation -- inhibiting adoption in numerous scientific applications such as equilibrium sampling of molecular systems. In this paper, we revisit classical normalizing flows as one-step generative models with exact likelihoods and propose a novel, scalable training objective that does not require computing the expensive change of variable formula used in conventional maximum likelihood training. We propose Forward-Only Regression Training (FORT), a simple $\ell_2$-regression objective that maps prior samples under our flow to specifically chosen targets. We demonstrate that FORT supports a wide class of targets, such as optimal transport targets and targets from pre-trained continuous-time normalizing flows (CNF). We further demonstrate that by using CNF targets, our one-step flows allow for larger-scale training that exceeds the performance and stability of maximum likelihood training, while unlocking a broader class of architectures that were previously challenging to train. Empirically, we elucidate that our trained flows can perform equilibrium conformation sampling in Cartesian coordinates of alanine dipeptide, alanine tripeptide, and alanine tetrapeptide.
摘要：无模拟训练框架一直处于连续空间中生成建模革命的最前沿，导致神经动力学系统涵盖现代大规模扩散和流动匹配模型。尽管训练的可伸缩性，但在模型下的高质量样本产生及其相应的可能性需要昂贵的数值模拟 - 抑制了在许多科学应用中的采用，例如分子系统的平衡采样。在本文中，我们将经典流动重新验证为具有确切可能性的单步生成模型，并提出了一个新颖的可扩展训练目标，该目标不需要计算常规最大似然训练中使用的可变公式的昂贵变化。我们提出了仅向前的回归培训（FORT），这是一个简单的$ \ ell_2 $ - 回归目标，该目标将我们流下的先验样本映射到专门选择的目标。我们证明，堡垒支持广泛的目标，例如最佳运输目标和预训练的连续时间归一化流（CNF）的目标。我们进一步证明，通过使用CNF目标，我们的一步流可以进行大规模的训练，超过最大似然训练的性能和稳定性，同时解锁了以前更具挑战性训练的更广泛的体系结构。从经验上讲，我们阐明我们的训练的流可以在丙氨酸二肽，丙氨酸三肽和丙氨酸四肽的笛卡尔坐标中进行平衡构象采样。

Title: Bridging Quantum and Classical Computing in Drug Design: Architecture Principles for Improved Molecule Generation

Authors: Andrew Smith, Erhan Guven
Subjects: cs.LG, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2506.01177
Pdf URL: https://arxiv.org/pdf/2506.01177
Copy Paste: [[2506.01177]] Bridging Quantum and Classical Computing in Drug Design: Architecture Principles for Improved Molecule Generation(https://arxiv.org/abs/2506.01177)
Keywords: generation, generative
Abstract: Hybrid quantum-classical machine learning offers a path to leverage noisy intermediate-scale quantum (NISQ) devices for drug discovery, but optimal model architectures remain unclear. We systematically optimize the quantum-classical bridge architecture for generative adversarial networks (GANs) in molecular discovery using multi-objective Bayesian optimization. Our optimized model (BO-QGAN) significantly improves performance, achieving a 2.27-fold higher Drug Candidate Score (DCS) than prior quantum-hybrid benchmarks and 2.21-fold higher than the classical baseline, using over 60% fewer parameters. Key findings favor layering multiple (3-4) shallow (4-8 qubit) quantum circuits sequentially, while classical architecture shows less sensitivity above a minimum capacity. This work provides the first empirically grounded architectural guidelines for hybrid models, enabling more effective integration of current quantum computers into pharmaceutical research pipelines.
摘要：混合量子古典机器学习提供了一条途径，以利用嘈杂的中间量子量子（NISQ）设备进行药物发现，但是最佳模型体系结构尚不清楚。我们使用多目标贝叶斯优化系统地在分子发现中系统地优化了用于生成对抗网络（GAN）的量子 - 古典桥结构。我们优化的模型（BO-QGAN）可显着提高性能，比先前的量子杂交基准测试（DCS）高2.27倍，比经典基线高2.21倍，使用了60％以上的参数。关键发现有利于分层（3-4）浅（4-8量子）量子电路，而经典体系结构的灵敏度较小，高于最低容量。这项工作为混合模型提供了第一个经验扎根的建筑指南，从而使当前的量子计算机更有效地集成到药物研究管道中。

Title: ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Authors: Yiyang Zhou, Yangfan He, Yaofeng Su, Siwei Han, Joel Jang, Gedas Bertasius, Mohit Bansal, Huaxiu Yao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01300
Pdf URL: https://arxiv.org/pdf/2506.01300
Copy Paste: [[2506.01300]] ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding(https://arxiv.org/abs/2506.01300)
Keywords: generation
Abstract: Video understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision-language models (LVLMs) typically adopt a single-pass reasoning paradigm without dynamic feedback, limiting the model's capacity to self-correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation by incorporating reward models and reinforcement learning to enhance reasoning, or by employing tool-agent frameworks. However, these approaches face several challenges, including high annotation costs, reward signals that fail to capture real-time reasoning states, and low inference efficiency. To overcome these issues, we propose ReAgent-V, a novel agentic video understanding framework that integrates efficient frame selection with real-time reward generation during inference. These reward signals not only guide iterative answer refinement through a multi-perspective reflection mechanism-adjusting predictions from conservative, neutral, and aggressive viewpoints-but also enable automatic filtering of high-quality data for supervised fine-tuning (SFT), direct preference optimization (DPO), and group relative policy optimization (GRPO). ReAgent-V is lightweight, modular, and extensible, supporting flexible tool integration tailored to diverse tasks. Extensive experiments on 12 datasets across three core applications-video understanding, video reasoning enhancement, and vision-language-action model alignment-demonstrate significant gains in generalization and reasoning, with improvements of up to 6.9%, 2.1%, and 9.8%, respectively, highlighting the effectiveness and versatility of the proposed framework.
摘要：视频理解是诸如行动识别，视频推理和机器人控制之类的任务至关重要的。基于大型视觉模型（LVLM）的早期视频理解方法通常采用无动态反馈的单次推理范式，从而限制了该模型在复杂场景中自我校正和适应的能力。最近的努力试图通过合并奖励模型和增强学习来增强推理或采用工具代理框架来解决这一局限性。但是，这些方法面临着几个挑战，包括高注释成本，未能捕获实时推理状态的奖励信号以及推理效率低。为了克服这些问题，我们提出了Reagent-V，这是一种新型的代理视频理解框架，将有效的框架选择与推理期间的实时奖励生成集成在一起。这些奖励信号不仅通过多种反射机制来指导迭代答案的完善，从保守，中立和积极的观点进行了调整的预测，而且还可以自动滤波高质量数据以进行监督微调（SFT），直接偏好优化（DPO）（DPO）和小组相对策略（GRPO）（GRPO）。 Veagent-V是轻巧，模块化且可扩展的，可为适合各种任务量身定制的灵活工具集成。在三个核心应用程序中的12个数据集上进行了广泛的实验 - 视频理解，视频推理增强和视觉性能模型模型对准概括和推理方面的显着增长，分别提高了6.9％，2.1％和9.8％的改善，强调了拟议框架的有效性和多功能性。

Title: Recent Developments in GNNs for Drug Discovery

Authors: Zhengyu Fang, Xiaoge Zhang, Anyin Zhao, Xiao Li, Huiyuan Chen, Jing Li
Subjects: cs.LG, q-bio.QM
Abstract URL: https://arxiv.org/abs/2506.01302
Pdf URL: https://arxiv.org/pdf/2506.01302
Copy Paste: [[2506.01302]] Recent Developments in GNNs for Drug Discovery(https://arxiv.org/abs/2506.01302)
Keywords: generation
Abstract: In this paper, we review recent developments and the role of Graph Neural Networks (GNNs) in computational drug discovery, including molecule generation, molecular property prediction, and drug-drug interaction prediction. By summarizing the most recent developments in this area, we underscore the capabilities of GNNs to comprehend intricate molecular patterns, while exploring both their current and prospective applications. We initiate our discussion by examining various molecular representations, followed by detailed discussions and categorization of existing GNN models based on their input types and downstream application tasks. We also collect a list of commonly used benchmark datasets for a variety of applications. We conclude the paper with brief discussions and summarize common trends in this important research area.
摘要：在本文中，我们回顾了最新的发展以及图神经网络（GNN）在计算药物发现中的作用，包括分子产生，分子性质预测和药物 - 药物相互作用预测。通过总结该领域的最新发展，我们强调了GNN的能力理解复杂的分子模式，同时探索其当前和预期应用。我们通过检查各种分子表示，然后根据其输入类型和下游应用程序任务对现有GNN模型进行详细讨论和分类来启动讨论。我们还收集了用于各种应用程序的常用基准数据集列表。我们以简短的讨论结束了本文，并总结了这个重要的研究领域的共同趋势。

Title: $Ψ$-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models

Authors: Taehoon Yoon, Yunhong Min, Kyeongmin Yeo, Minhyuk Sung
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.01320
Pdf URL: https://arxiv.org/pdf/2506.01320
Copy Paste: [[2506.01320]] $Ψ$-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models(https://arxiv.org/abs/2506.01320)
Keywords: generation, generative
Abstract: We introduce $\Psi$-Sampler, an SMC-based framework incorporating pCNL-based initial particle sampling for effective inference-time reward alignment with a score-based generative model. Inference-time reward alignment with score-based generative models has recently gained significant traction, following a broader paradigm shift from pre-training to post-training optimization. At the core of this trend is the application of Sequential Monte Carlo (SMC) to the denoising process. However, existing methods typically initialize particles from the Gaussian prior, which inadequately captures reward-relevant regions and results in reduced sampling efficiency. We demonstrate that initializing from the reward-aware posterior significantly improves alignment performance. To enable posterior sampling in high-dimensional latent spaces, we introduce the preconditioned Crank-Nicolson Langevin (pCNL) algorithm, which combines dimension-robust proposals with gradient-informed dynamics. This approach enables efficient and scalable posterior sampling and consistently improves performance across various reward alignment tasks, including layout-to-image generation, quantity-aware generation, and aesthetic-preference generation, as demonstrated in our experiments.
摘要：我们介绍了$ \ psi $ -sampler，这是一种基于SMC的框架，结合了基于PCNL的初始粒子采样，以使用基于分数的生成模型进行有效的推理时间奖励对齐。在从训练前到训练后优化的更广泛的范式转变，与基于得分的生成模型的推理时间奖励对齐最近获得了显着的牵引力。这种趋势的核心是将顺序蒙特卡洛（SMC）应用于denoising过程。但是，现有方法通常从高斯先验中初始化粒子，该粒子捕获了相关区域，并导致采样效率降低。我们证明，从奖励感知后的初始化可显着提高对齐性能。为了在高维的潜在空间中启用后验采样，我们介绍了预处理的曲柄 - 尼科尔森·兰格文（PCNL）算法，该算法结合了尺寸刺激性建议与梯度信息的动力学。这种方法可实现有效且可扩展的后验采样，并始终提高各种奖励对准任务的性能，包括布局到图像生成，数量感知的生成和美学偏置生成，如我们的实验中所证明的那样。

Title: Ultra-High-Resolution Image Synthesis: Data, Method and Evaluation

Authors: Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, Di Huang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01331
Pdf URL: https://arxiv.org/pdf/2506.01331
Copy Paste: [[2506.01331]] Ultra-High-Resolution Image Synthesis: Data, Method and Evaluation(https://arxiv.org/abs/2506.01331)
Keywords: generation
Abstract: Ultra-high-resolution image synthesis holds significant potential, yet remains an underexplored challenge due to the absence of standardized benchmarks and computational constraints. In this paper, we establish Aesthetic-4K, a meticulously curated dataset containing dedicated training and evaluation subsets specifically designed for comprehensive research on ultra-high-resolution image synthesis. This dataset consists of high-quality 4K images accompanied by descriptive captions generated by GPT-4o. Furthermore, we propose Diffusion-4K, an innovative framework for the direct generation of ultra-high-resolution images. Our approach incorporates the Scale Consistent Variational Auto-Encoder (SC-VAE) and Wavelet-based Latent Fine-tuning (WLF), which are designed for efficient visual token compression and the capture of intricate details in ultra-high-resolution images, thereby facilitating direct training with photorealistic 4K data. This method is applicable to various latent diffusion models and demonstrates its efficacy in synthesizing highly detailed 4K images. Additionally, we propose novel metrics, namely the GLCM Score and Compression Ratio, to assess the texture richness and fine details in local patches, in conjunction with holistic measures such as FID, Aesthetics, and CLIPScore, enabling a thorough and multifaceted evaluation of ultra-high-resolution image synthesis. Consequently, Diffusion-4K achieves impressive performance in ultra-high-resolution image synthesis, particularly when powered by state-of-the-art large-scale diffusion models (eg, Flux-12B). The source code is publicly available at this https URL.
摘要：超高分辨率图像合成具有巨大的潜力，但由于缺乏标准化的基准和计算限制，仍然是一个毫无创伤的挑战。在本文中，我们建立了一种精心策划的数据集，该数据集包含专门针对超高分辨率图像合成的全面研究而设计的专门培训和评估子集。该数据集由高质量的4K图像组成，并附有由GPT-4O生成的描述标题。此外，我们提出了扩散4K，这是直接生成超高分辨率图像的创新框架。我们的方法结合了量表一致的变分自动编码器（SC-VAE）和基于小波的潜在微调（WLF），这些量表旨在有效地视觉令牌压缩，并在超高分辨率图像中捕获复杂的细节，从而通过Photorealistic 4K数据来促进直接训练。该方法适用于各种潜在扩散模型，并证明了其在合成高度详细的4K图像中的功效。此外，我们提出了新的指标，即GLCM评分和压缩比，以评估当地斑块中的质地丰富度和精细细节，并结合整体测量，例如FID，美学和ClipsCore，允许对超高分辨率图像综合的彻底且多面化的评估。因此，扩散-4K在超高分辨率图像合成中实现了令人印象深刻的性能，尤其是当由最先进的大规模扩散模型（例如，磁通量12b）提供动力时。源代码可在此HTTPS URL上公开可用。

Title: NoiseAR: AutoRegressing Initial Noise Prior for Diffusion Models

Authors: Zeming Li, Xiangyue Liu, Xiangyu Zhang, Ping Tan, Heung-Yeung Shum
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01337
Pdf URL: https://arxiv.org/pdf/2506.01337
Copy Paste: [[2506.01337]] NoiseAR: AutoRegressing Initial Noise Prior for Diffusion Models(https://arxiv.org/abs/2506.01337)
Keywords: generation, generative
Abstract: Diffusion models have emerged as powerful generative frameworks, creating data samples by progressively denoising an initial random state. Traditionally, this initial state is sampled from a simple, fixed distribution like isotropic Gaussian, inherently lacking structure and a direct mechanism for external control. While recent efforts have explored ways to introduce controllability into the diffusion process, particularly at the initialization stage, they often rely on deterministic or heuristic approaches. These methods can be suboptimal, lack expressiveness, and are difficult to scale or integrate into more sophisticated optimization frameworks. In this paper, we introduce NoiseAR, a novel method for AutoRegressive Initial Noise Prior for Diffusion Models. Instead of a static, unstructured source, NoiseAR learns to generate a dynamic and controllable prior distribution for the initial noise. We formulate the generation of the initial noise prior's parameters as an autoregressive probabilistic modeling task over spatial patches or tokens. This approach enables NoiseAR to capture complex spatial dependencies and introduce learned structure into the initial state. Crucially, NoiseAR is designed to be conditional, allowing text prompts to directly influence the learned prior, thereby achieving fine-grained control over the diffusion initialization. Our experiments demonstrate that NoiseAR can generate initial noise priors that lead to improved sample quality and enhanced consistency with conditional inputs, offering a powerful, learned alternative to traditional random initialization. A key advantage of NoiseAR is its probabilistic formulation, which naturally supports seamless integration into probabilistic frameworks like Markov Decision Processes and Reinforcement Learning. Our code will be available at this https URL
摘要：扩散模型已成为强大的生成框架，通过逐步降低初始随机状态来创建数据样本。传统上，这种初始状态是从简单的，固定的分布（如各向同性高斯，固有地缺乏结构和外部控制的直接机制）中取样的。尽管最近的努力探索了将可控性引入扩散过程的方法，尤其是在初始化阶段，但他们通常依靠确定性或启发式方法。这些方法可能是次优的，缺乏表现力，并且难以扩展或集成到更复杂的优化框架中。在本文中，我们引入了Noisear，这是一种扩散模型的自动回归初始噪声的新方法。 Noisear不是静态的，非结构化的来源，而是学会为初始噪声生成动态和可控的先验分布。我们将初始噪声先验参数的生成作为自回归概率建模任务，上面是空间贴片或令牌。这种方法使Noisear能够捕获复杂的空间依赖性，并将学习的结构引入初始状态。至关重要的是，Noisear被设计为有条件的，允许文本提示直接影响到之前的先验，从而实现了对扩散初始化的细粒度控制。我们的实验表明，Noisear可以产生初始的噪声先验，从而提高样品质量并增强与条件输入的一致性，从而提供强大的，学习的替代方案，用于传统的随机初始化。 NOISEAR的一个关键优势是其概率表述，它自然支持无缝集成到马尔可夫决策过程和增强学习等概率框架中。我们的代码将在此HTTPS URL上提供

Title: A 2-Stage Model for Vehicle Class and Orientation Detection with Photo-Realistic Image Generation

Authors: Youngmin Kim, Donghwa Kang, Hyeongboo Baek
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01338
Pdf URL: https://arxiv.org/pdf/2506.01338
Copy Paste: [[2506.01338]] A 2-Stage Model for Vehicle Class and Orientation Detection with Photo-Realistic Image Generation(https://arxiv.org/abs/2506.01338)
Keywords: generation
Abstract: We aim to detect the class and orientation of a vehicle by training a model with synthetic data. However, the distribution of the classes in the training data is imbalanced, and the model trained on the synthetic image is difficult to predict in real-world images. We propose a two-stage detection model with photo-realistic image generation to tackle this issue. Our model mainly takes four steps to detect the class and orientation of the vehicle. (1) It builds a table containing the image, class, and location information of objects in the image, (2) transforms the synthetic images into real-world images style, and merges them into the meta table. (3) Classify vehicle class and orientation using images from the meta-table. (4) Finally, the vehicle class and orientation are detected by combining the pre-extracted location information and the predicted classes. We achieved 4th place in IEEE BigData Challenge 2022 Vehicle class and Orientation Detection (VOD) with our approach.
摘要：我们旨在通过训练模型的合成数据来检测车辆的类和方向。但是，训练数据中类的分布是不平衡的，在现实世界中，很难预测在合成图像上训练的模型。我们提出了一个带有照片现实图像生成的两阶段检测模型，以解决此问题。我们的模型主要采取四个步骤来检测车辆的类和方向。（1）它构建了一个包含图像中对象的图像，类和位置信息的表，（2）将合成图像转换为现实世界的图像样式，并将它们合并到元表中。（3）使用元表中的图像对车辆类和方向进行分类。（4）最后，通过组合预先提取的位置信息和预测类来检测车辆类和方向。我们在IEEE BIGDATA挑战赛中获得了2022年车辆类别和方向检测（VOD）的第四名。

Title: TimeGraph: Synthetic Benchmark Datasets for Robust Time-Series Causal Discovery

Authors: Muhammad Hasan Ferdous, Emam Hossain, Md Osman Gani
Subjects: cs.LG, cs.IR, stat.ML
Abstract URL: https://arxiv.org/abs/2506.01361
Pdf URL: https://arxiv.org/pdf/2506.01361
Copy Paste: [[2506.01361]] TimeGraph: Synthetic Benchmark Datasets for Robust Time-Series Causal Discovery(https://arxiv.org/abs/2506.01361)
Keywords: generation
Abstract: Robust causal discovery in time series datasets depends on reliable benchmark datasets with known ground-truth causal relationships. However, such datasets remain scarce, and existing synthetic alternatives often overlook critical temporal properties inherent in real-world data, including nonstationarity driven by trends and seasonality, irregular sampling intervals, and the presence of unobserved confounders. To address these challenges, we introduce TimeGraph, a comprehensive suite of synthetic time-series benchmark datasets that systematically incorporates both linear and nonlinear dependencies while modeling key temporal characteristics such as trends, seasonal effects, and heterogeneous noise patterns. Each dataset is accompanied by a fully specified causal graph featuring varying densities and diverse noise distributions and is provided in two versions: one including unobserved confounders and one without, thereby offering extensive coverage of real-world complexity while preserving methodological neutrality. We further demonstrate the utility of TimeGraph through systematic evaluations of state-of-the-art causal discovery algorithms including PCMCI+, LPCMCI, and FGES across a diverse array of configurations and metrics. Our experiments reveal significant variations in algorithmic performance under realistic temporal conditions, underscoring the need for robust synthetic benchmarks in the fair and transparent assessment of causal discovery methods. The complete TimeGraph suite, including dataset generation scripts, evaluation metrics, and recommended experimental protocols, is freely available to facilitate reproducible research and foster community-driven advancements in time-series causal discovery.
摘要：时间序列数据集中的强大因果发现取决于具有已知地面真相关系的可靠基准数据集。但是，这样的数据集仍然很少，现有的合成替代方案通常会忽略现实世界中固有的关键时间特性，包括趋势和季节性，不规则抽样间隔以及未观察的混杂因素的存在。为了应对这些挑战，我们介绍了TimeGraph，这是一套综合的合成时间序列基准数据集，该数据集系统地结合了线性和非线性依赖性，同时对关键的时间特征进行建模，例如趋势，季节性效应和异构噪声模式。每个数据集都伴随着一个完全指定的因果图，具有不同的密度和不同的噪声分布，并提供了两个版本：一个包括未观察到的混杂因素，一个没有，因此提供了对现实世界复杂性的广泛覆盖，同时保留方法论中性。我们通过系统地评估包括PCMCI+，LPCMCI以及FGE的最新因果发现算法的系统评估，进一步证明了时间图的实用性。我们的实验揭示了在逼真的时间条件下算法性能的显着差异，强调了在因果发现方法的公平和透明评估中需要强大合成基准的必要性。完整的时图套件，包括数据集生成脚本，评估指标和推荐的实验协议，可以自由使用，以促进可重复的研究并促进社区驱动的促进时间序列的因果关系发现。

Title: Synthetic Data Augmentation using Pre-trained Diffusion Models for Long-tailed Food Image Classification

Authors: GaYeon Koh, Hyun-Jic Oh, Jeonghyun Noh, Won-Ki Jeong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01368
Pdf URL: https://arxiv.org/pdf/2506.01368
Copy Paste: [[2506.01368]] Synthetic Data Augmentation using Pre-trained Diffusion Models for Long-tailed Food Image Classification(https://arxiv.org/abs/2506.01368)
Keywords: generation, generative
Abstract: Deep learning-based food image classification enables precise identification of food categories, further facilitating accurate nutritional analysis. However, real-world food images often show a skewed distribution, with some food types being more prevalent than others. This class imbalance can be problematic, causing models to favor the majority (head) classes with overall performance degradation for the less common (tail) classes. Recently, synthetic data augmentation using diffusion-based generative models has emerged as a promising solution to address this issue. By generating high-quality synthetic images, these models can help uniformize the data distribution, potentially improving classification performance. However, existing approaches face challenges: fine-tuning-based methods need a uniformly distributed dataset, while pre-trained model-based approaches often overlook inter-class separation in synthetic data. In this paper, we propose a two-stage synthetic data augmentation framework, leveraging pre-trained diffusion models for long-tailed food classification. We generate a reference set conditioned by a positive prompt on the generation target and then select a class that shares similar features with the generation target as a negative prompt. Subsequently, we generate a synthetic augmentation set using positive and negative prompt conditions by a combined sampling strategy that promotes intra-class diversity and inter-class separation. We demonstrate the efficacy of the proposed method on two long-tailed food benchmark datasets, achieving superior performance compared to previous works in terms of top-1 accuracy.
摘要：基于深度学习的食物图像分类可以精确地识别食物类别，进一步促进准确的营养分析。但是，现实世界中的食物图像通常显示出偏斜的分布，有些食物类型比其他食物更普遍。此类失衡可能是有问题的，导致模型偏爱较不常见（尾巴）类的总体绩效降低的大多数（头部）类。最近，使用基于扩散的生成模型的合成数据扩展已成为解决此问题的有前途解决方案。通过产生高质量的合成图像，这些模型可以帮助统一数据分布，从而有可能改善分类性能。但是，现有方法面临挑战：基于微调的方法需要统一分布的数据集，而基于预训练的模型方法通常会忽略合成数据中的类间隔。在本文中，我们提出了一个两阶段的合成数据增强框架，利用预先训练的扩散模型进行长尾食品分类。我们生成一个由生成目标上的正提示来调节的参考集，然后选择与生成目标共享相似特征的类作为负提示。随后，我们通过促进类内部多样性和阶层间分离的组合抽样策略，使用正及时及时条件生成合成的增强集。我们证明了该方法对两个长尾食品基准数据集的功效，就TOP-1准确性而言，与以前的作品相比，实现了卓越的性能。

Title: Incentivizing LLMs to Self-Verify Their Answers

Authors: Fuxiang Zhang, Jiacheng Xu, Chaojie Wang, Ce Cui, Yang Liu, Bo An
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01369
Pdf URL: https://arxiv.org/pdf/2506.01369
Copy Paste: [[2506.01369]] Incentivizing LLMs to Self-Verify Their Answers(https://arxiv.org/abs/2506.01369)
Keywords: generation
Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance during inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B, demonstrating its capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling. Our code is available at this https URL.
摘要：大型语言模型（LLMS）通过培训后和测试时间缩放定律在复杂的推理任务中表现出了很大的进步。尽管通常通过使用外部奖励模型来指导模型生成过程来实现普遍的测试时间缩放方法，但我们发现在对特定推理任务进行训练后进行训练的模型时，只能获取边际收益。我们确定有限的改进源于特定训练后发生器和一般奖励模型之间的分布差异。为了解决这个问题，我们提出了一个激励LLMS自我验证自己的答案的框架。通过在单个强化学习（RL）过程中统一答案生成和验证，我们训练可以有效评估其解决方案的正确性的模型。受过训练的模型可以通过验证其世代进一步扩展其在推理期间的性能，而无需外部验证器。我们基于QWEN2.5-MATH-7B和DEEPSEEK-R1-DISTILL-QWEN-1.5B训练自我验证模型，展示了其在各种推理上下文长度上的能力。多个数学推理基准的实验表明，我们的模型不仅可以提高训练后性能，还可以提高有效的测试时间缩放。我们的代码可在此HTTPS URL上找到。

Title: PointT2I: LLM-based text-to-image generation via keypoints

Authors: Taekyung Lee, Donggyu Lee, Myungjoo Kang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01370
Pdf URL: https://arxiv.org/pdf/2506.01370
Copy Paste: [[2506.01370]] PointT2I: LLM-based text-to-image generation via keypoints(https://arxiv.org/abs/2506.01370)
Keywords: generation
Abstract: Text-to-image (T2I) generation model has made significant advancements, resulting in high-quality images aligned with an input prompt. However, despite T2I generation's ability to generate fine-grained images, it still faces challenges in accurately generating images when the input prompt contains complex concepts, especially human pose. In this paper, we propose PointT2I, a framework that effectively generates images that accurately correspond to the human pose described in the prompt by using a large language model (LLM). PointT2I consists of three components: Keypoint generation, Image generation, and Feedback system. The keypoint generation uses an LLM to directly generate keypoints corresponding to a human pose, solely based on the input prompt, without external references. Subsequently, the image generation produces images based on both the text prompt and the generated keypoints to accurately reflect the target pose. To refine the outputs of the preceding stages, we incorporate an LLM-based feedback system that assesses the semantic consistency between the generated contents and the given prompts. Our framework is the first approach to leveraging LLM for keypoints-guided image generation without any fine-tuning, producing accurate pose-aligned images based solely on textual prompts.
摘要：文本对图像（T2I）生成模型已取得了重大进步，从而导致与输入提示的高质量图像。但是，尽管T2i生成能够生成细粒度的图像，但在输入提示包含复杂的概念，尤其是人类姿势时，它仍然面临着准确生成图像的挑战。在本文中，我们提出了PointT2I，该框架有效地生成了与使用大语言模型（LLM）在提示中描述的人类姿势相对应的图像。 PointT2i由三个组成部分组成：关键点的生成，图像生成和反馈系统。 Kepoint Generation使用LLM直接生成与人姿势相对应的关键点，仅基于输入提示，而没有外部参考。随后，图像生成基于文本提示和生成的关键点产生图像，以准确反映目标姿势。为了完善前面阶段的输出，我们合并了一个基于LLM的反馈系统，该系统评估了生成的内容和给定提示之间的语义一致性。我们的框架是利用LLM进行关键点引导的图像生成的第一种方法，而无需进行任何微调，仅基于文本提示而产生准确的姿势对准图像。

Title: RadarSplat: Radar Gaussian Splatting for High-Fidelity Data Synthesis and 3D Reconstruction of Autonomous Driving Scenes

Authors: Pou-Chun Kung, Skanda Harisha, Ram Vasudevan, Aline Eid, Katherine A. Skinner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01379
Pdf URL: https://arxiv.org/pdf/2506.01379
Copy Paste: [[2506.01379]] RadarSplat: Radar Gaussian Splatting for High-Fidelity Data Synthesis and 3D Reconstruction of Autonomous Driving Scenes(https://arxiv.org/abs/2506.01379)
Keywords: generation
Abstract: High-Fidelity 3D scene reconstruction plays a crucial role in autonomous driving by enabling novel data generation from existing datasets. This allows simulating safety-critical scenarios and augmenting training datasets without incurring further data collection costs. While recent advances in radiance fields have demonstrated promising results in 3D reconstruction and sensor data synthesis using cameras and LiDAR, their potential for radar remains largely unexplored. Radar is crucial for autonomous driving due to its robustness in adverse weather conditions like rain, fog, and snow, where optical sensors often struggle. Although the state-of-the-art radar-based neural representation shows promise for 3D driving scene reconstruction, it performs poorly in scenarios with significant radar noise, including receiver saturation and multipath reflection. Moreover, it is limited to synthesizing preprocessed, noise-excluded radar images, failing to address realistic radar data synthesis. To address these limitations, this paper proposes RadarSplat, which integrates Gaussian Splatting with novel radar noise modeling to enable realistic radar data synthesis and enhanced 3D reconstruction. Compared to the state-of-the-art, RadarSplat achieves superior radar image synthesis (+3.4 PSNR / 2.6x SSIM) and improved geometric reconstruction (-40% RMSE / 1.5x Accuracy), demonstrating its effectiveness in generating high-fidelity radar data and scene reconstruction. A project page is available at this https URL.
摘要：高保真3D场景重建在自动驾驶中通过从现有数据集中启用新的数据生成来起着至关重要的作用。这允许模拟关键安全方案和增强培训数据集，而不会产生进一步的数据收集成本。尽管辐射场的最新进展表明，使用摄像头和激光片在3D重建和传感器数据合成中取得了令人鼓舞的结果，但它们的雷达潜力仍然在很大程度上没有探索。雷达对于自动驾驶至关重要，因为它在不利的天气条件（如雨水，雾和雪）中的稳健性，光学传感器经常在挣扎。尽管最新的基于雷达的神经表示显示了3D驾驶场景重建的希望，但在具有明显的雷达噪声（包括接收器饱和度和多径反射）的情况下，它的性能很差。此外，它仅限于综合预处理的，被噪声的雷达图像，无法解决现实的雷达数据合成。为了解决这些局限性，本文提出了RadarsPlat，该雷达平地与新型雷达噪声建模相结合，以实现现实的雷达数据合成并增强3D重建。与最先进的图片相比，RadarsPlat可实现上雷达图像的合成（+3.4 PSNR / 2.6倍SSIM）和改进的几何重建（-40％RMSE / 1.5X精度），以表明其在产生高缺陷雷达数据和场景重建方面的有效性。该项目页面可在此HTTPS URL上找到。

Title: Playing with Transformer at 30+ FPS via Next-Frame Diffusion

Authors: Xinle Cheng, Tianyu He, Jiayi Xu, Junliang Guo, Di He, Jiang Bian
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01380
Pdf URL: https://arxiv.org/pdf/2506.01380
Copy Paste: [[2506.01380]] Playing with Transformer at 30+ FPS via Next-Frame Diffusion(https://arxiv.org/abs/2506.01380)
Keywords: generation
Abstract: Autoregressive video models offer distinct advantages over bidirectional diffusion models in creating interactive video content and supporting streaming applications with arbitrary duration. In this work, we present Next-Frame Diffusion (NFD), an autoregressive diffusion transformer that incorporates block-wise causal attention, enabling iterative sampling and efficient inference via parallel token generation within each frame. Nonetheless, achieving real-time video generation remains a significant challenge for such models, primarily due to the high computational cost associated with diffusion sampling and the hardware inefficiencies inherent to autoregressive generation. To address this, we introduce two innovations: (1) We extend consistency distillation to the video domain and adapt it specifically for video models, enabling efficient inference with few sampling steps; (2) To fully leverage parallel computation, motivated by the observation that adjacent frames often share the identical action input, we propose speculative sampling. In this approach, the model generates next few frames using current action input, and discard speculatively generated frames if the input action differs. Experiments on a large-scale action-conditioned video generation benchmark demonstrate that NFD beats autoregressive baselines in terms of both visual quality and sampling efficiency. We, for the first time, achieves autoregressive video generation at over 30 Frames Per Second (FPS) on an A100 GPU using a 310M model.
摘要：自回旋视频模型在创建交互式视频内容和支持任意持续时间的流媒体应用程序方面具有与双向扩散模型相比的不同优势。在这项工作中，我们介绍了下一框扩散（NFD），这是一种自回旋扩散变压器，融合了构成障碍的因果关注，从而在每个框架内通过平行令牌生成实现了迭代采样和有效的推断。尽管如此，实现实时视频生成仍然是此类模型仍然是一个重大挑战，这主要是由于与扩散抽样相关的高计算成本以及自动回归产生固有的硬件效率低下。为了解决这个问题，我们介绍了两项创新：（1）我们将一致性蒸馏扩展到视频域并专门针对视频模型，从而可以有效地推断，以几乎没有采样步骤；（2）为了完全利用平行计算，这是由相邻帧通常共享相同动作输入的观察到的，我们提出了投机性采样。在这种方法中，模型使用当前动作输入生成下一个帧，如果输入操作有所不同，则丢弃投机生成的帧。大规模动作条件的视频生成基准的实验表明，NFD在视觉质量和采样效率方面都击败了自回归基准。我们首次使用310m型号在A100 GPU上以每秒30帧（FPS）的价格实现自回旋视频生成。

Title: NTIRE 2025 the 2nd Restore Any Image Model (RAIM) in the Wild Challenge

Authors: Jie Liang, Radu Timofte, Qiaosi Yi, Zhengqiang Zhang, Shuaizheng Liu, Lingchen Sun, Rongyuan Wu, Xindong Zhang, Hui Zeng, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01394
Pdf URL: https://arxiv.org/pdf/2506.01394
Copy Paste: [[2506.01394]] NTIRE 2025 the 2nd Restore Any Image Model (RAIM) in the Wild Challenge(https://arxiv.org/abs/2506.01394)
Keywords: restoration, generation
Abstract: In this paper, we present a comprehensive overview of the NTIRE 2025 challenge on the 2nd Restore Any Image Model (RAIM) in the Wild. This challenge established a new benchmark for real-world image restoration, featuring diverse scenarios with and without reference ground truth. Participants were tasked with restoring real-captured images suffering from complex and unknown degradations, where both perceptual quality and fidelity were critically evaluated. The challenge comprised two tracks: (1) the low-light joint denoising and demosaicing (JDD) task, and (2) the image detail enhancement/generation task. Each track included two sub-tasks. The first sub-task involved paired data with available ground truth, enabling quantitative evaluation. The second sub-task dealt with real-world yet unpaired images, emphasizing restoration efficiency and subjective quality assessed through a comprehensive user study. In total, the challenge attracted nearly 300 registrations, with 51 teams submitting more than 600 results. The top-performing methods advanced the state of the art in image restoration and received unanimous recognition from all 20+ expert judges. The datasets used in Track 1 and Track 2 are available at this https URL and this https URL, respectively. The official challenge pages for Track 1 and Track 2 can be found at this https URL and this https URL.
摘要：在本文中，我们介绍了第二次还原野外任何图像模型（RAIM）的NTIRE 2025挑战的全面概述。这项挑战为现实世界图像恢复建立了一个新的基准，该基准的特征是有或没有参考地面真相的各种场景。参与者的任务是恢复患有复杂和未知降解的真实捕捉图像，在此均经过了认真评估，在该图像中，感知质量和忠诚度都经过严格评估。挑战包括两个轨道：（1）低光关节脱氧和表演（JDD）任务，以及（2）图像详细信息增强/生成任务。每个轨道都包含两个子任务。第一个子任务涉及配对数据，并具有可用的地面真相，从而实现了定量评估。第二个子任务涉及现实世界但未配对的图像，强调通过全面的用户研究评估恢复效率和主观质量。总体而言，挑战吸引了近300次注册，51个团队提交了600多个结果。表现最佳的方法在图像恢复方面提高了艺术状态，并获得了所有20多名专家法官的一致认可。轨道1和轨道2中使用的数据集分别在此HTTPS URL和此HTTPS URL上可用。可以在此HTTPS URL和此HTTPS URL上找到轨道1和轨道2的官方挑战页面。

Title: Self-supervised Latent Space Optimization with Nebula Variational Coding

Authors: Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari
Subjects: cs.LG, cs.IT
Abstract URL: https://arxiv.org/abs/2506.01414
Pdf URL: https://arxiv.org/pdf/2506.01414
Copy Paste: [[2506.01414]] Self-supervised Latent Space Optimization with Nebula Variational Coding(https://arxiv.org/abs/2506.01414)
Keywords: generative
Abstract: Deep learning approaches process data in a layer-by-layer way with intermediate (or latent) features. We aim at designing a general solution to optimize the latent manifolds to improve the performance on classification, segmentation, completion and/or reconstruction through probabilistic models. This paper proposes a variational inference model which leads to a clustered embedding. We introduce additional variables in the latent space, called \textbf{nebula anchors}, that guide the latent variables to form clusters during training. To prevent the anchors from clustering among themselves, we employ the variational constraint that enforces the latent features within an anchor to form a Gaussian distribution, resulting in a generative model we refer as Nebula Variational Coding (NVC). Since each latent feature can be labeled with the closest anchor, we also propose to apply metric learning in a self-supervised way to make the separation between clusters more explicit. As a consequence, the latent variables of our variational coder form clusters which adapt to the generated semantic of the training data, \textit{e.g.} the categorical labels of each sample. We demonstrate experimentally that it can be used within different architectures designed to solve different problems including text sequence, images, 3D point clouds and volumetric data, validating the advantage of our proposed method.
摘要：深度学习以中级（或潜在）功能以逐层方式处理数据。我们旨在设计一种通用解决方案，以优化潜在流形，以通过概率模型来提高分类，分割，完成和/或重建的性能。本文提出了一个变异推理模型，该模型导致聚类的嵌入。我们在潜在空间中介绍了其他变量，称为\ textbf {nebula锚}，该变量指导潜在变量在训练过程中形成簇。为了防止锚点之间的锚聚类，我们采用了变异约束，该约束在锚固中强制执行潜在特征以形成高斯分布，从而导致生成模型，我们称为星云变量编码（NVC）。由于每个潜在特征都可以用最接近的锚定标记，因此我们还建议以一种自制的方式应用公制学习，以使群集之间的分离更加明确。结果，我们的变量编码器的潜在变量适应训练数据的生成的语义，\ textit {e.g。}每个样本的分类标签。我们通过实验证明它可以在旨在解决不同问题的不同体系结构中使用，包括文本序列，图像，3D点云和体积数据，从而验证了我们提出的方法的优势。

Title: DNAEdit: Direct Noise Alignment for Text-Guided Rectified Flow Editing

Authors: Chenxi Xie, Minghan Li, Shuai Li, Yuhui Wu, Qiaosi Yi, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01430
Pdf URL: https://arxiv.org/pdf/2506.01430
Copy Paste: [[2506.01430]] DNAEdit: Direct Noise Alignment for Text-Guided Rectified Flow Editing(https://arxiv.org/abs/2506.01430)
Keywords: generation
Abstract: Leveraging the powerful generation capability of large-scale pretrained text-to-image models, training-free methods have demonstrated impressive image editing results. Conventional diffusion-based methods, as well as recent rectified flow (RF)-based methods, typically reverse synthesis trajectories by gradually adding noise to clean images, during which the noisy latent at the current timestep is used to approximate that at the next timesteps, introducing accumulated drift and degrading reconstruction accuracy. Considering the fact that in RF the noisy latent is estimated through direct interpolation between Gaussian noises and clean images at each timestep, we propose Direct Noise Alignment (DNA), which directly refines the desired Gaussian noise in the noise domain, significantly reducing the error accumulation in previous methods. Specifically, DNA estimates the velocity field of the interpolated noised latent at each timestep and adjusts the Gaussian noise by computing the difference between the predicted and expected velocity field. We validate the effectiveness of DNA and reveal its relationship with existing RF-based inversion methods. Additionally, we introduce a Mobile Velocity Guidance (MVG) to control the target prompt-guided generation process, balancing image background preservation and target object editability. DNA and MVG collectively constitute our proposed method, namely DNAEdit. Finally, we introduce DNA-Bench, a long-prompt benchmark, to evaluate the performance of advanced image editing models. Experimental results demonstrate that our DNAEdit achieves superior performance to state-of-the-art text-guided editing methods. Codes and benchmark will be available at \href{ this https URL}{this https URL}.
摘要：利用大规模的文本对图像模型的强大生成能力，无训练的方法显示出令人印象深刻的图像编辑结果。常规的基于扩散的方法以及最近的整流流（RF）基于基于扩散的方法，通常是通过逐渐向清洁图像添加噪声来反向合成轨迹，在此期间，在当前时间段上的嘈杂潜在的潜在用来近似于下一个时间段，在下一个时间段中，引入了累积的漂移和脱落的漂移精度。考虑到在RF中，通过在每个时间段上的高斯声音和干净的图像之间的直接插值来估算嘈杂的潜在事实，我们提出了直接噪声比对（DNA），该噪声对准（DNA）直接完善了噪声域中所需的高斯噪声，从而大大降低了先前方法中的误差积累。具体而言，DNA估计了每个时间步中插值潜在的螺栓脉的速度场，并通过计算预测速度和预期速度场之间的差来调整高斯噪声。我们验证了DNA的有效性，并揭示了其与现有基于RF的反转方法的关系。此外，我们引入了移动速度指南（MVG），以控制目标迅速引导的生成过程，平衡图像背景保护和目标对象编辑性。 DNA和MVG共同构成了我们提出的方法，即DNAEDIT。最后，我们介绍了长期基准的DNA板台，以评估高级图像编辑模型的性能。实验结果表明，我们的DNAIT在最先进的文本引导编辑方法上实现了卓越的性能。代码和基准将在\ href {this HTTPS url} {此https url}上可用。

Title: DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion

Authors: Geunmin Hwang, Hyun-kyu Ko, Younghyun Kim, Seungryong Lee, Eunbyung Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01454
Pdf URL: https://arxiv.org/pdf/2506.01454
Copy Paste: [[2506.01454]] DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion(https://arxiv.org/abs/2506.01454)
Keywords: generation
Abstract: Recent advancements in diffusion models have revolutionized video generation, enabling the creation of high-quality, temporally consistent videos. However, generating high frame-rate (FPS) videos remains a significant challenge due to issues such as flickering and degradation in long sequences, particularly in fast-motion scenarios. Existing methods often suffer from computational inefficiencies and limitations in maintaining video quality over extended frames. In this paper, we present a novel, training-free approach for high FPS video generation using pre-trained diffusion models. Our method, DiffuseSlide, introduces a new pipeline that leverages key frames from low FPS videos and applies innovative techniques, including noise re-injection and sliding window latent denoising, to achieve smooth, consistent video outputs without the need for additional fine-tuning. Through extensive experiments, we demonstrate that our approach significantly improves video quality, offering enhanced temporal coherence and spatial fidelity. The proposed method is not only computationally efficient but also adaptable to various video generation tasks, making it ideal for applications such as virtual reality, video games, and high-quality content creation.
摘要：扩散模型的最新进展彻底改变了视频的生成，从而创建了高质量的，时间一致的视频。但是，由于长序列闪烁和降解等问题，尤其是在快速运动场景中，因此产生高帧速率（FPS）视频仍然是一个重大挑战。现有的方法通常会遭受计算效率低下和局限性，以维持扩展帧的视频质量。在本文中，我们提出了一种使用预训练的扩散模型的新型，无训练的方法，用于高FPS视频生成。我们的方法，DixFuseslide引入了一条新的管道，该管道利用低FPS视频的关键帧，并应用创新技术，包括噪声重新注入和潜在的窗口潜在的DeNoising，以实现平滑，一致的视频输出，而无需进行其他微调。通过广泛的实验，我们证明了我们的方法可显着提高视频质量，从而增强了时间连贯性和空间忠诚度。所提出的方法不仅在计算上有效，而且对各种视频生成任务都适应，也非常适合虚拟现实，视频游戏和高质量的内容创建等应用程序。

Title: Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark

Authors: Shuyu Yang, Yilun Wang, Yaxiong Wang, Li Zhu, Zhedong Zheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01466
Pdf URL: https://arxiv.org/pdf/2506.01466
Copy Paste: [[2506.01466]] Towards Scalable Video Anomaly Retrieval: A Synthetic Video-Text Benchmark(https://arxiv.org/abs/2506.01466)
Keywords: generative
Abstract: Video anomaly retrieval aims to localize anomalous events in videos using natural language queries to facilitate public safety. However, existing datasets suffer from severe limitations: (1) data scarcity due to the long-tail nature of real-world anomalies, and (2) privacy constraints that impede large-scale collection. To address the aforementioned issues in one go, we introduce SVTA (Synthetic Video-Text Anomaly benchmark), the first large-scale dataset for cross-modal anomaly retrieval, leveraging generative models to overcome data availability challenges. Specifically, we collect and generate video descriptions via the off-the-shelf LLM (Large Language Model) covering 68 anomaly categories, e.g., throwing, stealing, and shooting. These descriptions encompass common long-tail events. We adopt these texts to guide the video generative model to produce diverse and high-quality videos. Finally, our SVTA involves 41,315 videos (1.36M frames) with paired captions, covering 30 normal activities, e.g., standing, walking, and sports, and 68 anomalous events, e.g., falling, fighting, theft, explosions, and natural disasters. We adopt three widely-used video-text retrieval baselines to comprehensively test our SVTA, revealing SVTA's challenging nature and its effectiveness in evaluating a robust cross-modal retrieval method. SVTA eliminates privacy risks associated with real-world anomaly collection while maintaining realistic scenarios. The dataset demo is available at: [this https URL].
摘要：视频异常检索旨在使用自然语言查询来在视频中本地局部事件，以促进公共安全。但是，现有数据集受到严重局限性：（1）由于现实世界异常的长尾性质而引起的数据稀缺，以及（2）阻碍大规模收集的隐私限制。为了解决上述问题，我们引入了SVTA（合成视频文本异常基准），这是第一个用于跨模式异常检索的大型数据集，利用生成模型来克服数据可用性挑战。具体来说，我们通过现成的LLM（大型语言模型）收集和生成视频描述，其中涵盖68个异常类别，例如投掷，偷窃和拍摄。这些描述包括常见的长尾事件。我们采用这些文本来指导视频生成模型，以制作多样化和高质量的视频。最后，我们的SVTA涉及41,315个视频（136万帧），并带有配对标题，涵盖30个正常活动，例如站立，步行和运动，以及68个异常事件，例如跌倒，战斗，盗窃，爆炸，爆炸和自然灾害。我们采用了三种广泛使用的视频检索基线来全面测试我们的SVTA，揭示了SVTA的挑战性质及其在评估强大的跨模式检索方法方面的有效性。 SVTA消除了与实际异常收集相关的隐私风险，同时保持现实情况。数据集演示可用：[此HTTPS URL]。

Title: Feature-aware Hypergraph Generation via Next-Scale Prediction

Authors: Dorian Gailhard, Enzo Tartaglione, Lirida Naviner, Jhony H. Giraldo
Subjects: cs.LG, cs.DM
Abstract URL: https://arxiv.org/abs/2506.01467
Pdf URL: https://arxiv.org/pdf/2506.01467
Copy Paste: [[2506.01467]] Feature-aware Hypergraph Generation via Next-Scale Prediction(https://arxiv.org/abs/2506.01467)
Keywords: generation, generative
Abstract: Hypergraphs generalize traditional graphs by allowing hyperedges to connect multiple nodes, making them well-suited for modeling complex structures with higher-order relationships, such as 3D meshes, molecular systems, and electronic circuits. While topology is central to hypergraph structure, many real-world applications also require node and hyperedge features. Existing hypergraph generation methods focus solely on topology, often overlooking feature modeling. In this work, we introduce FAHNES (feature-aware hypergraph generation via next-scale prediction), a hierarchical approach that jointly generates hypergraph topology and features. FAHNES builds a multi-scale representation through node coarsening, then learns to reconstruct finer levels via localized expansion and refinement, guided by a new node budget mechanism that controls cluster splitting. We evaluate FAHNES on synthetic hypergraphs, 3D meshes, and molecular datasets. FAHNES achieves competitive results in reconstructing topology and features, establishing a foundation for future research in featured hypergraph generative modeling.
摘要：HyperGraphs通过允许Hyperedges连接多个节点来概括传统图，从而非常适合建模具有高阶关系的复杂结构，例如3D网格，分子系统和电子电路。虽然拓扑是超图结构的核心，但许多现实世界的应用也需要节点和超越功能。现有的HyperGraph生成方法仅着眼于拓扑，通常会忽略特征建模。在这项工作中，我们介绍了Fahnes（通过隔壁预测通过功能感知的超图生成），这是一种层次结构方法，共同生成了超刻孔拓扑和功能。 Fahnes通过结块构建多尺度表示，然后学会通过局部扩展和改进来重建较高的水平，并在控制集群拆分的新节点预算机制的指导下进行。我们评估Fahnes的合成超图，3D网格和分子数据集。 Fahnes在重建拓扑和功能方面取得了竞争性的结果，为未来的HyperGraph生成型建模建立了基础。

Title: Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation

Authors: Kaihang Pan, Yang Wu, Wendong Bu, Kai Shen, Juncheng Li, Yingting Wang, Yunfei Li, Siliang Tang, Jun Xiao, Fei Wu, Hang Zhao, Yueting Zhuang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01480
Pdf URL: https://arxiv.org/pdf/2506.01480
Copy Paste: [[2506.01480]] Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation(https://arxiv.org/abs/2506.01480)
Keywords: generation
Abstract: Recent endeavors in Multimodal Large Language Models (MLLMs) aim to unify visual comprehension and generation. However, these two capabilities remain largely independent, as if they are two separate functions encapsulated within the same model. Consequently, visual comprehension does not enhance visual generation, and the reasoning mechanisms of LLMs have not been fully integrated to revolutionize image generation. In this paper, we propose to enable the collaborative co-evolution of visual comprehension and generation, advancing image generation into an iterative introspective process. We introduce a two-stage training approach: supervised fine-tuning teaches the MLLM with the foundational ability to generate genuine CoT for visual generation, while reinforcement learning activates its full potential via an exploration-exploitation trade-off. Ultimately, we unlock the Aha moment in visual generation, advancing MLLMs from text-to-image tasks to unified image generation. Extensive experiments demonstrate that our model not only excels in text-to-image generation and image editing, but also functions as a superior image semantic evaluator with enhanced visual comprehension capabilities. Project Page: this https URL.
摘要：多模式大语模型（MLLM）的最新努力旨在统一视觉理解和产生。但是，这两个功能在很大程度上仍然是独立的，就好像它们是在同一模型中封装的两个独立函数一样。因此，视觉理解并不能增强视觉产生，而LLM的推理机制尚未完全集成以彻底改变图像产生。在本文中，我们建议实现视觉理解和产生的协作共同发展，将图像生成发展为迭代的内省过程。我们介绍了一种两阶段的培训方法：监督的微调教授MLLM具有基本的能力，可以为视觉发电而产生真正的COT，同时增强学习通过探索探索探索的权衡激活其全部潜力。最终，我们在视觉生成中解锁了AHA时刻，将MLLM从文本到图像任务推进到统一的图像生成。广泛的实验表明，我们的模型不仅在文本到图像的生成和图像编辑中表现出色，而且还可以充当具有增强视觉理解能力的出色图像语义评估器。项目页面：此HTTPS URL。

Title: Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?

Authors: Zijian Zhao, Dian Jin, Zijing Zhou, Xiaoyu Zhang
Subjects: cs.LG, cs.AI, cs.MM, eess.AS
Abstract URL: https://arxiv.org/abs/2506.01482
Pdf URL: https://arxiv.org/pdf/2506.01482
Copy Paste: [[2506.01482]] Automatic Stage Lighting Control: Is it a Rule-Driven Process or Generative Task?(https://arxiv.org/abs/2506.01482)
Keywords: generative
Abstract: Stage lighting plays an essential role in live music performances, influencing the engaging experience of both musicians and audiences. Given the high costs associated with hiring or training professional lighting engineers, Automatic Stage Lighting Control (ASLC) has gained increasing attention. However, most existing approaches only classify music into limited categories and map them to predefined light patterns, resulting in formulaic and monotonous outcomes that lack rationality. To address this issue, this paper presents an end-to-end solution that directly learns from experienced lighting engineers -- Skip-BART. To the best of our knowledge, this is the first work to conceptualize ASLC as a generative task rather than merely a classification problem. Our method modifies the BART model to take audio music as input and produce light hue and value (intensity) as output, incorporating a novel skip connection mechanism to enhance the relationship between music and light within the frame this http URL validate our method through both quantitative analysis and an human evaluation, demonstrating that Skip-BART outperforms conventional rule-based methods across all evaluation metrics and shows only a limited gap compared to real lighting this http URL, our method yields a p-value of 0.72 in a statistical comparison based on human evaluations with human lighting engineers, suggesting that the proposed approach closely matches human lighting engineering performance. To support further research, we have made our self-collected dataset, code, and trained model parameters available at this https URL .
摘要：舞台照明在现场音乐表演中起着至关重要的作用，影响了音乐家和观众的引人入胜的体验。鉴于与招聘或培训专业照明工程师相关的高成本，自动舞台照明控制（ASLC）已引起人们越来越多的关注。但是，大多数现有方法仅将音乐分为有限的类别，并将其映射到预定义的光模式，从而导致缺乏理性的公式化和单调的结果。为了解决此问题，本文提出了一种直接从经验丰富的照明工程师-Skip-Bart学习的端到端解决方案。据我们所知，这是将ASLC概念化为生成任务的第一项工作，而不仅仅是分类问题。我们的方法修改了BART模型，将音频音乐作为输入并产生灯光和价值（强度）作为输出，并结合了一种新颖的跳过连接机制，以增强音乐和光线之间的关系在框架内通过定量分析和人类评估来验证我们的方法，并证明了Skip-Bart在所有评估中的范围，并显示了所有评估的方法，并将其显示为contrantal limiti capts conterm ligith ligiti gop gop gop gop gop gop gop gop gop gop gop gop gop gop gop gop gop gop gop gop gop。 URL，我们的方法在基于人类照明工程师的人类评估的统计比较中得出0.72的P值为0.72，这表明所提出的方法与人类照明工程的性能紧密匹配。为了支持进一步的研究，我们已经在此HTTPS URL上提供了自我收集的数据集，代码和训练有素的模型参数。

Title: FDSG: Forecasting Dynamic Scene Graphs

Authors: Yi Yang, Yuren Cong, Hao Cheng, Bodo Rosenhahn, Michael Ying Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01487
Pdf URL: https://arxiv.org/pdf/2506.01487
Copy Paste: [[2506.01487]] FDSG: Forecasting Dynamic Scene Graphs(https://arxiv.org/abs/2506.01487)
Keywords: generation
Abstract: Dynamic scene graph generation extends scene graph generation from images to videos by modeling entity relationships and their temporal evolution. However, existing methods either generate scene graphs from observed frames without explicitly modeling temporal dynamics, or predict only relationships while assuming static entity labels and locations. These limitations hinder effective extrapolation of both entity and relationship dynamics, restricting video scene understanding. We propose Forecasting Dynamic Scene Graphs (FDSG), a novel framework that predicts future entity labels, bounding boxes, and relationships, for unobserved frames, while also generating scene graphs for observed frames. Our scene graph forecast module leverages query decomposition and neural stochastic differential equations to model entity and relationship dynamics. A temporal aggregation module further refines predictions by integrating forecasted and observed information via cross-attention. To benchmark FDSG, we introduce Scene Graph Forecasting, a new task for full future scene graph prediction. Experiments on Action Genome show that FDSG outperforms state-of-the-art methods on dynamic scene graph generation, scene graph anticipation, and scene graph forecasting. Codes will be released upon publication.
摘要：动态场景图生成通过建模实体关系及其时间演变将场景图从图像扩展到视频。但是，现有方法要么从观察到的帧中生成场景图，而无需明确建模时间动态，要么在假设静态实体标签和位置时仅预测关系。这些限制阻碍了实体和关系动态的有效推断，从而限制了视频场景的理解。我们提出了预测动态场景图（FDSG），这是一个新颖的框架，可预测未来的实体标签，边界框和关系，用于未观察到的框架，同时还为观察到的帧生成场景图。我们的场景图预测模块利用查询分解和神经随机微分方程来建模实体和关系动态。时间聚集模块通过通过交叉注意进行整合和观察到的信息进一步完善预测。为了基准FDSG，我们介绍了场景图预测，这是一项全面场景图预测的新任务。动作基因组的实验表明，FDSG在动态场景图，场景图期望和场景图预测上的最先进方法优于最先进的方法。代码将在出版后发布。

Title: Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity

Authors: Yuya Kobayashi, Yuhta Takida, Takashi Shibuya, Yuki Mitsufuji
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01493
Pdf URL: https://arxiv.org/pdf/2506.01493
Copy Paste: [[2506.01493]] Efficiency without Compromise: CLIP-aided Text-to-Image GANs with Increased Diversity(https://arxiv.org/abs/2506.01493)
Keywords: generation, generative
Abstract: Recently, Generative Adversarial Networks (GANs) have been successfully scaled to billion-scale large text-to-image datasets. However, training such models entails a high training cost, limiting some applications and research usage. To reduce the cost, one promising direction is the incorporation of pre-trained models. The existing method of utilizing pre-trained models for a generator significantly reduced the training cost compared with the other large-scale GANs, but we found the model loses the diversity of generation for a given prompt by a large margin. To build an efficient and high-fidelity text-to-image GAN without compromise, we propose to use two specialized discriminators with Slicing Adversarial Networks (SANs) adapted for text-to-image tasks. Our proposed model, called SCAD, shows a notable enhancement in diversity for a given prompt with better sample fidelity. We also propose to use a metric called Per-Prompt Diversity (PPD) to evaluate the diversity of text-to-image models quantitatively. SCAD achieved a zero-shot FID competitive with the latest large-scale GANs at two orders of magnitude less training cost.
摘要：最近，生成的对抗网络（GAN）已成功地扩展到十亿个大型的大型文本图像数据集。但是，培训此类模型需要进行高训练成本，从而限制了一些应用和研究用法。为了降低成本，一个有希望的方向是纳入预训练模型。与其他大规模gan相比，使用预训练模型的现有方法可显着降低训练成本，但我们发现该模型因较大的边距而在给定的提示中失去了发电的多样性。为了在不妥协的情况下构建一个高效且高保真的文本对图像gan，我们建议将两个专门的歧视器与切片对抗网络（SANS）相适应的文本对象任务。我们提出的称为SCAD的模型显示出具有更好样本保真度的给定提示中的多样性的显着增强。我们还建议使用称为“每次宣传”多样性（PPD）的指标来定量评估文本对图像模型的多样性。 SCAD与最新的大规模GAN的零射FID竞争竞争，培训成本少两个数量级。

Title: Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment

Authors: Kaixun Jiang, Zhaoyu Chen, Haijing Guo, Jinglun Li, Jiyuan Fu, Pinxue Guo, Hao Tang, Bo Li, Wenqiang Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01511
Pdf URL: https://arxiv.org/pdf/2506.01511
Copy Paste: [[2506.01511]] Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment(https://arxiv.org/abs/2506.01511)
Keywords: generation
Abstract: Preference alignment in diffusion models has primarily focused on benign human preferences (e.g., aesthetic). In this paper, we propose a novel perspective: framing unrestricted adversarial example generation as a problem of aligning with adversary preferences. Unlike benign alignment, adversarial alignment involves two inherently conflicting preferences: visual consistency and attack effectiveness, which often lead to unstable optimization and reward hacking (e.g., reducing visual quality to improve attack success). To address this, we propose APA (Adversary Preferences Alignment), a two-stage framework that decouples conflicting preferences and optimizes each with differentiable rewards. In the first stage, APA fine-tunes LoRA to improve visual consistency using rule-based similarity reward. In the second stage, APA updates either the image latent or prompt embedding based on feedback from a substitute classifier, guided by trajectory-level and step-wise rewards. To enhance black-box transferability, we further incorporate a diffusion augmentation strategy. Experiments demonstrate that APA achieves significantly better attack transferability while maintaining high visual consistency, inspiring further research to approach adversarial attacks from an alignment perspective. Code will be available at this https URL.
摘要：扩散模型中的偏好比对主要集中于良性人类的偏好（例如美学）。在本文中，我们提出了一种新颖的观点：将无限制的对抗示例生成作为与对手偏好保持一致的问题。与良性对齐不同，对抗对齐涉及两个固有的偏见：视觉一致性和攻击效果，这通常会导致不稳定的优化和奖励黑客入侵（例如，降低视觉质量以提高攻击成功）。为了解决这个问题，我们提出了APA（对手偏好对齐），这是一个两阶段的框架，它破坏了相互矛盾的偏好并以可区分的奖励优化。在第一阶段，使用基于规则的相似性奖励提高视觉一致性，以提高视觉效果。在第二阶段，APA根据替代分类器的反馈来更新图像潜在或提示嵌入，并在轨迹级别和逐步奖励的指导下更新。为了增强黑盒可传递性，我们进一步纳入了扩散的增强策略。实验表明，APA在保持高视觉一致性的同时，可以实现更好的攻击性转移性，从而激发了进一步的研究以从一致性的角度来接触对抗性攻击。代码将在此HTTPS URL上可用。

Title: Beyond Diagonal Covariance: Flexible Posterior VAEs via Free-Form Injective Flows

Authors: Peter Sorrenson, Lukas Lührs, Hans Olischläger, Ullrich Köthe
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.01522
Pdf URL: https://arxiv.org/pdf/2506.01522
Copy Paste: [[2506.01522]] Beyond Diagonal Covariance: Flexible Posterior VAEs via Free-Form Injective Flows(https://arxiv.org/abs/2506.01522)
Keywords: generative
Abstract: Variational Autoencoders (VAEs) are powerful generative models widely used for learning interpretable latent spaces, quantifying uncertainty, and compressing data for downstream generative tasks. VAEs typically rely on diagonal Gaussian posteriors due to computational constraints. Using arguments grounded in differential geometry, we demonstrate inherent limitations in the representational capacity of diagonal covariance VAEs, as illustrated by explicit low-dimensional examples. In response, we show that a regularized variant of the recently introduced Free-form Injective Flow (FIF) can be interpreted as a VAE featuring a highly flexible, implicitly defined posterior. Crucially, this regularization yields a posterior equivalent to a full Gaussian covariance distribution, yet maintains computational costs comparable to standard diagonal covariance VAEs. Experiments on image datasets validate our approach, demonstrating that incorporating full covariance substantially improves model likelihood.
摘要：变异自动编码器（VAE）是广泛用于学习可解释的潜在空间，量化不确定性和压缩数据以进行下游生成任务的强大生成模型。 VAE通常由于计算限制而依赖于对角线高斯后期。使用以差异几何形状为基础的参数，我们证明了对角协方差VAE的代表能力的固有局限性，如明确的低维示例所示。作为回应，我们表明，最近引入的自由形式的注射流（FIF）的正则变体可以解释为具有高度灵活，隐含定义的后验的VAE。至关重要的是，这种正规化产生的后部等效于完整的高斯协方差分布，但保持与标准对角线协方差VAE相当的计算成本。图像数据集上的实验验证了我们的方法，表明结合完整的协方差显着改善了模型的可能性。

Title: G4Seg: Generation for Inexact Segmentation Refinement with Diffusion Models

Authors: Tianjiao Zhang, Fei Zhang, Jiangchao Yao, Ya Zhang, Yanfeng Wang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.01539
Pdf URL: https://arxiv.org/pdf/2506.01539
Copy Paste: [[2506.01539]] G4Seg: Generation for Inexact Segmentation Refinement with Diffusion Models(https://arxiv.org/abs/2506.01539)
Keywords: generation, generative
Abstract: This paper considers the problem of utilizing a large-scale text-to-image diffusion model to tackle the challenging Inexact Segmentation (IS) task. Unlike traditional approaches that rely heavily on discriminative-model-based paradigms or dense visual representations derived from internal attention mechanisms, our method focuses on the intrinsic generative priors in Stable Diffusion~(SD). Specifically, we exploit the pattern discrepancies between original images and mask-conditional generated images to facilitate a coarse-to-fine segmentation refinement by establishing a semantic correspondence alignment and updating the foreground probability. Comprehensive quantitative and qualitative experiments validate the effectiveness and superiority of our plug-and-play design, underscoring the potential of leveraging generation discrepancies to model dense representations and encouraging further exploration of generative approaches for solving discriminative tasks.
摘要：本文考虑了利用大规模的文本对图像扩散模型来解决具有挑战性的不精确细分（IS）任务的问题。与传统的方法不同地依赖于基于内部注意机制的基于歧视模型的范式或密集的视觉表示，我们的方法着重于稳定的扩散〜（SD）中的内在生成率。具体而言，我们利用原始图像和蒙版条件生成的图像之间的模式差异，以通过建立语义对应关系对齐并更新前景概率来促进粗到最细分的细分细分。全面的定量和定性实验验证了我们的插件设计的有效性和优势，强调了利用产生差异来模拟密集表示的潜力，并鼓励进一步探索解决歧视任务的生成方法。

Title: Adaptive Destruction Processes for Diffusion Samplers

Authors: Timofei Gritsaev, Nikita Morozov, Kirill Tamogashev, Daniil Tiapkin, Sergey Samsonov, Alexey Naumov, Dmitry Vetrov, Nikolay Malkin
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.01541
Pdf URL: https://arxiv.org/pdf/2506.01541
Copy Paste: [[2506.01541]] Adaptive Destruction Processes for Diffusion Samplers(https://arxiv.org/abs/2506.01541)
Keywords: generation, generative
Abstract: This paper explores the challenges and benefits of a trainable destruction process in diffusion samplers -- diffusion-based generative models trained to sample an unnormalised density without access to data samples. Contrary to the majority of work that views diffusion samplers as approximations to an underlying continuous-time model, we view diffusion models as discrete-time policies trained to produce samples in very few generation steps. We propose to trade some of the elegance of the underlying theory for flexibility in the definition of the generative and destruction policies. In particular, we decouple the generation and destruction variances, enabling both transition kernels to be learned as unconstrained Gaussian densities. We show that, when the number of steps is limited, training both generation and destruction processes results in faster convergence and improved sampling quality on various benchmarks. Through a robust ablation study, we investigate the design choices necessary to facilitate stable training. Finally, we show the scalability of our approach through experiments on GAN latent space sampling for conditional image generation.
摘要：本文探讨了在扩散采样器中可训练的破坏过程的挑战和好处 - 基于扩散的生成模型，该模型训练有素，可在不访问数据样本的情况下采样非均衡密度。与将扩散采样器视为基础连续时间模型的近似值的大多数工作相反，我们将扩散模型视为经过几个生成步骤的样本训练的离散时间策略。我们建议在生成和破坏政策的定义中贸易基础理论的一些优雅性，以换取灵活性。特别是，我们将产生和破坏差异解散，使两个过渡内核都被学到了无约束的高斯密度。我们表明，当步骤的数量有限时，培训发电和破坏过程会导致更快的收敛性和改善各种基准测试的采样质量。通过一项强大的消融研究，我们研究了促进稳定训练所需的设计选择。最后，我们通过对有条件图像产生的GAN潜在空间采样的实验来显示方法的可伸缩性。

Title: LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model

Authors: Xiaodong Wang, Zhirong Wu, Peixi Peng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01546
Pdf URL: https://arxiv.org/pdf/2506.01546
Copy Paste: [[2506.01546]] LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model(https://arxiv.org/abs/2506.01546)
Keywords: generation
Abstract: Driving world models are used to simulate futures by video generation based on the condition of the current state and actions. However, current models often suffer serious error accumulations when predicting the long-term future, which limits the practical application. Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility. However, these models are always trained on short video clips (high fps and short duration), and multiple roll-out generations struggle to produce consistent and reasonable long videos due to the training-inference gap. To this end, we propose several solutions to build a simple yet effective long-term driving world model. First, we hierarchically decouple world model learning into large motion learning and bidirectional continuous motion learning. Then, considering the continuity of driving scenes, we propose a simple distillation method where fine-grained video flows are self-supervised signals for coarse-grained flows. The distillation is designed to improve the coherence of infinite video generation. The coarse-grained and fine-grained modules are coordinated to generate long-term and temporally coherent videos. In the public benchmark NuScenes, compared with the state-of-the-art front-view model, our model improves FVD by $27\%$ and reduces inference time by $85\%$ for the video task of generating 110+ frames. More videos (including 90s duration) are available at this https URL.
摘要：驾驶世界模型用于根据当前状态和动作的状况来通过视频生成来模拟未来。但是，当前模型在预测长期未来时通常会遇到严重的错误积累，这限制了实际应用。最近的研究利用扩散变压器（DIT）作为驱动世界模型的骨干，以提高学习灵活性。但是，这些模型总是在短视频剪辑（高FPS和短时间）上进行培训，并且由于训练推动差距，多个推出世代都难以生成一致，合理的长视频。为此，我们提出了几种解决方案，以建立一个简单而有效的长期驾驶世界模型。首先，我们将世界模型学习分为大型运动学习和双向连续运动学习。然后，考虑到驾驶场景的连续性，我们提出了一种简单的蒸馏方法，其中细颗粒的视频流是对粗粒流的自我监督信号。蒸馏旨在提高无限视频生成的连贯性。粗粒和细粒度的模块是协调的，以生成长期和时间连贯的视频。在公共基准Nuscenes中，与最先进的前景型号相比，我们的模型将FVD提高了$ 27 \％$，并将推理时间降低了$ 85 \％$ $，用于产生110+帧的视频任务。此HTTPS URL提供了更多视频（包括90年代的持续时间）。

Title: HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception

Authors: Wei Yao, Yunlian Sun, Hongwen Zhang, Yebin Liu, Jinhui Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01579
Pdf URL: https://arxiv.org/pdf/2506.01579
Copy Paste: [[2506.01579]] HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception(https://arxiv.org/abs/2506.01579)
Keywords: generation
Abstract: Generating high-fidelity full-body human interactions with dynamic objects and static scenes remains a critical challenge in computer graphics and animation. Existing methods for human-object interaction often neglect scene context, leading to implausible penetrations, while human-scene interaction approaches struggle to coordinate fine-grained manipulations with long-range navigation. To address these limitations, we propose HOSIG, a novel framework for synthesizing full-body interactions through hierarchical scene perception. Our method decouples the task into three key components: 1) a scene-aware grasp pose generator that ensures collision-free whole-body postures with precise hand-object contact by integrating local geometry constraints, 2) a heuristic navigation algorithm that autonomously plans obstacle-avoiding paths in complex indoor environments via compressed 2D floor maps and dual-component spatial reasoning, and 3) a scene-guided motion diffusion model that generates trajectory-controlled, full-body motions with finger-level accuracy by incorporating spatial anchors and dual-space classifier-free guidance. Extensive experiments on the TRUMANS dataset demonstrate superior performance over state-of-the-art methods. Notably, our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention. This work bridges the critical gap between scene-aware navigation and dexterous object manipulation, advancing the frontier of embodied interaction synthesis. Codes will be available after publication. Project page: this http URL
摘要：在计算机图形和动画中，与动态对象和静态场景生成高保真的全身互动仍然是一个关键的挑战。人类对象互动的现有方法通常会忽略场景的环境，从而导致难以置信的渗透，而人类场景的互动方法则难以协调细粒度的操纵与远程导航。为了解决这些局限性，我们提出了Hosig，这是通过层次场景感知综合全身相互作用的新型框架。我们的方法将任务分解为三个关键组成部分：1）通过整合本地几何学约束，可确保具有精确手动接触的无碰撞的全身姿势的场景引起的姿势发生器，2）一种启发式导航算法，该启发式导航算法自动地按照复杂的印刷环境自动计划障碍物，并在复杂的2D楼层中散发障碍物，并散发出3层的2D楼层效果。场景引导的运动扩散模型通过合并空间锚和双空间分类器指导，生成具有指导级准确性的轨迹控制的全身运动。在杜鲁士数据集上进行的广泛实验表明，与最先进的方法相比，性能卓越。值得注意的是，我们的框架通过自回归产生支持无限的运动长度，并且需要最少的手动干预。这项工作弥合了场景感知导航和灵活的对象操纵之间的关键差距，从而推进了体现相互作用合成的前沿。出版后，代码将可用。项目页面：此HTTP URL

Title: EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models

Authors: Andy Bonnetto, Haozhe Qi, Franklin Leong, Matea Tashkovska, Mahdi Rad, Solaiman Shokur, Friedhelm Hummel, Silvestro Micera, Marc Pollefeys, Alexander Mathis
Subjects: cs.CV, cs.AI, cs.LG, q-bio.OT
Abstract URL: https://arxiv.org/abs/2506.01608
Pdf URL: https://arxiv.org/pdf/2506.01608
Copy Paste: [[2506.01608]] EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models(https://arxiv.org/abs/2506.01608)
Keywords: generation
Abstract: Understanding behavior requires datasets that capture humans while carrying out complex tasks. The kitchen is an excellent environment for assessing human motor and cognitive function, as many complex actions are naturally exhibited in kitchens from chopping to cleaning. Here, we introduce the EPFL-Smart-Kitchen-30 dataset, collected in a noninvasive motion capture platform inside a kitchen environment. Nine static RGB-D cameras, inertial measurement units (IMUs) and one head-mounted HoloLens~2 headset were used to capture 3D hand, body, and eye movements. The EPFL-Smart-Kitchen-30 dataset is a multi-view action dataset with synchronized exocentric, egocentric, depth, IMUs, eye gaze, body and hand kinematics spanning 29.7 hours of 16 subjects cooking four different recipes. Action sequences were densely annotated with 33.78 action segments per minute. Leveraging this multi-modal dataset, we propose four benchmarks to advance behavior understanding and modeling through 1) a vision-language benchmark, 2) a semantic text-to-motion generation benchmark, 3) a multi-modal action recognition benchmark, 4) a pose-based action segmentation benchmark. We expect the EPFL-Smart-Kitchen-30 dataset to pave the way for better methods as well as insights to understand the nature of ecologically-valid human behavior. Code and data are available at this https URL
摘要：了解行为需要在执行复杂任务时捕获人类的数据集。厨房是评估人类运动和认知功能的绝佳环境，因为从切碎到清洁，许多复杂的动作都自然展示在厨房中。在这里，我们介绍了EPFL-SMART-KITCHEN-30数据集，该数据集收集在厨房环境中的无创运动捕获平台中。使用9个静态RGB-D摄像机，惯性测量单元（IMU）和一张头部安装的Hololens〜2头戴式耳机来捕获3D手，身体和眼动。 EPFL-SMART-KITCHEN-30数据集是一个多视图的动作数据集，具有同步的Exentric，Egincentric，Egocentric，Depth，Depth，Imus，Eye Ceaze，Eye Gaze，Body和Hand Kinematics，涉及16个受试者的29.7小时，烹饪四个不同的食谱。每分钟33.78个动作段密集注释动作序列。利用此多模式数据集，我们提出了四个基准测试，以通过1）视觉语言基准提高行为理解和建模，2）语义文本到运动生成基准，3）多模式的动作识别基准，4）基于姿势的动作动作分割基准。我们希望EPFL-Smart-Kitchen-30数据集为更好的方法和见解铺平道路，以了解生态播种人类行为的性质。代码和数据可在此HTTPS URL上找到

Title: Minimal Impact ControlNet: Advancing Multi-ControlNet Integration

Authors: Shikun Sun, Min Zhou, Zixuan Wang, Xubin Li, Tiezheng Ge, Zijie Ye, Xiaoyu Qin, Junliang Xing, Bo Zheng, Jia Jia
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.01672
Pdf URL: https://arxiv.org/pdf/2506.01672
Copy Paste: [[2506.01672]] Minimal Impact ControlNet: Advancing Multi-ControlNet Integration(https://arxiv.org/abs/2506.01672)
Keywords: generation
Abstract: With the advancement of diffusion models, there is a growing demand for high-quality, controllable image generation, particularly through methods that utilize one or multiple control signals based on ControlNet. However, in current ControlNet training, each control is designed to influence all areas of an image, which can lead to conflicts when different control signals are expected to manage different parts of the image in practical applications. This issue is especially pronounced with edge-type control conditions, where regions lacking boundary information often represent low-frequency signals, referred to as silent control signals. When combining multiple ControlNets, these silent control signals can suppress the generation of textures in related areas, resulting in suboptimal outcomes. To address this problem, we propose Minimal Impact ControlNet. Our approach mitigates conflicts through three key strategies: constructing a balanced dataset, combining and injecting feature signals in a balanced manner, and addressing the asymmetry in the score function's Jacobian matrix induced by ControlNet. These improvements enhance the compatibility of control signals, allowing for freer and more harmonious generation in areas with silent control signals.
摘要：随着扩散模型的发展，对高质量，可控图像生成的需求不断增长，尤其是通过使用基于ControlNET的一个或多个控制信号的方法。但是，在当前的控制网训练中，每个控件旨在影响图像的所有区域，当期望在实际应用中管理图像的不同部分时，这可能导致冲突。对于边缘型控制条件，此问题尤其明显，其中缺乏边界信息的区域通常代表低频信号，称为静音控制信号。当组合多个控制网络时，这些无声控制信号可以抑制相关区域中的纹理产生，从而导致次优结果。为了解决这个问题，我们提出了最小的影响控制网络。我们的方法通过三个关键策略减轻冲突：以平衡的方式构建平衡数据集，结合和注入特征信号，并解决由ControlNet引起的分数函数的jacobian矩阵中的不对称性。这些改进增强了控制信号的兼容性，从而在具有无声控制信号的区域中可以更自由，更和谐地产生。

Title: VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking

Authors: Desen Meng, Rui Huang, Zhilin Dai, Xinhao Li, Yifan Xu, Jun Zhang, Zhenpeng Huang, Meng Zhang, Lingshu Zhang, Yi Liu, Limin Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01725
Pdf URL: https://arxiv.org/pdf/2506.01725
Copy Paste: [[2506.01725]] VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking(https://arxiv.org/abs/2506.01725)
Keywords: generation
Abstract: While recent advances in reinforcement learning have significantly enhanced reasoning capabilities in large language models (LLMs), these techniques remain underexplored in multi-modal LLMs for video captioning. This paper presents the first systematic investigation of GRPO-based RL post-training for video MLLMs, with the goal of enhancing video MLLMs' capability of describing actions in videos. Specifically, we develop the VideoCap-R1, which is prompted to first perform structured thinking that analyzes video subjects with their attributes and actions before generating complete captions, supported by two specialized reward mechanisms: a LLM-free think scorer evaluating the structured thinking quality and a LLM-assisted caption scorer assessing the output quality. The RL training framework effectively establishes the connection between structured reasoning and comprehensive description generation, enabling the model to produce captions with more accurate actions. Our experiments demonstrate that VideoCap-R1 achieves substantial improvements over the Qwen2VL-7B baseline using limited samples (1.5k) across multiple video caption benchmarks (DREAM1K: +4.4 event F1, VDC: +4.2 Acc, CAREBENCH: +3.1 action F1, +6.9 object F1) while consistently outperforming the SFT-trained counterparts, confirming GRPO's superiority in enhancing MLLMs' captioning capabilities.
摘要：尽管在大型语言模型（LLMS）中，强化学习的最新进展显着增强了推理能力，但这些技术在多模式LLMS中仍未逐渐消失，用于视频字幕。本文介绍了对视频MLLM的基于GRPO的RL后培训进行的首次系统调查，目的是增强视频MLLM的描述视频动作的能力。具体而言，我们开发了VideoCap-R1，该录像带R1被提示先执行结构化思维，以在产生完整的字幕之前分析视频主题，并由两种专业的奖励机制支持：无LLM的无LLM Think Think Scorer评估结构化思维质量和LLM Assissed Advision Assanced Advision Asspedion Caftion Scorer评估输出质量。 RL培训框架有效地建立了结构化推理与全面描述生成之间的联系，从而使模型能够以更准确的操作产生字幕。我们的实验表明，VideoCap-R1使用有限的样本（1.5K）在多个视频标题基准中（Dream1k：+4.4事件F1，VDC：+4.2 ACC，CareBench：+3.1 Action F1，+6.9对象F1）始终如一地对抗sff-sff-sff-sff-sff-sff-sff-增强MLLM的字幕功能。

Title: STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset

Authors: Jinhong Wang, Shuo Tong, Jian liu, Dongqi Tang, Jintai Chen, Haochao Ying, Hongxia Xu, Danny Chen, Jian Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01738
Pdf URL: https://arxiv.org/pdf/2506.01738
Copy Paste: [[2506.01738]] STORM: Benchmarking Visual Rating of MLLMs with a Comprehensive Ordinal Regression Dataset(https://arxiv.org/abs/2506.01738)
Keywords: quality assessment
Abstract: Visual rating is an essential capability of artificial intelligence (AI) for multi-dimensional quantification of visual content, primarily applied in ordinal regression (OR) tasks such as image quality assessment, facial age estimation, and medical image grading. However, current multi-modal large language models (MLLMs) under-perform in such visual rating ability while also suffering the lack of relevant datasets and benchmarks. In this work, we collect and present STORM, a data collection and benchmark for Stimulating Trustworthy Ordinal Regression Ability of MLLMs for universal visual rating. STORM encompasses 14 ordinal regression datasets across five common visual rating domains, comprising 655K image-level pairs and the corresponding carefully curated VQAs. Importantly, we also propose a coarse-to-fine processing pipeline that dynamically considers label candidates and provides interpretable thoughts, providing MLLMs with a general and trustworthy ordinal thinking paradigm. This benchmark aims to evaluate the all-in-one and zero-shot performance of MLLMs in scenarios requiring understanding of the essential common ordinal relationships of rating labels. Extensive experiments demonstrate the effectiveness of our framework and shed light on better fine-tuning strategies. The STORM dataset, benchmark, and pre-trained models are available on the following webpage to support further research in this area. Datasets and codes are released on the project page: this https URL.
摘要：视觉评分是人工智能（AI）的重要能力，用于视觉内容的多维量化，主要应用于序数回归（或）任务（或）任务，例如图像质量评估，面部年龄估计和医疗图像分级。但是，当前的多模式大型语言模型（MLLM）在这种视觉评级能力中表现不佳，同时也缺乏相关的数据集和基准。在这项工作中，我们收集并介绍Storm，这是一个数据收集和基准，用于刺激MLLM对通用视觉评级的值得信赖的序数回归能力。 Storm涵盖了五个常见的视觉评分域中的14个序数回归数据集，其中包括655k图像级对和相应的精心策划的VQA。重要的是，我们还提出了一个粗到精细的处理管道，该管道动态考虑候选标签并提供可解释的思想，为MLLM提供了一般且值得信赖的典型思维范式。该基准旨在评估MLLM在需要理解评级标签基本常见序数关系的情况下的多合一和零射击性能。广泛的实验证明了我们的框架的有效性，并阐明了更好的微调策略。在以下网页上可以使用Storm数据集，基准和预培训的模型，以支持该领域的进一步研究。数据集和代码已在项目页面上发布：此HTTPS URL。

Title: Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks

Authors: Tao Yang, Ruibin Li, Yangming Shi, Yuqi Zhang, Qide Dong, Haoran Cheng, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01758
Pdf URL: https://arxiv.org/pdf/2506.01758
Copy Paste: [[2506.01758]] Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks(https://arxiv.org/abs/2506.01758)
Keywords: generation
Abstract: Diffusion models have shown impressive performance in many visual generation and manipulation tasks. Many existing methods focus on training a model for a specific task, especially, text-to-video (T2V) generation, while many other works focus on finetuning the pretrained T2V model for image-to-video (I2V), video-to-video (V2V), image and video manipulation tasks, etc. However, training a strong T2V foundation model requires a large amount of high-quality annotations, which is very costly. In addition, many existing models can perform only one or several tasks. In this work, we introduce a unified framework, namely many-for-many, which leverages the available training data from many different visual generation and manipulation tasks to train a single model for those different tasks. Specifically, we design a lightweight adapter to unify the different conditions in different tasks, then employ a joint image-video learning strategy to progressively train the model from scratch. Our joint learning leads to a unified visual generation and manipulation model with improved video generation performance. In addition, we introduce depth maps as a condition to help our model better perceive the 3D space in visual generation. Two versions of our model are trained with different model sizes (8B and 2B), each of which can perform more than 10 different tasks. In particular, our 8B model demonstrates highly competitive performance in video generation tasks compared to open-source and even commercial engines. Our models and source codes are available at this https URL.
摘要：扩散模型在许多视觉生成和操纵任务中都表现出了令人印象深刻的性能。许多现有的方法着重于培训特定任务的模型，尤其是文本对视频（T2V）的一代，而许多其他作品着重于对图像到视频（I2V）（I2V），视频对视频（V2V），图像和视频操作任务进行填充的T2V模型。但是，培训T2V基础的构成量很大，很大程度上需要大量的型号。此外，许多现有模型只能执行一个或几个任务。在这项工作中，我们介绍了一个统一的框架，即多样的框架，该框架利用了许多不同视觉生成和操纵任务的可用培训数据来训练这些不同任务的单个模型。具体而言，我们设计了一个轻巧的适配器来统一不同任务中的不同条件，然后采用联合图像视频学习策略来逐步从头开始训练模型。我们的联合学习导致统一的视觉生成和操纵模型，并改善了视频生成性能。此外，我们引入了深度图，以帮助我们的模型更好地感知视觉生成中的3D空间。我们的模型的两个版本都经过不同的模型大小（8B和2B）的训练，每个模型大小（8B和2B）可以执行10个以上不同的任务。特别是，我们的8B模型与开源源甚至商用发动机相比，在视频生成任务中表现出了高度竞争性的性能。我们的模型和源代码可在此HTTPS URL上找到。

Title: Federated Gaussian Mixture Models

Authors: Sophia Zhang Pettersson, Kuo-Yun Liang, Juan Carlos Andresen
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.01780
Pdf URL: https://arxiv.org/pdf/2506.01780
Copy Paste: [[2506.01780]] Federated Gaussian Mixture Models(https://arxiv.org/abs/2506.01780)
Keywords: generative
Abstract: This paper introduces FedGenGMM, a novel one-shot federated learning approach for Gaussian Mixture Models (GMM) tailored for unsupervised learning scenarios. In federated learning (FL), where multiple decentralized clients collaboratively train models without sharing raw data, significant challenges include statistical heterogeneity, high communication costs, and privacy concerns. FedGenGMM addresses these issues by allowing local GMM models, trained independently on client devices, to be aggregated through a single communication round. This approach leverages the generative property of GMMs, enabling the creation of a synthetic dataset on the server side to train a global model efficiently. Evaluation across diverse datasets covering image, tabular, and time series data demonstrates that FedGenGMM consistently achieves performance comparable to non-federated and iterative federated methods, even under significant data heterogeneity. Additionally, FedGenGMM significantly reduces communication overhead, maintains robust performance in anomaly detection tasks, and offers flexibility in local model complexities, making it particularly suitable for edge computing environments.
摘要：本文介绍了FedGengmm，这是一种针对无监督学习场景量身定制的高斯混合模型（GMM）的新型单次联合学习方法。在联邦学习（FL）中，多个分散的客户在没有共享原始数据的情况下进行培训模型，包括统计异质性，高度沟通成本和隐私问题。 FedGengmm通过允许在客户设备上独立培训的本地GMM模型来解决这些问题，并通过一次通信回合进行汇总。这种方法利用GMM的生成属性，使在服务器端创建合成数据集以有效地训练全局模型。涵盖图像，表格和时间序列数据的各种数据集的评估表明，即使在重要的数据异质性下，FedGengmm始终达到的性能与未赋予的和迭代的联合方法相当。此外，FedGengmm显着降低了沟通开销，在异常检测任务中保持了稳健的性能，并在本地模型复杂性方面具有灵活性，使其特别适合边缘计算环境。

Title: Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability

Authors: Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, Ruochen Zhang, Zheng-Xin Yong, Jan Christian Blaise Cruz, Niklas Muennighoff, Seungone Kim, Hanyang Zhao, Sudipta Kar, Kezia Erina Suryoraharjo, M. Farid Adilazuarda, En-Shiun Annie Lee, Ayu Purwarianti, Derry Tanti Wijaya, Monojit Choudhury
Subjects: cs.LG, cs.AI, cs.CL, cs.CV, eess.AS
Abstract URL: https://arxiv.org/abs/2506.01789
Pdf URL: https://arxiv.org/pdf/2506.01789
Copy Paste: [[2506.01789]] Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability(https://arxiv.org/abs/2506.01789)
Keywords: generation, quality assessment
Abstract: High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at this https URL.
摘要：高质量的数据集是培训和评估机器学习模型的基础，但是它们的创建尤其是在准确的人类注释范围内是一个重大挑战。许多数据集论文提交缺乏独创性，多样性或严格的质量控制，这些缺点在同行评审过程中常常被忽略。提交也经常省略有关数据集构建和属性的基本细节。尽管现有工具（例如数据表）旨在提高透明度，但它们在很大程度上具有描述性，并且不提供标准化的可测量方法来评估数据质量。同样，会议上的元数据要求促进了问责制，但不一致地执行。为了解决这些局限性，该立场论文倡导将系统的，基于标语的评估指标集成到数据集审查过程中，随着提交量的不断增长。我们还探索可扩展的，具有成本效益的方法，用于合成数据的生成，包括专用工具和LLM-AS-A-A-Gudge方法，以支持更有效的评估。作为行动呼吁，我们介绍了数据核酸元，这是一个结构化框架，用于评估人类和模型生成数据集的质量。 Datarubrics利用基于LLM的评估的最新进展，为数据集质量评估提供了可再现，可扩展性和可行的解决方案，使作者和审阅者都可以在以数据为中心的研究中维护更高的标准。我们还发布代码以支持此HTTPS URL上基于LLM的评估的可重复性。

Title: WorldExplorer: Towards Generating Fully Navigable 3D Scenes

Authors: Manuel-Andreas Schneider, Lukas Höllein, Matthias Nießner
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01799
Pdf URL: https://arxiv.org/pdf/2506.01799
Copy Paste: [[2506.01799]] WorldExplorer: Towards Generating Fully Navigable 3D Scenes(https://arxiv.org/abs/2506.01799)
Keywords: generation
Abstract: Generating 3D worlds from text is a highly anticipated goal in computer vision. Existing works are limited by the degree of exploration they allow inside of a scene, i.e., produce streched-out and noisy artifacts when moving beyond central or panoramic perspectives. To this end, we propose WorldExplorer, a novel method based on autoregressive video trajectory generation, which builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints. We initialize our scenes by creating multi-view consistent images corresponding to a 360 degree panorama. Then, we expand it by leveraging video diffusion models in an iterative scene generation pipeline. Concretely, we generate multiple videos along short, pre-defined trajectories, that explore the scene in depth, including motion around objects. Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results, like moving into objects. Finally, we fuse all generated views into a unified 3D representation via 3D Gaussian Splatting optimization. Compared to prior approaches, WorldExplorer produces high-quality scenes that remain stable under large camera motion, enabling for the first time realistic and unrestricted exploration. We believe this marks a significant step toward generating immersive and truly explorable virtual 3D environments.
摘要：从文本中生成3D世界是计算机视觉中备受期待的目标。现有作品受到场景内部允许的探索程度的限制，即，在超越中央或全景的观点时，会产生散布和嘈杂的文物。为此，我们提出了WorldExplorer，这是一种基于自回归视频轨迹的新颖方法，该方法构建了完全可导航的3D场景，并在广泛的观点中具有一致的视觉质量。我们通过创建与360度全景的多视图一致的图像来初始化场景。然后，我们通过在迭代场景生成管道中利用视频扩散模型来扩展它。具体而言，我们沿着简短的预定义轨迹生成多个视频，这些视频深入探索场景，包括围绕对象的运动。我们的新型场景记忆在最相关的先前视图上为每个视频提供了条件，而碰撞检测机制则阻止了退化的结果，例如移入对象。最后，我们通过3D高斯脱落优化将所有生成的视图融合到统一的3D表示中。与先前的方法相比，WorldExplorer会产生高质量的场景，这些场景在大型相机运动下保持稳定，这是首次实现现实和不受限制的探索。我们认为，这标志着产生沉浸式和真正可探索的虚拟3D环境迈出的重要一步。

Title: OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation

Authors: Sen Liang, Zhentao Yu, Zhengguang Zhou, Teng Hu, Hongmei Wang, Yi Chen, Qin Lin, Yuan Zhou, Xin Li, Qinglin Lu, Zhibo Chen
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01801
Pdf URL: https://arxiv.org/pdf/2506.01801
Copy Paste: [[2506.01801]] OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation(https://arxiv.org/abs/2506.01801)
Keywords: generation
Abstract: The emergence of Diffusion Transformers (DiT) has brought significant advancements to video generation, especially in text-to-video and image-to-video tasks. Although video generation is widely applied in various fields, most existing models are limited to single scenarios and cannot perform diverse video generation and editing through dynamic content manipulation. We propose OmniV2V, a video model capable of generating and editing videos across different scenarios based on various operations, including: object movement, object addition, mask-guided video edit, try-on, inpainting, outpainting, human animation, and controllable character video synthesis. We explore a unified dynamic content manipulation injection module, which effectively integrates the requirements of the above tasks. In addition, we design a visual-text instruction module based on LLaVA, enabling the model to effectively understand the correspondence between visual content and instructions. Furthermore, we build a comprehensive multi-task data processing system. Since there is data overlap among various tasks, this system can efficiently provide data augmentation. Using this system, we construct a multi-type, multi-scenario OmniV2V dataset and its corresponding OmniV2V-Test benchmark. Extensive experiments show that OmniV2V works as well as, and sometimes better than, the best existing open-source and commercial models for many video generation and editing tasks.
摘要：扩散变压器（DIT）的出现为视频生成带来了重大进步，尤其是在文本到视频和图像到视频任务中。尽管视频生成广泛应用于各个领域，但大多数现有模型都限于单个场景，并且无法通过动态内容操纵进行多样化的视频生成和编辑。我们提出了Omniv2v，这是一个视频模型，该模型旨在基于各种操作在不同方案上生成和编辑视频，包括：对象运动，对象添加，掩盖引导的视频编辑，try-On，in-on-On，inpherting，inpherting，ofpaining，ofpainting，ofpainting，oppainting，human Animation，human Animation和可控制的角色视频综合。我们探索一个统一的动态内容操纵注入模块，该模块有效地整合了上述任务的要求。此外，我们设计了一个基于LLAVA的视觉文本指令模块，使模型能够有效地了解视觉内容和指令之间的对应关系。此外，我们构建了一个全面的多任务数据处理系统。由于各种任务之间存在数据重叠，因此该系统可以有效地提供数据增强。使用此系统，我们构建了一个多类型的多型Scenario Omniv2v数据集及其相应的Omniv2V检测基准。广泛的实验表明，Omniv2v的工作原理，有时甚至比许多视频生成和编辑任务的现有最佳开源和商业模型更好。

Title: GSCodec Studio: A Modular Framework for Gaussian Splat Compression

Authors: Sicheng Li, Chengzhen Wu, Hao Li, Xiang Gao, Yiyi Liao, Lu Yu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2506.01822
Pdf URL: https://arxiv.org/pdf/2506.01822
Copy Paste: [[2506.01822]] GSCodec Studio: A Modular Framework for Gaussian Splat Compression(https://arxiv.org/abs/2506.01822)
Keywords: generation
Abstract: 3D Gaussian Splatting and its extension to 4D dynamic scenes enable photorealistic, real-time rendering from real-world captures, positioning Gaussian Splats (GS) as a promising format for next-generation immersive media. However, their high storage requirements pose significant challenges for practical use in sharing, transmission, and storage. Despite various studies exploring GS compression from different perspectives, these efforts remain scattered across separate repositories, complicating benchmarking and the integration of best practices. To address this gap, we present GSCodec Studio, a unified and modular framework for GS reconstruction, compression, and rendering. The framework incorporates a diverse set of 3D/4D GS reconstruction methods and GS compression techniques as modular components, facilitating flexible combinations and comprehensive comparisons. By integrating best practices from community research and our own explorations, GSCodec Studio supports the development of compact representation and compression solutions for static and dynamic Gaussian Splats, namely our Static and Dynamic GSCodec, achieving competitive rate-distortion performance in static and dynamic GS compression. The code for our framework is publicly available at this https URL , to advance the research on Gaussian Splats compression.
摘要：3D高斯裂开及其扩展到4D动态场景，可以从现实世界中捕获的光真逼真，实时渲染，将高斯夹夹（GS）定位为下一代沉浸式媒体的有希望的格式。但是，它们的高存储要求在共享，传输和存储方面面临着重大挑战。尽管从不同的角度探讨了GS压缩的各种研究，但这些努力仍散布在单独的存储库中，使基准测试和最佳实践的整合变得复杂。为了解决这一差距，我们提出了GSCODEC Studio，这是一个用于GS重建，压缩和渲染的统一和模块化框架。该框架结合了各种3D/4D GS重建方法和GS压缩技术作为模块化组件，促进了柔性组合和全面的比较。通过整合社区研究和我们自己的探索的最佳实践，GSCODEC Studio支持紧凑型表示和压缩解决方案的静态和动态高斯夹层的开发，即我们的静态和动态的GSCODEC，在静态和动态GS压缩中实现了竞争性速率率性能。我们的框架代码在此HTTPS URL上公开可用，以推动高斯夹层压缩的研究。

Title: SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Authors: Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, Remi Cadene
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2506.01844
Pdf URL: https://arxiv.org/pdf/2506.01844
Copy Paste: [[2506.01844]] SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics(https://arxiv.org/abs/2506.01844)
Keywords: generation
Abstract: Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.
摘要：在大规模的多模式数据集上预测的视觉语言模型（VLM）编码丰富的视觉和语言知识，使它们成为机器人技术的坚实基础。最近的方法没有从头开始训练机器人策略，而是将VLMS调整为视觉语言动作（VLA）模型，从而实现自然语言驱动的感知和控制。但是，现有的VLA通常是巨大的 - 通常带有数十亿个参数 - 领导着高训练成本和有限的现实可部署性。此外，他们依靠学术和工业数据集，忽视了来自负担得起的机器人平台的社区收集数据的不断增长。在这项工作中，我们提出了Smolvla，这是一个小型，高效且以社区为导向的VLA，可大大降低培训和推理成本，同时保持竞争性绩效。 Smolvla设计为对单个GPU进行培训，并在消费级GPU甚至CPU上部署。为了进一步提高响应能力，我们引入了一种异步推理堆栈的脱钩感知和行动执行的行动预测，从而使较高的控制速率随着行动的产生而产生。尽管大小紧凑，但Smolvla的性能与大10倍的VLA相当。我们评估了Smolvla在一系列模拟的机器人基准和释放所有代码，验证模型和培训数据的范围内评估Smolvla。

Title: MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

Authors: Wayner Barrios, Andrés Villa, Juan León Alcázar, SouYoung Jin, Bernard Ghanem
Subjects: cs.CV, cs.AI, cs.LG, cs.MM
Abstract URL: https://arxiv.org/abs/2506.01850
Pdf URL: https://arxiv.org/pdf/2506.01850
Copy Paste: [[2506.01850]] MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs(https://arxiv.org/abs/2506.01850)
Keywords: generation
Abstract: Recently, Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle to ground fine-grained visual concepts in complex scenes. In this paper, we propose MoDA (Modulation Adapter), a lightweight yet effective module designed to refine pre-aligned visual features through instruction-guided modulation. Our approach follows the standard LLaVA training protocol, consisting of a two-stage process: (1) aligning image features to the LLMs input space via a frozen vision encoder and adapter layers, and (2) refining those features using the MoDA adapter during the instructional tuning stage. MoDA employs a Transformer-based cross-attention mechanism to generate a modulation mask over the aligned visual tokens, thereby emphasizing semantically relevant embedding dimensions based on the language instruction. The modulated features are then passed to the LLM for autoregressive language generation. Our experimental evaluation shows that MoDA improves visual grounding and generates more contextually appropriate responses, demonstrating its effectiveness as a general-purpose enhancement for image-based MLLMs.
摘要：最近，多模式大型语言模型（MLLM）通过将预验证的视觉编码器与大语言模型（LLMS）集成在一起，在跟随指导遵守任务上表现出了令人印象深刻的表现。但是，现有的方法通常难以在复杂的场景中扎根细粒度的视觉概念。在本文中，我们提出了Moda（调制适配器），这是一个轻巧但有效的模块，旨在通过指令引导的调制来完善预先对准的视觉特征。我们的方法遵循标准LLAVA训练协议，该协议由两个阶段的过程组成：（1）通过冷冻视觉编码器和适配器层对齐LLMS输入空间，以及（2）在教学调整阶段使用MODA适配器来完善这些功能。 MODA采用基于变压器的跨注意机制来在对齐的视觉令牌上生成调制面膜，从而根据语言指令强调语义相关的嵌入维度。然后，调制功能将传递给LLM以进行自回归语言。我们的实验评估表明，MODA改善了视觉接地并产生更适合上下文的响应，证明了其作为基于图像的MLLM的通用增强功能的有效性。

Title: ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

Authors: Junliang Ye, Zhengyi Wang, Ruowen Zhao, Shenghao Xie, Jun Zhu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01853
Pdf URL: https://arxiv.org/pdf/2506.01853
Copy Paste: [[2506.01853]] ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding(https://arxiv.org/abs/2506.01853)
Keywords: generation
Abstract: Recently, the powerful text-to-image capabilities of ChatGPT-4o have led to growing appreciation for native multimodal large language models. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate 3D content is equally crucial. To address this gap, we propose ShapeLLM-Omni-a native 3D large language model capable of understanding and generating 3D assets and text in any sequence. First, we train a 3D vector-quantized variational autoencoder (VQVAE), which maps 3D objects into a discrete latent space to achieve efficient and accurate shape representation and reconstruction. Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca, encompassing generation, comprehension, and editing, thus providing rich resources for future research and training. Finally, by performing instruction-based training of the Qwen-2.5-vl-7B-Instruct model on the 3D-Alpaca dataset. Our work provides an effective attempt at extending multimodal models with basic 3D capabilities, which contributes to future research in 3D-native AI. Project page: this https URL
摘要：最近，Chatgpt-4O的强大文本对图像功能导致对本地多模式大型语言模型的欣赏日益增长。但是，其多模式功能仍然局限于图像和文本。然而，除了图像之外，理解和生成3D内容的能力同样至关重要。为了解决这一差距，我们提出了shapellm-omni-a天然3D大型语言模型，能够理解和生成任何序列的3D资产和文本。首先，我们将3D矢量定量的变分自动编码器（VQVAE）训练，该变量自动编码器（VQVAE）将3D对象映射到离散的潜在空间中，以实现有效，准确的形状表示和重建。在3D感知的离散代币的基础上，我们创新地构建了一个名为3D-Alpaca的大规模连续培训数据集，其中包含了一代，理解和编辑，从而为未来的研究和培训提供了丰富的资源。最后，通过对3D-ALPACA数据集上的QWEN-2.5-VL-7B-INSCRUCT模型进行基于指令的培训。我们的工作提供了一种有效的尝试，以扩展具有基本3D功能的多模式模型，这有助于3D-Nenative AI的未来研究。项目页面：此HTTPS URL

Title: SMOTE-DP: Improving Privacy-Utility Tradeoff with Synthetic Data

Authors: Yan Zhou, Bradley Malin, Murat Kantarcioglu
Subjects: cs.LG, cs.CR, stat.ML
Abstract URL: https://arxiv.org/abs/2506.01907
Pdf URL: https://arxiv.org/pdf/2506.01907
Copy Paste: [[2506.01907]] SMOTE-DP: Improving Privacy-Utility Tradeoff with Synthetic Data(https://arxiv.org/abs/2506.01907)
Keywords: generation, generative
Abstract: Privacy-preserving data publication, including synthetic data sharing, often experiences trade-offs between privacy and utility. Synthetic data is generally more effective than data anonymization in balancing this trade-off, however, not without its own challenges. Synthetic data produced by generative models trained on source data may inadvertently reveal information about outliers. Techniques specifically designed for preserving privacy, such as introducing noise to satisfy differential privacy, often incur unpredictable and significant losses in utility. In this work we show that, with the right mechanism of synthetic data generation, we can achieve strong privacy protection without significant utility loss. Synthetic data generators producing contracting data patterns, such as Synthetic Minority Over-sampling Technique (SMOTE), can enhance a differentially private data generator, leveraging the strengths of both. We prove in theory and through empirical demonstration that this SMOTE-DP technique can produce synthetic data that not only ensures robust privacy protection but maintains utility in downstream learning tasks.
摘要：隐私的数据出版物（包括综合数据共享）经常在隐私与公用事业之间经历权衡。合成数据通常比数据匿名化在平衡这一权衡方面更有效，但是，并非没有其自身的挑战。在源数据上训练的生成模型产生的合成数据可能会无意间揭示有关异常值的信息。专门为保留隐私而设计的技术，例如引入噪音以满足差异性隐私，通常会造成效用中不可预测且重大损失。在这项工作中，我们表明，借助合成数据生成的正确机制，我们可以实现强大的隐私保护而不会大大损失。产生收缩数据模式的合成数据生成器，例如合成少数群体过采样技术（SMOTE），可以增强差异性数据生成器，从而利用两者的优势。我们在理论上并通过经验证明证明了这种SMOTE-DP技术可以产生合成数据，不仅可以确保强大的隐私保护，而且可以在下游学习任务中保持效用。

Title: Elucidating the representation of images within an unconditional diffusion model denoiser

Authors: Zahra Kadkhodaie, Stéphane Mallat, Eero Simoncelli
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01912
Pdf URL: https://arxiv.org/pdf/2506.01912
Copy Paste: [[2506.01912]] Elucidating the representation of images within an unconditional diffusion model denoiser(https://arxiv.org/abs/2506.01912)
Keywords: generative
Abstract: Generative diffusion models learn probability densities over diverse image datasets by estimating the score with a neural network trained to remove noise. Despite their remarkable success in generating high-quality images, the internal mechanisms of the underlying score networks are not well understood. Here, we examine a UNet trained for denoising on the ImageNet dataset, to better understand its internal representation and computation of the score. We show that the middle block of the UNet decomposes individual images into sparse subsets of active channels, and that the vector of spatial averages of these channels can provide a nonlinear representation of the underlying clean images. We develop a novel algorithm for stochastic reconstruction of images from this representation and demonstrate that it recovers a sample from a set of images defined by a target image representation. We then study the properties of the representation and demonstrate that Euclidean distances in the latent space correspond to distances between conditional densities induced by representations as well as semantic similarities in the image space. Applying a clustering algorithm in the representation space yields groups of images that share both fine details (e.g., specialized features, textured regions, small objects), as well as global structure, but are only partially aligned with object identities. Thus, we show for the first time that a network trained solely on denoising contains a rich and accessible sparse representation of images.
摘要：生成扩散模型通过使用训练噪声的神经网络估算分数来了解各种图像数据集的概率密度。尽管在产生高质量图像方面取得了显着的成功，但基本得分网络的内部机制尚不清楚。在这里，我们检查了一个在Imagenet数据集上训练有素的UNET，以更好地了解其内部表示和分数的计算。我们表明，UNET的中间块将单个图像分解为活动通道的稀疏子集，并且这些通道的空间平均值向量可以提供基础干净图像的非线性表示。我们开发了一种新型算法，用于从该表示形式中对图像进行随机重建，并证明它从目标图像表示定义的一组图像中恢复了样本。然后，我们研究表示的特性，并证明潜在空间中的欧几里得距离对应于由表示空间引起的条件密度以及图像空间中的语义相似性之间的距离。在表示空间中应用聚类算法会产生一组既有细节（例如，专业特征，纹理区域，小物体）以及全局结构的图像组，但仅部分与对象身份相符。因此，我们首次展示了仅针对DeNoising训练的网络包含图像的丰富稀疏表示。

Title: TaxaDiffusion: Progressively Trained Diffusion Model for Fine-Grained Species Generation

Authors: Amin Karimi Monsefi, Mridul Khurana, Rajiv Ramnath, Anuj Karpatne, Wei-Lun Chao, Cheng Zhang
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01923
Pdf URL: https://arxiv.org/pdf/2506.01923
Copy Paste: [[2506.01923]] TaxaDiffusion: Progressively Trained Diffusion Model for Fine-Grained Species Generation(https://arxiv.org/abs/2506.01923)
Keywords: generation
Abstract: We propose TaxaDiffusion, a taxonomy-informed training framework for diffusion models to generate fine-grained animal images with high morphological and identity accuracy. Unlike standard approaches that treat each species as an independent category, TaxaDiffusion incorporates domain knowledge that many species exhibit strong visual similarities, with distinctions often residing in subtle variations of shape, pattern, and color. To exploit these relationships, TaxaDiffusion progressively trains conditioned diffusion models across different taxonomic levels -- starting from broad classifications such as Class and Order, refining through Family and Genus, and ultimately distinguishing at the Species level. This hierarchical learning strategy first captures coarse-grained morphological traits shared by species with common ancestors, facilitating knowledge transfer before refining fine-grained differences for species-level distinction. As a result, TaxaDiffusion enables accurate generation even with limited training samples per species. Extensive experiments on three fine-grained animal datasets demonstrate that outperforms existing approaches, achieving superior fidelity in fine-grained animal image generation. Project page: this https URL
摘要：我们提出了一种分类式化的扩散模型的分类学培训框架，以产生具有高形态学和身份准确性的细颗粒动物图像。与将每个物种视为独立类别的标准方法不同，TaxAdiflusion结合了域知识，许多物种表现出强烈的视觉相似性，而区分通常属于形状，图案和颜色的细微变化。为了利用这些关系，税收扩散逐渐培训了不同分类级别的条件扩散模型 - 从诸如班级和秩序等广泛的分类开始，通过家庭和属，最终在物种一级区分。该分层学习策略首先捕获物种与共同祖先共有的粗粒粒度形态特征，从而促进知识转移，然后再完善物种水平区分的细粒差异。结果，即使每个物种的培训样本有限，税收排化也能够准确产生。在三个细颗粒动物数据集上进行的广泛实验表明，实现现有方法的表现，实现了精细颗粒动物形象产生的优越性。项目页面：此HTTPS URL

Title: Low-Rank Head Avatar Personalization with Registers

Authors: Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Md Moniruzzaman, Chen-Ping Yu, Yi-Hsuan Tsai, Dimitris Samaras
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01935
Pdf URL: https://arxiv.org/pdf/2506.01935
Copy Paste: [[2506.01935]] Low-Rank Head Avatar Personalization with Registers(https://arxiv.org/abs/2506.01935)
Keywords: generation
Abstract: We introduce a novel method for low-rank personalization of a generic model for head avatar generation. Prior work proposes generic models that achieve high-quality face animation by leveraging large-scale datasets of multiple identities. However, such generic models usually fail to synthesize unique identity-specific details, since they learn a general domain prior. To adapt to specific subjects, we find that it is still challenging to capture high-frequency facial details via popular solutions like low-rank adaptation (LoRA). This motivates us to propose a specific architecture, a Register Module, that enhances the performance of LoRA, while requiring only a small number of parameters to adapt to an unseen identity. Our module is applied to intermediate features of a pre-trained model, storing and re-purposing information in a learnable 3D feature space. To demonstrate the efficacy of our personalization method, we collect a dataset of talking videos of individuals with distinctive facial details, such as wrinkles and tattoos. Our approach faithfully captures unseen faces, outperforming existing methods quantitatively and qualitatively. We will release the code, models, and dataset to the public.
摘要：我们介绍了一种新颖的方法，用于低排名的个性化模型，用于头像头像生成。先前的工作提出了通用模型，通过利用多个身份的大规模数据集来实现高质量的面部动画。但是，这种通用模型通常无法综合特定于身份的细节，因为它们在先前学习了一般领域。为了适应特定的主题，我们发现通过低级适应（Lora）等流行解决方案捕获高频面部细节仍然具有挑战性。这促使我们提出了一个特定的体系结构，即寄存器模块，以增强洛拉的性能，同时只需要少量参数才能适应看不见的身份。我们的模块应用于预训练模型的中间特征，将信息存储和重新填充信息在可学习的3D特征空间中。为了证明我们的个性化方法的功效，我们收集了一个会话视频的数据集，这些视频具有独特的面部细节，例如皱纹和纹身。我们的方法忠实地捕捉了看不见的面孔，以定量和定性的方式优于现有方法。我们将向公众发布代码，模型和数据集。

Title: Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Authors: Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, Dahua Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01943
Pdf URL: https://arxiv.org/pdf/2506.01943
Copy Paste: [[2506.01943]] Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control(https://arxiv.org/abs/2506.01943)
Keywords: generation
Abstract: Recent advances in video diffusion models have demonstrated strong potential for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing trajectory-based methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex robotic manipulation. This limitation arises from multi-feature entanglement in overlapping regions, which leads to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics through a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction. Each stage is modeled using the feature of the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction, thereby mitigating the drawback of multi-object feature fusion present during interaction in prior work. To further ensure subject semantic consistency throughout the video, we incorporate appearance- and shape-aware latent representations for objects. Extensive experiments on the challenging Bridge V2 dataset, as well as in-the-wild evaluation, demonstrate that our method outperforms existing approaches, establishing new state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
摘要：视频扩散模型的最新进展显示出生成机器人决策数据的强大潜力，并且轨迹条件进一步实现了细粒度的控制。但是，现有的基于轨迹的方法主要集中于单个对象运动，并难以捕获复杂的机器人操纵至关重要的多对象相互作用。这种限制是由重叠区域中的多种功能纠缠产生的，这导致视觉保真度退化。为了解决这个问题，我们提出了Robomaster，这是一个新颖的框架，该框架通过协作轨迹公式建模对象间动力学。与分解对象的先前方法不同，我们的核心是将相互作用过程分解为三个子阶段：交互前，相互作用和相互作用。每个阶段都是使用主要对象的特征进行建模的，特别是在交互前和相互作用阶段和操纵对象中的机器人组，从而减轻了先前工作中交互期间存在的多对象特征融合的缺点。为了进一步确保整个视频中的主题语义一致性，我们将对象的外观和形状吸引的潜在表示。在具有挑战性的桥梁V2数据集以及野外评估上进行的广泛实验表明，我们的方法表现优于现有方法，在轨迹控制的视频生成中建立了用于机器人操作的轨迹控制视频中的新最新性能。

Title: IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout

Authors: Fei Shen, Xiaoyu Du, Yutong Gao, Jian Yu, Yushe Cao, Xing Lei, Jinhui Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.01949
Pdf URL: https://arxiv.org/pdf/2506.01949
Copy Paste: [[2506.01949]] IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout(https://arxiv.org/abs/2506.01949)
Keywords: generation
Abstract: Recent diffusion models have advanced image editing by enhancing visual quality and control, supporting broad applications across creative and personalized domains. However, current image editing largely overlooks multi-object scenarios, where precise control over object categories, counts, and spatial layouts remains a significant challenge. To address this, we introduce a new task, quantity-and-layout consistent image editing (QL-Edit), which aims to enable fine-grained control of object quantity and spatial structure in complex scenes. We further propose IMAGHarmony, a structure-aware framework that incorporates harmony-aware attention (HA) to integrate multimodal semantics, explicitly modeling object counts and layouts to enhance editing accuracy and structural consistency. In addition, we observe that diffusion models are susceptible to initial noise and exhibit strong preferences for specific noise patterns. Motivated by this, we present a preference-guided noise selection (PNS) strategy that chooses semantically aligned initial noise samples based on vision-language matching, thereby improving generation stability and layout consistency in multi-object editing. To support evaluation, we construct HarmonyBench, a comprehensive benchmark covering diverse quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony consistently outperforms state-of-the-art methods in structural alignment and semantic accuracy. The code and model are available at this https URL.
摘要：最近的扩散模型通过增强视觉质量和控制，支持跨创意和个性化领域的广泛应用，具有高级图像编辑。但是，当前的图像编辑在很大程度上忽略了多对象的方案，而对对象类别，计数和空间布局的精确控制仍然是一个重大挑战。为了解决这个问题，我们介绍了一个新任务，数量和数量一致的图像编辑（QL-EDIT），旨在在复杂场景中对对象数量和空间结构进行细粒度控制。我们进一步提出了Imagharmony，这是一个结构感知的框架，结合了和谐感知的注意（HA），以整合多模式语义，明确对对象计数进行建模和布局，以增强编辑精度和结构一致性。此外，我们观察到扩散模型容易受到初始噪声的影响，并且对特定的噪声模式表现出强烈的偏好。在此激励的情况下，我们提出了一种偏好引导的噪声选择（PNS）策略，该策略根据视觉匹配选择了语义上对齐的初始噪声样本，从而改善了多对象编辑中的发电稳定性和布局一致性。为了支持评估，我们构建了HarmonyBench，这是一个全面的基准，涵盖了不同数量和布局控制方案。广泛的实验表明，Imagharmony在结构比对和语义准确性方面始终优于最先进的方法。该代码和模型可在此HTTPS URL上找到。

Title: Dual-Process Image Generation

Authors: Grace Luo, Jonathan Granskog, Aleksander Holynski, Trevor Darrell
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2506.01955
Pdf URL: https://arxiv.org/pdf/2506.01955
Copy Paste: [[2506.01955]] Dual-Process Image Generation(https://arxiv.org/abs/2506.01955)
Keywords: generation
Abstract: Prior methods for controlling image generation are limited in their ability to be taught new tasks. In contrast, vision-language models, or VLMs, can learn tasks in-context and produce the correct outputs for a given input. We propose a dual-process distillation scheme that allows feed-forward image generators to learn new tasks from deliberative VLMs. Our scheme uses a VLM to rate the generated images and backpropagates this gradient to update the weights of the image generator. Our general framework enables a wide variety of new control tasks through the same text-and-image based interface. We showcase a handful of applications of this technique for different types of control signals, such as commonsense inferences and visual prompts. With our method, users can implement multimodal controls for properties such as color palette, line weight, horizon position, and relative depth within a matter of minutes. Project page: this https URL.
摘要：控制图像生成的先前方法的教学能力有限。相比之下，视觉模型或VLM可以在文本中学习任务，并为给定输入产生正确的输出。我们提出了一种双处理蒸馏方案，该方案允许馈送图像发生器可以从协商VLM中学习新任务。我们的方案使用VLM对生成的图像进行评分，并将此梯度反向放置以更新图像生成器的权重。我们的一般框架通过相同的基于文本和图像的接口实现了各种各样的新控制任务。我们向不同类型的控制信号（例如常识推断和视觉提示）展示了此技术的少数应用。使用我们的方法，用户可以在几分钟内实现诸如调色板，线重量，地平线位置和相对深度等属性的多模式控件。项目页面：此HTTPS URL。