2025-06-05

Title: Test-Time Scaling of Diffusion Models via Noise Trajectory Search

Authors: Vignav Ramesh, Morteza Mardani
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03164
Pdf URL: https://arxiv.org/pdf/2506.03164
Copy Paste: [[2506.03164]] Test-Time Scaling of Diffusion Models via Noise Trajectory Search(https://arxiv.org/abs/2506.03164)
Keywords: generation
Abstract: The iterative and stochastic nature of diffusion models enables test-time scaling, whereby spending additional compute during denoising generates higher-fidelity samples. Increasing the number of denoising steps is the primary scaling axis, but this yields quickly diminishing returns. Instead optimizing the noise trajectory--the sequence of injected noise vectors--is promising, as the specific noise realizations critically affect sample quality; but this is challenging due to a high-dimensional search space, complex noise-outcome interactions, and costly trajectory evaluations. We address this by first casting diffusion as a Markov Decision Process (MDP) with a terminal reward, showing tree-search methods such as Monte Carlo tree search (MCTS) to be meaningful but impractical. To balance performance and efficiency, we then resort to a relaxation of MDP, where we view denoising as a sequence of independent contextual bandits. This allows us to introduce an $\epsilon$-greedy search algorithm that globally explores at extreme timesteps and locally exploits during the intermediate steps where de-mixing occurs. Experiments on EDM and Stable Diffusion reveal state-of-the-art scores for class-conditioned/text-to-image generation, exceeding baselines by up to $164\%$ and matching/exceeding MCTS performance. To our knowledge, this is the first practical method for test-time noise trajectory optimization of arbitrary (non-differentiable) rewards.
摘要：扩散模型的迭代和随机性可以实现测试时间缩放，从而在denoising期间花费更多的计算会产生更高的保真度样本。增加降解步骤的数量是主要的缩放轴，但这会迅速减少回报。取而代之的是优化噪声轨迹 - 注入的噪声向量的序列 - 随着特定的噪声实现严重影响样品质量，这是有希望的。但这是由于高维搜索空间，复杂的噪声结果相互作用以及昂贵的轨迹评估而具有挑战性的。我们通过首先将扩散作为马尔可夫决策过程（MDP）来解决这一问题，并具有终端奖励，显示诸如蒙特卡洛树搜索（MCT）之类的树搜索方法是有意义但不切实际的。为了平衡性能和效率，我们诉诸于MDP的放松，在那里我们将DeNosing视为一系列独立的上下文匪徒。这使我们能够引入$ \ epsilon $ - 绿色搜索算法，该算法在发生混合发生混合的中间步骤中，在极端的时间段和本地探索全球探索。 EDM和稳定扩散的实验揭示了课堂条件/文本对象生成的最新分数，超过基线的$ 164 \％$ $，并且匹配/超过MCTS性能。据我们所知，这是测试时间噪声轨迹优化任意（非差异）奖励的第一种实用方法。

Title: PALADIN : Robust Neural Fingerprinting for Text-to-Image Diffusion Models

Authors: Murthy L, Subarna Tripathi
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.03170
Pdf URL: https://arxiv.org/pdf/2506.03170
Copy Paste: [[2506.03170]] PALADIN : Robust Neural Fingerprinting for Text-to-Image Diffusion Models(https://arxiv.org/abs/2506.03170)
Keywords: generation, generative
Abstract: The risk of misusing text-to-image generative models for malicious uses, especially due to the open-source development of such models, has become a serious concern. As a risk mitigation strategy, attributing generative models with neural fingerprinting is emerging as a popular technique. There has been a plethora of recent work that aim for addressing neural fingerprinting. A trade-off between the attribution accuracy and generation quality of such models has been studied extensively. None of the existing methods yet achieved $100\%$ attribution accuracy. However, any model with less than \emph{perfect} accuracy is practically non-deployable. In this work, we propose an accurate method to incorporate neural fingerprinting for text-to-image diffusion models leveraging the concepts of cyclic error correcting codes from the literature of coding theory.
摘要：滥用恶意用途的文本到图像生成模型的风险，尤其是由于此类模型的开源开发，已成为一个严重的问题。作为一种降低风险的策略，将生成模型归因于神经指纹识别是一种流行技术。最近有很多旨在解决神经指纹识别的工作。对此类模型的归因准确性和发电质量之间的权衡已得到广泛的研究。尚未实现$ 100 \％$归因精度。但是，任何小于\ emph {persect}精度的模型实际上都是不可部署的。在这项工作中，我们提出了一种准确的方法，以结合文本对图像扩散模型的神经指纹识别，该模型利用了编码理论的文献中校正代码的概念。

Title: FOLIAGE: Towards Physical Intelligence World Models Via Unbounded Surface Evolution

Authors: Xiaoyi Liu, Hao Tang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03173
Pdf URL: https://arxiv.org/pdf/2506.03173
Copy Paste: [[2506.03173]] FOLIAGE: Towards Physical Intelligence World Models Via Unbounded Surface Evolution(https://arxiv.org/abs/2506.03173)
Keywords: generation
Abstract: Physical intelligence -- anticipating and shaping the world from partial, multisensory observations -- is critical for next-generation world models. We propose FOLIAGE, a physics-informed multimodal world model for unbounded accretive surface growth. In its Action-Perception loop, a unified context encoder maps images, mesh connectivity, and point clouds to a shared latent state. A physics-aware predictor, conditioned on physical control actions, advances this latent state in time to align with the target latent of the surface, yielding a Modality-Agnostic Growth Embedding (MAGE) that interfaces with critic heads for downstream objectives. FOLIAGE's Accretive Graph Network (AGN) captures dynamic connectivity through Age Positional Encoding and Energy-Gated Message-Passing. Geometry-Correspondence Fusion and Cross-Patch Masking enhance MAGE's expressiveness, while Hierarchical Pooling balances global context with local dynamics. We create SURF-GARDEN, a world model learning platform comprising a Counterfactual Physics Simulator, a Multimodal Correspondence Extractor, and Evolution Tracing, which generates 7,200 diverse surface-growth sequences. SURF-BENCH, our physical-intelligence evaluation suite, evaluates six core tasks -- topology recognition, inverse material estimation, growth-stage classification, latent roll-out, cross-modal retrieval, and dense correspondence -- and four stress tests -- sensor dropout, zero-shot modality transfer, long-horizon prediction, and physics ablation -- to probe resilience. FOLIAGE outperforms specialized baselines while remaining robust across dynamic environments, establishing a new world-model based, multimodal pathway to physical intelligence.
摘要：物理智力 - 预期和塑造世界，从部分，多感官观察中塑造世界 - 对于下一代世界模型至关重要。我们提出了叶子，这是一种具有物理知识的多模式世界模型，用于无限的增生表面生长。在其动作感知循环中，统一上下文编码器将图像，网格连接和点云映射到共享的潜在状态。一个以物理控制动作为条件的物理意识的预测因子及时提高了这种潜在状态，以与表面潜在的目标保持一致，从而产生了与批评者的下游目标相结合的模态 - 静态增长嵌入（MAGE）。叶子的增生图网络（AGN）通过年龄的位置编码和能量门控消息来捕获动态连通性。几何形状 - 对应融合和交叉掩蔽增强了法师的表现力，而层次池的平衡与局部动力学平衡了全局环境。我们创建了Surf-Garden，这是一个世界模型学习平台，其中包括反事实模拟器，多模式对应提取器和进化跟踪，它产生了7,200种不同的表面增长序列。冲浪板，我们的身体智能评估套件，评估了六项核心任务 - 拓扑识别，材料估计，增长阶段分类，潜在的推出，跨模式检索和密集的对应关系 - 以及四个压力测试 - 传感器掉落，零折 - 折射模式转移，长期转移，远程预测和物理学的预测 - 概率 - 探测 - 探测 - 替代。叶子的表现优于专业基线，同时在动态环境中保持健壮，建立了一种新的基于世界模式的多模式途径的物理智能途径。

Title: Multimodal Generative AI with Autoregressive LLMs for Human Motion Understanding and Generation: A Way Forward

Authors: Muhammad Islam, Tao Huang, Euijoon Ahn, Usman Naseem
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03191
Pdf URL: https://arxiv.org/pdf/2506.03191
Copy Paste: [[2506.03191]] Multimodal Generative AI with Autoregressive LLMs for Human Motion Understanding and Generation: A Way Forward(https://arxiv.org/abs/2506.03191)
Keywords: generation, generative
Abstract: This paper presents an in-depth survey on the use of multimodal Generative Artificial Intelligence (GenAI) and autoregressive Large Language Models (LLMs) for human motion understanding and generation, offering insights into emerging methods, architectures, and their potential to advance realistic and versatile motion synthesis. Focusing exclusively on text and motion modalities, this research investigates how textual descriptions can guide the generation of complex, human-like motion sequences. The paper explores various generative approaches, including autoregressive models, diffusion models, Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and transformer-based models, by analyzing their strengths and limitations in terms of motion quality, computational efficiency, and adaptability. It highlights recent advances in text-conditioned motion generation, where textual inputs are used to control and refine motion outputs with greater precision. The integration of LLMs further enhances these models by enabling semantic alignment between instructions and motion, improving coherence and contextual relevance. This systematic survey underscores the transformative potential of text-to-motion GenAI and LLM architectures in applications such as healthcare, humanoids, gaming, animation, and assistive technologies, while addressing ongoing challenges in generating efficient and realistic human motion.
摘要：本文介绍了一项有关使用多模式生成人工智能（Genai）和自回归大型语言模型（LLMS）的深入调查，以了解人类运动的理解和产生，从而有见识新兴方法，体系结构及其推动现实和多样性运动合成的潜力。本研究专门关注文本和运动方式，研究文本描述如何指导复杂的，类人类运动序列的产生。本文探讨了各种生成方法，包括自回旋模型，扩散模型，生成的对抗网络（GAN），变异自动编码器（VAE）和基于变压器的模型，通过分析其运动质量，计算效率和适应性方面，通过分析其优势和限制。它重点介绍了文本条件运动生成的最新进展，其中文本输入用于控制和完善运动输出，以更高的精度来控制和完善运动输出。 LLM的集成通过实现指令和运动之间的语义一致性进一步增强了这些模型，从而提高了连贯性和上下文相关性。这项系统的调查强调了文本到动作Genai和LLM体系结构在医疗保健，人形生物，游戏，动画和辅助技术等应用中的变革潜力，同时在产生有效和现实的人类运动方面面临着持续的挑战。

Title: FLEX: A Large-Scale Multi-Modal Multi-Action Dataset for Fitness Action Quality Assessment

Authors: Hao Yin, Lijun Gu, Paritosh Parmar, Lin Xu, Tianxiao Guo, Weiwei Fu, Yang Zhang, Tianyou Zheng
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03198
Pdf URL: https://arxiv.org/pdf/2506.03198
Copy Paste: [[2506.03198]] FLEX: A Large-Scale Multi-Modal Multi-Action Dataset for Fitness Action Quality Assessment(https://arxiv.org/abs/2506.03198)
Keywords: quality assessment
Abstract: With the increasing awareness of health and the growing desire for aesthetic physique, fitness has become a prevailing trend. However, the potential risks associated with fitness training, especially with weight-loaded fitness actions, cannot be overlooked. Action Quality Assessment (AQA), a technology that quantifies the quality of human action and provides feedback, holds the potential to assist fitness enthusiasts of varying skill levels in achieving better training outcomes. Nevertheless, current AQA methodologies and datasets are limited to single-view competitive sports scenarios and RGB modality and lack professional assessment and guidance of fitness actions. To address this gap, we propose the FLEX dataset, the first multi-modal, multi-action, large-scale dataset that incorporates surface electromyography (sEMG) signals into AQA. FLEX utilizes high-precision MoCap to collect 20 different weight-loaded actions performed by 38 subjects across 3 different skill levels for 10 repetitions each, containing 5 different views of the RGB video, 3D pose, sEMG, and physiological information. Additionally, FLEX incorporates knowledge graphs into AQA, constructing annotation rules in the form of penalty functions that map weight-loaded actions, action keysteps, error types, and feedback. We conducted various baseline methodologies on FLEX, demonstrating that multimodal data, multiview data, and fine-grained annotations significantly enhance model performance. FLEX not only advances AQA methodologies and datasets towards multi-modal and multi-action scenarios but also fosters the integration of artificial intelligence within the fitness domain. Dataset and code are available at this https URL.
摘要：随着对健康的认识越来越多，对美学体质的渴望日益增长，健身已成为一种流行的趋势。但是，与健身训练相关的潜在风险，尤其是体重负载的健身动作，不可忽视。动作质量评估（AQA）是一种量化人类行动质量并提供反馈的技术，具有帮助健身爱好者在实现更好的培训成果方面具有不同技能水平的潜力。然而，当前的AQA方法和数据集仅限于单视竞争性运动场景和RGB模式，并且缺乏专业评估和健身行动的指导。为了解决这一差距，我们提出了flex数据集，这是第一个多模式，多功能，大规模数据集，该数据集将表面肌电图（SEMG）信号纳入AQA。 Flex利用高精度MOCAP收集20种不同的体重负载动作，由38位受试者在3个不同技能水平上执行10个重复，其中包含RGB视频，3D姿势，SEMG和生理信息的5个不同视图。此外，FLEX将知识图纳入AQA，以限制加载动作，操作密钥步骤，错误类型和反馈的惩罚函数的形式构建注释规则。我们在Flex上进行了各种基线方法，表明多模式数据，多视图数据和细粒注释可显着增强模型性能。 Flex不仅将AQA方法和数据集推进到多模式和多动力方案，还可以促进人工智能在健身域中的整合。数据集和代码可在此HTTPS URL上找到。

Title: Channel-adaptive Cross-modal Generative Semantic Communication for Point Cloud Transmission

Authors: Wanting Yang, Zehui Xiong, Qianqian Yang, Ping Zhang, Merouane Debbah, Rahim Tafazolli
Subjects: cs.CV, cs.NI
Abstract URL: https://arxiv.org/abs/2506.03211
Pdf URL: https://arxiv.org/pdf/2506.03211
Copy Paste: [[2506.03211]] Channel-adaptive Cross-modal Generative Semantic Communication for Point Cloud Transmission(https://arxiv.org/abs/2506.03211)
Keywords: generative
Abstract: With the rapid development of autonomous driving and extended reality, efficient transmission of point clouds (PCs) has become increasingly important. In this context, we propose a novel channel-adaptive cross-modal generative semantic communication (SemCom) for PC transmission, called GenSeC-PC. GenSeC-PC employs a semantic encoder that fuses images and point clouds, where images serve as non-transmitted side information. Meanwhile, the decoder is built upon the backbone of PointDif. Such a cross-modal design not only ensures high compression efficiency but also delivers superior reconstruction performance compared to PointDif. Moreover, to ensure robust transmission and reduce system complexity, we design a streamlined and asymmetric channel-adaptive joint semantic-channel coding architecture, where only the encoder needs the feedback of average signal-to-noise ratio (SNR) and available bandwidth. In addition, rectified denoising diffusion implicit models is employed to accelerate the decoding process to the millisecond level, enabling real-time PC communication. Unlike existing methods, GenSeC-PC leverages generative priors to ensure reliable reconstruction even from noisy or incomplete source PCs. More importantly, it supports fully analog transmission, improving compression efficiency by eliminating the need for error-free side information transmission common in prior SemCom approaches. Simulation results confirm the effectiveness of cross-modal semantic extraction and dual-metric guided fine-tuning, highlighting the framework's robustness across diverse conditions, including low SNR, bandwidth limitations, varying numbers of 2D images, and previously unseen objects.
摘要：随着自动驾驶和扩展现实的快速发展，点云（PC）的有效传播变得越来越重要。在这种情况下，我们提出了一种用于PC传输的新型通道自适应跨模式生成语义通信（SEMCOM），称为GENSEC-PC。 Gensec-PC采用了融合图像和点云的语义编码器，其中图像充当非传输侧面信息。同时，解码器建立在尖角的骨干上。这样的跨模式设计不仅可以确保高压效率，而且与PointDif相比，还具有出色的重建性能。此外，为了确保可靠的传输并降低系统的复杂性，我们设计了一种流线型和不对称的通道自适应关节语义通道编码体系结构，在那里只有编码器需要平均信号 - 噪声比率（SNR）的反馈和可用的带宽。此外，采用了整流的denoising扩散模型将解码过程加速到毫秒级，从而实现了实时PC通信。与现有方法不同，GENSEC-PC利用生成先验来确保即使是噪音或不完整的源PC，也可以确保可靠的重建。更重要的是，它通过消除对先前SEMCOM方法中常见的无错误侧信息传输的需求来支持完全模拟传输，从而提高压缩效率。仿真结果证实了跨模式语义提取和双重指导性微调的有效性，突出了框架跨不同条件的框架的鲁棒性，包括低SNR，带宽限制，2D图像的数量不同，并且以前看不见的对象。

Title: DiaBlo: Diagonal Blocks Are Sufficient For Finetuning

Authors: Selcuk Gurses, Aozhong Zhang, Yanxia Deng, Xun Dong, Xin Li, Naigang Wang, Penghang Yin, Zi Yang
Subjects: cs.LG, cs.AI, cs.CL, math.OC
Abstract URL: https://arxiv.org/abs/2506.03230
Pdf URL: https://arxiv.org/pdf/2506.03230
Copy Paste: [[2506.03230]] DiaBlo: Diagonal Blocks Are Sufficient For Finetuning(https://arxiv.org/abs/2506.03230)
Keywords: generation
Abstract: Finetuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Finetuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present DiaBlo, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. We conduct extensive experiments across a range of tasks, including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, to evaluate the effectiveness and efficiency of DiaBlo. Across these benchmarks, DiaBlo demonstrates strong and consistent performance while maintaining high memory efficiency and fast finetuning speed. Codes are available at this https URL.
摘要：Finetuning是将大型语言模型（LLMS）调整为特定领域的下游任务的关键步骤。为了减轻全模型微调的大量计算和内存成本，已经提出了仅更新模型参数的一小部分的参数效率登录（PEFT）方法。但是，仍然存在PEFT方法和全模型微调之间的性能差距。在这项工作中，我们提出了暗黑破坏神，这是一种简单而有效的PEFT方法，仅更新选定模型重量矩阵的对角线块。与低级适应（LORA）及其变体不同，暗黑破坏神消除了对低级矩阵产品的需求，从而避免了对辅助初始化方案的依赖或定制的优化策略以改善收敛性。这种设计导致稳定而强大的收敛性，同时保持对洛拉的可比记忆效率和训练速度。我们在各种任务中进行了广泛的实验，包括常识性推理，算术推理，代码生成和安全一致性，以评估暗黑破坏神的有效性和效率。在这些基准测试中，暗黑破坏神表现出强劲而一致的性能，同时保持高记忆效率和快速的填充速度。代码可在此HTTPS URL上找到。

Title: BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF

Authors: Kaiwen Duan, Hongwei Yao, Yufei Chen, Ziyun Li, Tong Qiao, Zhan Qin, Cong Wang
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03234
Pdf URL: https://arxiv.org/pdf/2506.03234
Copy Paste: [[2506.03234]] BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF(https://arxiv.org/abs/2506.03234)
Keywords: generation
Abstract: Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning text-to-image (T2I) models with human preferences. However, RLHF's feedback mechanism also opens new pathways for adversaries. This paper demonstrates the feasibility of hijacking T2I models by poisoning a small fraction of preference data with natural-appearing examples. Specifically, we propose BadReward, a stealthy clean-label poisoning attack targeting the reward model in multi-modal RLHF. BadReward operates by inducing feature collisions between visually contradicted preference data instances, thereby corrupting the reward model and indirectly compromising the T2I model's integrity. Unlike existing alignment poisoning techniques focused on single (text) modality, BadReward is independent of the preference annotation process, enhancing its stealth and practical threat. Extensive experiments on popular T2I models show that BadReward can consistently guide the generation towards improper outputs, such as biased or violent imagery, for targeted concepts. Our findings underscore the amplified threat landscape for RLHF in multi-modal systems, highlighting the urgent need for robust defenses. Disclaimer. This paper contains uncensored toxic content that might be offensive or disturbing to the readers.
摘要：从人类反馈（RLHF）中学习的强化学习对于与人类偏好对齐文本形象（T2I）模型至关重要。但是，RLHF的反馈机制也为对手打开了新的途径。本文通过自然出现的例子中毒了一小部分偏好数据，证明了劫持T2I模型的可行性。具体来说，我们提出了Badreward，这是一种针对多模式RLHF奖励模型的隐形清洁标签中毒攻击。 BadReward通过引起视觉矛盾的偏好数据实例之间的特征碰撞来运行，从而破坏奖励模型并间接损害T2I模型的完整性。与关注单个（文本）模式的现有对齐中毒技术不同，Badreward独立于偏好注释过程，从而增强了其隐身和实际威胁。对流行T2I模型的广泛实验表明，BadReward可以始终如一地指导一代人对有针对性的概念进行不当输出，例如偏见或暴力图像。我们的发现强调了多模式系统中RLHF的放大威胁景观，强调了迫切需要强大的防御能力。免责声明。本文包含未经审查的有毒内容，可能会对读者感到冒犯或令人不安。

Title: Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas

Authors: Austin Silveria, Soham V. Govande, Daniel Y. Fu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03275
Pdf URL: https://arxiv.org/pdf/2506.03275
Copy Paste: [[2506.03275]] Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas(https://arxiv.org/abs/2506.03275)
Keywords: generation
Abstract: Diffusion Transformers (DiTs) have achieved state-of-the-art performance in high-quality image and video generation but incur substantial compute cost at inference. A common observation is that DiT latent noise vectors change slowly across inference steps, which suggests that the DiT compute may be redundant across steps. In this paper, we aim to speed up inference by reducing this redundancy, without additional training. We first study how activations change between steps in two state-of-the-art open-source DiTs. We find that just 5-25% of the values in attention and MLP explain 70-90% of the change in activations across steps. This finding motivates our approach, Chipmunk, which uses dynamic sparsity at inference time to recompute only the fastest-changing intermediate activations, while caching the rest. Dynamic sparsity introduces two systems challenges: (1) sparse attention and MLP operations tend to underutilize GPU tensor cores; and (2) computing dynamic sparsity patterns at runtime and caching activations both introduce overhead. To address these challenges, Chipmunk first uses a voxel-based reordering of input tokens to introduce column-wise sparsity. We implement column-sparse kernels utilizing efficient sparse gathers from global to shared GPU memory, achieving a 9.3x speedup at 93% sparsity compared to highly-optimized dense baselines. Second, Chipmunk overlaps the computation of sparsity patterns and cache updates with other parts of the computation (e.g., second layer of the MLP) to hide the extra latency. Chipmunk achieves up to 2.16x speedup on HunyuanVideo and 1.41x on FLUX.1-dev without compromising generation quality. Furthermore, we show that Chipmunk can be stacked on top of full step caching, achieving a 3.72x speedup on HunyuanVideo, a 2.67x speedup on WAN2.1, and a 2.25x speedup on FLUX.1-dev with minimal quality impact.
摘要：扩散变压器（DIT）在高质量的图像和视频生成方面已经达到了最先进的性能，但在推断时会遇到实质性的计算成本。一个常见的观察结果是，dit潜在噪声矢量在推理步骤中变化缓慢，这表明DIT计算可能在各个步骤中是多余的。在本文中，我们旨在通过减少这种冗余而没有额外培训来加快推断。我们首先研究了在两个最先进的开源dits中的步骤之间的激活如何在步骤之间发生变化。我们发现，注意力值仅5-25％，MLP解释了70-90％的跨步骤变化。这一发现激发了我们的方法Chipmunk，Chipmunk在推理时使用动态稀疏性，仅重新计算最快的中间激活，同时缓存其余的激活。动态稀疏性引入了两个系统的挑战：（1）稀疏的注意力和MLP操作往往不足以使GPU张量核心核心；（2）在运行时计算动态稀疏模式和缓存激活都引入了开销。为了应对这些挑战，Chipmunk首先使用基于体素的输入令牌的重新排序来引入列的稀疏性。我们实施了利用从全球到共享GPU内存的有效稀疏收集的柱状内核，与高度优化的致密基线相比，以93％的稀疏性达到了9.3倍的速度。其次，Chipmunk与计算的其他部分（例如MLP的第二层）重叠了稀疏模式和缓存更新的计算，以隐藏额外的延迟。 Chipmunk在Hunyuanvideo上最多可实现2.16倍的速度，而Flux.1-DEV上的1.41倍，而不会损害发电质量。此外，我们表明chipmunk可以堆叠在完整的台阶缓存上，在Hunyuanvideo上实现了3.72倍的速度，WAN2.1上的2.67倍加速，以及在Flux.1-DEV上的2.25倍加速，其质量质量很小。

Title: Robustness in Both Domains: CLIP Needs a Robust Text Encoder

Authors: Elias Abad Rocamora, Christian Schlarmann, Naman Deep Singh, Yongtao Wu, Matthias Hein, Volkan Cevher
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.03355
Pdf URL: https://arxiv.org/pdf/2506.03355
Copy Paste: [[2506.03355]] Robustness in Both Domains: CLIP Needs a Robust Text Encoder(https://arxiv.org/abs/2506.03355)
Keywords: generation, generative
Abstract: Adversarial input attacks can cause a significant shift of CLIP embeddings. This can affect the downstream robustness of models incorporating CLIP in the pipeline, such as text-to-image generative models or large vision language models. While some efforts have been done towards making the CLIP image encoders robust, the robustness of text encoders remains unexplored. In this work, we cover this gap in the literature. We propose LEAF: an efficient adversarial finetuning method for the text domain, with the ability to scale to large CLIP models. Our models significantly improve the zero-shot adversarial accuracy in the text domain, while maintaining the vision performance provided by robust image encoders. When combined with text-to-image diffusion models, we can improve the generation quality under adversarial noise. When employing our robust CLIP encoders in multimodal retrieval tasks, we improve the recall under adversarial noise over standard CLIP models. Finally, we show that robust text encoders facilitate better reconstruction of input text from its embedding via direct optimization.
摘要：对抗输入攻击可能会导致夹具嵌入的显着转移。这可能会影响将剪辑纳入管道中的模型的下游鲁棒性，例如文本到图像生成模型或大型视觉语言模型。尽管为使剪辑图像编码鲁棒而做出了一些努力，但文本编码器的鲁棒性仍然没有探索。在这项工作中，我们涵盖了文献中的这一差距。我们提出了叶子：一种有效的文本域的对抗性芬太尼方法，能够扩展到大型剪辑模型。我们的模型可显着提高文本域中的零射击对手精度，同时保持可靠图像编码器提供的视觉性能。与文本对图像扩散模型结合使用时，我们可以在对抗噪声下提高发电质量。当在多模式检索任务中采用强大的剪辑编码器时，我们将在对抗噪声下改善召回率，而不是标准夹模型。最后，我们表明，强大的文本编码器可以通过直接优化从其嵌入中更好地重建输入文本。

Title: Adaptive Task Vectors for Large Language Models

Authors: Joonseong Kang, Soojeong Lee, Subeen Park, Sumin Park, Taero Kim, Jihee Kim, Ryunyi Lee, Kyungwoo Song
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.03426
Pdf URL: https://arxiv.org/pdf/2506.03426
Copy Paste: [[2506.03426]] Adaptive Task Vectors for Large Language Models(https://arxiv.org/abs/2506.03426)
Keywords: generation
Abstract: In-Context Learning (ICL) enables Large Language Models (LLMs) to perform tasks without parameter updates by conditioning on a few demonstrations provided in the prompt. Despite its success, ICL suffers from several limitations, including sensitivity to demonstration order, context length constraints, and computational inefficiency. To address these challenges, task vector-based approaches compress task information into a single vector. However, these methods typically construct task vectors from fixed sets of demonstrations and reuse them across input queries, without conditioning on the specific input. This limitation can lead models to struggle with effective adaptation when the input query is not well aligned with the underlying demonstrations, consequently degrading their generalization performance on unseen tasks. To overcome this limitation, we propose Adaptive Task Vectors (ATV), a simple and effective framework that dynamically generates task vectors conditioned on each input query. ATV employs a small language model to generate task vectors, which are then transformed to match the target LLM's architecture and applied to guide its output generation. In contrast to ICL and previous vector-based approaches, which rely on fixed demonstration sets and their corresponding vectors, ATV dynamically generates task vectors tailored to each specific input query and task. Consequently, ATV demonstrates strong performance and generalization capabilities, even for unseen tasks. Furthermore, we provide a theoretical analysis indicating that ATV is expressively equivalent to LoRA under equal rank budgets and more expressive than Prefix-Tuning, thereby offering formal support for its representational advantage.
摘要：在上下文学习（ICL）中，大型语言模型（LLMS）可以通过调节提示中提供的一些演示来执行无参数更新的任务。尽管取得了成功，ICL仍受到了几个局限性，包括对示范顺序的敏感性，上下文长度限制和计算效率低下。为了应对这些挑战，基于任务向量的方法将任务信息压缩到单个向量中。但是，这些方法通常从固定的演示集中构造任务向量，并在输入查询中重复使用它们，而无需在特定输入上进行条件。当输入查询与基本示范不太一致时，这种限制可能会导致模型在有效适应中挣扎，从而使他们在看不见的任务上的概括性能降低。为了克服此限制，我们提出了自适应任务向量（ATV），这是一个简单有效的框架，动态生成在每个输入查询的任务向量。 ATV采用小型语言模型来生成任务向量，然后将其转换为匹配目标LLM的体系结构并应用以指导其输出生成。与依靠固定的演示集及其相应向量的ICL和先前基于向量的方法相反，ATV动态生成了针对每个特定输入查询和任务量身定制的任务向量。因此，即使是看不见的任务，ATV也表现出强大的性能和概括能力。此外，我们提供了一个理论分析，表明ATV表现得等于洛拉在等级预算下，而不是前缀调整，从而为其代表性优势提供了正式的支持。

Title: Exploiting LLMs for Automatic Hypothesis Assessment via a Logit-Based Calibrated Prior

Authors: Yue Gong, Raul Castro Fernandez
Subjects: cs.LG, cs.CL
Abstract URL: https://arxiv.org/abs/2506.03444
Pdf URL: https://arxiv.org/pdf/2506.03444
Copy Paste: [[2506.03444]] Exploiting LLMs for Automatic Hypothesis Assessment via a Logit-Based Calibrated Prior(https://arxiv.org/abs/2506.03444)
Keywords: generation
Abstract: As hypothesis generation becomes increasingly automated, a new bottleneck has emerged: hypothesis assessment. Modern systems can surface thousands of statistical relationships-correlations, trends, causal links-but offer little guidance on which ones are novel, non-trivial, or worthy of expert attention. In this work, we study the complementary problem to hypothesis generation: automatic hypothesis assessment. Specifically, we ask: given a large set of statistical relationships, can we automatically assess which ones are novel and worth further exploration? We focus on correlations as they are a common entry point in exploratory data analysis that often serve as the basis for forming deeper scientific or causal hypotheses. To support automatic assessment, we propose to leverage the vast knowledge encoded in LLMs' weights to derive a prior distribution over the correlation value of a variable pair. If an LLM's prior expects the correlation value observed, then such correlation is not surprising, and vice versa. We propose the Logit-based Calibrated Prior, an LLM-elicited correlation prior that transforms the model's raw output logits into a calibrated, continuous predictive distribution over correlation values. We evaluate the prior on a benchmark of 2,096 real-world variable pairs and it achieves a sign accuracy of 78.8%, a mean absolute error of 0.26, and 95% credible interval coverage of 89.2% in predicting Pearson correlation coefficient. It also outperforms a fine-tuned RoBERTa classifier in binary correlation prediction and achieves higher precision@K in hypothesis ranking. We further show that the prior generalizes to correlations not seen during LLM pretraining, reflecting context-sensitive reasoning rather than memorization.
摘要：随着假设的产生越来越自动化，出现了新的瓶颈：假设评估。现代系统可以表现出数千种统计关系相关，趋势，因果关系，但几乎没有指导哪些是新颖，非平凡或值得专家注意的指导。在这项工作中，我们研究了假设产生的互补问题：自动假设评估。具体来说，我们问：鉴于一系列统计关系，我们可以自动评估哪些是新颖的且值得进一步探索的？我们专注于相关性，因为它们是探索性数据分析中常见的切入点，通常是形成更深入的科学或因果假设的基础。为了支持自动评估，我们建议利用LLMS权重编码的庞大知识，以在变量对的相关值上得出先前的分布。如果LLM的先验期望观察到相关值，那么这种相关就不足为奇了，反之亦然。我们提出了基于logit的校准先验，LLM引用的相关性先验将模型的原始输出逻辑转换为相关值的校准，连续的预测分布。我们以2,096个现实世界可变对的基准评估了先验，并且在预测Pearson相关系数方面，它的符号准确度为78.8％，平均绝对误差为0.26和95％可信间隔覆盖率为89.2％。它还优于二进制相关预测中的微调罗伯塔分类器，并在假设排名中获得更高的精度@k。我们进一步表明，先前的概括在LLM预处理期间未见的相关性，反映了上下文敏感的推理而不是记忆。

Title: RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

Authors: Bimsara Pathiraja, Maitreya Patel, Shivam Singh, Yezhou Yang, Chitta Baral
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03448
Pdf URL: https://arxiv.org/pdf/2506.03448
Copy Paste: [[2506.03448]] RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions(https://arxiv.org/abs/2506.03448)
Keywords: generation
Abstract: Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce RefEdit-Bench, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly. To overcome this limitation, we introduce RefEdit -- an instruction-based editing model trained on our scalable synthetic data generation pipeline. Our RefEdit, trained on only 20,000 editing triplets, outperforms the Flux/SD3 model-based baselines trained on millions of data. Extensive evaluations across various benchmarks demonstrate that our model not only excels in referring expression tasks but also enhances performance on traditional benchmarks, achieving state-of-the-art results comparable to closed-source methods. We release data \& checkpoint for reproducibility.
摘要：尽管最近取得了反转和基于指导的图像编辑的进步，但现有方法主要在编辑单个，突出的对象方面表现出色，但是当应用于包含多个实体的复杂场景时，它会大为挣扎。为了量化这一差距，我们首先引入了Refedit Bench，这是一种扎根于Refcoco的严格现实世界基准，即使是接受了数百万个样品的训练的基线，也是如此。为了克服这一限制，我们介绍了Refedit，这是一种基于指导的编辑模型，该模型训练了我们可扩展的合成数据生成管道。我们的REFEDIT仅接受了20,000个编辑三胞胎的培训，优于基于Flux/SD3模型的基线，该基线对数百万个数据进行了培训。各种基准的广泛评估表明，我们的模型不仅在参考表达任务方面表现出色，而且还提高了传统基准的性能，从而实现了与封闭源方法相当的最新结果。我们发布数据\＆CheckPoint，以获得可重复性。

Title: CHIME: Conditional Hallucination and Integrated Multi-scale Enhancement for Time Series Diffusion Model

Authors: Yuxuan Chen, Haipeng Xie
Subjects: cs.CV, eess.SY
Abstract URL: https://arxiv.org/abs/2506.03502
Pdf URL: https://arxiv.org/pdf/2506.03502
Copy Paste: [[2506.03502]] CHIME: Conditional Hallucination and Integrated Multi-scale Enhancement for Time Series Diffusion Model(https://arxiv.org/abs/2506.03502)
Keywords: generative
Abstract: The denoising diffusion probabilistic model has become a mainstream generative model, achieving significant success in various computer vision tasks. Recently, there has been initial exploration of applying diffusion models to time series tasks. However, existing studies still face challenges in multi-scale feature alignment and generative capabilities across different entities and long-time scales. In this paper, we propose CHIME, a conditional hallucination and integrated multi-scale enhancement framework for time series diffusion models. By employing multi-scale decomposition and adaptive integration, CHIME captures the decomposed features of time series, achieving in-domain distribution alignment between generated and original samples. In addition, we introduce a feature hallucination module in the conditional denoising process, enabling the transfer of temporal features through the training of category-independent transformation layers. Experimental results on publicly available real-world datasets demonstrate that CHIME achieves state-of-the-art performance and exhibits excellent generative generalization capabilities in few-shot scenarios.
摘要：denoising扩散概率模型已成为一种主流生成模型，在各种计算机视觉任务中取得了重大成功。最近，已经对将扩散模型应用于时间序列任务进行了初步探索。但是，现有的研究仍然面临着不同实体和长期规模的多尺度特征对齐和生成能力的挑战。在本文中，我们提出了Chime，这是时间序列扩散模型的条件幻觉和集成的多尺度增强框架。通过采用多尺度分解和自适应整合，Chime捕获了时间序列的分解特征，从而实现了生成的和原始样品之间的内域分布对齐。此外，我们在有条件的降级过程中介绍了一个特征幻觉模块，从而通过训练独立于类别的转换层来使时间特征转移。公开可用的现实世界数据集的实验结果表明，Chime可以实现最先进的性能，并在几个射击场景中具有出色的生成概括能力。

Title: DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

Authors: Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, Aliaksandr Siarohin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03517
Pdf URL: https://arxiv.org/pdf/2506.03517
Copy Paste: [[2506.03517]] DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models(https://arxiv.org/abs/2506.03517)
Keywords: generation
Abstract: Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.
摘要：直接偏好优化（DPO）最近已被用作文本到视频扩散模型的训练后技术。为了获得培训数据，要求注释者提供从独立噪声产生的两个视频之间提供偏好。但是，这种方法禁止细粒度的比较，我们指出，它偏向于低动作夹，因为它们通常包含更少的视觉伪像。在这项工作中，我们介绍了densedpo，这种方法通过做出三个贡献来解决这些缺点。首先，我们通过剥夺了地面真相视频的损坏的副本来为DPO创建每个视频对。这会导致与运动结构相似的对齐对，同时在局部细节上有所不同，从而有效中和运动偏置。其次，我们利用由此产生的时间对齐来在短段而不是整个剪辑上标记偏好，从而产生更密集，更精确的学习信号。 DensedPo只有三分之一的标签数据，可以大大改善Vanilla DPO的运动产生，同时以文本对齐，视觉质量和时间一致性进行匹配。最后，我们表明，使用现成的视觉语言模型（VLM）解锁自动偏好注释：GPT准确地预测了类似于特定于任务特定的微调视频奖励模型的细分级偏好，而对这些标签进行了培训的DensedPo，可以使用人类标签来实现这些标签的性能。

Title: Path Generation and Evaluation in Video Games: A Nonparametric Statistical Approach

Authors: Daniel Campa, Mehdi Saeedi, Ian Colbert, Srinjoy Das
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.03522
Pdf URL: https://arxiv.org/pdf/2506.03522
Copy Paste: [[2506.03522]] Path Generation and Evaluation in Video Games: A Nonparametric Statistical Approach(https://arxiv.org/abs/2506.03522)
Keywords: generation, generative
Abstract: Navigation path traces play a crucial role in video game design, serving as a vital resource for both enhancing player engagement and fine-tuning non-playable character behavior. Generating such paths with human-like realism can enrich the overall gaming experience, and evaluating path traces can provide game designers insights into player interactions. Despite the impressive recent advancements in deep learning-based generative modeling, the video game industry hesitates to adopt such models for path generation, often citing their complex training requirements and interpretability challenges. To address these problems, we propose a novel path generation and evaluation approach that is grounded in principled nonparametric statistics and provides precise control while offering interpretable insights. Our path generation method fuses two statistical techniques: (1) nonparametric model-free transformations that capture statistical characteristics of path traces through time; and (2) copula models that capture statistical dependencies in space. For path evaluation, we adapt a nonparametric three-sample hypothesis test designed to determine if the generated paths are overfit (mimicking the original data too closely) or underfit (diverging too far from it). We demonstrate the precision and reliability of our proposed methods with empirical analysis on two existing gaming benchmarks to showcase controlled generation of diverse navigation paths. Notably, our novel path generator can be fine-tuned with user controllable parameters to create navigation paths that exhibit varying levels of human-likeness in contrast to those produced by neural network-based agents. The code is available at this https URL.
摘要：导航路径跟踪在视频游戏设计中起着至关重要的作用，是增强玩家参与度和微调不可播放的角色行为的重要资源。使用类似人类的现实主义产生这样的路径可以丰富整体游戏体验，评估路径痕迹可以为游戏设计师的洞察力提供对玩家互动的见解。尽管在基于深度学习的生成建模方面取得了令人印象深刻的进步，但视频游戏行业犹豫不决地采用这种模型来生成路径，并以其复杂的培训要求和可解释性挑战。为了解决这些问题，我们提出了一种新颖的路径生成和评估方法，该方法基于原则上的非参数统计数据，并提供了精确的控制，同时提供了可解释的见解。我们的路径生成方法融合了两种统计技术：（1）捕获路径痕迹的统计特征的非参数模型转换；（2）捕获空间中统计依赖性的Copula模型。为了进行路径评估，我们适应了非参数三样本假设检验，旨在确定生成的路径是否过拟合（太接近模仿原始数据）或不合适（与之相差太远）。我们通过对两个现有游戏基准的经验分析来证明我们提出的方法的精度和可靠性，以展示受控生成的各种导航路径。值得注意的是，与基于神经网络的代理相比，我们的新型路径生成器可以通过可控参数进行微调，以创建具有不同水平的人类风格的导航路径。该代码可在此HTTPS URL上找到。

Title: Learning Monotonic Probabilities with a Generative Cost Model

Authors: Yongxiang Tang, Yanhua Cheng, Xiaocheng Liu, Chenchen Jiao, Yanxiang Zeng, Ning Luo, Pengjia Yuan, Xialong Liu, Peng Jiang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03542
Pdf URL: https://arxiv.org/pdf/2506.03542
Copy Paste: [[2506.03542]] Learning Monotonic Probabilities with a Generative Cost Model(https://arxiv.org/abs/2506.03542)
Keywords: generative
Abstract: In many machine learning tasks, it is often necessary for the relationship between input and output variables to be monotonic, including both strictly monotonic and implicitly monotonic relationships. Traditional methods for maintaining monotonicity mainly rely on construction or regularization techniques, whereas this paper shows that the issue of strict monotonic probability can be viewed as a partial order between an observable revenue variable and a latent cost variable. This perspective enables us to reformulate the monotonicity challenge into modeling the latent cost variable. To tackle this, we introduce a generative network for the latent cost variable, termed the Generative Cost Model (GCM), which inherently addresses the strict monotonic problem, and propose the Implicit Generative Cost Model (IGCM) to address the implicit monotonic problem. We further validate our approach with a numerical simulation of quantile regression and conduct multiple experiments on public datasets, showing that our method significantly outperforms existing monotonic modeling techniques. The code for our experiments can be found at this https URL.
摘要：在许多机器学习任务中，输入变量和输出变量之间的关系通常是单调的，包括严格的单调和隐式单调关系。保持单调性的传统方法主要依赖于构建或正则化技术，而本文表明，严格的单调概率问题可以看作是可观察到的收入变量和潜在成本变量之间的部分订单。这种观点使我们能够将单调性挑战重新制定为建模潜在的成本变量。为了解决这个问题，我们引入了一个用于潜在成本变量的生成网络，称为生成成本模型（GCM），该网络固有地解决了严格的单调问题，并提出了隐性生成成本模型（IGCM）来解决隐式单调问题。我们通过对分位数回归的数值模拟进一步验证我们的方法，并在公共数据集上进行多个实验，这表明我们的方法显着优于现有的单调建模技术。我们的实验代码可以在此HTTPS URL上找到。

Title: VCDiag: Classifying Erroneous Waveforms for Failure Triage Acceleration

Authors: Minh Luu, Surya Jasper, Khoi Le, Evan Pan, Michael Quinn, Aakash Tyagi, Jiang Hu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03590
Pdf URL: https://arxiv.org/pdf/2506.03590
Copy Paste: [[2506.03590]] VCDiag: Classifying Erroneous Waveforms for Failure Triage Acceleration(https://arxiv.org/abs/2506.03590)
Keywords: generation
Abstract: Failure triage in design functional verification is critical but time-intensive, relying on manual specification reviews, log inspections, and waveform analyses. While machine learning (ML) has improved areas like stimulus generation and coverage closure, its application to RTL-level simulation failure triage, particularly for large designs, remains limited. VCDiag offers an efficient, adaptable approach using VCD data to classify failing waveforms and pinpoint likely failure locations. In the largest experiment, VCDiag achieves over 94% accuracy in identifying the top three most likely modules. The framework introduces a novel signal selection and statistical compression approach, achieving over 120x reduction in raw data size while preserving features essential for classification. It can also be integrated into diverse Verilog/SystemVerilog designs and testbenches.
摘要：设计功能验证中的故障分流是至关重要的，但次数很密，依赖于手动规范评论，日志检查和波形分析。尽管机器学习（ML）改善了刺激产生和覆盖范围等领域，但其应用于RTL级模拟失败分类，尤其是对于大型设计，仍然有限。 VCDIAG使用VCD数据提供了一种有效的，适应性的方法来对失败的波形进行分类，并确定了可能的故障位置。在最大的实验中，VCDIAG在识别前三名最有可能的模块时达到了超过94％的精度。该框架引入了一种新型的信号选择和统计压缩方法，可实现原始数据大小的120倍以上，同时保留分类必不可少的功能。它也可以集成到多样化的Verilog/SystemVerilog设计和测试板中。

Title: Resolving Task Objective Conflicts in Unified Multimodal Understanding and Generation via Task-Aware Mixture-of-Experts

Authors: Jiaxing Zhang, Xinyi Zeng, Hao Tang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03591
Pdf URL: https://arxiv.org/pdf/2506.03591
Copy Paste: [[2506.03591]] Resolving Task Objective Conflicts in Unified Multimodal Understanding and Generation via Task-Aware Mixture-of-Experts(https://arxiv.org/abs/2506.03591)
Keywords: generation
Abstract: Unified multimodal large language models (MLLMs) based on end-to-end autoregressive (AR) transformers effectively integrate both understanding and generation tasks within a single framework. However, intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges, often leading to suboptimal trade-offs and task interference. Existing solutions, such as decoupling shared visual encoders, fall short of fundamentally resolving these conflicts due to inherent AR architecture. In this paper, we propose a novel approach that decouples internal components of AR to resolve task objective conflicts. Specifically, we design UTAMoE, a Unified Task-Aware Mixture-of-Experts (MoE) framework that decouples internal AR modules via a Task-Aware MoE Layer to create task-specific optimization subpaths. To enhance task differentiation while maintaining overall coordination, we introduce a novel Two-Stage Training Strategy. Extensive experiments on multimodal benchmarks demonstrate that UTAMoE mitigates task objective conflicts, achieving state-of-the-art performance across various tasks. Visualizations and ablation studies further validate the effectiveness of our approach.
摘要：基于端到端自回归（AR）变压器的统一多模式大型语言模型（MLLM）有效地整合了一个框架内的理解和发电任务。但是，在理解中高级语义抽象与一代细节保存之间的内在任务目标冲突构成了重大挑战，通常会导致次优的权衡和任务干扰。现有的解决方案（例如解耦共享的视觉编码器）由于固有的AR架构而无法从根本上解决这些冲突。在本文中，我们提出了一种新颖的方法，该方法将解除AR内部组成部分来解决任务目标冲突。具体而言，我们设计了Utamoe，这是一种统一的任务感知的专家（MOE）框架，该框架通过任务感知的MOE层将内部AR模块解散，以创建特定于任务的优化子路口。为了增强任务差异化的同时保持整体协调，我们引入了一种新颖的两阶段培训策略。对多模式基准测试的广泛实验表明，UTAMOE减轻了任务目标冲突，从而在各种任务中实现了最先进的绩效。可视化和消融研究进一步验证了我们方法的有效性。

Title: ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning

Authors: Feng Han, Yang Jiao, Shaoxiang Chen, Junhao Xu, Jingjing Chen, Yu-Gang Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03596
Pdf URL: https://arxiv.org/pdf/2506.03596
Copy Paste: [[2506.03596]] ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning(https://arxiv.org/abs/2506.03596)
Keywords: generation
Abstract: The field of controllable image generation has seen significant advancements, with various architectures improving generation layout consistency with control signals. However, contemporary methods still face challenges in bridging the semantic gap between input text prompts with sparse semantics and the target images, often over-relying on low-level control signals to infer regional details. To address this challenge, we propose ControlThinker, a novel framework that employs a "comprehend-then-generate" paradigm. Firstly, by incentivizing the visual reasoning capability of a MLLM, latent semantics from control images are mined to enrich text prompts. This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications. To further tackle the uncertainty arising from the ambiguity of control images, we encourage broader exploration of reasoning trajectories and select the optimal one using a metric-based output reward model (ORM). Extensive experimental results demonstrate that ControlThinker effectively mitigates the semantic gap between raw text prompts and target images, resulting in improved visual quality and semantic consistency across a wide range of benchmarks. The code and models are available at this https URL.
摘要：可控图像生成的领域已经取得了重大进步，各种体系结构可改善与控制信号的生成布局一致性。但是，当代方法在弥合语义上的语义提示和目标图像之间的语义差距方面仍然面临挑战，通常在低级控制信号上过度延伸到推断区域细节。为了应对这一挑战，我们提出了ControlThinker，这是一个采用“理解 - 然后生成”范式的新型框架。首先，通过激励MLLM的视觉推理能力，从控制图像中挖出了潜在语义，以丰富文本提示。这种丰富的语义理解，然后无缝地帮助图像生成，而无需进行其他复杂的修改。为了进一步解决控制图像的歧义引起的不确定性，我们鼓励对推理轨迹进行更广泛的探索，并使用基于公制的输出奖励模型（ORM）选择最佳的轨迹。广泛的实验结果表明，ControlThinker有效地减轻了原始文本提示和目标图像之间的语义差距，从而改善了各种基准的视觉质量和语义一致性。代码和模型可在此HTTPS URL上找到。

Title: Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision

Authors: Tomoya Yoshida, Shuhei Kurita, Taichi Nishimura, Shinsuke Mori
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03605
Pdf URL: https://arxiv.org/pdf/2506.03605
Copy Paste: [[2506.03605]] Generating 6DoF Object Manipulation Trajectories from Action Description in Egocentric Vision(https://arxiv.org/abs/2506.03605)
Keywords: generation
Abstract: Learning to use tools or objects in common scenes, particularly handling them in various ways as instructed, is a key challenge for developing interactive robots. Training models to generate such manipulation trajectories requires a large and diverse collection of detailed manipulation demonstrations for various objects, which is nearly unfeasible to gather at scale. In this paper, we propose a framework that leverages large-scale ego- and exo-centric video datasets -- constructed globally with substantial effort -- of Exo-Ego4D to extract diverse manipulation trajectories at scale. From these extracted trajectories with the associated textual action description, we develop trajectory generation models based on visual and point cloud-based language models. In the recently proposed egocentric vision-based in-a-quality trajectory dataset of HOT3D, we confirmed that our models successfully generate valid object trajectories, establishing a training dataset and baseline models for the novel task of generating 6DoF manipulation trajectories from action descriptions in egocentric vision.
摘要：学习在常见场景中使用工具或对象，尤其是按照指示的各种方式处理它们，这是开发交互式机器人的关键挑战。生成此类操作轨迹的培训模型需要针对各种物体进行大量多样的详细操作演示，这几乎是不可行的。在本文中，我们提出了一个框架，该框架利用exo-ego4D大规模构建的大规模的自我和以外的视频数据集（全球构建），以大规模提取各种操纵轨迹。从这些提取的轨迹和相关的文本动作描述中，我们基于基于视觉和点云的语言模型开发轨迹生成模型。在最近提出的基于以上为中心的视觉的HOT3D的In-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-a-A-In质量轨迹数据集中，我们确认我们的模型成功地生成了有效的对象轨迹，建立了训练数据集和基线模型，用于从事环境视觉中从动作描述中生成6DOF操纵轨迹的新任务。

Title: Negative-Guided Subject Fidelity Optimization for Zero-Shot Subject-Driven Generation

Authors: Chaehun Shin, Jooyoung Choi, Johan Barthelemy, Jungbeom Lee, Sungroh Yoon
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03621
Pdf URL: https://arxiv.org/pdf/2506.03621
Copy Paste: [[2506.03621]] Negative-Guided Subject Fidelity Optimization for Zero-Shot Subject-Driven Generation(https://arxiv.org/abs/2506.03621)
Keywords: generation
Abstract: We present Subject Fidelity Optimization (SFO), a novel comparative learning framework for zero-shot subject-driven generation that enhances subject fidelity. Beyond supervised fine-tuning methods that rely only on positive targets and use the diffusion loss as in the pre-training stage, SFO introduces synthetic negative targets and explicitly guides the model to favor positives over negatives through pairwise comparison. For negative targets, we propose Condition-Degradation Negative Sampling (CDNS), which automatically generates distinctive and informative negatives by intentionally degrading visual and textual cues without expensive human annotations. Moreover, we reweight the diffusion timesteps to focus finetuning on intermediate steps where subject details emerge. Extensive experiments demonstrate that SFO with CDNS significantly outperforms baselines in terms of both subject fidelity and text alignment on a subject-driven generation benchmark. Project page: this https URL
摘要：我们提出了主题保真度优化（SFO），这是一个新型的比较学习框架，用于零射门主题驱动的生成，可增强对象的保真度。除了仅依赖积极目标并使用扩散损失的监督微调方法外，SFO还引入了综合负面靶标，并明确指导该模型通过成对比较偏爱阳性而不是负面的阳性。对于负目标，我们提出了条件降低负面采样（CDN），该采样会自动产生独特的和信息性的负面因素，而无需降低视觉和文本提示而无需昂贵的人类注释。此外，我们将扩散时间段的扩散时间段重点放在中间步骤上，主题详细信息出现。广泛的实验表明，与受试者驱动的生成基准相关的主体保真度和文本对齐方式，具有CDN的SFO显着优于基准。项目页面：此HTTPS URL

Title: EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation

Authors: Cheng Zhang, Hongxia xie, Bin Wen, Songhan Zuo, Ruoxuan Zhang, Wen-huang Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03652
Pdf URL: https://arxiv.org/pdf/2506.03652
Copy Paste: [[2506.03652]] EmoArt: A Multidimensional Dataset for Emotion-Aware Artistic Generation(https://arxiv.org/abs/2506.03652)
Keywords: generation
Abstract: With the rapid advancement of diffusion models, text-to-image generation has achieved significant progress in image resolution, detail fidelity, and semantic alignment, particularly with models like Stable Diffusion 3.5, Stable Diffusion XL, and FLUX 1. However, generating emotionally expressive and abstract artistic images remains a major challenge, largely due to the lack of large-scale, fine-grained emotional datasets. To address this gap, we present the EmoArt Dataset -- one of the most comprehensive emotion-annotated art datasets to date. It contains 132,664 artworks across 56 painting styles (e.g., Impressionism, Expressionism, Abstract Art), offering rich stylistic and cultural diversity. Each image includes structured annotations: objective scene descriptions, five key visual attributes (brushwork, composition, color, line, light), binary arousal-valence labels, twelve emotion categories, and potential art therapy effects. Using EmoArt, we systematically evaluate popular text-to-image diffusion models for their ability to generate emotionally aligned images from text. Our work provides essential data and benchmarks for emotion-driven image synthesis and aims to advance fields such as affective computing, multimodal learning, and computational art, enabling applications in art therapy and creative design. The dataset and more details can be accessed via our project website.
摘要：随着扩散模型的快速发展，文本到图像的生成在图像解决，细节忠诚度和语义一致性方面取得了重大进展，尤其是在诸如稳定扩散3.5，稳定的扩散XL和Flux 1之类的模型中。然而，由于缺乏大量的精神，因此由于缺乏大量的表现力量，因此产生情感表现力和抽象的艺术图像仍然是一个主要的挑战。为了解决这一差距，我们介绍了EMOART数据集，这是迄今为止最全面的情感宣布数据集之一。它包含56种绘画风格（例如印象派，表现主义，抽象艺术）的132,664件艺术品，提供丰富的风格和文化多样性。每个图像都包括结构化注释：目标场景描述，五个关键的视觉属性（刷子，构图，颜色，线，光），二进制唤醒价标签，十二个情感类别以及潜在的艺术疗法效果。使用EMOART，我们系统地评估流行的文本对图像扩散模型，以使其能够从文本中产生情感上的图像。我们的工作为情绪驱动的图像综合提供了必不可少的数据和基准，并旨在推进情感计算，多模式学习和计算艺术等领域，从而在艺术疗法和创意设计中提供了应用。可以通过我们的项目网站访问数据集和更多详细信息。

Title: Out-of-Distribution Graph Models Merging

Authors: Yidi Wang, Jiawei Gu, pei Xiaobing, Xubin Zheng, Xiao Luo, Pengyang Wang, Ziyue Qiao
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.03674
Pdf URL: https://arxiv.org/pdf/2506.03674
Copy Paste: [[2506.03674]] Out-of-Distribution Graph Models Merging(https://arxiv.org/abs/2506.03674)
Keywords: generation
Abstract: This paper studies a novel problem of out-of-distribution graph models merging, which aims to construct a generalized model from multiple graph models pre-trained on different domains with distribution discrepancy. This problem is challenging because of the difficulty in learning domain-invariant knowledge implicitly in model parameters and consolidating expertise from potentially heterogeneous GNN backbones. In this work, we propose a graph generation strategy that instantiates the mixture distribution of multiple domains. Then, we merge and fine-tune the pre-trained graph models via a MoE module and a masking mechanism for generalized adaptation. Our framework is architecture-agnostic and can operate without any source/target domain data. Both theoretical analysis and experimental results demonstrate the effectiveness of our approach in addressing the model generalization problem.
摘要：本文研究了一个新的问题的新问题，其分布图模型合并，该问题旨在构建从具有分布差异的不同领域预训练的多个图形模型的广义模型。这个问题是具有挑战性的，因为很难在模型参数中隐含地学习域不变知识，并从潜在的异质GNN骨架中巩固了专业知识。在这项工作中，我们提出了一种图形生成策略，该策略实例化了多个域的混合分布。然后，我们通过MOE模块和一种掩盖机制合并并微调预训练的图模型，以进行广义适应。我们的框架是架构 - 不平衡的，并且可以在没有任何源/目标域数据的情况下运行。理论分析和实验结果都证明了我们方法在解决模型泛化问题方面的有效性。

Title: PRJ: Perception-Retrieval-Judgement for Generated Images

Authors: Qiang Fu, Zonglei Jing, Zonghao Ying, Xiaoqian Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03683
Pdf URL: https://arxiv.org/pdf/2506.03683
Copy Paste: [[2506.03683]] PRJ: Perception-Retrieval-Judgement for Generated Images(https://arxiv.org/abs/2506.03683)
Keywords: generative
Abstract: The rapid progress of generative AI has enabled remarkable creative capabilities, yet it also raises urgent concerns regarding the safety of AI-generated visual content in real-world applications such as content moderation, platform governance, and digital media regulation. This includes unsafe material such as sexually explicit images, violent scenes, hate symbols, propaganda, and unauthorized imitations of copyrighted artworks. Existing image safety systems often rely on rigid category filters and produce binary outputs, lacking the capacity to interpret context or reason about nuanced, adversarially induced forms of harm. In addition, standard evaluation metrics (e.g., attack success rate) fail to capture the semantic severity and dynamic progression of toxicity. To address these limitations, we propose Perception-Retrieval-Judgement (PRJ), a cognitively inspired framework that models toxicity detection as a structured reasoning process. PRJ follows a three-stage design: it first transforms an image into descriptive language (perception), then retrieves external knowledge related to harm categories and traits (retrieval), and finally evaluates toxicity based on legal or normative rules (judgement). This language-centric structure enables the system to detect both explicit and implicit harms with improved interpretability and categorical granularity. In addition, we introduce a dynamic scoring mechanism based on a contextual toxicity risk matrix to quantify harmfulness across different semantic dimensions. Experiments show that PRJ surpasses existing safety checkers in detection accuracy and robustness while uniquely supporting structured category-level toxicity interpretation.
摘要：生成AI的快速进步已经实现了非凡的创造力，但它也引起了人们对AI生成的视觉内容在现实世界应用中的安全性的紧急问题，例如内容审核，平台治理和数字媒体监管。这包括不安全的材料，例如性明确的图像，暴力场景，仇恨符号，宣传以及对版权艺术品的未经授权的模仿。现有的图像安全系统通常依赖于严格的类别过滤器并产生二进制输出，缺乏解释有关细微的，对抗性诱发的伤害形式的上下文或原因的能力。此外，标准评估指标（例如，攻击成功率）无法捕获语义严重程度和毒性的动态进展。为了解决这些局限性，我们提出了一种感知 - 返回判断（PRJ），这是一个具有认知启发的框架，将毒性检测模型为结构化推理过程。 PRJ遵循三阶段的设计：它首先将图像转换为描述性语言（感知），然后检索与危害类别和特质（检索）有关的外部知识，并最终根据法律或规范规则（判断）评估毒性。以语言为中心的结构使系统能够通过改善的可解释性和分类粒度来检测显式和隐性危害。此外，我们基于上下文毒性风险矩阵介绍了动态评分机制，以量化不同语义维度的有害性。实验表明，PRJ以检测准确性和鲁棒性超过了现有的安全检查器，同时唯一支持结构化类别级别的毒性解释。

Title: Advancements in Artificial Intelligence Applications for Cardiovascular Disease Research

Authors: Yuanlin Mo, Haishan Huang, Bocheng Liang, Weibo Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03698
Pdf URL: https://arxiv.org/pdf/2506.03698
Copy Paste: [[2506.03698]] Advancements in Artificial Intelligence Applications for Cardiovascular Disease Research(https://arxiv.org/abs/2506.03698)
Keywords: generative
Abstract: Recent advancements in artificial intelligence (AI) have revolutionized cardiovascular medicine, particularly through integration with computed tomography (CT), magnetic resonance imaging (MRI), electrocardiography (ECG) and ultrasound (US). Deep learning architectures, including convolutional neural networks and generative adversarial networks, enable automated analysis of medical imaging and physiological signals, surpassing human capabilities in diagnostic accuracy and workflow efficiency. However, critical challenges persist, including the inability to validate input data accuracy, which may propagate diagnostic errors. This review highlights AI's transformative potential in precision diagnostics while underscoring the need for robust validation protocols to ensure clinical reliability. Future directions emphasize hybrid models integrating multimodal data and adaptive algorithms to refine personalized cardiovascular care.
摘要：人工智能（AI）的最新进展彻底改变了心血管医学，特别是通过与计算机断层扫描（CT），磁共振成像（MRI），心电图（ECG）和超声（US）的整合。深度学习体系结构，包括卷积神经网络和生成对抗网络，可以自动分析医学成像和生理信号，超过了人类在诊断准确性和工作流程效率方面的能力。但是，关键挑战持续存在，包括无法验证输入数据准确性，这可能会传播诊断错误。这篇评论强调了AI在精确诊断方面的变革潜力，同时强调了对可靠验证方案的需求，以确保临床可靠性。未来的方向强调集成多模式数据和自适应算法以完善个性化心血管护理的混合模型。

Title: On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity

Authors: Quentin Bertrand, Anne Gagneux, Mathurin Massias, Rémi Emonet
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.03719
Pdf URL: https://arxiv.org/pdf/2506.03719
Copy Paste: [[2506.03719]] On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity(https://arxiv.org/abs/2506.03719)
Keywords: generative
Abstract: Modern deep generative models can now produce high-quality synthetic samples that are often indistinguishable from real training data. A growing body of research aims to understand why recent methods -- such as diffusion and flow matching techniques -- generalize so effectively. Among the proposed explanations are the inductive biases of deep learning architectures and the stochastic nature of the conditional flow matching loss. In this work, we rule out the latter -- the noisy nature of the loss -- as a primary contributor to generalization in flow matching. First, we empirically show that in high-dimensional settings, the stochastic and closed-form versions of the flow matching loss yield nearly equivalent losses. Then, using state-of-the-art flow matching models on standard image datasets, we demonstrate that both variants achieve comparable statistical performance, with the surprising observation that using the closed-form can even improve performance.
摘要：现代深层生成模型现在可以产生高质量的合成样品，这些样品通常与实际训练数据无法区分。越来越多的研究旨在了解为什么最近的方法（例如扩散和流匹配技术）如此有效地概括。在提出的解释中，有深度学习体系结构的感应偏见以及条件流匹配损失的随机性质。在这项工作中，我们排除了后者 - 损失的嘈杂性质 - 是流动匹配中泛化的主要贡献者。首先，我们从经验上表明，在高维设置中，流动匹配损耗的随机和闭合形式产生了几乎同等的损失。然后，使用标准图像数据集上的最先进的流量匹配模型，我们证明了这两个变体都具有可比的统计性能，令人惊讶的是，使用封闭形式甚至可以提高性能。

Title: SAAT: Synergistic Alternating Aggregation Transformer for Image Super-Resolution

Authors: Jianfeng Wu, Nannan Xu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.03740
Pdf URL: https://arxiv.org/pdf/2506.03740
Copy Paste: [[2506.03740]] SAAT: Synergistic Alternating Aggregation Transformer for Image Super-Resolution(https://arxiv.org/abs/2506.03740)
Keywords: super-resolution
Abstract: Single image super-resolution is a well-known downstream task which aims to restore low-resolution images into high-resolution images. At present, models based on Transformers have shone brightly in the field of super-resolution due to their ability to capture long-term dependencies in information. However, current methods typically compute self-attention in nonoverlapping windows to save computational costs, and the standard self-attention computation only focuses on its results, thereby neglecting the useful information across channels and the rich spatial structural information generated in the intermediate process. Channel attention and spatial attention have, respectively, brought significant improvements to various downstream visual tasks in terms of extracting feature dependency and spatial structure relationships, but the synergistic relationship between channel and spatial attention has not been fully explored this http URL address these issues, we propose a novel model. Synergistic Alternating Aggregation Transformer (SAAT), which can better utilize the potential information of features. In SAAT, we introduce the Efficient Channel & Window Synergistic Attention Group (CWSAG) and the Spatial & Window Synergistic Attention Group (SWSAG). On the one hand, CWSAG combines efficient channel attention with shifted window attention, enhancing non-local feature fusion, and producing more visually appealing results. On the other hand, SWSAG leverages spatial attention to capture rich structured feature information, thereby enabling SAAT to more effectively extract structural this http URL experimental results and ablation studies demonstrate the effectiveness of SAAT in the field of super-resolution. SAAT achieves performance comparable to that of the state-of-the-art (SOTA) under the same quantity of parameters.
摘要：单图像超分辨率是一项众所周知的下游任务，旨在将低分辨率图像恢复到高分辨率图像中。目前，由于捕获信息长期依赖性的能力，基于变压器的模型在超分辨率领域中具有明亮的发光。但是，当前方法通常会在非重叠窗口中计算自我注意力以节省计算成本，而标准的自我发项计算仅着眼于其结果，从而忽略了跨渠道和中间过程中产生的丰富空间结构信息的有用信息。在提取特征依赖关系和空间结构关系方面，渠道的关注和空间注意力分别为各种下游视觉任务带来了重大改进，但是我们建议一个新的模型。协同交替的聚合变压器（SAAT）可以更好地利用功能的潜在信息。在SAAT中，我们介绍了高效的渠道和窗口协同关注组（CWSAG）和空间和窗口协同关注组（SWSAG）。一方面，CWSAG将有效的渠道注意力与转移的窗户注意力相结合，增强了非本地特征融合，并产生更吸引人的结果。另一方面，SWSAG利用空间注意力来捕获丰富的结构化特征信息，从而使SAAT能够更有效地提取该HTTP URL实验结果，而消融研究表明，SAAT在超级分辨率领域中的有效性。 SAAT在相同数量的参数下实现与最先进的（SOTA）相当的性能。

Title: Joint Video Enhancement with Deblurring, Super-Resolution, and Frame Interpolation Network

Authors: Giyong Choi, HyunWook Park
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.03892
Pdf URL: https://arxiv.org/pdf/2506.03892
Copy Paste: [[2506.03892]] Joint Video Enhancement with Deblurring, Super-Resolution, and Frame Interpolation Network(https://arxiv.org/abs/2506.03892)
Keywords: super-resolution
Abstract: Video quality is often severely degraded by multiple factors rather than a single factor. These low-quality videos can be restored to high-quality videos by sequentially performing appropriate video enhancement techniques. However, the sequential approach was inefficient and sub-optimal because most video enhancement approaches were designed without taking into account that multiple factors together degrade video quality. In this paper, we propose a new joint video enhancement method that mitigates multiple degradation factors simultaneously by resolving an integrated enhancement problem. Our proposed network, named DSFN, directly produces a high-resolution, high-frame-rate, and clear video from a low-resolution, low-frame-rate, and blurry video. In the DSFN, low-resolution and blurry input frames are enhanced by a joint deblurring and super-resolution (JDSR) module. Meanwhile, intermediate frames between input adjacent frames are interpolated by a triple-frame-based frame interpolation (TFBFI) module. The proper combination of the proposed modules of DSFN can achieve superior performance on the joint video enhancement task. Experimental results show that the proposed method outperforms other sequential state-of-the-art techniques on public datasets with a smaller network size and faster processing time.
摘要：视频质量通常会因多种因素而不是一个因素而严重降低。这些低质量的视频可以通过依次执行适当的视频增强技术来恢复到高质量的视频。但是，顺序方法效率低下且最佳选择，因为大多数视频增强方法都是设计的，而没有考虑到多个因素在一起降低视频质量。在本文中，我们提出了一种新的联合视频增强方法，该方法通过解决综合增强问题来同时减轻多个降解因子。我们提出的名为DSFN的网络直接从低分辨率，低帧速率和模糊视频中产生了高分辨率，高率速率和清晰的视频。在DSFN中，通过关节脱毛和超分辨率（JDSR）模块增强了低分辨率和模糊输入帧。同时，基于三框架的框架插值（TFBFI）模块，输入相邻帧之间的中间帧插值。 DSFN提出的模块的正确组合可以在联合视频增强任务上实现卓越的性能。实验结果表明，所提出的方法在公共数据集上的其他顺序最先进的技术的表现较小，其网络大小且处理时间更快。

Title: Lower Ricci Curvature for Hypergraphs

Authors: Shiyi Yang, Can Chen, Didong Li
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2506.03943
Pdf URL: https://arxiv.org/pdf/2506.03943
Copy Paste: [[2506.03943]] Lower Ricci Curvature for Hypergraphs(https://arxiv.org/abs/2506.03943)
Keywords: generative
Abstract: Networks with higher-order interactions, prevalent in biological, social, and information systems, are naturally represented as hypergraphs, yet their structural complexity poses fundamental challenges for geometric characterization. While curvature-based methods offer powerful insights in graph analysis, existing extensions to hypergraphs suffer from critical trade-offs: combinatorial approaches such as Forman-Ricci curvature capture only coarse features, whereas geometric methods like Ollivier-Ricci curvature offer richer expressivity but demand costly optimal transport computations. To address these challenges, we introduce hypergraph lower Ricci curvature (HLRC), a novel curvature metric defined in closed form that achieves a principled balance between interpretability and efficiency. Evaluated across diverse synthetic and real-world hypergraph datasets, HLRC consistently reveals meaningful higher-order organization, distinguishing intra- from inter-community hyperedges, uncovering latent semantic labels, tracking temporal dynamics, and supporting robust clustering of hypergraphs based on global structure. By unifying geometric sensitivity with algorithmic simplicity, HLRC provides a versatile foundation for hypergraph analytics, with broad implications for tasks including node classification, anomaly detection, and generative modeling in complex systems.
摘要：具有高阶相互作用的网络在生物，社会和信息系统中普遍存在，自然表示为超图，但是它们的结构复杂性对几何表征构成了基本挑战。虽然基于曲率的方法在图形分析中提供了有力的见解，但现有的超图表遭受了关键的权衡措施：组合方法（例如eman-ricci曲率曲率捕获）仅捕获粗糙特征，而诸如Ollivier-Ricci曲率等几何方法则提供了更丰富的表达性，但需求昂贵，但昂贵的最佳运输计算。为了应对这些挑战，我们引入了HyperGraph Lower Ricci曲率（HLRC），这是一种以封闭形式定义的新型曲率度量，可以在解释性和效率之间达到原则平衡。 HLRC在各种综合和现实世界中的超图数据集进行了评估，始终揭示有意义的高阶组织，将内部与社区间超预交区分开，发现潜在的语义标签，跟踪时间动力学，并支持基于全球结构的超图的可靠群集。通过将几何灵敏度统一使用算法简单性，HLRC为超图分析提供了多功能基础，对包括节点分类，异常检测和复杂系统中的生成模型在内的任务具有广泛的影响。

Title: Solving Inverse Problems via Diffusion-Based Priors: An Approximation-Free Ensemble Sampling Approach

Authors: Haoxuan Chen, Yinuo Ren, Martin Renqiang Min, Lexing Ying, Zachary Izzo
Subjects: cs.LG, cs.CV, eess.IV, math.NA, stat.ML
Abstract URL: https://arxiv.org/abs/2506.03979
Pdf URL: https://arxiv.org/pdf/2506.03979
Copy Paste: [[2506.03979]] Solving Inverse Problems via Diffusion-Based Priors: An Approximation-Free Ensemble Sampling Approach(https://arxiv.org/abs/2506.03979)
Keywords: generative
Abstract: Diffusion models (DMs) have proven to be effective in modeling high-dimensional distributions, leading to their widespread adoption for representing complex priors in Bayesian inverse problems (BIPs). However, current DM-based posterior sampling methods proposed for solving common BIPs rely on heuristic approximations to the generative process. To exploit the generative capability of DMs and avoid the usage of such approximations, we propose an ensemble-based algorithm that performs posterior sampling without the use of heuristic approximations. Our algorithm is motivated by existing works that combine DM-based methods with the sequential Monte Carlo (SMC) method. By examining how the prior evolves through the diffusion process encoded by the pre-trained score function, we derive a modified partial differential equation (PDE) governing the evolution of the corresponding posterior distribution. This PDE includes a modified diffusion term and a reweighting term, which can be simulated via stochastic weighted particle methods. Theoretically, we prove that the error between the true posterior distribution can be bounded in terms of the training error of the pre-trained score function and the number of particles in the ensemble. Empirically, we validate our algorithm on several inverse problems in imaging to show that our method gives more accurate reconstructions compared to existing DM-based methods.
摘要：事实证明，扩散模型（DMS）在对高维分布进行建模方面有效，从而导致其在代表贝叶斯反问题（BIPS）中代表复杂先验的广泛采用（BIPS）。但是，提出的用于解决常见BIP的当前基于DM的后验方法依赖于生成过程的启发式近似。为了利用DMS的生成能力并避免使用此类近似值，我们提出了一种基于集合的算法，该算法在不使用启发式近似值的情况下执行后验采样。我们的算法是由将基于DM的方法与顺序蒙特卡洛（SMC）方法相结合的现有作品的动机。通过检查先前如何通过预先训练的分数函数编码的扩散过程演变，我们得出了管理相应后验分布的演变的修改后的部分微分方程（PDE）。该PDE包括一个修改的扩散项和一个重新加权项，可以通过随机加权粒子方法模拟。从理论上讲，我们证明，真正的后验分布之间的误差可以根据预训练的得分函数的训练误差和集合中的粒子数量界定。从经验上讲，我们验证了成像中几个反问题的算法，以表明我们的方法与现有的基于DM的方法相比提供了更准确的重建。

Title: Optimal Spiking Brain Compression: Improving One-Shot Post-Training Pruning and Quantization for Spiking Neural Networks

Authors: Lianfeng Shi, Ao Li, Benjamin Ward-Cherrier
Subjects: cs.LG, cs.NE
Abstract URL: https://arxiv.org/abs/2506.03996
Pdf URL: https://arxiv.org/pdf/2506.03996
Copy Paste: [[2506.03996]] Optimal Spiking Brain Compression: Improving One-Shot Post-Training Pruning and Quantization for Spiking Neural Networks(https://arxiv.org/abs/2506.03996)
Keywords: generation
Abstract: Spiking Neural Networks (SNNs) have emerged as a new generation of energy-efficient neural networks suitable for implementation on neuromorphic hardware. As neuromorphic hardware has limited memory and computing resources, weight pruning and quantization have recently been explored to improve SNNs' efficiency. State-of-the-art SNN pruning/quantization methods employ multiple compression and training iterations, increasing the cost for pre-trained or very large SNNs. In this paper, we propose a new one-shot post-training pruning/quantization framework, Optimal Spiking Brain Compression (OSBC), that adapts the Optimal Brain Compression (OBC) method of [Frantar, Singh, and Alistarh, 2023] for SNNs. Rather than minimizing the loss on neuron input current as OBC does, OSBC achieves more efficient and accurate SNN compression in one pass by minimizing the loss on spiking neuron membrane potential with a small sample dataset. Our experiments on neuromorphic datasets (N-MNIST, CIFAR10-DVS, DVS128-Gesture) demonstrate that OSBC can achieve 97% sparsity through pruning with 1.41%, 10.20%, and 1.74% accuracy loss, or 4-bit symmetric quantization with 0.17%, 1.54%, and 7.71% accuracy loss, respectively. Code will be available on GitHub.
摘要：尖峰神经网络（SNN）已成为适合于神经形态硬件实现的新一代节能神经网络。由于神经形态硬件的内存和计算资源有限，最近探索了重量修剪和量化以提高SNNS的效率。最先进的SNN修剪/量化方法采用多重压缩和训练迭代，增加了预培训或非常大的SNN的成本。在本文中，我们提出了一种新的单发训练后修剪/量化框架，最佳的尖峰脑压缩（OSBC），该框架适用于[Frantar，Singh和Alistarh，2023年]的最佳脑压缩方法（OBC）方法。 OSBC并没有像OBC那样最大程度地减少神经元输入电流的损失，而是通过将较小的样品数据集中的尖峰神经元膜电位的损失最小化，从而在一次通过中实现了更有效，准确的SNN压缩。我们对神经形态数据集的实验（N-MNIST，CIFAR10-DVS，DVS128手机）表明，OSBC可以通过修剪以1.41％，10.20％和1.74％的精度损失1.41％，10.20％和1.74％的准确性，或4位对称性量化，或4位对称性量，损失0.17％，1.17％，1.54％％和7.71％，并且准确性为7.71％，并且准确性为7.71％。代码将在GitHub上提供。

Title: Point Cloud Quality Assessment Using the Perceptual Clustering Weighted Graph (PCW-Graph) and Attention Fusion Network

Authors: Abdelouahed Laazoufi, Mohammed El Hassouni, Hocine Cherifi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04081
Pdf URL: https://arxiv.org/pdf/2506.04081
Copy Paste: [[2506.04081]] Point Cloud Quality Assessment Using the Perceptual Clustering Weighted Graph (PCW-Graph) and Attention Fusion Network(https://arxiv.org/abs/2506.04081)
Keywords: quality assessment
Abstract: No-Reference Point Cloud Quality Assessment (NR-PCQA) is critical for evaluating 3D content in real-world applications where reference models are unavailable.
摘要：无参考点云质量评估（NR-PCQA）对于评估参考模型不可用的现实应用程序中的3D内容至关重要。

Title: UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

Authors: Jinting Wang, Shan Yang, Li Liu
Subjects: cs.CV, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.04134
Pdf URL: https://arxiv.org/pdf/2506.04134
Copy Paste: [[2506.04134]] UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation(https://arxiv.org/abs/2506.04134)
Keywords: generation
Abstract: Cued Speech (CS) enhances lipreading through hand coding, providing precise speech perception support for the hearing-impaired. CS Video-to-Speech generation (CSV2S) task aims to convert the CS visual expressions (CS videos) of hearing-impaired individuals into comprehensible speech signals. Direct generation of speech from CS video (called single CSV2S) yields poor performance due to insufficient CS data. Current research mostly focuses on CS Recognition (CSR), which convert video content into linguistic text. Based on this, one straightforward way of CSV2S is to combine CSR with a Text-to-Speech system. This combined architecture relies on text as an intermediate medium for stepwise cross-modal alignment, which may lead to error propagation and temporal misalignment between speech and video dynamics. To address these challenges, we propose a novel approach that directly generates speech from CS videos without relying on intermediate text. Building upon this, we propose UniCUE, the first unified framework for CSV2S, whose core innovation lies in the integration of the CSR task that provides fine-grained visual-semantic information to facilitate speech generation from CS videos. More precisely, (1) a novel fine-grained semantic alignment pool to ensure precise mapping between visual features and speech contents; (2) a VisioPhonetic adapter to bridge cross-task representations, ensuring seamless compatibility between two distinct tasks (i.e., CSV2S and CSR); (3) a pose-aware visual processor is introduced to enhance fine-grained spatiotemporal correlations between lip and hand movements in CS video. Experiments on our new established Chinese CS dataset (14 cuers1: 8 hearing-impaired and 6 normal-hearing) show that our UniCUE significantly reduces Word Error Rate by 78.3% and improves lip-speech synchronization by 32% compared to the single CSV2S.
摘要：提示语音（CS）通过手工编码增强了口头读力，为听力受损提供了精确的语音感知支持。 CS视频到语音生成（CSV2S）任务旨在将听力受损的个体的CS视觉表达（CS视频）转换为可理解的语音信号。 CS视频的直接生成语音（称为单个CSV2）由于CS数据不足而产生的性能差。当前的研究主要集中在CS识别（CSR）上，将视频内容转换为语言文本。基于此，CSV2S的一种直接方式是将CSR与文本到语音系统相结合。这种结合的架构依靠文本作为逐步交叉模式对齐的中间介质，这可能导致错误传播和语音和视频动力学之间的时间错位。为了应对这些挑战，我们提出了一种新颖的方法，该方法直接从CS视频中产生语音而不依赖中间文本。在此基础上，我们提出了UniCue，这是CSV2的第一个统一框架，其核心创新在于CSR任务的整合，该任务提供了提供精细的视觉语义信息，以促进CS视频中的语音生成。更准确地说，（1）一个新颖的细粒语义对齐池，以确保视觉特征和语音内容之间的精确映射；（2）一个视觉适配器桥梁交叉任务表示，以确保两个不同的任务之间的无缝兼容性（即CSV2S和CSR）；（3）引入了姿势感知的视觉处理器，以增强CS视频中唇和手动运动之间的细粒时空相关性。与单个CSV2相比，我们新建立的中国CS数据集（14：8听力受损和6个正常听觉）的实验表明，我们的Unicue将单词错误率显着降低了78.3％，并将Lip语音同步提高了32％。

Title: Image Editing As Programs with Diffusion Models

Authors: Yujia Hu, Songhua Liu, Zhenxiong Tan, Xingyi Yang, Xinchao Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04158
Pdf URL: https://arxiv.org/pdf/2506.04158
Copy Paste: [[2506.04158]] Image Editing As Programs with Diffusion Models(https://arxiv.org/abs/2506.04158)
Keywords: generation
Abstract: While diffusion models have achieved remarkable success in text-to-image generation, they encounter significant challenges with instruction-driven image editing. Our research highlights a key challenge: these models particularly struggle with structurally inconsistent edits that involve substantial layout changes. To mitigate this gap, we introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture. At its core, IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations. Each operation is implemented via a lightweight adapter sharing the same DiT backbone and is specialized for a specific type of edit. Programmed by a vision-language model (VLM)-based agent, these operations collaboratively support arbitrary and structurally inconsistent transformations. By modularizing and sequencing edits in this way, IEAP generalizes robustly across a wide range of editing tasks, from simple adjustments to substantial structural changes. Extensive experiments demonstrate that IEAP significantly outperforms state-of-the-art methods on standard benchmarks across various editing scenarios. In these evaluations, our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions. Codes are available at this https URL.
摘要：尽管扩散模型在文本到图像生成方面取得了巨大的成功，但通过教学驱动的图像编辑，它们遇到了重大挑战。我们的研究突出了一个关键挑战：这些模型特别在结构上不一致的编辑中挣扎，涉及实质性布局变化。为了减轻这一差距，我们将图像编辑介绍为程序（IEAP），这是建立在扩散变压器（DIT）体系结构上的统一图像编辑框架。 IAP以还原镜头的方式进行了教学编辑，将复杂的编辑说明分解为原子操作序列。每个操作都是通过轻质适配器共享相同的DIT主链实现的，并且专门用于特定类型的编辑。这些操作由基于视觉语言模型（VLM）的代理编程，协作支持任意和结构上不一致的转换。通过以这种方式进行模块化和测序编辑，我可以在从简单调整到实质性结构变化的各种编辑任务中稳固地概括。广泛的实验表明，在各种编辑方案上，IAP明显优于标准基准测试的最先进方法。在这些评估中，我们的框架提供了卓越的准确性和语义保真度，尤其是对于复杂的多步骤说明。代码可在此HTTPS URL上找到。

Title: Physics-Constrained Flow Matching: Sampling Generative Models with Hard Constraints

Authors: Utkarsh Utkarsh, Pengfei Cai, Alan Edelman, Rafael Gomez-Bombarelli, Christopher Vincent Rackauckas
Subjects: cs.LG, cs.AI, cs.CE, math.NA
Abstract URL: https://arxiv.org/abs/2506.04171
Pdf URL: https://arxiv.org/pdf/2506.04171
Copy Paste: [[2506.04171]] Physics-Constrained Flow Matching: Sampling Generative Models with Hard Constraints(https://arxiv.org/abs/2506.04171)
Keywords: generative
Abstract: Deep generative models have recently been applied to physical systems governed by partial differential equations (PDEs), offering scalable simulation and uncertainty-aware inference. However, enforcing physical constraints, such as conservation laws (linear and nonlinear) and physical consistencies, remains challenging. Existing methods often rely on soft penalties or architectural biases that fail to guarantee hard constraints. In this work, we propose Physics-Constrained Flow Matching (PCFM), a zero-shot inference framework that enforces arbitrary nonlinear constraints in pretrained flow-based generative models. PCFM continuously guides the sampling process through physics-based corrections applied to intermediate solution states, while remaining aligned with the learned flow and satisfying physical constraints. Empirically, PCFM outperforms both unconstrained and constrained baselines on a range of PDEs, including those with shocks, discontinuities, and sharp features, while ensuring exact constraint satisfaction at the final solution. Our method provides a general framework for enforcing hard constraints in both scientific and general-purpose generative models, especially in applications where constraint satisfaction is essential.
摘要：最近，深层生成模型已应用于由部分微分方程（PDE）控制的物理系统，提供可扩展的模拟和不确定性感知的推理。但是，强制执行身体限制，例如保护法（线性和非线性）和身体一致性，仍然具有挑战性。现有方法通常依赖于无法保证严格限制的软惩罚或建筑偏见。在这项工作中，我们提出了一个物理受限的流量匹配（PCFM），这是一种零摄像的推理框架，可在基于流动的生成模型中执行任意非线性约束。 PCFM通过应用于中间解决方案状态的基于物理的校正来连续指导采样过程，同时保持与学习的流量并满足物理约束。从经验上讲，PCFM在一系列PDE上均优于不受约束和约束的基线，包括那些具有冲击，不连续性和尖锐特征的基准，同时确保最终解决方案的确切约束满意度。我们的方法提供了一个通用框架，用于在科学和通用生成模型中执行硬性约束，尤其是在约束满意度至关重要的应用中。

Title: Does Prompt Design Impact Quality of Data Imputation by LLMs?

Authors: Shreenidhi Srinivasan, Lydia Manikonda
Subjects: cs.LG, cs.ET
Abstract URL: https://arxiv.org/abs/2506.04172
Pdf URL: https://arxiv.org/pdf/2506.04172
Copy Paste: [[2506.04172]] Does Prompt Design Impact Quality of Data Imputation by LLMs?(https://arxiv.org/abs/2506.04172)
Keywords: generation
Abstract: Generating realistic synthetic tabular data presents a critical challenge in machine learning. It adds another layer of complexity when this data contain class imbalance problems. This paper presents a novel token-aware data imputation method that leverages the in-context learning capabilities of large language models. This is achieved through the combination of a structured group-wise CSV-style prompting technique and the elimination of irrelevant contextual information in the input prompt. We test this approach with two class-imbalanced binary classification datasets and evaluate the effectiveness of imputation using classification-based evaluation metrics. The experimental results demonstrate that our approach significantly reduces the input prompt size while maintaining or improving imputation quality compared to our baseline prompt, especially for datasets that are of relatively smaller in size. The contributions of this presented work is two-fold -- 1) it sheds light on the importance of prompt design when leveraging LLMs for synthetic data generation and 2) it addresses a critical gap in LLM-based data imputation for class-imbalanced datasets with missing data by providing a practical solution within computational constraints. We hope that our work will foster further research and discussions about leveraging the incredible potential of LLMs and prompt engineering techniques for synthetic data generation.
摘要：生成现实的合成表格数据在机器学习中提出了一个关键的挑战。当此数据包含类不平衡问题时，它增加了另一层复杂性。本文提出了一种新颖的令牌数据插补方法，该方法利用了大型语言模型的内在学习能力。这是通过结构化组的CSV式提示技术的结合以及在输入提示中消除无关的上下文信息来实现的。我们使用两个平衡的二元分类数据集测试了这种方法，并使用基于分类的评估指标评估了插补的有效性。实验结果表明，与基线提示相比，我们的方法在保持或提高估算质量的同时大大降低了输入及时尺寸，尤其是大小相对较小的数据集。这项提出的工作的贡献是两倍 - 1）它阐明了当利用LLMS进行合成数据生成时及时设计的重要性，并且2）它解决了基于LLM的数据插入基于LLM的数据，用于类别不平衡的数据集中，通过在计算约束中提供实用解决方案，这些数据集中的数据集中的数据集中存在缺失的数据。我们希望我们的工作能够促进有关利用LLM的难以置信的潜力和促使工程技术来生成的进一步研究和讨论。

Title: OpenThoughts: Data Recipes for Reasoning Models

Authors: Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, Yichuan Deng, Sarah Pratt, Vivek Ramanujan, Jon Saad-Falcon, Jeffrey Li, Achal Dave, Alon Albalak, Kushal Arora, Blake Wulfe, Chinmay Hegde, Greg Durrett, Sewoong Oh, Mohit Bansal, Saadia Gabriel, Aditya Grover, Kai-Wei Chang, Vaishaal Shankar, Aaron Gokaslan, Mike A. Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G. Dimakis, Ludwig Schmidt
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.04178
Pdf URL: https://arxiv.org/pdf/2506.04178
Copy Paste: [[2506.04178]] OpenThoughts: Data Recipes for Reasoning Models(https://arxiv.org/abs/2506.04178)
Keywords: generation
Abstract: Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThinker3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond. All of our datasets and models are available on this https URL.
摘要：推理模型已在许多涉及数学，代码和科学的基准上取得了迅速的进步。然而，关于推理的最佳培训食谱仍然存在许多公开问题，因为最先进的模型通常依赖于几乎没有可用信息的专有数据集。为了解决这个问题，Openthight项目的目标是创建用于培训推理模型的开源数据集。经过初步探索，我们的Openthoughts2-1M数据集导致OpenthInker2-32B，这是第一个在公共推理数据上训练的模型，以匹配DeepSeek-R1-Distill-32B在AIME和LiveCodeBench等标准推理基准上。然后，我们通过系统地通过1,000多个受控实验来系统地研究数据生成管道的每个步骤，从而进一步改善数据集，从而导致了Optheeds3。将管道缩放到1.2m的示例，并使用QWQ-32B作为老师，可以产生我们的OpenthInker3-7b模型，该模型可实现最先进的结果：AIME 2025的53％，在LiveCodeBench 06/24-24-01/25上获得51％，在GPQA Diamond上为54％。我们所有的数据集和模型均可在此HTTPS URL上找到。

Title: Diffusion Domain Teacher: Diffusion Guided Domain Adaptive Object Detector

Authors: Boyong He, Yuxiang Ji, Zhuoyue Tan, Liaoni Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04211
Pdf URL: https://arxiv.org/pdf/2506.04211
Copy Paste: [[2506.04211]] Diffusion Domain Teacher: Diffusion Guided Domain Adaptive Object Detector(https://arxiv.org/abs/2506.04211)
Keywords: generative
Abstract: Object detectors often suffer a decrease in performance due to the large domain gap between the training data (source domain) and real-world data (target domain). Diffusion-based generative models have shown remarkable abilities in generating high-quality and diverse images, suggesting their potential for extracting valuable feature from various domains. To effectively leverage the cross-domain feature representation of diffusion models, in this paper, we train a detector with frozen-weight diffusion model on the source domain, then employ it as a teacher model to generate pseudo labels on the unlabeled target domain, which are used to guide the supervised learning of the student model on the target domain. We refer to this approach as Diffusion Domain Teacher (DDT). By employing this straightforward yet potent framework, we significantly improve cross-domain object detection performance without compromising the inference speed. Our method achieves an average mAP improvement of 21.2% compared to the baseline on 6 datasets from three common cross-domain detection benchmarks (Cross-Camera, Syn2Real, Real2Artistic}, surpassing the current state-of-the-art (SOTA) methods by an average of 5.7% mAP. Furthermore, extensive experiments demonstrate that our method consistently brings improvements even in more powerful and complex models, highlighting broadly applicable and effective domain adaptation capability of our DDT. The code is available at this https URL.
摘要：由于训练数据（源域）和现实世界数据（目标域）之间的较大域间隙，对象探测器的性能通常会下降。基于扩散的生成模型在产生高质量和多样的图像方面表现出了显着的能力，这表明它们从各个领域中提取有价值的特征的潜力。为了有效利用扩散模型的跨域特征表示形式，在本文中，我们在源域上训练具有冷冻重量扩散模型的检测器，然后将其用作教师模型，以在未标记的目标域上生成伪标签，这些识别标签用于指导目标域中的学生模型在目标域上学习。我们将这种方法称为扩散域老师（DDT）。通过采用这个简单而有效的框架，我们可以显着提高跨域对象检测性能，而不会损害推理速度。与来自三个常见的跨域检测基准（跨摄像机，syn2real，syn2real，real2artistion}的6个数据集的基线相比，我们的方法的平均地图提高了21.2％，与当前的最新技术（SOTA）方法相比，超过5.7％的图表，我们的方法是一致的，既有又一致的构建，又有一个完善的效果，均超过了整个方法。我们的DDT的有效域适应能力。

Title: FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

Authors: Xuanhua He, Quande Liu, Zixuan Ye, Wecai Ye, Qiulin Wang, Xintao Wang, Qifeng Chen, Pengfei Wan, Di Zhang, Kun Gai
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04213
Pdf URL: https://arxiv.org/pdf/2506.04213
Copy Paste: [[2506.04213]] FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers(https://arxiv.org/abs/2506.04213)
Keywords: generation
Abstract: Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in context-latent interactions throughout the diffusion process. Based on these insights, we propose FullDiT2, an efficient in-context conditioning framework for general controllability in both video generation and editing tasks, which innovates from two key perspectives. Firstly, to address the token redundancy, FullDiT2 leverages a dynamic token selection mechanism to adaptively identify important context tokens, reducing the sequence length for unified full-attention. Additionally, a selective context caching mechanism is devised to minimize redundant interactions between condition tokens and video latents. Extensive experiments on six diverse conditional video editing and generation tasks demonstrate that FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step, with minimal degradation or even higher performance in video generation quality. The project page is at \href{this https URL}{this https URL}.
摘要：视频扩散变压器上的细粒度和有效的可控性提高了对适用性的需求。最近，在统一条件视频生成的强大范式中出现了内在条件，通过将不同的上下文信号与嘈杂的视频潜在潜伏在长期的统一令牌序列中，可以通过全面处理，从而实现各种控制，并通过全面关注来处理它们，例如，例如，fulldit。尽管它们有效，但随着任务复杂性的增加，这些方法面临二次计算开销，从而阻碍了实际部署。在本文中，我们研究了原始的内部文化调节视频生成框架中忽略的效率瓶颈。我们从系统的分析开始，以确定计算效率低下的两个关键来源：上下文条件令牌内的固有冗余性以及整个扩散过程中上下文范围交互中的计算冗余。基于这些见解，我们提出了FullDit2，这是一个有效的视频生成和编辑任务中一般可控性的内在条件调节框架，从两个关键的角度进行了创新。首先，为了解决令牌冗余，fulldit2利用动态令牌选择机制适应性地识别重要的上下文令牌，从而减少了统一全注意的序列长度。此外，设计了一种选择性上下文缓存机制，以最大程度地减少条件令牌和视频潜在的冗余相互作用。对六种有条件的视频编辑和发电任务进行的广泛实验表明，FullDit2可实现大幅度的计算减少，并且每次扩散步骤平均时间成本的速度为2-3倍，并且视频发电质量的降级最小，甚至更高的性能。项目页面位于\ href {此https url} {this https url}。

Title: Sounding that Object: Interactive Object-Aware Image to Audio Generation

Authors: Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, Yuxuan Wang
Subjects: cs.CV, cs.LG, cs.MM, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2506.04214
Pdf URL: https://arxiv.org/pdf/2506.04214
Copy Paste: [[2506.04214]] Sounding that Object: Interactive Object-Aware Image to Audio Generation(https://arxiv.org/abs/2506.04214)
Keywords: generation
Abstract: Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an {\em interactive object-aware audio generation} model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the {\em object} level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds. Project page: this https URL
摘要：为复杂的视听场景生成准确的声音是具有挑战性的，尤其是在存在多个对象和声音源的情况下。在本文中，我们提出了一个{\ em Interactive对象感知音频生成}模型，该模型在图像中的用户选择的视觉对象中进行了声音生成。我们的方法将以对象为中心的学习整合到条件潜在扩散模型中，该模型学会通过多模式的注意将图像区域与相应的声音联系起来。在测试时，我们的模型采用图像分割，允许用户在{\ em Object}级别上交互生成声音。从理论上讲，我们的注意机制在功能上近似测试时间分割掩码，从而确保生成的音频与所选对象对齐。定量和定性评估表明，我们的模型表现优于基准，可以在对象及其相关声音之间更好地保持对齐。项目页面：此HTTPS URL

Title: UNIC: Unified In-Context Video Editing

Authors: Zixuan Ye, Xuanhua He, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qifeng Chen, Wenhan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04216
Pdf URL: https://arxiv.org/pdf/2506.04216
Copy Paste: [[2506.04216]] UNIC: Unified In-Context Video Editing(https://arxiv.org/abs/2506.04216)
Keywords: generation, generative
Abstract: Recent advances in text-to-video generation have sparked interest in generative video editing tasks. Previous methods often rely on task-specific architectures (e.g., additional adapter modules) or dedicated customizations (e.g., DDIM inversion), which limit the integration of versatile editing conditions and the unification of various editing tasks. In this paper, we introduce UNified In-Context Video Editing (UNIC), a simple yet effective framework that unifies diverse video editing tasks within a single model in an in-context manner. To achieve this unification, we represent the inputs of various video editing tasks as three types of tokens: the source video tokens, the noisy video latent, and the multi-modal conditioning tokens that vary according to the specific editing task. Based on this formulation, our key insight is to integrate these three types into a single consecutive token sequence and jointly model them using the native attention operations of DiT, thereby eliminating the need for task-specific adapter designs. Nevertheless, direct task unification under this framework is challenging, leading to severe token collisions and task confusion due to the varying video lengths and diverse condition modalities across tasks. To address these, we introduce task-aware RoPE to facilitate consistent temporal positional encoding, and condition bias that enables the model to clearly differentiate different editing tasks. This allows our approach to adaptively perform different video editing tasks by referring the source video and varying condition tokens "in context", and support flexible task composition. To validate our method, we construct a unified video editing benchmark containing six representative video editing tasks. Results demonstrate that our unified approach achieves superior performance on each task and exhibits emergent task composition abilities.
摘要：文本到视频生成的最新进展引发了人们对生成视频编辑任务的兴趣。以前的方法通常依赖于特定于任务的体系结构（例如其他适配器模块）或专用的自定义（例如DDIM倒置），这限制了多功能编辑条件的集成以及各种编辑任务的统一。在本文中，我们介绍了统一的文化视频编辑（UNIC），这是一个简单而有效的框架，以封闭式方式将单个模型中的各种视频编辑任务统一。为了实现此统一，我们将各种视频编辑任务的输入表示为三种令牌：源视频令牌，嘈杂的视频潜伏和根据特定编辑任务而变化的多模式调节令牌。基于此公式，我们的主要见解是将这三种类型集成到单个连续的令牌序列中，并使用DIT的本机注意操作共同对其进行建模，从而消除了对特定于任务适配器设计的需求。然而，此框架下的直接任务统一具有挑战性，导致严重的令牌碰撞和任务困惑，因为视频长度的不同和各个任务的各种状况方式。为了解决这些问题，我们介绍了任务感知的绳索，以促进一致的时间位置编码，并使模型能够明确区分不同的编辑任务。这允许我们的方法通过“在上下文中”中引用源视频和变化的条件令牌，并支持灵活的任务组成，从而可以自适应地执行不同的视频编辑任务。为了验证我们的方法，我们构建了一个统一的视频编辑基准，其中包含六个代表性视频编辑任务。结果表明，我们的统一方法在每个任务上都能达到卓越的表现，并展示出紧急的任务组成能力。

Title: Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation

Authors: Tianyu Huang, Wangguandong Zheng, Tengfei Wang, Yuhao Liu, Zhenwei Wang, Junta Wu, Jie Jiang, Hui Li, Rynson W.H. Lau, Wangmeng Zuo, Chunchao Guo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04225
Pdf URL: https://arxiv.org/pdf/2506.04225
Copy Paste: [[2506.04225]] Voyager: Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation(https://arxiv.org/abs/2506.04225)
Keywords: generation
Abstract: Real-world applications like video gaming and virtual reality often demand the ability to model 3D scenes that users can explore along custom camera trajectories. While significant progress has been made in generating 3D objects from text or images, creating long-range, 3D-consistent, explorable 3D scenes remains a complex and challenging problem. In this work, we present Voyager, a novel video diffusion framework that generates world-consistent 3D point-cloud sequences from a single image with user-defined camera path. Unlike existing approaches, Voyager achieves end-to-end scene generation and reconstruction with inherent consistency across frames, eliminating the need for 3D reconstruction pipelines (e.g., structure-from-motion or multi-view stereo). Our method integrates three key components: 1) World-Consistent Video Diffusion: A unified architecture that jointly generates aligned RGB and depth video sequences, conditioned on existing world observation to ensure global coherence 2) Long-Range World Exploration: An efficient world cache with point culling and an auto-regressive inference with smooth video sampling for iterative scene extension with context-aware consistency, and 3) Scalable Data Engine: A video reconstruction pipeline that automates camera pose estimation and metric depth prediction for arbitrary videos, enabling large-scale, diverse training data curation without manual 3D annotations. Collectively, these designs result in a clear improvement over existing methods in visual quality and geometric accuracy, with versatile applications.
摘要：视频游戏和虚拟现实之类的现实世界应用程序通常需要模拟用户可以沿着自定义相机轨迹探索的3D场景的能力。尽管从文本或图像生成3D对象方面已经取得了重大进展，但创建长期3D一致，可探索的3D场景仍然是一个复杂且具有挑战性的问题。在这项工作中，我们提出了Voyager，这是一个新颖的视频扩散框架，从具有用户定义的相机路径的单个图像中生成世界一致的3D点云序列。与现有的方法不同，Voyager在整个框架之间具有固有的一致性实现端到端的场景产生和重建，从而消除了对3D重建管道的需求（例如，结构 - 从胶合 - 移动或多视图立体声）。 Our method integrates three key components: 1) World-Consistent Video Diffusion: A unified architecture that jointly generates aligned RGB and depth video sequences, conditioned on existing world observation to ensure global coherence 2) Long-Range World Exploration: An efficient world cache with point culling and an auto-regressive inference with smooth video sampling for iterative scene extension with context-aware consistency, and 3) Scalable Data Engine: A video自动化相机姿势估计和任意视频的度量深度预测的重建管道，实现了无需手动3D注释的大规模，多样化的培训数据策划。总的来说，这些设计通过多功能应用程序对现有的视觉质量和几何精度的现有方法有了明显的改进。

Title: LayerFlow: A Unified Model for Layer-aware Video Generation

Authors: Sihui Ji, Hao Luo, Xi Chen, Yuanpeng Tu, Yiyang Wang, Hengshuang Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.04228
Pdf URL: https://arxiv.org/pdf/2506.04228
Copy Paste: [[2506.04228]] LayerFlow: A Unified Model for Layer-aware Video Generation(https://arxiv.org/abs/2506.04228)
Keywords: generation
Abstract: We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos for different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.
摘要：我们提出了LayerFlow，这是一种用于图层吸引视频生成的统一解决方案。给定的每层提示，LayerFlow为透明前景，干净的背景和混合场景生成视频。它还支持多功能变体，例如分解混合视频或生成给定前景的背景，反之亦然。从文本到视频扩散变压器开始，我们将不同层的视频作为子剪辑组织，并利用层嵌入以区分每个夹子和相应的层提示。这样，我们在一个统一的框架中无缝支持上述变体。由于缺乏高质量的层面培训视频，我们设计了一种多阶段训练策略，以适应具有高质量层注释的静态图像。具体来说，我们首先使用低质量的视频数据训练模型。然后，我们调整一个运动洛拉，使该模型与静态帧兼容。之后，我们将内容Lora训练图像数据与高质量分层图像的混合物以及拷贝性的视频数据。在推断期间，我们删除运动洛拉，从而产生带有所需层的光滑视频。