2025-06-30

Title: APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization

Authors: Minjie Hong, Zirun Guo, Yan Xia, Zehan Wang, Ziang Zhang, Tao Jin, Zhou Zhao
Subjects: cs.LG, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2506.21655
Pdf URL: https://arxiv.org/pdf/2506.21655
Copy Paste: [[2506.21655]] APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization(https://arxiv.org/abs/2506.21655)
Keywords: generation
Abstract: Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data, but they often struggle with complex reasoning. While Reinforcement learning (RL) can boost reasoning in LLMs, applying it to MLLMs is tricky. Common issues include a drop in performance on general tasks and the generation of overly detailed or "overthinking" reasoning. Our work investigates how the KL penalty and overthinking affect RL training in MLLMs. We propose Asymmetric Policy Optimization (APO) to address these issues, which divides the sampled responses into positive and negative groups. For positive samples, Difficulty-Adaptive Divergence Shaping (DADS) is introduced to dynamically adjust the KL divergence weight based on their difficulty. This method prevents policy entropy from dropping sharply, improves training stability, utilizes samples better, and preserves the model's existing knowledge. For negative samples, Suboptimal Trajectory Complexity Regularization (STCR) is proposed to penalize overly long responses. This helps mitigate overthinking and encourages more concise reasoning while preserving the model's explorative capacity. We apply our method to Qwen2.5-VL-3B, creating View-R1-3B. View-R1-3B significantly enhances reasoning capabilities, showing an average 7\% gain over the base model and outperforming larger MLLMs (7-11B) on various reasoning benchmarks. Importantly, unlike other reasoning-tuned MLLMs that often degrade on general tasks, View-R1-3B maintains consistent improvement, demonstrating superior generalization. These results highlight the effectiveness and broad applicability of our DADS and STCR techniques for advancing complex multimodal reasoning in MLLMs. The code will be made available at this https URL.
摘要：多模式的大语言模型（MLLM）在整合多种数据方面非常有力，但它们通常在复杂的推理方面挣扎。尽管增强学习（RL）可以在LLM中提高推理，但将其应用于MLLM是很棘手的。常见问题包括一般任务的性能下降以及产生过度详细或“过度思考”推理。我们的工作调查了KL罚款和过度思考如何影响MLLM中的RL培训。我们提出了不对称政策优化（APO）来解决这些问题，该问题将采样的响应分为积极和负面的群体。对于阳性样品，引入了难度 - 自适应散射形状（DADS），以根据其难度动态调节KL差异的重量。此方法可防止策略熵急剧下降，改善训练稳定性，更好地利用样品并保留模型的现有知识。对于负样本，提出了次优轨迹复杂性正则化（STCR）来惩罚过长的反应。这有助于减轻过度思考，并鼓励更简洁的推理，同时保留模型的探索能力。我们将方法应用于QWEN2.5-VL-3B，创建View-R1-3b。 View-R1-3b显着增强了推理能力，显示基本模型的平均增长率为7％，并且在各种推理基准上都超过了更大的MLLM（7-11b）。重要的是，与经常在一般任务上降级的其他推理调整的MLLM不同，View-R1-3B保持一致的改进，表明了卓越的概括。这些结果突出了我们的父亲和STCR技术在MLLM中推进复杂的多模式推理的有效性和广泛的适用性。该代码将在此HTTPS URL上提供。

Title: TanDiT: Tangent-Plane Diffusion Transformer for High-Quality 360° Panorama Generation

Authors: Hakan Çapuk, Andrew Bond, Muhammed Burak Kızıl, Emir Göçen, Erkut Erdem, Aykut Erdem
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21681
Pdf URL: https://arxiv.org/pdf/2506.21681
Copy Paste: [[2506.21681]] TanDiT: Tangent-Plane Diffusion Transformer for High-Quality 360° Panorama Generation(https://arxiv.org/abs/2506.21681)
Keywords: generation, generative
Abstract: Recent advances in image generation have led to remarkable improvements in synthesizing perspective images. However, these models still struggle with panoramic image generation due to unique challenges, including varying levels of geometric distortion and the requirement for seamless loop-consistency. To address these issues while leveraging the strengths of the existing models, we introduce TanDiT, a method that synthesizes panoramic scenes by generating grids of tangent-plane images covering the entire 360$^\circ$ view. Unlike previous methods relying on multiple diffusion branches, TanDiT utilizes a unified diffusion model trained to produce these tangent-plane images simultaneously within a single denoising iteration. Furthermore, we propose a model-agnostic post-processing step specifically designed to enhance global coherence across the generated panoramas. To accurately assess panoramic image quality, we also present two specialized metrics, TangentIS and TangentFID, and provide a comprehensive benchmark comprising captioned panoramic datasets and standardized evaluation scripts. Extensive experiments demonstrate that our method generalizes effectively beyond its training data, robustly interprets detailed and complex text prompts, and seamlessly integrates with various generative models to yield high-quality, diverse panoramic images.
摘要：图像产生的最新进展导致了综合透视图像的显着改善。但是，由于独特的挑战，这些模型仍然与全景图像生成相处，包括不同程度的几何变形和对无缝环路抗性的要求。为了解决这些问题，在利用现有模型的优势时，我们介绍了Tandit，该方法通过生成覆盖整个360 $^\ circ $视图的切线平面图像的网格来综合全景场景。与以前依靠多个扩散分支的方法不同，Tandit利用统一的扩散模型在单个deo的迭代中同时生成这些切线平面图像。此外，我们提出了一个模型不合时宜的后处理步骤，专门设计，旨在增强生成的全景的全局相干性。为了准确评估全景图像质量，我们还提供了两个专门的指标，即Tangentis和TangentFID，并提供了包含字幕的全景数据集和标准化评估脚本的全面基准。广泛的实验表明，我们的方法有效地概括了其训练数据，可靠地解释详细且复杂的文本提示，并与各种生成模型无缝集成，以产生高质量的，多样化的全景图像。

Title: $\textrm{ODE}_t \left(\textrm{ODE}_l \right)$: Shortcutting the Time and Length in Diffusion and Flow Models for Faster Sampling

Authors: Denis Gudovskiy, Wenzhao Zheng, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.21714
Pdf URL: https://arxiv.org/pdf/2506.21714
Copy Paste: [[2506.21714]] $\textrm{ODE}_t \left(\textrm{ODE}_l \right)$: Shortcutting the Time and Length in Diffusion and Flow Models for Faster Sampling(https://arxiv.org/abs/2506.21714)
Keywords: generation
Abstract: Recently, continuous normalizing flows (CNFs) and diffusion models (DMs) have been studied using the unified theoretical framework. Although such models can generate high-quality data points from a noise distribution, the sampling demands multiple iterations to solve an ordinary differential equation (ODE) with high computational complexity. Most existing methods focus on reducing the number of time steps during the sampling process to improve efficiency. In this work, we explore a complementary direction in which the quality-complexity tradeoff can be dynamically controlled in terms of time steps and in the length of the neural network. We achieve this by rewiring the blocks in the transformer-based architecture to solve an inner discretized ODE w.r.t. its length. Then, we employ time- and length-wise consistency terms during flow matching training, and as a result, the sampling can be performed with an arbitrary number of time steps and transformer blocks. Unlike others, our $\textrm{ODE}_t \left(\textrm{ODE}_l \right)$ approach is solver-agnostic in time dimension and decreases both latency and memory usage. Compared to the previous state of the art, image generation experiments on CelebA-HQ and ImageNet show a latency reduction of up to $3\times$ in the most efficient sampling mode, and a FID score improvement of up to $3.5$ points for high-quality sampling. We release our code and model weights with fully reproducible experiments.
摘要：最近，使用统一的理论框架研究了连续归一化流（CNF）和扩散模型（DMS）。尽管这样的模型可以从噪声分布中产生高质量的数据点，但采样需要多次迭代来求解具有较高计算复杂性的普通微分方程（ODE）。大多数现有方法侧重于减少抽样过程中的时间步骤数量以提高效率。在这项工作中，我们探讨了一个互补的方向，可以在该方向上，可以在时间步骤和神经网络的长度上动态控制质量复杂性权衡。我们通过将基于变压器的体系结构中的块重新布线来解决内部离散的ODE W.R.T.来实现这一目标。它的长度。然后，我们在流匹配训练期间使用时间和长度的一致性项，因此，可以使用任意数量的时间步骤和变压器块进行采样。与其他人不同，我们的$ \ textrm {ode} _t \ left（\ textrm {ode} _l \ right）$方法在时间维度上是求解器 - agnostic，并且减少了延迟和内存使用情况。与以前的艺术状态相比，Celeba-HQ和Imagenet上的图像生成实验显示，在最有效的采样模式下，延迟减少了$ 3 \ times $，而高质量抽样的FID得分提高了$ 3.5 $。我们通过完全可重现的实验释放代码和模型权重。

Title: Elucidating and Endowing the Diffusion Training Paradigm for General Image Restoration

Authors: Xin Lu, Xueyang Fu, Jie Xiao, Zihao Fan, Yurui Zhu, Zheng-Jun Zha
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21722
Pdf URL: https://arxiv.org/pdf/2506.21722
Copy Paste: [[2506.21722]] Elucidating and Endowing the Diffusion Training Paradigm for General Image Restoration(https://arxiv.org/abs/2506.21722)
Keywords: restoration, generation, generative
Abstract: While diffusion models demonstrate strong generative capabilities in image restoration (IR) tasks, their complex architectures and iterative processes limit their practical application compared to mainstream reconstruction-based general ordinary IR networks. Existing approaches primarily focus on optimizing network architecture and diffusion paths but overlook the integration of the diffusion training paradigm within general ordinary IR frameworks. To address these challenges, this paper elucidates key principles for adapting the diffusion training paradigm to general IR training through systematic analysis of time-step dependencies, network hierarchies, noise-level relationships, and multi-restoration task correlations, proposing a new IR framework supported by diffusion-based training. To enable IR networks to simultaneously restore images and model generative representations, we introduce a series of regularization strategies that align diffusion objectives with IR tasks, improving generalization in single-task scenarios. Furthermore, recognizing that diffusion-based generation exerts varying influences across different IR tasks, we develop an incremental training paradigm and task-specific adaptors, further enhancing performance in multi-task unified IR. Experiments demonstrate that our method significantly improves the generalization of IR networks in single-task IR and achieves superior performance in multi-task unified IR. Notably, the proposed framework can be seamlessly integrated into existing general IR architectures.
摘要：尽管扩散模型在图像恢复（IR）任务中表现出强大的生成能力，但与基于主流重建的普通IR网络相比，它们的复杂体系结构和迭代过程限制了其实际应用。现有方法主要集中于优化网络体系结构和扩散路径，但忽略了一般普通IR框架中扩散训练范式的集成。为了应对这些挑战，本文阐明了通过系统分析的时间键依赖性，网络层次结构，噪声级别的关系和多恢复任务相关性，将扩散训练范式适应一般IR培训，以调整扩散训练范式，并提出了一个新的IR框架，提出了基于扩散培训支持的新IR框架。为了使IR网络同时还原图像和模型生成表示，我们引入了一系列正则化策略，这些策略将扩散目标与IR任务保持一致，从而改善了单任务场景中的概括。此外，认识到基于扩散的生成在不同的IR任务中发挥不同的影响，我们开发了增量的训练范式和特定于任务的适配器，从而进一步提高了多任务统一IR的性能。实验表明，我们的方法显着改善了单任务IR中IR网络的概括，并在多任务统一IR中实现了卓越的性能。值得注意的是，所提出的框架可以无缝集成到现有的一般IR架构中。

Title: Exploring Image Generation via Mutually Exclusive Probability Spaces and Local Correlation Hypothesis

Authors: Chenqiu Zhao, Anup Basu
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.21731
Pdf URL: https://arxiv.org/pdf/2506.21731
Copy Paste: [[2506.21731]] Exploring Image Generation via Mutually Exclusive Probability Spaces and Local Correlation Hypothesis(https://arxiv.org/abs/2506.21731)
Keywords: generation, generative
Abstract: We propose two theoretical frameworks, the Mutually Exclusive Probability Space (MESP) and the Local Correlation Hypothesis (LCH), to explore a potential limitation in probabilistic generative models; namely that learning global distributions leads to memorization rather than generative behavior. MESP emerges from our rethinking of the Variational Autoencoder (VAE). We observe that latent variable distributions in VAE exhibit overlap, which leads to an optimization conflict between the reconstruction loss and KL-divergence loss. A lower bound based on the overlap coefficient is proposed. We refer to this phenomenon as Mutually Exclusive Probability Spaces. Based on MESP, a Binary Latent Autoencoder (BL-AE) is proposed to encode images into binary latent representations. These binary latents are used as the input to our Autoregressive Random Variable Model (ARVM), a modified autoregressive model outputting histograms. Our ARVM achieves competitive FID scores, outperforming state-of-the-art methods on standard datasets. However, such scores reflect memorization rather than generation. To address this issue, we propose the Local Correlation Hypothesis (LCH), which posits that generative capability arising from local correlations among latent variables. Comprehensive experiments and discussions are conducted to validate our frameworks.
摘要：我们提出了两个理论框架，即相互排斥的概率空间（MESP）和局部相关假设（LCH），以探讨概率生成模型的潜在限制。即，学习全球分布会导致记忆而不是生成行为。 MESP从我们对变异自动编码器（VAE）的重新思考中出现。我们观察到VAE中的潜在变量分布在重叠中表现出重叠，从而导致重建损失与KL-Divergence损失之间的优化冲突。提出了基于重叠系数的下限。我们将此现象称为相互排斥的概率空间。基于MESP，提出了二进制潜在自动编码器（BL-AE）将图像编码为二进制潜在表示。这些二进制潜在的潜在被用作我们自动回归随机变量模型（ARVM）的输入，这是一种修改的自回归模型输出直方图。我们的ARVM取得了竞争性的FID分数，在标准数据集上表现优于最先进的方法。但是，这种分数反映了记忆而不是产生。为了解决这个问题，我们提出了局部相关假设（LCH），该假说认为，潜在变量之间局部相关性引起的生成能力。进行全面的实验和讨论以验证我们的框架。

Title: M3PO: Massively Multi-Task Model-Based Policy Optimization

Authors: Aditya Narendra, Dmitry Makarov, Aleksandr Panov
Subjects: cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2506.21782
Pdf URL: https://arxiv.org/pdf/2506.21782
Copy Paste: [[2506.21782]] M3PO: Massively Multi-Task Model-Based Policy Optimization(https://arxiv.org/abs/2506.21782)
Keywords: generative
Abstract: We introduce Massively Multi-Task Model-Based Policy Optimization (M3PO), a scalable model-based reinforcement learning (MBRL) framework designed to address sample inefficiency in single-task settings and poor generalization in multi-task domains. Existing model-based approaches like DreamerV3 rely on pixel-level generative models that neglect control-centric representations, while model-free methods such as PPO suffer from high sample complexity and weak exploration. M3PO integrates an implicit world model, trained to predict task outcomes without observation reconstruction, with a hybrid exploration strategy that combines model-based planning and model-free uncertainty-driven bonuses. This eliminates the bias-variance trade-off in prior methods by using discrepancies between model-based and model-free value estimates to guide exploration, while maintaining stable policy updates through a trust-region optimizer. M3PO provides an efficient and robust alternative to existing model-based policy optimization approaches and achieves state-of-the-art performance across multiple benchmarks.
摘要：我们介绍了基于多任务模型的策略优化（M3PO），这是一种基于可扩展模型的强化学习（MBRL）框架，旨在解决单任务设置中的样本效率低下，并且在多任务域中的概括不良。诸如Dreamerv3之类的现有基于模型的方法依赖于忽略以控制以控制的为中心表示的像素级生成模型，而PPO等无模型方法则遭受了较高的样本复杂性和较弱的探索。 M3PO整合了一个隐式世界模型，该模型训练，可以通过混合探索策略来预测任务成果而无需观察重建，该策略结合了基于模型的计划和无模型的不确定性驱动奖金。这通过使用基于模型的和无模型的价值估算之间的差异来指导探索，同时通过信任区域优化器维持稳定的策略更新，从而消除了先前方法中的偏见变化权衡。 M3PO为现有的基于模型的策略优化方法提供了有效且可靠的替代方案，并在多个基准测试中实现了最先进的性能。

Title: CAT-SG: A Large Dynamic Scene Graph Dataset for Fine-Grained Understanding of Cataract Surgery

Authors: Felix Holm, Gözde Ünver, Ghazal Ghazaei, Nassir Navab
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2506.21813
Pdf URL: https://arxiv.org/pdf/2506.21813
Copy Paste: [[2506.21813]] CAT-SG: A Large Dynamic Scene Graph Dataset for Fine-Grained Understanding of Cataract Surgery(https://arxiv.org/abs/2506.21813)
Keywords: generation
Abstract: Understanding the intricate workflows of cataract surgery requires modeling complex interactions between surgical tools, anatomical structures, and procedural techniques. Existing datasets primarily address isolated aspects of surgical analysis, such as tool detection or phase segmentation, but lack comprehensive representations that capture the semantic relationships between entities over time. This paper introduces the Cataract Surgery Scene Graph (CAT-SG) dataset, the first to provide structured annotations of tool-tissue interactions, procedural variations, and temporal dependencies. By incorporating detailed semantic relations, CAT-SG offers a holistic view of surgical workflows, enabling more accurate recognition of surgical phases and techniques. Additionally, we present a novel scene graph generation model, CatSGG, which outperforms current methods in generating structured surgical representations. The CAT-SG dataset is designed to enhance AI-driven surgical training, real-time decision support, and workflow analysis, paving the way for more intelligent, context-aware systems in clinical practice.
摘要：了解白内障手术的复杂工作流程需要建模手术工具，解剖结构和程序技术之间的复杂相互作用。现有数据集主要解决手术分析的孤立方面，例如工具检测或相分段，但缺乏捕获实体之间随时间的语义关系的全面表示。本文介绍了白内障手术场景图（CAT-SG）数据集，该数据集是第一个提供工具组织相互作用，过程变化和时间依赖性的结构化注释的数据集。通过结合详细的语义关系，CAT-SG提供了手术工作流程的整体视图，从而更准确地识别了手术阶段和技术。此外，我们提出了一种新颖的场景图生成模型CATSGG，该模型在生成结构化手术表示方面的当前方法优于当前方法。 CAT-SG数据集旨在增强AI驱动的手术培训，实时决策支持和工作流程分析，为在临床实践中更聪明，上下文感知的系统铺平了道路。

Title: TaleForge: Interactive Multimodal System for Personalized Story Creation

Authors: Minh-Loi Nguyen, Quang-Khai Le, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21832
Pdf URL: https://arxiv.org/pdf/2506.21832
Copy Paste: [[2506.21832]] TaleForge: Interactive Multimodal System for Personalized Story Creation(https://arxiv.org/abs/2506.21832)
Keywords: generation
Abstract: Storytelling is a deeply personal and creative process, yet existing methods often treat users as passive consumers, offering generic plots with limited personalization. This undermines engagement and immersion, especially where individual style or appearance is crucial. We introduce TaleForge, a personalized story-generation system that integrates large language models (LLMs) and text-to-image diffusion to embed users' facial images within both narratives and illustrations. TaleForge features three interconnected modules: Story Generation, where LLMs create narratives and character descriptions from user prompts; Personalized Image Generation, merging users' faces and outfit choices into character illustrations; and Background Generation, creating scene backdrops that incorporate personalized characters. A user study demonstrated heightened engagement and ownership when individuals appeared as protagonists. Participants praised the system's real-time previews and intuitive controls, though they requested finer narrative editing tools. TaleForge advances multimodal storytelling by aligning personalized text and imagery to create immersive, user-centric experiences.
摘要：讲故事是一个深刻的个人和创造过程，但是现有的方法通常将用户视为被动消费者，提供了有限个性化的通用图。这会破坏参与和沉浸式，尤其是在个人风格或外观至关重要的情况下。我们介绍了Taleforge，这是一个个性化的故事生成系统，将大型语言模型（LLMS）和文本对图像扩散整合到叙事和插图中的嵌入式用户的面部图像中。 Taleforge具有三个相互联系的模块：故事生成，LLMS在其中创建了用户提示中的叙述和角色描述；个性化的图像生成，将用户的面孔和服装选择合并为角色插图；和背景生成，创建场景，以包含个性化角色。一项用户研究表明，当个人出现为主角时，参与度和所有权提高。参与者称赞该系统的实时预览和直观控制，尽管他们要求更精细的叙事编辑工具。 Taleforge通过使个性化的文本和图像结盟以创造身临其境的，以用户为中心的体验来推进多模式讲故事。

Title: GenEscape: Hierarchical Multi-Agent Generation of Escape Room Puzzles

Authors: Mengyi Shan, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.21839
Pdf URL: https://arxiv.org/pdf/2506.21839
Copy Paste: [[2506.21839]] GenEscape: Hierarchical Multi-Agent Generation of Escape Room Puzzles(https://arxiv.org/abs/2506.21839)
Keywords: generation
Abstract: We challenge text-to-image models with generating escape room puzzle images that are visually appealing, logically solid, and intellectually stimulating. While base image models struggle with spatial relationships and affordance reasoning, we propose a hierarchical multi-agent framework that decomposes this task into structured stages: functional design, symbolic scene graph reasoning, layout synthesis, and local image editing. Specialized agents collaborate through iterative feedback to ensure the scene is visually coherent and functionally solvable. Experiments show that agent collaboration improves output quality in terms of solvability, shortcut avoidance, and affordance clarity, while maintaining visual quality.
摘要：我们通过产生逃生室拼图图像来挑战文本对图像模型，这些图像在视觉上吸引人，逻辑上扎实且具有智力刺激。尽管基本图像模型在空间关系和负担性推理方面遇到了困难，但我们提出了一个分层多代理框架，该框架将此任务分解为结构化阶段：功能设计，符号场景图形推理，布局综合和本地图像编辑。专业代理商通过迭代反馈进行协作，以确保场景在视觉上是连贯的，并且在功能上可以解决。实验表明，代理协作在保持视觉质量的同时，可以在溶解度，避免快捷方式和负担能力方面提高产出质量。

Title: Generating Attribute-Aware Human Motions from Textual Prompt

Authors: Xinghan Wang, Kun Xu, Fei Li, Cao Sheng, Jiazhong Yu, Yadong Mu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2506.21912
Pdf URL: https://arxiv.org/pdf/2506.21912
Copy Paste: [[2506.21912]] Generating Attribute-Aware Human Motions from Textual Prompt(https://arxiv.org/abs/2506.21912)
Keywords: generation
Abstract: Text-driven human motion generation has recently attracted considerable attention, allowing models to generate human motions based on textual descriptions. However, current methods neglect the influence of human attributes (such as age, gender, weight, and height) which are key factors shaping human motion patterns. This work represents a pilot exploration for bridging this gap. We conceptualize each motion as comprising both attribute information and action semantics, where textual descriptions align exclusively with action semantics. To achieve this, a new framework inspired by Structural Causal Models is proposed to decouple action semantics from human attributes, enabling text-to-semantics prediction and attribute-controlled generation. The resulting model is capable of generating realistic, attribute-aware motion aligned with the user's text and attribute inputs. For evaluation, we introduce HumanAttr, a comprehensive dataset containing attribute annotations for text-motion pairs, setting the first benchmark for attribute-aware text-to-motion generation. Extensive experiments on the new dataset validate our model's effectiveness.
摘要：文本驱动的人类运动产生最近引起了广泛的关注，使模型可以基于文本描述产生人类动作。但是，当前方法忽略了人类属性（例如年龄，性别，体重和身高）的影响，这是塑造人类运动模式的关键因素。这项工作代表了弥合此差距的试点探索。我们将每个运动概念化为包括属性信息和动作语义，其中文本描述仅与动作语义相吻合。为了实现这一目标，提出了一个受结构因果模型启发的新框架，以使动作语义与人类属性解矛，从而实现文本对语音的预测和属性控制的生成。所得模型能够生成与用户的文本和属性输入一致的现实属性感知运动。为了进行评估，我们介绍了HumanAttr，这是一个综合数据集，其中包含文本对对的属性注释，为属性感知文本到动作的生成设置了第一个基准。在新数据集上进行的广泛实验验证了我们的模型的有效性。

Title: Quality Assessment and Distortion-aware Saliency Prediction for AI-Generated Omnidirectional Images

Authors: Liu Yang, Huiyu Duan, Jiarui Wang, Jing Liu, Menghan Hu, Xiongkuo Min, Guangtao Zhai, Patrick Le Callet
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.21925
Pdf URL: https://arxiv.org/pdf/2506.21925
Copy Paste: [[2506.21925]] Quality Assessment and Distortion-aware Saliency Prediction for AI-Generated Omnidirectional Images(https://arxiv.org/abs/2506.21925)
Keywords: quality assessment
Abstract: With the rapid advancement of Artificial Intelligence Generated Content (AIGC) techniques, AI generated images (AIGIs) have attracted widespread attention, among which AI generated omnidirectional images (AIGODIs) hold significant potential for Virtual Reality (VR) and Augmented Reality (AR) applications. AI generated omnidirectional images exhibit unique quality issues, however, research on the quality assessment and optimization of AI-generated omnidirectional images is still lacking. To this end, this work first studies the quality assessment and distortion-aware saliency prediction problems for AIGODIs, and further presents a corresponding optimization process. Specifically, we first establish a comprehensive database to reflect human feedback for AI-generated omnidirectionals, termed OHF2024, which includes both subjective quality ratings evaluated from three perspectives and distortion-aware salient regions. Based on the constructed OHF2024 database, we propose two models with shared encoders based on the BLIP-2 model to evaluate the human visual experience and predict distortion-aware saliency for AI-generated omnidirectional images, which are named as BLIP2OIQA and BLIP2OISal, respectively. Finally, based on the proposed models, we present an automatic optimization process that utilizes the predicted visual experience scores and distortion regions to further enhance the visual quality of an AI-generated omnidirectional image. Extensive experiments show that our BLIP2OIQA model and BLIP2OISal model achieve state-of-the-art (SOTA) results in the human visual experience evaluation task and the distortion-aware saliency prediction task for AI generated omnidirectional images, and can be effectively used in the optimization process. The database and codes will be released on this https URL to facilitate future research.
摘要：随着人工智能生成的内容（AIGC）技术的快速发展，AI生成的图像（AIGIS）引起了广泛的关注，其中AI产生的全向图像（AIGODIS）具有虚拟现实（VR）和增强现实（AR）应用的重要潜力。 AI产生的全向图像表现出独特的质量问题，但是，仍然缺乏对AI生成的全向图像的质量评估和优化的研究。为此，这项工作首先研究了Aigodis的质量评估和变形 - 意识到显着性预测问题，并进一步提出了相应的优化过程。具体而言，我们首先建立了一个综合数据库，以反映以AI生成的全向方向的反馈，称为OHF2024，其中包括从三个角度进行评估的主观质量评级和扭曲意识到的良好区域。基于构建的OHF2024数据库，我们提出了两个基于BLIP-2模型共享编码器的模型，以评估人类的视觉体验并预测AI生成的综合图像的失真性显着性，分别命名为Blip2OIIQA和Blip2oisal。最后，基于提出的模型，我们提出了一个自动优化过程，该过程利用预测的视觉体验得分和失真区域来进一步增强AI生成的全向图像的视觉质量。广泛的实验表明，我们的BLIP2OIQA模型和BLIP2OISAL模型实现了最新的（SOTA）导致人类的视觉体验评估任务以及AI生成的全向图像的失真性显着性预测任务，并且可以在优化过程中有效地使用。数据库和代码将在此HTTPS URL上发布，以促进未来的研究。

Title: Physics-informed network paradigm with data generation and background noise removal for diverse distributed acoustic sensing applications

Authors: Yangyang Wan, Haotian Wang, Xuhui Yu, Jiageng Chen, Xinyu Fan, Zuyuan He
Subjects: cs.LG, physics.app-ph, physics.optics
Abstract URL: https://arxiv.org/abs/2506.21952
Pdf URL: https://arxiv.org/pdf/2506.21952
Copy Paste: [[2506.21952]] Physics-informed network paradigm with data generation and background noise removal for diverse distributed acoustic sensing applications(https://arxiv.org/abs/2506.21952)
Keywords: generation, generative
Abstract: Distributed acoustic sensing (DAS) has attracted considerable attention across various fields and artificial intelligence (AI) technology plays an important role in DAS applications to realize event recognition and denoising. Existing AI models require real-world data (RWD), whether labeled or not, for training, which is contradictory to the fact of limited available event data in real-world scenarios. Here, a physics-informed DAS neural network paradigm is proposed, which does not need real-world events data for training. By physically modeling target events and the constraints of real world and DAS system, physical functions are derived to train a generative network for generation of DAS events data. DAS debackground net is trained by using the generated DAS events data to eliminate background noise in DAS data. The effectiveness of the proposed paradigm is verified in event identification application based on a public dataset of DAS spatiotemporal data and in belt conveyor fault monitoring application based on DAS time-frequency data, and achieved comparable or better performance than data-driven networks trained with RWD. Owing to the introduction of physical information and capability of background noise removal, the paradigm demonstrates generalization in same application on different sites. A fault diagnosis accuracy of 91.8% is achieved in belt conveyor field with networks which transferred from simulation test site without any fault events data of test site and field for training. The proposed paradigm is a prospective solution to address significant obstacles of data acquisition and intense noise in practical DAS applications and explore more potential fields for DAS.
摘要：分布式声传感（DAS）在各个领域和人工智能（AI）技术中引起了很大的关注，在DAS应用中起着重要的作用，以实现事件识别和denoing。现有的AI模型需要现实世界中的数据（RWD），无论是否标记为培训，这与现实世界中可用事件数据有限的事实相矛盾。在这里，提出了一个具有物理信息的DAS神经网络范式，该范式不需要现实世界中的事件数据进行培训。通过对目标事件进行物理建模以及现实世界和DAS系统的约束，可以得出物理功能来训练生成网络以生成DAS事件数据。通过使用生成的DAS事件数据来消除DAS数据中的背景噪声来训练DAS Bebackground Net。基于DAS时空数据的公共数据集以及基于DAS时间频率数据的公共数据集以及基于DAS时空数据的公共数据集以及基于DAS时间频率数据的皮带输送机故障监视应用程序的事件识别应用程序的有效性，并且与接受RWD培训的数据驱动网络相当或更好的性能。由于引入了物理信息和删除背景噪声的能力，范式表明了在不同站点上的同一应用中的概括。在皮带输送机场中，通过从模拟测试站点转移的网络，没有任何故障事件数据的测试站点和领域进行训练的网络，在皮带输送机场中实现了91.8％的故障诊断精度。提出的范式是一种前瞻性解决方案，可以解决实际DAS应用中数据采集和强烈噪声的重大障碍，并探索DAS的更多潜在领域。

Title: Optimal Return-to-Go Guided Decision Transformer for Auto-Bidding in Advertisement

Authors: Hao Jiang, Yongxiang Tang, Yanxiang Zeng, Pengjia Yuan, Yanhua Cheng, Teng Sha, Xialong Liu, Peng Jiang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2506.21956
Pdf URL: https://arxiv.org/pdf/2506.21956
Copy Paste: [[2506.21956]] Optimal Return-to-Go Guided Decision Transformer for Auto-Bidding in Advertisement(https://arxiv.org/abs/2506.21956)
Keywords: generative
Abstract: In the realm of online advertising, advertisers partake in ad auctions to obtain advertising slots, frequently taking advantage of auto-bidding tools provided by demand-side platforms. To improve the automation of these bidding systems, we adopt generative models, namely the Decision Transformer (DT), to tackle the difficulties inherent in automated bidding. Applying the Decision Transformer to the auto-bidding task enables a unified approach to sequential modeling, which efficiently overcomes short-sightedness by capturing long-term dependencies between past bidding actions and user behavior. Nevertheless, conventional DT has certain drawbacks: (1) DT necessitates a preset return-to-go (RTG) value before generating actions, which is not inherently produced; (2) The policy learned by DT is restricted by its training data, which is consists of mixed-quality trajectories. To address these challenges, we introduce the R* Decision Transformer (R* DT), developed in a three-step process: (1) R DT: Similar to traditional DT, R DT stores actions based on state and RTG value, as well as memorizing the RTG for a given state using the training set; (2) R^ DT: We forecast the highest value (within the training set) of RTG for a given state, deriving a suboptimal policy based on the current state and the forecasted supreme RTG value; (3) R* DT: Based on R^ DT, we generate trajectories and select those with high rewards (using a simulator) to augment our training dataset. This data enhancement has been shown to improve the RTG of trajectories in the training data and gradually leads the suboptimal policy towards optimality. Comprehensive tests on a publicly available bidding dataset validate the R* DT's efficacy and highlight its superiority when dealing with mixed-quality trajectories.
摘要：在在线广告领域中，广告商参与广告拍卖会以获取广告插槽，经常利用需求端平台提供的自动铸造工具。为了改善这些竞标系统的自动化，我们采用了生成模型，即决策变压器（DT），以解决自动竞标中固有的困难。将决策变压器应用于自动竞标任务，可以采用统一的顺序建模方法，该方法通过捕获过去的竞标动作和用户行为之间的长期依赖性来有效地克服短暂的视野。然而，常规DT具有某些缺点：（1）DT在生成动作之前，需要预设返回到GO（RTG）值，而该操作并非固有地产生；（2）DT学到的政策受其培训数据的限制，该数据由混合质量轨迹组成。为了应对这些挑战，我们介绍了在三步过程中开发的R*决策变压器（R* DT）：（1）R DT：类似于传统的DT，R DT，基于状态和RTG值的措施，并使用培训集来记住给定状态的RTG；（2）R^ dt：我们预测给定状态的RTG的最高值（在培训集中），根据当前状态和预测的Supreme RTG值得出了次优政策；（3）r* dt：基于r^ dt，我们生成轨迹，然后选择具有较高奖励（使用模拟器）的轨迹来增强我们的培训数据集。该数据增强已被证明可以改善培训数据中的轨迹RTG，并逐渐将次优政策带入最佳性。对公开招标数据集进行的综合测试验证了R* DT的功效，并在处理混合质量轨迹时强调了其优势。

Title: SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model

Authors: Shuhan Tan, John Lambert, Hong Jeon, Sakshum Kulshrestha, Yijing Bai, Jing Luo, Dragomir Anguelov, Mingxing Tan, Chiyu Max Jiang
Subjects: cs.LG, cs.AI, cs.CV, cs.MA, cs.RO
Abstract URL: https://arxiv.org/abs/2506.21976
Pdf URL: https://arxiv.org/pdf/2506.21976
Copy Paste: [[2506.21976]] SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model(https://arxiv.org/abs/2506.21976)
Keywords: generation, generative
Abstract: The goal of traffic simulation is to augment a potentially limited amount of manually-driven miles that is available for testing and validation, with a much larger amount of simulated synthetic miles. The culmination of this vision would be a generative simulated city, where given a map of the city and an autonomous vehicle (AV) software stack, the simulator can seamlessly simulate the trip from point A to point B by populating the city around the AV and controlling all aspects of the scene, from animating the dynamic agents (e.g., vehicles, pedestrians) to controlling the traffic light states. We refer to this vision as CitySim, which requires an agglomeration of simulation technologies: scene generation to populate the initial scene, agent behavior modeling to animate the scene, occlusion reasoning, dynamic scene generation to seamlessly spawn and remove agents, and environment simulation for factors such as traffic lights. While some key technologies have been separately studied in various works, others such as dynamic scene generation and environment simulation have received less attention in the research community. We propose SceneDiffuser++, the first end-to-end generative world model trained on a single loss function capable of point A-to-B simulation on a city scale integrating all the requirements above. We demonstrate the city-scale traffic simulation capability of SceneDiffuser++ and study its superior realism under long simulation conditions. We evaluate the simulation quality on an augmented version of the Waymo Open Motion Dataset (WOMD) with larger map regions to support trip-level simulation.
摘要：交通模拟的目的是增加可用于测试和验证的可能有限的手动驱动里程，并使用大量的模拟合成里程。这种愿景的结晶将是一个生成的模拟城市，鉴于该城市的地图和自动驾驶汽车（AV）软件堆栈，模拟器可以通过填充AV周围的城市并控制场景的所有方面，从而使动态剂（例如，车辆，车辆，行人）来对照来对照来无效，可以通过填充AV的所有方面并控制场景的所有方面来无缝模拟旅行。我们将这个愿景称为CitySim，它需要模拟技术的集聚：场景生成以填充初始场景，代理行为建模以使场景动画，遮挡推理，动态场景生成，以无缝产卵和删除代理，以及诸如交通信号灯等因素的环境模拟。尽管在各种作品中已经分别研究了一些关键技术，但在研究社区中，诸如动态场景产生和环境模拟等动态仿真较少的关注。我们提出了SceneDiffuser ++，这是第一个端到端的生成世界模型，该模型是在单个损耗功能上训练的，能够在城市规模上进行A-TO-B模拟，以整合上面的所有要求。我们证明了SceneDiffuser ++的城市规模交通模拟能力，并在长期的模拟条件下研究了其优越的现实主义。我们评估了Waymo Open Motion Datat（WOMD）的增强版本上的仿真质量，并具有较大的地图区域，以支持Trip级仿真。

Title: RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation

Authors: Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Soumajit Majumder, Ziyuan Liu, Gitta Kutyniok, Abhinav Valada
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22007
Pdf URL: https://arxiv.org/pdf/2506.22007
Copy Paste: [[2506.22007]] RoboEnvision: A Long-Horizon Video Generation Model for Multi-Task Robot Manipulation(https://arxiv.org/abs/2506.22007)
Keywords: generation
Abstract: We address the problem of generating long-horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photorealism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works predict short sequences of the robot achieving one task and employ an autoregressive paradigm to extend to the long horizon, leading to error accumulations in the generated video and in the execution. To overcome these limitations, we propose a novel pipeline that bypasses the need for autoregressive generation. We achieve this through a threefold contribution: 1) we first decompose the high-level goals into smaller atomic tasks and generate keyframes aligned with these instructions. A second diffusion model then interpolates between each of the two generated frames, achieving the long-horizon video. 2) We propose a semantics preserving attention module to maintain consistency between the keyframes. 3) We design a lightweight policy model to regress the robot joint states from generated videos. Our approach achieves state-of-the-art results on two benchmarks in video quality and consistency while outperforming previous policy models on long-horizon tasks.
摘要：我们解决了为机器人操纵任务生成长马视频的问题。文本到视频扩散模型在光真相，语言理解和运动产生方面取得了重大进展，但要与长途机器人任务斗争。最近的作品使用视频扩散模型来用于机器人计划中的高质量仿真数据和预测推广。但是，这些作品预测了机器人完成一项任务并采用自回旋范式扩展到漫长的地平线的简短序列，从而导致了生成的视频和执行中的错误积累。为了克服这些局限性，我们提出了一条新型的管道，绕过了自学产生的需求。我们通过三倍的贡献来实现这一目标：1）我们首先将高级目标分解为较小的原子任务，并生成与这些说明一致的密钥帧。然后，第二个扩散模型在两个生成的框架中的每个帧之间进行了插值，从而实现了长马视频。 2）我们提出了一种语义，以保留注意力模块，以保持关键帧之间的一致性。 3）我们设计了一种轻巧的政策模型，以从生成的视频中回归机器人联合状态。我们的方法在视频质量和一致性方面取得了两个基准，同时超过了长期任务的先前策略模型，从而实现了最新的结果。

Title: Few-Shot Identity Adaptation for 3D Talking Heads via Global Gaussian Field

Authors: Hong Nie, Fuyuan Cao, Lu Chen, Fengxin Chen, Yuefeng Zou, Jun Yu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22044
Pdf URL: https://arxiv.org/pdf/2506.22044
Copy Paste: [[2506.22044]] Few-Shot Identity Adaptation for 3D Talking Heads via Global Gaussian Field(https://arxiv.org/abs/2506.22044)
Keywords: generative
Abstract: Reconstruction and rendering-based talking head synthesis methods achieve high-quality results with strong identity preservation but are limited by their dependence on identity-specific models. Each new identity requires training from scratch, incurring high computational costs and reduced scalability compared to generative model-based approaches. To overcome this limitation, we propose FIAG, a novel 3D speaking head synthesis framework that enables efficient identity-specific adaptation using only a few training footage. FIAG incorporates Global Gaussian Field, which supports the representation of multiple identities within a shared field, and Universal Motion Field, which captures the common motion dynamics across diverse identities. Benefiting from the shared facial structure information encoded in the Global Gaussian Field and the general motion priors learned in the motion field, our framework enables rapid adaptation from canonical identity representations to specific ones with minimal data. Extensive comparative and ablation experiments demonstrate that our method outperforms existing state-of-the-art approaches, validating both the effectiveness and generalizability of the proposed framework. Code is available at: \textit{this https URL}.
摘要：重建和基于渲染的说话头合成方法实现了具有强大的身份的高质量结果，但受其依赖于特定于身份模型的限制。与基于生成模型的方法相比，每个新身份都需要从头开始培训，产生高计算成本和降低的可伸缩性。为了克服这一限制，我们提出了FIAG，这是一种新型的3D语言式合成框架，仅使用少数训练镜头才能有效地适应。 FIAG合并了全球高斯领域，该领域支持共享字段中多个身份的表示，并捕获了跨不同身份的通用运动动力学。我们的框架受益于全球高斯领域中编码的共享面部结构信息以及在运动场中学到的一般运动先验，我们的框架可以快速适应从规范的身份表示到具有最小数据的特定框架。广泛的比较和消融实验表明，我们的方法的表现优于现有的最新方法，从而验证了所提出的框架的有效性和概括性。代码可在：\ textit {this https url}中获得。

Title: MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation

Authors: Dechao Meng, Steven Xiao, Xindi Zhang, Guangyuan Wang, Peng Zhang, Qi Wang, Bang Zhang, Liefeng Bo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22065
Pdf URL: https://arxiv.org/pdf/2506.22065
Copy Paste: [[2506.22065]] MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation(https://arxiv.org/abs/2506.22065)
Keywords: generation
Abstract: Audio-driven portrait animation, which synthesizes realistic videos from reference images using audio signals, faces significant challenges in real-time generation of high-fidelity, temporally coherent animations. While recent diffusion-based methods improve generation quality by integrating audio into denoising processes, their reliance on frame-by-frame UNet architectures introduces prohibitive latency and struggles with temporal consistency. This paper introduces MirrorMe, a real-time, controllable framework built on the LTX video model, a diffusion transformer that compresses video spatially and temporally for efficient latent space denoising. To address LTX's trade-offs between compression and semantic fidelity, we propose three innovations: 1. A reference identity injection mechanism via VAE-encoded image concatenation and self-attention, ensuring identity consistency; 2. A causal audio encoder and adapter tailored to LTX's temporal structure, enabling precise audio-expression synchronization; and 3. A progressive training strategy combining close-up facial training, half-body synthesis with facial masking, and hand pose integration for enhanced gesture control. Extensive experiments on the EMTD Benchmark demonstrate MirrorMe's state-of-the-art performance in fidelity, lip-sync accuracy, and temporal stability.
摘要：音频驱动的肖像画动画，使用音频信号从参考图像中综合了现实的视频，面临着实时生成高保真，时间连贯的动画的巨大挑战。尽管最近的基于扩散的方法通过将音频集成到降解过程中提高了发电质量，但它们对逐帧的UNET体系结构的依赖会引入过时的潜伏期和与时间一致性的斗争。本文介绍了Mirrorme，这是一个建立在LTX视频模型上的实时，可控制的框架，这是一种扩散的变压器，可在空间和时间上压缩视频，以实现有效的潜在空间降级。为了解决LTX在压缩和语义保真度之间的权衡，我们提出了三个创新：1。通过VAE编码的图像串联和自我注意力的参考身份注射机制，确保身份一致性； 2。量身定制的是由LTX的时间结构量身定制的因果音频编码器和适配器，可实现精确的音频表达同步；和3。一种渐进式训练策略，结合了特写面部训练，半身综合与面部掩饰以及手动姿势整合，以增强手势控制。对EMTD基准测试的广泛实验表明，Mirrorme在忠诚度，唇部同步准确性和时间稳定性方面的最新性能。

Title: EAMamba: Efficient All-Around Vision State Space Model for Image Restoration

Authors: Yu-Cheng Lin, Yu-Syuan Xu, Hao-Wei Chen, Hsien-Kai Kuo, Chun-Yi Lee
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22246
Pdf URL: https://arxiv.org/pdf/2506.22246
Copy Paste: [[2506.22246]] EAMamba: Efficient All-Around Vision State Space Model for Image Restoration(https://arxiv.org/abs/2506.22246)
Keywords: restoration
Abstract: Image restoration is a key task in low-level computer vision that aims to reconstruct high-quality images from degraded inputs. The emergence of Vision Mamba, which draws inspiration from the advanced state space model Mamba, marks a significant advancement in this field. Vision Mamba demonstrates excellence in modeling long-range dependencies with linear complexity, a crucial advantage for image restoration tasks. Despite its strengths, Vision Mamba encounters challenges in low-level vision tasks, including computational complexity that scales with the number of scanning sequences and local pixel forgetting. To address these limitations, this study introduces Efficient All-Around Mamba (EAMamba), an enhanced framework that incorporates a Multi-Head Selective Scan Module (MHSSM) with an all-around scanning mechanism. MHSSM efficiently aggregates multiple scanning sequences, which avoids increases in computational complexity and parameter count. The all-around scanning strategy implements multiple patterns to capture holistic information and resolves the local pixel forgetting issue. Our experimental evaluations validate these innovations across several restoration tasks, including super resolution, denoising, deblurring, and dehazing. The results validate that EAMamba achieves a significant 31-89% reduction in FLOPs while maintaining favorable performance compared to existing low-level Vision Mamba methods.
摘要：图像恢复是低级计算机视觉中的关键任务，旨在从退化的输入中重建高质量的图像。 Vision Mamba的出现从先进的州空间模型Mamba中汲取灵感，标志着该领域的重大进步。 Vision Mamba在用线性复杂性的长期依赖性建模方面表现出卓越的表现，这是图像恢复任务的关键优势。尽管有优势，视觉曼巴还是在低级视觉任务中遇到了挑战，包括计算复杂性，以扫描序列的数量和局部像素遗忘的范围缩放。为了解决这些局限性，本研究介绍了有效的全能Mamba（Eamamba），这是一个增强的框架，该框架结合了多头选择性扫描模块（MHSSM）和全方位扫描机制。 MHSSM有效地聚集了多个扫描序列，从而避免了计算复杂性和参数计数的增加。全方位扫描策略实现了多种模式，以捕获整体信息并解决本地像素遗忘问题。我们的实验评估验证了这些创新在几项恢复任务中的这些创新，包括超级分辨率，变形，脱张和降临。结果证明，与现有的低级视觉mamba方法相比，Eamamba的失败率显着降低了31-89％，同时保持有利的性能。

Title: COOCO -- Common Objects Out-of-Context -- Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication

Authors: Filippo Merlo, Ece Takmaz, Wenkai Chen, Albert Gatt
Subjects: cs.CV, cs.CL
Abstract URL: https://arxiv.org/abs/2506.22274
Pdf URL: https://arxiv.org/pdf/2506.22274
Copy Paste: [[2506.22274]] COOCO -- Common Objects Out-of-Context -- Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication(https://arxiv.org/abs/2506.22274)
Keywords: generation
Abstract: Natural scenes provide us with rich contexts for object recognition and reference. In particular, knowing what type of scene one is looking at generates expectations about which objects will occur, and what their spatial configuration should be. Do Vision-Language Models (VLMs) learn to rely on scene contexts in a similar way, when generating references to objects? To address this question, we introduce the \textit{Common Objects Out-of-Context (COOCO)} dataset and test to what extent VLMs rely on scene context to refer to objects under different degrees of scene-object congruency, and different perturbations. Our findings show that models leverage scene context adaptively, depending on both the semantic relatedness between object and scene and the level of noise. In particular, models rely more on context under high target-scene congruence or when objects are degraded. Attention analysis reveals that successful object categorisation involves increased focus on the target in mid-level layers, especially under moderate noise, suggesting that VLMs dynamically balance local and contextual information for reference generation. We make our dataset, code and models available at \href{this https URL}{this https URL}.
摘要：自然场景为我们提供了丰富的环境，以供对象识别和参考。特别是，知道哪种类型的场景正在查看人们对将发生哪些对象的期望以及它们的空间配置应该是什么。当生成对象引用时，视觉模型（VLM）是否会以类似的方式学习以类似的方式依靠场景上下文？为了解决这个问题，我们介绍了\ textIt {common对象of-context（cooco）}数据集，并测试VLMS在何种程度上依赖场景上下文来参考在不同程度的场景 - 目标一致性和不同扰动的情况下参考对象。我们的发现表明，模型可以适应场景上下文，这取决于对象和场景之间的语义相关性以及噪声水平。特别是，模型更多地依赖于高目标球场一致性或对象降解时的上下文。注意力分析表明，成功的对象分类涉及在中层层中尤其是在中等噪声下对目标的关注，这表明VLMS动态平衡了参考生成的本地和上下文信息。我们在\ href {this https url} {此https url}上提供数据集，代码和模型。

Title: RoomCraft: Controllable and Complete 3D Indoor Scene Generation

Authors: Mengqi Zhou, Xipeng Wang, Yuxi Wang, Zhaoxiang Zhang
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22291
Pdf URL: https://arxiv.org/pdf/2506.22291
Copy Paste: [[2506.22291]] RoomCraft: Controllable and Complete 3D Indoor Scene Generation(https://arxiv.org/abs/2506.22291)
Keywords: generation
Abstract: Generating realistic 3D indoor scenes from user inputs remains a challenging problem in computer vision and graphics, requiring careful balance of geometric consistency, spatial relationships, and visual realism. While neural generation methods often produce repetitive elements due to limited global spatial reasoning, procedural approaches can leverage constraints for controllable generation but struggle with multi-constraint scenarios. When constraints become numerous, object collisions frequently occur, forcing the removal of furniture items and compromising layout completeness. To address these limitations, we propose RoomCraft, a multi-stage pipeline that converts real images, sketches, or text descriptions into coherent 3D indoor scenes. Our approach combines a scene generation pipeline with a constraint-driven optimization framework. The pipeline first extracts high-level scene information from user inputs and organizes it into a structured format containing room type, furniture items, and spatial relations. It then constructs a spatial relationship network to represent furniture arrangements and generates an optimized placement sequence using a heuristic-based depth-first search (HDFS) algorithm to ensure layout coherence. To handle complex multi-constraint scenarios, we introduce a unified constraint representation that processes both formal specifications and natural language inputs, enabling flexible constraint-oriented adjustments through a comprehensive action space design. Additionally, we propose a Conflict-Aware Positioning Strategy (CAPS) that dynamically adjusts placement weights to minimize furniture collisions and ensure layout completeness. Extensive experiments demonstrate that RoomCraft significantly outperforms existing methods in generating realistic, semantically coherent, and visually appealing room layouts across diverse input modalities.
摘要：在计算机视觉和图形中，从用户输入中生成现实的3D室内场景仍然是一个具有挑战性的问题，需要仔细平衡几何一致性，空间关系和视觉现实主义。尽管由于全球空间推理有限，神经产生方法通常会产生重复的元素，但程序方法可以利用可控生成的约束，但要在多构造场景中挣扎。当限制变为众多时，经常发生对象碰撞，迫使拆除家具物品并损害布局完整性。为了解决这些局限性，我们建议将室范（一种多阶段的管道转换为真实的图像，草图或文本说明为连贯的3D室内场景。我们的方法将场景生成管道与约束驱动的优化框架结合在一起。管道首先从用户输入中提取高级场景信息，并将其组织成一个结构化格式，其中包含房间类型，家具项目和空间关系。然后，它构建了空间关系网络，以表示家具布置并使用基于启发式的深度优点搜索（HDFS）算法生成优化的位置序列，以确保布局相干性。为了处理复杂的多构造场景，我们引入了统一的约束表示，该表示既可以处理形式规格和自然语言输入，从而通过全面的动作空间设计实现了面向柔性约束的调整。此外，我们提出了一种意识到冲突的定位策略（CAP），该策略会动态调整放置权重以最大程度地减少家具碰撞并确保布局完整性。广泛的实验表明，室范在产生跨不同输入方式的现实，语义连贯和视觉吸引力的房间布局方面显着优于现有方法。

Title: Unfolding Generative Flows with Koopman Operators: Fast and Interpretable Sampling

Authors: Erkan Turan, Aristotelis Siozopoulos, Maks Ovsjanikov
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2506.22304
Pdf URL: https://arxiv.org/pdf/2506.22304
Copy Paste: [[2506.22304]] Unfolding Generative Flows with Koopman Operators: Fast and Interpretable Sampling(https://arxiv.org/abs/2506.22304)
Keywords: generative
Abstract: Conditional Flow Matching (CFM) offers a simulation-free framework for training continuous-time generative models, bridging diffusion and flow-based approaches. However, sampling from CFM still relies on numerically solving non-linear ODEs which can be computationally expensive and difficult to interpret. Recent alternatives address sampling speed via trajectory straightening, mini-batch coupling or distillation. However, these methods typically do not shed light on the underlying \textit{structure} of the generative process. In this work, we propose to accelerate CFM and introduce an interpretable representation of its dynamics by integrating Koopman operator theory, which models non-linear flows as linear evolution in a learned space of observables. We introduce a decoder-free Koopman-CFM architecture that learns an embedding where the generative dynamics become linear, enabling closed-form, one-step sampling via matrix exponentiation. This results in significant speedups over traditional CFM as demonstrated on controlled 2D datasets and real-world benchmarks, MNIST, Fashion-MNIST (F-MNIST), and the Toronto Face Dataset (TFD). Unlike previous methods, our approach leads to a well-structured Koopman generator, whose spectral properties, eigenvalues, and eigenfunctions offer principled tools for analyzing generative behavior such as temporal scaling, mode stability, and decomposition in Koopman latent space. By combining sampling efficiency with analytical structure, Koopman-enhanced flow matching offers a potential step toward fast and interpretable generative modeling.
摘要：条件流量匹配（CFM）为训练连续的生成模型，桥接扩散和基于流动的方法提供了无模拟框架。但是，来自CFM的采样仍然依赖于数值求解的非线性ODE，这些ODE在计算上可能很昂贵且难以解释。最近的替代方案通过轨迹拉直，迷你批次耦合或蒸馏来解决采样速度。但是，这些方法通常不会揭示生成过程的基础\ textit {struction}。在这项工作中，我们建议通过整合Koopman操作员理论来加速CFM并引入可解释的动力学表示，该理论将非线性流将非线性流动为线性进化，以在学习的可观察到的空间中。我们引入了无解码器的Koopman-CFM体系结构，该体系结构学习了一个嵌入式生成动力学的嵌入，可以通过矩阵启动来实现封闭形式的一步采样。如受控的2D数据集和现实基准，MNIST，Fashion-MNIST（F-MNIST）和多伦多Face DataSet（TFD）所证明的那样，这导致了对传统CFM的显着加速。与以前的方法不同，我们的方法会导致结构良好的Koopman生成器，其光谱，特征值和特征功能提供了用于分析生成行为（例如时间缩放，模式稳定性和分解）在Koopman Littent Letent Space中的生成行为的原则工具。通过将采样效率与分析结构相结合，Koopman增强的流量匹配为快速和可解释的生成建模提供了潜在的步骤。

Title: A Deep Learning framework for building damage assessment using VHR SAR and geospatial data: demonstration on the 2023 Turkiye Earthquake

Authors: Luigi Russo, Deodato Tapete, Silvia Liberata Ullo, Paolo Gamba
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22338
Pdf URL: https://arxiv.org/pdf/2506.22338
Copy Paste: [[2506.22338]] A Deep Learning framework for building damage assessment using VHR SAR and geospatial data: demonstration on the 2023 Turkiye Earthquake(https://arxiv.org/abs/2506.22338)
Keywords: generation
Abstract: Building damage identification shortly after a disaster is crucial for guiding emergency response and recovery efforts. Although optical satellite imagery is commonly used for disaster mapping, its effectiveness is often hampered by cloud cover or the absence of pre-event acquisitions. To overcome these challenges, we introduce a novel multimodal deep learning (DL) framework for detecting building damage using single-date very high resolution (VHR) Synthetic Aperture Radar (SAR) imagery from the Italian Space Agency (ASI) COSMO SkyMed (CSK) constellation, complemented by auxiliary geospatial data. Our method integrates SAR image patches, OpenStreetMap (OSM) building footprints, digital surface model (DSM) data, and structural and exposure attributes from the Global Earthquake Model (GEM) to improve detection accuracy and contextual interpretation. Unlike existing approaches that depend on pre and post event imagery, our model utilizes only post event data, facilitating rapid deployment in critical scenarios. The framework effectiveness is demonstrated using a new dataset from the 2023 earthquake in Turkey, covering multiple cities with diverse urban settings. Results highlight that incorporating geospatial features significantly enhances detection performance and generalizability to previously unseen areas. By combining SAR imagery with detailed vulnerability and exposure information, our approach provides reliable and rapid building damage assessments without the dependency from available pre-event data. Moreover, the automated and scalable data generation process ensures the framework's applicability across diverse disaster-affected regions, underscoring its potential to support effective disaster management and recovery efforts. Code and data will be made available upon acceptance of the paper.
摘要：灾难发生后不久，建筑损害身份对于指导紧急响应和恢复工作至关重要。尽管光学卫星图像通常用于灾难映射，但其有效性通常会因云覆盖或没有事件前的收购而阻碍。为了克服这些挑战，我们介绍了一个新型的多式联运深度学习（DL）框架，用于使用单个日期非常高的分辨率（VHR）合成孔径雷达（SAR）图像从意大利航天局（ASI）Cosmo Skymed（CSK）星座进行构建，并补充了Auxiliary Gepospatial数据。我们的方法集成了SAR图像贴片，OpenStreetMap（OSM）构建足迹，数字表面模型（DSM）数据以及全球地震模型（GEM）的结构和暴露属性，以提高检测准确性和上下文解释。与依赖事件前和事件后图像的现有方法不同，我们的模型仅利用事件数据，从而在关键方案中促进快速部署。使用2023年土耳其地震的新数据集证明了框架效率，涵盖了多种城市环境的多个城市。结果表明，结合地理空间特征可显着提高检测性能和对以前看不见的区域的普遍性。通过将SAR图像与详细的脆弱性和暴露信息相结合，我们的方法提供了可靠且快速的建筑损害评估，而无需依赖可用的事前数据。此外，自动化和可扩展的数据生成过程可确保该框架在受灾难影响的地区的适用性，从而强调其支持有效的灾难管理和恢复工作的潜力。接受论文后将提供代码和数据。

Title: Sheaf-Based Decentralized Multimodal Learning for Next-Generation Wireless Communication Systems

Authors: Abdulmomen Ghalkha, Zhuojun Tian, Chaouki Ben Issaid, Mehdi Bennis
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2506.22374
Pdf URL: https://arxiv.org/pdf/2506.22374
Copy Paste: [[2506.22374]] Sheaf-Based Decentralized Multimodal Learning for Next-Generation Wireless Communication Systems(https://arxiv.org/abs/2506.22374)
Keywords: generation
Abstract: In large-scale communication systems, increasingly complex scenarios require more intelligent collaboration among edge devices collecting various multimodal sensory data to achieve a more comprehensive understanding of the environment and improve decision-making accuracy. However, conventional federated learning (FL) algorithms typically consider unimodal datasets, require identical model architectures, and fail to leverage the rich information embedded in multimodal data, limiting their applicability to real-world scenarios with diverse modalities and varying client capabilities. To address this issue, we propose Sheaf-DMFL, a novel decentralized multimodal learning framework leveraging sheaf theory to enhance collaboration among devices with diverse modalities. Specifically, each client has a set of local feature encoders for its different modalities, whose outputs are concatenated before passing through a task-specific layer. While encoders for the same modality are trained collaboratively across clients, we capture the intrinsic correlations among clients' task-specific layers using a sheaf-based structure. To further enhance learning capability, we propose an enhanced algorithm named Sheaf-DMFL-Att, which tailors the attention mechanism within each client to capture correlations among different modalities. A rigorous convergence analysis of Sheaf-DMFL-Att is provided, establishing its theoretical guarantees. Extensive simulations are conducted on real-world link blockage prediction and mmWave beamforming scenarios, demonstrate the superiority of the proposed algorithms in such heterogeneous wireless communication systems.
摘要：在大规模的通信系统中，越来越复杂的场景需要在收集各种多模式感觉数据的边缘设备之间进行更智能的协作，以实现对环境的更全面的了解并提高决策准确性。但是，传统的联合学习（FL）算法通常考虑单峰数据集，需要相同的模型架构，并且无法利用嵌入多模式数据中的丰富信息，从而将其适用性限制在具有多种方式的现实情况下，并将其适用性限制在不同的情况下。为了解决这个问题，我们提出了Sheaf-DMFL，这是一个新颖的分散的多模式学习框架，利用捆式理论增强了具有多种方式的设备之间的协作。具体而言，每个客户端都有一组本地功能编码器的不同模式，在通过特定于任务的层之前，其输出会串联。虽然对同一模式进行编码，但在跨客户群之间进行了协作培训，但我们使用基于捆绑的结构捕获了客户特定于任务的层之间的固有相关性。为了进一步增强学习能力，我们提出了一种名为Sheaf-DMFL-Att的增强算法，该算法调整了每个客户中的注意力机制以捕获不同方式之间的相关性。提供了对捆合-DMFL-ATT的严格合并分析，并确定其理论保证。大量的模拟是在现实世界的链接阻塞预测和MMWave横梁形成方案上进行的，证明了此类异构无线通信系统中提出算法的优越性。

Title: Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment

Authors: Yue Zhang, Jilei Sun, Yunhui Guo, Vibhav Gogate
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2506.22385
Pdf URL: https://arxiv.org/pdf/2506.22385
Copy Paste: [[2506.22385]] Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment(https://arxiv.org/abs/2506.22385)
Keywords: generation, generative
Abstract: Video Large Multimodal Models (VLMMs) have made impressive strides in understanding video content, but they often struggle with abstract and adaptive reasoning-the ability to revise their interpretations when new information emerges. In reality, conclusions are rarely set in stone; additional context can strengthen or weaken an initial inference. To address this, we introduce Defeasible Video Entailment (DVidE), a new task that challenges models to think like doubters, constantly updating their reasoning based on evolving evidence. In DVidE, given a video premise and a textual hypothesis, models must determine whether a new update strengthens or weakens the hypothesis (classification version) or generate a coherent update that modifies the entailment relationship (generation version). For solving the classification task, we propose the Chain of Counterfactual Thought framework, utilizing counterfactual reasoning, ASR-enhanced video content, and rationale refinement to reduce inference bias. For the generation task, we develop a framework that combines ASR output with a Large Language Model (LLM) to produce coherent, contextually relevant updates aligned with the intended strengthener or weakener goals. Additionally, we introduce a novel benchmark dataset, with strengthener/weakener annotations and an LLM-based evaluation metric specifically designed for assessing generative performance. Experimental results demonstrate significant improvements, highlighting our proposed method in enhancing dynamic reasoning capabilities of VLMMs.
摘要：视频大型多模型模型（VLMMS）在理解视频内容方面取得了令人印象深刻的进步，但是他们经常在抽象和自适应推理中挣扎 - 在出现新信息时修改其解释的能力。实际上，结论很少是石头的。其他上下文可以加强或削弱初始推断。为了解决这个问题，我们介绍了不稳定的视频（DVIDE），这是一项新任务，挑战模型像怀疑者一样思考，不断根据不断发展的证据来更新其推理。在DVIDE中，给定视频前提和文本假设，模型必须确定新更新是否会增强或削弱假设（分类版本）或生成连贯的更新，以修改零件关系（生成版本）。为了解决分类任务，我们提出了反事实思维框架的链条，利用反事实推理，ASR增强视频内容和理由改进以减少推理偏见。对于生成任务，我们开发了一个将ASR输出与大语言模型（LLM）相结合的框架，以产生与预期的增强器或弱化器目标一致的连贯的，上下文相关的更新。此外，我们介绍了一个新型的基准数据集，具有增强器/弱化器注释和专门设计用于评估生成性能的基于LLM的评估度量。实验结果表明了显着的改进，强调了我们提出的提高VLMM的动态推理能力的方法。

Title: Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy

Authors: Yuhao Liu, Tengfei Wang, Fang Liu, Zhenwei Wang, Rynson W.H. Lau
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2506.22432
Pdf URL: https://arxiv.org/pdf/2506.22432
Copy Paste: [[2506.22432]] Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy(https://arxiv.org/abs/2506.22432)
Keywords: generative
Abstract: Recent advances in deep generative modeling have unlocked unprecedented opportunities for video synthesis. In real-world applications, however, users often seek tools to faithfully realize their creative editing intentions with precise and consistent control. Despite the progress achieved by existing methods, ensuring fine-grained alignment with user intentions remains an open and challenging problem. In this work, we present Shape-for-Motion, a novel framework that incorporates a 3D proxy for precise and consistent video editing. Shape-for-Motion achieves this by converting the target object in the input video to a time-consistent mesh, i.e., a 3D proxy, allowing edits to be performed directly on the proxy and then inferred back to the video frames. To simplify the editing process, we design a novel Dual-Propagation Strategy that allows users to perform edits on the 3D mesh of a single frame, and the edits are then automatically propagated to the 3D meshes of the other frames. The 3D meshes for different frames are further projected onto the 2D space to produce the edited geometry and texture renderings, which serve as inputs to a decoupled video diffusion model for generating edited results. Our framework supports various precise and physically-consistent manipulations across the video frames, including pose editing, rotation, scaling, translation, texture modification, and object composition. Our approach marks a key step toward high-quality, controllable video editing workflows. Extensive experiments demonstrate the superiority and effectiveness of our approach. Project page: this https URL
摘要：深层生成建模的最新进展已解开了视频综合的前所未有的机会。但是，在现实世界中，用户经常寻求工具来忠实地以精确且一致的控制来实现其创造性的编辑意图。尽管现有方法取得了进步，但确保与用户意图的细粒度对齐仍然是一个开放且具有挑战性的问题。在这项工作中，我们提出了形状，这是一个新颖的框架，该框架结合了3D代理，以进行精确且一致的视频编辑。通过将输入视频中的目标对象转换为时间一致的网格，即3D代理来实现这一目标，从而可以直接在代理上执行编辑，然后推断回视频框架。为了简化编辑过程，我们设计了一种新颖的双传播策略，该策略允许用户在单个帧的3D网格上执行编辑，然后将编辑自动传播到其他框架的3D网格。针对不同帧的3D网格进一步投影到2D空间上，以产生编辑的几何形状和纹理效果图，这些几何形状和纹理渲染是对解耦的视频扩散模型的输入，以生成编辑的结果。我们的框架支持视频帧上各种精确和物理上一致的操作，包括姿势编辑，旋转，缩放，翻译，纹理修改和对象组成。我们的方法标志着迈向高质量，可控制的视频编辑工作流程的关键步骤。广泛的实验证明了我们方法的优势和有效性。项目页面：此HTTPS URL