2025-03-14

Title: Inductive Spatio-Temporal Kriging with Physics-Guided Increment Training Strategy for Air Quality Inference

Authors: Songlin Yang, Tao Yang, Bo Hu
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09646
Pdf URL: https://arxiv.org/pdf/2503.09646
Copy Paste: [[2503.09646]] Inductive Spatio-Temporal Kriging with Physics-Guided Increment Training Strategy for Air Quality Inference(https://arxiv.org/abs/2503.09646)
Keywords: generation
Abstract: The deployment of sensors for air quality monitoring is constrained by high costs, leading to inadequate network coverage and data deficits in some areas. Utilizing existing observations, spatio-temporal kriging is a method for estimating air quality at unobserved locations during a specific period. Inductive spatio-temporal kriging with increment training strategy has demonstrated its effectiveness using virtual nodes to simulate unobserved nodes. However, a disparity between virtual and real nodes persists, complicating the application of learning patterns derived from virtual nodes to actual unobserved ones. To address these limitations, this paper presents a Physics-Guided Increment Training Strategy (PGITS). Specifically, we design a dynamic graph generation module to incorporate the advection and diffusion processes of airborne particles as physical knowledge into the graph structure, dynamically adjusting the adjacency matrix to reflect physical interactions between nodes. By using physics principles as a bridge between virtual and real nodes, this strategy ensures the features of virtual nodes and their pseudo labels are closer to actual nodes. Consequently, the learned patterns of virtual nodes can be applied to actual unobserved nodes for effective kriging.
摘要：用于空气质量监测的传感器部署受到高成本的限制，导致某些领域的网络覆盖率和数据缺陷不足。利用现有的观察结果，时空kriging是一种在特定时期估算未观察到位置空气质量的方法。具有增量训练策略的感应时空kriging证明了其使用虚拟节点模拟未观察到的节点的有效性。但是，虚拟节点和真实节点之间的差异仍然存在，这使从虚拟节点到实际未观察到的学习模式的应用变得复杂。为了解决这些局限性，本文提出了物理引导的增量培训策略（PGIT）。具体而言，我们设计了一个动态图生成模块，以将空气颗粒作为物理知识的对流和扩散过程结合到图形结构中，并动态调整邻接矩阵以反映节点之间的物理相互作用。通过将物理原理用作虚拟和真实节点之间的桥梁，该策略可确保虚拟节点及其伪标签的特征更接近实际节点。因此，可以将虚拟节点的学习模式应用于实际未观察到的节点，以进行有效的kriging。

Title: CoRe^2: Collect, Reflect and Refine to Generate Better and Faster

Authors: Shitong Shao, Zikai Zhou, Dian Xie, Yuetong Fang, Tian Ye, Lichen Bai, Zeke Xie
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09662
Pdf URL: https://arxiv.org/pdf/2503.09662
Copy Paste: [[2503.09662]] CoRe^2: Collect, Reflect and Refine to Generate Better and Faster(https://arxiv.org/abs/2503.09662)
Keywords: generative
Abstract: Making text-to-image (T2I) generative model sample both fast and well represents a promising research direction. Previous studies have typically focused on either enhancing the visual quality of synthesized images at the expense of sampling efficiency or dramatically accelerating sampling without improving the base model's generative capacity. Moreover, nearly all inference methods have not been able to ensure stable performance simultaneously on both diffusion models (DMs) and visual autoregressive models (ARMs). In this paper, we introduce a novel plug-and-play inference paradigm, CoRe^2, which comprises three subprocesses: Collect, Reflect, and Refine. CoRe^2 first collects classifier-free guidance (CFG) trajectories, and then use collected data to train a weak model that reflects the easy-to-learn contents while reducing number of function evaluations during inference by half. Subsequently, CoRe^2 employs weak-to-strong guidance to refine the conditional output, thereby improving the model's capacity to generate high-frequency and realistic content, which is difficult for the base model to capture. To the best of our knowledge, CoRe^2 is the first to demonstrate both efficiency and effectiveness across a wide range of DMs, including SDXL, SD3.5, and FLUX, as well as ARMs like LlamaGen. It has exhibited significant performance improvements on HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench. Furthermore, CoRe^2 can be seamlessly integrated with the state-of-the-art Z-Sampling, outperforming it by 0.3 and 0.16 on PickScore and AES, while achieving 5.64s time saving using this http URL is released at this https URL.
摘要：使文本对图像（T2i）生成模型快速和井代表一个有希望的研究方向。先前的研究通常集中在提高合成图像的视觉质量上，而以抽样效率为代价，或者在不改善基本模型的生成能力的情况下大幅加速采样。此外，几乎所有推理方法都无法在扩散模型（DMS）和视觉自回归模型（ARM）上同时确保稳定的性能。在本文中，我们引入了一种新颖的插件推理范式，Core^2，其中包括三个子过程：收集，反射和完善。 Core^2首先收集无分类器的指导（CFG）轨迹，然后使用收集的数据来训练一个弱模型，该模型反映了易于学习的内容，同时将推断期间的功能评估数减少了一半。随后，Core^2采用弱到紧张的指导来完善条件输出，从而提高模型产生高频和现实内容的能力，这对于基本模型而言很难捕获。据我们所知，Core^2是第一个在包括SDXL，SD3.5和Flux以及Lamagen等手臂的各种DMS中均表现出效率和有效性。它在HPD V2，PIC-PIC，Drawbench，Geneval和T2i-Compbench上表现出显着的性能提高。此外，可以将Core^2与最先进的Z采样无缝集成，在PickScore和AES上的表现优于0.3和0.16，同时在此HTTPS URL上发布了使用此HTTP URL节省的5.64S时间。

Title: Accelerating Diffusion Sampling via Exploiting Local Transition Coherence

Authors: Shangwen Zhu, Han Zhang, Zhantao Yang, Qianyu Peng, Zhao Pu, Huangji Wang, Fan Cheng
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09675
Pdf URL: https://arxiv.org/pdf/2503.09675
Copy Paste: [[2503.09675]] Accelerating Diffusion Sampling via Exploiting Local Transition Coherence(https://arxiv.org/abs/2503.09675)
Keywords: generation
Abstract: Text-based diffusion models have made significant breakthroughs in generating high-quality images and videos from textual descriptions. However, the lengthy sampling time of the denoising process remains a significant bottleneck in practical applications. Previous methods either ignore the statistical relationships between adjacent steps or rely on attention or feature similarity between them, which often only works with specific network structures. To address this issue, we discover a new statistical relationship in the transition operator between adjacent steps, focusing on the relationship of the outputs from the network. This relationship does not impose any requirements on the network structure. Based on this observation, we propose a novel training-free acceleration method called LTC-Accel, which uses the identified relationship to estimate the current transition operator based on adjacent steps. Due to no specific assumptions regarding the network structure, LTC-Accel is applicable to almost all diffusion-based methods and orthogonal to almost all existing acceleration techniques, making it easy to combine with them. Experimental results demonstrate that LTC-Accel significantly speeds up sampling in text-to-image and text-to-video synthesis while maintaining competitive sample quality. Specifically, LTC-Accel achieves a speedup of 1.67-fold in Stable Diffusion v2 and a speedup of 1.55-fold in video generation models. When combined with distillation models, LTC-Accel achieves a remarkable 10-fold speedup in video generation, allowing real-time generation of more than 16FPS.
摘要：基于文本的扩散模型在产生文本描述中产生高质量的图像和视频方面取得了重大突破。但是，在实用应用中，脱索过程的冗长抽样时间仍然是一个重要的瓶颈。先前的方法要么忽略相邻步骤之间的统计关系，要么依赖于注意力或在它们之间的特征相似性，这通常仅与特定的网络结构一起使用。为了解决这个问题，我们在相邻步骤之间的过渡操作员中发现了新的统计关系，重点是网络的输出关系。这种关系对网络结构没有任何要求。基于此观察结果，我们提出了一种名为LTC-ACCEL的新型无训练加速方法，该方法使用确定的关系来估算基于相邻步骤的当前过渡操作员。由于没有关于网络结构的具体假设，LTC-ACCEL几乎适用于几乎所有基于扩散的方法，几乎适用于几乎所有现有的加速技术，因此很容易与它们结合。实验结果表明，LTC-ACCEL可以显着加快文本对图和文本对视频合成的采样，同时保持竞争性样本质量。具体而言，LTC-ACCEL在稳定扩散V2中达到了1.67倍的速度，视频生成模型的加速度为1.55倍。当与蒸馏模型结合使用时，LTC-Accel在视频生成中实现了10倍的速度，可实时生成超过16fps。

Title: Revisiting Backdoor Attacks on Time Series Classification in the Frequency Domain

Authors: Yuanmin Huang, Mi Zhang, Zhaoxiang Wang, Wenxuan Li, Min Yang
Subjects: cs.LG, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2503.09712
Pdf URL: https://arxiv.org/pdf/2503.09712
Copy Paste: [[2503.09712]] Revisiting Backdoor Attacks on Time Series Classification in the Frequency Domain(https://arxiv.org/abs/2503.09712)
Keywords: generation, generative
Abstract: Time series classification (TSC) is a cornerstone of modern web applications, powering tasks such as financial data analysis, network traffic monitoring, and user behavior analysis. In recent years, deep neural networks (DNNs) have greatly enhanced the performance of TSC models in these critical domains. However, DNNs are vulnerable to backdoor attacks, where attackers can covertly implant triggers into models to induce malicious outcomes. Existing backdoor attacks targeting DNN-based TSC models remain elementary. In particular, early methods borrow trigger designs from computer vision, which are ineffective for time series data. More recent approaches utilize generative models for trigger generation, but at the cost of significant computational complexity. In this work, we analyze the limitations of existing attacks and introduce an enhanced method, FreqBack. Drawing inspiration from the fact that DNN models inherently capture frequency domain features in time series data, we identify that improper perturbations in the frequency domain are the root cause of ineffective attacks. To address this, we propose to generate triggers both effectively and efficiently, guided by frequency analysis. FreqBack exhibits substantial performance across five models and eight datasets, achieving an impressive attack success rate of over 90%, while maintaining less than a 3% drop in model accuracy on clean data.
摘要：时间序列分类（TSC）是现代Web应用程序的基石，诸如财务数据分析，网络流量监控和用户行为分析等任务。近年来，深度神经网络（DNN）大大提高了这些关键领域中TSC模型的性能。但是，DNN容易受到后门攻击的攻击，攻击者可以秘密植入物触发模型以引起恶意结果。针对基于DNN的TSC模型的现有后门攻击仍然是基本的。特别是，早期方法从计算机视觉中借用触发设计，这对于时间序列数据无效。最新的方法利用生成模型来产生触发器，但以显着的计算复杂性为代价。在这项工作中，我们分析了现有攻击的局限性，并引入了增强的方法Freqback。从DNN模型固有地捕获频域特征的时间序列数据中，我们确定频域中的扰动是无效攻击的根本原因。为了解决这个问题，我们建议在频率分析的指导下有效，有效地生成触发因素。 Freqback在五个型号和八个数据集中表现出色，获得了令人印象深刻的攻击成功率超过90％，同时保持了清洁数据的模型准确性下降不到3％。

Title: I2V3D: Controllable image-to-video generation with 3D guidance

Authors: Zhiyuan Zhang, Dongdong Chen, Jing Liao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09733
Pdf URL: https://arxiv.org/pdf/2503.09733
Copy Paste: [[2503.09733]] I2V3D: Controllable image-to-video generation with 3D guidance(https://arxiv.org/abs/2503.09733)
Keywords: generation, generative
Abstract: We present I2V3D, a novel framework for animating static images into dynamic videos with precise 3D control, leveraging the strengths of both 3D geometry guidance and advanced generative models. Our approach combines the precision of a computer graphics pipeline, enabling accurate control over elements such as camera movement, object rotation, and character animation, with the visual fidelity of generative AI to produce high-quality videos from coarsely rendered inputs. To support animations with any initial start point and extended sequences, we adopt a two-stage generation process guided by 3D geometry: 1) 3D-Guided Keyframe Generation, where a customized image diffusion model refines rendered keyframes to ensure consistency and quality, and 2) 3D-Guided Video Interpolation, a training-free approach that generates smooth, high-quality video frames between keyframes using bidirectional guidance. Experimental results highlight the effectiveness of our framework in producing controllable, high-quality animations from single input images by harmonizing 3D geometry with generative models. The code for our framework will be publicly released.
摘要：我们提出了I2V3D，这是一个新颖的框架，可将静态图像动画为具有精确3D控制的动态视频，利用3D几何指导和先进的生成模型的优势。我们的方法结合了计算机图形管道的精确度，可以准确控制诸如摄像机运动，对象旋转和角色动画的元素，以及生成AI的视觉保真度，从而从粗糙的输入中产生高质量的视频。 To support animations with any initial start point and extended sequences, we adopt a two-stage generation process guided by 3D geometry: 1) 3D-Guided Keyframe Generation, where a customized image diffusion model refines rendered keyframes to ensure consistency and quality, and 2) 3D-Guided Video Interpolation, a training-free approach that generates smooth, high-quality video frames between keyframes using bidirectional guidance.实验结果突出了我们的框架在通过将3D几何形状与生成模型相一致的单个输入图像中产生可控的高质量动画中的有效性。我们的框架代码将公开发布。

Title: SASNet: Spatially-Adaptive Sinusoidal Neural Networks

Authors: Haoan Feng, Diana Aldana, Tiago Novello, Leila De Floriani
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09750
Pdf URL: https://arxiv.org/pdf/2503.09750
Copy Paste: [[2503.09750]] SASNet: Spatially-Adaptive Sinusoidal Neural Networks(https://arxiv.org/abs/2503.09750)
Keywords: super-resolution
Abstract: Sinusoidal neural networks (SNNs) have emerged as powerful implicit neural representations (INRs) for low-dimensional signals in computer vision and graphics. They enable high-frequency signal reconstruction and smooth manifold modeling; however, they often suffer from spectral bias, training instability, and overfitting. To address these challenges, we propose SASNet, Spatially-Adaptive SNNs that robustly enhance the capacity of compact INRs to fit detailed signals. SASNet integrates a frequency embedding layer to control frequency components and mitigate spectral bias, along with jointly optimized, spatially-adaptive masks that localize neuron influence, reducing network redundancy and improving convergence stability. Robust to hyperparameter selection, SASNet faithfully reconstructs high-frequency signals without overfitting low-frequency regions. Our experiments show that SASNet outperforms state-of-the-art INRs, achieving strong fitting accuracy, super-resolution capability, and noise suppression, without sacrificing model compactness.
摘要：正弦神经网络（SNN）已成为计算机视觉和图形中低维信号的强大隐式神经表示（INR）。它们可以实现高频信号重建和光滑的歧管建模；但是，他们经常会遭受频谱偏见，训练不稳定和过度拟合的困扰。为了应对这些挑战，我们提出了SASNET，具有空间自适应的SNN，可稳固地提高紧凑型INR的能力，以适应详细信号。 SASNET集成了频率嵌入层以控制频率成分并减轻频谱偏置，以及共同优化的空间自适应掩模，该面罩定位神经元影响，降低网络冗余并改善收敛稳定性。 SASNET忠实地重建高频信号而不拟合低频区域，Sasnet忠实地重建了高频信号。我们的实验表明，SASNET优于最先进的INR，实现了强大的拟合精度，超分辨率能力和抑制噪声，而无需牺牲模型紧凑。

Title: A PyTorch-Enabled Tool for Synthetic Event Camera Data Generation and Algorithm Development

Authors: Joseph L. Greene, Adrish Kar, Ignacio Galindo, Elijah Quiles, Elliott Chen, Matthew Anderson
Subjects: cs.CV, physics.optics
Abstract URL: https://arxiv.org/abs/2503.09754
Pdf URL: https://arxiv.org/pdf/2503.09754
Copy Paste: [[2503.09754]] A PyTorch-Enabled Tool for Synthetic Event Camera Data Generation and Algorithm Development(https://arxiv.org/abs/2503.09754)
Keywords: generation
Abstract: Event, or neuromorphic cameras, offer a novel encoding of natural scenes by asynchronously reporting significant changes in brightness, known as events, with improved dynamic range, temporal resolution and lower data bandwidth when compared to conventional cameras. However, their adoption in domain-specific research tasks is hindered in part by limited commercial availability, lack of existing datasets, and challenges related to predicting the impact of their nonlinear optical encoding, unique noise model and tensor-based data processing requirements. To address these challenges, we introduce Synthetic Events for Neural Processing and Integration (SENPI) in Python, a PyTorch-based library for simulating and processing event camera data. SENPI includes a differentiable digital twin that converts intensity-based data into event representations, allowing for evaluation of event camera performance while handling the non-smooth and nonlinear nature of the forward model The library also supports modules for event-based I/O, manipulation, filtering and visualization, creating efficient and scalable workflows for both synthetic and real event-based data. We demonstrate SENPI's ability to produce realistic event-based data by comparing synthetic outputs to real event camera data and use these results to draw conclusions on the properties and utility of event-based perception. Additionally, we showcase SENPI's use in exploring event camera behavior under varying noise conditions and optimizing event contrast threshold for improved encoding under target conditions. Ultimately, SENPI aims to lower the barrier to entry for researchers by providing an accessible tool for event data generation and algorithmic developmnent, making it a valuable resource for advancing research in neuromorphic vision systems.
摘要：事件或神经形态摄像机通过异步报告亮度（称为事件）的显着变化，具有改善的动态范围，时间分辨率和较低的数据带宽，与常规摄像机相比，它提供了一种新颖的自然场景编码。但是，它们在特定领域的研究任务中的采用部分受到了有限的商业可用性，缺乏现有数据集的影响以及与预测其非线性光学编码，独特噪声模型和基于张量的数据处理要求相关的挑战。为了应对这些挑战，我们在Python中引入了用于神经处理和集成的合成事件，该事件是基于Pytorch的库，用于模拟和处理事件摄像机数据。 SENPI包括一个可区分的数字双胞胎，将基于强度的数据转换为事件表示形式，可以评估事件摄像机性能，同时处理远期模型的非平滑和非线性性质，图书馆还支持基于事件的I/O，操作，过滤和可视化，为合成和实际事件基于事件的数据提供高效且可扩展的工作表。我们证明了Senpi通过将合成输出与真实事件摄像机数据进行比较，并使用这些结果来得出有关基于事件的感知的属性和实用性的结论，从而证明了Senpi产生基于事件的现实数据的能力。此外，我们还展示了Senpi在不同的噪声条件下探索事件相机行为的用途，并优化了事件对比度阈值，以改善目标条件下的编码。最终，Senpi的目标是通过为事件数据生成和算法开发提供可访问的工具来降低研究人员进入障碍，从而使其成为推进神经形态视觉系统研究的宝贵资源。

Title: BiasConnect: Investigating Bias Interactions in Text-to-Image Models

Authors: Pushkar Shukla, Aditya Chinchure, Emily Diana, Alexander Tolbert, Kartik Hosanagar, Vineeth N. Balasubramanian, Leonid Sigal, Matthew A. Turk
Subjects: cs.CV, cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2503.09763
Pdf URL: https://arxiv.org/pdf/2503.09763
Copy Paste: [[2503.09763]] BiasConnect: Investigating Bias Interactions in Text-to-Image Models(https://arxiv.org/abs/2503.09763)
Keywords: generative
Abstract: The biases exhibited by Text-to-Image (TTI) models are often treated as if they are independent, but in reality, they may be deeply interrelated. Addressing bias along one dimension, such as ethnicity or age, can inadvertently influence another dimension, like gender, either mitigating or exacerbating existing disparities. Understanding these interdependencies is crucial for designing fairer generative models, yet measuring such effects quantitatively remains a challenge. In this paper, we aim to address these questions by introducing BiasConnect, a novel tool designed to analyze and quantify bias interactions in TTI models. Our approach leverages a counterfactual-based framework to generate pairwise causal graphs that reveals the underlying structure of bias interactions for the given text prompt. Additionally, our method provides empirical estimates that indicate how other bias dimensions shift toward or away from an ideal distribution when a given bias is modified. Our estimates have a strong correlation (+0.69) with the interdependency observations post bias mitigation. We demonstrate the utility of BiasConnect for selecting optimal bias mitigation axes, comparing different TTI models on the dependencies they learn, and understanding the amplification of intersectional societal biases in TTI models.
摘要：文本对图像（TTI）模型所表现出的偏见通常被视为独立的偏见，但实际上，它们可能与之有着密切相关的。解决一个维度的偏见，例如种族或年龄，可以无意中影响另一个维度，例如性别，减轻或加剧现有差异。了解这些相互依存关系对于设计更公平的生成模型至关重要，但是定量测量这种效果仍然是一个挑战。在本文中，我们旨在通过引入BiasConnect来解决这些问题，BiasConnect是一种新颖的工具，旨在分析和量化TTI模型中的偏差相互作用。我们的方法利用基于反事实的框架生成成对因果图，该图表揭示了给定文本提示的偏见相互作用的基本结构。此外，我们的方法还提供了经验估计，以指示当给定偏置修改时，其他偏差维度如何向理想分布转移或远离理想分布。我们的估计值与偏置后的相互依赖观察结果具有很强的相关性（+0.69）。我们演示了biasConnect选择最佳偏置缓解轴的实用性，比较了他们所学习的依赖项的不同TTI模型，并了解TTI模型中的截面社会偏见的扩增。

Title: Temporal Difference Flows

Authors: Jesse Farebrother, Matteo Pirotta, Andrea Tirinzoni, Rémi Munos, Alessandro Lazaric, Ahmed Touati
Subjects: cs.LG, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2503.09817
Pdf URL: https://arxiv.org/pdf/2503.09817
Copy Paste: [[2503.09817]] Temporal Difference Flows(https://arxiv.org/abs/2503.09817)
Keywords: generative
Abstract: Predictive models of the future are fundamental for an agent's ability to reason and plan. A common strategy learns a world model and unrolls it step-by-step at inference, where small errors can rapidly compound. Geometric Horizon Models (GHMs) offer a compelling alternative by directly making predictions of future states, avoiding cumulative inference errors. While GHMs can be conveniently learned by a generative analog to temporal difference (TD) learning, existing methods are negatively affected by bootstrapping predictions at train time and struggle to generate high-quality predictions at long horizons. This paper introduces Temporal Difference Flows (TD-Flow), which leverages the structure of a novel Bellman equation on probability paths alongside flow-matching techniques to learn accurate GHMs at over 5x the horizon length of prior methods. Theoretically, we establish a new convergence result and primarily attribute TD-Flow's efficacy to reduced gradient variance during training. We further show that similar arguments can be extended to diffusion-based methods. Empirically, we validate TD-Flow across a diverse set of domains on both generative metrics and downstream tasks including policy evaluation. Moreover, integrating TD-Flow with recent behavior foundation models for planning over pre-trained policies demonstrates substantial performance gains, underscoring its promise for long-horizon decision-making.
摘要：未来的预测模型对于代理商推理和计划的能力是基础。一种共同的策略可以学习世界模型，并在推理时逐步展开它，在这种推论中，小错误可能会迅速复合。几何范围模型（GHM）通过直接对未来状态进行预测，避免累积推理误差，提供了令人信服的替代方案。虽然可以通过对时间差异（TD）学习的生成类似物来方便地学习GHM，但现有方法会受到火车时间的自举预测的负面影响，并难以在长时间的地平线上产生高质量的预测。本文介绍了时间差流（TD-Flow），该流程利用了概率路径上的新型钟形方程与流量匹配技术的结构，以在超过5倍以上的范围长度上学习准确的GHM。从理论上讲，我们建立了一个新的收敛结果，主要将TD-Flow的功效归因于训练过程中梯度差异的降低。我们进一步表明，类似的参数可以扩展到基于扩散的方法。从经验上讲，我们在包括策略评估在内的生成指标和下游任务上验证了各种域中的TD流。此外，将TD-Flow与最近的行为基础模型集成到预先培训的政策上，这表明了绩效的实质性提高，强调了其对长途决策的希望。

Title: Resolution Invariant Autoencoder

Authors: Ashay Patel, Michela Antonelli, Sebastien Ourselin, M. Jorge Cardoso
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.09828
Pdf URL: https://arxiv.org/pdf/2503.09828
Copy Paste: [[2503.09828]] Resolution Invariant Autoencoder(https://arxiv.org/abs/2503.09828)
Keywords: super-resolution, generative
Abstract: Deep learning has significantly advanced medical imaging analysis, yet variations in image resolution remain an overlooked challenge. Most methods address this by resampling images, leading to either information loss or computational inefficiencies. While solutions exist for specific tasks, no unified approach has been proposed. We introduce a resolution-invariant autoencoder that adapts spatial resizing at each layer in the network via a learned variable resizing process, replacing fixed spatial down/upsampling at the traditional factor of 2. This ensures a consistent latent space resolution, regardless of input or output resolution. Our model enables various downstream tasks to be performed on an image latent whilst maintaining performance across different resolutions, overcoming the shortfalls of traditional methods. We demonstrate its effectiveness in uncertainty-aware super-resolution, classification, and generative modelling tasks and show how our method outperforms conventional baselines with minimal performance loss across resolutions.
摘要：深度学习具有明显的先进的医学成像分析，但是图像分辨率的变化仍然是一个被忽视的挑战。大多数方法通过重新采样图像来解决此问题，从而导致信息丢失或计算效率低下。尽管存在针对特定任务的解决方案，但未提出统一的方法。我们介绍了一个分辨率不变的自动编码器，该自动编码器通过学习的可变调整过程在网络中的每个层调整空间调整大小，以2的传统因素取代固定的空间向下/UPSMPLING，这确保了一致的潜在空间分辨率，无论输入或输出分辨率如何，都可以确保一个一致的潜在空间分辨率。我们的模型使能够在图像潜在的图像上执行各种下游任务，同时维持不同分辨率的性能，克服传统方法的短缺。我们证明了它在不确定性感知的超分辨率，分类和生成建模任务中的有效性，并展示了我们的方法如何优于传统基线，而整个分辨率之间的性能损失最少。

Title: Exploring Position Encoding in Diffusion U-Net for Training-free High-resolution Image Generation

Authors: Feng Zhou, Pu Cao, Yiyang Ma, Lu Yang, Jianqin Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09830
Pdf URL: https://arxiv.org/pdf/2503.09830
Copy Paste: [[2503.09830]] Exploring Position Encoding in Diffusion U-Net for Training-free High-resolution Image Generation(https://arxiv.org/abs/2503.09830)
Keywords: generation, generative
Abstract: Denoising higher-resolution latents via a pre-trained U-Net leads to repetitive and disordered image patterns. Although recent studies make efforts to improve generative quality by aligning denoising process across original and higher resolutions, the root cause of suboptimal generation is still lacking exploration. Through comprehensive analysis of position encoding in U-Net, we attribute it to inconsistent position encoding, sourced by the inadequate propagation of position information from zero-padding to latent features in convolution layers as resolution increases. To address this issue, we propose a novel training-free approach, introducing a Progressive Boundary Complement (PBC) method. This method creates dynamic virtual image boundaries inside the feature map to enhance position information propagation, enabling high-quality and rich-content high-resolution image synthesis. Extensive experiments demonstrate the superiority of our method.
摘要：通过预先训练的U-NET剥夺高分辨率潜伏期会导致重复且无序的图像模式。尽管最近的研究通过使原始和更高分辨率的deno流程对齐过程来提高生成质量，但次优产生的根本原因仍缺乏探索。通过对U-NET中编码的位置的全面分析，我们将其归因于编码不一致的位置，这是由于位置信息不足的传播信息从零填充到卷积层中潜在特征的传播不足，随着分辨率的增加。为了解决这个问题，我们提出了一种新颖的无培训方法，引入了渐进边界补体（PBC）方法。此方法在特征图内创建动态虚拟图像边界，以增强位置信息传播，从而使高质量和丰富的高分辨率图像合成。广泛的实验证明了我们方法的优越性。

Title: On the Limitations of Vision-Language Models in Understanding Image Transforms

Authors: Ahmad Mustafa Anis, Hasnain Ali, Saquib Sarfraz
Subjects: cs.CV, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.09837
Pdf URL: https://arxiv.org/pdf/2503.09837
Copy Paste: [[2503.09837]] On the Limitations of Vision-Language Models in Understanding Image Transforms(https://arxiv.org/abs/2503.09837)
Keywords: generation
Abstract: Vision Language Models (VLMs) have demonstrated significant potential in various downstream tasks, including Image/Video Generation, Visual Question Answering, Multimodal Chatbots, and Video Understanding. However, these models often struggle with basic image transformations. This paper investigates the image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by Google. Our findings reveal that these models lack comprehension of multiple image-level augmentations. To facilitate this study, we created an augmented version of the Flickr8k dataset, pairing each image with a detailed description of the applied transformation. We further explore how this deficiency impacts downstream tasks, particularly in image editing, and evaluate the performance of state-of-the-art Image2Image models on simple transformations.
摘要：视觉语言模型（VLM）在各种下游任务中表现出了巨大的潜力，包括图像/视频生成，视觉问题答案，多模式聊天机器人和视频理解。但是，这些模型通常会在基本的图像转换中挣扎。本文研究了对VLM的图像级别的理解，特别是OpenAI的剪辑和Google的Siglip。我们的发现表明，这些模型缺乏对多个图像级增强的理解。为了促进这项研究，我们创建了FlickR8K数据集的增强版本，将每个图像与应用转换的详细描述配对。我们进一步探讨了这种缺陷如何影响下游任务，尤其是在图像编辑中，并评估最先进的图像模型在简单转换上的性能。

Title: LuciBot: Automated Robot Policy Learning from Generated Videos

Authors: Xiaowen Qiu, Yian Wang, Jiting Cai, Zhehuan Chen, Chunru Lin, Tsun-Hsuan Wang, Chuang Gan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09871
Pdf URL: https://arxiv.org/pdf/2503.09871
Copy Paste: [[2503.09871]] LuciBot: Automated Robot Policy Learning from Generated Videos(https://arxiv.org/abs/2503.09871)
Keywords: generation
Abstract: Automatically generating training supervision for embodied tasks is crucial, as manual designing is tedious and not scalable. While prior works use large language models (LLMs) or vision-language models (VLMs) to generate rewards, these approaches are largely limited to simple tasks with well-defined rewards, such as pick-and-place. This limitation arises because LLMs struggle to interpret complex scenes compressed into text or code due to their restricted input modality, while VLM-based rewards, though better at visual perception, remain limited by their less expressive output modality. To address these challenges, we leverage the imagination capability of general-purpose video generation models. Given an initial simulation frame and a textual task description, the video generation model produces a video demonstrating task completion with correct semantics. We then extract rich supervisory signals from the generated video, including 6D object pose sequences, 2D segmentations, and estimated depth, to facilitate task learning in simulation. Our approach significantly improves supervision quality for complex embodied tasks, enabling large-scale training in simulators.
摘要：自动为具体任务产生培训监督至关重要，因为手动设计乏味且不可扩展。虽然先前的作品使用大型语言模型（LLM）或视觉语言模型（VLM）来产生奖励，但这些方法在很大程度上仅限于具有明确奖励的简单任务，例如挑选奖励。之所以出现此限制，是因为LLM由于其受限的输入方式而难以解释将复杂的场景压缩到文本或代码中，而基于VLM的奖励虽然在视觉感知方面更好，但仍受其不太表达的输出模式的限制。为了应对这些挑战，我们利用通用视频生成模型的想象力。给定初始的仿真框架和文本任务描述，视频生成模型会产生视频，以正确的语义演示任务完成。然后，我们从生成的视频中提取丰富的监督信号，包括6D对象姿势序列，2D分割和估计的深度，以促进模拟中的任务学习。我们的方法大大提高了对复杂体现任务的监督质量，从而实现了模拟器的大规模培训。

Title: Inter-environmental world modeling for continuous and compositional dynamics

Authors: Kohei Hayashi, Masanori Koyama, Julian Jorge Andrade Guerreiro
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09911
Pdf URL: https://arxiv.org/pdf/2503.09911
Copy Paste: [[2503.09911]] Inter-environmental world modeling for continuous and compositional dynamics(https://arxiv.org/abs/2503.09911)
Keywords: generative
Abstract: Various world model frameworks are being developed today based on autoregressive frameworks that rely on discrete representations of actions and observations, and these frameworks are succeeding in constructing interactive generative models for the target environment of interest. Meanwhile, humans demonstrate remarkable generalization abilities to combine experiences in multiple environments to mentally simulate and learn to control agents in diverse environments. Inspired by this human capability, we introduce World modeling through Lie Action (WLA), an unsupervised framework that learns continuous latent action representations to simulate across environments. WLA learns a control interface with high controllability and predictive ability by simultaneously modeling the dynamics of multiple environments using Lie group theory and object-centric autoencoder. On synthetic benchmark and real-world datasets, we demonstrate that WLA can be trained using only video frames and, with minimal or no action labels, can quickly adapt to new environments with novel action sets.
摘要：今天正在开发各种世界模型框架，该框架是基于依赖动作和观察的离散表示的自回旋框架，这些框架成功地为感兴趣的目标环境构建了交互式生成模型。同时，人类具有出色的概括能力，可以在多种环境中结合体验，以在精神上模拟和学习控制不同环境中的代理。受到人类能力的启发，我们通过Lie Action（WLA）介绍了世界建模，这是一个无监督的框架，学习了连续的潜在动作表示，以模拟跨环境。 WLA通过使用Lie组理论和以对象为中心的自动编码器同时对多个环境的动力学进行建模，从而学习具有高可控性和预测能力的控制界面。在合成基准和现实世界数据集上，我们证明可以仅使用视频帧对WLA进行训练，并且具有最小或没有动作标签的情况，可以快速适应具有新颖动作集的新环境。

Title: Type Information-Assisted Self-Supervised Knowledge Graph Denoising

Authors: Jiaqi Sun, Yujia Zheng, Xinshuai Dong, Haoyue Dai, Kun Zhang
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.09916
Pdf URL: https://arxiv.org/pdf/2503.09916
Copy Paste: [[2503.09916]] Type Information-Assisted Self-Supervised Knowledge Graph Denoising(https://arxiv.org/abs/2503.09916)
Keywords: generation
Abstract: Knowledge graphs serve as critical resources supporting intelligent systems, but they can be noisy due to imperfect automatic generation processes. Existing approaches to noise detection often rely on external facts, logical rule constraints, or structural embeddings. These methods are often challenged by imperfect entity alignment, flexible knowledge graph construction, and overfitting on structures. In this paper, we propose to exploit the consistency between entity and relation type information for noise detection, resulting a novel self-supervised knowledge graph denoising method that avoids those problems. We formalize type inconsistency noise as triples that deviate from the majority with respect to type-dependent reasoning along the topological structure. Specifically, we first extract a compact representation of a given knowledge graph via an encoder that models the type dependencies of triples. Then, the decoder reconstructs the original input knowledge graph based on the compact representation. It is worth noting that, our proposal has the potential to address the problems of knowledge graph compression and completion, although this is not our focus. For the specific task of noise detection, the discrepancy between the reconstruction results and the input knowledge graph provides an opportunity for denoising, which is facilitated by the type consistency embedded in our method. Experimental validation demonstrates the effectiveness of our approach in detecting potential noise in real-world data.
摘要：知识图是支持智能系统的关键资源，但是由于不完美的自动生成过程，它们可能会嘈杂。现有的噪声检测方法通常取决于外部事实，逻辑规则约束或结构嵌入。这些方法通常会受到不完美的实体对准，灵活的知识图构造以及对结构过度拟合的挑战。在本文中，我们建议利用实体和关系类型信息之间的一致性进行噪声检测，从而导致了一种新颖的自我监督知识图形DeNoising方法，以避免这些问题。我们将类型不一致的噪声形式化为沿拓扑结构的类型依赖性推理的偏见的三倍。具体而言，我们首先通过编码器提取给定知识图的紧凑表示，该编码器对三元组的类型依赖性进行建模。然后，解码器基于紧凑的表示，重建原始的输入知识图。值得注意的是，尽管这不是我们的重点，但我们的建议有可能解决知识图压缩和完成的问题。对于噪声检测的特定任务，重建结果与输入知识图之间的差异为降解提供了机会，这是由我们方法中嵌入的类型一致性促进的。实验验证证明了我们方法在检测现实世界数据中潜在噪声方面的有效性。

Title: VideoMerge: Towards Training-free Long Video Generation

Authors: Siyang Zhang, Harry Yang, Ser-Nam Lim
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09926
Pdf URL: https://arxiv.org/pdf/2503.09926
Copy Paste: [[2503.09926]] VideoMerge: Towards Training-free Long Video Generation(https://arxiv.org/abs/2503.09926)
Keywords: generation
Abstract: Long video generation remains a challenging and compelling topic in computer vision. Diffusion based models, among the various approaches to video generation, have achieved state of the art quality with their iterative denoising procedures. However, the intrinsic complexity of the video domain renders the training of such diffusion models exceedingly expensive in terms of both data curation and computational resources. Moreover, these models typically operate on a fixed noise tensor that represents the video, resulting in predetermined spatial and temporal dimensions. Although several high quality open-source pretrained video diffusion models, jointly trained on images and videos of varying lengths and resolutions, are available, it is generally not recommended to specify a video length at inference that was not included in the training set. Consequently, these models are not readily adaptable to the direct generation of longer videos by merely increasing the specified video length. In addition to feasibility challenges, long-video generation also encounters quality issues. The domain of long videos is inherently more complex than that of short videos: extended durations introduce greater variability and necessitate long-range temporal consistency, thereby increasing the overall difficulty of the task. We propose VideoMerge, a training-free method that can be seamlessly adapted to merge short videos generated by pretrained text-to-video diffusion model. Our approach preserves the model's original expressiveness and consistency while allowing for extended duration and dynamic variation as specified by the user. By leveraging the strengths of pretrained models, our method addresses challenges related to smoothness, consistency, and dynamic content through orthogonal strategies that operate collaboratively to achieve superior quality.
摘要：在计算机视觉中，长期视频生成仍然是一个具有挑战性和引人入胜的话题。在视频生成的各种方法中，基于扩散的模型通过迭代降解程序实现了最先进的质量。但是，视频域的内在复杂性使这种扩散模型的训练在数据策展和计算资源方面非常昂贵。此外，这些模型通常在代表视频的固定噪声张量上操作，从而导致预定的空间和时间尺寸。尽管可以使用几种高质量的开源视频扩散模型，但可以使用不同长度和分辨率的图像和视频进行联合培训，但通常不建议在训练集中指定未包含的推理时指定视频长度。因此，这些模型不容易通过增加指定的视频长度来适应更长的视频的直接生成。除了可行性挑战外，长期发电还遇到了质量问题。长视频的领域本质上比简短视频的范围更为复杂：延长的持续时间引入了更大的可变性并需要长时间的时间一致性，从而增加了任务的整体难度。我们提出了视频，这是一种无训练的方法，可以无缝地适应通过验证的文本对视频扩散模型产生的简短视频。我们的方法保留了模型的原始表现力和一致性，同时允许用户指定的延长持续时间和动态变化。通过利用验证模型的优势，我们的方法通过正交策略来解决与平稳性，一致性和动态内容相关的挑战，这些策略可协作以实现卓越的质量。

Title: PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation

Authors: Sen Wang, Dongliang Zhou, Liang Xie, Chao Xu, Ye Yan, Erwei Yin
Subjects: cs.CV, cs.MM, cs.RO
Abstract URL: https://arxiv.org/abs/2503.09938
Pdf URL: https://arxiv.org/pdf/2503.09938
Copy Paste: [[2503.09938]] PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation(https://arxiv.org/abs/2503.09938)
Keywords: generation
Abstract: Vision-and-language navigation (VLN) tasks require agents to navigate three-dimensional environments guided by natural language instructions, offering substantial potential for diverse applications. However, the scarcity of training data impedes progress in this field. This paper introduces PanoGen++, a novel framework that addresses this limitation by generating varied and pertinent panoramic environments for VLN tasks. PanoGen++ incorporates pre-trained diffusion models with domain-specific fine-tuning, employing parameter-efficient techniques such as low-rank adaptation to minimize computational costs. We investigate two settings for environment generation: masked image inpainting and recursive image outpainting. The former maximizes novel environment creation by inpainting masked regions based on textual descriptions, while the latter facilitates agents' learning of spatial relationships within panoramas. Empirical evaluations on room-to-room (R2R), room-for-room (R4R), and cooperative vision-and-dialog navigation (CVDN) datasets reveal significant performance enhancements: a 2.44% increase in success rate on the R2R test leaderboard, a 0.63% improvement on the R4R validation unseen set, and a 0.75-meter enhancement in goal progress on the CVDN validation unseen set. PanoGen++ augments the diversity and relevance of training environments, resulting in improved generalization and efficacy in VLN tasks.
摘要：视觉和语言导航（VLN）任务要求代理在以自然语言指导为指导的三维环境中导航，从而为各种应用提供了巨大的潜力。但是，培训数据的稀缺性阻碍了该领域的进步。本文介绍了Panogen ++，这是一个新颖的框架，该框架通过为VLN任务生成各种和相关的全景环境来解决此限制。 Panogen ++将预先训练的扩散模型与域特异性微调合并，采用参数效率高效技术，例如低级适应，以最大程度地减少计算成本。我们研究了环境生成的两个设置：掩盖图像介绍和递归图像支出。前者通过基于文本描述来介绍掩盖区域，从而最大程度地创造了新颖的环境，而后者则有助于代理商在全景中学习空间关系。 Empirical evaluations on room-to-room (R2R), room-for-room (R4R), and cooperative vision-and-dialog navigation (CVDN) datasets reveal significant performance enhancements: a 2.44% increase in success rate on the R2R test leaderboard, a 0.63% improvement on the R4R validation unseen set, and a 0.75-meter enhancement in goal progress on the CVDN validation unseen set. Panogen ++增强了培训环境的多样性和相关性，从而提高了VLN任务的概括和功效。

Title: Identifying Trustworthiness Challenges in Deep Learning Models for Continental-Scale Water Quality Prediction

Authors: Xiaobo Xia, Xiaofeng Liu, Jiale Liu, Kuai Fang, Lu Lu, Samet Oymak, William S. Currie, Tongliang Liu
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.09947
Pdf URL: https://arxiv.org/pdf/2503.09947
Copy Paste: [[2503.09947]] Identifying Trustworthiness Challenges in Deep Learning Models for Continental-Scale Water Quality Prediction(https://arxiv.org/abs/2503.09947)
Keywords: generation
Abstract: Water quality is foundational to environmental sustainability, ecosystem resilience, and public health. Deep learning models, particularly Long Short-Term Memory (LSTM) networks, offer transformative potential for large-scale water quality prediction and scientific insights generation. However, their widespread adoption in high-stakes decision-making, such as pollution mitigation and equitable resource allocation, is prevented by unresolved trustworthiness challenges including fairness, uncertainty, interpretability, robustness, generalizability, and reproducibility. In this work, we present the first comprehensive evaluation of trustworthiness in a continental-scale multi-task LSTM model predicting 20 water quality variables (encompassing physical/chemical processes, geochemical weathering, and nutrient cycling) across 482 U.S. basins. Our investigation uncovers systematic patterns of model performance disparities linked to basin characteristics, the inherent complexity of biogeochemical processes, and variable predictability, emphasizing critical performance fairness concerns. We further propose methodological frameworks for quantitatively evaluating critical aspects of trustworthiness, including uncertainty, interpretability, and robustness, identifying key limitations that could challenge reliable real-world deployment. This work serves as a timely call to action for advancing trustworthy data-driven methods for water resources management and provides a pathway to offering critical insights for researchers, decision-makers, and practitioners seeking to leverage artificial intelligence (AI) responsibly in environmental management.
摘要：水质是环境可持续性，生态系统弹性和公共卫生的基础。深度学习模型，尤其是长期记忆（LSTM）网络，为大规模水质预测和科学见解的产生提供了变革潜力。但是，他们在高风险决策中的广泛采用，例如缓解污染和公平资源分配，这是由于未解决的可信赖性挑战所阻止的，包括公平，不确定性，不确定性，可解释性，可靠性，可概括性和可重复性。在这项工作中，我们在482个美国盆地的大陆尺度多任务LSTM模型中介绍了大陆规模的多任务LSTM模型中的首次全面评估，该模型可预测20种水质变量（包括物理/化学过程，地球化学风化和营养循环）。我们的调查发现了与盆地特征，生物地球化学过程的固有复杂性以及可变可预测性相关的模型性能差异的系统模式，并强调了关键的性能公平关注。我们进一步提出了方法论框架，以定量评估可信赖性的关键方面，包括不确定性，可解释性和鲁棒性，确定可能挑战可靠现实世界部署的关键局限性。这项工作是及时呼吁推进水资源管理的可信赖数据驱动的方法，并为研究人员，决策者和从业人员提供了寻求在环境管理中负责任地利用人工智能（AI）的关键见解的途径。

Title: UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

Authors: Yuanxin Liu, Rui Zhu, Shuhuai Ren, Jiacong Wang, Haoyuan Guo, Xu Sun, Lu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09949
Pdf URL: https://arxiv.org/pdf/2503.09949
Copy Paste: [[2503.09949]] UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?(https://arxiv.org/abs/2503.09949)
Keywords: generative
Abstract: With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 16 MLLMs. Our empirical results suggest that while advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation. The code is available at this https URL.
摘要：随着视频生成模型（VGM）的快速增长，为AI生成的视频（AIGVS）开发可靠且全面的自动指标至关重要。现有方法要么使用针对其他任务进行了优化的现成模型，要么依靠人类评估数据来培训专业评估者。这些方法限制在特定的评估方面，并且由于对更元素和更全面评估的需求不断增长，因此难以扩展。为了解决这个问题，这项工作调查了使用多模式大语言模型（MLLM）作为AIGVS的统一评估者的可行性，利用其强大的视觉感知和语言理解能力。为了评估统一AIGV评估中自动指标的性能，我们介绍了一个名为uve bench的基准。 Uve Bench收集了由最先进的VGM产生的视频，并在15个评估方面提供了成对的人类偏好注释。使用UVE板台，我们广泛评估16个MLLM。我们的经验结果表明，尽管高级MLLM（例如QWEN2VL-72B和InternVL2.5-78B）仍然落后于人类评估者，但它们在统一的AIGV评估中表现出了有希望的能力，显着超过了现有的专业评估方法。此外，我们对影响MLLM驱动的评估人员的性能的关键设计选择进行了深入的分析，为未来的AIGV评估研究提供了宝贵的见解。该代码可在此HTTPS URL上找到。

Title: Exploring Mutual Empowerment Between Wireless Networks and RL-based LLMs: A Survey

Authors: Yu Qiao, Phuong-Nam Tran, Ji Su Yoon, Loc X. Nguyen, Choong Seon Hong
Subjects: cs.LG, cs.AI, cs.CV, cs.ET
Abstract URL: https://arxiv.org/abs/2503.09956
Pdf URL: https://arxiv.org/pdf/2503.09956
Copy Paste: [[2503.09956]] Exploring Mutual Empowerment Between Wireless Networks and RL-based LLMs: A Survey(https://arxiv.org/abs/2503.09956)
Keywords: generation
Abstract: Reinforcement learning (RL)-based large language models (LLMs), such as ChatGPT, DeepSeek, and Grok-3, have gained significant attention for their exceptional capabilities in natural language processing and multimodal data understanding. Meanwhile, the rapid expansion of information services has driven the growing need for intelligence, efficient, and adaptable wireless networks. Wireless networks require the empowerment of RL-based LLMs while these models also benefit from wireless networks to broaden their application scenarios. Specifically, RL-based LLMs can enhance wireless communication systems through intelligent resource allocation, adaptive network optimization, and real-time decision-making. Conversely, wireless networks provide a vital infrastructure for the efficient training, deployment, and distributed inference of RL-based LLMs, especially in decentralized and edge computing environments. This mutual empowerment highlights the need for a deeper exploration of the interplay between these two domains. We first review recent advancements in wireless communications, highlighting the associated challenges and potential solutions. We then discuss the progress of RL-based LLMs, focusing on key technologies for LLM training, challenges, and potential solutions. Subsequently, we explore the mutual empowerment between these two fields, highlighting key motivations, open challenges, and potential solutions. Finally, we provide insights into future directions, applications, and their societal impact to further explore this intersection, paving the way for next-generation intelligent communication systems. Overall, this survey provides a comprehensive overview of the relationship between RL-based LLMs and wireless networks, offering a vision where these domains empower each other to drive innovations.
摘要：加强学习（RL）基于大型语言模型（LLM），例如Chatgpt，DeepSeek和Grok-3，对它们在自然语言处理和多模式数据理解中的非凡能力引起了极大的关注。同时，信息服务的快速扩展促使人们对智能，高效和适应性无线网络的需求不断增长。无线网络需要增强基于RL的LLM的能力，而这些模型也受益于无线网络，以扩大其应用程序方案。具体而言，基于RL的LLM可以通过智能资源分配，自适应网络优化和实时决策来增强无线通信系统。相反，无线网络为基于RL的LLM的有效培训，部署和分布推理提供了重要的基础架构，尤其是在分散和边缘计算环境中。这种相互的授权强调了对这两个领域之间相互作用的更深入探索的必要性。我们首先回顾了无线通信方面的最新进展，突出了相关的挑战和潜在解决方案。然后，我们讨论基于RL的LLM的进展，重点是用于LLM培训，挑战和潜在解决方案的关键技术。随后，我们探索了这两个领域之间的相互授权，突出了关键动机，开放挑战和潜在的解决方案。最后，我们为未来的方向，应用程序及其社会影响提供了见解，以进一步探索这一交叉点，为下一代智能通信系统铺平了道路。总体而言，这项调查提供了有关基于RL的LLM和无线网络之间关系的全面概述，从而提供了一个愿景，这些域名互相授权彼此以推动创新。

Title: Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes

Authors: JunYong Choi, Min-Cheol Sagong, SeokYeong Lee, Seung-Won Jung, Ig-Jae Kim, Junghyun Cho
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.09993
Pdf URL: https://arxiv.org/pdf/2503.09993
Copy Paste: [[2503.09993]] Channel-wise Noise Scheduled Diffusion for Inverse Rendering in Indoor Scenes(https://arxiv.org/abs/2503.09993)
Keywords: generative
Abstract: We propose a diffusion-based inverse rendering framework that decomposes a single RGB image into geometry, material, and lighting. Inverse rendering is inherently ill-posed, making it difficult to predict a single accurate solution. To address this challenge, recent generative model-based methods aim to present a range of possible solutions. However, finding a single accurate solution and generating diverse solutions can be conflicting. In this paper, we propose a channel-wise noise scheduling approach that allows a single diffusion model architecture to achieve two conflicting objectives. The resulting two diffusion models, trained with different channel-wise noise schedules, can predict a single highly accurate solution and present multiple possible solutions. The experimental results demonstrate the superiority of our two models in terms of both diversity and accuracy, which translates to enhanced performance in downstream applications such as object insertion and material editing.
摘要：我们提出了一个基于扩散的反渲染框架，将单个RGB图像分解为几何，材料和照明。逆渲染本质上是不符合的，因此很难预测单个精确的解决方案。为了应对这一挑战，最近基于生成的模型方法旨在提出一系列可能的解决方案。但是，找到单个准确的解决方案并生成多种解决方案可能是矛盾的。在本文中，我们提出了一种通过渠道的噪声调度方法，该方法允许单个扩散模型体系结构实现两个相互矛盾的目标。由此产生的两个扩散模型，通过不同的频道噪声表训练，可以预测一个高度精确的解决方案，并提供多种可能的解决方案。实验结果证明了我们两个模型在多样性和准确性方面具有优势，这转化为在下游应用中的性能增强，例如对象插入和材料编辑。

Title: Investigating and Improving Counter-Stereotypical Action Relation in Text-to-Image Diffusion Models

Authors: Sina Malakouti, Adriana Kovashka
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10037
Pdf URL: https://arxiv.org/pdf/2503.10037
Copy Paste: [[2503.10037]] Investigating and Improving Counter-Stereotypical Action Relation in Text-to-Image Diffusion Models(https://arxiv.org/abs/2503.10037)
Keywords: generation
Abstract: Text-to-image diffusion models consistently fail at generating counter-stereotypical action relationships (e.g., "mouse chasing cat"), defaulting to frequent stereotypes even when explicitly prompted otherwise. Through systematic investigation, we discover this limitation stems from distributional biases rather than inherent model constraints. Our key insight reveals that while models fail on rare compositions when their inversions are common, they can successfully generate similar intermediate compositions (e.g., "mouse chasing boy"). To test this hypothesis, we develop a Role-Bridging Decomposition framework that leverages these intermediates to gradually teach rare relationships without architectural modifications. We introduce ActionBench, a comprehensive benchmark specifically designed to evaluate action-based relationship generation across stereotypical and counter-stereotypical configurations. Our experiments validate that intermediate compositions indeed facilitate counter-stereotypical generation, with both automatic metrics and human evaluations showing significant improvements over existing approaches. This work not only identifies fundamental biases in current text-to-image systems but demonstrates a promising direction for addressing them through compositional reasoning.
摘要：文本对图像扩散模型始终无法生成反型动作关系（例如，“鼠标追逐猫”），即使在明确提示的情况下，默认为频繁的刻板印象也是如此。通过系统的研究，我们发现了这种限制源于分布偏见而不是固有的模型约束。我们的主要见解表明，虽然模型在罕见成分很常见时会失败，但它们可以成功生成相似的中间组成（例如，“鼠标追逐男孩”）。为了检验这一假设，我们开发了一个构成角色的分解框架，该框架利用这些中间体在没有建筑修改的情况下逐渐教授罕见的关系。我们介绍了ActionBench，这是一种综合基准，专门旨在评估跨刻板印象和反式配置的基于动作的关系的产生。我们的实验验证了中间组成的确确实有助于反疾病的产生，自动指标和人类评估都显示出对现有方法的显着改善。这项工作不仅确定了当前文本到图像系统中的基本偏见，而且还展示了通过构图推理来解决它们的有希望的方向。

Title: FourierSR: A Fourier Token-based Plugin for Efficient Image Super-Resolution

Authors: Wenjie Li, Heng Guo, Yuefeng Hou, Zhanyu Ma
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10043
Pdf URL: https://arxiv.org/pdf/2503.10043
Copy Paste: [[2503.10043]] FourierSR: A Fourier Token-based Plugin for Efficient Image Super-Resolution(https://arxiv.org/abs/2503.10043)
Keywords: super-resolution
Abstract: Image super-resolution (SR) aims to recover low-resolution images to high-resolution images, where improving SR efficiency is a high-profile challenge. However, commonly used units in SR, like convolutions and window-based Transformers, have limited receptive fields, making it challenging to apply them to improve SR under extremely limited computational cost. To address this issue, inspired by modeling convolution theorem through token mix, we propose a Fourier token-based plugin called FourierSR to improve SR uniformly, which avoids the instability or inefficiency of existing token mix technologies when applied as plug-ins. Furthermore, compared to convolutions and windows-based Transformers, our FourierSR only utilizes Fourier transform and multiplication operations, greatly reducing complexity while having global receptive fields. Experimental results show that our FourierSR as a plug-and-play unit brings an average PSNR gain of 0.34dB for existing efficient SR methods on Manga109 test set at the scale of x4, while the average increase in the number of Params and FLOPs is only 0.6% and 1.5% of original sizes. We will release our codes upon acceptance.
摘要：图像超分辨率（SR）旨在将低分辨率图像恢复到高分辨率图像中，而改善SR效率是一个备受瞩目的挑战。但是，SR中常用的单元，例如卷积和基于窗户的变压器，具有有限的接收场，因此在极限有限的计算成本下将其应用于改善SR的挑战。为了解决此问题，我们通过令牌混合物通过对卷积定理进行建模启发，我们建议一个名为FourierSr的基于傅立叶令牌的插件，以均匀地改进SR，从而避免将现有令牌混合技术的不稳定性或效率低效率应用于插件时。此外，与卷积和基于Windows的变压器相比，我们的FouriersR仅利用傅立叶变换和乘法操作，在拥有全球接受场的同时大大降低了复杂性。实验结果表明，我们作为插件单元的FourierSR在X4尺度上的现有有效的SR方法的平均PSNR增益为0.34dB，而参数和拖鞋的平均数量仅为原始尺寸的0.6％和1.5％。我们将在接受后发布我们的代码。

Title: Compute Optimal Scaling of Skills: Knowledge vs Reasoning

Authors: Nicholas Roberts, Niladri Chatterji, Sharan Narang, Mike Lewis, Dieuwke Hupkes
Subjects: cs.LG, cs.AI, cs.CL
Abstract URL: https://arxiv.org/abs/2503.10061
Pdf URL: https://arxiv.org/pdf/2503.10061
Copy Paste: [[2503.10061]] Compute Optimal Scaling of Skills: Knowledge vs Reasoning(https://arxiv.org/abs/2503.10061)
Keywords: generation
Abstract: Scaling laws are a critical component of the LLM development pipeline, most famously as a way to forecast training decisions such as 'compute-optimally' trading-off parameter count and dataset size, alongside a more recent growing list of other crucial decisions. In this work, we ask whether compute-optimal scaling behaviour can be skill-dependent. In particular, we examine knowledge and reasoning-based skills such as knowledge-based QA and code generation, and we answer this question in the affirmative: $\textbf{scaling laws are skill-dependent}$. Next, to understand whether skill-dependent scaling is an artefact of the pretraining datamix, we conduct an extensive ablation of different datamixes and find that, also when correcting for datamix differences, $\textbf{knowledge and code exhibit fundamental differences in scaling behaviour}$. We conclude with an analysis of how our findings relate to standard compute-optimal scaling using a validation set, and find that $\textbf{a misspecified validation set can impact compute-optimal parameter count by nearly 50%,}$ depending on its skill composition.
摘要：扩展定律是LLM开发管道的关键组成部分，最著名的是预测培训决策的一种方式，例如“计算最佳的”交易参数计数和数据集大小，以及其他关键决策的最新列表。在这项工作中，我们询问计算最佳缩放行为是否可以与技能有关。特别是，我们检查了基于知识和推理的技能，例如基于知识的质量质量质量检查和代码生成，并以肯定的方式回答了这个问题：$ \ textbf {缩放定律是技能依赖性} $。接下来，要了解与技能相关的缩放率是否是预处理Datamix的人工制品，我们进行了广泛的消融不同的数据amix，并发现当纠正数据amix差异时，$ \ textbf {知识和代码在缩放行为中表现出基本差异} $。最后，我们分析了我们的发现如何使用验证集与标准的计算最佳缩放相关联，并发现$ \ textbf {误解的验证集可能会影响compute-oftimal参数计数近50％，} $根据其技能组成。

Title: VMBench: A Benchmark for Perception-Aligned Video Motion Generation

Authors: Xinrang Ling, Chen Zhu, Meiqi Wu, Hangyu Li, Xiaokun Feng, Cundian Yang, Aiming Hao, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10076
Pdf URL: https://arxiv.org/pdf/2503.10076
Copy Paste: [[2503.10076]] VMBench: A Benchmark for Perception-Aligned Video Motion Generation(https://arxiv.org/abs/2503.10076)
Keywords: generation
Abstract: Video generation has advanced rapidly, improving evaluation methods, yet assessing video's motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based on these findings, we introduce VMBench--a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: 1) Perception-Driven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models' strengths and weaknesses in motion quality. 2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multi-level prompt library covering six key dynamic scene dimensions. 3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3% improvement in Spearman's correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment. Additionally, we will soon release VMBench at this https URL, setting a new standard for evaluating and advancing motion generation models.
摘要：视频生成迅速发展，改善了评估方法，但是评估视频的动作仍然是一个主要挑战。具体而言，有两个关键问题：1）当前的运动指标并不完全与人类的看法保持一致； 2）现有的运动提示是有限的。基于这些发现，我们引入了VMBench，这是一种具有感知一致的运动指标并具有最多样化类型的运动的全面视频运动基准。 VMBENCH具有多种吸引人的特性：1）感知驱动的运动评估指标，我们根据人类在运动视频评估中的人类感知确定五个维度，并开发出细粒度的评估指标，从而更深入地了解模型在运动质量中的优势和弱点。 2）元引导的运动及时生成，一种提取元信息的结构化方法，使用LLMS生成多样化的运动提示，并通过人类AI验证来完善它们，从而产生了一个多级别提示库，涵盖了六个关键的动态场景维度。 3）人类对准的验证机制，我们提供了人类偏好注释来验证我们的基准，我们的指标平均提高了Spearman与基线方法的相关性35.3％。这是第一次从人类感知一致性的角度评估视频中的运动质量。此外，我们将很快在此HTTPS URL上发布VMBENCH，为评估和前进运动模型设定了新的标准。

Title: Image Quality Assessment: From Human to Machine Preference

Authors: Chunyi Li, Yuan Tian, Xiaoyue Ling, Zicheng Zhang, Haodong Duan, Haoning Wu, Ziheng Jia, Xiaohong Liu, Xiongkuo Min, Guo Lu, Weisi Lin, Guangtao Zhai
Subjects: cs.CV, cs.MM, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10078
Pdf URL: https://arxiv.org/pdf/2503.10078
Copy Paste: [[2503.10078]] Image Quality Assessment: From Human to Machine Preference(https://arxiv.org/abs/2503.10078)
Keywords: quality assessment
Abstract: Image Quality Assessment (IQA) based on human subjective preferences has undergone extensive research in the past decades. However, with the development of communication protocols, the visual data consumption volume of machines has gradually surpassed that of humans. For machines, the preference depends on downstream tasks such as segmentation and detection, rather than visual appeal. Considering the huge gap between human and machine visual systems, this paper proposes the topic: Image Quality Assessment for Machine Vision for the first time. Specifically, we (1) defined the subjective preferences of machines, including downstream tasks, test models, and evaluation metrics; (2) established the Machine Preference Database (MPD), which contains 2.25M fine-grained annotations and 30k reference/distorted image pair instances; (3) verified the performance of mainstream IQA algorithms on MPD. Experiments show that current IQA metrics are human-centric and cannot accurately characterize machine preferences. We sincerely hope that MPD can promote the evolution of IQA from human to machine preferences. Project page is on: this https URL.
摘要：在过去的几十年中，基于人类主观偏好的图像质量评估（IQA）进行了广泛的研究。但是，随着通信协议的开发，机器的视觉数据消耗量逐渐超过了人类。对于机器，偏好取决于下游任务，例如分割和检测，而不是视觉吸引力。考虑到人类和机器视觉系统之间的巨大差距，本文提出了这一主题：第一次对机器视觉的图像质量评估。具体而言，我们（1）定义了机器的主观偏好，包括下游任务，测试模型和评估指标；（2）建立了机器偏好数据库（MPD），其中包含2.25m的细粒注释和30k参考/失真的图像对实例；（3）验证了MPD上主流IQA算法的性能。实验表明，当前的IQA指标是以人为本的，无法准确表征机器偏好。我们衷心希望MPD能够促进IQA从人类到机器偏好的发展。项目页面已打开：此HTTPS URL。

Title: AdvPaint: Protecting Images from Inpainting Manipulation via Adversarial Attention Disruption

Authors: Joonsung Jeon, Woo Jae Kim, Suhyeon Ha, Sooel Son, Sung-eui Yoon
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2503.10081
Pdf URL: https://arxiv.org/pdf/2503.10081
Copy Paste: [[2503.10081]] AdvPaint: Protecting Images from Inpainting Manipulation via Adversarial Attention Disruption(https://arxiv.org/abs/2503.10081)
Keywords: generation, generative
Abstract: The outstanding capability of diffusion models in generating high-quality images poses significant threats when misused by adversaries. In particular, we assume malicious adversaries exploiting diffusion models for inpainting tasks, such as replacing a specific region with a celebrity. While existing methods for protecting images from manipulation in diffusion-based generative models have primarily focused on image-to-image and text-to-image tasks, the challenge of preventing unauthorized inpainting has been rarely addressed, often resulting in suboptimal protection performance. To mitigate inpainting abuses, we propose ADVPAINT, a novel defensive framework that generates adversarial perturbations that effectively disrupt the adversary's inpainting tasks. ADVPAINT targets the self- and cross-attention blocks in a target diffusion inpainting model to distract semantic understanding and prompt interactions during image generation. ADVPAINT also employs a two-stage perturbation strategy, dividing the perturbation region based on an enlarged bounding box around the object, enhancing robustness across diverse masks of varying shapes and sizes. Our experimental results demonstrate that ADVPAINT's perturbations are highly effective in disrupting the adversary's inpainting tasks, outperforming existing methods; ADVPAINT attains over a 100-point increase in FID and substantial decreases in precision.
摘要：扩散模型产生高质量图像的出色能力在被对手滥用时构成了重大威胁。特别是，我们假设利用扩散模型来用于介绍任务的恶意对手，例如用名人代替特定地区。虽然现有的用于保护图像免受基于扩散的生成模型操作的方法主要集中在图像到图像和文本对象任务上，但很少解决防止未经授权介绍的挑战，通常会导致次优保护性能。为了减轻滥用侵害，我们提出了Adv -Paint，这是一个新颖的防御框架，产生对抗性扰动，有效地破坏了对手的介入任务。 advaint针对目标扩散镶嵌模型中的自我和交叉注意区块，以分散图像生成过程中语义理解和迅速相互作用。 ADVPAINT还采用了两阶段的扰动策略，根据对象周围的扩大边界框来划分扰动区域，从而增强了各种形状和尺寸的不同面膜的稳健性。我们的实验结果表明，ADVPAINT的扰动在破坏对手的介入任务方面非常有效，表现优于现有方法。 ADVPAINT的FID提高了100分，精确度大幅下降。

Title: Semantic Latent Motion for Portrait Video Generation

Authors: Qiyuan Zhang, Chenyu Wu, Wenzhang Sun, Huaize Liu, Donglin Di, Wei Chen, Changqing Zou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10096
Pdf URL: https://arxiv.org/pdf/2503.10096
Copy Paste: [[2503.10096]] Semantic Latent Motion for Portrait Video Generation(https://arxiv.org/abs/2503.10096)
Keywords: generation, generative
Abstract: Recent advancements in portrait video generation have been noteworthy. However, existing methods rely heavily on human priors and pre-trained generation models, which may introduce unrealistic motion and lead to inefficient inference. To address these challenges, we propose Semantic Latent Motion (SeMo), a compact and expressive motion representation. Leveraging this representation, our approach achieve both high-quality visual results and efficient inference. SeMo follows an effective three-step framework: Abstraction, Reasoning, and Generation. First, in the Abstraction step, we use a carefully designed Mask Motion Encoder to compress the subject's motion state into a compact and abstract latent motion (1D token). Second, in the Reasoning step, long-term modeling and efficient reasoning are performed in this latent space to generate motion sequences. Finally, in the Generation step, the motion dynamics serve as conditional information to guide the generation model in synthesizing realistic transitions from reference frames to target frames. Thanks to the compact and descriptive nature of Semantic Latent Motion, our method enables real-time video generation with highly realistic motion. User studies demonstrate that our approach surpasses state-of-the-art models with an 81% win rate in realism. Extensive experiments further highlight its strong compression capability, reconstruction quality, and generative potential. Moreover, its fully self-supervised nature suggests promising applications in broader video generation tasks.
摘要：肖像视频产生的最新进展值得注意。但是，现有方法在很大程度上依赖于人类先验和预训练的生成模型，这可能会引入不现实的运动并导致推理效率低下。为了应对这些挑战，我们提出了语义潜在运动（SEMO），这是一种紧凑而表达的运动表示。利用这一表示形式，我们的方法既可以实现高质量的视觉结果和有效的推断。 SEMO遵循一个有效的三步框架：抽象，推理和发电。首先，在抽象步骤中，我们使用经过精心设计的面膜运动编码器将受试者的运动状态压缩为紧凑而抽象的潜在运动（1D令牌）。其次，在推理步骤中，在此潜在空间中进行长期建模和有效的推理以生成运动序列。最后，在生成步骤中，运动动力学是有条件信息，可以指导生成模型，以综合从参考帧到目标帧的现实过渡。由于语义潜在运动的紧凑和描述性质，我们的方法可以通过高度逼真的运动实时视频生成。用户研究表明，我们的方法超过了现实主义中胜率81％的最先进模型。广泛的实验进一步强调了其强大的压缩能力，重建质量和生成潜力。此外，其完全自我监督的本质提出了在更广泛的视频生成任务中的有希望的应用。

Title: Dream-IF: Dynamic Relative EnhAnceMent for Image Fusion

Authors: Xingxin Xu, Bing Cao, Yinan Xia, Pengfei Zhu, Qinghua Hu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10109
Pdf URL: https://arxiv.org/pdf/2503.10109
Copy Paste: [[2503.10109]] Dream-IF: Dynamic Relative EnhAnceMent for Image Fusion(https://arxiv.org/abs/2503.10109)
Keywords: restoration
Abstract: Image fusion aims to integrate comprehensive information from images acquired through multiple sources. However, images captured by diverse sensors often encounter various degradations that can negatively affect fusion quality. Traditional fusion methods generally treat image enhancement and fusion as separate processes, overlooking the inherent correlation between them; notably, the dominant regions in one modality of a fused image often indicate areas where the other modality might benefit from enhancement. Inspired by this observation, we introduce the concept of dominant regions for image enhancement and present a Dynamic Relative EnhAnceMent framework for Image Fusion (Dream-IF). This framework quantifies the relative dominance of each modality across different layers and leverages this information to facilitate reciprocal cross-modal enhancement. By integrating the relative dominance derived from image fusion, our approach supports not only image restoration but also a broader range of image enhancement applications. Furthermore, we employ prompt-based encoding to capture degradation-specific details, which dynamically steer the restoration process and promote coordinated enhancement in both multi-modal image fusion and image enhancement scenarios. Extensive experimental results demonstrate that Dream-IF consistently outperforms its counterparts.
摘要：图像融合旨在整合通过多个来源获取的图像中的全面信息。但是，由不同传感器捕获的图像通常会遇到各种降解，可能会对融合质量产生负面影响。传统的融合方法通常将图像增强和融合视为单独的过程，从而忽略了它们之间的固有相关性；值得注意的是，融合图像的一种模式中的主要区域通常表明另一种方式可能受益于增强的区域。受这一观察的启发，我们介绍了图像增强的主要区域的概念，并为图像融合提供了动态的相对增强框架（Dream-if）。该框架量化了每种模式在不同层上的相对优势，并利用此信息来促进相互的跨模式增强。通过整合来自图像融合的相对优势，我们的方法不仅支持图像恢复，还支持更广泛的图像增强应用。此外，我们采用及时的编码来捕获特定于降解的细节，该细节动态地引导恢复过程并促进多模式图像融合和图像增强方案中的协调增强。广泛的实验结果表明，梦想 - 如果始终超过其同行。

Title: MoEdit: On Learning Quantity Perception for Multi-object Image Editing

Authors: Yanfeng Li, Kahou Chan, Yue Sun, Chantong Lam, Tong Tong, Zitong Yu, Keren Fu, Xiaohong Liu, Tao Tan
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10112
Pdf URL: https://arxiv.org/pdf/2503.10112
Copy Paste: [[2503.10112]] MoEdit: On Learning Quantity Perception for Multi-object Image Editing(https://arxiv.org/abs/2503.10112)
Keywords: generation
Abstract: Multi-object images are prevalent in various real-world scenarios, including augmented reality, advertisement design, and medical imaging. Efficient and precise editing of these images is critical for these applications. With the advent of Stable Diffusion (SD), high-quality image generation and editing have entered a new era. However, existing methods often struggle to consider each object both individually and part of the whole image editing, both of which are crucial for ensuring consistent quantity perception, resulting in suboptimal perceptual performance. To address these challenges, we propose MoEdit, an auxiliary-free multi-object image editing framework. MoEdit facilitates high-quality multi-object image editing in terms of style transfer, object reinvention, and background regeneration, while ensuring consistent quantity perception between inputs and outputs, even with a large number of objects. To achieve this, we introduce the Feature Compensation (FeCom) module, which ensures the distinction and separability of each object attribute by minimizing the in-between interlacing. Additionally, we present the Quantity Attention (QTTN) module, which perceives and preserves quantity consistency by effective control in editing, without relying on auxiliary tools. By leveraging the SD model, MoEdit enables customized preservation and modification of specific concepts in inputs with high quality. Experimental results demonstrate that our MoEdit achieves State-Of-The-Art (SOTA) performance in multi-object image editing. Data and codes will be available at this https URL.
摘要：多对象图像在各种现实世界中都普遍存在，包括增强现实，广告设计和医学成像。这些图像的有效且精确的编辑对于这些应用至关重要。随着稳定扩散（SD）的出现，高质量的图像生成和编辑进入了一个新时代。但是，现有的方法通常很难单独考虑每个对象，这两个方法对于确保一致的数量感知至关重要，从而导致次优感知性能。为了应对这些挑战，我们提出了Moedit，这是一个无辅助的多对象图像编辑框架。 Moedit在样式传输，对象重新发明和背景再生方面促进了高质量的多对象图像编辑，同时即使有大量对象，也可以确保输入和输出之间的一致数量感知。为了实现这一目标，我们介绍了特征补偿（FECOM）模块，该模块通过最大程度地减少间接插条之间的间隔来确保每个对象属性的区别和可分离性。此外，我们提出了数量注意（QTTN）模块，该模块可以通过编辑中的有效控制来感知和保留数量一致性，而无需依赖辅助工具。通过利用SD模型，MoEdit可以在高质量的输入中对定制的保存和修改特定概念。实验结果表明，我们的Moedit在多对象图像编辑中实现了最先进的（SOTA）性能。数据和代码将在此HTTPS URL上可用。

Title: Hybrid Agents for Image Restoration

Authors: Bingchen Li, Xin Li, Yiting Lu, Zhibo Chen
Subjects: cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2503.10120
Pdf URL: https://arxiv.org/pdf/2503.10120
Copy Paste: [[2503.10120]] Hybrid Agents for Image Restoration(https://arxiv.org/abs/2503.10120)
Keywords: restoration
Abstract: Existing Image Restoration (IR) studies typically focus on task-specific or universal modes individually, relying on the mode selection of users and lacking the cooperation between multiple task-specific/universal restoration modes. This leads to insufficient interaction for unprofessional users and limits their restoration capability for complicated real-world applications. In this work, we present HybridAgent, intending to incorporate multiple restoration modes into a unified image restoration model and achieve intelligent and efficient user interaction through our proposed hybrid agents. Concretely, we propose the hybrid rule of fast, slow, and feedback restoration agents. Here, the slow restoration agent optimizes the powerful multimodal large language model (MLLM) with our proposed instruction-tuning dataset to identify degradations within images with ambiguous user prompts and invokes proper restoration tools accordingly. The fast restoration agent is designed based on a lightweight large language model (LLM) via in-context learning to understand the user prompts with simple and clear requirements, which can obviate the unnecessary time/resource costs of MLLM. Moreover, we introduce the mixed distortion removal mode for our HybridAgents, which is crucial but not concerned in previous agent-based works. It can effectively prevent the error propagation of step-by-step image restoration and largely improve the efficiency of the agent system. We validate the effectiveness of HybridAgent with both synthetic and real-world IR tasks.
摘要：现有的图像恢复（IR）研究通常依靠用户的模式选择，并且缺乏多种特定于任务/通用的恢复模式之间的合作，通常集中于特定于任务或通用模式。这会导致不专业用户的交互作用不足，并限制了对复杂现实世界应用程序的恢复能力。在这项工作中，我们介绍了杂种，打算将多种修复模式纳入统一的图像恢复模型，并通过我们提出的混合动力代理实现智能有效的用户交互。具体而言，我们提出了快速，缓慢和反馈恢复剂的混合规则。在这里，缓慢的恢复代理使用我们建议的指令调整数据集优化了强大的多式模式大语言模型（MLLM），以识别图像中使用模棱两可的用户提示中的图像中的降级，并相应地调用适当的修复工具。快速恢复代理是基于轻巧的大语言模型（LLM）设计的，可以通过内在学习学习，以简单明了的要求了解用户提示，这可以消除MLLM的不必要的时间/资源成本。此外，我们为杂化剂引入了混合失真清除模式，这在以前的基于代理的工作中至关重要，但并不关注。它可以有效防止逐步恢复的错误传播，并在很大程度上提高了代理系统的效率。我们通过合成和现实世界IR任务验证了杂化型的有效性。

Title: Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation

Authors: Yi Wu, Lingting Zhu, Lei Liu, Wandi Qiao, Ziqiang Li, Lequan Yu, Bin Li
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.10125
Pdf URL: https://arxiv.org/pdf/2503.10125
Copy Paste: [[2503.10125]] Proxy-Tuning: Tailoring Multimodal Autoregressive Models for Subject-Driven Image Generation(https://arxiv.org/abs/2503.10125)
Keywords: generation
Abstract: Multimodal autoregressive (AR) models, based on next-token prediction and transformer architecture, have demonstrated remarkable capabilities in various multimodal tasks including text-to-image (T2I) generation. Despite their strong performance in general T2I tasks, our research reveals that these models initially struggle with subject-driven image generation compared to dominant diffusion models. To address this limitation, we introduce Proxy-Tuning, leveraging diffusion models to enhance AR models' capabilities in subject-specific image generation. Our method reveals a striking weak-to-strong phenomenon: fine-tuned AR models consistently outperform their diffusion model supervisors in both subject fidelity and prompt adherence. We analyze this performance shift and identify scenarios where AR models excel, particularly in multi-subject compositions and contextual understanding. This work not only demonstrates impressive results in subject-driven AR image generation, but also unveils the potential of weak-to-strong generalization in the image generation domain, contributing to a deeper understanding of different architectures' strengths and limitations.
摘要：基于下一步预测和变形金刚体系结构的多模式自动回应（AR）模型在包括文本到文本图像（T2I）生成（T2I）的各种多模式任务中表现出了显着的功能。尽管在一般T2I任务中表现出色，但我们的研究表明，与主导扩散模型相比，这些模型最初与受试者驱动的图像产生斗争。为了解决这一限制，我们介绍了代理调节，利用扩散模型来增强AR模型在特定于特定图像生成中的功能。我们的方法揭示了一个引人注目的弱到严重现象：微调的AR模型在受试者保真度和迅速依从性方面始终优于其扩散模型主管。我们分析了这种性能变化并确定AR模型出色的方案，尤其是在多主体组成和上下文理解中。这项工作不仅在受试者驱动的AR图像产生中表现出了令人印象深刻的结果，而且还揭示了图像生成域中弱至较强的概括的潜力，从而有助于更深入地了解不同体系结构的优势和局限性。

Title: PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

Authors: Runze He, Bo Cheng, Yuhang Ma, Qingxiang Jia, Shanyuan Liu, Ao Ma, Xiaoyu Wu, Liebucha Wu, Dawei Leng, Yuhui Yin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10127
Pdf URL: https://arxiv.org/pdf/2503.10127
Copy Paste: [[2503.10127]] PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models(https://arxiv.org/abs/2503.10127)
Keywords: generation
Abstract: In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layoutrelated tasks, showing its great potential. Code is available at: this https URL.
摘要：在本文中，我们提出了一个统一的布局计划和图像生成模型Plangen，该模型可以在生成图像之前预先计划空间布局条件。与以前的基于扩散的模型将布局计划和布局到图像视为两个单独的模型不同，Plangen仅使用下一步的预测将两个任务共同将两个任务建模为一个自回归变压器。 Plangen将布局条件作为上下文整合到模型中，而无需对本地字幕和边界框坐标进行专门编码，这比以前的嵌入式和池操作在布局条件上提供了显着优势，尤其是在处理复杂的布局时。统一提示允许Plangen执行与布局相关的多任务培训，包括布局计划，布局到图像生成，图像布局的理解等。此外，由于设计良好的建模，可以将Plangen无缝扩展到布局指导的图像操作，并具有良好的教师制定内容，并具有教师制定的内容内容操纵政策和负面的布局指导。广泛的实验验证了我们在多个透明任务中质量的有效性，显示出其巨大的潜力。代码可用：此HTTPS URL。

Title: Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding

Authors: Shunqi Mao, Chaoyi Zhang, Weidong Cai
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10183
Pdf URL: https://arxiv.org/pdf/2503.10183
Copy Paste: [[2503.10183]] Through the Magnifying Glass: Adaptive Perception Magnification for Hallucination-Free VLM Decoding(https://arxiv.org/abs/2503.10183)
Keywords: generation
Abstract: Existing vision-language models (VLMs) often suffer from visual hallucination, where the generated responses contain inaccuracies that are not grounded in the visual input. Efforts to address this issue without model finetuning primarily mitigate hallucination by reducing biases contrastively or amplifying the weights of visual embedding during decoding. However, these approaches improve visual perception at the cost of impairing the language reasoning capability. In this work, we propose the Perception Magnifier (PM), a novel visual decoding method that iteratively isolates relevant visual tokens based on attention and magnifies the corresponding regions, spurring the model to concentrate on fine-grained visual details during decoding. Specifically, by magnifying critical regions while preserving the structural and contextual information at each decoding step, PM allows the VLM to enhance its scrutiny of the visual input, hence producing more accurate and faithful responses. Extensive experimental results demonstrate that PM not only achieves superior hallucination mitigation but also enhances language generation while preserving strong reasoning this http URL is available at this https URL .
摘要：现有的视觉模型（VLMS）通常患有视觉幻觉，其中生成的响应包含不正确的视觉输入中的不准确性。努力解决此问题而不模型登录主要通过对比或放大解码过程中视觉嵌入的权重来减少偏见，从而主要减轻幻觉。但是，这些方法以损害语言推理能力为代价改善了视觉感知。在这项工作中，我们提出了一种感知放大器（PM），这是一种新型的视觉解码方法，它基于注意力并放大相应的区域，迭代地分离相关的视觉令牌，促使该模型集中在解码过程中的细粒度视觉细节上。具体而言，通过放大关键区域的同时在每个解码步骤中保留结构和上下文信息，PM允许VLM增强其对视觉输入的审查，从而产生更准确和更忠实的响应。广泛的实验结果表明，PM不仅可以实现较高的幻觉缓解措施，而且还可以增强语言的产生，同时在此HTTPS URL上提供了强大的推理此HTTP URL。

Title: Probability-Flow ODE in Infinite-Dimensional Function Spaces

Authors: Kunwoo Na, Junghyun Lee, Se-Young Yun, Sungbin Lim
Subjects: cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2503.10219
Pdf URL: https://arxiv.org/pdf/2503.10219
Copy Paste: [[2503.10219]] Probability-Flow ODE in Infinite-Dimensional Function Spaces(https://arxiv.org/abs/2503.10219)
Keywords: generation
Abstract: Recent advances in infinite-dimensional diffusion models have demonstrated their effectiveness and scalability in function generation tasks where the underlying structure is inherently infinite-dimensional. To accelerate inference in such models, we derive, for the first time, an analog of the probability-flow ODE (PF-ODE) in infinite-dimensional function spaces. Leveraging this newly formulated PF-ODE, we reduce the number of function evaluations while maintaining sample quality in function generation tasks, including applications to PDEs.
摘要：无限维扩散模型的最新进展已证明了它们在基础结构本质上是无限维度的功能生成任务中的有效性和可伸缩性。为了加速这种模型的推断，我们首次得出了无限维函数空间中概率流动（PF-ODE）的类似物。利用这种新配制的PF-ODE，我们减少了功能评估的数量，同时保持功能生成任务的样本质量，包括对PDE的应用程序。

Title: Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA

Authors: Zhixuan Li, Hyunse Yoon, Sanghoon Lee, Weisi Lin
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10225
Pdf URL: https://arxiv.org/pdf/2503.10225
Copy Paste: [[2503.10225]] Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA(https://arxiv.org/abs/2503.10225)
Keywords: generation
Abstract: Amodal segmentation aims to infer the complete shape of occluded objects, even when the occluded region's appearance is unavailable. However, current amodal segmentation methods lack the capability to interact with users through text input and struggle to understand or reason about implicit and complex purposes. While methods like LISA integrate multi-modal large language models (LLMs) with segmentation for reasoning tasks, they are limited to predicting only visible object regions and face challenges in handling complex occlusion scenarios. To address these limitations, we propose a novel task named amodal reasoning segmentation, aiming to predict the complete amodal shape of occluded objects while providing answers with elaborations based on user text input. We develop a generalizable dataset generation pipeline and introduce a new dataset focusing on daily life scenarios, encompassing diverse real-world occlusions. Furthermore, we present AURA (Amodal Understanding and Reasoning Assistant), a novel model with advanced global and spatial-level designs specifically tailored to handle complex occlusions. Extensive experiments validate AURA's effectiveness on the proposed dataset. The code, model, and dataset will be publicly released.
摘要：Amodal分割旨在推断被遮挡物体的完整形状，即使无法使用被遮挡的区域的外观。但是，当前的Amodal分割方法缺乏通过文本输入与用户互动的能力，并难以理解或理解隐式和复杂目的。尽管诸如Lisa之类的方法将多模式的大语言模型（LLMS）与用于推理任务的细分集成在一起，但它们仅限于仅预测可见的对象区域并在处理复杂的闭塞场景时面临挑战。为了解决这些限制，我们提出了一个名为Amodal推理细分的新任务，旨在预测闭塞对象的完整形状，同时根据用户文本输入提供详细说明的答案。我们开发了可推广的数据集生成管道，并引入了一个新的数据集，该数据集专注于日常生活方案，包括各种现实世界的闭合。此外，我们提出了Aura（Amodal的理解和推理助手），这是一个新型模型，具有高级全球和空间级设计，专门针对处理复杂的遮挡。广泛的实验验证了Aura对拟议数据集的有效性。代码，模型和数据集将公开发布。

Title: ROODI: Reconstructing Occluded Objects with Denoising Inpainters

Authors: Yeonjin Chang, Erqun Dong, Seunghyeon Seo, Nojun Kwak, Kwang Moo Yi
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10256
Pdf URL: https://arxiv.org/pdf/2503.10256
Copy Paste: [[2503.10256]] ROODI: Reconstructing Occluded Objects with Denoising Inpainters(https://arxiv.org/abs/2503.10256)
Keywords: generative
Abstract: While the quality of novel-view images has improved dramatically with 3D Gaussian Splatting, extracting specific objects from scenes remains challenging. Isolating individual 3D Gaussian primitives for each object and handling occlusions in scenes remain far from being solved. We propose a novel object extraction method based on two key principles: (1) being object-centric by pruning irrelevant primitives; and (2) leveraging generative inpainting to compensate for missing observations caused by occlusions. For pruning, we analyze the local structure of primitives using K-nearest neighbors, and retain only relevant ones. For inpainting, we employ an off-the-shelf diffusion-based inpainter combined with occlusion reasoning, utilizing the 3D representation of the entire scene. Our findings highlight the crucial synergy between pruning and inpainting, both of which significantly enhance extraction performance. We evaluate our method on a standard real-world dataset and introduce a synthetic dataset for quantitative analysis. Our approach outperforms the state-of-the-art, demonstrating its effectiveness in object extraction from complex scenes.
摘要：虽然3D高斯脱落的新颖视图图像的质量已大大提高，但从场景中提取特定物体仍然具有挑战性。为每个物体隔离单个3D高斯原始图，并在场景中处理遮挡尚未解决。我们提出了一种基于两个关键原则的新颖对象提取方法：（1）通过修剪无关的原始素以对象为中心；（2）利用生成涂料来弥补由遮挡引起的缺失观察结果。为了修剪，我们使用K-Nearest邻居分析原始结构的局部结构，并仅保留相关的邻居。对于介入，我们利用整个场景的3D表示，采用了基于现成的基于式扩散的Inpainter和遮挡推理的结合。我们的发现凸显了修剪和介入之间的关键协同作用，这两者都显着提高了提取性能。我们在标准的现实世界数据集上评估我们的方法，并引入合成数据集进行定量分析。我们的方法表现优于最先进的方法，证明了其在复杂场景中提取物体中的有效性。

Title: KVQ: Boosting Video Quality Assessment via Saliency-guided Local Perception

Authors: Yunpeng Qu, Kun Yuan, Qizhi Xie, Ming Sun, Chao Zhou, Jian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10259
Pdf URL: https://arxiv.org/pdf/2503.10259
Copy Paste: [[2503.10259]] KVQ: Boosting Video Quality Assessment via Saliency-guided Local Perception(https://arxiv.org/abs/2503.10259)
Keywords: quality assessment
Abstract: Video Quality Assessment (VQA), which intends to predict the perceptual quality of videos, has attracted increasing attention. Due to factors like motion blur or specific distortions, the quality of different regions in a video varies. Recognizing the region-wise local quality within a video is beneficial for assessing global quality and can guide us in adopting fine-grained enhancement or transcoding strategies. Due to the heavy cost of annotating region-wise quality, the lack of ground truth constraints from relevant datasets further complicates the utilization of local perception. Inspired by the Human Visual System (HVS) that links global quality to the local texture of different regions and their visual saliency, we propose a Kaleidoscope Video Quality Assessment (KVQ) framework, which aims to effectively assess both saliency and local texture, thereby facilitating the assessment of global quality. Our framework extracts visual saliency and allocates attention using Fusion-Window Attention (FWA) while incorporating a Local Perception Constraint (LPC) to mitigate the reliance of regional texture perception on neighboring areas. KVQ obtains significant improvements across multiple scenarios on five VQA benchmarks compared to SOTA methods. Furthermore, to assess local perception, we establish a new Local Perception Visual Quality (LPVQ) dataset with region-wise annotations. Experimental results demonstrate the capability of KVQ in perceiving local distortions. KVQ models and the LPVQ dataset will be available at this https URL.
摘要：旨在预测视频感知质量的视频质量评估（VQA）吸引了越来越多的关注。由于运动模糊或特定扭曲等因素，视频中不同区域的质量各不相同。认识到视频中区域的本地质量有益于评估全球质量，并可以指导我们采用细粒度的增强或转码策略。由于注释区域质量的沉重成本，相关数据集缺乏地面真相的限制进一步使对当地知觉的利用更加复杂。受到人类视觉系统（HVS）的启发，该系统将全球质量与不同区域的本地纹理联系起来及其视觉显着性，我们提出了万花筒视频质量评估（KVQ）框架，该框架旨在有效地评估显着性和本地质地，从而促进全球质量评估。我们的框架提取了视觉显着性，并使用融合窗口注意（FWA）分配了注意力，同时融合了局部感知约束（LPC），以减轻对邻近地区区域纹理感知的依赖。与SOTA方法相比，KVQ在五个VQA基准的多种情况下获得了显着改进。此外，为了评估当地的看法，我们建立了带有区域注释的新的本地视觉质量（LPVQ）数据集。实验结果证明了KVQ在感知局部扭曲中的能力。 KVQ模型和LPVQ数据集将在此HTTPS URL上可用。

Title: MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion

Authors: Zebin He, Mingxin Yang, Shuhui Yang, Yixuan Tang, Tao Wang, Kaihao Zhang, Guanying Chen, Yuhong Liu, Jie Jiang, Chunchao Guo, Wenhan Luo
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10289
Pdf URL: https://arxiv.org/pdf/2503.10289
Copy Paste: [[2503.10289]] MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion(https://arxiv.org/abs/2503.10289)
Keywords: generation
Abstract: Physically-based rendering (PBR) has become a cornerstone in modern computer graphics, enabling realistic material representation and lighting interactions in 3D scenes. In this paper, we present MaterialMVP, a novel end-to-end model for generating PBR textures from 3D meshes and image prompts, addressing key challenges in multi-view material synthesis. Our approach leverages Reference Attention to extract and encode informative latent from the input reference images, enabling intuitive and controllable texture generation. We also introduce a Consistency-Regularized Training strategy to enforce stability across varying viewpoints and illumination conditions, ensuring illumination-invariant and geometrically consistent results. Additionally, we propose Dual-Channel Material Generation, which separately optimizes albedo and metallic-roughness (MR) textures while maintaining precise spatial alignment with the input images through Multi-Channel Aligned Attention. Learnable material embeddings are further integrated to capture the distinct properties of albedo and MR. Experimental results demonstrate that our model generates PBR textures with realistic behavior across diverse lighting scenarios, outperforming existing methods in both consistency and quality for scalable 3D asset creation.
摘要：基于物理的渲染（PBR）已成为现代计算机图形的基石，从而在3D场景中实现了现实的材料表示和照明相互作用。在本文中，我们提出了MaterialMVP，这是一种新型的端到端模型，用于从3D网格和图像提示中生成PBR纹理，从而解决了多视图材料合成中的关键挑战。我们的方法利用参考注意从输入参考图像中提取和编码信息潜在，从而实现直观且可控的纹理生成。我们还引入了一致性调节训练策略，以在不同的观点和照明条件上执行稳定性，从而确保对照明不变和几何成果的结果。此外，我们提出了双通道材料的产生，该材料生成分别优化了反照率和金属纹理（MR）纹理，同时通过通过多渠道对齐的关注来保持精确的空间对齐方式，并通过输入图像进行分配。可以进一步整合可学习的材料嵌入，以捕获反照率和MR的不同特性。实验结果表明，我们的模型在不同的照明方案中生成具有现实行为的PBR纹理，对于可扩展的3D资产创建的一致性和质量的现有方法都优于现有方法。

Title: Towards Fast, Memory-based and Data-Efficient Vision-Language Policy

Authors: Haoxuan Li, Sixu Yan, Yuhan Li, Xinggang Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10322
Pdf URL: https://arxiv.org/pdf/2503.10322
Copy Paste: [[2503.10322]] Towards Fast, Memory-based and Data-Efficient Vision-Language Policy(https://arxiv.org/abs/2503.10322)
Keywords: generation
Abstract: Vision Language Models (VLMs) pretrained on Internet-scale vision-language data have demonstrated the potential to transfer their knowledge to robotic learning. However, the existing paradigm encounters three critical challenges: (1) expensive inference cost resulting from large-scale model parameters, (2) frequent domain shifts caused by mismatched data modalities, and (3) limited capacity to handle past or future experiences. In this work, we propose LiteVLP, a lightweight, memory-based, and general-purpose vision-language policy generation model. LiteVLP is built upon a pre-trained 1B-parameter VLM and fine-tuned on a tiny-scale and conversation-style robotic dataset. Through extensive experiments, we demonstrate that LiteVLP outperforms state-of-the-art vision-language policy on VIMA-Bench, with minimal training time. Furthermore, LiteVLP exhibits superior inference speed while maintaining exceptional high accuracy. In long-horizon manipulation tasks, LiteVLP also shows remarkable memory ability, outperforming the best-performing baseline model by 18.8%. These results highlight LiteVLP as a promising model to integrating the intelligence of VLMs into robotic learning.
摘要：在Internet规模的视觉语言数据上预测的视觉语言模型（VLM）已经证明了将其知识转移到机器人学习中的潜力。但是，现有的范式遇到了三个关键挑战：（1）由大规模模型参数产生的昂贵推理成本，（2）由不匹配的数据模式引起的频繁域移动，以及（3）处理过去或将来的经验的能力有限。在这项工作中，我们提出了LiteVLP，这是一种轻巧，基于内存和通用视觉的政策生成模型。 LiteVLP建立在预先训练的1B参数VLM上，并在微小的尺度和对话风格的机器人数据集中进行了微调。通过广泛的实验，我们证明了LiteVLP在VIMA板凳上的最先进的视觉语言政策，而训练时间很少。此外，LiteVLP表现出较高的推理速度，同时保持出色的高精度。在长马操纵任务中，LiteVLP还显示出显着的内存能力，表现优于表现最佳的基线模型，增长了18.8％。这些结果强调了LiteVLP是将VLM智力整合到机器人学习中的有前途的模型。

Title: IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification

Authors: Yuhao Wang, Yongfeng Lv, Pingping Zhang, Huchuan Lu
Subjects: cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2503.10324
Pdf URL: https://arxiv.org/pdf/2503.10324
Copy Paste: [[2503.10324]] IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification(https://arxiv.org/abs/2503.10324)
Keywords: generation
Abstract: Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary information from various modalities. However, existing methods focus on fusing heterogeneous visual features, neglecting the potential benefits of text-based semantic information. To address this issue, we first construct three text-enhanced multi-modal object ReID benchmarks. To be specific, we propose a standardized multi-modal caption generation pipeline for structured and concise text annotations with Multi-modal Large Language Models (MLLMs). Besides, current methods often directly aggregate multi-modal information without selecting representative local features, leading to redundancy and high complexity. To address the above issues, we introduce IDEA, a novel feature learning framework comprising the Inverted Multi-modal Feature Extractor (IMFE) and Cooperative Deformable Aggregation (CDA). The IMFE utilizes Modal Prefixes and an InverseNet to integrate multi-modal information with semantic guidance from inverted text. The CDA adaptively generates sampling positions, enabling the model to focus on the interplay between global features and discriminative local features. With the constructed benchmarks and the proposed modules, our framework can generate more robust multi-modal features under complex scenarios. Extensive experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
摘要：多模式对象重新识别（REID）旨在通过利用来自各种方式的互补信息来检索特定对象。但是，现有方法着重于融合异质视觉特征，忽略了基于文本的语义信息的潜在优势。为了解决此问题，我们首先构建了三个文本增强的多模式对象REID基准。具体来说，我们建议使用多模式大语言模型（MLLM）的结构化和简洁文本注释，为结构化和简洁的文本注释提供标准化的多模式字幕生成管道。此外，当前方法通常直接汇总多模式信息，而无需选择代表性的本地特征，从而导致冗余和高复杂性。为了解决上述问题，我们介绍了一个新的特征学习框架，其中包括倒数多模式的特征提取器（IMFE）和合作可变形聚合（CDA）。 IMFE利用模态前缀和InverseNet将多模式信息与倒文本的语义指导集成在一起。 CDA自适应地生成采样位置，使模型能够专注于全局特征和区分局部特征之间的相互作用。借助构造的基准和提议的模块，我们的框架可以在复杂的方案下生成更强大的多模式特征。对三个多模式对象REID基准测试的广泛实验证明了我们提出的方法的有效性。

Title: Generative Binary Memory: Pseudo-Replay Class-Incremental Learning on Binarized Embeddings

Authors: Yanis Basso-Bert, Anca Molnos, Romain Lemaire, William Guicquero, Antoine Dupret
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.10333
Pdf URL: https://arxiv.org/pdf/2503.10333
Copy Paste: [[2503.10333]] Generative Binary Memory: Pseudo-Replay Class-Incremental Learning on Binarized Embeddings(https://arxiv.org/abs/2503.10333)
Keywords: generative
Abstract: In dynamic environments where new concepts continuously emerge, Deep Neural Networks (DNNs) must adapt by learning new classes while retaining previously acquired ones. This challenge is addressed by Class-Incremental Learning (CIL). This paper introduces Generative Binary Memory (GBM), a novel CIL pseudo-replay approach which generates synthetic binary pseudo-exemplars. Relying on Bernoulli Mixture Models (BMMs), GBM effectively models the multi-modal characteristics of class distributions, in a latent, binary space. With a specifically-designed feature binarizer, our approach applies to any conventional DNN. GBM also natively supports Binary Neural Networks (BNNs) for highly-constrained model sizes in embedded systems. The experimental results demonstrate that GBM achieves higher than state-of-the-art average accuracy on CIFAR100 (+2.9%) and TinyImageNet (+1.5%) for a ResNet-18 equipped with our binarizer. GBM also outperforms emerging CIL methods for BNNs, with +3.1% in final accuracy and x4.7 memory reduction, on CORE50.
摘要：在新概念不断出现的动态环境中，深度神经网络（DNN）必须通过学习新课程，同时保留先前获得的概念来适应。该挑战是通过课堂学习学习（CIL）来解决的。本文介绍了生成二进制记忆（GBM），这是一种新型的CIL伪复制方法，生成合成的二进制伪伪分类。 GBM依靠Bernoulli混合模型（BMM），有效地在潜在的，二进制空间中建模了类分布的多模式特征。借助特定设计的特征双牵引器，我们的方法适用于任何常规的DNN。 GBM还本地支持嵌入式系统中高度约束的模型大小的二进制神经网络（BNN）。实验结果表明，对于配备我们的二元仪的RESNET-18，GBM的CIFAR100（+2.9％）和Tinyimagenet（+1.5％）的GBM高于最先进的平均准确性（+2.9％）（+1.5％）。 GBM还优于BNN的新兴CIL方法，最终精度的 +3.1％，而X4.7记忆降低，Core50。

Title: DreamInsert: Zero-Shot Image-to-Video Object Insertion from A Single Image

Authors: Qi Zhao, Zhan Ma, Pan Zhou
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10342
Pdf URL: https://arxiv.org/pdf/2503.10342
Copy Paste: [[2503.10342]] DreamInsert: Zero-Shot Image-to-Video Object Insertion from A Single Image(https://arxiv.org/abs/2503.10342)
Keywords: generative
Abstract: Recent developments in generative diffusion models have turned many dreams into realities. For video object insertion, existing methods typically require additional information, such as a reference video or a 3D asset of the object, to generate the synthetic motion. However, inserting an object from a single reference photo into a target background video remains an uncharted area due to the lack of unseen motion information. We propose DreamInsert, which achieves Image-to-Video Object Insertion in a training-free manner for the first time. By incorporating the trajectory of the object into consideration, DreamInsert can predict the unseen object movement, fuse it harmoniously with the background video, and generate the desired video seamlessly. More significantly, DreamInsert is both simple and effective, achieving zero-shot insertion without end-to-end training or additional fine-tuning on well-designed image-video data pairs. We demonstrated the effectiveness of DreamInsert through a variety of experiments. Leveraging this capability, we present the first results for Image-to-Video object insertion in a training-free manner, paving exciting new directions for future content creation and synthesis. The code will be released soon.
摘要：生成扩散模型的最新发展使许多梦想变成了现实。对于视频对象插入，现有方法通常需要其他信息，例如参考视频或对象的3D资产来生成合成运动。但是，由于缺乏看不见的运动信息，将对象从单个参考照片插入目标背景视频仍然是一个未知的区域。我们提出了DreamInsert，该Dreaminsert首次以无训练的方式实现图像到视频对象插入。通过将对象的轨迹纳入考虑，Dreaminsert可以预测看不见的对象运动，与背景视频和谐融合，并无缝生成所需的视频。更重要的是，Dreaminsert既简单又有效，可以在没有端到端培训或对精心设计的图像视频数据对上进行其他微调实现零射击插入。我们通过各种实验证明了Dreaminsert的有效性。利用此功能，我们以无训练的方式介绍了图像到视频对象插入的第一个结果，为未来的内容创建和综合铺平了令人兴奋的新方向。该代码将很快发布。

Title: Enhancing Facial Privacy Protection via Weakening Diffusion Purification

Authors: Ali Salar, Qing Liu, Yingli Tian, Guoying Zhao
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10350
Pdf URL: https://arxiv.org/pdf/2503.10350
Copy Paste: [[2503.10350]] Enhancing Facial Privacy Protection via Weakening Diffusion Purification(https://arxiv.org/abs/2503.10350)
Keywords: generation
Abstract: The rapid growth of social media has led to the widespread sharing of individual portrait images, which pose serious privacy risks due to the capabilities of automatic face recognition (AFR) systems for mass surveillance. Hence, protecting facial privacy against unauthorized AFR systems is essential. Inspired by the generation capability of the emerging diffusion models, recent methods employ diffusion models to generate adversarial face images for privacy protection. However, they suffer from the diffusion purification effect, leading to a low protection success rate (PSR). In this paper, we first propose learning unconditional embeddings to increase the learning capacity for adversarial modifications and then use them to guide the modification of the adversarial latent code to weaken the diffusion purification effect. Moreover, we integrate an identity-preserving structure to maintain structural consistency between the original and generated images, allowing human observers to recognize the generated image as having the same identity as the original. Extensive experiments conducted on two public datasets, i.e., CelebA-HQ and LADN, demonstrate the superiority of our approach. The protected faces generated by our method outperform those produced by existing facial privacy protection approaches in terms of transferability and natural appearance.
摘要：社交媒体的迅速增长导致了单个肖像图像的广泛共享，由于自动面部识别（AFR）系统的大规模监视能力，构成了严重的隐私风险。因此，保护面部隐私免受未经授权的AFR系统至关重要。受新兴扩散模型的生成能力的启发，最近的方法采用扩散模型来生成对抗性面部图像以保护隐私。但是，它们具有扩散纯化效果，导致了低保护成功率（PSR）。在本文中，我们首先提出学习无条件嵌入，以提高对抗性修饰的学习能力，然后使用它们来指导对抗性潜在代码的修改，以削弱扩散纯化效果。此外，我们集成了一个具有身份的结构，以维持原始图像和生成的图像之间的结构一致性，从而使人类观察者能够识别出生成的图像与原始图像具有相同的身份。在两个公共数据集（即Celeba-HQ和LADN）进行的广泛实验证明了我们方法的优势。我们方法产生的受保护面比现有面部隐私保护方法在可转移性和自然外观方面产生的面孔优于那些面孔。

Title: ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation

Authors: Zirun Guo, Tao Jin
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10358
Pdf URL: https://arxiv.org/pdf/2503.10358
Copy Paste: [[2503.10358]] ConceptGuard: Continual Personalized Text-to-Image Generation with Forgetting and Confusion Mitigation(https://arxiv.org/abs/2503.10358)
Keywords: generation
Abstract: Diffusion customization methods have achieved impressive results with only a minimal number of user-provided images. However, existing approaches customize concepts collectively, whereas real-world applications often require sequential concept integration. This sequential nature can lead to catastrophic forgetting, where previously learned concepts are lost. In this paper, we investigate concept forgetting and concept confusion in the continual customization. To tackle these challenges, we present ConceptGuard, a comprehensive approach that combines shift embedding, concept-binding prompts and memory preservation regularization, supplemented by a priority queue which can adaptively update the importance and occurrence order of different concepts. These strategies can dynamically update, unbind and learn the relationship of the previous concepts, thus alleviating concept forgetting and confusion. Through comprehensive experiments, we show that our approach outperforms all the baseline methods consistently and significantly in both quantitative and qualitative analyses.
摘要：扩散自定义方法仅使用最少数量的用户提供的图像获得了令人印象深刻的结果。但是，现有方法共同自定义概念，而实际应用程序通常需要顺序的概念集成。这种顺序的性质可能会导致灾难性的遗忘，而以前学习的概念丢失了。在本文中，我们调查了不断定制的概念遗忘和概念混乱。为了应对这些挑战，我们提出了一种概念卫生，这是一种全面的方法，结合了换档嵌入，概念结合提示和记忆保存正则正规化，并以优先的队列补充，可以自适应地更新不同概念的重要性和发生顺序。这些策略可以动态更新，解开和学习以前概念的关系，从而减轻概念遗忘和混乱。通过全面的实验，我们表明我们的方法在定量和定性分析中始终如一地优于所有基线方法。

Title: Piece it Together: Part-Based Concepting with IP-Priors

Authors: Elad Richardson, Kfir Goldberg, Yuval Alaluf, Daniel Cohen-Or
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10365
Pdf URL: https://arxiv.org/pdf/2503.10365
Copy Paste: [[2503.10365]] Piece it Together: Part-Based Concepting with IP-Priors(https://arxiv.org/abs/2503.10365)
Keywords: generation, generative
Abstract: Advanced generative models excel at synthesizing images but often rely on text-based conditioning. Visual designers, however, often work beyond language, directly drawing inspiration from existing visual elements. In many cases, these elements represent only fragments of a potential concept-such as an uniquely structured wing, or a specific hairstyle-serving as inspiration for the artist to explore how they can come together creatively into a coherent whole. Recognizing this need, we introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition while simultaneously sampling the missing parts needed to generate a plausible and complete concept. Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+, on which we train IP-Prior, a lightweight flow-matching model that synthesizes coherent compositions based on domain-specific priors, enabling diverse and context-aware generations. Additionally, we present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task, addressing its common trade-off between reconstruction quality and prompt adherence.
摘要：高级生成模型在合成图像方面表现出色，但通常依赖于基于文本的调节。但是，视觉设计师通常会超越语言，直接从现有视觉元素中汲取灵感。在许多情况下，这些元素仅代表潜在概念的片段，例如独特的机翼，或者是特定的发型服务，作为艺术家探索如何创造性地融合成一个连贯的整体的灵感。认识到这种需求，我们引入了一个生成框架，该框架将部分用户提供的视觉组件一组无缝地集成到连贯的构图中，同时对生成合理且完整的概念所需的缺失零件进行采样。我们的方法建立在从IP-Adapter+中提取的强大而不受欢迎的表示空间的基础上，我们在该空间上训练IP-Prior，这是一种轻巧的流量匹配模型，该模型综合了基于域特异性先验的相干构图，从而实现了多样的和上下文意识到的世代。此外，我们提出了一种基于洛拉的微调策略，该策略可显着提高给定任务的IP-Adapter+迅速依从性，从而解决了其重建质量和及时遵守之间的共同权衡。

Title: Probabilistic Forecasting via Autoregressive Flow Matching

Authors: Ahmed El-Gazzar, Marcel van Gerven
Subjects: cs.LG
Abstract URL: https://arxiv.org/abs/2503.10375
Pdf URL: https://arxiv.org/pdf/2503.10375
Copy Paste: [[2503.10375]] Probabilistic Forecasting via Autoregressive Flow Matching(https://arxiv.org/abs/2503.10375)
Keywords: generative
Abstract: In this work, we propose FlowTime, a generative model for probabilistic forecasting of multivariate timeseries data. Given historical measurements and optional future covariates, we formulate forecasting as sampling from a learned conditional distribution over future trajectories. Specifically, we decompose the joint distribution of future observations into a sequence of conditional densities, each modeled via a shared flow that transforms a simple base distribution into the next observation distribution, conditioned on observed covariates. To achieve this, we leverage the flow matching (FM) framework, enabling scalable and simulation-free learning of these transformations. By combining this factorization with the FM objective, FlowTime retains the benefits of autoregressive models -- including strong extrapolation performance, compact model size, and well-calibrated uncertainty estimates -- while also capturing complex multi-modal conditional distributions, as seen in modern transport-based generative models. We demonstrate the effectiveness of FlowTime on multiple dynamical systems and real-world forecasting tasks.
摘要：在这项工作中，我们提出了Flowtime，这是一种用于多元时间表数据概率预测的生成模型。鉴于历史测量和可选的未来协变量，我们将预测作为从未来轨迹的有条件分布中取样。具体而言，我们将未来观测值的联合分布分解为条件密度的序列，每种观测值通过共享流量建模，将简单的基础分布转化为下一个基于观察到的协变量的下一个观测分布。为了实现这一目标，我们利用流量匹配（FM）框架，为这些转换提供可扩展和无模拟学习。通过将这种分解与FM目标相结合，Flowtime保留了自回归模型的好处 - 包括强大的外推性能，紧凑的模型大小和良好的不确定性估计值，同时还可以捕获复杂的多模式条件分布，如现代基于基于运输的生成模型所示。我们演示了流动时间对多个动态系统和现实预测任务的有效性。

Title: CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance

Authors: Yufan Deng, Xun Guo, Yizhi Wang, Jacob Zhiyuan Fang, Angtian Wang, Shenghai Yuan, Yiding Yang, Bo Liu, Haibin Huang, Chongyang Ma
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10391
Pdf URL: https://arxiv.org/pdf/2503.10391
Copy Paste: [[2503.10391]] CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance(https://arxiv.org/abs/2503.10391)
Keywords: generation, generative
Abstract: Video generation has witnessed remarkable progress with the advent of deep generative models, particularly diffusion models. While existing methods excel in generating high-quality videos from text prompts or single images, personalized multi-subject video generation remains a largely unexplored challenge. This task involves synthesizing videos that incorporate multiple distinct subjects, each defined by separate reference images, while ensuring temporal and spatial consistency. Current approaches primarily rely on mapping subject images to keywords in text prompts, which introduces ambiguity and limits their ability to model subject relationships effectively. In this paper, we propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM). Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort. By leveraging MLLM to interpret subject relationships, our method facilitates scalability, enabling the use of large and diverse datasets for training. Furthermore, our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation. Through extensive evaluations, we demonstrate that our approach significantly improves subject consistency, and overall video coherence, paving the way for advanced applications in storytelling, interactive media, and personalized video generation.
摘要：随着深层生成模型的出现，尤其是扩散模型，视频生成取得了显着的进步。尽管现有的方法在从文本提示或单个图像中生成高质量视频方面表现出色，但个性化的多对象视频生成仍然是一个很大程度上没有探索的挑战。此任务涉及合成包含多个不同主题的视频，每个视频由单独的参考图像定义，同时确保时间和空间一致性。当前的方法主要依赖于文本提示中的关键字映射到关键字，这引入了歧义并限制了其有效建模主题关系的能力。在本文中，我们提出了Cinema，这是一个通过利用多模式大语言模型（MLLM）的连贯多受试者视频生成的新颖框架。我们的方法消除了主题图像和文本实体之间对对应关系的需求，从而减轻了歧义和减少注释工作。通过利用MLLM来解释主题关系，我们的方法促进了可扩展性，从而可以使用大型多样的数据集进行培训。此外，我们的框架可以以不同的主题为条件，从而在个性化内容创建中具有更大的灵活性。通过广泛的评估，我们证明我们的方法可显着提高主题的一致性和整体视频连贯性，为讲故事，互动媒体和个性化视频生成的高级应用铺平了道路。

Title: Hyper3D: Efficient 3D Representation via Hybrid Triplane and Octree Feature for Enhanced 3D Shape Variational Auto-Encoders

Authors: Jingyu Guo, Sensen Gao, Jia-Wang Bian, Wanhu Sun, Heliang Zheng, Rongfei Jia, Mingming Gong
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10403
Pdf URL: https://arxiv.org/pdf/2503.10403
Copy Paste: [[2503.10403]] Hyper3D: Efficient 3D Representation via Hybrid Triplane and Octree Feature for Enhanced 3D Shape Variational Auto-Encoders(https://arxiv.org/abs/2503.10403)
Keywords: generation
Abstract: Recent 3D content generation pipelines often leverage Variational Autoencoders (VAEs) to encode shapes into compact latent representations, facilitating diffusion-based generation. Efficiently compressing 3D shapes while preserving intricate geometric details remains a key challenge. Existing 3D shape VAEs often employ uniform point sampling and 1D/2D latent representations, such as vector sets or triplanes, leading to significant geometric detail loss due to inadequate surface coverage and the absence of explicit 3D representations in the latent space. Although recent work explores 3D latent representations, their large scale hinders high-resolution encoding and efficient training. Given these challenges, we introduce Hyper3D, which enhances VAE reconstruction through efficient 3D representation that integrates hybrid triplane and octree features. First, we adopt an octree-based feature representation to embed mesh information into the network, mitigating the limitations of uniform point sampling in capturing geometric distributions along the mesh surface. Furthermore, we propose a hybrid latent space representation that integrates a high-resolution triplane with a low-resolution 3D grid. This design not only compensates for the lack of explicit 3D representations but also leverages a triplane to preserve high-resolution details. Experimental results demonstrate that Hyper3D outperforms traditional representations by reconstructing 3D shapes with higher fidelity and finer details, making it well-suited for 3D generation pipelines.
摘要：最近的3D内容生成管道通常利用变异自动编码器（VAE）将形状编码为紧凑的潜在表示，从而促进基于扩散的生成。在保留复杂的几何细节的同时有效地压缩3D形状仍然是一个关键挑战。现有的3D形状VAE通常采用均匀的点采样和1D/2D潜在表示，例如向量集或三型载体，从而导致由于表面覆盖不足和潜在空间中缺乏显式3D表示，导致显着的几何细节损失。尽管最近的工作探讨了3D潜在表示，但它们的大规模却阻碍了高分辨率编码和高效培训。鉴于这些挑战，我们引入了Hyper3D，从而通过有效的3D表示增强了VAE重建，从而整合了混合三层和OCTREE功能。首先，我们采用基于OCTREE的特征表示将网格信息嵌入网络中，从而减轻沿网格表面捕获几何分布的均匀点采样的局限性。此外，我们提出了一种混合潜在空间表示，该表示将高分辨率三烷集成与低分辨率3D网格。这种设计不仅可以弥补缺乏明确的3D表示形式，而且还要利用三层钢管来保留高分辨率的细节。实验结果表明，Hyper3D通过重建3D形状具有更高的保真度和更细节的细节来优于传统表示，从而非常适合3D生成管道。

Title: RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models

Authors: Yijing Lin, Mengqi Huang, Shuhan Zhuang, Zhendong Mao
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10406
Pdf URL: https://arxiv.org/pdf/2503.10406
Copy Paste: [[2503.10406]] RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models(https://arxiv.org/abs/2503.10406)
Keywords: generation
Abstract: Unifying diverse image generation tasks within a single framework remains a fundamental challenge in visual generation. While large language models (LLMs) achieve unification through task-agnostic data and generation, existing visual generation models fail to meet these principles. Current approaches either rely on per-task datasets and large-scale training or adapt pre-trained image models with task-specific modifications, limiting their generalizability. In this work, we explore video models as a foundation for unified image generation, leveraging their inherent ability to model temporal correlations. We introduce RealGeneral, a novel framework that reformulates image generation as a conditional frame prediction task, analogous to in-context learning in LLMs. To bridge the gap between video models and condition-image pairs, we propose (1) a Unified Conditional Embedding module for multi-modal alignment and (2) a Unified Stream DiT Block with decoupled adaptive LayerNorm and attention mask to mitigate cross-modal interference. RealGeneral demonstrates effectiveness in multiple important visual generation tasks, e.g., it achieves a 14.5% improvement in subject similarity for customized generation and a 10% enhancement in image quality for canny-to-image task. Project page: this https URL
摘要：在单个框架内统一多样化的图像生成任务仍然是视觉生成的基本挑战。尽管大型语言模型（LLMS）通过任务不合时宜的数据和生成实现统一，但现有的视觉生成模型无法符合这些原则。当前的方法要么依赖于任务数据集和大规模培训，要么通过特定于任务的修改来调整预训练的图像模型，从而限制了它们的可推广性。在这项工作中，我们探索视频模型是统一图像生成的基础，利用它们固有的时间建模能力来建模时间相关。我们介绍了Realgeneral，这是一个新型框架，将图像生成重新定义为条件框架预测任务，类似于LLMS中的文化学习。为了弥合视频模型和条件图像对之间的差距，我们提出了（1）一个用于多模式对齐的统一条件嵌入模块，（2）带有脱钩的适应性分层和注意力掩模的统一的流dit块，以减轻交叉模态干扰。 Realener在多个重要的视觉生成任务中表现出有效性，例如，它在定制生成的主题相似性方面提高了14.5％，而对于巧妙的图像任务的图像质量提高了10％。项目页面：此HTTPS URL

Title: Learning Disease State from Noisy Ordinal Disease Progression Labels

Authors: Gustav Schmidt, Holger Heidrich, Philipp Berens, Sarah Müller
Subjects: cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10440
Pdf URL: https://arxiv.org/pdf/2503.10440
Copy Paste: [[2503.10440]] Learning Disease State from Noisy Ordinal Disease Progression Labels(https://arxiv.org/abs/2503.10440)
Keywords: generation
Abstract: Learning from noisy ordinal labels is a key challenge in medical imaging. In this work, we ask whether ordinal disease progression labels (better, worse, or stable) can be used to learn a representation allowing to classify disease state. For neovascular age-related macular degeneration (nAMD), we cast the problem of modeling disease progression between medical visits as a classification task with ordinal ranks. To enhance generalization, we tailor our model to the problem setting by (1) independent image encoding, (2) antisymmetric logit space equivariance, and (3) ordinal scale awareness. In addition, we address label noise by learning an uncertainty estimate for loss re-weighting. Our approach learns an interpretable disease representation enabling strong few-shot performance for the related task of nAMD activity classification from single images, despite being trained only on image pairs with ordinal disease progression labels.
摘要：从嘈杂的序数标签中学习是医学成像中的关键挑战。在这项工作中，我们询问有序疾病进展标签（更好，更糟或稳定）是否可以用于学习允许疾病状态进行分类的代表。对于新血管与年龄相关的黄斑变性（NAMD），我们提出了建模医疗就诊之间疾病进展的问题，作为序数等级的分类任务。为了增强概括，我们通过（1）独立的图像编码，（2）反对称逻辑空间空间均衡和（3）序数尺度意识来定制模型设置。此外，我们通过学习重新加权损失的不确定性估计来解决标签噪声。我们的方法学到了一种可解释的疾病表征，尽管仅在具有顺序疾病进展标签的图像对中训练，但从单个图像进行了NAMD活动分类的相关任务，从而实现了很少的射击性能。

Title: Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion

Authors: Evgeniia Vu, Andrei Boiarov, Dmitry Vetrov
Subjects: cs.LG, cs.CV, cs.HC
Abstract URL: https://arxiv.org/abs/2503.10488
Pdf URL: https://arxiv.org/pdf/2503.10488
Copy Paste: [[2503.10488]] Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion(https://arxiv.org/abs/2503.10488)
Keywords: generation
Abstract: Generating co-speech gestures in real time requires both temporal coherence and efficient sampling. We introduce Accelerated Rolling Diffusion, a novel framework for streaming gesture generation that extends rolling diffusion models with structured progressive noise scheduling, enabling seamless long-sequence motion synthesis while preserving realism and diversity. We further propose Rolling Diffusion Ladder Acceleration (RDLA), a new approach that restructures the noise schedule into a stepwise ladder, allowing multiple frames to be denoised simultaneously. This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 2x speedup with high visual fidelity and temporal coherence. We evaluate our approach on ZEGGS and BEAT, strong benchmarks for real-world applicability. Our framework is universally applicable to any diffusion-based gesture generation model, transforming it into a streaming approach. Applied to three state-of-the-art methods, it consistently outperforms them, demonstrating its effectiveness as a generalizable and efficient solution for real-time, high-fidelity co-speech gesture synthesis.
摘要：实时生成共同语音的手势需要时间连贯性和有效抽样。我们引入了加速滚动扩散，这是一种用于流式手势生成的新型框架，它扩展了具有结构化的渐进噪声调度的滚动扩散模型，从而实现了无缝的长期运动综合，同时保留了现实主义和多样性。我们进一步提出了滚动扩散梯子加速器（RDLA），这是一种新方法，将噪声时间表重组为逐步梯子，从而同时将多个框架固定。这显着提高了采样效率，同时保持运动一致性，达到高达2倍的速度，具有高视觉保真度和时间连贯性。我们评估了我们在斑马和节拍上的方法，可用于现实世界中适用性的强大基准。我们的框架普遍适用于任何基于扩散的手势生成模型，将其转换为流媒体方法。应用于三种最先进的方法，它始终胜过它们，证明了其有效性是实时，高保真的共同语音手势合成的可推广和有效解决方案。

Title: Hoi2Anomaly: An Explainable Anomaly Detection Approach Guided by Human-Object Interaction

Authors: Yuhan Wang, Cheng Liu, Daou Zhang, Weichao Wu
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10508
Pdf URL: https://arxiv.org/pdf/2503.10508
Copy Paste: [[2503.10508]] Hoi2Anomaly: An Explainable Anomaly Detection Approach Guided by Human-Object Interaction(https://arxiv.org/abs/2503.10508)
Keywords: generative
Abstract: In the domain of Image Anomaly Detection (IAD), Existing methods frequently exhibit a paucity of fine-grained, interpretable semantic information, resulting in the detection of anomalous entities or activities that are susceptible to machine illusions. This deficiency often leads to the detection of anomalous entities or actions that are susceptible to machine illusions and lack sufficient explanation. In this thesis, we propose a novel approach to anomaly detection, termed Hoi2Anomaly, which aims to achieve precise discrimination and localization of anomalies. The proposed methodology involves the construction of a multi-modal instruction tuning dataset comprising human-object interaction (HOI) pairs in anomalous scenarios. Second, we have trained an HOI extractor in threat scenarios to localize and match anomalous actions and entities. Finally, explanatory content is generated for the detected anomalous HOI by fine-tuning the visual language pretraining (VLP) framework. The experimental results demonstrate that Hoi2Anomaly surpasses existing generative approaches in terms of precision and explainability. We will release Hoi2Anomaly for the advancement of the field of anomaly detection.
摘要：在图像异常检测（IAD）的域中，现有方法经常表现出缺乏细粒度，可解释的语义信息，从而导致检测到易于机器幻觉的异常实体或活动。这种缺陷通常会导致对容易受到机器幻觉和缺乏足够解释的异常实体或行动的检测。在本论文中，我们提出了一种新型的异常检测方法，称为HOI2anomaly，旨在实现异常的精确歧视和定位。提出的方法涉及在异常场景中构建包括人类对象相互作用（HOI）对的多模式指令调谐数据集。其次，我们已经在威胁场景中培训了一个HOI提取器，以本地化和匹配异常的动作和实体。最后，通过微调视觉语言（VLP）框架来为检测到的异常HOI生成解释性内容。实验结果表明，HOI2Anomaly在精确性和解释性方面超过了现有的生成方法。我们将释放HOI2anomaly，以提高异常检测领域。

Title: Conformal Prediction Sets for Deep Generative Models via Reduction to Conformal Regression

Authors: Hooman Shahrokhi, Devjeet Raj Roy, Yan Yan, Venera Arnaoudova, Janaradhan Rao Doppa
Subjects: cs.LG, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10512
Pdf URL: https://arxiv.org/pdf/2503.10512
Copy Paste: [[2503.10512]] Conformal Prediction Sets for Deep Generative Models via Reduction to Conformal Regression(https://arxiv.org/abs/2503.10512)
Keywords: generation, generative
Abstract: We consider the problem of generating valid and small prediction sets by sampling outputs (e.g., software code and natural language text) from a black-box deep generative model for a given input (e.g., textual prompt). The validity of a prediction set is determined by a user-defined binary admissibility function depending on the target application. For example, requiring at least one program in the set to pass all test cases in code generation application. To address this problem, we develop a simple and effective conformal inference algorithm referred to as Generative Prediction Sets (GPS). Given a set of calibration examples and black-box access to a deep generative model, GPS can generate prediction sets with provable guarantees. The key insight behind GPS is to exploit the inherent structure within the distribution over the minimum number of samples needed to obtain an admissible output to develop a simple conformal regression approach over the minimum number of samples. Experiments on multiple datasets for code and math word problems using different large language models demonstrate the efficacy of GPS over state-of-the-art methods.
摘要：我们考虑通过为给定输入的黑框深层生成模型采样输出（例如软件代码和自然语言文本）来生成有效和小预测集的问题（例如，文本提示）。预测集的有效性取决于用户定义的二进制可接受性功能，具体取决于目标应用程序。例如，在集合中需要至少一个程序来传递代码生成应用中的所有测试用例。为了解决这个问题，我们开发了一种简单有效的共形推理算法，称为生成预测集（GPS）。给定一组校准示例和黑框访问深度生成模型，GPS可以生成具有可证明保证的预测集。 GPS背后的关键见解是利用分布中的固有结构，以获得可允许的输出所需的最小样本数量，以在最小数量的样品数量上开发简单的保形回归方法。使用不同的大语言模型在多个数据集上进行代码和数学单词问题的实验证明了GPS对最新方法的功效。

Title: PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models

Authors: Zilu Guo, Hongbin Lin, Zhihao Yuan, Chaoda Zheng, Pengshuo Qiu, Dongzhi Jiang, Renrui Zhang, Chun-Mei Feng, Zhen Li
Subjects: cs.CV, cs.AI
Abstract URL: https://arxiv.org/abs/2503.10529
Pdf URL: https://arxiv.org/pdf/2503.10529
Copy Paste: [[2503.10529]] PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models(https://arxiv.org/abs/2503.10529)
Keywords: generation, generative
Abstract: 3D Multimodal Large Language Models (MLLMs) have recently made substantial advancements. However, their potential remains untapped, primarily due to the limited quantity and suboptimal quality of 3D datasets. Current approaches attempt to transfer knowledge from 2D MLLMs to expand 3D instruction data, but still face modality and domain gaps. To this end, we introduce PiSA-Engine (Point-Self-Augmented-Engine), a new framework for generating instruction point-language datasets enriched with 3D spatial semantics. We observe that existing 3D MLLMs offer a comprehensive understanding of point clouds for annotation, while 2D MLLMs excel at cross-validation by providing complementary information. By integrating holistic 2D and 3D insights from off-the-shelf MLLMs, PiSA-Engine enables a continuous cycle of high-quality data generation. We select PointLLM as the baseline and adopt this co-evolution training framework to develop an enhanced 3D MLLM, termed PointLLM-PiSA. Additionally, we identify limitations in previous 3D benchmarks, which often feature coarse language captions and insufficient category diversity, resulting in inaccurate evaluations. To address this gap, we further introduce PiSA-Bench, a comprehensive 3D benchmark covering six key aspects with detailed and diverse labels. Experimental results demonstrate PointLLM-PiSA's state-of-the-art performance in zero-shot 3D object captioning and generative classification on our PiSA-Bench, achieving significant improvements of 46.45% (+8.33%) and 63.75% (+16.25%), respectively. We will release the code, datasets, and benchmark.
摘要：3D多模式大型语言模型（MLLM）最近取得了重大进步。但是，它们的潜力仍未开发，这主要是由于3D数据集的数量有限和次优质量。当前的方法试图将知识从2D MLLM转移以扩展3D指令数据，但仍面临模态和域间隙。为此，我们介绍了PISA-Engine（Point-Sef-aigmented-engine），这是一个新的框架，用于生成富含3D空间语义的指令点语言数据集。我们观察到现有的3D MLLM提供了对点云的全面理解，而2D MLLMS通过提供互补信息而在交叉验证时表现出色。通过整合来自现成的MLLM的整体2D和3D洞察力，PISA-Engine可以连续地进行高质量的数据生成周期。我们选择PointLlm作为基线，并采用此共同进化训练框架来开发增强的3D MLLM，称为Pointllm-Pisa。此外，我们确定了以前的3D基准测试中的局限性，这些基准通常具有粗糙的语言标题和类别多样性不足，从而导致评估不准确。为了解决这一差距，我们进一步介绍了PISA Bench，这是一个全面的3D基准，涵盖了六个关键方面，并具有详细的和多样的标签。实验结果表明，Pointllm-Pisa在我们的PISA板凳上的零光3D对象字幕和生成性分类中的最先进性能，分别取得了46.45％（+8.33％）和63.75％（+16.25％）的显着改善。我们将发布代码，数据集和基准。

Title: MASQUE: A Text-Guided Diffusion-Based Framework for Localized and Customized Adversarial Makeup

Authors: Youngjin Kwon, Xiao Zhang
Subjects: cs.CV, cs.CR
Abstract URL: https://arxiv.org/abs/2503.10549
Pdf URL: https://arxiv.org/pdf/2503.10549
Copy Paste: [[2503.10549]] MASQUE: A Text-Guided Diffusion-Based Framework for Localized and Customized Adversarial Makeup(https://arxiv.org/abs/2503.10549)
Keywords: generative
Abstract: As facial recognition is increasingly adopted for government and commercial services, its potential misuse has raised serious concerns about privacy and civil rights. To counteract, various anti-facial recognition techniques have been proposed for privacy protection by adversarially perturbing face images, among which generative makeup-based approaches are the most popular. However, these methods, designed primarily to impersonate specific target identities, can only achieve weak dodging success rates while increasing the risk of targeted abuse. In addition, they often introduce global visual artifacts or a lack of adaptability to accommodate diverse makeup prompts, compromising user satisfaction. To address the above limitations, we develop MASQUE, a novel diffusion-based framework that generates localized adversarial makeups guided by user-defined text prompts. Built upon precise null-text inversion, customized cross-attention fusion with masking, and a pairwise adversarial guidance mechanism using images of the same individual, MASQUE achieves robust dodging performance without requiring any external identity. Comprehensive evaluations on open-source facial recognition models and commercial APIs demonstrate that MASQUE significantly improves dodging success rates over all baselines, along with higher perceptual fidelity and stronger adaptability to various text makeup prompts.
摘要：由于面部识别越来越多地用于政府和商业服务，因此其潜在的滥用引起了人们对隐私和公民权利的严重关注。为了抵消，已经提出了各种抗种族识别技术，通过对抗面部图像来保护隐私保护，其中基于生成的化妆方法最受欢迎。但是，这些方法主要是为了冒充特定的目标身份，只能在增加目标滥用的风险的同时实现弱躲避的成功率。此外，他们经常引入全球视觉文物或缺乏适应性来容纳各种化妆提示，从而损害用户满意度。为了解决上述限制，我们开发了Masque，这是一种基于扩散的新颖框架，生成以用户定义的文本提示为指导的局部对抗化妆。建立在精确的null文本反演的基础上，并使用同一个人的图像进行定制的交叉注意融合以及成对的对抗指导机制，即在不需要任何外部身份的情况下实现强大的躲避性能。对开源面部识别模型和商业API的全面评估表明，Masque显着提高了所有基线的躲避成功率，以及更高的感知忠诚度和对各种文本化妆提示的更强大的适应性。

Title: Autoregressive Image Generation with Randomized Parallel Decoding

Authors: Haopeng Li, Jinyue Yang, Guoqi Li, Huan Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10568
Pdf URL: https://arxiv.org/pdf/2503.10568
Copy Paste: [[2503.10568]] Autoregressive Image Generation with Randomized Parallel Decoding(https://arxiv.org/abs/2503.10568)
Keywords: generation
Abstract: We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation, addressing the inherent limitations of conventional raster-order approaches, which hinder inference efficiency and zero-shot generalization due to their sequential, predefined token generation order. Our key insight is that effective random-order modeling necessitates explicit guidance for determining the position of the next predicted token. To this end, we propose a novel guided decoding framework that decouples positional guidance from content representation, encoding them separately as queries and key-value pairs. By directly incorporating this guidance into the causal attention mechanism, our approach enables fully random-order training and generation, eliminating the need for bidirectional attention. Consequently, ARPG readily generalizes to zero-shot tasks such as image inpainting, outpainting, and resolution expansion. Furthermore, it supports parallel inference by concurrently processing multiple queries using a shared KV cache. On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.94 with only 64 sampling steps, achieving over a 20-fold increase in throughput while reducing memory consumption by over 75% compared to representative recent autoregressive models at a similar scale.
摘要：我们介绍了ARPG，这是一种新型的视觉自回旋模型，可实现随机平行生成，解决了传统的栅格阶方法的固有局限性，由于其顺序，预定义的代币生成顺序，它阻碍了推理的推理效率和零发出的概括。我们的主要见解是，有效的随机建模需要明确的指导来确定下一个预测令牌的位置。为此，我们提出了一个新颖的指导解码框架，该框架将位置指导与内容表示形式分开，将其分别编码为查询和键值对。通过将此指导直接纳入因果注意机制中，我们的方法可以完全随机训练和产生，从而消除了双向关注的需求。因此，ARPG很容易概括为零摄像的任务，例如图像插图，支出和分辨率扩展。此外，它通过使用共享KV缓存同时处理多个查询来支持并行推断。在Imagenet-1K 256基准中，我们的方法达到了1.94的FID，只有64个采样步骤，与以类似规模的代表性自动性模型相比，吞吐量增加了20倍，同时减少了75％以上的记忆消耗。

Title: Long Context Tuning for Video Generation

Authors: Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, Lu Jiang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10589
Pdf URL: https://arxiv.org/pdf/2503.10589
Copy Paste: [[2503.10589]] Long Context Tuning for Video Generation(https://arxiv.org/abs/2503.10589)
Keywords: generation
Abstract: Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension, paving the way for more practical visual content creation. See this https URL for more details.
摘要：视频生成的最新进展可以通过可扩展的扩散变压器产生逼真的，分钟的单发视频。但是，现实世界中的叙事视频需要跨镜头的视觉和动态一致性的多拍场景。在这项工作中，我们介绍了长上下文调整（LCT），该训练范式扩展了预先训练的单拍视频扩散模型的上下文窗口，以直接从数据中学习场景级别的一致性。我们的方法将全部注意机制从单个镜头扩展到涵盖场景中的所有镜头，结合了交错的3D位置嵌入和异步噪声策略，从而使关节和自动回火射击生成无其他参数。 LCT后具有双向关注的模型可以通过上下文可导致的关注进一步进行微调，从而促进具有有效的KV-CACHE的自动回归产生。实验证明了LCT之后的单次模型可以产生连贯的多拍场景并展示新兴功能，包括构图生成和交互式镜头扩展，为更实用的视觉内容创建铺平了道路。有关更多详细信息，请参见此HTTPS URL。

Title: CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models

Authors: Hao He, Ceyuan Yang, Shanchuan Lin, Yinghao Xu, Meng Wei, Liangke Gui, Qi Zhao, Gordon Wetzstein, Lu Jiang, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10592
Pdf URL: https://arxiv.org/pdf/2503.10592
Copy Paste: [[2503.10592]] CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models(https://arxiv.org/abs/2503.10592)
Keywords: generation, generative
Abstract: This paper introduces CameraCtrl II, a framework that enables large-scale dynamic scene exploration through a camera-controlled video diffusion model. Previous camera-conditioned video generative models suffer from diminished video dynamics and limited range of viewpoints when generating videos with large camera movement. We take an approach that progressively expands the generation of dynamic scenes -- first enhancing dynamic content within individual video clip, then extending this capability to create seamless explorations across broad viewpoint ranges. Specifically, we construct a dataset featuring a large degree of dynamics with camera parameter annotations for training while designing a lightweight camera injection module and training scheme to preserve dynamics of the pretrained models. Building on these improved single-clip techniques, we enable extended scene exploration by allowing users to iteratively specify camera trajectories for generating coherent video sequences. Experiments across diverse scenarios demonstrate that CameraCtrl Ii enables camera-controlled dynamic scene synthesis with substantially wider spatial exploration than previous approaches.
摘要：本文介绍了Cameractrl II，该框架可以通过摄像机控制的视频扩散模型来实现大规模的动态场景探索。在生成大型相机移动的视频时，以前的摄像头视频生成模型会遭受视频动态的减少和有限的观点范围。我们采用一种逐步扩展动态场景的生成的方法 - 首先增强单个视频剪辑中的动态内容，然后扩展此功能，以在广泛的观点范围内创建无缝的探索。具体来说，我们构建了一个数据集，该数据集具有大量动力学，并带有摄像机参数注释，用于训练，同时设计轻巧的摄像头注入模块和训练方案，以保留预告片的模型的动态。在这些改进的单盘技术的基础上，我们通过允许用户迭代指定相机轨迹来生成相干视频序列来启用扩展场景探索。跨不同场景的实验表明，摄影机II可以使摄像机控制的动态场景合成，并且与以前的方法相比，具有实质上更广泛的空间探索。

Title: MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction

Authors: Yingshuang Zou, Yikang Ding, Chuanrui Zhang, Jiazhe Guo, Bohan Li, Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Haoqian Wang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10604
Pdf URL: https://arxiv.org/pdf/2503.10604
Copy Paste: [[2503.10604]] MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction(https://arxiv.org/abs/2503.10604)
Keywords: generation
Abstract: Recent breakthroughs in radiance fields have significantly advanced 3D scene reconstruction and novel view synthesis (NVS) in autonomous driving. Nevertheless, critical limitations persist: reconstruction-based methods exhibit substantial performance deterioration under significant viewpoint deviations from training trajectories, while generation-based techniques struggle with temporal coherence and precise scene controllability. To overcome these challenges, we present MuDG, an innovative framework that integrates Multi-modal Diffusion model with Gaussian Splatting (GS) for Urban Scene Reconstruction. MuDG leverages aggregated LiDAR point clouds with RGB and geometric priors to condition a multi-modal video diffusion model, synthesizing photorealistic RGB, depth, and semantic outputs for novel viewpoints. This synthesis pipeline enables feed-forward NVS without computationally intensive per-scene optimization, providing comprehensive supervision signals to refine 3DGS representations for rendering robustness enhancement under extreme viewpoint changes. Experiments on the Open Waymo Dataset demonstrate that MuDG outperforms existing methods in both reconstruction and synthesis quality.
摘要：辐射场的最新突破已显着高级3D场景重建和自主驾驶中的新型视图合成（NVS）。然而，临界局限性持续存在：基于重建的方法在训练轨迹的显着偏差下表现出很大的性能恶化，而基于世代的技术则与时间连贯性和精确场景可控性抗争。为了克服这些挑战，我们提出了MUDG，这是一个创新的框架，将多模式扩散模型与高斯碎片（GS）集成在一起，以进行城市场景重建。 MUDG利用带有RGB和几何先验的聚合LIDAR点云来调节多模式的视频扩散模型，合成的感性RGB，深度和语义输出，以获取新观点。该综合管道可实现馈送前进的NV，而无需计算密集的每场曲（每场）优化，从而提供了全面的监督信号，以优化3DGS表示，以在极端的观点变化下呈现鲁棒性增强。开放Waymo数据集的实验表明，MUDG在重建和合成质量中的表现都优于现有方法。

Title: DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation

Authors: Chen Chen, Rui Qian, Wenze Hu, Tsu-Jui Fu, Lezhi Li, Bowen Zhang, Alex Schwing, Wei Liu, Yinfei Yang
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10618
Pdf URL: https://arxiv.org/pdf/2503.10618
Copy Paste: [[2503.10618]] DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation(https://arxiv.org/abs/2503.10618)
Keywords: generation
Abstract: In this work, we empirically study Diffusion Transformers (DiTs) for text-to-image generation, focusing on architectural choices, text-conditioning strategies, and training protocols. We evaluate a range of DiT-based architectures--including PixArt-style and MMDiT variants--and compare them with a standard DiT variant which directly processes concatenated text and noise inputs. Surprisingly, our findings reveal that the performance of standard DiT is comparable with those specialized models, while demonstrating superior parameter-efficiency, especially when scaled up. Leveraging the layer-wise parameter sharing strategy, we achieve a further reduction of 66% in model size compared to an MMDiT architecture, with minimal performance impact. Building on an in-depth analysis of critical components such as text encoders and Variational Auto-Encoders (VAEs), we introduce DiT-Air and DiT-Air-Lite. With supervised and reward fine-tuning, DiT-Air achieves state-of-the-art performance on GenEval and T2I CompBench, while DiT-Air-Lite remains highly competitive, surpassing most existing models despite its compact size.
摘要：在这项工作中，我们通过经验研究了扩散变压器（DIT），以进行文本到图像生成，专注于建筑选择，文本条件策略和培训方案。我们评估了一系列基于DIT的体系结构，包括Pixart风格和MMDIT变体，并将它们与直接处理串联文本和噪声输入的标准DIT变体进行了比较。出乎意料的是，我们的发现表明，标准DIT的性能与这些专业模型相当，同时证明了效率出色的参数效率，尤其是在扩展时。与MMDIT体系结构相比，我们利用层的参数共享策略，进一步降低了66％的模型大小，并且性能影响最小。在对重要组件（例如文本编码器和各种自动编码器（VAE））等关键组件进行深入分析的基础上，我们介绍了DIT-Air和Dit-air-Lite。通过有监督和奖励的微调，DIT-Air在Geneval和T2i Compbench上取得了最先进的表现，而DIT-Air-Lite仍然具有很高的竞争力，尽管其尺寸紧凑，但仍超过了大多数现有型号。

Title: DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding

Authors: Ayesha Ishaq, Jean Lahoud, Ketan More, Omkar Thawakar, Ritesh Thawkar, Dinura Dissanayake, Noor Ahsan, Yuhao Li, Fahad Shahbaz Khan, Hisham Cholakkal, Ivan Laptev, Rao Muhammad Anwer, Salman Khan
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.10621
Pdf URL: https://arxiv.org/pdf/2503.10621
Copy Paste: [[2503.10621]] DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding(https://arxiv.org/abs/2503.10621)
Keywords: generation
Abstract: While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model. Our framework, dataset, and model are available at this https URL.
摘要：虽然大型多模型模型（LMM）在各种视觉问题答案（VQA）任务中表现出强大的性能，但某些挑战需要复杂的多步理论才能达到准确的答案。一个特别具有挑战性的任务是自动驾驶，这需要在做出决策之前进行彻底的认知处理。在这个领域中，对视觉提示的顺序和解释性理解对于有效的感知，预测和计划至关重要。然而，常见的VQA基准通常集中在最终答案的准确性上，同时忽略了能够产生准确响应的推理过程。此外，现有方法缺乏在现实驾驶场景中评估分步推理的全面框架。为了解决这一差距，我们提出了Drivelmm-O1，这是一种新的数据集和基准，专门旨在推进自动驾驶的逐步视觉推理。我们的基准测试在训练集中具有18K VQA示例，在测试集中具有超过4K的示例，涵盖了有关感知，预测和计划的各种问题，每个问题都具有逐步的推理，以确保自主驾驶场景中的逻辑推断。我们进一步介绍了一个大型的多模式模型，该模型在我们的推理数据集中进行了微调，在复杂的驾驶场景中证明了出色的性能。此外，我们在提出的数据集中基于各种开源和闭合源方法基准，从系统地比较了它们在自动驾驶任务中的推理功能。我们的模型在最终答案的准确性中获得了 +7.49％的增益，以及与以前的最佳开源模型相比，推理得分的提高了3.62％。我们的框架，数据集和模型可在此HTTPS URL上找到。

Title: Transformers without Normalization

Authors: Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu
Subjects: cs.LG, cs.AI, cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2503.10622
Pdf URL: https://arxiv.org/pdf/2503.10622
Copy Paste: [[2503.10622]] Transformers without Normalization(https://arxiv.org/abs/2503.10622)
Keywords: generation
Abstract: Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, $S$-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.
摘要：标准化层在现代神经网络中无处不在，长期以来一直被认为是必不可少的。这项工作表明，没有归一化的变压器可以使用非常简单的技术实现相同或更好的性能。我们介绍了Dynamic Tanh（Dyt），元素操作$ dyt（$ x $）= \ tanh（\ alpha $ x $）$，作为变形金刚中归一化层的置换式替换。 Dyt的灵感来自于观察到变压器中的层归一化，通常会产生类似Tanh的，$ S $形的输入输出映射。通过合并DYT，没有归一化的变压器可以匹配或超过其标准化对应物的性能，主要是没有高参数调整。我们验证了在各种环境中具有DYT的变压器的有效性，从识别到一代，监督到自我监督的学习以及计算机视觉到语言模型。这些发现挑战了常规理解，即在现代神经网络中正常化层是必不可少的，并为其在深层网络中的作用提供了新的见解。

Title: NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models

Authors: Mert Albaba, Chenhao Li, Markos Diomataris, Omid Taheri, Andreas Krause, Michael Black
Subjects: cs.CV, cs.AI, cs.LG, cs.RO
Abstract URL: https://arxiv.org/abs/2503.10626
Pdf URL: https://arxiv.org/pdf/2503.10626
Copy Paste: [[2503.10626]] NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models(https://arxiv.org/abs/2503.10626)
Keywords: generation, generative
Abstract: Acquiring physically plausible motor skills across diverse and unconventional morphologies-including humanoid robots, quadrupeds, and animals-is essential for advancing character simulation and robotics. Traditional methods, such as reinforcement learning (RL) are task- and body-specific, require extensive reward function engineering, and do not generalize well. Imitation learning offers an alternative but relies heavily on high-quality expert demonstrations, which are difficult to obtain for non-human morphologies. Video diffusion models, on the other hand, are capable of generating realistic videos of various morphologies, from humans to ants. Leveraging this capability, we propose a data-independent approach for skill acquisition that learns 3D motor skills from 2D-generated videos, with generalization capability to unconventional and non-human forms. Specifically, we guide the imitation learning process by leveraging vision transformers for video-based comparisons by calculating pair-wise distance between video embeddings. Along with video-encoding distance, we also use a computed similarity between segmented video frames as a guidance reward. We validate our method on locomotion tasks involving unique body configurations. In humanoid robot locomotion tasks, we demonstrate that 'No-data Imitation Learning' (NIL) outperforms baselines trained on 3D motion-capture data. Our results highlight the potential of leveraging generative video models for physically plausible skill learning with diverse morphologies, effectively replacing data collection with data generation for imitation learning.
摘要：跨多种和非常规的形态获得身体上合理的运动技能，包括人形机器人，四倍和动物，这对于推进性格模拟和机器人技术至关重要。传统方法（例如增强学习（RL））是任务和身体特定的，需要广泛的奖励功能工程，并且不能很好地概括。模仿学习提供了一种替代方案，但在很大程度上依赖于高质量的专家演示，而非人类形态很难获得。另一方面，视频扩散模型能够生成从人到蚂蚁的各种形态的现实视频。利用这种功能，我们提出了一种与数据无关的技能获取方法，该方法从2D生成视频中学习了3D运动技能，并具有对非常规和非人类形式的概括能力。具体来说，我们通过利用视觉变形金刚通过计算视频嵌入之间的成对距离来指导模仿学习过程。除了视频编码距离之外，我们还使用分段视频框架之间的计算相似性作为指导奖励。我们验证了涉及独特身体配置的运动任务的方法。在类人机器人的运动任务中，我们证明了“ NO-DATA模仿学习”（NIL）优于对3D运动捕获数据训练的基线。我们的结果突出了利用生成视频模型的潜力，用于具有多种形态的物理上合理的技能学习，从而用数据生成有效地替换了数据收集以进行模仿学习。

Title: HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Authors: Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, Chengkai Hou, Mengdi Zhao, KC alex Zhou, Pheng-Ann Heng, Shanghang Zhang
Subjects: cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2503.10631
Pdf URL: https://arxiv.org/pdf/2503.10631
Copy Paste: [[2503.10631]] HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model(https://arxiv.org/abs/2503.10631)
Keywords: generation
Abstract: Recent advancements in vision-language models (VLMs) for common-sense reasoning have led to the development of vision-language-action (VLA) models, enabling robots to perform generalized manipulation. Although existing autoregressive VLA methods leverage large-scale pretrained knowledge, they disrupt the continuity of actions. Meanwhile, some VLA methods incorporate an additional diffusion head to predict continuous actions, relying solely on VLM-extracted features, which limits their reasoning capabilities. In this paper, we introduce HybridVLA, a unified framework that seamlessly integrates the strengths of both autoregressive and diffusion policies within a single large language model, rather than simply connecting them. To bridge the generation gap, a collaborative training recipe is proposed that injects the diffusion modeling directly into the next-token prediction. With this recipe, we find that these two forms of action prediction not only reinforce each other but also exhibit varying performance across different tasks. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses these two predictions, leading to more robust control. In experiments, HybridVLA outperforms previous state-of-the-art VLA methods across various simulation and real-world tasks, including both single-arm and dual-arm robots, while demonstrating stable manipulation in previously unseen configurations.
摘要：通用义推理的视觉模型（VLM）的最新进展导致了视觉语言动作（VLA）模型的发展，从而使机器人能够执行一般的操纵。尽管现有的自回归VLA方法利用了大规模的知识，但它们破坏了行动的连续性。同时，某些VLA方法包含了一个额外的扩散头，以预测连续的动作，仅依赖于VLM提取的特征，从而限制了其推理能力。在本文中，我们介绍了Hybridvla，这是一个统一的框架，它无缝地将自回归和扩散策略的优势集成到单个大语言模型中，而不是简单地将它们连接起来。为了弥合一代差距，提出了一个协作培训配方，该配方将扩散建模直接注入下一步的预测中。通过此食谱，我们发现这两种行动预测不仅相互加强，而且在不同任务之间表现出不同的性能。因此，我们设计了一种协作动作集合机制，可适应地融合这两个预测，从而实现更强大的控制。在实验中，Hybridvla在各种模拟和现实世界任务（包括单臂和双臂机器人）上的先前最先进的VLA方法优于先前的最先进的VLA方法，同时在以前看不见的配置中展示了稳定的操纵。

Title: The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation

Authors: Ho Kei Cheng, Alexander Schwing
Subjects: cs.LG, cs.CV
Abstract URL: https://arxiv.org/abs/2503.10636
Pdf URL: https://arxiv.org/pdf/2503.10636
Copy Paste: [[2503.10636]] The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation(https://arxiv.org/abs/2503.10636)
Keywords: generation
Abstract: Minibatch optimal transport coupling straightens paths in unconditional flow matching. This leads to computationally less demanding inference as fewer integration steps and less complex numerical solvers can be employed when numerically solving an ordinary differential equation at test time. However, in the conditional setting, minibatch optimal transport falls short. This is because the default optimal transport mapping disregards conditions, resulting in a conditionally skewed prior distribution during training. In contrast, at test time, we have no access to the skewed prior, and instead sample from the full, unbiased prior distribution. This gap between training and testing leads to a subpar performance. To bridge this gap, we propose conditional optimal transport C^2OT that adds a conditional weighting term in the cost matrix when computing the optimal transport assignment. Experiments demonstrate that this simple fix works with both discrete and continuous conditions in 8gaussians-to-moons, CIFAR-10, ImageNet-32x32, and ImageNet-256x256. Our method performs better overall compared to the existing baselines across different function evaluation budgets. Code is available at this https URL
摘要：Minibatch最佳传输耦合在无条件流量匹配中伸展路径。这会导致计算上要求较少的推断，因为在测试时间在数值求解普通的微分方程时，可以使用较少的集成步骤和较不复杂的数值求解器。但是，在有条件的环境中，Minibatch最佳运输量不足。这是因为默认的最佳传输映射无视条件，从而导致训练期间有条件的先验分布。相反，在测试时，我们无法访问偏斜的先验，而是从完整的，无偏见的先验分布中进行采样。训练和测试之间的差距导致表现不佳。为了弥合这一差距，我们提出了有条件的最佳运输C^2OT，在计算最佳传输分配时，在成本矩阵中添加了条件加权项。实验表明，这种简单的修复程序在8gaussian-to-moons，Cifar-10，Imagenet-32x32和Imagenet-256x256中都可以在离散和连续条件下使用。与在不同功能评估预算中现有基线相比，我们的方法的总体总体表现更好。代码可在此HTTPS URL上找到

Title: Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective

Authors: Xiaoming Zhao, Alexander G. Schwing
Subjects: cs.CV, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2503.10638
Pdf URL: https://arxiv.org/pdf/2503.10638
Copy Paste: [[2503.10638]] Studying Classifier(-Free) Guidance From a Classifier-Centric Perspective(https://arxiv.org/abs/2503.10638)
Keywords: generation
Abstract: Classifier-free guidance has become a staple for conditional generation with denoising diffusion models. However, a comprehensive understanding of classifier-free guidance is still missing. In this work, we carry out an empirical study to provide a fresh perspective on classifier-free guidance. Concretely, instead of solely focusing on classifier-free guidance, we trace back to the root, i.e., classifier guidance, pinpoint the key assumption for the derivation, and conduct a systematic study to understand the role of the classifier. We find that both classifier guidance and classifier-free guidance achieve conditional generation by pushing the denoising diffusion trajectories away from decision boundaries, i.e., areas where conditional information is usually entangled and is hard to learn. Based on this classifier-centric understanding, we propose a generic postprocessing step built upon flow-matching to shrink the gap between the learned distribution for a pre-trained denoising diffusion model and the real data distribution, majorly around the decision boundaries. Experiments on various datasets verify the effectiveness of the proposed approach.
摘要：无分类器的指导已成为有条件产生的主食，并具有降级扩散模型。但是，仍然缺少对无分类器指导的全面理解。在这项工作中，我们进行了一项实证研究，以提供有关无分类器指导的新观点。具体而言，我们不仅要专注于无分类器的指导，而是追溯到根部，即分类器指导，确定派生的关键假设，并进行系统的研究以了解分类器的作用。我们发现，分类器指导和无分类器指导都通过将deo的扩散轨迹从决策边界推开，即通常纠缠不清并且很难学习的区域来实现条件产生。基于这种以分类器为中心的理解，我们提出了一个基于流量匹配的通用后处理步骤，以缩小预训练的denoising扩散模型的学习分布之间的差距和实际数据分布，主要围绕决策范围。各种数据集上的实验验证了提出方法的有效性。

Title: GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Authors: Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, Hongsheng Li
Subjects: cs.CV
Abstract URL: https://arxiv.org/abs/2503.10639
Pdf URL: https://arxiv.org/pdf/2503.10639
Copy Paste: [[2503.10639]] GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing(https://arxiv.org/abs/2503.10639)
Keywords: generation
Abstract: Current image generation and editing methods primarily process textual prompts as direct inputs without reasoning about visual composition and explicit operations. We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements. We define the formulation of GoT and construct large-scale GoT datasets containing over 9M samples with detailed reasoning chains capturing semantic-spatial relationships. To leverage the advantages of GoT, we implement a unified framework that integrates Qwen2.5-VL for reasoning chain generation with an end-to-end diffusion model enhanced by our novel Semantic-Spatial Guidance Module. Experiments show our GoT framework achieves excellent performance on both generation and editing tasks, with significant improvements over baselines. Additionally, our approach enables interactive visual generation, allowing users to explicitly modify reasoning steps for precise image adjustments. GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent. To facilitate future research, we make our datasets, code, and pretrained models publicly available at this https URL.
摘要：当前的图像生成和编辑方法主要将文本提示作为直接输入处理，而无需理解视觉组成和明确操作。我们介绍了一代思想链（GOT），这是一种新颖的范式，可以通过在输出图像之前通过明确的语言推理过程来产生和编辑。这种方法将传统的文本到图像生成转换为推理指导的框架，该框架分析了语义关系和空间安排。我们定义了GOT和构建大规模的GOT数据集的配方，该数据集包含超过9M样本，并具有详细的推理链捕获语义空间关系。为了利用GOT的优势，我们实施了一个统一的框架，该框架将QWEN2.5-VL集成了推理链生成，并通过我们新颖的语义空间指导模块增强的端到端扩散模型。实验表明，我们的GOT框架在发电和编辑任务上都取得了出色的性能，并且对基准的改进有了重大改进。此外，我们的方法可以使交互式视觉生成，从而使用户可以明确修改推理步骤以进行精确的图像调整。使先驱者是一个新的方向，用于推理驱动的视觉生成和编辑，从而产生与人类意图更好的图像。为了促进未来的研究，我们在此HTTPS URL上公开可用的数据集，代码和验证模型。